statistical guarantees for the em algorithm: from ... · an example (murray, 1977; wu, 1983) twelve...

42
Statistical Guarantees for the EM Algorithm: From Population to Sample-Based Analysis Authors: Balakrishnan, Wainwright, and Yu Chenxi Zhou Reading Group in Statistical Learning and Data Mining September 5th, 2017 1

Upload: others

Post on 31-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Statistical Guarantees for the EM Algorithm: From Population to

Sample-Based AnalysisAuthors: Balakrishnan, Wainwright, and Yu

Chenxi Zhou

Reading Group in Statistical Learning and Data MiningSeptember 5th, 2017

1

Page 2: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Outline! Overview of Expectation-Maximization (EM) Algorithm" Population Analysis of First-Order EM Algorithm# Sample Analysis of First-Order EM Algorithm$ Example: Gaussian Mixture Model

2

Page 3: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Overview of Expectation-Maximization Algorithm

3

Page 4: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Estimation of Linkage in Genetics

» 197 animals are distributed multinomially into 5 categories

» Observed data:

with .

» Cell probabilities:

4

Page 5: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Example (continued)

» Likelihood function:

and

» How to solve this type of incomplete data problem?

5

Page 6: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

What is the EM Algorithm?

Expectation-Maximization (EM) Algorithm is an iterative method that attempts to find the maximum likelihood estimator of a parameter of a parametric probability distribution in incomplete data problems.

Incompleteness:- Missing data- Censored or grouped data- Latent class and latent data structures-

6

Page 7: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Basic Setup

» Let with the joint density function , where

»

» is non-empty, compact convex set

» Observe i.i.d. copies of ,

» are missing or latent

» Goal: Estimate by maximizing log-likelihood:

7

Page 8: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

EM Idea

Unfortunately, maximizing directly can be hard! But often the complete data log-likelihood

is easier to maximize. So we replace the complete data log-likelihood by its conditional expectation:

where expectation is computed with respect to current iterate .

8

Page 9: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

EM Algorithm

Starting with initial iterate , iterate the following steps for .

» Expectation Step: Compute EM surrogate :

» Maximization Step: Maximize EM surrogate:

9

Page 10: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Source:

10

Page 11: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Advantages !» Easy to implement

» Requires small storage space

» Low cost per iteration

» If is bounded, converges monotonically to , where is a stationary point

»

11

Page 12: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Drawbacks !» Finding the exact maximizer in the M step can be hard

12

Page 13: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

As for the First Drawback...

» Generalized EM Algorithm: Just choose so that

» First-Order EM Algorithm: Assume is differentiable in the first argument at each iteration . Given a step size , the updates are

where the gradient is taken in the first argument of .

13

Page 14: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Drawbacks !» Finding the exact maximizer in the M step can be hard

» No guarantees to converge to the global maximum of (depending on the choice of starting point)

14

Page 15: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

An Example (Murray, 1977; Wu, 1983)

Twelve observations are collected from a bivariate normal distribution with mean , correlation coefficient and variances ,

The likelihood function has

- two global maxima: , ; and

- a saddle point: , .

The EM algorithm starting at will return the saddle point.

15

Page 16: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Drawbacks !» Finding the exact maximizer in the M step can be hard

» No guarantees to converge to the global maximum of (depending on the choice of starting point)

» , where is a stationary point, does NOT imply and Wu (1983) only established the conditions of

convergence of to a stationary point

16

Page 17: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Contributions of Balakrishnan et al. (2017)

» Quantitative characterization of a basin of attraction around

» Where to choose the initialization to ensure

» Establishment of the convergence rate and the corresponding conditions

» Establishment of connections between population and sample analysis

17

Page 18: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Population Version of EM Algorithm

E Step: Compute the following population version surrogate function

M Step:

» Standard EM:

» First-Order EM:

18

Page 19: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Oracle Surrogate Function and Iterates

» Oracle Surrogate Function

» Oracle Iterates

19

Page 20: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Oracle Surrogate Function and Iterates

Why do we need them?

» If satisfies strong concavity and smoothness, then the gradient ascent updates achieve geometric convergence rate to

» The population version first-order EM updates can be viewed as a perturbation of the oracle updates

» Therefore, in the population level analysis of EM algorithm, we need to control the quantity

20

Page 21: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Population Analysis of First-Order EM Algorithm

21

Page 22: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Recall the population first-order updates are

and the oracle updates are

Condition 1: Gradient Smoothness

For an appropriately small parameter ,

for all .

22

Page 23: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Condition 2: -Strong Concavity

There is some such that

or, equivalently,

for all pairs .

23

Page 24: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Condition 3: -Smoothness

There is some such that

or, equivalently,

for all pairs .

24

Page 25: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

25

Page 26: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Theorem 1(General Population-Level Guarantee)

» For some radius and a triplet with such that -gradient smoothness, -strong concavity and -smoothness conditions hold;

» Choose the step size .

Then, given any , the population first-order EM iterates satisfy the bound

26

Page 27: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Sample Analysis of First-Order EM Algorithm

27

Page 28: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Recall that the sample first-order EM updates are

The analysis of the finite sample first-order EM algorithm depends on the empirical process

28

Page 29: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

For a given sample size and tolerance parameter , let be the smallest scalar such that

with probability at least .

29

Page 30: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Theorem 2(General Sample-Level Guarantee)

» For some radius and a triplet with such that the -gradient smoothness, -strong concavity and -smoothness conditions hold;

» Choose the step size ;

» Suppose the sample size is large enough to ensure

30

Page 31: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Theorem 2(General Sample-Level Guarantee) (continued)

Then, with probability at least , given any initialization , the finite-sample first-order EM iterates

satisfy the bound

for all

31

Page 32: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Example: Gaussian Mixture Model

32

Page 33: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Consider the following two-component Gaussian mixture model

where

and and are independent.

Key Quantity:

33

Page 34: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

To analyze the EM algorithm at the population level, i.e., apply Theorem 1, one needs to establish the gradient smoothness, -strong concavity and -smoothness.

Oracle Function:

where the weighting function is a smooth function.

It is easy to verify that is strongly-concave and smooth with parameters 1, i.e., .

34

Page 35: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

What about gradient smoothness?

Lemma 2

» Let for a sufficiently large ;

» Let the radius be .

Then, there is a constant with such that

35

Page 36: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Corollary 1(Population result for the first-order EM algorithm for GMM)

» Let for a sufficiently large ;

» Let the radius ;

» Choose the step size .

36

Page 37: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Corollary 1(Population result for the first-order EM algorithm for GMM) (continued)

Then, there is a contraction coefficient , where is a universal constant, such that for any initialization

, the population first-order EM iterates satisfy

the bound

for all

37

Page 38: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Now, we go from the population to the sample-based analysis of this particular model.

At the sample level, we study the random variable

over the ball .

38

Page 39: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Corollary 4(Sample-based result for first-order EM guarantees for GMM)

» Let for a sufficiently large ;

» Choose the radius ;

» Choose the step size ;

» Suppose the sample size is lower bounded by .

39

Page 40: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Corollary 4(Sample-based result for first-order EM guarantees for GMM)(continued)

Then, there is a contraction coefficient , where is a

universal constant, such that, for any initialization ,

the first-order EM iterates satisfy the bound

with probability at least .

40

Page 41: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

Summary

» This paper advances our theoretical understanding of EM algorithm.

» This paper concentrates on how to obtain a near-optimal estimate of using EM algorithm.

» With the help of optimization theory, this paper establishes the size of the region of attraction where the initialization should be chosen and the rate of convergence of the EM algorithm.

» This paper also develops techniques to analyze other algorithms for solving non-convex problems.

» What's next...

41

Page 42: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation

42