advanced computational econometrics: machine learning

76
Advanced Computational Econometrics: Machine Learning Chapter 6: Gaussian Mixture models Thierry Denœux July-August 2019 Thierry Denœux ACE - Gaussian Mixture models July-August 2019 1 / 76

Upload: others

Post on 20-Feb-2022

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Computational Econometrics: Machine Learning

Advanced Computational Econometrics:Machine Learning

Chapter 6: Gaussian Mixture models

Thierry Denœux

July-August 2019

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 1 / 76

Page 2: Advanced Computational Econometrics: Machine Learning

Introduction

Overview

1 IntroductionGaussian Mixture ModelSupervised vs. unsupervised learningMaximum likelihood estimationReminder on the EM algorithm

2 Parameter estimation in GMMsUnsupervised learningSemi-supervised learningMixture Discriminant Analysis

3 Regression modelsMixture of regressionsMixture of experts

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 2 / 76

Page 3: Advanced Computational Econometrics: Machine Learning

Introduction Gaussian Mixture Model

Overview

1 IntroductionGaussian Mixture ModelSupervised vs. unsupervised learningMaximum likelihood estimationReminder on the EM algorithm

2 Parameter estimation in GMMsUnsupervised learningSemi-supervised learningMixture Discriminant Analysis

3 Regression modelsMixture of regressionsMixture of experts

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 3 / 76

Page 4: Advanced Computational Econometrics: Machine Learning

Introduction Gaussian Mixture Model

Return to LDA and QDA

In LDA and QDA, we assume that the conditional density of X givenY = k is multivariate Gaussian

φk(x ;µk ,Σk) =1

(2π)p/2|Σk |1/2exp(−12(x − µk)TΣ−1

k (x − µk)).

(with Σk = Σ in the case of LDA)The marginal density of X is then a mixture of c Gaussian densities:

p(x) =c∑

k=1

p(x |Y = k)P(Y = k) =c∑

k=1

πkφk(x ;µk ,Σk)

This is called a Gaussian Mixture Model (GMM).

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 4 / 76

Page 5: Advanced Computational Econometrics: Machine Learning

Introduction Gaussian Mixture Model

Gaussian Mixture Models

GMMs are widely used in Machine Learning forDensity estimationClustering (finding groups in data)Classification (modeling complex-shaped class distributions)Regression (accounting for different linear relations within subgroups ofa population)etc.

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 5 / 76

Page 6: Advanced Computational Econometrics: Machine Learning

Introduction Gaussian Mixture Model

Example with p = 1

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 6 / 76

Page 7: Advanced Computational Econometrics: Machine Learning

Introduction Gaussian Mixture Model

Example with p = 2

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 7 / 76

Page 8: Advanced Computational Econometrics: Machine Learning

Introduction Gaussian Mixture Model

How to generate data from a mixture?

Assume X ∼∑c

k=1 πkN (µk ,Σk)

How to generate X?1 Generate Y ∈ {1, . . . , c} with probabilities π1, . . . , πc .2 If Y = k , generate X from p(x |Y = k) = φk(x ;µk ,Σk).

Remark: we can define mixtures of other distributions. In this chapter,we will focus (without loss of generality) on mixtures of normaldistributions.

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 8 / 76

Page 9: Advanced Computational Econometrics: Machine Learning

Introduction Supervised vs. unsupervised learning

Overview

1 IntroductionGaussian Mixture ModelSupervised vs. unsupervised learningMaximum likelihood estimationReminder on the EM algorithm

2 Parameter estimation in GMMsUnsupervised learningSemi-supervised learningMixture Discriminant Analysis

3 Regression modelsMixture of regressionsMixture of experts

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 9 / 76

Page 10: Advanced Computational Econometrics: Machine Learning

Introduction Supervised vs. unsupervised learning

Supervised learning

In discriminant analysis, we observe both the input vector X and theresponse (class label) Y for n individuals taken randomly from thepopulation.The learning set has the form Ls = {(xi , yi )}ni=1.Learning a classifier from such data is called supervised learning.

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 10 / 76

Page 11: Advanced Computational Econometrics: Machine Learning

Introduction Supervised vs. unsupervised learning

Unsupervised learning

In some situations, we observe X , but Y is not observed. We say thatY is a latent variable.The learning set has the form Lns = {xi}ni=1.Estimating the model parameters from such data is calledunsupervised learning.Applications: density estimation, clustering, feature extraction.Unsupervised learning is usually more difficult than supervised learning,because we have less information to estimate the parameters.

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 11 / 76

Page 12: Advanced Computational Econometrics: Machine Learning

Introduction Supervised vs. unsupervised learning

Supervised vs. unsupervised learning

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 12 / 76

Page 13: Advanced Computational Econometrics: Machine Learning

Introduction Supervised vs. unsupervised learning

Semi-supervised learning

Sometimes, we collect of lot of data, but we can label only a part ofthem.Examples: image data from the web, or from sensors on a robot.The data then have the form Lss = {(xi , yi )}nsi=1 ∪ {xi}ni=ns+1.This is called a semi-supervised learning problem.Semi-supervised learning is intermediate between supervised andunsupervised learning.

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 13 / 76

Page 14: Advanced Computational Econometrics: Machine Learning

Introduction Maximum likelihood estimation

Overview

1 IntroductionGaussian Mixture ModelSupervised vs. unsupervised learningMaximum likelihood estimationReminder on the EM algorithm

2 Parameter estimation in GMMsUnsupervised learningSemi-supervised learningMixture Discriminant Analysis

3 Regression modelsMixture of regressionsMixture of experts

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 14 / 76

Page 15: Advanced Computational Econometrics: Machine Learning

Introduction Maximum likelihood estimation

Maximum likelihood: supervised case

In the case of supervised learning of GMMs, the maximum likelihoodestimates (MLE) of µk , Σk and πk have simple closed-formexpressions.The likelihood function is

L(θ;Ls) =n∏

i=1

p(xi , yi ) =n∏

i=1

p(xi |Yi = yi )p(Yi = yi )

=n∏

i=1

c∏k=1

φ(xi ;µk ,Σk)yikπyikk

The log-likelihood function is

`(θ;Ls) =c∑

k=1

{n∑

i=1

yik log φ(xi ;µk ,Σk)

}+

n∑i=1

c∑k=1

yik log πk

The parameters µk ,Σk can be estimated separately using the datafrom class k .Thierry Denœux ACE - Gaussian Mixture models July-August 2019 15 / 76

Page 16: Advanced Computational Econometrics: Machine Learning

Introduction Maximum likelihood estimation

MLE in the supervised case

We have

n∑i=1

yik log φ(xi ;µk ,Σk) = −12

n∑i=1

yik(xi − µk)TΣ−1k (xi − µk)

− nk2

log |Σk | −nkp

2log(2π)

The derivative wrt to µk is∑

i yikΣ−1k (xi − µk). Hence,

µ̂k =1nk

n∑i=1

yikxi

It can be shown that

Σ̂k =1nk

n∑i=1

yik(xi − µ̂k)(xi − µ̂k)T

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 16 / 76

Page 17: Advanced Computational Econometrics: Machine Learning

Introduction Maximum likelihood estimation

MLE in the supervised case (continued)

To find the MLE of the πk , we maximize

n∑i=1

c∑k=1

yik log πk

wrt to πk , subject to the constraint∑c

k=1 πk = 1.The solution is

π̂k =nkn, k = 1, . . . , c

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 17 / 76

Page 18: Advanced Computational Econometrics: Machine Learning

Introduction Maximum likelihood estimation

Maximum likelihood: unsupervised case

In the case of unsupervised learning, the log-likelihood function is

`(θ;Lns) =n∑

i=1

log p(xi )

=n∑

i=1

(log

c∑k=1

πkφk(xi ;µk ,Σk)

)

We can no longer separate the terms corresponding to each class.Maximizing the log-likelihood becomes a difficult nonlinearoptimization problem, for which no closed-form solution exists.A powerful method: the Expectation-Maximization (EM) algorithm.

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 18 / 76

Page 19: Advanced Computational Econometrics: Machine Learning

Introduction Reminder on the EM algorithm

Overview

1 IntroductionGaussian Mixture ModelSupervised vs. unsupervised learningMaximum likelihood estimationReminder on the EM algorithm

2 Parameter estimation in GMMsUnsupervised learningSemi-supervised learningMixture Discriminant Analysis

3 Regression modelsMixture of regressionsMixture of experts

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 19 / 76

Page 20: Advanced Computational Econometrics: Machine Learning

Introduction Reminder on the EM algorithm

Notation

X : Observed variablesY : Missing or latent variablesZ : Complete data Z = (X,Y)

θ : Unknown parameterL(θ) : observed-data likelihood, short for L(θ; x) = p(x ; θ)Lc(θ) : complete-data likelihood, short for L(θ; z) = p(z ; θ)

`(θ), `c(θ) : observed and complete-data log-likelihoods

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 20 / 76

Page 21: Advanced Computational Econometrics: Machine Learning

Introduction Reminder on the EM algorithm

Notation (continued)

Suppose we seek to maximize L(θ) with respect to θ.Define Q(θ; θ(t)) to be the expectation of the complete-datalog-likelihood, conditional on the observed data X = x. Namely

Q(θ, θ(t)) =Eθ(t){`c(θ)

∣∣ x}

=Eθ(t){log p(Z; θ)

∣∣ x}

=

∫ [log p(z; θ)

]p(y|x; θ(t)) dy

where the last equation emphasizes that Y is the only random part ofZ once we are given X = x.

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 21 / 76

Page 22: Advanced Computational Econometrics: Machine Learning

Introduction Reminder on the EM algorithm

The EM Algorithm (reminder)

Start with θ(0) and set t = 0. Then1 E step: Compute Q(θ, θ(t)).2 M step: Maximize Q(θ, θ(t)) with respect to θ. Set θ(t+1) equal to

the maximizer of Q.3 Return to the E step and increment t unless a stopping criterion has

been met, e.g.,|`(θ(t+1))− `(θ(t))| ≤ ε

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 22 / 76

Page 23: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs

Overview

1 IntroductionGaussian Mixture ModelSupervised vs. unsupervised learningMaximum likelihood estimationReminder on the EM algorithm

2 Parameter estimation in GMMsUnsupervised learningSemi-supervised learningMixture Discriminant Analysis

3 Regression modelsMixture of regressionsMixture of experts

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 23 / 76

Page 24: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Unsupervised learning

Overview

1 IntroductionGaussian Mixture ModelSupervised vs. unsupervised learningMaximum likelihood estimationReminder on the EM algorithm

2 Parameter estimation in GMMsUnsupervised learningSemi-supervised learningMixture Discriminant Analysis

3 Regression modelsMixture of regressionsMixture of experts

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 24 / 76

Page 25: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Unsupervised learning

Old Faithful geyser data

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

5060

7080

90

eruptions

wai

ting

Waiting time between eruptions and duration of the eruption (in min) forthe Old Faithful geyser in Yellowstone National Park, Wyoming, USA (272observations).

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 25 / 76

Page 26: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Unsupervised learning

General GMM

Let X = (X1, . . . ,Xn) be an i.i.d. sample from a mixture of Kmultivariate normal distributions N (µk ,Σk) with proportions πk . Thepdf of Xi is

p(xi ; θ) =c∑

k=1

πkφ(xi ;µk ,Σk),

where θ is the vector of parameters.We introduce latent variables Y = (Y1, . . . ,Yn), such that

Yi ∼M(1, π1, . . . , πc),p(xi |Yi = k) = φ(xi ;µk ,Σk), k = 1 . . . , c

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 26 / 76

Page 27: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Unsupervised learning

Observed and complete-data likelihoods

Observed-data likelihood:

L(θ) =n∏

i=1

p(xi ; θ) =n∏

i=1

c∑k=1

πkφ(xi ;µk ,Σk)

Complete-data likelihood:

Lc(θ) =n∏

i=1

p(xi , yi ; θ) =n∏

i=1

p(xi |yi ; θ)p(yi |π)

=n∏

i=1

c∏k=1

φ(xi ;µk ,Σk)yikπyikk ,

with yik = I (yi = k).

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 27 / 76

Page 28: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Unsupervised learning

Derivation of function Q

Complete-data log-likelihood:

`c(θ) =n∑

i=1

c∑k=1

yik log φ(xi ;µk ,Σk) +n∑

i=1

c∑k=1

yik log πk

It is linear in the yik . Consequently, the Q function is simply

Q(θ, θ(t)) =n∑

i=1

c∑k=1

y(t)ik log φ(xi ;µk ,Σk) +

n∑i=1

c∑k=1

y(t)ik log πk

with y(t)ik = Eθ(t) [Yik |xi ] = Pθ(t) [Yi = k|xi ].

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 28 / 76

Page 29: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Unsupervised learning

EM algorithm

E-step: compute

y(t)ik = Pθ(t) [Yi = k |xi ]

=φ(xi ;µ

(t)k ,Σ

(t)k )π

(t)k∑c

`=1 φ(xi ;µ(t)` ,Σ

(t)` )π

(t)`

M-step: Maximize Q(θ, θ(t)). We get

π(t+1)k =

n(t)k

n, µ

(t+1)k =

1

n(t)k

n∑i=1

y(t)ik xi

Σ(t+1)k =

1

n(t)k

n∑i=1

y(t)ik (xi − µ

(t+1)k )(xi − µ

(t+1)k )T

with n(t)k =

∑ni=1 y

(t)ik .

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 29 / 76

Page 30: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Unsupervised learning

GMM with the package mclust

library(mclust)data(faithful)

faithfulMclust <- Mclust(faithful,G=2,modelNames="VVV")plot(faithfulMclust)

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 30 / 76

Page 31: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Unsupervised learning

Result

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

5060

7080

90

eruptions

wai

ting

Classification

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 31 / 76

Page 32: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Unsupervised learning

Choosing the number of clusters

In clustering, selecting the number of clusters is often a difficultproblem.This is a model selection problem. We can use the BIC criterion.(Reminder: BIC = −2`+ d log(N))

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 32 / 76

Page 33: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Unsupervised learning

Choosing the number of clusters

plot(faithfulMclust)

−26

00−

2550

−25

00−

2450

−24

00−

2350

Number of components

BIC

1 2 3 4 5 6 7 8 9

VVV

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 33 / 76

Page 34: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Unsupervised learning

Reducing the number of parameters

The general model has c[p + p(p + 1)/2+ 1]− 1 parameters.When n is small and/or p is large: we need more parsimonious models(i.e., models with fewer parameters).Simple approaches:

Assume equal covariance matrix (homoscedasticity)Assume the covariance matrices to be diagonal, or scalar

More flexible approach: use the eigendecomposition of matrix Σk

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 34 / 76

Page 35: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Unsupervised learning

Eigendecomposition Σk

As matrix Σk is symmetric, we can write

Σk = DkΛkDTk = λkDkAkDT

k ,

whereΛ = diag(λk1, . . . , λkp) is a diagonal matrix whose components are thedecreasing eigenvalues of Σk

Dk is an orthogonal matrix (DkDTk = I ) whose columns are the

normalized eigenvectors of Σk

Ak is a diagonal matrix such that |A| = 1, with decreasing diagonalvalues proportional to the eigenvalues of Σk

λk =(∏p

j=1 λkj

)1/p= |Σk |1/p

Interpretation:Ak describes the shape of the clusterDk (a rotation matrix) describes its orientationλk describes its volume

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 35 / 76

Page 36: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Unsupervised learning

Example in R2

a1/a0

0A = cos-sin

θθ cos

sinθθ

D =

µ

θ

D: rotation matrix, angle θA: diagonal matrix with diagonal terms a and 1/a

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 36 / 76

Page 37: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Unsupervised learning

Parsimonious models

With this parametrization, the parameters of the GMM are: thecenters, volumes, shapes, orientations and proportions.28 different models

Spherical, diagonal, arbitraryVolumes equal or notShapes equal or notDirections equal or notProportions equal or not

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 37 / 76

Page 38: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Unsupervised learning

The 14 models based on assumptions on variance matrices

λ [ B ] k

λ [ I]

λ [ I] k

λ [ B]

λ [ B] k

λ [ B ] k k

λ [ D A D’]

λ [ D A D’] k

λ[ D A D’] k

λ[ D A D’] k k

λ[ D A D ’] k k

λ[ D A D ’] k k k

λ[ D A D ’] k k k

λ[ D A D ’] k k k k

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 38 / 76

Page 39: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Unsupervised learning

Parsimonious models in mclust

faithfulMclust <- Mclust(faithful)plot(faithfulMclust)

−40

00−

3500

−30

00−

2500

Number of components

BIC

1 2 3 4 5 6 7 8 9

EIIVIIEEIVEIEVIVVIEEE

EVEVEEVVEEEVVEVEVVVVV

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 39 / 76

Page 40: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Unsupervised learning

Best model

Best model: EEE or λDADT (ellipsoidal, equal volume, shape andorientation) model with 3 components

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

5060

7080

90

eruptions

wai

ting

Classification

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 40 / 76

Page 41: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Semi-supervised learning

Overview

1 IntroductionGaussian Mixture ModelSupervised vs. unsupervised learningMaximum likelihood estimationReminder on the EM algorithm

2 Parameter estimation in GMMsUnsupervised learningSemi-supervised learningMixture Discriminant Analysis

3 Regression modelsMixture of regressionsMixture of experts

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 41 / 76

Page 42: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Semi-supervised learning

Semi-supervised learning

In semi-supervised learning, the data have the formLss = {(xi , yi )}nsi=1 ∪ {xi}ni=ns+1.Observed-data likelihood:

L(θ) =ns∏i=1

p(xi , yi ; θ)n∏

i=ns+1

p(xi ; θ)

=

(ns∏i=1

c∏k=1

φ(xi ;µk ,Σk)yikπyikk

)(n∏

i=ns+1

c∑k=1

πkφ(xi ;µk ,Σk)

)

Complete-data likelihood: same as in the unsupervised case,

Lc(θ) =n∏

i=1

c∏k=1

φ(xi ;µk ,Σk)yikπyikk ,

with yik = 1 if yi = k and yik = 0 otherwise.

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 42 / 76

Page 43: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Semi-supervised learning

EM algorithm

E-step: compute

y(t)ik =

yik i = 1, . . . , ns (fixed)φ(xi ;µ

(t)k ,Σ

(t)k )π

(t)k∑c

`=1 φ(xi ;µ(t)` ,Σ

(t)` )π

(t)`

, i = ns + 1, . . . , n

M-step: same as in the unsupervised case.

π(t+1)k =

n(t)k

n, µ

(t+1)k =

1

n(t)k

n∑i=1

y(t)ik xi

Σ(t+1)k =

1

n(t)k

n∑i=1

y(t)ik (xi − µ

(t+1)k )(xi − µ

(t+1)k )T

with n(t)k =

∑ni=1 y

(t)ik

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 43 / 76

Page 44: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Semi-supervised learning

Package upclass

library(upclass)

data(iris)X <- as.matrix(iris[,-5])cl <- as.matrix(iris[,5])indtrain <- sort(sample(1:150,110))Xtrain <- X[indtrain,]cltrain <- cl[indtrain]indtest <- setdiff(1:150, indtrain)Xtest <- X[indtest,]models <- c("EII", "VII", "VEI","EVI")

fitupmodels <- upclassify(Xtrain,cltrain,Xtest,modelscope=models)fitupmodels$Best$modelName # What is the best model?

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 44 / 76

Page 45: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Mixture Discriminant Analysis

Overview

1 IntroductionGaussian Mixture ModelSupervised vs. unsupervised learningMaximum likelihood estimationReminder on the EM algorithm

2 Parameter estimation in GMMsUnsupervised learningSemi-supervised learningMixture Discriminant Analysis

3 Regression modelsMixture of regressionsMixture of experts

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 45 / 76

Page 46: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Mixture Discriminant Analysis

Mixture Discriminant Analysis

GMM can also be useful in supervised classification.Here, we model the distribution of X in each class by a GMM:

p(x |Y = k) =

Rk∑r=1

πkrφ(x ;µkr ,Σkr )

with∑Rk

r=1 πkr = πk .This method is called Mixture Discriminant Analysis (MDA). Itextends LDA.By varying the number of components in each mixture, we can handleclasses of any shape, and obtain arbitrarily complex nonlinear decisionboundaries.We may impose Σkr = Σ, or any other parsimonious model, to controlthe complexity of the model.

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 46 / 76

Page 47: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Mixture Discriminant Analysis

Observed-data likelihood

Observed-data likelihood:

L(θ) =n∏

i=1

p(xi , yi ; θ) =n∏

i=1

p(xi |yi ; θ)p(yi ; θ)

=n∏

i=1

c∏k=1

(Rk∑r=1

πkrφ(x ;µkr ,Σkr )

)yik

πyikk

Again, the EM algorithm can be used to estimate the modelparameters, separately in each class (see ESL page 449 for details).

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 47 / 76

Page 48: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Mixture Discriminant Analysis

MDA using package mclust

odd <- seq(from = 1, to = nrow(iris), by = 2)even <- odd + 1X.train <- iris[odd,-5]Class.train <- iris[odd,5]X.test <- iris[even,-5]Class.test <- iris[even,5]

# general covariance structure selected by BICirisMclustDA <- MclustDA(X.train, Class.train)summary(irisMclustDA, newdata = X.test, newclass = Class.test)

plot(irisMclustDA)

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 48 / 76

Page 49: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Mixture Discriminant Analysis

Result

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 49 / 76

Page 50: Advanced Computational Econometrics: Machine Learning

Parameter estimation in GMMs Mixture Discriminant Analysis

Result

Sepal.Length

Sepal.WidthS

epal

.Len

gth

2.0 2.5 3.0 3.5 4.0

Petal.Length

Sep

al.L

engt

h

Petal.Width

Sep

al.L

engt

h

0.5 1.0 1.5 2.0 2.5

4.5

5.5

6.5

7.5

Sepal.Length

Sep

al.W

idth

2.0

2.5

3.0

3.5

4.0

Sepal.Width

Petal.LengthS

epal

.Wid

thPetal.Width

Sep

al.W

idth

Sepal.Length

Pet

al.L

engt

h

Sepal.Width

Pet

al.L

engt

h

Petal.Length

Petal.WidthP

etal

.Len

gth

12

34

56

7

Sepal.Length

Pet

al.W

idth

4.5 5.5 6.5 7.5

0.5

1.0

1.5

2.0

2.5

Sepal.Width

Pet

al.W

idth

Petal.Length

Pet

al.W

idth

1 2 3 4 5 6 7

Petal.Width

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 50 / 76

Page 51: Advanced Computational Econometrics: Machine Learning

Regression models

Overview

1 IntroductionGaussian Mixture ModelSupervised vs. unsupervised learningMaximum likelihood estimationReminder on the EM algorithm

2 Parameter estimation in GMMsUnsupervised learningSemi-supervised learningMixture Discriminant Analysis

3 Regression modelsMixture of regressionsMixture of experts

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 51 / 76

Page 52: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of regressions

Overview

1 IntroductionGaussian Mixture ModelSupervised vs. unsupervised learningMaximum likelihood estimationReminder on the EM algorithm

2 Parameter estimation in GMMsUnsupervised learningSemi-supervised learningMixture Discriminant Analysis

3 Regression modelsMixture of regressionsMixture of experts

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 52 / 76

Page 53: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of regressions

Introductory example

10 20 30 40

510

1520

1996 GNP and Emissions Data

GNP per capita

CO

2 pe

r ca

pita

CAN

MEX

USA

JAPKOR

AUS

NZ OST

BELCZ DNK

FIN

FRA

DEU

GRC

HUN

EIRE

ITL

HOL

NOR

POL

PORESP

SWCH

TUR

UK

RUS

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 53 / 76

Page 54: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of regressions

Introductory example (continued)

The data in the previous slide do not show any clear linear trend.However, there seem to be several groups for which a linear modelwould be a reasonable approximation.How to identify those groups and the corresponding linear models?

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 54 / 76

Page 55: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of regressions

Formalization

We assume that the response variable Y depends on the input variableX in different ways, depending on a latent variable Z . (Beware: wehave switched back to the classical notation for regression models!)This model is called mixture of regressions or switching regressions. Ithas been widely studied in the econometrics literature.

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 55 / 76

Page 56: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of regressions

Model

Model:

Y =

βT1 X + ε1, ε1 ∼ N (0, σ1) if Z = 1,...βTKX + εK , εK ∼ N (0, σK ) if Z = c ,

with X = (1,X1, . . . ,Xp), and

P(Z = k) = πk , k = 1, . . . , c .

So,

p(y |X = x) =c∑

k=1

πkφ(y ;βTk x , σk)

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 56 / 76

Page 57: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of regressions

Observed and complete-data likelihoods

Observed-data likelihood:

L(θ) =n∏

i=1

p(yi ; θ) =n∏

i=1

c∑k=1

πkφ(yi ;βTk xi , σk)

Complete-data likelihood:

Lc(θ) =n∏

i=1

p(yi , zi ; θ) =n∏

i=1

p(yi |zi ; θ)p(zi |π)

=n∏

i=1

c∏k=1

φ(yi ;βTk xi , σk)

zikπzikk ,

with zik = 1 if zi = k and zik = 0 otherwise.

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 57 / 76

Page 58: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of regressions

Derivation of function Q

Complete-data log-likelihood:

`c(θ) =n∑

i=1

c∑k=1

zik log φ(yi ;βTk xi , σk) +n∑

i=1

c∑k=1

zik log πk

It is linear in the zik . Consequently, the Q function is simply

Q(θ, θ(t)) =n∑

i=1

c∑k=1

z(t)ik log φ(yi ;βTk xi , σk) +

n∑i=1

c∑k=1

z(t)ik log πk

with z(t)ik = Eθ(t) [Zik |yi ] = Pθ(t) [Zi = k |yi ].

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 58 / 76

Page 59: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of regressions

EM algorithm

E-step: compute

z(t)ik = Pθ(t) [Zi = k|yi ]

=φ(yi ;β

(t)Tk xi , σ

(t)k )π

(t)k∑c

`=1 φ(yi ;β(t)T` xi , σ

(t)` )π

(t)`

M-step: Maximize Q(θ, θ(t)). As before, we get

π(t+1)k =

n(t)k

n,

with n(t)k =

∑ni=1 z

(t)ik .

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 59 / 76

Page 60: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of regressions

M-step: update of the βk and σk

In Q(θ, θ(t)), the term depending on βk is

SSk =n∑

i=1

z(t)ik (yi − βTk xi )

2.

Minimizing SSk w.r.t. βk is a weighted least-squares (WLS) problem.In matrix form,

SSk = (y − Xβk)TW k(y − Xβk),

where W k = diag(z(t)1k , . . . , z(t)nk ) is a diagonal matrix of size n.

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 60 / 76

Page 61: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of regressions

M-step: update of the βk and σk (continued)

The solution is the WLS estimate of βk :

β(t+1)k = (XTW kX )−1XTW ky

The value of σk minimizing Q(θ, θ(t)) is the weighted average of theresiduals,

σ2(t+1)k =

1

n(t)k

n∑i=1

z(t)ik (yi − β

(t+1)Tk xi )

2

=1

n(t)k

(y − Xβ(t+1)k )TW k(y − Xβ(t+1)

k )

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 61 / 76

Page 62: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of regressions

Mixture of regressions using mixtools

library(mixtools)data(CO2data)attach(CO2data)

CO2reg <- regmixEM(CO2, GNP)summary(CO2reg)

ii1<-CO2reg$posterior>0.5ii2<-CO2reg$posterior<=0.5text(GNP[ii1],CO2[ii1],country[ii1],col=’red’)text(GNP[Cii2],CO2[ii2],country[ii2],col=’blue’)abline(CO2reg$beta[,1],col=’red’)abline(CO2reg$beta[,2],col=’blue’)

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 62 / 76

Page 63: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of regressions

Best solution in 10 runs

10 20 30 40

510

1520

GNP per capita

CO

2 pe

r ca

pita

JAPKOR

NZ OST

BELCZ DNK

FIN

FRA

DEU

GRC

HUN

EIRE

ITL

HOLPOL

PORESP

SWCH

UK

RUS

CAN

MEX

USA

AUSNOR

TUR

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 63 / 76

Page 64: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of regressions

Increase of log-likelihood

5 10 15

−80

−75

−70

iterations

log−

likel

ihoo

d

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 64 / 76

Page 65: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of regressions

Another solution (with lower log-likelihood)

10 20 30 40

510

1520

GNP per capita

CO

2 pe

r ca

pita

MEX

JAPKOR

NZ OST

BEL DNKFIN

FRA

DEU

GRC

HUN

EIRE

ITL

HOLPOL

PORESP

SWCH

TUR

UK

CAN

USA

AUS

CZ

NOR

RUS

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 65 / 76

Page 66: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of regressions

Increase of log-likelihood

2 4 6 8 10 12

−76

−75

−74

−73

−72

−71

−70

iterations

log−

likel

ihoo

d

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 66 / 76

Page 67: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of experts

Overview

1 IntroductionGaussian Mixture ModelSupervised vs. unsupervised learningMaximum likelihood estimationReminder on the EM algorithm

2 Parameter estimation in GMMsUnsupervised learningSemi-supervised learningMixture Discriminant Analysis

3 Regression modelsMixture of regressionsMixture of experts

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 67 / 76

Page 68: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of experts

Making the mixing proportions predictor-dependent

An interesting extension of the previous model is to assume theproportions πk to be partially explained by a vector of concomitantvariables W .If W = X , we can approximate the regression function by differentlinear functions in different regions of the predictor space.In ML, this method is referred to as the mixture of experts method.A useful parametric form for πk that ensures πk ≥ 0 and∑c

k=1 πk = 1 is the multinomial logit (softmax) model

πk(w , α) =exp(αT

k w)∑cl=1 exp(α

Tl w)

with α = (α1, . . . , αc) and α1 = 0.

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 68 / 76

Page 69: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of experts

EM algorithm

The Q function is the same as before, except that the πk now dependon the wi and parameter α:

Q(θ, θ(t)) =n∑

i=1

c∑k=1

z(t)ik log φ(yi ;βTk xi , σk) +

n∑i=1

c∑k=1

z(t)ik log πk(wi , α)

In the M-step, the update formula for βk and σk are unchanged.The last term of Q(θ, θ(t)) can be maximized w.r.t. α using aniterative algorithm, such as the Newton-Raphson procedure. (Seeremark on next slide)

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 69 / 76

Page 70: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of experts

Generalized EM algorithm

To ensure convergence of EM, we only need to increase (but notnecessarily maximize) Q(θ, θ(t)) at each step.Any algorithm that chooses θ(t+1) at each iteration to guarantee theabove condition (without maximizing Q(θ, θ(t))) is called aGeneralized EM (GEM) algorithm.Here, we can perform a single step of the Newton-Raphson algorithmto maximize

n∑i=1

c∑k=1

z(t)ik log πk(wi , α)

with respect to α.Backtracking can be used to ensure ascent.

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 70 / 76

Page 71: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of experts

Example: motorcycle data

10 20 30 40 50

−10

0−

500

50

Motorcycle data

time

acce

lera

tion

library(’MASS’)x<-mcycle$timesy<-mcycle$accelplot(x,y)

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 71 / 76

Page 72: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of experts

Mixture of experts using flexmix

library(flexmix)

K<-5res<-flexmix(y ˜ x,k=K,model=FLXMRglm(family="gaussian"),concomitant=FLXPmultinom(formula=˜x))

beta<- parameters(res)[1:2,]alpha<-res@concomitant@coef

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 72 / 76

Page 73: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of experts

Plotting the posterior probabilities

xt<-seq(0,60,0.1)Nt<-length(xt)plot(x,y)pit=matrix(0,Nt,K)for(k in 1:K) pit[,k]<-exp(alpha[1,k]+alpha[2,k]*xt)pit<-pit/rowSums(pit)

plot(xt,pit[,1],type="l",col=1)for(k in 2:K) lines(xt,pit[,k],col=k)

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 73 / 76

Page 74: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of experts

Posterior probabilities

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Motorcycle data − posterior probabilities

x

π k

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 74 / 76

Page 75: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of experts

Plotting the predictions

yhat<-rep(0,Nt)for(k in 1:K) yhat<-yhat+pit[,k]*(beta[1,k]+beta[2,k]*xt)

plot(x,y,main="Motorcycle data",xlab="time",ylab="acceleration")for(k in 1:K) abline(beta[1:2,k],lty=2)lines(xt,yhat,col=’red’,lwd=2)

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 75 / 76

Page 76: Advanced Computational Econometrics: Machine Learning

Regression models Mixture of experts

Regression lines and predictions

10 20 30 40 50

−10

0−

500

50Motorcycle data

time

acce

lera

tion

Thierry Denœux ACE - Gaussian Mixture models July-August 2019 76 / 76