introduction to machine learning - stéphane gaïffas · introduction to machine learning ......

Introduction to machine learning

Master M2MO (Modelisation Aleatoire)

Stephane Gaıffas

Unsupervised learning

Marketing: finding groups of customers with similar behaviorgiven a large database of customer data containing theirproperties and past buying records

Biology: classification of plants and animals given theirfeatures

Insurance: identifying groups of motor insurance policyholders with a high average claim cost, identifying frauds

City-planning: identifying groups of houses according to theirhouse type, value and geographical location

Internet: document classification, clustering weblog data todiscover groups of similar access patterns

Marketing

Data: base of customer data containing their properties andpast buying records

Goal: Use the customers similarities to find groups

Two directions:

Clustering: propose an explicit grouping of the customers

Visualization: propose a representation of the customers sothat the groups are visibles

Supervised learning reminder

We have training data Dn = [(x1, y1), . . . , (xn, yn)]

(xi , yi ) i.i.d with distribution P on X × YConstruct a predictor f : X → Y using Dn

Loss `(y , f (x)) measures how well f (x) predicts y well

Aim is to minimize the generalization error

EX ,Y (`(Y , f (X ))|Dn) =

∫`(y , f (x))dP(x , y).

The goal is clear

Predict the label y based on features x

Heard on the street

Supervised learning is solved. Unsupervised learning isn’t

Unsupervised learning

We have training data Dn = [x1, . . . , xn]

Loss: ????

Aim: ????

The goal is unclear. Classical tasks

Clustering: construct groups of data in homogeneous classes

Dimension reduction: construct a map of the data in alow-dimension space without distorting it too much

Motivations

Interpretation of the groups

Use of the groups in further processing

Clustering

Training data Dn = {x1, . . . , xd} with xi ∈ Rd

Latent groups?

Construct a map f : Rd → {1, . . . ,K} which affects a clusternumber to each xi :

f : xi 7→ ki

No ground truth for ki ...

Warning

Choice if K is hard

Roughly two approaches

Partition-based

Model-based

Partition-based clustering

Need to define the quality of a cluster

No obvious measure!

Homogeneity: samples in the same group should be similar

Simple idea

Use the distance as a measure of similar

Simplest is Eucliean distance

K -means algorithm

Similar = small Euclidian distance

K -means algorithm

Fix K ≥ 2, n data points xi ∈ Rd

Find centroids c1, . . . cK that minimizes the quantificationerror

n∑i=1

mink=1,...,K

‖xi − ck‖22

Impossible to find the exact solution!

Lloyd (1981) proposes a way of finding local solutions

Choose at random K centroids {c1, . . . , cK}For each k ∈ {1, . . . ,K}, find the set Ck of points that arecloser to ck than any ck ′ for k ′ 6= k

Update the centroids:

ck =1

|Ck |∑i∈Ck

xi

Repeat the two previous steps until the sets Ck don’t change

K -means

K -means very sensitive to the choice of initial points

An easy solution:

Pick a first point at random

Choose the next initial point the farthest from the previousones

Farthest point initialization

But, very sensitive to outliers

A solution: K -means++Pick the initial centroids as follows

1 Pick uniformly at random i ∈ {x1, . . . , xn}, put c1 ← xi2 k ← k + 1

3 Sample i ∈ {X1, . . . ,Xn} with probability

mink ′=1,...,k−1 ‖xi − ck ′‖22∑n

i ′=1 mink ′=1,...,k−1 ‖Xi ′ − ck ′‖22

4 Put ck ← xi5 If k < K go back to step 2.

Then use K -means based on these initial clusters

This is between random initialization and furthest pointinitialization

K -means++ In expectation, leads to a solution close to theoptimum. Define the quantification error

Qn(c1, . . . , cK ) =n∑

i=1

mink=1,...,K

‖xi − ck‖22.

We know that (Arthur and Vassilvitskii [2006]) if c1, . . . , cK arecentroids obtained with K -means++, then

E[Qn(c1, . . . , cK )] ≤ 8(logK + 2) minc ′1,...,c

′K

Qn(c ′1, . . . , c′K )

where E is with respect to random choice of initial centroids

Model-based clustering

use a model on data with clusters

using a mixture of distributions with different location (mean)parameter

−4 −2 0 2 4 6 8 10

0.0

0.1

0.2

0.3

0.4

−4 −2 0 2 4 6 8 10

0.00

0.05

0.10

0.15

Figure: Gaussian Mixture in dimension 1




−6

−4

−2

0

2

4

−5−4

−3−2

−10

12

340

0.02

0.04

0.06

0.08

0.1

0.12

0.14

axis [1]

Density mixture − plane [1 2]

axis [2]

dens

ity

Figure: Gaussian Mixture in dimension 2




0 5 10 15 20

0.00

0.02

0.04

0.06

0.08

0.10

0.3*P(.|2)0.2*P(.|5)0.5*P(.|10)loi du melange

Figure: Poisson Mixture

Mixture of densities f1, . . . , fK

K ∈ N∗ (number of clusters)

(p1, . . . , pK ) ∈ (R+)K such that∑K

k=1 pk = 1

Mixture density:

f (x) =K∑

k=1

pk fk(x)

Gaussian Mixtures Model

The most widely used example

Put fk = ϕµk ,Σk= density of N(µk ,Σk), where we recall

ϕµk ,Σk(x) =

1

(2π)d/2√

det(Σk)exp

(−1

2(x−µk)>Σ−1

k (x−µk))

with Σk � 0


Statistical model with density

fθ(x) =K∑

k=1

pkϕµk ,Σk(x),

Parameter θ = (p1, . . . , pK , µ1, . . . , µK ,Σ1, . . . ,ΣK )

Goodness-of-fit is:

Rn(θ) = −log-likelihood = −n∑

i=1

log( K∑

k=1

pkϕµk ,Σk(xi ))

A local minimizer θ is typically obtained using an algorithmcalled Expectation-Minimization (EM) algorithm

EM AlgorithmIdea:

there is a hidden structure in the model

knowing this structure, the optimization problem is easier

Indeed, each point Xi belongs to an unknown class k ∈ {1, . . . ,K}Put Ci ,k = 1 when i belongs to class k , Ci ,k = 0 otherwise.

We don’t observe {Ci ,k}1≤i≤n,1≤k≤K

We say that these are latent variables

Put Ck = {i : Ci ,k = 1}, then C1, . . . , CK is a partition of{1, . . . , n}.

Generative model:

i belongs to class Ck with probability pk , namely

Ci = (Ci ,1, . . . ,Ci ,K ) ∼M(1, p1, . . . , pK )

[multinomial distribution with parameter (1, p1, . . . , pK )].

Xi ∼ ϕµk ,Σkif Ci ,k = 1

The joint distribution of (X ,C ) is, according to this model

fθ(x , c) =K∏

k=1

(pkϕµk ,Σk(x))ck

[ck = 1 for only one k and 0 elsewhere] and the marginaldensity in X is the one of the mixture:

fθ(x) =K∑

k=1

pkϕµk ,Σk(x)

This generative model adds a latent variable C , in such a waythat the marginal distribution of (X ,C ) in X is indeed the oneof the mixture fθ

Put X = (X1, . . . ,Xn) and C = (C1, . . . ,Cn)

Do as if we observed the latent variables CWrite a completed likelihood for these “virtual” observations(joint distribution of (X ,C )):

Lc(θ; X ,C ) =n∏

i=1

K∏k=1

(pkϕµk ,Σk(Xi ))Ci,k

and the completed log-likelihood:

`c(θ; X ,C ) =n∑

i=1

K∑k=1

Ci ,k(log pk + logϕµk ,Σk(Xi )).

EM-Algorithm [Dempster et al. (77)](E=Expectation, M=Maximization)

Initialize θ(0)

for t = 0, . . . , until convergence, repeat:

1 (E )-step: compute

θ 7→ Q(θ, θ(t)) = Eθ(t)

[`c(θ; X ,C )

∣∣∣X][Expectation with respect to the latent variables, for theprevious value of θ]

2 (M)-step: compute

θ(t+1) ∈ arg maxθ∈Θ

Q(θ, θ(t))

[Maximize this expectation]

(E) and (M) steps often have explicit solutions

Theorem. The sequence θ(t) obtained using EM Algorithmsatisfies

`(θ(t+1); X ) ≥ `(θ(t); X )

for any t.

A each step, EM increases the likelihood

Initialization will be very important (usually done usingK-Means or K-Means++)

Proof. Using Bayes formula, density of C |X = fθ(c |x) = fθ(x,c)fθ(x) , so

`(θ; X ) = `c(θ; X ,C )− `(θ; C |X ),

where we put `(θ; c |x) = log fθ(c |x). Take expectation w.r.t Pθ(t) (·|X ),and obtain

`(θ; X ) = Eθ(t)

[`c(θ; X ,C )|X

]− Eθ(t)

[`(θ; C |X )|X

]= Q(θ, θ(t))− Q1(θ, θ(t)).

Hence

`(θ(t+1); X )− `(θ(t); X ) = Q(θ(t+1), θ(t))− Q(θ(t), θ(t))

−(Q1(θ(t+1), θ(t))− Q1(θ(t), θ(t))

).

We have Q(θ(t+1), θ(t)) ≥ Q(θ(t), θ(t)) using the definition of θ(t+1)

So what is the EM Algorithm, and where do we use this?

It is an algorithm that allows to optimize a likelihood withmissing or latent data

For a mixture distribution, we come up with natural latentvariables, that simplify the original optimization problem

Consider the completed likelihood

`c(θ; X ,C ) =n∑

i=1

K∑k=1

Ci ,k(log pk + logϕµk ,Σk(Xi )),

whereθ = (p1, . . . , pK , µ1, . . . , µK ,Σ1, . . . ,ΣK ).

What are the (E) and (M) steps in this case?

(E)-StepCompute

Eθ(t)

[`c(θ; X ,C )

∣∣∣X] =n∑

i=1

K∑k=1

Eθ(t) [Ci ,k |X ](log pk+logϕµk ,Σk(Xi )),

which is simply:

Eθ[Ci ,k |X ] = Pθ(Ci ,k = 1|Xi ) =: πi ,k(θ),

where

πi ,k(θ) =pkϕµk ,Σk

(Xi )∑Kk ′=1 pk ′ϕµk′ ,Σk′ (Xi )

.

We call πi ,k(θ) the “soft-assignment” of i in class k .

(E)-Step

−30 −20 −10 0 10 20 300

0.01

0.02

0.03

0.04

0.05

0.06

0.07

−30 −20 −10 0 10 20 30

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure: Soft-assignments : x 7→ pkϕµk ,Σk(x)∑K

k′=1pk′ϕµk′ ,Σk′

(x).

(M)-Step: compute

θ(t+1) = (p(t+1)1 , . . . , p

(t+1)K , µ

(t+1)1 , . . . , µ

(t+1)K ,Σ

(t+1)1 , . . . ,Σ

(t+1)K )

using:

p(t+1)k =

1

n

n∑i=1

πi ,k(θ(t)),

µ(t+1)k =

∑ni=1 πi ,k(θ(t))Xi∑ni=1 πi ,k(θ(t))

Σ(t+1)k =

∑ni=1 πi ,k(θ(t))(Xi − µ

(t+1)k )(Xi − µ

(t+1)k )>∑n

i=1 πi ,k(θ(t)).

This is natural: estimation for the means and covariances,weighted by the soft-assignments

So, for t = 1, . . ., iterate (E) and (M) until convergence:

πi ,k(θ(t)) =p

(t)k ϕ

µ(t)k ,Σ

(t)k

(Xi )∑Kk ′=1 p

(t)k ′ ϕµ(t)

k′ ,Σ(t)

k′(Xi )

p(t+1)k =

1

n

n∑i=1

πi ,k(θ(t))

µ(t+1)k =

∑ni=1 πi ,k(θ(t))Xi∑ni=1 πi ,k(θ(t))

Σ(t+1)k =

∑ni=1 πi ,k(θ(t))(Xi − µ

(t+1)k )(Xi − µ

(t+1)k )>∑n

i=1 πi ,k(θ(t)).

We obtain an estimator θ = (p1, . . . , pK , µ1, . . . , µK , Σ1, . . . , ΣK ).

Clustering: the maximum a posteriori (MAP) rule

Given θ, how to affect a cluster number k to a given x ∈ Rd?

Compute the the soft-assignments

πk(x) =pkϕµk ,Σk

(x)∑Kk ′=1 pk ′ϕµk′ ,Σk′

(x)

Consider

i ∈ Ck if πk(x) > πk ′(x) for any k ′ 6= k

The MAP rule

−30 −20 −10 0 10 20 300

0.01

0.02

0.03

0.04

0.05

0.06

0.07

−30 −20 −10 0 10 20 30

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Mixture fθ(x) Soft-assignement πk(x)

K-Means and GMM don’t work for “embedded” cluster structures

Model-selection

How do I choose the number of clusters K?

This can be done using model-selection

Idea

Minimize a penalization of the goodness-of-fit

K = argminK≥1

{− `(θK ) + pen(K )

}where

θK = MLE in the mixture model with K clusters

−`(θK ): goodness-of-fit. If K is too large: overfitting

pen(K ) : penalization

Classical penalization functions

AIC (Akaike Information Criterion)

K = argminK≥1

{− `(θK ; X ) + df(K )

}BIC (Bayesian Information Criterion)

K = argminK≥1

{− `(θK ; X ) +

df(K ) log n

2

}where

n: size of data

df(K ): degrees of freedom of the mixture model with Kclusters

In summary:

AIC: realizes a compromise between “bias” and “variance”. Ittends to underpenalize

BIC: convergent (if the model is true), namely K → K whenn→ +∞).

What is df(K ) in the Gaussian Mixtures model?

p1, . . . , pK : K − 1 (sum = 1)

µ1, . . . , µK : K × d

Σ1, . . . ,ΣK : K × d(d + 1)/2 (symmetrical matrices)

Hence

for the model with general covariances:

df(K ) = K − 1 + Kd + Kd(d + 1)/2

when restricting to diagonal matrices:

df(K ) = K − 1 + Kd + Kd .

Many other possibilites for the shapes of the covariances...

Thank you!

introduction to machine learning - stéphane gaïffas · introduction to machine learning ......

Documents