introduction to machine learning - stéphane gaïffas · introduction to machine learning ......

Post on 20-May-2020

15 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to machine learning

Master M2MO (Modelisation Aleatoire)

Stephane Gaıffas

Unsupervised learning

Marketing: finding groups of customers with similar behaviorgiven a large database of customer data containing theirproperties and past buying records

Biology: classification of plants and animals given theirfeatures

Insurance: identifying groups of motor insurance policyholders with a high average claim cost, identifying frauds

City-planning: identifying groups of houses according to theirhouse type, value and geographical location

Internet: document classification, clustering weblog data todiscover groups of similar access patterns

Marketing

Data: base of customer data containing their properties andpast buying records

Goal: Use the customers similarities to find groups

Two directions:

Clustering: propose an explicit grouping of the customers

Visualization: propose a representation of the customers sothat the groups are visibles

Supervised learning reminder

We have training data Dn = [(x1, y1), . . . , (xn, yn)]

(xi , yi ) i.i.d with distribution P on X × YConstruct a predictor f : X → Y using Dn

Loss `(y , f (x)) measures how well f (x) predicts y well

Aim is to minimize the generalization error

EX ,Y (`(Y , f (X ))|Dn) =

∫`(y , f (x))dP(x , y).

The goal is clear

Predict the label y based on features x

Heard on the street

Supervised learning is solved. Unsupervised learning isn’t

Unsupervised learning

We have training data Dn = [x1, . . . , xn]

Loss: ????

Aim: ????

The goal is unclear. Classical tasks

Clustering: construct groups of data in homogeneous classes

Dimension reduction: construct a map of the data in alow-dimension space without distorting it too much

Motivations

Interpretation of the groups

Use of the groups in further processing

Clustering

Training data Dn = {x1, . . . , xd} with xi ∈ Rd

Latent groups?

Construct a map f : Rd → {1, . . . ,K} which affects a clusternumber to each xi :

f : xi 7→ ki

No ground truth for ki ...

Warning

Choice if K is hard

Roughly two approaches

Partition-based

Model-based

Partition-based clustering

Need to define the quality of a cluster

No obvious measure!

Homogeneity: samples in the same group should be similar

Simple idea

Use the distance as a measure of similar

Simplest is Eucliean distance

K -means algorithm

Similar = small Euclidian distance

K -means algorithm

Fix K ≥ 2, n data points xi ∈ Rd

Find centroids c1, . . . cK that minimizes the quantificationerror

n∑i=1

mink=1,...,K

‖xi − ck‖22

Impossible to find the exact solution!

Lloyd (1981) proposes a way of finding local solutions

Choose at random K centroids {c1, . . . , cK}For each k ∈ {1, . . . ,K}, find the set Ck of points that arecloser to ck than any ck ′ for k ′ 6= k

Update the centroids:

ck =1

|Ck |∑i∈Ck

xi

Repeat the two previous steps until the sets Ck don’t change

K -means

K -means very sensitive to the choice of initial points

An easy solution:

Pick a first point at random

Choose the next initial point the farthest from the previousones

Farthest point initialization

Farthest point initialization

Farthest point initialization

Farthest point initialization

Farthest point initialization

But, very sensitive to outliers

But, very sensitive to outliers

But, very sensitive to outliers

A solution: K -means++Pick the initial centroids as follows

1 Pick uniformly at random i ∈ {x1, . . . , xn}, put c1 ← xi2 k ← k + 1

3 Sample i ∈ {X1, . . . ,Xn} with probability

mink ′=1,...,k−1 ‖xi − ck ′‖22∑n

i ′=1 mink ′=1,...,k−1 ‖Xi ′ − ck ′‖22

4 Put ck ← xi5 If k < K go back to step 2.

Then use K -means based on these initial clusters

This is between random initialization and furthest pointinitialization

K -means++ In expectation, leads to a solution close to theoptimum. Define the quantification error

Qn(c1, . . . , cK ) =n∑

i=1

mink=1,...,K

‖xi − ck‖22.

We know that (Arthur and Vassilvitskii [2006]) if c1, . . . , cK arecentroids obtained with K -means++, then

E[Qn(c1, . . . , cK )] ≤ 8(logK + 2) minc ′1,...,c

′K

Qn(c ′1, . . . , c′K )

where E is with respect to random choice of initial centroids

Model-based clustering

use a model on data with clusters

using a mixture of distributions with different location (mean)parameter

−4 −2 0 2 4 6 8 10

0.0

0.1

0.2

0.3

0.4

−4 −2 0 2 4 6 8 10

0.00

0.05

0.10

0.15

Figure: Gaussian Mixture in dimension 1

Model-based clustering

use a model on data with clusters

using a mixture of distributions with different location (mean)parameter

−6

−4

−2

0

2

4

−5−4

−3−2

−10

12

340

0.02

0.04

0.06

0.08

0.1

0.12

0.14

axis [1]

Density mixture − plane [1 2]

axis [2]

dens

ity

Figure: Gaussian Mixture in dimension 2

Model-based clustering

use a model on data with clusters

using a mixture of distributions with different location (mean)parameter

0 5 10 15 20

0.00

0.02

0.04

0.06

0.08

0.10

0.3*P(.|2)0.2*P(.|5)0.5*P(.|10)loi du melange

Figure: Poisson Mixture

Mixture of densities f1, . . . , fK

K ∈ N∗ (number of clusters)

(p1, . . . , pK ) ∈ (R+)K such that∑K

k=1 pk = 1

Mixture density:

f (x) =K∑

k=1

pk fk(x)

Gaussian Mixtures Model

The most widely used example

Put fk = ϕµk ,Σk= density of N(µk ,Σk), where we recall

ϕµk ,Σk(x) =

1

(2π)d/2√

det(Σk)exp

(−1

2(x−µk)>Σ−1

k (x−µk))

with Σk � 0

Gaussian Mixtures Model

Statistical model with density

fθ(x) =K∑

k=1

pkϕµk ,Σk(x),

Parameter θ = (p1, . . . , pK , µ1, . . . , µK ,Σ1, . . . ,ΣK )

Goodness-of-fit is:

Rn(θ) = −log-likelihood = −n∑

i=1

log( K∑

k=1

pkϕµk ,Σk(xi ))

A local minimizer θ is typically obtained using an algorithmcalled Expectation-Minimization (EM) algorithm

EM AlgorithmIdea:

there is a hidden structure in the model

knowing this structure, the optimization problem is easier

Indeed, each point Xi belongs to an unknown class k ∈ {1, . . . ,K}Put Ci ,k = 1 when i belongs to class k , Ci ,k = 0 otherwise.

We don’t observe {Ci ,k}1≤i≤n,1≤k≤K

We say that these are latent variables

Put Ck = {i : Ci ,k = 1}, then C1, . . . , CK is a partition of{1, . . . , n}.

Generative model:

i belongs to class Ck with probability pk , namely

Ci = (Ci ,1, . . . ,Ci ,K ) ∼M(1, p1, . . . , pK )

[multinomial distribution with parameter (1, p1, . . . , pK )].

Xi ∼ ϕµk ,Σkif Ci ,k = 1

The joint distribution of (X ,C ) is, according to this model

fθ(x , c) =K∏

k=1

(pkϕµk ,Σk(x))ck

[ck = 1 for only one k and 0 elsewhere] and the marginaldensity in X is the one of the mixture:

fθ(x) =K∑

k=1

pkϕµk ,Σk(x)

This generative model adds a latent variable C , in such a waythat the marginal distribution of (X ,C ) in X is indeed the oneof the mixture fθ

Put X = (X1, . . . ,Xn) and C = (C1, . . . ,Cn)

Do as if we observed the latent variables CWrite a completed likelihood for these “virtual” observations(joint distribution of (X ,C )):

Lc(θ; X ,C ) =n∏

i=1

K∏k=1

(pkϕµk ,Σk(Xi ))Ci,k

and the completed log-likelihood:

`c(θ; X ,C ) =n∑

i=1

K∑k=1

Ci ,k(log pk + logϕµk ,Σk(Xi )).

EM-Algorithm [Dempster et al. (77)](E=Expectation, M=Maximization)

Initialize θ(0)

for t = 0, . . . , until convergence, repeat:

1 (E )-step: compute

θ 7→ Q(θ, θ(t)) = Eθ(t)

[`c(θ; X ,C )

∣∣∣X][Expectation with respect to the latent variables, for theprevious value of θ]

2 (M)-step: compute

θ(t+1) ∈ arg maxθ∈Θ

Q(θ, θ(t))

[Maximize this expectation]

(E) and (M) steps often have explicit solutions

Theorem. The sequence θ(t) obtained using EM Algorithmsatisfies

`(θ(t+1); X ) ≥ `(θ(t); X )

for any t.

A each step, EM increases the likelihood

Initialization will be very important (usually done usingK-Means or K-Means++)

Proof. Using Bayes formula, density of C |X = fθ(c |x) = fθ(x,c)fθ(x) , so

`(θ; X ) = `c(θ; X ,C )− `(θ; C |X ),

where we put `(θ; c |x) = log fθ(c |x). Take expectation w.r.t Pθ(t) (·|X ),and obtain

`(θ; X ) = Eθ(t)

[`c(θ; X ,C )|X

]− Eθ(t)

[`(θ; C |X )|X

]= Q(θ, θ(t))− Q1(θ, θ(t)).

Hence

`(θ(t+1); X )− `(θ(t); X ) = Q(θ(t+1), θ(t))− Q(θ(t), θ(t))

−(Q1(θ(t+1), θ(t))− Q1(θ(t), θ(t))

).

We have Q(θ(t+1), θ(t)) ≥ Q(θ(t), θ(t)) using the definition of θ(t+1)

Remains to check that Q1(θ(t+1), θ(t))− Q1(θ(t), θ(t)) ≤ 0 :

Q1(θ(t+1), θ(t))− Q1(θ(t), θ(t)) = Eθ(t)

[`(θ(t+1); C |X )− `(θ(t); C |X )|X

]=

∫log( fθ(t+1) (c |X )

fθ(t) (c |X )

)fθ(t) (c |X )µ(dc)

≤(Jensen)

log

∫fθ(t+1) (c |X )µ(dc) = 0,

This proves `(θ(t+1); X ) ≥ `(θ(t); X ) for any t

So what is the EM Algorithm, and where do we use this?

It is an algorithm that allows to optimize a likelihood withmissing or latent data

For a mixture distribution, we come up with natural latentvariables, that simplify the original optimization problem

Consider the completed likelihood

`c(θ; X ,C ) =n∑

i=1

K∑k=1

Ci ,k(log pk + logϕµk ,Σk(Xi )),

whereθ = (p1, . . . , pK , µ1, . . . , µK ,Σ1, . . . ,ΣK ).

What are the (E) and (M) steps in this case?

(E)-StepCompute

Eθ(t)

[`c(θ; X ,C )

∣∣∣X] =n∑

i=1

K∑k=1

Eθ(t) [Ci ,k |X ](log pk+logϕµk ,Σk(Xi )),

which is simply:

Eθ[Ci ,k |X ] = Pθ(Ci ,k = 1|Xi ) =: πi ,k(θ),

where

πi ,k(θ) =pkϕµk ,Σk

(Xi )∑Kk ′=1 pk ′ϕµk′ ,Σk′ (Xi )

.

We call πi ,k(θ) the “soft-assignment” of i in class k .

(E)-Step

−30 −20 −10 0 10 20 300

0.01

0.02

0.03

0.04

0.05

0.06

0.07

−30 −20 −10 0 10 20 30

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure: Soft-assignments : x 7→ pkϕµk ,Σk(x)∑K

k′=1pk′ϕµk′ ,Σk′

(x).

(M)-Step: compute

θ(t+1) = (p(t+1)1 , . . . , p

(t+1)K , µ

(t+1)1 , . . . , µ

(t+1)K ,Σ

(t+1)1 , . . . ,Σ

(t+1)K )

using:

p(t+1)k =

1

n

n∑i=1

πi ,k(θ(t)),

µ(t+1)k =

∑ni=1 πi ,k(θ(t))Xi∑ni=1 πi ,k(θ(t))

Σ(t+1)k =

∑ni=1 πi ,k(θ(t))(Xi − µ

(t+1)k )(Xi − µ

(t+1)k )>∑n

i=1 πi ,k(θ(t)).

This is natural: estimation for the means and covariances,weighted by the soft-assignments

So, for t = 1, . . ., iterate (E) and (M) until convergence:

πi ,k(θ(t)) =p

(t)k ϕ

µ(t)k ,Σ

(t)k

(Xi )∑Kk ′=1 p

(t)k ′ ϕµ(t)

k′ ,Σ(t)

k′(Xi )

p(t+1)k =

1

n

n∑i=1

πi ,k(θ(t))

µ(t+1)k =

∑ni=1 πi ,k(θ(t))Xi∑ni=1 πi ,k(θ(t))

Σ(t+1)k =

∑ni=1 πi ,k(θ(t))(Xi − µ

(t+1)k )(Xi − µ

(t+1)k )>∑n

i=1 πi ,k(θ(t)).

We obtain an estimator θ = (p1, . . . , pK , µ1, . . . , µK , Σ1, . . . , ΣK ).

Clustering: the maximum a posteriori (MAP) rule

Given θ, how to affect a cluster number k to a given x ∈ Rd?

Compute the the soft-assignments

πk(x) =pkϕµk ,Σk

(x)∑Kk ′=1 pk ′ϕµk′ ,Σk′

(x)

Consider

i ∈ Ck if πk(x) > πk ′(x) for any k ′ 6= k

The MAP rule

−30 −20 −10 0 10 20 300

0.01

0.02

0.03

0.04

0.05

0.06

0.07

−30 −20 −10 0 10 20 30

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Mixture fθ(x) Soft-assignement πk(x)

Gaussian Mixtures Model

Gaussian Mixtures Model

K-Means and GMM don’t work for “embedded” cluster structures

Model-selection

How do I choose the number of clusters K?

This can be done using model-selection

Idea

Minimize a penalization of the goodness-of-fit

K = argminK≥1

{− `(θK ) + pen(K )

}where

θK = MLE in the mixture model with K clusters

−`(θK ): goodness-of-fit. If K is too large: overfitting

pen(K ) : penalization

Classical penalization functions

AIC (Akaike Information Criterion)

K = argminK≥1

{− `(θK ; X ) + df(K )

}BIC (Bayesian Information Criterion)

K = argminK≥1

{− `(θK ; X ) +

df(K ) log n

2

}where

n: size of data

df(K ): degrees of freedom of the mixture model with Kclusters

In summary:

AIC: realizes a compromise between “bias” and “variance”. Ittends to underpenalize

BIC: convergent (if the model is true), namely K → K whenn→ +∞).

What is df(K ) in the Gaussian Mixtures model?

p1, . . . , pK : K − 1 (sum = 1)

µ1, . . . , µK : K × d

Σ1, . . . ,ΣK : K × d(d + 1)/2 (symmetrical matrices)

Hence

for the model with general covariances:

df(K ) = K − 1 + Kd + Kd(d + 1)/2

when restricting to diagonal matrices:

df(K ) = K − 1 + Kd + Kd .

Many other possibilites for the shapes of the covariances...

Thank you!

top related