![Page 1: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/1.jpg)
Introduction to machine learning
Master M2MO (Modelisation Aleatoire)
Stephane Gaıffas
![Page 2: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/2.jpg)
Unsupervised learning
Marketing: finding groups of customers with similar behaviorgiven a large database of customer data containing theirproperties and past buying records
Biology: classification of plants and animals given theirfeatures
Insurance: identifying groups of motor insurance policyholders with a high average claim cost, identifying frauds
City-planning: identifying groups of houses according to theirhouse type, value and geographical location
Internet: document classification, clustering weblog data todiscover groups of similar access patterns
![Page 3: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/3.jpg)
Marketing
Data: base of customer data containing their properties andpast buying records
Goal: Use the customers similarities to find groups
Two directions:
Clustering: propose an explicit grouping of the customers
Visualization: propose a representation of the customers sothat the groups are visibles
![Page 4: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/4.jpg)
Supervised learning reminder
We have training data Dn = [(x1, y1), . . . , (xn, yn)]
(xi , yi ) i.i.d with distribution P on X × YConstruct a predictor f : X → Y using Dn
Loss `(y , f (x)) measures how well f (x) predicts y well
Aim is to minimize the generalization error
EX ,Y (`(Y , f (X ))|Dn) =
∫`(y , f (x))dP(x , y).
The goal is clear
Predict the label y based on features x
Heard on the street
Supervised learning is solved. Unsupervised learning isn’t
![Page 5: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/5.jpg)
Unsupervised learning
We have training data Dn = [x1, . . . , xn]
Loss: ????
Aim: ????
The goal is unclear. Classical tasks
Clustering: construct groups of data in homogeneous classes
Dimension reduction: construct a map of the data in alow-dimension space without distorting it too much
Motivations
Interpretation of the groups
Use of the groups in further processing
![Page 6: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/6.jpg)
Clustering
Training data Dn = {x1, . . . , xd} with xi ∈ Rd
Latent groups?
Construct a map f : Rd → {1, . . . ,K} which affects a clusternumber to each xi :
f : xi 7→ ki
No ground truth for ki ...
Warning
Choice if K is hard
Roughly two approaches
Partition-based
Model-based
![Page 7: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/7.jpg)
Partition-based clustering
Need to define the quality of a cluster
No obvious measure!
Homogeneity: samples in the same group should be similar
Simple idea
Use the distance as a measure of similar
Simplest is Eucliean distance
![Page 8: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/8.jpg)
K -means algorithm
Similar = small Euclidian distance
![Page 9: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/9.jpg)
K -means algorithm
Fix K ≥ 2, n data points xi ∈ Rd
Find centroids c1, . . . cK that minimizes the quantificationerror
n∑i=1
mink=1,...,K
‖xi − ck‖22
Impossible to find the exact solution!
![Page 10: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/10.jpg)
Lloyd (1981) proposes a way of finding local solutions
Choose at random K centroids {c1, . . . , cK}For each k ∈ {1, . . . ,K}, find the set Ck of points that arecloser to ck than any ck ′ for k ′ 6= k
Update the centroids:
ck =1
|Ck |∑i∈Ck
xi
Repeat the two previous steps until the sets Ck don’t change
![Page 11: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/11.jpg)
K -means
![Page 12: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/12.jpg)
K -means very sensitive to the choice of initial points
![Page 13: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/13.jpg)
An easy solution:
Pick a first point at random
Choose the next initial point the farthest from the previousones
![Page 14: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/14.jpg)
Farthest point initialization
![Page 15: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/15.jpg)
Farthest point initialization
![Page 16: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/16.jpg)
Farthest point initialization
![Page 17: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/17.jpg)
Farthest point initialization
![Page 18: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/18.jpg)
Farthest point initialization
![Page 19: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/19.jpg)
But, very sensitive to outliers
![Page 20: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/20.jpg)
But, very sensitive to outliers
![Page 21: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/21.jpg)
But, very sensitive to outliers
![Page 22: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/22.jpg)
A solution: K -means++Pick the initial centroids as follows
1 Pick uniformly at random i ∈ {x1, . . . , xn}, put c1 ← xi2 k ← k + 1
3 Sample i ∈ {X1, . . . ,Xn} with probability
mink ′=1,...,k−1 ‖xi − ck ′‖22∑n
i ′=1 mink ′=1,...,k−1 ‖Xi ′ − ck ′‖22
4 Put ck ← xi5 If k < K go back to step 2.
Then use K -means based on these initial clusters
This is between random initialization and furthest pointinitialization
![Page 23: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/23.jpg)
K -means++ In expectation, leads to a solution close to theoptimum. Define the quantification error
Qn(c1, . . . , cK ) =n∑
i=1
mink=1,...,K
‖xi − ck‖22.
We know that (Arthur and Vassilvitskii [2006]) if c1, . . . , cK arecentroids obtained with K -means++, then
E[Qn(c1, . . . , cK )] ≤ 8(logK + 2) minc ′1,...,c
′K
Qn(c ′1, . . . , c′K )
where E is with respect to random choice of initial centroids
![Page 24: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/24.jpg)
Model-based clustering
use a model on data with clusters
using a mixture of distributions with different location (mean)parameter
−4 −2 0 2 4 6 8 10
0.0
0.1
0.2
0.3
0.4
−4 −2 0 2 4 6 8 10
0.00
0.05
0.10
0.15
Figure: Gaussian Mixture in dimension 1
![Page 25: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/25.jpg)
Model-based clustering
use a model on data with clusters
using a mixture of distributions with different location (mean)parameter
−6
−4
−2
0
2
4
−5−4
−3−2
−10
12
340
0.02
0.04
0.06
0.08
0.1
0.12
0.14
axis [1]
Density mixture − plane [1 2]
axis [2]
dens
ity
Figure: Gaussian Mixture in dimension 2
![Page 26: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/26.jpg)
Model-based clustering
use a model on data with clusters
using a mixture of distributions with different location (mean)parameter
0 5 10 15 20
0.00
0.02
0.04
0.06
0.08
0.10
0.3*P(.|2)0.2*P(.|5)0.5*P(.|10)loi du melange
Figure: Poisson Mixture
![Page 27: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/27.jpg)
Mixture of densities f1, . . . , fK
K ∈ N∗ (number of clusters)
(p1, . . . , pK ) ∈ (R+)K such that∑K
k=1 pk = 1
Mixture density:
f (x) =K∑
k=1
pk fk(x)
Gaussian Mixtures Model
The most widely used example
Put fk = ϕµk ,Σk= density of N(µk ,Σk), where we recall
ϕµk ,Σk(x) =
1
(2π)d/2√
det(Σk)exp
(−1
2(x−µk)>Σ−1
k (x−µk))
with Σk � 0
![Page 28: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/28.jpg)
Gaussian Mixtures Model
Statistical model with density
fθ(x) =K∑
k=1
pkϕµk ,Σk(x),
Parameter θ = (p1, . . . , pK , µ1, . . . , µK ,Σ1, . . . ,ΣK )
Goodness-of-fit is:
Rn(θ) = −log-likelihood = −n∑
i=1
log( K∑
k=1
pkϕµk ,Σk(xi ))
A local minimizer θ is typically obtained using an algorithmcalled Expectation-Minimization (EM) algorithm
![Page 29: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/29.jpg)
EM AlgorithmIdea:
there is a hidden structure in the model
knowing this structure, the optimization problem is easier
Indeed, each point Xi belongs to an unknown class k ∈ {1, . . . ,K}Put Ci ,k = 1 when i belongs to class k , Ci ,k = 0 otherwise.
We don’t observe {Ci ,k}1≤i≤n,1≤k≤K
We say that these are latent variables
Put Ck = {i : Ci ,k = 1}, then C1, . . . , CK is a partition of{1, . . . , n}.
![Page 30: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/30.jpg)
Generative model:
i belongs to class Ck with probability pk , namely
Ci = (Ci ,1, . . . ,Ci ,K ) ∼M(1, p1, . . . , pK )
[multinomial distribution with parameter (1, p1, . . . , pK )].
Xi ∼ ϕµk ,Σkif Ci ,k = 1
![Page 31: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/31.jpg)
The joint distribution of (X ,C ) is, according to this model
fθ(x , c) =K∏
k=1
(pkϕµk ,Σk(x))ck
[ck = 1 for only one k and 0 elsewhere] and the marginaldensity in X is the one of the mixture:
fθ(x) =K∑
k=1
pkϕµk ,Σk(x)
This generative model adds a latent variable C , in such a waythat the marginal distribution of (X ,C ) in X is indeed the oneof the mixture fθ
![Page 32: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/32.jpg)
Put X = (X1, . . . ,Xn) and C = (C1, . . . ,Cn)
Do as if we observed the latent variables CWrite a completed likelihood for these “virtual” observations(joint distribution of (X ,C )):
Lc(θ; X ,C ) =n∏
i=1
K∏k=1
(pkϕµk ,Σk(Xi ))Ci,k
and the completed log-likelihood:
`c(θ; X ,C ) =n∑
i=1
K∑k=1
Ci ,k(log pk + logϕµk ,Σk(Xi )).
![Page 33: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/33.jpg)
EM-Algorithm [Dempster et al. (77)](E=Expectation, M=Maximization)
Initialize θ(0)
for t = 0, . . . , until convergence, repeat:
1 (E )-step: compute
θ 7→ Q(θ, θ(t)) = Eθ(t)
[`c(θ; X ,C )
∣∣∣X][Expectation with respect to the latent variables, for theprevious value of θ]
2 (M)-step: compute
θ(t+1) ∈ arg maxθ∈Θ
Q(θ, θ(t))
[Maximize this expectation]
![Page 34: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/34.jpg)
(E) and (M) steps often have explicit solutions
Theorem. The sequence θ(t) obtained using EM Algorithmsatisfies
`(θ(t+1); X ) ≥ `(θ(t); X )
for any t.
A each step, EM increases the likelihood
Initialization will be very important (usually done usingK-Means or K-Means++)
![Page 35: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/35.jpg)
Proof. Using Bayes formula, density of C |X = fθ(c |x) = fθ(x,c)fθ(x) , so
`(θ; X ) = `c(θ; X ,C )− `(θ; C |X ),
where we put `(θ; c |x) = log fθ(c |x). Take expectation w.r.t Pθ(t) (·|X ),and obtain
`(θ; X ) = Eθ(t)
[`c(θ; X ,C )|X
]− Eθ(t)
[`(θ; C |X )|X
]= Q(θ, θ(t))− Q1(θ, θ(t)).
Hence
`(θ(t+1); X )− `(θ(t); X ) = Q(θ(t+1), θ(t))− Q(θ(t), θ(t))
−(Q1(θ(t+1), θ(t))− Q1(θ(t), θ(t))
).
We have Q(θ(t+1), θ(t)) ≥ Q(θ(t), θ(t)) using the definition of θ(t+1)
![Page 36: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/36.jpg)
Remains to check that Q1(θ(t+1), θ(t))− Q1(θ(t), θ(t)) ≤ 0 :
Q1(θ(t+1), θ(t))− Q1(θ(t), θ(t)) = Eθ(t)
[`(θ(t+1); C |X )− `(θ(t); C |X )|X
]=
∫log( fθ(t+1) (c |X )
fθ(t) (c |X )
)fθ(t) (c |X )µ(dc)
≤(Jensen)
log
∫fθ(t+1) (c |X )µ(dc) = 0,
This proves `(θ(t+1); X ) ≥ `(θ(t); X ) for any t
![Page 37: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/37.jpg)
So what is the EM Algorithm, and where do we use this?
It is an algorithm that allows to optimize a likelihood withmissing or latent data
For a mixture distribution, we come up with natural latentvariables, that simplify the original optimization problem
![Page 38: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/38.jpg)
Consider the completed likelihood
`c(θ; X ,C ) =n∑
i=1
K∑k=1
Ci ,k(log pk + logϕµk ,Σk(Xi )),
whereθ = (p1, . . . , pK , µ1, . . . , µK ,Σ1, . . . ,ΣK ).
What are the (E) and (M) steps in this case?
![Page 39: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/39.jpg)
(E)-StepCompute
Eθ(t)
[`c(θ; X ,C )
∣∣∣X] =n∑
i=1
K∑k=1
Eθ(t) [Ci ,k |X ](log pk+logϕµk ,Σk(Xi )),
which is simply:
Eθ[Ci ,k |X ] = Pθ(Ci ,k = 1|Xi ) =: πi ,k(θ),
where
πi ,k(θ) =pkϕµk ,Σk
(Xi )∑Kk ′=1 pk ′ϕµk′ ,Σk′ (Xi )
.
We call πi ,k(θ) the “soft-assignment” of i in class k .
![Page 40: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/40.jpg)
(E)-Step
−30 −20 −10 0 10 20 300
0.01
0.02
0.03
0.04
0.05
0.06
0.07
−30 −20 −10 0 10 20 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure: Soft-assignments : x 7→ pkϕµk ,Σk(x)∑K
k′=1pk′ϕµk′ ,Σk′
(x).
![Page 41: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/41.jpg)
(M)-Step: compute
θ(t+1) = (p(t+1)1 , . . . , p
(t+1)K , µ
(t+1)1 , . . . , µ
(t+1)K ,Σ
(t+1)1 , . . . ,Σ
(t+1)K )
using:
p(t+1)k =
1
n
n∑i=1
πi ,k(θ(t)),
µ(t+1)k =
∑ni=1 πi ,k(θ(t))Xi∑ni=1 πi ,k(θ(t))
Σ(t+1)k =
∑ni=1 πi ,k(θ(t))(Xi − µ
(t+1)k )(Xi − µ
(t+1)k )>∑n
i=1 πi ,k(θ(t)).
This is natural: estimation for the means and covariances,weighted by the soft-assignments
![Page 42: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/42.jpg)
So, for t = 1, . . ., iterate (E) and (M) until convergence:
πi ,k(θ(t)) =p
(t)k ϕ
µ(t)k ,Σ
(t)k
(Xi )∑Kk ′=1 p
(t)k ′ ϕµ(t)
k′ ,Σ(t)
k′(Xi )
p(t+1)k =
1
n
n∑i=1
πi ,k(θ(t))
µ(t+1)k =
∑ni=1 πi ,k(θ(t))Xi∑ni=1 πi ,k(θ(t))
Σ(t+1)k =
∑ni=1 πi ,k(θ(t))(Xi − µ
(t+1)k )(Xi − µ
(t+1)k )>∑n
i=1 πi ,k(θ(t)).
We obtain an estimator θ = (p1, . . . , pK , µ1, . . . , µK , Σ1, . . . , ΣK ).
![Page 43: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/43.jpg)
Clustering: the maximum a posteriori (MAP) rule
Given θ, how to affect a cluster number k to a given x ∈ Rd?
Compute the the soft-assignments
πk(x) =pkϕµk ,Σk
(x)∑Kk ′=1 pk ′ϕµk′ ,Σk′
(x)
Consider
i ∈ Ck if πk(x) > πk ′(x) for any k ′ 6= k
![Page 44: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/44.jpg)
The MAP rule
−30 −20 −10 0 10 20 300
0.01
0.02
0.03
0.04
0.05
0.06
0.07
−30 −20 −10 0 10 20 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Mixture fθ(x) Soft-assignement πk(x)
![Page 45: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/45.jpg)
Gaussian Mixtures Model
![Page 46: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/46.jpg)
Gaussian Mixtures Model
![Page 47: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/47.jpg)
K-Means and GMM don’t work for “embedded” cluster structures
![Page 48: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/48.jpg)
Model-selection
How do I choose the number of clusters K?
This can be done using model-selection
Idea
Minimize a penalization of the goodness-of-fit
K = argminK≥1
{− `(θK ) + pen(K )
}where
θK = MLE in the mixture model with K clusters
−`(θK ): goodness-of-fit. If K is too large: overfitting
pen(K ) : penalization
![Page 49: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/49.jpg)
Classical penalization functions
AIC (Akaike Information Criterion)
K = argminK≥1
{− `(θK ; X ) + df(K )
}BIC (Bayesian Information Criterion)
K = argminK≥1
{− `(θK ; X ) +
df(K ) log n
2
}where
n: size of data
df(K ): degrees of freedom of the mixture model with Kclusters
![Page 50: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/50.jpg)
In summary:
AIC: realizes a compromise between “bias” and “variance”. Ittends to underpenalize
BIC: convergent (if the model is true), namely K → K whenn→ +∞).
What is df(K ) in the Gaussian Mixtures model?
p1, . . . , pK : K − 1 (sum = 1)
µ1, . . . , µK : K × d
Σ1, . . . ,ΣK : K × d(d + 1)/2 (symmetrical matrices)
![Page 51: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/51.jpg)
Hence
for the model with general covariances:
df(K ) = K − 1 + Kd + Kd(d + 1)/2
when restricting to diagonal matrices:
df(K ) = K − 1 + Kd + Kd .
Many other possibilites for the shapes of the covariances...
![Page 52: Introduction to machine learning - Stéphane Gaïffas · Introduction to machine learning ... Unsupervised learning Marketing: nding groups of customers with similar behavior given](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4dc1971cf24623f2b0675/html5/thumbnails/52.jpg)
Thank you!