introduction to machine learning - stéphane gaïffas · introduction to machine learning ......
TRANSCRIPT
Introduction to machine learning
Master M2MO (Modelisation Aleatoire)
Stephane Gaıffas
Unsupervised learning
Marketing: finding groups of customers with similar behaviorgiven a large database of customer data containing theirproperties and past buying records
Biology: classification of plants and animals given theirfeatures
Insurance: identifying groups of motor insurance policyholders with a high average claim cost, identifying frauds
City-planning: identifying groups of houses according to theirhouse type, value and geographical location
Internet: document classification, clustering weblog data todiscover groups of similar access patterns
Marketing
Data: base of customer data containing their properties andpast buying records
Goal: Use the customers similarities to find groups
Two directions:
Clustering: propose an explicit grouping of the customers
Visualization: propose a representation of the customers sothat the groups are visibles
Supervised learning reminder
We have training data Dn = [(x1, y1), . . . , (xn, yn)]
(xi , yi ) i.i.d with distribution P on X × YConstruct a predictor f : X → Y using Dn
Loss `(y , f (x)) measures how well f (x) predicts y well
Aim is to minimize the generalization error
EX ,Y (`(Y , f (X ))|Dn) =
∫`(y , f (x))dP(x , y).
The goal is clear
Predict the label y based on features x
Heard on the street
Supervised learning is solved. Unsupervised learning isn’t
Unsupervised learning
We have training data Dn = [x1, . . . , xn]
Loss: ????
Aim: ????
The goal is unclear. Classical tasks
Clustering: construct groups of data in homogeneous classes
Dimension reduction: construct a map of the data in alow-dimension space without distorting it too much
Motivations
Interpretation of the groups
Use of the groups in further processing
Clustering
Training data Dn = {x1, . . . , xd} with xi ∈ Rd
Latent groups?
Construct a map f : Rd → {1, . . . ,K} which affects a clusternumber to each xi :
f : xi 7→ ki
No ground truth for ki ...
Warning
Choice if K is hard
Roughly two approaches
Partition-based
Model-based
Partition-based clustering
Need to define the quality of a cluster
No obvious measure!
Homogeneity: samples in the same group should be similar
Simple idea
Use the distance as a measure of similar
Simplest is Eucliean distance
K -means algorithm
Similar = small Euclidian distance
K -means algorithm
Fix K ≥ 2, n data points xi ∈ Rd
Find centroids c1, . . . cK that minimizes the quantificationerror
n∑i=1
mink=1,...,K
‖xi − ck‖22
Impossible to find the exact solution!
Lloyd (1981) proposes a way of finding local solutions
Choose at random K centroids {c1, . . . , cK}For each k ∈ {1, . . . ,K}, find the set Ck of points that arecloser to ck than any ck ′ for k ′ 6= k
Update the centroids:
ck =1
|Ck |∑i∈Ck
xi
Repeat the two previous steps until the sets Ck don’t change
K -means
K -means very sensitive to the choice of initial points
An easy solution:
Pick a first point at random
Choose the next initial point the farthest from the previousones
Farthest point initialization
Farthest point initialization
Farthest point initialization
Farthest point initialization
Farthest point initialization
But, very sensitive to outliers
But, very sensitive to outliers
But, very sensitive to outliers
A solution: K -means++Pick the initial centroids as follows
1 Pick uniformly at random i ∈ {x1, . . . , xn}, put c1 ← xi2 k ← k + 1
3 Sample i ∈ {X1, . . . ,Xn} with probability
mink ′=1,...,k−1 ‖xi − ck ′‖22∑n
i ′=1 mink ′=1,...,k−1 ‖Xi ′ − ck ′‖22
4 Put ck ← xi5 If k < K go back to step 2.
Then use K -means based on these initial clusters
This is between random initialization and furthest pointinitialization
K -means++ In expectation, leads to a solution close to theoptimum. Define the quantification error
Qn(c1, . . . , cK ) =n∑
i=1
mink=1,...,K
‖xi − ck‖22.
We know that (Arthur and Vassilvitskii [2006]) if c1, . . . , cK arecentroids obtained with K -means++, then
E[Qn(c1, . . . , cK )] ≤ 8(logK + 2) minc ′1,...,c
′K
Qn(c ′1, . . . , c′K )
where E is with respect to random choice of initial centroids
Model-based clustering
use a model on data with clusters
using a mixture of distributions with different location (mean)parameter
−4 −2 0 2 4 6 8 10
0.0
0.1
0.2
0.3
0.4
−4 −2 0 2 4 6 8 10
0.00
0.05
0.10
0.15
Figure: Gaussian Mixture in dimension 1
Model-based clustering
use a model on data with clusters
using a mixture of distributions with different location (mean)parameter
−6
−4
−2
0
2
4
−5−4
−3−2
−10
12
340
0.02
0.04
0.06
0.08
0.1
0.12
0.14
axis [1]
Density mixture − plane [1 2]
axis [2]
dens
ity
Figure: Gaussian Mixture in dimension 2
Model-based clustering
use a model on data with clusters
using a mixture of distributions with different location (mean)parameter
0 5 10 15 20
0.00
0.02
0.04
0.06
0.08
0.10
0.3*P(.|2)0.2*P(.|5)0.5*P(.|10)loi du melange
Figure: Poisson Mixture
Mixture of densities f1, . . . , fK
K ∈ N∗ (number of clusters)
(p1, . . . , pK ) ∈ (R+)K such that∑K
k=1 pk = 1
Mixture density:
f (x) =K∑
k=1
pk fk(x)
Gaussian Mixtures Model
The most widely used example
Put fk = ϕµk ,Σk= density of N(µk ,Σk), where we recall
ϕµk ,Σk(x) =
1
(2π)d/2√
det(Σk)exp
(−1
2(x−µk)>Σ−1
k (x−µk))
with Σk � 0
Gaussian Mixtures Model
Statistical model with density
fθ(x) =K∑
k=1
pkϕµk ,Σk(x),
Parameter θ = (p1, . . . , pK , µ1, . . . , µK ,Σ1, . . . ,ΣK )
Goodness-of-fit is:
Rn(θ) = −log-likelihood = −n∑
i=1
log( K∑
k=1
pkϕµk ,Σk(xi ))
A local minimizer θ is typically obtained using an algorithmcalled Expectation-Minimization (EM) algorithm
EM AlgorithmIdea:
there is a hidden structure in the model
knowing this structure, the optimization problem is easier
Indeed, each point Xi belongs to an unknown class k ∈ {1, . . . ,K}Put Ci ,k = 1 when i belongs to class k , Ci ,k = 0 otherwise.
We don’t observe {Ci ,k}1≤i≤n,1≤k≤K
We say that these are latent variables
Put Ck = {i : Ci ,k = 1}, then C1, . . . , CK is a partition of{1, . . . , n}.
Generative model:
i belongs to class Ck with probability pk , namely
Ci = (Ci ,1, . . . ,Ci ,K ) ∼M(1, p1, . . . , pK )
[multinomial distribution with parameter (1, p1, . . . , pK )].
Xi ∼ ϕµk ,Σkif Ci ,k = 1
The joint distribution of (X ,C ) is, according to this model
fθ(x , c) =K∏
k=1
(pkϕµk ,Σk(x))ck
[ck = 1 for only one k and 0 elsewhere] and the marginaldensity in X is the one of the mixture:
fθ(x) =K∑
k=1
pkϕµk ,Σk(x)
This generative model adds a latent variable C , in such a waythat the marginal distribution of (X ,C ) in X is indeed the oneof the mixture fθ
Put X = (X1, . . . ,Xn) and C = (C1, . . . ,Cn)
Do as if we observed the latent variables CWrite a completed likelihood for these “virtual” observations(joint distribution of (X ,C )):
Lc(θ; X ,C ) =n∏
i=1
K∏k=1
(pkϕµk ,Σk(Xi ))Ci,k
and the completed log-likelihood:
`c(θ; X ,C ) =n∑
i=1
K∑k=1
Ci ,k(log pk + logϕµk ,Σk(Xi )).
EM-Algorithm [Dempster et al. (77)](E=Expectation, M=Maximization)
Initialize θ(0)
for t = 0, . . . , until convergence, repeat:
1 (E )-step: compute
θ 7→ Q(θ, θ(t)) = Eθ(t)
[`c(θ; X ,C )
∣∣∣X][Expectation with respect to the latent variables, for theprevious value of θ]
2 (M)-step: compute
θ(t+1) ∈ arg maxθ∈Θ
Q(θ, θ(t))
[Maximize this expectation]
(E) and (M) steps often have explicit solutions
Theorem. The sequence θ(t) obtained using EM Algorithmsatisfies
`(θ(t+1); X ) ≥ `(θ(t); X )
for any t.
A each step, EM increases the likelihood
Initialization will be very important (usually done usingK-Means or K-Means++)
Proof. Using Bayes formula, density of C |X = fθ(c |x) = fθ(x,c)fθ(x) , so
`(θ; X ) = `c(θ; X ,C )− `(θ; C |X ),
where we put `(θ; c |x) = log fθ(c |x). Take expectation w.r.t Pθ(t) (·|X ),and obtain
`(θ; X ) = Eθ(t)
[`c(θ; X ,C )|X
]− Eθ(t)
[`(θ; C |X )|X
]= Q(θ, θ(t))− Q1(θ, θ(t)).
Hence
`(θ(t+1); X )− `(θ(t); X ) = Q(θ(t+1), θ(t))− Q(θ(t), θ(t))
−(Q1(θ(t+1), θ(t))− Q1(θ(t), θ(t))
).
We have Q(θ(t+1), θ(t)) ≥ Q(θ(t), θ(t)) using the definition of θ(t+1)
Remains to check that Q1(θ(t+1), θ(t))− Q1(θ(t), θ(t)) ≤ 0 :
Q1(θ(t+1), θ(t))− Q1(θ(t), θ(t)) = Eθ(t)
[`(θ(t+1); C |X )− `(θ(t); C |X )|X
]=
∫log( fθ(t+1) (c |X )
fθ(t) (c |X )
)fθ(t) (c |X )µ(dc)
≤(Jensen)
log
∫fθ(t+1) (c |X )µ(dc) = 0,
This proves `(θ(t+1); X ) ≥ `(θ(t); X ) for any t
So what is the EM Algorithm, and where do we use this?
It is an algorithm that allows to optimize a likelihood withmissing or latent data
For a mixture distribution, we come up with natural latentvariables, that simplify the original optimization problem
Consider the completed likelihood
`c(θ; X ,C ) =n∑
i=1
K∑k=1
Ci ,k(log pk + logϕµk ,Σk(Xi )),
whereθ = (p1, . . . , pK , µ1, . . . , µK ,Σ1, . . . ,ΣK ).
What are the (E) and (M) steps in this case?
(E)-StepCompute
Eθ(t)
[`c(θ; X ,C )
∣∣∣X] =n∑
i=1
K∑k=1
Eθ(t) [Ci ,k |X ](log pk+logϕµk ,Σk(Xi )),
which is simply:
Eθ[Ci ,k |X ] = Pθ(Ci ,k = 1|Xi ) =: πi ,k(θ),
where
πi ,k(θ) =pkϕµk ,Σk
(Xi )∑Kk ′=1 pk ′ϕµk′ ,Σk′ (Xi )
.
We call πi ,k(θ) the “soft-assignment” of i in class k .
(E)-Step
−30 −20 −10 0 10 20 300
0.01
0.02
0.03
0.04
0.05
0.06
0.07
−30 −20 −10 0 10 20 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure: Soft-assignments : x 7→ pkϕµk ,Σk(x)∑K
k′=1pk′ϕµk′ ,Σk′
(x).
(M)-Step: compute
θ(t+1) = (p(t+1)1 , . . . , p
(t+1)K , µ
(t+1)1 , . . . , µ
(t+1)K ,Σ
(t+1)1 , . . . ,Σ
(t+1)K )
using:
p(t+1)k =
1
n
n∑i=1
πi ,k(θ(t)),
µ(t+1)k =
∑ni=1 πi ,k(θ(t))Xi∑ni=1 πi ,k(θ(t))
Σ(t+1)k =
∑ni=1 πi ,k(θ(t))(Xi − µ
(t+1)k )(Xi − µ
(t+1)k )>∑n
i=1 πi ,k(θ(t)).
This is natural: estimation for the means and covariances,weighted by the soft-assignments
So, for t = 1, . . ., iterate (E) and (M) until convergence:
πi ,k(θ(t)) =p
(t)k ϕ
µ(t)k ,Σ
(t)k
(Xi )∑Kk ′=1 p
(t)k ′ ϕµ(t)
k′ ,Σ(t)
k′(Xi )
p(t+1)k =
1
n
n∑i=1
πi ,k(θ(t))
µ(t+1)k =
∑ni=1 πi ,k(θ(t))Xi∑ni=1 πi ,k(θ(t))
Σ(t+1)k =
∑ni=1 πi ,k(θ(t))(Xi − µ
(t+1)k )(Xi − µ
(t+1)k )>∑n
i=1 πi ,k(θ(t)).
We obtain an estimator θ = (p1, . . . , pK , µ1, . . . , µK , Σ1, . . . , ΣK ).
Clustering: the maximum a posteriori (MAP) rule
Given θ, how to affect a cluster number k to a given x ∈ Rd?
Compute the the soft-assignments
πk(x) =pkϕµk ,Σk
(x)∑Kk ′=1 pk ′ϕµk′ ,Σk′
(x)
Consider
i ∈ Ck if πk(x) > πk ′(x) for any k ′ 6= k
The MAP rule
−30 −20 −10 0 10 20 300
0.01
0.02
0.03
0.04
0.05
0.06
0.07
−30 −20 −10 0 10 20 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Mixture fθ(x) Soft-assignement πk(x)
Gaussian Mixtures Model
Gaussian Mixtures Model
K-Means and GMM don’t work for “embedded” cluster structures
Model-selection
How do I choose the number of clusters K?
This can be done using model-selection
Idea
Minimize a penalization of the goodness-of-fit
K = argminK≥1
{− `(θK ) + pen(K )
}where
θK = MLE in the mixture model with K clusters
−`(θK ): goodness-of-fit. If K is too large: overfitting
pen(K ) : penalization
Classical penalization functions
AIC (Akaike Information Criterion)
K = argminK≥1
{− `(θK ; X ) + df(K )
}BIC (Bayesian Information Criterion)
K = argminK≥1
{− `(θK ; X ) +
df(K ) log n
2
}where
n: size of data
df(K ): degrees of freedom of the mixture model with Kclusters
In summary:
AIC: realizes a compromise between “bias” and “variance”. Ittends to underpenalize
BIC: convergent (if the model is true), namely K → K whenn→ +∞).
What is df(K ) in the Gaussian Mixtures model?
p1, . . . , pK : K − 1 (sum = 1)
µ1, . . . , µK : K × d
Σ1, . . . ,ΣK : K × d(d + 1)/2 (symmetrical matrices)
Hence
for the model with general covariances:
df(K ) = K − 1 + Kd + Kd(d + 1)/2
when restricting to diagonal matrices:
df(K ) = K − 1 + Kd + Kd .
Many other possibilites for the shapes of the covariances...
Thank you!