clustering: evolution of methods to meet new challenges · clustering of clustering algorithms7...
TRANSCRIPT
Introduction Methods & questions Model-based clustering Illustrations Challenges
Clustering:evolution of methods to meet new challenges
C. Biernacki
Journee “Clustering”, Orange Labs, October 20th 2015
1/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Take home message
cluster
clustering
define both!
2/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Outline
1 Introduction
2 Methods & questions
3 Model-based clustering
4 Illustrations
5 Challenges
3/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
A first systematic attempt
Carl von Linne (1707–1778), Swedish botanist, physician, and zoologist
Father of modern taxonomy based on the most visible similarities between species
Linnaeus’s Systema Naturae (1st ed. in 1735) lists about 10,000 species oforganisms (6,000 plants, 4,236 animals)
4/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Interdisciplinary endeavor
Medicine1: diseases may be classified by etiology (cause), pathogenesis(mechanism by which the disease is caused), or by symptom(s). Alternatively,diseases may be classified according to the organ system involved, though this isoften complicated since many diseases affect more than one organ.
And so on. . .
1Nosologie methodique, dans laquelle les maladies sont rangees par classes, suivant le systeme de Sydenham, &l’ordre des botanistes, par Francois Boissier de Sauvages de Lacroix. Paris, Herissant le fils, 1771
5/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Three main clustering structures
Data set of n individuals x = (x1, . . . , xn), xi described by d variables
Partition in K clusters denoted by z = (z1, . . . , zn), with zi ∈ {1, . . . ,K}Hierarchy Nested partitions
Block partition Crossing simultaneously partitions in individuals and columns
6/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Clustering is the cluster building process
According to JSTOR, data clustering first appeared in the title of a 1954 articledealing with anthropological data
Need to be automatic (algorithms) for complex data: mixed features, large datasets, high-dimensional data. . .
7/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
A 1st aim: explanatory task
A clustering for a marketing studyData: d = 13 demographic attributes (nominal and ordinal variables) ofn = 6 876 shopping mall customers in the San Francisco Bay (SEX (1. Male, 2.Female), MARITAL STATUS (1. Married, 2. Living together, not married, 3.Divorced or separated, 4. Widowed, 5. Single, never married), AGE (1. 14 thru17, 2. 18 thru 24, 3. 25 thru 34, 4. 35 thru 44, 5. 45 thru 54, 6. 55 thru 64, 7.65 and Over), etc.)Partition: retrieve less that 19 999$ (group of “low income”), between 20 000$and 39 999$ group of “average income”), more than 40 000$ (group of “highincome”)
−1 −0.5 0 0.5 1 1.5 2 2.5−1.5
−1
−0.5
0
0.5
1
1.5
1st MCA axis
2nd
MC
A a
xis
−1 −0.5 0 0.5 1 1.5 2 2.5−1.5
−1
−0.5
0
0.5
1
1.5
1st MCA axis
2nd
MC
A a
xis
Low incomeAverage incomeHigh income
8/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
A 2nd aim: preprocessing step
Logit model:Not very flexible since linear borderlineUnbiased ML estimate by asymptotic variance ∼ n(x′wx)−1 is influenced by correlations
A clustering may improve logistic regression predictionMore flexible borderline: piecewise linearDecrease correlation so decrease variance
9/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Mixed features
10/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Large data sets2
2S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering:
Algorithms and Applications, 2911/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
High-dimensional data3
3S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering:
Algorithms and Applications, 2912/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Genesis of “Big Data”
The Big Data phenomenon mainly originates in the increase of computer and digitalresources at an ever lower cost
Storage cost per MB: 700$ in 1981, 1$ in 1994, 0.01$ in 2013→ price divided by 70,000 in thirty years
Storage capacity of HDDs: ≈1.02 Go in 1982, ≈8 To today→ capacity multiplied by 8,000 over the same period
Computeur processing speed: 1 gigaFLOPS4 in 1985, 33 petaFLOPS in 2013→ speed multiplied by 33 million
4FLOP = FLoating-point Operations Per Second13/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Digital flow
Digital in 1986: 1% of the stored information, 0.02 Eo5
Digital in 2007: 94% of the stored information, 280 Eo (multiplied by 14,000)
5Exabyte14/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Societal phenomenon
All human activities are impacted by data accumulation
Trade and business: corporate reporting system , banks, commercial transactions,reservation systems. . .
Governments and organizations: laws, regulations, standardizations ,infrastructure. . .
Entertainment: music, video, games, social networks. . .
Sciences: astronomy, physics and energy, genome,. . .
Health: medical record databases in the social security system. . .
Environment: climate, sustainable development , pollution, power. . .
Humanities and Social Sciences: digitization of knowledge , literature, history ,art, architecture, archaeological data. . .
15/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
New data. . . but classical answers6
6Rexer Analytics’s Annual Data Miner Survey is the largest survey of data mining, data science, and analyticsprofessionals in the industry (survey of 2011)
16/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Outline
1 Introduction
2 Methods & questions
3 Model-based clustering
4 Illustrations
5 Challenges
17/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Clustering of clustering algorithms7
Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into 5groups based on their partitions on 12 different datasets.
It is not surprising to see that the related algorithms are clustered together.
For a visualization of the similarity between the algorithms, the 35 algorithms arealso embedded in a two-dimensional space obtained from the 35x35 similaritymatrix.
7A.K. Jain (2008). Data Clustering: 50 Years Beyond K-Means.18/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Popularity of K -means and hierarchical clustering
Even K -means was first proposed over 50 years ago, it is still one of the most widelyused algorithms for clustering for several reasons: ease of implementation, simplicity,efficiency, empirical success. . . and model-based interpretation (see later)
19/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Within-cluster inertia criterion
Select the partition z minimizing the criterion
WM(z) =n∑
i=1
K∑
k=1
zik‖xi − xk‖2M
‖ · ‖M is the Euclidian distance with metric M in Rd
xk is the mean of the kth cluster
xk =1
nk
n∑
i=1
zikxi
and nk =∑n
k=1 zik indicates the number of individuals in cluster k
20/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Ward hierarchical clustering
aa
bb
cc
dd
ee
aa
bb
cc
dd
ee
++
aa
bb
cc
dd
ee
++
++
aa
bb
cc
dd
ee
++
++
aa
bb
cc
dd
ee++
aa
bb
cc
dd
ee
données singletons d({b},{c})=0.5
d({d},{e})=2 d({a},{b,c}) = 3 d({a,b,c},{d,e})=10
Classification hiérarchique ascendante(méthode de Ward)
aa bb cc dd ee
indice
0.5
22
33
10
Dendogramme
00
Suboptimal optimisation of WM(·)A partition is obtained by cuting the dendrogram
A dissimilarity matrix between pairs of individuals is enough
21/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
K -means algorithm
aa
bb
cc
dd
ee
aa
bb
cc
dd
ee
aa
bb
cc
dd
ee
aa
bb
cc
dd
ee
aa
bb
cc
dd
ee
aa
bb
cc
dd
ee
aa
bb
cc
dd
ee
données
2 individus au hasard affectation aux centres calcul des centres
calcul des centresaffectation aux centres affectation aux centres
Algorithme des centres mobiles
Alternating optimization between the partition and the center of clusters
22/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
The identity metric M: a classical but hazardous choice
−2 0 2 4 6 8 10
−4
−2
02
4
x[,1]
x[,2
]
Données en deux classes sphériques
−2 0 2 4 6 8 10
−4
−2
02
4
x[cl$cluster == 1, ][,1]
x[cl
$clu
ster
==
1, ]
[,2]
k−means sur deux classes sphériques
−→
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
−20
−10
010
20
x[,1]
x[,2
]
Données en deux classes allongées
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
−20
−10
010
20
x[cl$cluster == 1, ][,1]
x[cl
$clu
ster
==
1, ]
[,2]
k−means sur deux classes allongées
−→
Alternative: estimate M(k) by minimizing WM(k)(z) over (z,M(k))
23/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Effect of the metric M through a real example
Animals represented by 13 Boolean features related to appearance and activity
Large weight on the appearance features compared to the activity features: theanimals were clustered into mammals vs. birds
Large weight on the activity features: partitioning predators vs. non-predators
Both partitions are equally valid, and uncover meaningful structures in the data
The user has to carefully choose his representation to obtain a desired clustering
24/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Semi-supervised clustering8
The user has to provide any external information he has on the partitionPair-wise constraints:
A must-link constraint specifies that the point pair connected by the constraint belongto the same clusterA cannot-link constraint specifies that the point pair connected by the constraint do notbelong to the same cluster
Attempts to derive constraints from domain ontology and other external sourcesinto clustering algorithms include the usage of WordNet ontology, gene ontology,Wikipedia, etc. to guide clustering solutions
8O. Chapelle et al. (2006), A.K. Jain (2008). Data Clustering: 50 Years Beyond K-Means.25/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Online clustering
Dynamic data are quite recent: blogs, Web pages, retail chain, credit cardtransaction streams, network packets received by a router and stock market, etc.
As the data gets modified, clustering must be updated accordingly: ability todetect emerging clusters, etc.
Often all data cannot be stored on a disk
This imposes additional requirements to traditional clustering algorithms torapidly process and summarize the massive amount of continuously arriving data
Data stream clustering a significant challenge since they are expected to involvedsingle-pass algorithms
26/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Data representation challenge9
The data unit can be crucial for the data clustering task
9A.K. Jain (2008). Data Clustering: 50 Years Beyond K-Means.27/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
HD clustering: blessing (1/2)
A two-component d-variate Gaussian mixture:
π1 = π2 =1
2, X1|z11 = 1 ∼ Nd (0, I), X1|z12 = 1 ∼ Nd (1, I)
Each variable provides equal and own separation information
Theoretical error decreases when d grows: errtheo = Φ(−√d/2). . .
−3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
x1
x2
1 2 3 4 5 6 7 8 9 100.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
d
err
EmpiricalTheoretical
. . . and empirical error rate decreases also with d!28/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
HD clustering: blessing (2/2)
FDA
−4 −3 −2 −1 0 1 2 3 4−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
1st axis FDA
2nd
axis
FD
A
d=2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−3
−2
−1
0
1
2
3
4
1st axis FDA
2nd
axis
FD
A
d=20
−1.5 −1 −0.5 0 0.5 1 1.5−4
−3
−2
−1
0
1
2
3
4
5
1st axis FDA
2nd
axis
FD
A
d=200
−1.5 −1 −0.5 0 0.5 1 1.5−3
−2
−1
0
1
2
3
1st axis FDA
2nd
axis
FD
A
d=400
29/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
HD clustering: curse (1/2)
Many variables provide no separation information
Same parameter setting except:
X1|z12 = 1 ∼ Nd ((1 0 . . . 0)′, I)
Groups are not separated more when d grows: ‖µ2 − µ1‖I = 1. . .
−4 −3 −2 −1 0 1 2 3 4−5
−4
−3
−2
−1
0
1
2
3
4
x1
x2
1 2 3 4 5 6 7 8 9 100.3
0.32
0.34
0.36
0.38
0.4
0.42
0.44
d
err
EmpiricalTheoretical
. . . thus theoretical error is constant (= Φ(− 12)) and empirical error increases with d
30/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
HD clustering: curse (2/2)
Many variables provide redundant separation information
Same parameter setting except:
Xj1 = X1
1 +N1(0, 1) (j = 2, . . . , d)
Groups are not separated more when d grows: ‖µ2 − µ1‖Σ = 1. . .
−3 −2 −1 0 1 2 3 4−6
−4
−2
0
2
4
6
x1
x2
1 2 3 4 5 6 7 8 9 100.3
0.32
0.34
0.36
0.38
0.4
0.42
d
err
EmpiricalTheoretical
. . . thus errtheo is constant (= Φ(− 12)) and empirical error increases (less) with d
31/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Clustering: an ill-posed problem
Questions to be addressed:
What is the best metric M(k)?
How to choose the number K of clusters: WM(z) decreases K . . .
clusters of different sizes are they well estimated?
How to choose the data unit?
How to select features in a high-dimensional context?
How to deal with mixed data?
. . .
First, answer to. . . what is formally a cluster?
Model-based clustering solution
a cluster ⇐⇒ a distribution
It recasts all previous problems into model design/estimation/selection
32/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Outline
1 Introduction
2 Methods & questions
3 Model-based clustering
4 Illustrations
5 Challenges
33/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Model-based clustering
x = (x1, ..., xn)
−2 0 2 4
−2
02
4
X1
X2
clustering−→
z = (z1, . . . , zn), K clusters
−2 0 2 4
−2
02
4
X1
X2
Mixture model: well-posed problem
p(x|K ;θ) =K∑
k=1
πkp(x|K ;θk ) can be used for
x → θ → p(z|x,K ; θ) → z
x → p(K |x) → K
. . .
with θ = (πk , (αk))
34/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Hypothesis of mixture of parametric distributions
cluster k is modelled by a parametric distribution:
Xi|Zik=1i.i.d.∼ p(·; αk)
cluster k has probability πk with∑K
k=1 πk = 1 :
Zii.i.d.∼ MultK (1, π1, . . . , πK )
The whole mixture parameter θ = (π1, . . . , πK ,α1, . . . ,αK )
p(xi ;θ) =K∑
k=1
πkp(xi ;αk )
35/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Gaussian mixture
p(·;αk) = Nd (µk ,Σk) where αk = ( µk︸︷︷︸mean
, Σk︸︷︷︸
covariance matrix
)
−2 −1 0 1 2 3 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
x
f(x)
composante 1 composante 2 densite melange
−2
0
2
4
6 −4
−2
0
2
40
0.02
0.04
0.06
0.08
0.1
0.12
x2
x1
f(x)
Parameter = summary + help to understand
cluster k is described by meaningful parameters:cluster size (πk), position (µk) et dispersion (Σk).
36/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
The clustering process in mixtures
1 Estimation of θ by θ
2 Estimation of the conditional probability that xi ∈ cluster k
tik(θ) = p(Zik = 1|Xi = xi ; θ) =πkp(xi ; αk )
p(xi ; θ)
3 Estimation of zi by maximum a posteriori (MAP)
zik = I{k=arg maxh=1,...,K tih(θ)}
−6 −4 −2 0 2 4 6
−5
−4
−3
−2
−1
0
1
2
3
4
5
−6−4
−20
24
68
−6
−4
−2
0
2
4
6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
y
x
dens
ity
−4 −2 0 2 4 6
−5
−4
−3
−2
−1
0
1
2
3
4
5
123
−→ −→
37/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Estimation of θ by complete-likelihood
Maximize the complete-likelihood over (θ, z)
ℓc(θ; x, z) =n∑
i=1
K∑
k=1
zik ln {πkp(xi ;αk)}
Equivalent to tradition methods
Metric M = I M free Mk freeGaussian model [πλI ] [πλC ] [πλkCk ]
Bias of θ: heavy if poor separated clusters
Associated optimization algorithm: CEM (see later)
CEM with [πλI ] is strictly equivalent to K -means
CEM is simple et fast (convergence with few iterations)
38/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Estimation of θ by observe-likelihood
Maximize the observe-likelihood on θ
ℓ(θ; x) =n∑
i=1
ln p(xi ;θ)
Convergence of θ
General algorithm for missing data: EM
EM is simple but slower than CEM
Interpretation: it is a kind of fuzzy clustering
39/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Principle of EM and CEM
Initialization: θ0
Iteration noq:Step E: estimate probabilities tq = {tik (θ
q)}
Step C: classify by setting tq = MAP({tik (θq)})
Step M: maximize θq+1 = arg maxθ ℓc (θ; x, t
q)
Stopping rule: iteration number or criterion stability
Properties
⊕: simplicity, monotony, low memory requirement
⊖: local maxima (depends on θ0), linear convergence (EM)
−4 −3 −2 −1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
x
dens
ity
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−2
−1
0
1
2−180
−170
−160
−150
−140
−130
−120
−110
−100
−90
−80
µ1
µ2
Like
lihoo
d
−160
−140
−120
−100
−2 0 2−2
−1
0
1
21EM
essai 1−2 0 2
−2
−1
0
1
2
essai 2−2 0 2
−2
−1
0
1
2
essai 3
−2 0 2−2
−1
0
1
2
essai 4−2 0 2
−2
−1
0
1
2
essai 5−2 0 2
−2
−1
0
1
2
essai 6
40/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Importance of model selection
Model = number of clusters + parametric structure of clusters
Too simple model: bias
classe 1
classe 2
−6 −4 −2 0 2 4 6−5
−4
−3
−2
−1
0
1
2
3
4
5
x1
x2
true modele: [πλk I ]
too simple model: [πλI ]
Too complex model: variance
−10 −5 0 5 10−10
−8
−6
−4
−2
0
2
4
6
8
10
x1
x2
true borderlineborderline with [πλI ]
. . . borderline with [πλkCk ]
41/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Model selection criteria
The most widespread principle
Criterion︸ ︷︷ ︸
to be maximized
= maximum log-likelihood︸ ︷︷ ︸
model-data adequacy
− penalty︸ ︷︷ ︸
”cost” of the model
criterion penalty interpretation user purpose
general criteria in statistics
AIC ν model complexity predictionBIC 0.5ν ln(n) model complexity identification
specific criterion for the clustering aim
ICL 0.5ν ln(n) model complexity well-separated
−∑
i,k zik ln tik (θ) + partition entropy clusters
N.B.: in a prediction context, it is also possible to use the predictive error rate
42/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Outline
1 Introduction
2 Methods & questions
3 Model-based clustering
4 Illustrations
5 Challenges
43/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
ICL/BIC for acoustic emission control
Data: n = 2 061 event locations in a rectangle of R2 representing the vessel
Model: Diagonal Gaussian mixture + uniform (noise)
Groups: sound locations = vessel defects
0 5 10 15 20−2.22
−2.2
−2.18
−2.16
−2.14
−2.12
−2.1
−2.08x 10
4 Critere ICL
0 5 10 15 20−2.22
−2.2
−2.18
−2.16
−2.14
−2.12
−2.1
−2.08
−2.06
−2.04x 10
4 Critere BIC
−100 −50 0 50 100 150−80
−60
−40
−20
0
20
40
60
80
100
44/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
ICL for prostate cancer data (1/2)
Individuals: n = 475 patients with prostatic cancer grouped on clinical criteriainto two Stages 3 and 4 of the disease
Variables: d = 12 pre-trial variates were measured on each patient, composed byeight continuous variables (age, weight, systolic blood pressure, diastolic bloodpressure, serum haemoglobin, size of primary tumour, index of tumour stage andhistolic grade, serum prostatic acid phosphatase) and four categorical variableswith various numbers of levels (performance rating, cardiovascular disease history,electrocardiogram code, bone metastases)
Model: cond. indep. p(x1;αk) = p(x1;αcontk
) · p(x1;αcatk
)
45/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
ICL for prostate cancer data (2/2)
Variables Continuous Categorical MixedError (%) 9.46 47.16 8.63True \ estimated group 1 2 1 2 1 2Stage 3 247 26 142 131 252 21Stage 4 19 183 120 82 20 182
−80 −60 −40 −20 0 20 40 60−50
−40
−30
−20
−10
0
10
20
30
40
1st axis PCA
2nd
axis
PC
A
Continuous data
−2.5 −2 −1.5 −1 −0.5 0 0.5 1−2
−1
0
1
2
3
4
5
1st axis MCA
2nd
axis
MC
A
Categorical data
46/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
BIC for partitioning communes of Wallonia
Data: n = 262 communes of Wallonia in terms of d = 2 fractals at a local level
Model:Data unit: one to one transformation g(x) = (g(x j
i), i = 1, . . . , n j = 1, . . . , d) of the
initial data set. Typically, standard transformations are g(x ji) = x
j
i(identity),
g(x ji) = exp(x j
i) or g(x j
i) = ln(x j
i)
Mixture: K = 6 (fixed) but all 28 Gaussian models
Result: 6 meaningful groups with g(x ji) = exp(x j
i) (natural for fractals. . . )
Model criterion:
BICg = ℓ(θg; g(x)) −ν
2ln n + ln | Hg
︸︷︷︸
Jacobian
|.
47/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
BIC for Gaussian “variable selection”10
Definition
p(x1;θ) =
{K∑
k=1
πkp(xS1 ;µk ,Σk)
}
︸ ︷︷ ︸
clustering variables
×{
p(xU1 ; a+ xR1 b,C)}
︸ ︷︷ ︸
redundant variables
×{
p(xW1 ; u,V)}
︸ ︷︷ ︸
independent variables
where
all parts are Gaussians
S: set of variables useful for clustering
U: set of redondant clustering variables, expressed with R ⊆ S
W : set of variables independent of clustering
TrickVariable selection is recasted as a particular model selected by BIC
10Raftery and Dean (2006), Maugis et al. (09a), Maugis et al. (09b)48/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Curve “cookies” example (1/2)
The Kneading dataset comes from Danone Vitapole Paris Research Center and concerns the
quality of cookies and the relationship with the flour kneading process11. There are 115 different
flours for which the dough resistance is measured during the kneading process for 480 seconds.
One obtains 115 kneading curves observed at 241 equispaced instants of time in the interval [0;
480]. The 115 flours produce cookies of different quality: 50 of them have produced cookies of
good quality, 25 produced medium quality and 40 low quality.
11Leveder et al (2004)49/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Curve “cookies” example (2/2)
Using a basis functional model-based design for functional data12
12Jacques and Preda (2013)50/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Co-clustering (1/2)
Contingency table: document clustering
Mixture of Medline (1033 medical summaries) and Cranfield (1398 aeronauticssummaries)
Lines: 2431 documents
Columns: present words (except stop), thus 9275 unique words
Data matrix: cross counting document×words
Poisson model13
Results with 2×2 blocs
Medline CranfieldMedline 1033 0Cranfield 0 1398
13G. Govaert and M. Nadif (2013). Co-clustering. Wiley.51/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Co-clustering (2/2)
52/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Outline
1 Introduction
2 Methods & questions
3 Model-based clustering
4 Illustrations
5 Challenges
53/54
Introduction Methods & questions Model-based clustering Illustrations Challenges
Three kinds of challenges, linked to the user task
Model design: depends on data, should incorporate user information
Model estimation: find efficient algorithms along the user requirement
Model selection (validation): depends again on the user purpose
Model-based clustering
It is a just a confortable and rigorous framework
The user keep its freedom in this word, because high flexibility at each level
54/54