clustering: evolution of methods to meet new challenges · clustering of clustering algorithms7...

Introduction Methods & questions Model-based clustering Illustrations Challenges

Clustering:evolution of methods to meet new challenges

C. Biernacki

Journee “Clustering”, Orange Labs, October 20th 2015

1/54


Take home message

cluster

clustering

define both!

2/54


Outline

1 Introduction

2 Methods & questions

3 Model-based clustering

4 Illustrations

5 Challenges

3/54


A first systematic attempt

Carl von Linne (1707–1778), Swedish botanist, physician, and zoologist

Father of modern taxonomy based on the most visible similarities between species

Linnaeus’s Systema Naturae (1st ed. in 1735) lists about 10,000 species oforganisms (6,000 plants, 4,236 animals)

4/54


Interdisciplinary endeavor

Medicine1: diseases may be classified by etiology (cause), pathogenesis(mechanism by which the disease is caused), or by symptom(s). Alternatively,diseases may be classified according to the organ system involved, though this isoften complicated since many diseases affect more than one organ.

And so on. . .

1Nosologie methodique, dans laquelle les maladies sont rangees par classes, suivant le systeme de Sydenham, &l’ordre des botanistes, par Francois Boissier de Sauvages de Lacroix. Paris, Herissant le fils, 1771

5/54


Three main clustering structures

Data set of n individuals x = (x1, . . . , xn), xi described by d variables

Partition in K clusters denoted by z = (z1, . . . , zn), with zi ∈ {1, . . . ,K}Hierarchy Nested partitions

Block partition Crossing simultaneously partitions in individuals and columns

6/54


Clustering is the cluster building process

According to JSTOR, data clustering first appeared in the title of a 1954 articledealing with anthropological data

Need to be automatic (algorithms) for complex data: mixed features, large datasets, high-dimensional data. . .

7/54


A 1st aim: explanatory task

A clustering for a marketing studyData: d = 13 demographic attributes (nominal and ordinal variables) ofn = 6 876 shopping mall customers in the San Francisco Bay (SEX (1. Male, 2.Female), MARITAL STATUS (1. Married, 2. Living together, not married, 3.Divorced or separated, 4. Widowed, 5. Single, never married), AGE (1. 14 thru17, 2. 18 thru 24, 3. 25 thru 34, 4. 35 thru 44, 5. 45 thru 54, 6. 55 thru 64, 7.65 and Over), etc.)Partition: retrieve less that 19 999$ (group of “low income”), between 20 000$and 39 999$ group of “average income”), more than 40 000$ (group of “highincome”)

−1 −0.5 0 0.5 1 1.5 2 2.5−1.5

−1

−0.5

0

0.5

1

1.5

1st MCA axis

2nd

MC

A a

xis

−1 −0.5 0 0.5 1 1.5 2 2.5−1.5

−1

−0.5

0

0.5

1

1.5

1st MCA axis

2nd

MC

A a

xis

Low incomeAverage incomeHigh income

8/54


A 2nd aim: preprocessing step

Logit model:Not very flexible since linear borderlineUnbiased ML estimate by asymptotic variance ∼ n(x′wx)−1 is influenced by correlations

A clustering may improve logistic regression predictionMore flexible borderline: piecewise linearDecrease correlation so decrease variance

9/54


Mixed features

10/54


Large data sets2

2S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering:

Algorithms and Applications, 2911/54


High-dimensional data3

3S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering:

Algorithms and Applications, 2912/54


Genesis of “Big Data”

The Big Data phenomenon mainly originates in the increase of computer and digitalresources at an ever lower cost

Storage cost per MB: 700$ in 1981, 1$ in 1994, 0.01$ in 2013→ price divided by 70,000 in thirty years

Storage capacity of HDDs: ≈1.02 Go in 1982, ≈8 To today→ capacity multiplied by 8,000 over the same period

Computeur processing speed: 1 gigaFLOPS4 in 1985, 33 petaFLOPS in 2013→ speed multiplied by 33 million

4FLOP = FLoating-point Operations Per Second13/54


Digital flow

Digital in 1986: 1% of the stored information, 0.02 Eo5

Digital in 2007: 94% of the stored information, 280 Eo (multiplied by 14,000)

5Exabyte14/54


Societal phenomenon

All human activities are impacted by data accumulation

Trade and business: corporate reporting system , banks, commercial transactions,reservation systems. . .

Governments and organizations: laws, regulations, standardizations ,infrastructure. . .

Entertainment: music, video, games, social networks. . .

Sciences: astronomy, physics and energy, genome,. . .

Health: medical record databases in the social security system. . .

Environment: climate, sustainable development , pollution, power. . .

Humanities and Social Sciences: digitization of knowledge , literature, history ,art, architecture, archaeological data. . .

15/54


New data. . . but classical answers6

6Rexer Analytics’s Annual Data Miner Survey is the largest survey of data mining, data science, and analyticsprofessionals in the industry (survey of 2011)

16/54


Outline

1 Introduction



4 Illustrations

5 Challenges

17/54


Clustering of clustering algorithms7

Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into 5groups based on their partitions on 12 different datasets.

It is not surprising to see that the related algorithms are clustered together.

For a visualization of the similarity between the algorithms, the 35 algorithms arealso embedded in a two-dimensional space obtained from the 35x35 similaritymatrix.

7A.K. Jain (2008). Data Clustering: 50 Years Beyond K-Means.18/54


Popularity of K -means and hierarchical clustering

Even K -means was first proposed over 50 years ago, it is still one of the most widelyused algorithms for clustering for several reasons: ease of implementation, simplicity,efficiency, empirical success. . . and model-based interpretation (see later)

19/54


Within-cluster inertia criterion

Select the partition z minimizing the criterion

WM(z) =n∑

i=1

K∑

k=1

zik‖xi − xk‖2M

‖ · ‖M is the Euclidian distance with metric M in Rd

xk is the mean of the kth cluster

xk =1

nk

n∑

i=1

zikxi

and nk =∑n

k=1 zik indicates the number of individuals in cluster k

20/54


Ward hierarchical clustering

aa

bb

cc

dd

ee

aa

bb

cc

dd

ee

++

aa

bb

cc

dd

ee

++

++

aa

bb

cc

dd

ee

++

++

aa

bb

cc

dd

ee++

aa

bb

cc

dd

ee

données singletons d({b},{c})=0.5

d({d},{e})=2 d({a},{b,c}) = 3 d({a,b,c},{d,e})=10

Classification hiérarchique ascendante(méthode de Ward)

aa bb cc dd ee

indice

0.5

22

33

10

Dendogramme

00

Suboptimal optimisation of WM(·)A partition is obtained by cuting the dendrogram

A dissimilarity matrix between pairs of individuals is enough

21/54


K -means algorithm

aa

bb

cc

dd

ee

aa

bb

cc

dd

ee

aa

bb

cc

dd

ee

aa

bb

cc

dd

ee

aa

bb

cc

dd

ee

aa

bb

cc

dd

ee

aa

bb

cc

dd

ee

données

2 individus au hasard affectation aux centres calcul des centres

calcul des centresaffectation aux centres affectation aux centres

Algorithme des centres mobiles

Alternating optimization between the partition and the center of clusters

22/54


The identity metric M: a classical but hazardous choice

−2 0 2 4 6 8 10

−4

−2

02

4

x[,1]

x[,2

]

Données en deux classes sphériques

−2 0 2 4 6 8 10

−4

−2

02

4

x[cl$cluster == 1, ][,1]

x[cl

$clu

ster

==

1, ]

[,2]

k−means sur deux classes sphériques

−→

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−20

−10

010

20

x[,1]

x[,2

]

Données en deux classes allongées

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−20

−10

010

20

x[cl$cluster == 1, ][,1]

x[cl

$clu

ster

==

1, ]

[,2]

k−means sur deux classes allongées

−→

Alternative: estimate M(k) by minimizing WM(k)(z) over (z,M(k))

23/54


Effect of the metric M through a real example

Animals represented by 13 Boolean features related to appearance and activity

Large weight on the appearance features compared to the activity features: theanimals were clustered into mammals vs. birds

Large weight on the activity features: partitioning predators vs. non-predators

Both partitions are equally valid, and uncover meaningful structures in the data

The user has to carefully choose his representation to obtain a desired clustering

24/54


Semi-supervised clustering8

The user has to provide any external information he has on the partitionPair-wise constraints:

A must-link constraint specifies that the point pair connected by the constraint belongto the same clusterA cannot-link constraint specifies that the point pair connected by the constraint do notbelong to the same cluster

Attempts to derive constraints from domain ontology and other external sourcesinto clustering algorithms include the usage of WordNet ontology, gene ontology,Wikipedia, etc. to guide clustering solutions

8O. Chapelle et al. (2006), A.K. Jain (2008). Data Clustering: 50 Years Beyond K-Means.25/54


Online clustering

Dynamic data are quite recent: blogs, Web pages, retail chain, credit cardtransaction streams, network packets received by a router and stock market, etc.

As the data gets modified, clustering must be updated accordingly: ability todetect emerging clusters, etc.

Often all data cannot be stored on a disk

This imposes additional requirements to traditional clustering algorithms torapidly process and summarize the massive amount of continuously arriving data

Data stream clustering a significant challenge since they are expected to involvedsingle-pass algorithms

26/54


Data representation challenge9

The data unit can be crucial for the data clustering task

9A.K. Jain (2008). Data Clustering: 50 Years Beyond K-Means.27/54


HD clustering: blessing (1/2)

A two-component d-variate Gaussian mixture:

π1 = π2 =1

2, X1|z11 = 1 ∼ Nd (0, I), X1|z12 = 1 ∼ Nd (1, I)

Each variable provides equal and own separation information

Theoretical error decreases when d grows: errtheo = Φ(−√d/2). . .

−3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

x1

x2

1 2 3 4 5 6 7 8 9 100.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

d

err

EmpiricalTheoretical

. . . and empirical error rate decreases also with d!28/54


HD clustering: blessing (2/2)

FDA

−4 −3 −2 −1 0 1 2 3 4−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

1st axis FDA

2nd

axis

FD

A

d=2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−3

−2

−1

0

1

2

3

4

1st axis FDA

2nd

axis

FD

A

d=20

−1.5 −1 −0.5 0 0.5 1 1.5−4

−3

−2

−1

0

1

2

3

4

5

1st axis FDA

2nd

axis

FD

A

d=200

−1.5 −1 −0.5 0 0.5 1 1.5−3

−2

−1

0

1

2

3

1st axis FDA

2nd

axis

FD

A

d=400

29/54


HD clustering: curse (1/2)

Many variables provide no separation information

Same parameter setting except:

X1|z12 = 1 ∼ Nd ((1 0 . . . 0)′, I)

Groups are not separated more when d grows: ‖µ2 − µ1‖I = 1. . .

−4 −3 −2 −1 0 1 2 3 4−5

−4

−3

−2

−1

0

1

2

3

4

x1

x2

1 2 3 4 5 6 7 8 9 100.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

d

err


. . . thus theoretical error is constant (= Φ(− 12)) and empirical error increases with d

30/54


HD clustering: curse (2/2)

Many variables provide redundant separation information

Same parameter setting except:

Xj1 = X1

1 +N1(0, 1) (j = 2, . . . , d)

Groups are not separated more when d grows: ‖µ2 − µ1‖Σ = 1. . .

−3 −2 −1 0 1 2 3 4−6

−4

−2

0

2

4

6

x1

x2

1 2 3 4 5 6 7 8 9 100.3

0.32

0.34

0.36

0.38

0.4

0.42

d

err


. . . thus errtheo is constant (= Φ(− 12)) and empirical error increases (less) with d

31/54


Clustering: an ill-posed problem

Questions to be addressed:

What is the best metric M(k)?

How to choose the number K of clusters: WM(z) decreases K . . .

clusters of different sizes are they well estimated?

How to choose the data unit?

How to select features in a high-dimensional context?

How to deal with mixed data?

. . .

First, answer to. . . what is formally a cluster?

Model-based clustering solution

a cluster ⇐⇒ a distribution

It recasts all previous problems into model design/estimation/selection

32/54


Outline

1 Introduction



4 Illustrations

5 Challenges

33/54


Model-based clustering

x = (x1, ..., xn)

−2 0 2 4

−2

02

4

X1

X2

clustering−→

z = (z1, . . . , zn), K clusters

−2 0 2 4

−2

02

4

X1

X2

Mixture model: well-posed problem

p(x|K ;θ) =K∑

k=1

πkp(x|K ;θk ) can be used for

x → θ → p(z|x,K ; θ) → z

x → p(K |x) → K

. . .

with θ = (πk , (αk))

34/54


Hypothesis of mixture of parametric distributions

cluster k is modelled by a parametric distribution:

Xi|Zik=1i.i.d.∼ p(·; αk)

cluster k has probability πk with∑K

k=1 πk = 1 :

Zii.i.d.∼ MultK (1, π1, . . . , πK )

The whole mixture parameter θ = (π1, . . . , πK ,α1, . . . ,αK )

p(xi ;θ) =K∑

k=1

πkp(xi ;αk )

35/54


Gaussian mixture

p(·;αk) = Nd (µk ,Σk) where αk = ( µk︸︷︷︸mean

, Σk︸︷︷︸

covariance matrix

)

−2 −1 0 1 2 3 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

x

f(x)

composante 1 composante 2 densite melange

−2

0

2

4

6 −4

−2

0

2

40

0.02

0.04

0.06

0.08

0.1

0.12

x2

x1

f(x)

Parameter = summary + help to understand

cluster k is described by meaningful parameters:cluster size (πk), position (µk) et dispersion (Σk).

36/54


The clustering process in mixtures

1 Estimation of θ by θ

2 Estimation of the conditional probability that xi ∈ cluster k

tik(θ) = p(Zik = 1|Xi = xi ; θ) =πkp(xi ; αk )

p(xi ; θ)

3 Estimation of zi by maximum a posteriori (MAP)

zik = I{k=arg maxh=1,...,K tih(θ)}

−6 −4 −2 0 2 4 6

−5

−4

−3

−2

−1

0

1

2

3

4

5

−6−4

−20

24

68

−6

−4

−2

0

2

4

6

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

y

x

dens

ity

−4 −2 0 2 4 6

−5

−4

−3

−2

−1

0

1

2

3

4

5

123

−→ −→

37/54


Estimation of θ by complete-likelihood

Maximize the complete-likelihood over (θ, z)

ℓc(θ; x, z) =n∑

i=1

K∑

k=1

zik ln {πkp(xi ;αk)}

Equivalent to tradition methods

Metric M = I M free Mk freeGaussian model [πλI ] [πλC ] [πλkCk ]

Bias of θ: heavy if poor separated clusters

Associated optimization algorithm: CEM (see later)

CEM with [πλI ] is strictly equivalent to K -means

CEM is simple et fast (convergence with few iterations)

38/54


Estimation of θ by observe-likelihood

Maximize the observe-likelihood on θ

ℓ(θ; x) =n∑

i=1

ln p(xi ;θ)

Convergence of θ

General algorithm for missing data: EM

EM is simple but slower than CEM

Interpretation: it is a kind of fuzzy clustering

39/54


Principle of EM and CEM

Initialization: θ0

Iteration noq:Step E: estimate probabilities tq = {tik (θ

q)}

Step C: classify by setting tq = MAP({tik (θq)})

Step M: maximize θq+1 = arg maxθ ℓc (θ; x, t

q)

Stopping rule: iteration number or criterion stability

Properties

⊕: simplicity, monotony, low memory requirement

⊖: local maxima (depends on θ0), linear convergence (EM)

−4 −3 −2 −1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

x

dens

ity

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−2

−1

0

1

2−180

−170

−160

−150

−140

−130

−120

−110

−100

−90

−80

µ1

µ2

Like

lihoo

d

−160

−140

−120

−100

−2 0 2−2

−1

0

1

21EM

essai 1−2 0 2

−2

−1

0

1

2

essai 2−2 0 2

−2

−1

0

1

2

essai 3

−2 0 2−2

−1

0

1

2

essai 4−2 0 2

−2

−1

0

1

2

essai 5−2 0 2

−2

−1

0

1

2

essai 6

40/54


Importance of model selection

Model = number of clusters + parametric structure of clusters

Too simple model: bias

classe 1

classe 2

−6 −4 −2 0 2 4 6−5

−4

−3

−2

−1

0

1

2

3

4

5

x1

x2

true modele: [πλk I ]

too simple model: [πλI ]

Too complex model: variance

−10 −5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

x1

x2

true borderlineborderline with [πλI ]

. . . borderline with [πλkCk ]

41/54


Model selection criteria

The most widespread principle

Criterion︸︷︷︸

to be maximized

= maximum log-likelihood︸︷︷︸

model-data adequacy

− penalty︸︷︷︸

”cost” of the model

criterion penalty interpretation user purpose

general criteria in statistics

AIC ν model complexity predictionBIC 0.5ν ln(n) model complexity identification

specific criterion for the clustering aim

ICL 0.5ν ln(n) model complexity well-separated

−∑

i,k zik ln tik (θ) + partition entropy clusters

N.B.: in a prediction context, it is also possible to use the predictive error rate

42/54


Outline

1 Introduction



4 Illustrations

5 Challenges

43/54


ICL/BIC for acoustic emission control

Data: n = 2 061 event locations in a rectangle of R2 representing the vessel

Model: Diagonal Gaussian mixture + uniform (noise)

Groups: sound locations = vessel defects

0 5 10 15 20−2.22

−2.2

−2.18

−2.16

−2.14

−2.12

−2.1

−2.08x 10

4 Critere ICL

0 5 10 15 20−2.22

−2.2

−2.18

−2.16

−2.14

−2.12

−2.1

−2.08

−2.06

−2.04x 10

4 Critere BIC

−100 −50 0 50 100 150−80

−60

−40

−20

0

20

40

60

80

100

44/54


ICL for prostate cancer data (1/2)

Individuals: n = 475 patients with prostatic cancer grouped on clinical criteriainto two Stages 3 and 4 of the disease

Variables: d = 12 pre-trial variates were measured on each patient, composed byeight continuous variables (age, weight, systolic blood pressure, diastolic bloodpressure, serum haemoglobin, size of primary tumour, index of tumour stage andhistolic grade, serum prostatic acid phosphatase) and four categorical variableswith various numbers of levels (performance rating, cardiovascular disease history,electrocardiogram code, bone metastases)

Model: cond. indep. p(x1;αk) = p(x1;αcontk

) · p(x1;αcatk

)

45/54


ICL for prostate cancer data (2/2)

Variables Continuous Categorical MixedError (%) 9.46 47.16 8.63True \ estimated group 1 2 1 2 1 2Stage 3 247 26 142 131 252 21Stage 4 19 183 120 82 20 182

−80 −60 −40 −20 0 20 40 60−50

−40

−30

−20

−10

0

10

20

30

40

1st axis PCA

2nd

axis

PC

A

Continuous data

−2.5 −2 −1.5 −1 −0.5 0 0.5 1−2

−1

0

1

2

3

4

5

1st axis MCA

2nd

axis

MC

A

Categorical data

46/54


BIC for partitioning communes of Wallonia

Data: n = 262 communes of Wallonia in terms of d = 2 fractals at a local level

Model:Data unit: one to one transformation g(x) = (g(x j

i), i = 1, . . . , n j = 1, . . . , d) of the

initial data set. Typically, standard transformations are g(x ji) = x

j

i(identity),

g(x ji) = exp(x j

i) or g(x j

i) = ln(x j

i)

Mixture: K = 6 (fixed) but all 28 Gaussian models

Result: 6 meaningful groups with g(x ji) = exp(x j

i) (natural for fractals. . . )

Model criterion:

BICg = ℓ(θg; g(x)) −ν

2ln n + ln | Hg

︸︷︷︸

Jacobian

|.

47/54


BIC for Gaussian “variable selection”10

Definition

p(x1;θ) =

{K∑

k=1

πkp(xS1 ;µk ,Σk)

}

︸︷︷︸

clustering variables

×{

p(xU1 ; a+ xR1 b,C)}

︸︷︷︸

redundant variables

×{

p(xW1 ; u,V)}

︸︷︷︸

independent variables

where

all parts are Gaussians

S: set of variables useful for clustering

U: set of redondant clustering variables, expressed with R ⊆ S

W : set of variables independent of clustering

TrickVariable selection is recasted as a particular model selected by BIC

10Raftery and Dean (2006), Maugis et al. (09a), Maugis et al. (09b)48/54


Curve “cookies” example (1/2)

The Kneading dataset comes from Danone Vitapole Paris Research Center and concerns the

quality of cookies and the relationship with the flour kneading process11. There are 115 different

flours for which the dough resistance is measured during the kneading process for 480 seconds.

One obtains 115 kneading curves observed at 241 equispaced instants of time in the interval [0;

480]. The 115 flours produce cookies of different quality: 50 of them have produced cookies of

good quality, 25 produced medium quality and 40 low quality.

11Leveder et al (2004)49/54


Curve “cookies” example (2/2)

Using a basis functional model-based design for functional data12

12Jacques and Preda (2013)50/54


Co-clustering (1/2)

Contingency table: document clustering

Mixture of Medline (1033 medical summaries) and Cranfield (1398 aeronauticssummaries)

Lines: 2431 documents

Columns: present words (except stop), thus 9275 unique words

Data matrix: cross counting document×words

Poisson model13

Results with 2×2 blocs

Medline CranfieldMedline 1033 0Cranfield 0 1398

13G. Govaert and M. Nadif (2013). Co-clustering. Wiley.51/54


Co-clustering (2/2)

52/54


Outline

1 Introduction



4 Illustrations

5 Challenges

53/54


Three kinds of challenges, linked to the user task

Model design: depends on data, should incorporate user information

Model estimation: find efficient algorithms along the user requirement

Model selection (validation): depends again on the user purpose

Model-based clustering

It is a just a confortable and rigorous framework

The user keep its freedom in this word, because high flexibility at each level

54/54

clustering: evolution of methods to meet new challenges · clustering of clustering algorithms7...

Documents