degrees of freedom and model selection for -means clustering · degrees of freedom and model...

Degrees of Freedom and Model Selection for k-means

Clustering

David P. Hofmeyr Department of Statistics and Actuarial Science

. Stellenbosch University

. 7600, South Africa

AbstractThis paper investigates the model degrees of freedom in k-means clustering. An extensionof Stein’s lemma provides an expression for the effective degrees of freedom in the k-meansmodel. Approximating the degrees of freedom in practice requires simplifications of thisexpression, however empirical studies evince the appropriateness of our proposed approach.The practical relevance of this new degrees of freedom formulation for k-means is demon-strated through model selection using the Bayesian Information Criterion. The reliabilityof this method is validated through experiments on simulated data as well as on a largecollection of publicly available benchmark data sets from diverse application areas. Compar-isons with popular existing techniques indicate that this approach is extremely competitivefor selecting high quality clustering solutions. Code to implement the proposed approach isavailable in the form of an R package from https://github.com/DavidHofmeyr/edfkmeans.

Keywords: clustering; k-means; model selection; cluster number determination; degreesof freedom; Bayesian Information Criterion; penalised likelihood

1 Introduction

Degrees of freedom arise explicitly in model selection, as a way of accounting for the biasin the model log-likelihood for estimating generalisation performance (Akaike, 1998, AkaikeInformation Criterion, AIC) and, indirectly, Bayes factors (Schwarz et al., 1978, BayesianInformation Criterion, BIC). In particular, degrees of freedom account for the complexity,or flexibility of a model by measuring its effective number of parameters. In the contextof clustering, model flexibility is varied primarily by different choices of k, the number ofclusters. In k-means, clusters are associated with compact collections of points arising arounda set of cluster centroids. The optimal centroids are those which minimise the sum of squareddistances between each point and its assigned centroid. Using the squared distance connectsthe k-means objective with the log-likelihood of a simple Gaussian Mixture Model (GMM).Pairing elements of the GMM log-likelihood with AIC and BIC type penalties, based on thenumber of explicitly estimated parameters, has motivated multiple model selection methodsfor k-means (Manning et al., 2008; Ramsey et al., 2008; Pelleg et al., 2000). However, it hasbeen observed that these approaches can lead to substantial over-estimation of the number ofclusters (Hamerly and Elkan, 2004).

We argue that these simple penalties are inappropriate, and do not account for the entirecomplexity of the model, and investigate more rigorously the degrees of freedom in the k-means model. The proposed formulation depends not only on the explicit dimension of themodel, but also accounts for the uncertainty in the cluster assignments. This is intuitively

1

arX

iv:1

806.

0203

4v4

[st

at.M

L]

21

Feb

2020

https://github.com/DavidHofmeyr/edfkmeans

appealing, as it allows the degrees of freedom to incorporate the difficulty of the clusteringproblem, which cannot be captured solely by the model dimension. This formulation drawson the work of Tibshirani (2015), and is the first application, of which we are aware, of thisapproach to the problem of clustering. We validate the proposed formulation by applying itwithin the BIC to perform model selection for k-means. The approach is found to be extremelycompetitive with the state-of-the-art on a very large collection of benchmark data sets.

The remaining paper is organised as follows. In Section 2 we discuss the k-means modelexplicitly, and consider its degrees of freedom. We also provide details for how we approximatethe degrees of freedom practically. Section 3 describes our approach for model selection basedon the Bayesian Information Criterion and using these approximated degrees of freedom.Section 4 documents the results from a thorough simulation study as well as comparisonsbetween the proposed approach and popular existing methods on simulated data, as well ason a very large collection of publicly available benchmark data sets. Finally, we give someconcluding remarks in Section 5.

2 Degrees of Freedom in the k-means Model

From a probabilistic perspective, the standard modelling assumptions for k-means are thatthe data arose from a k component Gaussian mixture in d dimensions with equal isotropiccovariance matrix, σ2I, and either equal mixing proportions (Manning et al., 2008; Celeuxand Govaert, 1992) or sufficiently small σ (Jiang et al., 2012). In this case, and with a slightabuse of notation, one may in general write the likelihood for the data, given modelM, whichwe assume to include all parameters of the underlying distribution which are being estimated,as

`(X|M) =n∑i=1

log

(k∑j=1

πij1

(2πσ2)d/2exp

(−||Xi − µµµj||2

2σ2

)).

Here µµµ1, ...,µµµk ∈ Rd are the component means, πij is the probability that the i-th datum arisesfrom the j-th component, and the subscript “i ” is used to denote the i-th row of a matrix.The terms πij are usually assumed equal for fixed j, and have been used to represent mixingproportions (Manning et al., 2008). Popular formulations of the k-means likelihood (Manninget al., 2008; Ramsey et al., 2008; Pelleg et al., 2000) use the so-called classification likeli-hood (Fraley and Raftery, 2002), which treats the cluster assignments as true class labels. Forexample, a simple BIC formulation has been expressed, up to an additive constant, as (Ramseyet al., 2008)

1

σ2

n∑i=1

minj∈{1,...,k}

||Xi − µµµj||2 + log(n)kd. (1)

Here only the means are assumed part of the estimation, and hence the model dimension iskd, for k clusters. There is a fundamental mismatch in formulations such as this, however,including those in Manning et al. (2008); Ramsey et al. (2008); Pelleg et al. (2000), between thelog-likelihood component and the bias correction term. Specifically, by using the classificationlikelihood the assumption is that the model is also estimating the assignments of data to

2

clusters. However, without incorporating this added estimation into the model degrees offreedom, the bias of the log-likelihood for estimating generalisation error, and Bayes factors,is severely under-estimated.

In this work a modified formulation is considered which incorporates the cluster assignmentinto the modelling procedure. We find it convenient to assume that the data matrix X hasbeen generated as,

X = µ+ E, (2)

where the mean matrix µ ∈ Rn×d is assumed to have k unique rows and the elements ofE ∈ Rn×d are independent realisations from a N(0, σ2) distribution. Notice that in this casethe log-likelihood may be written as,

`(X|µ, σ) =n∑i=1

d∑j=1

log

(1√2πσ

exp

(−

(Xi,j −µi,j)2

2σ2

))

= − 1

2σ2

n∑i=1

n∑j=1

(Xi,j −µi,j)2 − nd log(σ) +K,

for constant K independent of σ and µ. Note that in this formulation the assignment of data(rows of X) to mixture components is captured implicitly by the k distinct rows of µ. Alsonotice that if σ is assumed fixed then this is essentially equivalent (up to an additive constant)to the likelihood term in the BIC formulation in (1) above.

For this formulation it is possible to consider estimating pointwise the elements of µ, underthe constraint of having k unique rows, using a modelling procedureM : Rn×d → Rn×d, definedas

M(X)i,j = µc(i),j (3)

µ = argminM∈Rk×d

n∑i=1

minl∈{1,...,k}

||Xi −Ml ||2 (4)

c(i) = argminl∈{1,...,k} ||Xi − µl ||2. (5)

The matrix µ ∈ Rk×d estimates the unique rows of µ, and provides an approximation ofthe maximum likelihood solution under Eq (2). The indices c(i), i = 1, ..., n indicate theassignments of the data (rows of X) to the different clusters’ means (rows of µ). With thisformulation we are able address the estimation of the “effective degrees of freedom” (Efron,1986), given by

df(M) =1

σ2

n∑i=1

d∑j=1

Cov(M(X)i,j,Xi,j). (6)

The covariance offers an appealing interpretation in terms of model complexity/flexibility. Amore complex model will respond more to variations in the data, in that additional flexibilitywill allow the model to attempt to “explain” this variation. The covariance between itsfitted values and the data will therefore be higher. On the other hand, an inflexible model

3

will, by definition, vary less due to changes in the observations. Furthermore in numeroussimple Gaussian error models there is an exact equality between this covariance and themodel dimension. The remainder of this section is concerned with obtaining an appropriateapproximation of the effective degrees of freedom for the k-means model. The following twolemmas are useful for obtaining such an estimate.

Lemma 1 Let X = µ + E ∈ Rn×d, with µ fixed and Ei,j ∼ N(0, σ2) with Ei,j,Ek,l inde-pendent for all (i, j) 6= (l, k). Let f : Rn×d → Rn×d satisfy the following condition. Forall W ∈ Rn×d and each i, j, there exists a finite set DW

i,j =⋃ql=1{δl} s.t. f , viewed as

a univariate function by keeping all other elements of W, {Wk,l}(k,l) 6=(i,j), fixed, is Lips-chitz on each of (−∞, δ1), (δ1, δ2), ..., (δq−1, δq), and (δq,∞). Then for each i, j, the quantity1σ2Cov(f(X)i,j,Xi,j) is equal to

E

[∂

∂Xi,j

f(X)i,j

]+

1

σE

[ ∑δ:Xi,j+δ∈DX

i,j

φ

(Xi,j + δ −µi,j

σ

)limγ↓↑δ

f(X + γei,j)i,j

], (7)

provided the second term on the right hand side exists. Here φ(x) = (2π)−1/2 exp(−x2/2) isthe Gaussian density function; ei,j ∈ Rn×d has zero entries except in the i, j-th position, whereit takes the value one; and

limγ↓↑δ

f(X + γei,j) = limγ↓δ

f(X + γei,j)− limγ↑δ

f(X + γei,j).

is the size of the discontinuity at δ.

This result is very similar to (Tibshirani, 2015, Lemma 5), where the regression contextis considered. Our proof is given in the appendix. The first term in (7) comes from Stein’sinfluential result (Stein, 1981, Lemma 2) for determining the risk in the estimation of the meanof a Gaussian random variable using a smooth model. Due to the discontinuities in the k-means model, which occur at points where the cluster assignments of some of the data change,the additional covariance at the discontinuity points needs to be accounted for. Consider an Xwhich is close to a point of discontinuity with respect to the i, j-th entry. Conditional on thefact that X is close to such a point, f(X)i,j takes values approximately equal to the left andright limits, depending on whether Xi,j is below or above the discontinuity respectively. On asmall enough scale each happens with roughly equal probability. After taking into account theprobability of being close to the discontinuity point, and taking the limit as X gets arbitrarilyclose to the discontinuity point, one can arrive at an intuitive justification for the additionalterm in (7). In the remainder this additional covariance term will be referred to as the excessdegrees of freedom.

In the above result the function f may be seen to represent an arbitrary modelling pro-cedure, which takes as argument a data matrix and outputs a matrix of fitted values whichrepresent an estimate of the means of the elements in the data under a Gaussian error model.The next lemma places Lemma 1 in the context of the k-means model, where it is verifiedthat the modelling procedure M, described in Eqs. (3)–(5), satisfies the conditions describedabove. Notice that in this context, the discontinuities in the model (the δ values in the state-ment of Lemma 1) correspond with the points at which some of the clustering assignmentswould change.

4

Lemma 2 Let M : Rn×d → Rn×d be defined as

M(W)i,j = µc(i),j,

where

µ = argminM∈Rk×d

n∑i=1

minj∈{1,...,k}

||Wi −Mj ||2

c(i) = argminj∈{1,...,k} ||Wi − µj ||2.

ThenM satisfies the conditions on the function f in the statement of Lemma 1, and moreoverif X = µ + E ∈ Rn×d, with µ fixed and Ei,j ∼ N(0, σ2) with Ei,j,Ek,l independent for all(i, j) 6= (l, k), then

E


i,j

φ

(Xi,j + δ −µi,j

σ

)limγ↓↑δM(X + γei,j)i,j

]

exists and is finite.

One of the most important consequences of (Stein, 1981, Lemma 2), which leads to the firstterm in (7), is that this term is devoid of any of the parameters of the underlying distribution.An unbiased estimate of this term can be obtained by taking the partial derivatives of themodel using the observed data. In the case of k-means one arrives at,

∂M(X)i,j∂Xi,j

=∂µc(i),j

∂Xi,j

=1

nc(i),

where nc(i) is the number of data assigned to centroid c(i). Therefore,

n∑i=1

d∑j=1

∂M(X)i,j∂Xi,j

=d∑j=1

k∑l=1

∑i:c(i)=l

∂M(X)i,j∂Xi,j

=d∑j=1

k∑l=1

∑i:c(i)=l

1

nc(i)=

d∑j=1

k∑l=1

nc(i)1

nc(i)= kd.

The excess degrees of freedom therefore equals the difference between the effective degrees offreedom and the explicit model dimension, i.e., the number of elements in µ. It may thereforebe interpreted as the additional complexity in assigning data to clusters. This is intuitivelypleasing in light of the fact that this additional covariance directly accounts for the potentialassignment of the data to different clusters, in that these are what result in discontinuities inthe model.

2.1 Approximating Excess Degrees of Freedom

The excess degrees of freedom reintroduces the unknown parameters to the degrees of freedomexpression. Furthermore, as noted by Tibshirani (2015), it is generally extremely difficult to

5

determine the discontinuity points, making the computation of the excess degrees of freedomvery challenging. This perhaps even more so in the case of clustering. Consider the excessdegrees of freedom arising from the i,j-th entry,

1

σE


i,j

φ

(Xi,j + δ −µi,j

σ


].

Assume for now that the model parameters, µ and σ2, are fixed. We will discuss our approachfor accommodating these unknown parameters in the next subsection. Now, recall that thediscontinuities DX

i,j are those δ at which the assignment of some of the data changes. That is,those δ for which ∃m s.t.

limγ↓↑δ

argminl=1,...,k ||(X + γei,j)m − µ(X + γei,j)l || 6= 0,

The fact that discontinuities are determined in terms of the would-be solution, µ(X + γei,j),rather than the observed solution, µ(X), is one of the reasons which make determining thediscontinuity points extremely challenging. Here we have made explicit the dependence of theestimated means, µ, on the data. Indeed, one can construct examples where slight changesin only a single matrix entry can result in reassignments of arbitrarily large subsets of data,resulting in substantial and unpredictable changes in µ. We are thus led to making somesimplifications. First, we only consider discontinuities w.r.t. the i,j-th entry arising fromreassignments of Xi , the corresponding datum. This is a necessary simplification whichmaintains the intuitive interpretation of the excess degrees of freedom as the covariance arisingfrom reassignments of data. Now, consider the value of δ at which the assignment of Xi

changes from c(i) to some l 6= c(i). Ignoring all other clusters, we find that δ satisfies∥∥∥∥Xi + δej − µc(i) −δ

nc(i)ej

∥∥∥∥2 =∥∥∥Xi + δej − µl

∥∥∥2 , (8)

where ej is the j-th canonical basis vector for Rd and nc(i) is the size of the c(i)-th cluster.This is a quadratic equation which can easily be solved. A further simplification is adoptedhere. Rather than considering the paths of Xi through multiple reassignments resulting fromvarying δ (which quickly become extremely difficult to calculate), the magnitude and locationof a discontinuity at a value δ is determined as though no reassignments had occurred for valuesbetween zero and δ. Since the corresponding values of δ are generally large, the contributions

from the quantities φ(

Xi,j+δ−µi,jσ

)limγ↓↑δM(X + γei,j)i,j are generally small, and hence we

expect the bias induced by this simplification to be relatively small. The excess degrees offreedom for the i,j-th entry is thus approximated using

1

σ

∑l 6=c(i)

φ

(Xi,j + δl −µi,j

σ

)limγ↓↑δlM(X + γei,j)i,j, (9)

where δl is the solution to Eq. (8) with smaller magnitude (when a solution exists). Todetermine the magnitude of the discontinuities observe that when δl < 0, and we assume, as

6

above, that no values δl < δ < 0 result in a reassignment of Xi , we have

limγ↓δlM(X + γei,j)i,j = µc(i),j +

δlnc(i)

,

limγ↑δlM(X + γei,j)i,j =

1

nl + 1

(nlµl,j + Xi,j + δl

)⇒ lim

γ↓↑δlM(X + γei,j)i,j = µc(i),j −

nlnl + 1

µl,j −Xi,j

nl + 1+ δl

(nl + 1− nc(i)nc(i)(nl + 1)

). (10)

If δl > 0 then we simply have the negative of the above.

2.1.1 Selecting Appropriate Values for µ and σ2 for Estimating Degrees of Free-dom

The estimate of excess degrees of freedom depends on the values of µ and σ2. It is temptingto use the apparently natural candidates, based on µ and an estimate of the within clustervariance from the model, whose degrees of freedom are being estimated, itself. However, this isinappropriate for the purpose of comparing models. First, notice that the value of µ will lead

to an underestimation of the terms, φ(

Xi,j+δ−µi,jσ

). This is because the values which result in

a reassignment of the corresponding datum occur at the boundaries of the estimated clusters;and hence, on average, at the greatest distances from µ. Furthermore, note that smallervalues of σ2 tend to result in a smaller value of the estimated degrees of freedom, everythingelse being equal. A model with an over-estimation of k would lead to an underestimation ofσ2, and hence an artifically low estimated degrees of freedom. Such a model would thus bepenalised insufficiently, relatively to those with a smaller number of clusters, and hence largerestimate of σ2.

We have observed that to estimate the degrees of freedom for a model with k clusters, areasonable approximation can often be obtained by using the estimated parameters from anylarger model (i.e., one with a greater number of clusters). In particular, if we now letM(X;k)be the fitted values from Eqs. (3)–(5), making explicit the number of clusters in the model,

then replacing µ and σ with M(X;k′) and√

1nd

∑ni=1

∑dj=1(Xi,j −M(X;k′)i,j)2 respectively,

where k′ > k, provides a reasonable estimate of the degrees of freedom in model M(X;k).It is interesting that the estimate of degrees of freedom is similar for a large range of valuesk′, provided they are greater than k. Let’s consider again the terms in the excess degrees offreedom, i.e., terms of the form

1

σφ

(Xi,j + δ −µi,j

σ

)limγ↑↓δM(X + γei,j;k)i,j.

Now, notice that the term inside φ may be seen as having two components, namelyXi,j−µi,j

σ

and δσ. The first of these will tend to be similar for different k′ when a complementary pair of

µ and σ is used. Indeed, replacing these with the estimates described above, averaging theirsquared values over all i,j produces a constant, independent of k′. Furthermore, notice that,in general, δ will have the same sign as Xi,j −µi,j, since δ is the value which causes a change

in the assignment of the datum Xi from its nearest cluster mean. The term φ(

Xi,j+δ−µi,jσ

)7

will therefore tend to decrease, in general, when considering all pairs i,j, as k′ increases.Conveniently, this decrease is approximately counteracted by the fact that the terms in theexcess degrees of freedom include the factor 1/σ, which increases as k′ increases.

Figure 1 shows the estimated degrees of freedom from k-means models obtained from twoof the data sets used in our applications1. For each of the two data sets we have shown theestimated degrees of freedom for the models with 5, 10 and 15 clusters, and for varying k′.There is a very clear dip in the plots where k = k′, caused by underestimation of the degreesof freedom by replacing the unknown parameters with the estimates from the same model.As described above, however, the estimates then become stable for values k′ > k. In practicewe simply set k′ = kmax + 1, where kmax is the largest number of clusters under consideration,to estimate the degrees of freedom for all values of k.

2.2 Accuracy of the Approximated Degrees of Freedom

Here we briefly report on a short set of simulations designed to assess the accuracy of the de-grees of freedom approximation we have introduced. To begin, we quickly recap our approach.To approximate the degrees of freedom in the modelM(X;k), i.e., the k-means solution withk clusters, we first compute, for each i,j, those values of δ at which datum Xi would beassigned to another cluster, if shifted in direction ej. That is, for each l 6= c(i), we computeδi,jl according to Eq. (8), where we have now introduced explicitly into the notation the indicesi,j. We then compute the sizes of the model discontinuities at these values of δ, i.e., the valueslimγ↓↑δi,jl

M(X + γei,j)i,j, using Eq. (10). Finally, we set

df(M(X;k)) =1

σ

n∑i=1

d∑j=1

∑l 6=c(i)

φ

(Xi,j + δi,jl − µi,j

σ

)limγ↓↑δi,jl

M(X + γei,j)i,j, (11)

where µ =M(X;k′) and σ =√

1nd

∑ni=1

∑dj=1(Xi,j − µi,j)

2 for some k′ > k.

Figure 2 shows the results of our simulation study. Data sets of size 1000 were generatedunder the modelling assumptions in Eq (2). The number of clusters and dimensions were eachset to 5, 10 and 20. The figure shows plots of k against the estimated degrees of freedom basedon the above approach, where k′ was set to kmax + 1 = 31. The results from 30 replicationsare shown (——). The plots also show direct empirical estimates of the degrees of freedomobtained by estimating the covariance between the model and the data when sampling fromthe true distribution (——–). That is, we generate multiple data sets according to Eq (2),apply k-means for each value of k, and compute the corresponding empirical covariance. Tocompute the direct estimate of degrees of freedom, this covariance is then simply divided bythe true value σ2. This direct estimate may therefore be seen as our target. For context wealso include the plot of kd (······), corresponding to the naıve degrees of freedom equated withthe explicit model dimension.

Given the number of simplifications made, and the difficulty of the problem in the abstract,we find the estimation to be very satisfactory in general. The only exceptions apparent fromthis simple simulation study arose from the 20 dimensional examples, where the proposedmethod appears to underestimate the degrees of freedom for values of k greater than the true

1both data sets are available from the UCI machine learning repository (Bache and Lichman, 2013)

8

0 5 10 15 20 25 30

120

140

160

180

200

220

240

k = 5

k'

edf

0 5 10 15 20 25 30

200

250

300

350

400

k = 10

k'ed

f

0 5 10 15 20 25 30

250

300

350

400

450

500

550

k = 15

k'

edf

(a) Wine Data Set

0 5 10 15 20 25 30

3500

4000

4500

5000

k = 5

k'

edf

0 5 10 15 20 25 30

4000

5000

6000

7000

k = 10

k'

edf

0 5 10 15 20 25 30

4000

5000

6000

7000

8000

k = 15

k'

edf

(b) Optical Recognition of Handwritten Digits Data Set

Figure 1: Estimated degrees of freedom for models with k = 5,10 and 15 clusters, using theestimated parameters from models with a number of clusters, k′, ranging from 1 to30. There is a clear dip in the estimates when k = k′, but the estimates are stablefor k′ > k.

value. Note that from the point of view of model selection, a relatively larger underestimationof the degrees of freedom for a specific value of k will bias the model selection towards thatvalue of k. It is therefore this apparent negative bias in the estimated degrees of freedom forhigher dimensional cases and for values of k greater than the correct value which we find tobe most problematic. We discuss a simple heuristic implemented to mitigate this effect in thenext subsection, where we summarise our approach for performing model selection using theestimated degrees of freedom.

9

0 5 10 15 20 25 30

050

010

0015

0020

00

k

Estim

ated

Deg

rees

of F

reed

om

(a) 5 clusters in 5 di-mensions

0 5 10 15 20 25 30

050

010

0015

00

k

Estim

ated

Deg

rees

of F

reed

om

(b) 10 clusters in 5dimensions

0 5 10 15 20 25 30

020

040

060

080

010

00

k

Estim

ated

Deg

rees

of F

reed

om

(c) 20 clusters in 5dimensions

0 5 10 15 20 25 30

050

015

0025

00

k

Estim

ated

Deg

rees

of F

reed

om

(d) 5 clusters in 10dimensions

0 5 10 15 20 25 30

050

010

0015

0020

0025

00

k

Estim

ated

Deg

rees

of F

reed

om

(e) 10 clusters in 10dimensions

0 5 10 15 20 25 300

500

1000

1500

k

Estim

ated

Deg

rees

of F

reed

om

(f) 20 clusters in 10dimensions

0 5 10 15 20 25 30

010

0020

0030

0040

00

k

Estim

ated

Deg

rees

of F

reed

om

(g) 5 clusters in 20dimensions

0 5 10 15 20 25 30

010

0020

0030

0040

00

k

Estim

ated

Deg

rees

of F

reed

om

(h) 10 clusters in 20dimensions

0 5 10 15 20 25 30

050

015

0025

0035

00

k

Estim

ated

Deg

rees

of F

reed

om

(i) 20 clusters in 20dimensions

Figure 2: Estimated degrees of freedom computed through (i) direct sampling (——), (ii)proposed method for approximating effective degrees of freedom (——) and (iii)naıve estimate of degrees of freedom (······)

10

3 Choosing k Using the BIC

The Bayesian Information Criterion approximates, up to unnecessary constants, the logarithmof the evidence for a model M, i.e., P (M|X), using

−2`(X|M) + log(m)df(M).

Again `(X|M) is the model log-likelihood and here m is the number of independent “residuals”in X. With the modelling assumptions in Eq (2), the BIC for k-means is therefore, up to anadditive constant,

1

σ2

n∑i=1

d∑j=1

(Xi,j −µi,j)2 + ndlog(σ2) + log(nd)df(M).

Setting µ here to be equal to M(X;k), i.e., the matrix of fitted values from the model, andσ2 = 1

nd

∑ni=1

∑dj=1(Xi,j − µi,j)

2 to be the corresponding maximum likelihood estimate of thein-cluster variance, the estimated BIC in the k-means model with k clusters is therefore, upto an additive constant,

ndlog

(n∑i=1

d∑j=1

(Xi,j − µi,j)2

)+ log(nd) df(M(X;k)).

Now, we found in the previous section that the proposed approximation method for the modeldegrees of freedom has the potential to exhibit negative bias for larger values of d and for kgreater than the true number of clusters. To mitigate the effect this has on model selection, weselect the number of clusters as the smallest value of k which corresponds to a local minimumin the estimated BIC curve, seen as a function of k. If no such local minima are present, thenwe select either kmin or kmax; whichever gives the lowest value of the BIC. A similar “firstextremum” approach for model selection has also been used by Tibshirani et al. (2001). Wealso apply a simple local-linear smoothing to the approximated degrees of freedom curves.This mitigates the effect of variation, which is quite pronounced in, e.g., Figure 2 (c) and (f).It also smooths over the short range variation within each estimated curve which is apparentin the proposed estimates, but not present in the curves estimated by direct sampling. Notsmoothing over this variation has the potential to induce spurious local minima in the resultingBIC curves, which would not be present were it possible to obtain such direct estimates inpractice.

4 Experimental Results

In this section we report on the results from experiments conducted to assess the performanceof the proposed approach for model selection, using both simulated data and data from realapplications. In addition to the proposed approach, we also experimented with the followingpopular existing methods for model selection:

1. The Gap Statistic (Tibshirani et al., 2001), which is based on approximating, throughMonte Carlo simulation, the deviation of the (transformed) within cluster sum of squares

11

from its expected value when the underlying data distribution contains no clusters. Dueto high computation time, solutions for the Monte Carlo samples were based on a singleinitialisation. Using ten initialisations, as for the clustering solutions of the actual datasets, did not produce better results in general on data sets for which this approach ter-minated in a reasonable amount of time.

2. The method of Pham et al. (2005) which uses the same motivation as the Gap Statistic,but determines the deviation of the sum of squares from its expected value analyticallyunder the assumption that the data distribution meets the standard k-means assump-tions. We use “fK” to refer to this method in the remainder.

3. The Silhouette Index (Kaufman and Rousseeuw, 2009), which is based on comparingthe average dissimilarity of each point to its own cluster with its average dissimilarityto points in different clusters. Dissimilarity is determined by the Euclidean distancebetween points.

4. The Jump Statistic (Sugar and James, 2003), which selects the number of clusters basedon the first differences in the k-means objective raised to the power −d

2. This statistic

is based on rate distortion theory, which approximates the mutual information betweenthe complete data set and the summarisation by the k centroids.

5. The Bayesian Information Criterion with a naıve estimate of the degrees of freedomgiven by kd. We used exactly the same selection approach as for the proposed method.

The clustering solutions given to each model selection method were the best, in termsof k-means objective, from ten random initialisations for each value of k2. For all data setsvalues of k from 1 to 30 were considered. In all cases clustering solutions were obtained usingthe implementation of k-means provided in R’s base stats package (R Core Team, 2013).

4.1 Simulations

In this section we report results from simulated data sets where the model structure is knownand can be reasonably well controlled. We investigate scenarios including (i) when the k-means model assumptions of a Gaussian mixture with equal mixing proportions and equal andspherical covariance matrices are met; (ii) simple deviations from these assumptions includ-ing Gaussian mixtures with non-spherical covariances and unequal scale/mixture componentdensity; and (iii) deviations from Gaussianity including slightly non-convex clusters and dif-ferent tails in the residual distributions. To generate non-convex clusters, we use the approachdescribed in Hofmeyr (2019), and using the R package spuds3. Here, points generated froma Gaussian mixture are given perturbations, and the size of the perturbation is greater the

2Exactly the same clustering solutions were given to all selection methods.3https://github.com/DavidHofmeyr/spuds

12

https://github.com/DavidHofmeyr/spuds

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●● ●

●

●

●

●

●

●●

●

●●

● ●

● ●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

● ●●

●

●

●●●

●

●

●

●●

(a) Assumptions met

●

●

●

●●

●

●

●

●●

●

●

●

●

● ●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●● ●

●●

●

●

●

●●

● ●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

● ●

●

●

● ●

●●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

(b) Varying clusterscale

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●● ●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●●

● ●

●●●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●●

●

●

● ●

●

●

●●

●

●

●●

●

●●

●

●

● ●

●

●

●

●

●

●

●● ●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

(c) Varying clustershape

Figure 3: Plots of typical simulated data sets from Gaussian mixtures with 10 clusters in 10dimensions. Plots show data projected onto their first two principal components.Clusters are differentiated by colour and by point character

●

●●

●

●

●

●

●

●

●

●

●

● ●●●●

● ●

●●

● ●

●

●

●

●●●

●●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●●

●●

●

● ●

●●

●

●●

●●

●

●

●

●

●●

●●

●

●

●●

●

●

● ●

●

●

●

●

●

●●

●●●

● ●

●●

● ●

●

●

● ●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●●

●●

●●●●

●

●

●

●

●●

●

● ●●●

●

● ●● ●

● ●

●

●

●●

● ●

●●

●

●

●

●

● ●

●●

●

●

●●

●●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●●

●●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●● ●

● ●

●

●

●

●●

(a) Long tails (t3)

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●● ●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

(b) Uniform clusters

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●● ●

●

●●●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

● ●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

● ●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

(c) Non-convex clus-ters

Figure 4: Plots of typical simulated data sets from non-Gaussian mixtures with 10 clusters in10 dimensions. Plots show data projected onto their first two principal components.Clusters are differentiated by colour and by point character

nearer a point is to points from other clusters. This simulation scheme was designed to testmore flexible clustering methods, such as spectral clustering. However, k-means is capable ofachieving high clustering accuracy when the degree of non-convexity of the clusters is not toosubstantial. For the reader’s interest, Figures 3 and 4 show typical data sets generated fromeach simulation scheme, for the cases with 10 clusters in 10 dimensions. The figures showthe two-dimensional principal component plots of the data. These give some indication of thetypes of data given to the algorithms, and can be used to infer somewhat the comparativedifficulty of the clustering problems.

The results from the simulations are summarised in Tables 1 and 2. For each simulationscheme 30 data sets were generated, and the best performing methods, in terms of quality

13

of solutions selected (see below), are indicated by bold font. In addition, methods whoseperformance was not significantly different from the best, based on a paired Wilcoxon signedrank test using a p-value threshold of 0.01, are also highlighted. We chose to use a small p-valueto retain considerable discrimination in the results among the methods which perform wellin general, but not so small that a single or few instances of one method identifying a singleextra cluster would lead to it being excluded from the “best performers” for a given simulationscenario. Methods are compared based on their ability to select the correct number of clusters,and also based on the quality of the clustering solutions selected when compared to the groundtruth4. For this we use the adjusted Rand index (Hubert and Arabie, 1985, ARI). The Randindex (Rand, 1971) is given as the proportion of pairs of points which are either groupedtogether in both the clustering solution and the ground truth or assigned to different clustersboth in the solution and the ground truth. An adjustment is then applied to normalise thisproportion based on its expectation under a random assignment. The clustering accuracy isimportant to consider since it provides a means for comparing solutions when incorrect valuesof k are selected. Table 1 shows the results corresponding to data sets generated from Gaussianmixtures. Both the proposed approach, described as BICedf , and the Silhouette Index showvery strong performance. The Jump Statistic performs very well when the assumptions are metexactly, but the performance drops dramatically when these assumptions are deviated from.It is worth noting that of the methods compared, the Silhouette Index is the only approachwhich is incapable of discerning “one cluster” from “more than one cluster”. It is possible,therefore, that the performance of this method is in some sense slightly over-estimated, sinceits fail-cases are not as severe.

Table 2 shows the results corresponding to data sets generated from non-Gaussian mix-tures. In this case the Silhouette Index enjoys the best performance. The performance of theproposed method is also strong, but significantly below that of the Silhouette in a numberof cases, most frequently on the data containing non-convex clusters. The Gap Statistic hereshowed numerous instances of a failure to identify the presence of clusters. This is an inter-esting point to note, as in our experiments on data from real applications, the Gap Statisticperforms well in general on non-Gaussian data.

4.2 Public Benchmark Data

This section presents briefly on results from experiments using a large collection of 28 publiclyavailable data sets associated with real applications5 from diverse fields. These are popularbenchmark data sets taken from the UCI machine learning repository (Bache and Lichman,2013), with the exception of the Yeast6 and Phoneme7 data sets. These data sets were chosensince ground-truth label sets are available, which can be used for validation and comparison ofclustering solutions. All data sets were standardised to have unit variance in every dimensionbefore applying any clustering.

4The ground truth here corresponds to the identities of the mixture components from which the data weregenerated.

5The Synth data set is, as far as the author is aware, the only simulated data set in this collection. Thisdata set is a popular time-series clustering data set based on short length control-chart simulations.

6https://genome-www.stanford.edu/cellcycle/7https://web.stanford.edu/~hastie/ElemStatLearn/

14

https://genome-www.stanford.edu/cellcycle/

https://web.stanford.edu/~hastie/ElemStatLearn/

Table 1: Results from simulated Gaussian mixture data sets. The Median of the number ofclusters selected by each method (k) and corresponding adjusted Rand index (ARI)are reported. Subscripts show the 10th and 90th centiles. The quantiles are basedon the results from 30 data sets generated for each simulation set-up. Highest per-formances for each scenario are highlighted , as are those which are not significantlydifferent from the highest based on a paired Wilcoxon signed rank test with p-valuethreshold of 0.01.

fK Gap Silh. Jump BIC BICedf

Simulation k d k ARI k ARI k ARI k ARI k ARI k ARIAssump- 5 5 53,5 9859,99 55,5 9897,99 55,5 9897,99 55,5 9897,99 2418,29 3025,38 55,5 9898,99

tions 10 42,5 7431,96 55,5 9492,97 55,5 9492,97 55,5 9492,97 1915,21 3532,43 55,5 9492,97

met 15 22,4 2920,70 51,5 850,87 55,5 8683,87 55,29 8621,87 129,14 4539,58 51,5 840,87

10 5 82,10 7917,97 101,10 940,98 1010,10 9694,98 1010,10 9694,98 2017,26 6656,74 1010,11 9692,98

10 72,10 6214,94 61,10 440,93 1010,10 9290,94 1010,10 9290,94 1614,17 7368,81 1010,10 9289,94

15 22,2 1211,14 11,10 00,81 1010,10 8280,85 1010,10 8280,85 1110,12 7874,82 1010,10 8280,85

15 5 122,14 7812,90 21,15 120,95 1514,15 9388,96 1514,15 9388,96 2420,29 7566,84 1515,16 9489,96

10 112,15 669,91 11,15 00,91 1515,15 9085,92 1515,15 9086,92 1716,18 8682,88 1515,16 8986,92

15 22,12 87,63 11,1 00,0 1515,15 7976,81 1515,15 7975,81 1515,15 7976,81 1615,16 7875,80

Within 5 5 52,5 9438,99 55,6 9478,99 55,5 9688,99 55,29 9435,99 138,19 6950,82 55,5 9690,99

cluster 10 42,5 7327,97 55,6 9183,97 75,13 8775,97 265,29 5939,97 1711,23 6452,75 55,6 9185,97

scale 15 22,5 3120,82 55,6 8478,93 145,28 7359,94 285,30 5445,94 1411,17 6458,76 55,6 8478,94

varies 10 5 95,10 8045,95 101,10 910,98 1010,14 9386,98 1010,24 9371,98 1915,25 7971,86 1110,12 9182,96

10 72,10 6214,92 101,11 850,92 1410,24 8576,93 1010,29 8668,93 1913,22 7871,85 1110,12 8681,93

15 22,8 1411,71 101,10 760,85 2313,30 7366,82 2810,30 6962,89 1310,15 7770,81 1110,16 8072,89

15 5 1311,15 8070,88 41,15 260,90 1515,17 8985,95 1514,18 8984,95 1514,15 8576,91 1615,19 8881,94

10 112,14 6610,86 11,15 00,89 1915,24 8579,91 1514,16 8777,91 2016,23 8377,88 1715,19 8678,90

15 72,14 428,72 111,15 00,82 2926,30 7469,78 2615,29 7572,82 1916,22 7772,81 169,21 7751,82

Within 5 5 53,5 9860,100 55,6 9887,99 55,5 9895,100 245,30 3426,98 1510,25 4831,68 55,5 9894,100

cluster 10 42,5 7633,97 55,8 9474,98 55,5 9592,98 2824,30 2725,33 2014,25 3829,48 55,5 9590,98

shape 15 22,5 3021,88 55,5 8683,92 55,5 8783,92 2926,30 2422,28 1815,22 3429,43 55,5 8681,92

varies 10 5 97,10 8567,97 101,10 940,97 1010,10 9592,97 2710,30 5550,94 2217,25 6556,79 1010,11 9490,96

10 72,10 6713,94 101,10 900,94 1010,10 9288,95 2821,30 5146,63 2117,25 6352,72 1110,11 9186,93

15 22,8 1311,70 61,10 460,87 1010,11 8480,88 2825,30 4642,51 1915,25 6152,72 1110,11 8381,87

15 5 1410,15 8363,90 61,15 410,95 1514,15 9186,96 1514,29 9070,95 2216,26 8072,88 1615,18 9085,95

10 112,14 679,84 11,15 00,91 1515,15 9186,94 1515,30 8764,94 2318,26 7669,85 1615,17 8885,92

15 22,12 86,66 11,15 00,80 1515,16 8278,86 2825,29 6257,65 2017,23 7266,79 1615,17 8076,85

15

Table 2: Results from simulated non-Gaussian mixture data sets. The Median of the num-ber of clusters selected by each method (k) and corresponding adjusted Rand index(ARI) are reported. Subscripts show the 10th and 90th centiles. The quantiles arebased on the results from 30 data sets generated for each simulation set-up. Highestperformances for each scenario are highlighted , as are those which are not signifi-cantly different from the highest based on a paired Wilcoxon signed rank test withp-value threshold of 0.01.

fK Gap Silh. Jump BIC BICedf

Simulation k d k ARI k ARI k ARI k ARI k ARI k ARILong 5 5 52,5 9236,95 51,5 920,94 65,6 9390,95 55,6 9390,95 2114,26 4134,58 55,6 9390,95

tails (t3) 10 32,5 5229,86 11,5 00,90 65,8 8683,90 275,30 3632,87 97,14 7762,85 55,7 8784,90

15 22,4 2619,64 11,1 00,8 62,7 7823,81 2824,30 3430,39 139,18 6143,69 51,22 590,81

10 5 82,10 7217,90 11,10 00,90 1110,12 9086,92 1010,12 8978,92 1812,23 7770,89 1110,12 9087,92

10 32,9 1513,74 11,2 00,8 1311,14 8280,86 2410,30 6860,85 1211,14 8379,86 1111,13 8280,86

15 22,10 1210,70 11,1 00,0 1511,17 7267,75 2824,30 5953,66 1512,17 7167,75 1110,13 7368,75

15 5 112,14 6411,82 11,14 00,83 1614,18 8581,89 1514,17 8578,89 1916,26 8177,88 1613,18 8577,88

10 112,15 609,78 11,1 00,0 1916,22 8076,83 2316,28 7872,83 1816,22 8075,84 1710,18 8048,83

15 132,17 567,65 11,1 00,0 2319,25 6866,74 2823,30 6662,70 1916,22 6965,74 1815,20 6963,72

Uniform 5 5 52,5 10038,100 55,5 10098,100 55,5 10098,100 55,5 10099,100 2517,30 3025,39 55,5 10099,100

clusters 10 42,5 7632,97 51,5 960,98 55,5 9694,98 55,5 9695,98 1815,22 3831,45 55,5 9695,98

15 22,5 2820,84 51,5 860,88 55,5 8684,88 55,29 8620,88 1311,14 4339,48 51,5 850,88

10 5 82,10 8017,99 101,10 950,99 1010,10 9996,100 1010,10 9997,100 2117,27 6553,75 1010,11 9792,99

10 22,10 1613,95 11,10 00,95 1010,10 9491,96 1010,10 9491,96 1714,19 7165,79 1010,11 9391,96

15 22,2 1210,13 11,2 00,8 1010,10 8481,86 1010,10 8380,85 1111,13 7972,84 1010,10 8379,85

15 5 122,14 8013,92 141,15 890,98 1514,15 9789,98 1514,15 9690,98 2320,29 8268,86 1515,16 9591,98

10 82,14 489,89 11,15 00,93 1515,15 9390,95 1514,15 9389,94 1716,17 9086,92 1615,16 9288,94

15 22,13 87,71 11,1 00,0 1515,15 8178,84 1514,15 8175,84 1515,15 8178,85 1515,16 8076,84

Non- 5 5 54,5 8268,91 91,11 680,81 65,8 8171,89 2622,30 3128,44 2216,25 3731,56 86,11 7666,85

convex 10 42,5 5024,86 1411,18 4934,59 96,12 6853,86 2826,30 2724,30 2217,30 3124,45 95,12 6242,81

clusters 15 22,4 3123,60 151,18 420,55 126,16 5038,85 2928,30 2523,28 2316,29 3024,43 94,16 5743,81

10 5 109,10 9085,95 101,11 890,95 1010,11 9285,95 1010,28 8557,95 1915,23 7463,83 1211,14 8882,93

10 98,10 8369,95 131,16 790,87 1010,14 9181,96 2826,30 5751,60 2216,30 6551,81 1412,21 8566,91

15 98,10 8265,91 1610,19 7449,85 1210,13 8878,96 2926,30 5148,56 2217,27 6152,72 1311,17 8164,91

15 5 1412,15 8879,94 151,16 900,96 1515,16 9390,97 1615,27 9271,97 2016,25 8576,92 1716,19 9184,94

10 1413,15 9384,98 1512,18 9474,99 1515,17 9791,99 2826,30 7670,80 2015,26 8975,95 1815,21 9286,97

15 1512,15 9279,99 161,17 940,97 1515,16 9895,99 2827,30 7370,79 2317,26 8374,95 1716,19 9488,97

16

Table 3 shows the results of these experiments. The numbers in brackets indicate the truenumber of clusters, k. For each method the selected number of clusters, k, and the adjustedRand index are reported8. For each data set we have also included the “Ideal” k-meanssolution, which corresponds with the solution that attains the highest ARI value. We findthis to be pertinent since when the data distribution deviates substantially from the k-meansassumptions it may be that the best k-means solution does not contain the same number ofclusters as the ground truth. Furthermore, although it is unlikely that there exists a methodwhich will reliably select the ideal solution, it is also very likely that there exists, theoretically,a method which performs better than any of the methods considered herein. Comparing withthe ideal performance therefore gives a bound on how much better it is possible to performwith any model selection technique for k-means. The ideal performance also gives us someindication of the difficulty of the clustering problem. Two immediate take-aways from thetable are that the fK method selected two clusters in almost all cases, while the Jump Statisticdramatically over-estimated the number of clusters in all but a few instances. The BIC withnaıve setting of the degrees of freedom also over-esimates the number of clusters considerablyin general, but not by so large a margin as the Jump Statistic. The Silhouette Index, GapStatistic and the BIC with the effective degrees of freedom all perform quite consistently well.To better illustrate the overall performance of the methods on these data sets, the results ofTable 3 are summarised in Figure 5. The figure shows boxplots of the ARI performance regret,when compared to the ideal performance, normalised for difficulty. That is, for a method Mand data set X, the normalised regret is given by

ARI(Ideal(X))− ARI(M(X))

ARI(Ideal(X)).

The figure also shows the mean of the normalised regret for each method, indicated by a reddot. Here we see that the Gap Statistic and the BIC using the proposed estimate of effectivedegrees of freedom perform substantially better than the other methods, in general. Whilethe Silhouette Index yields a similar median performance, its instances of poor performanceare considerably worse than those of the Gap and proposed BIC variant.

Given the variety and number of the data sets used in these experiments, there is strongevidence that the proposed estimation procedure for the effective degrees of freedom leadsto selection of models which enjoy very strong performance when compared with existingtechniques.

5 Discussion

This work investigated the effective degrees of freedom in the k-means model. We argued thatthe degrees of freedom estimate based on the number of explicitly estimated parameters is aninappropriate pairing with the so-called classification likelihood for performing model selectionfor k-means. This is because the classification likelihood assumes the clustering assignmentforms part of the estimation, but this added estimation is not accounted for in the modeldimension. The proposed formulation accommodates the uncertainty of the class assignments

8Two of the data sets offer multiple “ground truth” label sets. The table shows the average performanceof each method over the different label sets.

17

Table 3: Results from publicly available benchmark data sets. Number of clusters selected byeach method (k) and corresponding adjusted Rand index (ARI) are reported.

fK Gap Silh. Jump BIC BICedf Ideal

Data set (k) k ARI k ARI k ARI k ARI k ARI k ARI k ARIWine (3) 2 0.37 3 0.9 3 0.9 30 0.13 11 0.35 3 0.9 3 0.9Seeds (3) 2 0.48 3 0.77 2 0.48 29 0.12 17 0.2 3 0.77 3 0.77Ionosphere (2) 2 0.17 8 0.17 4 0.28 30 0.11 12 0.12 4 0.28 3 0.29Votes (2) 2 0.57 7 0.21 2 0.57 29 0.06 14 0.1 4 0.32 2 0.57Iris (3) 2 0.57 3 0.62 2 0.57 27 0.14 14 0.3 3 0.62 3 0.62Libras (15) 2 0.07 13 0.31 18 0.31 29 0.29 16 0.32 16 0.32 20 0.34Heart (2) 2 0.34 2 0.34 5 0.29 29 0.04 5 0.29 5 0.29 2 0.34Glass (6) 2 0.19 9 0.17 2 0.19 29 0.13 13 0.24 4 0.2 5 0.24Mammography (2) 2 0.39 3 0.31 3 0.31 25 0.05 11 0.13 4 0.31 2 0.39Parkinsons (2) 2 -0.1 7 0.07 2 -0.1 30 0.03 12 0.05 10 0.04 6 0.12Yeast (5) 2 0.42 8 0.4 2 0.42 29 0.14 10 0.39 12 0.36 4 0.57Forest (4) 2 0.18 5 0.39 2 0.18 30 0.15 19 0.2 12 0.28 4 0.45Breast Cancer (2) 2 0.82 9 0.38 2 0.82 30 0.15 17 0.34 4 0.76 2 0.82Dermatology (6) 2 0.21 6 0.7 3 0.57 28 0.26 9 0.65 6 0.7 5 0.84Synth (6) 2 0.27 8 0.67 2 0.27 30 0.35 10 0.65 10 0.65 8 0.67Soy Bean (19) 2 0.05 16 0.43 2 0.05 30 0.42 16 0.43 16 0.43 18 0.56Olive Oil (3/9) 2 0.4 10 0.49 5 0.67 30 0.19 18 0.17 9 0.5 5 0.67Bank (2) 2 0.01 3 0.06 18 0.1 26 0.09 24 0.09 21 0.1 5 0.21Optidigits (10) 2 0.13 17 0.57 20 0.6 30 0.47 18 0.65 18 0.65 18 0.65Image Seg (7) 2 0.17 14 0.46 6 0.48 28 0.3 14 0.46 14 0.46 9 0.51MF Digits (10) 2 0.15 18 0.62 9 0.65 1 0 20 0.59 20 0.59 11 0.68Satellite (6) 3 0.29 12 0.41 3 0.29 30 0.25 16 0.35 16 0.35 7 0.56Texture (11) 2 0.11 23 0.41 2 0.11 30 0.41 30 0.41 30 0.41 11 0.5Pen Digits (10) 2 0.13 30 0.45 8 0.45 28 0.46 30 0.45 21 0.53 14 0.64Phoneme (5) 2 0.16 11 0.45 2 0.16 1 0 21 0.28 21 0.28 5 0.64Frogs (4/8/10) 2 0.46 17 0.21 3 0.5 25 0.14 17 0.15 15 0.24 4 0.57Auto (3) 2 -0.04 4 0.13 2 -0.04 26 0.05 24 0.03 4 0.13 5 0.16Yeast UCI (10) 7 0.19 1 0 6 0.11 9 0.18 4 0.1 9 0.18 7 0.19

18

Figure 5: Boxplots of normalised regret (when compared with the ideal performance). Meannormalised regret is indicated in each case by a red dot.

● ●

●

Jump fK BIC Silh. Gap BIC_edf

0.0

0.5

1.0

1.5

●

●

●●

●●

in the degrees of freedom, where an extension of Stein’s lemma showed that these uncertaintiesare appropriately accommodated by considering the size and location of the discontinuitiesin the k-means model, which correspond precisely to the reassignments of points to differentclusters. Evaluating the new degrees of freedom expression is challenging, however a fewsimplifications allowed us to approximate this value in practice. The approximation wasvalidated through model selection within the Bayesian Information Criterion. Experimentsusing simulated data, as well as a large collection of publicly available benchmark data setssuggest that this approach is competitive with popular existing methods for model selectionin k-means clustering.

Proofs

Proof of Lemma 1

Let X ∼ N(µ,σ2) and consider any g : R → R which is Lipschitz on (−∞,δ) and (δ,∞) forsome δ ∈ R. For each ε > 0 define

gε(x) =

{g(x), x 6∈ Bε(δ)g(δ − ε) + [g(δ + ε)− g(δ − ε)]x−(δ−ε)

2ε, x ∈ Bε(δ),

where Bε(δ) = (δ− ε,δ+ ε). Then gε is Lipschitz by construction and so by (Candes et al.,2013, Lemma 3.2) we know gε is almost differentiable and E[g′ε(X)2] < ∞, and so by (Stein,1981, Lemma 2) we have

1

σ2E[(X − µ)gε(X)] = E[g′ε(X)].

19

But

E[g′ε(X)] =E [g′ε(X)|X 6∈ Bε(δ)]P (X 6∈ Bε(δ)) + E [g′ε(X)|X ∈ Bε(δ)]P (X ∈ Bε(δ))

=E [g′(X)|X 6∈ Bε(δ)]P (X 6∈ Bε(δ)) +g(δ + ε)− g(δ − ε)

2εP (X ∈ Bε(δ)).

Taking the limit as ε→ 0+ gives

1

σ2E[(X−µ)g(X)] = E[g′(X)] +

(limγ↓δ

g(γ)− limγ↑δ

g(γ)

)1√2πσ

e−1

2σ2(δ−µ)2 ,

as required. The extension to any g with finitely many such discontinuity points arises froma very simple induction.

We therefore have for any i,j, that

1

σ2E

[(Xi,j −µi,j)f(X)i,j

∣∣∣∣{Xk,l}(k,l) 6=(i,j)

]= E

[∂

∂Xi,j

f(X)i,j

∣∣∣∣{Xk,l}(k,l) 6=(i,j)

]+

∑δ:Xi,j+δ∈

D({Xk,l}(k,l)6=(i,j))

1

σφ

(Xi,j + δ −µi,j

σ

)limγ↓↑δ

f(X + γei,j)i,j

The result follows from the law of total expectation. �

Proof of Lemma 2

Notice that the discontinuities inM(X)i,j can occur only when there is a change in the assign-ment of one of the observations. If this occurs at the point X+δei,j, then it is straightforwardto show that

| limγ↓↑δM(X + γei,j)i,j| ≤ Diam(X) + C|δ|,

where Diam(X) is the diameter of the rows of X and C is a constant independent of X. Thereare also clearly finitely many such discontinuities since there are finitely many cluster solutionsarising from n data, i.e.,

|D({Xk,l}(k,l)6=(i,j))| ≤ A,

for some constant A independent of i,j,X. Furthermore |M(X + γei,j)i,j −M(X)i,j| ≤ γ aslong as all cluster assignments remain the same, and hence M(X + γei,j)i,j is Lipschitz as afunction of γ between points of discontinuity. Finally,

E

[∣∣∣∣∣ ∑δ:Xi,j+δ∈

D({Xk,l}(k,l)6=(i,j))

φ

(Xi,j + δ −µi,j

σ


∣∣∣∣∣]

≤ E

[ ∑δ:Xi,j+δ∈

D({Xk,l}(k,l)6=(i,j))

φ

(Xi,j + δ −µi,j

σ

)(Diam(X) + C|δ|)

]

≤ A√2π

(E[Diam(X)] + CE

[|Xi,j −µi,j |+ (Xi,j −µi,j)

2 + 4σ2]),

20

since φ((a−δ)/σ)|δ| is maximised by a δ satisfying |δ| ≤ (|a|+ |a2−4σ2|)/2, and φ is boundedabove by 1/

√2π. Now, the tail of the distribution of Diam(X) is similar to that of the

distribution of the maximum of n χ random variables with d degrees of freedom. ThereforeE[Diam(X)] is clearly finite. The second term above is clearly finite, since Xi,j − µi,j isnormally distributed, and hence the expectation in Lemma 2 exists and is finite. �

References

Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle.In Selected papers of Hirotugu Akaike, pages 199–213. Springer, 1998.

K. Bache and M. Lichman. UCI machine learning repository, 2013. URL http://archive.

ics.uci.edu/ml.

Emmanuel J Candes, Carlos A Sing-Long, and Joshua D Trzasko. Unbiased risk estimates forsingular value thresholding and spectral estimators. IEEE transactions on signal processing,61(19):4643–4657, 2013.

Gilles Celeux and Gerard Govaert. A classification em algorithm for clustering and twostochastic versions. Computational statistics & Data analysis, 14(3):315–332, 1992.

Bradley Efron. How biased is the apparent error rate of a prediction rule? Journal of theAmerican statistical Association, 81(394):461–470, 1986.

Chris Fraley and Adrian E Raftery. Model-based clustering, discriminant analysis, and densityestimation. Journal of the American statistical Association, 97(458):611–631, 2002.

Greg Hamerly and Charles Elkan. Learning the k in k-means. In Advances in neural infor-mation processing systems, pages 281–288, 2004.

David P Hofmeyr. Improving spectral clustering using the asymptotic value of the normalizedcut. Journal of Computational and Graphical Statistics, pages 1–13, 2019.

Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2(1):193–218, 1985.

Ke Jiang, Brian Kulis, and Michael I Jordan. Small-variance asymptotics for exponentialfamily dirichlet process mixture models. In Advances in Neural Information ProcessingSystems, pages 3158–3166, 2012.

Leonard Kaufman and Peter J Rousseeuw. Finding groups in data: an introduction to clusteranalysis, volume 344. John Wiley & Sons, 2009.

Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction to InformationRetrieval. Cambridge University Press, 1 edition, 2008.

Dan Pelleg, Andrew W Moore, et al. X-means: Extending k-means with efficient estimationof the number of clusters. In Icml, volume 1, pages 727–734, 2000.

21

http://archive.ics.uci.edu/ml

http://archive.ics.uci.edu/ml

Duc Truong Pham, Stefan S Dimov, and Cuong Du Nguyen. Selection of k in k-means cluster-ing. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of MechanicalEngineering Science, 219(1):103–119, 2005.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation forStatistical Computing, Vienna, Austria, 2013. URL http://www.R-project.org/.

Stephen A Ramsey, Sandy L Klemm, Daniel E Zak, Kathleen A Kennedy, Vesteinn Thorsson,Bin Li, Mark Gilchrist, Elizabeth S Gold, Carrie D Johnson, Vladimir Litvak, et al. Uncov-ering a macrophage transcriptional program by integrating evidence from motif scanningand expression dynamics. PLoS computational biology, 4(3):e1000021, 2008.

William M Rand. Objective criteria for the evaluation of clustering methods. Journal of theAmerican Statistical association, 66(336):846–850, 1971.

Gideon Schwarz et al. Estimating the dimension of a model. The annals of statistics, 6(2):461–464, 1978.

Charles M Stein. Estimation of the mean of a multivariate normal distribution. The annalsof Statistics, pages 1135–1151, 1981.

Catherine A Sugar and Gareth M James. Finding the number of clusters in a dataset: Aninformation-theoretic approach. Journal of the American Statistical Association, 98(463):750–763, 2003.

Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters ina data set via the gap statistic. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 63(2):411–423, 2001.

Ryan J Tibshirani. Degrees of freedom and model search. Statistica Sinica, pages 1265–1296,2015.

22

http://www.R-project.org/

degrees of freedom and model selection for -means clustering · degrees of freedom and model...

Documents