model-based clustering of functional...

PUB. IRMA, LILLE 2011

Vol. 71, No VII

Model-based clustering of functional data∗

Julien Jacques a, b, c , Cristian Preda a, b, c

Abstract

Model-based clustering for functional data is considered. An alterna-

tive to model-based clustering using the functional principal components

is proposed by approximating the density of functional random variables.

The EM algorithm is used for parameter estimation and the maximum

a posteriori rule provides the clusters. Simulation study and real data

application illustrate the interest of the proposed methodology.

Résumé

Ce papier traite de la classi�cation automatique de données fonction-

nelles. Nous proposons une procédure à base de modèles de mélange, dé-

�nis à partir d'une approximation de la notion de densité d'une variable

aléatoire fonctionnelle. L'estimation des paramètres par maximum de vrai-

semblance est réalisée à l'aide de l'algorithme EM, et la classi�cation est

réalisée par maximum a posteriori. Des études sur simulations et données

réelles illustrent l'intérêt de la méthodologie proposée en comparaison des

approches classiques.

MSC 2009 subject classi�cations. 62H30, 62H25, 62M10.

Key words and phrases. functional data, functional principal component analy-sis, model-based clustering, random function density, EM algorithm.

∗Preprint.a Laboratoire P. Painlevé, UMR 8524 CNRS Université Lille I, Bât M2, Cité Scienti�que,

F-59655 Villeneuve d'Ascq Cedex, France,bMODAL, INRIA Lille-Nord Europe,cPolytech'Lille.

VII � 2

1 Introduction

Let X be a functional random variable with values in a functional space F . Forinstance, we consider F is the space of squared integrable functions, L2([0, T ]),T > 0, and X is a L2-continuous stochastic process, X = {Xt, t ∈ [0, T ]}. LetX1, . . . , Xn be an i.i.d sample of size n from the same probability distributionas X. Known as functional data (see [16]), the observation of Xi's correspondsto n curves belonging to F .

The aim of model-based clustering is to identify homogeneous groups ofdata from a mixture densities model. More precisely, the model-based clusteringallows to predict the observation of an indicator vector Z = (Z1, . . . , ZK) ofthe K clusters, such that, conditionally to the belonging to the gth group,Zg = 1, Xi's comes from a common distribution f indexed by some group-speci�c parameters, f(θg).

In �nite dimensional setting (see for instance [2, 6]), the multivariate probabi-lity density function is the main tool for estimating such a model. For functionaldata, the notion of probability density is not well de�ned because of the in�nitedimension of data. To overcome this di�culty, a pragmatic solution consistsin using classical clustering tools, designed for the �nite dimensional setting,onto the expansion coe�cient of X on some �nite basis of functions. The maindrawback of such method is that the basis expansion is built independently ofthe clustering objective. Recent works [10, 4] overcome this problem by de�ningbasis expansion speci�c to each cluster.

Our work is based on the idea developped in [7] where a "surrogate density"for X is proposed using the Karhunen-Loeve expansion (or principal componentanalysis (PCA)) :

X(t) = µ(t) +

∞∑j=1

Cjψj(t), (1)

where µ is the mean function of X, Cj =

∫ T

0

(Xt − µ(t))ψj(t)dt, j ≥ 1, are

zero-mean random variables (called principal components) and ψj 's form anorthonormal system of eigen-functions of the covariance operator of X :∫ T

0

Cov(Xt, Xs)ψj(s)ds = λjψj(t),∀t ∈ [0, T ].

Notice that the principal components Cj 's are uncorrelated random variables ofvariance λj . Considering the principal components indexed upon the descendingorder of the eigenvalues (λ1 ≥ λ2 ≥ . . .), let denote by X(q) the approximationof X by truncating (1) at the q �rst terms, q ≥ 1,

X(q)(t) = µ(t) +

q∑j=1

Cjψj(t). (2)

VII � 3

Then, X(q) is the best approximation of X, under the mean square criterion,among all the approximations of the same type (linear combination of determi-nistic functions of t with random coe�cients, [17]). Denoting by ‖.‖ the usualnorm on L2([0, T ]), we have

E(‖X −X(q)‖2) =∑j≥q+1

λj and ‖X −X(q)‖ m.s.−−−→q→∞

0. (3)

Without loss of generality, we will suppose in the following thatX is a zero-meanstochastic process, i.e. µ(t) = 0, ∀t ∈ [0, T ].

Based on the approximation of X by X(q), [7] show that the probability ofX to belong to a ball of radius h centered in x ∈ L2[0, T ] can be written as

logP (‖X − x‖ ≤ h) =

q∑j=1

log fCj(cj(x)) + ξ(h, q(h)) + o(q(h)), (4)

where fCjis the probability density of Cj and cj(x) is the jth principal com-

ponent score of x, cj(x) =< x,ψj >L2. The functions q(h) and ξ are such that

q(h) grows to in�nity when h decreases to zero and ξ is a constant dependingon h and q(h).

The equality (4) suggests the use of the multivariate probability density ofthe vector C(q) = (C1, . . . , Cq) as an approximation for the "density" of X.Moreover, observe that we have, ∀h > 0, x ∈ L2[0, T ],

P(‖X(q) − x‖ ≤ h− ‖X −X(q)‖

)≤ P (‖X − x‖ ≤ h) ≤ P

(‖X(q) − x‖ ≤ h+ ‖X −X(q)‖

). (5)

The relation (3) and (5) suggest also that the probability P (‖X−x‖ ≤ h) couldbe approximated by P (‖X(q) − x‖ ≤ h).

Let denote by f(q)X the joint probability density of C(q). If x =

∑j≥1 cj(x)ψj

and x(q) =∑qj=1 cj(x)ψj then

P (‖X(q) − x‖ ≤ h) =

∫D(q)

x

f(q)X (y)dy, (6)

where D(q)x = {y ∈ Rq : ‖y − x(q)‖Rq ≤

√h2 −

∑j≥q+1 c

2j (x)}.

When X is a gaussian process, the principal components Cj are gaussian and

independent. The density f(q)X is then :

f(q)X (x) =

q∏j=1

fCj (cj(x)). (7)

We use the functional de�ned by (7) to develop our model-based clusteringmethodology for functional data. Our approach is di�erent of that consisting toperform classical model-based clustering on the �rst q principal components ofX.

The paper is organized as follows. In Section 2 we de�ne the model underlyingthe functional data and describe the parameter estimation procedure for the

VII � 4

model-based clustering procedure. The choice of the approximation order q andthe de�nition of the clustering rule are described. In Section 3 we present asimulation study as well as an application on real data (Danone) and compareour results with those provided by other clustering methods.

2 Model-based clustering for functional data

In the following we suppose that X is a zero-mean gaussian stochastic process.Let X = (X1, ..., Xn) be an i.i.d sample of size n of X and Z be a latentcategorical random variable of dimension K, 1 ≤ K < ∞, associated to the Kclusters Xi's belong. For each i = 1, . . . , n, let associate to Xi the correspondingcategorical variable Zi indicating the group Xi belongs : Zi = (Zi,1, . . . , Zi,K) ∈{0, 1}K is such that Zi,g = 1 if Xi belongs to the cluster g, 1 ≤ g ≤ K, and 0otherwise.

In a clustering setting, the Xi's variables are observed but not the Zi's. Thegoal is to predict the Zi's knowing the Xi's. For this, we de�ne a parametricmixture model based on the approximation (7) of the density of a randomfunction.

2.1 The mixture model

Let assume that each couple (Xi, Zi) is an independent realization of the randomvector (X,Z) where X has an approximated density depending on its groupbelonging :

f(qg)X|Zg=1

(x; Σg) =

qg∏j=1

fCj |Zg=1(cj,g(x);σ2

j,g)

where qg is the number of the �rst principal components retained in the ap-proximation (7) for the group g, cj,g(x) is the jth principal component score ofX|Zg=1 for X = x, fCj |Zg=1

its probability density and Σg the diagonal matrix

diag(σ21,g, . . . , σ

2q,g).

Conditionally on the group, the probability density fCj |Zg=1of the jth prin-

cipal component of X is the univariate gaussian density with zero mean (theprincipal component are centered) and variance σ2

j,g.The vector Z = (Z1, . . . , ZK) is assumed to have one order multinomial

distribution

Z ∼M1(π1, . . . , πG)

with π1, . . . , πK the mixing probabilities (∑Kg=1 πg = 1). Under this model it

follows that the unconditional (approximated) density of X is given by

f(q)X (x; θ) =

K∑g=1

πg

qg∏j=1

fCj |Zg=1(cj,g(x);σ2

j,g) (8)

VII � 5

where θ = (πg, σ21,g, . . . , σ

2qg,g)1≤g≤K have to be estimated and q = (q1, . . . , qK).

As in the �nite dimensional setting, we de�ne an approximated likelihood of thesample of curves X by :

l(q)(θ;X) =

n∏i=1

K∑g=1

πg

qg∏j=1

1√2πσj,g

exp−1

2

(Ci,j,gσj,g

)2

(9)

where Ci,j,g is the jth principal score of the curve Xi belonging to the group g.

2.2 Parameter estimation

In the unsupervised context the estimation of the mixture model parameters isnot as straightforward as in the supervised context since the groups indicatorsZi are unknown. On the one hand, we need to use an iterative algorithm whichalternate the estimation of the group indicators, the estimation of the PCAscores for each group and then the estimation of the mixture model parameters.On the other hand, the parameter q must be estimated by an empirical method,similar to those used to select the number of components in usual PCA.

2.2.1 Mixture model and component scores estimation

A classical way to maximise a mixture model likelihood when data are missing(here the clusters indicators Zi) is to use the iterative EM algorithm [8, 12,13]. In this work we use an EM like algorithm for the maximization of theapproximated likelihood (9). This algorithm includes, between the standard Eand M steps, a step in which the principal components scores of each group areupdated.The EM algorithm consists in maximizing the approximated completed log-likelihood

L(q)c (θ;X,Z) =

n∑i=1

G∑g=1

Zi,g

log πg +

qg∑j=1

log fCj |Zg=1(Ci,j,g)

,

which is known to be easier to maximise than its incomplete version (9), andleads to the same estimate. Let θ(h) be the current value of the estimated para-meter at step h, h ≥ 1.

E step. As the groups indicators Zi,g's are unknown, the E step consistsin computing the conditional expectation of the approximated completed log-likelihood :

Q(θ; θ(h)) = Eθ(h) [L(q)c (θ;X,Z)|X = x] =

n∑i=1

K∑g=1

ti,g

(log πg +

qg∑j=1

log fCj |Zg=1(ci,j,g)

)

VII � 6

where ti,g is the probability for the curve Xi to belong to the group g conditio-nally to Ci,j,g = ci,j,g :

ti,g = Eθ(h) [Zi,g|X = x] 'πg∏qgj=1 fCj |Zi,g=1

(ci,j,g)∑Kl=1 πl

∏qgj=1 fCj |Zi,g=1

(ci,j,g). (10)

The approximation in (10) is due to the use of an approximation of the densityof X.

Principal score updating step. The computation of FPCA eigenfunctionsand scores within a given cluster follows [16]. In general, this computation needssome approximation. The most usual one is to assume that the curve admitsan expansion into a basis of functions φ = (φ1, . . . , φL). Let Γ be the n ×L expansion coe�cients matrix and W =

∫φφ′ be the matrix of the inner

products between the basis functions. Here, the computation of the principalcomponent scores Ci,j,g of the curve Xi in the group g is updated dependingof the current conditional probability ti,g computed in the previous E step.This computation is carried out by ponderating the importance of each curve inthe construction of the principal components with the conditional probabilitiesTg = diag(t1,g, . . . , tn,g). Consequently, the �rst step consists in centering thecurve Xi within the group g by substraction of the mean curve computed usingthe ti,g's. The principal component scores Ci,j,g are then given by

Ci,j,g = (λj,g)−1/2γi,gWβj,g

where βj,g = W−1/2uj,g, uj,g and λj,g being the jth eigenvector and respectivelyeigenvalue of the matrix n−1W 1/2Γ′TgΓW

1/2.

Group speci�c dimension qg estimation step. The estimation of thegroup speci�c dimension qg is an open problem. It can not be solved by theuse of such likelihood-based method. Indeed, the approximation of the density(7) is the product of the density of the q �rst principal component scores. The-refore, when the density distributions of the principal components are not toopeaked (variance lower than (2π)−1 for gaussian densities), their values are lo-wer than 1, and then the likelihood necessarily decreases when q grows.In this work we propose to use, once the group speci�c FPCA have been com-puted, classical empirical criteria as the proportion of the explained variance orthe scree-test of Cattell [5] in order to select each group speci�c dimension qg.

M step. The M step consists in computing the mixture model parametersθ(h+1) which maximizes Q(θ; θ(h)). It leads simply to the following estimators

π(h+1)g =

1

n

n∑i=1

ti,g, and σ2j,g

(h+1)= λj,g, 1 ≤ j ≤ qg

VII � 7

where λj,g is the variance of the jth principal component of the cluster g alreadycomputed in the principal score updating step.

The EM algorithm stops when the di�erence of the approximated likelihoodvalue of two consecutive steps is lower than a given threshold ε (typically ε =10−6).

2.2.2 Model selection

We provided an EM procedure for �tting the model-based clustering for func-tional data. However, there is a discrete parameter to estimate : the number Kof clusters. We propose to use an approximation of the BIC criterion [18] builtfrom the approximated log-likelihood (9) :

BIC(q) = 2logl(q)(θ̂;X)− ν log n,

where ν = 2∗K−1 is the number of parameters of the model (mixing proportions

and principal scores variances) and l(q)(θ̂;X) is the maximum achieved by thelikelihood. The number K of clusters maximizing this criterion could be anappropriate choice.

2.3 Classi�cation step

Once the mixture model parameters have been estimated, we proceed to theclassi�cation of the observed curves in order to complete our clustering approach.The group belonging can be estimated by the rule of maximum a posteriori(MAP), which consists in classifying a curve xi into the group g maximizingthe conditional probability P (Zig = 1|Xi = xi). At the convergence of the EMalgorithm, this probability is given by (10).

Link with related methods. If the principal component scores of each curveare not computed conditionally to their group belonging (here the FPCA arecarried out by group), then our approach corresponds exactly to a Gaussianmixture model on the principal component scores. The closest method to our ap-proach is that proposed in [4] (called fun-HDDC ), which assumes, conditionallyto the group, a Gaussian mixture model on the coe�cients of the eigen-functionexpansion. Our approach is di�erent since we assume a Gaussian distributionfor the principal component scores, which is true if the curves are sample pathsof a Gaussian process. This is a reasonable hypothesis.

3 Numerical experiments

In order to compare our model (quoted in the following by funclust) to otherapproaches, a simulation study and an application on real data are presented inthis section. The simulation study allows to compare funclust to the usual clus-tering procedures, kmeans and gaussian mixture model (GMM, [2, 6], throughthe R package mclust) applied directly on the FPCA scores. The application

VII � 8

on real data consists in clustering Danone kneading curves. We illustrate theaccuracy of funclust with respect to usual clustering methods such as HDDC [3],MixtPPCA [19], kmeans, GMM [2, 6] and hierarchical clustering (hclust, R pa-ckage). All these methods are successively applied on the discretized data, onthe expansion coe�cients in a natural cubic splines basis and on the functionalPCA scores. For both, simulation study and application, the number of clustersis assumed to be known.

3.1 Simulation study

In this simulation, the number of clusters is assumed to be known : K=2. Asample of n = 100 curves are simulated according to the following model inspiredby [9, 14] :

Class 1 : X(t) = U1h1(t) + U2h2(t) + ε(t), t ∈ [1, 21],

Class 2 : X(t) = U1h1(t) + ε(t), t ∈ [1, 21],

where U1 and U2 are independent gaussian variables such that E[U1] = E[U2] =0, Var(U1) = 1/2, Var(U2) = 1/12 and ε(t) is a white noise, independent ofUi's and such that Var(εt) = 1/12. The function h1 and h2 (plotted on Figure1) are de�ned, for t ∈ [1, 21], by h1(t) = 6− |t− 7| and h2(t) = 6− |t− 15|.

Figure 1 � Plots of the functions h1(t) (solid line) and h2(t) (dashed line).

The mixing proportions πi's are choosen to be equal, and the curves are obser-ved in 101 equidistant points (t = 1, 1.2, . . . , 21). Figure 2 plots the simulatedcurves.The principal components of X are approximated from {Xt}t=1,...,21 and arecomputed using linear spline smoothing (with 30 equidistant knots). For fun-clust, the group speci�c dimensions qg are estimated such that 90% of the totalvariance was explained by the �rst qg principal components. For the classicalclustering procedures, kmeans and gaussian mixture model (GMM, [2, 6]), thenumber of FPCA scores used is selected in the same way. Corresponding dimen-sions and correct classi�cation rates, averaged on 100 simulations, are given inTable 1.

VII � 9

Figure 2 � Class 1 (left), Class 2 (center) and both classes (right).

method correct classif. rate q1 q2funclust 79.68 1.88 1.90GMM 56.58 1.10kmeans 54.46 1.10

Table 1 � Correct classi�cation rates, group speci�c dimension qg for funclustand number of FPCA scores for GMM and kmeans (averaged on 100 simula-tions), for the simulation study.

As we can expect, for this dataset with speci�c principal spaces of di�erent di-mensions, funclust outperforms classical clustering methods for the multivariatesetting.

3.2 Application

The dataset we use comes from Danone Vitapole Paris Research Center andit concerns the quality of cookies and the relationship with the �our kneadingprocess. The kneading data set is described in detail in [11].

There are 115 di�erent �ours for which the dought resistance is measuredduring the kneading process for 480 seconds. One obtains 115 kneading curvesobserved at 241 equispaced instants of time in the interval [0, 480]. The 115�ours produce cookies of di�erent quality : 50 of them have produced cookiesof good quality, 25 produced adjustable quality and 40 bad quality. The Figure3 presents the set of the 115 kneading curves.

In a supervised classi�cation context, this data is used in [11, 15, 1] for �ttinglinear and non-parametric prediction models for the cookie's quality. From thesestudies, it appears that it is di�cult to discriminate between the three classes,even for supervised classi�ers, partly because of the adjustable class.

Let us consider that the 115 kneading curves are sample-paths of a secondorder stochastic process X. In order to get the functional feature of data, each

VII � 10

curve is approximated using cubic B-spline basis expansion with the following16 knots [11] : 10, 42, 84, 88, 108, 134, 148, 200, 216, 284, 286, 328, 334, 380, 388,478. Thus, each curve Xi is represented by a set of 18 coe�cients. Therefore, theFPCA of X is approximated using the smoothed curves (for more details, see[16]). The group speci�c dimensions qg are estimated such that at least 95% ofthe total variance was explained. Resulting dimensions are q1 = 2, q2 = 1, q3 = 1.

800

700

600

500

400

300

200

100

do

ug

hre

sist

an

ce

450400350300250200150100500

time

Quality : Good

800

700

600

500

400

300

200

100d

ough

resi

stan

ce450400350300250200150100500

time

Quality : Good

800

700

600

500

400

300

200

100

do

ug

hre

sist

an

ce

450400350300250200150100500

time

Quality : Adjustable

800

700

600

500

400

300

200

100

dou

ghre

sist

ance

450400350300250200150100500

time

Quality : Adjustable

800

700

600

500

400

300

200

100

do

ug

hre

sist

an

ce

450400350300250200150100500

time

Quality : Bad

800

700

600

500

400

300

200

100

dou

ghre

sist

ance

450400350300250200150100500

time

Quality : Bad

Figure 3 � Kneading data : 115 �ours observed during 480 seconds. Left : ob-served data. Right : smoothed data using cubic B-splines

Table 2 presents the results obtained with di�erent clustering methods. Ourmethod funclust performs better than fun-HDDC [4] which similarly to funclustconsiders group speci�c subspaces but assume a Gaussian mixture model on thecoe�cients of the eigen-function expansion, and not on the principal score asfunclust. The methods from the multivariate �nite setting are also outperformed

VII � 11

by funclust.

2-steps discretized spline coe�. FPCA scores functional

methods (241 instants) (20 splines) (4 components) methods

HDDC 66.09 53.91 44.35 fun-HDDC1 62.61

MixtPPCA 65.22 64.35 62.61 funclust 67.82

mclust 63.48 50.43 60

kmeans 62.61 62.61 62.61

hclust 63.48 63.48 63.48

Table 2 � Percentage of correct classi�cation for the Kneading dataset

4 Conclusion

In this paper we propose a clustering procedure for functional data based onan approximation of the notion of density of a random function. The main toolis the use of the probability densities of the principal components scores. Assu-ming that the functional data are sample of a Gaussian process, the resultingmixture model is an extrapolation of the �nite dimensional Gaussian mixturemodel to the in�nite dimensional setting. We de�ned an EM like algorithm forthe parameter estimation and performed a simulation study, as well as an ap-plication on real data, in order to show the performance of this approach withrespect to other clustering procedures.The approximation of the density of a random function, based on the principalcomponents densities, opens numerous perspectives for futur works. Indeed, aclustering procedure for multivariate functional data (several curves observedfor a same individual) can be de�ned similarly. The di�cult task in such a multi-variate functional setting is to de�ne dependence between univariate functions.This challenge can be met by the FPCA of multivariate curve [16].

Références

[1] A.M. Aguilera, M. Escabiasa, C. Preda, and G. Saporta. Using basis expan-sions for estimating functional pls regression. applications with chemome-tric data. Chemometrics and Intelligent Laboratory Systems, 104(2) :289�305, 2011.

[2] J.D. Ban�eld and A.E. Raftery. Model-based gaussian and non-gaussianclustering. Biometrics, 49 :803�821, 1993.

[3] C. Bouveyron, S. Girard, and C. Schmid. High Dimensional Data Cluste-ring. Computational Statistics and Data Analysis, 52 :502�519, 2007.

VII � 12

[4] C. Bouveyron and J. Jacques. Model-based clustering of time series ingroup-speci�c functional subspaces. Advances in Data Analysis and Clas-

si�cation, in press, 2011.

[5] R. Cattell. The scree test for the number of factors. Multivariate Behav.

Res., 1(2) :245�276, 1966.

[6] G. Celeux and G. Govaert. Gaussian parsimonious clustering models. TheJournal of the Pattern Recognition Society, 28 :781�793, 1995.

[7] A. Delaigle and P. Hall. De�ning pobability density for a distribution ofrandom functions. The Annals of Statistics, 38 :1171�1193, 2010.

[8] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihoodfrom incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser.

B, 39(1) :1�38, 1977.

[9] F. Ferraty and P. Vieu. Curves discrimination : a nonparametric approach.Computational Statistics and Data Analysis, 44 :161�173, 2003.

[10] G.M. James and C.A. Sugar. Clustering for sparsely sampled functionaldata. J. Amer. Statist. Assoc., 98(462) :397�408, 2003.

[11] C. Lévéder, P.A. Abraham, E. Cornillon, E. Matzner-Lober, and N. Moli-nari. Discrimination de courbes de pétrissage. In Chimiométrie 2004, pages37�43, Paris, 2004.

[12] G. McLachlan and T. Krishnan. The EM algorithm and extensions. WileyInterscience, New York, 1997.

[13] G. McLachlan and D. Peel. Finite Mixture Models. Wiley Interscience,New York, 2000.

[14] C. Preda. Regression models for functional data by reproducing kernelhilbert spaces methods. Journal of Statistical Planning and Inference,137 :829�840, 2007.

[15] C. Preda, G. Saporta, and C. Lévéder. PLS classi�cation of functionaldata. Comput. Statist., 22(2) :223�235, 2007.

[16] J. O. Ramsay and B. W. Silverman. Functional data analysis. SpringerSeries in Statistics. Springer, New York, second edition, 2005.

[17] G. Saporta. Méthodes exploratoires d'analyse de données temporelles. Ca-hiers du Buro, 37�38, 1981.

[18] G. Schwarz. Estimating the dimension of a model. Ann. Statist., 6(2) :461�464, 1978.

[19] M. E. Tipping and C. Bishop. Mixtures of principal component analyzers.Neural Computation, 11(2) :443�482, 1999.

model-based clustering of functional...

Documents