to combine steady-state genetic algorithm and ensemble learning for data clustering

8
To combine steady-state genetic algorithm and ensemble learning for data clustering Yi Hong * , Sam Kwong Department of Computer Science, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong article info Article history: Received 29 June 2007 Received in revised form 28 December 2007 Available online 4 March 2008 Communicated by L. Heutte Keywords: Clustering analysis Ensemble learning Genetic-guided clustering algorithms abstract This paper proposes a data clustering algorithm that combines the steady-state genetic algorithm and the ensemble learning method, termed as genetic-guided clustering algorithm with ensemble learning oper- ator (GCEL). GCEL adopts the steady-state genetic algorithm to perform the search task, but replaces its traditional recombination operator with an ensemble learning operator. Therefore, GCEL can avoid the problems of clustering invalidity and context insensitivity of the traditional recombination operator of genetic algorithms. In addition, GCEL generates its initial population of candidate clustering solutions by using the random subspaces method. Therefore, less fitness evaluations are required to converge. The proposed GCEL is tested on one synthetic and several real data sets. Experimental results demon- strate that GCEL is able to achieve a comparative or better clustering solution with less fitness evaluations when compared with several other existing genetic-guided clustering algorithms. Ó 2008 Elsevier B.V. All rights reserved. 1. Introduction Clustering algorithms work to classify a set of unlabeled in- stances into groups such that instances in the same group are more similar to each other, while they are more different in different groups (Jain et al., 1999; Duda et al., 2001). They can automatically identify intrinsic structures of data, therefore benefit for the stor- age, transmission and processing of data. Clustering algorithms have been widely applied into many fields such as data compres- sion, pattern recognition and machine vision. However, clustering algorithms have several disadvantages (Jain et al., 1999). Among them is that clustering criteria such as the minimization of the within-cluster variation are usually high-dimensional, non-linear and multi-modal functions with numbers of local optimal cluster- ing solutions. Whereas commonly used hill-climbing search meth- ods can only guarantee a local optimal clustering solution. The above disadvantage of clustering algorithms has motivated the application of more robust heuristic search methods such as genet- ic algorithms (GAs) in the area of data clustering. Numbers of re- cent studies have demonstrated that genetic-guided clustering algorithms are often able to identify a better clustering solution when compared with those obtained by hill-climbing search meth- ods (Fränti, 2000; Garai and Chaudhuri, 2004; Krishna and Murty, 1999; Kuncheva and Bezdek, 1998; Mitra, 2004; Martnez-Otzeta et al., 2006). However, genetic-guided clustering algorithms are not exempt from any drawbacks. One problem is that the traditional recombi- nation operator of genetic algorithms suffers from clustering inval- idity and context insensitivity (Falkenauer, 1994; Jones and Beltramo, 1991). If the clustering invalidity and context insensitiv- ity occur, the recombination operator may lead to the disruption of good building blocks, thus significantly degrades the search capa- bility of GAs. There are several approaches for solving the problems of clustering invalidity and context insensitivity. For example, ge- netic-guided clustering algorithms can mitigate the clustering invalidity by penalizing or repairing unfeasible clustering solutions in the population. A widely used approach for avoiding the context insensitivity is to remove the recombination operator from genet- ic-guided clustering algorithms, while only remaining the muta- tion operator for perturbing the population (Krishna and Murty, 1999; Lu et al., 2004). However, removing the recombination oper- ator from GAs will weaken their search capability. Apart from the clustering invalidity and context insensitivity, another problem associated with genetic-guided clustering algorithms is their slow convergence (Krishna and Murty, 1999). A popular approach for speeding up the convergence of genetic-guided clustering algo- rithms is the one-step Kmeans operator (Krishna and Murty, 1999). However, as suggested in (Sheng et al., 2004) the one-step Kmeans operator may restrict the GAs’ search capability. This paper proposes a novel data clustering algorithm that is able to mitigate the above two problems of genetic-guided cluster- ing algorithms. It is termed as genetic-guided clustering algorithm with ensemble learning operator (GCEL). GCEL generates its initial population by using the random subspaces method and replaces its 0167-8655/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2008.02.017 * Corresponding author. Tel./fax: +852 21942611. E-mail addresses: [email protected] (Y. Hong), [email protected] (S. Kwong). Pattern Recognition Letters 29 (2008) 1416–1423 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Upload: yi-hong

Post on 21-Jun-2016

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: To combine steady-state genetic algorithm and ensemble learning for data clustering

Pattern Recognition Letters 29 (2008) 1416–1423

Contents lists available at ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier .com/locate /patrec

To combine steady-state genetic algorithm and ensemble learningfor data clustering

Yi Hong *, Sam KwongDepartment of Computer Science, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong

a r t i c l e i n f o

Article history:Received 29 June 2007Received in revised form 28 December 2007Available online 4 March 2008

Communicated by L. Heutte

Keywords:Clustering analysisEnsemble learningGenetic-guided clustering algorithms

0167-8655/$ - see front matter � 2008 Elsevier B.V. Adoi:10.1016/j.patrec.2008.02.017

* Corresponding author. Tel./fax: +852 21942611.E-mail addresses: [email protected] (Y. Hong

Kwong).

a b s t r a c t

This paper proposes a data clustering algorithm that combines the steady-state genetic algorithm and theensemble learning method, termed as genetic-guided clustering algorithm with ensemble learning oper-ator (GCEL). GCEL adopts the steady-state genetic algorithm to perform the search task, but replaces itstraditional recombination operator with an ensemble learning operator. Therefore, GCEL can avoid theproblems of clustering invalidity and context insensitivity of the traditional recombination operator ofgenetic algorithms. In addition, GCEL generates its initial population of candidate clustering solutionsby using the random subspaces method. Therefore, less fitness evaluations are required to converge.The proposed GCEL is tested on one synthetic and several real data sets. Experimental results demon-strate that GCEL is able to achieve a comparative or better clustering solution with less fitness evaluationswhen compared with several other existing genetic-guided clustering algorithms.

� 2008 Elsevier B.V. All rights reserved.

1. Introduction

Clustering algorithms work to classify a set of unlabeled in-stances into groups such that instances in the same group are moresimilar to each other, while they are more different in differentgroups (Jain et al., 1999; Duda et al., 2001). They can automaticallyidentify intrinsic structures of data, therefore benefit for the stor-age, transmission and processing of data. Clustering algorithmshave been widely applied into many fields such as data compres-sion, pattern recognition and machine vision. However, clusteringalgorithms have several disadvantages (Jain et al., 1999). Amongthem is that clustering criteria such as the minimization of thewithin-cluster variation are usually high-dimensional, non-linearand multi-modal functions with numbers of local optimal cluster-ing solutions. Whereas commonly used hill-climbing search meth-ods can only guarantee a local optimal clustering solution. Theabove disadvantage of clustering algorithms has motivated theapplication of more robust heuristic search methods such as genet-ic algorithms (GAs) in the area of data clustering. Numbers of re-cent studies have demonstrated that genetic-guided clusteringalgorithms are often able to identify a better clustering solutionwhen compared with those obtained by hill-climbing search meth-ods (Fränti, 2000; Garai and Chaudhuri, 2004; Krishna and Murty,1999; Kuncheva and Bezdek, 1998; Mitra, 2004; Martnez-Otzetaet al., 2006).

ll rights reserved.

), [email protected] (S.

However, genetic-guided clustering algorithms are not exemptfrom any drawbacks. One problem is that the traditional recombi-nation operator of genetic algorithms suffers from clustering inval-idity and context insensitivity (Falkenauer, 1994; Jones andBeltramo, 1991). If the clustering invalidity and context insensitiv-ity occur, the recombination operator may lead to the disruption ofgood building blocks, thus significantly degrades the search capa-bility of GAs. There are several approaches for solving the problemsof clustering invalidity and context insensitivity. For example, ge-netic-guided clustering algorithms can mitigate the clusteringinvalidity by penalizing or repairing unfeasible clustering solutionsin the population. A widely used approach for avoiding the contextinsensitivity is to remove the recombination operator from genet-ic-guided clustering algorithms, while only remaining the muta-tion operator for perturbing the population (Krishna and Murty,1999; Lu et al., 2004). However, removing the recombination oper-ator from GAs will weaken their search capability. Apart from theclustering invalidity and context insensitivity, another problemassociated with genetic-guided clustering algorithms is their slowconvergence (Krishna and Murty, 1999). A popular approach forspeeding up the convergence of genetic-guided clustering algo-rithms is the one-step Kmeans operator (Krishna and Murty,1999). However, as suggested in (Sheng et al., 2004) the one-stepKmeans operator may restrict the GAs’ search capability.

This paper proposes a novel data clustering algorithm that isable to mitigate the above two problems of genetic-guided cluster-ing algorithms. It is termed as genetic-guided clustering algorithmwith ensemble learning operator (GCEL). GCEL generates its initialpopulation by using the random subspaces method and replaces its

Page 2: To combine steady-state genetic algorithm and ensemble learning for data clustering

Y. Hong, S. Kwong / Pattern Recognition Letters 29 (2008) 1416–1423 1417

traditional crossover operator with the ensemble learning operatorfor reproducing new candidate solutions.

The remainder of this paper is divided into four sections. Sec-tion 2 briefly outlines the advantages of GCEL. Section 3 introducesthe related work for this paper. Section 4 goes into details ofdescribing GCEL. Experimental results on one synthetic and severalreal data sets are given in Section 5. Section 6 gives some furtherillustrations about the performance of GCEL and Section 7 con-cludes this paper.

2. Advantages of GCEL

GCEL has two advantages when compared with other existinggenetic-guided clustering algorithms. First, GCEL uses the steady-state genetic algorithm to perform the search task, but replacesits traditional recombination operator with an ensemble learningoperator. Therefore, GCEL can mitigate the problems of clusteringinvalidity and context insensitivity of traditional genetic-guidedclustering algorithms. Second, GCEL initializes its population ofcandidate clustering solutions by using the random subspacesmethod (Ho, 1998). Therefore, much less fitness evaluations are re-quired to converge when compared with genetic-guided clusteringalgorithm whose initial population is randomly generated. Theabove two advantages of GCEL guarantee that GCEL is capable ofachieving a comparative or better clustering solution with less fit-ness evaluations when compared with several other existing ge-netic-guided clustering algorithms.

3. Related work

Before further illustrating about genetic-guided clustering algo-rithm with ensemble learning operator, this section briefly intro-duces the literature on the steady-state genetic algorithm andthe ensemble learning method.

3.1. Steady-state genetic algorithm

GAs are a class of heuristic search methods that loosely mimicthe behavior of Dawinian evolution for solving large-scale complexoptimization problems (Goldberg, 1989). Unlike commonly usedhill-climbing search methods where only one candidate solutionis used, GAs maintain a population of candidate solutions duringtheir search. Therefore, GAs are often able to jump over local opti-mal solutions and converge to a better solution when comparedwith the one obtained by hill-climbing search methods. Majorsteps of GAs include three genetic operators: the selection opera-tor, the recombination operator and the mutation operator. GAswork with these three operators to explore and exploit the codedsearch space of the objective function.

The original version of GAs uses the generational replacement,where the entire population are replaced at each iteration. If the en-tire population are replaced at each iteration, GAs tend to lose thediversity of the population very fast and converge to a local optimalsolution. From the standpoint of this view, generational replacementbased GAs are not suitable for solving problems with a large numberof local optimal solutions. It is considered that clustering criteria areusually high-dimensional, non-linear and multi-modal with num-bers of local optimal solutions, in this paper we use another versionof GAs, commonly known as the steady-state GAs (Whitley andKauth, 1988; Syswerda, 1991; Rogers and Prügel-Bennett, 1999).In the steady-state GAs, populations in successive two iterationssignificantly overlap and only one or two candidate solutions arereplaced at each iteration. Therefore, the steady-state GAs have abetter performance for maintaining the diversity of the populationand are more suitable for solving the problem of data clustering.

3.2. Ensemble learning

Clustering analysis was known as an ill-posed combinatoryoptimization problem. Numbers of clustering algorithms exist sofar and their clustering solutions may be significantly different.The ensemble learning method is an effective method for improv-ing the robustness and stability of clustering algorithms, thatworks with combining multiple clustering results into a single con-sensus one by leveraging their consensus (Strehl and Ghosh, 1999).There are several effective approaches for combining multiple clus-tering results (Strehl and Ghosh, 1999; Fred and Jain, 2005; Fernand Brodley, 2003). For example, Fred and Jain achieved a numberof partitions of a data set through executing Kmeans clusteringalgorithm with random initializations and random numbers ofclusters. They obtained the final consensus partition of the dataset by an agglomerative clustering algorithm such as the AverageLink clustering algorithm (Fred and Jain, 2005).

It is noted that the function of the ensemble learning methodfor combining multiple clustering results into a single consensusone is somewhat similar to that of a recombination operator ofGAs that works to mix different candidate clustering solutions intoa new better one. In addition, commonly used recombination oper-ators of GAs such as the one-point crossover operator do not per-form well enough due to the problems of clustering invalidityand context insensitivity. Therefore, in this paper the ensemblelearning method is used as the recombination operator of geneticalgorithms to reproduce new candidate clustering solutions.

4. Genetic-guided clustering algorithm with ensemble learningoperator

4.1. Problem definition

Before further illustrating about GCEL, the authors define dataclustering as the following combinatory optimization problem.Let D ¼ fx1; x2; . . . ; xng denote a data set containing n unlabeledinstances, clustering algorithms work to classify these n instancesinto K groups such that the optimal value of a predefined cluster-ing criterion is achieved. Numbers of clustering criteria exist sofar and no single one is valid for all kinds of data sets. A popularclustering criterion is the minimization of the within-clustervariation. Provided that each instance xj has m features xj ¼ðxj1; xj2; . . . ; xjmÞ; j ¼ 1;2; . . . ;n, then the within-cluster variationof the clustering solution C ¼ fC1;C2; . . . ;CKg of the data set canbe calculated as

f ðCÞ ¼Xn

j¼1

XK

k¼1

dðxj;CkÞ �Xm

l¼1

ðxjl � cklÞ2; ð1Þ

where

ckl ¼Pn

j¼1dðxj;CkÞ � xjlPnj¼1dðxj;CkÞ

ð2Þ

for k ¼ 1; . . . ;K, l ¼ 1; . . . ;m and

dðxj;CkÞ ¼1 if the instance xj belongs to the group Ck;

0 if otherwise:

�ð3Þ

The above objective function f ðCÞ is usually high-dimensional, non-linear and multi-modal with numbers of local optimal clusteringsolutions. Whereas, commonly used hill-climbing search methodscan only guarantee a local optimal clustering solution. Therefore,heuristic search methods such as GAs were widely applied in solv-ing the above combinatory optimization problem. The followingthree paragraphs will go into details of describing three key compo-nents of GCEL, respectively: individuals encoding, ensemble learn-ing operator and population initialization.

Page 3: To combine steady-state genetic algorithm and ensemble learning for data clustering

1418 Y. Hong, S. Kwong / Pattern Recognition Letters 29 (2008) 1416–1423

4.2. Individuals encoding

Genetic-guided clustering algorithm maintains a population ofcoded candidate clustering solutions during its search. Severalencoding strategies were proposed such as the string-of-groupencoding (Krishna and Murty, 1999), the cluster centers encoding(Mitra, 2004) and the linear linkage encoding (Du et al., 2004).However, no conclusion has been drawn on which encoding strat-egy is the best. This is because in the algorithms where the recom-bination operator is easy to perform, fitness evaluations are verytime-consuming. While in the algorithms where fitness evalua-tions are simple, their recombination operators are complicated(Krishna and Murty, 1999). This paper adopts the string-of-groupencoding strategy because of its simplicity and wide applications.In genetic-guided clustering algorithm with the string-of-groupencoding strategy, each candidate clustering solution is coded asan integer string and the value of an integer in the string representsthe label of the group in which the instance is classified. For exam-ple, if the data set has five instances fx1; x2; x3; x4; x5g, the chromo-some (1 2 2 2 1) represents that the instances fx1; x5g are classifiedin one group, while the instances fx2; x3; x4g are classified in theother group, and the partition of the data represented by the chro-mosome is ffx1; x5g; fx2; x3; x4gg.

4.3. Ensemble learning operator

Apart from the encoding strategy, another important compo-nent of genetic-guided clustering algorithms is to design an effec-tive recombination operator for mixing and reproducing newcandidate clustering solutions. Commonly used recombinationoperators of GAs such as the one-point crossover operator cannot perform well enough due to the problems of clustering invalid-ity and context insensitivity.

The clustering invalidity occurs if the recombination operatorreproduces new clustering solutions, whose number of clusters issmaller than the given number of clusters. For example, if the sim-ple one-point crossover operator is executed on the chromosome(1 1 2 2 3 3) and the chromosome (3 1 1 3 2 2), both new clusteringsolutions (1 1 2 2 2 2) and (3 1 1 3 3 3) have only two clusters andboth of them are invalid.

Apart from the clustering invalidity, a more serious problemassociated with commonly used recombination operators such asthe one-point crossover operator is the context insensitivity. Thecontext insensitivity occurs if one clustering solution can be codedby several different chromosomes (Falkenauer, 1994). For example,both the chromosome (1 1 2 2) and the chromosome (2 2 1 1) rep-resent the same clustering solution where instances fx1; x2g areclassified into one group and instances fx3; x4g are classified intothe other group. In this case, the recombination operator ex-changes string blocks of two different chromosomes in the popula-tion, but may not exchange their clustering contexts for combiningnew candidate clustering solutions. For example, the chromosome(1 1 1 2 2 2) and the chromosome (2 2 2 1 1 1) represent the sameclustering solution where instances fx1; x2; x3g are classified intoone group and instances fx4; x5; x6g are classified into the othergroup. However, their offsprings (1 1 1 1 1 1) and (2 2 2 2 2 2) afterexecuting the one-point crossover operator are significantly differ-ent from their parents.

The above example lets us know that the commonly usedrecombination operator of GAs is only able to mix string blocksof different chromosomes, but not able to recombine clusteringcontexts of different chromosomes into new better ones. The con-text insensitivity of the recombination operator often leads to thedisruption of good building blocks. If the disruption of good build-ing blocks occurs too frequently, the potential of the recombinationoperator loses and the search of GAs becomes a random walk.

In this paper, traditional recombination operators of GAs suchas the one-point crossover operator are replaced by an ensemblelearning operator for reproducing new candidate clustering solu-tions. Provided that PðsÞ ¼ fIð1Þ; Ið2Þ; . . . ; IðMÞg are M parental cluster-ing solutions and IðiÞj represents the label of the group in which theinstance xj is classified in the ith clustering solution, the ensemblelearning operator works to reproduce a new candidate clusteringsolution IðnewÞ through combining these M clustering solutionswithout accessing features of the data. Several feasible ensemblestrategies have been proposed (Strehl and Ghosh, 1999; Fred andJain, 2005; Fern and Brodley, 2003). This paper uses the averagelink agglomerative clustering algorithm for its simplicity (Fredand Jain, 2005). The ensemble learning operator based on the aver-age link agglomerative clustering algorithm obtains a new cluster-ing solution with the following steps employed: first, the clusteringsolution IðiÞ is transformed into a similarity matrix SðiÞ as follows(Fred and Jain, 2005):

SðiÞðj1; j2Þ ¼1 if IðiÞj1

¼ IðiÞj2;

0 if otherwise;

(ð4Þ

where j1 ¼ 1; . . . ;n and j2 ¼ 1; . . . ; n. Accordingly, fSð1Þ; Sð2Þ; . . . ; SðMÞgcan be obtained from these M available clustering solutions. Thenall similarity matrixes fSð1Þ; Sð2Þ; . . . ; SðMÞg are combined into a singleconsensus similarity matrix Sðj1; j2Þ (Fred and Jain, 2005):

Sðj1; j2Þ ¼PM

i¼1SðiÞðj1; j2ÞM

; ð5Þ

where j1 ¼ 1; . . . ;n, j2 ¼ 1; . . . ; n. The value of Sðj1; j2Þ represents thefrequency that the instances xj1 and xj2 are classified into the samegroup in the parental clustering solutions PðsÞ ¼ fIð1Þ; Ið2Þ; . . . ; IðMÞg.After the similarity matrix S is calculated, a new similarity matrixSðnewÞ is sampled from the above similarity matrix S as follows:

SðnewÞðj1; j2Þ ¼1 if randð1Þ < Sðj1; j2Þ;0 if otherwise;

�ð6Þ

where randð1Þ is a random number in ½0;1�, j1 ¼ 1;2; . . . ;n andj2 ¼ 1;2; . . . ;n. Lastly, a new clustering solution IðnewÞ can be ob-tained by the average link agglomerative clustering algorithm onthe similarity matrix SðnewÞ. Something that should be mentionedis the average link agglomerative clustering algorithm classifiesdata instances based on their distance matrix and a small value ofthe element in the distance matrix represents that two instanceshave a high probability to be classified into the same group. But un-like the distance matrix, the similarity matrix of data instances de-scribes the similarities among instances. Thus a small value of theelement represents that two instances have a small probability tobe classified into the same group. In this case, the similarity matrixshould be firstly transformed into the distance matrix before theexecution of the average link agglomerative clustering algorithm.

It is noted that the ensemble learning operator can mitigate theproblem of context insensitivity. This is because one clusteringcontext has only one similarity matrix and different chromosomeswith the same clustering context share the same similarity matrix.For example, both the chromosome (1 1 1 2 2 2) and the chromo-some (2 2 2 1 1 1) have the same clustering context ffx1; x2; x3gfx4; x5; x6gg, that is represented by the same similarity matrix:

S ¼

1 1 1 0 0 01 1 1 0 0 01 1 1 0 0 00 0 0 1 1 10 0 0 1 1 10 0 0 1 1 1

2666666664

3777777775:

Therefore, the ensemble learning operator on this similarity matrixdoes not cause the problem of clustering insensitivity. In addition,

Page 4: To combine steady-state genetic algorithm and ensemble learning for data clustering

Y. Hong, S. Kwong / Pattern Recognition Letters 29 (2008) 1416–1423 1419

since new candidate clustering solutions are directly generated bythe Average Link agglomerative clustering algorithm whose numberof clusters is fixed to the given number of clusters, the ensemblelearning operator is also immune from the problem of the clusteringinvalidity.

4.4. Population initialization

Unlike existing genetic-guided clustering algorithms whose ini-tial population are randomly generated, GCEL initializes its popula-tion by using the random subspaces method: part of features arerandomly selected from the full feature set; then a clustering solu-tion is obtained by executing Kmeans clustering on selected fea-tures; the above two steps iterate until a population of clusteringsolutions are obtained. The inspiration of using the random sub-spaces method is that the random subspaces method has beenknown as an effective method for providing us a population ofaccurate and diverse clustering solutions (Ho, 1998; Skurichinaand Duin, 2002; Kuncheva and Hadjitodorov, 2004). To the bestknowledge of the authors’, this is the first time to adopt the ran-dom subspaces method to initialize the population of genetic-guided clustering algorithms. The authors claim that the using ofrandom subspaces method can significantly speed up the searchof genetic-guided clustering algorithms.

4.5. Framework of genetic-guided clustering algorithm with ensemblelearning operator

GCEL performs the search task with the following steps em-ployed: first, GCEL obtains a population of candidate clusteringsolutions by executing Kmeans clustering algorithm in differentfeature subspaces of the data. Second, GCEL calculates fitness val-ues of all clustering solutions in the population and selects partof promising clustering solutions to form a selected subpopulationaccording to their fitness values. Third, a new clustering solution isgenerated through combining all clustering solutions in the se-lected subpopulation and its fitness value is calculated. Lastly,GCEL compares the fitness value of the new clustering solutionwith the fitness value of the worst clustering solution in the popu-lation. If its fitness value is lower than that of the worst clusteringsolution in the population, then the worst clustering solution is re-placed by the new clustering solution. The above steps iterate untilthe finish condition is met. Algorithm 1 gives the framework ofgenetic-guided clustering algorithm with ensemble learning oper-ator. In Algorithm 1, the tournament selection and simple muta-tion operator are adopted.

Algorithm 1. Genetic-guided clustering algorithm with ensemblelearning operator.

//Initialization(1) P Generate N clustering solutions fIð1Þ; . . . ; IðNÞg

by the random subspaces method;// Fitness evaluation

(2) ffðIð1ÞÞ; fðIð2ÞÞ; . . . ; fðIðNÞÞg Calculate fitness valuesof clustering solutions in the population P;// Selection operator

(3) PðsÞ Select MðM < NÞ promising clustering solutionsfrom P by tournament selection;// Ensemble learning operator

(4) IðnewÞ Generate a new solution through combiningclustering solutions in PðsÞ by ensemble learning operator;// New candidate solution evaluation

(5) fðIðnewÞÞ Calculate fitness value of IðnewÞ;// Replacement operator

(6) If fðIðnewÞÞ < maxffðIð1ÞÞ; fðIð2ÞÞ; . . . ; fðIðMÞÞg,then replace the worst clustering solution IðworstÞ

with IðnewÞ;// Check the finish condition

(7) If the finish condition is not met, go to (3).

5. Experimental results and their analysis

5.1. Synthetic data

One synthetic data set X8K5D was used to test the perfor-mance of the ensemble learning operator1. The full X8K5D con-tains 1000 instances sampled from five multivariate Gaussiandistributions in 8D space and each cluster have 200 instances withthe same variance 0.1. In experiments, we randomly selected 100instances from the data. The following three algorithms werecompared:

� Genetic-guided clustering algorithm without recombinationoperator. In this algorithm, only the mutation operator wasused to perturb the population and its mutation rate is fixedto 0.005.

� Genetic-guided clustering algorithm with one-point crossoveroperator. In this algorithm, both one-point crossover andmutation operators were used to perturb the population. Itscrossover rate was fixed to 0.8 and mutation rate was fixed to0.005.

� Genetic-guided clustering algorithm with the ensemble learn-ing operator. In this algorithm, both ensemble learning operatorand mutation operator were used to perturb the population. Itsensemble size M was fixed to 2 and mutation rate was fixed to0.005.

All three algorithms were initialized by using the random sub-spaces method and the steady-state genetic algorithm wasadopted to perform the search task. Therefore, only one new clus-tering solution was reproduced at each iteration. The qualities ofnew clustering solutions were evaluated by the following two cri-teria: the within-cluster variation and the clustering accuracy thatis calculated by the Rand Index method (Rand, 1971). Experimen-tal results are shown in Fig. 1. In Fig. 1a, the accuracies of newclustering solutions generated by ensemble learning operator in-creased from 87% to 100%, the accuracies of new clustering solu-tions generated by mutation operator increased from 87% to95%, and the accuracies of new clustering solutions generated byone-point crossover operator fluctuated around from 83%. Thesimilar phenomenon can also be observed from Fig. 1b. InFig. 1b, the within-cluster variations of new clustering solutionsgenerated by ensemble learning operator decreased from 25 to7.5, the within-cluster variations of new clustering solutions gen-erated by mutation operator decreased from 25 to 11, and thewithin-cluster variations of new clustering solutions generatedby one-point crossover operator fluctuated around from 36. Itcan be concluded from the above experimental results that thecommonly used one-point crossover operator caused serious dis-ruption of good building blocks due to the problem of clusteringinsensitivity, therefore is not capable of reproducing high-qualityclustering solutions. Mutation operator is better than one-pointcrossover operator. However, it can not jump over local optimalclustering solutions and its final within-cluster variation was10.17 and accuracy was around 95%. Among mutation operator,one-point crossover operator and ensemble learning operator,the ensemble learning operator performed the best. The solutioncaptured by GCEL converged to the global optimal clustering solu-tion. Its final within-clustering variation equaled to 7.50 and accu-racy equaled to 100%.

1 X8K5D can be downloaded from the following website: http://strehl.com/.

Page 5: To combine steady-state genetic algorithm and ensemble learning for data clustering

0 200 400 600 800 10000.7

0.75

0.8

0.85

0.9

0.95

1

New clustering solutions generated by genetic operator

Accu

racy

Ensemble learningoperator

Withoutrecombination

One-point crossover

0 200 400 600 800 10005

10

15

20

25

30

35

40

45

50

New clustering solutions generated by genetic operator

With

in-c

lust

er v

aria

tion

Without recombinationOne-point crossoverEnsemble learning operator

One-point crossover

Without recombination

Ensemble learning operator

Without recombinationOne-point crossoverEnsemble learning operator

Fig. 1. Experimental results on synthetic data set.

Table 2Clustering accuracy

Data Sets GCMC (Painho andFernando, 2000)

GKA (Krishna andMurty, 1999)

GCEL

X8K5D 91:13%� 5:28% 94:29%� 5:46% 100:0%� 0:00%

Ionosphere 58:89%� 0:05% 58:79%� 0:25% 58:89%� 0:00%

Promoters 63:21%� 3:36% 62:69%� 4:35% 66:79%� 1:07%

Segmentation 87:06%� 0:61% 86:08%� 0:72% 87:86%� 0:13%

Leukemia 92:77%� 1:97% 94:96%� 0:74% 95:40%� 0:55%

Table 3Within-cluster variation

Data Sets GCMC (Painho andFernando, 2000)

GKA (Krishna andMurty, 1999)

GCEL

X8K5D 1.333 ± 0.528E+1 8.110 ± 0.460E+0 7.496 ± 0.000E+0Ionosphere 2.417 ± 0.003E+3 2.427 ± 0.006E+3 2.416 ± 0.002E+3Promoters 7.160 ± 0.004E+3 7.157 ± 0.002E+3 7.152 ± 0.001E+3Segmentation 1.857 ± 0.041E+1 1.862 ± 0.014E+1 1.826 ± 0.004E+1Leukemia 3.316 ± 0.049E+10 3.296 ± 0.097E+10 3.268 ± 0.028E+10

1420 Y. Hong, S. Kwong / Pattern Recognition Letters 29 (2008) 1416–1423

5.2. Real data

Four real data sets were selected to test the performance ofGCEL. Their names and characteristics were given in Table 1. It isnoted that all selected data sets are high-dimensional data setswith a small number of instances. This is because genetic-guidedclustering algorithms with the string-of-group encoding strategyare more suitable for this kind of data set (Demiriz et al., 1999).Parameter settings in our experiments were given as follows: pop-ulation size was fixed to 100. The tournament selection wasadopted and its tournament size equaled to 2. The simple mutationoperator was used and its mutation rate was set as 0.005. If thepopulation of genetic-guided clustering algorithms was initializedby using the random subspaces method, the initial population wasgeneralized as follows: part of all features were selected and in-stances were classified by executing Kmeans clustering algorithmon these selected features. The above steps iterated for N roundsand a population of N clustering solutions were obtained. Thenumber of selected features was fixed to 2 in experiments.

First, GCEL was compared with two string-of-group coded ge-netic-guided clustering algorithms: GCMC (Painho and Fernando,2000) and GKA (Krishna and Murty, 1999). Among these two ge-netic-guided clustering algorithms, GCMC employed the simple ge-netic algorithm with standard mutation and recombinationoperators; while GKA removed the recombination operator fromthe simple genetic algorithm and added the one-step Kmeansoperator. The ensemble size M of GCEL was fixed to 2 in experi-ments. All algorithms terminated if 85% individuals in the popula-tion were the same. Both clustering accuracy and within-clustervariation were studied. All algorithms were executed for 20 inde-pendent runs and their average results were reported.

Tables 2 and 3 show the experimental results. In Table 2, GCELachieved 100.0% accuracy for X8K5D, 58.89% accuracy for Iono-sphere, 66.79% accuracy for Promoter, 87.86% accuracy for Seg-mentation and 95.40% accuracy for Leukemia, that are higherthan 91.13% for X8K5D, 58.89% for Ionosphere, 63.21% for Pro-

Table 1Data set and their characteristics

Names Instances Nominal features Numeric features Classes

X8K5D 100 0 8 5Ionosphere 351 0 34 2Promoters 106 57 0 2Colon 100 0 17 7Leukemia 38 0 999 3

moter, 87.06% for Segmentation and 92.77% for Leukemia obtainedby GCMC and 94.29% for X8K5D, 58.79% for Ionosphere, 62.69% forPromoter, 86.08% for Segmentation and 94.96% for Leukemia ob-tained by GKA. In Table 3, the within-clustering variations ob-tained by GCEL were smaller than those obtained by GCMC andGKA for all data sets. For example, the within-cluster variation ofthe clustering solution obtained by GCEL for Promoter data setequaled to 7152, that was smaller than 7160 obtained by GCMCand 7157 obtained by GKA. It can be concluded from the aboveexperimental results that GCEL is often able to identify a clusteringsolution with higher clustering accuracy and smaller within-clus-ter variation when compared with those obtained by other existinggenetic-guided clustering algorithms.

To study the potentials of random subspaces method and clus-tering ensemble operator, eight guided-guided clustering algo-rithms were designed and tested. Their characteristics are givenin Table 4. All the experiments were executed for 20 independentruns and their average results were reported.

The results are shown in Fig. 2. Five phenomena can be ob-served from this Figure: (1) GCMS outperformed GCM and GCMCSobtained better clustering solutions faster than GCMC. (2) The per-formance of SGCM was much better than that of GCMS and SGCMCoutperformed GCMCS. (3) Genetic-guided clustering algorithmwith crossover operator obtained a comparative or better

Page 6: To combine steady-state genetic algorithm and ensemble learning for data clustering

Table 4Genetic-guided clustering algorithms and their characteristics

Name Genetic algorithm Recombination Initialization M

GCM Generational – Random –GCMC Generational One-point Random –GCMS Generational – Subspaces –GCMCS Generational One-point Subspaces –SGCM Steady-state – Subspaces –SGCMC Steady-state One-point Subspaces –GCEL-2 Steady-state Ensemble learning Subspaces 2GCEL-10 Steady-state Ensemble learning Subspaces 10

Y. Hong, S. Kwong / Pattern Recognition Letters 29 (2008) 1416–1423 1421

clustering solution with less fitness evaluations when comparedwith genetic-guided clustering algorithm without crossover opera-tor for all four data sets. (4) Both GCEL-2 and GCEL-10 identifiedsatisfactory clustering solutions for all four data sets with muchless fitness evaluations when compared with other genetic-guidedclustering algorithms. (5) GCEL-10 was faster than GCEL-2. How-ever, GCEL-2 sometimes achieved better clustering solutions thanthose identified by GCEL-10. The above experimental results dem-onstrated that the random subspaces method is a good method forinitializing the population of genetic-guided clustering algorithms,that is able to speed up the convergence of genetic-guided cluster-ing algorithms. In addition, the mutation operator is not enough toexplore and exploit the search space of high-dimensional data setsand crossover operator can often lead to a better performance.Lastly, the proposed ensemble learning operator performs better

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

2420

2440

2460

2480

2500

2520

2540

2560

2580

The number of fitness evaluations

With

in-c

lust

er v

aria

tion

GCMGCMCGCMSGCMCSSGCMSGCMCGCEL-2GCEL-10

1000 2000 3000 4000 5000 6000 7000 8000 9000 1000010

20

30

40

50

60

70

80

The number of fitness evaluations

Wth

in-c

lust

er v

aria

tions

GCMGCMCGCMSGCMCSSGCMSGCMCCGCEL-2GCEL-10

Fig. 2. Within-clus

for reproducing new candidate clustering solutions when com-pared with the commonly used one-point crossover operator.

6. Further illustrations

It can be observed from the above experimental results that theimprovement of the accuracies of clustering solutions obtained byGCEL are always not very significant when compared with thoseobtained by other existing genetic-guided clustering algorithms.This is because the accuracies of clustering algorithms are not onlyrelied on the used search methods, but also heavily dependent onthe used clustering criterion. If an improper clustering criterion isadopted, the clustering solution with a better value of the cluster-ing criterion may be even worse than the clustering solution with aworse value of the clustering criterion. The clustering criterion ofthe minimization of the within-cluster variation used in this papermay be not valid for some data sets. For example, for Segmentationdata set the within-cluster variation of the clustering solutionreproduced by the ensemble learning operator is decreasing (seeFig. 2c) with the evolution of the population. However, the accu-racy of the clustering solution reproduced by the ensemble learn-ing operator is also decreasing (see Fig. 3c). Another thing thatshould be mentioned is that the average link based ensemblelearning operator requires more time to reproduce a new candi-date clustering solution when compared with that of the one-pointcrossover operator. The authors would like to use more efficientensemble method such as (Viswanath and Jayasurya, 2006) to

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

7150

7200

7250

7300

7350

7400

The number of fitness evaluations

With

in-c

lust

er v

aria

tion

GCMGCMCGCMSGCMCSSGCMSGCMCGCEL-2GCEL-10

1000 2000 3000 4000 5000 6000 7000 8000 9000 100003.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

4x 10

10

The number of fitness evaluations

With

in-c

lust

er v

aria

tion

GCMGCMCGCMSGCMCSSGCMSGCMCGCEL-2GCEL-10

ter variations.

Page 7: To combine steady-state genetic algorithm and ensemble learning for data clustering

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.5

0.51

0.52

0.53

0.54

0.55

0.56

0.57

0.58

0.59

0.6

Generations

Accu

racy

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.45

0.5

0.55

0.6

0.65

0.7

0.75

Generations

Accu

racy

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.74

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

0.92

The number of fitness evaluations

Accu

racy

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0.4

0.5

0.6

0.7

0.8

0.9

1

Generations

Accu

racy

Fig. 3. Accuracies of the solutions reproduced by GCEL.

1422 Y. Hong, S. Kwong / Pattern Recognition Letters 29 (2008) 1416–1423

reproduce new candidate clustering solutions in our future workfor improving the efficiency of GCEL.

7. Conclusion

In this paper, a novel genetic-guided clustering algorithmtermed as GCEL has been proposed. Experimental results on onesynthetic and several real data sets have demonstrated its effec-tiveness. The characteristics of GCEL are concluded as follows: first,the population of GCEL is initialized by using the random subspac-es method. Second, GCEL replaces traditional recombination oper-ator of genetic algorithm with an ensemble learning operator.Lastly, GCEL uses the steady-state genetic algorithm to performthe search task. The above three characteristics of GCEL guaranteethat GCEL can always achieve a comparative or better clusteringsolution with less fitness evaluations when compared with othergenetic-guided clustering algorithms.

Acknowledgement

This paper was supported by the Project No. 7002073, City Uni-versity of Hong Kong. The authors would like to thank the con-structive comments from the reviewers.

References

Demiriz, A., Bennett, K.P., Embrechts. M.J., 1999. Semi-supervised clustering usinggenetic algorithms. In: Proc. Artificial Neural Networks in Engineering.

Duda, R.O., Hart, P.E., Stork, D.G., 2001. Pattern classification. Pattern Classification.Du, J., Korkmaz, E., Alhajj, R., Barker, K., 2004. Novel clustering approach that

employs genetic algorithm with new representation scheme and multipleobjectives. In: Proc. Internat. Conf. on Data Warehousing and KnowledgeDiscovery.

Falkenauer, E., 1994. A new representation and operators for genetic algorithmsapplied to grouping problems. Evolut. Comput. 2, 123–144.

Fern, X.Z., Brodley, C.E., 2003. Clustering ensembles for high dimensional dataclustering. In: Proc. Internat. Conf. on Machine Learning, pp. 186–193.

Fränti, P., 2000. Genetic algorithm with deterministic crossover for vectorquantization. Pattern Recognition Lett. 21, 61–68.

Fred, A., Jain, A.K., 2005. Combining multiple clusterings using evidenceaccumulation. IEEE Trans. Pattern Anal. Machine Intell. 27, 835–850.

Garai, G., Chaudhuri, B.B., 2004. A novel genetic algorithm for automatic clustering.Pattern Recognition Lett. 25, 173–187.

Goldberg, D.E., 1989. Genetic Algorithm in Search, Optimization and MachineLearning. Addison-Wesley, Reading, MA.

Ho, T.K., 1998. The random subspace method for constructing decision forests. IEEETrans. Pattern Anal. Machine Intell. 20, 832–844.

Jain, A.K., Murty, M.N., Flynn, P.J., 1999. Data clustering: A review. ACM Comput.Sur. 13, 264–323.

Jones, D.R., Beltramo, M.A., 1991. Solving partitioning problems with geneticalgorithm. In: Proc. Internat. Conf. on Genetic Algorithm, pp. 442–449.

Krishna, K., Murty, M., 1999. Genetic K-means algorithm. IEEE Trans. System ManCybernet – Part B 29, 433–439.

Kuncheva, L.I., Bezdek, J.C., 1998. Nearest prototype classification: Clustering,genetic algorithms or random search? IEEE Trans. Systems Man Cybernet. – PartB 28, 160–164.

Kuncheva, L.I., Hadjitodorov, S.T., 2004. Using diversity in cluster ensembles. In:Proc. System, Man and Cybernetics, pp. 1214–1219.

Lu, Y., Li, S., Fotouhi, F., Deng, Y., Brown, S.J., 2004. Incremental genetic Kmeansalgorithm and its application in gene expression data analysis. BMC Bioinform.

Martnez-Otzeta, J.M., Sierra, B., Lazkano, E., Astigarraga, A., 2006. Classifierhierarchy learning by means of genetic algorithms. Pattern Recognition Lett.27, 1998–2004.

Mitra, S., 2004. An evolutionary rough partitive clustering. Pattern Recognition Lett.25, 1439–1449.

Painho, M., Fernando, B., 2000. Using genetic algorithms in clustering problems. In:Proc. Conf. on GeoComputation.

Rand, W.M., 1971. Objective criteria for the evaluation of clustering methods. J.Amer. Statist. Assoc. 66, 846–850.

Rogers, A., Prügel-Bennett, A., 1999. Modelling the dynamics of a steady-stategenetic algorithm. In: Proc. Foundation of Genetic Algorithms, pp. 57–68.

Sheng, W., Tucher, A., Liu, X., 2004. Clustering with niching genetic Kmeansalgorithm. In: Proc. Genetic and Evolutionary Computation Conf., pp. 162–173.

Page 8: To combine steady-state genetic algorithm and ensemble learning for data clustering

Y. Hong, S. Kwong / Pattern Recognition Letters 29 (2008) 1416–1423 1423

Skurichina, M., Duin, R., 2002. Bagging, boosting and the random subspace methodfor linear classifier. Pattern Anal. Appl. 5, 121–135.

Strehl, A., Ghosh, J., 1999. Clustering ensembles – A knowledge reuse framework forcombining multiple partitions. J. Machine Learning Res. 3, 583–617.

Syswerda, G., 1991. A study of reproduction in generational and steady state geneticalgorithms. In: Proc. Foundation of Genetic Algorithms, pp. 94–101.

Viswanath, P., Jayasurya, K., 2006. A fast and efficient ensemble clustering method.In: Proc. Internat. Conf. on Pattern Recognition.

Whitley, D., Kauth, J., 1988. GENITOR: A different genetic algorithm. In: Proc. RochyMourntain Conf. on Artificial Intelligence.