an interactive approach to multiobjective clustering of gene expression patterns

7
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 60, NO. 1, JANUARY 2013 35 An Interactive Approach to Multiobjective Clustering of Gene Expression Patterns Anirban Mukhopadhyay , Senior Member, IEEE, Ujjwal Maulik, Senior Member, IEEE, and Sanghamitra Bandyopadhyay, Senior Member, IEEE Abstract—Some recent studies have posed the problem of data clustering as a multiobjective optimization problem, where sev- eral cluster validity indices are simultaneously optimized to obtain tradeoff clustering solutions. A number of cluster validity index measures are available in the literature. However, none of the mea- sures can perform equally well in all kinds of datasets. Depending on the dataset properties and its inherent clustering structure, dif- ferent cluster validity measures perform differently. Therefore, it is important to find the best set of validity indices that should be optimized simultaneously to obtain good clustering results. In this paper, a novel interactive genetic algorithm-based multiobjective approach is proposed that simultaneously finds the clustering so- lution as well as evolves the set of validity measures that are to be optimized simultaneously. The proposed method interactively takes the input from the human decision maker (DM) during exe- cution and adaptively learns from that input to obtain the final set of validity measures along with the final clustering result. The algo- rithm is applied for clustering real-life benchmark gene expression datasets and its performance is compared with that of several other existing clustering algorithms to demonstrate its effectiveness. The results indicate that the proposed method outperforms the other existing algorithms for all the datasets considered here. Index Terms—Clustering, interactive algorithm, microar- ray gene expression, multiobjective genetic algorithm, Pareto optimality. I. INTRODUCTION C LUSTERING [1] is an important unsupervised data min- ing tool where a set of patterns, usually vectors in multi- dimensional space, are grouped into K clusters based on some similarity or dissimilarity criteria. In partitional clustering, the aim is to produce a K × n partition matrix U (X) of the given dataset X = {x 1 ,x 2 ,...,x n }, consisting of n objects. The par- tition matrix may be represented as U =[u kj ],k =1,...,K and j =1,...,n, where u kj is the membership of pattern x j to the kth cluster. In crisp partitioning u kj =1 if x j C k ; other- wise, u kj =0. On the other hand, for fuzzy partitioning of the Manuscript received May 6, 2012; revised August 27, 2012; accepted September 17, 2012. Date of publication September 28, 2012; date of cur- rent version December 14, 2012. Asterisk indicates corresponding author. A. Mukhopadhyay is with the Department of Computer Science and Engi- neering, University of Kalyani, Kalyani 741235, West Bengal, India (e-mail: [email protected]). U. Maulik is with the Department of Computer Science and Engi- neering, Jadavpur University, Kolkata 700032, West Bengal, India (e-mail: [email protected]). S. Bandyopadhyay is with the Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, West Bengal, India (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TBME.2012.2220765 data, 0 u kj 1. The objective of clustering techniques is to find a suitable grouping of the input dataset so that some criteria are optimized. Hence, the problem of clustering can be posed as an optimization problem. The objectives to be optimized may represent different characteristics of the clusters, such as com- pactness, separation, and connectivity. A straightforward way to pose clustering as an optimization problem is to optimize some cluster validity index [2] that reflects the goodness of the clustering solutions. All possible partitionings of the dataset and the corresponding values of the validity index define the complete search space. Traditional partitional clustering tech- niques, such as K-means and Fuzzy C-means (FCM) [1], employ greedy search techniques over the search space to optimize the compactness of the clusters. However, they often get stuck at some local optima depending on the choice of the initial cluster centers. Moreover, they optimize a single cluster validity index (compactness in this case), and therefore do not cover different characteristics of the datasets. To overcome the problem of local optima, some global opti- mization tools such as genetic algorithms (GAs) [3] have been widely used to reach the global optimum value of the cho- sen validity measure. Conventional GA-based clustering tech- niques [4] use some validity measure as the fitness value. How- ever, as no single validity measure works equally well for dif- ferent kinds of datasets, it is natural to simultaneously optimize multiple such measures for capturing different characteristics of the data. Some recent studies have explored the application of multiobjective optimization (MOO) [5] for clustering [6]–[9]. Unlike single objective optimization, in MOO, search is per- formed over a number of, often conflicting, objective functions. The final solution set contains a number of nondominated so- lutions, none of which can be further improved on any one objective without degrading it in another. In the existing approaches of GA-based multiobjective clus- tering, the algorithms simultaneously optimize two or three cho- sen cluster validity measures. However, it cannot be guaranteed that these predefined set of objective functions (validity mea- sures) would work equally well for every dataset, since the per- formance of the cluster validity measures are heavily dependent on the dataset properties and its inherent clustering structure. Therefore, it will be useful to devise some method to evolve the useful objective functions during the execution of the algorithm rather than fixing them a priori. Motivated by this, in this pa- per, a novel interactive GA-based approach for multiobjective clustering is proposed. Interactive GAs are named so because they interact with the human operator, known as decision maker (DM), during the 0018-9294/$31.00 © 2012 IEEE

Upload: sanghamitra

Post on 15-Dec-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: An Interactive Approach to Multiobjective Clustering of Gene Expression Patterns

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 60, NO. 1, JANUARY 2013 35

An Interactive Approach to MultiobjectiveClustering of Gene Expression Patterns

Anirban Mukhopadhyay∗, Senior Member, IEEE, Ujjwal Maulik, Senior Member, IEEE,and Sanghamitra Bandyopadhyay, Senior Member, IEEE

Abstract—Some recent studies have posed the problem of dataclustering as a multiobjective optimization problem, where sev-eral cluster validity indices are simultaneously optimized to obtaintradeoff clustering solutions. A number of cluster validity indexmeasures are available in the literature. However, none of the mea-sures can perform equally well in all kinds of datasets. Dependingon the dataset properties and its inherent clustering structure, dif-ferent cluster validity measures perform differently. Therefore, itis important to find the best set of validity indices that should beoptimized simultaneously to obtain good clustering results. In thispaper, a novel interactive genetic algorithm-based multiobjectiveapproach is proposed that simultaneously finds the clustering so-lution as well as evolves the set of validity measures that are tobe optimized simultaneously. The proposed method interactivelytakes the input from the human decision maker (DM) during exe-cution and adaptively learns from that input to obtain the final setof validity measures along with the final clustering result. The algo-rithm is applied for clustering real-life benchmark gene expressiondatasets and its performance is compared with that of several otherexisting clustering algorithms to demonstrate its effectiveness. Theresults indicate that the proposed method outperforms the otherexisting algorithms for all the datasets considered here.

Index Terms—Clustering, interactive algorithm, microar-ray gene expression, multiobjective genetic algorithm, Paretooptimality.

I. INTRODUCTION

C LUSTERING [1] is an important unsupervised data min-ing tool where a set of patterns, usually vectors in multi-

dimensional space, are grouped into K clusters based on somesimilarity or dissimilarity criteria. In partitional clustering, theaim is to produce a K × n partition matrix U(X) of the givendataset X = {x1 , x2 , . . . , xn}, consisting of n objects. The par-tition matrix may be represented as U = [ukj ], k = 1, . . . ,Kand j = 1, . . . , n, where ukj is the membership of pattern xj tothe kth cluster. In crisp partitioning ukj = 1 if xj ∈ Ck ; other-wise, ukj = 0. On the other hand, for fuzzy partitioning of the

Manuscript received May 6, 2012; revised August 27, 2012; acceptedSeptember 17, 2012. Date of publication September 28, 2012; date of cur-rent version December 14, 2012. Asterisk indicates corresponding author.

∗A. Mukhopadhyay is with the Department of Computer Science and Engi-neering, University of Kalyani, Kalyani 741235, West Bengal, India (e-mail:[email protected]).

U. Maulik is with the Department of Computer Science and Engi-neering, Jadavpur University, Kolkata 700032, West Bengal, India (e-mail:[email protected]).

S. Bandyopadhyay is with the Machine Intelligence Unit, Indian StatisticalInstitute, Kolkata 700108, West Bengal, India (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TBME.2012.2220765

data, 0 ≤ ukj ≤ 1. The objective of clustering techniques is tofind a suitable grouping of the input dataset so that some criteriaare optimized. Hence, the problem of clustering can be posed asan optimization problem. The objectives to be optimized mayrepresent different characteristics of the clusters, such as com-pactness, separation, and connectivity. A straightforward wayto pose clustering as an optimization problem is to optimizesome cluster validity index [2] that reflects the goodness of theclustering solutions. All possible partitionings of the datasetand the corresponding values of the validity index define thecomplete search space. Traditional partitional clustering tech-niques, such as K-means and Fuzzy C-means (FCM) [1], employgreedy search techniques over the search space to optimize thecompactness of the clusters. However, they often get stuck atsome local optima depending on the choice of the initial clustercenters. Moreover, they optimize a single cluster validity index(compactness in this case), and therefore do not cover differentcharacteristics of the datasets.

To overcome the problem of local optima, some global opti-mization tools such as genetic algorithms (GAs) [3] have beenwidely used to reach the global optimum value of the cho-sen validity measure. Conventional GA-based clustering tech-niques [4] use some validity measure as the fitness value. How-ever, as no single validity measure works equally well for dif-ferent kinds of datasets, it is natural to simultaneously optimizemultiple such measures for capturing different characteristics ofthe data. Some recent studies have explored the application ofmultiobjective optimization (MOO) [5] for clustering [6]–[9].Unlike single objective optimization, in MOO, search is per-formed over a number of, often conflicting, objective functions.The final solution set contains a number of nondominated so-lutions, none of which can be further improved on any oneobjective without degrading it in another.

In the existing approaches of GA-based multiobjective clus-tering, the algorithms simultaneously optimize two or three cho-sen cluster validity measures. However, it cannot be guaranteedthat these predefined set of objective functions (validity mea-sures) would work equally well for every dataset, since the per-formance of the cluster validity measures are heavily dependenton the dataset properties and its inherent clustering structure.Therefore, it will be useful to devise some method to evolve theuseful objective functions during the execution of the algorithmrather than fixing them a priori. Motivated by this, in this pa-per, a novel interactive GA-based approach for multiobjectiveclustering is proposed.

Interactive GAs are named so because they interact with thehuman operator, known as decision maker (DM), during the

0018-9294/$31.00 © 2012 IEEE

Page 2: An Interactive Approach to Multiobjective Clustering of Gene Expression Patterns

36 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 60, NO. 1, JANUARY 2013

execution to evaluate the solutions generated until the currentgeneration [10], [11]. In clustering problem, this decision is of-ten subjective, and thus very difficult to predict using any objec-tive measure such as cluster validity indexes. Hence in this paper,the proposed interactive multiobjective clustering (IMOC) algo-rithm starts with a set of cluster validity measures as the initialobjective functions and periodically consults the DM to learnwhich validity measures are more suitable for the dataset beingclustered. To reduce the fatigue of the DM for ranking all thesolutions up to the current generation, only a few important solu-tions from the current nondominated front is provided to the DMfor evaluation. The DM ranks the solution using visualizationas well as some available domain knowledge, and IMOC grad-ually tries to learn from this ranking which objective functionsshould be optimized for obtaining a suitable partitioning of thedataset being clustered. For this purpose, nondominated sortingGA-II (NSGA-II) [5] is used as the underlying MOO tool. Forthe visualization, cluster heatmap [12] and VAT plot [13] areutilized.

IMOC is applied on two real-life microarray gene expressiondatasets to cluster the genes, and its performance is comparedwith that of several other existing clustering algorithms, suchas, K-means [1], FCM [14], hierarchical average linkage [1],SiMM-TS [15], MOGA-SVM [8], and an automatic version ofIMOC (described later). Different results are reported to demon-strate that IMOC yields more biologically relevant clusters com-pared to other algorithms. Moreover, statistical significance testsare performed to establish the superiority of IMOC over the otherclustering approaches.

II. MULTIOBJECTIVE FUZZY CLUSTERING

In this section, first we provide the basic concepts and defini-tions about MOO. Thereafter, the multiobjective fuzzy cluster-ing method adopted in this paper is described in brief.

A. Multiobjective Optimization

The MOO problem can formally be stated as [5] follows. Findthe vector x∗ = [x∗

1 , x∗2 , . . . , x

∗n ]T of decision variables which

satisfies a number of equality and inequality constraints and op-timizes the vector function f̄(x) = [f1(x), f2(x), . . . , fk (x)]T .The constraints define the feasible region F which containsall the admissible solutions. The vector x∗ denotes an opti-mal solution in F . The concept of Pareto optimality is usefulin the domain of MOO. A formal definition of Pareto opti-mality from the viewpoint of the minimization problem maybe given as follows: a decision vector x∗ is called Paretooptimal if and only if there is no x that dominates x∗, i.e.,there is no x such that ∀i ∈ {1, 2, . . . , k}, fi(x) ≤ fi(x∗) and∃i ∈ {1, 2, . . . , k}, fi(x) < fi(x∗). Pareto optimality usuallyadmits a set of solutions called nondominated solutions.

There are different approaches to solving multiobjective opti-mization problems. Nondominated sorting GA-II (NSGA-II) [5]is one of the popular multiobjective GAs. The multiobjectivefuzzy clustering scheme [8] considered here uses NSGA-II asan underlying multiobjective framework for developing the pro-posed interactive clustering algorithm.

B. Multiobjective Clustering

The existing GA-based multiobjective clustering algorithmshave two main issues. First is chromosome representation andthe second is the choice of the cluster validity measures to beoptimized. There are two popular strategies for chromosomerepresentation: point-based and center-based. In point-basedencoding [16], [17], the length of a chromosome is the sameas the number of points. The value assigned to each position(corresponding to each point) is drawn from {1, . . . , K} (K =number of clusters). K may be fixed or variable. If position i isassigned a value j, then the ith point is assigned to the jth clus-ter. Point-based encoding techniques are straightforward, butsuffer from large chromosome lengths and hence slow rates ofconvergence. Moreover, such techniques may produce highly re-dundant chromosomes. In center-based encoding [4], [8], clustercenters are encoded into chromosomes. Hence, each chromo-some is of length K × d, where d is the dimension of the data.Here also, K may be varied, resulting in variable length chro-mosomes. The advantage of center-based encoding is that thechromosome length is shorter, and thus, it usually has a fasterconvergence rate than point-based encoding techniques. Here,we adopt center-based encoding.

The choice of objective functions (cluster validity measures)plays an important role in multiobjective clustering. As dis-cussed, the existing multiobjective clustering algorithms prede-fine the objective functions to be optimized before the executionof the algorithm begins, and the same objective functions areused for all the datasets. In the works [6] and [17], the authorshave used two objective functions, viz., cluster variance andconnectivity. In [7] and [8], the authors have used two objectivefunctions, viz., XB and Jm index. The authors in [18] haveoptimized fuzzy compactness and fuzzy separation simultane-ously. Three validity indices, viz., Jm , XB, and PBM , havebeen optimized simultaneously in [19].

Unlike the existing approaches, here the best set of objectivefunctions have been evolved during the process of clusteringinstead of defining them a priori. Five popular cluster valid-ity measures, viz., Davies–Bouldin (DB) index [20], Xie–Beni(XB) index [21], Jm index [14], PBM index [2], andSilhouette (S) index [22], are considered. The first three valid-ity indices are of minimization type, whereas the last two are ofmaximization type. Moreover, among the aforementioned valid-ity indices, DB and S are crisp indices, whereas XB, Jm , andPBM indices are fuzzy indices. The fuzzy indices are computedusing the fuzzy membership matrix, whereas the crisp indicesare computed after defuzzification of the fuzzy membership ma-trix by assigning each data point to the cluster to which it hasthe highest membership degree. Starting with these five validitymeasures, the best set of objective functions are evolved auto-matically along with the clustering solution. As we wanted touse a mixture of crisp and fuzzy validity indices, the aforemen-tioned five validity indices have been used due to their popular-ity. However, some other validity measures, such as connected-ness index [6] and point symmetry-based index [23], could havebeen included in the initial list of validity indices. In fact, usercan start with any set of validity indices of his/her choice.

Page 3: An Interactive Approach to Multiobjective Clustering of Gene Expression Patterns

MUKHOPADHYAY et al.: INTERACTIVE APPROACH TO MULTIOBJECTIVE CLUSTERING OF GENE EXPRESSION PATTERNS 37

1) Take a dataset X = {x1, x2, . . . , xn}, the number of clusters K, the set of initial objective functions O = {o1, o2, . . . , oM},interaction frequency F , penalty threshold T , fatigue threshold R and the GA parameters such as number of generations (G),population size (P ), crossover probability (pc) and mutation probability (pm) as the inputs.

2) Initialize each chromosome of the initial population by encoding K random points from X as the cluster centers.3) Initialize an objective importance vector of length M , V = {v1, v2, . . . , vM}, by making each vi = 0 for i = 1, . . . , M . Here

vi corresponds to objective oi.4) Set the generation counter gen count = 1 and interaction frequency controller f = 1.2.5) While gen count ≤ G do:

a) Evaluate each chromosome in the population by computing each the objective function value oi ∈ O, i = 1, . . . , M .b) Rank the chromosomes in the population using non-dominated sorting.c) Perform selection using the crowded binary tournament selection operator.d) Perform crossover and mutation to generate the offspring population.e) Combine the parent and child populations and replace the parent population by the best members (selected using non-dominated sorting and the crowded comparison operator) of the combined population.

f) If gen count fF theni) Set f = fF .ii) Visualize the top R% solutions (as per crowding distance) from the Rank 1 solutions to the DM for evaluation.iii) The DM ranks these clustering solutions as per his/her expertise and domain knowledge and let the DM specified

rank of solution a be ra, and lower rank means better solution.iv) For each oi ∈ O do:

For each pair of DM-ranked solutions (a, b)If ra < rb and oi(a) > oi(b) then

vi = vi − (rb − ra), vi ∈ V .endIf

endForendFor

v) If (max(V ) − min(V ) > T ) thenSet O = O \ {op}, and V = V \ {vp} where p = arg mini{vi|vi ∈ V }.∀vi ∈ V , set vi = 0.

endIfendIf

g) Set gen count = gen count + 1.6) Obtain the final clustering solution using the SVM-based majority voting ensemble.7) Return the final clustering solution and O, the set of evolved objective functions.

Fig. 1. Steps of the proposed IMOC clustering method.

III. PROPOSED IMOC ALGORITHM

In this section, the proposed IMOC algorithm has been de-scribed in detail. As mentioned before, the proposed techniqueuses center-based encoding technique for chromosome repre-sentation. The main NSGA-II procedure is modified to incorpo-rate the interaction with the DM in order to evolve the best setof objective functions as well as the clustering simultaneously.The multiobjective optimization problem has been modeled asa minimization problem where all the objective functions areminimized. So the maximization objectives, such as PBM andS indices are modeled as 1/PBM and 1 − S to make themminimization objectives, respectively. The final clustering solu-tion has been obtained from the nondominated front produced inthe final generation using support vector machine (SVM) clas-sifier (with radial basis function (RBF) kernel) based ensemblemethod as in [7] and [8]. The steps of IMOC are shown in Fig. 1.

The crossover and mutation operations are performed inStep 5(d). Here, conventional single-point crossover method[3] is used. For mutation, a randomly chosen cluster centerfrom the chromosome to be mutated is perturbed slightly [4].The crossover and mutation operations are controlled by thecrossover probability pc and mutation probability pm .

There are three parameters of IMOC not related to GA. Theseare interaction frequency F , penalty threshold T, and fatiguethreshold R. Interaction frequency F controls the frequencyof interaction between the program and the DM. F should be

greater than 1. Step 5(f) decides whether the program will in-teract with the DM in the current generation. Here, f is calledthe interaction frequency controller and is initialized to 1.2 atStep 4. If f = 1.2 and F = 1.2, then DM first interacts at gen-eration �1.21.2� = 2. The next interaction will be at generation�21.2� = 3. Similarly, the subsequent generations at which DMinteraction takes place will be generations 4, 6, 9, 14, 24, andso on. Note that the gap between two successive interactionsincreases as the program runs through generations, i.e., initiallyIMOC interacts with the DM more often. As the generationprogresses and the solutions get more optimized, the rate ofintervention of the DM also gets lower. At each generation, ifDM’s intervention is needed, the top R% solutions are providedto the DM for his/her feedback [Step 5(f)ii–v]. DM ranks thesesolutions as per his/her expertise, and this ranking is used to pe-nalize the objective functions that do not conform to this ranking[Step 5(f)iv].

The parameters F and R control the DM’s fatigue. If F issmall and R is large, then the DM has to evaluate more numberof solutions more frequently and vice versa. To ensure approxi-mately 10 interactions during the complete program execution,we have chosen F = 1.2 for 100 generations. The F value canbe chosen accordingly if the number of generations is different.Similarly, we ensure that the DM evaluates at most ten solu-tions at each interaction. Therefore, R value is taken to be 20%,since population size is taken as 50 in our case. Thus, the sizeof the first nondominated front at a particular generation cannot

Page 4: An Interactive Approach to Multiobjective Clustering of Gene Expression Patterns

38 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 60, NO. 1, JANUARY 2013

Fig. 2. Example of visualization of the clustering solutions provided to theDM. The first row shows the heatmaps and the second row shows the corre-sponding VAT representations.

exceed 50. R value can be adjusted accordingly for differentpopulation sizes.

The penalty threshold T is used to determine if the impor-tance of an objective function falls too much. If the differencebetween the highest and lowest importance scores is greater thanT , then the objective function corresponding to the lowest im-portance score is not considered further. Note that if T is small,then the difference between the highest and lowest importancescores may quickly get pass the threshold T , and thus, the num-ber of objectives to be optimized will go down quickly. On thecontrary, if T is large, it takes more number of generations forthis difference to exceed T , and thus, the number of objectiveswill reduce slowly. While experimenting, we noted that if T isfixed around the average number of solutions to be examinedby the DM at a particular generation, then IMOC provides thebest performance. Thus, the value of T can be taken as |P × R|,where P is the population size and R is the fatigue thresholddescribed earlier. As is our case, P = 50 and R = 20%; there-fore, T is set to 10. For different values of P and R, T can bechosen accordingly.

For visualization of the clustering solutions to DM, two vi-sualization techniques have been used: cluster heatmap and vi-sual assessment of clustering tendency (VAT) representation. Inheatmap representation [12], the image of the data matrix isshown by representing the values of the matrix as colors, wheresimilar values get similar colors. The data points are orderedbefore plotting so that the points that belong to the same clusterare placed one after another. So each cluster must have a similarcolor pattern for all the points. In VAT representation [13], tovisualize a clustering solution, the points are sorted accordingto the class labels given by the solution, and the distance ma-trix is also reordered according to this. In the graphical plotof the distance matrix, the boxes lying on the main diagonalrepresent the cluster structure. Fig. 2 shows the heatmaps (top)and VAT representations (bottom) of five clustering solutions aspresented to the DM.

The final clustering solution is obtained from the nondom-inated set of solutions produced in the final generation usingSVM-based (Step 6) majority voting ensemble as in [7] and [8].The idea is that if a subset of points are almost always clus-tered together by most of the nondominated solutions, then theymay safely be considered to be clustered properly. Hence, thesepoints may be used for training a classifier (SVM classifier isused here), which can thereafter be used for grouping the re-maining low confidence points. As a result of IMOC, both thefinal clustering solution and the set of objective functions suit-able for the dataset are returned.

IV. EXPERIMENTS AND RESULTS

IMOC is applied on two real-life gene expression datasets,viz., human fibroblasts serum and yeast cell cycle. One of theauthors has been selected as the DM arbitrarily in each runof IMOC. The performance of IMOC is compared with that ofsome other existing algorithms, such as K-means [1], FCM [14],hierarchical average linkage [1], SiMM-TS [15], and MOGA-SVM [8]. Besides this, a modified version of IMOC, in whichthe visualized solutions are ranked automatically instead of bythe DM, is used for comparison. We call the algorithm as IMOC-Auto. In this method, the VAT images of the visualized solutionsare analyzed by the program as follows: first, the VAT image ofa solution is converted into grayscale and the pixel values fromeach diagonal box (corresponding to a cluster) are considered.For a good clustering, the pixel values within a diagonal boxare expected to be similar. Therefore, we compute the standarddeviation Si of the pixel values in each diagonal box i and findthe mean M = 1

k

∑ki=1 Si of these standard deviations over all

k boxes. Note that lower values of M indicate better clustering.Hence, the visualized solutions are ranked in ascending order oftheir M values.

A. Datasets for Experiments

Microarray technology has made it possible to study theexpression levels of thousands of genes simultaneously overa number of time points. A microarray dataset consisting ofg genes and t time points is represented as a g × t matrixM = [mij ], where each element mij represents the expres-sion level of the ith gene at the jth time point. Two real-lifebenchmark microarray gene expression datasets, viz., humanfibroblasts serum and yeast cell cycle, are used for experiments.

1) Human Fibroblasts Serum: This dataset [24] contains theexpression levels of 8613 human genes. The data-set has 13 di-mensions corresponding to 12 time points (0, 0.25, 0.5, 1, 2,4, 6, 8, 12, 16, 20, and 24 h) and one unsynchronized sam-ple. A subset of 517 genes whose expression levels changedsubstantially across the time points are chosen. The data arethen log2-transformed. This dataset can be downloaded fromhttp://www.sciencemag.org/feature/data/984559.shl.

2) Yeast Cell Cycle: The yeast cell cycle dataset was ex-tracted from a dataset that shows the fluctuation of expressionlevels of approximately 6000 genes over two cell cycles (17 timepoints). Out of these 6000 genes, 384 genes are selected to be

Page 5: An Interactive Approach to Multiobjective Clustering of Gene Expression Patterns

MUKHOPADHYAY et al.: INTERACTIVE APPROACH TO MULTIOBJECTIVE CLUSTERING OF GENE EXPRESSION PATTERNS 39

cell-cycle regulated [25]. This dataset is publicly available at thefollowing website: http://faculty.washington.edu/kayee/cluster.

Both the datasets have been normalized so that each row hasmean 0 and variance 1.

B. Performance Metrics

To compare the performance of the clustering algorithms, bi-ological homogeneity index (BHI) [26] is used. BHI determinesthe goodness of a clustering solution with respect to a referenceset of functional classes. Functional classes are created usingtwo web-based functional annotation tools, viz., DAVID [27](for serum dataset) and FatiGO [28] (for cell cycle dataset). Outof 517 genes of the serum dataset, 357 genes were annotatedby DAVID. These genes are grouped into 6 overlapping func-tional classes. For the cell cycle data, the functional annotationof level 5 produced by FatiGO is considered. The functionalclasses containing at least 12.8% genes are selected. This re-sults in 5 overlapping functional classes containing 181 genesin total.

Suppose two annotated genes x and y belong to the sameclusterD produced by some algorithm. Let C(x) and C(y) be thefunctional classes that contain the genes x and y, respectively.The indicator function I(C(x) = C(y)) will have the value 1 ifC(x) matches with C(y). Note that in the case of membershipto multiple functional classes, any one match is considered tobe sufficient. Since the genes x and y belong to the same clusterin the clustering produced by the algorithm, the two functionalclasses are expected to match. Thus, the BHI index is defined as

BHI =1K

K∑

j=1

1nj (nj − 1)

x =y∈Dj

I(C(x) = C(y)). (1)

Here, K is the number of clusters produced by the algorithm andnj is the number of annotated genes in cluster Dj . Larger valueof BHI index implies that the clustering algorithm producesmore biologically homogeneous clusters.

C. Input Parameters

The different parameters of IMOC are set as follows: num-ber of generations = 100, population size = 50, crossoverprobability = 0.8 and mutation probability = 0.01, interac-tion frequency F = 1.2, penalty threshold T = 10, and fatiguethreshold R = 20%. IMOC starts with all the five validitymeasures. Hence, initially O = {DB,XB, Jm , 1/PBM , 1 −S}. Euclidean distance measure is used for all the algorithms.

D. Results and Discussion

Each algorithm is executed for K = 2, 3, . . . , 10, and for eachK, average BHI value over 20 runs is considered. The plots ofBHI for serum and cell cycle datasets are shown in Figs. 3 and4, respectively. The plots reveal that the performance of aver-age linkage, SiMM-TS, and MOGA-SVM in general are betterthan that of K-means and FCM. This conforms to the findingin [26], [15], and [8], respectively. IMOC-Auto has also givenreasonably good values of BHI. However, IMOC is found toprovide the maximum average BHI scores for almost all values

Fig. 3. Plots of BHI for different clustering algorithms for serum data.

Fig. 4. Plots of BHI for different clustering algorithms for cell cycle data.

TABLE IBEST AVERAGE BHI VALUES PRODUCED BY DIFFERENT ALGORITHMS ALONG

WITH THE CORRESPONDING NUMBER OF CLUSTERS

of K, outperforming even MOGA-SVM, which is a noninter-active multiobjective clustering method optimizing XB andJm indices, and IMOC-Auto, which is an automatic version ofIMOC. Hence, these results indicate that IMOC can producemore biologically relevant clusters compared to the other meth-ods. Moreover, it is found that for serum and cell cycle datasets,the best combination of objective functions are [DB, XB] and[XB, PBM, S], respectively. This indicates that the same setof objective functions is not suitable for all the datasets.

Table I reports the highest average BHI values produced bydifferent algorithms along with the corresponding number ofclusters for both the datasets. For the serum and cell cycledatasets, IMOC yields the maximum BHI scores for K = 6(0.4530) and K = 5 (0.5725), respectively. These values arebetter than those provided by the other algorithms includingMOGA-SVM and IMOC-Auto, the two closest contenders.Since the number of functional classes for the two datasets

Page 6: An Interactive Approach to Multiobjective Clustering of Gene Expression Patterns

40 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 60, NO. 1, JANUARY 2013

Fig. 5. Serum data clustered using the proposed IMOC clustering method. (a)Heatmap. (b) Cluster profile plots.

are also 6 and 5, respectively, therefore it appears that IMOCdetects the number of clusters reasonably well.

For the purpose of illustration, the heatmap and the clusterprofile plots for the best clustering solutions (in terms of BHIindex) found by IMOC for the serum dataset are shown in Fig. 5.In the heatmap, the expression value of a gene at a specific timepoint is represented by coloring the corresponding cell of thedata matrix with a color similar to the original color of its spot onthe microarray. The shades of red represent higher expressionlevels, the shades of green represent lower expression levels,and the colors toward black represent absence of differentialexpression. The genes are ordered before plotting so that thegenes that belong to the same cluster are placed one after another.The cluster boundaries are identified by white colored blankrows. The six clusters are very much clear from the heatmap[see Fig. 5(a)]. It is evident from the figure that the expressionprofiles of the genes of a cluster is similar to each other andthey produce similar color patterns. Cluster profile plots showfor each cluster the normalized gene expression values (lightgreen) of the genes of that cluster with respect to the timepoints. Also, the average expression values of the genes of acluster over different time points are plotted as a black linetogether with the standard deviation within the cluster at eachtime point. Fig. 5(b) shows the cluster profile plots for serumdataset and it also demonstrates how the cluster profiles forthe different groups of genes differ from each other, while theprofiles within a group are reasonably similar. Fig. 6 showsthe heatmap and the cluster profile plots for the best clusteringproduced by IMOC when applied on cell cycle dataset, and thisfigure also demonstrates the homogeneity of the clusters.

E. Statistical Significance Test

To establish that IMOC is significantly superior to the otheralgorithms, a nonparametric statistical significance test calledWilcoxon’s rank sum test for independent samples [29] is con-ducted at the 5% significance level. As rank sum test does notassume the normal distribution of the input samples, and thenormality test is difficult for small number of samples (20 inour case), we have used rank sum test instead of any parametric

(b)(a)

Fig. 6. Cell cycle data clustered using the proposed IMOC clustering method.(a) Heatmap. (b) Cluster profile plots.

TABLE IIMEDIAN VALUES OF BHI PRODUCED BY DIFFERENT ALGORITHMS ALONG

WITH THE CORRESPONDING NUMBER OF CLUSTERS

TABLE IIIP-Values PRODUCED BY WILCOXON’S RANK SUM TEST COMPARING IMOC

WITH OTHER ALGORITHMS

test. Except average linkage, all other algorithms are probabilis-tic in nature, i.e., they may produce different clustering resultsin different runs depending on the initialization and subjectiveevaluation by the DM (in the case of IMOC). It is found that inall the runs, IMOC produces better BHI index score than aver-age linkage algorithm. Therefore, the average linkage algorithmis not considered in the statistical test conducted. Six groups,corresponding to the six algorithms (1 IMOC, 2 MOGA-SVM,3 K-means, 4 FCM, 5 SiMM-TS, 6 IMOC-Auto), are createdfor each dataset. Each group consists of the BHI index scoresproduced over 20 runs of the corresponding algorithm for thenumber of clusters as shown in Table I. The median values ofeach group for both the datasets are reported in Table II.

As is evident from Table II, the median values of BHI scoresfor IMOC are better than those for the other algorithms. Toestablish that this goodness is statistically significant, Table IIIreports the P-values produced by rank sum test for compar-ison of two groups (the group corresponding to IMOC anda group corresponding to another algorithm) at a time. As anull hypothesis, it is assumed that there is no significant dif-ference between the median values of two groups, whereas the

Page 7: An Interactive Approach to Multiobjective Clustering of Gene Expression Patterns

MUKHOPADHYAY et al.: INTERACTIVE APPROACH TO MULTIOBJECTIVE CLUSTERING OF GENE EXPRESSION PATTERNS 41

alternative hypothesis is that there is significant difference in themedian values of the two groups. Note that as this is a multi-ple comparison test, we have set the P-value threshold to 0.0125(0.05/4) according to Bonferroni inequality to achieve an overall5% significance level. All the P-values reported in the table aremuch less than 0.0125. This is strong evidence against the nullhypothesis, indicating that the better median values of the per-formance metric produced by IMOC are statistically significantand have not occurred by chance.

V. CONCLUSION

Motivated by the observation that a single or a predefinedset of cluster validity measures cannot perform equally wellfor different types of datasets, in this paper a novel interac-tive version of multiobjective clustering, called IMOC, is pro-posed. The proposed algorithm starts with a set of objectivefunctions in the form of validity measures to be optimizedsimultaneously, and gradually evolves the clustering as well asthe most suitable subset of the validity measures for the datasetbeing considered. For this purpose, IMOC interacts with a hu-man decision maker periodically and takes his/her subjectivedecision (based on his/her domain knowledge and expertise)to learn which objective functions are more suitable for thedataset. The performance of IMOC has been demonstrated fortwo real-life gene expression datasets and compared with thatof several other existing clustering algorithms. Results indicatethat IMOC produces more biologically significant clusters com-pared to the other algorithms and the better result provided byIMOC is statistically significant.

In this study, we have acted as the DM during execution ofIMOC and we did not use any biological property of the cluster-ing solutions to evaluate and rank them. Subjective evaluationhas been performed based on the visualization of the solutions.As an interesting future scope of work, the DM can use bio-logical information also while evaluating a clustering solution.IMOC can easily adopt this without any change in the coreprogram. Moreover, a detailed study of other cluster validitymeasures should also be made.

REFERENCES

[1] A. K. Jain and R. C. Dubes, “Data clustering: A review,” ACM Comput.Surv., vol. 31, no. 3, pp. 264–323, 1999.

[2] U. Maulik, S. Bandyopadhyay, and A. Mukhopadhyay, Multiobjective Ge-netic Algorithms for Clustering: Applications in Data Mining and Bioin-formatics. New York: Springer-Verlag, 2011.

[3] D. E. Goldberg, Genetic Algorithms in Search, Optimization and MachineLearning. New York: Addison-Wesley, 1989.

[4] U. Maulik and S. Bandyopadhyay, “Genetic algorithm based clusteringtechnique,” Pattern Recognit., vol. 33, pp. 1455–1465, 2000.

[5] K. Deb, A. Pratap, S. Agrawal, and T. Meyarivan, “A fast and elitistmultiobjective genetic algorithm: NSGA-II,” IEEE Trans. Evol. Comput.,vol. 6, no. 2, pp. 182–197, Apr. 2002.

[6] J. Handl and J. Knowles, “An evolutionary approach to multiobjectiveclustering,” IEEE Trans. Evol. Comput., vol. 11, no. 1, pp. 56–76, Feb.2007.

[7] A. Mukhopadhyay and U. Maulik, “Unsupervised pixel classification insatellite imagery using multiobjective fuzzy clustering combined withSVM classifier,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 4,pp. 1132–1138, Apr. 2009.

[8] U. Maulik, A. Mukhopadhyay, and S. Bandyopadhyay, “CombiningPareto-optimal clusters using supervised learning for identifying co-expressed genes,” BMC Bioinform., vol. 10, no. 1, p. 27, 2009

[9] A. Mukhopadhyay, U. Maulik, and S. Bandyopadhyay, “Multi-objectivegenetic algorithm based fuzzy clustering of categorical attributes,” IEEETrans. Evol. Comput., vol. 13, no. 5, pp. 991–1005, Oct. 2009.

[10] H. Takagi, “Interactive evolutionary computation: Fusion of the capabili-ties of EC optimization and human evaluation,” Proc. IEEE, vol. 89, no. 9,pp. 1275–1296, Sep. 2001.

[11] I. Parmee and J. Abraham, Interactive Evolutionary Design.. New York:Springer, 2005, pp. 435–458.

[12] W. Shannon, R. Culverhouse, and J. Duncan, “Analyzing microarray datausing cluster analysis,” Pharmacogenomics, vol. 4, no. 1, pp. 41–51, 2003.

[13] J. C. Bezdek and R. J. Hathaway, “VAT: A tool for visual assessment of(cluster) tendency,” in Proc. Int. Joint Conf. Neural Netw., 2002, vol. 3,pp. 2225–2230.

[14] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algo-rithms. New York: Plenum, 1981.

[15] S. Bandyopadhyay, A. Mukhopadhyay, and U. Maulik, “An improvedalgorithm for clustering gene expression data,” Bioinformatics, vol. 23,no. 21, pp. 2859–2865, 2007.

[16] Y. Lu, S. Lu, F. Fotouhi, Y. Deng, and S. J. Brown, “Incremental genetick-means algorithm and its application in gene expression data analysis,”BMC Bioinform., vol. 5, no. 1, p. 172, 2004.

[17] J. Handl and J. Knowles, “Multiobjective Clustering and Cluster Vali-dation,” in Multi-Objective Machine Learnin (Studies in ComputationalIntelligence vol. 16). New York: Springer, pp. 21–47, 2006.

[18] A. Mukhopadhyay and U. Maulik, “A multiobjective approach to MRbrain image segmentation,” Appl. Soft Comput., vol. 11, pp. 872–880,Jan. 2011.

[19] A. Mukhopadhyay, U. Maulik, and S. Bandyopadhyay, “Multiobjectivegenetic clustering with ensemble among Pareto front solutions: Applica-tion to MRI brain image segmentation,” in Proc. Int. Conf. Adv. PatternRecognit., 2009, pp. 236–239.

[20] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEETrans. Pattern Anal. Mach. Intell., vol. PAMI-1, no. 2, pp. 224–227, Apr.1979.

[21] X. L. Xie and G. Beni, “A validity measure for fuzzy clustering,” IEEETrans. Pattern Anal. Mach. Intell., vol. 13, no. 8, pp. 841–847, Aug. 1991.

[22] P. Rousseeuw, “Silhouettes: A graphical aid to the interpretation andvalidation of cluster analysis,” J. Comp. App. Math, vol. 20, pp. 53–65,1987.

[23] S. Bandyopadhyay and S. Saha, “A point symmetry-based clustering tech-nique for automatic evolution of clusters,” IEEE Trans. Knowl. Data Eng.,vol. 20, no. 11, pp. 1441–1457, Nov. 2008.

[24] V. R. Iyer, M. B. Eisen, D. T. Ross, G. Schuler, T. Moore, J. Lee,J. M. Trent, L. M. Staudt, J. J. Hudson, M. S. Boguski, D. Lashkari,D. Shalon, D. Botstein, and P. O. Brown, “The transcriptional programin the response of the human fibroblasts to serum,” Science, vol. 283,pp. 83–87, 1999.

[25] R. J. Cho, M. J. Campbell, E. A. Winzeler, L. Steinmetz, A. Conway,L. Wodica, T. G. Wolfsberg, A. E. Gabrielian, D. Landsman,D. J. Lockhart, and R. W. Davis, “A genome-wide transcriptional anal-ysis of mitotic cell cycle,” Mol. Cell., vol. 2, pp. 65–73, 1998.

[26] S. Datta and S. Datta, “Methods for evaluating clustering algorithms forgene expression data using a reference set of functional classes,” BMCBioinform., vol. 7, no. 1, p. 397, 2006.

[27] G. Dennis, B. T. Sherman, D. A. Hosack, J. Yang, W. Gao, H. C. Lane,and R. A. Lempicki, “DAVID: Database for annotation visualization, andintegrated discovery,” Genome Biol., vol. 4, no. 5, p. P3, 2004.

[28] F. Al-Shahrour, R. Diaz-Uriarte, and J. Dopazo, “FatiGO: A Web toolfor finding significant associations of gene ontology terms with groups ofgenes,” Bioinformatics, vol. 20, no. 4, pp. 578–580, 2004.

[29] M. Hollander and D. A. Wolfe, Nonparametric Statistical Methods, 2nded. New York: Wiley, 1999.

Authors’ photographs and biographies not available at the time of publication.