data clustering: a review - computer science at rutgers

60
Data Clustering: A Review A.K. JAIN Michigan State University M.N. MURTY Indian Institute of Science AND P.J. FLYNN The Ohio State University Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify cross-cutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval. Categories and Subject Descriptors: I.5.1 [Pattern Recognition]: Models; I.5.3 [Pattern Recognition]: Clustering; I.5.4 [Pattern Recognition]: Applications— Computer vision; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Clustering; I.2.6 [Artificial Intelligence]: Learning—Knowledge acquisition General Terms: Algorithms Additional Key Words and Phrases: Cluster analysis, clustering applications, exploratory data analysis, incremental clustering, similarity indices, unsupervised learning Section 6.1 is based on the chapter “Image Segmentation Using Clustering” by A.K. Jain and P.J. Flynn, Advances in Image Understanding: A Festschrift for Azriel Rosenfeld (K. Bowyer and N. Ahuja, Eds.), 1996 IEEE Computer Society Press, and is used by permission of the IEEE Computer Society. Authors’ addresses: A. Jain, Department of Computer Science, Michigan State University, A714 Wells Hall, East Lansing, MI 48824; M. Murty, Department of Computer Science and Automation, Indian Institute of Science, Bangalore, 560 012, India; P. Flynn, Department of Electrical Engineering, The Ohio State University, Columbus, OH 43210. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee. © 2000 ACM 0360-0300/99/0900–0001 $5.00 ACM Computing Surveys, Vol. 31, No. 3, September 1999

Upload: others

Post on 04-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Clustering: A Review - Computer Science at Rutgers

Data Clustering: A ReviewA.K. JAIN

Michigan State University

M.N. MURTY

Indian Institute of Science

AND

P.J. FLYNN

The Ohio State University

Clustering is the unsupervised classification of patterns (observations, data items,or feature vectors) into groups (clusters). The clustering problem has beenaddressed in many contexts and by researchers in many disciplines; this reflects itsbroad appeal and usefulness as one of the steps in exploratory data analysis.However, clustering is a difficult problem combinatorially, and differences inassumptions and contexts in different communities has made the transfer of usefulgeneric concepts and methodologies slow to occur. This paper presents an overviewof pattern clustering methods from a statistical pattern recognition perspective,with a goal of providing useful advice and references to fundamental conceptsaccessible to the broad community of clustering practitioners. We present ataxonomy of clustering techniques, and identify cross-cutting themes and recentadvances. We also describe some important applications of clustering algorithmssuch as image segmentation, object recognition, and information retrieval.

Categories and Subject Descriptors: I.5.1 [Pattern Recognition]: Models; I.5.3[Pattern Recognition]: Clustering; I.5.4 [Pattern Recognition]: Applications—Computer vision; H.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval—Clustering; I.2.6 [Artificial Intelligence]:Learning—Knowledge acquisition

General Terms: Algorithms

Additional Key Words and Phrases: Cluster analysis, clustering applications,exploratory data analysis, incremental clustering, similarity indices, unsupervisedlearning

Section 6.1 is based on the chapter “Image Segmentation Using Clustering” by A.K. Jain and P.J.Flynn, Advances in Image Understanding: A Festschrift for Azriel Rosenfeld (K. Bowyer and N. Ahuja,Eds.), 1996 IEEE Computer Society Press, and is used by permission of the IEEE Computer Society.Authors’ addresses: A. Jain, Department of Computer Science, Michigan State University, A714 WellsHall, East Lansing, MI 48824; M. Murty, Department of Computer Science and Automation, IndianInstitute of Science, Bangalore, 560 012, India; P. Flynn, Department of Electrical Engineering, TheOhio State University, Columbus, OH 43210.Permission to make digital / hard copy of part or all of this work for personal or classroom use is grantedwithout fee provided that the copies are not made or distributed for profit or commercial advantage, thecopyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute tolists, requires prior specific permission and / or a fee.© 2000 ACM 0360-0300/99/0900–0001 $5.00

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 2: Data Clustering: A Review - Computer Science at Rutgers

1. INTRODUCTION

1.1 Motivation

Data analysis underlies many comput-ing applications, either in a designphase or as part of their on-line opera-tions. Data analysis procedures can bedichotomized as either exploratory orconfirmatory, based on the availabilityof appropriate models for the datasource, but a key element in both typesof procedures (whether for hypothesisformation or decision-making) is thegrouping, or classification of measure-ments based on either (i) goodness-of-fitto a postulated model, or (ii) naturalgroupings (clustering) revealed throughanalysis. Cluster analysis is the organi-zation of a collection of patterns (usual-ly represented as a vector of measure-ments, or a point in a multidimensionalspace) into clusters based on similarity.

Intuitively, patterns within a valid clus-ter are more similar to each other thanthey are to a pattern belonging to adifferent cluster. An example of cluster-ing is depicted in Figure 1. The inputpatterns are shown in Figure 1(a), andthe desired clusters are shown in Figure1(b). Here, points belonging to the samecluster are given the same label. Thevariety of techniques for representingdata, measuring proximity (similarity)between data elements, and groupingdata elements has produced a rich andoften confusing assortment of clusteringmethods.

It is important to understand the dif-ference between clustering (unsuper-vised classification) and discriminantanalysis (supervised classification). Insupervised classification, we are pro-vided with a collection of labeled (pre-classified) patterns; the problem is tolabel a newly encountered, yet unla-beled, pattern. Typically, the given la-beled (training) patterns are used tolearn the descriptions of classes whichin turn are used to label a new pattern.In the case of clustering, the problem isto group a given collection of unlabeledpatterns into meaningful clusters. In asense, labels are associated with clus-ters also, but these category labels aredata driven; that is, they are obtainedsolely from the data.

Clustering is useful in several explor-atory pattern-analysis, grouping, deci-sion-making, and machine-learning sit-uations, including data mining,document retrieval, image segmenta-tion, and pattern classification. How-ever, in many such problems, there islittle prior information (e.g., statisticalmodels) available about the data, andthe decision-maker must make as fewassumptions about the data as possible.It is under these restrictions that clus-tering methodology is particularly ap-propriate for the exploration of interre-lationships among the data points tomake an assessment (perhaps prelimi-nary) of their structure.

The term “clustering” is used in sev-eral research communities to describe

CONTENTS

1. Introduction1.1 Motivation1.2 Components of a Clustering Task1.3 The User’s Dilemma and the Role of Expertise1.4 History1.5 Outline

2. Definitions and Notation3. Pattern Representation, Feature Selection and

Extraction4. Similarity Measures5. Clustering Techniques

5.1 Hierarchical Clustering Algorithms5.2 Partitional Algorithms5.3 Mixture-Resolving and Mode-Seeking

Algorithms5.4 Nearest Neighbor Clustering5.5 Fuzzy Clustering5.6 Representation of Clusters5.7 Artificial Neural Networks for Clustering5.8 Evolutionary Approaches for Clustering5.9 Search-Based Approaches5.10 A Comparison of Techniques5.11 Incorporating Domain Constraints in

Clustering5.12 Clustering Large Data Sets

6. Applications6.1 Image Segmentation Using Clustering6.2 Object and Character Recognition6.3 Information Retrieval6.4 Data Mining

7. Summary

Data Clustering • 265

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 3: Data Clustering: A Review - Computer Science at Rutgers

methods for grouping of unlabeled data.These communities have different ter-minologies and assumptions for thecomponents of the clustering processand the contexts in which clustering isused. Thus, we face a dilemma regard-ing the scope of this survey. The produc-tion of a truly comprehensive surveywould be a monumental task given thesheer mass of literature in this area.The accessibility of the survey mightalso be questionable given the need toreconcile very different vocabulariesand assumptions regarding clusteringin the various communities.

The goal of this paper is to survey thecore concepts and techniques in thelarge subset of cluster analysis with itsroots in statistics and decision theory.Where appropriate, references will bemade to key concepts and techniquesarising from clustering methodology inthe machine-learning and other commu-nities.

The audience for this paper includespractitioners in the pattern recognitionand image analysis communities (whoshould view it as a summarization ofcurrent practice), practitioners in themachine-learning communities (whoshould view it as a snapshot of a closelyrelated field with a rich history of well-understood techniques), and thebroader audience of scientific profes-

sionals (who should view it as an acces-sible introduction to a mature field thatis making important contributions tocomputing application areas).

1.2 Components of a Clustering Task

Typical pattern clustering activity in-volves the following steps [Jain andDubes 1988]:

(1) pattern representation (optionallyincluding feature extraction and/orselection),

(2) definition of a pattern proximitymeasure appropriate to the data do-main,

(3) clustering or grouping,

(4) data abstraction (if needed), and

(5) assessment of output (if needed).

Figure 2 depicts a typical sequencing ofthe first three of these steps, includinga feedback path where the groupingprocess output could affect subsequentfeature extraction and similarity com-putations.

Pattern representation refers to thenumber of classes, the number of avail-able patterns, and the number, type,and scale of the features available to theclustering algorithm. Some of this infor-mation may not be controllable by the

X X

Y Y

(a) (b)

x xx

xx 1 1

1

x x

11

2 2

x x 2 2

x x x

xx

x

xx

x

x

xxx

x

x

x

x

x

x

3 3 3

3

4

44

4

4

4

44

4

4

44

4 44

x

x

x

x

x

x

x

x 6

6

6

7

7

7

7

6

x x x

xx

xx

4 5 55

5

55

Figure 1. Data clustering.

266 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 4: Data Clustering: A Review - Computer Science at Rutgers

practitioner. Feature selection is theprocess of identifying the most effectivesubset of the original features to use inclustering. Feature extraction is the useof one or more transformations of theinput features to produce new salientfeatures. Either or both of these tech-niques can be used to obtain an appro-priate set of features to use in cluster-ing.

Pattern proximity is usually measuredby a distance function defined on pairsof patterns. A variety of distance mea-sures are in use in the various commu-nities [Anderberg 1973; Jain and Dubes1988; Diday and Simon 1976]. A simpledistance measure like Euclidean dis-tance can often be used to reflect dis-similarity between two patterns,whereas other similarity measures canbe used to characterize the conceptualsimilarity between patterns [Michalskiand Stepp 1983]. Distance measures arediscussed in Section 4.

The grouping step can be performedin a number of ways. The output clus-tering (or clusterings) can be hard (apartition of the data into groups) orfuzzy (where each pattern has a vari-able degree of membership in each ofthe output clusters). Hierarchical clus-tering algorithms produce a nested se-ries of partitions based on a criterion formerging or splitting clusters based onsimilarity. Partitional clustering algo-rithms identify the partition that opti-mizes (usually locally) a clustering cri-terion. Additional techniques for thegrouping operation include probabilistic[Brailovski 1991] and graph-theoretic[Zahn 1971] clustering methods. Thevariety of techniques for cluster forma-tion is described in Section 5.

Data abstraction is the process of ex-tracting a simple and compact represen-tation of a data set. Here, simplicity iseither from the perspective of automaticanalysis (so that a machine can performfurther processing efficiently) or it ishuman-oriented (so that the representa-tion obtained is easy to comprehend andintuitively appealing). In the clusteringcontext, a typical data abstraction is acompact description of each cluster,usually in terms of cluster prototypes orrepresentative patterns such as the cen-troid [Diday and Simon 1976].

How is the output of a clustering algo-rithm evaluated? What characterizes a‘good’ clustering result and a ‘poor’ one?All clustering algorithms will, whenpresented with data, produce clusters —regardless of whether the data containclusters or not. If the data does containclusters, some clustering algorithmsmay obtain ‘better’ clusters than others.The assessment of a clustering proce-dure’s output, then, has several facets.One is actually an assessment of thedata domain rather than the clusteringalgorithm itself— data which do notcontain clusters should not be processedby a clustering algorithm. The study ofcluster tendency, wherein the input dataare examined to see if there is any meritto a cluster analysis prior to one beingperformed, is a relatively inactive re-search area, and will not be consideredfurther in this survey. The interestedreader is referred to Dubes [1987] andCheng [1995] for information.

Cluster validity analysis, by contrast,is the assessment of a clustering proce-dure’s output. Often this analysis uses aspecific criterion of optimality; however,these criteria are usually arrived at

FeatureSelection/Extraction

PatternGrouping

ClustersInterpatternSimilarity

Representations

Patterns

feedback loop

Figure 2. Stages in clustering.

Data Clustering • 267

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 5: Data Clustering: A Review - Computer Science at Rutgers

subjectively. Hence, little in the way of‘gold standards’ exist in clustering ex-cept in well-prescribed subdomains. Va-lidity assessments are objective [Dubes1993] and are performed to determinewhether the output is meaningful. Aclustering structure is valid if it cannotreasonably have occurred by chance oras an artifact of a clustering algorithm.When statistical approaches to cluster-ing are used, validation is accomplishedby carefully applying statistical meth-ods and testing hypotheses. There arethree types of validation studies. Anexternal assessment of validity com-pares the recovered structure to an apriori structure. An internal examina-tion of validity tries to determine if thestructure is intrinsically appropriate forthe data. A relative test compares twostructures and measures their relativemerit. Indices used for this comparisonare discussed in detail in Jain andDubes [1988] and Dubes [1993], and arenot discussed further in this paper.

1.3 The User’s Dilemma and the Role ofExpertise

The availability of such a vast collectionof clustering algorithms in the litera-ture can easily confound a user attempt-ing to select an algorithm suitable forthe problem at hand. In Dubes and Jain[1976], a set of admissibility criteriadefined by Fisher and Van Ness [1971]are used to compare clustering algo-rithms. These admissibility criteria arebased on: (1) the manner in which clus-ters are formed, (2) the structure of thedata, and (3) sensitivity of the cluster-ing technique to changes that do notaffect the structure of the data. How-ever, there is no critical analysis of clus-tering algorithms dealing with the im-portant questions such as

—How should the data be normalized?

—Which similarity measure is appropri-ate to use in a given situation?

—How should domain knowledge be uti-lized in a particular clustering prob-lem?

—How can a vary large data set (say, amillion patterns) be clustered effi-ciently?

These issues have motivated this sur-vey, and its aim is to provide a perspec-tive on the state of the art in clusteringmethodology and algorithms. With sucha perspective, an informed practitionershould be able to confidently assess thetradeoffs of different techniques, andultimately make a competent decisionon a technique or suite of techniques toemploy in a particular application.

There is no clustering technique thatis universally applicable in uncoveringthe variety of structures present in mul-tidimensional data sets. For example,consider the two-dimensional data setshown in Figure 1(a). Not all clusteringtechniques can uncover all the clusterspresent here with equal facility, becauseclustering algorithms often contain im-plicit assumptions about cluster shapeor multiple-cluster configurations basedon the similarity measures and group-ing criteria used.

Humans perform competitively withautomatic clustering procedures in twodimensions, but most real problems in-volve clustering in higher dimensions. Itis difficult for humans to obtain an intu-itive interpretation of data embedded ina high-dimensional space. In addition,data hardly follow the “ideal” structures(e.g., hyperspherical, linear) shown inFigure 1. This explains the large num-ber of clustering algorithms which con-tinue to appear in the literature; eachnew clustering algorithm performsslightly better than the existing ones ona specific distribution of patterns.

It is essential for the user of a cluster-ing algorithm to not only have a thor-ough understanding of the particulartechnique being utilized, but also toknow the details of the data gatheringprocess and to have some domain exper-tise; the more information the user hasabout the data at hand, the more likelythe user would be able to succeed inassessing its true class structure [Jainand Dubes 1988]. This domain informa-

268 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 6: Data Clustering: A Review - Computer Science at Rutgers

tion can also be used to improve thequality of feature extraction, similaritycomputation, grouping, and cluster rep-resentation [Murty and Jain 1995].

Appropriate constraints on the datasource can be incorporated into a clus-tering procedure. One example of this ismixture resolving [Titterington et al.1985], wherein it is assumed that thedata are drawn from a mixture of anunknown number of densities (often as-sumed to be multivariate Gaussian).The clustering problem here is to iden-tify the number of mixture componentsand the parameters of each component.The concept of density clustering and amethodology for decomposition of fea-ture spaces [Bajcsy 1997] have alsobeen incorporated into traditional clus-tering methodology, yielding a tech-nique for extracting overlapping clus-ters.

1.4 History

Even though there is an increasing in-terest in the use of clustering methodsin pattern recognition [Anderberg1973], image processing [Jain andFlynn 1996] and information retrieval[Rasmussen 1992; Salton 1991], cluster-ing has a rich history in other disci-plines [Jain and Dubes 1988] such asbiology, psychiatry, psychology, archae-ology, geology, geography, and market-ing. Other terms more or less synony-mous with clustering includeunsupervised learning [Jain and Dubes1988], numerical taxonomy [Sneath andSokal 1973], vector quantization [Oehlerand Gray 1995], and learning by obser-vation [Michalski and Stepp 1983]. Thefield of spatial analysis of point pat-terns [Ripley 1988] is also related tocluster analysis. The importance andinterdisciplinary nature of clustering isevident through its vast literature.

A number of books on clustering havebeen published [Jain and Dubes 1988;Anderberg 1973; Hartigan 1975; Spath1980; Duran and Odell 1974; Everitt1993; Backer 1995], in addition to someuseful and influential review papers. A

survey of the state of the art in cluster-ing circa 1978 was reported in Dubesand Jain [1980]. A comparison of vari-ous clustering algorithms for construct-ing the minimal spanning tree and theshort spanning path was given in Lee[1981]. Cluster analysis was also sur-veyed in Jain et al. [1986]. A review ofimage segmentation by clustering wasreported in Jain and Flynn [1996]. Com-parisons of various combinatorial opti-mization schemes, based on experi-ments, have been reported in Mishraand Raghavan [1994] and Al-Sultan andKhan [1996].

1.5 Outline

This paper is organized as follows. Sec-tion 2 presents definitions of terms to beused throughout the paper. Section 3summarizes pattern representation,feature extraction, and feature selec-tion. Various approaches to the compu-tation of proximity between patternsare discussed in Section 4. Section 5presents a taxonomy of clustering ap-proaches, describes the major tech-niques in use, and discusses emergingtechniques for clustering incorporatingnon-numeric constraints and the clus-tering of large sets of patterns. Section6 discusses applications of clusteringmethods to image analysis and datamining problems. Finally, Section 7 pre-sents some concluding remarks.

2. DEFINITIONS AND NOTATION

The following terms and notation areused throughout this paper.

—A pattern (or feature vector, observa-tion, or datum) x is a single data itemused by the clustering algorithm. Ittypically consists of a vector of d mea-surements: x 5 ~x1, . . . xd!.

—The individual scalar components xi

of a pattern x are called features (orattributes).

Data Clustering • 269

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 7: Data Clustering: A Review - Computer Science at Rutgers

—d is the dimensionality of the patternor of the pattern space.

—A pattern set is denoted - 5$x1, . . . xn%. The ith pattern in - isdenoted xi 5 ~xi,1, . . . xi,d!. In manycases a pattern set to be clustered isviewed as an n 3 d pattern matrix.

—A class, in the abstract, refers to astate of nature that governs the pat-tern generation process in some cases.More concretely, a class can be viewedas a source of patterns whose distri-bution in feature space is governed bya probability density specific to theclass. Clustering techniques attemptto group patterns so that the classesthereby obtained reflect the differentpattern generation processes repre-sented in the pattern set.

—Hard clustering techniques assign aclass label li to each patterns xi, iden-tifying its class. The set of all labelsfor a pattern set - is + 5$l1, . . . ln%, with li [ $1, · · ·, k%,where k is the number of clusters.

—Fuzzy clustering procedures assign toeach input pattern xi a fractional de-gree of membership fij in each outputcluster j.

—A distance measure (a specializationof a proximity measure) is a metric(or quasi-metric) on the feature spaceused to quantify the similarity of pat-terns.

3. PATTERN REPRESENTATION, FEATURESELECTION AND EXTRACTION

There are no theoretical guidelines thatsuggest the appropriate patterns andfeatures to use in a specific situation.Indeed, the pattern generation processis often not directly controllable; theuser’s role in the pattern representationprocess is to gather facts and conjec-tures about the data, optionally performfeature selection and extraction, and de-sign the subsequent elements of the

clustering system. Because of the diffi-culties surrounding pattern representa-tion, it is conveniently assumed that thepattern representation is available priorto clustering. Nonetheless, a careful in-vestigation of the available features andany available transformations (evensimple ones) can yield significantly im-proved clustering results. A good pat-tern representation can often yield asimple and easily understood clustering;a poor pattern representation may yielda complex clustering whose true struc-ture is difficult or impossible to discern.Figure 3 shows a simple example. Thepoints in this 2D feature space are ar-ranged in a curvilinear cluster of ap-proximately constant distance from theorigin. If one chooses Cartesian coordi-nates to represent the patterns, manyclustering algorithms would be likely tofragment the cluster into two or moreclusters, since it is not compact. If, how-ever, one uses a polar coordinate repre-sentation for the clusters, the radiuscoordinate exhibits tight clustering anda one-cluster solution is likely to beeasily obtained.

A pattern can measure either a phys-ical object (e.g., a chair) or an abstractnotion (e.g., a style of writing). As notedabove, patterns are represented conven-tionally as multidimensional vectors,where each dimension is a single fea-ture [Duda and Hart 1973]. These fea-tures can be either quantitative or qual-itative. For example, if weight and colorare the two features used, then~20, black! is the representation of ablack object with 20 units of weight.The features can be subdivided into thefollowing types [Gowda and Diday1992]:

(1) Quantitative features: e.g.(a) continuous values (e.g., weight);(b) discrete values (e.g., the number

of computers);(c) interval values (e.g., the dura-

tion of an event).

(2) Qualitative features:(a) nominal or unordered (e.g., color);

270 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 8: Data Clustering: A Review - Computer Science at Rutgers

(b) ordinal (e.g., military rank orqualitative evaluations of tem-perature (“cool” or “hot”) orsound intensity (“quiet” or“loud”)).

Quantitative features can be measuredon a ratio scale (with a meaningful ref-erence value, such as temperature), oron nominal or ordinal scales.

One can also use structured features[Michalski and Stepp 1983] which arerepresented as trees, where the parentnode represents a generalization of itschild nodes. For example, a parent node“vehicle” may be a generalization ofchildren labeled “cars,” “buses,”“trucks,” and “motorcycles.” Further,the node “cars” could be a generaliza-tion of cars of the type “Toyota,” “Ford,”“Benz,” etc. A generalized representa-tion of patterns, called symbolic objectswas proposed in Diday [1988]. Symbolicobjects are defined by a logical conjunc-tion of events. These events link valuesand features in which the features cantake one or more values and all theobjects need not be defined on the sameset of features.

It is often valuable to isolate only themost descriptive and discriminatory fea-tures in the input set, and utilize thosefeatures exclusively in subsequent anal-ysis. Feature selection techniques iden-

tify a subset of the existing features forsubsequent use, while feature extrac-tion techniques compute new featuresfrom the original set. In either case, thegoal is to improve classification perfor-mance and/or computational efficiency.Feature selection is a well-exploredtopic in statistical pattern recognition[Duda and Hart 1973]; however, in aclustering context (i.e., lacking class la-bels for patterns), the feature selectionprocess is of necessity ad hoc, and mightinvolve a trial-and-error process wherevarious subsets of features are selected,the resulting patterns clustered, andthe output evaluated using a validityindex. In contrast, some of the popularfeature extraction processes (e.g., prin-cipal components analysis [Fukunaga1990]) do not depend on labeled dataand can be used directly. Reduction ofthe number of features has an addi-tional benefit, namely the ability to pro-duce output that can be visually in-spected by a human.

4. SIMILARITY MEASURES

Since similarity is fundamental to thedefinition of a cluster, a measure of thesimilarity between two patterns drawnfrom the same feature space is essentialto most clustering procedures. Becauseof the variety of feature types andscales, the distance measure (or mea-sures) must be chosen carefully. It ismost common to calculate the dissimi-larity between two patterns using a dis-tance measure defined on the featurespace. We will focus on the well-knowndistance measures used for patternswhose features are all continuous.

The most popular metric for continu-ous features is the Euclidean distance

d2~xi, xj! 5 ~ Ok51

d

~xi, k 2 xj, k!2!1/ 2

5 ixi 2 xji2,

which is a special case (p52) of theMinkowski metric

.

..

.

.

..

. ..

. ..

. ..

..

.

.

..

. ..

. ..

. ..

..

.

.

..

. ..

. ..

. ..

..

.

.

..

. ..

. ..

. ..

..

.

.

..

. ..

. ..

. ..

..

.

.

..

. ..

. ..

. ..

..

.

.

..

. ..

. ..

. ..

..

.

.

..

. ..

. ..

. ..

..

.

.

..

. ..

. ..

. ..

..

.

.

..

...

...

...

..

.

.

..

...

...

...

..

.

.

..

...

...

.. .

..

.

.

..

...

...

...

..

.

.

..

...

...

...

..

.

.

..

...

...

.. .

..

.

.

..

...

...

...

..

.

.

..

...

...

.. .

..

.

.

..

...

...

...

..

.

.

..

...

...

.. .

..

.

.

..

...

...

..

Figure 3. A curvilinear cluster whose pointsare approximately equidistant from the origin.Different pattern representations (coordinatesystems) would cause clustering algorithms toyield different results for this data (see text).

Data Clustering • 271

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 9: Data Clustering: A Review - Computer Science at Rutgers

dp~xi, xj! 5 ~ Ok51

d

?xi, k 2 xj, k?p!1/p

5 ixi 2 xjip.

The Euclidean distance has an intuitiveappeal as it is commonly used to evalu-ate the proximity of objects in two orthree-dimensional space. It works wellwhen a data set has “compact” or “iso-lated” clusters [Mao and Jain 1996].The drawback to direct use of theMinkowski metrics is the tendency ofthe largest-scaled feature to dominatethe others. Solutions to this probleminclude normalization of the continuousfeatures (to a common range or vari-ance) or other weighting schemes. Lin-ear correlation among features can alsodistort distance measures; this distor-tion can be alleviated by applying awhitening transformation to the data orby using the squared Mahalanobis dis-tance

dM~xi, xj! 5 ~xi 2 xj!S21~xi 2 xj!

T,

where the patterns xi and xj are as-sumed to be row vectors, and S is thesample covariance matrix of the pat-terns or the known covariance matrix ofthe pattern generation process; dM~ z , z!assigns different weights to differentfeatures based on their variances andpairwise linear correlations. Here, it isimplicitly assumed that class condi-tional densities are unimodal and char-acterized by multidimensional spread,i.e., that the densities are multivariateGaussian. The regularized Mahalanobisdistance was used in Mao and Jain[1996] to extract hyperellipsoidal clus-ters. Recently, several researchers[Huttenlocher et al. 1993; Dubuissonand Jain 1994] have used the Hausdorffdistance in a point set matching con-text.

Some clustering algorithms work on amatrix of proximity values instead of onthe original pattern set. It is useful insuch situations to precompute all the

n~n 2 1! / 2 pairwise distance valuesfor the n patterns and store them in a(symmetric) matrix.

Computation of distances betweenpatterns with some or all features beingnoncontinuous is problematic, since thedifferent types of features are not com-parable and (as an extreme example)the notion of proximity is effectively bi-nary-valued for nominal-scaled fea-tures. Nonetheless, practitioners (espe-cially those in machine learning, wheremixed-type patterns are common) havedeveloped proximity measures for heter-ogeneous type patterns. A recent exam-ple is Wilson and Martinez [1997],which proposes a combination of a mod-ified Minkowski metric for continuousfeatures and a distance based on counts(population) for nominal attributes. Avariety of other metrics have been re-ported in Diday and Simon [1976] andIchino and Yaguchi [1994] for comput-ing the similarity between patterns rep-resented using quantitative as well asqualitative features.

Patterns can also be represented us-ing string or tree structures [Knuth1973]. Strings are used in syntacticclustering [Fu and Lu 1977]. Severalmeasures of similarity between stringsare described in Baeza-Yates [1992]. Agood summary of similarity measuresbetween trees is given by Zhang [1995].A comparison of syntactic and statisti-cal approaches for pattern recognitionusing several criteria was presented inTanaka [1995] and the conclusion wasthat syntactic methods are inferior inevery aspect. Therefore, we do not con-sider syntactic methods further in thispaper.

There are some distance measures re-ported in the literature [Gowda andKrishna 1977; Jarvis and Patrick 1973]that take into account the effect of sur-rounding or neighboring points. Thesesurrounding points are called context inMichalski and Stepp [1983]. The simi-larity between two points xi and xj,given this context, is given by

272 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 10: Data Clustering: A Review - Computer Science at Rutgers

s~xi, xj! 5 f~xi, xj, %!,

where % is the context (the set of sur-rounding points). One metric definedusing context is the mutual neighbordistance (MND), proposed in Gowda andKrishna [1977], which is given by

MND~xi, xj! 5 NN~xi, xj! 1 NN~xj, xi!,

where NN~xi, xj! is the neighbor num-ber of xj with respect to xi. Figures 4and 5 give an example. In Figure 4, thenearest neighbor of A is B, and B’snearest neighbor is A. So, NN~A, B! 5NN~B, A! 5 1 and the MND betweenA and B is 2. However, NN~B, C! 5 1but NN~C, B! 5 2, and thereforeMND~B, C! 5 3. Figure 5 was ob-tained from Figure 4 by adding three newpoints D, E, and F. Now MND~B, C!5 3 (as before), but MND~A, B! 5 5.The MND between A and B has in-creased by introducing additionalpoints, even though A and B have notmoved. The MND is not a metric (it doesnot satisfy the triangle inequality[Zhang 1995]). In spite of this, MND hasbeen successfully applied in severalclustering applications [Gowda and Di-day 1992]. This observation supportsthe viewpoint that the dissimilaritydoes not need to be a metric.

Watanabe’s theorem of the ugly duck-ling [Watanabe 1985] states:

“Insofar as we use a finite set ofpredicates that are capable of dis-tinguishing any two objects con-sidered, the number of predicatesshared by any two such objects isconstant, independent of thechoice of objects.”

This implies that it is possible tomake any two arbitrary patternsequally similar by encoding them with asufficiently large number of features. Asa consequence, any two arbitrary pat-terns are equally similar, unless we usesome additional domain information.For example, in the case of conceptualclustering [Michalski and Stepp 1983],the similarity between xi and xj is de-fined as

s~xi, xj! 5 f~xi, xj, #, %!,

where # is a set of pre-defined concepts.This notion is illustrated with the helpof Figure 6. Here, the Euclidean dis-tance between points A and B is lessthan that between B and C. However, Band C can be viewed as “more similar”than A and B because B and C belong tothe same concept (ellipse) and A belongsto a different concept (rectangle). Theconceptual similarity measure is themost general similarity measure. We

A

B

C

X

X

1

2

Figure 4. A and B are more similar than Aand C.

A

B

C

X

X

1

2

DF E

Figure 5. After a change in context, B and Care more similar than B and A.

Data Clustering • 273

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 11: Data Clustering: A Review - Computer Science at Rutgers

discuss several pragmatic issues associ-ated with its use in Section 5.

5. CLUSTERING TECHNIQUES

Different approaches to clustering datacan be described with the help of thehierarchy shown in Figure 7 (other tax-onometric representations of clusteringmethodology are possible; ours is basedon the discussion in Jain and Dubes[1988]). At the top level, there is a dis-tinction between hierarchical and parti-tional approaches (hierarchical methodsproduce a nested series of partitions,while partitional methods produce onlyone).

The taxonomy shown in Figure 7must be supplemented by a discussionof cross-cutting issues that may (inprinciple) affect all of the different ap-proaches regardless of their placementin the taxonomy.

—Agglomerative vs. divisive: This as-pect relates to algorithmic structureand operation. An agglomerative ap-proach begins with each pattern in adistinct (singleton) cluster, and suc-cessively merges clusters together un-til a stopping criterion is satisfied. Adivisive method begins with all pat-terns in a single cluster and performssplitting until a stopping criterion ismet.

—Monothetic vs. polythetic: This aspectrelates to the sequential or simulta-neous use of features in the clusteringprocess. Most algorithms are polythe-tic; that is, all features enter into thecomputation of distances betweenpatterns, and decisions are based onthose distances. A simple monotheticalgorithm reported in Anderberg[1973] considers features sequentiallyto divide the given collection of pat-terns. This is illustrated in Figure 8.Here, the collection is divided intotwo groups using feature x1; the verti-cal broken line V is the separatingline. Each of these clusters is furtherdivided independently using featurex2, as depicted by the broken lines H1

and H2. The major problem with thisalgorithm is that it generates 2d clus-ters where d is the dimensionality ofthe patterns. For large values of d(d . 100 is typical in information re-trieval applications [Salton 1991]),the number of clusters generated bythis algorithm is so large that thedata set is divided into uninterest-ingly small and fragmented clusters.

—Hard vs. fuzzy: A hard clustering al-gorithm allocates each pattern to asingle cluster during its operation andin its output. A fuzzy clusteringmethod assigns degrees of member-ship in several clusters to each inputpattern. A fuzzy clustering can beconverted to a hard clustering by as-signing each pattern to the clusterwith the largest measure of member-ship.

—Deterministic vs. stochastic: This is-sue is most relevant to partitionalapproaches designed to optimize asquared error function. This optimiza-tion can be accomplished using tradi-tional techniques or through a ran-dom search of the state spaceconsisting of all possible labelings.

—Incremental vs. non-incremental:This issue arises when the pattern set

x x x x x x x x x x x x x x

x x x x x x x x x x x x x x

x

x

x

x

x

x

x

x

x

x

xx

x x x x xx

x

x

x

xxxxxxx

xxxx

xx

A

B

C

Figure 6. Conceptual similarity be-tween points .

274 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 12: Data Clustering: A Review - Computer Science at Rutgers

to be clustered is large, and con-straints on execution time or memoryspace affect the architecture of thealgorithm. The early history of clus-tering methodology does not containmany examples of clustering algo-rithms designed to work with largedata sets, but the advent of data min-ing has fostered the development ofclustering algorithms that minimizethe number of scans through the pat-tern set, reduce the number of pat-terns examined during execution, orreduce the size of data structuresused in the algorithm’s operations.

A cogent observation in Jain andDubes [1988] is that the specification ofan algorithm for clustering usuallyleaves considerable flexibilty in imple-mentation.

5.1 Hierarchical Clustering Algorithms

The operation of a hierarchical cluster-ing algorithm is illustrated using thetwo-dimensional data set in Figure 9.This figure depicts seven patterns la-beled A, B, C, D, E, F, and G in threeclusters. A hierarchical algorithm yieldsa dendrogram representing the nestedgrouping of patterns and similarity lev-els at which groupings change. A den-drogram corresponding to the seven

points in Figure 9 (obtained from thesingle-link algorithm [Jain and Dubes1988]) is shown in Figure 10. The den-drogram can be broken at different lev-els to yield different clusterings of thedata.

Most hierarchical clustering algo-rithms are variants of the single-link[Sneath and Sokal 1973], complete-link[King 1967], and minimum-variance[Ward 1963; Murtagh 1984] algorithms.Of these, the single-link and complete-link algorithms are most popular. Thesetwo algorithms differ in the way theycharacterize the similarity between apair of clusters. In the single-linkmethod, the distance between two clus-

Clustering

Partitional

SingleLink

CompleteLink

Hierarchical

SquareError

GraphTheoretic

MixtureResolving

ModeSeeking

k-meansMaximizationExpectation

Figure 7. A taxonomy of clustering approaches.

X

1 11

111

1111

1

1 1 11

22

2222

2 2 2

3 3

33 3 33 33

33333

3

33 3

3

4 4

444

4 4 44

44

44

44

222222

2

22

1

1111

11

1 11

V

H

H2

1

X2

1

Figure 8. Monothetic partitional clustering.

Data Clustering • 275

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 13: Data Clustering: A Review - Computer Science at Rutgers

ters is the minimum of the distancesbetween all pairs of patterns drawnfrom the two clusters (one pattern fromthe first cluster, the other from the sec-ond). In the complete-link algorithm,the distance between two clusters is themaximum of all pairwise distances be-

tween patterns in the two clusters. Ineither case, two clusters are merged toform a larger cluster based on minimumdistance criteria. The complete-link al-gorithm produces tightly bound or com-pact clusters [Baeza-Yates 1992]. Thesingle-link algorithm, by contrast, suf-fers from a chaining effect [Nagy 1968].It has a tendency to produce clustersthat are straggly or elongated. Thereare two clusters in Figures 12 and 13separated by a “bridge” of noisy pat-terns. The single-link algorithm pro-duces the clusters shown in Figure 12,whereas the complete-link algorithm ob-tains the clustering shown in Figure 13.The clusters obtained by the complete-link algorithm are more compact thanthose obtained by the single-link algo-rithm; the cluster labeled 1 obtainedusing the single-link algorithm is elon-gated because of the noisy patterns la-beled “*”. The single-link algorithm ismore versatile than the complete-linkalgorithm, otherwise. For example, thesingle-link algorithm can extract theconcentric clusters shown in Figure 11,but the complete-link algorithm cannot.However, from a pragmatic viewpoint, ithas been observed that the complete-link algorithm produces more useful hi-erarchies in many applications than thesingle-link algorithm [Jain and Dubes1988].

Agglomerative Single-Link Clus-tering Algorithm

(1) Place each pattern in its own clus-ter. Construct a list of interpatterndistances for all distinct unorderedpairs of patterns, and sort this listin ascending order.

(2) Step through the sorted list of dis-tances, forming for each distinct dis-similarity value dk a graph on thepatterns where pairs of patternscloser than dk are connected by agraph edge. If all the patterns aremembers of a connected graph, stop.Otherwise, repeat this step.

X

ABC D E

F G

Cluster1Cluster2

Cluster3

X

1

2

Figure 9. Points falling in three clusters.

A B C D E F G

Similarity

Figure 10. The dendrogram obtained usingthe single-link algorithm.

X

Y

11

1

1

1

111

1

2

2

2

2

22

Figure 11. Two concentric clusters.

276 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 14: Data Clustering: A Review - Computer Science at Rutgers

(3) The output of the algorithm is anested hierarchy of graphs whichcan be cut at a desired dissimilaritylevel forming a partition (clustering)identified by simply connected com-ponents in the corresponding graph.

Agglomerative Complete-Link Clus-tering Algorithm

(1) Place each pattern in its own clus-ter. Construct a list of interpatterndistances for all distinct unorderedpairs of patterns, and sort this listin ascending order.

(2) Step through the sorted list of dis-tances, forming for each distinct dis-similarity value dk a graph on thepatterns where pairs of patternscloser than dk are connected by agraph edge. If all the patterns aremembers of a completely connectedgraph, stop.

(3) The output of the algorithm is anested hierarchy of graphs whichcan be cut at a desired dissimilaritylevel forming a partition (clustering)identified by completely connectedcomponents in the correspondinggraph.

Hierarchical algorithms are more ver-satile than partitional algorithms. Forexample, the single-link clustering algo-rithm works well on data sets contain-ing non-isotropic clusters including

well-separated, chain-like, and concen-tric clusters, whereas a typical parti-tional algorithm such as the k-meansalgorithm works well only on data setshaving isotropic clusters [Nagy 1968].On the other hand, the time and spacecomplexities [Day 1992] of the parti-tional algorithms are typically lowerthan those of the hierarchical algo-rithms. It is possible to develop hybridalgorithms [Murty and Krishna 1980]that exploit the good features of bothcategories.

Hierarchical Agglomerative Clus-tering Algorithm

(1) Compute the proximity matrix con-taining the distance between eachpair of patterns. Treat each patternas a cluster.

(2) Find the most similar pair of clus-ters using the proximity matrix.Merge these two clusters into onecluster. Update the proximity ma-trix to reflect this merge operation.

(3) If all patterns are in one cluster,stop. Otherwise, go to step 2.

Based on the way the proximity matrixis updated in step 2, a variety of ag-glomerative algorithms can be designed.Hierarchical divisive algorithms startwith a single cluster of all the givenobjects and keep splitting the clustersbased on some criterion to obtain a par-tition of singleton clusters.

1 1

111

1

1 11

11

11 11

2

22

22

2

22

2 2

2

X

11

11

1

1 2

2

2

2

2

22

* * * * * * * * *

1

X2

Figure 12. A single-link clustering of a patternset containing two classes (1 and 2) connected bya chain of noisy patterns (*).

1 1

111

1

1 11

11

11 11

2

22

22

2

22

2 2

2

X

11

11

1

1 2

2

2

2

2

22

* * * * * * * * *

1

X2

Figure 13. A complete-link clustering of a pat-tern set containing two classes (1 and 2) con-nected by a chain of noisy patterns (*).

Data Clustering • 277

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 15: Data Clustering: A Review - Computer Science at Rutgers

5.2 Partitional Algorithms

A partitional clustering algorithm ob-tains a single partition of the data in-stead of a clustering structure, such asthe dendrogram produced by a hierar-chical technique. Partitional methodshave advantages in applications involv-ing large data sets for which the con-struction of a dendrogram is computa-tionally prohibitive. A problemaccompanying the use of a partitionalalgorithm is the choice of the number ofdesired output clusters. A seminal pa-per [Dubes 1987] provides guidance onthis key design decision. The partitionaltechniques usually produce clusters byoptimizing a criterion function definedeither locally (on a subset of the pat-terns) or globally (defined over all of thepatterns). Combinatorial search of theset of possible labelings for an optimumvalue of a criterion is clearly computa-tionally prohibitive. In practice, there-fore, the algorithm is typically run mul-tiple times with different startingstates, and the best configuration ob-tained from all of the runs is used as theoutput clustering.

5.2.1 Squared Error Algorithms.The most intuitive and frequently usedcriterion function in partitional cluster-ing techniques is the squared error cri-terion, which tends to work well withisolated and compact clusters. Thesquared error for a clustering + of apattern set - (containing K clusters) is

e2~-, +! 5 Oj51

K Oi51

nj

ixi~ j! 2 cji2,

where xi~ j! is the ith pattern belonging to

the jth cluster and cj is the centroid ofthe jth cluster.

The k-means is the simplest and mostcommonly used algorithm employing asquared error criterion [McQueen 1967].It starts with a random initial partitionand keeps reassigning the patterns toclusters based on the similarity betweenthe pattern and the cluster centers until

a convergence criterion is met (e.g.,there is no reassignment of any patternfrom one cluster to another, or thesquared error ceases to decrease signifi-cantly after some number of iterations).The k-means algorithm is popular be-cause it is easy to implement, and itstime complexity is O~n!, where n is thenumber of patterns. A major problemwith this algorithm is that it is sensitiveto the selection of the initial partitionand may converge to a local minimum ofthe criterion function value if the initialpartition is not properly chosen. Figure14 shows seven two-dimensional pat-terns. If we start with patterns A, B,and C as the initial means aroundwhich the three clusters are built, thenwe end up with the partition {{A}, {B,C}, {D, E, F, G}} shown by ellipses. Thesquared error criterion value is muchlarger for this partition than for thebest partition {{A, B, C}, {D, E}, {F, G}}shown by rectangles, which yields theglobal minimum value of the squarederror criterion function for a clusteringcontaining three clusters. The correctthree-cluster solution is obtained bychoosing, for example, A, D, and F asthe initial cluster means.

Squared Error Clustering Method(1) Select an initial partition of the pat-

terns with a fixed number of clus-ters and cluster centers.

(2) Assign each pattern to its closestcluster center and compute the newcluster centers as the centroids ofthe clusters. Repeat this step untilconvergence is achieved, i.e., untilthe cluster membership is stable.

(3) Merge and split clusters based onsome heuristic information, option-ally repeating step 2.

k-Means Clustering Algorithm

(1) Choose k cluster centers to coincidewith k randomly-chosen patterns ork randomly defined points insidethe hypervolume containing the pat-tern set.

278 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 16: Data Clustering: A Review - Computer Science at Rutgers

(2) Assign each pattern to the closestcluster center.

(3) Recompute the cluster centers usingthe current cluster memberships.

(4) If a convergence criterion is not met,go to step 2. Typical convergencecriteria are: no (or minimal) reas-signment of patterns to new clustercenters, or minimal decrease insquared error.

Several variants [Anderberg 1973] ofthe k-means algorithm have been re-ported in the literature. Some of themattempt to select a good initial partitionso that the algorithm is more likely tofind the global minimum value.

Another variation is to permit split-ting and merging of the resulting clus-ters. Typically, a cluster is split whenits variance is above a pre-specifiedthreshold, and two clusters are mergedwhen the distance between their cen-troids is below another pre-specifiedthreshold. Using this variant, it is pos-sible to obtain the optimal partitionstarting from any arbitrary initial parti-tion, provided proper threshold valuesare specified. The well-known ISO-DATA [Ball and Hall 1965] algorithmemploys this technique of merging andsplitting clusters. If ISODATA is giventhe “ellipse” partitioning shown in Fig-ure 14 as an initial partitioning, it willproduce the optimal three-cluster parti-

tioning. ISODATA will first merge theclusters {A} and {B,C} into one clusterbecause the distance between their cen-troids is small and then split the cluster{D,E,F,G}, which has a large variance,into two clusters {D,E} and {F,G}.

Another variation of the k-means al-gorithm involves selecting a differentcriterion function altogether. The dy-namic clustering algorithm (which per-mits representations other than thecentroid for each cluster) was proposedin Diday [1973], and Symon [1977] anddescribes a dynamic clustering ap-proach obtained by formulating theclustering problem in the framework ofmaximum-likelihood estimation. Theregularized Mahalanobis distance wasused in Mao and Jain [1996] to obtainhyperellipsoidal clusters.

5.2.2 Graph-Theoretic Clustering.The best-known graph-theoretic divisiveclustering algorithm is based on con-struction of the minimal spanning tree(MST) of the data [Zahn 1971], and thendeleting the MST edges with the largestlengths to generate clusters. Figure 15depicts the MST obtained from ninetwo-dimensional points. By breakingthe link labeled CD with a length of 6units (the edge with the maximum Eu-clidean length), two clusters ({A, B, C}and {D, E, F, G, H, I}) are obtained. Thesecond cluster can be further dividedinto two clusters by breaking the edgeEF, which has a length of 4.5 units.

The hierarchical approaches are alsorelated to graph-theoretic clustering.Single-link clusters are subgraphs ofthe minimum spanning tree of the data[Gower and Ross 1969] which are alsothe connected components [Gotlieb andKumar 1968]. Complete-link clustersare maximal complete subgraphs, andare related to the node colorability ofgraphs [Backer and Hubert 1976]. Themaximal complete subgraph was consid-ered the strictest definition of a clusterin Augustson and Minker [1970] andRaghavan and Yu [1981]. A graph-ori-ented approach for non-hierarchicalstructures and overlapping clusters is

Figure 14. The k-means algorithm is sensitiveto the initial partition.

Data Clustering • 279

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 17: Data Clustering: A Review - Computer Science at Rutgers

presented in Ozawa [1985]. The Delau-nay graph (DG) is obtained by connect-ing all the pairs of points that areVoronoi neighbors. The DG contains allthe neighborhood information containedin the MST and the relative neighbor-hood graph (RNG) [Toussaint 1980].

5.3 Mixture-Resolving and Mode-SeekingAlgorithms

The mixture resolving approach to clus-ter analysis has been addressed in anumber of ways. The underlying as-sumption is that the patterns to be clus-tered are drawn from one of severaldistributions, and the goal is to identifythe parameters of each and (perhaps)their number. Most of the work in thisarea has assumed that the individualcomponents of the mixture density areGaussian, and in this case the parame-ters of the individual Gaussians are tobe estimated by the procedure. Tradi-tional approaches to this problem in-volve obtaining (iteratively) a maximumlikelihood estimate of the parametervectors of the component densities [Jainand Dubes 1988].

More recently, the Expectation Maxi-mization (EM) algorithm (a general-purpose maximum likelihood algorithm[Dempster et al. 1977] for missing-dataproblems) has been applied to the prob-lem of parameter estimation. A recentbook [Mitchell 1997] provides an acces-

sible description of the technique. In theEM framework, the parameters of thecomponent densities are unknown, asare the mixing parameters, and theseare estimated from the patterns. TheEM procedure begins with an initialestimate of the parameter vector anditeratively rescores the patterns againstthe mixture density produced by theparameter vector. The rescored patternsare then used to update the parameterestimates. In a clustering context, thescores of the patterns (which essentiallymeasure their likelihood of being drawnfrom particular components of the mix-ture) can be viewed as hints at the classof the pattern. Those patterns, placed(by their scores) in a particular compo-nent, would therefore be viewed as be-longing to the same cluster.

Nonparametric techniques for densi-ty-based clustering have also been de-veloped [Jain and Dubes 1988]. Inspiredby the Parzen window approach to non-parametric density estimation, the cor-responding clustering proceduresearches for bins with large counts in amultidimensional histogram of the in-put pattern set. Other approaches in-clude the application of another parti-tional or hierarchical clusteringalgorithm using a distance measurebased on a nonparametric density esti-mate.

5.4 Nearest Neighbor Clustering

Since proximity plays a key role in ourintuitive notion of a cluster, nearest-neighbor distances can serve as the ba-sis of clustering procedures. An itera-tive procedure was proposed in Lu andFu [1978]; it assigns each unlabeledpattern to the cluster of its nearest la-beled neighbor pattern, provided thedistance to that labeled neighbor is be-low a threshold. The process continuesuntil all patterns are labeled or no addi-tional labelings occur. The mutualneighborhood value (described earlier inthe context of distance computation) canalso be used to grow clusters from nearneighbors.

X

B

AC D

E

F

G HI

edge with the maximum length

2

2

6 2.3

4.5

2 22

X2

1

Figure 15. Using the minimal spanning tree toform clusters.

280 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 18: Data Clustering: A Review - Computer Science at Rutgers

5.5 Fuzzy Clustering

Traditional clustering approaches gen-erate partitions; in a partition, eachpattern belongs to one and only onecluster. Hence, the clusters in a hardclustering are disjoint. Fuzzy clusteringextends this notion to associate eachpattern with every cluster using a mem-bership function [Zadeh 1965]. The out-put of such algorithms is a clustering,but not a partition. We give a high-levelpartitional fuzzy clustering algorithmbelow.

Fuzzy Clustering Algorithm

(1) Select an initial fuzzy partition ofthe N objects into K clusters byselecting the N 3 K membershipmatrix U. An element uij of thismatrix represents the grade of mem-bership of object xi in cluster cj.Typically, uij [ @0,1#.

(2) Using U, find the value of a fuzzycriterion function, e.g., a weightedsquared error criterion function, as-sociated with the corresponding par-tition. One possible fuzzy criterionfunction is

E2~-, U! 5 Oi51

N Ok51

K

uijixi 2 cki2,

where ck 5 (i51

N

uikxi is the kth fuzzy

cluster center.Reassign patterns to clusters to re-duce this criterion function valueand recompute U.

(3) Repeat step 2 until entries in U donot change significantly.

In fuzzy clustering, each cluster is afuzzy set of all the patterns. Figure 16illustrates the idea. The rectangles en-close two “hard” clusters in the data:H1 5 $1,2,3,4,5% and H2 5 $6,7,8,9%.A fuzzy clustering algorithm might pro-duce the two fuzzy clusters F1 and F2depicted by ellipses. The patterns will

have membership values in [0,1] foreach cluster. For example, fuzzy clusterF1 could be compactly described as

$~1,0.9!, ~2,0.8!, ~3,0.7!, ~4,0.6!, ~5,0.55!,

~6,0.2!, ~7,0.2!, ~8,0.0!, ~9,0.0!%

and F2 could be described as

$~1,0.0!, ~2,0.0!, ~3,0.0!, ~4,0.1!, ~5,0.15!,

~6,0.4!, ~7,0.35!, ~8,1.0!, ~9,0.9!%

The ordered pairs ~i, m i! in each clusterrepresent the ith pattern and its mem-bership value to the cluster m i. Largermembership values indicate higher con-fidence in the assignment of the patternto the cluster. A hard clustering can beobtained from a fuzzy partition bythresholding the membership value.

Fuzzy set theory was initially appliedto clustering in Ruspini [1969]. Thebook by Bezdek [1981] is a good sourcefor material on fuzzy clustering. Themost popular fuzzy clustering algorithmis the fuzzy c-means (FCM) algorithm.Even though it is better than the hardk-means algorithm at avoiding localminima, FCM can still converge to localminima of the squared error criterion.The design of membership functions isthe most important problem in fuzzyclustering; different choices include

X

Y

12

3 4

5 6

7

89

H 2H

FF

1

1 2

Figure 16. Fuzzy clusters.

Data Clustering • 281

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 19: Data Clustering: A Review - Computer Science at Rutgers

those based on similarity decompositionand centroids of clusters. A generaliza-tion of the FCM algorithm was proposedby Bezdek [1981] through a family ofobjective functions. A fuzzy c-shell algo-rithm and an adaptive variant for de-tecting circular and elliptical bound-aries was presented in Dave [1992].

5.6 Representation of Clusters

In applications where the number ofclasses or clusters in a data set must bediscovered, a partition of the data set isthe end product. Here, a partition givesan idea about the separability of thedata points into clusters and whether itis meaningful to employ a supervisedclassifier that assumes a given numberof classes in the data set. However, inmany other applications that involvedecision making, the resulting clustershave to be represented or described in acompact form to achieve data abstrac-tion. Even though the construction of acluster representation is an importantstep in decision making, it has not beenexamined closely by researchers. Thenotion of cluster representation was in-troduced in Duran and Odell [1974] andwas subsequently studied in Diday andSimon [1976] and Michalski et al.[1981]. They suggested the followingrepresentation schemes:

(1) Represent a cluster of points bytheir centroid or by a set of distantpoints in the cluster. Figure 17 de-picts these two ideas.

(2) Represent clusters using nodes in aclassification tree. This is illus-trated in Figure 18.

(3) Represent clusters by using conjunc-tive logical expressions. For example,the expression @X1 . 3#@X2 , 2# inFigure 18 stands for the logical state-ment ‘X1 is greater than 3’ and ’X2 isless than 2’.

Use of the centroid to represent acluster is the most popular scheme. Itworks well when the clusters are com-pact or isotropic. However, when theclusters are elongated or non-isotropic,then this scheme fails to represent themproperly. In such a case, the use of acollection of boundary points in a clus-ter captures its shape well. The numberof points used to represent a clustershould increase as the complexity of itsshape increases. The two different rep-resentations illustrated in Figure 18 areequivalent. Every path in a classifica-tion tree from the root node to a leafnode corresponds to a conjunctive state-ment. An important limitation of thetypical use of the simple conjunctiveconcept representations is that they candescribe only rectangular or isotropicclusters in the feature space.

Data abstraction is useful in decisionmaking because of the following:

(1) It gives a simple and intuitive de-scription of clusters which is easyfor human comprehension. In bothconceptual clustering [Michalski

X XBy Three Distant PointsBy The Centroid

*

*

* *

*

** *

*

*

*

*

*

*

1

X2 X2

1

Figure 17. Representation of a cluster by points.

282 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 20: Data Clustering: A Review - Computer Science at Rutgers

and Stepp 1983] and symbolic clus-tering [Gowda and Diday 1992] thisrepresentation is obtained withoutusing an additional step. These al-gorithms generate the clusters aswell as their descriptions. A set offuzzy rules can be obtained fromfuzzy clusters of a data set. Theserules can be used to build fuzzy clas-sifiers and fuzzy controllers.

(2) It helps in achieving data compres-sion that can be exploited further bya computer [Murty and Krishna1980]. Figure 19(a) shows samplesbelonging to two chain-like clusterslabeled 1 and 2. A partitional clus-tering like the k-means algorithmcannot separate these two struc-tures properly. The single-link algo-rithm works well on this data, but iscomputationally expensive. So a hy-brid approach may be used to ex-ploit the desirable properties of boththese algorithms. We obtain 8 sub-clusters of the data using the (com-putationally efficient) k-means algo-rithm. Each of these subclusters canbe represented by their centroids asshown in Figure 19(a). Now the sin-gle-link algorithm can be applied onthese centroids alone to clusterthem into 2 groups. The resultinggroups are shown in Figure 19(b).Here, a data reduction is achieved

by representing the subclusters bytheir centroids.

(3) It increases the efficiency of the de-cision making task. In a cluster-based document retrieval technique[Salton 1991], a large collection ofdocuments is clustered and each ofthe clusters is represented using itscentroid. In order to retrieve docu-ments relevant to a query, the queryis matched with the cluster cen-troids rather than with all the docu-ments. This helps in retrieving rele-vant documents efficiently. Also inseveral applications involving largedata sets, clustering is used to per-form indexing, which helps in effi-cient decision making [Dorai andJain 1995].

5.7 Artificial Neural Networks forClustering

Artificial neural networks (ANNs)[Hertz et al. 1991] are motivated bybiological neural networks. ANNs havebeen used extensively over the pastthree decades for both classification andclustering [Sethi and Jain 1991; Jainand Mao 1994]. Some of the features ofthe ANNs that are important in patternclustering are:

X0 1 2 3 4 5

0

1

2

3

4

5

|

|

|

|

|

|

|

|

|

|

|

|

|

| - - - - - - - - - -

22

2

22

2

2

33

33

33

3

3

333

1

1 11

1

1111

11 1

11 1

1

11

1

1

1

1

1

X < 3 X >3

1 2 3Using Nodes in a Classification Tree

Using Conjunctive Statements

X2

11: [X <3]; 2: [X >3][X <2]; 3:[X >3][X >2]1

11

X <2 X >22 2

1 2 1 2

Figure 18. Representation of clusters by a classification tree or by conjunctive statements.

Data Clustering • 283

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 21: Data Clustering: A Review - Computer Science at Rutgers

(1) ANNs process numerical vectors andso require patterns to be representedusing quantitative features only.

(2) ANNs are inherently parallel anddistributed processing architec-tures.

(3) ANNs may learn their interconnec-tion weights adaptively [Jain andMao 1996; Oja 1982]. More specifi-cally, they can act as pattern nor-malizers and feature selectors byappropriate selection of weights.

Competitive (or winner–take–all)neural networks [Jain and Mao 1996]are often used to cluster input data. Incompetitive learning, similar patternsare grouped by the network and repre-sented by a single unit (neuron). Thisgrouping is done automatically based ondata correlations. Well-known examplesof ANNs used for clustering include Ko-honen’s learning vector quantization(LVQ) and self-organizing map (SOM)[Kohonen 1984], and adaptive reso-nance theory models [Carpenter andGrossberg 1990]. The architectures ofthese ANNs are simple: they are single-layered. Patterns are presented at theinput and are associated with the out-put nodes. The weights between the in-put nodes and the output nodes areiteratively changed (this is called learn-ing) until a termination criterion is sat-isfied. Competitive learning has beenfound to exist in biological neural net-works. However, the learning or weightupdate procedures are quite similar to

those in some classical clustering ap-proaches. For example, the relationshipbetween the k-means algorithm andLVQ is addressed in Pal et al. [1993].The learning algorithm in ART modelsis similar to the leader clustering algo-rithm [Moor 1988].

The SOM gives an intuitively appeal-ing two-dimensional map of the multidi-mensional data set, and it has beensuccessfully used for vector quantiza-tion and speech recognition [Kohonen1984]. However, like its sequentialcounterpart, the SOM generates a sub-optimal partition if the initial weightsare not chosen properly. Further, itsconvergence is controlled by various pa-rameters such as the learning rate anda neighborhood of the winning node inwhich learning takes place. It is possi-ble that a particular input pattern canfire different output units at differentiterations; this brings up the stabilityissue of learning systems. The system issaid to be stable if no pattern in thetraining data changes its category aftera finite number of learning iterations.This problem is closely associated withthe problem of plasticity, which is theability of the algorithm to adapt to newdata. For stability, the learning rateshould be decreased to zero as iterationsprogress and this affects the plasticity.The ART models are supposed to bestable and plastic [Carpenter andGrossberg 1990]. However, ART netsare order-dependent; that is, differentpartitions are obtained for different or-ders in which the data is presented tothe net. Also, the size and number ofclusters generated by an ART net de-pend on the value chosen for the vigi-lance threshold, which is used to decidewhether a pattern is to be assigned toone of the existing clusters or start anew cluster. Further, both SOM andART are suitable for detecting only hy-perspherical clusters [Hertz et al. 1991].A two-layer network that employs regu-larized Mahalanobis distance to extracthyperellipsoidal clusters was proposedin Mao and Jain [1994]. All these ANNsuse a fixed number of output nodes

1

1

1

1 2

2

2

2 1

1

1

1

2

2

X X

1 1 11

111

2

2

1

11

11

111

11

1

1

11

1 1 111

111

1

2

2 22

222

22

2

2

2

222

2

2 2 22

2

2 22

22

2 2

22

2

2111

1 111

111

(a) (b)1 1

X2 X2

Figure 19. Data compression by clustering.

284 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 22: Data Clustering: A Review - Computer Science at Rutgers

which limit the number of clusters thatcan be produced.

5.8 Evolutionary Approaches forClustering

Evolutionary approaches, motivated bynatural evolution, make use of evolu-tionary operators and a population ofsolutions to obtain the globally optimalpartition of the data. Candidate solu-tions to the clustering problem are en-coded as chromosomes. The most com-monly used evolutionary operators are:selection, recombination, and mutation.Each transforms one or more inputchromosomes into one or more outputchromosomes. A fitness function evalu-ated on a chromosome determines achromosome’s likelihood of survivinginto the next generation. We give belowa high-level description of an evolution-ary algorithm applied to clustering.

An Evolutionary Algorithm forClustering

(1) Choose a random population of solu-tions. Each solution here corre-sponds to a valid k-partition of thedata. Associate a fitness value witheach solution. Typically, fitness isinversely proportional to thesquared error value. A solution witha small squared error will have alarger fitness value.

(2) Use the evolutionary operators se-lection, recombination and mutationto generate the next population ofsolutions. Evaluate the fitness val-ues of these solutions.

(3) Repeat step 2 until some termina-tion condition is satisfied.

The best-known evolutionary tech-niques are genetic algorithms (GAs)[Holland 1975; Goldberg 1989], evolu-tion strategies (ESs) [Schwefel 1981],and evolutionary programming (EP)[Fogel et al. 1965]. Out of these threeapproaches, GAs have been most fre-quently used in clustering. Typically,solutions are binary strings in GAs. In

GAs, a selection operator propagates so-lutions from the current generation tothe next generation based on their fit-ness. Selection employs a probabilisticscheme so that solutions with higherfitness have a higher probability of get-ting reproduced.

There are a variety of recombinationoperators in use; crossover is the mostpopular. Crossover takes as input a pairof chromosomes (called parents) andoutputs a new pair of chromosomes(called children or offspring) as depictedin Figure 20. In Figure 20, a singlepoint crossover operation is depicted. Itexchanges the segments of the parentsacross a crossover point. For example,in Figure 20, the parents are the binarystrings ‘10110101’ and ‘11001110’. Thesegments in the two parents after thecrossover point (between the fourth andfifth locations) are exchanged to pro-duce the child chromosomes. Mutationtakes as input a chromosome and out-puts a chromosome by complementingthe bit value at a randomly selectedlocation in the input chromosome. Forexample, the string ‘11111110’ is gener-ated by applying the mutation operatorto the second bit location in the string‘10111110’ (starting at the left). Bothcrossover and mutation are applied withsome prespecified probabilities whichdepend on the fitness values.

GAs represent points in the searchspace as binary strings, and rely on the

parent1

parent2

child1

child2

1 0 1 1 0 1 0 1

1 0 1 1 1 1 1 0

1 1 0 0 0 1 0 1

1 1 0 0 1 1 1 0

crossover point

Figure 20. Crossover operation.

Data Clustering • 285

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 23: Data Clustering: A Review - Computer Science at Rutgers

crossover operator to explore the searchspace. Mutation is used in GAs for thesake of completeness, that is, to makesure that no part of the search space isleft unexplored. ESs and EP differ fromthe GAs in solution representation andtype of the mutation operator used; EPdoes not use a recombination operator,but only selection and mutation. Each ofthese three approaches have been usedto solve the clustering problem by view-ing it as a minimization of the squarederror criterion. Some of the theoreticalissues such as the convergence of theseapproaches were studied in Fogel andFogel [1994].

GAs perform a globalized search forsolutions whereas most other clusteringprocedures perform a localized search.In a localized search, the solution ob-tained at the ‘next iteration’ of the pro-cedure is in the vicinity of the currentsolution. In this sense, the k-means al-gorithm, fuzzy clustering algorithms,ANNs used for clustering, various an-nealing schemes (see below), and tabusearch are all localized search tech-niques. In the case of GAs, the crossoverand mutation operators can producenew solutions that are completely dif-ferent from the current ones. We illus-trate this fact in Figure 21. Let us as-sume that the scalar X is coded using a5-bit binary representation, and let S1

and S2 be two points in the one-dimen-sional search space. The decimal valuesof S1 and S2 are 8 and 31, respectively.Their binary representations are S1 501000 and S2 5 11111. Let us applythe single-point crossover to thesestrings, with the crossover site fallingbetween the second and third most sig-nificant bits as shown below.

01!000

11!111

This will produce a new pair of points orchromosomes S3 and S4 as shown inFigure 21. Here, S3 5 01111 and

S4 5 11000. The corresponding deci-mal values are 15 and 24, respectively.Similarly, by mutating the most signifi-cant bit in the binary string 01111 (dec-imal 15), the binary string 11111 (deci-mal 31) is generated. These jumps, orgaps between points in successive gen-erations, are much larger than thoseproduced by other approaches.

Perhaps the earliest paper on the useof GAs for clustering is by Raghavanand Birchand [1979], where a GA wasused to minimize the squared error of aclustering. Here, each point or chromo-some represents a partition of N objectsinto K clusters and is represented by aK-ary string of length N. For example,consider six patterns—A, B, C, D, E,and F—and the string 101001. This six-bit binary (K 5 2) string corresponds toplacing the six patterns into two clus-ters. This string represents a two-parti-tion, where one cluster has the first,third, and sixth patterns and the secondcluster has the remaining patterns. Inother words, the two clusters are{A,C,F} and {B,D,E} (the six-bit binarystring 010110 represents the same clus-tering of the six patterns). When thereare K clusters, there are K! differentchromosomes corresponding to eachK-partition of the data. This increasesthe effective search space size by a fac-tor of K!. Further, if crossover is appliedon two good chromosomes, the resulting

f(X)

XSS S S

1 23 4

X XX

X

Figure 21. GAs perform globalized search.

286 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 24: Data Clustering: A Review - Computer Science at Rutgers

offspring may be inferior in this repre-sentation. For example, let {A,B,C} and{D,E,F} be the clusters in the optimal2-partition of the six patterns consid-ered above. The corresponding chromo-somes are 111000 and 000111. By ap-plying single-point crossover at thelocation between the third and fourthbit positions on these two strings, weget 111111 and 000000 as offspring andboth correspond to an inferior partition.These problems have motivated re-searchers to design better representa-tion schemes and crossover operators.

In Bhuyan et al. [1991], an improvedrepresentation scheme is proposedwhere an additional separator symbol isused along with the pattern labels torepresent a partition. Let the separatorsymbol be represented by *. Then thechromosome ACF*BDE corresponds to a2-partition {A,C,F} and {B,D,E}. Usingthis representation permits them tomap the clustering problem into a per-mutation problem such as the travelingsalesman problem, which can be solvedby using the permutation crossover op-erators [Goldberg 1989]. This solutionalso suffers from permutation redun-dancy. There are 72 equivalent chromo-somes (permutations) corresponding tothe same partition of the data into thetwo clusters {A,C,F} and {B,D,E}.

More recently, Jones and Beltramo[1991] investigated the use of edge-based crossover [Whitley et al. 1989] tosolve the clustering problem. Here, allpatterns in a cluster are assumed toform a complete graph by connectingthem with edges. Offspring are gener-ated from the parents so that they in-herit the edges from their parents. It isobserved that this crossover operatortakes O~K6 1 N! time for N patternsand K clusters ruling out its applicabil-ity on practical data sets having morethan 10 clusters. In a hybrid approachproposed in Babu and Murty [1993], theGA is used only to find good initialcluster centers and the k-means algo-rithm is applied to find the final parti-

tion. This hybrid approach performedbetter than the GA.

A major problem with GAs is theirsensitivity to the selection of variousparameters such as population size,crossover and mutation probabilities,etc. Grefenstette [Grefenstette 1986]has studied this problem and suggestedguidelines for selecting these control pa-rameters. However, these guidelinesmay not yield good results on specificproblems like pattern clustering. It wasreported in Jones and Beltramo [1991]that hybrid genetic algorithms incorpo-rating problem-specific heuristics aregood for clustering. A similar claim ismade in Davis [1991] about the applica-bility of GAs to other practical prob-lems. Another issue with GAs is theselection of an appropriate representa-tion which is low in order and short indefining length.

It is possible to view the clusteringproblem as an optimization problemthat locates the optimal centroids of theclusters directly rather than finding anoptimal partition using a GA. This viewpermits the use of ESs and EP, becausecentroids can be coded easily in boththese approaches, as they support thedirect representation of a solution as areal-valued vector. In Babu and Murty[1994], ESs were used on both hard andfuzzy clustering problems and EP hasbeen used to evolve fuzzy min-max clus-ters [Fogel and Simpson 1993]. It hasbeen observed that they perform betterthan their classical counterparts, thek-means algorithm and the fuzzyc-means algorithm. However, all ofthese approaches suffer (as do GAs andANNs) from sensitivity to control pa-rameter selection. For each specificproblem, one has to tune the parametervalues to suit the application.

5.9 Search-Based Approaches

Search techniques used to obtain theoptimum value of the criterion functionare divided into deterministic and sto-chastic search techniques. Determinis-

Data Clustering • 287

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 25: Data Clustering: A Review - Computer Science at Rutgers

tic search techniques guarantee an opti-mal partition by performing exhaustiveenumeration. On the other hand, thestochastic search techniques generate anear-optimal partition reasonablyquickly, and guarantee convergence tooptimal partition asymptotically.Among the techniques considered so far,evolutionary approaches are stochasticand the remainder are deterministic.Other deterministic approaches to clus-tering include the branch-and-boundtechnique adopted in Koontz et al.[1975] and Cheng [1995] for generatingoptimal partitions. This approach gen-erates the optimal partition of the dataat the cost of excessive computationalrequirements. In Rose et al. [1993], adeterministic annealing approach wasproposed for clustering. This approachemploys an annealing technique inwhich the error surface is smoothed, butconvergence to the global optimum isnot guaranteed. The use of determinis-tic annealing in proximity-mode cluster-ing (where the patterns are specified interms of pairwise proximities ratherthan multidimensional points) was ex-plored in Hofmann and Buhmann[1997]; later work applied the determin-istic annealing approach to texture seg-mentation [Hofmann and Buhmann1998].

The deterministic approaches are typ-ically greedy descent approaches,whereas the stochastic approaches per-mit perturbations to the solutions innon-locally optimal directions also withnonzero probabilities. The stochasticsearch techniques are either sequentialor parallel, while evolutionary ap-proaches are inherently parallel. Thesimulated annealing approach (SA)[Kirkpatrick et al. 1983] is a sequentialstochastic search technique, whose ap-plicability to clustering is discussed inKlein and Dubes [1989]. Simulated an-nealing procedures are designed toavoid (or recover from) solutions whichcorrespond to local optima of the objec-tive functions. This is accomplished byaccepting with some probability a newsolution for the next iteration of lower

quality (as measured by the criterionfunction). The probability of acceptanceis governed by a critical parametercalled the temperature (by analogy withannealing in metals), which is typicallyspecified in terms of a starting (firstiteration) and final temperature value.Selim and Al-Sultan [1991] studied theeffects of control parameters on the per-formance of the algorithm, and Baeza-Yates [1992] used SA to obtain near-optimal partition of the data. SA isstatistically guaranteed to find the glo-bal optimal solution [Aarts and Korst1989]. A high-level outline of a SAbased algorithm for clustering is givenbelow.

Clustering Based on SimulatedAnnealing

(1) Randomly select an initial partitionand P0, and compute the squarederror value, EP0. Select values forthe control parameters, initial andfinal temperatures T0 and Tf.

(2) Select a neighbor P1 of P0 and com-pute its squared error value, EP1. IfEP1 is larger than EP0, then assignP1 to P0 with a temperature-depen-dent probability. Else assign P1 toP0. Repeat this step for a fixed num-ber of iterations.

(3) Reduce the value of T0, i.e. T0 5cT0, where c is a predeterminedconstant. If T0 is greater than Tf,then go to step 2. Else stop.

The SA algorithm can be slow inreaching the optimal solution, becauseoptimal results require the temperatureto be decreased very slowly from itera-tion to iteration.

Tabu search [Glover 1986], like SA, isa method designed to cross boundariesof feasibility or local optimality and tosystematically impose and release con-straints to permit exploration of other-wise forbidden regions. Tabu searchwas used to solve the clustering prob-lem in Al-Sultan [1995].

288 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 26: Data Clustering: A Review - Computer Science at Rutgers

5.10 A Comparison of Techniques

In this section we have examined vari-ous deterministic and stochastic searchtechniques to approach the clusteringproblem as an optimization problem. Amajority of these methods use thesquared error criterion function. Hence,the partitions generated by these ap-proaches are not as versatile as thosegenerated by hierarchical algorithms.The clusters generated are typically hy-perspherical in shape. Evolutionary ap-proaches are globalized search tech-niques, whereas the rest of theapproaches are localized search tech-nique. ANNs and GAs are inherentlyparallel, so they can be implementedusing parallel hardware to improvetheir speed. Evolutionary approachesare population-based; that is, theysearch using more than one solution ata time, and the rest are based on usinga single solution at a time. ANNs, GAs,SA, and Tabu search (TS) are all sensi-tive to the selection of various learning/control parameters. In theory, all four ofthese methods are weak methods [Rich1983] in that they do not use explicitdomain knowledge. An important fea-ture of the evolutionary approaches isthat they can find the optimal solutioneven when the criterion function is dis-continuous.

An empirical study of the perfor-mance of the following heuristics forclustering was presented in Mishra andRaghavan [1994]; SA, GA, TS, random-ized branch-and-bound (RBA) [Mishraand Raghavan 1994], and hybrid search(HS) strategies [Ismail and Kamel 1989]were evaluated. The conclusion wasthat GA performs well in the case ofone-dimensional data, while its perfor-mance on high dimensional data sets isnot impressive. The performance of SAis not attractive because it is very slow.RBA and TS performed best. HS is goodfor high dimensional data. However,none of the methods was found to besuperior to others by a significant mar-gin. An empirical study of k-means, SA,TS, and GA was presented in Al-Sultan

and Khan [1996]. TS, GA and SA werejudged comparable in terms of solutionquality, and all were better thank-means. However, the k-means methodis the most efficient in terms of execu-tion time; other schemes took more time(by a factor of 500 to 2500) to partition adata set of size 60 into 5 clusters. Fur-ther, GA encountered the best solutionfaster than TS and SA; SA took moretime than TS to encounter the best solu-tion. However, GA took the maximumtime for convergence, that is, to obtain apopulation of only the best solutions,followed by TS and SA. An importantobservation is that in both Mishra andRaghavan [1994] and Al-Sultan andKhan [1996] the sizes of the data setsconsidered are small; that is, fewer than200 patterns.

A two-layer network was employed inMao and Jain [1996], with the firstlayer including a number of principalcomponent analysis subnets, and thesecond layer using a competitive net.This network performs partitional clus-tering using the regularized Mahalano-bis distance. This net was trained usinga set of 1000 randomly selected pixelsfrom a large image and then used toclassify every pixel in the image. Babuet al. [1997] proposed a stochastic con-nectionist approach (SCA) and com-pared its performance on standard datasets with both the SA and k-means algo-rithms. It was observed that SCA issuperior to both SA and k-means interms of solution quality. Evolutionaryapproaches are good only when the datasize is less than 1000 and for low di-mensional data.

In summary, only the k-means algo-rithm and its ANN equivalent, the Ko-honen net [Mao and Jain 1996] havebeen applied on large data sets; otherapproaches have been tested, typically,on small data sets. This is because ob-taining suitable learning/control param-eters for ANNs, GAs, TS, and SA isdifficult and their execution times arevery high for large data sets. However,it has been shown [Selim and Ismail

Data Clustering • 289

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 27: Data Clustering: A Review - Computer Science at Rutgers

1984] that the k-means method con-verges to a locally optimal solution. Thisbehavior is linked with the initial seedselection in the k-means algorithm. Soif a good initial partition can be ob-tained quickly using any of the othertechniques, then k-means would workwell even on problems with large datasets. Even though various methods dis-cussed in this section are comparativelyweak, it was revealed through experi-mental studies that combining domainknowledge would improve their perfor-mance. For example, ANNs work betterin classifying images represented usingextracted features than with raw im-ages, and hybrid classifiers work betterthan ANNs [Mohiuddin and Mao 1994].Similarly, using domain knowledge tohybridize a GA improves its perfor-mance [Jones and Beltramo 1991]. So itmay be useful in general to use domainknowledge along with approaches likeGA, SA, ANN, and TS. However, theseapproaches (specifically, the criteriafunctions used in them) have a tendencyto generate a partition of hyperspheri-cal clusters, and this could be a limita-tion. For example, in cluster-based doc-ument retrieval, it was observed thatthe hierarchical algorithms performedbetter than the partitional algorithms[Rasmussen 1992].

5.11 Incorporating Domain Constraints inClustering

As a task, clustering is subjective innature. The same data set may need tobe partitioned differently for differentpurposes. For example, consider awhale, an elephant, and a tuna fish[Watanabe 1985]. Whales and elephantsform a cluster of mammals. However, ifthe user is interested in partitioningthem based on the concept of living inwater, then whale and tuna fish areclustered together. Typically, this sub-jectivity is incorporated into the cluster-ing criterion by incorporating domainknowledge in one or more phases ofclustering.

Every clustering algorithm uses sometype of knowledge either implicitly orexplicitly. Implicit knowledge plays arole in (1) selecting a pattern represen-tation scheme (e.g., using one’s priorexperience to select and encode fea-tures), (2) choosing a similarity measure(e.g., using the Mahalanobis distanceinstead of the Euclidean distance to ob-tain hyperellipsoidal clusters), and (3)selecting a grouping scheme (e.g., speci-fying the k-means algorithm when it isknown that clusters are hyperspheri-cal). Domain knowledge is used implic-itly in ANNs, GAs, TS, and SA to selectthe control/learning parameter valuesthat affect the performance of these al-gorithms.

It is also possible to use explicitlyavailable domain knowledge to con-strain or guide the clustering process.Such specialized clustering algorithmshave been used in several applications.Domain concepts can play several rolesin the clustering process, and a varietyof choices are available to the practitio-ner. At one extreme, the available do-main concepts might easily serve as anadditional feature (or several), and theremainder of the procedure might beotherwise unaffected. At the other ex-treme, domain concepts might be usedto confirm or veto a decision arrived atindependently by a traditional cluster-ing algorithm, or used to affect the com-putation of distance in a clustering algo-rithm employing proximity. Theincorporation of domain knowledge intoclustering consists mainly of ad hoc ap-proaches with little in common; accord-ingly, our discussion of the idea willconsist mainly of motivational materialand a brief survey of past work. Ma-chine learning research and pattern rec-ognition research intersect in this topi-cal area, and the interested reader isreferred to the prominent journals inmachine learning (e.g., Machine Learn-ing, J. of AI Research, or Artificial Intel-ligence) for a fuller treatment of thistopic.

290 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 28: Data Clustering: A Review - Computer Science at Rutgers

As documented in Cheng and Fu[1985], rules in an expert system maybe clustered to reduce the size of theknowledge base. This modification ofclustering was also explored in the do-mains of universities, congressional vot-ing records, and terrorist events by Leb-owitz [1987].

5.11.1 Similarity Computation. Con-ceptual knowledge was used explicitlyin the similarity computation phase inMichalski and Stepp [1983]. It was as-sumed that the pattern representationswere available and the dynamic cluster-ing algorithm [Diday 1973] was used togroup patterns. The clusters formedwere described using conjunctive state-ments in predicate logic. It was statedin Stepp and Michalski [1986] andMichalski and Stepp [1983] that thegroupings obtained by the conceptualclustering are superior to those ob-tained by the numerical methods forclustering. A critical analysis of thatwork appears in Dale [1985], and it wasobserved that monothetic divisive clus-tering algorithms generate clusters thatcan be described by conjunctive state-ments. For example, consider Figure 8.Four clusters in this figure, obtainedusing a monothetic algorithm, can bedescribed by using conjunctive conceptsas shown below:

Cluster 1: @X # a# ∧ @Y # b#

Cluster 2: @X # a# ∧ @Y . b#

Cluster 3: @X . a# ∧ @Y . c#

Cluster 4: @X . a# ∧ @Y # c#

where ∧ is the Boolean conjunction(‘and’) operator, and a, b, and c areconstants.

5.11.2 Pattern Representation. It wasshown in Srivastava and Murty [1990]that by using knowledge in the patternrepresentation phase, as is implicitlydone in numerical taxonomy ap-proaches, it is possible to obtain thesame partitions as those generated byconceptual clustering. In this sense,

conceptual clustering and numericaltaxonomy are not diametrically oppo-site, but are equivalent. In the case ofconceptual clustering, domain knowl-edge is explicitly used in interpatternsimilarity computation, whereas in nu-merical taxonomy it is implicitly as-sumed that pattern representations areobtained using the domain knowledge.

5.11.3 Cluster Descriptions. Typi-cally, in knowledge-based clustering,both the clusters and their descriptionsor characterizations are generated[Fisher and Langley 1985]. There aresome exceptions, for instance,, Gowdaand Diday [1992], where only clusteringis performed and no descriptions aregenerated explicitly. In conceptual clus-tering, a cluster of objects is describedby a conjunctive logical expression[Michalski and Stepp 1983]. Eventhough a conjunctive statement is one ofthe most common descriptive formsused by humans, it is a limited form. InShekar et al. [1987], functional knowl-edge of objects was used to generatemore intuitively appealing cluster de-scriptions that employ the Boolean im-plication operator. A system that repre-sents clusters probabilistically wasdescribed in Fisher [1987]; these de-scriptions are more general than con-junctive concepts, and are well-suited tohierarchical classification domains (e.g.,the animal species hierarchy). A concep-tual clustering system in which cluster-ing is done first is described in Fisherand Langley [1985]. These clusters arethen described using probabilities. Asimilar scheme was described in Murtyand Jain [1995], but the descriptionsare logical expressions that employ bothconjunction and disjunction.

An important characteristic of concep-tual clustering is that it is possible togroup objects represented by both qual-itative and quantitative features if theclustering leads to a conjunctive con-cept. For example, the concept cricketball might be represented as

Data Clustering • 291

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 29: Data Clustering: A Review - Computer Science at Rutgers

color 5 red ∧ ~shape 5 sphere!

∧ ~make 5 leather!

∧ ~radius 5 1.4 inches!,

where radius is a quantitative featureand the rest are all qualitative features.This description is used to describe acluster of cricket balls. In Stepp andMichalski [1986], a graph (the goal de-pendency network) was used to groupstructured objects. In Shekar et al.[1987] functional knowledge was usedto group man-made objects. Functionalknowledge was represented usingand/or trees [Rich 1983]. For example,the function cooking shown in Figure 22can be decomposed into functions likeholding and heating the material in aliquid medium. Each man-made objecthas a primary function for which it isproduced. Further, based on its fea-tures, it may serve additional functions.For example, a book is meant for read-ing, but if it is heavy then it can also beused as a paper weight. In Sutton et al.[1993], object functions were used toconstruct generic recognition systems.

5.11.4 Pragmatic Issues. Any imple-mentation of a system that explicitlyincorporates domain concepts into aclustering technique has to address thefollowing important pragmatic issues:

(1) Representation, availability andcompleteness of domain concepts.

(2) Construction of inferences using theknowledge.

(3) Accommodation of changing or dy-namic knowledge.

In some domains, complete knowledgeis available explicitly. For example, theACM Computing Reviews classificationtree used in Murty and Jain [1995] iscomplete and is explicitly available foruse. In several domains, knowledge isincomplete and is not available explic-itly. Typically, machine learning tech-niques are used to automatically extractknowledge, which is a difficult and chal-lenging problem. The most prominentlyused learning method is “learning fromexamples” [Quinlan 1990]. This is aninductive learning scheme used to ac-quire knowledge from examples of eachof the classes in different domains. Evenif the knowledge is available explicitly,it is difficult to find out whether it iscomplete and sound. Further, it is ex-tremely difficult to verify soundnessand completeness of knowledge ex-tracted from practical data sets, be-cause such knowledge cannot be repre-sented in propositional logic. It ispossible that both the data and knowl-edge keep changing with time. For ex-ample, in a library, new books might getadded and some old books might bedeleted from the collection with time.Also, the classification system (knowl-edge) employed by the library is up-dated periodically.

A major problem with knowledge-based clustering is that it has not beenapplied to large data sets or in domainswith large knowledge bases. Typically,the number of objects grouped was lessthan 1000, and number of rules used asa part of the knowledge was less than100. The most difficult problem is to usea very large knowledge base for cluster-ing objects in several practical problemsincluding data mining, image segmenta-tion, and document retrieval.

5.12 Clustering Large Data Sets

There are several applications where itis necessary to cluster a large collectionof patterns. The definition of ‘large’ hasvaried (and will continue to do so) withchanges in technology (e.g., memory andprocessing time). In the 1960s, ‘large’

cooking

heating liquid holding

electric ... water ... metallic ...

Figure 22. Functional knowledge.

292 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 30: Data Clustering: A Review - Computer Science at Rutgers

meant several thousand patterns [Ross1968]; now, there are applicationswhere millions of patterns of high di-mensionality have to be clustered. Forexample, to segment an image of size500 3 500 pixels, the number of pixelsto be clustered is 250,000. In documentretrieval and information filtering, mil-lions of patterns with a dimensionalityof more than 100 have to be clustered toachieve data abstraction. A majority ofthe approaches and algorithms pro-posed in the literature cannot handlesuch large data sets. Approaches basedon genetic algorithms, tabu search andsimulated annealing are optimizationtechniques and are restricted to reason-ably small data sets. Implementationsof conceptual clustering optimize somecriterion functions and are typicallycomputationally expensive.

The convergent k-means algorithmand its ANN equivalent, the Kohonennet, have been used to cluster largedata sets [Mao and Jain 1996]. The rea-sons behind the popularity of thek-means algorithm are:

(1) Its time complexity is O~nkl!,where n is the number of patterns,k is the number of clusters, and l isthe number of iterations taken bythe algorithm to converge. Typi-cally, k and l are fixed in advanceand so the algorithm has linear timecomplexity in the size of the data set[Day 1992].

(2) Its space complexity is O~k 1 n!. Itrequires additional space to storethe data matrix. It is possible tostore the data matrix in a secondarymemory and access each patternbased on need. However, thisscheme requires a huge access timebecause of the iterative nature ofthe algorithm, and as a consequenceprocessing time increases enor-mously.

(3) It is order-independent; for a giveninitial seed set of cluster centers, itgenerates the same partition of the

data irrespective of the order inwhich the patterns are presented tothe algorithm.

However, the k-means algorithm is sen-sitive to initial seed selection and evenin the best case, it can produce onlyhyperspherical clusters.

Hierarchical algorithms are more ver-satile. But they have the following dis-advantages:

(1) The time complexity of hierarchicalagglomerative algorithms is O~n2

log n! [Kurita 1991]. It is possibleto obtain single-link clusters usingan MST of the data, which can beconstructed in O~n log2 n! time fortwo-dimensional data [Choudhuryand Murty 1990].

(2) The space complexity of agglomera-tive algorithms is O~n2!. This is be-cause a similarity matrix of sizen 3 n has to be stored. To clusterevery pixel in a 100 3 100 image,approximately 200 megabytes ofstorage would be required (assuningsingle-precision storage of similari-ties). It is possible to compute theentries of this matrix based on needinstead of storing them (this wouldincrease the algorithm’s time com-plexity [Anderberg 1973]).

Table I lists the time and space com-plexities of several well-known algo-rithms. Here, n is the number of pat-terns to be clustered, k is the number ofclusters, and l is the number of itera-tions.

Table I. Complexity of Clustering Algorithms

Clustering Algorithm TimeComplexity

SpaceComplexity

leader O~kn! O~k!k-means O~nkl! O~k!ISODATA O~nkl! O~k!shortest spanning path O~n2! O~n!single-line O~n2 log n! O~n2!complete-line O~n2 log n! O~n2!

Data Clustering • 293

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 31: Data Clustering: A Review - Computer Science at Rutgers

A possible solution to the problem ofclustering large data sets while onlymarginally sacrificing the versatility ofclusters is to implement more efficientvariants of clustering algorithms. A hy-brid approach was used in Ross [1968],where a set of reference points is chosenas in the k-means algorithm, and eachof the remaining data points is assignedto one or more reference points or clus-ters. Minimal spanning trees (MST) areobtained for each group of points sepa-rately. These MSTs are merged to forman approximate global MST. This ap-proach computes similarities betweenonly a fraction of all possible pairs ofpoints. It was shown that the number ofsimilarities computed for 10,000 pat-terns using this approach is the same asthe total number of pairs of points in acollection of 2,000 points. Bentley andFriedman [1978] contains an algorithmthat can compute an approximate MSTin O~n log n! time. A scheme to gener-ate an approximate dendrogram incre-mentally in O~n log n! time was pre-sented in Zupan [1982], whileVenkateswarlu and Raju [1992] pro-posed an algorithm to speed up the ISO-DATA clustering algorithm. A study ofthe approximate single-linkage clusteranalysis of large data sets was reportedin Eddy et al. [1994]. In that work, anapproximate MST was used to form sin-gle-link clusters of a data set of size40,000.

The emerging discipline of data min-ing (discussed as an application in Sec-tion 6) has spurred the development ofnew algorithms for clustering large datasets. Two algorithms of note are theCLARANS algorithm developed by Ngand Han [1994] and the BIRCH algo-rithm proposed by Zhang et al. [1996].CLARANS (Clustering Large Applica-tions based on RANdom Search) identi-fies candidate cluster centroids throughanalysis of repeated random samplesfrom the original data. Because of theuse of random sampling, the time com-plexity is O~n! for a pattern set of nelements. The BIRCH algorithm (Bal-

anced Iterative Reducing and Cluster-ing) stores summary information aboutcandidate clusters in a dynamic treedata structure. This tree hierarchicallyorganizes the clusterings represented atthe leaf nodes. The tree can be rebuiltwhen a threshold specifying cluster sizeis updated manually, or when memoryconstraints force a change in thisthreshold. This algorithm, like CLAR-ANS, has a time complexity linear inthe number of patterns.

The algorithms discussed above workon large data sets, where it is possibleto accommodate the entire pattern setin the main memory. However, thereare applications where the entire dataset cannot be stored in the main mem-ory because of its size. There are cur-rently three possible approaches tosolve this problem.

(1) The pattern set can be stored in asecondary memory and subsets ofthis data clustered independently,followed by a merging step to yield aclustering of the entire pattern set.We call this approach the divide andconquer approach.

(2) An incremental clustering algorithmcan be employed. Here, the entiredata matrix is stored in a secondarymemory and data items are trans-ferred to the main memory one at atime for clustering. Only the clusterrepresentations are stored in themain memory to alleviate the spacelimitations.

(3) A parallel implementation of a clus-tering algorithm may be used. Wediscuss these approaches in the nextthree subsections.

5.12.1 Divide and Conquer Approach.Here, we store the entire pattern matrixof size n 3 d in a secondary storagespace (e.g., a disk file). We divide thisdata into p blocks, where an optimumvalue of p can be chosen based on theclustering algorithm used [Murty andKrishna 1980]. Let us assume that wehave n / p patterns in each of the blocks.

294 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 32: Data Clustering: A Review - Computer Science at Rutgers

We transfer each of these blocks to themain memory and cluster it into k clus-ters using a standard algorithm. One ormore representative samples from eachof these clusters are stored separately;we have pk of these representative pat-terns if we choose one representativeper cluster. These pk representativesare further clustered into k clusters andthe cluster labels of these representa-tive patterns are used to relabel theoriginal pattern matrix. We depict thistwo-level algorithm in Figure 23. It ispossible to extend this algorithm to anynumber of levels; more levels are re-quired if the data set is very large andthe main memory size is very small[Murty and Krishna 1980]. If the single-link algorithm is used to obtain 5 clus-ters, then there is a substantial savingsin the number of computations asshown in Table II for optimally chosen pwhen the number of clusters is fixed at5. However, this algorithm works wellonly when the points in each block arereasonably homogeneous which is oftensatisfied by image data.

A two-level strategy for clustering adata set containing 2,000 patterns wasdescribed in Stahl [1986]. In the firstlevel, the data set is loosely clusteredinto a large number of clusters usingthe leader algorithm. Representativesfrom these clusters, one per cluster, arethe input to the second level clustering,which is obtained using Ward’s hierar-chical method.

5.12.2 Incremental Clustering. In-cremental clustering is based on theassumption that it is possible to con-sider patterns one at a time and assignthem to existing clusters. Here, a newdata item is assigned to a cluster with-out affecting the existing clusters signif-icantly. A high level description of atypical incremental clustering algo-rithm is given below.

An Incremental Clustering Algo-rithm(1) Assign the first data item to a clus-

ter.

(2) Consider the next data item. Eitherassign this item to one of the exist-ing clusters or assign it to a newcluster. This assignment is donebased on some criterion, e.g. the dis-tance between the new item and theexisting cluster centroids.

(3) Repeat step 2 till all the data itemsare clustered.

The major advantage with the incre-mental clustering algorithms is that itis not necessary to store the entire pat-tern matrix in the memory. So, thespace requirements of incremental algo-rithms are very small. Typically, theyare noniterative. So their time require-ments are also small. There are severalincremental clustering algorithms:

(1) The leader clustering algorithm[Hartigan 1975] is the simplest interms of time complexity which isO~nk!. It has gained popularity be-cause of its neural network imple-mentation, the ART network [Car-penter and Grossberg 1990]. It isvery easy to implement as it re-quires only O~k! space.

xxxxx x xx

xx

x xxx

x

xxx

x x xxxxx

x

xxx xx

xxxx

x xxx

xx x

x

xx

x

xx

x

x

xx

. . .

1 2 p

n-- p

pk

Figure 23. Divide and conquer approach toclustering.

Table II. Number of Distance Computations (n)for the Single-Link Clustering Algorithm and a

Two-Level Divide and Conquer Algorithm

n Single-link p Two-level

100 4,950 1200500 124,750 2 10,750100 499,500 4 31,50010,000 49,995,000 10 1,013,750

Data Clustering • 295

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 33: Data Clustering: A Review - Computer Science at Rutgers

(2) The shortest spanning path (SSP)algorithm [Slagle et al. 1975] wasoriginally proposed for data reorga-nization and was successfully usedin automatic auditing of records[Lee et al. 1978]. Here, SSP algo-rithm was used to cluster 2000 pat-terns using 18 features. These clus-ters are used to estimate missingfeature values in data items and toidentify erroneous feature values.

(3) The cobweb system [Fisher 1987] isan incremental conceptual cluster-ing algorithm. It has been success-fully used in engineering applica-tions [Fisher et al. 1993].

(4) An incremental clustering algorithmfor dynamic information processingwas presented in Can [1993]. Themotivation behind this work is that,in dynamic databases, items mightget added and deleted over time.These changes should be reflected inthe partition generated without sig-nificantly affecting the current clus-ters. This algorithm was used tocluster incrementally an INSPECdatabase of 12,684 documents corre-sponding to computer science andelectrical engineering.

Order-independence is an importantproperty of clustering algorithms. Analgorithm is order-independent if it gen-erates the same partition for any orderin which the data is presented. Other-wise, it is order-dependent. Most of theincremental algorithms presented aboveare order-dependent. We illustrate thisorder-dependent property in Figure 24where there are 6 two-dimensional ob-jects labeled 1 to 6. If we present thesepatterns to the leader algorithm in theorder 2,1,3,5,4,6 then the two clustersobtained are shown by ellipses. If theorder is 1,2,6,4,5,3, then we get a two-partition as shown by the triangles. TheSSP algorithm, cobweb, and the algo-rithm in Can [1993] are all order-depen-dent.

5.12.3 Parallel Implementation. Re-cent work [Judd et al. 1996] demon-

strates that a combination of algorith-mic enhancements to a clusteringalgorithm and distribution of the com-putations over a network of worksta-tions can allow an entire 512 3 512image to be clustered in a few minutes.Depending on the clustering algorithmin use, parallelization of the code andreplication of data for efficiency mayyield large benefits. However, a globalshared data structure, namely the clus-ter membership table, remains andmust be managed centrally or replicatedand synchronized periodically. Thepresence or absence of robust, efficientparallel clustering techniques will de-termine the success or failure of clusteranalysis in large-scale data mining ap-plications in the future.

6. APPLICATIONS

Clustering algorithms have been usedin a large variety of applications [Jainand Dubes 1988; Rasmussen 1992;Oehler and Gray 1995; Fisher et al.1993]. In this section, we describe sev-eral applications where clustering hasbeen employed as an essential step.These areas are: (1) image segmenta-tion, (2) object and character recogni-tion, (3) document retrieval, and (4)data mining.

6.1 Image Segmentation Using Clustering

Image segmentation is a fundamentalcomponent in many computer vision

Y

X

1

3 4

6

2 5

Figure 24. The leader algorithm is orderdependent.

296 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 34: Data Clustering: A Review - Computer Science at Rutgers

applications, and can be addressed as aclustering problem [Rosenfeld and Kak1982]. The segmentation of the image(s)presented to an image analysis systemis critically dependent on the scene tobe sensed, the imaging geometry, con-figuration, and sensor used to transducethe scene into a digital image, and ulti-mately the desired output (goal) of thesystem.

The applicability of clustering meth-odology to the image segmentationproblem was recognized over three de-cades ago, and the paradigms underly-ing the initial pioneering efforts are stillin use today. A recurring theme is todefine feature vectors at every imagelocation (pixel) composed of both func-tions of image intensity and functions ofthe pixel location itself. This basic idea,depicted in Figure 25, has been success-fully used for intensity images (with orwithout texture), range (depth) imagesand multispectral images.

6.1.1 Segmentation. An image seg-mentation is typically defined as an ex-haustive partitioning of an input imageinto regions, each of which is consideredto be homogeneous with respect to someimage property of interest (e.g., inten-sity, color, or texture) [Jain et al. 1995].If

( 5 $xij, i 5 1. . . Nr, j 5 1. . . Nc%

is the input image with Nr rows and Nc

columns and measurement value xij atpixel ~i, j!, then the segmentation canbe expressed as 6 5 $S1, . . . Sk%, withthe lth segment

Sl 5 $~il1, jl1!, . . . ~ilNl, jlNl

!%

consisting of a connected subset of thepixel coordinates. No two segmentsshare any pixel locations (Si ù Sj 5 À@i Þ j), and the union of all segmentscovers the entire image ~ø i51

k Si 5$1. . . Nr% 3 $1. . . Nc%!. Jain andDubes [1988], after Fu and Mui [1981]identified three techniques for produc-ing segmentations from input imagery:region-based, edge-based, or cluster-based.

Consider the use of simple gray levelthresholding to segment a high-contrastintensity image. Figure 26(a) shows agrayscale image of a textbook’s bar codescanned on a flatbed scanner. Part bshows the results of a simple threshold-ing operation designed to separate thedark and light regions in the bar codearea. Binarization steps like this areoften performed in character recogni-tion systems. Thresholding in effect‘clusters’ the image pixels into twogroups based on the one-dimensionalintensity measurement [Rosenfeld 1969;

x

x

x

1

2

3

Figure 25. Feature representation for clustering. Image measurements and positions are transformedto features. Clusters in feature space correspond to image segments.

Data Clustering • 297

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 35: Data Clustering: A Review - Computer Science at Rutgers

Dunn et al. 1974]. A postprocessing stepseparates the classes into connected re-gions. While simple gray level thresh-olding is adequate in some carefullycontrolled image acquisition environ-ments and much research has been de-voted to appropriate methods forthresholding [Weszka 1978; Trier andJain 1995], complex images requiremore elaborate segmentation tech-niques.

Many segmenters use measurementswhich are both spectral (e.g., the multi-spectral scanner used in remote sens-ing) and spatial (based on the pixel’slocation in the image plane). The mea-surement at each pixel hence corre-sponds directly to our concept of a pat-tern.

6.1.2 Image Segmentation Via Clus-tering. The application of local featureclustering to segment gray–scale imageswas documented in Schachter et al.[1979]. This paper emphasized the ap-propriate selection of features at eachpixel rather than the clustering method-ology, and proposed the use of imageplane coordinates (spatial information)as additional features to be employed inclustering-based segmentation. The goalof clustering was to obtain a sequence ofhyperellipsoidal clusters starting withcluster centers positioned at maximumdensity locations in the pattern space,and growing clusters about these cen-ters until a x2 test for goodness of fitwas violated. A variety of features were

0 50 100 150 200 250 300

’x.dat’

(c)

(a) (b)

Figure 26. Binarization via thresholding. (a): Original grayscale image. (b): Gray-level histogram. (c):Results of thresholding.

298 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 36: Data Clustering: A Review - Computer Science at Rutgers

discussed and applied to both grayscaleand color imagery.

An agglomerative clustering algo-rithm was applied in Silverman andCooper [1988] to the problem of unsu-pervised learning of clusters of coeffi-cient vectors for two image models thatcorrespond to image segments. The firstimage model is polynomial for the ob-served image measurements; the as-sumption here is that the image is acollection of several adjoining graphsurfaces, each a polynomial function ofthe image plane coordinates, which aresampled on the raster grid to producethe observed image. The algorithm pro-ceeds by obtaining vectors of coefficientsof least-squares fits to the data in Mdisjoint image windows. An agglomera-tive clustering algorithm merges (ateach step) the two clusters that have aminimum global between-cluster Ma-halanobis distance. The same frame-work was applied to segmentation oftextured images, but for such imagesthe polynomial model was inappropri-ate, and a parameterized Markov Ran-dom Field model was assumed instead.

Wu and Leahy [1993] describe theapplication of the principles of networkflow to unsupervised classification,yielding a novel hierarchical algorithmfor clustering. In essence, the techniqueviews the unlabeled patterns as nodesin a graph, where the weight of an edge(i.e., its capacity) is a measure of simi-larity between the corresponding nodes.Clusters are identified by removingedges from the graph to produce con-nected disjoint subgraphs. In image seg-mentation, pixels which are 4-neighborsor 8-neighbors in the image plane shareedges in the constructed adjacencygraph, and the weight of a graph edge isbased on the strength of a hypothesizedimage edge between the pixels involved(this strength is calculated using simplederivative masks). Hence, this seg-menter works by finding closed contoursin the image, and is best labeled edge-based rather than region-based.

In Vinod et al. [1994], two neuralnetworks are designed to perform pat-tern clustering when combined. A two-layer network operates on a multidi-mensional histogram of the data toidentify ‘prototypes’ which are used toclassify the input patterns into clusters.These prototypes are fed to the classifi-cation network, another two-layer net-work operating on the histogram of theinput data, but are trained to have dif-fering weights from the prototype selec-tion network. In both networks, the his-togram of the image is used to weightthe contributions of patterns neighbor-ing the one under consideration to thelocation of prototypes or the ultimateclassification; as such, it is likely to bemore robust when compared to tech-niques which assume an underlyingparametric density function for the pat-tern classes. This architecture wastested on gray-scale and color segmen-tation problems.

Jolion et al. [1991] describe a processfor extracting clusters sequentially fromthe input pattern set by identifying hy-perellipsoidal regions (bounded by lociof constant Mahalanobis distance)which contain a specified fraction of theunclassified points in the set. The ex-tracted regions are compared againstthe best-fitting multivariate Gaussiandensity through a Kolmogorov-Smirnovtest, and the fit quality is used as afigure of merit for selecting the ‘best’region at each iteration. The processcontinues until a stopping criterion issatisfied. This procedure was applied tothe problems of threshold selection formultithreshold segmentation of inten-sity imagery and segmentation of rangeimagery.

Clustering techniques have also beensuccessfully used for the segmentationof range images, which are a popularsource of input data for three-dimen-sional object recognition systems [Jainand Flynn 1993]. Range sensors typi-cally return raster images with themeasured value at each pixel being thecoordinates of a 3D location in space.These 3D positions can be understood

Data Clustering • 299

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 37: Data Clustering: A Review - Computer Science at Rutgers

as the locations where rays emergingfrom the image plane locations in a bun-dle intersect the objects in front of thesensor.

The local feature clustering concept isparticularly attractive for range imagesegmentation since (unlike intensitymeasurements) the measurements ateach pixel have the same units (length);this would make ad hoc transformationsor normalizations of the image featuresunnecessary if their goal is to imposeequal scaling on those features. How-ever, range image segmenters often addadditional measurements to the featurespace, removing this advantage.

A range image segmentation systemdescribed in Hoffman and Jain [1987]employs squared error clustering in asix-dimensional feature space as asource of an “initial” segmentationwhich is refined (typically by mergingsegments) into the output segmenta-tion. The technique was enhanced inFlynn and Jain [1991] and used in arecent systematic comparison of rangeimage segmenters [Hoover et al. 1996];as such, it is probably one of the long-est-lived range segmenters which hasperformed well on a large variety ofrange images.

This segmenter works as follows. Ateach pixel ~i, j! in the input range im-age, the corresponding 3D measurementis denoted ~xij, yij, zij!, where typicallyxij is a linear function of j (the columnnumber) and yij is a linear function of i(the row number). A k 3 k neighbor-hood of ~i, j! is used to estimate the 3Dsurface normal nij 5 ~nij

x , nijy , nij

z ! at~i, j!, typically by finding the least-squares planar fit to the 3D points inthe neighborhood. The feature vector forthe pixel at ~i, j! is the six-dimensionalmeasurement ~xij, yij, zij, nij

x , nijy , nij

z !,and a candidate segmentation is foundby clustering these feature vectors. Forpractical reasons, not every pixel’s fea-ture vector is used in the clusteringprocedure; typically 1000 feature vec-tors are chosen by subsampling.

The CLUSTER algorithm [Jain andDubes 1988] was used to obtain seg-ment labels for each pixel. CLUSTER isan enhancement of the k-means algo-rithm; it has the ability to identify sev-eral clusterings of a data set, each witha different number of clusters. Hoffmanand Jain [1987] also experimented withother clustering techniques (e.g., com-plete-link, single-link, graph-theoretic,and other squared error algorithms) andfound CLUSTER to provide the bestcombination of performance and accu-racy. An additional advantage of CLUS-TER is that it produces a sequence ofoutput clusterings (i.e., a 2-cluster solu-tion up through a Kmax-cluster solutionwhere Kmax is specified by the user andis typically 20 or so); each clustering inthis sequence yields a clustering statis-tic which combines between-cluster sep-aration and within-cluster scatter. Theclustering that optimizes this statisticis chosen as the best one. Each pixel inthe range image is assigned the seg-ment label of the nearest cluster center.This minimum distance classificationstep is not guaranteed to produce seg-ments which are connected in the imageplane; therefore, a connected compo-nents labeling algorithm allocates newlabels for disjoint regions that wereplaced in the same cluster. Subsequentoperations include surface type tests,merging of adjacent patches using a testfor the presence of crease or jump edgesbetween adjacent segments, and surfaceparameter estimation.

Figure 27 shows this processing ap-plied to a range image. Part a of thefigure shows the input range image;part b shows the distribution of surfacenormals. In part c, the initial segmenta-tion returned by CLUSTER and modi-fied to guarantee connected segments isshown. Part d shows the final segmen-tation produced by merging adjacentpatches which do not have a significantcrease edge between them. The finalclusters reasonably represent distinctsurfaces present in this complex object.

300 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 38: Data Clustering: A Review - Computer Science at Rutgers

The analysis of textured images hasbeen of interest to researchers for sev-eral years. Texture segmentation tech-niques have been developed using a va-riety of texture models and imageoperations. In Nguyen and Cohen[1993], texture image segmentation wasaddressed by modeling the image as ahierarchy of two Markov RandomFields, obtaining some simple statisticsfrom each image block to form a featurevector, and clustering these blocks us-ing a fuzzy K-means clustering method.The clustering procedure here is modi-fied to jointly estimate the number of

clusters as well as the fuzzy member-ship of each feature vector to the vari-ous clusters.

A system for segmenting texture im-ages was described in Jain and Far-rokhnia [1991]; there, Gabor filterswere used to obtain a set of 28 orienta-tion- and scale-selective features thatcharacterize the texture in the neigh-borhood of each pixel. These 28 featuresare reduced to a smaller numberthrough a feature selection procedure,and the resulting features are prepro-cessed and then clustered using theCLUSTER program. An index statistic

(a) (b)

(c) (d)

Figure 27. Range image segmentation using clustering. (a): Input range image. (b): Surface normalsfor selected image pixels. (c): Initial segmentation (19 cluster solution) returned by CLUSTER using1000 six-dimensional samples from the image as a pattern set. (d): Final segmentation (8 segments)produced by postprocessing.

Data Clustering • 301

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 39: Data Clustering: A Review - Computer Science at Rutgers

[Dubes 1987] is used to select the bestclustering. Minimum distance classifi-cation is used to label each of the origi-nal image pixels. This technique wastested on several texture mosaics in-cluding the natural Brodatz texturesand synthetic images. Figure 28(a)shows an input texture mosaic consist-ing of four of the popular Brodatz tex-tures [Brodatz 1966]. Part b shows thesegmentation produced when the Gaborfilter features are augmented to containspatial information (pixel coordinates).This Gabor filter based technique hasproven very powerful and has been ex-tended to the automatic segmentation oftext in documents [Jain and Bhatta-charjee 1992] and segmentation of ob-jects in complex backgrounds [Jain etal. 1997].

Clustering can be used as a prepro-cessing stage to identify pattern classesfor subsequent supervised classifica-tion. Taxt and Lundervold [1994] andLundervold et al. [1996] describe a par-titional clustering algorithm and a man-ual labeling technique to identify mate-rial classes (e.g., cerebrospinal fluid,white matter, striated muscle, tumor) inregistered images of a human head ob-tained at five different magnetic reso-

nance imaging channels (yielding a five-dimensional feature vector at eachpixel). A number of clusterings wereobtained and combined with domainknowledge (human expertise) to identifythe different classes. Decision rules forsupervised classification were based onthese obtained classes. Figure 29(a)shows one channel of an input multi-spectral image; part b shows the 9-clus-ter result.

The k-means algorithm was appliedto the segmentation of LANDSAT imag-ery in Solberg et al. [1996]. Initial clus-ter centers were chosen interactively bya trained operator, and correspond toland-use classes such as urban areas,soil (vegetation-free) areas, forest,grassland, and water. Figure 30(a)shows the input image rendered asgrayscale; part b shows the result of theclustering procedure.

6.1.3 Summary. In this section, theapplication of clustering methodology toimage segmentation problems has beenmotivated and surveyed. The historicalrecord shows that clustering is a power-ful tool for obtaining classifications ofimage pixels. Key issues in the design ofany clustering-based segmenter are the

(a) (b)

Figure 28. Texture image segmentation results. (a): Four-class texture mosaic. (b): Four-clustersolution produced by CLUSTER with pixel coordinates included in the feature set.

302 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 40: Data Clustering: A Review - Computer Science at Rutgers

choice of pixel measurements (features)and dimensionality of the feature vector(i.e., should the feature vector containintensities, pixel positions, model pa-rameters, filter outputs?), a measure ofsimilarity which is appropriate for theselected features and the application do-main, the identification of a clusteringalgorithm, the development of strate-gies for feature and data reduction (toavoid the “curse of dimensionality” andthe computational burden of classifying

large numbers of patterns and/or fea-tures), and the identification of neces-sary pre- and post-processing tech-niques (e.g., image smoothing andminimum distance classification). Theuse of clustering for segmentation datesback to the 1960s, and new variationscontinue to emerge in the literature.Challenges to the more successful use ofclustering include the high computa-tional complexity of many clustering al-gorithms and their incorporation of

(a) (b)

Figure 29. Multispectral medical image segmentation. (a): A single channel of the input image. (b):9-cluster segmentation.

(a) (b)

Figure 30. LANDSAT image segmentation. (a): Original image (ESA/EURIMAGE/Sattelitbild). (b):Clustered scene.

Data Clustering • 303

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 41: Data Clustering: A Review - Computer Science at Rutgers

strong assumptions (often multivariateGaussian) about the multidimensionalshape of clusters to be obtained. Theability of new clustering procedures tohandle concepts and semantics in classi-fication (in addition to numerical mea-surements) will be important for certainapplications [Michalski and Stepp 1983;Murty and Jain 1995].

6.2 Object and Character Recognition

6.2.1 Object Recognition. The use ofclustering to group views of 3D objectsfor the purposes of object recognition inrange data was described in Dorai andJain [1995]. The term view refers to arange image of an unoccluded objectobtained from any arbitrary viewpoint.The system under consideration em-ployed a viewpoint dependent (or view-centered) approach to the object recog-nition problem; each object to berecognized was represented in terms ofa library of range images of that object.

There are many possible views of a 3Dobject and one goal of that work was toavoid matching an unknown input viewagainst each image of each object. Acommon theme in the object recognitionliterature is indexing, wherein the un-known view is used to select a subset ofviews of a subset of the objects in thedatabase for further comparison, andrejects all other views of objects. One ofthe approaches to indexing employs thenotion of view classes; a view class is theset of qualitatively similar views of anobject. In that work, the view classeswere identified by clustering; the rest ofthis subsection outlines the technique.

Object views were grouped intoclasses based on the similarity of shapespectral features. Each input image ofan object viewed in isolation yields afeature vector which characterizes thatview. The feature vector contains thefirst ten central moments of a normal-

ized shape spectral distribution, H# ~h!,of an object view. The shape spectrum ofan object view is obtained from its rangedata by constructing a histogram of

shape index values (which are related tosurface curvature values) and accumu-lating all the object pixels that fall intoeach bin. By normalizing the spectrumwith respect to the total object area, thescale (size) differences that may existbetween different objects are removed.The first moment m1 is computed as theweighted mean of H# ~h!:

m1 5 Oh

~h!H# ~h!. (1)

The other central moments, mp, 2 # p# 10 are defined as:

mp 5 Oh

~h 2 m1!pH# ~h!. (2)

Then, the feature vector is denoted asR 5 ~m1, m2, · · ·, m10!, with therange of each of these moments being@21,1#.

Let 2 5 $O1, O2, · · ·, On% be a col-lection of n 3D objects whose views arepresent in the model database, }D. Theith view of the jth object, Oj

i in thedatabase is represented by ^Lj

i, Rji&,

where Lji is the object label and Rj

i is thefeature vector. Given a set of objectrepresentations 5 i 5 $^L1

i , R1i &, · · ·,

^Lmi , Rm

i &% that describes m views of theith object, the goal is to derive a par-tition of the views, 3 i 5 $C1

i ,C2

i , · · ·, Cki

i %. Each cluster in 3 i con-tains those views of the ith object thathave been adjudged similar based onthe dissimilarity between the corre-sponding moment features of the shapespectra of the views. The measure ofdissimilarity, between Rj

i and Rki , is de-

fined as:

$~Rji, Rk

i ! 5 Ol51

10

~Rjli 2 Rkl

i !2. (3)

6.2.2 Clustering Views. A databasecontaining 3,200 range images of 10 dif-ferent sculpted objects with 320 viewsper object is used [Dorai and Jain 1995].

304 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 42: Data Clustering: A Review - Computer Science at Rutgers

The range images from 320 possibleviewpoints (determined by the tessella-tion of the view-sphere using the icosa-hedron) of the objects were synthesized.Figure 31 shows a subset of the collec-tion of views of Cobra used in the exper-iment.

The shape spectrum of each view iscomputed and then its feature vector isdetermined. The views of each objectare clustered, based on the dissimilaritymeasure $ between their moment vec-tors using the complete-link hierarchi-cal clustering scheme [Jain and Dubes1988]. The hierarchical grouping ob-tained with 320 views of the Cobra ob-

ject is shown in Figure 32. The viewgrouping hierarchies of the other nineobjects are similar to the dendrogram inFigure 32. This dendrogram is cut at adissimilarity level of 0.1 or less to ob-tain compact and well-separated clus-ters. The clusterings obtained in thismanner demonstrate that the views ofeach object fall into several distinguish-able clusters. The centroid of each ofthese clusters was determined by com-puting the mean of the moment vectorsof the views falling into the cluster.

Dorai and Jain [1995] demonstratedthat this clustering-based view groupingprocedure facilitates object matching

Figure 31. A subset of views of Cobra chosen from a set of 320 views.

Data Clustering • 305

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 43: Data Clustering: A Review - Computer Science at Rutgers

in terms of classification accuracy andthe number of matches necessary forcorrect classification of test views. Ob-ject views are grouped into compact andhomogeneous view clusters, thus dem-onstrating the power of the cluster-based scheme for view organization andefficient object matching.

6.2.3 Character Recognition. Clus-tering was employed in Connell andJain [1998] to identify lexemes in hand-written text for the purposes of writer-independent handwriting recognition.The success of a handwriting recogni-tion system is vitally dependent on itsacceptance by potential users. Writer-dependent systems provide a higherlevel of recognition accuracy than writ-er-independent systems, but require alarge amount of training data. A writer-

independent system, on the other hand,must be able to recognize a wide varietyof writing styles in order to satisfy anindividual user. As the variability of thewriting styles that must be captured bya system increases, it becomes more andmore difficult to discriminate betweendifferent classes due to the amount ofoverlap in the feature space. One solu-tion to this problem is to separate thedata from these disparate writing stylesfor each class into different subclasses,known as lexemes. These lexemes repre-sent portions of the data which are moreeasily separated from the data of classesother than that to which the lexemebelongs.

In this system, handwriting is cap-tured by digitizing the ~x, y! position ofthe pen and the state of the pen point

0.0

0.05

0.10

0.15

0.20

0.25

Figure 32. Hierarchical grouping of 320 views of a cobra sculpture.

306 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 44: Data Clustering: A Review - Computer Science at Rutgers

(up or down) at a constant samplingrate. Following some resampling, nor-malization, and smoothing, each strokeof the pen is represented as a variable-length string of points. A metric basedon elastic template matching and dy-namic programming is defined to allowthe distance between two strokes to becalculated.

Using the distances calculated in thismanner, a proximity matrix is con-structed for each class of digits (i.e., 0through 9). Each matrix measures theintraclass distances for a particulardigit class. Digits in a particular classare clustered in an attempt to find asmall number of prototypes. Clusteringis done using the CLUSTER programdescribed above [Jain and Dubes 1988],in which the feature vector for a digit isits N proximities to the digits of thesame class. CLUSTER attempts to pro-duce the best clustering for each valueof K over some range, where K is thenumber of clusters into which the datais to be partitioned. As expected, themean squared error (MSE) decreasesmonotonically as a function of K. The“optimal” value of K is chosen by identi-fying a “knee” in the plot of MSE vs. K.

When representing a cluster of digitsby a single prototype, the best on-linerecognition results were obtained by us-ing the digit that is closest to that clus-ter’s center. Using this scheme, a cor-rect recognition rate of 99.33% wasobtained.

6.3 Information Retrieval

Information retrieval (IR) is concernedwith automatic storage and retrieval ofdocuments [Rasmussen 1992]. Manyuniversity libraries use IR systems toprovide access to books, journals, andother documents. Libraries use the Li-brary of Congress Classification (LCC)scheme for efficient storage and re-trieval of books. The LCC scheme con-sists of classes labeled A to Z [LC Clas-sification Outline 1990] which are usedto characterize books belonging to dif-

ferent subjects. For example, label Qcorresponds to books in the area of sci-ence, and the subclass QA is assigned tomathematics. Labels QA76 to QA76.8are used for classifying books related tocomputers and other areas of computerscience.

There are several problems associatedwith the classification of books usingthe LCC scheme. Some of these arelisted below:

(1) When a user is searching for booksin a library which deal with a topicof interest to him, the LCC numberalone may not be able to retrieve allthe relevant books. This is becausethe classification number assignedto the books or the subject catego-ries that are typically entered in thedatabase do not contain sufficientinformation regarding all the topicscovered in a book. To illustrate thispoint, let us consider the book Algo-rithms for Clustering Data by Jainand Dubes [1988]. Its LCC numberis ‘QA 278.J35’. In this LCC num-ber, QA 278 corresponds to the topic‘cluster analysis’, J corresponds tothe first author’s name and 35 is theserial number assigned by the Li-brary of Congress. The subject cate-gories for this book provided by thepublisher (which are typically en-tered in a database to facilitatesearch) are cluster analysis, dataprocessing and algorithms. There isa chapter in this book [Jain andDubes 1988] that deals with com-puter vision, image processing, andimage segmentation. So a user look-ing for literature on computer visionand, in particular, image segmenta-tion will not be able to access thisbook by searching the database withthe help of either the LCC numberor the subject categories provided inthe database. The LCC number forcomputer vision books is TA 1632[LC Classification 1990] which isvery different from the number QA278.J35 assigned to this book.

Data Clustering • 307

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 45: Data Clustering: A Review - Computer Science at Rutgers

(2) There is an inherent problem in as-signing LCC numbers to books in arapidly developing area. For exam-ple, let us consider the area of neu-ral networks. Initially, category ‘QP’in LCC scheme was used to labelbooks and conference proceedings inthis area. For example, Proceedingsof the International Joint Conferenceon Neural Networks [IJCNN’91] wasassigned the number ‘QP 363.3’. Butmost of the recent books on neuralnetworks are given a number usingthe category label ‘QA’; Proceedingsof the IJCNN’92 [IJCNN’92] is as-signed the number ‘QA 76.87’. Mul-tiple labels for books dealing withthe same topic will force them to beplaced on different stacks in a li-brary. Hence, there is a need to up-date the classification labels fromtime to time in an emerging disci-pline.

(3) Assigning a number to a new book isa difficult problem. A book may dealwith topics corresponding to two ormore LCC numbers, and therefore,assigning a unique number to sucha book is difficult.

Murty and Jain [1995] describe aknowledge-based clustering scheme togroup representations of books, whichare obtained using the ACM CR (Associ-ation for Computing Machinery Com-puting Reviews) classification tree[ACM CR Classifications 1994]. Thistree is used by the authors contributingto various ACM publications to providekeywords in the form of ACM CR cate-gory labels. This tree consists of 11nodes at the first level. These nodes arelabeled A to K. Each node in this treehas a label that is a string of one ormore symbols. These symbols are alpha-numeric characters. For example, I515is the label of a fourth-level node in thetree.

6.3.1 Pattern Representation. Eachbook is represented as a generalized list[Sangal 1991] of these strings using theACM CR classification tree. For the

sake of brevity in representation, thefourth-level nodes in the ACM CR clas-sification tree are labeled using numer-als 1 to 9 and characters A to Z. Forexample, the children nodes of I.5.1(models) are labeled I.5.1.1 to I.5.1.6.Here, I.5.1.1 corresponds to the nodelabeled deterministic, and I.5.1.6 standsfor the node labeled structural. In asimilar fashion, all the fourth-levelnodes in the tree can be labeled as nec-essary. From now on, the dots in be-tween successive symbols will be omit-ted to simplify the representation. Forexample, I.5.1.1 will be denoted as I511.

We illustrate this process of represen-tation with the help of the book by Jainand Dubes [1988]. There are five chap-ters in this book. For simplicity of pro-cessing, we consider only the informa-tion in the chapter contents. There is asingle entry in the table of contents forchapter 1, ‘Introduction,’ and so we donot extract any keywords from this.Chapter 2, labeled ‘Data Representa-tion,’ has section titles that correspondto the labels of the nodes in the ACMCR classification tree [ACM CR Classifi-cations 1994] which are given below:

(1a) I522 (feature evaluation and selec-tion),

(2b) I532 (similarity measures), and

(3c) I515 (statistical).Based on the above analysis, Chapter 2 ofJain and Dubes [1988] can be character-ized by the weighted disjunction((I522 ∨ I532 ∨ I515)(1,4)). The weights(1,4) denote that it is one of the four chap-ters which plays a role in the representa-tion of the book. Based on the table ofcontents, we can use one or more of thestrings I522, I532, and I515 to representChapter 2. In a similar manner, we canrepresent other chapters in this book asweighted disjunctions based on the table ofcontents and the ACM CR classificationtree. The representation of the entire book,the conjunction of all these chapter repre-sentations, is given by ~~~I522 ∨ I532 ∨I515!~1,4! ∧ ~~I515 ∨ I531!~2,4!! ∧~~I541 ∨ I46 ∨ I434!~1,4!!!.

308 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 46: Data Clustering: A Review - Computer Science at Rutgers

Currently, these representations aregenerated manually by scanning the ta-ble of contents of books in computerscience area as ACM CR classificationtree provides knowledge of computerscience books only. The details of thecollection of books used in this study areavailable in Murty and Jain [1995].

6.3.2 Similarity Measure. The simi-larity between two books is based on thesimilarity between the correspondingstrings. Two of the well-known distancefunctions between a pair of strings are[Baeza-Yates 1992] the Hamming dis-tance and the edit distance. Neither ofthese two distance functions can bemeaningfully used in this application.The following example illustrates thepoint. Consider three strings I242, I233,and H242. These strings are labels(predicate logic for knowledge represen-tation, logic programming, and distrib-uted database systems) of three fourth-level nodes in the ACM CRclassification tree. Nodes I242 and I233are the grandchildren of the node la-beled I2 (artificial intelligence) andH242 is a grandchild of the node labeledH2 (database management). So, the dis-tance between I242 and I233 should besmaller than that between I242 andH242. However, Hamming distance andedit distance [Baeza-Yates 1992] bothhave a value 2 between I242 and I233and a value of 1 between I242 andH242. This limitation motivates the def-inition of a new similarity measure thatcorrectly captures the similarity be-tween the above strings. The similaritybetween two strings is defined as theratio of the length of the largest com-mon prefix [Murty and Jain 1995] be-tween the two strings to the length ofthe first string. For example, the simi-larity between strings I522 and I51 is0.5. The proposed similarity measure isnot symmetric because the similaritybetween I51 and I522 is 0.67. The mini-mum and maximum values of this simi-larity measure are 0.0 and 1.0, respec-tively. The knowledge of therelationship between nodes in the ACM

CR classification tree is captured by therepresentation in the form of strings.For example, node labeled pattern rec-ognition is represented by the string I5,whereas the string I53 corresponds tothe node labeled clustering. The similar-ity between these two nodes (I5 and I53)is 1.0. A symmetric measure of similar-ity [Murty and Jain 1995] is used toconstruct a similarity matrix of size 100x 100 corresponding to 100 books usedin experiments.

6.3.3 An Algorithm for ClusteringBooks. The clustering problem can bestated as follows. Given a collection @of books, we need to obtain a set # ofclusters. A proximity dendrogram [Jainand Dubes 1988], using the complete-link agglomerative clustering algorithmfor the collection of 100 books is shownin Figure 33. Seven clusters are ob-tained by choosing a threshold (t) valueof 0.12. It is well known that differentvalues for t might give different cluster-ings. This threshold value is chosen be-cause the “gap” in the dendrogram be-tween the levels at which six and sevenclusters are formed is the largest. Anexamination of the subject areas of thebooks [Murty and Jain 1995] in theseclusters revealed that the clusters ob-tained are indeed meaningful. Each ofthese clusters are represented using alist of string s and frequency sf pairs,where sf is the number of books in thecluster in which s is present. For exam-ple, cluster c1 contains 43 books belong-ing to pattern recognition, neural net-works, artificial intelligence, andcomputer vision; a part of its represen-tation 5~C1! is given below.

5~C1! 5 ~~B718,1!, ~C12,1!, ~D0,2!,

~D311,1!, ~D312,2!, ~D321,1!,

~D322,1!, ~D329,1!, . . . ~I46,3!,

~I461,2!, ~I462,1!, ~I463, 3!,

. . . ~J26,1!, ~J6,1!,

~J61,7!, ~J71,1!)

Data Clustering • 309

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 47: Data Clustering: A Review - Computer Science at Rutgers

These clusters of books and the corre-sponding cluster descriptions can beused as follows: If a user is searchingfor books, say, on image segmentation(I46), then we select cluster C1 becauseits representation alone contains thestring I46. Books B2 (Neurocomputing)and B18 (Sensory Neural Networks: Lat-eral Inhibition) are both members of clus-ter C1 even though their LCC numbersare quite different (B2 is QA76.5.H4442,B18 is QP363.3.N33).

Four additional books labeled B101,B102, B103, and B104 have been used tostudy the problem of assigning classifi-cation numbers to new books. The LCCnumbers of these books are: (B101)Q335.T39, (B102) QA76.73.P356C57,(B103) QA76.5.B76C.2, and (B104)QA76.9D5W44. These books are as-signed to clusters based on nearestneighbor classification. The nearestneighbor of B101, a book on artificialintelligence, is B23 and so B101 is as-signed to cluster C1. It is observed thatthe assignment of these four books tothe respective clusters is meaningful,demonstrating that knowledge-basedclustering is useful in solving problemsassociated with document retrieval.

6.4 Data Mining

In recent years we have seen ever in-creasing volumes of collected data of allsorts. With so much data available, it isnecessary to develop algorithms whichcan extract meaningful informationfrom the vast stores. Searching for use-ful nuggets of information among hugeamounts of data has become known asthe field of data mining.

Data mining can be applied to rela-tional, transaction, and spatial data-bases, as well as large stores of unstruc-tured data such as the World Wide Web.There are many data mining systems inuse today, and applications include theU.S. Treasury detecting money launder-ing, National Basketball Association

coaches detecting trends and patterns ofplay for individual players and teams,and categorizing patterns of children inthe foster care system [Hedberg 1996].Several journals have had recent specialissues on data mining [Cohen 1996,Cross 1996, Wah 1996].

6.4.1 Data Mining Approaches.Data mining, like clustering, is an ex-ploratory activity, so clustering methodsare well suited for data mining. Cluster-ing is often an important initial step ofseveral in the data mining process[Fayyad 1996]. Some of the data miningapproaches which use clustering are da-tabase segmentation, predictive model-ing, and visualization of large data-bases.

Segmentation. Clustering methodsare used in data mining to segmentdatabases into homogeneous groups.This can serve purposes of data com-pression (working with the clustersrather than individual items), or toidentify characteristics of subpopula-tions which can be targeted for specificpurposes (e.g., marketing aimed at se-nior citizens).

A continuous k-means clustering algo-rithm [Faber 1994] has been used tocluster pixels in Landsat images [Faberet al. 1994]. Each pixel originally has 7values from different satellite bands,including infra-red. These 7 values aredifficult for humans to assimilate andanalyze without assistance. Pixels withthe 7 feature values are clustered into256 groups, then each pixel is assignedthe value of the cluster centroid. Theimage can then be displayed with thespatial information intact. Human view-ers can look at a single picture andidentify a region of interest (e.g., high-way or forest) and label it as a concept.The system then identifies other pixelsin the same cluster as an instance ofthat concept.

Predictive Modeling. Statistical meth-ods of data analysis usually involve hy-pothesis testing of a model the analystalready has in mind. Data mining canaid the user in discovering potential

310 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 48: Data Clustering: A Review - Computer Science at Rutgers

hypotheses prior to using statisticaltools. Predictive modeling uses cluster-ing to group items, then infers rules tocharacterize the groups and suggestmodels. For example, magazine sub-scribers can be clustered based on anumber of factors (age, sex, income,etc.), then the resulting groups charac-terized in an attempt to find a model

which will distinguish those subscribersthat will renew their subscriptions fromthose that will not [Simoudis 1996].

Visualization. Clusters in large data-bases can be used for visualization, inorder to aid human analysts in identify-ing groups and subgroups that havesimilar characteristics. WinViz [Lee andOng 1996] is a data mining visualization

1

2

3

4 5

6 7

8

9

10

11

12

13 14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

3233

34 35

36

37 3839 40

4142

43

44 45

46

47

48

49

5051

52

53

54

55

56

57

58

59 60

6162

63

64

65

66

67 68 69 70

71

72

73

74

75

76

7778

79 80

81

82 83

8485

86 87

88

89

90

91

92

93

94

95

96

97

98

99

100

0.0

0.2

0.4

0.6

0.8

1.0

Figure 33. A dendrogram corresponding to 100 books.

Data Clustering • 311

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 49: Data Clustering: A Review - Computer Science at Rutgers

tool in which derived clusters can beexported as new attributes which canthen be characterized by the system.For example, breakfast cereals are clus-tered according to calories, protein, fat,sodium, fiber, carbohydrate, sugar, po-tassium, and vitamin content per serv-ing. Upon seeing the resulting clusters,the user can export the clusters to Win-Viz as attributes. The system showsthat one of the clusters is characterizedby high potassium content, and the hu-man analyst recognizes the individualsin the cluster as belonging to the “bran”cereal family, leading to a generaliza-tion that “bran cereals are high in po-tassium.”

6.4.2 Mining Large Unstructured Da-tabases. Data mining has often beenperformed on transaction and relationaldatabases which have well-definedfields which can be used as features, butthere has been recent research on largeunstructured databases such as theWorld Wide Web [Etzioni 1996].

Examples of recent attempts to clas-sify Web documents using words orfunctions of words as features includeMaarek and Shaul [1996] and Chekuriet al. [1999]. However, relatively smallsets of labeled training samples andvery large dimensionality limit the ulti-mate success of automatic Web docu-

ment categorization based on words asfeatures.

Rather than grouping documents in aword feature space, Wulfekuhler andPunch [1997] cluster the words from asmall collection of World Wide Web doc-uments in the document space. Thesample data set consisted of 85 docu-ments from the manufacturing domainin 4 different user-defined categories(labor, legal, government, and design).These 85 documents contained 5190 dis-tinct word stems after common words(the, and, of) were removed. Since thewords are certainly not uncorrelated,they should fall into clusters wherewords used in a consistent way acrossthe document set have similar values offrequency in each document.

K-means clustering was used to groupthe 5190 words into 10 groups. Onesurprising result was that an average of92% of the words fell into a single clus-ter, which could then be discarded fordata mining purposes. The smallestclusters contained terms which to a hu-man seem semantically related. The 7smallest clusters from a typical run areshown in Figure 34.

Terms which are used in ordinarycontexts, or unique terms which do notoccur often across the training docu-ment set will tend to cluster into the

Figure 34. The seven smallest clusters found in the document set. These are stemmed words.

312 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 50: Data Clustering: A Review - Computer Science at Rutgers

large 4000 member group. This takescare of spelling errors, proper nameswhich are infrequent, and terms whichare used in the same manner through-out the entire document set. Terms usedin specific contexts (such as file in thecontext of filing a patent, rather than acomputer file) will appear in the docu-ments consistently with other terms ap-propriate to that context (patent, invent)and thus will tend to cluster together.Among the groups of words, unique con-texts stand out from the crowd.

After discarding the largest cluster,the smaller set of features can be usedto construct queries for seeking outother relevant documents on the Webusing standard Web searching tools(e.g., Lycos, Alta Vista, Open Text).

Searching the Web with terms takenfrom the word clusters allows discoveryof finer grained topics (e.g., family med-ical leave) within the broadly definedcategories (e.g., labor).

6.4.3 Data Mining in Geological Da-tabases. Database mining is a criticalresource in oil exploration and produc-tion. It is common knowledge in the oilindustry that the typical cost of drillinga new offshore well is in the range of$30-40 million, but the chance of thatsite being an economic success is 1 in10. More informed and systematic drill-ing decisions can significantly reduceoverall production costs.

Advances in drilling technology anddata collection methods have led to oilcompanies and their ancillaries collect-ing large amounts of geophysical/geolog-ical data from production wells and ex-ploration sites, and then organizingthem into large databases. Data miningtechniques has recently been used toderive precise analytic relations be-tween observed phenomena and param-eters. These relations can then be usedto quantify oil and gas reserves.

In qualitative terms, good recoverablereserves have high hydrocarbon satura-tion that are trapped by highly poroussediments (reservoir porosity) and sur-rounded by hard bulk rocks that pre-

vent the hydrocarbon from leakingaway. A large volume of porous sedi-ments is crucial to finding good recover-able reserves, therefore developing reli-able and accurate methods forestimation of sediment porosities fromthe collected data is key to estimatinghydrocarbon potential.

The general rule of thumb experts usefor porosity computation is that it is aquasiexponential function of depth:

Porosity 5 K z e2F~x1, x2, ..., xm!zDepth. (4)

A number of factors such as rock types,structure, and cementation as parame-ters of function F confound this rela-tionship. This necessitates the defini-tion of proper contexts, in which toattempt discovery of porosity formulas.Geological contexts are expressed interms of geological phenomena, such asgeometry, lithology, compaction, andsubsidence, associated with a region. Itis well known that geological contextchanges from basin to basin (differentgeographical areas in the world) andalso from region to region within a ba-sin [Allen and Allen 1990; Biswas 1995].Furthermore, the underlying features ofcontexts may vary greatly. Simplemodel matching techniques, which workin engineering domains where behavioris constrained by man-made systemsand well-established laws of physics,may not apply in the hydrocarbon explo-ration domain. To address this, dataclustering was used to identify the rele-vant contexts, and then equation discov-ery was carried out within each context.The goal was to derive the subset x1,x2, ..., xm from a larger set of geologicalfeatures, and the functional relation-ship F that best defined the porosityfunction in a region.

The overall methodology illustratedin Figure 35, consists of two primarysteps: (i) Context definition using unsu-pervised clustering techniques, and (ii)Equation discovery by regression analy-sis [Li and Biswas 1995]. Real explora-tion data collected from a region in the

Data Clustering • 313

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 51: Data Clustering: A Review - Computer Science at Rutgers

Alaska basin was analyzed using themethodology developed. The data ob-jects (patterns) are described in terms of37 geological features, such as porosity,permeability, grain size, density, andsorting, amount of different mineralfragments (e.g., quartz, chert, feldspar)present, nature of the rock fragments,pore characteristics, and cementation.All these feature values are numericmeasurements made on samples ob-tained from well-logs during exploratorydrilling processes.

The k-means clustering algorithmwas used to identify a set of homoge-neous primitive geological structures~g1, g2, ..., gm!. These primitives werethen mapped onto the unit code versusstratigraphic unit map. Figure 36 de-picts a partial mapping for a set of wellsand four primitive structures. The nextstep in the discovery process identifiedsections of wells regions that were madeup of the same sequence of geologicalprimitives. Every sequence defined acontext Ci. From the partial mapping ofFigure 36, the context C1 5 g2 + g1 +

g2 + g3 was identified in two well re-gions (the 300 and 600 series). After thecontexts were defined, data points be-longing to each context were groupedtogether for equation derivation. Thederivation procedure employed multipleregression analysis [Sen and Srivastava1990].

This method was applied to a data setof about 2600 objects corresponding to

sample measurements collected fromwells is the Alaskan Basin. Thek-means clustered this data set intoseven groups. As an illustration, we se-lected a set of 138 objects representing acontext for further analysis. The fea-tures that best defined this cluster wereselected, and experts surmised that thecontext represented a low porosity re-gion, which was modeled using the re-gression procedure.

7. SUMMARY

There are several applications wheredecision making and exploratory pat-tern analysis have to be performed onlarge data sets. For example, in docu-ment retrieval, a set of relevant docu-ments has to be found among severalmillions of documents of dimensionalityof more than 1000. It is possible tohandle these problems if some usefulabstraction of the data is obtained andis used in decision making, rather thandirectly using the entire data set. Bydata abstraction, we mean a simple andcompact representation of the data.This simplicity helps the machine inefficient processing or a human in com-prehending the structure in data easily.Clustering algorithms are ideally suitedfor achieving data abstraction.

In this paper, we have examined var-ious steps in clustering: (1) pattern rep-resentation, (2) similarity computation,(3) grouping process, and (4) cluster rep-resentation. Also, we have discussed

Figure 35. Description of the knowledge-based scientific discovery process.

314 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 52: Data Clustering: A Review - Computer Science at Rutgers

statistical, fuzzy, neural, evolutionary,and knowledge-based approaches toclustering. We have described four ap-plications of clustering: (1) image seg-mentation, (2) object recognition, (3)document retrieval, and (4) data min-ing.

Clustering is a process of groupingdata items based on a measure of simi-larity. Clustering is a subjective pro-cess; the same set of data items oftenneeds to be partitioned differently fordifferent applications. This subjectivitymakes the process of clustering difficult.This is because a single algorithm orapproach is not adequate to solve everyclustering problem. A possible solutionlies in reflecting this subjectivity in theform of knowledge. This knowledge isused either implicitly or explicitly inone or more phases of clustering.Knowledge-based clustering algorithmsuse domain knowledge explicitly.

The most challenging step in cluster-ing is feature extraction or pattern rep-resentation. Pattern recognition re-searchers conveniently avoid this stepby assuming that the pattern represen-

tations are available as input to theclustering algorithm. In small size datasets, pattern representations can be ob-tained based on previous experience ofthe user with the problem. However, inthe case of large data sets, it is difficultfor the user to keep track of the impor-tance of each feature in clustering. Asolution is to make as many measure-ments on the patterns as possible anduse them in pattern representation. Butit is not possible to use a large collectionof measurements directly in clusteringbecause of computational costs. So sev-eral feature extraction/selection ap-proaches have been designed to obtainlinear or nonlinear combinations ofthese measurements which can be usedto represent patterns. Most of theschemes proposed for feature extrac-tion/selection are typically iterative innature and cannot be used on large datasets due to prohibitive computationalcosts.

The second step in clustering is simi-larity computation. A variety ofschemes have been used to computesimilarity between two patterns. They

Area CodeS

trat

igra

phic

Uni

t100 200 300 400 500 600 700

3190

3180

3170

2000

3160

3150

3140

3130

3120

3110

3100

3000

Figure 36. Area code versus stratigraphic unit map for part of the studied region.

Data Clustering • 315

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 53: Data Clustering: A Review - Computer Science at Rutgers

use knowledge either implicitly or ex-plicitly. Most of the knowledge-basedclustering algorithms use explicitknowledge in similarity computation.However, if patterns are not repre-sented using proper features, then it isnot possible to get a meaningful parti-tion irrespective of the quality andquantity of knowledge used in similar-ity computation. There is no universallyacceptable scheme for computing simi-larity between patterns represented us-ing a mixture of both qualitative andquantitative features. Dissimilarity be-tween a pair of patterns is representedusing a distance measure that may ormay not be a metric.

The next step in clustering is thegrouping step. There are broadly twogrouping schemes: hierarchical and par-titional schemes. The hierarchicalschemes are more versatile, and thepartitional schemes are less expensive.The partitional algorithms aim at maxi-mizing the squared error criterion func-tion. Motivated by the failure of thesquared error partitional clustering al-gorithms in finding the optimal solutionto this problem, a large collection ofapproaches have been proposed andused to obtain the global optimal solu-tion to this problem. However, theseschemes are computationally prohibi-tive on large data sets. ANN-based clus-tering schemes are neural implementa-tions of the clustering algorithms, andthey share the undesired properties ofthese algorithms. However, ANNs havethe capability to automatically normal-ize the data and extract features. Animportant observation is that even if ascheme can find the optimal solution tothe squared error partitioning problem,it may still fall short of the require-ments because of the possible non-iso-tropic nature of the clusters.

In some applications, for example indocument retrieval, it may be useful tohave a clustering that is not a partition.This means clusters are overlapping.Fuzzy clustering and functional cluster-ing are ideally suited for this purpose.Also, fuzzy clustering algorithms can

handle mixed data types. However, amajor problem with fuzzy clustering isthat it is difficult to obtain the member-ship values. A general approach maynot work because of the subjective na-ture of clustering. It is required to rep-resent clusters obtained in a suitableform to help the decision maker. Knowl-edge-based clustering schemes generateintuitively appealing descriptions ofclusters. They can be used even whenthe patterns are represented using acombination of qualitative and quanti-tative features, provided that knowl-edge linking a concept and the mixedfeatures are available. However, imple-mentations of the conceptual clusteringschemes are computationally expensiveand are not suitable for grouping largedata sets.

The k-means algorithm and its neuralimplementation, the Kohonen net, aremost successfully used on large datasets. This is because k-means algorithmis simple to implement and computa-tionally attractive because of its lineartime complexity. However, it is not fea-sible to use even this linear time algo-rithm on large data sets. Incrementalalgorithms like leader and its neuralimplementation, the ART network, canbe used to cluster large data sets. Butthey tend to be order-dependent. Divideand conquer is a heuristic that has beenrightly exploited by computer algorithmdesigners to reduce computational costs.However, it should be judiciously usedin clustering to achieve meaningful re-sults.

In summary, clustering is an interest-ing, useful, and challenging problem. Ithas great potential in applications likeobject recognition, image segmentation,and information filtering and retrieval.However, it is possible to exploit thispotential only after making several de-sign choices carefully.

ACKNOWLEDGMENTS

The authors wish to acknowledge thegenerosity of several colleagues who

316 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 54: Data Clustering: A Review - Computer Science at Rutgers

read manuscript drafts, made sugges-tions, and provided summaries ofemerging application areas which wehave incorporated into this paper. Gau-tam Biswas and Cen Li of VanderbiltUniversity provided the material onknowledge discovery in geological data-bases. Ana Fred of Instituto SuperiorTécnico in Lisbon, Portugal providedmaterial on cluster analysis in the syn-tactic domain. William Punch and Mari-lyn Wulfekuhler of Michigan State Uni-versity provided material on theapplication of cluster analysis to datamining problems. Scott Connell of Mich-igan State provided material describinghis work on character recognition. Chi-tra Dorai of IBM T.J. Watson ResearchCenter provided material on the use ofclustering in 3D object recognition.Jianchang Mao of IBM Almaden Re-search Center, Peter Bajcsy of the Uni-versity of Illinois, and Zoran Obradovicof Washington State University alsoprovided many helpful comments. Mariode Figueirido performed a meticulousreading of the manuscript and providedmany helpful suggestions.

This work was supported by the Na-tional Science Foundation under grantINT-9321584.

REFERENCES

AARTS, E. AND KORST, J. 1989. Simulated An-nealing and Boltzmann Machines: A Stochas-tic Approach to Combinatorial Optimizationand Neural Computing. Wiley-Interscienceseries in discrete mathematics and optimiza-tion. John Wiley and Sons, Inc., New York,NY.

ACM, 1994. ACM CR Classifications. ACMComputing Surveys 35, 5–16.

AL-SULTAN, K. S. 1995. A tabu search approachto clustering problems. Pattern Recogn. 28,1443–1451.

AL-SULTAN, K. S. AND KHAN, M. M. 1996.Computational experience on four algorithmsfor the hard clustering problem. PatternRecogn. Lett. 17, 3, 295–308.

ALLEN, P. A. AND ALLEN, J. R. 1990. BasinAnalysis: Principles and Applica-tions. Blackwell Scientific Publications, Inc.,Cambridge, MA.

ALTA VISTA, 1999. http://altavista.digital.com.AMADASUN, M. AND KING, R. A. 1988. Low-level

segmentation of multispectral images via ag-

glomerative clustering of uniform neigh-bourhoods. Pattern Recogn. 21, 3 (1988),261–268.

ANDERBERG, M. R. 1973. Cluster Analysis forApplications. Academic Press, Inc., NewYork, NY.

AUGUSTSON, J. G. AND MINKER, J. 1970. Ananalysis of some graph theoretical clusteringtechniques. J. ACM 17, 4 (Oct. 1970), 571–588.

BABU, G. P. AND MURTY, M. N. 1993. A near-optimal initial seed value selection inK-means algorithm using a genetic algorithm.Pattern Recogn. Lett. 14, 10 (Oct. 1993), 763–769.

BABU, G. P. AND MURTY, M. N. 1994. Clusteringwith evolution strategies. Pattern Recogn.27, 321–329.

BABU, G. P., MURTY, M. N., AND KEERTHI, S.S. 2000. Stochastic connectionist approachfor pattern clustering (To appear). IEEETrans. Syst. Man Cybern..

BACKER, F. B. AND HUBERT, L. J. 1976. A graph-theoretic approach to goodness-of-fit in com-plete-link hierarchical clustering. J. Am.Stat. Assoc. 71, 870–878.

BACKER, E. 1995. Computer-Assisted Reasoningin Cluster Analysis. Prentice Hall Interna-tional (UK) Ltd., Hertfordshire, UK.

BAEZA-YATES, R. A. 1992. Introduction to datastructures and algorithms related to informa-tion retrieval. In Information Retrieval:Data Structures and Algorithms, W. B.Frakes and R. Baeza-Yates, Eds. Prentice-Hall, Inc., Upper Saddle River, NJ, 13–27.

BAJCSY, P. 1997. Hierarchical segmentationand clustering using similarity analy-sis. Ph.D. Dissertation. Department ofComputer Science, University of Illinois atUrbana-Champaign, Urbana, IL.

BALL, G. H. AND HALL, D. J. 1965. ISODATA, anovel method of data analysis and classifica-tion. Tech. Rep.. Stanford University,Stanford, CA.

BENTLEY, J. L. AND FRIEDMAN, J. H. 1978. Fastalgorithms for constructing minimal spanningtrees in coordinate spaces. IEEE Trans.Comput. C-27, 6 (June), 97–105.

BEZDEK, J. C. 1981. Pattern Recognition WithFuzzy Objective Function Algorithms. Ple-num Press, New York, NY.

BHUYAN, J. N., RAGHAVAN, V. V., AND VENKATESH,K. E. 1991. Genetic algorithm for cluster-ing with an ordered representation. In Pro-ceedings of the Fourth International Confer-ence on Genetic Algorithms, 408–415.

BISWAS, G., WEINBERG, J., AND LI, C. 1995. AConceptual Clustering Method for KnowledgeDiscovery in Databases. Editions Technip.

BRAILOVSKY, V. L. 1991. A probabilistic ap-proach to clustering. Pattern Recogn. Lett.12, 4 (Apr. 1991), 193–198.

Data Clustering • 317

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 55: Data Clustering: A Review - Computer Science at Rutgers

BRODATZ, P. 1966. Textures: A Photographic Al-bum for Artists and Designers. Dover Publi-cations, Inc., Mineola, NY.

CAN, F. 1993. Incremental clustering for dy-namic information processing. ACM Trans.Inf. Syst. 11, 2 (Apr. 1993), 143–164.

CARPENTER, G. AND GROSSBERG, S. 1990. ART3:Hierarchical search using chemical transmit-ters in self-organizing pattern recognition ar-chitectures. Neural Networks 3, 129–152.

CHEKURI, C., GOLDWASSER, M. H., RAGHAVAN, P.,AND UPFAL, E. 1997. Web search using au-tomatic classification. In Proceedings of theSixth International Conference on the WorldWide Web (Santa Clara, CA, Apr.), http://theory.stanford.edu/people/wass/publications/Web Search/Web Search.html.

CHENG, C. H. 1995. A branch-and-bound clus-tering algorithm. IEEE Trans. Syst. ManCybern. 25, 895–898.

CHENG, Y. AND FU, K. S. 1985. Conceptual clus-tering in knowledge organization. IEEETrans. Pattern Anal. Mach. Intell. 7, 592–598.

CHENG, Y. 1995. Mean shift, mode seeking, andclustering. IEEE Trans. Pattern Anal. Mach.Intell. 17, 7 (July), 790–799.

CHIEN, Y. T. 1978. Interactive Pattern Recogni-tion. Marcel Dekker, Inc., New York, NY.

CHOUDHURY, S. AND MURTY, M. N. 1990. A divi-sive scheme for constructing minimal span-ning trees in coordinate space. PatternRecogn. Lett. 11, 6 (Jun. 1990), 385–389.

1996. Special issue on data mining. Commun.ACM 39, 11.

COLEMAN, G. B. AND ANDREWS, H.C. 1979. Image segmentation by cluster-ing. Proc. IEEE 67, 5, 773–785.

CONNELL, S. AND JAIN, A. K. 1998. Learningprototypes for on-line handwritten digits. InProceedings of the 14th International Confer-ence on Pattern Recognition (Brisbane, Aus-tralia, Aug.), 182–184.

CROSS, S. E., Ed. 1996. Special issue on datamining. IEEE Expert 11, 5 (Oct.).

DALE, M. B. 1985. On the comparison of con-ceptual clustering and numerical taxono-my. IEEE Trans. Pattern Anal. Mach. Intell.7, 241–244.

DAVE, R. N. 1992. Generalized fuzzy C-shellsclustering and detection of circular and ellip-tic boundaries. Pattern Recogn. 25, 713–722.

DAVIS, T., Ed. 1991. The Handbook of GeneticAlgorithms. Van Nostrand Reinhold Co.,New York, NY.

DAY, W. H. E. 1992. Complexity theory: An in-troduction for practitioners of classifica-tion. In Clustering and Classification, P.Arabie and L. Hubert, Eds. World ScientificPublishing Co., Inc., River Edge, NJ.

DEMPSTER, A. P., LAIRD, N. M., AND RUBIN, D.B. 1977. Maximum likelihood from incom-plete data via the EM algorithm. J. RoyalStat. Soc. B. 39, 1, 1–38.

DIDAY, E. 1973. The dynamic cluster method innon-hierarchical clustering. J. Comput. Inf.Sci. 2, 61–88.

DIDAY, E. AND SIMON, J. C. 1976. Clusteringanalysis. In Digital Pattern Recognition, K.S. Fu, Ed. Springer-Verlag, Secaucus, NJ,47–94.

DIDAY, E. 1988. The symbolic approach in clus-tering. In Classification and Related Meth-ods, H. H. Bock, Ed. North-Holland Pub-lishing Co., Amsterdam, The Netherlands.

DORAI, C. AND JAIN, A. K. 1995. Shape spectrabased view grouping for free-form objects. InProceedings of the International Conference onImage Processing (ICIP-95), 240–243.

DUBES, R. C. AND JAIN, A. K. 1976. Clusteringtechniques: The user’s dilemma. PatternRecogn. 8, 247–260.

DUBES, R. C. AND JAIN, A. K. 1980. Clusteringmethodology in exploratory data analysis. InAdvances in Computers, M. C. Yovits,, Ed.Academic Press, Inc., New York, NY, 113–125.

DUBES, R. C. 1987. How many clusters arebest?—an experiment. Pattern Recogn. 20, 6(Nov. 1, 1987), 645–663.

DUBES, R. C. 1993. Cluster analysis and relatedissues. In Handbook of Pattern Recognition& Computer Vision, C. H. Chen, L. F. Pau,and P. S. P. Wang, Eds. World ScientificPublishing Co., Inc., River Edge, NJ, 3–32.

DUBUISSON, M. P. AND JAIN, A. K. 1994. A mod-ified Hausdorff distance for object matchin-g. In Proceedings of the International Con-ference on Pattern Recognition (ICPR’94), 566–568.

DUDA, R. O. AND HART, P. E. 1973. PatternClassification and Scene Analysis. JohnWiley and Sons, Inc., New York, NY.

DUNN, S., JANOS, L., AND ROSENFELD, A. 1983.Bimean clustering. Pattern Recogn. Lett. 1,169–173.

DURAN, B. S. AND ODELL, P. L. 1974. ClusterAnalysis: A Survey. Springer-Verlag, NewYork, NY.

EDDY, W. F., MOCKUS, A., AND OUE, S. 1996.Approximate single linkage cluster analysis oflarge data sets in high-dimensional spaces.Comput. Stat. Data Anal. 23, 1, 29–43.

ETZIONI, O. 1996. The World-Wide Web: quag-mire or gold mine? Commun. ACM 39, 11,65–68.

EVERITT, B. S. 1993. Cluster Analysis. EdwardArnold, Ltd., London, UK.

FABER, V. 1994. Clustering and the continuousk-means algorithm. Los Alamos Science 22,138–144.

FABER, V., HOCHBERG, J. C., KELLY, P. M., THOMAS,T. R., AND WHITE, J. M. 1994. Concept ex-traction: A data-mining technique. LosAlamos Science 22, 122–149.

FAYYAD, U. M. 1996. Data mining and knowl-edge discovery: Making sense out of data.IEEE Expert 11, 5 (Oct.), 20–25.

318 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 56: Data Clustering: A Review - Computer Science at Rutgers

FISHER, D. AND LANGLEY, P. 1986. Conceptualclustering and its relation to numerical tax-onomy. In Artificial Intelligence and Statis-tics, A W. Gale, Ed. Addison-Wesley Long-man Publ. Co., Inc., Reading, MA, 77–116.

FISHER, D. 1987. Knowledge acquisition via in-cremental conceptual clustering. Mach.Learn. 2, 139–172.

FISHER, D., XU, L., CARNES, R., RICH, Y., FENVES, S.J., CHEN, J., SHIAVI, R., BISWAS, G., AND WEIN-BERG, J. 1993. Applying AI clustering toengineering tasks. IEEE Expert 8, 51–60.

FISHER, L. AND VAN NESS, J. W. 1971.Admissible clustering procedures. Biometrika58, 91–104.

FLYNN, P. J. AND JAIN, A. K. 1991. BONSAI: 3Dobject recognition using constrained search.IEEE Trans. Pattern Anal. Mach. Intell. 13,10 (Oct. 1991), 1066–1075.

FOGEL, D. B. AND SIMPSON, P. K. 1993. Evolvingfuzzy clusters. In Proceedings of the Interna-tional Conference on Neural Networks (SanFrancisco, CA), 1829–1834.

FOGEL, D. B. AND FOGEL, L. J., Eds. 1994. Spe-cial issue on evolutionary computation.IEEE Trans. Neural Netw. (Jan.).

FOGEL, L. J., OWENS, A. J., AND WALSH, M.J. 1965. Artificial Intelligence ThroughSimulated Evolution. John Wiley and Sons,Inc., New York, NY.

FRAKES, W. B. AND BAEZA-YATES, R., Eds.1992. Information Retrieval: Data Struc-tures and Algorithms. Prentice-Hall, Inc.,Upper Saddle River, NJ.

FRED, A. L. N. AND LEITAO, J. M. N. 1996. Aminimum code length technique for clusteringof syntactic patterns. In Proceedings of theInternational Conference on Pattern Recogni-tion (Vienna, Austria), 680–684.

FRED, A. L. N. 1996. Clustering of sequencesusing a minimum grammar complexity crite-rion. In Grammatical Inference: LearningSyntax from Sentences, L. Miclet and C.Higuera, Eds. Springer-Verlag, Secaucus,NJ, 107–116.

FU, K. S. AND LU, S. Y. 1977. A clustering pro-cedure for syntactic patterns. IEEE Trans.Syst. Man Cybern. 7, 734–742.

FU, K. S. AND MUI, J. K. 1981. A survey onimage segmentation. Pattern Recogn. 13,3–16.

FUKUNAGA, K. 1990. Introduction to StatisticalPattern Recognition. 2nd ed. AcademicPress Prof., Inc., San Diego, CA.

GLOVER, F. 1986. Future paths for integer pro-gramming and links to artificial intelligence.Comput. Oper. Res. 13, 5 (May 1986), 533–549.

GOLDBERG, D. E. 1989. Genetic Algorithms inSearch, Optimization and Machine Learning.Addison-Wesley Publishing Co., Inc., Red-wood City, CA.

GORDON, A. D. AND HENDERSON, J. T. 1977.Algorithm for Euclidean sum of squares.Biometrics 33, 355–362.

GOTLIEB, G. C. AND KUMAR, S. 1968. Semanticclustering of index terms. J. ACM 15, 493–513.

GOWDA, K. C. 1984. A feature reduction andunsupervised classification algorithm for mul-tispectral data. Pattern Recogn. 17, 6, 667–676.

GOWDA, K. C. AND KRISHNA, G. 1977.Agglomerative clustering using the concept ofmutual nearest neighborhood. PatternRecogn. 10, 105–112.

GOWDA, K. C. AND DIDAY, E. 1992. Symbolicclustering using a new dissimilarity mea-sure. IEEE Trans. Syst. Man Cybern. 22,368–378.

GOWER, J. C. AND ROSS, G. J. S. 1969. Minimumspanning rees and single-linkage clusteranalysis. Appl. Stat. 18, 54–64.

GREFENSTETTE, J 1986. Optimization of controlparameters for genetic algorithms. IEEETrans. Syst. Man Cybern. SMC-16, 1 (Jan./Feb. 1986), 122–128.

HARALICK, R. M. AND KELLY, G. L. 1969.Pattern recognition with measurement spaceand spatial clustering for multiple images.Proc. IEEE 57, 4, 654–665.

HARTIGAN, J. A. 1975. Clustering Algorithms.John Wiley and Sons, Inc., New York, NY.

HEDBERG, S. 1996. Searching for the motherlode: Tales of the first data miners. IEEEExpert 11, 5 (Oct.), 4–7.

HERTZ, J., KROGH, A., AND PALMER, R. G. 1991.Introduction to the Theory of Neural Compu-tation. Santa Fe Institute Studies in the Sci-ences of Complexity lecture notes. Addison-Wesley Longman Publ. Co., Inc., Reading,MA.

HOFFMAN, R. AND JAIN, A. K. 1987.Segmentation and classification of range im-ages. IEEE Trans. Pattern Anal. Mach. In-tell. PAMI-9, 5 (Sept. 1987), 608–620.

HOFMANN, T. AND BUHMANN, J. 1997. Pairwisedata clustering by deterministic annealing.IEEE Trans. Pattern Anal. Mach. Intell. 19, 1(Jan.), 1–14.

HOFMANN, T., PUZICHA, J., AND BUCHMANN, J.M. 1998. Unsupervised texture segmenta-tion in a deterministic annealing framework.IEEE Trans. Pattern Anal. Mach. Intell. 20, 8,803–818.

HOLLAND, J. H. 1975. Adaption in Natural andArtificial Systems. University of MichiganPress, Ann Arbor, MI.

HOOVER, A., JEAN-BAPTISTE, G., JIANG, X., FLYNN,P. J., BUNKE, H., GOLDGOF, D. B., BOWYER, K.,EGGERT, D. W., FITZGIBBON, A., AND FISHER, R.B. 1996. An experimental comparison ofrange image segmentation algorithms. IEEETrans. Pattern Anal. Mach. Intell. 18, 7, 673–689.

Data Clustering • 319

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 57: Data Clustering: A Review - Computer Science at Rutgers

HUTTENLOCHER, D. P., KLANDERMAN, G. A., AND

RUCKLIDGE, W. J. 1993. Comparing imagesusing the Hausdorff distance. IEEE Trans.Pattern Anal. Mach. Intell. 15, 9, 850–863.

ICHINO, M. AND YAGUCHI, H. 1994. GeneralizedMinkowski metrics for mixed feature-typedata analysis. IEEE Trans. Syst. Man Cy-bern. 24, 698–708.

1991. Proceedings of the International Joint Con-ference on Neural Networks. (IJCNN’91).

1992. Proceedings of the International Joint Con-ference on Neural Networks.

ISMAIL, M. A. AND KAMEL, M. S. 1989.Multidimensional data clustering utilizinghybrid search strategies. Pattern Recogn. 22,1 (Jan. 1989), 75–89.

JAIN, A. K. AND DUBES, R. C. 1988. Algorithmsfor Clustering Data. Prentice-Hall advancedreference series. Prentice-Hall, Inc., UpperSaddle River, NJ.

JAIN, A. K. AND FARROKHNIA, F. 1991.Unsupervised texture segmentation using Ga-bor filters. Pattern Recogn. 24, 12 (Dec.1991), 1167–1186.

JAIN, A. K. AND BHATTACHARJEE, S. 1992. Textsegmentation using Gabor filters for auto-matic document processing. Mach. VisionAppl. 5, 3 (Summer 1992), 169–184.

JAIN, A. J. AND FLYNN, P. J., Eds. 1993. ThreeDimensional Object Recognition Systems.Elsevier Science Inc., New York, NY.

JAIN, A. K. AND MAO, J. 1994. Neural networksand pattern recognition. In ComputationalIntelligence: Imitating Life, J. M. Zurada, R.J. Marks, and C. J. Robinson, Eds. 194–212.

JAIN, A. K. AND FLYNN, P. J. 1996. Image seg-mentation using clustering. In Advances inImage Understanding: A Festschrift for AzrielRosenfeld, N. Ahuja and K. Bowyer, Eds,IEEE Press, Piscataway, NJ, 65–83.

JAIN, A. K. AND MAO, J. 1996. Artificial neuralnetworks: A tutorial. IEEE Computer 29(Mar.), 31–44.

JAIN, A. K., RATHA, N. K., AND LAKSHMANAN, S.1997. Object detection using Gabor filters.Pattern Recogn. 30, 2, 295–309.

JAIN, N. C., INDRAYAN, A., AND GOEL, L. R.1986. Monte Carlo comparison of six hierar-chical clustering methods on random data.Pattern Recogn. 19, 1 (Jan./Feb. 1986), 95–99.

JAIN, R., KASTURI, R., AND SCHUNCK, B. G.1995. Machine Vision. McGraw-Hill seriesin computer science. McGraw-Hill, Inc., NewYork, NY.

JARVIS, R. A. AND PATRICK, E. A. 1973.Clustering using a similarity method based onshared near neighbors. IEEE Trans. Com-put. C-22, 8 (Aug.), 1025–1034.

JOLION, J.-M., MEER, P., AND BATAOUCHE, S.1991. Robust clustering with applications incomputer vision. IEEE Trans. Pattern Anal.Mach. Intell. 13, 8 (Aug. 1991), 791–802.

JONES, D. AND BELTRAMO, M. A. 1991. Solvingpartitioning problems with genetic algorithms.In Proceedings of the Fourth InternationalConference on Genetic Algorithms, 442–449.

JUDD, D., MCKINLEY, P., AND JAIN, A. K.1996. Large-scale parallel data clustering.In Proceedings of the International Conferenceon Pattern Recognition (Vienna, Aus-tria), 488–493.

KING, B. 1967. Step-wise clustering proce-dures. J. Am. Stat. Assoc. 69, 86–101.

KIRKPATRICK, S., GELATT, C. D., JR., AND VECCHI,M. P. 1983. Optimization by simulated an-nealing. Science 220, 4598 (May), 671–680.

KLEIN, R. W. AND DUBES, R. C. 1989.Experiments in projection and clustering bysimulated annealing. Pattern Recogn. 22,213–220.

KNUTH, D. 1973. The Art of Computer Program-ming. Addison-Wesley, Reading, MA.

KOONTZ, W. L. G., FUKUNAGA, K., AND NARENDRA,P. M. 1975. A branch and bound clusteringalgorithm. IEEE Trans. Comput. 23, 908–914.

KOHONEN, T. 1989. Self-Organization and Asso-ciative Memory. 3rd ed. Springer informa-tion sciences series. Springer-Verlag, NewYork, NY.

KRAAIJVELD, M., MAO, J., AND JAIN, A. K. 1995.A non-linear projection method based on Ko-honen’s topology preserving maps. IEEETrans. Neural Netw. 6, 548–559.

KRISHNAPURAM, R., FRIGUI, H., AND NASRAOUI, O.1995. Fuzzy and probabilistic shell cluster-ing algorithms and their application to bound-ary detection and surface approximation.IEEE Trans. Fuzzy Systems 3, 29–60.

KURITA, T. 1991. An efficient agglomerativeclustering algorithm using a heap. PatternRecogn. 24, 3 (1991), 205–209.

LIBRARY OF CONGRESS, 1990. LC classificationoutline. Library of Congress, Washington,DC.

LEBOWITZ, M. 1987. Experiments with incre-mental concept formation. Mach. Learn. 2,103–138.

LEE, H.-Y. AND ONG, H.-L. 1996. Visualizationsupport for data mining. IEEE Expert 11, 5(Oct.), 69–75.

LEE, R. C. T., SLAGLE, J. R., AND MONG, C. T.1978. Towards automatic auditing ofrecords. IEEE Trans. Softw. Eng. 4, 441–448.

LEE, R. C. T. 1981. Cluster analysis and itsapplications. In Advances in InformationSystems Science, J. T. Tou, Ed. PlenumPress, New York, NY.

LI, C. AND BISWAS, G. 1995. Knowledge-basedscientific discovery in geological databases.In Proceedings of the First International Con-ference on Knowledge Discovery and DataMining (Montreal, Canada, Aug. 20-21),204–209.

320 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 58: Data Clustering: A Review - Computer Science at Rutgers

LU, S. Y. AND FU, K. S. 1978. A sentence-to-sentence clustering procedure for patternanalysis. IEEE Trans. Syst. Man Cybern. 8,381–389.

LUNDERVOLD, A., FENSTAD, A. M., ERSLAND, L., ANDTAXT, T. 1996. Brain tissue volumes frommultispectral 3D MRI: A comparative study offour classifiers. In Proceedings of the Confer-ence of the Society on Magnetic Resonance,

MAAREK, Y. S. AND BEN SHAUL, I. Z. 1996.Automatically organizing bookmarks per con-tents. In Proceedings of the Fifth Interna-tional Conference on the World Wide Web(Paris, May), http://www5conf.inria.fr/fich-html/paper-sessions.html.

MCQUEEN, J. 1967. Some methods for classifi-cation and analysis of multivariate observa-tions. In Proceedings of the Fifth BerkeleySymposium on Mathematical Statistics andProbability, 281–297.

MAO, J. AND JAIN, A. K. 1992. Texture classifi-cation and segmentation using multiresolu-tion simultaneous autoregressive models.Pattern Recogn. 25, 2 (Feb. 1992), 173–188.

MAO, J. AND JAIN, A. K. 1995. Artificial neuralnetworks for feature extraction and multivari-ate data projection. IEEE Trans. NeuralNetw. 6, 296–317.

MAO, J. AND JAIN, A. K. 1996. A self-organizingnetwork for hyperellipsoidal clustering (HEC).IEEE Trans. Neural Netw. 7, 16–29.

MEVINS, A. J. 1995. A branch and bound incre-mental conceptual clusterer. Mach. Learn.18, 5–22.

MICHALSKI, R., STEPP, R. E., AND DIDAY, E.1981. A recent advance in data analysis:Clustering objects into classes characterizedby conjunctive concepts. In Progress in Pat-tern Recognition, Vol. 1, L. Kanal and A.Rosenfeld, Eds. North-Holland PublishingCo., Amsterdam, The Netherlands.

MICHALSKI, R., STEPP, R. E., AND DIDAY,E. 1983. Automated construction of classi-fications: conceptual clustering versus numer-ical taxonomy. IEEE Trans. Pattern Anal.Mach. Intell. PAMI-5, 5 (Sept.), 396–409.

MISHRA, S. K. AND RAGHAVAN, V. V. 1994. Anempirical study of the performance of heuris-tic methods for clustering. In Pattern Recog-nition in Practice, E. S. Gelsema and L. N.Kanal, Eds. 425–436.

MITCHELL,T. 1997. MachineLearning. McGraw-Hill, Inc., New York, NY.

MOHIUDDIN, K. M. AND MAO, J. 1994. A compar-ative study of different classifiers for hand-printed character recognition. In PatternRecognition in Practice, E. S. Gelsema and L.N. Kanal, Eds. 437–448.

MOOR, B. K. 1988. ART 1 and Pattern Cluster-ing. In 1988 Connectionist Summer School,Morgan Kaufmann, San Mateo, CA, 174–185.

MURTAGH, F. 1984. A survey of recent advancesin hierarchical clustering algorithms whichuse cluster centers. Comput. J. 26, 354–359.

MURTY, M. N. AND KRISHNA, G. 1980. A compu-tationally efficient technique for data cluster-ing. Pattern Recogn. 12, 153–158.

MURTY, M. N. AND JAIN, A. K. 1995. Knowledge-based clustering scheme for collection man-agement and retrieval of library books. Pat-tern Recogn. 28, 949–964.

NAGY, G. 1968. State of the art in pattern rec-ognition. Proc. IEEE 56, 836–862.

NG, R. AND HAN, J. 1994. Very large data bases.In Proceedings of the 20th International Con-ference on Very Large Data Bases (VLDB’94,Santiago, Chile, Sept.), VLDB Endowment,Berkeley, CA, 144–155.

NGUYEN, H. H. AND COHEN, P. 1993. Gibbs ran-dom fields, fuzzy clustering, and the unsuper-vised segmentation of textured images. CV-GIP: Graph. Models Image Process. 55, 1 (Jan.1993), 1–19.

OEHLER, K. L. AND GRAY, R. M. 1995.Combining image compression and classifica-tion using vector quantization. IEEE Trans.Pattern Anal. Mach. Intell. 17, 461–473.

OJA, E. 1982. A simplified neuron model as aprincipal component analyzer. Bull. Math.Bio. 15, 267–273.

OZAWA, K. 1985. A stratificational overlappingcluster scheme. Pattern Recogn. 18, 279–286.

OPEN TEXT, 1999. http://index.opentext.net.KAMGAR-PARSI, B., GUALTIERI, J. A., DEVANEY, J. A.,

AND KAMGAR-PARSI, K. 1990. Clustering withneural networks. Biol. Cybern. 63, 201–208.

LYCOS, 1999. http://www.lycos.com.PAL, N. R., BEZDEK, J. C., AND TSAO, E. C.-K.

1993. Generalized clustering networks andKohonen’s self-organizing scheme. IEEETrans. Neural Netw. 4, 549–557.

QUINLAN, J. R. 1990. Decision trees and deci-sion making. IEEE Trans. Syst. Man Cy-bern. 20, 339–346.

RAGHAVAN, V. V. AND BIRCHAND, K. 1979. Aclustering strategy based on a formalism ofthe reproductive process in a natural system.In Proceedings of the Second InternationalConference on Information Storage and Re-trieval, 10–22.

RAGHAVAN, V. V. AND YU, C. T. 1981. A compar-ison of the stability characteristics of somegraph theoretic clustering methods. IEEETrans. Pattern Anal. Mach. Intell. 3, 393–402.

RASMUSSEN, E. 1992. Clustering algorithms.In Information Retrieval: Data Structures andAlgorithms, W. B. Frakes and R. Baeza-Yates,Eds. Prentice-Hall, Inc., Upper SaddleRiver, NJ, 419–442.

RICH, E. 1983. Artificial Intelligence. McGraw-Hill, Inc., New York, NY.

RIPLEY, B. D., Ed. 1989. Statistical Inferencefor Spatial Processes. Cambridge UniversityPress, New York, NY.

ROSE, K., GUREWITZ, E., AND FOX, G. C. 1993.Deterministic annealing approach to con-strained clustering. IEEE Trans. PatternAnal. Mach. Intell. 15, 785–794.

Data Clustering • 321

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 59: Data Clustering: A Review - Computer Science at Rutgers

ROSENFELD, A. AND KAK, A. C. 1982. Digital Pic-ture Processing. 2nd ed. Academic Press,Inc., New York, NY.

ROSENFELD, A., SCHNEIDER, V. B., AND HUANG, M.K. 1969. An application of cluster detectionto text and picture processing. IEEE Trans.Inf. Theor. 15, 6, 672–681.

ROSS, G. J. S. 1968. Classification techniquesfor large sets of data. In Numerical Taxon-omy, A. J. Cole, Ed. Academic Press, Inc.,New York, NY.

RUSPINI, E. H. 1969. A new approach to cluster-ing. Inf. Control 15, 22–32.

SALTON, G. 1991. Developments in automatictext retrieval. Science 253, 974–980.

SAMAL, A. AND IYENGAR, P. A. 1992. Automaticrecognition and analysis of human faces andfacial expressions: A survey. Pattern Recogn.25, 1 (Jan. 1992), 65–77.

SAMMON, J. W. JR. 1969. A nonlinear mappingfor data structure analysis. IEEE Trans.Comput. 18, 401–409.

SANGAL, R. 1991. Programming Paradigms inLISP. McGraw-Hill, Inc., New York, NY.

SCHACHTER, B. J., DAVIS, L. S., AND ROSENFELD,A. 1979. Some experiments in image seg-mentation by clustering of local feature val-ues. Pattern Recogn. 11, 19–28.

SCHWEFEL, H. P. 1981. Numerical Optimizationof Computer Models. John Wiley and Sons,Inc., New York, NY.

SELIM, S. Z. AND ISMAIL, M. A. 1984. K-means-type algorithms: A generalized convergencetheorem and characterization of local opti-mality. IEEE Trans. Pattern Anal. Mach. In-tell. 6, 81–87.

SELIM, S. Z. AND ALSULTAN, K. 1991. A simu-lated annealing algorithm for the clusteringproblem. Pattern Recogn. 24, 10 (1991),1003–1008.

SEN, A. AND SRIVASTAVA, M. 1990. RegressionAnalysis. Springer-Verlag, New York, NY.

SETHI, I. AND JAIN, A. K., Eds. 1991. ArtificialNeural Networks and Pattern Recognition:Old and New Connections. Elsevier ScienceInc., New York, NY.

SHEKAR, B., MURTY, N. M., AND KRISHNA, G.1987. A knowledge-based clustering scheme.Pattern Recogn. Lett. 5, 4 (Apr. 1, 1987), 253–259.

SILVERMAN, J. F. AND COOPER, D. B. 1988.Bayesian clustering for unsupervised estima-tion of surface and texture models.IEEE Trans. Pattern Anal. Mach. Intell. 10, 4(July 1988), 482–495.

SIMOUDIS, E. 1996. Reality check for data min-ing. IEEE Expert 11, 5 (Oct.), 26–33.

SLAGLE, J. R., CHANG, C. L., AND HELLER, S. R.1975. A clustering and data-reorganizing al-gorithm. IEEE Trans. Syst. Man Cybern. 5,125–128.

SNEATH, P. H. A. AND SOKAL, R. R. 1973.Numerical Taxonomy. Freeman, London,UK.

SPATH, H. 1980. Cluster Analysis Algorithmsfor Data Reduction and Classification. EllisHorwood, Upper Saddle River, NJ.

SOLBERG, A., TAXT, T., AND JAIN, A. 1996. AMarkov random field model for classificationof multisource satellite imagery. IEEETrans. Geoscience and Remote Sensing 34, 1,100–113.

SRIVASTAVA, A. AND MURTY, M. N 1990. A com-parison between conceptual clustering andconventional clustering. Pattern Recogn. 23,9 (1990), 975–981.

STAHL, H. 1986. Cluster analysis of large datasets. In Classification as a Tool of Research,W. Gaul and M. Schader, Eds. ElsevierNorth-Holland, Inc., New York, NY, 423–430.

STEPP, R. E. AND MICHALSKI, R. S. 1986.Conceptual clustering of structured objects: Agoal-oriented approach. Artif. Intell. 28, 1(Feb. 1986), 43–69.

SUTTON, M., STARK, L., AND BOWYER, K.1993. Function-based generic recognition formultiple object categories. In Three-Dimen-sional Object Recognition Systems, A. Jainand P. J. Flynn, Eds. Elsevier Science Inc.,New York, NY.

SYMON, M. J. 1977. Clustering criterion andmulti-variate normal mixture. Biometrics77, 35–43.

TANAKA, E. 1995. Theoretical aspects of syntac-tic pattern recognition. Pattern Recogn. 28,1053–1061.

TAXT, T. AND LUNDERVOLD, A. 1994. Multi-spectral analysis of the brain using magneticresonance imaging. IEEE Trans. MedicalImaging 13, 3, 470–481.

TITTERINGTON, D. M., SMITH, A. F. M., AND MAKOV,U. E. 1985. Statistical Analysis of FiniteMixture Distributions. John Wiley and Sons,Inc., New York, NY.

TOUSSAINT, G. T. 1980. The relative neighbor-hood graph of a finite planar set. PatternRecogn. 12, 261–268.

TRIER, O. D. AND JAIN, A. K. 1995. Goal-directed evaluation of binarization methods.IEEE Trans. Pattern Anal. Mach. Intell. 17,1191–1201.

UCHIYAMA, T. AND ARBIB, M. A. 1994. Color imagesegmentation using competitive learning.IEEE Trans. Pattern Anal. Mach. Intell. 16, 12(Dec. 1994), 1197–1206.

URQUHART, R. B. 1982. Graph theoretical clus-tering based on limited neighborhoodsets. Pattern Recogn. 15, 173–187.

VENKATESWARLU, N. B. AND RAJU, P. S. V. S. K.1992. Fast ISODATA clustering algorithms.Pattern Recogn. 25, 3 (Mar. 1992), 335–342.

VINOD, V. V., CHAUDHURY, S., MUKHERJEE, J., AND

GHOSE, S. 1994. A connectionist approachfor clustering with applications in imageanalysis. IEEE Trans. Syst. Man Cybern. 24,365–384.

322 • A. Jain et al.

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Page 60: Data Clustering: A Review - Computer Science at Rutgers

WAH, B. W., Ed. 1996. Special section on min-ing of databases. IEEE Trans. Knowl. DataEng. (Dec.).

WARD, J. H. JR. 1963. Hierarchical grouping tooptimize an objective function. J. Am. Stat.Assoc. 58, 236–244.

WATANABE, S. 1985. Pattern Recognition: Hu-man and Mechanical. John Wiley and Sons,Inc., New York, NY.

WESZKA, J. 1978. A survey of threshold selec-tion techniques. Pattern Recogn. 7, 259–265.

WHITLEY, D., STARKWEATHER, T., AND FUQUAY,D. 1989. Scheduling problems and travel-ing salesman: the genetic edge recombina-tion. In Proceedings of the Third Interna-tional Conference on Genetic Algorithms(George Mason University, June 4–7), J. D.Schaffer, Ed. Morgan Kaufmann PublishersInc., San Francisco, CA, 133–140.

WILSON, D. R. AND MARTINEZ, T. R. 1997.Improved heterogeneous distance func-tions. J. Artif. Intell. Res. 6, 1–34.

WU, Z. AND LEAHY, R. 1993. An optimal graphtheoretic approach to data clustering: Theoryand its application to image segmentation.IEEE Trans. Pattern Anal. Mach. Intell. 15,1101–1113.

WULFEKUHLER, M. AND PUNCH, W. 1997. Findingsalient features for personal web page categories.In Proceedings of the Sixth International Con-ference on the World Wide Web (Santa Clara,CA, Apr.), http://theory.stanford.edu/people/wass/publications/Web Search/Web Search.html.

ZADEH, L. A. 1965. Fuzzy sets. Inf. Control 8,338–353.

ZAHN, C. T. 1971. Graph-theoretical methodsfor detecting and describing gestalt clusters.IEEE Trans. Comput. C-20 (Apr.), 68–86.

ZHANG, K. 1995. Algorithms for the constrainedediting distance between ordered labeledtrees and related problems. Pattern Recogn.28, 463–474.

ZHANG, J. AND MICHALSKI, R. S. 1995. An inte-gration of rule induction and exemplar-basedlearning for graded concepts. Mach. Learn.21, 3 (Dec. 1995), 235–267.

ZHANG, T., RAMAKRISHNAN, R., AND LIVNY, M.1996. BIRCH: An efficient data clusteringmethod for very large databases. SIGMODRec. 25, 2, 103–114.

ZUPAN, J. 1982. Clustering of Large DataSets. Research Studies Press Ltd., Taunton,UK.

Received: March 1997; revised: October 1998; accepted: January 1999

Data Clustering • 323

ACM Computing Surveys, Vol. 31, No. 3, September 1999