an algorithm for clustering tendency assessment - … · an algorithm for clustering tendency...

An Algorithm for Clustering Tendency Assessment

Yingkang HuGeorgia Southern University

Department of Mathematical SciencesStatesboro, GA 30460-8093

[email protected]

Richard J. HathawayGeorgia Southern University

Department of Mathematical SciencesStatesboro, GA 30460-8093

[email protected]

Abstract: The visual assessment of tendency (VAT) technique, developed by J.C. Bezdek, R.J. Hathawayand J.M. Huband, uses a visual approach to find the number of clusters in data. In this paper, we developa new algorithm that processes the numeric output of VAT programs, other than gray level images as inVAT, and produces the tendency curves. Possible cluster borders will be seen as high-low patterns on thecurves, which can be caught not only by human eyes but also by the computer. Our numerical results arevery promising. The program caught cluster structures even in cases where the visual outputs of VATare virtually useless.

Key–Words: Clustering, similarity measures, data visualization, clustering tendency

1 Introduction

In clustering one partitions a set of objects

O = {o1, o2, . . . , on}into c self-similar subsets (clusters) based on avail-able data and some well-defined measure of simi-larity. But before using a clustering method onehas to decide whether there are meaningful clus-ters, and if so, how many are there. This is be-cause all clustering algorithms will find any num-ber (up to n) of clusters, even if no meaningfulclusters exist. The process of determining thenumber of clusters is called the assessing of clus-tering tendency. We refer the reader to Tukey [1]and Cleveland [2] for visual approaches in variousdata analysis problems, and to Jain and Dubes [3]and Everitt [4] for formal (statistics-based) andinformal techniques for cluster tendency assess-ment. Other interesting articles include [13]–[17].Recently the research on the visual assessmentof tendency (VAT) technique has been quite ac-tive; see the original VAT paper by Bezdek andHathaway [5], also see Bezdek, Hathaway andHuband [6], Hathaway, Bezdek and Huband [7],and Huband, Bezdek and Hathaway [8, 9].

The VAT algorithms apply to relational data,in which each pair of objects in O is representedby a relationship. Most likely, the relationship be-tween oi and oj is given by their dissimilarity Rij

(a distance or some other measure; see [11] and[12]). These n2 data items form a symmetric ma-trix R = [Rij ]n×n. If each object oi is represented

by a feature vector xi = (xi1, xi2, . . . , xis), wherexij are properties of oi such as height, length,color, etc, the set

X = {x1, x2, . . . , xn} ⊂ IRs

is called object data representation of the set O.In this case Rij can be computed as the distancebetween xi and xj measured by some norm ormetric in IRs. In this paper if the data is given asobject data X, we will use as Rij the square rootof the Euclidean norm of xi − xj , that is,

Rij =√

‖xi − xj‖2.

The VAT algorithms reorder (through indexing)the points so that points that are close to one an-other in the feature space will generally have sim-ilar indices. Their numeric output is an ordereddissimilarity matrix (ODM). We will still use theletter R for the ODM. It will not cause confusionsince this is the only information on the data weare going to use. The ODM satisfies

0 ≤ Rij ≤ 1, Rij = Rji and Rii = 0.

The largest element of R is 1 because the algo-rithms scale the elements of R.

The ODM is displayed as ordered dissimilarityimage (ODI), which is the visual output of VAT.In ODI the gray level of pixel (i, j) is proportionalto the value of Rij : pure black if Rij = 0 and purewhite if Rij = 1.

WSEAS TRANSACTIONS on MATHEMATICS Yingkang Hu and Richard J. Hathaway

ISSN: 1109-2769 441 Issue 7, Volume 7, July 2008

−8 −6 −4 −2 0 2 4 6 8−2

0

2

4

6

8

10

12

Figure 1: Scatterplot of the data set X.

−8 −6 −4 −2 0 2 4 6 8−2

0

2

4

6

8

10

12

Figure 2: The original order of X.

Figure 3: Dissimilarity image before reordering X

−8 −6 −4 −2 0 2 4 6 8−2

0

2

4

6

8

10

12

Figure 4: A data set X ordered by VAT

Figure 5: ODI using the order in Fig 4.

The idea of VAT is shown in Fig.1–5.Fig.1 shows a scatterplot of a data set X ={x1, x2, . . . , x20} ⊂ IR2 of 20 points containingthree well-defined clusters. Its original order,shown by the broken line in Fig.2 with x1 ≈(1.3, 9.6) marked by a diamond, is random, as inmost applications. The corresponding dissimilar-ity image in Fig.3 contains no useful visual infor-mation about the cluster structure in X. Fig.4shows the new order of the data set X, with thediamond in the lower left corner representing thenew x1 ≈ (−6.1,−1.3) in the ordered data set.Fig.5 gives the corresponding ODI. Now the threeclusters are represented by the three well-formedblack blocks.

The VAT algorithms are certainly useful, butthere is room for improvements. It seems to usthat our eyes are not very sensitive to structuresin gray level images. One example is given inFig.6. There are three clusters in the data as wewill show later. The clusters are not well sep-



arated, and the ODI from VAT does not revealany sign of the existence of the structure.

The approach of this paper is to focus onchanges in dissimilarities in the ODM, the nu-meric output of VAT that underlies its visual out-put ODI. The results will be displayed as curves,which we call the tendency curves. The bordersof clusters in the ODM (or blocks in the ODI) areshown as certain patterns in peaks and valleys onthe tendency curves. The patterns can be caughtnot only by human eyes but also by the computer.It seems that the computer is more sensitive tothese patterns on the curves than human eyes areto them or to the gray level patterns in the ODI.For example, the computer caught the three clus-ters in the data set that produced the virtuallyuseless ODI in Fig.6.

Figure 6: How many clusters are in this ODI?

2 Tendency Curves

Our approach is to catch possible diagonal blocksin the ordered dissimilarity matrix R by using var-ious averages of distances, which are stored as vec-tors and displayed as curves. Let n be the numberof points in the data, we define

m = 0.05n, M = 5m, w = 3m. (1)

We restrict ourselves to the w-subdiagonal band(excluding the diagonal) of R, as shown in Fig.7.Let ℓi = max(1, i − w), then the i-th “row-average” is defined by

r1 = 0, ri =1

i − ℓi

i−1∑

j=ℓi

Rij , 2 ≤ i ≤ n. (2)

In another word, each ri is the average of the el-ements of row i in the w-band. The i-th m-rowmoving average is defined as the average of all ele-ments in up to m rows above row i, inclusive, thatfall in the w-band. This corresponds to the regionbetween the two horizontal line segments in Fig.7.We also define the M -row moving average in al-most the identical way except with m replaced byM . They will be referred to as the r-curve, them-curve and the M -curve, respectively.

Figure 7: Sub-diagonal band of the ODM

The idea of the r-curve is simple. Just imag-ine a horizontal line, representing the current rowin the program, moving downward in an ODI suchas the one in Fig.5. When it moves out of one di-agonal black block and into another, the r-curveshould first show a peak because the numbers tothe left of diagonal element Rii will suddenly in-crease. It should drop back down rather quicklywhen the line moves well into the next black block.Therefore the border of two blocks should be rep-resented as a peak on the r-curve if the clustersare well separated.

When the situation is less than ideal, therewill be noise, which may destroy possible patternson the r-curve. That is how the m-curve comes in,which often reveals the pattern beneath the noise.Since the VAT algorithms tend to order outliersnear the end, so the m-curve tends to move up inthe long run, which makes it hard for the programto identify peaks. That is why we introduce theM -curve, which shows long term trends of the r-curve. The difference of the m- and M -curves,which we call the d-curve, retains the shape ofthe m-curve but is more horizontal, basically lyingon the horizontal axis. Furthermore, the M -curvechanges more slowly than the m-curve, thus whenmoving from one block into another in the ODM,



it will tend to be lower than the m-curve. As aresult, the d-curve will show a valley, most likelybelow the horizontal axis, after a peak. It is thepeak-valley, or high-low, patterns that signal theexistence of cluster structures. This will becomeclear in our examples in the section that follows.

3 Numerical Examples

−10 −5 0 5 10−4

−2

0

2

4

6

8

10

12

14

Figure 8: Scatterplot of three normally dis-tributed clusters in IR2, with α = 8.

Figure 9: ODI from VAT for the data set withα = 8

We give one group of examples in IR2 sothat we can use their scatterplots to show howwell/poorly the clusters are separated. We alsogive the visual outputs (ODIs) of VAT for com-parison. These sets are generated by choos-ing α = 8, 4, 3, 2, 1 and 0 in the following set-

0 200 400 600 800 1000 1200 1400 1600 1800 2000−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 10: Tendency curves for the data set withα = 8

tings: 2000 points (observations) are generatedin three groups from multivariate normal distri-bution having mean vectors µ1 = (0, α

√6/2),

µ2 = (−α√

2/2, 0) and µ3 = (α√

2/2, 0). Theprobabilities for a point to fall into each of thethree groups are 0.35, 0.4 and 0.25, respectively.The covariance matrices for all three groups areI2. Note that µ1, µ2 and µ3 form an equilateraltriangle of side length α

√2.

Fig.8–10 for the case α = 8 show what weshould look for on the curves. The clusters arevery well separated, the ODI has three blackblocks on the diagonal with sharp borders. Our r-curve (the one with “noise”) has two vertical risesand the m-curve (the solid curve going throughthe r-curve where it is relatively flat) has twopeaks, corresponding to the two block bordersin the ODI. The M -curve, the smoother, dash-dotted curve, is only interesting in its relative po-sition with respect to the m-curve. That is, itis only useful in generating the d-curve, the dif-ference of these two curves. The d-curve looksalmost identical to the m-curve, also having twopeaks and two valleys. The major difference isthat it is in the lower part of the figure, aroundthe horizontal axis.

Fig.11–13 show the case α = 4. The clustersare less separated than the case α = 8, and theslopes of the tendency curves are smaller. Thereare still two vertical rises on the r-curve, and twopeaks followed by two valleys on all other curveswhere the block borders are in the ODI in Fig.12.What is really different here from the case α = 8 isthe wild oscillations near the end of the r-curve,



bringing up all other three curves. This corre-sponds to the small region in the lower-right cor-ner of the ODI, where there lacks pattern. Notethat no valley follows from the third rise or peak.This is understandable because a valley appearswhen the curve index (the horizontal variable ofthe graphs) runs into a cluster, shown as a blockin ODI.

Now we know what we should look for: verti-cal rises or peaks followed by valleys, or high-lowpatterns, on all the tendency curves maybe exceptthe M -curve.

−6 −4 −2 0 2 4 6−4

−2

0

2

4

6

8

10

Figure 11: Three normally distributed clusters inIR2 with α = 4

Figure 12: ODI from VAT, α = 4

The case α = 3 is given in Fig.14–16. Onecan still easily make out the three clusters in thescatterplot, but it is harder to tell to which clustermany points in the middle belong, (the member-ships are very fuzzy). It is expected that every

0 200 400 600 800 1000 1200 1400 1600 1800 2000−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Figure 13: Tendency curves for α = 4

−5 −4 −3 −2 −1 0 1 2 3 4 5−4

−2

0

2

4

6

8





0 200 400 600 800 1000 1200 1400 1600 1800 2000−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8


visual method will have difficulties with them, asevidenced by the lower right corner of the ODI,and the oscillations on the last one fifth of ther-curve. The oscillations bring up the r- andm-curves, but not the d-curve. The d-curve re-mains almost the same as those in the two pre-vious cases, except the third peak becomes largerand decreases moderately without forming a val-ley. The two high-low patterns on the m- andd-curves show the existence of three clusters. Aswe have said earlier that it is a valley on the m-curve and, especially, the d-curve that signals thebeginning of a new cluster.

We hope by now the reader can see the pur-pose of the d-curve. Both the m- and M -curvesin Fig.16 go up with wild oscillations, but the d-curve always stays low, lying near the horizon-tal axis. Unlike the other three curves, its valuesnever get too high or too low. This enables us todetect the beginnings of new blocks in an ODM bycatching the high-lows on the d-curve. When thed-curve hits a ceiling, set as 0.04, and then a floor,set as 0, the program reports one new cluster. Theceiling and floor values are satisfied by all cases inour numerical experiments where the clusters arereasonably, sometimes only barely, separated. Ifwe lower the ceiling and raise the floor, we wouldbe able to catch some of the less separated clus-ters we know we have missed, but it would alsoincrease the chance of “catching” false clusters.We do not like the idea of tuning parameters toparticular examples. We will stick to the sameceiling and floor values throughout this paper. Infact, we do not recommend changing the suggestedvalues of the parameters in our program, that is,the values for the ceiling and floor set here, andthose for m, M and w given in (1).

−5 −4 −3 −2 −1 0 1 2 3 4 5−4

−3

−2

−1

0

1

2

3

4

5

6



The situation in the case α = 2, shown inFig.17–19, really deteriorates. One can barelymake out the three clusters in Fig.17 that are sup-posed to be there; the ODI in Fig.18 is a mess. Infact, this is the same ODI as the one in Fig.6, puthere again for side-by-side comparison with thescatterplot and the tendency curves. The ten-dency curves in Fig.19, however, pick up clusterstructure from the ODM. The d-curve has severalhigh-lows, with two of them large enough to hitboth the ceiling and floor, whose peaks are near600 and 1000 marks on the horizontal axis, re-spectively. This example clearly shows that ourtendency curves generated from the ODM aremore sensitive than the raw block structure in thegraphical display (ODI) of the same ODM. Thelargest advantage of the tendency curves is proba-



0 200 400 600 800 1000 1200 1400 1600 1800 2000−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8


bly the quantization which enables the computer,not only human eyes, to catch possible patterns.

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

Figure 20: Three normally distributed clusters inIR2 with α = 0 (combining into one)

When α goes down to zero, the cluster struc-ture disappears. The scatterplots for α = 0(Fig.20) and α = 1 (not shown) are almost iden-tical, showing a single cluster in the center. Thetendency curves for both cases (Fig.21 and 22)have no high-lows large enough to hit the ceilingthen the floor, which is the way they should be.Note that while all other three curves go up whenmoving to the right, the d-curve, the difference ofthe m- and M -curves, stays horizontal, which is,again, the reason we introduced it.

We also tested our program on many otherdata sets, including small data sets containing150 points in IR4. It worked equally well. Wealso tested two examples in Figures 12 and 13of Bezdek and Hathaway [5], where the pointsare regularly arranged, on a rectangular grid, and

0 200 400 600 800 1000 1200 1400 1600 1800 2000−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8


0 200 400 600 800 1000 1200 1400 1600 1800 2000−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8


along a pair of concentric circles, respectively. Be-cause we could only have speculated from ODIimages produced from the sets before applyingour program, it was to our great, and pleasant,surprise that the program worked seamlessly onthem, accurately reporting the number of clus-ters that exist. What we want to emphasize isthat we did all this without ever having to modifyany suggested parameter values ! These tests willbe reported in a forthcoming paper.

It is almost a sacred ritual that everybodytries the Iris data in a paper on clustering, so weconclude ours by trying our program on it. TheODI from VAT is given by Fig.23, and the ten-dency curves are given in Fig.24. The computercaught the large high-low on the left and ignoredthe small one on the right, and correctly reportingtwo clusters.

Remark 1 There is more than one version of theIris data, we are using the “real” version as dis-



Figure 23: ODI for the Iris data

0 50 100 150−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

Figure 24: Tendency curves for the Iris data

cussed in [10]. The original object data consistsof measurements of four features of each of 150irises, which are of three, not two, different phys-ical types. Fifty of each type were used to generatethe data. Thus it seems there should be three clus-ters in the data. The reason both the ODI and thetendency curves show only two clusters is that twoof the three iris types generate data that overlapin IR4, so many argue that the unlabeled data nat-urally form two clusters; see [5].

4 Discussion

The original VAT algorithm introduced in 2002provides a useful visual display of well-separatedcluster structure. Two significant weaknesses ofthe original approach are: (1) there is no straight-forward way to automate the procedure so that

cluster assessment can be done without the aidof a human interpreter; and (2) the quality ofthe information in the ODI–to a human observer–deteriorates badly in some cases where there isstill a significant separation between clusters. Ourproposed use of ”tendency curves” derived fromthe VAT reordering output successfully addressesboth of these problems. Regarding (1), while thetendency curves provide much information for de-tailed human interpretation of cluster structure,they also concentrate and organize the informa-tion in a way that allows the use of a simple auto-mated procedure for detecting the number of clus-ters, involving a simple counting of the number ofhigh-low patterns. And very importantly, goodchoices for the ”high” and ”low” thresholds aresuggested elsewhere in the paper; the parametersdefining the automated procedure do not appearto depend sensitively on the particular problembeing solved. Regarding weakness (2), the successof the procedure on data sets such as that shownin Fig. 17 demonstrates a great improvement overresults obtained from the raw ODIs produced byVAT.

For clearer exposition in this note, we chose torestrict the main discussion to the original VATprocedure. Actually, the tendency curve modifi-cation is completely applicable, and easily trans-ferable, to several other cases involving more re-cent relatives of VAT. For example, the sVAT pro-cedure in Hathaway, Bezdek and Huband [7] is atwo step procedure for assessing cluster tendencyin very large scale data sets. First a representa-tive sample of the original data set is chosen , andthen original VAT is applied to the selected sam-ple. Clearly, the tendency curves could be addedas a third step of a cluster assessment procedure.Significantly, this means the approach describedin this note is scalable and applicable to very largedata sets.

Another opportunity to extend the tendencycurve assessment approach is in the case of thevisual cluster validity (VCV) scheme suggestedin Hathaway and Bezdek [18]. Whereas VATinvolves a pre-clustering activity that uses regu-lar Euclidean distances between points, the VCVscheme is done post-clustering. The pair wise dis-tances used by VCV are defined indirectly on thebasis of how closely pairs of data points fit into thesame cluster. In other words, two points that maybe far apart in a Euclidean sense, but both fittingclosely to a single cluster prototype, which couldbe an ellipsoid, plane, etc., would be assigned asmall pair wise distance. An example of a linearcluster structure and its corresponding VCV ODI



Figure 25: Scatterplot of three linear clusters

from [18] is given in Figs. 25-26. VCV can be usedas one tool to visually validate how well the datapoints fit the assumed cluster structure. Eventhough cluster validity (done post-clustering) is afundamentally different activity than cluster ten-dency assessment (done pre-clustering), the VCVapproach again involves applying VAT to a matrixof pair wise distances, and is therefore completelyamenable to the application of a tendency curveapproach.

A more challenging extension of the use oftendency curves is in the case of non-square dis-similarity matrices considered in [6]. For rectan-gular dissimilarity matrices we can calculate row-based tendency curves, as done here, and also(different) column-based tendency curves. We arecurrently studying exactly how to make the bestuse of the two sets of curves to assess the under-lying co-cluster tendency.

In closing we mention a final thought thatmay bear no fruit, but at least for now seems in-triguing. An often used tactic in computationalintelligence is to find something done efficientlyin nature, and then attempt to mimic it com-putationally in a way that solves some problemof interest. The ”high-low” rule for identifying achange in clusters reminds us of technical analysisschemes used in analyzing market or individualstock pricing charts. (The bible of such graph-based rules is [19].) Is there any ”intelligence”in these rules that can be adapted to clusteringor other types of data analysis. Moreover, howgenerally applicable is the fundamental approachtaken in this paper, which we see as: (1) convert-ing a large amount of multidimensional data into

Figure 26: VCV ODI for Fig. 25 data

a univariate data stream; and (2) then extract-ing some important part of the information in thedata stream using some graph-based rules? Whatis the most useful way to represent the streaminformation so that it can be understood by hu-mans? Visually? Sonically? Might it be possibleto ”hear” clusters? This will be researched in thenear future.

References:

[1] J.W. Tukey, Exploratory Data Analysis.Reading, MA: Addison-Wesley, 1977.

[2] W.S. Cleveland, Visualizing Data. Summit,NJ: Hobart Press, 1993.

[3] A.K. Jain and R.C. Dubes, Algorithmsfor Clustering Data. Englewood Cliffs, NJ:Prentice-Hall, 1988.

[4] B.S. Everitt, Graphical Techniques for Multi-variate Data. New York, NY: North Holland,1978.

[5] J.C. Bezdek and R.J. Hathaway, VAT: A toolfor visual assessment of (cluster) tendency.Proc. IJCNN 2002. IEEE Press, Piscataway,NJ, 2002, pp.2225-2230.

[6] J.C. Bezdek, R.J. Hathaway and J.M.Huband, Visual Assessment of ClusteringTendency for Rectangular Dissimilarity Ma-trices, IEEE Trans. on Fuzzy Systems, 15

(2007) 890-903

[7] R. J. Hathaway, J. C. Bezdek and J. M.Huband, Scalable visual assessment of clustertendency for large data sets, Pattern Recog-nition, 39 (2006) 1315-1324.



[8] J. M. Huband, J.C. Bezdek and R.J. Hath-away, Revised visual assessment of (clus-ter) tendency (reVAT). Proc. North Amer-ican Fuzzy Information Processing Soci-ety (NAFIPS), IEEE, Banff, Canada, 2004,pp.101-104.

[9] J. M. Huband, J.C. Bezdek and R.J. Hath-away, bigVAT: Visual assessment of clus-ter tendency for large data set. PATTERNRECOGNITION, 38 (2005) 1875-1886.

[10] J. C. Bezdek, J. M. Keller, R. Krishnapu-ram, L. I. Kuncheva and N. R. Pal, Will thereal Iris data please stand up? IEEE Trans.Fuzzy Systems, 7, 368–369 (1999).

[11] I. Borg and J. Lingoes, MultidimensionalSimilarity Structure Analysis. Springer-Verlag, New York, 1987.

[12] M. Kendall and J.D. Gibbons, Rank Correla-tion Methods. Oxford University Press, NewYork, 1990.

[13] Nawara Chansiri, Siriporn Supratid andChom Kimpan, Image Retrieval Improve-ment using Fuzzy C-Means Initialized byFixed Threshold Clustering: a Case StudyRelating to a Color Histogram, WSEASTransactions on Mathematics, 5.7 (2006),926–931.

[14] Hung-Yueh Lin and Jun-Zone Chen, Ap-plying Neural Fuzzy Method to an UrbanDevelopment Criterion for Landfill Siting,WSEAS Transactions on Mathematics, 5.9

(2006), 1053–1059.[15] Nancy P. Lin, Hung-Jen Chen, Hao-En

Chueh, Wei-Hua Hao and Chung-I Chang,A Fuzzy Statistics based Method for MiningFuzzy Correlation Rules, WSEAS Transac-tions on Mathematics, 6.11 (2007), 852–858.

[16] Miin-Shen Yang, Karen Chia-Ren Lin, Hsiu-Chih Liu and Jiing-Feng Lirng, A Fuzzy-SoftCompetitive Learning Algorithm For Oph-thalmological MRI Segmentation, to appearin WSEAS Transactions on Mathematics.

[17] Gita Sastria, Choong Yeun Liong and IshakHashim, Application of Fuzzy SubtractiveClustering for Enzymes Classification, to ap-pear in WSEAS Transactions on Mathemat-ics.

[18] R.J. Hathaway and J.C. Bezdek, VisualCluster Assessment for Prototype GeneratorClustering Models, Pattern Recognition Let-ters, 24 (2003), 1563-1569.

[19] J.J. Murphy, Technical Analysis of the Fi-nancial Markets, New York Institute of Fi-nance, New York, 1999.



an algorithm for clustering tendency assessment - … · an algorithm for clustering tendency...

Documents