coupling clustering and visualization for knowledge discovery …bennani/publications/ca... ·...

Coupling Clustering and Visualization for Knowledge Discoveryfrom Data

Guénaël Cabanes and Younès Bennani

Abstract— The exponential growth of data generates ter-abytes of very large databases. The growing number of datadimensions and data objects presents tremendous challenges foreffective data analysis and data exploration methods and tools.One solution commonly proposed is the use of a condenseddescription of the properties and structure of data. Thus, it be-comes crucial to have visualization tools capable of representingthe data structure, not from the data themselves, but from thesecondensed descriptions. The purpose of our work described inthis paper is to develop and put a synergistic visualization ofdata and knowledge into the knowledge discovery process. Wepropose here a method of describing data from enriched andsegmented prototypes using a clustering algorithm. We thenintroduce a visualization tool that can enhance the structurewithin and between groups in data. We show, using someartificial and real databases, the relevance of the proposedmethod.

I. INTRODUCTION

DUE to advances in data acquisition and scientific com-puting, today’s datasets are becoming increasingly com-

plex. The ability to measure and simulate more processesat finer scales, the number of data dimensions and dataobjects has grown significantly in today’s datasets [1], whilethe phenomena researchers are able to investigate becomeincreasingly complex. The growing number of data dimen-sions and data objects presents tremendous challenges foreffective data analysis and data exploration methods andtools. Researchers are overwhelmed with data and standardtools are often insufficient to enable efficient data analysisand, hence, discovery of information and knowledge fromthe data. One solution commonly proposed is the use of acondensed description of the properties and structure of data[2], [3], [4]. Thus, it becomes crucial to have visualizationtools capable of representing the data structure, not from thedata themselves, but from these condensed descriptions. Weaddress these challenges via a combination of visualizationand topographic clustering.

The purpose of our work described in this paper is todevelop and put a synergistic visualization of data andknowledge into the knowledge discovery process in orderto support an active participation of the user. A majorcontribution that the proposed approach will make is theability to provide data visualizations via maps and graphs.These visualizations provide a comprehensive exploration ofthe data resources. We propose here a method of describingdata from enriched prototypes, based on the learning of a

Guénaël Cabanes and Younès Bennani are with the LIPN-CNRS,UMR 7030, 99 Avenue J-B. Clément, 93430 Villetaneuse, France (email:[email protected]).

Self-Organizing Map (SOM) [5]. Prototypes of the SOM aresegmented using an adapted clustering algorithm. We thenintroduce a visualization tool that can enhance the structurewithin and between groups of data.

The remainder of this paper is organized as follow. SectionII presents the learning of the data structure to obtain acondensed description. Visualization tool is described inSection III and some examples are shown. A conclusion andfuture works are given in section IV.

II. LEARNING DATA STRUCTURE

We propose here a method to learn data structure, basedon the automated enrichment and segmentation of a group ofprototypes representing the data to be analyzed. We supposethat these prototypes have been previously computed fromdata thanks to an adapted algorithm, such as Neural Gas(NG) [6] or Self Organizing Map (SOM) [7], [5]. In thispaper we focus on the use of the SOM algorithm as a basisof data quantization and representation. A SOM consists ofa set of artificial neurons that represent the data structure.These neurons are connected with their neighbors accordingto topological connections (also called neighborhood con-nections). The dataset to analyze is used to organize theSOM under topological constraints of the input space. Thus,a correspondence between the input space and the mappingspace is built. Two observations close in the input spaceshould activate the same neuron or two neighboring neuronsof the SOM. Each neuron is associated with a prototype and,to respect the topological constraints, neighboring neurons ofthe best match unit of a data (BMU, the most representativeneuron) also update their prototype for a better representationof this data. This update is important because the neurons areclose neighbors of the best neuron.

A. Principle

The first step is the learning of the enriched SOM. Duringthe learning, each SOM prototype is extended with novelinformation extracted from the data. These information willbe used in the following step to find clusters in the data and toinfer the density function. More specifically, the informationadded to each prototype are:• Density modes. It is a measure of the data density

surrounding the prototype (local density). The localdensity is an information about the amount of datapresent in an area of the input space. We use a Gaussiankernel estimator [8] for this task.

• Local variability. It is a measure of the data variabilitythat is represented by the prototype. It can be defined

as the average distance between the prototypes and therepresented data.

• The neighborhood. This is a prototype’s neighborhoodmeasure. The neighborhood value of two prototypes isthe number of data that are well represented by eachone.

The second step is the clustering of the data using densityand connectivity information so as to detect low-densityboundary between clusters. We propose a clustering methodthat directly uses the information learned during the firststage.

B. Prototypes Enrichment

The enrichment algorithm proceeds in three phases:

Input :• The distance matrix Dist(w, x) between the M proto-

types w and the N data x.

Output :• The density Di and the local variability si associated to

each prototype wi.• The neighborhood values vi,j associated with each pair

of prototype wi and wj .

Algorithm :• Density estimate :

Di = 1/N

N∑k=1

e−Dist(wi,x

(k))2

2σ2

σ√2π

with σ a bandwidth parameter chosen by user.

• Estimate neighborhood values :– For each data x, find the two closest prototypes

(BMUs : Best Match Units) u∗(x) et u∗∗(x):

u∗(x) = argmini(Dist(wi, x))

and

u∗∗(x) = argmini 6=u∗(x)(Dist(wi, x))

– Compute vi,j = the number of data having i and jas two first BMUs.

• Local variability estimate : For each prototype w,variability s is the mean distance between w and theL data xw represented by w:

si = 1/L

L∑j=1

Dist(wi, x(j)w )

The proposed method for estimating the mode density isvery similar to that proposed by [9]. It has been shown thatwhen the number of data approaches infinity, the estimatorD converges asymptotically to the true density function [10].The choice of the parameter σ is important for good results.

If σ is too large, all data will influence the density of all theprototypes, and close prototypes will be associated to similardensities, resulting in decreased accuracy of the estimate. Ifσ is too small, a large proportion of data (the most distantprototypes) will not influence the density of the prototypes,which induces a loss of information. A heuristic that seemsrelevant and gives good results is to define σ as the averagedistance between a prototype and its nearest neighbor.

At the end of this step, each prototype is associated witha density and variability value, and each pair of prototypesis associated with a neighborhood value. Much of the infor-mation on the data structure is stored in these values. Thereis no more needs to keep data in memory.

C. Clustering of prototypes

Various prototypes-based approaches have been proposedto solve the clustering problem [11], [12], [13]. However,the obtained clustering is never optimal, since part of theinformation contained in the data is not represented bythe prototypes. We propose a new method of prototypes’clustering, that uses density and neighborhood informationto optimize the clustering. The main idea is that the corepart of a cluster can be defined as a region with high density.Then in most cases the cluster borders are defined either bylow density region or “empty” region between clusters (i.e.large inter cluster distances) [14].

At the end of the enrichment process, each set of proto-types linked together by a neighborhood value v > 0 definewell separate clusters (i.e. distance-defined). This is usefulto detect borders defined by large inter cluster distances(Fig.1(b)). The estimation of the local density (D) is used todetect cluster borders defined by low density. Each clusteris defined by a local maximum of density (density mode,Fig. 1(c)). Thus, a “Watersheds” method [15] is applied onprototypes’ density for each well separated cluster to find lowdensity area inside these clusters, in order to characterizedensity defined sub-clusters (Fig.1(d)). For each pair ofadjacent subgroups we use a density-dependent index [16]to check if a low density area is a reliable indicator of thedata structure, or whether it should be regarded as a randomfluctuation in the density (Fig.1(e)). This process is veryfast because generally the number of prototypes is small.The combined use of these two types of group definitioncan achieve very good results despite the low number ofprototypes in the map and is able to detect automatically thenumber of cluster (cf. [17]).

The algorithm proceed in three steps:

Input :• Density values Di.• Neighborhood values vi,j .Output :• The clusters of prototypes.1) Extract all groups of connected units :

(a) Data base (b) Sets of connectedprototypes

(c) Density modes de-tection

(d) Subgroups asso-ciated to each mode

(e) Merging of irrele-vant subgroups: finalclusters

(f) Data clusteringfrom prototypesclustering

Fig. 1. Example of a sequence of the different stages of the clusteringalgorithm.

Let P = {Ci}Li=1 the L groups of linked prototypes(see Fig.1(b)):

∀m ∈ Ci,∃n ∈ Ci such as vm,n > threshold

In this paper threshold = 0.2) For each Ck ∈ P do : :

• Find the set M(Ck) of density maxima (i.e. den-sity mode, see Fig.1(c)).

M(Ck) = {wi ∈ Ck | Di ≥ Dj ,

∀wj neighbor to wi}

Prototypes wi and wj are neighbor if vi,j >threshold.

• Determine the merging threshold matrix (see Fig.2) :

S = [S (i, j)]i,j=1...|M(Ck)|

with

S(i, j) =

(1

Di+

1

Dj

)−1

Fig. 2. Threshold computation

• For all prototype wi ∈ Ck, label wi with one ele-ment label(i) of M(Ck), according to an ascend-ing density gradient along the neighborhood. Eachlabel represents a micro-cluster (see Fig.1(d)).

• For each pair of neighbors prototypes (wi, wj) inCk, if:

label(i) 6= label(j)

and if both

Di > S(label(i), label(j))

andDj > S(label(i), label(j))

then merge the two micro-clusters (Fig.1(e)).3) Return clusters.The effectiveness of the proposed clustering method have

been demonstrated in [17] by testing the performances on10 databases presenting various clustering difficulties. Itwas compared to S2L-SOM [18] (using only neighborhoodinformation) and to some traditional two levels methods,in term of clustering quality (Jaccard and Rand indexes[19]) and stability (sub-sampling based method [20]). Theselected traditional algorithms for comparison are K-meansand Ascendant Hierarchical Clustering (AHC) applied (i) tothe data and (ii) to the prototypes of the trained SOM. TheDavies-Bouldin [21] index was used to determine the bestcutting of the dendrogram (AHC) or the optimal numberK of centroids for K-means. Our algorithm determine thenumber of clusters automatically and do not need to usethis index. In AHC, the proximity of two clusters wasdefined as the minimum of the distance between any twoobjects in the two different clusters. The results for theexternal indexes show that for all the databases the proposedclustering algorithm is able to find without any error theexpected data segmentation and the right number of clusters.This is not the case of the other algorithms, when the groupshave an arbitrary form, when there is no structure (i.e. onlyone cluster) in the data or when clusters are in contact.Considering the stability, the new algorithm shows excellentresults, whatever the dimension of data or the clusters’ shape.It is worth noticing that in some case the clustering obtainedby the traditional methods can be extremely unstable.

We present here additional tests that have been doneto compare the new method with other usual clusteringalgorithms that generally perform better than K-Means andAHC. These algorithms are DBSCAN [22], CURE [23]and Spectral Clustering [24]. In [25], the authors show thatthese algorithms fail in resolving some clustering problems,especially when clusters’ shape is not hyper-spherical orwhen clusters are in contact. Fig. 3 to 5 show that our methodsuccess in resolving this kind of problems (datasets are thesame as in [25]).

To summarize, the proposed method presents some inter-esting qualities in comparison to other clustering algorithms:• The number of cluster is automatically detected by the

algorithm.

Fig. 3. Clustering obtained with (a) DBSCAN and (b) the proposed method.

Fig. 4. Clustering obtained with (a) Spectral Clustering and (b) theproposed method.

Fig. 5. Clustering obtained with (a) Spectral Clustering , (b) CURE and(c) the proposed method.

• No linearly separable clusters and non hyper-sphericalclusters can be detected.

• The algorithm can deal with noise (i.e. touching clus-ters) by using density estimation.

D. Modeling data distributions

The objective of this step is to estimate the density functionwhich associates a density value to each point of the inputspace. An estimation of some values of this function havebeen calculated (i.e. Di) at the position of the prototypesrepresenting a cluster. An approximation of the function mustnow be inferred from these values.

The hypothesis here is that this function may be properlyapproximated in the form of a mixture of Gaussian kernels.Each kernel K is a Gaussian function centered on a proto-type. The density function can therefore be written as:

f(x) =

M∑i=1

αiKi(x)

with

Ki(x) =1

N√2πhi

e− d(wi,x)

2

2hi2

The most popular method to fit mixture models (i.e. to findhi and αi) is the expectation-maximization (EM) algorithm[26]. However, this algorithm needs to work in the data inputspace. As here we work on enriched SOM instead of dataset,we cannot use EM algorithm.

Thus, we propose a heuristic to choose hi:

hi =

∑j

vi,jNi+Nj

(siNi + di,jNj)∑j vi,j

di,j is the distance between prototypes wi and wj . The ideais that hi is the standard deviation of data represented by Ki.These data are also represented by wi and their neighbors.Then hi depends on the variability si computed for wi andthe distance di,j between wi and his neighbors, weighted bythe number of data represented by each prototype and theconnectivity value between wi and his neighborhood.

Now, since the density D for each prototype w is known(f(wi) = Di), a gradient descent method can be used todetermine the weights αi. The αi are initialized with thevalues of Di, then these values are reduced gradually to betterfit D =

∑Mi=1 αiKi(w). To do this, the following criterion

is minimized:

R(α) =1

M

M∑i=1

M∑j=1

(αjGj(wi))−Di

2

Algorithm :1) Initialization :

∀i, αi = Di

2) Error calculation :

∀i, Err(i) =M∑j=1

αjGj(wi)−Di

3) Coefficients update :

∀i, αi(t) = max [0 ; αi(t-1)− ε ∗ Err(i)]

with ε the gradient step. Here we use ε = 0.1.4) As mean(|Err|) > threshold : go to 2, else return

αi. The threshold is chosen by user, here we choose1% of the mean density.

Thus, we have a density function that is a model of thedataset represented by the enriched SOM. Some examples ofestimated density are shown on Fig. 6 and 7.

Fig. 6. “Engytime” dataset and the related estimated density function.

Fig. 7. “Rings” dataset and the related estimated density function.

III. VISUALIZATION

A. Description of the visualization process

The clustering is accompanied by a set of informationthat can be used to complete the analysis of data. Theseinformation are the matrix of distances between prototypesand the density matrix, but also the values of connectionsthat can be used to determine relative importance of eachprototype for the representation of data. It is possible torepresent all this information into a single figure for adetailed analysis of the structure of each group and theirrelationships:• The prototypes are projected in a two-dimensional space

(possibly three) using a projection of Sammon, whichretains the best initial distances between prototypes [27].

• The size of the disks representing the prototype is pro-portional to the density associated with each prototype.

• The color of each prototype depends on the cluster towhich it is associated.

• Neighborhood connections (local topology) are repre-sented by a segment connecting the neighboring proto-types.

• Local values of density and variability allow us toestimate the density variations in the representationspace. These variations are represented in the formof contour lines. The projection of contour lines in

the plane is operated by a projection of the Gaussianmixture in the space of representation.

This visualization provides information on both inter-group structure (number of clusters, similarities betweenclusters) but also intra-group structure (local topology, localdensity and density variation within the cluster, and datavariability).

B. Applications

We applied this method to eight artificial and realdatabases, using a Self-Organizing Map algorithm to learnprototypes. Table I describe experimental datasets character-istics.

TABLE IDATASETS DESCRIPTION

Datasets Type Size DimensionEngytime Artificial 4096 2

Hepta Artificial 212 3Lsun Artificial 400 2Rings Artificial 2500 2Spirals Artificial 5000 2

Iris Real 150 4Ants Real 80 11

Children Real 120 8

Figures 8 to 12 show some visualization examples that canbe obtained from low-dimensional datasets.

Fig. 8. “Engytime” dataset (left) and their visualization (right).

Fig. 9. “Hepta” dataset (left) and their visualization (right).

One notices that the data structure is well preserved by thequantization and clustering algorithm and is well representedby the visualization process. The data density is easilyrepresented by the size of the prototypes and the level lines.Furthermore, these lines allow two-dimensional view of thegeneral form of the different clusters and their relative size.Visualization of connections, added to the different colors

Fig. 10. “Lsun” dataset (left) and their visualization (right).

Fig. 11. “Rings” dataset (left) and their visualization (right).

Fig. 12. “Spirals” dataset (left) and their visualization (right).

associated with the prototypes, allow for a visual descriptionof the segmentation of data into different clusters. Twohighly connected clusters are representative of data groupsin contact, as in Figure 8, while a lack of connection denoteswell separated groups (for example, in Figure 9). In addition,visualization is sufficiently detailed to allow representation ofcomplex data distribution, as illustrated in Figures 11 and 12.

Figures 13 to 15 show some examples of visualizationsthat can be obtained from real data. “ Iris” data describesthree different species of flowers using four features. The“Ants” data describe the activity of each individual of acolony of ants (11 features). Finally, the “Children” data isa description of time spent in various gaming activities in agroup of children (8 features).

The visualization of these databases, which have small sizebut dimension greater than three, illustrates the ability of thevisualization method to project the relevant information in atwo-dimensional space. For example, the “ Iris” data (Fig. 13)are structured into two distinct groups, one of these groupsis further subdivided into two very close subgroups. Thethree clusters are automatically discovered by the clusteringalgorithm and correspond to three distinct species of flowers.

Regarding “Ants” data (Fig. 14), each cluster detected bythe algorithm corresponds to a behavior and a different social

role within the colony (hunters, nurses, cleaners, guards, etc....). Here, there is no clear separation in terms of densitybetween the groups, which means that certain behaviors arepossible intermediates. The existence of these intermediariesare known in biology, especially thanks to the presence ofgeneralist ants which can perform any task, based on theneeds of the colony [28].

Finally, the data “Children” [29] (Fig. 15) represent theactivities of kindergarten children playing at recess. The dataare divided into two sets of density fairly well separated,each subdivided into two subsets. The central subgroup itselfis subdivided into three clusters by the algorithm. It isinteresting to note that, overall, group order from top tobottom corresponds to an increase in the age of the childand increase the complexities of game activities. The yellowgroup is composed almost exclusively of children in the firstyear of kindergarten, while the vast majority of childrenin the last year are in the brown group. The subdivisionof the two intermediate years into four clusters reflectsindividual differences in the dynamics of child development.The decrease in density between the blue group and the greengroup separates the child spending most of their time insocial games (with their peers) of children playing mostlyalone. This indicates that a child who began playing withothers will not return, or rarely, to solitary play. All theseinformation are in agreement with the domain knowledge[30], [29].

IV. CONCLUSION

In this paper, we proposed a new data structure model-ing method, based on the learning of prototypes. We alsoproposed a method of visualization of this structure, able toenhance the data structure within and between groups. Weshown, using some artificial and real examples, the relevanceof the proposed method.

Our future work will focus on interactive data mining.Indeed, allowing the user to interact with visualizationsproposed could lead to an exploratory analysis of the finerstructure of the data. The addition of information through adisplay of text labels on the prototypes is also envisaged.

REFERENCES

[1] P. Lyman and H. R. Varian, “How Much Information, 2003,” Retrievedfrom http://www.sims.berkeley.edu/how-much-info-2003.

[2] J. Gehrke, F. Korn, and D. Srivastava, “On computing correlatedaggregates over continual data streams,” in Special Interest Group onManagement of Data Conference, 2001, pp. 13–24.

[3] G. S. Manku and R. Motwani, “Approximate frequency counts overdata streams,” in Very Large Data Base, 2002, pp. 346–357.

[4] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework forclustering evolving data streams,” in Very Large Data Base, 2003, pp.81–92.

[5] T. Kohonen, Self-Organizing Maps. Berlin: Springer-Verlag, 2001.[6] T. M. Martinetz and K. J. Schulten, “A “neural-gas” network learns

topologies,” in Artificial Neural Networks, T. Kohonen, K. Mäkisara,O. Simula, and J. Kangas, Eds. Amsterdam: Elsevier SciencePublishers, 1991, pp. 397–402.

[7] T. Kohonen, Self-Organization and Associative Memory. Berlin:Springer-Verlag, 1984.

[8] B. Silverman, “Using kernel density estimates to investigate multi-modality,” Journal of the Royal Statistical Society, Series B, vol. 43,pp. 97–99, 1981.

Fig. 13. Visualization of “Iris” data.

Fig. 14. Visualization of “Ants” data.

Fig. 15. Visualization of “Children” data.

[9] S. R. Pamudurthy, S. Chandrakala, and C. C. Sakhar, “Local densityestimation based clustering,” Prodeeding of International Joint Con-ference on Neural Networks, pp. 1338–1343, August 2007.

[10] B. W. Silverman, Density Estimation for Statistics and Data Analysis.Chapman & Hall/CRC, 1986.

[11] E. L. J. Bohez, “Two level cluster analysis based on fractal dimensionand iteratedfunction systems (ifs) for speech signal recognition,” IEEEAsia-Pacific Conference on Circuits and Systems, pp. 291–294, 1998.

[12] M. F. Hussin, M. S. Kamel, and M. H. Nagi, “An efficient two-levelSOMART document clustering through dimensionality reduction,” inICONIP, 2004, pp. 158–165.

[13] E. E. Korkmaz, “A two-level clustering method using linear linkageencoding,” in International Conference on Parallel Problem Solv-ing From Nature, Lecture Notes in Computer Science, vol. 4193.Springer-Verlag, 2006, pp. 681–690.

[14] A. Ultsch, “Clustering with SOM: U*C,” in Procedings of the Work-shop on Self-Organizing Maps, 2005, pp. 75–82.

[15] L. Vincent and P. Soille, “Watersheds in digital spaces: An efficientalgorithm based on immersion simulation,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 13, pp. 583–598, 1991.

[16] S.-H. Yue, P. Li, J.-D. Guo, and S.-G. Zhou, “Using greedy algorithm:DBSCAN revisited II,” Journal of Zhejiang University SCIENCE,vol. 5, no. 11, pp. 1405–1412, 2004.

[17] G. Cabanes and Y. Bennani, “A local density-based simultaneoustwo-level algorithm for topographic clustering,” in Proceeding of theInternational Joint Conference on Neural Networks, 2008, pp. 1176–1182.

[18] ——, “A simultaneous two-level clustering algorithm for automaticmodel selection,” in Proceedings of the International Conference onMachine Learning and Applications (ICMLA’07), 2007, pp. 316–321.

[19] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “Cluster ValidityMethods,” Special Interest Group on Management of Data Record,vol. 31, no. 2,3, pp. 40–45, 19–27, 2002.

[20] A. Ben-Hur, A. Elisseeff, and I. Guyon, “A stability based methodfor discovering structure in clustered data,” in Pacific Symposium onBiocomputing, vol. 7, 2002, pp. 6–17.

[21] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEETransactions on Pattern Recognition and Machine Intelligence, vol. 1,no. 2, pp. 224–227, 1979.

[22] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-basedalgorithm for discovering clusters in large spatial databases withnoise.” AAAI Press, 1996, pp. 226–231.

[23] S. Guha and B. Harb, “Approximation algorithms for wavelet trans-form coding of data streams,” IEEE Transactions on InformationTheory, vol. 54, no. 2, pp. 811–830, 2008.

[24] J. Shi and J. Malik, “Normalized Cuts and Image Segmentation,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 22,no. 8, pp. 888–905, 2000.

[25] G. Karypis, E.-H. Han, and V. Kumar, “Chameleon: Hierarchicalclustering using dynamic modeling,” IEEE Computer, vol. 32, no. 8,pp. 68–75, 1999.

[26] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihoodfrom incomplete data via the EM algorithm,” Journal of the RoyalStatistical Society, Series B, vol. 39, pp. 1–38, 1977.

[27] J. Sammon Jr., “A nonlinear mapping for data structure analysis,” IEEETransactions on Computer, vol. 18, no. 5, pp. 401–409, May 1969.

[28] B. Holldobler and E. Wilson, The ants. Cambridge, MA: HarvardUniversity Press, 1990.

[29] S. Barbu, G. Cabanes, and G. Le Maner-Idrissi, “Boys and girls onthe playground: Sex differences in social development are not stableacross early childhood,” PLoS ONE, vol. 6, no. 1, p. e16407, 2011.

[30] D. P. Fromberg and D. Bergen, Play from birth to Twelve: Contexts,perspectives, and Meanings. New York: Routledge, 2006.

coupling clustering and visualization for knowledge discovery …bennani/publications/ca... ·...

Documents