density-based data analysis and similarity search · in this chapter, we present methods that...

Density-Based Data Analysis and SimilaritySearch

Hans-Peter Kriegel, Stefan Brecheisen, Peer Kroger, Martin Pfeifle, MatthiasSchubert, and Arthur Zimek

Department “Institute for Informatics”, University of MunichOettingenstr. 67, 80538 Munich, Germany

{kriegel,brecheis,kroegerp,pfeifle,schubert,zimek}@dbs.ifi.lmu.de

Abstract. Similarity search in database systems is becoming an increas-ingly important task in modern application domains such as multimedia,molecular biology, medical imaging, computer aided engineering, market-ing and purchasing assistance as well as many others. Furthermore, thefeature transformations and distance measures used in similarity searchbuild the foundation of sophisticated data analysis and mining tech-niques. In this chapter, we show how visualizing cluster hierarchies de-scribing a database of objects can aid the user in the time consumingtask to find similar objects and discover interesting patterns. We presentrelated work and explain its shortcomings which led to the developmentof our new methods. Based on reachability plots, we introduce methodsfor visually exploring a data set in multiple representations and compar-ing multiple similarity models. Furthermore, we present a new methodfor automatically extracting cluster hierarchies from a given reachabilityplot which allows a user to browse the database for similarity search. Weintegrated our new method in a prototype which serves two purposes,namely visual data analysis and a new way of object retrieval callednavigational similarity search.

1 Introduction

In recent years, an increasing number of database applications has emerged forwhich efficient and effective similarity search and data analysis is substantial.Important application areas are multimedia, medical imaging, molecular biology,computer aided engineering, marketing and purchasing assistance, etc. [1–8]. Inthese applications, there usually exist various feature representations and similar-ity models that can be used to retrieve similar data objects or derive interestingpatterns from a given database. Hierarchical clustering was shown to be effectivefor evaluating similarity models [9, 10]. Especially, the reachability plot gener-ated by OPTICS [11] is suitable for assessing the quality of similarity modelsand compare the meaning of different representations to each other. To furtherextract patterns and allow new methods of similarity search, cluster extractionalgorithms can extract cluster hierarchies representing a concrete categorizationof all data objects.

brecheis

In: Petrushin V. A., Khan L. (eds.): Multimedia Data Mining and Knowledge Discovery, pp. 94–115, 2006. © Springer-Verlag London Ltd 2006

In this chapter, we present methods that employ hierarchical clustering andvisual data mining techniques to fulfill various tasks for comparing and evaluat-ing distance models and feature extractions methods. Furthermore, we introducean algorithm for automatically detecting hierarchical clusters and use this hier-archy for navigational similarity search. In ordinary similarity search systems,a user is usually obliged to provide an example query object to which the re-trieved database objects should be similar. In contrast, navigational similaritysearch allows a user to browse the database using the extracted cluster hier-archy to navigate between groups of similar objects. In order to evaluate ourideas, we developed a research prototype. Its basic functionality is to display thecluster structure of a given data set and to allow navigational similarity search.Furthermore, we integrated two components called VICO and CLUSS. VICO(VI sually Connected Object Orderings) is a tool for evaluating and comparingfeature representations and similarity models. The idea of VICO is to comparemultiple reachabilty plots of one and the same data set. CLUSS (CLU ster Hier-archies for S imilarity Search) is an alternative hierarchical clustering algorithmwhich was especially developed to generate cluster hierarchies being well suitedfor navigational similarity search.

To sum up, the main topics of this chapter are as follows:

– We describe methods for evaluating data representations and similarity mod-els. Furthermore, we sketch possibilities to visually compare these represen-tations and models.

– We present an alternative approach to the retrieval of similar objects, callednavigational similarity search. Unlike conventional similarity queries, theuser does not need to provide a query object but can interactively browsethe data set.

– We introduce a new cluster recognition algorithm for the reachability plotsgenerated by OPTICS. This algorithm generalizes all the other known clus-ter recognition algorithms for reachability plots. Although our new algorithmdoes not need a sophisticated and extensive parameter setting, it outperformsthe other cluster recognition algorithms w.r.t. quality and number of recog-nized clusters and subclusters. The derived cluster hierarchy enables us toemploy OPTICS for navigational similarity search.

– We introduce an alternative method for generating cluster hierarchies whichprovides a more intuitive access to the database for navigational similaritysearch. The advantage of this method is that each of the derived clusters isdescribed by a set of well selected representative objects giving the user abetter impression of the objects contained in the cluster.

The remainder of the chapter is organized as follows: We briefly introducethe clustering algorithm OPTICS in Section 2. In Section 3, we present themain application areas of our new methods for data analysis and navigationalsimilarity search. Section 4 introduces a novel algorithm for extracting clusterhierarchies, together with an experimental evaluation. An alternative way toderive cluster hierarchies for navigational similarity search is presented in Section5. The chapter concludes in Section 6 with a short summary.

2 Hierarchical Clustering

In the following, we will briefly review the hierarchical density-based cluster-ing algorithm OPTICS which is the foundation of the majority of the methodsdescribed in this chapter.

The key idea of density-based clustering is that for each object o of a clusterthe neighborhood Nε(o) of a given radius ε has to contain at least a minimumnumber MinPts of objects. Using the density-based hierarchical clustering algo-rithm OPTICS yields several advantages due to the following reasons:

– OPTICS is – in contrast to most other algorithms – relatively insensitive toits two input parameters, ε and MinPts. The authors in [11] state that theinput parameters just have to be large enough to produce good results.

– OPTICS is a hierarchical clustering method which yields more informationabout the cluster structure than a method that computes a flat partitioningof the data (e.g. k-means [12]).

– There exists a very efficient variant of the OPTICS algorithm which is basedon a sophisticated data compression technique called “Data Bubbles” [13],where we have to trade only very little quality of the clustering result for agreat increase in performance.

– There exists an efficient incremental version [14] of the OPTICS algorithm.

OPTICS emerges from the algorithm DBSCAN [15] which computes a flatpartitioning of the data. The clustering notion underlying DBSCAN is that ofdensity-connected sets (cf. [15] for more details). It is assumed that there is ametric distance function on the objects in the database (e.g. one of the Lp-normsfor a database of feature vectors).

In contrast to DBSCAN, OPTICS does not assign cluster memberships butcomputes an ordering in which the objects are processed and additionally gen-erates the information which would be used by an extended DBSCAN algorithmto assign cluster memberships. This information consists of only two values foreach object, the core-distance and the reachability-distance.

Definition 1 (core-distance). Let o ∈ DB, MinPts ∈ N, ε ∈ R, and MinPts-dist(o) be the distance from o to its MinPts-nearest neighbor. The core-distanceof o w.r.t. ε and MinPts is defined as follows:

Core-Dist(o) :={∞ if |Nε(o)| < MinPtsMinPts-dist(o) otherwise.

Definition 2 (reachability-distance). Let o ∈ DB, MinPts ∈ N and ε ∈ R.The reachability distance of o w.r.t. ε and MinPts from an object p ∈ DB isdefined as follows:

Reach-Dist(p, o) := max (Core-Dist(p), distance(p, o)).

core-distance(o)reachability-distance(o,p)reachability-distance(o,q)

o

pq

MinPts = 5

Fig. 1. Illustration of core-level and reachability-distance.

Figure 1 illustrates both concepts: The reachability-distance of p from oequals to the core-distance of o and the reachability-distance of q from o equalsto the distance between q and o.

The original output of OPTICS is an ordering of the objects, a so calledcluster ordering :

Definition 3 (cluster ordering). Let MinPts ∈ N, ε ∈ R, and CO be a totallyordered permutation of the database objects. Each o ∈ D has additional attributeso.P , o.C and o.R, where o.P ∈ {1, . . . , |CO|} symbolizes the position of o inCO. We call CO a cluster ordering w.r.t. ε and MinPts if the following threeconditions hold:

(1) ∀p ∈ CO : p.C = Core-Dist(p)(2) ∀o, x, y ∈ CO :

o.P < x.P ∧ x.P < y.P ⇒ Reach-Dist(o, x) ≤ Reach-Dist(o, y)(3) ∀p, o ∈ CO : R(p) = min{Reach-Dist(o, p) | o.P < p.P}, where min ∅ =

∞.

Intuitively, Condition (2) states that the order is built on selecting at each posi-tion i in CO that object o having the minimum reachability to any object beforei. o.C symbolizes the core-distance of an object o in CO whereas o.R is thereachability-distance assigned to object o during the generation of CO. We callo.R the reachablity of object o throughout the chapter. Note that o.R is onlywell-defined in the context of a cluster ordering.

The cluster structure can be visualized through so called reachability plotswhich are 2D plots generated as follows: the clustered objects are ordered alongthe x-axis according to the cluster ordering computed by OPTICS and the reach-abilities assigned to each object are plotted along the abscissa. An examplereachability plot is depicted in Figure 2. Valleys in this plot indicate clusters:objects having a small reachability value are closer and thus more similar totheir predecessor objects than objects having a higher reachability value.

1

2

A1

A2 B

A B

Fig. 2. Reachability plot (right) computed by OPTICS for a sample 2-D data set (left).

The reachability plot generated by OPTICS can be cut at any level εcut

parallel to the abscissa. It represents the density-based clusters according to thedensity threshold εcut: A consecutive subsequence of objects having a smallerreachability value than εcut belongs to the same cluster. An example is presentedin Figure 2: For a cut at the level ε1 we find two clusters denoted as A and B.Compared to this clustering, a cut at level ε2 would yield three clusters. Thecluster A is split into two smaller clusters denoted by A1 and A2 and cluster Bdecreased its size. Usually, for evaluation purposes, a good value for εcut wouldyield as many clusters as possible.

3 Application Ranges

The introduced methods combine techniques from hierarchical clustering anddata visualization for two main purposes, data analysis and similarity search. Inthe following, we will propose several applications in both areas for which ourintroduced methods are very useful.

3.1 Data Analysis

The data analysis part of our prototype is called VICO. It allows a user to clusterdata objects in varying representations and using varying similarity models. Themain purpose of VICO is to compare different feature spaces that describe thesame set of data. For this comparison, VICO relies on the interactive visualexploration of reachability plots. Therefore, VICO displays any available viewon a set of data objects as adjacent reachability plots and allows comparisonsbetween the local neighborhoods of each object. Figure 3 displays the mainwindow of VICO. The left side of the window contains a so-called tree controlthat contains a subtree for each view of the data set. In each subtree, the keysare ordered w.r.t. the cluster order of the corresponding view. The tree controlallows a user to directly search for individual data objects. In addition to the

Fig. 3. VICO displaying OPTICS plots of multi-represented data.

object keys displayed in the tree control, VICO displays the reachability plot ofeach view of the data set.

Since valleys in the reachability plot represent clusters in the underlyingrepresentation, the user gets an instant impression of the richness of the clusterstructure in each representation. However, to explore the relationships betweenthe representations, we need to find out whether objects that are clustered inone representation are also similar in the other representation. To achieve thistype of comparison, VICO allows the user to select any data object in anyreachability plot or the tree control. By selecting a set of objects in one view,the objects are highlighted in any other view as well. For example, if the userlooks at the reachability plot in one representation and selects a cluster withinthis plot, the corresponding object keys are highlighted in the tree control andidentify the objects that are contained in the cluster. Let us note that it ispossible to visualize the selected objects as well, as long as there is a viewableobject representation. In addition to the information about which objects areclustered together, the set of objects is highlighted in the reachability plots ofthe other representations as well. Thus, we can easily decide whether the objectsin one representation are placed within a cluster in another representation aswell or if they are spread among different clusters or are part of the noise. Ifthere exist contradicting reachability plots for the same set of data objects, it isinteresting to know which of these representations is closer to the desired notionof similarity. Thus, VICO allows the user to label data objects w.r.t. predefinedclass values. The different class values for the objects are displayed by different

colors in the reachability plot. Thus, a reachability plot of a data space thatmatches the user’s notion of similarity should display clusters containing objectsof the same color. Figure 3 displays a comparison of two feature spaces for animage data set. Each image is labelled w.r.t. the displayed motive.

Another feature of VICO is the ability to handle multi-instance objects. Ina multi-instance representation, one data object is given by a set of separatedfeature objects. An example are CAD parts that can be decomposed to a setof spatial primitives, which can be represented by a single feature vector. Thisway, the complete CAD part is represented by a set of feature vectors, whichcan be compared by a variety of distance functions. To find out which instancesare responsible for clusters of multi-instance objects, VICO allows us to clus-ter the instances without considering the multi-instance object they belong to.Comparing this instance plot to the plot derived on the complete multi-instanceobjects allows us to analyze which instance clusters are typical for the clusterson the complete multi-instance object. Thus, for multi-instance settings, VICOhighlights all instances belonging to some selected multi-instance object. To con-clude, VICO allows even non-expert users to evaluate similarity models, directlycompares different similarity models to each other, and helps exploring the con-nection between multi-instance distance functions and the underlying distancemetrics in the feature space of instances.

3.2 Navigational Similarity Search

For similarity search, the idea of our system is to provide navigational accessto a database. This way, it is possible to browse a database of objects in anexplorer-like application, instead of posing separated similarity queries. A mainproblem of these similarity queries is that a user always has to provide a queryobject to which the retrieved objects should be as similar as possible. However,in many application scenarios finding a query object is not easy. For example, anengineer querying a CAD database would have to specify the 3D shape of a CADpart before finding similar parts. Sketching more complicated parts might causea considerable effort. Thus, similarity queries are often quite time-consuming.The alternative idea of navigational similarity search offers an easier way to re-trieve the desired objects. The idea is to use an extracted hierarchy of clustersas a navigation tree. The root represents the complete data set. Each node inthe tree represents a subcluster consisting of a subset of data objects for whichthe elements are more similar to each other than in the father cluster. The morespecialized a cluster is the more similar its members are to each other. To de-scribe the members of a cluster one or more representative objects are displayed.Browsing starts at a general level. Afterwards, the user follows the path in thecluster hierarchy for which the displayed representatives resemble the image inthe user’s mind in a best possible way. The browsing terminates when the userreaches a cluster for which the contained objects match the user’s expectation.Another advantage of this approach is that the cluster usually contains the com-plete set of similar objects. For similarity queries, the number of retrieved resultscan either be specified in the case of kNN queries or is implicitly controlled by a

specified query range. However, both methods usually do not retrieve all objectsthat could be considered as similar.

4 Cluster Recognition for OPTICS

In this section, we address the task of automatically extracting clusters from areachability plot. Enhancing the resulting cluster hierarchy with representativeobjects for each extracted cluster, allows us to use the result of OPTICS fornavigational similarity search. After a brief discussion of recent work in thatarea, we propose a new approach for hierarchical cluster recognition based onreachability plots called Gradient Clustering.

4.1 Recent Work

To the best of our knowledge, there are only two methods for automatic clusterextraction from hierarchical representations such as reachability plots or dendro-grams – both are also based on reachability plots. Since clusters are representedas valleys (or dents) in the reachability plot, the task of automatic cluster ex-traction is to identify significant valleys.

The first approach proposed in [11] called ξ-clustering is based on the steep-ness of the valleys in the reachability plot. The steepness is defined by meansof an input parameter ξ. The method suffers from the fact that this parameteris difficult to understand and hard to determine. Rather small variations of thevalue ξ often lead to drastic changes of the resulting clustering hierarchy. Asa consequence, this method is unsuitable for the purpose of automatic clusterextraction.

The second approach was proposed by Sander et al. [16]. The authors describean algorithm called Tree Clustering that automatically extracts a hierarchicalclustering from a reachability plot and computes a cluster tree. It is based onthe idea that significant local maxima in the reachability plot separate clusters.Two parameters are introduced to decide whether a local maximum is significant:The first parameter specifies the minimum cluster size, i.e. how many objectsmust be located between two significant local maxima. The second parameterspecifies the ratio between the reachability of a significant local maximum m andthe average reachabilities of the regions to the left and to the right of m. Theauthors in [16] propose to set the minimum cluster size to 0.5% of the data setsize and the second parameter to 0.75. They empirically show, that this defaultsetting approximately represents the requirements of a typical user.

Although the second method is rather suitable for automatic cluster extrac-tion from reachability plots, it has one major drawback. Many real-world datasets consist of narrowing clusters, i.e. clusters each consisting of exactly onesmaller subcluster (cf. Figure 4).

Since the Tree Clustering algorithm runs through a list of all local maxima(sorted in descending order of reachability) and decides at each local maximumm, whether m is significant to split the objects to the left of m and to the right of

Fig. 4. Sample narrowing clusters: data space (left); reachability plot (middle); clusterhierarchy (right)

m into two clusters, the algorithm cannot detect such narrowing clusters. Theseclusters cannot be split by a significant maximum. Figure 4 illustrates this fact.The narrowing cluster A consists of one cluster B which is itself narrowingconsisting of one cluster C (the clusters are indicated by dashed lines). The TreeClustering algorithm will only find cluster A since there are no local maxima tosplit clusters B and C. The ξ-clustering will detect only one of the clusters A, Bor C depending on the parameter ξ but also fails to detect the cluster hierarchy.

A new cluster recognition algorithm should meet the following requirements:

– It should detect all kinds of subclusters, including narrowing subclusters.– It should create a clustering structure which is close to the one which an

experienced user would manually extract from a given reachability plot.– It should allow an easy integration into the OPTICS algorithm. We do not

want to apply an additional cluster recognition step after the OPTICS runis completed. In contrast, the hierarchical clustering structure should becreated on-the-fly during the OPTICS run without causing any noteworthyadditional cost.

– It should be integrable into the incremental version of OPTICS [14], as mostof the discussed application ranges benefit from such an incremental version.

4.2 Gradient Clustering

In this section, we introduce our new Gradient Clustering algorithm which ful-fills all of the above mentioned requirements. The idea behind the new clusterextraction algorithm is based on the concept of inflexion points. During the OP-TICS run, we decide for each point added to the result set, i.e. the reachabilityplot, whether it is an inflexion point or not. If it is an inflexion point we mightbe at the start or at the end of a new subcluster. We store the possible startingpoints of the subclusters in a list, called startPts. This stack consists of pairs

g (x,y)

x y

x.R

y.R

w

−y.R x.R

Fig. 5. Gradient vector g(x, y) of two objects x and y adjacent in the cluster ordering.

(o.P, o.R). The Gradient Clustering algorithm can easily be intergrated into OP-TICS and is described in full detail, after we have formally introduced the newconcept of inflexion points.

In the following, we assume that CO is a cluster ordering as defined in Def-inition 3. We call two objects o1, o2 ∈ CO adjacent in CO if o2.P = o1.P + 1.Let us recall, that o.R is the reachability of o ∈ CO assigned by OPTICS whilegenerating CO. For any two objects o1, o2 ∈ CO adjacent in the cluster order-ing, we can determine the gradient of the reachability values o1.R and o2.R. Thegradient can easily be modelled as a 2D vector where the y-axis measures thereachability values (o1.R and o2.R) in the ordering, and the x-axis represent theordering of the objects. If we assume that each object in the ordering is separatedby width w, the gradient of o1 and o2 is the vector

g(o1, o2) =(

w

o2.R− o1.R

).

An example for a gradient vector of two objects x and y adjacent in a clusterordering is depicted in Figure 5.

Intuitively, an inflexion point should be an object in the cluster orderingwhere the gradient of the reachabilities changes significantly. This significantchange indicates a starting or an end point of a cluster.

Let x, y, z ∈ CO be adjacent, i.e.

x.P + 1 = y.P = z.P − 1.

We can now measure the difference between the gradient vectors g(x, y) andg(y, z) by computing the cosine of the angle between the vectors g(x, y) andg(z, y) (= −g(y, z)). The cosine of this angle is equal to −1 if the angle is 180◦,i.e. the vectors have the same direction. On the other hand, if the gradient vectorsdiffer a lot, the angle between them will be clearly smaller than 180◦ and thus

n o p. . . . . .

a c d x y zwb

cluster ordering

reachability

cluster Dcluster C

cluster B

cluster A

Fig. 6. Illustration of inflexion points measuring the angle between the gradient vectorsof objects adjacent in the ordering.

the cosine will be significantly greater than −1. This observation motivates theconcepts of inflexion index and inflexion points:

Definition 4 (inflexion index). Let CO be a cluster ordering and x, y, z ∈ CObe objects adjacent in CO. The inflexion index of y, denoted by II(y), is definedas the cosine of the angle between the gradient vector of x, y (g(x, y)) and thegradient vector of z, y (g(z, y)), formally:

II(y) = cos ϕ(g(x,y),g(z,y)) =−w2 + (y.R− x.R)(y.R− z.R)

‖g(x, y)‖ ‖g(z, y)‖,

where ‖v‖ :=√

v21 + v2

2 is the length of the vector v.

Definition 5 (inflexion point). Let CO be a cluster ordering and x, y, z ∈ CObe objects adjacent in CO and let t ∈ R. Object y is an inflexion point iff

II(y) > t.

The concept of inflexion points is suitable to detect objects in CO which areinteresting for extracting clusters.

Definition 6 (gradient determinant). Let CO be a cluster ordering andx, y, z ∈ CO be objects adjacent in CO. The gradient determinant of the gradi-ents g(x, y) and g(y, z) is defined as

gd(g(x, y), g(z, y)) :=∣∣∣∣ w −wx.R− y.R z.R− y.R

∣∣∣∣

algorithm gradient_clustering(ClusterOrdering CO, Integer MinPts, Real t)startPts := emptyStack;setOfClusters := emptySet;currCluster := emptySet;o := CO.getFirst(); // first object is a starting pointstartPts.push(o);

WHILE o.hasNext() DO // for all remaining objectso := o.next;IF o.hasNext() THEN

IF II(o) > t THEN // inflexion pointIF gd(o) > 0 THEN

IF currCluster.size() >= MinPts THENsetOfClusters.add(currCluster);

ENDIFcurrCluster := emptySet;IF startPts.top().R <= o.R THEN

startPts.pop();ENDIFWHILE startPts.top().R < o.R DO

setOfClusters.add(set of objects from startPts.top() to last end point);startPts.pop();

ENDDOsetOfClusters.add(set of objects from startPts.top() to last end point);IF o.next.R < o.R THEN // o is a starting point

startPts.push(o);ENDIF

ELSEIF o.next.R > o.R THEN // o is an end point

currCluster := set of objects from startPts.top() to o;ENDIF

ENDIFENDIF

ELSE // add clusters at end of plotWHILE NOT startPts.isEmpty() DO

currCluster := set of objects from startPts.top() to o;IF (startPts.top().R > o.R) AND (currCluster.size() >= MinPts) THEN

setOfClusters.add(currCluster);ENDIFstartPts.pop();

ENDDOENDIF

ENDDO

RETURN setOfClusters;END. // gradient_clustering

Fig. 7. Pseudo code of the Gradient Clustering algorithm.

If x, y, z are clear from the context, we use the short form gd(y) for thegradient determinant gd(g(x, y), g(y, z)).

The sign of gd(y) indicates whether y ∈ CO is a starting point or an endpoint of a cluster. In fact, we can distinguish the following two cases which arevisualized in Figure 6:

– II(y) > t and gd(y) > 0:Object y is either a starting point of a cluster (e.g. object a in Figure 6) orthe first object outside of a cluster (e.g. object z in Figure 6).

– II(y) > t and gd(y) < 0:Object y is either an end point of a cluster (e.g. object n in Figure 6) or thesecond object inside a cluster (e.g. object b in Figure 6).

Let us note that a local maximum m ∈ CO which is the cluster separation pointin [16] is a special form of the first case (i.e. II(m) > t and gd(m) > 0).

The threshold t is independent from the absolut reachability values of theobjects in CO. The influence of t is also very comprehensible because if weknow which values for the angles between gradients are interesting, we can easilycompute t. For example, if we are interested in angles < 120◦ and > 240◦ we sett = cos 120◦ = −0.5.

Obviously, the Gradient Clustering algorithm is able to extract narrowingclusters. Experimental comparisons with the methods in [16] and [11] are pre-sented in Section 4.3.

The pseudo code of the Gradient Clustering algorithm is depicted in Figure7, which works like this. Initially, the first object of the cluster ordering CO ispushed to the stack of starting points startPts. Whenever a new starting pointis found, it is pushed to the stack. If the current object is an end point, a newcluster is created containing all objects between the starting point on top of thestack and the current end point. Starting points are removed from the stack iftheir reachablity is lower than the reachability of the current object. Clusters arecreated as described above for all removed starting points as well as for the start-ing point which remains in the stack. The input parameter MinPts determinesthe minimum cluster size and the parameter t was discussed above. Finally theparameter w influences the gradient vectors and proportionally depends on thereachability values of the objects in CO.

After extracting a meaningful hierarchy of clusters from the reachability plotof a given data set, we still need to enhance the found clustering with suitablerepresentations. For this purpose, we can display the medoid of each cluster, i.e.the object having the minimal average distance to all the other objects in thecluster.

4.3 Evaluation

Automatic cluster recognition is very desirable when analyzing large sets ofdata. In the following, we will first evaluate the quality and then the efficiencyof the three cluster recognition algorithms using two real-world test data sets.The first data set contains approximately 200 CAD objects from a German carmanufacturer, and the second one is a sample of the Protein Databank [17]containing approximately 5000 protein structures. We tested on a workstationfeaturing a 1.7 GHz CPU and 2 GB RAM.

Effectivity Both the car and the protein data set exhibit the commonly seenquality of unpronounced but nevertheless to the observer clearly visible clusters.The corresponding reachability plots of the two data sets are depicted in Figure8.

Figure 8c shows that the Tree Clustering algorithm does not find any clustersat all in the car data set, with the suggested default ratio parameter of 75%[16]. In order to detect clusters in the car data set, we had to adjust the ratio

Fig. 8. Sample cluster of car parts. a) Gradient Clustering, b) ξ-Clustering, c) TreeClustering.

parameter to 95%. In this case Tree Clustering detected some clusters but missedsome other important clusters and did not detect any cluster hierarchies at all. Ifwe have rather high reachability values, e.g. values between 5 and 7 as in Figure8 for the car data set, the ratio parameter for the Tree Clustering algorithmshould be set higher than for smaller values. In the case of the protein data setwe detected three clusters with the default parameter setting, but again missedout on some important clusters. Generally, in cases where a reachability graphconsists of rather high reachability values or does not present spikes at all, butclusters are formed by smooth troughs in the waveform, this cluster recognitionalgorithm is unsuitable. Furthermore, it is inherently unable to detect narrowingclusters where a cluster has one subcluster of increased density (cf. Figure 4).

On the other hand, the ξ-clustering approach successfully recognizes someclusters while also missing out on significant subclusters (cf. Figure 8b). Thisalgorithm has some trouble recognizing cluster structures with a significant dif-ferential of “steepness”. For instance, in Figure 4 it does not detect the narrowingcluster B inside of cluster A because it tries to create steep down-areas contain-ing as many points as possible. Thus, it will merge the two steep edges if theirsteepness exceeds the threshold ξ. On the other hand, it is able to detect clusterC within cluster A.

Finally, we look at our new Gradient Clustering algorithm. Figure 8a showsthat the recognized cluster structure is close to the intuitive one, which an expe-rienced user would manually derive. Clusters which are clearly distinguishableand contain more than MinPts elements are detected by this algorithm. Not

Table 1. CPU time for cluster recognition.

Car data (200 parts) Protein data (5,000 molecules)

ξ-clustering 0.221 s 5.057 s

Tree Clustering 0.060 s 1.932 s

Gradient Clustering 0.310 s 3.565 s

only does it detect a lot of clusters, but it also detects a lot of meaningful clusterhierarchies, consisting of narrowing subclusters.

To sum up, in all our tests the Gradient Clustering algorithm detected muchmore clusters than the other two approaches, without producing any redundantand unnecessary cluster information.

Efficiency In all tests, we first created the reachability plots and then appliedthe algorithms for cluster recognition and representation. Let us note that wecould also have integrated the Gradient Clustering into the OPTICS run withoutcausing any noteworthy overhead.

The overall runtimes for the three different cluster recognition algorithmsare depicted in Table 1. Our new Gradient Clustering algorithm does not onlyproduce the most meaningful results, but also in sufficiently short time. This isdue to its runtime complexity of O(n).

It theoretically and empirically turned out, that the Gradient Clusteringalgorithm seems to be more practical than recent work for automatic clusterextraction from hierarchical cluster representations.

5 Extracting Cluster Hierarchies for Similarity Search

5.1 Motivation

So far, our prototype works fine by computing, extracting, and visualizing thehierarchical density-based cluster structure of a data set. The density-based clus-ter model has been chosen due to several criteria. One very important aspectamong these criteria is its effectivity in finding clusters of different size andshape. However, this clustering notion needs not to be the best cluster model fornavigational similarity search. For this application, the use of OPTICS may havetwo limitations. In this section, we will discuss these limitations and a possiblesolution for them.

The first limitation of using the density-based clustering notion is that OP-TICS may place two objects that are rather similar, i.e. near in the feature space,into two separate clusters. As a consequence, these two objects may be displayedin completely different subtrees of the cluster hierarchy, i.e. the relationship be-tween these two points in the cluster hierarchy is rather weak. This problem isvisualized in Figure 9: object A is obviously much more similar to object B thanto object C. However, since A and C belong to the same cluster, both objectswill be considered as similar by OPTICS. A and C will be placed in a similar

OPTICS

CLUSS

C

A

B

Fig. 9. A new cluster model for navigational similarity search.

subtree of the cluster hierarchy, whereas B may end up in a completely differentsubtree. This is against the intuitive notion of similarity that would expect Aand B having a closer relationship in the cluster hierarchy than A and C.

The second limitation is that a cluster of complex shape and huge size canusually not be represented by one representative object. However, the idea ofnavigational similarity search depends on the suitability of the object which isdisplayed in order to represent the objects in a cluster.

In order to solve these limitations, we propose a novel way of computing thecluster hierarchy and suitable representations. The general idea is that we areinterested in small spherically shaped clusters rather than in clusters of arbitrarysize and shape. Intuitively, we can select a given number of representative objectsthat represent their spherical neighborhood in a best possible way. These repre-sentatives can build the lowest level of a cluster hierarchy tree. The next levelof the tree (above a given level) can then be built by choosing again the mostrepresentative objects from the representatives in the level below until we do nothave enough representatives and, thus, have reached the root of the hierarchy.The most important aspect in this strategy is the definition of the representativepower of an object. We will discuss this issue and outline a novel procedure togenerate the cluster hierarchy in the following.

The idea of this new approach is sketched in Figure 9: a data set with twoclusters of different size and shape is clustered. Using OPTICS, both clusters arewell separated and we can observe both of the mentioned problems: objects thatare member of the larger cluster but are quite near (i.e. similar) to the objectsin the smaller cluster will have a significantly poor relationship in the clusterhierarchy to the objects in the smaller cluster. On the other hand, objects thatare much less similar but are members of the same (larger) cluster will have astrong relationship to each other in the hierarchy. In addition, it is not clearhow to represent the larger cluster with its complex shape by a meaningful

representative. Our new approach generates representatives for small convexclusters step by step. As it can be seen from Figure 9, this results in a hierarchythat is much more suitable for interactive similarity search. For example, theobjects of the smaller cluster will be represented by the same object in the rootnode of the hierarchy as the objects that belong to the large cluster but arelocated at the border of that cluster near the smaller cluster. This reflects theintuitive notion of similarity more accurately.

5.2 Basic Definitions

The basic idea is to extract a sufficient amount of dedicated objects that rep-resents the other objects in a best possible way. Such objects are called repre-sentatives and should be associated with a set of non-representatives, called theborder-shadow of a representative.

Definition 7 (border-shadow). Let REP be a set of representative objectsand ε ∈ R. The border-shadow of a representative r ∈ REP is defined by:

r.bordershadow := {o ∈ REP | dist(o, r) ≤ ε}.

The border-shadow of r contains the set of objects in the (hyper-) spherearound r with radius ε. Obviously, a global value for the size of the representativearea for each representative is not appropriate, since it does not reflect the localdata distribution. Thus, we define an additional, adaptive set of representativeobjects, the so called core-shadow of a representative.

Definition 8 (core-distance). Let REP be a set of representative objects, k ∈N and r ∈ REP. The core-distance of r is defined by:

r.coredist :={

0 if |Nε(r)| < kk-nn-dist(r) else.

Definition 9 (core-shadow). Let REP be a set of representative objects andr ∈ REP. The core-shadow of r is defined by:

r.coreshadow := {o ∈ REP | dist(o, r) ≤ r.coredist}.

The core-shadow of r contains the set of objects in the (hyper-) sphere aroundr with radius k-nn-dist(r). Obviously, this radius adapts to the local density ofobjects around r.

Both the border-shadow and the core-shadow can be used to define the qual-ity of a representative.

Definition 10 (quality of a representative). Let REP be a set of represen-tative objects and r ∈ REP. The quality of r is defined by:

r.quality :=

{0 if |Nε(r)| < k

Nε(r)

1+k-nn-dist(r) else.

algorithm cluster_representatives(SetOfObjects DB, Integer k)REP_0 := emptySet;REP_1 := DB;i := 1;

WHILE REP_i != REP_i-1 DOcompute new epsilon;send = emptySpatialIndex;wait = emptySpatialIndex;sort REP_i in descending order w.r.t. quality values;r := REP_i.removeFirst();

WHILE r.quality == 0 AND REP_i.size > 0 DOIF r.coreshadow does not intersect the core shadow of any object in send THEN

send.add(r);FOR EACH s IN wait DO

IF s is in r.bordershadow DOwait.remove(s);

ENDIFENDDO

ELSEIF r is not in border-shadow of any object in send DO

wait.add(r);ENDIF

ENDIFr := REP_i.removeFirst();

ENDDO

REP_i+1 := REP_i;REP_i+1.addAll(send);REP_i+1.addAll(wait);i++;

ENDDOEND. // cluster_representatives

Fig. 10. Pseudo code of the Cluster Representatives algorithm.

The key idea of our hierarchical clustering approach is to find on each levelof the hierarchy an optimal (w.r.t. quality) set of representatives such that therepresentatives have non-overlapping core-shadows (overlapping border-shadowsare allowed).

5.3 Algorithm

The general idea of the algorithm is to start with the whole database as theinitial set of representatives. In the i-th iteration, we take the set of currentrepresentatives REP i and compute the new set of representatives REP i+1. Toensure, that we get the best representatives, we sort REP i by descending qualityvalues (cf. Definition 10). Theoretically, we can recursively select the represen-tative r having the highest quality from the sorted REP i list, add r with itsborder-shadow to REP i+1, and remove all further objects from REP i that arein the core-shadow of r.

However, we have to take care, that core-shadows of representatives in REP i+1

do not overlap. For that purpose, we test for each not yet selected representa-tive r ∈ REP i whether its core-shadow overlaps the core-shadow of any objectin REP i+1. To support this intersection-query efficiently, we can organize the

core-shadow

covering-shadow

r

ab

c

a⇒ wait

b⇒ remove

c⇒ send

Fig. 11. Sorting of representatives acording to intersections of their core-shadows andborder-shadows.

objects in REP i+1 in a spatial index structure (data structure send). If thereis no such overlap, we add r and its border-shadow to REP i+1 (and send) andremove all objects in the core-shadow of r from REP i. If the core-shadow of rintersects the core-shadow of any already selected representatives in REP i+1,we have to distinguish two cases. (1) If r is within the border-shadow of anyrepresentative in REP i+1 we can remove r because it is represented by at leastone representative. (2) If r is not within the border-shadow of any representativein REP i+1 we cannot remove r since it is not yet represented. We will have totest for r at a later time, whether it is in the border-shadow of a representativechosen later. For that purpose, we add those points to an additional data struc-ture called wait. Thus, when adding a new representative r to REP i+1, we haveto test whether there are objects in wait which are in the border-shadow of r.If so, we delete those objects from wait. The algorithm terminates if we gain nomore new representatives after an iteration, i.e. REP i+1 = REP i.

In summary, in each iteration, we do the following (cf. the pseudo code inFigure 10):

1. Sort REP i by descending quality values.2. As long as there are representatives in REP i having a quality greater than

0, take and remove the first object r from the sorted REP i list:If r.coreshadow does not intersect the core-shadow of any object in REP i+1

(cf. object c in Figure 11):– add r along with r.coreshadow and r.bordershadow to REP i+1,– add r to send,

– remove all objects in wait which are in the border-shadow of r (range-query against wait).

Else (i.e. r.coreshadow intersects the core-shadow of at least one object inREP i+1):– If r is not in the border-shadow of any object in REP i+1, add r to wait

(cf. object a in Figure 11).– If r is in the border-shadow of any object in REP i+1, do nothing (r is

already removed from REP i; cf. object b in Figure 11).3. Add all remaining objects in REP i and wait to REP i+1.

5.4 Choice of ε in the i-th Iteration

The quality of a representative r depends not only on the parameter k (spec-ifying the number of neighbors of r which are taken into account for qualitycomputation) but also on the radius ε of the local area around r. A global valuefor ε is not appropriate since the dataspace is getting sparser in each iteration.A dynamically adapting value for ε is deemed more appropriate.

If we assume that DB is a set of n feature vectors of dimensionality d (i.e.DB ⊆ Rd) and each attribute value of all o ∈ DB is normalized (i.e. falls intothe range [0,MAX ] for a specified, fixed MAX ) the volume of the data spacecan be computed by:

V ol(DB) = MAX d.

If the n objects of DB are uniformly distributed in V ol(DB), we expect oneobject in the volume V ol(DB)/n and k objects in the volume k · V ol(DB)/n.Thus, the expected k-nearest neighbor distance of any object o ∈ DB is equalto the radius r of a hypersphere having this volume k · V ol(DB)/n. Since thevolume of a hypersphere with radius r can be computed by:

VSphere(r) =

√πd

Γ (d/2 + 1)· rd,

where Γ denotes the well-known Gamma function, we can compute the expectedk-nn-distance r of the objects in DB by solving the following equation:

√πd

Γ (d/2 + 1)· rd = k · MAX d

n.

Simple algebraic transformations yield:

r = MAX · d

√k · Γ (d/2 + 1)

n ·√

πd.

For simplicity reasons, we can also compute:

r = MAX · d

√k

n.

Fig. 12. Screenshot of CLUSS.

Let us note that this is the correct value of the expected k-nn-distance if we usethe L∞-norm instead of the L2-norm.

In the i-th iteration, an appropriated choice for ε as the expected k-nn-distance of the objects in REP i:

ε = MAX · d

√k

|REP i|.

If we further assume MAX = 1 (i.e. all attributes have normalized values in[0, 1]), we have:

ε = d

√k

|REP i|.

5.5 The Extended Prototype CLUSS

We have implemented the proposed ideas in Java and integrated a GUI to visu-alize the cluster hierarchy and cluster representatives. The resulting prototype iscalled CLUSS which uses the newly proposed method for generating the clusterhierarchy and suitable cluster representatives. A sample sceenshot of CLUSS isdepicted in Figure 12. The hierarchy is now visualized by means of a tree (upperright frame in Figure 12). For clearness, the subtrees of each node of the tree isnot visualized per default but can be expanded for browsing by clicking on theaccording node. The hierarchy tree is also visualized in the frame on the left-hand side of the GUI. The frame on the lower right side in Figure 12 displaysthe representatives at the nodes that are currently selected.

We performed some sample visual similarity search queries using CLUSSand compared it to the cluster hierarchy created by OPTICS and Gradient

Clustering. In fact, it turned out that using CLUSS allows a more accurateinteractive similarity search. The hierarchy generated by CLUSS differs from thatgenerated by its comparison partner and is more meaningful. So CLUSS providedbetter results for navigational similarity search, for applications of visual datamining, employing OPTICS is more appropriate, especially, if the applicationcalls for clustering solutions that detect clusters of different sizes and shapes.

6 Conclusions

In this chapter, we combined hierarchical clustering and data visualization tech-niques to allow data analysis, comparison of data models and navigational simi-larity search. The idea of comparing data spaces using hierarchical density-basedclustering is to display connected reachability plots and compare these to ref-erence class models using color coding. Navigational similarity search describesa novel approach to retrieve similar data objects. Instead of posing similarityqueries our new approach extracts a cluster hierarchy from a given set of dataobjects and uses the resulting cluster tree to navigate between sets of similarobjects. To extract a cluster hierarchy from a reachability plot generated by thedensity-based hierarchical clustering algorithm OPTICS, we introduced a newmethod for cluster extraction called Gradient Clustering. OPTICS produces ameaningful picture of the density distribution of a given data set and is thuswell suited for data analysis. However, in many applications the cluster hier-archy derived from the reachability plot might not provide intuitive access fornavigational similarity search. Therefore, we described an alternative hierarchi-cal clustering approach called CLUSS which facilitates the interactive similaritysearch in large collections of data.

References

1. Jagadish, H.V.: “A Retrieval Technique for Similar Shapes”. In: Proc. ACMSIGMOD Int. Conf. on Management of Data (SIGMOD’91), Denver, CO. (1991)208–217

2. Agrawal, R., Faloutsos, C., Swami, A.: “Efficient Similarity Search in SequenceDatabases”. In: Proc. 4th. Int. Conf. on Foundations of Data Organization andAlgorithms (FODO’93), Evanston, ILL. Volume 730 of Lecture Notes in ComputerScience (LNCS)., Springer (1993) 69–84

3. Faloutsos, C., Barber, R., Flickner, M., Hafner, J., et al.: “Efficient and EffectiveQuerying by Image Content”. Journal of Intelligent Information Systems 3 (1994)231–262

4. Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: “Fast Subsequence Matchingin Time-Series Databases”. In: Proc. ACM SIGMOD Int. Conf. on Managementof Data (SIGMOD’94), Minneapolis, MN. (1994) 419–429

5. Agrawal, R., Lin, K.I., Sawhney, H.S., Shim, K.: “Fast Similarity Search in thePresence of Noise, Scaling, and Translation in Time-Series Databases”. In: Proc.21th Int. Conf. on Very Large Databases (VLDB’95). (1995) 490–501

6. Berchtold, S., Keim, D.A., Kriegel, H.P.: “Using Extended Feature Objects forPartial Similarity Retrieval”. VLDB Journal 6 (1997) 333–348

7. Berchtold, S., Kriegel, H.P.: “S3: Similarity Search in CAD Database Systems”. In:Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’97), Tucson,AZ. (1997) 564–567

8. Keim, D.A.: “Efficient Geometry-based Similarity Search of 3D Spatial Databases”.In: Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’99),Philadelphia, PA. (1999) 419–430

9. Kriegel, H.P., Kroger, P., Mashael, Z., Pfeifle, M., Potke, M., Seidl, T.: “EffectiveSimilarity Search on Voxelized CAD Objects”. In: Proc. 8th Int. Conf. on DatabaseSystems for Advanced Applications (DASFAA’03), Kyoto, Japan. (2003) 27–36

10. Kriegel, H.P., Brecheisen, S., Kroger, P., Pfeifle, M., Schubert, M.: “Using Sets ofFeature Vectors for Similarity Search on Voxelized CAD Objects”. In: Proc. ACMSIGMOD Int. Conf. on Management of Data (SIGMOD’03), San Diego, CA. (2003)587–598

11. Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: “OPTICS: Ordering Pointsto Identify the Clustering Structure”. In: Proc. ACM SIGMOD Int. Conf. onManagement of Data (SIGMOD’99), Philadelphia, PA. (1999) 49–60

12. McQueen, J.: “Some Methods for Classification and Analysis of Multivariate Ob-servations”. In: 5th Berkeley Symp. Math. Statist. Prob. Volume 1. (1967) 281–297

13. Breunig, M.M., Kriegel, H.P., Kroger, P., Sander, J.: “Data Bubbles: QualityPreserving Performance Boosting for Hierarchical Clustering”. In: Proc. ACMSIGMOD Int. Conf. on Management of Data (SIGMOD’01), Santa Barbara, CA.(2001) 79–90

14. Achtert, E., Bohm, C., Kriegel, H.P., Kroger, P.: “Online Hierarchical Clusteringin a Data Warehouse Environment”. In: Proc. 5th IEEE Int. Conf. on Data Mining(ICDM’05), Houston, TX. (2005) 10–17

15. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: “A Density-Based Algorithm forDiscovering Clusters in Large Spatial Databases with Noise”. In: Proc. 2nd Int.Conf. on Knowledge Discovery and Data Mining (KDD’96), Portland, OR, AAAIPress (1996) 291–316

16. Sander, J., Qin, X., Lu, Z., Niu, N., Kovarsky, A.: “Automatic Extraction ofClusters from Hierarchical Clustering Representations”. In: Proc. 7th Pacific-AsiaConference on Knowledge Discovery and Data Mining (PAKDD 2003), Seoul, Ko-rea. (2003) 75–87

17. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T., Weissig, H.,Shindyalov, I.N., Bourne, P.E.: “The Protein Data Bank”. Nucleic Acids Research28 (2000) 235–242

density-based data analysis and similarity search · in this chapter, we present methods that...

Documents