interactive clustering for data explorationclustering is a natural part of the human cognitive...

7
Interactive Clustering for Data Exploration Joel R. Brandt * Jiayi Chong Sean Rosenbaum Stanford University Figure 1: A complete view of our system. In the top left, the Solution Explorer is shown. Below this is the Member Table, and to the right are the visualizations of two solutions. ABSTRACT Clustering algorithms are widely used in data analysis systems. However, these systems are largely static in nature. There may be interaction with the resulting visualization of the clustering, but there is rarely interaction with the process. Here, we describe a sys- tem for visual data exploration using clustering. The system makes the exploration and understanding of moderately-large (500-1000 instances) multidimensional (10-20 dimensions) data sets easier. CR Categories: H.5.0 [Information Systems]: Information In- terfaces and Presentation—General; I.5.3 [Computing Methodolo- gies]: Pattern Recognition—Clustering Keywords: clustering, data exploration, interaction * e-mail:[email protected] e-mail:[email protected] e-mail:[email protected] 1 I NTRODUCTION Clustering is a natural part of the human cognitive process. When shown a set of objects, an individual naturally groups and orga- nizes these items within his or her mind. In the domain of machine learning, unsupervised clustering algorithms have been developed to mimic this intrinsic process. Yet these algorithms are usually em- ployed within rigid frameworks, removing the fluid, exploratory na- ture of human clustering. In this paper, we present a visual system that uses clustering algorithms to aid data exploration in a natural, fluid way. 1.1 Motivation Clustering techniques are widely used to analyse large, multi- dimensional data sets [1, 2]. However, this use is typically static in nature: the user loads a data set, selects a few parameters, runs a clustering algorithm, and then views the results. The process then stops here; clustering is used simply to analyze the data, not to ex- plore it. We believe that with the right visualization environment, cluster- ing can be used to provide a very natural way for users to explore

Upload: others

Post on 03-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Interactive Clustering for Data ExplorationClustering is a natural part of the human cognitive process. When shown a set of objects, an individual naturally groups and orga-nizes these

Interactive Clustering for Data Exploration

Joel R. Brandt∗ Jiayi Chong† Sean Rosenbaum‡

Stanford University

Figure 1: A complete view of our system. In the top left, the Solution Explorer is shown. Below this is the Member Table, and to the right arethe visualizations of two solutions.

ABSTRACT

Clustering algorithms are widely used in data analysis systems.However, these systems are largely static in nature. There maybe interaction with the resulting visualization of the clustering, butthere is rarely interaction with the process. Here, we describe a sys-tem for visual data exploration using clustering. The system makesthe exploration and understanding of moderately-large (500-1000instances) multidimensional (10-20 dimensions) data sets easier.

CR Categories: H.5.0 [Information Systems]: Information In-terfaces and Presentation—General; I.5.3 [Computing Methodolo-gies]: Pattern Recognition—Clustering

Keywords: clustering, data exploration, interaction

∗e-mail:[email protected]†e-mail:[email protected]‡e-mail:[email protected]

1 INTRODUCTION

Clustering is a natural part of the human cognitive process. Whenshown a set of objects, an individual naturally groups and orga-nizes these items within his or her mind. In the domain of machinelearning, unsupervised clustering algorithms have been developedto mimic this intrinsic process. Yet these algorithms are usually em-ployed within rigid frameworks, removing the fluid, exploratory na-ture of human clustering. In this paper, we present a visual systemthat uses clustering algorithms to aid data exploration in a natural,fluid way.

1.1 Motivation

Clustering techniques are widely used to analyse large, multi-dimensional data sets [1, 2]. However, this use is typically staticin nature: the user loads a data set, selects a few parameters, runs aclustering algorithm, and then views the results. The process thenstops here; clustering is used simply to analyze the data, not to ex-plore it.

We believe that with the right visualization environment, cluster-ing can be used to provide a very natural way for users to explore

Page 2: Interactive Clustering for Data ExplorationClustering is a natural part of the human cognitive process. When shown a set of objects, an individual naturally groups and orga-nizes these

complex data sets. For example, when a user is given a small, low-dimensional data set to explore (such as a collection of objects ona table), a typical individual intuitively groups, or clusters, similaritems mentally. The individual may then compare clusters, break upindividual groups by different attributes, completely re-cluster theset based on different attributes, and so on. Without aid, however,both the size of the data set and the types of attributes that an indi-vidual can operate on is quite limited. Our system allows the userto perform these intrinsic operations within a much larger space.

1.2 Major Contributions

Our system makes contributions in two main areas: data explorationtechniques and visualization. More specifically, our system

• provides an intuitive mechanism for visually exploringmoderately-large multi-dimensional data sets,

• supports a fluid, iterative process of clustering, refinement,and re-clustering to enable this exploration,

• and proposes a novel, faithful visualization of high-level char-acteristics of the clustering results.

1.3 Organization

The rest of this paper proceeds as follows. In Section 2 we beginwith an analysis of prior work in this area. We then detail our dataexploration and visualization contributions in Sections 3 and 4 re-spectively. In Section 5, we give a complete example of the systemin use. Finally, we conclude in Section 6 with a plan for futurework.

2 PRIOR WORK

A great deal of prior work has been done in the areas of visualiz-ing clustering results and interacting with these visualizations. Arelatively smaller amount of work has been done in the field of in-teracting with the clustering process. We examine each of theseareas in turn, and then consider some related work that leveragestechniques other than clustering.

2.1 Visualization of Clustering Results

Given the complex output of clustering algorithms, good visual-izations are necessary for users to interpret the results accurately.Visualizations generally display either micro-level, or macro-levelcharacteristics of the clustering.

Micro-level visualizations show explicitly the pairwise relationsbetween instances. Conversely, macro-level visualizations attemptto express the quality of the clustering result, such as the size andcompactness of each cluster and the separation of clusters relativeto each other. We believe that understanding macro-level charac-teristics is most useful for data exploration, whereas interpretingmicro-level characteristics lies in the domain of data analysis. Thisdistinction is described in detail in Section 3.1. Here, we will ex-amine existing visualizations for each class of characteristics sepa-rately.

2.1.1 Micro-level Visualizations

Many micro-level visualizations begin by projecting the clustersinto 2-dimensional or 3-dimensional space [5, 8]. The projectionis chosen, as much as is possible, such that nearby items lie inthe same cluster, and nearby clusters are similar. However, 3-dimensional visualizations are often unintuitive due to occlusionand depth-cuing problems. Likewise, 2-dimensional projections

are often problematic because of the issues associated with ac-curately projecting high-dimensional data into a low-dimensionalspace. Lighthouse [5] takes an interesting approach by allowing theuser to switch between 2- and 3-dimensional views. The usabilityof this feature, however, is not well studied.

A colored matrix representation is another widely used methodfor visualizing clustering results [1, 2, 9, 10]. In this representation,instances lie on one axis, and features lie on the other. Each cell(corresponding to an instance/feature pair) is colored according tothe value of that feature for that instance. The features are orderedby a hierarchical clustering method so that like rows appear next toeach other. gCLUTO [9] takes this a step further by allowing theuser to also cluster the transpose of the data set, and sort the featuresby similarity.

Alternatively, this same colored matrix representation is oftenused to express pairwise distances between all instances. In thisvisualization, each instance lies on both axes. Each cell is coloredaccording to the relative distance between the two instances rep-resented. Hierarchical clustering methods are used to produce anordering on the axes, so that the majority of cells corresponding tonearby points lie near the diagonal.

2.1.2 Macro-level Visualizations

A relatively smaller amount of effort has been devoted to produc-ing compelling macro-level visualizations. gCLUTO presents a“Mountain” visualization technique. The centroids of each clusterare projected into the plane as mentioned above. Then, at each cen-troid, a mountain is formed. Attributes of the mountain are mappedto attributes of the clusters: the height is mapped to the internalsimilarity of the cluster, the volume is mapped to the number ofobjects in the cluster, and the color of the peak is mapped to theinternal standard deviation of the cluster’s objects. While these areall important attributes to consider, the method of displaying themis arguably a bit unintuitive.

2.2 Interaction with Clustering

Work on interaction with clustering algorithms is best divided intotwo categories: interaction with the result set and interaction withthe clustering process.

2.2.1 Interaction with the Result Set

Most commonly, systems provide a means of interacting with theresult set. Because these result sets are often too large to be repre-sented in their entirety on a typical display, these interactions usu-ally center around hiding data.

When hierarchical clustering is performed, visualization toolsoften provide a means for collapsing portions of the hierarchy.The Hierarchical Clustering Explorer [10] supports this throughdynamic queries, and gCLUTO [9] supports this through a typi-cal expandable tree structure. While these methods are effective forreducing screen clutter, little semantic meaning is tied to the direc-tional branching of the tree, so it can be difficult to select only theregions of interest.

Domain-specific methods of interacting with the result set arealso common [5, 8]. For example, Lighthouse produces a visual-ization of web search results using clustering, and then allows theuser to select a point in the representation to visit the correspondingsite. Such domain-specific interactions are of little interest in thiswork.

Finally, many clustering visualizations use detail on demandtechniques [5, 9, 10]. Positioning a cursor over a particular datapoint, for example, often brings up a small window with metadataabout the corresponding instance. Such techniques are necessarybecause of the large amount of data being displayed.

Page 3: Interactive Clustering for Data ExplorationClustering is a natural part of the human cognitive process. When shown a set of objects, an individual naturally groups and orga-nizes these

VisualizeSolutions

SubsetData

GenerateSub-problem Cluster

Figure 2: The data exploration pipeline for our system.

2.2.2 Interaction with the Clustering Process

Systems that allow interaction with the clustering process are some-what more rare. Many systems let users define initial cluster cen-troids in a visual way, rather than choosing them randomly. Thevalue of such a system, however, is unclear: if it is easy for the userto select centroids, it is probably unnecessary to cluster the data!

Some systems go a bit further and allow the user to interact withthe clustering process as it is occurring. For example, Looney’swork [6] allows the user to eliminate or merge clusters at varioussteps in the algorithm. While this work takes strides to solve someof the major problems with clustering, it requires that the user un-derstand the data set in order to produce a result. We seek exactlythe opposite paradigm: the user iteratively produces results in orderto understand the data set.

2.2.3 Alternatives to Clustering

Self-organizing maps are an unsupervised learning technique builton neural networks. They have been widely explored as a tool toaid both visualization and data exploration [3, 4]. They are typi-cally employed to aid in the production of low-dimensional visual-izations of high-dimensional data.

The benefits of self-organizing maps are somewhat contrary tothe goals of this work: they automatically reduce the dimensional-ity of the data, while providing little evidence of why a particularprojection was chosen. Instead, we enable the user to explore andre-weight dimensions at his or her discretion, helping the user tounderstand links between these dimensions.

3 DATA EXPLORATION

The principle goal of our system is to enable intuitive data explo-ration of moderately-large, multi-dimensional data sets. In thispaper, we center our data exploration process around k-means, astraightforward clustering algorithm [7]. However, we believe ourdata exploration pipeline is applicable when coupling user interac-tion with any automated technique.

We begin by explaining the difference between data explorationand data analysis. Then, we discuss our the details of our data ex-ploration pipeline.

3.1 Data Exploration versus Data Analysis

The distinction between data analysis and data exploration seemssubtle at first. Most simply, in data analysis, the user knows whathe or she is looking for; in data exploration, the user does not.

Data analysis tasks typically investigate specific data instances,and their relation to other instances. The analyst usually has a largeunderstanding of the structure of the data set he or she is workingwith. That is, the relations between attributes are typically well un-derstood, or at least the characteristics of a particular attribute are

well known. The examination of a gene array clustering, for exam-ple, is a typical data analysis task [1, 2]. The analyst knows whateach gene is, and what each experiment is, and is attempting todetermine which genes respond in similar ways to particular exper-iments. Such a task is completely static: a clustering is produced,visualized, and analyzed.

Data exploration tasks are those which attempt to uncover thegeneral structure of the data. Here, the user may not know which at-tributes best separate or explain the data, may not know the relation-ships between attributes, and may not even know which attributesare useful. However, the user is likely to have domain knowledgeabout the data set being explored. For example, the user may havehigh-level knowledge about instances in the data set, and may beinterested in determining which attributes are most useful in pre-dicting or explaining that knowledge.

Data exploration is an iterative process of discovery. As such,tools for data exploration must support this iterative search. Specif-ically, we believe tools for data exploration must make it easy forthe user to

• explore the data along multiple paths,

• create branches in those exploration paths,

• and compare various exploration paths.

3.2 The Data Exploration Pipeline

The use of our system centers around our Data ExplorationPipeline, shown in Figure 2.

The user explores the data by iterating through this pipeline. Af-ter loading the data set, the user is presented with a visualization ofa solution to a trivial clustering problem: clustering all of the datainto one cluster. From this visualization, the exploration begins:

1. The user explores a solution, visualizing it through the tech-niques described in Section 4.

2. The user selects a subset of the data to continue exploring.This subset may, of course, be the entire set.

3. The user generates a sub-problem using this subset of the data.This involves chooses the value of several parameters, suchas number of clusters to form, which attributes to use whenclustering, and the relative weights of each of those attributes.

4. The clustering is performed and the sub-solution is stored.

5. The process repeats using the new sub-solution.

Of course, the user has more control over the pipeline than whatis given here. For example, the user can generate several differentsub-problems from any solution, varying the parameters (and eventhe sets) in each sub-problem. As a result of this flexibility, the

Page 4: Interactive Clustering for Data ExplorationClustering is a natural part of the human cognitive process. When shown a set of objects, an individual naturally groups and orga-nizes these

Figure 3: A view of an individual cluster. The centroid is shown inred. The currently selected point is shown in blue. Points that lieclose to the selected point (in high-dimensional space) are shown ingray.

pipeline results in a hierarchy of clustering solutions, where eachsub-solution is a refinement of a subset of its parent solution.

Furthermore, as will be discussed in Section 4, we allow the userto open up visualizations of as many solutions as is desired, and linkthe display of these solutions so that similarities and differences caneasily be seen. In this way, it is easy for the user to explore theeffects of clustering using different features and parameters.

When each solution is generated, we keep track of the parametersused in the clustering, as well as the parameters defining the sub-set of instances to be clustered. With this information, if new datais added to the system (for example, if we want to classify addi-tional instances), we can place the new instances in the appropriateclusters in all solutions within the system.

4 VISUALIZATION TECHNIQUES

In this section, we examine the visualization techniques used tosupport the data exploration pipeline discussed in Section 3. Wedevote the majority of our attention to the techniques used to visu-alize a particular solution, and to compare several solutions, as thisis the novel portion of our system. However, in Section 4.2.1, wediscuss the interfaces for managing a hierarchy of solutions and forgenerating new solutions.

The visualization techniques presented here have been developedwith the goals of effectively representing macro-level characteris-tics of clustering results and enabling intuitive comparison of multi-ple clustering solutions. These are the techniques required for dataexploration.

While we provide some drill-down into the micro-level charac-teristics of a solution as a part of our brushing and highlightingtechniques, these characteristics are not our primary concern. In-vestigation of micro-level characteristics lies mainly in the domainof data analysis rather than data exploration. We believe that muchof the prior work discussed in Section 2 accomplishes the data anal-ysis task successfully. So, in a complete system for both data explo-ration and analysis, we propose the marrying of our new techniqueswith extensions of existing analysis techniques. This is discussedfurther in our section on future work (Section 6.1).

4.1 Small Multiple Histograms for Cluster Visualization

In the simplest sense, we visualize a clustering solution as a collec-tion of histograms. Each cluster is represented by a histogram, asshown in Figure 3. The centroid of the cluster is placed on the leftof the histogram, and the instances are arranged according to theirEuclidean distance from the centroid. All of the histogram axes arescaled the same within one clustering solution. Complete sets ofsmall multiples for a solution can be seen in Figures 1 and 7.

1

2

3

4Figure 4: Four views of the same centroid histogram. In each his-togram, the centroid indicated is used as the basis.

4.1.1 Centroid Histogram

Furthermore, we produce a histogram layout of the centroids thatmimics the individual cluster histograms. The user is able to se-lect the centroid to serve as the basis, and the other centroids areplaced in the histogram according to their Euclidean distance fromthe basis centroid. (This rocking of the basis element is discussedfurther in Section 4.1.3.) Examples of this visualization can be seenin Figure 4.

Together, these histograms provide an intuitive summary ofmacro-level cluster characteristics. Cluster size and distributioncan easily be seen within one cluster histogram. Comparing clusterhistograms gives an understanding of relative compactness of eachcluster. Finally, the centroid histogram summarizes the inter-clusterseparation.

4.1.2 One Dimension versus Two Dimensions

Our histograms can be thought of as a one-dimensional projectionof the data. At first consideration, it may seem that projecting intoone dimension (instead of two or three) gives up a great deal of flex-ibility. However, we believe that when combined with the decou-pling of inter- and intra-cluster characteristics mentioned above, ourhistograms lead to a more faithful representation of the macro-levelcharacteristics than would be possible in a typical two-dimensionalprojection.

Consider a projection of all instances into two dimensions usinga technique such as multi-dimensional scaling. In such a technique,one attempts to find the projection that preserves the pairwise dis-tances between instances to the greatest extent possible. In mostcases, the actual distance between two points will be well reflectedby their distance in projected space. However, for some pairs ofinstances, such results may be impossible to achieve. It may bethat the best projection overall still places a significant number ofpoints close to other points that are actually far apart. Such misrep-resentations make understanding macro-level characteristics of the

Page 5: Interactive Clustering for Data ExplorationClustering is a natural part of the human cognitive process. When shown a set of objects, an individual naturally groups and orga-nizes these

clustering result more difficult using this representation.Furthermore, even without these misrepresentations, the decou-

pling of inter- and intra-cluster characteristics in our method is noteasily achievable in a two-dimensional projection of all data. In-stead, the user must segment the space mentally to perform suchcomparisons. Our decoupling allows the user to more easily exam-ine the characteristics he or she is concerned about, without havingto block out additional information.

As has been mentioned in Section 2, other two-dimensional tech-niques, such as colored matrix representations, exist for expressingcluster results. These techniques, however, are more suited towardmicro-level examination, which is outside the scope of this work.

4.1.3 Rocking

Rocking is a technique often used to solve depth-cuing problemswhen visualizing three-dimensional point data in two dimensions.If the point set is rotated slightly, the necessary depth cue is pro-vided: points in front move one direction while points in back movethe opposite direction.

We borrow this idea of rocking to improve our histogram dis-plays. In our histograms, it is often the case that two distant pointswill end up nearly the same distance from the centroid, and thus inthe same place on the histogram. (Note that this is not a misrep-resentation, we make no claim that distant points will end up farapart in the histogram.) We allow the user to select two points, oneof which may be the centroid. After doing so, the first point is usedas the basis for computing distances to build the histogram. A slidercan be used to rock the basis point along the line between the firstand second points. As before, points that move left are closer tothe second point than the first, and conversely for points that moveright. An example of this is shown in Figure 5.

As mentioned earlier, we allow a similar type of rocking withinthe centroid display. We allow the user to select a basis centroidby clicking. When the user changes basis centroids, new positionsfor each centroid in the histogram are computed, and the movementbetween old and new positions is animated. As before, the move-ments express the relative locations of other centroids as comparedto the two bases.

Figure 4 shows all possible rockings of a single centroid his-togram. From these views, it is clear that clusters 2 and 3 are lo-cated quite close to each other, whereas 1 and 4 are both far fromeach other and far from 2 and 3.

4.2 Data Subsetting

A crucial part of the data exploration pipeline is the subsetting andreprocessing of data. We support data subsetting through dynamicqueries as shown in Figure 6. The user makes a range selectionsimply by dragging the range selection brackets so that they encloseonly the points of interest.

Note that this range selection may be made with any point se-lected as the basis for distance calculation (and even when the basispoint is being rocked.) This allows the user to select points that areclose to (or far from) any point, rather than making selections withrespect to the centroid only.

4.2.1 Generating Solution Hierarchies

Once the desired subset is selected within each cluster histogram,a new sub-solution may be generated. The Sub-Solution Generatorwindow (not shown) presents a simple user interface to select theattributes for clustering, assign weights to those attributes, choosethe number of clusters to be formed, and initiate the clustering.

When a new clustering solution is generated, it is placed in theSolution Explorer (shown in Figure 1) as a child of its parent solu-tion. Any number of child and grandchild solutions may be created

Figure 5: Rocking of a cluster. The two points used to compute therocking line are shown in yellow. The amount of rocking is controlledby moving the slider.

Figure 6: A dynamic query within a cluster histogram, used for bothdata subsetting and range selection in view linking.

Page 6: Interactive Clustering for Data ExplorationClustering is a natural part of the human cognitive process. When shown a set of objects, an individual naturally groups and orga-nizes these

from any parent solution. In this way, the user is able to traversemultiple exploration paths, branching as desired. The Solution Ex-plorer keeps each solution organized, providing a summary of theattributes used to generate the clustering.

4.3 Brushing and View Linking

With the generation of multiple solutions comes the need to explorethem in concert. We support this exploration through brushing andview linking.

As shown in Figure 3, brushing is used to highlight instancesnear a selected instance (in high-dimensional space). When a userhovers over an instance, it is highlighted in blue. Additionally, allnearby instances are colored gray. Note that this highlighting issomewhat akin to rocking. Only the instances which are actuallynear the instance of interest are highlighted. In addition to high-lighting close-by instances within the cluster, we also highlight in-stances corresponding to the selected instance in all other views.

Similarly, we highlight all instances contained within the activeregion of the currently selected cluster in all solution visualizations.(The currently selected cluster is defined by the chosen basis cen-troid in the currently focused window.) This highlighting is shownin Figure 7.

View linking allows the user to easily visualize the relative con-sistency of clustering between different solutions using different at-tributes and weights. In this way, relations between attributes can beeasily found and understood, helping the user uncover the structureof the data, the ultimate goal of data exploration.

4.4 Micro-level Data Examination

A minimal amount of support is provided within the system formicro-level data examination. The member table, shown in Fig-ure 1, lists all instances in the currently selected cluster. Brushingis supported between the member table and solution visualization:the selected instance in the visualization is highlighted in the mem-ber table, and likewise, a selected instance in the member table ishighlighted in the solution visualization. Finally, if a range selec-tion is made within the active cluster, only the selected instancesare shown in the member table.

Support for micro-level data examination could be greatly en-hanced by building upon much of the prior work mentioned in Sec-tion 2.1.1. We discuss our plans for this further in Section 6.1. Nonethe less, we believe that the linking of brushing between macro- andmicro-level visualizations presented here here would prove to be avery useful feature regardless of the micro-level visualization used.

5 EXAMPLE USE

In this section, we present a brief example of one possible use ofour system. Consider a network administrator who is attemptingto locate machines that are behaving atypically. The administratorbelieves that the usage patterns of most machines stay consistentfrom month to month, and that a change in behavior might be anindication of an intrusion or other exploit. However, she has a largenumber of traffic metrics available to her, and is not sure which ofthese metrics best express a machine’s usage pattern.

The network administrator begins her data exploration by com-piling a variety of traffic metrics for each month for each machine.Each of these values becomes an attribute. Each instance (a ma-chine) has a group of attributes for each month of data, resulting inn · m attributes, where n is the number of traffic metrics, and m isthe number of months. All of this data is loaded into the system.

She first decides to cluster the entire dataset into 5 clusters us-ing the all of the first months attributes. To do this, she opens theinitial solution (a clustering of everything into one cluster) present

in the Solution Explorer. She does not need to subset the data, soshe simply uses the Sub-Problem Generator to define her clusteringproblem: she selects the attributes of interest and chooses 5 clusters.

After the clustering is completed, she visualizes the new solu-tion. She observes that two clusters contain most of the instances.She quickly hovers over the few instances that lie in the other threeclusters. She discovers that all of these machines are servers of onesort or another. Since she watches these machines pretty closely us-ing other tools, she decides to exclude them from her exploration.She makes an empty range selection in each of these clusters, leavesthe entire range selected in the two dense clusters, and opens up theSub-Problem Generator again.

She decides now to try to confirm her theory that the behavior ofmost machines does not change from month to month. Using thesubset of machines selected, she produces two new sub-solutions:one using the first month’s attributes, and one using the secondmonth’s attributes. She opens up both solutions, and utilizes theview linking features to explore their similarity. This explorationis shown in Figure 7. She selects each cluster in the first month’svisualization in turn. As she does so, the corresponding instancesin the second month’s visualization are highlighted. As is shownin Figure 7, the clusters stay relatively consistent. She quickly ex-amines those instances that change clusters. For some, she easilyobserves that the machines are outliers in both clustering solutions.This suggests that the attributes or weights in use may not be op-timal. For others, she uses her domain knowledge to explain thedifferences. For example, perhaps one of the instances is a machinethat was added in the middle of the first month. For a few others,she decides to do further investigation.

From this point, her data exploration process could go any num-ber of directions. She could produce several clustering solutions forthe same month using different attributes to explore links betweenthe attributes. She could produce clustering solutions of varyingsizes for the same attributes, giving her a clearer picture of howmany types of machines she really has. The system easily supportsthese and many other tasks. Furthermore, once she has determinedthe set of attributes and weights that best characterizes her data, shecan carry this information over into the data analysis domain anduse these same clustering techniques in her daily network monitor-ing.

6 CONCLUSION

We have presented a visualization system that harnesses clus-tering algorithms to make exploration of moderately-large, high-dimensional data sets more intuitive. An iterative process of visu-alization, query refinement, and re-processing was presented thatwe believe accurately represents the ideal data exploration process.Furthermore, we proposed a novel method for visualizing macro-level clustering characteristics. Finally, we showed an example useof our system.

6.1 Future Work

We believe that a complete solution would provide means for bothdata exploration and data analysis. In this work, we have only ex-plored the domain of data exploration. We are interested in aug-menting the visualizations presented here with adaptations of someof the techniques presented in Section 2 to produce such a completesystem.

We also plan to characterize more clearly the trade-offs betweenexisting two-dimensional visualizations and the one-dimensionalapproach presented here when visualizing macro-level character-istics of clustering results. Furthermore, we plan to explore ways toafford the user more control over the visualizations produced usingour techniques. For example, the user could control the distance

Page 7: Interactive Clustering for Data ExplorationClustering is a natural part of the human cognitive process. When shown a set of objects, an individual naturally groups and orga-nizes these

1 2 3 4

Figure 7: An example of view linking. In each group, the same two solutions are shown. The cluster indicated at the bottom of each group isselected in the left visualization. Instances in the right visualization are colored blue if they are members of the cluster selected on the left.

metric used to perform the layout separately from the weights usedin clustering. This would allow exploration of the tightness of var-ious dimensions. Similarly, we would like to investigate ways tomake rocking more general and intuitive.

We also plan to explore the affinity of other clustering techniques(as well as other non-clustering-based machine learning techniques)to our data exploration framework.

Ultimately, an in-depth user study of an extended version of thissystem would be quite valuable. This is the only way to accuratelyjudge the usefulness and applicability of these techniques.

7 ACKNOWLEDGEMENTS

We would like to thank Alexis Battle, Dan Ramage, and Ling Xiaofor their helpful insights. We would also like to thank Pat Hanrahanand all of the Winter 2006 CS448b class for their useful comments.

REFERENCES

[1] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack,and A. J. Levine. Broad patterns of gene expression revealed by clus-tering analysis of tumor and normal colon tissues probed by oligonu-cleotide arrays. In Proceedings of the National Academy of Sciences,volume 96, pages 6745–6750, 1999.

[2] Amir Ben-Dor, Ron Shamir, and Zohar Yakhini. Clustering gene ex-pression patterns. Journal of Computational Biology, 6(3/4):281–297,1999.

[3] Arthur Flexer. On the use of self-organizing maps for clustering andvisualization. In Principles of Data Mining and Knowledge Discovery,pages 80–88, 1999.

[4] M.Y. Kiang, U.R. Kulkarni, and Y.T. Kar. Self-organizing map net-work as an interactive clustering tool — an application to group tech-nology. Decision Support Systems, 15(4):351–374, December 1995.

[5] Anton Leuski and James Allan. Lighthouse: Showing the way to rel-evant information. In INFOVIS, pages 125–130, 2000.

[6] Carl G. Looney. Interactive clustering and merging with a new fuzzyexpected value. Pattern Recognition, 35(11):2413–2423, November2002.

[7] J. McQueen. Some methods for classification and analysis of multi-variate observations. In Proceedings of the 5th Berkeley Symposiumon Mathematical Statistics and Probability, volume 1, pages 281–297,1967.

[8] Sougata Mukherjea, James D. Foley, and Scott E. Hudson. Interac-tive clustering for navigating in hypermedia systems. In ECHT ’94:

Proceedings of the 1994 ACM European conference on Hypermediatechnology, pages 136–145, New York, NY, 1994. ACM Press.

[9] Matt Rasmussen and George Karypis. gCLUTO: An interactive clus-tering, visualization, and analysis system. Technical Report TR-04-021, University of Minnesota, 2004.

[10] Jinwook Seo and Ben Shneiderman. Interactively exploring hierarchi-cal clustering results. IEEE Computer, 35(7):80–86, July 2002.