visual clutter reduction through hierarchy-based...

Visual Clutter Reduction through Hierarchy-based Projection ofHigh-dimensional Labeled Data

Dominik Herr1,2 Qi Han1 Steffen Lohmann1 Thomas Ertl1

1Institute for Visualization and Interactive Systems (VIS)∗2Graduate School of Excellence advanced Manufacturing Engineering (GSaME)

University of Stuttgart, Universitatsstraße 38, 70569 Stuttgart, Germany

Figure 1: Our approach reduces visual clutter of projected labels by showing the user the relatedness of the labels on increasing levels ofdetail. A co-occurrence based hierarchy is used to generate the levels of detail. Here, the cluster asp.net, highlighted with a green background,and its related clusters, marked in with a slightly lighter green, are shown in increasing levels of detail. To follow the cluster more easily, wetrace it with two lines in this depiction.

ABSTRACT

Visualizing high-dimensional labeled data on a two-dimensionalplane can quickly result in visual clutter and information overload.To address this problem, the data usually needs to be structured, sothat only parts of it are displayed at a time. We present a hierarchy-based approach that projects labeled data on different levels of de-tail on a two-dimensional plane, whilst keeping the user’s cognitiveload between the level changes as low as possible. The approachconsists of three steps: First, the data is hierarchically clustered;second, the user can determine levels of detail; third, the levelsof detail are visualized one at a time on a two-dimensional plane.Animations make transitions between the levels of detail traceable,while the exploration on each level is supported by several interac-tion techniques. We demonstrate the applicability and usefulness ofthe approach with use cases from the patent domain and a question-and-answer website.

Index Terms: H.5.2 [Information Interfaces and Presentation]:User Interfaces—GUI; H.3.3 [Information Storage and Retrieval]:Information Search and Retrieval—Clustering

∗{firstname.lastname}@vis.uni-stuttgart.de

1 INTRODUCTION

The abundance of data produced nowadays is often too much toshow all at once and its structure regarding content and relation isoften either unavailable or unknown. A common practice to labelthe content is using keywords or tags that originate from predefinedmetadata, a classification structure, or from extracted content. In ad-dition, the overview of the relation between data is often simplifiedby clustering the data and then showing the clusters’ relations. Thelabels’ relations span a high-dimensional space that is based on theco-occurrence of data objects’ labels. To visually analyze the labels’coherence, the labels may be shown on a two-dimensional plane, forexample through projection. Yet, the two-dimensional visualizationof many labels leads to the problem of visual clutter. This visualclutter combined with a possibly complex relationship measure cancause heavy cognitive load for an analyzing user and hinders theefficiency of the analysis. One common approach to ease this prob-lem is to introduce a hierarchy to the labels, wherein a child nodespecifies the description of its parent node.

However, the visualization of such hierarchies usually focuses onthe relationship between the parent and its children. Apart from theintuition that siblings with the same parent are closer related to eachother than other data objects on the same hierarchy level, the child-child relationship remains hidden in such visualization approaches.

We present an approach that aims to reduce the visual clut-ter of two-dimensional projections of a large number of labelsthrough a smooth transition between overview and detail [31]. Thisis achieved by hierarchizing the labels based on a similarity ma-trix derived from the co-occurrence of the labels. Then we show

109

Graphics Interface Conference 20161-3 June, Victoria, British Columbia, CanadaCopyright held by authors. Permission granted to CHCCS/SCDHM to publish in print and digital form, and ACM to publish electronically.

first highly aggregated clusters and then provide more details as theclusters are split apart. Also, our approach represents the child-childrelationship of the generated hierarchy at a given level of detailthrough the projection of the relevant labels onto a two-dimensionalplane. To do so, the approach is divided into three steps:

1. A hierarchy is generated from a given similarity matrix. Here,this matrix is based on the co-occurrence of the labels.

2. Based on the extracted hierarchy, varying levels of detail aredetermined for the information shown to the user; the user canchange those levels at will.

3. The information contained in the topmost level of detail isprojected onto a two-dimensional plane, and the user can ex-plore the shown data or increase the level of detail.

2 RELATED WORK

The following discussion of related work is divided into four parts.First, we examine works from the field of information visualiza-tion that use dimension reduction to visualize data spatially. Af-terwards, we review visualization methods for hierarchical datasetsand related interaction techniques. Then, we show visualization ap-proaches for hierarchy-based data. At last, we present recent workin the area of word cloud visualization that are related to our work.

2.1 Dimension Reduction and Data ProjectionOne common approach to visualize the content in datasets, forexample document collections, is to take extracted features, suchas keywords or concepts, and compare the data based on thosefeatures. The natural language processing community developedmany ways to extract keywords from documents [15]. The relat-edness of such features is usually pairwise computed and there-fore presents a high-dimensional reduction problem. Methods tosolve this problem can be projection-based, for instance, by us-ing Principal Component Analysis (PCA) [38], MultidimensionalScaling (MDS) [23], Least Squares Projection [27], or t-DistributedStochastic Neighbor Embedding (t-SNE) [35]. Van der Maaten etal. [36] survey a number of techniques to project high-dimensionaldata onto a low-dimensional space. In such approaches, data is usu-ally visualized as data points that take up almost no space. This rep-resentation suffers from visual clutter when the data is not clearlyseparable, which is even more problematic when the data is repre-sented with labels.

Due to the high complexity of projection techniques, other po-tentially faster and more intuitive approaches, such as force-basedlayouts [12, 13] are often used when interactive visualizations areneeded. However, Garcıa-Fernandez et al. [14] concluded in theirstudy that projection-based approaches are superior to force-basedlayouts, when a complete and large dataset needs to be visualized.

Some approaches are based on neural networks, such as hierar-chical self-organizing maps [22]. When this approach is used foreach created area, as proposed in [32] or [10], this approach cre-ates a visualization seemingly based on a hierarchy. However, whengenerating a visual hierarchy this way, only elements contained inan area are mapped relative to each other. In our approach, we usea hierarchy not only to visualize the relation between parents andtheir children, but also between the siblings across the clusters.

2.2 Hierarchical Aggregation and VisualizationIn case the data is already hierarchically structured, there are manyapproaches to visualize such data. If the user needs to inspect thedata’s distribution across the hierarchy tree, tree visualization tech-niques, such as dendrograms or icicle plots, are commonly used.However, dendrograms do not provide information about the rela-tion between the elements across multiple clusters (even when theyare at the same hierarchy level). Another prominent method to show

a hierarchy’s structure are treemaps [30]. Due to a treemap’s layout,its ability to provide information about the relation of clusters withdifferent parent clusters is limited and it is almost impossible to in-dicate possible shifts of the clusters’ similarities on different levelsof detail as tree maps assume a fixed hierarchy without consideringa possible relation between the clusters, which we want to provide.

The goal of most hierarchical visualization techniques is to showthe structure of the hierarchy. Elmqvist and Fekete [9] proposeguidelines how hierarchical data can be used to limit the amountof shown information. We adapt some of the proposed guidelinesin our work, such as taking the most important element of a clusteras cluster representative.

2.3 Visualization Approaches for Hierarchy-based Data

There are various approaches to provide information about moredetails about hierarchy-based data. For example, Dou et al. [8] gen-erate topic models where the users can interactively modify the cre-ated hierarchical structure. Afterwards, the user can inspect the de-velopment of individual or groups of topics over time. In contrast toour approach, this attempt uses the hierarchies to aggregate topicsfor the analysis of changes over time. Instead, we use hierarchies tofilter shown data and show the data’s relation spatially. Like mosthierarchical visualization approaches, they assume the availabilityof a hierarchical structure and use predefined hierarchy levels toshow information. Our approach goes beyond that by enabling theusers to set the shown hierarchy levels by themselves.

Liu et al. [24] developed an approach to build hierarchies basedon topic graphs. They visualize the relations of extracted topics byusing stacked trees in combination with force-based graph layouts.In contrast to our approach, they use hierarchies to distinguish be-tween topics and not to set their content in a relation to other topics.Our approach aims not only to provide relational information acrossclusters, but also of their content when more information is shown.

Fried and Kobourov [11] presented a system, wherein they mapthe titles of papers in the DBLP database onto a two-dimensionallandscape based on a hierarchy. The users can create a search profilewhose results are highlighted on the landscape using a heat mapvisualization. They focus on temporal aspects of the data.

Wise et al. [37] proposed to show large document collectionsthrough a galaxy metaphor, in which every document is representedby a star. By doing so, the user gets a more intuitive understand-ing of the relations between documents. Similarly, SPIRE [34] andINSPIRE [39] use the same metaphor, but they combine it witha visual analytics approach to enable users to further analyze thedata. The STREAMIT system [1] uses force-based layouts, cluster-ing discovery techniques and topic modeling to visualize documentstreams in real-time. The clustering is based on the graph layoutand does not create a hierarchy to examine the document streamson a semantical level. An early version of the recently publishedOverview system [4] projects documents onto a two-dimensionalscatterplot, whilst showing a hierarchical tree structure of the doc-uments in another view. The system uses brushing and linking toconnect those two visualizations. However, the selected elements ofthe hierarchy view are not represented in the amount of data shownin the scatterplot.

Only a few of these landscape-based approaches support hier-archical data, and most that do assume the hierarchy to be given.Thom et al. [33] use hierarchical topic clustering combined with atreemap-based visualization to show Twitter data on different levelsof detail. This level of detail can be interactively steered by the userduring an analysis run, which enables the user to see more detailsabout a specific topic. However, the visualization has similar draw-backs as described in subsection 2.2. As a result, the positions ofthe topics is based on the hierarchy, but the relation of the varioustopics is hard to comprehend.

In addition to approaches that try to depict the global relatedness

110

of data, there are approaches that show the local relatedness, de-pending on a user-chosen subset of the data. Often such approachesare realized through the use of focus+context techniques. One ex-ample for the interactive exploration of locally relevant informationis the Proxilens approach by Heulot et al. [20]. Here, a lens drawsrelated data towards the data in the lens’s center and pushes irrele-vant information out of the bounds of the lens.

2.4 Word Cloud VisualizationAlso related to our work are word cloud visualizations that showthe most frequent words of a text as a weighted list in some specificspatial arrangement, such as a sequential, circular, or clustered lay-out [25]. Several variations and advancements of word clouds havebeen proposed in recent years. For instance, Seifert et al. [29] de-veloped algorithms for space-filling word clouds based on a set ofheuristics, while related layout algorithms have also been presentedin a number of other works.

Some layout strategies consider word relationships and imple-ment spatial arrangements where strongly related words are placedin close proximity, similar to our attempt. The layout strategiesrange from simple line-by-line approaches [16] to treemap-likelayouts [21] and force-directed placements in combination withVenn diagrams [6]. Some works even apply projection techniques,such as the aforementioned MDS, to reflect the relatedness ofwords [28]. There are also attempts to explicitly depict the relation-ships in word clouds, either by adding links between related wordsor via interactive highlighting [18]. Prefix Tag Clouds [5] make useof prefix trees to group different word forms, whereas the WordCloud Explorer uses advanced NLP processing to link word formsand to support the visual analysis of text documents via interactiveword clouds [18].

However, we are not aware of any word cloud visualization thatprojects the words on multiple layers and allows for a seamless in-teraction and exploration on and between the layers.

3 APPROACH

The main goal of our approach is to reduce visual clutter caused bythe projection of labels onto a two-dimensional plane. We do thisby introducing a hierarchy in the projection, providing a smoothtransition between overview and detail. At the same time, we aimto preserve and indicate the relationships between labels on differ-ent levels of detail. Those levels of detail are based on the hierarchygenerated in a preprocessing step, while the distribution of the la-bels on each level of detail supports the users’ intuition that relatedlabels tend to be placed in close proximity.

As aforementioned, our approach consists of three steps, shownbelow. In the following, these steps are explained in more detail.

Create co-occurrence-

based hierarchy

Define levels ofdetail for clutter

reduction

Visualize levelsof detail for

further analysis

Figure 2: Workflow of our approach

3.1 Data PreprocessingThe creation of a hierarchy through clustering is expensive regard-ing computation time. As there is no need to recompute a dataset’shierarchy unless the dataset itself changes, we compute the hierar-chy in a preprocessing step and store it for later use. This ensures aquick and smooth entry point into the analysis, as information aboutthe hierarchy of the dataset can be shown right away.

Step 1: Hierarchization To create the hierarchy, we use Hi-erarchical Agglomerative Clustering (HAC) [26], which initiallytreats every data element as an individual cluster. Afterwards it iter-atively merges the two most similar clusters based on a linkage cri-terion. Often, HAC-based clustering approaches use single-linkage

as a linkage criterion. In single-linkage, the distance of two clustersis defined by the distance of the two closest elements of the clusters.However, it is unclear what merging clusters this way means on asemantic level [17, p. 525].

Therefore, we used medoid-linkage as the linkage criterion. Thesimilarity of two clusters is defined by the similarity of the clusters’medoids. A medoid is the element within a cluster with the leasttotal distance to all other elements within the cluster. As the calcu-lation of a medoid includes all elements within a cluster, it is a betterway to describe a cluster’s content compared to the elements usedby single-linkage. Further, medoid linkage implicitly compensatesfor a high variance within the cluster and is more robust to outlierswithin the cluster [26, p. 392;398].

One drawback of using a medoid linkage is the lack of mono-tonicity regarding the similarity of the merged clusters, as themedoid of a merged cluster may differ from the clusters that weremerged. Therefore, a cluster’s medoid usually needs to be recom-puted as soon as the cluster changes.

3.2 Recurring Steps of AnalysisThe following steps are executed every time an analysis is per-formed. As users may be interested in defining different levels ofdetail for each run of an analysis, the visualized information islikely to change and, hence, needs to be computed on demand.

F

A B C D

G

E

F

A B C D

G

E

Figure 3: The labels shown on a given level of detail is similar toa cut in the dendrogram of the hierarchy. The first nodes after thecut represent relevant clusters for that level of detail (they are high-lighted by a border in the figure). These clusters’ labels are thenused in the subsequent visualization step.

Step 2: Setting the Levels of Detail As our hierarchy isgenerated through a hierarchical agglomerative clustering, the hi-erarchy’s branching is binary. Thus, a given level of depth of thehierarchy describes the amount of shown clusters at the same time.An analysis of the hierarchy by stepping through it one level ofdepth at a time can be extremely tedious as only one cluster wouldsplit apart and therefore is not viable for our approach. To reach ourbefore mentioned goal, we enable the user to step through multiplelevels of depth at once. In our approach, the chosen levels of depthare called levels of detail. A level of detail corresponds to a cut atthe level of depth within the hierarchy. Only the clusters directlybelow the cut are being shown to the user. An example illustratingthe idea of this is shown in Figure 3.

Therein, the cut is at first after the cluster G and the clusters Eand F are the next clusters directly below the cut. In the second step,the cut is below F. Cluster E remains, but cluster F, which is nowabove the cut, is replaced by the clusters C and D.

Initially, we automatically set the levels of detail to represent anincrease of 20 levels of depth. This ensures that each level of detailshows a comprehensible amount of new information. Afterwards,the user may add, change or remove any number of levels of detail.To support the user in this task, we show two plots that containinformation about the hierarchy (see Figure 4 for an example).

Figure 4 a shows the merged clusters’ similarity. Every point ofthe plot represents the merging of two clusters. The similarity isexpressed in the y-value of the points, whereas the clustering stepis equal to the points’ x-value. A higher similarity value indicatesthat the clusters are much alike, whereas a low value indicates littleoverlap regarding the co-occurrence of clusters’ medoids.

111

Figure 4: We initially present two plots. The fusion similarity ( a )indicates the similarity of the clusters that have been merged overthe course of the hierarchization. The cluster error ( b ) showsthe clusters’ correctness using the Davies-Boulding index (a lowervalue indicates better clusters). Also, initial cuts for the levels ofdetail are proposed to the user. For demonstration purposes, we usea reduced number of levels of detail and added small depictionsof the visualizations that result on three different levels. The samevisualizations are shown in larger size in Figure 1.

Figure 4 b gives an indication of the quality of the clustering ata given step. As an indication measure, we use a variation of theDavies-Bouldin index [7]. The Davies-Bouldin index is designed tohave a high value when the distance between clusters is low and thedistance within the clusters is high. We assume that the similarityvalue s is normalized and calculate the distance d of two clusters Ciand C j as di, j = 1− si, j. Since we use medoid linkage, whereas theDavies-Bouldin index uses centroid distances, we slightly adaptedthe index. Our modified Davies-Boulding index DBImod is calcu-lated as

DBImod =1n·

n−1

∑i=0·max

i 6= j(

σi +σ j

dm(Ci,C j)),

with i and j being the cluster indexes, σx representing the averagedistance between the elements within cluster x and dm(Ci,C j) beingthe distance between the medoids of the clusters Ci and C j.

This way, the user gets a visual insight into the clustering andcan decide whether an adjustment of the levels of detail is necessaryand useful. For instance, a sudden change of the similarity or of theDavies-Boulding index could be a reason for an adjustment.

Step 3: Visualizing the Clusters In the third step, theuser can visually explore the hierarchy and the relationships ofthe shown clusters on increasing levels of detail. At first, the la-bels from the upmost level of detail are taken and projected ontoa two-dimensional plane, which we henceforth call landscape vi-sualization, by using the t-SNE projection technique [35]. We uset-SNE as it is possible to apply it with or without an initial spatialmapping of the data. This will be further elaborated in Section 4.1.

For every level of detail, we create a separate distance matrix thatis used by t-SNE. We calculate the cost to traverse the hierarchy treebetween all shown clusters and use the results as the distances be-tween the clusters. Once the positions of the clusters have been de-termined, representative labels can be shown in the landscape view.Details about the projection and positioning of the labels will begiven in Section 4.1.

We encoded several key aspects of the data and its structure intothe landscape visualization:

Representative of the cluster: Every cluster is represented by alabel. Following the recommendation of Elmqvist and Fekete [9],we use the cluster’s element with the highest overall occurrencefrequency as the cluster label.

Font size for importance: As the font size is perceptuallyprominent, we use it to indicate a cluster’s importance, similar toword clouds. Since clusters usually consist of several elements, wemap the accumulated occurrence frequency of the elements onto

the font size. We opted to represent the clusters’ importance on thecurrently shown level of detail by normalizing the accumulated fre-quencies of all shown clusters.

Numbers of elements within a cluster: To give the users an ideaof the general distribution of the labels, we added a colored radialbackground to every label. The size of the radius corresponds to theamount of elements that are contained within the cluster. The colorscale is a linear gradient that starts with a light blue in the centerand fades out towards the outside. In case the cluster is selected ormarked to be relevant, the light blue is replaced by a dark or lightgreen (see Section 4.2.3).

Cluster density: When two or more clusters overlap, their colorsadd up and a heat map-like visualization is created. This helps to geta better visual impression of the clusters’ distribution. The radius ofthe clusters remains constant when the users zoom in or out of thelandscape. Thus, smaller clusters are aggregated into bigger ones.

Position of the labels: As with all projection techniques, the po-sition of the labels follows the Gestalt principle that visual closenesscorrelates with similarity. More precisely, due to the way t-SNEworks, closeness correlates with the likelihood that two elementsare related to each other. We incorporated this aspect implicitly byusing the similarity as the projection measure in the second step ofour approach (see Section 3.2). In order to keep the cognitive loadlow, it is important to keep the positions of the already visualizedlabels as stable as possible when changing the level of detail. Weelaborate on the specific aspects of this in the following section.

4 INTERACTIVE EXPLORATION

c

d

a

e

b

Figure 5: Screenshot of our prototype showing data from thequestion-and-answer website StackOverflow. The landscape viewa shows the projected level of detail’s clusters. Some clusters areselected and highlighted with a dark green background. All selectedand related clusters are indicated by a green background color andhalos. Some clusters are not visible in the focused viewport, but thehalos indicate their position ( b ). A tooltip c shows the contentof the focused cluster jquery, including its name, description and aword cloud with the most frequent elements contained in that clus-ter. The selected labels are also shown and highlighted in the searchcomponent on the right d . On the bottom e , questions are listedthat contain at least one element from every selected cluster.

As outlined in Section 2, there are different ways to interact withtwo-dimensional representations of high-dimensional data. In thefollowing, we describe how to switch between levels of detail andways to interactively explore the landscape within a given level.

4.1 Switch Between Levels of DetailTo view the data on different levels of detail, the user needs to beable to switch between the chosen levels. However, projection al-gorithms are not designed to support the iterative projection of adataset with subsets of increasing levels of detail and do not make

112

use of previously positioned clusters. However, it is important tosupport the user’s mental map regarding the positions of the clus-ters in order to minimize the cognitive load during the change of alevel of detail. At the same time, it is necessary to visually informthe user about the relation of the newly available subclusters in thecontext of the already visualized clusters. We achieve this by min-imizing the movement of unchanged data as well as animating themovement of new data points from their respective parent cluster’sposition to their final destination.

At the first level of detail, we do not have any prior data aboutthe previous projection step. Therefore, we apply the standard im-plementation of t-SNE, which uses a t-student distribution of theprojected data as an initial mapping before optimizing the clusters’positions [35]. We use several means to stabilize the subsequentprojections, which we will explain in detail in the following:

1. Initial map: Originally designed to pause and resume a pro-jection run of t-SNE in order to show intermediate results, we usethe option to ‘resume’ a t-SNE run with a superset of the data usedin the previous projection step. We search for all clusters that splitinto smaller clusters in the next level of detail. Every unchangedcluster keeps its position and all newly generated clusters inheritthe position from the cluster they originate from. The resulting mapis used as the initial map of t-SNE, replacing the t-student distribu-tion of the clusters that is used by default and therefore stabilizingthe clusters’ positions.

2. Perplexity: We also adapt the perplexity parameter used byt-SNE. The perplexity can be interpreted as a factor that controlsthe size of the neighborhood considered in the high-dimensionalspace during the projection of a data point. Typically, this is a con-stant value which must be adapted, if the results are not satisfying.Van der Maaten and Hinton [35] recommend a value between fiveand 50 for this parameter. Usually, this value is robust and smalldifferences do not heavily impact the visualization. However, inthe case of a modified initial mapping, a too high or too low per-plexity value can lead to visual artifacts. In case the perplexity istoo high, the visualization will look like all clusters center aroundone point, because every cluster tries to optimize with regard to allother clusters. If the perplexity is too low, the clusters will not moveat all because they only consider themselves as relevant and there-fore all newly added clusters overlap at their parent’s position. Tokeep the user from following a trial and error approach to find aproper value, we decided to dynamically approximate the perplex-ity value depending on the number of shown clusters. Our perplex-ity function p(x) = 6.929 · x0.252710 is designed to initially increasefast in order to ensure that clusters in the first few levels of detailalready consider some of their neighboring clusters. The more clus-ters are shown the lower is the increase of the perplexity comparedto the previous level of detail. The values are based on the resultswe achieved with the datasets we describe in the use cases.

We compared our function to static perplexities on different lev-els of detail. We used the Kullback-Leibler divergence as a perfor-mance measure, since the authors of the t-SNE algorithm proposedit as a good statistical quality measure. The results based on thedataset of use case 2 are shown in Figure 6. It becomes clear that ourfunction-based perplexity performs slightly better than most staticperplexities. When we tested a static perplexity of p = 15, we no-ticed a lot more overdraw of the labels and concluded that lowerperplexities will amplify this problem even more.

3. Scaling of target projection space: Once the clusters are pro-jected, their positions are min-max normalized. To prevent a shiftof all clusters due to outliers, we first calculate the clusters’ geo-metric center. Afterwards, we normalize all clusters, but we discardthe 10% of the clusters that are the farthest away from the center.Then, we rescale the target projection space. The scaling factor is aroot function that depends on the amount of visible clusters.

0

2

4

6

8

10

0 40 60 80 100 120 140 200

Ku

lbac

k-Le

ible

r d

iver

gen

ce

amount of shown clusters

dynamic static (p = 15) static (p = 20) static (p = 30)

Figure 6: Comparison of our dynamic against static perplexities

4.2 Explore Cluster RelationshipsIn case the users want to explore a given level of detail, there aretwo scenarios: either they already have an initial idea what they arelooking for, or they want to get an overview of the chosen level ofdetail without any premises. In both cases, the next step is to decideon interesting clusters to inspect. In the following, we will describethe means we provide in order to accomplish this.

4.2.1 Search for Specific Clusters and ElementsWe provide the user with a search box that provides an auto-complete feature, suggesting every label contained within thedataset. Once an element has been found, it can be added to thelist below the search box (Figure 5 d ). The selections made in thesearch list are linked to the landscape view (Figure 5 a ). Sometimesthe searched element is not visible in the landscape view because itis hidden within a cluster. In this case, the cluster that contains theelement will be selected. It is also possible to focus on the elementor cluster that contains the element in the center of the view.

4.2.2 Freely Explore the Landscape ViewThe landscape view supports zooming and panning to let the userfreely explore the visualized level of detail. As described in Sec-tion 3.2, the size of the background color of the clusters is inde-pendent of the visual zoom. This way, the clusters are visually ag-gregated when the user zooms out, as the cluster backgrounds’ sizeincreases compared to the clusters’ labels. This helps the user todistinguish between areas with a higher cluster density and regionsthat are sparser.

When the users decide on one or more interesting clusters, theycan select them in the landscape view. The selected elements willthen be added to the aforementioned search list (Figure 5 d ).

4.2.3 Navigation Aids in the Landscape ViewOnce a set of clusters and/or elements is selected, the user may beinterested in other clusters related to the selected ones. However,this information may not be available directly, as some clusters thatare related in the high-dimensional space may not be close to eachother in the low-dimensional space or vice versa. We differenti-ate between two kinds of relatedness: global and local. The globalrelatedness is depicted by the spatial arrangement of the clusters,wherein closer clusters are more likely to be related than clustersthat are farther apart. The local relatedness is available for selectedclusters and shows which clusters are similar to those clusters. Tomeasure the local relatedness, we calculate the average similaritysavg between the set of selected clusters and the cluster they arecompared with by calculating

savg(Cother) =1n

n−1∑

i=0

(1m ·

m−1∑

j=0

1o ·

(o−1∑

k=0sim(Ci, j,Cother,k)

)),

wherein i is the index of the cluster within the cluster set, j is theindex of the compared element within the cluster and k is the indexof the element within the cluster that is compared with the set ofselected clusters.

As the locally related clusters may lie spatially apart from theselected clusters, we implemented several measures to compensatefor this information loss.

113

Halos: To support users in finding clusters that are related tothe selected clusters, we provide them with a navigation aid intro-duced by Baudisch and Rosenholtz [2]. Therewith, the users get avisual notion of the selected and other related clusters’ position bydrawing a circle, which is called halo in this concept, around theseclusters. In case the cluster is outside of the visible area, the halo’sradius gets expanded, so it barely stays within the view. The cur-vature of the Halo fragment indicates direction and distance of thecorresponding clusters. An example of the visual indication pro-vided by halos is shown in Figure 5 b .

Color coding: To make selected, related and other clusters dis-tinguishable, we color the backgrounds and halos of selected clus-ters with a noticeable shade of green. Analogously, we colored thebackgrounds and halos of related clusters with a shade of lightgreen. The other clusters’ background is drawn in a light blue. Wedecided to use color coding over approaches such as isolines orglyphs because we want to give the user an impression of the clus-ters’ different statuses, whilst indicating the uncertainty of the clus-ters’ positions due to the projection’s information loss.

Darts View: Furthermore, we added the Darts View Visualiza-tion introduced by Herr et al. [19]. By using a darts game metaphor,users can quickly spot related clusters, as the selected clusters areshown in the middle of the darts view and the related clusters arearranged around them.

The Figures 5 a and 8 depict how these aspects look like in ourimplementation of the concept.

4.3 Analyzing a Cluster’s Content

A B C D E

(A B) (C D E)V

V

V V

Figure 7: For each selected cluster (here indicated with a green cir-cle), the leaf nodes are retrieved. Then, data is requested from thedataset, which contains at least one of the leaf nodes from everyselected cluster.

Once the users found a relevant cluster, they may be interestedin further information about it. We provide several means to inspectthe content of a cluster. First, the user may request additional infor-mation about the cluster by looking at the cluster’s tooltip. In caseonly the name of the elements is available, the tooltip shows thename of the representative label as well as a word cloud. The labelsshown in the cloud belong to the elements contained in the cluster.They are ordered by their weight (= occurrence frequency), whichis also encoded into the words’ font size. In case the elements alsocontain a textual description, the description of the representativelabel is shown at the top. The word cloud contains the terms thatoccur most often in the descriptions of all elements. An exemplarydepiction of the tooltip is shown in Figure 5 c .

Second, the user can select one or more clusters. When doingso, our approach retrieves individual documents which contain atleast one term from each selected cluster. A schematic demonstra-tion of such a request is depicted in Figure 7. Here, the two high-lighted clusters were selected. As the elements A and B are con-tained within the first cluster, only one of them has to be containedwithin the resulting document. By selecting an individual docu-ment, further details can be retrieved.

5 USE CASES

In the following, we will illustrate the applicability and usefulnessof the approach with two scenarios using real world data.

5.1 Use Case 1: Analysis of StackOverflowIn the first use case, we assume the role of the head of a newlyfounded department of a software company that used to developserver-sided software solutions. Our new department is tasked tosupplement a client-sided solution to the software suite. The com-pany already supplied us with some of its software developers, butwe still need to recruit a developer that is well versed with web de-velopment. As we only possess basic knowledge of this field, weneed to take an explorative approach in order to gain knowledge,which aspects are important. To create a list of key skills for ourrecruit’s profile, we will use our approach to explore the question-and-answer website StackOverflow’s1 tags and their relations. SinceStackOverflow contains data that comprises more than ten millionquestions, we limit our inspection to tags that have been assignedto at least 5000 questions. The similarity between the tags is basedon Jaccard coefficient [26, p. 56].

We decide to deselect the first four proposed levels of detail, be-cause we want to see some detailed information at the top level al-ready. We know that our company’s server-sided software returnsJSON formatted answers and we also know that JavaScript is acommon solution for web clients. Therefore, we select the clusterscontaining JSON and JavaScript. To see if our interview candidateshave some background knowledge, we note some of the most oftenviewed questions in StackOverflow that contain elements from theJSON and JavaScript clusters. After selecting the clusters, the toolmarks the cluster jQuery to be relevant. Once reading the tag’s de-scription in the tooltip and looking at its contents, we decide to addit to our profile and select the cluster (shown in Figure 5 c ). Wewrite down questions such as ‘How to iterate over a JSON struc-ture?’ for the interview of candidates for our team. At last, we freelyexplore the space around our selected clusters and notice the clus-ter d3.js. After looking at the cluster’s description, we decide to addd3.js to our profile, as a developer with knowledge in web-based vi-sualization may be useful in later software development stages.

By now, we have a profile that we can use to search for a soft-ware developer that has knowledge in key technologies used in webdevelopment, which fits our companies existing technologies.

5.2 Use Case 2: Patent Analysis

Figure 8: Depiction of the concepts related to the clusters adhe-sive (contained in housing) and tissue. There are five concepts thatare strongly related to those clusters. One is catheter, which makessense, as a catheter may be used to apply the adhesive.

In the second use case, we take the point of view of an intel-lectual property analyst at a large company that produces and sellsmedical apparatuses. The company plans to develop a new productto apply an adhesive to open tissue during a surgery and wants toavoid conflicts with existing patents.

1 http://stackoverflow.com/

114

The analyst’s task is to limit the number of patents that have to beinspected individually to a reasonable amount. At first, the analystlimits the number of patents by searching for a code from the Inter-national Patent Classification (IPC), which is specific for medicalsurgery and diagnosis. Also, the patents need to contain the key-word glue or adhesive in their description. This results in a patentset of about 300 patents, which is too much to analyze manually.Thus, the patents are automatically processed and concepts are ex-tracted. The similarity between concepts is calculated by measuringthe concepts’ co-occurrence within sentences and comparing themusing a cosine similarity. These concepts will then be loaded intoour prototypical implementation to limit the patents even further.

The analyst may choose to change the initially proposed levels ofdetail, but in this case we assume that the expert will leave the levelsas proposed. Once the first level is projected, the analyst searchesfor the clusters that contain the concepts adhesive (which is hid-den inside the cluster housing) and tissue. By doing so, the analystnotices that the concept catheter is marked to be relevant to the se-lection. This becomes even clearer by switching to the darts view,which is shown in Figure 8. As it makes sense to apply an adhesivewith a catheter, the analyst adds it to the set of selected clusters.

When looking at the resulting patents, the analyst is not satisfiedwith the quality of the results. Therefore, he or she adds the clustercontaining the concept apparatus to the selected concept list andincreases the level of detail several times. By doing so, the selectedclusters get more specific as dissimilar elements split apart whichresults in a more specific patent request. Hence, the results of theleftover patents becomes better. By increasing the levels about fivemore times, the clusters become more specific, as less similar clus-ters split apart. Therefore, the analyst is left with 27 patents, whichis a reasonable amount for a manual analysis.

6 DISCUSSION

When analyzing a new dataset, the user is confronted with a chickenand egg problem: on the one hand, the zoom levels are supposed tohelp users in understanding the possible relations within the datasetby only showing information on a very coarse level. On the otherhand, without knowledge about what levels of detail may be inter-esting, it is very hard to decide how many levels of detail shouldbe shown and what information granularity they should have. Weaddress this dilemma by proposing the user a preset number of lev-els of detail that can be used as a starting point for the exploration.Also, we show the user the error index of the clustering to supportan easier manipulation of the levels. However, the shown informa-tion is of little help when the user is interested in which clustersare actually merged at a specific step. Even in case the user hadthat knowledge, it is still impossible to infer any further knowledgeabout the previous or the next cluster fusion step. Moreover, theuser only gets an abstract measure of the quality of the clustering,which is hard to comprehend. This problem may worsen when thebinary tree of the clustering algorithm is simplified, for example,by using Bayesian Rose Trees [3]. It is hard for a user to compre-hend how many clusters will be shown in the next level of depthof the tree, as every step may imply multiple aggregation steps. Asall of the mentioned aspects are important in choosing levels of de-tail that meet the user’s needs and expectations, we will look intopossible solutions in the future. For example, instead of proposinga constant step size within the hierarchy, it may be possible to fur-ther analyze the hierarchy and dataset to make a more appropriaterecommendation for relevant levels of detail.

Like all dimensionality reduction approaches, our approach suf-fers from the curse of dimensionality. The more elements are con-tained within the matrix and the more they are related, the less willbe their measurable distance. This causes problems in the projec-tion step, as it will not be able to distinguish which labels shouldbe clustered together anymore. This correlates with the problem to

pick a proper perplexity value in our approach. When the value istoo low or too high, the visualization will not be able to representthe data’s relation anymore.

We tried to design the approach in a way that an analysis run canbe done as smoothly as possible, even as more and more clustersare shown. As our approach is based on the original implementationof t-SNE, the time complexity is O(n2), where n is the number ofshown clusters. This may cause a performance issue when manylabels have to be projected. We measured how the increase of onelevel of detail in use case 2’s dataset takes during the actual analysis(after preprocessing) on a computer with an Intel Core i7-4700MQprocessing unit. It took 0.5 seconds for 40 clusters and 4.0 secondsfor 480 clusters from the point of the user interaction to the pointwhen the labels started to move to their new positions. Such issuescan be tackled, for instance, by calculating the next level of detailin a background thread while the user explores the actual level ofdetail, by using a better performing implementation of t-SNE, or bycalculating the projection information of the visible portions of thescreen first and compute other needed information on demand. It isimportant to note that the latter will implicitly impact the quality ofthe visualization. Similar clusters should be projected close to eachother, even if they originate from different parent clusters. Whenone of these parents is not visible, this information will be lost.

Moreover, the abundance of clusters on a high level of detailcauses an increased information loss during the projection onto atwo-dimensional space. Labels that are intuitively close to eachother may be far apart. Also, our approach to place new clustersat the position of their parent clusters may bias the projection’s re-sult. However, we think this is tolerable as the similarity matrixprovided to the projection contains all visible clusters and the orig-inal t-student distribution does not use any previously gained infor-mation. This is a general challenge and can only be slightly alle-viated by indicating the position of the labels that are close in thehigh-dimensional space. We plan to extend our approach’s varietyof navigation aids by adding more visualizations to support in-depthanalysis of the selected and related clusters.

As our visualization shows the mode of a cluster, it seems plausi-ble to use those modes for clustering. However, we decided againstsuch a linkage, because it is susceptible to outliers as it does notconsider other elements in a cluster, similar to single linkage. Also,it implies that an element often occurring in the complete dataset isalso important within a cluster, which is problematic.

Furthermore, the overlap of some clusters still poses an issuewhen many very similar clusters are projected on the landscape.Our approach tries to find a trade-off between the increasing whitespace produced when scaling the target space to prevent overlap-ping and a slight overlapping where the labels are still readable. Wedecided not to use an overlap removal after the projection step, aswe wanted to keep the spatial distance of the clusters as intact aspossible. However, we think that using more features to scale theprojections target space may further help to prevent overlapping.

7 CONCLUSION

In this paper, we presented a visualization approach for the hier-archical exploration of labeled data. It reduces the problem of vi-sual clutter and information overload that can quickly occur in thetwo-dimensional visualization of large amounts of labels. Overall,it consists of three steps: 1) the data is hierarchically clustered, 2)distinct levels of detail are defined, and 3) the levels of detail areprojected on a layered landscape visualization.

In contrast to related work, our approach enables the users to ex-plore the labeled clusters in two different ways: 1) by analyzing anindividual level of detail to understand the clusters’ child-child re-lation, and 2) by switching between different levels of detail to un-derstand the hierarchical structure and composition of the clusters.This allows users to choose the level of abstraction that is most use-

115

ful for their analysis goals, and to explore a large amount of labeleddata. We tailored the t-SNE projection algorithm to make use ofthe hierarchical structure of the data and to reuse already computedpositional information. Animations help the users to understand thetransitions between different levels of detail. Interaction techniquessupport the exploration on a given level of detail and ease the iden-tification of related clusters in the visualization.

We demonstrated the applicability and usefulness of the ap-proach with two use cases. In the first one, we showed the explo-ration aspect of the approach by analyzing the tag structure of thequestion-and-answer website StackOverflow to extend our under-standing about a domain. In the second use case, we applied theapproach to the field of patent analysis to support an analyst in fil-tering a set of patents.

Key challenges for future work include the appropriate labelingof merged clusters and the automatic detection of useful levels ofdetail. Furthermore, improved methods to even better reflect theglobal and local relatedness of clusters in the spatial layout needto be examined.

8 ACKNOWLEDGMENTS

This work has been supported by the EU funded project iPatDoc(grant no. 606163).

REFERENCES

[1] J. Alsakran, Y. Chen, D. Luo, Y. Zhao, J. Yang, W. Dou, and S. Liu.Real-time visualization of streaming text with a force-based dynamicsystem. IEEE Comput. Graph. Appl., 32(1):34–45, 2012.

[2] P. Baudisch and R. Rosenholtz. Halo: A technique for visualizing off-screen objects. In Eurograph. Ass. SIGCHI Conf. on Human Factorsin Computing Systems, pages 481–488. ACM, 2003.

[3] C. Blundell, Y. W. Teh, and K. A. Heller. Bayesian rose trees. In Proc.26th Conf. Uncertainty in Artificial Intelligence, pages 65–72. AUAIPress, 2010.

[4] M. Brehmer, S. Ingram, J. Stray, and T. Munzner. Overview: Thedesign, adoption, and analysis of a visual document mining toolfor investigative journalists. IEEE Trans. Vis. Comput. Graph.,20(12):2271–2280, 2014.

[5] M. Burch, S. Lohmann, D. Pompe, and D. Weiskopf. Prefix TagClouds. In Proc. Int. Conf. Information Visualisation, pages 45–50,2013.

[6] Y.-X. Chen, R. Santamarıa, A. Butz, and R. Theron. TagClusters:Semantic aggregation of collaborative tags beyond TagClouds. In 10thInt. Symp. Smart Graphics, pages 56–67. Springer, 2009.

[7] D. L. Davies and D. W. Bouldin. A cluster separation measure. IEEETrans. Pattern Anal. Mach. Intell., 1(2):224–227, 1979.

[8] W. Dou, L. Yu, X. Wang, Z. Ma, and W. Ribarsky. HierarchicalTopics:Visually exploring large text collections using topic hierarchies. IEEETrans. Vis. Comput. Graph., 19(12):2002–2011, 2013.

[9] N. Elmqvist and J.-D. Fekete. Hierarchical aggregation for informa-tion visualization: Overview, techniques, and design guidelines. IEEETrans. Vis. Comput. Graph., 16(3):439–454, 2010.

[10] M. Endo, M. Ueno, and T. Tanabe. A clustering method using hi-erarchical self-organizing maps. J. VLSI Signal Process. Syst., 32(1-2):105–118, 2002.

[11] D. Fried and S. G. Kobourov. Maps of computer science. In IEEEPacific Visualization Symposium, pages 113–120. IEEE, 2014.

[12] T. M. Fruchterman and E. M. Reingold. Graph drawing by force-directed placement. Softw. Pract. Exper., 21(11):1129–1164, 1991.

[13] E. R. Gansner and S. C. North. Improved force-directed layouts. InProc. 6th Int. Symp. Graph Drawing, pages 364–373. Springer, 1998.

[14] F. J. Garca-Fernndez, M. Verleysen, J. A. Lee, and I. Daz. Stabil-ity Comparison of Dimensionality Reduction Techniques Attendingto Data and Parameter Variations. In EuroVis Workshop Visual Ana-lytics using Multidimensional Projections. Eurograph. Ass., 2013.

[15] K. S. Hasan and V. Ng. Automatic keyphrase extraction: A survey ofthe state of the art. In Proc. 52nd Annual Meeting of the Associationfor Computational Linguistics, pages 1262–1273. ACL, 2014.

[16] Y. Hassan-Montero and V. Herrero-Solana. Improving tag-clouds asvisual information retrieval interfaces. In Int. Conf. MultidisciplinaryInformation Sciences and Technologies, pages 25–28, 2006.

[17] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin. The elementsof statistical learning: data mining, inference and prediction. Math.intell., 27(2):83–85, 2005.

[18] F. Heimerl, S. Lohmann, S. Lange, and T. Ertl. Word cloud explorer:Text analytics based on word clouds. In 47th Hawaii Int. Conf. SystemSciences, pages 1833–1842. IEEE, 2014.

[19] D. Herr, Q. Han, S. Lohmann, S. Brugmann, and T. Ertl. Visual explo-ration of patent collections with ipc clouds. In Proc. 1st Int. WorkshopPatent Mining and Its Applications, volume 1292. CEUR-WS, 2014.

[20] N. Heulot, M. Aupetit, and J.-D. Fekete. ProxiLens: Interactive Ex-ploration of High-Dimensional Data using Projections. In EuroVisWorkshop Visual Analytics using Multidimensional Projections. Euro-graph. Ass., 2013.

[21] O. Kaser and D. Lemire. Tag-cloud drawing: Algorithms for cloudvisualization. In Workshop Tagging and Metadata for Social Informa-tion Organization, 2007.

[22] T. Kohonen. The self-organizing map. Proc. IEEE, 78(9):1464–1480,1990.

[23] J. B. Kruskal and M. Wish. Multidimensional scaling. Sage, 1978.[24] S. Liu, X. Wang, J. Chen, J. Zhu, and B. Guo. TopicPanorama: A full

picture of relevant topics. In IEEE Conf. on Visual Analytics Scienceand Technology (VAST), pages 183–192, 2014.

[25] S. Lohmann, J. Ziegler, and L. Tetzlaff. Comparison of tag cloudlayouts: Task-related performance and visual exploration. In 12thIFIP TC 13 Int. Conf. Human-Computer Interaction, pages 392–404.Springer, 2009.

[26] C. D. Manning, P. Raghavan, H. Schutze, et al. Introduction to infor-mation retrieval, volume 1. Cambridge University Press, 2008.

[27] F. V. Paulovich, L. G. Nonato, R. Minghim, and H. Levkowitz. Leastsquare projection: A fast high-precision multidimensional projectiontechnique and its application to document mapping. IEEE Trans. Vis.Comput. Graph., 14(3):564–575, 2008.

[28] F. V. Paulovich, F. M. B. Toledo, G. P. Telles, R. Minghim, and L. G.Nonato. Semantic wordification of document collections. Comput.Graph. Forum, 31(3):1145–1153, 2012.

[29] C. Seifert, B. Kump, W. Kienreich, G. Granitzer, and M. Granitzer. Onthe beauty and usability of tag clouds. In 12th Int. Conf. InformationVisualisation, pages 17–25, 2008.

[30] B. Shneiderman. Tree visualization with tree-maps: 2-d space-fillingapproach. ACM Trans. Graph., 11(1):92–99, 1992.

[31] B. Shneiderman. The eyes have it: A task by data type taxonomy forinformation visualizations. In Proc. IEEE Symp. Visual Languages,pages 336–343. IEEE, 1996.

[32] P. Suganthan. Hierarchical overlapped som’s for pattern classification.IEEE Trans. Neural Netw., 10(1):193–196, 1999.

[33] D. Thom and T. Ertl. TreeQueST: A treemap-based query sandbox formicrodocument retrieval. In 48th Hawaii Int. Conf. System Sciences,pages 1714–1723. IEEE, 2015.

[34] J. J. Thomas, P. J. Cowley, O. Kuchar, L. T. Nowell, J. Thompson,and P. C. Wong. Discovering knowledge through visual analysis. J.Univers. Comput. Sci., 7(6):517–529, 2001.

[35] L. Van der Maaten and G. Hinton. Visualizing data using t-SNE. J.Mach. Learn. Res., 9(85):2579–2605, 2008.

[36] L. J. van der Maaten, E. O. Postma, and H. J. van den Herik. Di-mensionality reduction: A comparative review. J. Mach. Learn. Res.,10(1-41):66–71, 2009.

[37] J. A. Wise, J. J. Thomas, K. Pennock, D. Lantrip, M. Pottier, A. Schur,and V. Crow. Visualizing the non-visual: Spatial analysis and inter-action with information from text documents. In Proc. IEEE Symp.Information Visualization, pages 51–58. IEEE, 1995.

[38] S. Wold, K. Esbensen, and P. Geladi. Principal component analysis.Chemometrics Intell. Lab. Syst., 2(1):37–52, 1987.

[39] P. C. Wong, B. Hetzler, C. Posse, M. Whiting, S. Havre, N. Cramer,A. Shah, M. Singhal, A. Turner, and J. Thomas. IN-SPIRE infovis2004 contest entry. In Proc. IEEE Symp. Information Visualization.IEEE, 2004.

116

visual clutter reduction through hierarchy-based...

Documents