challenges in visual data analysis

Challenges in Visual Data Analysis∗

Daniel A. Keim, Florian Mansmann, Jorn Schneidewind, and Hartmut ZieglerUniversity of Konstanz, Germany

{keim, mansmann, schneide, ziegler}@inf.uni-konstanz.de

Abstract

In today’s applications data is produced at unprece-dented rates. While the capacity to collect and store newdata grows rapidly, the ability to analyze these data vol-umes increases at much lower pace. This gap leads to newchallenges in the analysis process, since analysts, decisionmakers, engineers, or emergency response teams depend oninformation “concealed” in the data. The emerging fieldof visual analytics focuses on handling massive, heteroge-nous, and dynamic volumes of information through integra-tion of human judgement by means of visual representationsand interaction techniques in the analysis process. Further-more, it is the combination of related research areas includ-ing visualization, data mining, and statistics that turns vi-sual analytics into a promising field of research. This paperaims at providing an overview of visual analytics, its scopeand concepts, and details the most important technical re-search challenges in the field.

1. Introduction

Information overload is a well-known phenomenon ofthe information age, since due to the progress in computerpower and storage capacity over the last decades, data isproduced at an incredible rate. Meanwhile, our ability tocollect and store data is growing faster than our ability toanalyse it. However, the analysis of these massive, typ-ically messy and inconsistent, volumes of data is crucialin many application domains. For decision makers, ana-lysts or emergency response teams it is an essential task torapidly extract relevant information from the flood of data.Today, a selected number of software tools is employed tohelp analysts to organize their data, generate overviews andexplore the information space in order to extract potentiallyuseful information. Most of such data analysis systems stillrely on interaction metaphors developed over a decade ago

∗Part of this work has been presented at Dagstuhl Seminar 05231 “Sci-entific Visualization: Challenges for the Future” [3]

and it is questionable whether they are able to meet the de-mands of the information age. In fact, huge investments interms of time and money are often wasted, because the pos-sibilities to properly interact with the databases are still toolimited. Visual analytics aims at bridging this gap by em-ploying more intelligent means in the analysis process. Thebasic idea of visual analytics is to visually represent infor-mation, allowing the human to directly interact with it, togain insight, to draw conclusions, and to ultimately makebetter decisions. Visual representation of the informationreduces complex cognitive work needed to perform certaintasks. People may use visual analytics tools and techniquesto synthesize information and derive insight from massive,dynamic, and often conflicting data by providing timely, de-fensible, and understandable assessments. [10]

The goal of visual analytics research is to turn the in-formation overload into an opportunity. Decision-makersshould be enabled to examine massive, multi-dimensional,multi-source, time-varying information stream to make ef-fective decisions in time-critical situations. For informeddecisions, it is indispensable to include humans into the dataanalysis process to combine flexibility, creativity, and back-ground knowledge with the enormous storage capacity andthe computational power of today’s computers. The spe-cific advantage of visual analytics is that decision makersmay focus their full cognitive and perceptual capabilitieson the analytical process, while allowing them to apply ad-vanced computational capabilities to augment the discoveryprocess. This paper gives an overview on visual analyticsand discusses the most important research challenges in thisfield. Technical research challenges are detailed to showhow visual analytics can help to turn information overloadas generated by today’s applications into a useful asset.

The rest of the paper is organized as follows: section 2defines visual analytics and discusses its scope. Technicalchallenges are described in Section 3. Finally, section 4presents the visual analytics mantra as a solution.

Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006 IEEE

Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on July 17,2010 at 18:09:56 UTC from IEEE Xplore. Restrictions apply.

2. Scope of Visual Analytics

In general, visual analytics can be described as “the sci-ence of analytical reasoning facilitated by interactive visualinterfaces” [10]. To be more precise, visual analytics is aniterative process that involves collecting information, datapreprocessing, knowledge representation, interaction, anddecision making. The ultimate goal is to gain insight intothe problem at hand which is described by vast amountsof scientific, forensic or business data from heterogeneoussources. To achieve this goal, visual analytics combines theadvantages of machines with strenghts of humans. Whilemethods from knowledge discovery in databases (KDD),statistics and mathematics are the driving force on the auto-matic analysis side, capabilities to perceive, relate and con-clude turn visual analytics into a very promising field ofresearch.

Historically, visual analytics has evolved out of the fieldsof information and scientific visualization. According toColin Ware, the term visualization is meanwhile understoodas “a graphical representation of data or concepts” [14],while the term was formerly applied to form a mental im-age. Nowadays fast computers and sophisticated output de-vices create meaningful visualizations and allow us not onlyto mentally visualize data and concepts, but to actually seeand explore the representation of the data under consider-ation on a computer screen. However, transformation ofdata into meaningful visualizations is a non-trivial task thatcan not be automatically improved through steadily grow-ing computational resources. Very often, there are manydifferent ways to represent the data and it is unclear whichrepresentation is the best one. State-of-the-art concepts ofrepresentation, perception, interaction and decision-makingneed to be applied and extended to be suitable for visualdata analysis.

The fields of information and scientific visualization dealwith visual representations of data. Scientific visualizationexamines potentially huge amounts of scientific data ob-tained from sensors, simulations or laboratory tests withtypical applications being flow visualization, volume ren-dering, and slicing techniques for medical illustrations. Inmost cases, some aspects of the data can be directly mappedonto geographic coordinates or into virtual 3D environ-ments. We define information visualization more generallyas the communication of abstract data relevant in terms ofaction through the use of interactive visual interfaces. Thereare three major goals of visualization, namely a) presenta-tion, b) confirmatory analysis, and c) exploratory analysis.For presentation purposes, the facts to be presented are fixeda priori, and the choice of the appropriate presentation tech-nique depends largely on the user. The aim is to efficientlyand effectively communicate the results of an analysis. Forconfirmatory analysis, one or more hypotheses about the

Presentation, production, and dissemination

Statistical Analytics

Scientific Analytics

Knowledge DiscoveryKData Management &

Knowledge Representation

Cognitive andPerceptual Science

Interaction

Geospatial Analytics

Scope of Visual Analytics

G

Information Analytics

Figure 1. Visual analytics as a highly interdis-ciplinary field of research.

data serve as a starting point. The process can be describedas a goal-oriented examination of these hypotheses. As aresult, visualization either confirms these hypotheses or re-jects them. Exploratory data analysis as the process ofsearching and analyzing databases to find implicit but po-tentially useful information, is a difficult task. At the be-ginning, the analyst has no hypothesis about the data. Ac-cording to John Tuckey, tools as well as understanding areneeded [12] for the interactive and usually undirected searchfor structures and trends.

Visual analytics is more than just visualization and canrather be seen as an integrated approach combining visual-ization, human factors and data analysis. Figure 1 illustratesthe detailed scope of visual analytics. With respect to thefield of visualization, visual analytics integrates methodol-ogy from information analytics, geospatial analytics, andscientific analytics. Especially human factors (e.g., inter-action, cognition, perception, collaboration, presentation,and dissemination) play a key role in the communicationbetween human and computer, as well as in the decision-making process. In this context, production is defined as thecreation of materials that summarize the results of an analyt-ical effort, presentation as the packaging of those materialsin a way that helps the audience understand the analyticalresults using terms that are meaningful to them, and dissem-ination as the process of sharing that information with theintended audience [11]. In matters of data analysis, visualanalytics further benefits from the methodologies developedin the fields of data management & knowledge representa-tion, knowledge discovery, and statistical analytics. Notethat visual analytics, is not likely to become a separate fieldof study [15], but its influence will spread over the researchareas it comprises.

According to Jarke J. van Wijk, “visualization is not



’good’ by definition, developers of new methods have tomake clear why the information sought cannot be extractedautomatically” [13]. From this statement, one immediatelyrecognizes the need for the visual analytics approach usingautomatic methods from statistics, mathematics and knowl-edge discovery in databases (KDD) wherever they are ap-plicable. Visualization is used as a means to efficientlycommunicate and explore the information space when au-tomatic methods fail. In this context, human backgroundknowledge, intuition, and decision-making either cannot beautomated or serve as input for the future development ofautomated processes.

Overlooking a large information space is a typical visualanalytics problem. In many cases, the information at handis conflicting and needs to be integrated from heterogeneousdata sources. Moreover, the system lacks the knowledge hu-man experts posses. By applying analytical reasoning, hy-potheses about the data can be either affirmed or discarded,eventually leading to a better understanding of the data, thussupporting the analyst in his task to gain insight. Contrary tothat, a well-defined problem where the optimum or a goodestimation can be calculated by non-interactive analyticalmeans should not be classified as a visual analytics prob-lem. In such a scenario, non-interactive analysis should beclearly preferred due to efficiency reasons. Likewise, vi-sualization problems not involving methods for automaticdata analysis do not fall into the category of visual analyticsproblems.

The fields of visualization and visual analytics both relyon methods from scientific, geospatial, and information an-alytics. They both benefit from the knowledge out of thefield of interaction as well as of cognitive and perceptualscience. They do differ in that visual analytics also inte-grates the methodology from the fields of statistical analyt-ics, knowledge discovery, data management & knowledgerepresentation and presentation, production & dissemina-tion.

3. Technical Challenges

With information technology becoming a standard inmost areas in the past years, from medicine to science, fromeducation to business, from astronomy to networking (In-ternet), more and more digital information is generated andcollected. Today, we continuously face rapidly increasingamounts of data: New sensors, faster recording methodsand decreasing prices for storage capacities in the previousyears allow storing huge amounts of data that used to beunimaginable a decade ago. Applications like flow simu-lations, molecular dynamics, nuclear science, computer to-mography or astronomy generate amounts of data that caneasily reach terabytes; the Large Hadron Collider (LHC) atCERN, for example, generates a volume of 1 petabyte of

Figure 2. Visual Analysis of Financial Data:The Growth Matrix

data per year. Parallel to the growth of datasets, the compu-tational power of the computer systems for processing theseamounts of data also evolved: Faster processors, more mainmemory, faster networks, parallel and distributed comput-ing and larger storage capacities increase the data through-put every year. Having no possibility of adequately explor-ing the large amounts of data which have been collected dueto their potential usefulness, the data becomes useless andthe databases become data “dumps” [2].

Commonly, the amount of data (often terabytes) to bevisualized exceeds the limited amount of pixels of a dis-play by several orders of magnitude. Filtering, aggregation,compression, principle component analysis or other data re-duction techniques are then needed to reduce the amount ofdata as only a small portion of it can be displayed. Visualscalability is defined as the capability of visualization toolsto effectively display large data sets in terms of either thenumber or the dimension of individual data elements [1].We thus not only have to compare the absolute data growthand hardware performance in order to cope with a problem,but also the software and the algorithms to bring this dataonto the screen in an appropriate way. Scalability in gen-eral is a key challenge of visual analytics as it determinesthe ability to process large datasets by means of computa-tional overhead as well as appropriate rendering techniques.In the last decade, information visualization has developednumerous techniques to visualize datasets, but only some ofthem are scalable to the huge data sets used in visual analyt-ics. As the amount of data is continuously growing and theamount of pixels on the display remains rather constant, therate of compression to visualize the dataset on the displayis continuously increasing, and therefore, more and moredetails are lost. It is the task of visual analytics to create ahigher-level view of the dataset to gain insight, while maxi-mizing the amount of details at the same time.

An example of a highly scalable visual analytics tech-nique in the field of financial data analysis is the GrowthMatrix [4]. It visualizes all possible time intervals of a fund



over a time span of 14 years (about 11.000 intervals), andsimultaneously compares the performance of each intervalwith 14.000 funds that are in the database in just one image(see Fig. 2). To generate each image, 154 million valueshave to be calculated. Using a large display wall, it is possi-ble to compare several hundreds of these Growth Matrices.

Dynamic processes, arising in business, network ortelecommunications generate tremendous streams of timerelated or real time data. Examples are sensor logs, webstatistics, network traffic logs or atmospheric and meteoro-logical applications. Analysis of such data streams is animportant challenge, since it plays an essential role in manyareas of science and technology. As the sheer amount ofdata often does not allow to record all the data at full de-tail, effective compression and feature extraction methodsare needed to manage the data. Furthermore, it is vital toprovide analysis techniques and metaphors that are capa-ble of analyzing large real time data streams in time, and topresent the results in a meaningful and intuitive way. Thisenables quick identification of important information andtimely reaction on critical process states or alarming inci-dents.

Real-world applications often access information from anumber of heterogeneous information sources and thus re-quire synthesis of heterogeneous types of data in order toperform an effective analysis. These heterogeneous datasources may include collections of vector data, strings andtext documents, graphs or sets of objects. A typical ap-plication domain is computational biology, where the hu-man genome is accompanied by real-valued gene expres-sion data, functional annotation of genes, genotyping infor-mation, a graph of interacting proteins, equations describ-ing the dynamics of a system, localization of proteins ina cell, and natural language text in the form of papers de-scribing experiments, partial models, and numerous otherdata sources. The problem of integrating these data sourcestouches upon many fundamental problems in decision the-ory, information theory, statistics, and machine learning ev-idently posing a challenge for visual analytics, too. The fo-cus on scalable and robust methods for fusing complex andheterogeneous data sources is thus a key to a more effectiveanalysis process.

Interpretability or the ability to recognize and under-stand the data is one of the biggest challenges in visual an-alytics. Generating a visually correct output from raw dataand drawing the right conclusions largely depends on thequality of the used data and methods. Many possible qual-ity problems (e.g., data capture errors, noise, outliers, lowprecision, missing values, coverage errors, double counts)can already be contained in the raw data. Furthermore, pre-processing of data in order to use it for visual analysis bearsmany potential quality problems (i.e., data migration andparsing, data cleaning, data reduction, data enrichment, up-

/ down-sampling, rounding and weighting, aggregation andcombining). Data can be inherently incomplete or simplyout of date. The challenges are on the one hand to deter-mine and to minimize these errors on the pre-processingside, and to provide a flexible yet stable design of the vi-sual analytics application to cope with data quality prob-lems on the other hand. From the technical point of viewsuch application can be either designed to be insensitiveto data quality issues through employment of data clean-ing methods or to explicitly visualize errors and uncertaintyin the application to make the analyst aware of the prob-lem. Many domains ranging from terrorism informatics tonatural sciences and business intelligence would potentiallybenefit from the methods for enhancing data quality. Home-land Security applications in general have to deal with manymissing values and uncertainty. For instance, consider ascreening program in the context of Homeland Security inan airport. The system should identify potential terrorists,but also try to minimize false positives in order to avoidincorrectly targeting innocent travelers. A falsely inserteddata record should not influence the principle way in whichthe system observes and analyses other persons. Moreover,updated data of a potential terrorist might not be availablein the database for many weeks, but the visual monitoringand analysis of patterns should still work even though therecords in the database are widely incomplete.

The field of problem solving, decision science, and hu-man information discourse constitutes a further visual an-alytics challenge. Many psychological studies about theprocess of problem solving have been conducted. In ausual test setup the subjects have to solve a well-definedproblem where the optimal solution is known to the re-searchers. However, real-world problems are manifold. Inmany cases these problems are intransparent, consist of con-flicting goals, and are complex in terms of large numbersof items, interrelations, and decisions involved. In addi-tion to these aspects, information exhibits its own dynam-ics by changing over time and has thus a strong impact onthe optimal solution. To date, decision making is aidedby so-called Decision Support Systems which try to rep-resent expert knowledge through rules in order to reducethe risk of human errors when too much interrelated infor-mation is involved. Decision making in groups makes thehuman decision-making process even more complicated.As for human information discourse, it is estimated thathuman subjects can reliably distinguish between approxi-mately seven categories in the process of absolute judge-ment [6]. This number has therefore direct consequencesfor the design of effective user interfaces. To assess fur-ther limitations of human capabilities for more complextasks is challenging as subjects react differently to screendesigns depending on their personality and their cultural,educational, and professional background. Summing up,



the process of decision support for problem solving requiresunderstanding of technology on the one hand, but also com-prehension of typical human capabilities such as logic, rea-soning, and common sense on the other hand. Intuitivedisplays and interaction devices should be constructed tosteer analysis and to communicate analytical results throughmeaningful visualizations and clear representations.

Another challenge in the context of visual analytics isto provide semantics for future analysis tasks and decision-centered visualization. Semantic meta data extracted fromheterogeneous sources may capture associations and com-plex relationships. Therefore, providing techniques toanalyse and detect this information is crucial to visual an-alytics applications. Ontology-driven techniques and sys-tems have already started to enable new semantic applica-tions in a wide span of areas such as bioinformatics, finan-cial services, web services , business intelligence, and na-tional security. However, further research is necessary inorder to increase capabilities for creating and maintaininglarge domain ontologies and automatic extraction of seman-tic meta data, since the integration process between differ-ent ontologies to link various datasets is hardly automatedyet. In order to perform a more powerful analysis of het-erogeneous data sources, more advanced methods for theextraction of semantics from heterogenous data are a keyrequirement. Thereby, research challenges arise from thesize of ontologies, content diversity, heterogeneity as wellas from computation of complex queries and link analy-sis over ontology instances and meta data. New techniquesare necessary to resolve semantic heterogeneities in order todiscover complex relationships.

User acceptability is a further challenge; many novel vi-sualization techniques have been presented, yet their wide-spread deployment has not taken place, primarily due to theusers’ refusal to change their working routines. Therefore,the advantages of visual analytics tools need to be commu-nicated to the audience of future users to overcome usagebarriers, and to eventually tap the full potential of the visualanalytics approach. One example is the IBM Remail project[8] which tries to enhance human capabilities to cope withemail overload. Concepts such as “Thread Arcs”, “Cor-respondents Map”, and “Message Map” support the userin efficiently analysing his personal email communication.MIT’s project Oxygen [7] even goes one step further, byaddressing the challenges of new systems to be pervasive,embedded, nomadic, adaptable, powerful, intentional andeternal.

Another challenge for visual analytics is the integrationof the visualization techniques into other applications andsystems. Visual analytics tools and techniques should notstand alone, but should integrate seamlessly into the appli-cations of diverse domains, and allow interaction with otheralready existing systems. Although many visual analytics

tools are very specific (like in astronomy or nuclear science)and therefore rather unique, in many domains (like businessapplications or network security) integration into existingsystems would make sense. This requires interactive capa-bilities and input/output software interfaces for communi-cation with other applications. Real-time visual monitoringand analysis require a connection to databases and frequentupdates, and can also be used for automated analysis. In ad-dition to that, visual analytics applications can be integratedor connected to new innovative interaction and visualizationdevices for improved performance.

Evaluation as a systematic determination of merit,worth, and significance of a system is crucial to its suc-cess. When evaluating a system, different aspects can beconsidered such as functional testing, performance bench-marks, measurement of the effectiveness of the display, eco-nomic success, user studies, assessment of its impact ondecision-making to name just a few. Not all of these as-pects are orthogonal, they rather commonly represent com-parisons with previous systems to assess the novel system’sadequacy. In this context, objective rules of thumbs to facil-itate design decisions would be a great contribution to thecommunity.

4 Solutions

Visual analytics combines strengths from informationanalytics, geospatial analytics, scientific analytics, statis-tical analytics, knowledge discovery, data management &knowledge representation, presentation, production & dis-semination, cognition, perception, and interaction. It is agoal-oriented process to gain insight into heterogeneous,contradictory and incomplete data through the combina-tion of automatic analysis methods with human backgroundknowledge and intuition.

The example shown in Figure 3 helps to illustrate the ra-tio behind visual analytics. The figure shows analysis ofstock market data using the CircleView [5] approach, vi-sualizing prices of 240 stocks from the S&P 500 over 6months, starting from January 2004 (periphery of the cir-cle) to June 2004 (center of the circle). Instead of just visu-alizing the data, it is analysed by applying a relevance func-tion in the first step. Since from the analyst’s point of viewit may be more interesting to analyse actual stock prizesrather than historic ones, the basic idea is to show actualstock prizes in full detail and to present historic values asaggregated high level views. That is, the relevance value ofeach data point is determined by its time stamp and the datais presented at different levels of details (i.e., day, week,month). This helps to reduce the amount of data and to pro-vide overviews even on large data sets. In the next step thestocks are clustered to identify groups of stocks with sim-ilar performance over time. If the analyst then identifies



Figure 3. CircleView technique

stocks of interest or relevant groups of stocks, he can pro-ceed by selecting relevant subsets of data and performing adetailed analysis using drill down operations on items withlower resolution or select interesting parts of the data andinvestigate them in more detail using chart techniques.

Unlike described in the information seeking mantra(“overview first, zoom/filter, details on demand”) [9], thevisual analytics process comprises the application of auto-matic analysis methods before and after the interactive vi-sual representation is used. This is primarily due to the factthat current and especially future data sets are complex onthe one hand and too large to be visualized in a straightfor-ward manner on the other hand. Therefore, we present thevisual analytics mantra:

“Analyse First -Show the Important -

Zoom, Filter and Analyse Further -Details on Demand”

The visual analytics mantra could be exemplarily appliedin the context of data analysis for network security. Visu-alizing the raw data is unfeasible and rarely reveals any in-sight. Therefore, the data is first analysed (i.e., computechanges, intrusion detection analysis, etc.) and then dis-played. The analyst proceeds by choosing a small suspi-cious subset of the recorded intrusion incidents by applyingfilters and zoom operations. Finally, this subset is used fora more careful analysis. Insight is gained in the course ofthe whole visual analytics process.

Acknowledgements

This work has been funded by the German Research So-ciety (DFG) under the grant GK-1042, Explorative Analysisand Visualization of Large Information Spaces, Konstanz.

References

[1] S. G. Eick. Visual scalability. Journal of Computational &Graphical Statistics, March 2002.

[2] D. A. Keim. Information visualization and visual data min-ing. IEEE Transactions on Visualization and ComputerGraphics (TVCG), 8(1):1–8, January–March 2002.

[3] D. A. Keim. Visual Exploration of Large Geo-Spatial DataSets. In T. Ertl, E. Groller, K. Joy, and G. Nielson, edi-tors, Dagstuhl Seminar 05231: Scientific Visualization, June2005. http://www.dagstuhl.de/05231/.

[4] D. A. Keim, T. Nietzschmann, N. Schelwies, J. Schnei-dewind, T. Schreck, and H. Ziegler. A spectral visualizationsystem for analyzing financial time series data. In EuroVis2006: Eurographics/IEEE-VGTC Symposium on Visualiza-tion, Lisbon, Portugal, 8-10 May, 2006.

[5] D. A. Keim, J. Schneidewind, and M. Sips. Circleview - anew approach for visualizing time related multidimensionaldata sets. In ACM Advanced Visual Interfaces (AVI). Associ-ation for Computing Machinery (ACM), ACM Press, 2004.

[6] G. A. Miller. The magical number seven, plus or minus two:Some limits on our capacity for processing information. ThePsychological Review, 63:81–97, 1956.

[7] MIT Project Oxygen. http://oxygen.lcs.mit.edu/.[8] S. L. Rohall, D. Gruen, P. Moody, M. Wattenberg, M. Stern,

B. Kerr, B. Stachel, K. Dave, R. Armes, and E. Wilcox. Re-mail: a reinvented email prototype. In Extended abstracts ofthe 2004 Conference on Human Factors in Computing Sys-tems, CHI 2004, Vienna, Austria, April 24 - 29, 2004, pages791–792, 2004.

[9] B. Shneiderman. The eyes have it: A task by data type tax-onomy for information visualizations. In IEEE Symposiumon Visual Languages, pages 336–343, 1996.

[10] J. Thomas and K. Cook. Illuminating the Path: Researchand Development Agenda for Visual Analytics. IEEE-Press,2005.

[11] J. J. Thomas and K. A. Cook. A Visual Analytics Agenda.IEEE Transactions on Computer Graphics and Applica-tions, 26(1):12–19, January/February 2006.

[12] J. W. Tuckey. Exploratory Data Analysis. Addison-Wesley,Reading MA, 1977.

[13] J. J. van Wijk. The value of visualization. In IEEE Visual-ization, 2005.

[14] C. Ware. Information Visualization - Perception for Design.Morgan Kaufmann Publishers, 1st edition, 2000.

[15] P. C. Wong and J. Thomas. Visual analytics - guest editors’introduction. IEEE Transactions on Computer Graphics andApplications, September/October 2004.



challenges in visual data analysis

Documents