visflow: a web-based data ow framework for visual data ... · the current data ow context during...
TRANSCRIPT
VisFlow: A Web-Based Dataflow Framework for VisualData Exploration
DISSERTATION
Submitted in Partial Fulfillment of
the Requirements for
the Degree of
DOCTOR OF PHILOSOPHY (Computer Science)
at the
NEW YORK UNIVERSITY
TANDON SCHOOL OF ENGINEERING
by
Bowen Yu
January 2019
VisFlow: A Web-Based Dataflow Framework for VisualData Exploration
DISSERTATION
Submitted in Partial Fulfillment of
the Requirements for
the Degree of
DOCTOR OF PHILOSOPHY (Computer Science)
at the
NEW YORK UNIVERSITY
TANDON SCHOOL OF ENGINEERING
by
Bowen Yu
January 2019
Approved:
Department Head Signature
Date
University ID#: N17821602
Net ID: by460
ii
Approved by the Guidance Committee:
Major: Computer Science
Claudio T. SilvaProfessor of
Computer Science and Engineering
Juliana FreireProfessor of
Computer Science and Engineering
Enrico BertiniProfessor of
Computer Science and Engineering
Luis Gustavo NonatoProfessor of
Institute of Mathematics and Computer SciencesUniversity of Sao Paulo
iii
Microfilm or other copies of this dissertation are obtainable from
UMI Dissertation Publishing
ProQuest CSA
789 E. Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106-1346
iv
Vita
Bowen Yu was born in 1990 in Shanghai, China. He received his Bachelor
of Science from Peking University, Beijing, China, in 2012. He joined the Ph.D.
program at Tandon School of Engineering, New York University, in Fall 2012. His
research interests are data visualization frameworks and visual analytics. He was
a graduate adjunct in Spring 2018, and an adjunct faculty in Fall 2018 at the
Courant Institute of Mathematical Sciences, New York University. He is specialized
in competitive programming, and was the coach of the New York University
programming team between 2014 and 2018.
v
Acknowledgements
I am very grateful to my advisor, Prof. Claudio T. Silva, for his tremendous
support of my research and academic life over the past six years. This dissertation
would not be possible without his guidance and vision. He not only provided
insightful feedback on my research, but also gave me deep inspiration on how to
become a better researcher and a better person overall. I would also like to thank
all the committee members: Prof. Claudio T. Silva, Prof. Juliana Freire, Prof.
Enrico Bertini, and Prof. Luis Gustavo Nonato. By working with them on various
projects, I have learned a lot from their research experience and personality.
I am very fortunate to have worked in the VIDA group, where I am able
to explore the research opportunities that are hardly available elsewhere. The
atmosphere in the group is charming and encouraging. The discussions I had with
the colleagues here were always thought-provoking and fun.
I would like to thank Bo Zhou for being my best friend and fellow labmate. It
was a great pleasure to discuss challenging research topics and algorithmic problems
with him. His wisdom and humor were essential to my life at NYC.
I would like to thank my colleagues at the Courant Institute, who worked hard
to make it possible for me to create my own course.
I would like to thank H.L., Yi.W., M.(M.)C. for being with me, and S.C., Yu.W.,
H.S., C.G., C.W. for their friendship across states and continents.
I would like to thank those coffee beans for their contribution to my integrity.
Finally, I would like to thank my parents for their unconditional support. It is
their deep love that leads me to who I am today.
Bowen Yu
January 2019
vi
To my parents, and all the adventurous souls
vii
ABSTRACT
VisFlow: A Web-Based Dataflow Framework for Visual Data
Exploration
by
Bowen Yu
Advisor: Prof. Claudio T. Silva, Ph.D.
Submitted in Partial Fulfillment of the Requirements for
the Degree of Doctor of Philosophy (Computer Science)
January 2019
Visual data exploration requires tools that are flexible and adaptive. Although
domain-specific systems are effective at solving particular tasks, they are relatively
costly to build. It is therefore desirable to have a general-purpose tool that gives
the user control over how data are queried and presented. In this work we design
VisFlow, a web-based visualization framework that employs dataflow diagrams
to facilitate flexible visual data exploration with good usability and low learning
overhead. VisFlow applies a subset flow model that focuses on processing tabular
viii
data subsets. The model allows the user to create visualizations that update
reactively upon dataflow diagram changes. It also enables data selection from
visualizations for interactive filtering, subset identification, and subset manipulation.
VisFlow may help the user generate a multi-view visualization environment with
brushing and linking support. Compared with other existing dataflow systems,
VisFlow addresses the lack of interactivity in dataflow and overcomes the drawback
of high learning overhead due to complicated dataflow diagrams. We demonstrate
the capability and effectiveness of VisFlow by several case studies on real-world
data analysis scenarios.
Although the subset flow provides good dataflow interactivity, it requires data
immutability within the dataflow. To overcome the limitation on data processing
capability resulted from data immutability, we further design the extended subset
flow model that expands the application of VisFlow to derived data. In the extended
subset flow model, nodes are allowed to mutate the data and create data mutation
boundaries. The subset flow applies to the groups of nodes within a same data
mutation boundary, and may thus preserve its interactivity benefit. We incorporate
several node type extensions in the extended subset flow, such as the data reservoir
that stores and sends back its input data to address the limitation of acyclic
dataflow, and the script editor that supports custom JavaScript scripting to edit
or generate data. We show by case studies that the extended subset flow may
significantly improve the analytical capability of VisFlow and make the framework
applicable to a larger variety of tasks.
We develop a natural language interface FlowSense that employs semantic
parsing to assist dataflow diagram construction in VisFlow. FlowSense maps the
user queries in English to diagram editing operations in VisFlow. We propose a
ix
grammar design for FlowSense that is based on POS and special utterance tagging.
We employ special utterance placeholders to make the semantic parser aware of
the current dataflow context during execution, while the grammar of the parser
can be independent of the data and dataflow diagram in use. The integration of
FlowSense simplifies diagram construction in VisFlow and improves the usability of
the dataflow framework. The effectiveness of FlowSense is validated by both case
studies and a formal user study.
The implementation of the VisFlow dataflow framework is available as open
source software. We provide an online demo of the system and create comprehensive
documentation for the best reproducibility of the system.
x
Contents
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx
1 Introduction 1
2 Background and Related Work 6
2.1 Visualization Libraries and Frameworks . . . . . . . . . . . . . . . . 6
2.2 Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Computational Dataflow Systems . . . . . . . . . . . . . . . . . . . 9
2.4 Dataflow Visualization Systems (DFVS) . . . . . . . . . . . . . . . 10
2.5 Subset Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Subset Flow 13
3.1 Input Tabular Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Diagram Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Nodes, Ports and Edges . . . . . . . . . . . . . . . . . . . . 14
xi
3.2.2 Data Primitives: Subsets and Constants . . . . . . . . . . . 14
3.2.3 Visual Properties . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.4 Node Categories . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Data Immutability . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Data Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6 Heterogeneous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.6.1 Link Between Heterogeneous Tables . . . . . . . . . . . . . . 23
3.6.2 Visualization for Heterogeneous Input . . . . . . . . . . . . . 24
3.7 Diagram Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 VisFlow Framework Implementation 29
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 System Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Application Stack . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.3 Component Inheritance . . . . . . . . . . . . . . . . . . . . . 33
4.3 VisMode Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.1 Gene Regulatory Network Analysis . . . . . . . . . . . . . . 40
4.5.2 Baseball Pitch Analysis . . . . . . . . . . . . . . . . . . . . . 43
5 Extended Subset Flow 48
5.1 Extended Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Node Type Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 50
xii
5.2.1 Script Editor . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Data Reservoir . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.3 Series Transpose . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.1 Evacuation Dataset Visualization . . . . . . . . . . . . . . . 57
5.3.2 k-Means Clustering Visualization . . . . . . . . . . . . . . . 59
5.3.3 Model Training Visualization . . . . . . . . . . . . . . . . . 61
5.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6 FlowSense: A Natural Language Interface 65
6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1.1 NLIs for Data Visualizations . . . . . . . . . . . . . . . . . . 66
6.1.2 Semantic Parsing . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2 Design Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.3 Semantic Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.3.1 VisFlow Functions . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.2 Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.3 Query Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4 Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.4.1 POS and Special Utterance Tagging . . . . . . . . . . . . . . 76
6.4.2 Keyword Classification . . . . . . . . . . . . . . . . . . . . . 78
6.4.3 Query Pattern Completion . . . . . . . . . . . . . . . . . . . 78
6.4.4 Diagram Update . . . . . . . . . . . . . . . . . . . . . . . . 80
6.4.5 Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.4.6 Error Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.5 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
xiii
6.6 Query Auto-Completion . . . . . . . . . . . . . . . . . . . . . . . . 85
6.7 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.7.1 Speed Reduction Study . . . . . . . . . . . . . . . . . . . . . 87
6.7.2 Diagram Reproduction Study . . . . . . . . . . . . . . . . . 90
6.8 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.8.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . 91
6.8.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.9 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7 Conclusions and Future Work 103
A VisFlow Resources 108
xiv
List of Figures
3.1 Illustration of the key concepts of the VisFlow subset flow model.
Node types are labeled in the diagram. The subsets are denoted by
letter IDs within brackets. Assigned visual properties are shown in
red font color. Transmitted constants are shown in gray. . . . . . . 15
3.2 Data schema behind the subset flow model. Whenever a subset
passes through a visual editor, virtually a new copy of the subset
is generated with the visual properties possibly modified. Each
visualization node renders its input subset according to the visual
properties carried by the data items in the subset. Immutable original
table columns are shown in light gray. . . . . . . . . . . . . . . . . 23
3.3 Linking heterogeneous tables using a linker. . . . . . . . . . . . . . 24
LIST OF FIGURES xv
3.4 Heterogenous input example: a network visualization that takes two
heterogeneous subsets as inputs, for nodes and edges respectively.
There are visual property mappings of node weights to node sizes,
and edge weights to edge colors. Sizes bound to the nodes are shown
on top of the node IDs. Colors bound to the edges are denoted
by the font colors of the edge IDs. The network correspondingly
renders the nodes and edges, and have four outputs: selected nodes,
forwarded nodes, selected edges, forwarded edges. . . . . . . . . . . 25
3.5 An example subset flow diagram implemented by the VisFlow frame-
work. The user edits the dataflow diagram that corresponds to
an interactive visualization web application shown in the VisMode
dashboard. The model years of the user selected outliers in the
scatterplot (b) are used to find all car models designed in those
years (1981, 1982), which form a subset S that is visualized in three
metaphors: a table for displaying row details (h), a histogram for
horsepower distribution (i) and a heatmap for multi-dimensional
visualization (j). The selected outliers are highlighted in red in the
downflow of (b). The user selection in the parallel coordinates are
brushed in blue and unified with S to be shown in (h), (i), (j). A
heterogeneous table that contains the MDS coordinates of the cars
are loaded in (k) and visualized in the MDS plot (o), with S being
visually linked in yellow color among the other cars. . . . . . . . . 26
LIST OF FIGURES xvi
4.1 The VisFlow framework interface. Nodes are created in a drag-and-
drop manner onto the infinitely large canvas. The node panels list
the node types supported by the VisFlow framework. These node
types do not include the extended subset flow extensions introduced
in Chapter 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Regulatory network analysis workflow and its corresponding VisMode
dashboard generated in the VisFlow framework: (i) The dataflow dia-
gram; (ii) The interactive VisMode visualization dashboard produced
by the dataflow diagram. . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Applying VisFlow to baseball pitch data analysis. (i) The pitch-
ing movement; (ii) The Statcast coordinate system illustration
(from [32]); (iii) The analysis environment for the baseball pitch
analysis generated by VisFlow; (iv) Plots of pitching movements of
12 players. A categorical color scale is applied to render each player’s
pitches in a uniform color. . . . . . . . . . . . . . . . . . . . . . . . 46
5.1 Example of data mutation boundary created in the extended subset
flow model. The two nodes with black borders are data mutating
nodes. The one at the top performs mpg aggregation for each car
origin. The one at the bottom joins the two input tables. The
system uses node borders to help the user identify where the data
get changed, and the node groups in which the original subset flow
applies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Limitation of acyclic dataflow diagram. One layer of nodes has to
be created for one iteration of network expansion. . . . . . . . . . 54
LIST OF FIGURES xvii
5.3 The data reservoir holds all the changes to the edges. When the user
releases the changes, those edges are merged into the upflow edges
so that the network visualization may include the new edges. . . . 55
5.4 A snapshot of the JavaScript written in a script editor to render the
floor plan of the evacuation data. The script manipulates the DOM
tree that is rooted at content. . . . . . . . . . . . . . . . . . . . . 58
5.5 Using a series player and a script editor to visualize the evacuation
data from VAST Challenge 2008. . . . . . . . . . . . . . . . . . . . 59
5.6 Using an MDS plot and a cluster label distribution plot to visualize
the iterations of the k-means clustering algorithm: (i) The dataflow
diagram; (ii) Visualizations of the clustering algorithm iterations. . 60
5.7 Applying a combination of extended model nodes to visualize a
multi-layered perceptron training process. Using a stateful script
editor we show metric value changes over time in a line chart. By
subset flow diagram highlighting, we can highlight the incorrectly
predicted test data in the MDS plot and the histogram. . . . . . . 62
6.1 An example FlowSense query and its execution. The derivation of
the query is shown as a parse tree in the middle. The sub-diagram
expanded by the query is illustrated at the bottom. The five major
components of a query pattern are underscored. Each component
and its relevant parts in the parse tree and the dataflow diagram are
highlighted by a unique color. The result of executing this query is
to create a parallel coordinates plot on the columns mpg, horsepower,
and origin, with input from the selection port of the node MyChart. 73
LIST OF FIGURES xviii
6.2 FlowSense query execution phases. POS and special utterance
tagging is performed first. Special utterances describing the data
columns and diagram nodes are identified can be matched against
utterance placeholders. Keyword classification is applied to identify
important utterance implications such as the intention to call a
specific VisFlow function. FlowSense attempts to complete the
query pattern if missing information can be filled using default
values. Upon an execution failure the user is notified and asked to
update the query. . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3 The FlowSense user interface and query auto-completion. Tagged
special utterances are shown in colored tags. (i) Manually update spe-
cial utterance tagging using a dropdown in the FlowSense input box;
(ii) Special utterance token completion; (iii) Query auto-completion. 85
6.4 Using FlowSense to study the overall speed reduction trend of NYC
streets with different speeds limits. The queries are applied in the
numbered order. The resulting visualization shows a time series for
the average speed of road segments, aggregated by unique speed
limits. The smaller histogram snapshot shows the speed histogram
without color encoding before step 3. . . . . . . . . . . . . . . . . . 87
6.5 Applying FlowSense for a comparative study on the street speed
changes between the West Village slow zone (blue) and the Alphabet
City slow zone (red). FlowSense processes the rich dataflow context
and allows the user to reference dataflow diagram elements at differ-
ent specificity levels, e.g. with node types, node labels, or implicit
references. The NL queries are executed in the numbered order. . . 88
LIST OF FIGURES xix
6.6 Correctness distributions of the participants’ answers to each of the
user study tasks. “ok” represents a correct answer. “wa” indicates a
wrong answer. “unanswered” means the participant did not find a
proper answer and skipped the task. . . . . . . . . . . . . . . . . . 95
6.7 Completion time box plot for each step of the user study. Four
outliers are not shown: Task1(2550), Task3(109, 119, 212). . . . . 96
6.8 Number of rejected queries for different rejection reasons. The colors
of the bars indicate the relative difficulty of resolving a rejection. . 100
xx
List of Tables
5.1 Series transpose example that converts column-major series in Table
(a) into row-major series in Table (b) based on the key column
“Country” and the series columns of the years. The cell values in
Table (a) are stored in the third column in Table (b). Table (b) has
9 rows that are not all shown. . . . . . . . . . . . . . . . . . . . . 56
6.1 Six major categories of VisFlow functions. These sub-diagrams are
frequently used to compose more sophisticated diagrams that address
analytical tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 VisFlow Survey Result . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3 FlowSense Survey Result . . . . . . . . . . . . . . . . . . . . . . . . 98
1
Chapter 1
Introduction
Data analysis requires an integrative environment that supports both data
presentation and interactive queries. The visual information seeking mantra and
task taxonomy [61] summarize several tasks that are frequently performed in visual
data analysis, such as overview, zoom, filter, and details-on-demand. In practice,
these tasks are often performed iteratively and progressively. Therefore, it is
desirable that the analysis can be carried out in a tool that is flexible and able to
be adapted for custom queries. Though a dedicated visual analytics tool may be
very effective at solving a particular domain problem, building a domain-specific
application is however often costly and time consuming. Visualization tools that
are adaptive and customizable are relatively lacking, in that they are typically
complicated and not simple to use especially for novice users.
In a dataflow system, the user draws a dataflow diagram that specifies system
behavior, such as how data are selected, filtered, and visualized. Updates on the
dataflow diagram immediately change the system functionality, so that the analysis
can be flexible and customizable. Additionally, dataflow diagrams are intuitive
CHAPTER 1. INTRODUCTION 2
representations of the system behavior and are very effective at capturing the
workflow design decisions that an analyst needs to assess and reflect frequently.
Despite the flexibility and intuitiveness, existing dataflow systems typically have
some drawbacks for visual data exploration:
• Many dataflow systems are designed for data processing and computations [10,
20]. Those systems present dataflow diagrams that are visual abstractions
of programming, which require the user to have programming background
and understand the correspondence between diagram inputs, outputs and the
underlying program arguments. Those systems have high learning overhead,
and produce dataflow diagrams that are often too complex to read.
• Dataflow analytics platforms [26, 29] are often not specifically designed for
visualizations. Visualizations in those systems are mostly statistical summaries
that do not support interactive data exploration. Besides, those systems
typically do not have reactive feedback upon dataflow changes, as explicit
re-execution is required to update the visualizations due to the complex
nature of the operations an analytics platform performs.
• Dataflow visualization systems (DFVS) [5, 23, 25, 48] mostly aim at generating
rendering pipelines, e.g. for volume rendering. Interactivity is often limited
for navigation within rendered views. The flexibility advantage of using
dataflow is exploited only in the application construction phase, rather than
in the visual data exploration phase.
We seek to design a dataflow framework that excels both in usability and
flexibility, and addresses the customization need in visual data exploration. In this
work we design VisFlow, a web-based visualization framework and its subset flow
CHAPTER 1. INTRODUCTION 3
dataflow model. The subset flow model builds on the idea of subset manipulation in
dataflow [55], and extends it to interactive visual data exploration. Our flow model
requires the data transmitted within the flow to be subsets of table rows from
immutable input tabular data. The advantage of this model is that visual properties
of data items can be unambiguously defined when subsets are transmitted, so that
brushing and linking, which are essential to visual data exploration, can be easily
achieved. We perform real-time evaluations of the system modules in order to
have visualizations update reactively on user interactions. As a result, the user
may directly select the data visualized and perform interactive queries through
embedded visualizations, saving the cost of explicit re-execution. In VisFlow,
user selections are explicit outputs of the system modules, so that subsets can be
easily tracked, compared, and understood. With a focus on data subsets, our flow
model also reduces the diagram complexity and mitigates the learning overhead of
computational dataflow. In Chapter 3, we give the definitions of the subset flow
model, its data immutability constraints and design philosophy.
We implement the VisFlow framework using modern client-server stack (Type-
Script, Vue.js, Express, MongoDB). We design intuitive drag-and-drop user interface
for dataflow diagram editing. The VisMode dashboard utility is also integrated to
allow easy sharing and presentation of the results of the dataflow. In Chapter 4,
we introduce the implementation details of VisFlow and its user interface design.
We showcase the application of the implemented framework with two case studies
in real-world data analysis scenarios.
Despite the advantages of better interactivity and lower learning overhead, the
subset flow has its limitations. First, all the data are immutable in a subset flow,
which may pose restrictions on the analysis, e.g. aggregations must be performed
CHAPTER 1. INTRODUCTION 4
outside VisFlow in order to analyze sums and averages. Yet we observe that data
processing and computation are often desired in an integrative analysis environment.
Second, VisFlow uses an acyclic dataflow diagram that does not allow loops that
may introduce execution and dependency ambiguity. This makes some tasks difficult
to achieve in VisFlow, such as iteratively expanding a network visualization by
adding incident edges into the graph. We address these limitations by proposing
the extended subset flow model in Chapter 5. In the extended model, nodes are
allowed to mutate the data and create data mutation boundaries. The subset flow
applies to each group of nodes within a same data mutation boundary, so that the
interactivity benefit from the original subset flow can be preserved while we increase
the data processing capability of the system. We support a script editor node that
enables generalized data editing and generation within the dataflow. The script
editor may also be used to perform custom rendering and DOM manipulation. We
also introduce a data reservoir node that may save its input subset and send the
subset back to the earlier dataflow upon user-initiated backward data propagation.
The data reservoir allows the system to update a same visualization iteratively,
which was previously not possible in an acyclic dataflow. We describe the details
of the extended subset flow model in Chapter 5, and exemplify its advantages and
applications with several case studies.
To further reduce the learning overhead and improve the usability of VisFlow,
we design a natural language interface FlowSense for the VisFlow framework.
The natural language interface allows the user to edit dataflow diagrams using
plain English. We first identify a set of commonly performed diagram editing
operations within VisFlow, and then employ the state-of-the-art semantic parsing
technique to map natural language input to diagram editing operations. To make
CHAPTER 1. INTRODUCTION 5
the semantic parser aware of the diagram content, we propose a grammar design
with special utterance placeholders. The special utterances related to the diagram
context are explicitly identified in the user interface and accepted by the utterance
placeholders when parsed. A query auto-completion algorithm based on template
matching is added to further enhance the usability of FlowSense. FlowSense may
not only automate and speed up the diagram construction, but also help the user
learn common diagram construction patterns in VisFlow. We demonstrate the
effectiveness of FlowSense using one case study, one diagram reproduction study,
and a formal user study. The design of FlowSense and its studies are discussed in
Chapter 6.
Some of the results presented in this dissertation have been published in refereed
journals and conference proceedings. Chapter 3 and Chapter 4 include work
from [81]. Chapter 6 includes work to be published. The implementation source
code of the VisFlow framework is available as an open source project on GitHub.
We also provide an online demo of the system with comprehensive documentation.
See Appendix A for a detailed list of the VisFlow resources.
6
Chapter 2
Background and Related Work
This work is rooted in research advancement in visualization frameworks and
dataflow systems. In this chapter, we first cover related general-purpose visualization
frameworks at large. We then briefly survey existing dataflow systems by their
major application scenarios and functionality. We categorize the systems into
computational dataflow systems and dataflow visualization systems. Specifically,
we highlight the works that also transmit subsets in dataflow.
2.1 Visualization Libraries and Frameworks
Visualization libraries such as the InfoVis Toolkit [14] and D3 [7, 11] provide
script support to author visualizations. Those tools are powerful but require the
user to program. Declarative languages such as Vega-Lite [59] allows the user to
create pre-defined visualizations with custom data using simpler JSON specifications.
Many visualization frameworks aim at providing a general-purpose environment that
is flexible and customizable for visual data analysis, without asking the user to write
script or text specification. Those frameworks are also referred to as visualization
2.1. VISUALIZATION LIBRARIES AND FRAMEWORKS 7
construction tools. Unlike visualization libraries, visualization frameworks do not
require the user to program the visualization and build an analytical solution
from scratch. Tableau [38, 66] provides a drag-and-drop interface for the user to
define mapping from data values to visualization attributes (namely columns and
rows) required by visualization metaphors. Quadrigram [53] creates a visualization
dashboard by adding charts and controllers to the dashboard area. iVisDesigner [54]
allows the user to add data elements to the canvas, specify their visual mappings,
and define chart area and axes. It provides an expressive environment to author
customizable visualizations. Lyra [58] employs a similar concept but with implicitly
pre-defined chart axes. The user drags graphical elements onto the Lyra canvas and
associates data attributes with those elements. VisFlow, like the other visualization
frameworks, aims at providing a general-purpose environment that is flexible and
customizable. It uses a dataflow diagram to define how data are transmitted,
processed, and visualized within the system. Compared with other authoring tools,
VisFlow applies a dataflow diagram to present a clearer and more explicit view of
data transmission and manipulation. Consequently, the user has more control over
the system behavior, in that VisFlow not only creates visualizations, but also helps
define the analytical logic behind the visual data exploration.
On the other hand, modern data analysis often uses a notebook environment.
Popular examples include Jupyter [52] and the latest invention Observable [46].
Dataflow systems provide a generalized workflow compared with notebooks, in
that notebooks use a single streamlined pattern of code snippets. Dataflow and
notebooks provide different data analysis and exploration solutions. Dataflow
systems pass the input and output explicitly, while a variable defined in a notebook
is global inside the page. This may help the user better interpret changes to the
2.2. DATAFLOW DIAGRAMS 8
data within the system in a step-by-step manner.
2.2 Dataflow Diagrams
Dataflow is a diagrammatic representation of the relations between system
components. It originated as a graphical method for analyzing system structures [18].
VisFlow uses a dataflow diagram is to specify data transmission between modules,
so that the transmission direction must be unique and data operations can be
executed without ambiguity. There is a distinction worth noting between the
dataflow diagrams in this work and those visually similar diagrams of actually
different meaning and purpose. The dataflow in VisFlow is different from the
illustrative flow diagrams where the name “data flow diagram” (DFD) may have
other indications [18]. For example, in information system modeling a DFD only
describes how modules communicate with each other and there may be loops in the
diagram. Some visualization techniques use graph-based relational diagrams, which
are not dataflow. For example, Domino [21] presents a novel metaphor to link
visualizations for identifying table subset relations, which visualizes bidirectional
relational edges. North et al. [45] proposed a visualization schema for relating
multiple-view visualization with database schema, in which the edge is an analogy
of database table join. The Improvise system [74] enables the user to specify
coordinations between variables, controls and rendering, which are illustrated in a
coordinated graph. Liu et al. [36] proposed network-based analysis for tabular data,
in which data entities (items) are nodes and edges show their weighted relations.
Despite the visually similar diagram appearances, none of the diagrams used by
the above systems are dataflow diagrams, because the edges in those diagrams
2.3. COMPUTATIONAL DATAFLOW SYSTEMS 9
represent or declare relations, but do not define the direction in which data are
transmitted within the system. Additionally, analytical or executional paths have
implicit dataflow with a single path. Examples include Victor’s creative demo of
drawing dynamic visualizations [70] and the Lyra [58] system. However in this
work we focus on flow flexibility and multiple flow branches.
2.3 Computational Dataflow Systems
Dataflow models have been widely adopted to model computations [22, 43]. In
the domains of parametric modeling [20], signal and media processing [10, 71], people
use dataflow to effectively model a system. Dataflow system may provide analysis
capability including data transformation, data filtering, statistics computation, and
visualizations. For example, the user is able to employ a comprehensive set of
computational nodes to perform machine learning and data mining in the IBM
SPSS Modeler [26], KNIME [29], and Orange [47] Kepler [37] manages scientific
workflows where analytical steps are modularized. Yahoo Pipes [75] (The service
was shutdown in 2015) used dataflow to process web content. Taverna [77] provide
workflows for integrating web services and local tools for bioinformatics.
The aforementioned systems are designed for general-purpose computation and
analysis. Visualization capability and queries by user selections are lacking in
those systems. Most of these systems do not emphasize the interactive queries that
are possible in plots, as visualization nodes have no outputs and serve only as a
summary of the computation results or statistics. Computational dataflow systems
often yield a complicated flow diagram due to the complexity of computations. In
this work, rather than aiming at general computations, we focus on enhancing the
2.4. DATAFLOW VISUALIZATION SYSTEMS (DFVS) 10
user’s interactive control over the visualized data in the dataflow. More particularly,
achieving interactive analysis with brushing and linking in the above systems is not
straightforward because computational modules may generate arbitrary derived
data that introduces ambiguity of tracing data items. The dataflow model of
VisFlow attempts to overcome this limitation.
2.4 Dataflow Visualization Systems (DFVS)
Researchers have attempted to use dataflow to model data visualizations in
order to provide better flexibility in system functionality. Compared with bespoke
visualization systems, dataflow visualization systems are more general-purpose and
may adapt to a larger variety of analytical tasks.
There are a large number of visualization systems based on dataflow. Most
systems effectively use dataflow to construct flexible rendering pipelines. Early
systems emerged mostly for interactive construction of scientific problem solv-
ing environments [78], e.g. steering geometric modeling and performing volume
rendering [16, 48, 55, 69]. The Application Visualization System (AVS) [69],
SCIRun [48, 49], ConMan [23], VISSION [64] and SmartLink [65] are the seminal
works for dataflow visualization. In these systems, program modules are provided
as modular blocks to be interconnected to achieve tunable custom rendering. More
recently, Ross et al. [57] proposed a dataflow workspace called HIVE for exploring
multi-dimensional scaling algorithms. VisTrails [5] generates multi-view visual-
izations from the specification of a pipeline and provides the interface to ease
the manipulation and management of different pipelines. Voreen [41] provides a
dataflow environment for ray-casting-based volume rendering.
2.5. SUBSET DATAFLOW 11
In the above systems, the advantages of dataflow are mostly exploited in the
application construction phase to perform a specific type of rendering or algorithm.
There is a lack of interaction support on the rendered result, e.g. in case of
volume rendering only view navigation is provided. Due to the relatively heavy
rendering workload, explicit re-execution is often required to update visualizations.
Additionally, modifying the pipelines often requires expert knowledge on the system
modules. All these limitations render the above DFVS’s not quite suitable for
visual data exploration, where information visualization and visual analytics are
more emphasized. VisFlow seeks to provide a dataflow approach with simpler usage
for interactive data analysis, rather than constructing rendering pipelines. To that
end, Waser et al. [73] present an effective dataflow design that supports interactive
flood simulation steering, in which the parametric connections fit the particular
domain. Instead of focusing on a particular domain, VisFlow has a subset flow
model that is general-purpose and can be applied to any tabular data.
2.5 Subset Dataflow
Connecting programming modules (e.g. VTK in [5]) in a dataflow often results
in complex flow diagrams that are hard to read. An option to make it easier is to
constrain the data types transmitted in the dataflow. One solution is to transmit
only data subsets. The Waltz system [55, 56] passes subsets between system
modules for volume segmenting and rendering. It produces a simpler tree-form
dataflow structure and a simplified visualization workflow. Dataflow systems with
subsets are relatively easier to follow and understand. They also naturally represent
sequential steps of data filtering. ExPlates [28] presents an information visualization
2.5. SUBSET DATAFLOW 12
workflow system that supports interactive subset extraction by expanding the flow
diagram upon user selection. ExPlates also shows embedded visualizations in the
diagram nodes that react to diagram changes.
In this work we extend the subset dataflow concept to allow for interactive
tabular data analysis. The Waltz system essentially targets at volume rendering, in
which subset filtering/slicing and view displays are only for volume data. Compared
with other information visualization dataflow systems like ExPlates, the VisFlow
model makes a key distinction that it does not generate derived data within the flow
to make subset brushing and linking unambiguously defined. However, ExPlates
performs table join operations and thus does not sufficiently support visual linking
of data items, as there is no straightforward way to represent data items by visual
properties. With our model constraint we develop a corresponding set of node
categories, data primitives and transmission rules for VisFlow, which are specifically
catered to assist the user to edit, compare, and understand tabular data subsets.
Furthermore, the VisFlow model is able to produce a simpler dataflow diagram.
We believe the proposed model to be well suited for visual data exploration, and to
our best knowledge the advantages of user interaction that works exclusively with
tabular subsets in a dataflow framework have not yet been studied in literature.
Chapter 3 gives a detailed introduction to the subset flow model employed by
VisFlow.
13
Chapter 3
Subset Flow
This chapter introduces the VisFlow dataflow model: the subset flow model.
We discuss the input data of the model, and then define the components in its
dataflow diagram, the interaction methods, and the data immutability requirement
of the model.
3.1 Input Tabular Data
The subset flow model aims at visualizing and analyzing tabular data. The
tabular data applied in the subset flow matches the entity-relationship database
table definition. We consider each table row to be a meaningful data item. A group
of data items is a data subset.
Each table in the subset flow is regarded as a single, independent dataset. The
type of a table is determined by its column names and column data types (number,
string, date). Each data item corresponds to exactly one row of one input table.
Multiple tables of different types can be loaded into the subset flow at the same
time to describe different aspects of a same set of data entities. We discuss how
3.2. DIAGRAM ELEMENTS 14
heterogeneous data are processed in Section 3.6.
3.2 Diagram Elements
This section introduces the diagram element definitions in the VisFlow dataflow.
Figure 3.1 illustrates the model concepts.
3.2.1 Nodes, Ports and Edges
A dataflow diagram in VisFlow consists of nodes and edges. A node is a VisFlow
module that loads, processes, filters or visualizes the data. Nodes expose input
and output ports that accept and transmit incoming and outgoing data. Input
and output ports are shown on the left and right side of a node respectively in
Figure 3.1. An edge is a directed connection from an output port to an input port.
A single port (single dot ) accepts at most one edge. A multiple port (triangle
) does not have edge number restriction. Topologically a subset flow diagram is
a directed acyclic graph (DAG).
Downflow and upflow. For the convenience of discussion, we define downflow(x)
and upflow(x) to be the set of nodes that are reachable from node x following the
edges, in their original and reverse directions respectively.
3.2.2 Data Primitives: Subsets and Constants
VisFlow only transmits two types of primitive elements in its dataflow: subsets
and constants. Ports are categorized exclusively into subset ports (white background
) that transmit subsets, and constant ports (gray background ) that transmit
constants. Ports must be type-matched when connected by edges.
3.2. DIAGRAM ELEMENTS 15
idmpg
name
a b cch
ev
role
t
bu
ick
am
c1
5
18
14
... ... ......
db
uic
k1
5...
Da
ta S
ou
rce
1
idsale
name
x y zch
ev
role
t
bu
ick
am
c3 42
Da
ta S
ou
rce
2
... ... ......
{a, b
, c, d
}
a
b
c{a
, b}
Use
r S
ele
ctio
n
Vis
ua
liza
tio
n 1
Vis
ua
l Ed
ito
r {a, b
}
{a, b
, c, d
}U
d{a
, b, c
, d}
15
≤ m
pg
≤ 2
0
Ra
ng
e Fi
lter
{a, c
, d}
ac
d
Vis
ua
liza
tio
n 2
Co
nst
an
ts E
xtra
cto
rS
ub
stri
ng
Filt
er 1
name
co
nta
ins
“am
c” o
r “b
uic
k”?
[”a
mc”
, “b
uic
k”]
{a
, b, d
}
Set
Op
era
tor
(Un
ion
)
{a, c
, d}
{ }
{a, b
, c, d
}
Su
bst
rin
g F
ilter
2
[”a
mc”
, “b
uic
k”]
name
co
nta
ins
“am
c” o
r “b
uic
k”?
{x, y
, z}
{x, y
}
ex
tra
ct n
ame
Fig
ure
3.1:
Illu
stra
tion
ofth
eke
yco
nce
pts
ofth
eV
isF
low
sub
set
flow
mod
el.
Nod
ety
pes
are
lab
eled
inth
ed
iagr
am.
The
subse
tsar
eden
oted
by
lett
erID
sw
ithin
bra
cket
s.A
ssig
ned
vis
ual
pro
per
ties
are
show
nin
red
font
colo
r.T
ransm
itte
dco
nst
ants
are
show
nin
gray
.
3.2. DIAGRAM ELEMENTS 16
The definitions of subsets and constants are as follows:
• Subsets. A subset transmitted by the dataflow is a collection of table rows
from some input table as defined previously. A subset virtually preserves all
the column information of its source table and the attribute values of its data
item members, which both can be retrieved from a node receiving the subset.
A subset is denoted by a pair of brackets containing the IDs of its member
items in Figure 3.1.
• Constants. We define constants to be an ordered list of constant values.
The values are either specified by user input directly or extracted from
attribute values of data items. Constants can be numbers, strings or dates.
In Figure 3.1, two string constants “amc” and “buick” are extracted from the
names of the user selected data items (a and b) at the constants extractor
(see Section 3.2.4).
3.2.3 Visual Properties
Data items in subsets can be associated with visual properties in the subset flow.
Those properties are transmitted together with the data items. Each data item
has a visual property object that maps a set of visual parameters to their assigned
values, e.g. { color: red, size: 5 }. VisFlow currently supports five types of visual
properties: color, size, border color, border width, opacity. Visual properties are
set and modified by the nodes along the flow so that a same data item may have
different visual properties at different nodes. Intuitively, this process is like dyeing.
Once a data item passes through a visual editor, its visual properties may get
mutated. The data item thus carries the new visual properties in the downflow.
3.2. DIAGRAM ELEMENTS 17
The visual editor (see Section 3.2.4) in Figure 3.1 assigns red color property to the
data items in the subset {a, b}, so that a and b have red color throughout their
downflow. Visual properties are the only properties of data items that can be
mutated along the dataflow.
3.2.4 Node Categories
The nodes employed by a subset flow may be categorized based on their func-
tionality. In particular, the subset flow employs a small number of node types to
achieve a relatively low learning overhead. We may categorize the nodes in the
subset flow into the following types:
• Data sources. Data sources load tabular data from data files. They do not
have input ports. A data source always produces a single output subset read
from a user-selected input table.
• Visualizations. Visualization nodes render the input subsets in visualization
metaphors. To facilitate the interactivity of dataflow visualizations, plots are
embedded in the visualization nodes by default. Interactive selection can be
made directly in an embedded visualization and sent to other nodes through
a dedicated selection port (square icon ). In Figure 3.1, the user selects
a and b in Visualization 1. The selection port of Visualization 1 outputs
the selected subset {a, b}. Additionally a visualization node also has a data
pass-through forwarding port (a multiple port denoted by ) that simply
outputs its input. The redundancy is included to reduce diagram clutter. In
Figure 3.1 Visualization 1 passes through the entire input set {a, b, c, d} to
the downflow through the forwarding port. Otherwise the downflow nodes
3.2. DIAGRAM ELEMENTS 18
must be connected to the data source to receive the entire input set.
Visualizations in the subset flow must always render the data items according
to their visual properties. Visualization 2 in Figure 3.1 renders data item
a in red color, as assigned by the upflow visual editor. Note that different
visualization metaphors may present the visual properties differently. While
a scatterplot renders the dots directly in the assigned colors of the data items,
a heatmap sets the font colors of row labels to the colors of the data items, so
as not to interfere with the color mapping used by the heatmap cells based
on the attribute values of the data items. The user is able to tune the visual
parameters of the visualizations through the user interface.
• Constants generators. Constants generators produce constants that are
used as filtering parameters. Constants can be either entered by the user,
or extracted from data. The constants extractor in Figure 3.1 extracts the
names of data items a and b (“amc” and “buick”), which are constants used
by the downflow attribute filters.
• Attribute filters. Filters examine attribute values of data items and perform
attribute filtering. Filtering parameters are constants and can be user specified,
or retrieved from value generators. The attribute filters in Figure 3.1 use
substring matching to find the items with names being either “amc” or “buick”.
An attribute filter may also be used to find the extremums in the data, or
randomly sample an input subset on a user-selected sampling dimension.
• Visual editors. Visual editors assign visual properties to its input data
items. The visual editor in Figure 3.1 assigns red color to its input subset
items a and b. A visual editor may also encode the attribute values of the input
3.3. INTERACTIONS 19
subset using a mapped visual channel (e.g. a color scale in Figure 3.4). The
user may assign distinguishable visual representations to important subsets so
that they can be identified and linked across multiple plots. Visual properties
can be overwritten by downflow visual editors.
• Set operators. Set operator nodes take two or more subsets from a same
table to produce a new subset using a mathematical set operation. The visual
properties of a same data item are merged at a set operation node. The union
node in Figure 3.1 merges {a, b} with {a, b, c, d} and preserves the colors
of a and b. The last connected input subset has higher priority in case of a
visual property merge conflict.
When implementing the VisFlow framework, we designed a few example node
types for each category. It is possible to add new node types to the categories or
extend the categories on demand as long as the added nodes meet the subset flow
model requirements. Chapter 4 introduces our implementation of the framework.
For more details on node types and node usage, please refer to the VisFlow online
documentation in Appendix A.
3.3 Interactions
In the subset flow, presented data in the visualizations can be directly selected
and extracted as subsets for further queries or manipulations. An important task
for visual analytics is to be able to perform brushing and linking:
• Brushing. The subset flow supports brushing by binding visual properties
to the subsets. A subset will be shown consistently according to its visual
3.4. DATA IMMUTABILITY 20
properties. For example, the user selects items a and b from Visualization 1
in Figure 3.1 and passes them through the visual editor, which brushes the
items in red.
• Linking. The same subsets are automatically linked across nodes with the
same visual properties based on the nature of the subset flow. A downflow
visualization in Figure 3.1 (Visualization 2) receives the data items with
associated visual properties, so that a is highlighted in red. Note that
Visualization 2 does not necessarily need to apply the same visualization
metaphor as Visualization 1. Combined with the attribute filter, the example
flow diagram in Figure 3.1 effectively brushes and highlights the selected
items that satisfy 15 ≤ mpg ≤ 20.
3.4 Data Immutability
The name subset flow naturally comes from the fact that subsets are the major
primitives transmitted by the dataflow. Within a subset flow a data item from a
subset must correspond to one of the input table rows so that it can be uniquely
identified and given visual properties. Although most dataflow systems implicitly
support generating subsets, this correspondence constraint defines the unique
system behavior of the subset flow model, and makes it different from a general
dataflow capable of subset generation. More particularly, the VisFlow subset flow
does not produce derived data within the dataflow such as a joined table. Such
model has two key advantages:
• Brushing and linking definition. By constraining the transmitted data
to be subsets, brushing and linking operations in the subset flow are defined
3.5. DATA SCHEMA 21
by the visual properties bound to the data items. The subset flow model
allows visual properties to be uniquely associated with data items throughout
the flow, and prevents the ambiguity of inheriting visual properties when new
types of data items are derived and generated, e.g. by a table join. It is not
straightforward to define the behavior of brushing and linking with derived
and mutated data. This is because there exist multiple possible results. For
example, when a data item colored red is joined with a heterogeneous data
item colored blue, the system cannot tell which color should be inherited.
Therefore it has to ask the user for a decision, which would consequently
increase the usage complexity and introduce confusion.
• User perception. With the complexity and confusion resulting from inherit-
ing visual properties when derived data are involved, it is hard for the user to
mentally trace, compare, and understand data subsets. The subset flow model
works exclusively with data subsets so as to improve the user’s understanding
of the data being visualized. Interactive queries are more intuitive as the user
will be able to tell which data items he/she is selecting, from which answers
to analytical questions can be derived. However, selecting derived table rows,
e.g. from a joined table, is less intuitive as there might exist arbitrary new
types of subsets with varying columns.
3.5 Data Schema
Intuitively, the visual properties associated with the data items can be regarded
as additional columns in which values are mutable. Values in these columns are
used differently from the original table columns to define how the data items are
3.5. DATA SCHEMA 22
presented in the visualizations. One may understand the subset flow model in a
way that each node outputs a new copy of its input subset, and potentially mutates
the visual property columns of the new copy. Each visualization node uses the
visual properties in the subset copy it receives to render the subset.
When two subsets are merged, the same data items (identified by their row
indices from the original input table) possibly carry different visual properties.
Therefore the union set operator combines and possibly overwrite some of the visual
properties carried by its input. The overwriting priority is defined by the connection
order of the incoming edges to the union node. Based on data immutability, only
subsets originating from the same input table can be merged.
Figure 3.2 illustrates the data schema concept behind the subset flow model and
how visual properties are merged. The first three columns shown in light gray are
from the original table and cannot be changed by any node in the subset flow. The
remaining three columns store the visual properties assigned. The four data items
pass through the first visual editor, which assigns red color and size 5. Visualization
1 shows the data at this point, in which all data items are presented in red color
and with size 5. Two data items a and b are then selected and they go through
a second visual editor, which assigns blue color to them. The visual property on
size is unchanged for these two data items. When these two data items are merged
back to the full set at the union node, their new blue color is kept because it has
higher priority. Visualization 2 shows the resulting subset. Every data point is
with size 5. a and b are in blue, while c and d retain their red color received from
Visual Editor 1.
3.6. HETEROGENEOUS DATA 23
id mpgname
a
b
c chevrolet
buick
amc 15
18
14
d buick 15
blue color
color size opacity
red color, size 5
id mpgname
a
b
c chevrolet
buick
amc 15
18
14
d buick 15
color size opacity
5
5
5
5
id mpgname
a
b buick
amc 15
14
color size opacity
5
5
selected
U Union
id mpgname
a
b
c chevrolet
buick
amc 15
18
14
d buick 15
color size opacity
5
5
5
5
a
b
c
d
a
b
c
d
Visualization 1
Visualization 2
Visual Editor 1
Visual Editor 2
Figure 3.2: Data schema behind the subset flow model. Whenever a subset passesthrough a visual editor, virtually a new copy of the subset is generated with thevisual properties possibly modified. Each visualization node renders its inputsubset according to the visual properties carried by the data items in the subset.Immutable original table columns are shown in light gray.
3.6 Heterogeneous Data
The subset flow supports heterogeneous data by links between heterogeneous
tables, or visualizations specifically designed for heterogeneous data:
3.6.1 Link Between Heterogeneous Tables
Heterogeneous data items can be linked based on a key column. A constants
extractor is first used to extract the key values as constants from one table, and
those keys are sent to an attribute filter to find among the second table those data
3.6. HETEROGENEOUS DATA 24
items with these keys. The keys extracted help relate heterogeneous data items
based on the analytical meaning of those keys. The related data items can be
further brushed and presented in linked visualization styles. In Figure 3.1, a second
data source loads a different type of table that can be filtered with the extracted
car names from the subset selected from the first table. In particular, the two
tables describe different aspects about the same set of cars. By linking the two
tables, one may find the sale number for the selected cars from the first table.
Extracting the key column values and then using those values to filter a second
table is a common way to relate heterogeneous data in the subset flow. For
convenience, we design the Linker node that combines the two steps in the VisFlow
framework. Figure 3.3 illustrates the usage of a linker. A linker internally extracts
the key columns values from the first table, and filters the second table based on
those keys. Using a linker makes the dataflow diagram simpler than using one
constants extractor and one attribute filter as shown in Figure 3.1.
id mpgname
a
b
c chevrolet
buick
amc 15
18
14
...
...
...
...
d buick 15 ...
Data Source 1
id salename
x
y
z chevrolet
buick
amc 3
4
2
Data Source 2
...
...
...
...
{a, b, c, d}
a
b
c {a, b}
User Selection
Visualization 1
d
Linker
link name
“amc” or “buick”?{x, y, z}
{x, y}
Figure 3.3: Linking heterogeneous tables using a linker.
3.6.2 Visualization for Heterogeneous Input
A visualization may directly render heterogeneous data. One of the examples
is a network visualization, in which data of nodes and edges are both required.
3.6. HETEROGENEOUS DATA 25
Figure 3.4 illustrates a network (graph) visualization. A network visualization
accepts two input subsets, for the nodes and edges respectively. Nodes and edges
can be assigned different visual properties. In Figure 3.4 the node weights are
encoded by sizes, while the edge weights are encoded by colors from a red-green
color scale. The network renders the nodes and edges respectively according to their
visual properties. User selections and forwarding of nodes and edges are output
separately. Therefore a network node has four output ports.
id weight
a
b
c
2
1
3
...
...
...
...
d ...
Data Source 1 (Nodes)
1
source weighttarget
a
a
b
3
2
4
Data Source 2 (Edges)
...
...
...
...
b
c
d
b 5 ...c
ide
1
e2
e3
e4
Network Visualization
size: [1, 5]
Visual Encoding 2
Visual Encoding 1
{a, b, c, d}
{e1, e
2, e
3, e
4} {e
1, e
2, e
3, e
4}
3 5 1 1
{a, b, c, d}
a
bc
d
3 5 1 1
{a, b, c, d}
{e1, e
2, e
3, e
4}
3 1
{a, c}
{e2}
weight
weight
User Selection
Figure 3.4: Heterogenous input example: a network visualization that takes twoheterogeneous subsets as inputs, for nodes and edges respectively. There are visualproperty mappings of node weights to node sizes, and edge weights to edge colors.Sizes bound to the nodes are shown on top of the node IDs. Colors bound to theedges are denoted by the font colors of the edge IDs. The network correspondinglyrenders the nodes and edges, and have four outputs: selected nodes, forwardednodes, selected edges, forwarded edges.
3.6. HETEROGENEOUS DATA 26
Dia
gra
m E
dit
ing
Vis
Mo
de
Da
shb
oa
rd
ab
c
d
e
f
gh
n
ij
k
l
m
i
o
d
j
h
o
p
b
Use
r S
ele
ctio
n
Use
r S
ele
ctio
n
Fig
ure
3.5:
An
exam
ple
subse
tflow
dia
gram
imple
men
ted
by
the
Vis
Flo
wfr
amew
ork.
The
use
red
its
the
dat
aflow
dia
gram
that
corr
esp
onds
toan
inte
ract
ive
vis
ual
izat
ion
web
applica
tion
show
nin
the
Vis
Mode
das
hb
oard
.T
he
mod
elye
ars
ofth
eu
ser
sele
cted
outl
iers
inth
esc
atte
rplo
t(b
)ar
eu
sed
tofi
nd
all
car
mod
els
des
ign
edin
thos
eye
ars
(198
1,19
82),
whic
hfo
rma
subse
tS
that
isvis
ual
ized
inth
ree
met
aphor
s:a
table
for
dis
pla
yin
gro
wdet
ails
(h),
ahis
togr
amfo
rhor
sep
ower
dis
trib
uti
on(i
)an
da
hea
tmap
for
mult
i-dim
ensi
onal
vis
ual
izat
ion
(j).
The
sele
cted
outl
iers
are
hig
hligh
ted
inre
din
the
dow
nflow
of(b
).T
he
use
rse
lect
ion
inth
epar
alle
lco
ordin
ates
are
bru
shed
inblu
ean
du
nifi
edw
ithS
tob
esh
own
in(h
),(i
),(j
).A
het
erog
eneo
us
tab
leth
atco
nta
ins
the
MD
Sco
ord
inat
esof
the
cars
are
load
edin
(k)
and
vis
ual
ized
inth
eM
DS
plo
t(o
),w
ithS
bei
ng
vis
ual
lylinke
din
yellow
colo
ram
ong
the
other
cars
.
3.7. DIAGRAM EXAMPLE 27
3.7 Diagram Example
Figure 3.5 provides a comprehensive example of composing multi-view visual-
ization environment using the subset flow model. The dataflow diagram and its
visualizations are rendered in the implemented VisFlow web framework. More
details on the framework user interface are introduced in Chapter 4.
The diagram applies the Auto MPG dataset1 (loaded by data source (a)) that
consists of 392 cars (excluding cars missing attributes) and their information on 9
columns, including name, mpg, displacement, etc.
A scatterplot (b) first shows the relation between the columns “displacement”
and “mpg”. Two outliers, in terms of the overall negative correlation, are identified
and selected. An interesting task could be to find all those cars that are produced
in the same years as the two outliers. We define this subset of cars to be S. A linker
(e) extracts the “model.year”s of the selected outliers and performs the query for S
by filtering the whole car collection. On the other hand, a parallel coordinates plot
(d) helps provide an overview of value distribution, in which lines are color encoded
(c) depending on the mpg values. The user selection in the parallel coordinates
(cars having 5 cylinders) are brushed in blue (p) and unified (g) with S, while the
outliers chosen above are brushed in red (f). Three visualizations, a table (h), a
histogram (i) and a heatmap (j) are used to render S along with the selection from
the parallel coordinates.
This example also includes a heterogeneous table (loaded by data source (k))
that contains the projected MDS coordinates of the cars, i.e. columns “mds x”,
“mds y”, and (car) “name”. The MDS coordinates are generated by the metric
SMACOF algorithm using the Minkowski Model in Euclidean distance [12], on all
1http://archive.ics.uci.edu/ml/datasets/Auto+MPG
3.7. DIAGRAM EXAMPLE 28
the columns with a maximum of 1 000 iterations.
An important task of studying an MDS plot is to identify the distribution of
a subset. Here we highlight the distribution of S in the MDS plot (q) by linking
heterogeneous data, as introduced in Section 3.6. The car names of S are extracted
and used to retrieve a corresponding subset with the MDS coordinates from the
MDS table (by linker (l)) to be highlighted. Highlighting of S in the MDS plot is
easily achieved by assigning yellow color (m) to S and unifying S (n) with the
other cars from the MDS table.
Recall that visualizations in the subset flow render the data items in their
rendering properties, but potentially in different ways. Histogram (i) visualizes the
“horsepower” distribution of the cars, with the number of highlighted items shown
proportionally in their bins. Heatmap (j) on the other hand renders the row labels
in the data items’ colors, while its cells use a separate color scale to encode the
attribute values.
As a summary, this flow diagram finds the cars that were produced in the same
years as those selected outliers in the scatterplot (b), as a set S. It visualizes
the distribution of S in an MDS plot. Meanwhile, additional cars selected from
the parallel coordinates are highlighted together with S for comparison, in the
visualizations (h), (i) and (j). This example demonstrates how brushing and linking
are achieved in the subset flow. It also shows how the subset flow works with
heterogeneous data.
29
Chapter 4
VisFlow Framework
Implementation
We develop the VisFlow framework that implements the subset flow model and
demonstrates its applications. We provide an online demo of the implemented
framework, along with comprehensive documentation about its usage. The source
code for VisFlow is available as a GitHub open source project. Appendix A lists the
URLs to access the online demo, documentation, and the implementation source
code. In this chapter we discuss the important aspects of our implementation. We
also introduce a few utility features we integrate into VisFlow that are not part of
the subset flow model but help provide a smoother user experience, including the
VisMode dashboard and diagram sharing.
4.1. OVERVIEW 30
diagram editing canvas
no
de
op
tion
pa
ne
l
quick node panel
no
de
pa
ne
l
drag-and-drop
node creation
system toolssystem menus
Figure 4.1: The VisFlow framework interface. Nodes are created in a drag-and-drop manner onto the infinitely large canvas. The node panels list the node typessupported by the VisFlow framework. These node types do not include the extendedsubset flow extensions introduced in Chapter 5.
4.1 Overview
We implement the VisFlow framework that supports visual data exploration
based on the subset flow model. We design the framework interface to assist
efficient understanding and manipulation of the dataflow diagrams in a web browser.
Figure 4.1 an overview of the system interface. The flow diagram is drawn and
manipulated on a virtually infinite canvas. Nodes can be created, resized and
re-positioned in an intuitive drag-and-drop manner. A node panel on the left guides
the user towards node creation, while a pop-out quick node panel activated by
context menu or keyboard shortcut appears around the mouse cursor to closely
follow the editing focus. Dragging from a port to another port or a node creates an
4.2. SYSTEM IMPLEMENTATION 31
edge. VisFlow automatically selects the first available port that satisfies connection
constraints when an edge being created is dropped on a node. The node option
panel on the right allows the user to set node-specific parameters and options, such
as table columns to be rendered in visualizations, color scale of a heatmap, value
range for an attribute filter, etc. Visualizations are embedded into the rectangular
node areas on the canvas, in which interactive selections can be performed. Nodes
can be either shown in detail (e.g. Figure 3.5(b), (c), (d), (e), etc., in Diagram
Editing) or collapsed into icons to save screen space (e.g. Figure 3.5(h), (i), (j),
etc., in Diagram Editing). The navigation bar at the top of the page lists system
menus and system tools such as the VisMode toggle (Section 4.3) and the dataset
management dialog.
4.2 System Implementation
In this section we discuss the important technical details of our implementation.
We introduce the application stack, and highlight the component inheritance
technique we apply to meet the need of HTML template rendering and node type
inheritance at the same time.
4.2.1 Application Stack
The VisFlow framework employs a client-server architecture like a typical
modern web application. The client is a Vue.js1 based single-page web application.
The VisFlow client handles all the dataflow computation, data visualization, and
user interactions in the user’s browser. Rendering is performed by D3 [7] in SVG
1https://vuejs.org
4.2. SYSTEM IMPLEMENTATION 32
for the convenience of listening to element interactions. Since there could be ample
space to increase the number of node types based on the analysis needs, we apply
an objected-oriented code architecture using Vue components 4.2.3 and ES6 classes
so that more node types can be added easily into the system using class inheritance.
The VisFlow server runs Express2 with Node.js3. The server is mainly respon-
sible for managing user logins and providing storage for dataflow diagrams and
user uploaded datasets. The server may also be extended for more complicated
computation jobs. For example, in Chapter 6 we add a natural language interface
to VisFlow. The natural language queries are sent to the server, which are relayed
to the backend semantic parser.
The details of our implementation can be found at the GitHub repository. The
current codebase contains around 30K lines of TypeScript4, Vue, HTML, and SCSS5
code. Up to the time of the completion of this dissertation, the codebase has been
through four major revisions. The earlier codebase used jQuery6 and JavaScript
with closure compiler7. We migrated the codebase to the new application stack
and tool selection for better code maintainability and readability.
4.2.2 Computation
VisFlow avoids repetitive execution and data storage redundancy to ensure
performance. A connected input port only makes a reference to the transmitted
subset or constants coming from its upflow output port. That is, though semantically
2http://expressjs.com3https://nodejs.org4https://www.typescriptlang.org5https://sass-lang.com6https://jquery.com7https://github.com/google/closure-compiler
4.2. SYSTEM IMPLEMENTATION 33
an output subset copy is created on each node (as discussed in Section 3.5), nodes
actually do not acquire copies of subsets where unnecessary. Our implementation
keeps minimally one copy of each table dataset in memory, and stores the row
indices of the data items as their IDs in the transmitted subsets, so that attribute
values are not duplicated. Though data items may have one-to-many relations with
their subsets, we only make copies of visual properties at the visual editors that
modify the visual properties and store object references elsewhere. User operations
such as flow diagram editing, data item selection in visualizations and filtering
parameter updates, lead to reactive changes in the downflow nodes. We propagate
the changes in the flow in topological order starting from the node where a change
occurs. The propagation at every node stops immediately when no change is
detected, e.g. the output of a set intersection may remain the same after an item is
added to its input subset.
4.2.3 Component Inheritance
The implementation of a subset flow model needs to create various node types
that have a large number of shared methods. These methods represent the shared
functionality between nodes, such as reading source datasets of subsets, manipulat-
ing subset visual properties, and notifying the system of output changes. Therefore
we use ES6 classes for the diagram elements to allow class inheritance. For exam-
ple, a Scatterplot node class inherits the Visualization base class, which further
inherits the SubsetNode base class. The Visualization class, as a base class for all
visualization node types, defines methods for handling interactively selected subsets
and their propagation.
To present the diagram elements in a web interface, we use Vue components
4.2. SYSTEM IMPLEMENTATION 34
along with ES6 classes. We use vue-class-component8, vue-property-decorator9 to
implement class inheritance. Additionally, we use vuex-class10 to connect all Vue
components to the global store of the system to manage the application state. Each
Vue component has an HTML or Vue template that defines how the component
should be presented on a webpage. In VisFlow, every diagram element such as a
node, edge, or port is implemented as its own component.
4.2.3.1 Template Injection
As seen in Figure 4.1, each node has a rectangular display are on the diagram
canvas, and is also associated with its option panel for tuning node settings.
Additionally, each node may have a context menu that is activated on right click.
The code architecture needs to allow each node to define its own display templates
for its sub-components. One may consider Vue slots11 to be an option to implement
this requirement. Yet we find that Vue slots do not serve our purpose well, because
Vue slots do not reflect class inheritance relations between an inheriting component
and a base component. When accessing this reference in a component with slots,
the reference points to an instance of the base component. However, we want the
this reference to point to the instance of the inheriting class instead. Essentially,
the relation between a component and its child component does not reflect the
relation between an inheriting class and its base class.
To the best of our knowledge, no modern frontend framework natively supports
DOM template together with class inheritance. This is because a typical web
application (e.g. an online shopping site) may not have flexible node types as those
8https://github.com/vuejs/vue-class-component9https://github.com/kaorun343/vue-property-decorator
10https://github.com/ktsn/vuex-class11https://vuejs.org/v2/guide/components-slots.html
4.2. SYSTEM IMPLEMENTATION 35
in VisFlow. To overcome the limitation, we utilize the template option that can be
passed to a Vue component on component registration. The template option allows
a string to be used as the Vue template, which is compiled at the client side12. We
thus create a general template for the base class with placeholders, and write our
own injection method to replace the placeholders by the templates of the inheriting
classes. This ensures that when a Vue component is registered, the template it
compiles has the full content of the inheriting class. Listing 4.1 shows a simplified
snippet from the template of the base Node component of VisFlow. The template
has three placeholders <!-- node-content -->, <!-- context-menu -->, and
<!-- option-panel -->. <!-- node-content --> represents the node display
on the diagram editing canvas. <!-- context-menu --> lists the right-click menu
options. <!-- option-panel --> defines the UI elements in the node option panel
that should be displayed when a node is clicked and activated. These placeholders
will be replaced by the content of the inheriting nodes at runtime. Because the Vue
component template is not finalized until we inject the placeholder content, the
compiler must be shipped within the final VisFlow bundle.
Listing 4.1: Base Node Component Template
1 <div>
2 <div ref="content" v-show="isContentVisible" :class="['
content ', { disabled: !isContentVisible }]" :style="
getContentStyles ()">
3 <!-- node-content -->
4 </div>
5 <context-menu ref="contextMenu">
6 <!-- context-menu -->
12https://vuejs.org/v2/guide/installation.html#Runtime-Compiler-vs-Runtime-only
4.2. SYSTEM IMPLEMENTATION 36
7 </context-menu>
8 <option-panel ref="optionPanel" v-if="isActive">
9 <!-- option-panel -->
10 </option-panel>
11 </div>
Many VisFlow Vue components are mounted dynamically rather than statically.
This is because the dataflow diagram may change constantly over time. We manually
mount all the dataflow diagram elements. For example, when a user uses drag-
and-drop interaction to create a node, we create the node instance and mount its
DOM element onto the diagram editing canvas. For static UI elements such as the
navigation bar, we use a static Vue template as in a typical Vue application. Some
UI elements such as the buttons and input boxes in the node option panel have a
static nature, but they have to be dynamically mounted as well because they belong
to nodes that are dynamically created and deleted. We use placeholder component
(Section 4.2.3.2) to ensure that all the dynamically mounted components have a
correct page layout.
4.2.3.2 Placeholder Component
Sometimes a child component may need to be presented globally. For example,
the pop-up option panel of a node on the right of the screen is displayed above
everything on the dataflow canvas. Besides, a node may occasionally need to display
a pop-up modal dialog that asks the user for confirmation, in which case the modal
needs to be above all other elements currently on the page. However, since the
child component nests inside its parent component, and the parent component may
have a fixed position within the overall page layout, it is difficult to use attributes
4.2. SYSTEM IMPLEMENTATION 37
like the CSS z-index to tweak the arrangement and make the child component
appear on top. For example, a node on the diagram editing canvas is nested inside
the canvas container, and the canvas container is a sibling of the navigation bar at
the top of the system. Therefore, it is impossible to make a child component of a
node to appear over the navigation bar, as the navigation bar must be above the
canvas container in the overall page layout. We create a placeholder component
to resolve the issue. A placeholder component, e.g. a global node option panel or
a global modal dialog, is empty when it is inactive, and awaits its content to be
mounted. The placeholder component may provide callback, so that the component
that initiates the interface change can update according to the user interaction.
The first method shown in Listing 4.2 is the activate() method from the base
Node class. This method is called when a node is clicked, at which time its option
panel should pop up. The method retrieves its option panel content (as HTML el-
ement) of the node using the Vue element reference (this.$refs.optionPanel). It
passes the option panel content to a store mutation method named mountOptionPanel(),
which appends the option panel content to the option panel placeholder compo-
nent named optionPanelMount. A child component can be unmounted when the
component that initiates the interface change finishes its task, e.g. when the user
deactivates the node the node option panel can be unmounted and hidden.
Listing 4.2: Mounting Option Panel
1 // in components/node/node.ts
2 public activate () {
3 this.isActive = true;
4 this.$nextTick (() => this.mountOptionPanel(this.
$refs.optionPanel as Vue));
5 }
4.3. VISMODE DASHBOARD 38
6 // in store/panels/index.ts
7 mountOptionPanel(state: PanelsState , panel: OptionPanel) {
8 state.optionPanelMount. appendChild(panel.$el);
9 }
4.3 VisMode Dashboard
We add a utility feature called the Visualization Mode (VisMode) to our im-
plemented framework to enhance its usability by providing a dashboard view of
the dataflow outcome. The VisMode dashboard hides diagram edges and ports,
and presents only a user-selected set of nodes. The sizes and positions of the
nodes in the VisMode can be configured separately from the diagram editing mode.
Figure 3.5 illustrates the correspondence between a dataflow diagram in the di-
agram editing mode and its VisMode dashboard view. Views labeled in blue in
the VisMode correspond to the labeled node with the same letter in the diagram
editing mode. The visualizations are re-arranged in a more compact layout to
present a cleaner interface for data analysis, just as an off-the-shelf application or
visualization dashboard. We provide smooth transition of the diagram elements
when the VisMode is toggled, to help the user perceive the node correspondence
between the two modes. The VisMode dashboard helps a user to stay away from
diagram details that are irrelevant to data exploration, so that the user may focus
on the visualized results and data analyses. Having VisMode also makes it easier
to share and present analysis results produced by the dataflow diagram.
4.4. REPRODUCIBILITY 39
4.4 Reproducibility
Each diagram in VisFlow corresponds to a unique diagram ID. A share link can
be generated for each diagram. With a share link, a diagram created by one user
can be loaded and viewed by other users to share the results of an analysis. VisFlow
preserves the diagram interaction states (i.e. user selections and navigation) when
a saved diagram is loaded, so that the shared results include the brushed and
highlighted data which are essential to presentation and answering questions about
the data. In addition to result sharing, having reproducible diagrams helps result
validation. Based on a shared diagram, a different user may reproduce the entire
analysis, make changes and extensions, or correct errors if necessary.
A diagram can be shared using the VisMode dashboard. The audience of the
dashboard does not need to look into the internal dataflow details and can focus
on the interactive visualizations and data exploration.
We implement a full history stack for all the operations performed in the system.
The history stack can be serialized into diagram logs and saved on the server. An
authorized user, e.g. a collaborator or a system admin, may view the complete logs
of the diagram to see how an analysis arrives at its conclusive state.
4.5 Case Studies
We evaluate the VisFlow framework and its subset flow model by case studies
on real-world datasets. We work with domain experts and apply the VisFlow
framework to solve their practical problems. In this dissertation we discuss two
case studies.
4.5. CASE STUDIES 40
4.5.1 Gene Regulatory Network Analysis
This case study follows our earlier research [80] that yielded domain application
that assists computational biologists to analyze gene regulatory network along
with their ground-truth lab experiment results. Understanding the regulations
between genes, e.g. the presence of a gene repressing or activating another, is an
important goal in computational biology. A gene regulatory network models the
regulations between genes, and is derived from mathematical models for predicting
potential regulations. One of the example networks, Th17, supports immune cell
fate specification and is computed from a lineage differentiation model system [8].
In this case study, we show that VisFlow is able to generate a gene regulatory
network analysis environment in a small number of steps. We worked with a group
of computational biologists who perform regulatory network analysis as one of their
research tasks. The biologists involved usually write script for the same set of
tasks to adapt their network data to other existing visualization tools, and have
moderate experience with programming. Figure 4.2(i) gives one resulting dataflow
diagram and its corresponding VisMode dashboard is show in Figure 4.2(ii).
A first step of the analysis is to visualize the regulatory network, which is a
directed weighted graph in which nodes are genes and edges are sourced from master
regulators called Transcription Factors (TFs). The weight of an edge denotes the
confidence score of a regulation prediction. The network visualization node (c) takes
two input tables, one for nodes and the other for the edges, and shows the network
in a force-directed layout provided by D3. Edge weight encoding is achieved by a
property mapping node (a). As biologists typically explore the network bottom-up
by adding to the network known genes they are interested in, an attribute filter
with user input gene names is added (b) so that the network only shows the genes
4.5. CASE STUDIES 41
Expression Matrix
(i)
(ii)
b
eNodes
Edges
(with series transpose)
f
d
g
c
a
Figure 4.2: Regulatory network analysis workflow and its corresponding VisModedashboard generated in the VisFlow framework: (i) The dataflow diagram; (ii) Theinteractive VisMode visualization dashboard produced by the dataflow diagram.
4.5. CASE STUDIES 42
specified by the biologists. The dotted section in Figure 4.2(i) shows the six diagram
nodes for a network visualization at the top-left of Figure 4.2(ii).
To allow the biologists to search for incident edges that are not currently shown
in the network for network expansion, an incident edge table is added to the
diagram: We add a linker (d) to retrieve the names of the user selected genes in
the network. We then add a filter to find those edges with a source or target gene
matching those selected gene names. This conveniently lists incident edges upon
gene selection in the network (bottom-left of Figure 4.2(ii)).
The analysis also includes a gene expression matrix as a supporting ground-
truth dataset. The matrix rows represent genes and columns represent experiment
conditions. Each matrix cell contains the gene’s responsiveness value under one
experiment condition. The visualization requirement is to render the expression
matrix as a heatmap, and additionally show selected gene profiles (rows of the
heatmap) in a line chart, ordered by the matrix columns. In this case, the experiment
conditions can be thought of as a series and we can apply the VisFlow line chart.
Lines are created by first grouping the series points by a grouping column “genes”,
and then rendering the lines along the series column “experiment conditions”. We
apply the series transpose (Section 5.2.3) to convert the input matrix to this format,
as the raw matrix table does not contain series points as data items. After series
transpose the new table has rows of type (gene, experiment condition, value), and
is applicable in the line chart. As rendering the heatmap requires the original
matrix, two data sources (e) have two outgoing branches, producing the original
and transposed matrices respectively. We filter the matrix rows using linkers (f)
and encode gene names using categorical color scales (g), which assign colors based
on the hash values of gene names to provide linked visualizations, i.e. the heatmap
4.5. CASE STUDIES 43
row labels are linked with the rendered lines by their colors.
Figure 4.2(ii) shows the resulting regulatory network analysis application in
the VisMode, which supports the visualizations of the regulatory network and the
experiment matrix, as well as unidirectional linked queries from the network to the
matrix. Those are a subset of functionality achieved by the Genotet system [80].
The biologists performing this experiment commented that VisFlow may save
time and efforts of biologists as a query workflow can be directly composed which
otherwise has to be created by writing custom script to parse and plot the data.
This case study shows that VisFlow, without requiring a programming background,
is able to assist domain data analysis with the composition of a relatively small
number of dataflow nodes.
4.5.2 Baseball Pitch Analysis
We present a second case study in the area of sports analytics for baseball
games, where we interacted with an expert in baseball analytics who is a statistician
working on applied problems in sports. Baseball is a highly data-driven sport. Since
pitching style is potentially the most interesting and important aspect of baseball
games, statistics for various metrics of pitches have been available for many decades.
The particular interest of the analyst is to understand the behavior and style of
pitchers by studying their movement data. We demonstrate how VisFlow is able to
adapt to this realistic scenario to quickly develop a tool that an expert can use for
the analysis, e.g. to identify how pitches differ in their deliveries.
One of the analysis tasks is to compare different pitchers across multiple games.
Our data comes from MLB.com Statcast13. The data is organized in many separate
13http://mlb.com/statcast
4.5. CASE STUDIES 44
tables, including a list of team matchups, a player roster for each team, a list of
game plays (i.e. individual pitches) for each game, etc. For each game play, there
are measures of metrics for the pitch. The metrics include the spin rate and speed
of the ball, the time it took the pitcher to release the ball, pitcher extension, and
the result of the play, which can be a Ball, a Strike, or different types of Foul balls.
Despite the data being spread over multiple table files, in a small amount of
time the analyst can compose a VisFlow flow diagram that enables us to visualize
the game plays (see Figure 4.3(iii)). The top-left table (a) lists the games described
by the data and allows the user to select one or more games to study, with each
game having around 300 pitches. The table in the middle (b) has each pitcher’s
name and player ID, from which the user may choose a list of players he/she is
interested in. Upon user selection of a set of players, the parallel coordinates at the
bottom (d) reactively displays the metrics statistics for the pitches of the selected
players. In Figure 4.3(iii), the user currently selects 1 game and 3 pitchers. Note
that selecting multiple games is also possible, in which case the presented plots will
show the data for multiple games, and the statistics are shown for all pitches in
those games.
In addition to the data described above, the Statcast system collects the pitch
movement data. Statcast optically tracks the players at 30 Hz, which gives us
unprecedented details on the players’ movements on the field (Figure 4.3(i)). The
optical tracking system is illustrated in Figure 4.3(ii). The tracking system uses a
coordinate system in which (0, 0) is at the home plate where the batter is; the y
axis points to the pitcher’s mound; the x axis is orthogonal to it in a right-handed
coordinate system. For each player, there is a sequence (xt, yt), of 2D positions,
recorded at a series of timestamps (t’s). The recorded coordinates for a player
4.5. CASE STUDIES 45
is of the approximated pitcher center of mass during his movement. The data
contains 90 samples for each pitch, that is, (x1, y1), . . . , (x90, y90). The analysis of
pitcher movements typically focuses on the delivery styles, for which the first 3
seconds of the movements are of key importance. Since pitchers have negligible
movement in the x direction, and pitchers’ mound is exactly at y = 60.5 feet
from the home plate, viewing the movement data with y values between [55, 65]
would allow the analyst to focus on the essential tracking records. With VisFlow
attribute filters, such visualization and attribute filtering requirement can be easily
achieved by creating a few nodes in the diagram. Range filters are applied to
remove tracking records after 3 seconds from the start of the delivery, as well as
those records that appeared outside the interested y value range [55, 65]. A line
chart visualization (Figure 4.3(iii)(c)) is used to render the series (t, yt). Players
can be encoded by categorical colors, so that the line chart renders the pitching
movements of a same player in a uniform color. Using VisFlow brushing and linking,
the user can easily relate the pitchers’ movements with the pitch statistics. For
any user selected pitch movements in the line chart (c), the parallel coordinates
plot (d) instantly highlights the metrics statistics corresponding to the selected
pitches. The analyst can therefore easily observe the speeds, velocities, and results
of selected pitchers’ movements. Such interactive queries largely help the analyst
derive relations between the delivery styles and the players’ performance.
4.5. CASE STUDIES 46
(i)
(ii)
(iii
)
(iv
)
ab
c d
Use
r S
ele
ctio
n
Win
d-u
pS
tre
tch
Div
erg
ing
Sty
les
12
34
56
78
91
01
11
2
Fig
ure
4.3:
Ap
ply
ing
Vis
Flo
wto
bas
ebal
lp
itch
dat
aan
alysi
s.(i
)T
he
pit
chin
gm
ovem
ent;
(ii)
Th
eS
tatc
ast
coor
din
ate
syst
emillu
stra
tion
(fro
m[3
2]);
(iii)
The
anal
ysi
sen
vir
onm
ent
for
the
bas
ebal
lpit
chan
alysi
sge
ner
ated
by
Vis
Flo
w;
(iv)
Plo
tsof
pit
chin
gm
ovem
ents
of12
pla
yers
.A
cate
gori
cal
colo
rsc
ale
isap
plied
tore
nder
each
pla
yer’
spit
ches
ina
unif
orm
colo
r.
4.5. CASE STUDIES 47
The constructed analysis environment may help conclude baseball findings.
For instance, it is easy to recognize the different types of deliveries, i.e. wind-up
(y distance increases and decreases alternately in the beginning) v.s. stretch (y
distance remains stable in the beginning). It is also possible to visually compare
the movement of different players. It can be observed that some players have fairly
uniform movements, while others’ movements vary widely. Figure 4.3(iv) shows
the pitches of 12 pitchers, numbered from 1 to 12. The pitchers in the first row
(number 1–6) have clearly two movement patterns, as the movements diverge in the
middle. The pitchers in the second row have a single movement pattern. Among
the second row, pitchers on the left half (number 7–9) have wind-up styles, and
the others (number 10–12) use a stretch delivery. Other observation, such as the
variation of pitcher 10’s starting y distance is significantly larger than the other
players who solely use stretch deliveries, can also be made.
The data involved in this case study has multiple heterogeneous tables. It also
has a large volume of recorded movement positions. In the diagram developed
for the case study, 20 games are loaded, which contain 100K+ data points for
the pitcher movements and other associated records, from 5 heterogeneous tables.
Around 30 pitches and 500 movement points are to be rendered for each player per
game. Data sampling (supported by the sampling mode of an attribute filter) may
help scale up to higher data volume, e.g. spanning a larger number of games. The
diagram contains fewer than 30 nodes. The VisMode of the application generated
from the diagram is shown in Figure 4.3(iii). This case study shows with the
flexibility provided by VisFlow it is possible to directly analyze such complex
domain data interactively.
48
Chapter 5
Extended Subset Flow
Though the subset flow has its interactivity advantages, the drawback of re-
quiring data immutability limits the analytical capability of the framework. We
therefore seek to add more data processing power to the dataflow while preserving
the benefits introduced by the subset flow. In this chapter, we discuss how to
extend the subset flow model so that data mutation can be performed in the system
without compromising much of the subset flow model. We explore the design
variations of the node types in the VisFlow framework, and demonstrate several
extension node types that may enhance the capability and usability of the system.
This chapter is organized as follows: First, we introduce the concept of the
extended subset flow model, and how to handle data mutation with respect to data
mutation boundaries in Section 5.1; We then describe several new types of nodes
that can be added to the extended subset flow so as to enhance system capability;
Finally, we demonstrate the improvement of the extended model using a few case
studies that employ the new node types.
5.1. EXTENDED MODEL 49
5.1 Extended Model
We loosen the data immutability requirement from the original subset flow
model. We defined the obtained new model to be the extended subset flow model. In
the new model, nodes are allowed to mutate the input data, resulting in new tabular
data being generated within the dataflow. Because of the ambiguity introduced
by data mutation (discussed in Section 3.4), visual properties cannot be inherited
when nodes mutate the flow data. However, it is possible to identify where the data
get changed and employ the subset flow on each group of nodes among which data
remain intact. Conceptually, a node that mutates the tabular data is considered to
create a data mutation boundary. The original subset flow model applies to the
nodes within a same data mutation boundary.
Figure 5.1 exemplifies this concept. The two nodes that mutate their input data
are shown in black border. These two nodes create the data mutation boundaries.
The black node at the top performs data aggregation and generates the average mpg
value per origin as a new table (with red border). The black node near the bottom
is a table join node that joins the cars from “car.csv” with their MDS coordinates,
so that their MDS plot can be shown (with green border). The orange branch in
the middle of the diagram all receive input that comes from “car.csv”. Those data
items remain intact and are within a same data mutation boundary. The subset
flow may unambiguously compute the visual properties between those nodes. As
seen in the figure, the system may use the border colors of the nodes to hint the
user about where the data are mutated so that the user can be aware of where
brushing and linking can be used to track subsets.
5.2. NODE TYPE EXTENSIONS 50
Figure 5.1: Example of data mutation boundary created in the extended subsetflow model. The two nodes with black borders are data mutating nodes. The oneat the top performs mpg aggregation for each car origin. The one at the bottomjoins the two input tables. The system uses node borders to help the user identifywhere the data get changed, and the node groups in which the original subset flowapplies.
5.2 Node Type Extensions
As data mutation is supported in the extended subset flow model, the dataflow
becomes general enough to perform any computation. In our implementation of the
extended subset flow model, we have added a list of new node types, including table
join, data aggregation, clustering, etc. Although theoretically we may add any type
of node, we are most interested those node types that are most effective in boosting
the analytical capability of the subset flow DFVS. In this section, we introduce
three representative node types: a generalized script editor, a data reservoir that
addresses the limitation of acyclic dataflow, and a series transpose that converts
column-major series table to row-major series table.
5.2. NODE TYPE EXTENSIONS 51
5.2.1 Script Editor
Within the extended subset flow model, we design a script editor node that
allows the user to write JavaScript code to edit and produce data. Theoretically any
node type can be implemented from scratch using a script editor, which essentially
represents the most general possibility of data mutation. The script editor expects
the user to write a JavaScript method that reads the input table(s) of the script
editor and outputs a table. An input table is described by its rows and columns.
The user-written method is expected to return a table in the same format.
Listing 5.1 shows a code snippet example the user may write in the script editor
to process the data. In this example, the code drops the last column in the data,
and finds all the rows with the first attribute value starting with “chevrolet”.
Listing 5.1: Script Editor Code Example
1 /**
2 * @typedef {{
3 * columns: string[],
4 * rows: Array <Array <number | string >>
5 * }} Table
6 * @param {Table | Table [] | undefined} input
7 * @param {HTMLElement | undefined} content
8 * @param {object | undefined} state
9 * @returns {{
10 * columns: string[],
11 * rows: Array <Array <number | string >>
12 * }}
13 */
14 (input , content , state) => {
15 // optional: modify node display HTML
5.2. NODE TYPE EXTENSIONS 52
16 // optional: modify node state
17 return {
18 columns: columns.slice(0, columns.length - 1),
19 rows: rows
20 .filter(row => row [0]. match (/^ chevrolet/i))
21 .map(row => row.slice(0, row.length - 1)),
22 };
23 };
The number of input ports of a script editor is configurable in the UI. If the
script editor should receive multiple input tables, the first argument passed to the
user-written method will be an array of Table (i.e. Table[] in JSDoc).
The script editor allows the user to generate derived data using JavaScript. It
comes in handy when the user wants to make small changes on the data during the
data exploration. For example, if the user notices some outliers in the data with a
specific pattern, she may remove them using a simple JavaScript filter in the script
editor.
Additionally, the script editor allows the user to control its HTML display.
When the script editor rendering option is enabled, a second argument content is
passed to the method, which represents the root of the display DOM tree of the
script editor on the dataflow canvas. The user may manipulate the DOM elements
using JavaScript DOM manipulation. We also make jQuery and D3 available within
the method for the convenience of DOM manipulation and data visualization. That
is, it is possible to access the $ jQuery object and the d3 namespace within the
user-written method when script editor rendering is enabled.
The last optional argument state allows the script execution to be stateful
This is useful when the script editor needs to perform incremental computation.
5.2. NODE TYPE EXTENSIONS 53
For example, if the script editor receives input data from a streaming source and
produces output for a line chart that visualizes the stream, in this case the script
editor may keep a window of the streamed data to be shown in the line chart.
The state object can thus be used to store the previous values in the window. We
show an example of such a stream visualization in a model training case study in
Section 5.3.3, in which a stateful script editor is used to produce visualization for
model metric changes over training iterations.
The code written in the script editor is executed in a JavaScript closure, so that
it does not have access to other variables and data in the wrapper environment.
This avoids the user code to accidentally modify the system behavior.
5.2.2 Data Reservoir
A key limitation of the subset flow and also other DFVS’s that use acyclic
dataflow diagrams, is that there is no way to perform iterative changes on a same
node. However, such iterative changes may be required by certain types of data
analysis. For example, in a bottom-up network exploration as in the case study
discussed in Section 4.5.1, the user may want to repeatedly add incident edges
to the network being visualized, so that the network is iteratively expanded. To
implement such an iterative addition of edges in a subset flow, VisFlow has to
create multiple layers of linkers, filters, and set operators, under the limitation
that each layer can only represent one iteration of edge addition (Figure 5.2). This
limitation would render iterative network expansion infeasible in VisFlow, if the
number of iterations is unknown or too large.
To address the limitation, we design a data reservoir node in the extended subset
flow. The data reservoir is able to keep its input data: it holds all the changes
5.2. NODE TYPE EXTENSIONS 54
Figure 5.2: Limitation of acyclic dataflow diagram. One layer of nodes has to becreated for one iteration of network expansion.
to its input and does not update its downflow nodes reactively. Instead, the user
must explicitly release the changes. Figure 5.3 illustrates the data reservoir in the
network expansion use case. The downflow of the network visualization generates
a new set of edges that is sent to the data reservoir. The data reservoir receives
this edge set but does not immediately update its output, which is connected in a
cycle to the input of the network visualization. When the user releases the changes
(by pressing a button), the new edges get merged into the previous input edges
of the network visualization. Consequently the network gets expanded with the
newly added edges from the downflow. Each release corresponds to one iteration of
network expansion.
The data reservoir presents one possible solution to overcome the limitation of
an acyclic dataflow diagram. It has its own unique characteristics. On one hand,
it meets the subset flow requirement by producing a copy of its input subset. On
the other hand, unlike most of the original subset flow nodes, the data reservoir
is a stateful node (like a stateful script editor introduced in Section 5.2.1), as it
remembers its last released input subset. Such stateful nodes do not appear in the
original subset flow design. The advantage of introducing the data reservoir is that
it enables iterative data exploration. But at the same time it may impact the user’s
5.2. NODE TYPE EXTENSIONS 55
Network Visualization
Data Reservoir
Nodes
Previous Edges
New Edges Selected
from Down!ow
Down!ow
U
New Edges
Figure 5.3: The data reservoir holds all the changes to the edges. When the userreleases the changes, those edges are merged into the upflow edges so that thenetwork visualization may include the new edges.
understanding of the dataflow, and make subset tracing and identification more
complicated. The user needs to be aware that a data reservoir keeps outputting a
subset that it received and released earlier.
5.2.3 Series Transpose
A tabular dataset may sometimes contain series information over a set of
columns. Table 5.1(a) shows a few entries from the “SP.DYN.LE00.IN” indicator
report from the World Bank Data repository1. The data contain a sequence of
columns that represent ordered series, i.e. the index values in each year. In this
case, each data point in the series is of analytical interest. A data point in the
series represents a new type of data item in row-major order that is different from
the original table rows that column-major series. Within the extended subset flow,
we provide a series transpose node that helps analyze series data in the format of
one series per row. The series transpose takes one key column and a list of series
columns, and transforms the input table into rows of series points. Column names
1http://databank.worldbank.org
5.3. CASE STUDIES 56
Table 5.1: Series transpose example that converts column-major series in Table (a)into row-major series in Table (b) based on the key column “Country” and theseries columns of the years. The cell values in Table (a) are stored in the thirdcolumn in Table (b). Table (b) has 9 rows that are not all shown.
are written as attribute values (middle column in Table 5.1(b)), and the original
table values are stored in a third column. Using “Country” as the key column
and the years as the series columns, series transpose produces Table 5.1(b) from
Table 5.1(a). Series transpose is a data mutating operation. It provides a utility for
the user’s convenience, so that a table like Table 5.1(a) with column-major series
can be directly visualized by a VisFlow line chart that expects row-major series
data points.
5.3 Case Studies
We exemplify the application of the extended subset flow model using three
case studies. We demonstrate that the interactivity benefits of the subset flow may
apply in each group of nodes within a same data mutation boundary. The mixed
subset and non-subset flow can be useful to many data analysis tasks.
5.3. CASE STUDIES 57
5.3.1 Evacuation Dataset Visualization
In this case study we employ the script editor, combined with another experi-
mental subset flow node type series player, to visualize movement data over time.
The series player is technically an attribute filter that treats its input data as time
series. In a time series table, there is one time column, and each data entity (such
as a person) has one row with each distinct time value. The series player performs
attribute filtering and allows the rows with one time value to pass at a time. The
user may choose to advance the time value at different speeds to replay the time
series and review the activities of the data entities over time.
In this study we look at the building evacuation traces from Challenge 4,
VAST Challenge 20082. The dataset describes a department of health that was
involved in conflicts with local religious groups. A bombing incident happened at
the department of health building, which could be related to the religious group
supporters. The dataset includes a floor plan of the building, and the locations
of the employees and visitors of the building during the bombing incident. Each
person is assumed to carry a badge that tracks his/her location. The goal of the
analysis is to visualize the evacuation when the bombing happened, and identify
any casualties along with potential suspects and witnesses to the event.
We may create a script editor to visualize the floor plan using JavaScript canvas
drawing. Figure 5.4 shows the JavaScript snippet written in the script editor. The
script renders the building floor plan by reading the building data and using an
HTML canvas to draw the walls. Alongside the floor plan we use a scatterplot to
show the locations of the the persons in the building. By overlaying the scatterplot
on the floor plan, we are able to visualize the locations of the persons in the building.
2http://www.cs.umd.edu/hcil/VASTchallenge08/
5.3. CASE STUDIES 58
Figure 5.4: A snapshot of the JavaScript written in a script editor to render thefloor plan of the evacuation data. The script manipulates the DOM tree that isrooted at content.
A series player is added to the input of the scatterplot to control the visualized
timestamp. The outcome of this diagram (in VisMode) is shown in Figure 5.5.
Using the series player, we may replay the entire evacuation traces and review
the movements of all the persons in the building. By the interactive selection of
VisFlow, we may also pause at any timestamp and select people we are interested
in. By linking the selected people with their descriptive information from a separate
table, we may identify who those selected people are.
The script editor and the series player extensions of VisFlow enable the visual-
ization and exploration of such movement data, which were previously not possible
with the original subset flow. In particular, the script editor gives the user the
freedom on how data are displayed and makes it more convenient to perform custom
rendering, i.e. drawing the floor plan. The series player demonstrates a node type
extension in the original subset flow model. Meanwhile, the interactivity benefits,
5.3. CASE STUDIES 59
Figure 5.5: Using a series player and a script editor to visualize the evacuationdata from VAST Challenge 2008.
i.e. supporting details-on-demand pulling of building personnel information, are
also effectively utilized.
5.3.2 k-Means Clustering Visualization
In this case study, we apply the extended subset flow model to visualize the
iterations in a k-means clustering algorithm. Figure 5.6(i) shows the dataflow
diagram that completes the visualization. A k-means clustering node is applied on
the Auto MPG dataset. The clustering node adds a “ClusterLabel” column and
assigns a cluster label to each row, as seen in the table in Figure 5.6(i).
To visualize different clustering labels produced by the clustering algorithms,
we may apply a categorical color scale on the “ClusterLabel” column from the
output of the k-means node. To determine a proper placement of the data items on
a 2D plane, we use the table join node available in the extended subset flow model
to assign the MDS coordinates to each car. The result is an MDS scatterplot in
5.3. CASE STUDIES 60
Figure 5.6(i), in which points are color encoded by their cluster labels.
(i)
(ii)
Figure 5.6: Using an MDS plot and a cluster label distribution plot to visualizethe iterations of the k-means clustering algorithm: (i) The dataflow diagram; (ii)Visualizations of the clustering algorithm iterations.
The clustering node that executes the k-means algorithm may be configured to
output the cluster labels immediately after each iteration of the clustering algorithm.
By such configuration, we are able to witness the label changes on the fly as the
algorithm proceeds. Additionally, utilizing the subset flow color linking, we may
create a histogram to identify how many data items are given a particular cluster
label. This is shown in the histogram on top of the MDS plot in Figure 5.6(i).
Several iterations starting from a random cluster label initialization are shown
5.3. CASE STUDIES 61
in Figure 5.6(ii). The numbers of data items with different labels are about even
after being randomly initialized, and gradually converge to the final output of
the algorithm. The algorithm process can be easily visualized within the subset
flow sub-diagram highlighted with red border in Figure 5.6(i). This case study
shows that even when data mutating nodes like the k-means algorithm and the
MDS coordinates table join are present in the dataflow, we can still make use of
subset flow visual properties to perform effective visual linking between multiple
visualizations.
5.3.3 Model Training Visualization
We present a third case study that employs the extended subset flow model to
visualize a machine learning model training process. The diagram in Figure 5.7
completes an example training process using a multi-layered perceptron. The
diagram uses the native subset flow sampling feature available in an attribute filter
with a set difference operator (a) to make a train/test split on the AutoMPG
dataset. 75% of the cars from each distinct origin are used as training examples,
while the remaining 25% are kept as the test data. The training data are sent to a
multi-layered perceptron (b), which predicts the “origin” for the cars in the test
data.
A script editor (c) is used to compute the model F1 macro as well as the
precision/recall metrics for the Japanese cars of the test data. The metric values
are passed to a line chart (d) to visualize the metrics changes after each iteration
of the model training. This script editor is a stateful node. It remembers all its
previous output so that the line chart can visualize the entire metrics changes over
the training process. At each iteration, three new rows giving the new values of
5.3. CASE STUDIES 62
the three metrics (F1 macro, precision Japanese, recall Japanese) are appended
to the output of the script editor, so that the line chart may produce the metrics
visualization as if the metric values are “streamed” from the model.
b
c
d
e f
g
h
i
a
Figure 5.7: Applying a combination of extended model nodes to visualize a multi-layered perceptron training process. Using a stateful script editor we show metricvalue changes over time in a line chart. By subset flow diagram highlighting, we canhighlight the incorrectly predicted test data in the MDS plot and the histogram.
A separate script editor (e) is used to identify which of the test data rows have
an incorrectly predicted “origin”. If the prediction is wrong, an “error” flag is
added as a new column and the downflow highlighting sub-diagram (g) reads this
flag and colors the incorrectly predicted test data in red. With a similar MDS plot
(h) as see in Section sec:case-kmeans, and a distribution histogram (i) of the “origin”
of the cars, we can visualize where the model fails to make a correct prediction.
This error highlighting sub-diagram applies the subset flow visual properties linking
to conveniently identify the incorrectly predicted cars from the test data in the
visualizations.
This case study demonstrates that with the extended subset flow model it is
5.4. DISCUSSIONS 63
possible to produce visualizations for a wider variety of tasks. The benefits of
having visual properties carried by the subsets may be preserved, even when many
nodes such as the perceptron (b), metric computation (c) and error computation (f)
in the diagram mutate the data. It also demonstrates more comprehensive usage of
the script editor and its state. Specifically, the stateful design of the script editor
enables the visualization of streamed data, which is useful for visualizing the model
metrics.
5.4 Discussions
In this chapter, we extend the subset flow model to remove the data immutability
requirement and consequently increase the data processing power of the dataflow
framework. We show that such design is beneficial, in that it enables a larger
variety of tasks to be supported by the dataflow as described in the case studies.
In particular, it can be seen that the interactivity advantage of the original subset
flow model can be preserved when data mutation is allowed within the dataflow.
Though technically we may add any data processing node into VisFlow under
the extended subset flow model, it is important that those new node types do
not over-complicate the dataflow diagram and make the subset flow harder to
understand and trace. The introduction of stateful nodes such as a stateful script
editor and a data reservoir, may also make it harder to identify where data are
coming from at a particular point within the dataflow. The user’s understanding
and perception of mixed subset and non-subset flow may need further study, so as
to find a better tradeoff between the subset flow advantages and the data processing
capability of the system.
5.4. DISCUSSIONS 64
Besides, creating a dataflow framework that is general-purpose and applicable
to any data analysis task may require community efforts. Currently, we are the only
maintainers and developers of the VisFlow repository and have to implement every
node type on our own. An extension standard is desirable so that the dataflow
framework may accept external and public contribution. We may also consider
setting up infrastructures to reuse packages and tools available in other languages,
so that the web-based dataflow framework may serve as an integrative environment
like a scripting notebook such as Jupyter [52] and Observable [46].
65
Chapter 6
FlowSense: A Natural Language
Interface
Natural language interfaces (NLI) for data visualization may help improve the
usability of a visualization system. Systems with NLI allow the user to specify
queries directly via natural language (NL) without much prior knowledge on the
system usage. Latest research has progressed in visualization-oriented NLIs [19,
24, 60]. However most of these interfaces present a single main visualization for
the user to interact with, possibly with a few auxiliary views and widgets. The
analytical capability and flexibility are thus restricted by the single-view design.
Visual data exploration often requires multi-view linked visualizations, in which
the design of an NLI becomes more challenging.
As a DFVS has the potential to produce flexible, customizable visualizations
but is often relatively difficult to learn, we seek to design an NLI that benefits
both from the usability of NL and the analytical flexibility of VisFlow. We name
the NLI FlowSense. FlowSense uses semantic parsing to support NL queries that
6.1. RELATED WORK 66
manipulate multi-view visualizations created from a dataflow diagram. The NL
capability effectively reduces the overhead of learning dataflow usage and simplifies
the interactions needed for dataflow diagram construction.
In this chapter, we discuss the design details of FlowSense and its achieved
results. We first survey the related work on NLIs in Section 6.1. The design goal
of our NLI is defined in Section 6.2. We then introduce the FlowSense semantic
parser and its query execution process, as well as the user interface we integrate
into VisFlow for performing FlowSense queries. Finally we present two case studies
in Section 6.7 and a formal user study in Section 6.8 to evaluate FlowSense within
the VisFlow framework.
6.1 Related Work
We first discuss the related works on NLIs for data visualization. We also briefly
cover semantic parsing and relevant techniques for parsing NL input.
6.1.1 NLIs for Data Visualizations
Extensive research has been devoted to NLIs for decades. These interfaces
address NL queries that otherwise have to be translated to formal query languages,
e.g. SQL. A few examples are the interfaces for querying XML [34], entity-relational
database [1, 3], and speech translator to SQL [31]. NLIs for data visualizations
answer the queries by presenting visual data representations. Compared with
other interfaces that simply return a numerical answer or a set of database entries,
visualization NLIs present results that are more human-readable. Cox et al. [9]
designed the Sisl service within the InfoStill data analysis framework. The service
6.1. RELATED WORK 67
asks a series of NL questions and uses the obtained answers to complete an unam-
biguous query. The Articulate system [62] uses a Graph Reasoner to select proper
visualizations to answer a query. DataTone [19] addresses the particular problem
of query ambiguity by showing ambiguity widgets along with the main visualiza-
tion so that the user is able to switch to informative alternatives. Eviza [60] and
Evizeon [24] further improve the user experience by allowing for conversation-like
follow-up questions. Kumar et al. [30] propose a dialogue system for visualization.
Several commercial tools integrate NLIs. IBM Watson Analytics [27] and Microsoft
Power BI [42] provide a list of relevant data and visualizations to an NL question,
from which the user may choose to continue the analysis. Wolfram Alpha [76]
supports knowledge-based Q&A and is able to plot the results. ThoughtSpot [67]
enables interactive search from relational database, and provides multiple types of
visualizations for the database. The design of NLIs for data visualization has two
challenges: First, modern natural language processing (NLP) techniques cannot yet
well understand arbitrary NL input due to the complex nature of NL. User queries
tend to be free-form and ambiguous; Second, choosing a proper visualization to
answer an analytical question is non-trivial as there can be multiple possible visual
representations [38].
6.1.2 Semantic Parsing
FlowSense uses semantic parsing to process NL input and map user queries to
dataflow diagram editing operations. It depends on a pre-defined grammar that
captures NL input with certain patterns. A semantic parser recursively expands
the variables in the grammar to match the input query and can interpret the input
based on the rules applied and the order of their application [6]. At a high level,
6.2. DESIGN GOAL 68
the mapping performed by FlowSense can also be considered a classification task
and addressed by classification algorithms [2]. However we prefer the semantic
parsing approach because most classification methods are supervised and require
a large corpus of labeled examples that is not available from a DFVS. Besides,
compared with deep learning methods [13], semantic parsing does not require heavy
computational resources.
The FlowSense semantic parser is implemented within the Stanford SEMPRE
framework [50] and CoreNLP toolkit [39]. The CoreNLP toolkit integrates a
comprehensive set of NLP tools including the Part-of-Speech (POS) tagger, Name-
Entity-Recognizer (NER), etc. The SEMPRE framework employs a modular design
in which different types of parsers and logical forms can be easily plugged-in. The
framework can quickly be adapted for domain-specific parser design [72].
6.2 Design Goal
FlowSense makes a distinction from the other NLIs as it is to our best knowledge
the first NLI to address a dataflow context. We set the scope of FlowSense to
focus on assisting dataflow diagram construction, rather than to directly answer
free-form analytical questions or seek a best visualization for a given query. We
believe such an approach is beneficial in several aspects:
• Capability: The analytical capability of FlowSense is rooted in the design of
the DFVS. The outcome of FlowSense is a complete, interactive, and iterative
visual data exploration process supported by VisFlow, rather than a single
visualization that only answers one particular query as in other interfaces.
Dataflow also naturally preserves analysis provenance [17]. The diagram
6.3. SEMANTIC PARSING 69
created by FlowSense explicitly keeps the user’s preference and intention
from previous queries, which are otherwise maintained by a model behind
the scene [19, 60].
• Usability: FlowSense significantly reduces the number of interactions re-
quired to construct a dataflow diagram. Its convenience is desirable by both
novice and experienced VisFlow users. Besides, the DFVS is able to recover
from errors more easily as the user always has full control over the system.
However in other interfaces the user has to mostly rely on the behavior of
the NLI and can hardly make corrections in case of misinterpretation. We
justify this advantage by case studies and user feedback.
• Feasibility: The scope of FlowSense on assisting dataflow diagram construc-
tion is well defined and practicable. Even the state-of-the-art NLP techniques
have limited success in understanding an arbitrary NL query. By restricting
our scope, FlowSense can produce more expected results and give better user
experience, as each query is expected to update dataflow diagram, and the
user decides what the system does and what visual representation to use
through dataflow editing. The mixed-initiative design mitigates the ambiguity
that potentially comes from misinterpreting the user’s intention.
6.3 Semantic Parsing
In this section we introduce the details of the semantic parsing in FlowSense.
For concept illustration we use the Auto MPG dataset throughout the discussion.
6.3. SEMANTIC PARSING 70
#Function
Sample
Queries
Description
Sample
Sub-D
iagra
m
AV
isual
izat
ion
Sh
ow
asc
att
erp
lot
of
mpg
an
dh
ors
epo
wer
Pre
sent
the
dat
ain
avi-
sual
izat
ion
idmpg
name
a b cc
he
vro
let
bu
ick
am
c1
5
18
14
Da
ta S
ou
rce
{a, b
, c
}
a
b
c
Vis
ua
liz
ati
on
BV
isual
Enco
din
g
En
cod
em
pgby
red
gree
nco
lor
sca
le;
Ma
pca
ro
rigi
nto
cate
gori
cal
colo
rs
Map
dat
aat
trib
ute
sto
vis
ual
chan
nel
s
{a, b
, c
}
Vis
ua
l E
dit
or
{a, b
, c
}
a
b
c
Vis
ua
liz
ati
on
mpg
CD
ata
Filte
ring
Fin
da
llca
rsw
ith
mpg
betw
een
15
an
d2
0;
Sh
ow
five
cars
wit
hm
axi
mu
mm
pg
Filte
rdat
ait
ems
and
loca
teex
trem
um
san
dou
tlie
rs
{a, b
, c}
15
≤ m
pg
≤ 2
0
Att
rib
ute
Fil
ters
{a, c
}
{a, b
, c}
{a, b
, c}
{a, c
}
ma
x {
mp
g}
{c}
DSubse
tM
anip
ula
tion
Mer
geth
eca
rsw
ith
tho
sefr
om
the
sca
tter
plo
t
Refi
ne
and
iden
tify
in-
tere
stin
gsu
bse
tsUnion
Intersection
UU
EH
ighligh
ting
Hig
hli
ght
the
sele
cted
cars
ina
para
llel
coo
rdin
ate
sp
lot
Vie
wth
ech
arac
teri
stic
sof
one
subse
tam
ong
its
sup
erse
tor
anot
her
sub-
set
Use
r S
ele
cti
on
Un
ion
a
b
c
Vis
ua
liza
tio
n
for
Sel
ecti
on
Vis
ua
l Ed
ito
ra
b
c
Hig
hlig
hte
d
Vis
ua
liza
tio
n
{a, b
, c
}
{a, b
, c
}
{a, b
}
{a, b
}
U
FL
inkin
g
Fin
dth
eca
rsw
ith
asa
me
na
me
fro
mth
esa
les
tabl
e;
Lin
kth
esa
les
reco
rds
byo
ri-
gin
fro
mth
esc
att
erp
lot
Extr
act
keys
from
one
tab
lean
dfi
nd
thei
rco
r-re
spon
den
cefr
oman
-ot
her
(het
erog
eneo
us)
table
usi
ng
alinke
r
idsale
name
x y zch
ev
role
t
bu
ick
am
c3 42
Da
ta S
ou
rce
2
Lin
ke
r
{a, b
}
{x, y
, z}
{x, y
}li
nk
name
“am
c” o
r “b
uic
k”?
idmpg
name
a bb
uic
ka
mc
15
14
Da
ta S
ou
rce
1
Tab
le6.
1:Six
majo
rca
tego
ries
ofV
isF
low
funct
ions.
Thes
esu
b-d
iagr
ams
are
freq
uen
tly
use
dto
com
pos
em
ore
sophis
tica
ted
dia
gram
sth
atad
dre
ssan
alyti
cal
task
s.
6.3. SEMANTIC PARSING 71
6.3.1 VisFlow Functions
To create an NLI for VisFlow, we first studied a sample diagram set that includes
60 dataflow diagrams created by 16 VisFlow users from their previous VisFlow
sessions. We identify a set of frequently appearing sub-diagrams and categorize
them into six major categories as listed in Table 6.1. The construction of these
sub-diagrams are defined as the VisFlow functions. By implementing the VisFlow
functions, FlowSense essentially supports the building blocks of VisFlow’s visual
data exploration so that analyses with VisFlow’s native interactions can then be
carried out with FlowSense. Table 6.1 explains the usage of each VisFlow function
and shows several sample queries.
In addition to the six major categories, FlowSense also supports many utility
functions such as adding/removing diagram edges, selecting data points in a
visualization, loading a given dataset into a data source node, automatically
adjusting the dataflow diagram layout, etc. Though these functions also enhance
the usability of the system, we omit their details here as they are auxiliary in the
system.
6.3.2 Grammar
FlowSense applies a semantic parser to map an NL input to one of the VisFlow
functions based on an elaborate grammar designed for these functions. The grammar
is context-free [40] and formally defined as a 4-tuple G = (V,Σ, R, S). V is a finite
set of variables. Σ is a finite set of terminals. A terminal represents an English
word or phrase. R is the rule set that defines how a single variable matches an
ordered list of terminals and variables (possibly itself in a recursive rule). Below is
6.3. SEMANTIC PARSING 72
an example rule:
〈Visualization〉 → 〈ShowVerb〉 〈Columns〉 in 〈VisualizationType〉
In this rule, 〈Visualization〉 is a high-level variable that matches a query that
requests a visualization. 〈ShowVerb〉 matches a verb that has a meaning similar to
“show”. 〈Columns〉matches one or more columns from the data. 〈VisualizationType〉
stands for a phrase that describes a visualization metaphor such as scatterplot or
parallel coordinates. The token “in” is a terminal symbol that comes from the
NL input directly. The example rule above is simplified for the convenience of
explanation. In practice, a rule often matches against generic variables rather than
a specific word. S is the start variable that expands to other variables to match
every query.
The grammar of the FlowSense semantic parser attempts to derive an input
query by recursively searching for all possible matches of the grammar rules.
This procedure is called derivation [6]. FlowSense uses the semantic parsing
implementation from SEMPRE. It also uses the Stanford CoreNLP [39] toolkit that
is built into SEMPRE for POS tagging (Section 6.4.1). The variables and rules
of FlowSense are defined in SEMPRE grammar files. The FlowSense grammar is
independent from the data being analyzed. It is also independent from the dataflow
diagram being constructed. We use special utterance placeholders (Section 6.4.1)
to let the grammar understand dataflow context at runtime. FlowSense currently
includes about 200 variables and a rule set of around 500 rules (i.e. SEMPRE
formulas). The FlowSense grammar may be extended to support more analytical
functions.
6.3. SEMANTIC PARSING 73
Fun
ctio
n O
pti
on
sFu
nct
ion
Typ
eP
ort
Sp
eci
�ca
tio
nS
ou
rce
No
de
Targ
et
No
de
Sp
eci
al
Utt
era
nce
s[c
olu
mn
][n
od
e]
[co
lum
n]
[co
lum
n]
[vis
ua
liza
tio
n t
ype
]
VB
NN
NN
NN
CC
IND
TV
BN
NN
SIN
NN
IND
TJJ
VB
ZN
NP
OS
Ta
gs
Vis
ua
lize
mp
g,
ho
rsep
ow
er,
an
d
ori
gin
o
f th
e se
lect
ed
cars
fr
om
M
yCh
art
in
a
p
ara
llel
coo
rdin
ate
s p
lot
idm
pg
na
me
a b cto
yota
bu
ick
am
c1
5
20
17 Da
ta S
ou
rce
{a, b
, c}
a
b
c
MyC
ha
rt
{a, c
}
ho
rse
po
we
r
19
0
12
2
11
0
ori
gin
Am
eri
can
Am
eri
can
Jap
an
ese
Use
r S
ele
ctio
n
...
mp
gh
ors
epo
wer
ori
gin
Pa
ralle
l Co
ord
ina
tes
<C
olu
mn
s><
Sh
ow
Ve
rb>
<S
ele
ctio
n>
<S
ou
rce
No
de
Wit
hN
am
e>
<V
isu
aliz
ati
on
Typ
e>
<Ta
rge
tNo
de
Wit
hO
pti
on
s>
<N
od
e>
<P
ort
Sp
eci
�ca
tio
n>
<S
ou
rce
Wit
hP
ort
>
<V
isu
aliz
ati
on
>
<P
rep
osi
tio
n>
<P
rep
osi
tio
n>
<P
rep
osi
tio
n>
<V
isu
aliz
ati
on
Fun
ctio
n>
<V
isu
aliz
ati
on
Fun
ctio
nW
ith
Op
tio
ns>
<C
olu
mn
Op
tio
ns>
Gra
mm
ar
Fig
ure
6.1:
An
exam
ple
Flo
wS
ense
qu
ery
and
its
exec
uti
on.
Th
ed
eriv
atio
nof
the
qu
ery
issh
own
asa
par
setr
eein
the
mid
dle
.T
he
sub-d
iagr
amex
pan
ded
by
the
quer
yis
illu
stra
ted
atth
eb
otto
m.
The
five
majo
rco
mp
onen
tsof
aqu
ery
pat
tern
are
un
der
scor
ed.
Eac
hco
mp
onen
tan
dit
sre
leva
nt
par
tsin
the
par
setr
eean
dth
ed
atafl
owd
iagr
amar
ehig
hligh
ted
by
auniq
ue
colo
r.T
he
resu
ltof
exec
uti
ng
this
quer
yis
tocr
eate
apar
alle
lco
ordin
ates
plo
ton
the
colu
mns
mpg,
hor
sep
ower
,an
dor
igin
,w
ith
input
from
the
sele
ctio
np
ort
ofth
enode
MyC
har
t.
6.3. SEMANTIC PARSING 74
6.3.3 Query Pattern
The main goal of FlowSense is to support progressive construction of dataflow
diagrams. We studied the creation process of many VisFlow diagrams in our
sample diagram set and empirically identified a common pattern with five key
query components that all VisFlow functions depend on: function type, function
options, source node(s), target node(s), and port specification. This pattern is
illustrated in Figure 6.1 with a sample query “Visualize mpg, horsepower, and
origin of the selected cars from MyChart in a parallel coordinates plot”. In this
query, the verb “visualize” indicates the intention to apply a visualization function.
The three columns “mpg, horsepower, and origin” describe the options (i.e. what
to visualize) for the visualization function. The phrase “from MyChart” tells
the system where the data to be plotted are coming from and provides source
node information. The phrase “in a parallel coordinates plot” indicates a new
visualization node with the given visualization type is to be created as the target
node. The source and target node information is closely related to the dataflow
context and is automatically identified upon user input and can then be matched as
special utterances (see Section 6.4.1). As VisFlow explicitly exports interactive data
selection from visualization nodes, the phrase “selected cars” is a port specification
that further describes that the user wants to visualize the selection from MyChart
and the new visualization node should be connected to the selection output port of
MyChart.
The grammar of FlowSense includes hierarchical variables that match the five
key components of an NL query. Figure 6.1 illustrates the parse tree that derives
the example query. The variables involved in the derivation are shown in the parse
tree, where variable expansions are bottom-up. A variable may carry information
6.4. QUERY EXECUTION 75
for multiple query components. We design a comprehensive set of variables and
rules that are able to not only accept queries with a particular order, but also their
different arrangements. For instance, “Show mpg and horsepower in a scatterplot”
is equivalent to “Show a scatterplot of mpg and horsepower”. They both can be
accepted by FlowSense. FlowSense is also able to derive multiple functions from one
single query and execute their combination, e.g. “Show the cars with mpg greater
than 15 in a scatterplot” infers both visualization and data filtering functions.
A query may not necessarily contain all the five components explicitly. For
example, the user may simply say “Show mpg and horsepower” without mentioning
any source node or target visualization type. FlowSense may automatically locate
source and target nodes in its query completion phase (Section 6.4.3). The user
query may also contain implicit information, e.g. “Find cars with large mpg” intends
to perform data filtering to search for a few cars with large mpg values. FlowSense
stores utterance implications in its grammar, e.g. the word “find” implies the use of
a filter. FlowSense uses keyword classification (Section 6.4.2) to identify important
utterance implications from the query.
6.4 Query Execution
To interpret NL input based on the current dataflow context, FlowSense not
only runs the semantic parser, but also employs several auxiliary phases for its
query execution. Figure 6.2 illustrates the execution process.
6.4. QUERY EXECUTION 76
NL Input
POS and SpecialUtterance Tagging
DataflowDiagram
Parser Execution
Query PatternCompletion
Dataflow DiagramUpdate
Incorrect/Missing Information
Unexpected Input
Faile
d
Success
KeywordClassification
Figure 6.2: FlowSense query execution phases. POS and special utterance taggingis performed first. Special utterances describing the data columns and diagramnodes are identified can be matched against utterance placeholders. Keywordclassification is applied to identify important utterance implications such as theintention to call a specific VisFlow function. FlowSense attempts to complete thequery pattern if missing information can be filled using default values. Upon anexecution failure the user is notified and asked to update the query.
6.4.1 POS and Special Utterance Tagging
FlowSense first performs POS tagging on the query with CoreNLP. Each token
receives a POS tag as shown in Figure 6.1. POS tags are used to generalize the
FlowSense grammar. For example, many prepositions can be used interchangeably,
e.g. “selection of the plot” is equivalent to “selection from the plot”. Instead of
having one rule for every preposition, the grammar uses a generic variable that
matches any preposition. POS tagging helps FlowSense analyze the basic semantic
structure of a query.
6.4. QUERY EXECUTION 77
Some utterances in the NL query refer to special entities such as visualization
types, table columns, or diagram node names. These utterances have remarkable
roles in executing a VisFlow function. FlowSense identifies these special utterances
and uses this information in the derivation. For the query shown in Figure 6.1,
FlowSense tags “mpg”, “horsepower”, “origin” as columns, “MyChart” as a node
label, and “parallel coordinates plot” as a visualization type (node type). FlowSense
uses generic variables like 〈column〉 in its grammar that do no depend on the dataset
being analyzed. Therefore, the grammar rules do not list the data-dependent special
utterances as terminals. For example, the grammar does not include table column
names or diagram node types, i.e. the string value “mpg” would not appear
in the grammar. These generic variables are special utterance placeholders that
support matching of special utterances. The information about special utterances
is collected in the tagging phase, and the special utterance placeholders are thus
able to match the tagged tokens.
For instance, when the user loads the Auto MPG dataset into the dataflow,
column names such as “mpg” are automatically extracted and whenever the user
types “mpg” it is identified as a data column on-the-fly so that column utterance
placeholder in the grammar may accept “mpg”. To enable dataflow context
awareness, FlowSense also have special utterance placeholders for diagram node
labels. By accepting node labels FlowSense may effectively support node references
so that the user can more precisely instruct where to extend the diagram.
For typo and naming tolerance, FlowSense employs approximate matching and
checks each k-gram in the query (where k may range from 1 to the query length)
against all special utterances using case-insensitive Levenshtein distance [33, 44].
We divide the distance over the string length and use the ratio to mitigate the fact
6.4. QUERY EXECUTION 78
that longer strings are more prone to typos. We find a k value of 2 or 3 and a ratio
threshold of 0.2 work well in practice.
6.4.2 Keyword Classification
FlowSense uses keyword classification to identify the semantic meaning of words
in the NL query and uses this information to decide a proper VisFlow function
to execute. For instance, the verb “show” is a synonym of “visualize”, “draw”,
etc. These words all indicate the intention to create a visualization. Meanwhile,
“find” may implicitly specify a data filtering requirement and is similar to “filter”.
We compute the Wu-Palmer similarity scores [79] between words and use the
measured scores to classify words in the NL query that have close meaning to a
set of pre-determined VisFlow function indicators. The implementation of the
similarity scores is based on WordNet [15] and NLTK 1.
6.4.3 Query Pattern Completion
The FlowSense parser identifies the key components of a query. FlowSense
attempts to fill in the blanks where information is missing using the following two
mechanisms:
Finding default values. Query components may be completed using default
values. Function options may have defaults. For instance, FlowSense automatically
chooses two numerical columns to visualize in a scatterplot when the query is simply
“Show a scatterplot”. Note that within a DFVS decisions like this can easily be
changed by the user, so FlowSense does not attempt to make a best guess. Similar
decisions include completing port specification. By default FlowSense filters all the
1http://www.nltk.org
6.4. QUERY EXECUTION 79
data a visualization node receives when creating a filter, rather than filtering the
data subset interactively selected in the visualization. Sometimes the default values
could even be empty. A query like “Filter by mpg” results in FlowSense creating a
range filter on the mpg column with no filtering range given (the filter allows all
its input data to pass). The user can then follow up and fill in the filtering range
via the DFVS interface.
Finding diagram editing focus. Whenever the user expands the dataflow
diagram there always exists an editing focus, though sometimes the focus is implicit.
For example, when the query contains a phrase like “from MyChart”, the focus (i.e.
the source node of the query) is explicitly given. However, users tend to neglect
the source or target nodes in their queries, especially when there is a sequence of
commands that together complete a task. When a query does not have explicit
focus, FlowSense derives the user’s implicit focus. If a node is activated by the user
(e.g. clicked), then that node is taken as the focus. Otherwise, we compute a focus
score for every node X by:
score(X) = activeness(X, t) + α(1− 1/(1 + e−(distanceToMouse(X)/γ−β))).
The activeness of X is re-iterated upon every user click in the system:
activeness(X, t) = activeness(X, t− 1)/2 + click(X, t),
where click(X, t) = 1 if the t-th click is on X and 0 otherwise. This definition
measures how actively a user is focusing on a node by how many times she recently
6.4. QUERY EXECUTION 80
clicks on it, as well as how close it is to the mouse cursor. The activeness derived
from user clicks decreases exponentially over time, while the closeness to mouse
dominates under a small distance with a shifted sigmoid function. We find the
parameters α = 2, β = 5, γ = 500 achieve good result. FlowSense chooses the node
with the highest focus score to be the diagram editing focus when there are no
activated nodes. If multiple source nodes are required (e.g. in a merge query),
FlowSense looks at the nodes in the order of their decreasing focus scores.
The focus may also be vaguely specified using node types instead of node labels.
For instance, the user may “show the data from the scatterplot”, in which case
“scatterplot” is a node type reference that describes a scatterplot node existing in the
dataflow diagram. FlowSense searches for matched node types within the dataflow
diagram. In case of a tie on the node type, e.g. there are multiple scatterplots in
the diagram, the nodes with higher focus scores are chosen.
6.4.4 Diagram Update
Once a query is successfully completed, FlowSense performs the VisFlow func-
tion(s) with the given function options. This typically results in the creation of one
or more nodes, e.g. the visualization function creates one plot but the highlighting
function creates three nodes (Table 6.1). FlowSense may also update existing nodes
without creating any new nodes, e.g. when the user only changes rendering colors.
Additionally, a query may operate on multiple existing nodes at once, e.g. linking
and merging two tables create edges between two nodes. Operating on multiple
nodes together helps simplify user interaction, as these operations previously require
multiple drag-and-drops using traditional mouse interaction.
After new nodes and edges are created, the diagram may become more cluttered.
6.4. QUERY EXECUTION 81
FlowSense locally adjusts the diagram layout after each diagram update. We use a
modified force-directed layout from the D3 library that works on the vicinity of
the current diagram editing focus. We extend the force to take rectangular node
sizes into account so that larger nodes such as embedded visualizations have larger
repulsive force. User-adjusted node positions are remembered by the system and
the layout algorithm avoids moving nodes that have been positioned by the user.
Currently FlowSense does not look for an optimal dataflow layout. We leave more
advanced layout [4] for future work.
6.4.5 Ambiguity
It is possible to have ambiguity even when the scope of the NLI is to map
queries to diagram editing operations. One type of ambiguity comes from multiple
possible query derivations (i.e. different parse trees), which can be defined as
syntactic ambiguity [19]. For example, FlowSense uses wildcard variables to match
stop words. The token “cars” from “Show a plot of cars” describes the user’s
understanding of data entities but is of no use to executing a visualization function.
Meanwhile, the token “horsepower” from “Show a plot of horsepower” is a special
utterance and should be treated as a table column to visualize. Therefore a wildcard
rule that matches the stop word “cars” may also match “horsepower”, resulting
in the second query to be mishandled. We could handle this case by creating a
wildcard variable that rejects a special utterance token. Nevertheless, such design
could lead to a larger number of variables and rules in the grammar, which are
harder to maintain and develop. Therefore we choose to resolve syntactic ambiguity
in the parsing phase with supervised learning on a weight vector w ∈ Rd that
gives the probability of derivations based on input utterances. Stochastic gradient
6.4. QUERY EXECUTION 82
descent (SGD) is employed to optimize the multiclass hinge loss objective [63], as
introduced by Liang et al [35] in the SEMPRE framework. The objective is given
by
minw
∑(x,y)∈S
maxy′
(scorew(x, y′) + penalty(y, y′))− scorew(x, y)
with scorew(x, y) = w · feature(x, y)
In the above x is the input query, y is the preferred derivation, and y′ is the chosen
derivation by the parser. The feature of a derivation, feature(x, y), is determined by
the applied rules in the derivation. penalty(y, y′) checks if a test passes and returns
0 on a correct prediction and 1 otherwise. The objective function has a penalty
for each incorrectly predicted example that is linear in terms of the score (i.e.
probability) difference. The parser fits the training examples by giving preferred
derivations higher probability so that they are more likely to be returned in case of
ambiguity. In particular, the rule that expands to a data column special utterance
will be preferred over a rule that expands to a wildcard. We have created a small
set of labeled training examples as the set S to inform the system of the preferred
choices in terms of syntactic ambiguity. Note that we only need a small labeled set
is because the training set only disambiguates the grammar set with around 500
rules that are data- and diagram-independent, instead of the overwhelmingly large
variations of NL input in general.
Another type of ambiguity lies in the multiple ways of executing a same query.
One example is “Show the cars with mpg greater than 15” on a visualization node.
6.4. QUERY EXECUTION 83
From the grammar perspective the returned result is unambiguously a visualization
function plus a filtering function. However, there are two ways of execution: One is
to create an attribute filter and then visualize the filtered cars in a new visualization;
Alternatively we may apply the filter on the input of the current visualization so
that it shows only the filtered cars. Both can be desired under some circumstances.
FlowSense has a default behavior that prefers filtering the input when the source
node is a visualization, which we find empirically more intuitive.
6.4.6 Error Recovery
There are two types of potential errors in executing a query (Figure 6.2). One
error is the parser does not accept an NL input. For example, the conversational
input “Hello there” makes no sense in a dataflow context. Such input is rejected by
the parser and no fix applies. In this case the system presents an error message and
the user may revisit the list of sample queries in the FlowSense documentation and
tutorial to learn more about the capability and scope of the NLI. The other error
is about an incomplete pattern of an NL input. The query may be structurally
acceptable but has incorrect or missing key information to properly execute. For
instance, the systems displays an error when it fails to find a scatterplot in the
diagram for the grammatically correct query “Highlight the selected cars from the
scatterplot”, which uses the vague node type reference “scatterplot” and can only
be successfully executed when a scatterplot is present in the diagram.
Since the user is simultaneously using the underlying VisFlow DFVS while using
FlowSense, she always has the option to undo the mistake of FlowSense or to make
partial adjustments and corrections when the NLI does not yield desired outcome.
This naturally facilitates easier error recovery, compared with NLIs in which the
6.5. USER INTERFACE 84
user has to rely on the NLI itself to apply a fix and has limited control over the
result.
6.5 User Interface
FlowSense is built as an extension to VisFlow. The user may optionally use
NL to edit the diagram wherever necessary. There are two modes to input a query:
typing or speech. In the typing mode, the user types in the pop-up FlowSense
dialog that shows up around the current focus of the diagram. The speech mode is
implemented with HTML5 web speech API, in which the user may record a spoken
query into the FlowSense input box for further editing. The speech mode can be
enabled by a microphone toggle on the right of the FlowSense input box, as shown
in Figure 6.3.
The special utterances identified by FlowSense are shown in colored tags. Each
different color represents a different type of special utterance, including data column
(green), node label (light green), node type (purple), and dataset name (light blue).
If a special utterance is misclassified, the user may correct it by clicking it and
removing/changing its utterance category in the dropdown (Figure 6.3(i)). The
FlowSense input box is also designed to support token completion for special
utterances. The user may use the tab and arrow keys to select token completion
candidates like in a programming IDE (Figure 6.3(ii)). This reduces the typing
workload and helps remind the user of what are available in the system and in the
current dataflow diagram.
6.6. QUERY AUTO-COMPLETION 85
(i)
(ii)
(iii)
Figure 6.3: The FlowSense user interface and query auto-completion. Taggedspecial utterances are shown in colored tags. (i) Manually update special utterancetagging using a dropdown in the FlowSense input box; (ii) Special utterance tokencompletion; (iii) Query auto-completion.
6.6 Query Auto-Completion
The usability of an NLI is closely related to its discoverability. It is desirable
that when the query is partially completed, the system is able to provide hints or
suggestions to the user on valid queries that include the partial input. It is known
that such a feature was requested by an NLI evaluation subject [19]. We therefore
develop an auto-completion algorithm in FlowSense to enhance its usability and
discoverability. When the user types a partial query and pauses for a long time,
the system triggers the query auto-completion automatically. The completion may
also be invoked manually by the user with a button press. Figure 6.3(iii) shows the
6.7. CASE STUDY 86
auto-completion suggestions in the FlowSense input box.
Auto-completion has been implemented in other visualization NLI, such as
Eviza [60]. Eviza applies a template-based auto-completion, in which the system
attempts to align user input to available set of templates. Here we take a similar
approach by creating a set of query templates with around 100 queries. Upon
an auto-completion request, the algorithm searches through all possible textual
matches between the user’s partial query and a prefix of the template. The matched
query is then sent to the FlowSense parser for evaluation. If the query is accepted, it
becomes an auto-completion candidate. The accepted results that contain obvious
grammatical errors are discarded, e.g. a sentence with consecutive prepositions like
“... in in ...”. Those grammatical errors are due to the loose design of the FlowSense
grammar, which does not emphasize the usage of determiners and prepositions,
and may neglect them as stop tokens.
6.7 Case Study
In this section we present two case studies that demonstrate the effectiveness of
FlowSense. The first case applies FlowSense to study the traffic speed reduction in
NYC, and shows how FlowSense can be used to analyze a real-world problem. The
second case is a diagram reproduction study, in which we validate that FlowSense
is able to help experienced VisFlow users speed up diagram construction. The
discussions in this section are based on an earlier version of system implementation,
and the NLI only applies to the subset flow model without its data mutation
extensions.
6.7. CASE STUDY 87
6.7.1 Speed Reduction Study
In this case study we collaborate with two analysts who are domain experts
researching the city regulation issued on November 7, 2014 that reduces the default
speed limit on all New York City streets from 30 MPH to 20 MPH. The data
contain the estimated average hourly speed [51] for each road segment in Manhattan
from January 2009 to June 2016. The speed estimation was performed over the
TLC yellow taxi records that only have pickup and dropoff information [68]. The
analysts are familiar with the data, and the visualizations to be created are similar
to the visualizations they previously generated for the project using Tableau [66].
However they have no prior experience with either VisFlow or FlowSense. We
met the analysts in person and first introduced VisFlow and FlowSense usage in
a 30-minute session. Then we guided the analyst through how FlowSense can be
used to create visualizations to study the speed reduction. We observe in this study
that almost all the analysts’ visualization requests (excluding those that exceed
the scope of the VisFlow subset flow) can be effectively supported by FlowSense.
Here we summarize the NL queries that are applied for the speed reduction study.
Open monthly
speed by speed limitEncode speed limit
by colorShow speed distribution
1
2
3
4 Draw speed over time grouped by speed limit
Figure 6.4: Using FlowSense to study the overall speed reduction trend of NYCstreets with different speeds limits. The queries are applied in the numberedorder. The resulting visualization shows a time series for the average speed ofroad segments, aggregated by unique speed limits. The smaller histogram snapshotshows the speed histogram without color encoding before step 3.
6.7. CASE STUDY 88
Initially, the analysts would like to look at the speed reduction impact at a
larger scale. They first load a pre-computed speed table (Figure 6.4(1)) with the
FlowSense data loading utility function (the analysts know the dataset name). The
table contains the monthly average speed aggregated by the speed limits of the
streets. The analysts ask the system to present a histogram of speed by “Show
speed distribution” (Figure 6.4(2)). The first histogram has no color encoding but
the analysts are able to immediately add a color scale by “Encode speed limit by
color”. FlowSense inserts a color mapping node with red-green scale at the input
of the histogram (Figure 6.4(3)). The histogram thus shows the street groups with
higher speed limit in green, and lower speed limit in red. To view the speed changes
over time, the analysts use the query “Draw speed over time grouped by speed limit”
(Figure 6.4(4)). The query result is a line chart showing average speed changes
for different speed limit groups. The analysts observe that overall there is a speed
reduction pattern for each speed limit group that started around middle 2013.
Show only segments
with a sign of yes2
Show the data
in a map1
Load segment
monthly speed3
Find roads with a same segment id from West Village/Alphabet City4
5 Set blue/red color
7 Show speed over time by segment id
6 Merge
Figure 6.5: Applying FlowSense for a comparative study on the street speed changesbetween the West Village slow zone (blue) and the Alphabet City slow zone (red).FlowSense processes the rich dataflow context and allows the user to referencedataflow diagram elements at different specificity levels, e.g. with node types, nodelabels, or implicit references. The NL queries are executed in the numbered order.
6.7. CASE STUDY 89
Seeing the overall trend, the analysts move on to a comparative analysis between
the individual streets from two slow zones. They load and visualize a speed sign
installation table in a map (Figure 6.5(1)) by “Show the data in a map”. This
dataset has information on the speed limit, the geographical location, and whether
the street has speed sign installed for every road segment in Manhattan (signs are
shown as dots in the map). As the slow zones mostly have speed signs installed,
the analysts narrow down the data in the map by placing a filter on the “sign”
column (Figure 6.5(2)). The filtered map reveals two slow zone neighborhoods with
densely located signs: Alphabet City and West Village. The analysts apply one
map visualization for each zone for a comparison between the two zones. They
name the two maps by the slow zone names and select a few streets from each
zone (marked in the maps of Figure 6.5). To study the speed changes of these
selected streets, another table (named “segment monthly speed”, also known to
the analysts) that includes monthly average speed for each road segment is added
to the diagram (Figure 6.5(3)). The analysts then use the link queries to create
a sequence of nodes that extract segment IDs from the selected streets and find
their monthly average speed from the segment monthly speed table (Figure 6.5(4)).
Blue and red colors are assigned to the streets in West Village and Alphabet City
respectively to visually differentiate them (Figure 6.5(5)). The two groups of streets
are then merged by a set manipulation function (Figure 6.5(6)). Note that the query
“Merge” only has a single word. But it still works because the query completion
of FlowSense automatically locates the recently focused color editors as source
nodes for this query. Finally, the two groups are rendered together in a speed series
visualization (Figure 6.5(7)), which compares the speed changes between the two
groups of streets. As the visualizations produced by FlowSense are linked, the
6.7. CASE STUDY 90
analysts can easily change the street selection in the maps to compare different
groups of streets. The generated visualizations are helpful to guide the analysts
towards further data analysis.
This case study demonstrates that FlowSense can be applied to a practical,
comprehensive analytical task. The analysts participating in this study think that
FlowSense is intuitive and easy to use after they understand how to work with
VisFlow dataflow diagram to create those visualizations. They also think FlowSense
exemplifies how to build diagrams in VisFlow and is helpful to their learning of the
DFVS.
6.7.2 Diagram Reproduction Study
We carried out a diagram reproduction experiment that shows how FlowSense
can assist experienced DFVS users to simplify dataflow diagram construction. We
use the dataflow diagram designed in the VisFlow case study (Section 4.5.2) that
visualizes the baseball pitchers’ movements with MLB.com Statcast data as the
target diagram. We invited two participants who have good familiarity with the
VisFlow system and the dataset but new to FlowSense. We introduced FlowSense
to both participants in a 15-minute session and walked them through the target
diagram in another 15 minutes to make sure that they understand how the target
diagram works. Then the participants were asked to reproduce the functionality
of the target diagram with FlowSense without referencing the original diagram.
Both participants were able to reproduce the diagram within 25 minutes. This
study demonstrates that an experienced DFVS user can benefit from FlowSense
in that it speeds up and simplifies dataflow diagram construction. Through post-
study conversations, we learn that the speed improvement mainly results from the
6.8. USER STUDY 91
FlowSense capability of expanding the diagram at the editing focus, and operating
on multiple nodes at once (e.g. in data linking or merging). These have to be
achieved otherwise by multiple drag-and-drop interactions to create several nodes
and edges.
6.8 User Study
We conduct a formal user study to evaluate the effective of FlowSense together
with the VisFlow framework. We use the user study to validate whether a user
is able to smoothly apply FlowSense for dataflow diagram construction, and how
well do FlowSense responses meet the user’s expectation. We design an experiment
that introduces VisFlow and FlowSense to the participant and assigns analytical
tasks to be solved using the system. We collect quantitative feedback from the
participants, measure the task completion time, and carry out a post-study data
analysis on the participants’ NL queries.
6.8.1 Experiment Design
We design an online experiment environment to perform the user study. Partic-
ipants are invited to participate in the study online in a web browser on their own
machines. Participants may ask the experiment assistant for help and clarification
via web chat or phone call during the experiment session.
We recruited 17 participants, all with an age between 20 and 30. Among them
11 are male, and 6 are female. All of the participants work or study in the field of
computer science, and 12 have a data visualization background. 9 of the partici-
pants are graduate students, and the other 8 are professionals (software engineer,
6.8. USER STUDY 92
researcher, faculty). Three participants have prior experience with VisFlow. None
of the participants have prior knowledge on FlowSense.
The procedure of the user study is as follows:
• The participant completes a tutorial of the VisFlow dataflow framework. The
participant is asked to complete the tutorial diagram following the instructions
to demonstrate familiarity with the subset flow. (10–20 minutes)
• The participant completes a tutorial of the FlowSense natural language
interface. The participant is asked to complete the VisFlow tutorial diagram
using solely FlowSense to demonstrate the familiarity of the NLI. (10–20
minutes)
• The participant freely explores and practices with both VisFlow and FlowSense.
(10 minutes)
• The participant explores an SDE Test dataset and constructs dataflow di-
agrams using VisFlow and FlowSense to answer questions about the data.
The participant is encouraged to use FlowSense as much as possible. (30–60
minutes)
• The participant takes a survey to give quantitative feedback about VisFlow
and FlowSense.
The SDE Test dataset includes the test results of software engineer candidates,
which reflect how strong a background each candidate has in computer science.
The dataset includes two tables. The first table describes the test results for each
candidate. A test consists of answering several multi-choice questions selected
by the system from a large question pool. Each question has a unique ID, a
6.8. USER STUDY 93
pre-determined difficulty, its supported programming language(s), and possibly a
time limit. If the candidate answers a question, a result (“correct” or “wrong”) is
given. Getting a question wrong results in a negative score penalty. So a candidate
may choose to “skip” a question and get zero score. If a candidate has no action
within the time limit of a question, the result is “unanswered”. The “TimeTaken”
column stores how much time in seconds a candidate took to answer a question.
The second table includes background information about each candidate, such as
the candidate’s age, field of study, and graduation date.
We give three analytical tasks about this dataset:
• Overview Task. The participant is asked to visualize the overview distri-
bution of the question answering results, and figure out the total number of
questions that were skipped, and the percentage of a question being answered
correctly.
• Outlier Task. The participant is first asked to find an outlier user with
invalid age value (“2018”). Then the participant is asked to investigate a
data recording discrepancy regarding the “TimeTaken” column: Some of
the “TimeTaken” values are significantly bigger than the other values when a
question is unanswered. This is probably due to a bug in the data collection
code.
• Comprehensive Task. The participant is asked to identify one question
that Masters candidates answer significantly better than Bachelors candidate.
This task requires comprehensive usage of VisFlow features, such as attribute
filtering, brushing, and heterogeneous table linking.
All the three tasks have definitive answers to ensure that participants explore
6.8. USER STUDY 94
the data and draw conclusions reasonably. Each user study session is logged with
anonymous full diagram editing history for post-study data analysis.
6.8.2 Results
We analyze the user study results based on the quantitative feedback, time
measurement of task completion time, and task answers. The NL queries the
participants entered are collected to analyze where the NLI meets and fails the
user expectation. In particular, we manually walk through the rejected FlowSense
queries one by one to identify and categorize their reasons of failure.
6.8.2.1 Task Completion Quality
Figure 6.6 shows the correctness distribution of the participants’ answers. It
can be seen that the majority of the participants were able to come up with the
correct answers to the tasks. This demonstrates that VisFlow and FlowSense are
effective at assisting visual data exploration, and may help the users successfully
seek answers to data analysis questions.
Figure 6.7 shows the completion time distribution for each step of the user study.
Overall, the amount of time consumed is as expected. However, the time participants
spent on VisFlow tutorial and Task 3 is longer than expected. This shows that
users of VisFlow may need a sufficient amount of time to go through the tutorial so
as to understand the dataflow system itself. When the task involves heterogeneous
tables and interactive data filtering to find solutions, the time required to complete
the task increases. By analyzing the user comments in the feedback, we believe
this may be due to the fact that many participants are first-time VisFlow users.
For them it takes some effort to digest the concept of the subset flow model. In
6.8. USER STUDY 95
0
4
8
12
16
ok watask1.count
0
4
8
12
16
ok unanswered watask1.percentage
0
4
8
12
16
ok unanswered watask2.user_id
0
4
8
12
16
ok unansweredtask2.timetaken
0
4
8
12
16
ok unanswered watask3.question_id
Figure 6.6: Correctness distributions of the participants’ answers to each of theuser study tasks. “ok” represents a correct answer. “wa” indicates a wrong answer.“unanswered” means the participant did not find a proper answer and skipped thetask.
6.8. USER STUDY 96
particular, table linking can be challenging to understand at first. However, after
approximately 20 minutes of system usage, most users were able to figure out how
to relate tables in VisFlow or use FlowSense for the purpose. This is reflected by
one of the feedback comments: “The operator and linker functions are confusing
at first. But after experimenting with the tool for a while and getting to know how
they work, things become easier”.
Task3
Task2
Task1
FlowSense.Tutorial
VisFlow.Tutorial
0 20 40 60Time (minutes)
Figure 6.7: Completion time box plot for each step of the user study. Four outliersare not shown: Task1(2550), Task3(109, 119, 212).
6.8.2.2 Quantitative Feedback
We ask for feedback on six aspects regarding VisFlow and FlowSense respectively
in our survey. Each aspect is presented with a statement with a 1–5 Likert scale
for the participant to express agreement (5) or disagreement (1). Table 6.2 shows
the feedback for the VisFlow dataflow system. Table 6.3 shows the feedback for
the FlowSense NLI.
The quantitative feedback for VisFlow in Table 6.2 shows that the users were
able to understand the subset flow model of VisFlow. The majority of the users
agree that VisFlow presents an effective approach to visual data exploration, and
can successfully utilize VisFlow features for their data exploration.
6.8. USER STUDY 97
Question Feedback
I understand the majority of Vis-Flow features. 0
36912
1 2 3 4 5
I understand the subset flow inVisFlow. 0
36912
1 2 3 4 5
I can follow the VisFlow dataflowdiagram and understand theirfunctionality. 0
36912
1 2 3 4 5
VisFlow is relatively simple tolearn and use.
036912
1 2 3 4 5
VisFlow is an effective system forvisual data exploration.
036912
1 2 3 4 5
I would like to use VisFlow formy future data exploration tasks.
036912
1 2 3 4 5
Table 6.2: VisFlow Survey Result
The quantitative feedback for FlowSense in Table 6.3 shows that most users
were able to understand the scope of FlowSense, and apply it for dataflow diagram
construction. 12/17 of the users agree (with a feedback score greater than 3)
that FlowSense simplifies the diagram construction, and 10/17 users agree that
FlowSense speeds up the data exploration. The feedback also reveals space for
improving the NLI, as it is unclear to an average user how to update a rejected
query to make it accepted by FlowSense. It may be helpful to design an algorithm
that provides suggested corrections or changes on a rejected query.
6.8. USER STUDY 98
Question Feedback
I understand what queries FlowSensemay accept and execute.
036912
1 2 3 4 5
The responses of FlowSense meet myexpectations.
036912
1 2 3 4 5
When my query got rejected, I canfigure out how to update it to let itbe accepted. 0
36912
1 2 3 4 5
FlowSense simplifies dataflow dia-gram construction.
036912
1 2 3 4 5
FlowSense speeds up my data explo-ration.
036912
1 2 3 4 5
FlowSense helps me learn VisFlow fea-tures that I was not aware of.
036912
1 2 3 4 5
Table 6.3: FlowSense Survey Result
6.8.2.3 Reasons for Query Rejection
To closely study where FlowSense does not accept a query, we conduct a manual
walk-through of the rejected queries and categorize each rejected query by its reason
of rejection. Figure 6.8 lists the identified categories and their relative level of
difficulty to be resolved.
The meaning of each category is:
• Not Implemented: FlowSense grammar may technically support parsing
this query. Yet we have not implemented the corresponding grammar and
6.8. USER STUDY 99
its web client handler. Example queries include “change x column to mpg”.
The current system implementation does not support node option changes
triggered by the NLI. Queries of this category can be accepted by extending
the grammar and adding more rules.
• Rephrase: The user rephrases the query using grammatical structures not
expected by the grammar, or the user uses words that do not appear in the
dataset table to describe a table entity or value. For example, in Task 3 if the
user mentions “degree”, FlowSense does not know that “degree” is equivalent
to the “HighestLevelOfEducation” column in the data. There needs to be
additional knowledge base added to the system so that the NLI can make
concept derivation.
• Not Supported: The functionality indicated by the query is not supported
by the VisFlow dataflow framework. A query like “how many questions were
skipped” asks directly an analytical question about the dataset and exceeds
the scope of VisFlow and FlowSense. It cannot be accepted because VisFlow
has no such functionality.
• Bug: The system should be able to handle that query. But due to an
implementation bug the execution went wrong.
• Tagging Miss: A special utterance should have (have not) been tagged, but
it was not (was) tagged. For example, the query “select iris with id between
3 and 5” has the word “iris” that is both a word to describe the data entity
and a dataset name. When FlowSense automatically tags “iris” as a dataset
name special utterance, the parser may fail to accept the query. In this case
6.8. USER STUDY 100
the user may manually override the tagging to avoid the error resulted from
parsing ambiguity.
• Invalid: The query was an invalid sentence, and cannot be understood by a
human.
• Composite: The user inputs a query that attempts to execute multiple
VisFlow functions that exceed the limit expected by the grammar or the
web client handler. The grammatical structure between these multiple func-
tions poses parsing difficulty. It is recommended that composite queries are
refactored into multiple smaller steps so as not to overload the NLI with
complicated grammatical structure that exceeds its parsing capability.
• Mistyped: The query has mistyped words.
MistypedComposite
InvalidTagging Miss
BugNot Supported
RephraseNot Implemented
0 20 40 60 80count
Difficulty to Resolve low medium high
Figure 6.8: Number of rejected queries for different rejection reasons. The colors ofthe bars indicate the relative difficulty of resolving a rejection.
Overall, we analyzed 649 queries, out of which 421 were accepted by FlowSense,
at an acceptance rate of 64.869%. Excluding the 34 invalid and mistyped queries,
the acceptance rate was 421/(649− 34) = 68.455%.
6.9. DISCUSSIONS 101
6.9 Discussions
The studies we conducted help validate the effectiveness of the FlowSense NLI.
Yet based on the study results we identify a set of possible tasks that may help
improve the NLI design. One outstanding user experience issue is the lack of
guidance on how to correct a rejected query. Though correction suggestion is
highly desirable, it is technically challenging to derive the intermediate parsing
state of a parse tree to find out which parts of the query mismatch. Even knowing
a derivation and its parse tree, it may be difficult to generate a human-readable
sentence because the grammar tends to be loose and may omit unimportant tokens
such as the stop words. We leave the design and research of query correction
suggestion to our future work.
In terms of query performance analysis, so far we have only analyzed the reasons
for the rejected queries. However the accepted queries may not necessarily achieve
what the user wants. We may more closely analyze the diagram editing logs
collected from the user study and check what actions a user performs following
the NL queries. For example, if the user immediately undoes after an NL query is
executed, it clearly indicates that the outcome of the query was not desirable.
On the other hand, despite the strength and intuitiveness of the NLI, the user’s
habit of using an NLI also plays a vital role. We observe that the traditional drag-
and-drop interaction is definitive and convenient in many cases, and is often more
trusted by the user. Upon complex analysis tasks, a user tends to less frequently
use the NLI and rely more on direct interactions. Given enough experience, the
user may gradually figure out over time a habit of when and where to use the NLI
to utilize its strengths and avoid its weaknesses.
In this work we prefer semantic parsing to deep learning mainly because the
6.9. DISCUSSIONS 102
latter has a bottleneck of requiring a large volume of training examples. Though
there are benchmark datasets for general NLP, there has not yet been a training
set catered for visualization-oriented NLI or DFVS. In the future with more users
working with our NLI, we would like to collect more user queries that constitute a
rich training set for text classification. With more data, we may explore alternative
algorithms for query parsing and execution. It is also possible to use the data to
improve the results of query auto-completion.
103
Chapter 7
Conclusions and Future Work
This dissertation presents VisFlow, a web-based dataflow framework for visual
data exploration. VisFlow uses a subset flow model that explores the possibility
of interactive visual data analysis in a dataflow context. The subset flow model
focuses on tabular data subsets and requires data immutability. The advantage
of using the subset flow model is that the dataflow may unambiguously assign
visual properties to the data items, so that subsets can be naturally brushed,
linked, and highlighted across multiple visualizations. VisFlow is a novel dataflow
framework that addresses the interactivity limitation of dataflow systems. It focuses
on enhancing the interactivity instead of supporting general computation as in
many of the other computational dataflow systems. The goal of the design is to
have a dataflow system that excels in flexibility, usability, and interactivity.
Chapter 3 introduces the concept of the subset flow model. It gives the definitions
of diagram elements and describes the mechanism of using visual properties to
visually track data subsets. A list of node categories is presented. This set of node
categories covers the majority of visual data exploration tasks that are possible in
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 104
the subset flow context. The subset flow data immutability constraints are given
and the design philosophy behind it is discussed. We illustrate that the data schema
of the subset flow model, which is equivalent to having node outputs as copies
of its input with immutable original table columns but mutable visual property
columns. We show an example diagram that demonstrates the diagram concepts
and visualizes the Auto MPG dataset in multiple views.
Chapter 4 discusses the implementation details of the VisFlow framework. The
choices for the application stack are listed, with an in-depth discussion on how
component inheritance can be implemented within the VueJS frontend framework.
The user interface of the implemented framework is presented. In particular, the
VisMode dashboard allows the diagram connection details to be hidden, so that
the user may focus on result presentation or data exploration. VisMode seamlessly
switches between diagram editing and dashboard display, so as to help the user
keep track of the correspondence between the two modes. With the implemented
VisFlow framework, we work with domain experts on real-world data analysis tasks
to showcase the application of VisFlow. We demonstrate a gene regulatory network
study and a baseball pitch study. The two studies exemplify how the VisFlow
framework can be used to analyze heterogeneous datasets with multiple tables, and
achieve multi-view linked visualization environment for data exploration. Such
data exploration has to be supported by a bespoke application otherwise.
Chapter 5 introduces the experimental extensions built over the subset flow
model to enhance its analytical capability. We introduce the concept of the extended
subset flow model, in which nodes are allowed to mutate their input data, and
consequently generate data mutation boundaries. It is shown that the original
subset flow may apply to groups of nodes within a same data mutation boundary,
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 105
so that the interactivity advantage of the subset flow model can be preserved while
we introduce data mutating node types into the system. In particular, a script
editor node type is added to allow custom JavaScript scripting inside the dataflow.
Using JavaScript, the user can edit and generate the data in situ. With a series
player node, we demonstrate an analysis of the evacuation dataset to show that
how scripting support can be used to perform data visualization that requires
custom rendering and display, e.g. drawing a floor plan. We also show two cases on
k-means algorithm visualization and model training visualization to demonstrate
more comprehensive usage of the extended subset flow model. To overcome the
limitation of iterative data analysis, a stateful data reservoir node is introduced
to allow downflow data to be sent back to the upflow for analysis iterations. The
input data of the data reservoir are released and propagated backward upon user
interaction. Such a design overcomes the limitation of acyclic dataflow diagram
and avoids multiple layers of nodes that would otherwise have to be created for
iterative analyses. Other data mutation nodes, such as the data transpose that
converts series data ordered by columns to series points listed in row order, can be
useful in the extended subset flow.
There could be many design variations on how to apply dataflow for interactive
data analysis and visual data exploration. In this work we propose the subset flow
model. We only present an example set of node types in our VisFlow implementation.
It is always possible to add new types on demand to enhance the subset flow.
Furthermore under the extended subset flow, virtually any type of node can be
added, given that they do not over-complicate the dataflow usage and compromise
the subset flow usability. In the future we would like to experiment on more node
types and consolidate a set of node types that best fulfills general visual data
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 106
exploration requirement in a dataflow context. We also would like to study the
impact of the extended subset flow model in terms of the user’s perception of the
dataflow. It would be interesting to analyze the usage distribution between the
subset flow that focuses on interactivity and the extended subset flow that leans
towards computation capability.
Chapter 6 covers the natural language interface FlowSense, which is integrated
into VisFlow to enhance the usability of the dataflow framework. FlowSense uses a
semantic parser to analyze and execute natural language queries. The NLI aims at
supporting dataflow diagram editing operations. We identify a set of commonly
performed tasks in VisFlow as the VisFlow functions, and make them the query
parsing targets. The grammar behind the parser is independent of the loaded
dataset or the dataflow diagram context. Instead, POS and special utterance
tagging is performed and the grammar includes utterance placeholders to accept
context-related values such as table column names and node labels. The FlowSense
grammar attempts to fill in the missing key components in a diagram editing
command by locating the current diagram editing focus, or using the default values.
The FlowSense user interface includes a rich input box with token and query
auto-completion. The query auto-completion algorithm takes a template-based
approach and matches a partial query against the templates to suggest acceptable
queries. One case study on the analysis of speed reduction in NYC is presented to
show the application of FlowSense in practical data analysis scenario. A diagram
reproduction study is performed to demonstrate the efficiency improvement brought
by FlowSense over the traditional drag-and-drop diagram editing interactions. In
addition to the two case studies, a formal user study is performed to validate the
effectiveness of FlowSense. In the study, the participants went through VisFlow
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 107
and FlowSense training, and completed three assigned tasks with definitive answers.
We analyze the participants’ task completion time, summarize their quantitative
feedback, and identify space for improving the NLI based on where the queries
were rejected.
Providing query correction suggestions would be a future research direction to
improve the user experience of the NLI. We may also more detailedly analyze the
diagram editing logs and replay the diagram construction to identify where the
FlowSense results do not meet the user expectation even when the NL queries are
accepted by the grammar. The NL queries collected from the user study may help
compose a larger training volume of dataflow NL queries, based on which we may
explore alternative algorithms for parsing the NL input.
108
Appendix A
VisFlow Resources
All VisFlow resources can be found at the website https://vislfow.org. The
site hosts an online demo of the dataflow framework at https://visflow.org/demo,
where we provide several demo datasets for the users to try out the system features.
The users may also create their own account to upload custom datasets to explore.
The documentation of the system is available at https://visflow.org/docs. The
documentation has a getting-started tutorial which introduces the the basic concept
and usage of the system. It also comprehensively covers all the design details,
including the definitions of the subset flow model and its elements, the supported
node types and their usage, system shortcuts, and so on. The documentation also
introduces the FlowSense natural language interface and provides example natural
language queries. The implementation source code of the VisFlow framework is
available as an open source project on GitHub: https://github.com/yubowenok/
visflow. Past codebase revisions are also available at this repository.
109
Bibliography
[1] R. Alexander, P. Rukshan, and S. Mahesan. Natural language web interface
for database (NLWIDB). CoRR, abs/1308.3830, 2013.
[2] M. Allahyari, S. A. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez,
and K. Kochut. A brief survey of text mining: Classification, clustering and
extraction techniques. CoRR, abs/1707.02919, 2017.
[3] I. Androutsopoulos, G. D. Ritchie, and P. Thanisch. Natural language interfaces
to databases - an introduction. CoRR, cmp-lg/9503016, 1995.
[4] C. Batini, E. Nardelli, and R. Tamassia. A layout algorithm for data flow
diagrams. IEEE Trans. Software Engineering, 12(4):538–546, Apr. 1986.
[5] L. Bavoil, S. P. Callahan, C. E. Scheidegger, H. T. Vo, P. Crossno, C. T. Silva,
and J. Freire. VisTrails: Enabling interactive multiple-view visualizations. In
IEEE Visualization Conference, pages 135–142, 2005.
[6] J. Berant, A. Chou, R. Frostig, and P. Liang. Semantic parsing on Free-
base from question-answer pairs. In Empirical Methods in Natural Language
Processing (EMNLP), 2013.
[7] M. Bostock, V. Ogievetsky, and J. Heer. D3: Data-driven documents.
BIBLIOGRAPHY 110
IEEE Transactions on Visualization and Computer Graphics (InfoVis’11),
17(12):2301–2309, 2011.
[8] M. Ciofani, A. Madar, C. Galan, M. Sellars, K. Mace, F. Pauli, A. Agar-
wal, W. Huang, C. N. Parkurst, M. Muratet, K. M. Newberry, S. Meadows,
A. Greenfield, Y. Yang, P. Jain, F. K. Kirigin, C. Birchmeier, E. F. Wagner,
K. M. Murphy, R. M. Myers, R. Bonneau, and D. R. Littman. A validated
regulatory network for Th17 cell specification. Cell, 151:289–303, 2012.
[9] K. Cox, R. E. Grinter, S. L. Hibino, Lalita, J. Jagadeesan, and D. Mantilla.
A multi-modal natural language interface to an information visualisation
environment. International Journal of Speech Technology, pages 297–314,
2001.
[10] Cycling74. https://cycling74.com/.
[11] D3: Data Driven Documents. http://d3js.org.
[12] J. de Leeuw. Modern multidimensional scaling: Theory and applications
(second edition). Journal of Statistical Software, Book Reviews, 14(4):1–2, 9
2005.
[13] L. Deng. A tutorial survey of architectures, algorithms, and applications for
deep learning. APSIPA Trans. Signal and Information Processing, 3, 2014.
[14] J.-D. Fekete. The infovis toolkit. In IEEE Symposium on Information Visual-
ization, pages 167–174, 2004.
[15] C. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, May
1998.
BIBLIOGRAPHY 111
[16] D. Foulser. IRIS Explorer: A framework for investigation. ACM SIGGRAPH
Computer Graphics, 29(2):13–16, 1995.
[17] J. Freire, C. T. Silva, S. P. Callahan, E. Santos, C. E. Scheidegger, and H. T.
Vo. Managing Rapidly-Evolving Scientific Workflows, pages 10–18. Springer
Berlin Heidelberg, 2006.
[18] C. Gane and T. Sarson. Structured Systems Analysis: Tools and Techniques.
McDonnell Douglas Systems Integration Company, 1979.
[19] T. Gao, M. Dontcheva, E. Adar, Z. Liu, and K. G. Karahalios. Datatone:
Managing ambiguity in natural language interfaces for data visualization. In
Proc. 28th Annual ACM Symposium on User Interface Software and Technology,
UIST’15, pages 489–500, 2015.
[20] Grasshopper3D. http://www.grasshopper3d.com/.
[21] S. Gratzl, N. Gehlenborg, A. Lex, H. Pfister, and M. Streit. Domino: Extract-
ing, comparing, and manipulating subsets across multiple tabular datasets.
IEEE Transactions on Visualization and Computer Graphics (InfoVis’14),
2014.
[22] J. Gurd, W. Bohm, and Y. M. Teo. Performance issues in dataflow machines.
In Future Generations Computer Systems. Elsevier Scientific, pages 285–297,
1987.
[23] P. E. Haeberli. ConMan: A visual programming language for interactive
graphics. ACM SIGGRAPH Computer Graphics, 22(4):103–111, June 1988.
BIBLIOGRAPHY 112
[24] E. Hoque, V. Setlur, M. Tory, and I. Dykeman. Applying pragmatics principles
for interaction with visual analytics. IEEE Transactions on Visualization and
Computer Graphics, 24(1):309–318, Jan 2018.
[25] IBM OpenDX. http://opendx.org/.
[26] IBM SPSS Modeler. http://www.ibm.com/software/products/en/
spss-modeler.
[27] IBM Watson Analytics. https://www.ibm.com/analytics/
watson-analytics/.
[28] W. Javed and N. Elmqvist. ExPlates: Spatializing interactive analysis to scaf-
fold visual exploration. Computer Graphics Forum (Proc. EuroVis), 32(2):441–
450, 2013.
[29] KNIME data analysis platform. http://www.knime.org/.
[30] A. Kumar, J. Aurisano, B. D. Eugenio, A. Johnson, A. Gonzalez, and J. Leigh.
Towards a dialogue system that supports rich visualizations of data. In The
17th Annual Meeting of the Special Interest Group on Discourse and Dialogue,
2016.
[31] S. Kumar, A. Kumar, P. Mitra, and G. Sundaram. System and methods for
converting speech to SQL. CoRR, abs/1308.3106, 2013.
[32] M. Lage, J. H. Ono, D. Cervone, J. Chiang, C. Dietrich, and C. Silva. Statcast
dashboard: Exploration of spatiotemporal baseball data. IEEE Computer
Graphics & Applications, 2016, to appear.
BIBLIOGRAPHY 113
[33] V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and
reversals. Soviet Physics Doklady, 10:707, 1966.
[34] Y. Li, H. Yang, and H. V. Jagadish. NaLIX: A generic natural language search
environment for XML data. ACM Trans. Database Systems, 32(4), Nov. 2007.
[35] P. Liang and C. Potts. Bringing machine learning and compositional semantics
together. Annual Review of Linguistics, 1:355–376, 2014.
[36] Z. Liu, S. Navathe, and J. Stasko. Network-based visual analysis of tabular
data. In IEEE Visual Analytics Science and Technology (VAST’11), pages
41–50, Oct 2011.
[37] B. Ludascher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. A.
Lee, J. Tao, and Y. Zhao. Scientific workflow management and the Kepler
system. Concurrency Computat.: Pract. Exper., 18(10):1039–1065, Aug. 2006.
[38] J. Mackinlay, P. Hanrahan, and C. Stolte. Show Me: Automatic presentation for
visual analysis. IEEE Trans. Visualization and Computer Graphics, 13(6):1137–
1144, 2007.
[39] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, P. Inc, S. J. Bethard, and
D. Mcclosky. The Stanford CoreNLP natural language processing toolkit. In In
Proc. 52nd Annual Meeting of the Association for Computational Linguistics:
System Demonstrations, pages 55–60, 2014.
[40] A. Meduna. Formal Languages and Computation: Models and Their Applica-
tions. Auerbach Publications, 2014.
BIBLIOGRAPHY 114
[41] J. Meyer-Spradow, T. Ropinski, J. Mensmann, and K. Hinrichs. Voreen: A
rapid-prototyping environment for ray-casting-based volume visualizations.
IEEE Computer Graphics and Applications, 29(6):6–13, Nov 2009.
[42] Microsoft Power BI. https://powerbi.microsoft.com/.
[43] W. A. Najjar, E. A. Lee, and G. R. Gao. Advances in the dataflow computa-
tional model. Parallel Computing, 25(13-14):1907 – 1929, 1999.
[44] G. Navarro. A guided tour to approximate string matching. ACM Computing
Surveys, 33(1):31–88, Mar. 2001.
[45] C. North, N. Conklin, K. Indukuri, and V. Saini. Visualization schemas and a
web-based architecture for custom multiple-view visualization of multiple-table
databases. In Information Visualization, pages 211–228, 2002.
[46] Observable. https://beta.observablehq.com/.
[47] Orange Data Mining Software. https://orange.biolab.si/.
[48] S. G. Parker and C. R. Johnson. SCIRun: A scientific programming envi-
ronment for computational steering. In Proc. ACM/IEEE Conference on
Supercomputing. ACM, 1995.
[49] S. G. Parker, D. M. Weinstein, and C. R. Johnson. The SCIRun Computational
Steering Software System. Birkhauser Boston, 1997.
[50] P. Pasupat and P. Liang. Compositional semantic parsing on semi-structured
tables. In In Proc. Annual Meeting of the Association for Computational
Linguistics, 2015.
BIBLIOGRAPHY 115
[51] J. Poco, H. Doraiswamy, H. T. Vo, J. L. D. Comba, J. Freire, and C. T.
Silva. Exploring traffic dynamics in urban environments using vector-valued
functions. Computer Graphics Forum (Proc. EuroVis), 34(3):161–170, 2015.
[52] Project Jupyter. http://jupyter.org/.
[53] Quadrigram. http://www.quadrigram.com/.
[54] D. Ren, T. Hllerer, and X. Yuan. iVisDesigner: Expressive interactive design of
information visualizations. IEEE Transactions on Visualization and Computer
Graphics (InfoVis’14), 20(12):2092–2101, Dec 2014.
[55] J. Roberts. Waltz - an exploratory visualization tool for volume data, using
multiform abstract displays. In Visual Data Exploration and Analysis V, Proc.
SPIE, volume 3298, pages 112–122, 1998.
[56] J. C. Roberts. On encouraging coupled views for visualization exploration.
In Visual Data Exploration and Analysis VI, Proc. SPIE, volume 3643, pages
14–24, 1999.
[57] G. Ross and M. Chalmers. A visual workspace for hybrid multidimensional scal-
ing algorithms. In IEEE Symposium on Information Visualization (InfoVis’03),
pages 91–96, Oct 2003.
[58] A. Satyanarayan and J. Heer. Lyra: An interactive visualization design
environment. Computer Graphics Forum (Proc. EuroVis), 2014.
[59] A. Satyanarayan, D. Moritz, K. Wongsuphasawat, and J. Heer. Vega-lite: A
grammar of interactive graphics. IEEE Transsactions on Visualization and
Computer Graphics (Proc. InfoVis), 2017.
BIBLIOGRAPHY 116
[60] V. Setlur, S. E. Battersby, M. Tory, R. Gossweiler, and A. X. Chang. Eviza: A
natural language interface for visual analysis. In Proc. 29th Annual Symposium
on User Interface Software and Technology, UIST’16, pages 365–377, 2016.
[61] B. Shneiderman. The eyes have it: A task by data type taxonomy for infor-
mation visualizations. In In IEEE Symposium on Visual Languages, pages
336–343, 1996.
[62] Y. Sun, J. Leigh, A. Johnson, and S. Lee. Articulate: A semi-automated
model for translating natural language queries into meaningful visualizations.
In Proc. 10th International Conference on Smart Graphics, pages 184–195.
Springer-Verlag, 2010.
[63] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. Neural
Information Processing Systems, 2003.
[64] A. Telea and J. J. van Wijk. Vission: An object oriented dataflow system for
simulation and visualization. In E. Groller, H. Loffelmann, and W. Ribarsky,
editors, Data Visualization ’99: Proceedings of the Joint EUROGRAPHICS
and IEEE TCVG Symposium on Visualization, pages 225–234, Vienna, 1999.
Springer Vienna.
[65] A. Telea and J. J. van Wijk. Smartlink: An agent for supporting dataflow
application construction. In W. C. de Leeuw and R. van Liere, editors, Data
Visualization 2000: Proceedings of the Joint EuroGraphics and IEEE TVCG
Symposium on Visualization, pages 189–198, Vienna, 2000. Springer Vienna.
[66] The Tableau Software. http://www.tableausoftware.com/.
[67] ThoughtSpot. http://www.thoughtspot.com/.
BIBLIOGRAPHY 117
[68] TLC Trip Records. http://www.nyc.gov/html/tlc/html/about/trip_
record_data.shtml.
[69] C. Upson, J. Faulhaber, T.A., D. Kamins, D. Laidlaw, D. Schlegel, J. Vroom,
R. Gurwitz, and A. van Dam. The application visualization system: a com-
putational environment for scientific visualization. IEEE Computer Graphics
and Applications, 9(4):30–42, July 1989.
[70] B. Victor. Drawing dynamic visualizations. https://vimeo.com/66085662,
February 2013.
[71] vvvv. http://vvvv.org/.
[72] Y. Wang, J. Berant, and P. Liang. Building a semantic parser overnight. In
Association for Computational Linguistics (ACL), 2015.
[73] J. Waser, H. Ribicic, R. Fuchs, C. Hirsch, B. Schindler, G. Bloschl, and
E. Groller. Nodes on ropes: A comprehensive data and control flow for steering
ensemble simulations. IEEE Transactions on Visualization and Computer
Graphics, 17(12):1872–1881, Dec 2011.
[74] C. Weaver. Building highly-coordinated visualizations in Improvise. In IEEE
Symposium on Information Visualization (InfoVis’04), pages 159–166, 2004.
[75] Wikipedia article: Yahoo Pipes. https://en.wikipedia.org/wiki/Yahoo!
_Pipes.
[76] Wolfram Alpha. http://www.wolframalpha.com/.
[77] K. Wolstencroft, R. Haines, D. Fellows, A. R. Williams, D. Withers, S. Owen,
S. Soiland-Reyes, I. Dunlop, A. Nenadic, P. Fisher, J. Bhagat, K. Belhajjame,
BIBLIOGRAPHY 118
F. Bacall, A. Hardisty, A. N. de la Hidalga, M. P. B. Vargas, S. Sufi, and C. A.
Goble. The Taverna workflow suite: designing and executing workflows of web
services on the desktop, web or in the cloud. Nucleic Acids Research, pages
557–561, 2013.
[78] H. Wright, K. Brodlie, and M. Brown. The dataflow visualization pipeline as
a problem solving environment. In Proceedings of the Eurographics Workshop
on Virtual Environments and Scientific Visualization, pages 267–276, 1996.
[79] Z. Wu and M. Palmer. Verbs semantics and lexical selection. In Proceedings of
the 32Nd Annual Meeting on Association for Computational Linguistics, ACL
’94, pages 133–138, Stroudsburg, PA, USA, 1994. Association for Computa-
tional Linguistics.
[80] B. Yu, H. Doraiswamy, X. Chen, E. Miraldi, M. Arrieta-Ortiz, C. Hafemeister,
A. Madar, R. Bonneau, and C. Silva. Genotet: An interactive web-based
visual exploration framework to support validation of gene regulatory networks.
IEEE Transactions on Visualization and Computer Graphics (Proc. VAST),
20(12):1903–1912, Dec 2014.
[81] B. Yu and C. T. Silva. VisFlow - web-based visualization framework for tabular
data with a subset flow model. IEEE Transsactions on Visualization and
Computer Graphics (Proc. VAST), 23(1):251–260, 2017.