visflow: a web-based data ow framework for visual data ... · the current data ow context during...

VisFlow: A Web-Based Dataflow Framework for VisualData Exploration

DISSERTATION

Submitted in Partial Fulfillment of

the Requirements for

the Degree of

DOCTOR OF PHILOSOPHY (Computer Science)

at the

NEW YORK UNIVERSITY

TANDON SCHOOL OF ENGINEERING

by

Bowen Yu

January 2019

VisFlow: A Web-Based Dataflow Framework for VisualData Exploration

DISSERTATION

Submitted in Partial Fulfillment of

the Requirements for

the Degree of

DOCTOR OF PHILOSOPHY (Computer Science)

at the

NEW YORK UNIVERSITY

TANDON SCHOOL OF ENGINEERING

by

Bowen Yu

January 2019

Approved:

Department Head Signature

Date

University ID#: N17821602

Net ID: by460

ii

Approved by the Guidance Committee:

Major: Computer Science

Claudio T. SilvaProfessor of

Computer Science and Engineering

Juliana FreireProfessor of


Enrico BertiniProfessor of


Luis Gustavo NonatoProfessor of

Institute of Mathematics and Computer SciencesUniversity of Sao Paulo

iii

Microfilm or other copies of this dissertation are obtainable from

UMI Dissertation Publishing

ProQuest CSA

789 E. Eisenhower Parkway

P.O. Box 1346

Ann Arbor, MI 48106-1346

iv

Vita

Bowen Yu was born in 1990 in Shanghai, China. He received his Bachelor

of Science from Peking University, Beijing, China, in 2012. He joined the Ph.D.

program at Tandon School of Engineering, New York University, in Fall 2012. His

research interests are data visualization frameworks and visual analytics. He was

a graduate adjunct in Spring 2018, and an adjunct faculty in Fall 2018 at the

Courant Institute of Mathematical Sciences, New York University. He is specialized

in competitive programming, and was the coach of the New York University

programming team between 2014 and 2018.

v

Acknowledgements

I am very grateful to my advisor, Prof. Claudio T. Silva, for his tremendous

support of my research and academic life over the past six years. This dissertation

would not be possible without his guidance and vision. He not only provided

insightful feedback on my research, but also gave me deep inspiration on how to

become a better researcher and a better person overall. I would also like to thank

all the committee members: Prof. Claudio T. Silva, Prof. Juliana Freire, Prof.

Enrico Bertini, and Prof. Luis Gustavo Nonato. By working with them on various

projects, I have learned a lot from their research experience and personality.

I am very fortunate to have worked in the VIDA group, where I am able

to explore the research opportunities that are hardly available elsewhere. The

atmosphere in the group is charming and encouraging. The discussions I had with

the colleagues here were always thought-provoking and fun.

I would like to thank Bo Zhou for being my best friend and fellow labmate. It

was a great pleasure to discuss challenging research topics and algorithmic problems

with him. His wisdom and humor were essential to my life at NYC.

I would like to thank my colleagues at the Courant Institute, who worked hard

to make it possible for me to create my own course.

I would like to thank H.L., Yi.W., M.(M.)C. for being with me, and S.C., Yu.W.,

H.S., C.G., C.W. for their friendship across states and continents.

I would like to thank those coffee beans for their contribution to my integrity.

Finally, I would like to thank my parents for their unconditional support. It is

their deep love that leads me to who I am today.

Bowen Yu

January 2019

vi

To my parents, and all the adventurous souls

vii

ABSTRACT

VisFlow: A Web-Based Dataflow Framework for Visual Data

Exploration

by

Bowen Yu

Advisor: Prof. Claudio T. Silva, Ph.D.

Submitted in Partial Fulfillment of the Requirements for

the Degree of Doctor of Philosophy (Computer Science)

January 2019

Visual data exploration requires tools that are flexible and adaptive. Although

domain-specific systems are effective at solving particular tasks, they are relatively

costly to build. It is therefore desirable to have a general-purpose tool that gives

the user control over how data are queried and presented. In this work we design

VisFlow, a web-based visualization framework that employs dataflow diagrams

to facilitate flexible visual data exploration with good usability and low learning

overhead. VisFlow applies a subset flow model that focuses on processing tabular

viii

data subsets. The model allows the user to create visualizations that update

reactively upon dataflow diagram changes. It also enables data selection from

visualizations for interactive filtering, subset identification, and subset manipulation.

VisFlow may help the user generate a multi-view visualization environment with

brushing and linking support. Compared with other existing dataflow systems,

VisFlow addresses the lack of interactivity in dataflow and overcomes the drawback

of high learning overhead due to complicated dataflow diagrams. We demonstrate

the capability and effectiveness of VisFlow by several case studies on real-world

data analysis scenarios.

Although the subset flow provides good dataflow interactivity, it requires data

immutability within the dataflow. To overcome the limitation on data processing

capability resulted from data immutability, we further design the extended subset

flow model that expands the application of VisFlow to derived data. In the extended

subset flow model, nodes are allowed to mutate the data and create data mutation

boundaries. The subset flow applies to the groups of nodes within a same data

mutation boundary, and may thus preserve its interactivity benefit. We incorporate

several node type extensions in the extended subset flow, such as the data reservoir

that stores and sends back its input data to address the limitation of acyclic

dataflow, and the script editor that supports custom JavaScript scripting to edit

or generate data. We show by case studies that the extended subset flow may

significantly improve the analytical capability of VisFlow and make the framework

applicable to a larger variety of tasks.

We develop a natural language interface FlowSense that employs semantic

parsing to assist dataflow diagram construction in VisFlow. FlowSense maps the

user queries in English to diagram editing operations in VisFlow. We propose a

ix

grammar design for FlowSense that is based on POS and special utterance tagging.

We employ special utterance placeholders to make the semantic parser aware of

the current dataflow context during execution, while the grammar of the parser

can be independent of the data and dataflow diagram in use. The integration of

FlowSense simplifies diagram construction in VisFlow and improves the usability of

the dataflow framework. The effectiveness of FlowSense is validated by both case

studies and a formal user study.

The implementation of the VisFlow dataflow framework is available as open

source software. We provide an online demo of the system and create comprehensive

documentation for the best reproducibility of the system.

x

Contents

Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx

1 Introduction 1

2 Background and Related Work 6

2.1 Visualization Libraries and Frameworks . . . . . . . . . . . . . . . . 6

2.2 Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Computational Dataflow Systems . . . . . . . . . . . . . . . . . . . 9

2.4 Dataflow Visualization Systems (DFVS) . . . . . . . . . . . . . . . 10

2.5 Subset Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Subset Flow 13

3.1 Input Tabular Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Diagram Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Nodes, Ports and Edges . . . . . . . . . . . . . . . . . . . . 14

xi

3.2.2 Data Primitives: Subsets and Constants . . . . . . . . . . . 14

3.2.3 Visual Properties . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.4 Node Categories . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 Data Immutability . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.5 Data Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.6 Heterogeneous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.6.1 Link Between Heterogeneous Tables . . . . . . . . . . . . . . 23

3.6.2 Visualization for Heterogeneous Input . . . . . . . . . . . . . 24

3.7 Diagram Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 VisFlow Framework Implementation 29

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 System Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.1 Application Stack . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.3 Component Inheritance . . . . . . . . . . . . . . . . . . . . . 33

4.3 VisMode Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5.1 Gene Regulatory Network Analysis . . . . . . . . . . . . . . 40

4.5.2 Baseball Pitch Analysis . . . . . . . . . . . . . . . . . . . . . 43

5 Extended Subset Flow 48

5.1 Extended Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2 Node Type Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 50

xii

5.2.1 Script Editor . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2.2 Data Reservoir . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2.3 Series Transpose . . . . . . . . . . . . . . . . . . . . . . . . 55

5.3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.3.1 Evacuation Dataset Visualization . . . . . . . . . . . . . . . 57

5.3.2 k-Means Clustering Visualization . . . . . . . . . . . . . . . 59

5.3.3 Model Training Visualization . . . . . . . . . . . . . . . . . 61

5.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6 FlowSense: A Natural Language Interface 65

6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.1.1 NLIs for Data Visualizations . . . . . . . . . . . . . . . . . . 66

6.1.2 Semantic Parsing . . . . . . . . . . . . . . . . . . . . . . . . 67

6.2 Design Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.3 Semantic Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.3.1 VisFlow Functions . . . . . . . . . . . . . . . . . . . . . . . 71

6.3.2 Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.3.3 Query Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.4 Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.4.1 POS and Special Utterance Tagging . . . . . . . . . . . . . . 76

6.4.2 Keyword Classification . . . . . . . . . . . . . . . . . . . . . 78

6.4.3 Query Pattern Completion . . . . . . . . . . . . . . . . . . . 78

6.4.4 Diagram Update . . . . . . . . . . . . . . . . . . . . . . . . 80

6.4.5 Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.4.6 Error Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.5 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

xiii

6.6 Query Auto-Completion . . . . . . . . . . . . . . . . . . . . . . . . 85

6.7 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.7.1 Speed Reduction Study . . . . . . . . . . . . . . . . . . . . . 87

6.7.2 Diagram Reproduction Study . . . . . . . . . . . . . . . . . 90

6.8 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.8.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . 91

6.8.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.9 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7 Conclusions and Future Work 103

A VisFlow Resources 108

xiv

List of Figures

3.1 Illustration of the key concepts of the VisFlow subset flow model.

Node types are labeled in the diagram. The subsets are denoted by

letter IDs within brackets. Assigned visual properties are shown in

red font color. Transmitted constants are shown in gray. . . . . . . 15

3.2 Data schema behind the subset flow model. Whenever a subset

passes through a visual editor, virtually a new copy of the subset

is generated with the visual properties possibly modified. Each

visualization node renders its input subset according to the visual

properties carried by the data items in the subset. Immutable original

table columns are shown in light gray. . . . . . . . . . . . . . . . . 23

3.3 Linking heterogeneous tables using a linker. . . . . . . . . . . . . . 24

LIST OF FIGURES xv

3.4 Heterogenous input example: a network visualization that takes two

heterogeneous subsets as inputs, for nodes and edges respectively.

There are visual property mappings of node weights to node sizes,

and edge weights to edge colors. Sizes bound to the nodes are shown

on top of the node IDs. Colors bound to the edges are denoted

by the font colors of the edge IDs. The network correspondingly

renders the nodes and edges, and have four outputs: selected nodes,

forwarded nodes, selected edges, forwarded edges. . . . . . . . . . . 25

3.5 An example subset flow diagram implemented by the VisFlow frame-

work. The user edits the dataflow diagram that corresponds to

an interactive visualization web application shown in the VisMode

dashboard. The model years of the user selected outliers in the

scatterplot (b) are used to find all car models designed in those

years (1981, 1982), which form a subset S that is visualized in three

metaphors: a table for displaying row details (h), a histogram for

horsepower distribution (i) and a heatmap for multi-dimensional

visualization (j). The selected outliers are highlighted in red in the

downflow of (b). The user selection in the parallel coordinates are

brushed in blue and unified with S to be shown in (h), (i), (j). A

heterogeneous table that contains the MDS coordinates of the cars

are loaded in (k) and visualized in the MDS plot (o), with S being

visually linked in yellow color among the other cars. . . . . . . . . 26

LIST OF FIGURES xvi

4.1 The VisFlow framework interface. Nodes are created in a drag-and-

drop manner onto the infinitely large canvas. The node panels list

the node types supported by the VisFlow framework. These node

types do not include the extended subset flow extensions introduced

in Chapter 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Regulatory network analysis workflow and its corresponding VisMode

dashboard generated in the VisFlow framework: (i) The dataflow dia-

gram; (ii) The interactive VisMode visualization dashboard produced

by the dataflow diagram. . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Applying VisFlow to baseball pitch data analysis. (i) The pitch-

ing movement; (ii) The Statcast coordinate system illustration

(from [32]); (iii) The analysis environment for the baseball pitch

analysis generated by VisFlow; (iv) Plots of pitching movements of

12 players. A categorical color scale is applied to render each player’s

pitches in a uniform color. . . . . . . . . . . . . . . . . . . . . . . . 46

5.1 Example of data mutation boundary created in the extended subset

flow model. The two nodes with black borders are data mutating

nodes. The one at the top performs mpg aggregation for each car

origin. The one at the bottom joins the two input tables. The

system uses node borders to help the user identify where the data

get changed, and the node groups in which the original subset flow

applies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2 Limitation of acyclic dataflow diagram. One layer of nodes has to

be created for one iteration of network expansion. . . . . . . . . . 54

LIST OF FIGURES xvii

5.3 The data reservoir holds all the changes to the edges. When the user

releases the changes, those edges are merged into the upflow edges

so that the network visualization may include the new edges. . . . 55

5.4 A snapshot of the JavaScript written in a script editor to render the

floor plan of the evacuation data. The script manipulates the DOM

tree that is rooted at content. . . . . . . . . . . . . . . . . . . . . 58

5.5 Using a series player and a script editor to visualize the evacuation

data from VAST Challenge 2008. . . . . . . . . . . . . . . . . . . . 59

5.6 Using an MDS plot and a cluster label distribution plot to visualize

the iterations of the k-means clustering algorithm: (i) The dataflow

diagram; (ii) Visualizations of the clustering algorithm iterations. . 60

5.7 Applying a combination of extended model nodes to visualize a

multi-layered perceptron training process. Using a stateful script

editor we show metric value changes over time in a line chart. By

subset flow diagram highlighting, we can highlight the incorrectly

predicted test data in the MDS plot and the histogram. . . . . . . 62

6.1 An example FlowSense query and its execution. The derivation of

the query is shown as a parse tree in the middle. The sub-diagram

expanded by the query is illustrated at the bottom. The five major

components of a query pattern are underscored. Each component

and its relevant parts in the parse tree and the dataflow diagram are

highlighted by a unique color. The result of executing this query is

to create a parallel coordinates plot on the columns mpg, horsepower,

and origin, with input from the selection port of the node MyChart. 73

LIST OF FIGURES xviii

6.2 FlowSense query execution phases. POS and special utterance

tagging is performed first. Special utterances describing the data

columns and diagram nodes are identified can be matched against

utterance placeholders. Keyword classification is applied to identify

important utterance implications such as the intention to call a

specific VisFlow function. FlowSense attempts to complete the

query pattern if missing information can be filled using default

values. Upon an execution failure the user is notified and asked to

update the query. . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.3 The FlowSense user interface and query auto-completion. Tagged

special utterances are shown in colored tags. (i) Manually update spe-

cial utterance tagging using a dropdown in the FlowSense input box;

(ii) Special utterance token completion; (iii) Query auto-completion. 85

6.4 Using FlowSense to study the overall speed reduction trend of NYC

streets with different speeds limits. The queries are applied in the

numbered order. The resulting visualization shows a time series for

the average speed of road segments, aggregated by unique speed

limits. The smaller histogram snapshot shows the speed histogram

without color encoding before step 3. . . . . . . . . . . . . . . . . . 87

6.5 Applying FlowSense for a comparative study on the street speed

changes between the West Village slow zone (blue) and the Alphabet

City slow zone (red). FlowSense processes the rich dataflow context

and allows the user to reference dataflow diagram elements at differ-

ent specificity levels, e.g. with node types, node labels, or implicit

references. The NL queries are executed in the numbered order. . . 88

LIST OF FIGURES xix

6.6 Correctness distributions of the participants’ answers to each of the

user study tasks. “ok” represents a correct answer. “wa” indicates a

wrong answer. “unanswered” means the participant did not find a

proper answer and skipped the task. . . . . . . . . . . . . . . . . . 95

6.7 Completion time box plot for each step of the user study. Four

outliers are not shown: Task1(2550), Task3(109, 119, 212). . . . . 96

6.8 Number of rejected queries for different rejection reasons. The colors

of the bars indicate the relative difficulty of resolving a rejection. . 100

xx

List of Tables

5.1 Series transpose example that converts column-major series in Table

(a) into row-major series in Table (b) based on the key column

“Country” and the series columns of the years. The cell values in

Table (a) are stored in the third column in Table (b). Table (b) has

9 rows that are not all shown. . . . . . . . . . . . . . . . . . . . . 56

6.1 Six major categories of VisFlow functions. These sub-diagrams are

frequently used to compose more sophisticated diagrams that address

analytical tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.2 VisFlow Survey Result . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.3 FlowSense Survey Result . . . . . . . . . . . . . . . . . . . . . . . . 98

1

Chapter 1

Introduction

Data analysis requires an integrative environment that supports both data

presentation and interactive queries. The visual information seeking mantra and

task taxonomy [61] summarize several tasks that are frequently performed in visual

data analysis, such as overview, zoom, filter, and details-on-demand. In practice,

these tasks are often performed iteratively and progressively. Therefore, it is

desirable that the analysis can be carried out in a tool that is flexible and able to

be adapted for custom queries. Though a dedicated visual analytics tool may be

very effective at solving a particular domain problem, building a domain-specific

application is however often costly and time consuming. Visualization tools that

are adaptive and customizable are relatively lacking, in that they are typically

complicated and not simple to use especially for novice users.

In a dataflow system, the user draws a dataflow diagram that specifies system

behavior, such as how data are selected, filtered, and visualized. Updates on the

dataflow diagram immediately change the system functionality, so that the analysis

can be flexible and customizable. Additionally, dataflow diagrams are intuitive

CHAPTER 1. INTRODUCTION 2

representations of the system behavior and are very effective at capturing the

workflow design decisions that an analyst needs to assess and reflect frequently.

Despite the flexibility and intuitiveness, existing dataflow systems typically have

some drawbacks for visual data exploration:

• Many dataflow systems are designed for data processing and computations [10,

20]. Those systems present dataflow diagrams that are visual abstractions

of programming, which require the user to have programming background

and understand the correspondence between diagram inputs, outputs and the

underlying program arguments. Those systems have high learning overhead,

and produce dataflow diagrams that are often too complex to read.

• Dataflow analytics platforms [26, 29] are often not specifically designed for

visualizations. Visualizations in those systems are mostly statistical summaries

that do not support interactive data exploration. Besides, those systems

typically do not have reactive feedback upon dataflow changes, as explicit

re-execution is required to update the visualizations due to the complex

nature of the operations an analytics platform performs.

• Dataflow visualization systems (DFVS) [5, 23, 25, 48] mostly aim at generating

rendering pipelines, e.g. for volume rendering. Interactivity is often limited

for navigation within rendered views. The flexibility advantage of using

dataflow is exploited only in the application construction phase, rather than

in the visual data exploration phase.

We seek to design a dataflow framework that excels both in usability and

flexibility, and addresses the customization need in visual data exploration. In this

work we design VisFlow, a web-based visualization framework and its subset flow


dataflow model. The subset flow model builds on the idea of subset manipulation in

dataflow [55], and extends it to interactive visual data exploration. Our flow model

requires the data transmitted within the flow to be subsets of table rows from

immutable input tabular data. The advantage of this model is that visual properties

of data items can be unambiguously defined when subsets are transmitted, so that

brushing and linking, which are essential to visual data exploration, can be easily

achieved. We perform real-time evaluations of the system modules in order to

have visualizations update reactively on user interactions. As a result, the user

may directly select the data visualized and perform interactive queries through

embedded visualizations, saving the cost of explicit re-execution. In VisFlow,

user selections are explicit outputs of the system modules, so that subsets can be

easily tracked, compared, and understood. With a focus on data subsets, our flow

model also reduces the diagram complexity and mitigates the learning overhead of

computational dataflow. In Chapter 3, we give the definitions of the subset flow

model, its data immutability constraints and design philosophy.

We implement the VisFlow framework using modern client-server stack (Type-

Script, Vue.js, Express, MongoDB). We design intuitive drag-and-drop user interface

for dataflow diagram editing. The VisMode dashboard utility is also integrated to

allow easy sharing and presentation of the results of the dataflow. In Chapter 4,

we introduce the implementation details of VisFlow and its user interface design.

We showcase the application of the implemented framework with two case studies

in real-world data analysis scenarios.

Despite the advantages of better interactivity and lower learning overhead, the

subset flow has its limitations. First, all the data are immutable in a subset flow,

which may pose restrictions on the analysis, e.g. aggregations must be performed


outside VisFlow in order to analyze sums and averages. Yet we observe that data

processing and computation are often desired in an integrative analysis environment.

Second, VisFlow uses an acyclic dataflow diagram that does not allow loops that

may introduce execution and dependency ambiguity. This makes some tasks difficult

to achieve in VisFlow, such as iteratively expanding a network visualization by

adding incident edges into the graph. We address these limitations by proposing

the extended subset flow model in Chapter 5. In the extended model, nodes are

allowed to mutate the data and create data mutation boundaries. The subset flow

applies to each group of nodes within a same data mutation boundary, so that the

interactivity benefit from the original subset flow can be preserved while we increase

the data processing capability of the system. We support a script editor node that

enables generalized data editing and generation within the dataflow. The script

editor may also be used to perform custom rendering and DOM manipulation. We

also introduce a data reservoir node that may save its input subset and send the

subset back to the earlier dataflow upon user-initiated backward data propagation.

The data reservoir allows the system to update a same visualization iteratively,

which was previously not possible in an acyclic dataflow. We describe the details

of the extended subset flow model in Chapter 5, and exemplify its advantages and

applications with several case studies.

To further reduce the learning overhead and improve the usability of VisFlow,

we design a natural language interface FlowSense for the VisFlow framework.

The natural language interface allows the user to edit dataflow diagrams using

plain English. We first identify a set of commonly performed diagram editing

operations within VisFlow, and then employ the state-of-the-art semantic parsing

technique to map natural language input to diagram editing operations. To make


the semantic parser aware of the diagram content, we propose a grammar design

with special utterance placeholders. The special utterances related to the diagram

context are explicitly identified in the user interface and accepted by the utterance

placeholders when parsed. A query auto-completion algorithm based on template

matching is added to further enhance the usability of FlowSense. FlowSense may

not only automate and speed up the diagram construction, but also help the user

learn common diagram construction patterns in VisFlow. We demonstrate the

effectiveness of FlowSense using one case study, one diagram reproduction study,

and a formal user study. The design of FlowSense and its studies are discussed in

Chapter 6.

Some of the results presented in this dissertation have been published in refereed

journals and conference proceedings. Chapter 3 and Chapter 4 include work

from [81]. Chapter 6 includes work to be published. The implementation source

code of the VisFlow framework is available as an open source project on GitHub.

We also provide an online demo of the system with comprehensive documentation.

See Appendix A for a detailed list of the VisFlow resources.

6

Chapter 2

Background and Related Work

This work is rooted in research advancement in visualization frameworks and

dataflow systems. In this chapter, we first cover related general-purpose visualization

frameworks at large. We then briefly survey existing dataflow systems by their

major application scenarios and functionality. We categorize the systems into

computational dataflow systems and dataflow visualization systems. Specifically,

we highlight the works that also transmit subsets in dataflow.

2.1 Visualization Libraries and Frameworks

Visualization libraries such as the InfoVis Toolkit [14] and D3 [7, 11] provide

script support to author visualizations. Those tools are powerful but require the

user to program. Declarative languages such as Vega-Lite [59] allows the user to

create pre-defined visualizations with custom data using simpler JSON specifications.

Many visualization frameworks aim at providing a general-purpose environment that

is flexible and customizable for visual data analysis, without asking the user to write

script or text specification. Those frameworks are also referred to as visualization

2.1. VISUALIZATION LIBRARIES AND FRAMEWORKS 7

construction tools. Unlike visualization libraries, visualization frameworks do not

require the user to program the visualization and build an analytical solution

from scratch. Tableau [38, 66] provides a drag-and-drop interface for the user to

define mapping from data values to visualization attributes (namely columns and

rows) required by visualization metaphors. Quadrigram [53] creates a visualization

dashboard by adding charts and controllers to the dashboard area. iVisDesigner [54]

allows the user to add data elements to the canvas, specify their visual mappings,

and define chart area and axes. It provides an expressive environment to author

customizable visualizations. Lyra [58] employs a similar concept but with implicitly

pre-defined chart axes. The user drags graphical elements onto the Lyra canvas and

associates data attributes with those elements. VisFlow, like the other visualization

frameworks, aims at providing a general-purpose environment that is flexible and

customizable. It uses a dataflow diagram to define how data are transmitted,

processed, and visualized within the system. Compared with other authoring tools,

VisFlow applies a dataflow diagram to present a clearer and more explicit view of

data transmission and manipulation. Consequently, the user has more control over

the system behavior, in that VisFlow not only creates visualizations, but also helps

define the analytical logic behind the visual data exploration.

On the other hand, modern data analysis often uses a notebook environment.

Popular examples include Jupyter [52] and the latest invention Observable [46].

Dataflow systems provide a generalized workflow compared with notebooks, in

that notebooks use a single streamlined pattern of code snippets. Dataflow and

notebooks provide different data analysis and exploration solutions. Dataflow

systems pass the input and output explicitly, while a variable defined in a notebook

is global inside the page. This may help the user better interpret changes to the

2.2. DATAFLOW DIAGRAMS 8

data within the system in a step-by-step manner.

2.2 Dataflow Diagrams

Dataflow is a diagrammatic representation of the relations between system

components. It originated as a graphical method for analyzing system structures [18].

VisFlow uses a dataflow diagram is to specify data transmission between modules,

so that the transmission direction must be unique and data operations can be

executed without ambiguity. There is a distinction worth noting between the

dataflow diagrams in this work and those visually similar diagrams of actually

different meaning and purpose. The dataflow in VisFlow is different from the

illustrative flow diagrams where the name “data flow diagram” (DFD) may have

other indications [18]. For example, in information system modeling a DFD only

describes how modules communicate with each other and there may be loops in the

diagram. Some visualization techniques use graph-based relational diagrams, which

are not dataflow. For example, Domino [21] presents a novel metaphor to link

visualizations for identifying table subset relations, which visualizes bidirectional

relational edges. North et al. [45] proposed a visualization schema for relating

multiple-view visualization with database schema, in which the edge is an analogy

of database table join. The Improvise system [74] enables the user to specify

coordinations between variables, controls and rendering, which are illustrated in a

coordinated graph. Liu et al. [36] proposed network-based analysis for tabular data,

in which data entities (items) are nodes and edges show their weighted relations.

Despite the visually similar diagram appearances, none of the diagrams used by

the above systems are dataflow diagrams, because the edges in those diagrams

2.3. COMPUTATIONAL DATAFLOW SYSTEMS 9

represent or declare relations, but do not define the direction in which data are

transmitted within the system. Additionally, analytical or executional paths have

implicit dataflow with a single path. Examples include Victor’s creative demo of

drawing dynamic visualizations [70] and the Lyra [58] system. However in this

work we focus on flow flexibility and multiple flow branches.

2.3 Computational Dataflow Systems

Dataflow models have been widely adopted to model computations [22, 43]. In

the domains of parametric modeling [20], signal and media processing [10, 71], people

use dataflow to effectively model a system. Dataflow system may provide analysis

capability including data transformation, data filtering, statistics computation, and

visualizations. For example, the user is able to employ a comprehensive set of

computational nodes to perform machine learning and data mining in the IBM

SPSS Modeler [26], KNIME [29], and Orange [47] Kepler [37] manages scientific

workflows where analytical steps are modularized. Yahoo Pipes [75] (The service

was shutdown in 2015) used dataflow to process web content. Taverna [77] provide

workflows for integrating web services and local tools for bioinformatics.

The aforementioned systems are designed for general-purpose computation and

analysis. Visualization capability and queries by user selections are lacking in

those systems. Most of these systems do not emphasize the interactive queries that

are possible in plots, as visualization nodes have no outputs and serve only as a

summary of the computation results or statistics. Computational dataflow systems

often yield a complicated flow diagram due to the complexity of computations. In

this work, rather than aiming at general computations, we focus on enhancing the

2.4. DATAFLOW VISUALIZATION SYSTEMS (DFVS) 10

user’s interactive control over the visualized data in the dataflow. More particularly,

achieving interactive analysis with brushing and linking in the above systems is not

straightforward because computational modules may generate arbitrary derived

data that introduces ambiguity of tracing data items. The dataflow model of

VisFlow attempts to overcome this limitation.

2.4 Dataflow Visualization Systems (DFVS)

Researchers have attempted to use dataflow to model data visualizations in

order to provide better flexibility in system functionality. Compared with bespoke

visualization systems, dataflow visualization systems are more general-purpose and

may adapt to a larger variety of analytical tasks.

There are a large number of visualization systems based on dataflow. Most

systems effectively use dataflow to construct flexible rendering pipelines. Early

systems emerged mostly for interactive construction of scientific problem solv-

ing environments [78], e.g. steering geometric modeling and performing volume

rendering [16, 48, 55, 69]. The Application Visualization System (AVS) [69],

SCIRun [48, 49], ConMan [23], VISSION [64] and SmartLink [65] are the seminal

works for dataflow visualization. In these systems, program modules are provided

as modular blocks to be interconnected to achieve tunable custom rendering. More

recently, Ross et al. [57] proposed a dataflow workspace called HIVE for exploring

multi-dimensional scaling algorithms. VisTrails [5] generates multi-view visual-

izations from the specification of a pipeline and provides the interface to ease

the manipulation and management of different pipelines. Voreen [41] provides a

dataflow environment for ray-casting-based volume rendering.

2.5. SUBSET DATAFLOW 11

In the above systems, the advantages of dataflow are mostly exploited in the

application construction phase to perform a specific type of rendering or algorithm.

There is a lack of interaction support on the rendered result, e.g. in case of

volume rendering only view navigation is provided. Due to the relatively heavy

rendering workload, explicit re-execution is often required to update visualizations.

Additionally, modifying the pipelines often requires expert knowledge on the system

modules. All these limitations render the above DFVS’s not quite suitable for

visual data exploration, where information visualization and visual analytics are

more emphasized. VisFlow seeks to provide a dataflow approach with simpler usage

for interactive data analysis, rather than constructing rendering pipelines. To that

end, Waser et al. [73] present an effective dataflow design that supports interactive

flood simulation steering, in which the parametric connections fit the particular

domain. Instead of focusing on a particular domain, VisFlow has a subset flow

model that is general-purpose and can be applied to any tabular data.

2.5 Subset Dataflow

Connecting programming modules (e.g. VTK in [5]) in a dataflow often results

in complex flow diagrams that are hard to read. An option to make it easier is to

constrain the data types transmitted in the dataflow. One solution is to transmit

only data subsets. The Waltz system [55, 56] passes subsets between system

modules for volume segmenting and rendering. It produces a simpler tree-form

dataflow structure and a simplified visualization workflow. Dataflow systems with

subsets are relatively easier to follow and understand. They also naturally represent

sequential steps of data filtering. ExPlates [28] presents an information visualization

2.5. SUBSET DATAFLOW 12

workflow system that supports interactive subset extraction by expanding the flow

diagram upon user selection. ExPlates also shows embedded visualizations in the

diagram nodes that react to diagram changes.

In this work we extend the subset dataflow concept to allow for interactive

tabular data analysis. The Waltz system essentially targets at volume rendering, in

which subset filtering/slicing and view displays are only for volume data. Compared

with other information visualization dataflow systems like ExPlates, the VisFlow

model makes a key distinction that it does not generate derived data within the flow

to make subset brushing and linking unambiguously defined. However, ExPlates

performs table join operations and thus does not sufficiently support visual linking

of data items, as there is no straightforward way to represent data items by visual

properties. With our model constraint we develop a corresponding set of node

categories, data primitives and transmission rules for VisFlow, which are specifically

catered to assist the user to edit, compare, and understand tabular data subsets.

Furthermore, the VisFlow model is able to produce a simpler dataflow diagram.

We believe the proposed model to be well suited for visual data exploration, and to

our best knowledge the advantages of user interaction that works exclusively with

tabular subsets in a dataflow framework have not yet been studied in literature.

Chapter 3 gives a detailed introduction to the subset flow model employed by

VisFlow.

13

Chapter 3

Subset Flow

This chapter introduces the VisFlow dataflow model: the subset flow model.

We discuss the input data of the model, and then define the components in its

dataflow diagram, the interaction methods, and the data immutability requirement

of the model.

3.1 Input Tabular Data

The subset flow model aims at visualizing and analyzing tabular data. The

tabular data applied in the subset flow matches the entity-relationship database

table definition. We consider each table row to be a meaningful data item. A group

of data items is a data subset.

Each table in the subset flow is regarded as a single, independent dataset. The

type of a table is determined by its column names and column data types (number,

string, date). Each data item corresponds to exactly one row of one input table.

Multiple tables of different types can be loaded into the subset flow at the same

time to describe different aspects of a same set of data entities. We discuss how

3.2. DIAGRAM ELEMENTS 14

heterogeneous data are processed in Section 3.6.

3.2 Diagram Elements

This section introduces the diagram element definitions in the VisFlow dataflow.

Figure 3.1 illustrates the model concepts.

3.2.1 Nodes, Ports and Edges

A dataflow diagram in VisFlow consists of nodes and edges. A node is a VisFlow

module that loads, processes, filters or visualizes the data. Nodes expose input

and output ports that accept and transmit incoming and outgoing data. Input

and output ports are shown on the left and right side of a node respectively in

Figure 3.1. An edge is a directed connection from an output port to an input port.

A single port (single dot ) accepts at most one edge. A multiple port (triangle

) does not have edge number restriction. Topologically a subset flow diagram is

a directed acyclic graph (DAG).

Downflow and upflow. For the convenience of discussion, we define downflow(x)

and upflow(x) to be the set of nodes that are reachable from node x following the

edges, in their original and reverse directions respectively.

3.2.2 Data Primitives: Subsets and Constants

VisFlow only transmits two types of primitive elements in its dataflow: subsets

and constants. Ports are categorized exclusively into subset ports (white background

) that transmit subsets, and constant ports (gray background ) that transmit

constants. Ports must be type-matched when connected by edges.


idmpg

name

a b cch

ev

role

t

bu

ick

am

c1

5

18

14

... ... ......

db

uic

k1

5...

Da

ta S

ou

rce

1

idsale

name

x y zch

ev

role

t

bu

ick

am

c3 42

Da

ta S

ou

rce

2

... ... ......

{a, b

, c, d

}

a

b

c{a

, b}

Use

r S

ele

ctio

n

Vis

ua

liza

tio

n 1

Vis

ua

l Ed

ito

r {a, b

}

{a, b

, c, d

}U

d{a

, b, c

, d}

15

≤ m

pg

≤ 2

0

Ra

ng

e Fi

lter

{a, c

, d}

ac

d

Vis

ua

liza

tio

n 2

Co

nst

an

ts E

xtra

cto

rS

ub

stri

ng

Filt

er 1

name

co

nta

ins

“am

c” o

r “b

uic

k”?

[”a

mc”

, “b

uic

k”]

{a

, b, d

}

Set

Op

era

tor

(Un

ion

)

{a, c

, d}

{ }

{a, b

, c, d

}

Su

bst

rin

g F

ilter

2

[”a

mc”

, “b

uic

k”]

name

co

nta

ins

“am

c” o

r “b

uic

k”?

{x, y

, z}

{x, y

}

ex

tra

ct n

ame

Fig

ure

3.1:

Illu

stra

tion

ofth

eke

yco

nce

pts

ofth

eV

isF

low

sub

set

flow

mod

el.

Nod

ety

pes

are

lab

eled

inth

ed

iagr

am.

The

subse

tsar

eden

oted

by

lett

erID

sw

ithin

bra

cket

s.A

ssig

ned

vis

ual

pro

per

ties

are

show

nin

red

font

colo

r.T

ransm

itte

dco

nst

ants

are

show

nin

gray

.


The definitions of subsets and constants are as follows:

• Subsets. A subset transmitted by the dataflow is a collection of table rows

from some input table as defined previously. A subset virtually preserves all

the column information of its source table and the attribute values of its data

item members, which both can be retrieved from a node receiving the subset.

A subset is denoted by a pair of brackets containing the IDs of its member

items in Figure 3.1.

• Constants. We define constants to be an ordered list of constant values.

The values are either specified by user input directly or extracted from

attribute values of data items. Constants can be numbers, strings or dates.

In Figure 3.1, two string constants “amc” and “buick” are extracted from the

names of the user selected data items (a and b) at the constants extractor

(see Section 3.2.4).

3.2.3 Visual Properties

Data items in subsets can be associated with visual properties in the subset flow.

Those properties are transmitted together with the data items. Each data item

has a visual property object that maps a set of visual parameters to their assigned

values, e.g. { color: red, size: 5 }. VisFlow currently supports five types of visual

properties: color, size, border color, border width, opacity. Visual properties are

set and modified by the nodes along the flow so that a same data item may have

different visual properties at different nodes. Intuitively, this process is like dyeing.

Once a data item passes through a visual editor, its visual properties may get

mutated. The data item thus carries the new visual properties in the downflow.


The visual editor (see Section 3.2.4) in Figure 3.1 assigns red color property to the

data items in the subset {a, b}, so that a and b have red color throughout their

downflow. Visual properties are the only properties of data items that can be

mutated along the dataflow.

3.2.4 Node Categories

The nodes employed by a subset flow may be categorized based on their func-

tionality. In particular, the subset flow employs a small number of node types to

achieve a relatively low learning overhead. We may categorize the nodes in the

subset flow into the following types:

• Data sources. Data sources load tabular data from data files. They do not

have input ports. A data source always produces a single output subset read

from a user-selected input table.

• Visualizations. Visualization nodes render the input subsets in visualization

metaphors. To facilitate the interactivity of dataflow visualizations, plots are

embedded in the visualization nodes by default. Interactive selection can be

made directly in an embedded visualization and sent to other nodes through

a dedicated selection port (square icon ). In Figure 3.1, the user selects

a and b in Visualization 1. The selection port of Visualization 1 outputs

the selected subset {a, b}. Additionally a visualization node also has a data

pass-through forwarding port (a multiple port denoted by ) that simply

outputs its input. The redundancy is included to reduce diagram clutter. In

Figure 3.1 Visualization 1 passes through the entire input set {a, b, c, d} to

the downflow through the forwarding port. Otherwise the downflow nodes


must be connected to the data source to receive the entire input set.

Visualizations in the subset flow must always render the data items according

to their visual properties. Visualization 2 in Figure 3.1 renders data item

a in red color, as assigned by the upflow visual editor. Note that different

visualization metaphors may present the visual properties differently. While

a scatterplot renders the dots directly in the assigned colors of the data items,

a heatmap sets the font colors of row labels to the colors of the data items, so

as not to interfere with the color mapping used by the heatmap cells based

on the attribute values of the data items. The user is able to tune the visual

parameters of the visualizations through the user interface.

• Constants generators. Constants generators produce constants that are

used as filtering parameters. Constants can be either entered by the user,

or extracted from data. The constants extractor in Figure 3.1 extracts the

names of data items a and b (“amc” and “buick”), which are constants used

by the downflow attribute filters.

• Attribute filters. Filters examine attribute values of data items and perform

attribute filtering. Filtering parameters are constants and can be user specified,

or retrieved from value generators. The attribute filters in Figure 3.1 use

substring matching to find the items with names being either “amc” or “buick”.

An attribute filter may also be used to find the extremums in the data, or

randomly sample an input subset on a user-selected sampling dimension.

• Visual editors. Visual editors assign visual properties to its input data

items. The visual editor in Figure 3.1 assigns red color to its input subset

items a and b. A visual editor may also encode the attribute values of the input

3.3. INTERACTIONS 19

subset using a mapped visual channel (e.g. a color scale in Figure 3.4). The

user may assign distinguishable visual representations to important subsets so

that they can be identified and linked across multiple plots. Visual properties

can be overwritten by downflow visual editors.

• Set operators. Set operator nodes take two or more subsets from a same

table to produce a new subset using a mathematical set operation. The visual

properties of a same data item are merged at a set operation node. The union

node in Figure 3.1 merges {a, b} with {a, b, c, d} and preserves the colors

of a and b. The last connected input subset has higher priority in case of a

visual property merge conflict.

When implementing the VisFlow framework, we designed a few example node

types for each category. It is possible to add new node types to the categories or

extend the categories on demand as long as the added nodes meet the subset flow

model requirements. Chapter 4 introduces our implementation of the framework.

For more details on node types and node usage, please refer to the VisFlow online

documentation in Appendix A.

3.3 Interactions

In the subset flow, presented data in the visualizations can be directly selected

and extracted as subsets for further queries or manipulations. An important task

for visual analytics is to be able to perform brushing and linking:

• Brushing. The subset flow supports brushing by binding visual properties

to the subsets. A subset will be shown consistently according to its visual

3.4. DATA IMMUTABILITY 20

properties. For example, the user selects items a and b from Visualization 1

in Figure 3.1 and passes them through the visual editor, which brushes the

items in red.

• Linking. The same subsets are automatically linked across nodes with the

same visual properties based on the nature of the subset flow. A downflow

visualization in Figure 3.1 (Visualization 2) receives the data items with

associated visual properties, so that a is highlighted in red. Note that

Visualization 2 does not necessarily need to apply the same visualization

metaphor as Visualization 1. Combined with the attribute filter, the example

flow diagram in Figure 3.1 effectively brushes and highlights the selected

items that satisfy 15 ≤ mpg ≤ 20.

3.4 Data Immutability

The name subset flow naturally comes from the fact that subsets are the major

primitives transmitted by the dataflow. Within a subset flow a data item from a

subset must correspond to one of the input table rows so that it can be uniquely

identified and given visual properties. Although most dataflow systems implicitly

support generating subsets, this correspondence constraint defines the unique

system behavior of the subset flow model, and makes it different from a general

dataflow capable of subset generation. More particularly, the VisFlow subset flow

does not produce derived data within the dataflow such as a joined table. Such

model has two key advantages:

• Brushing and linking definition. By constraining the transmitted data

to be subsets, brushing and linking operations in the subset flow are defined

3.5. DATA SCHEMA 21

by the visual properties bound to the data items. The subset flow model

allows visual properties to be uniquely associated with data items throughout

the flow, and prevents the ambiguity of inheriting visual properties when new

types of data items are derived and generated, e.g. by a table join. It is not

straightforward to define the behavior of brushing and linking with derived

and mutated data. This is because there exist multiple possible results. For

example, when a data item colored red is joined with a heterogeneous data

item colored blue, the system cannot tell which color should be inherited.

Therefore it has to ask the user for a decision, which would consequently

increase the usage complexity and introduce confusion.

• User perception. With the complexity and confusion resulting from inherit-

ing visual properties when derived data are involved, it is hard for the user to

mentally trace, compare, and understand data subsets. The subset flow model

works exclusively with data subsets so as to improve the user’s understanding

of the data being visualized. Interactive queries are more intuitive as the user

will be able to tell which data items he/she is selecting, from which answers

to analytical questions can be derived. However, selecting derived table rows,

e.g. from a joined table, is less intuitive as there might exist arbitrary new

types of subsets with varying columns.

3.5 Data Schema

Intuitively, the visual properties associated with the data items can be regarded

as additional columns in which values are mutable. Values in these columns are

used differently from the original table columns to define how the data items are

3.5. DATA SCHEMA 22

presented in the visualizations. One may understand the subset flow model in a

way that each node outputs a new copy of its input subset, and potentially mutates

the visual property columns of the new copy. Each visualization node uses the

visual properties in the subset copy it receives to render the subset.

When two subsets are merged, the same data items (identified by their row

indices from the original input table) possibly carry different visual properties.

Therefore the union set operator combines and possibly overwrite some of the visual

properties carried by its input. The overwriting priority is defined by the connection

order of the incoming edges to the union node. Based on data immutability, only

subsets originating from the same input table can be merged.

Figure 3.2 illustrates the data schema concept behind the subset flow model and

how visual properties are merged. The first three columns shown in light gray are

from the original table and cannot be changed by any node in the subset flow. The

remaining three columns store the visual properties assigned. The four data items

pass through the first visual editor, which assigns red color and size 5. Visualization

1 shows the data at this point, in which all data items are presented in red color

and with size 5. Two data items a and b are then selected and they go through

a second visual editor, which assigns blue color to them. The visual property on

size is unchanged for these two data items. When these two data items are merged

back to the full set at the union node, their new blue color is kept because it has

higher priority. Visualization 2 shows the resulting subset. Every data point is

with size 5. a and b are in blue, while c and d retain their red color received from

Visual Editor 1.

3.6. HETEROGENEOUS DATA 23

id mpgname

a

b

c chevrolet

buick

amc 15

18

14

d buick 15

blue color

color size opacity

red color, size 5

id mpgname

a

b

c chevrolet

buick

amc 15

18

14

d buick 15

color size opacity

5

5

5

5

id mpgname

a

b buick

amc 15

14

color size opacity

5

5

selected

U Union

id mpgname

a

b

c chevrolet

buick

amc 15

18

14

d buick 15

color size opacity

5

5

5

5

a

b

c

d

a

b

c

d

Visualization 1

Visualization 2

Visual Editor 1

Visual Editor 2

Figure 3.2: Data schema behind the subset flow model. Whenever a subset passesthrough a visual editor, virtually a new copy of the subset is generated with thevisual properties possibly modified. Each visualization node renders its inputsubset according to the visual properties carried by the data items in the subset.Immutable original table columns are shown in light gray.

3.6 Heterogeneous Data

The subset flow supports heterogeneous data by links between heterogeneous

tables, or visualizations specifically designed for heterogeneous data:

3.6.1 Link Between Heterogeneous Tables

Heterogeneous data items can be linked based on a key column. A constants

extractor is first used to extract the key values as constants from one table, and

those keys are sent to an attribute filter to find among the second table those data


items with these keys. The keys extracted help relate heterogeneous data items

based on the analytical meaning of those keys. The related data items can be

further brushed and presented in linked visualization styles. In Figure 3.1, a second

data source loads a different type of table that can be filtered with the extracted

car names from the subset selected from the first table. In particular, the two

tables describe different aspects about the same set of cars. By linking the two

tables, one may find the sale number for the selected cars from the first table.

Extracting the key column values and then using those values to filter a second

table is a common way to relate heterogeneous data in the subset flow. For

convenience, we design the Linker node that combines the two steps in the VisFlow

framework. Figure 3.3 illustrates the usage of a linker. A linker internally extracts

the key columns values from the first table, and filters the second table based on

those keys. Using a linker makes the dataflow diagram simpler than using one

constants extractor and one attribute filter as shown in Figure 3.1.

id mpgname

a

b

c chevrolet

buick

amc 15

18

14

...

...

...

...

d buick 15 ...

Data Source 1

id salename

x

y

z chevrolet

buick

amc 3

4

2

Data Source 2

...

...

...

...

{a, b, c, d}

a

b

c {a, b}

User Selection

Visualization 1

d

Linker

link name

“amc” or “buick”?{x, y, z}

{x, y}

Figure 3.3: Linking heterogeneous tables using a linker.

3.6.2 Visualization for Heterogeneous Input

A visualization may directly render heterogeneous data. One of the examples

is a network visualization, in which data of nodes and edges are both required.


Figure 3.4 illustrates a network (graph) visualization. A network visualization

accepts two input subsets, for the nodes and edges respectively. Nodes and edges

can be assigned different visual properties. In Figure 3.4 the node weights are

encoded by sizes, while the edge weights are encoded by colors from a red-green

color scale. The network renders the nodes and edges respectively according to their

visual properties. User selections and forwarding of nodes and edges are output

separately. Therefore a network node has four output ports.

id weight

a

b

c

2

1

3

...

...

...

...

d ...

Data Source 1 (Nodes)

1

source weighttarget

a

a

b

3

2

4

Data Source 2 (Edges)

...

...

...

...

b

c

d

b 5 ...c

ide

1

e2

e3

e4

Network Visualization

size: [1, 5]

Visual Encoding 2

Visual Encoding 1

{a, b, c, d}

{e1, e

2, e

3, e

4} {e

1, e

2, e

3, e

4}

3 5 1 1

{a, b, c, d}

a

bc

d

3 5 1 1

{a, b, c, d}

{e1, e

2, e

3, e

4}

3 1

{a, c}

{e2}

weight

weight

User Selection

Figure 3.4: Heterogenous input example: a network visualization that takes twoheterogeneous subsets as inputs, for nodes and edges respectively. There are visualproperty mappings of node weights to node sizes, and edge weights to edge colors.Sizes bound to the nodes are shown on top of the node IDs. Colors bound to theedges are denoted by the font colors of the edge IDs. The network correspondinglyrenders the nodes and edges, and have four outputs: selected nodes, forwardednodes, selected edges, forwarded edges.


Dia

gra

m E

dit

ing

Vis

Mo

de

Da

shb

oa

rd

ab

c

d

e

f

gh

n

ij

k

l

m

i

o

d

j

h

o

p

b

Use

r S

ele

ctio

n

Use

r S

ele

ctio

n

Fig

ure

3.5:

An

exam

ple

subse

tflow

dia

gram

imple

men

ted

by

the

Vis

Flo

wfr

amew

ork.

The

use

red

its

the

dat

aflow

dia

gram

that

corr

esp

onds

toan

inte

ract

ive

vis

ual

izat

ion

web

applica

tion

show

nin

the

Vis

Mode

das

hb

oard

.T

he

mod

elye

ars

ofth

eu

ser

sele

cted

outl

iers

inth

esc

atte

rplo

t(b

)ar

eu

sed

tofi

nd

all

car

mod

els

des

ign

edin

thos

eye

ars

(198

1,19

82),

whic

hfo

rma

subse

tS

that

isvis

ual

ized

inth

ree

met

aphor

s:a

table

for

dis

pla

yin

gro

wdet

ails

(h),

ahis

togr

amfo

rhor

sep

ower

dis

trib

uti

on(i

)an

da

hea

tmap

for

mult

i-dim

ensi

onal

vis

ual

izat

ion

(j).

The

sele

cted

outl

iers

are

hig

hligh

ted

inre

din

the

dow

nflow

of(b

).T

he

use

rse

lect

ion

inth

epar

alle

lco

ordin

ates

are

bru

shed

inblu

ean

du

nifi

edw

ithS

tob

esh

own

in(h

),(i

),(j

).A

het

erog

eneo

us

tab

leth

atco

nta

ins

the

MD

Sco

ord

inat

esof

the

cars

are

load

edin

(k)

and

vis

ual

ized

inth

eM

DS

plo

t(o

),w

ithS

bei

ng

vis

ual

lylinke

din

yellow

colo

ram

ong

the

other

cars

.

3.7. DIAGRAM EXAMPLE 27

3.7 Diagram Example

Figure 3.5 provides a comprehensive example of composing multi-view visual-

ization environment using the subset flow model. The dataflow diagram and its

visualizations are rendered in the implemented VisFlow web framework. More

details on the framework user interface are introduced in Chapter 4.

The diagram applies the Auto MPG dataset1 (loaded by data source (a)) that

consists of 392 cars (excluding cars missing attributes) and their information on 9

columns, including name, mpg, displacement, etc.

A scatterplot (b) first shows the relation between the columns “displacement”

and “mpg”. Two outliers, in terms of the overall negative correlation, are identified

and selected. An interesting task could be to find all those cars that are produced

in the same years as the two outliers. We define this subset of cars to be S. A linker

(e) extracts the “model.year”s of the selected outliers and performs the query for S

by filtering the whole car collection. On the other hand, a parallel coordinates plot

(d) helps provide an overview of value distribution, in which lines are color encoded

(c) depending on the mpg values. The user selection in the parallel coordinates

(cars having 5 cylinders) are brushed in blue (p) and unified (g) with S, while the

outliers chosen above are brushed in red (f). Three visualizations, a table (h), a

histogram (i) and a heatmap (j) are used to render S along with the selection from

the parallel coordinates.

This example also includes a heterogeneous table (loaded by data source (k))

that contains the projected MDS coordinates of the cars, i.e. columns “mds x”,

“mds y”, and (car) “name”. The MDS coordinates are generated by the metric

SMACOF algorithm using the Minkowski Model in Euclidean distance [12], on all

1http://archive.ics.uci.edu/ml/datasets/Auto+MPG

http://archive.ics.uci.edu/ml/datasets/Auto+MPG

3.7. DIAGRAM EXAMPLE 28

the columns with a maximum of 1 000 iterations.

An important task of studying an MDS plot is to identify the distribution of

a subset. Here we highlight the distribution of S in the MDS plot (q) by linking

heterogeneous data, as introduced in Section 3.6. The car names of S are extracted

and used to retrieve a corresponding subset with the MDS coordinates from the

MDS table (by linker (l)) to be highlighted. Highlighting of S in the MDS plot is

easily achieved by assigning yellow color (m) to S and unifying S (n) with the

other cars from the MDS table.

Recall that visualizations in the subset flow render the data items in their

rendering properties, but potentially in different ways. Histogram (i) visualizes the

“horsepower” distribution of the cars, with the number of highlighted items shown

proportionally in their bins. Heatmap (j) on the other hand renders the row labels

in the data items’ colors, while its cells use a separate color scale to encode the

attribute values.

As a summary, this flow diagram finds the cars that were produced in the same

years as those selected outliers in the scatterplot (b), as a set S. It visualizes

the distribution of S in an MDS plot. Meanwhile, additional cars selected from

the parallel coordinates are highlighted together with S for comparison, in the

visualizations (h), (i) and (j). This example demonstrates how brushing and linking

are achieved in the subset flow. It also shows how the subset flow works with

heterogeneous data.

29

Chapter 4

VisFlow Framework

Implementation

We develop the VisFlow framework that implements the subset flow model and

demonstrates its applications. We provide an online demo of the implemented

framework, along with comprehensive documentation about its usage. The source

code for VisFlow is available as a GitHub open source project. Appendix A lists the

URLs to access the online demo, documentation, and the implementation source

code. In this chapter we discuss the important aspects of our implementation. We

also introduce a few utility features we integrate into VisFlow that are not part of

the subset flow model but help provide a smoother user experience, including the

VisMode dashboard and diagram sharing.

4.1. OVERVIEW 30

diagram editing canvas

no

de

op

tion

pa

ne

l

quick node panel

no

de

pa

ne

l

drag-and-drop

node creation

system toolssystem menus

Figure 4.1: The VisFlow framework interface. Nodes are created in a drag-and-drop manner onto the infinitely large canvas. The node panels list the node typessupported by the VisFlow framework. These node types do not include the extendedsubset flow extensions introduced in Chapter 5.

4.1 Overview

We implement the VisFlow framework that supports visual data exploration

based on the subset flow model. We design the framework interface to assist

efficient understanding and manipulation of the dataflow diagrams in a web browser.

Figure 4.1 an overview of the system interface. The flow diagram is drawn and

manipulated on a virtually infinite canvas. Nodes can be created, resized and

re-positioned in an intuitive drag-and-drop manner. A node panel on the left guides

the user towards node creation, while a pop-out quick node panel activated by

context menu or keyboard shortcut appears around the mouse cursor to closely

follow the editing focus. Dragging from a port to another port or a node creates an

4.2. SYSTEM IMPLEMENTATION 31

edge. VisFlow automatically selects the first available port that satisfies connection

constraints when an edge being created is dropped on a node. The node option

panel on the right allows the user to set node-specific parameters and options, such

as table columns to be rendered in visualizations, color scale of a heatmap, value

range for an attribute filter, etc. Visualizations are embedded into the rectangular

node areas on the canvas, in which interactive selections can be performed. Nodes

can be either shown in detail (e.g. Figure 3.5(b), (c), (d), (e), etc., in Diagram

Editing) or collapsed into icons to save screen space (e.g. Figure 3.5(h), (i), (j),

etc., in Diagram Editing). The navigation bar at the top of the page lists system

menus and system tools such as the VisMode toggle (Section 4.3) and the dataset

management dialog.

4.2 System Implementation

In this section we discuss the important technical details of our implementation.

We introduce the application stack, and highlight the component inheritance

technique we apply to meet the need of HTML template rendering and node type

inheritance at the same time.

4.2.1 Application Stack

The VisFlow framework employs a client-server architecture like a typical

modern web application. The client is a Vue.js1 based single-page web application.

The VisFlow client handles all the dataflow computation, data visualization, and

user interactions in the user’s browser. Rendering is performed by D3 [7] in SVG

1https://vuejs.org

https://vuejs.org


for the convenience of listening to element interactions. Since there could be ample

space to increase the number of node types based on the analysis needs, we apply

an objected-oriented code architecture using Vue components 4.2.3 and ES6 classes

so that more node types can be added easily into the system using class inheritance.

The VisFlow server runs Express2 with Node.js3. The server is mainly respon-

sible for managing user logins and providing storage for dataflow diagrams and

user uploaded datasets. The server may also be extended for more complicated

computation jobs. For example, in Chapter 6 we add a natural language interface

to VisFlow. The natural language queries are sent to the server, which are relayed

to the backend semantic parser.

The details of our implementation can be found at the GitHub repository. The

current codebase contains around 30K lines of TypeScript4, Vue, HTML, and SCSS5

code. Up to the time of the completion of this dissertation, the codebase has been

through four major revisions. The earlier codebase used jQuery6 and JavaScript

with closure compiler7. We migrated the codebase to the new application stack

and tool selection for better code maintainability and readability.

4.2.2 Computation

VisFlow avoids repetitive execution and data storage redundancy to ensure

performance. A connected input port only makes a reference to the transmitted

subset or constants coming from its upflow output port. That is, though semantically

2http://expressjs.com3https://nodejs.org4https://www.typescriptlang.org5https://sass-lang.com6https://jquery.com7https://github.com/google/closure-compiler

http://expressjs.com

https://nodejs.org

https://www.typescriptlang.org

https://sass-lang.com

https://jquery.com

https://github.com/google/closure-compiler


an output subset copy is created on each node (as discussed in Section 3.5), nodes

actually do not acquire copies of subsets where unnecessary. Our implementation

keeps minimally one copy of each table dataset in memory, and stores the row

indices of the data items as their IDs in the transmitted subsets, so that attribute

values are not duplicated. Though data items may have one-to-many relations with

their subsets, we only make copies of visual properties at the visual editors that

modify the visual properties and store object references elsewhere. User operations

such as flow diagram editing, data item selection in visualizations and filtering

parameter updates, lead to reactive changes in the downflow nodes. We propagate

the changes in the flow in topological order starting from the node where a change

occurs. The propagation at every node stops immediately when no change is

detected, e.g. the output of a set intersection may remain the same after an item is

added to its input subset.

4.2.3 Component Inheritance

The implementation of a subset flow model needs to create various node types

that have a large number of shared methods. These methods represent the shared

functionality between nodes, such as reading source datasets of subsets, manipulat-

ing subset visual properties, and notifying the system of output changes. Therefore

we use ES6 classes for the diagram elements to allow class inheritance. For exam-

ple, a Scatterplot node class inherits the Visualization base class, which further

inherits the SubsetNode base class. The Visualization class, as a base class for all

visualization node types, defines methods for handling interactively selected subsets

and their propagation.

To present the diagram elements in a web interface, we use Vue components


along with ES6 classes. We use vue-class-component8, vue-property-decorator9 to

implement class inheritance. Additionally, we use vuex-class10 to connect all Vue

components to the global store of the system to manage the application state. Each

Vue component has an HTML or Vue template that defines how the component

should be presented on a webpage. In VisFlow, every diagram element such as a

node, edge, or port is implemented as its own component.

4.2.3.1 Template Injection

As seen in Figure 4.1, each node has a rectangular display are on the diagram

canvas, and is also associated with its option panel for tuning node settings.

Additionally, each node may have a context menu that is activated on right click.

The code architecture needs to allow each node to define its own display templates

for its sub-components. One may consider Vue slots11 to be an option to implement

this requirement. Yet we find that Vue slots do not serve our purpose well, because

Vue slots do not reflect class inheritance relations between an inheriting component

and a base component. When accessing this reference in a component with slots,

the reference points to an instance of the base component. However, we want the

this reference to point to the instance of the inheriting class instead. Essentially,

the relation between a component and its child component does not reflect the

relation between an inheriting class and its base class.

To the best of our knowledge, no modern frontend framework natively supports

DOM template together with class inheritance. This is because a typical web

application (e.g. an online shopping site) may not have flexible node types as those

8https://github.com/vuejs/vue-class-component9https://github.com/kaorun343/vue-property-decorator

10https://github.com/ktsn/vuex-class11https://vuejs.org/v2/guide/components-slots.html

https://github.com/vuejs/vue-class-component

https://github.com/kaorun343/vue-property-decorator

https://github.com/ktsn/vuex-class

https://vuejs.org/v2/guide/components-slots.html

in VisFlow. To overcome the limitation, we utilize the template option that can be

passed to a Vue component on component registration. The template option allows

a string to be used as the Vue template, which is compiled at the client side12. We

thus create a general template for the base class with placeholders, and write our

own injection method to replace the placeholders by the templates of the inheriting

classes. This ensures that when a Vue component is registered, the template it

compiles has the full content of the inheriting class. Listing 4.1 shows a simplified

snippet from the template of the base Node component of VisFlow. The template

has three placeholders , , and

.  represents the node display

on the diagram editing canvas.  lists the right-click menu

options.  defines the UI elements in the node option panel

that should be displayed when a node is clicked and activated. These placeholders

will be replaced by the content of the inheriting nodes at runtime. Because the Vue

component template is not finalized until we inject the placeholder content, the

compiler must be shipped within the final VisFlow bundle.

Listing 4.1: Base Node Component Template

1 <div>

2 <div ref="content" v-show="isContentVisible" :class="['

content ', { disabled: !isContentVisible }]" :style="

getContentStyles ()">

3 

4 </div>

5 <context-menu ref="contextMenu">

6 

12https://vuejs.org/v2/guide/installation.html#Runtime-Compiler-vs-Runtime-only

https://vuejs.org/v2/guide/installation.html#Runtime-Compiler-vs-Runtime-only

7 </context-menu>

8 <option-panel ref="optionPanel" v-if="isActive">

9 

10 </option-panel>

11 </div>

Many VisFlow Vue components are mounted dynamically rather than statically.

This is because the dataflow diagram may change constantly over time. We manually

mount all the dataflow diagram elements. For example, when a user uses drag-

and-drop interaction to create a node, we create the node instance and mount its

DOM element onto the diagram editing canvas. For static UI elements such as the

navigation bar, we use a static Vue template as in a typical Vue application. Some

UI elements such as the buttons and input boxes in the node option panel have a

static nature, but they have to be dynamically mounted as well because they belong

to nodes that are dynamically created and deleted. We use placeholder component

(Section 4.2.3.2) to ensure that all the dynamically mounted components have a

correct page layout.

4.2.3.2 Placeholder Component

Sometimes a child component may need to be presented globally. For example,

the pop-up option panel of a node on the right of the screen is displayed above

everything on the dataflow canvas. Besides, a node may occasionally need to display

a pop-up modal dialog that asks the user for confirmation, in which case the modal

needs to be above all other elements currently on the page. However, since the

child component nests inside its parent component, and the parent component may

have a fixed position within the overall page layout, it is difficult to use attributes


like the CSS z-index to tweak the arrangement and make the child component

appear on top. For example, a node on the diagram editing canvas is nested inside

the canvas container, and the canvas container is a sibling of the navigation bar at

the top of the system. Therefore, it is impossible to make a child component of a

node to appear over the navigation bar, as the navigation bar must be above the

canvas container in the overall page layout. We create a placeholder component

to resolve the issue. A placeholder component, e.g. a global node option panel or

a global modal dialog, is empty when it is inactive, and awaits its content to be

mounted. The placeholder component may provide callback, so that the component

that initiates the interface change can update according to the user interaction.

The first method shown in Listing 4.2 is the activate() method from the base

Node class. This method is called when a node is clicked, at which time its option

panel should pop up. The method retrieves its option panel content (as HTML el-

ement) of the node using the Vue element reference (this.$refs.optionPanel). It

passes the option panel content to a store mutation method named mountOptionPanel(),

which appends the option panel content to the option panel placeholder compo-

nent named optionPanelMount. A child component can be unmounted when the

component that initiates the interface change finishes its task, e.g. when the user

deactivates the node the node option panel can be unmounted and hidden.

Listing 4.2: Mounting Option Panel

1 // in components/node/node.ts

2 public activate () {

3 this.isActive = true;

4 this.$nextTick (() => this.mountOptionPanel(this.

$refs.optionPanel as Vue));

5 }

4.3. VISMODE DASHBOARD 38

6 // in store/panels/index.ts

7 mountOptionPanel(state: PanelsState , panel: OptionPanel) {

8 state.optionPanelMount. appendChild(panel.$el);

9 }

4.3 VisMode Dashboard

We add a utility feature called the Visualization Mode (VisMode) to our im-

plemented framework to enhance its usability by providing a dashboard view of

the dataflow outcome. The VisMode dashboard hides diagram edges and ports,

and presents only a user-selected set of nodes. The sizes and positions of the

nodes in the VisMode can be configured separately from the diagram editing mode.

Figure 3.5 illustrates the correspondence between a dataflow diagram in the di-

agram editing mode and its VisMode dashboard view. Views labeled in blue in

the VisMode correspond to the labeled node with the same letter in the diagram

editing mode. The visualizations are re-arranged in a more compact layout to

present a cleaner interface for data analysis, just as an off-the-shelf application or

visualization dashboard. We provide smooth transition of the diagram elements

when the VisMode is toggled, to help the user perceive the node correspondence

between the two modes. The VisMode dashboard helps a user to stay away from

diagram details that are irrelevant to data exploration, so that the user may focus

on the visualized results and data analyses. Having VisMode also makes it easier

to share and present analysis results produced by the dataflow diagram.

4.4. REPRODUCIBILITY 39

4.4 Reproducibility

Each diagram in VisFlow corresponds to a unique diagram ID. A share link can

be generated for each diagram. With a share link, a diagram created by one user

can be loaded and viewed by other users to share the results of an analysis. VisFlow

preserves the diagram interaction states (i.e. user selections and navigation) when

a saved diagram is loaded, so that the shared results include the brushed and

highlighted data which are essential to presentation and answering questions about

the data. In addition to result sharing, having reproducible diagrams helps result

validation. Based on a shared diagram, a different user may reproduce the entire

analysis, make changes and extensions, or correct errors if necessary.

A diagram can be shared using the VisMode dashboard. The audience of the

dashboard does not need to look into the internal dataflow details and can focus

on the interactive visualizations and data exploration.

We implement a full history stack for all the operations performed in the system.

The history stack can be serialized into diagram logs and saved on the server. An

authorized user, e.g. a collaborator or a system admin, may view the complete logs

of the diagram to see how an analysis arrives at its conclusive state.

4.5 Case Studies

We evaluate the VisFlow framework and its subset flow model by case studies

on real-world datasets. We work with domain experts and apply the VisFlow

framework to solve their practical problems. In this dissertation we discuss two

case studies.

4.5. CASE STUDIES 40

4.5.1 Gene Regulatory Network Analysis

This case study follows our earlier research [80] that yielded domain application

that assists computational biologists to analyze gene regulatory network along

with their ground-truth lab experiment results. Understanding the regulations

between genes, e.g. the presence of a gene repressing or activating another, is an

important goal in computational biology. A gene regulatory network models the

regulations between genes, and is derived from mathematical models for predicting

potential regulations. One of the example networks, Th17, supports immune cell

fate specification and is computed from a lineage differentiation model system [8].

In this case study, we show that VisFlow is able to generate a gene regulatory

network analysis environment in a small number of steps. We worked with a group

of computational biologists who perform regulatory network analysis as one of their

research tasks. The biologists involved usually write script for the same set of

tasks to adapt their network data to other existing visualization tools, and have

moderate experience with programming. Figure 4.2(i) gives one resulting dataflow

diagram and its corresponding VisMode dashboard is show in Figure 4.2(ii).

A first step of the analysis is to visualize the regulatory network, which is a

directed weighted graph in which nodes are genes and edges are sourced from master

regulators called Transcription Factors (TFs). The weight of an edge denotes the

confidence score of a regulation prediction. The network visualization node (c) takes

two input tables, one for nodes and the other for the edges, and shows the network

in a force-directed layout provided by D3. Edge weight encoding is achieved by a

property mapping node (a). As biologists typically explore the network bottom-up

by adding to the network known genes they are interested in, an attribute filter

with user input gene names is added (b) so that the network only shows the genes


Expression Matrix

(i)

(ii)

b

eNodes

Edges

(with series transpose)

f

d

g

c

a

Figure 4.2: Regulatory network analysis workflow and its corresponding VisModedashboard generated in the VisFlow framework: (i) The dataflow diagram; (ii) Theinteractive VisMode visualization dashboard produced by the dataflow diagram.


specified by the biologists. The dotted section in Figure 4.2(i) shows the six diagram

nodes for a network visualization at the top-left of Figure 4.2(ii).

To allow the biologists to search for incident edges that are not currently shown

in the network for network expansion, an incident edge table is added to the

diagram: We add a linker (d) to retrieve the names of the user selected genes in

the network. We then add a filter to find those edges with a source or target gene

matching those selected gene names. This conveniently lists incident edges upon

gene selection in the network (bottom-left of Figure 4.2(ii)).

The analysis also includes a gene expression matrix as a supporting ground-

truth dataset. The matrix rows represent genes and columns represent experiment

conditions. Each matrix cell contains the gene’s responsiveness value under one

experiment condition. The visualization requirement is to render the expression

matrix as a heatmap, and additionally show selected gene profiles (rows of the

heatmap) in a line chart, ordered by the matrix columns. In this case, the experiment

conditions can be thought of as a series and we can apply the VisFlow line chart.

Lines are created by first grouping the series points by a grouping column “genes”,

and then rendering the lines along the series column “experiment conditions”. We

apply the series transpose (Section 5.2.3) to convert the input matrix to this format,

as the raw matrix table does not contain series points as data items. After series

transpose the new table has rows of type (gene, experiment condition, value), and

is applicable in the line chart. As rendering the heatmap requires the original

matrix, two data sources (e) have two outgoing branches, producing the original

and transposed matrices respectively. We filter the matrix rows using linkers (f)

and encode gene names using categorical color scales (g), which assign colors based

on the hash values of gene names to provide linked visualizations, i.e. the heatmap


row labels are linked with the rendered lines by their colors.

Figure 4.2(ii) shows the resulting regulatory network analysis application in

the VisMode, which supports the visualizations of the regulatory network and the

experiment matrix, as well as unidirectional linked queries from the network to the

matrix. Those are a subset of functionality achieved by the Genotet system [80].

The biologists performing this experiment commented that VisFlow may save

time and efforts of biologists as a query workflow can be directly composed which

otherwise has to be created by writing custom script to parse and plot the data.

This case study shows that VisFlow, without requiring a programming background,

is able to assist domain data analysis with the composition of a relatively small

number of dataflow nodes.

4.5.2 Baseball Pitch Analysis

We present a second case study in the area of sports analytics for baseball

games, where we interacted with an expert in baseball analytics who is a statistician

working on applied problems in sports. Baseball is a highly data-driven sport. Since

pitching style is potentially the most interesting and important aspect of baseball

games, statistics for various metrics of pitches have been available for many decades.

The particular interest of the analyst is to understand the behavior and style of

pitchers by studying their movement data. We demonstrate how VisFlow is able to

adapt to this realistic scenario to quickly develop a tool that an expert can use for

the analysis, e.g. to identify how pitches differ in their deliveries.

One of the analysis tasks is to compare different pitchers across multiple games.

Our data comes from MLB.com Statcast13. The data is organized in many separate

13http://mlb.com/statcast


tables, including a list of team matchups, a player roster for each team, a list of

game plays (i.e. individual pitches) for each game, etc. For each game play, there

are measures of metrics for the pitch. The metrics include the spin rate and speed

of the ball, the time it took the pitcher to release the ball, pitcher extension, and

the result of the play, which can be a Ball, a Strike, or different types of Foul balls.

Despite the data being spread over multiple table files, in a small amount of

time the analyst can compose a VisFlow flow diagram that enables us to visualize

the game plays (see Figure 4.3(iii)). The top-left table (a) lists the games described

by the data and allows the user to select one or more games to study, with each

game having around 300 pitches. The table in the middle (b) has each pitcher’s

name and player ID, from which the user may choose a list of players he/she is

interested in. Upon user selection of a set of players, the parallel coordinates at the

bottom (d) reactively displays the metrics statistics for the pitches of the selected

players. In Figure 4.3(iii), the user currently selects 1 game and 3 pitchers. Note

that selecting multiple games is also possible, in which case the presented plots will

show the data for multiple games, and the statistics are shown for all pitches in

those games.

In addition to the data described above, the Statcast system collects the pitch

movement data. Statcast optically tracks the players at 30 Hz, which gives us

unprecedented details on the players’ movements on the field (Figure 4.3(i)). The

optical tracking system is illustrated in Figure 4.3(ii). The tracking system uses a

coordinate system in which (0, 0) is at the home plate where the batter is; the y

axis points to the pitcher’s mound; the x axis is orthogonal to it in a right-handed

coordinate system. For each player, there is a sequence (xt, yt), of 2D positions,

recorded at a series of timestamps (t’s). The recorded coordinates for a player


is of the approximated pitcher center of mass during his movement. The data

contains 90 samples for each pitch, that is, (x1, y1), . . . , (x90, y90). The analysis of

pitcher movements typically focuses on the delivery styles, for which the first 3

seconds of the movements are of key importance. Since pitchers have negligible

movement in the x direction, and pitchers’ mound is exactly at y = 60.5 feet

from the home plate, viewing the movement data with y values between [55, 65]

would allow the analyst to focus on the essential tracking records. With VisFlow

attribute filters, such visualization and attribute filtering requirement can be easily

achieved by creating a few nodes in the diagram. Range filters are applied to

remove tracking records after 3 seconds from the start of the delivery, as well as

those records that appeared outside the interested y value range [55, 65]. A line

chart visualization (Figure 4.3(iii)(c)) is used to render the series (t, yt). Players

can be encoded by categorical colors, so that the line chart renders the pitching

movements of a same player in a uniform color. Using VisFlow brushing and linking,

the user can easily relate the pitchers’ movements with the pitch statistics. For

any user selected pitch movements in the line chart (c), the parallel coordinates

plot (d) instantly highlights the metrics statistics corresponding to the selected

pitches. The analyst can therefore easily observe the speeds, velocities, and results

of selected pitchers’ movements. Such interactive queries largely help the analyst

derive relations between the delivery styles and the players’ performance.


(i)

(ii)

(iii

)

(iv

)

ab

c d

Use

r S

ele

ctio

n

Win

d-u

pS

tre

tch

Div

erg

ing

Sty

les

12

34

56

78

91

01

11

2

Fig

ure

4.3:

Ap

ply

ing

Vis

Flo

wto

bas

ebal

lp

itch

dat

aan

alysi

s.(i

)T

he

pit

chin

gm

ovem

ent;

(ii)

Th

eS

tatc

ast

coor

din

ate

syst

emillu

stra

tion

(fro

m[3

2]);

(iii)

The

anal

ysi

sen

vir

onm

ent

for

the

bas

ebal

lpit

chan

alysi

sge

ner

ated

by

Vis

Flo

w;

(iv)

Plo

tsof

pit

chin

gm

ovem

ents

of12

pla

yers

.A

cate

gori

cal

colo

rsc

ale

isap

plied

tore

nder

each

pla

yer’

spit

ches

ina

unif

orm

colo

r.


The constructed analysis environment may help conclude baseball findings.

For instance, it is easy to recognize the different types of deliveries, i.e. wind-up

(y distance increases and decreases alternately in the beginning) v.s. stretch (y

distance remains stable in the beginning). It is also possible to visually compare

the movement of different players. It can be observed that some players have fairly

uniform movements, while others’ movements vary widely. Figure 4.3(iv) shows

the pitches of 12 pitchers, numbered from 1 to 12. The pitchers in the first row

(number 1–6) have clearly two movement patterns, as the movements diverge in the

middle. The pitchers in the second row have a single movement pattern. Among

the second row, pitchers on the left half (number 7–9) have wind-up styles, and

the others (number 10–12) use a stretch delivery. Other observation, such as the

variation of pitcher 10’s starting y distance is significantly larger than the other

players who solely use stretch deliveries, can also be made.

The data involved in this case study has multiple heterogeneous tables. It also

has a large volume of recorded movement positions. In the diagram developed

for the case study, 20 games are loaded, which contain 100K+ data points for

the pitcher movements and other associated records, from 5 heterogeneous tables.

Around 30 pitches and 500 movement points are to be rendered for each player per

game. Data sampling (supported by the sampling mode of an attribute filter) may

help scale up to higher data volume, e.g. spanning a larger number of games. The

diagram contains fewer than 30 nodes. The VisMode of the application generated

from the diagram is shown in Figure 4.3(iii). This case study shows with the

flexibility provided by VisFlow it is possible to directly analyze such complex

domain data interactively.

48

Chapter 5

Extended Subset Flow

Though the subset flow has its interactivity advantages, the drawback of re-

quiring data immutability limits the analytical capability of the framework. We

therefore seek to add more data processing power to the dataflow while preserving

the benefits introduced by the subset flow. In this chapter, we discuss how to

extend the subset flow model so that data mutation can be performed in the system

without compromising much of the subset flow model. We explore the design

variations of the node types in the VisFlow framework, and demonstrate several

extension node types that may enhance the capability and usability of the system.

This chapter is organized as follows: First, we introduce the concept of the

extended subset flow model, and how to handle data mutation with respect to data

mutation boundaries in Section 5.1; We then describe several new types of nodes

that can be added to the extended subset flow so as to enhance system capability;

Finally, we demonstrate the improvement of the extended model using a few case

studies that employ the new node types.

5.1. EXTENDED MODEL 49

5.1 Extended Model

We loosen the data immutability requirement from the original subset flow

model. We defined the obtained new model to be the extended subset flow model. In

the new model, nodes are allowed to mutate the input data, resulting in new tabular

data being generated within the dataflow. Because of the ambiguity introduced

by data mutation (discussed in Section 3.4), visual properties cannot be inherited

when nodes mutate the flow data. However, it is possible to identify where the data

get changed and employ the subset flow on each group of nodes among which data

remain intact. Conceptually, a node that mutates the tabular data is considered to

create a data mutation boundary. The original subset flow model applies to the

nodes within a same data mutation boundary.

Figure 5.1 exemplifies this concept. The two nodes that mutate their input data

are shown in black border. These two nodes create the data mutation boundaries.

The black node at the top performs data aggregation and generates the average mpg

value per origin as a new table (with red border). The black node near the bottom

is a table join node that joins the cars from “car.csv” with their MDS coordinates,

so that their MDS plot can be shown (with green border). The orange branch in

the middle of the diagram all receive input that comes from “car.csv”. Those data

items remain intact and are within a same data mutation boundary. The subset

flow may unambiguously compute the visual properties between those nodes. As

seen in the figure, the system may use the border colors of the nodes to hint the

user about where the data are mutated so that the user can be aware of where

brushing and linking can be used to track subsets.

5.2. NODE TYPE EXTENSIONS 50

Figure 5.1: Example of data mutation boundary created in the extended subsetflow model. The two nodes with black borders are data mutating nodes. The oneat the top performs mpg aggregation for each car origin. The one at the bottomjoins the two input tables. The system uses node borders to help the user identifywhere the data get changed, and the node groups in which the original subset flowapplies.

5.2 Node Type Extensions

As data mutation is supported in the extended subset flow model, the dataflow

becomes general enough to perform any computation. In our implementation of the

extended subset flow model, we have added a list of new node types, including table

join, data aggregation, clustering, etc. Although theoretically we may add any type

of node, we are most interested those node types that are most effective in boosting

the analytical capability of the subset flow DFVS. In this section, we introduce

three representative node types: a generalized script editor, a data reservoir that

addresses the limitation of acyclic dataflow, and a series transpose that converts

column-major series table to row-major series table.


5.2.1 Script Editor

Within the extended subset flow model, we design a script editor node that

allows the user to write JavaScript code to edit and produce data. Theoretically any

node type can be implemented from scratch using a script editor, which essentially

represents the most general possibility of data mutation. The script editor expects

the user to write a JavaScript method that reads the input table(s) of the script

editor and outputs a table. An input table is described by its rows and columns.

The user-written method is expected to return a table in the same format.

Listing 5.1 shows a code snippet example the user may write in the script editor

to process the data. In this example, the code drops the last column in the data,

and finds all the rows with the first attribute value starting with “chevrolet”.

Listing 5.1: Script Editor Code Example

1 /**

2 * @typedef {{

3 * columns: string[],

4 * rows: Array <Array <number | string >>

5 * }} Table

6 * @param {Table | Table [] | undefined} input

7 * @param {HTMLElement | undefined} content

8 * @param {object | undefined} state

9 * @returns {{

10 * columns: string[],

11 * rows: Array <Array <number | string >>

12 * }}

13 */

14 (input , content , state) => {

15 // optional: modify node display HTML


16 // optional: modify node state

17 return {

18 columns: columns.slice(0, columns.length - 1),

19 rows: rows

20 .filter(row => row [0]. match (/^ chevrolet/i))

21 .map(row => row.slice(0, row.length - 1)),

22 };

23 };

The number of input ports of a script editor is configurable in the UI. If the

script editor should receive multiple input tables, the first argument passed to the

user-written method will be an array of Table (i.e. Table[] in JSDoc).

The script editor allows the user to generate derived data using JavaScript. It

comes in handy when the user wants to make small changes on the data during the

data exploration. For example, if the user notices some outliers in the data with a

specific pattern, she may remove them using a simple JavaScript filter in the script

editor.

Additionally, the script editor allows the user to control its HTML display.

When the script editor rendering option is enabled, a second argument content is

passed to the method, which represents the root of the display DOM tree of the

script editor on the dataflow canvas. The user may manipulate the DOM elements

using JavaScript DOM manipulation. We also make jQuery and D3 available within

the method for the convenience of DOM manipulation and data visualization. That

is, it is possible to access the $ jQuery object and the d3 namespace within the

user-written method when script editor rendering is enabled.

The last optional argument state allows the script execution to be stateful

This is useful when the script editor needs to perform incremental computation.


For example, if the script editor receives input data from a streaming source and

produces output for a line chart that visualizes the stream, in this case the script

editor may keep a window of the streamed data to be shown in the line chart.

The state object can thus be used to store the previous values in the window. We

show an example of such a stream visualization in a model training case study in

Section 5.3.3, in which a stateful script editor is used to produce visualization for

model metric changes over training iterations.

The code written in the script editor is executed in a JavaScript closure, so that

it does not have access to other variables and data in the wrapper environment.

This avoids the user code to accidentally modify the system behavior.

5.2.2 Data Reservoir

A key limitation of the subset flow and also other DFVS’s that use acyclic

dataflow diagrams, is that there is no way to perform iterative changes on a same

node. However, such iterative changes may be required by certain types of data

analysis. For example, in a bottom-up network exploration as in the case study

discussed in Section 4.5.1, the user may want to repeatedly add incident edges

to the network being visualized, so that the network is iteratively expanded. To

implement such an iterative addition of edges in a subset flow, VisFlow has to

create multiple layers of linkers, filters, and set operators, under the limitation

that each layer can only represent one iteration of edge addition (Figure 5.2). This

limitation would render iterative network expansion infeasible in VisFlow, if the

number of iterations is unknown or too large.

To address the limitation, we design a data reservoir node in the extended subset

flow. The data reservoir is able to keep its input data: it holds all the changes


Figure 5.2: Limitation of acyclic dataflow diagram. One layer of nodes has to becreated for one iteration of network expansion.

to its input and does not update its downflow nodes reactively. Instead, the user

must explicitly release the changes. Figure 5.3 illustrates the data reservoir in the

network expansion use case. The downflow of the network visualization generates

a new set of edges that is sent to the data reservoir. The data reservoir receives

this edge set but does not immediately update its output, which is connected in a

cycle to the input of the network visualization. When the user releases the changes

(by pressing a button), the new edges get merged into the previous input edges

of the network visualization. Consequently the network gets expanded with the

newly added edges from the downflow. Each release corresponds to one iteration of

network expansion.

The data reservoir presents one possible solution to overcome the limitation of

an acyclic dataflow diagram. It has its own unique characteristics. On one hand,

it meets the subset flow requirement by producing a copy of its input subset. On

the other hand, unlike most of the original subset flow nodes, the data reservoir

is a stateful node (like a stateful script editor introduced in Section 5.2.1), as it

remembers its last released input subset. Such stateful nodes do not appear in the

original subset flow design. The advantage of introducing the data reservoir is that

it enables iterative data exploration. But at the same time it may impact the user’s


Network Visualization

Data Reservoir

Nodes

Previous Edges

New Edges Selected

from Down!ow

Down!ow

U

New Edges

Figure 5.3: The data reservoir holds all the changes to the edges. When the userreleases the changes, those edges are merged into the upflow edges so that thenetwork visualization may include the new edges.

understanding of the dataflow, and make subset tracing and identification more

complicated. The user needs to be aware that a data reservoir keeps outputting a

subset that it received and released earlier.

5.2.3 Series Transpose

A tabular dataset may sometimes contain series information over a set of

columns. Table 5.1(a) shows a few entries from the “SP.DYN.LE00.IN” indicator

report from the World Bank Data repository1. The data contain a sequence of

columns that represent ordered series, i.e. the index values in each year. In this

case, each data point in the series is of analytical interest. A data point in the

series represents a new type of data item in row-major order that is different from

the original table rows that column-major series. Within the extended subset flow,

we provide a series transpose node that helps analyze series data in the format of

one series per row. The series transpose takes one key column and a list of series

columns, and transforms the input table into rows of series points. Column names

1http://databank.worldbank.org

http://databank.worldbank.org


Table 5.1: Series transpose example that converts column-major series in Table (a)into row-major series in Table (b) based on the key column “Country” and theseries columns of the years. The cell values in Table (a) are stored in the thirdcolumn in Table (b). Table (b) has 9 rows that are not all shown.

are written as attribute values (middle column in Table 5.1(b)), and the original

table values are stored in a third column. Using “Country” as the key column

and the years as the series columns, series transpose produces Table 5.1(b) from

Table 5.1(a). Series transpose is a data mutating operation. It provides a utility for

the user’s convenience, so that a table like Table 5.1(a) with column-major series

can be directly visualized by a VisFlow line chart that expects row-major series

data points.

5.3 Case Studies

We exemplify the application of the extended subset flow model using three

case studies. We demonstrate that the interactivity benefits of the subset flow may

apply in each group of nodes within a same data mutation boundary. The mixed

subset and non-subset flow can be useful to many data analysis tasks.


5.3.1 Evacuation Dataset Visualization

In this case study we employ the script editor, combined with another experi-

mental subset flow node type series player, to visualize movement data over time.

The series player is technically an attribute filter that treats its input data as time

series. In a time series table, there is one time column, and each data entity (such

as a person) has one row with each distinct time value. The series player performs

attribute filtering and allows the rows with one time value to pass at a time. The

user may choose to advance the time value at different speeds to replay the time

series and review the activities of the data entities over time.

In this study we look at the building evacuation traces from Challenge 4,

VAST Challenge 20082. The dataset describes a department of health that was

involved in conflicts with local religious groups. A bombing incident happened at

the department of health building, which could be related to the religious group

supporters. The dataset includes a floor plan of the building, and the locations

of the employees and visitors of the building during the bombing incident. Each

person is assumed to carry a badge that tracks his/her location. The goal of the

analysis is to visualize the evacuation when the bombing happened, and identify

any casualties along with potential suspects and witnesses to the event.

We may create a script editor to visualize the floor plan using JavaScript canvas

drawing. Figure 5.4 shows the JavaScript snippet written in the script editor. The

script renders the building floor plan by reading the building data and using an

HTML canvas to draw the walls. Alongside the floor plan we use a scatterplot to

show the locations of the the persons in the building. By overlaying the scatterplot

on the floor plan, we are able to visualize the locations of the persons in the building.

2http://www.cs.umd.edu/hcil/VASTchallenge08/

http://www.cs.umd.edu/hcil/VASTchallenge08/


Figure 5.4: A snapshot of the JavaScript written in a script editor to render thefloor plan of the evacuation data. The script manipulates the DOM tree that isrooted at content.

A series player is added to the input of the scatterplot to control the visualized

timestamp. The outcome of this diagram (in VisMode) is shown in Figure 5.5.

Using the series player, we may replay the entire evacuation traces and review

the movements of all the persons in the building. By the interactive selection of

VisFlow, we may also pause at any timestamp and select people we are interested

in. By linking the selected people with their descriptive information from a separate

table, we may identify who those selected people are.

The script editor and the series player extensions of VisFlow enable the visual-

ization and exploration of such movement data, which were previously not possible

with the original subset flow. In particular, the script editor gives the user the

freedom on how data are displayed and makes it more convenient to perform custom

rendering, i.e. drawing the floor plan. The series player demonstrates a node type

extension in the original subset flow model. Meanwhile, the interactivity benefits,


Figure 5.5: Using a series player and a script editor to visualize the evacuationdata from VAST Challenge 2008.

i.e. supporting details-on-demand pulling of building personnel information, are

also effectively utilized.

5.3.2 k-Means Clustering Visualization

In this case study, we apply the extended subset flow model to visualize the

iterations in a k-means clustering algorithm. Figure 5.6(i) shows the dataflow

diagram that completes the visualization. A k-means clustering node is applied on

the Auto MPG dataset. The clustering node adds a “ClusterLabel” column and

assigns a cluster label to each row, as seen in the table in Figure 5.6(i).

To visualize different clustering labels produced by the clustering algorithms,

we may apply a categorical color scale on the “ClusterLabel” column from the

output of the k-means node. To determine a proper placement of the data items on

a 2D plane, we use the table join node available in the extended subset flow model

to assign the MDS coordinates to each car. The result is an MDS scatterplot in


Figure 5.6(i), in which points are color encoded by their cluster labels.

(i)

(ii)

Figure 5.6: Using an MDS plot and a cluster label distribution plot to visualizethe iterations of the k-means clustering algorithm: (i) The dataflow diagram; (ii)Visualizations of the clustering algorithm iterations.

The clustering node that executes the k-means algorithm may be configured to

output the cluster labels immediately after each iteration of the clustering algorithm.

By such configuration, we are able to witness the label changes on the fly as the

algorithm proceeds. Additionally, utilizing the subset flow color linking, we may

create a histogram to identify how many data items are given a particular cluster

label. This is shown in the histogram on top of the MDS plot in Figure 5.6(i).

Several iterations starting from a random cluster label initialization are shown


in Figure 5.6(ii). The numbers of data items with different labels are about even

after being randomly initialized, and gradually converge to the final output of

the algorithm. The algorithm process can be easily visualized within the subset

flow sub-diagram highlighted with red border in Figure 5.6(i). This case study

shows that even when data mutating nodes like the k-means algorithm and the

MDS coordinates table join are present in the dataflow, we can still make use of

subset flow visual properties to perform effective visual linking between multiple

visualizations.

5.3.3 Model Training Visualization

We present a third case study that employs the extended subset flow model to

visualize a machine learning model training process. The diagram in Figure 5.7

completes an example training process using a multi-layered perceptron. The

diagram uses the native subset flow sampling feature available in an attribute filter

with a set difference operator (a) to make a train/test split on the AutoMPG

dataset. 75% of the cars from each distinct origin are used as training examples,

while the remaining 25% are kept as the test data. The training data are sent to a

multi-layered perceptron (b), which predicts the “origin” for the cars in the test

data.

A script editor (c) is used to compute the model F1 macro as well as the

precision/recall metrics for the Japanese cars of the test data. The metric values

are passed to a line chart (d) to visualize the metrics changes after each iteration

of the model training. This script editor is a stateful node. It remembers all its

previous output so that the line chart can visualize the entire metrics changes over

the training process. At each iteration, three new rows giving the new values of


the three metrics (F1 macro, precision Japanese, recall Japanese) are appended

to the output of the script editor, so that the line chart may produce the metrics

visualization as if the metric values are “streamed” from the model.

b

c

d

e f

g

h

i

a

Figure 5.7: Applying a combination of extended model nodes to visualize a multi-layered perceptron training process. Using a stateful script editor we show metricvalue changes over time in a line chart. By subset flow diagram highlighting, we canhighlight the incorrectly predicted test data in the MDS plot and the histogram.

A separate script editor (e) is used to identify which of the test data rows have

an incorrectly predicted “origin”. If the prediction is wrong, an “error” flag is

added as a new column and the downflow highlighting sub-diagram (g) reads this

flag and colors the incorrectly predicted test data in red. With a similar MDS plot

(h) as see in Section sec:case-kmeans, and a distribution histogram (i) of the “origin”

of the cars, we can visualize where the model fails to make a correct prediction.

This error highlighting sub-diagram applies the subset flow visual properties linking

to conveniently identify the incorrectly predicted cars from the test data in the

visualizations.

This case study demonstrates that with the extended subset flow model it is

5.4. DISCUSSIONS 63

possible to produce visualizations for a wider variety of tasks. The benefits of

having visual properties carried by the subsets may be preserved, even when many

nodes such as the perceptron (b), metric computation (c) and error computation (f)

in the diagram mutate the data. It also demonstrates more comprehensive usage of

the script editor and its state. Specifically, the stateful design of the script editor

enables the visualization of streamed data, which is useful for visualizing the model

metrics.

5.4 Discussions

In this chapter, we extend the subset flow model to remove the data immutability

requirement and consequently increase the data processing power of the dataflow

framework. We show that such design is beneficial, in that it enables a larger

variety of tasks to be supported by the dataflow as described in the case studies.

In particular, it can be seen that the interactivity advantage of the original subset

flow model can be preserved when data mutation is allowed within the dataflow.

Though technically we may add any data processing node into VisFlow under

the extended subset flow model, it is important that those new node types do

not over-complicate the dataflow diagram and make the subset flow harder to

understand and trace. The introduction of stateful nodes such as a stateful script

editor and a data reservoir, may also make it harder to identify where data are

coming from at a particular point within the dataflow. The user’s understanding

and perception of mixed subset and non-subset flow may need further study, so as

to find a better tradeoff between the subset flow advantages and the data processing

capability of the system.

5.4. DISCUSSIONS 64

Besides, creating a dataflow framework that is general-purpose and applicable

to any data analysis task may require community efforts. Currently, we are the only

maintainers and developers of the VisFlow repository and have to implement every

node type on our own. An extension standard is desirable so that the dataflow

framework may accept external and public contribution. We may also consider

setting up infrastructures to reuse packages and tools available in other languages,

so that the web-based dataflow framework may serve as an integrative environment

like a scripting notebook such as Jupyter [52] and Observable [46].

65

Chapter 6

FlowSense: A Natural Language

Interface

Natural language interfaces (NLI) for data visualization may help improve the

usability of a visualization system. Systems with NLI allow the user to specify

queries directly via natural language (NL) without much prior knowledge on the

system usage. Latest research has progressed in visualization-oriented NLIs [19,

24, 60]. However most of these interfaces present a single main visualization for

the user to interact with, possibly with a few auxiliary views and widgets. The

analytical capability and flexibility are thus restricted by the single-view design.

Visual data exploration often requires multi-view linked visualizations, in which

the design of an NLI becomes more challenging.

As a DFVS has the potential to produce flexible, customizable visualizations

but is often relatively difficult to learn, we seek to design an NLI that benefits

both from the usability of NL and the analytical flexibility of VisFlow. We name

the NLI FlowSense. FlowSense uses semantic parsing to support NL queries that

6.1. RELATED WORK 66

manipulate multi-view visualizations created from a dataflow diagram. The NL

capability effectively reduces the overhead of learning dataflow usage and simplifies

the interactions needed for dataflow diagram construction.

In this chapter, we discuss the design details of FlowSense and its achieved

results. We first survey the related work on NLIs in Section 6.1. The design goal

of our NLI is defined in Section 6.2. We then introduce the FlowSense semantic

parser and its query execution process, as well as the user interface we integrate

into VisFlow for performing FlowSense queries. Finally we present two case studies

in Section 6.7 and a formal user study in Section 6.8 to evaluate FlowSense within

the VisFlow framework.

6.1 Related Work

We first discuss the related works on NLIs for data visualization. We also briefly

cover semantic parsing and relevant techniques for parsing NL input.

6.1.1 NLIs for Data Visualizations

Extensive research has been devoted to NLIs for decades. These interfaces

address NL queries that otherwise have to be translated to formal query languages,

e.g. SQL. A few examples are the interfaces for querying XML [34], entity-relational

database [1, 3], and speech translator to SQL [31]. NLIs for data visualizations

answer the queries by presenting visual data representations. Compared with

other interfaces that simply return a numerical answer or a set of database entries,

visualization NLIs present results that are more human-readable. Cox et al. [9]

designed the Sisl service within the InfoStill data analysis framework. The service

6.1. RELATED WORK 67

asks a series of NL questions and uses the obtained answers to complete an unam-

biguous query. The Articulate system [62] uses a Graph Reasoner to select proper

visualizations to answer a query. DataTone [19] addresses the particular problem

of query ambiguity by showing ambiguity widgets along with the main visualiza-

tion so that the user is able to switch to informative alternatives. Eviza [60] and

Evizeon [24] further improve the user experience by allowing for conversation-like

follow-up questions. Kumar et al. [30] propose a dialogue system for visualization.

Several commercial tools integrate NLIs. IBM Watson Analytics [27] and Microsoft

Power BI [42] provide a list of relevant data and visualizations to an NL question,

from which the user may choose to continue the analysis. Wolfram Alpha [76]

supports knowledge-based Q&A and is able to plot the results. ThoughtSpot [67]

enables interactive search from relational database, and provides multiple types of

visualizations for the database. The design of NLIs for data visualization has two

challenges: First, modern natural language processing (NLP) techniques cannot yet

well understand arbitrary NL input due to the complex nature of NL. User queries

tend to be free-form and ambiguous; Second, choosing a proper visualization to

answer an analytical question is non-trivial as there can be multiple possible visual

representations [38].

6.1.2 Semantic Parsing

FlowSense uses semantic parsing to process NL input and map user queries to

dataflow diagram editing operations. It depends on a pre-defined grammar that

captures NL input with certain patterns. A semantic parser recursively expands

the variables in the grammar to match the input query and can interpret the input

based on the rules applied and the order of their application [6]. At a high level,

6.2. DESIGN GOAL 68

the mapping performed by FlowSense can also be considered a classification task

and addressed by classification algorithms [2]. However we prefer the semantic

parsing approach because most classification methods are supervised and require

a large corpus of labeled examples that is not available from a DFVS. Besides,

compared with deep learning methods [13], semantic parsing does not require heavy

computational resources.

The FlowSense semantic parser is implemented within the Stanford SEMPRE

framework [50] and CoreNLP toolkit [39]. The CoreNLP toolkit integrates a

comprehensive set of NLP tools including the Part-of-Speech (POS) tagger, Name-

Entity-Recognizer (NER), etc. The SEMPRE framework employs a modular design

in which different types of parsers and logical forms can be easily plugged-in. The

framework can quickly be adapted for domain-specific parser design [72].

6.2 Design Goal

FlowSense makes a distinction from the other NLIs as it is to our best knowledge

the first NLI to address a dataflow context. We set the scope of FlowSense to

focus on assisting dataflow diagram construction, rather than to directly answer

free-form analytical questions or seek a best visualization for a given query. We

believe such an approach is beneficial in several aspects:

• Capability: The analytical capability of FlowSense is rooted in the design of

the DFVS. The outcome of FlowSense is a complete, interactive, and iterative

visual data exploration process supported by VisFlow, rather than a single

visualization that only answers one particular query as in other interfaces.

Dataflow also naturally preserves analysis provenance [17]. The diagram

6.3. SEMANTIC PARSING 69

created by FlowSense explicitly keeps the user’s preference and intention

from previous queries, which are otherwise maintained by a model behind

the scene [19, 60].

• Usability: FlowSense significantly reduces the number of interactions re-

quired to construct a dataflow diagram. Its convenience is desirable by both

novice and experienced VisFlow users. Besides, the DFVS is able to recover

from errors more easily as the user always has full control over the system.

However in other interfaces the user has to mostly rely on the behavior of

the NLI and can hardly make corrections in case of misinterpretation. We

justify this advantage by case studies and user feedback.

• Feasibility: The scope of FlowSense on assisting dataflow diagram construc-

tion is well defined and practicable. Even the state-of-the-art NLP techniques

have limited success in understanding an arbitrary NL query. By restricting

our scope, FlowSense can produce more expected results and give better user

experience, as each query is expected to update dataflow diagram, and the

user decides what the system does and what visual representation to use

through dataflow editing. The mixed-initiative design mitigates the ambiguity

that potentially comes from misinterpreting the user’s intention.

6.3 Semantic Parsing

In this section we introduce the details of the semantic parsing in FlowSense.

For concept illustration we use the Auto MPG dataset throughout the discussion.


#Function

Sample

Queries

Description

Sample

Sub-D

iagra

m

AV

isual

izat

ion

Sh

ow

asc

att

erp

lot

of

mpg

an

dh

ors

epo

wer

Pre

sent

the

dat

ain

avi-

sual

izat

ion

idmpg

name

a b cc

he

vro

let

bu

ick

am

c1

5

18

14

Da

ta S

ou

rce

{a, b

, c

}

a

b

c

Vis

ua

liz

ati

on

BV

isual

Enco

din

g

En

cod

em

pgby

red

gree

nco

lor

sca

le;

Ma

pca

ro

rigi

nto

cate

gori

cal

colo

rs

Map

dat

aat

trib

ute

sto

vis

ual

chan

nel

s

{a, b

, c

}

Vis

ua

l E

dit

or

{a, b

, c

}

a

b

c

Vis

ua

liz

ati

on

mpg

CD

ata

Filte

ring

Fin

da

llca

rsw

ith

mpg

betw

een

15

an

d2

0;

Sh

ow

five

cars

wit

hm

axi

mu

mm

pg

Filte

rdat

ait

ems

and

loca

teex

trem

um

san

dou

tlie

rs

{a, b

, c}

15

≤ m

pg

≤ 2

0

Att

rib

ute

Fil

ters

{a, c

}

{a, b

, c}

{a, b

, c}

{a, c

}

ma

x {

mp

g}

{c}

DSubse

tM

anip

ula

tion

Mer

geth

eca

rsw

ith

tho

sefr

om

the

sca

tter

plo

t

Refi

ne

and

iden

tify

in-

tere

stin

gsu

bse

tsUnion

Intersection

UU

EH

ighligh

ting

Hig

hli

ght

the

sele

cted

cars

ina

para

llel

coo

rdin

ate

sp

lot

Vie

wth

ech

arac

teri

stic

sof

one

subse

tam

ong

its

sup

erse

tor

anot

her

sub-

set

Use

r S

ele

cti

on

Un

ion

a

b

c

Vis

ua

liza

tio

n

for

Sel

ecti

on

Vis

ua

l Ed

ito

ra

b

c

Hig

hlig

hte

d

Vis

ua

liza

tio

n

{a, b

, c

}

{a, b

, c

}

{a, b

}

{a, b

}

U

FL

inkin

g

Fin

dth

eca

rsw

ith

asa

me

na

me

fro

mth

esa

les

tabl

e;

Lin

kth

esa

les

reco

rds

byo

ri-

gin

fro

mth

esc

att

erp

lot

Extr

act

keys

from

one

tab

lean

dfi

nd

thei

rco

r-re

spon

den

cefr

oman

-ot

her

(het

erog

eneo

us)

table

usi

ng

alinke

r

idsale

name

x y zch

ev

role

t

bu

ick

am

c3 42

Da

ta S

ou

rce

2

Lin

ke

r

{a, b

}

{x, y

, z}

{x, y

}li

nk

name

“am

c” o

r “b

uic

k”?

idmpg

name

a bb

uic

ka

mc

15

14

Da

ta S

ou

rce

1

Tab

le6.

1:Six

majo

rca

tego

ries

ofV

isF

low

funct

ions.

Thes

esu

b-d

iagr

ams

are

freq

uen

tly

use

dto

com

pos

em

ore

sophis

tica

ted

dia

gram

sth

atad

dre

ssan

alyti

cal

task

s.


6.3.1 VisFlow Functions

To create an NLI for VisFlow, we first studied a sample diagram set that includes

60 dataflow diagrams created by 16 VisFlow users from their previous VisFlow

sessions. We identify a set of frequently appearing sub-diagrams and categorize

them into six major categories as listed in Table 6.1. The construction of these

sub-diagrams are defined as the VisFlow functions. By implementing the VisFlow

functions, FlowSense essentially supports the building blocks of VisFlow’s visual

data exploration so that analyses with VisFlow’s native interactions can then be

carried out with FlowSense. Table 6.1 explains the usage of each VisFlow function

and shows several sample queries.

In addition to the six major categories, FlowSense also supports many utility

functions such as adding/removing diagram edges, selecting data points in a

visualization, loading a given dataset into a data source node, automatically

adjusting the dataflow diagram layout, etc. Though these functions also enhance

the usability of the system, we omit their details here as they are auxiliary in the

system.

6.3.2 Grammar

FlowSense applies a semantic parser to map an NL input to one of the VisFlow

functions based on an elaborate grammar designed for these functions. The grammar

is context-free [40] and formally defined as a 4-tuple G = (V,Σ, R, S). V is a finite

set of variables. Σ is a finite set of terminals. A terminal represents an English

word or phrase. R is the rule set that defines how a single variable matches an

ordered list of terminals and variables (possibly itself in a recursive rule). Below is


an example rule:

〈Visualization〉 → 〈ShowVerb〉〈Columns〉 in 〈VisualizationType〉

In this rule, 〈Visualization〉 is a high-level variable that matches a query that

requests a visualization. 〈ShowVerb〉 matches a verb that has a meaning similar to

“show”. 〈Columns〉matches one or more columns from the data. 〈VisualizationType〉

stands for a phrase that describes a visualization metaphor such as scatterplot or

parallel coordinates. The token “in” is a terminal symbol that comes from the

NL input directly. The example rule above is simplified for the convenience of

explanation. In practice, a rule often matches against generic variables rather than

a specific word. S is the start variable that expands to other variables to match

every query.

The grammar of the FlowSense semantic parser attempts to derive an input

query by recursively searching for all possible matches of the grammar rules.

This procedure is called derivation [6]. FlowSense uses the semantic parsing

implementation from SEMPRE. It also uses the Stanford CoreNLP [39] toolkit that

is built into SEMPRE for POS tagging (Section 6.4.1). The variables and rules

of FlowSense are defined in SEMPRE grammar files. The FlowSense grammar is

independent from the data being analyzed. It is also independent from the dataflow

diagram being constructed. We use special utterance placeholders (Section 6.4.1)

to let the grammar understand dataflow context at runtime. FlowSense currently

includes about 200 variables and a rule set of around 500 rules (i.e. SEMPRE

formulas). The FlowSense grammar may be extended to support more analytical

functions.


Fun

ctio

n O

pti

on

sFu

nct

ion

Typ

eP

ort

Sp

eci

�ca

tio

nS

ou

rce

No

de

Targ

et

No

de

Sp

eci

al

Utt

era

nce

s[c

olu

mn

][n

od

e]

[co

lum

n]

[co

lum

n]

[vis

ua

liza

tio

n t

ype

]

VB

NN

NN

NN

CC

IND

TV

BN

NN

SIN

NN

IND

TJJ

VB

ZN

NP

OS

Ta

gs

Vis

ua

lize

mp

g,

ho

rsep

ow

er,

an

d

ori

gin

o

f th

e se

lect

ed

cars

fr

om

M

yCh

art

in

a

p

ara

llel

coo

rdin

ate

s p

lot

idm

pg

na

me

a b cto

yota

bu

ick

am

c1

5

20

17 Da

ta S

ou

rce

{a, b

, c}

a

b

c

MyC

ha

rt

{a, c

}

ho

rse

po

we

r

19

0

12

2

11

0

ori

gin

Am

eri

can

Am

eri

can

Jap

an

ese

Use

r S

ele

ctio

n

...

mp

gh

ors

epo

wer

ori

gin

Pa

ralle

l Co

ord

ina

tes

<C

olu

mn

s><

Sh

ow

Ve

rb>

<S

ele

ctio

n>

<S

ou

rce

No

de

Wit

hN

am

e>

<V

isu

aliz

ati

on

Typ

e>

<Ta

rge

tNo

de

Wit

hO

pti

on

s>

<N

od

e>

<P

ort

Sp

eci

�ca

tio

n>

<S

ou

rce

Wit

hP

ort

>

<V

isu

aliz

ati

on

>

<P

rep

osi

tio

n>

<P

rep

osi

tio

n>

<P

rep

osi

tio

n>

<V

isu

aliz

ati

on

Fun

ctio

n>

<V

isu

aliz

ati

on

Fun

ctio

nW

ith

Op

tio

ns>

<C

olu

mn

Op

tio

ns>

Gra

mm

ar

Fig

ure

6.1:

An

exam

ple

Flo

wS

ense

qu

ery

and

its

exec

uti

on.

Th

ed

eriv

atio

nof

the

qu

ery

issh

own

asa

par

setr

eein

the

mid

dle

.T

he

sub-d

iagr

amex

pan

ded

by

the

quer

yis

illu

stra

ted

atth

eb

otto

m.

The

five

majo

rco

mp

onen

tsof

aqu

ery

pat

tern

are

un

der

scor

ed.

Eac

hco

mp

onen

tan

dit

sre

leva

nt

par

tsin

the

par

setr

eean

dth

ed

atafl

owd

iagr

amar

ehig

hligh

ted

by

auniq

ue

colo

r.T

he

resu

ltof

exec

uti

ng

this

quer

yis

tocr

eate

apar

alle

lco

ordin

ates

plo

ton

the

colu

mns

mpg,

hor

sep

ower

,an

dor

igin

,w

ith

input

from

the

sele

ctio

np

ort

ofth

enode

MyC

har

t.


6.3.3 Query Pattern

The main goal of FlowSense is to support progressive construction of dataflow

diagrams. We studied the creation process of many VisFlow diagrams in our

sample diagram set and empirically identified a common pattern with five key

query components that all VisFlow functions depend on: function type, function

options, source node(s), target node(s), and port specification. This pattern is

illustrated in Figure 6.1 with a sample query “Visualize mpg, horsepower, and

origin of the selected cars from MyChart in a parallel coordinates plot”. In this

query, the verb “visualize” indicates the intention to apply a visualization function.

The three columns “mpg, horsepower, and origin” describe the options (i.e. what

to visualize) for the visualization function. The phrase “from MyChart” tells

the system where the data to be plotted are coming from and provides source

node information. The phrase “in a parallel coordinates plot” indicates a new

visualization node with the given visualization type is to be created as the target

node. The source and target node information is closely related to the dataflow

context and is automatically identified upon user input and can then be matched as

special utterances (see Section 6.4.1). As VisFlow explicitly exports interactive data

selection from visualization nodes, the phrase “selected cars” is a port specification

that further describes that the user wants to visualize the selection from MyChart

and the new visualization node should be connected to the selection output port of

MyChart.

The grammar of FlowSense includes hierarchical variables that match the five

key components of an NL query. Figure 6.1 illustrates the parse tree that derives

the example query. The variables involved in the derivation are shown in the parse

tree, where variable expansions are bottom-up. A variable may carry information

6.4. QUERY EXECUTION 75

for multiple query components. We design a comprehensive set of variables and

rules that are able to not only accept queries with a particular order, but also their

different arrangements. For instance, “Show mpg and horsepower in a scatterplot”

is equivalent to “Show a scatterplot of mpg and horsepower”. They both can be

accepted by FlowSense. FlowSense is also able to derive multiple functions from one

single query and execute their combination, e.g. “Show the cars with mpg greater

than 15 in a scatterplot” infers both visualization and data filtering functions.

A query may not necessarily contain all the five components explicitly. For

example, the user may simply say “Show mpg and horsepower” without mentioning

any source node or target visualization type. FlowSense may automatically locate

source and target nodes in its query completion phase (Section 6.4.3). The user

query may also contain implicit information, e.g. “Find cars with large mpg” intends

to perform data filtering to search for a few cars with large mpg values. FlowSense

stores utterance implications in its grammar, e.g. the word “find” implies the use of

a filter. FlowSense uses keyword classification (Section 6.4.2) to identify important

utterance implications from the query.

6.4 Query Execution

To interpret NL input based on the current dataflow context, FlowSense not

only runs the semantic parser, but also employs several auxiliary phases for its

query execution. Figure 6.2 illustrates the execution process.


NL Input

POS and SpecialUtterance Tagging

DataflowDiagram

Parser Execution

Query PatternCompletion

Dataflow DiagramUpdate

Incorrect/Missing Information

Unexpected Input

Faile

d

Success

KeywordClassification

Figure 6.2: FlowSense query execution phases. POS and special utterance taggingis performed first. Special utterances describing the data columns and diagramnodes are identified can be matched against utterance placeholders. Keywordclassification is applied to identify important utterance implications such as theintention to call a specific VisFlow function. FlowSense attempts to complete thequery pattern if missing information can be filled using default values. Upon anexecution failure the user is notified and asked to update the query.

6.4.1 POS and Special Utterance Tagging

FlowSense first performs POS tagging on the query with CoreNLP. Each token

receives a POS tag as shown in Figure 6.1. POS tags are used to generalize the

FlowSense grammar. For example, many prepositions can be used interchangeably,

e.g. “selection of the plot” is equivalent to “selection from the plot”. Instead of

having one rule for every preposition, the grammar uses a generic variable that

matches any preposition. POS tagging helps FlowSense analyze the basic semantic

structure of a query.


Some utterances in the NL query refer to special entities such as visualization

types, table columns, or diagram node names. These utterances have remarkable

roles in executing a VisFlow function. FlowSense identifies these special utterances

and uses this information in the derivation. For the query shown in Figure 6.1,

FlowSense tags “mpg”, “horsepower”, “origin” as columns, “MyChart” as a node

label, and “parallel coordinates plot” as a visualization type (node type). FlowSense

uses generic variables like 〈column〉 in its grammar that do no depend on the dataset

being analyzed. Therefore, the grammar rules do not list the data-dependent special

utterances as terminals. For example, the grammar does not include table column

names or diagram node types, i.e. the string value “mpg” would not appear

in the grammar. These generic variables are special utterance placeholders that

support matching of special utterances. The information about special utterances

is collected in the tagging phase, and the special utterance placeholders are thus

able to match the tagged tokens.

For instance, when the user loads the Auto MPG dataset into the dataflow,

column names such as “mpg” are automatically extracted and whenever the user

types “mpg” it is identified as a data column on-the-fly so that column utterance

placeholder in the grammar may accept “mpg”. To enable dataflow context

awareness, FlowSense also have special utterance placeholders for diagram node

labels. By accepting node labels FlowSense may effectively support node references

so that the user can more precisely instruct where to extend the diagram.

For typo and naming tolerance, FlowSense employs approximate matching and

checks each k-gram in the query (where k may range from 1 to the query length)

against all special utterances using case-insensitive Levenshtein distance [33, 44].

We divide the distance over the string length and use the ratio to mitigate the fact


that longer strings are more prone to typos. We find a k value of 2 or 3 and a ratio

threshold of 0.2 work well in practice.

6.4.2 Keyword Classification

FlowSense uses keyword classification to identify the semantic meaning of words

in the NL query and uses this information to decide a proper VisFlow function

to execute. For instance, the verb “show” is a synonym of “visualize”, “draw”,

etc. These words all indicate the intention to create a visualization. Meanwhile,

“find” may implicitly specify a data filtering requirement and is similar to “filter”.

We compute the Wu-Palmer similarity scores [79] between words and use the

measured scores to classify words in the NL query that have close meaning to a

set of pre-determined VisFlow function indicators. The implementation of the

similarity scores is based on WordNet [15] and NLTK 1.

6.4.3 Query Pattern Completion

The FlowSense parser identifies the key components of a query. FlowSense

attempts to fill in the blanks where information is missing using the following two

mechanisms:

Finding default values. Query components may be completed using default

values. Function options may have defaults. For instance, FlowSense automatically

chooses two numerical columns to visualize in a scatterplot when the query is simply

“Show a scatterplot”. Note that within a DFVS decisions like this can easily be

changed by the user, so FlowSense does not attempt to make a best guess. Similar

decisions include completing port specification. By default FlowSense filters all the

1http://www.nltk.org

http://www.nltk.org


data a visualization node receives when creating a filter, rather than filtering the

data subset interactively selected in the visualization. Sometimes the default values

could even be empty. A query like “Filter by mpg” results in FlowSense creating a

range filter on the mpg column with no filtering range given (the filter allows all

its input data to pass). The user can then follow up and fill in the filtering range

via the DFVS interface.

Finding diagram editing focus. Whenever the user expands the dataflow

diagram there always exists an editing focus, though sometimes the focus is implicit.

For example, when the query contains a phrase like “from MyChart”, the focus (i.e.

the source node of the query) is explicitly given. However, users tend to neglect

the source or target nodes in their queries, especially when there is a sequence of

commands that together complete a task. When a query does not have explicit

focus, FlowSense derives the user’s implicit focus. If a node is activated by the user

(e.g. clicked), then that node is taken as the focus. Otherwise, we compute a focus

score for every node X by:

score(X) = activeness(X, t) + α(1− 1/(1 + e−(distanceToMouse(X)/γ−β))).

The activeness of X is re-iterated upon every user click in the system:

activeness(X, t) = activeness(X, t− 1)/2 + click(X, t),

where click(X, t) = 1 if the t-th click is on X and 0 otherwise. This definition

measures how actively a user is focusing on a node by how many times she recently


clicks on it, as well as how close it is to the mouse cursor. The activeness derived

from user clicks decreases exponentially over time, while the closeness to mouse

dominates under a small distance with a shifted sigmoid function. We find the

parameters α = 2, β = 5, γ = 500 achieve good result. FlowSense chooses the node

with the highest focus score to be the diagram editing focus when there are no

activated nodes. If multiple source nodes are required (e.g. in a merge query),

FlowSense looks at the nodes in the order of their decreasing focus scores.

The focus may also be vaguely specified using node types instead of node labels.

For instance, the user may “show the data from the scatterplot”, in which case

“scatterplot” is a node type reference that describes a scatterplot node existing in the

dataflow diagram. FlowSense searches for matched node types within the dataflow

diagram. In case of a tie on the node type, e.g. there are multiple scatterplots in

the diagram, the nodes with higher focus scores are chosen.

6.4.4 Diagram Update

Once a query is successfully completed, FlowSense performs the VisFlow func-

tion(s) with the given function options. This typically results in the creation of one

or more nodes, e.g. the visualization function creates one plot but the highlighting

function creates three nodes (Table 6.1). FlowSense may also update existing nodes

without creating any new nodes, e.g. when the user only changes rendering colors.

Additionally, a query may operate on multiple existing nodes at once, e.g. linking

and merging two tables create edges between two nodes. Operating on multiple

nodes together helps simplify user interaction, as these operations previously require

multiple drag-and-drops using traditional mouse interaction.

After new nodes and edges are created, the diagram may become more cluttered.


FlowSense locally adjusts the diagram layout after each diagram update. We use a

modified force-directed layout from the D3 library that works on the vicinity of

the current diagram editing focus. We extend the force to take rectangular node

sizes into account so that larger nodes such as embedded visualizations have larger

repulsive force. User-adjusted node positions are remembered by the system and

the layout algorithm avoids moving nodes that have been positioned by the user.

Currently FlowSense does not look for an optimal dataflow layout. We leave more

advanced layout [4] for future work.

6.4.5 Ambiguity

It is possible to have ambiguity even when the scope of the NLI is to map

queries to diagram editing operations. One type of ambiguity comes from multiple

possible query derivations (i.e. different parse trees), which can be defined as

syntactic ambiguity [19]. For example, FlowSense uses wildcard variables to match

stop words. The token “cars” from “Show a plot of cars” describes the user’s

understanding of data entities but is of no use to executing a visualization function.

Meanwhile, the token “horsepower” from “Show a plot of horsepower” is a special

utterance and should be treated as a table column to visualize. Therefore a wildcard

rule that matches the stop word “cars” may also match “horsepower”, resulting

in the second query to be mishandled. We could handle this case by creating a

wildcard variable that rejects a special utterance token. Nevertheless, such design

could lead to a larger number of variables and rules in the grammar, which are

harder to maintain and develop. Therefore we choose to resolve syntactic ambiguity

in the parsing phase with supervised learning on a weight vector w ∈ Rd that

gives the probability of derivations based on input utterances. Stochastic gradient


descent (SGD) is employed to optimize the multiclass hinge loss objective [63], as

introduced by Liang et al [35] in the SEMPRE framework. The objective is given

by

minw

∑(x,y)∈S

maxy′

(scorew(x, y′) + penalty(y, y′))− scorew(x, y)

with scorew(x, y) = w · feature(x, y)

In the above x is the input query, y is the preferred derivation, and y′ is the chosen

derivation by the parser. The feature of a derivation, feature(x, y), is determined by

the applied rules in the derivation. penalty(y, y′) checks if a test passes and returns

0 on a correct prediction and 1 otherwise. The objective function has a penalty

for each incorrectly predicted example that is linear in terms of the score (i.e.

probability) difference. The parser fits the training examples by giving preferred

derivations higher probability so that they are more likely to be returned in case of

ambiguity. In particular, the rule that expands to a data column special utterance

will be preferred over a rule that expands to a wildcard. We have created a small

set of labeled training examples as the set S to inform the system of the preferred

choices in terms of syntactic ambiguity. Note that we only need a small labeled set

is because the training set only disambiguates the grammar set with around 500

rules that are data- and diagram-independent, instead of the overwhelmingly large

variations of NL input in general.

Another type of ambiguity lies in the multiple ways of executing a same query.

One example is “Show the cars with mpg greater than 15” on a visualization node.


From the grammar perspective the returned result is unambiguously a visualization

function plus a filtering function. However, there are two ways of execution: One is

to create an attribute filter and then visualize the filtered cars in a new visualization;

Alternatively we may apply the filter on the input of the current visualization so

that it shows only the filtered cars. Both can be desired under some circumstances.

FlowSense has a default behavior that prefers filtering the input when the source

node is a visualization, which we find empirically more intuitive.

6.4.6 Error Recovery

There are two types of potential errors in executing a query (Figure 6.2). One

error is the parser does not accept an NL input. For example, the conversational

input “Hello there” makes no sense in a dataflow context. Such input is rejected by

the parser and no fix applies. In this case the system presents an error message and

the user may revisit the list of sample queries in the FlowSense documentation and

tutorial to learn more about the capability and scope of the NLI. The other error

is about an incomplete pattern of an NL input. The query may be structurally

acceptable but has incorrect or missing key information to properly execute. For

instance, the systems displays an error when it fails to find a scatterplot in the

diagram for the grammatically correct query “Highlight the selected cars from the

scatterplot”, which uses the vague node type reference “scatterplot” and can only

be successfully executed when a scatterplot is present in the diagram.

Since the user is simultaneously using the underlying VisFlow DFVS while using

FlowSense, she always has the option to undo the mistake of FlowSense or to make

partial adjustments and corrections when the NLI does not yield desired outcome.

This naturally facilitates easier error recovery, compared with NLIs in which the

6.5. USER INTERFACE 84

user has to rely on the NLI itself to apply a fix and has limited control over the

result.

6.5 User Interface

FlowSense is built as an extension to VisFlow. The user may optionally use

NL to edit the diagram wherever necessary. There are two modes to input a query:

typing or speech. In the typing mode, the user types in the pop-up FlowSense

dialog that shows up around the current focus of the diagram. The speech mode is

implemented with HTML5 web speech API, in which the user may record a spoken

query into the FlowSense input box for further editing. The speech mode can be

enabled by a microphone toggle on the right of the FlowSense input box, as shown

in Figure 6.3.

The special utterances identified by FlowSense are shown in colored tags. Each

different color represents a different type of special utterance, including data column

(green), node label (light green), node type (purple), and dataset name (light blue).

If a special utterance is misclassified, the user may correct it by clicking it and

removing/changing its utterance category in the dropdown (Figure 6.3(i)). The

FlowSense input box is also designed to support token completion for special

utterances. The user may use the tab and arrow keys to select token completion

candidates like in a programming IDE (Figure 6.3(ii)). This reduces the typing

workload and helps remind the user of what are available in the system and in the

current dataflow diagram.

6.6. QUERY AUTO-COMPLETION 85

(i)

(ii)

(iii)

Figure 6.3: The FlowSense user interface and query auto-completion. Taggedspecial utterances are shown in colored tags. (i) Manually update special utterancetagging using a dropdown in the FlowSense input box; (ii) Special utterance tokencompletion; (iii) Query auto-completion.

6.6 Query Auto-Completion

The usability of an NLI is closely related to its discoverability. It is desirable

that when the query is partially completed, the system is able to provide hints or

suggestions to the user on valid queries that include the partial input. It is known

that such a feature was requested by an NLI evaluation subject [19]. We therefore

develop an auto-completion algorithm in FlowSense to enhance its usability and

discoverability. When the user types a partial query and pauses for a long time,

the system triggers the query auto-completion automatically. The completion may

also be invoked manually by the user with a button press. Figure 6.3(iii) shows the

6.7. CASE STUDY 86

auto-completion suggestions in the FlowSense input box.

Auto-completion has been implemented in other visualization NLI, such as

Eviza [60]. Eviza applies a template-based auto-completion, in which the system

attempts to align user input to available set of templates. Here we take a similar

approach by creating a set of query templates with around 100 queries. Upon

an auto-completion request, the algorithm searches through all possible textual

matches between the user’s partial query and a prefix of the template. The matched

query is then sent to the FlowSense parser for evaluation. If the query is accepted, it

becomes an auto-completion candidate. The accepted results that contain obvious

grammatical errors are discarded, e.g. a sentence with consecutive prepositions like

“... in in ...”. Those grammatical errors are due to the loose design of the FlowSense

grammar, which does not emphasize the usage of determiners and prepositions,

and may neglect them as stop tokens.

6.7 Case Study

In this section we present two case studies that demonstrate the effectiveness of

FlowSense. The first case applies FlowSense to study the traffic speed reduction in

NYC, and shows how FlowSense can be used to analyze a real-world problem. The

second case is a diagram reproduction study, in which we validate that FlowSense

is able to help experienced VisFlow users speed up diagram construction. The

discussions in this section are based on an earlier version of system implementation,

and the NLI only applies to the subset flow model without its data mutation

extensions.

6.7. CASE STUDY 87

6.7.1 Speed Reduction Study

In this case study we collaborate with two analysts who are domain experts

researching the city regulation issued on November 7, 2014 that reduces the default

speed limit on all New York City streets from 30 MPH to 20 MPH. The data

contain the estimated average hourly speed [51] for each road segment in Manhattan

from January 2009 to June 2016. The speed estimation was performed over the

TLC yellow taxi records that only have pickup and dropoff information [68]. The

analysts are familiar with the data, and the visualizations to be created are similar

to the visualizations they previously generated for the project using Tableau [66].

However they have no prior experience with either VisFlow or FlowSense. We

met the analysts in person and first introduced VisFlow and FlowSense usage in

a 30-minute session. Then we guided the analyst through how FlowSense can be

used to create visualizations to study the speed reduction. We observe in this study

that almost all the analysts’ visualization requests (excluding those that exceed

the scope of the VisFlow subset flow) can be effectively supported by FlowSense.

Here we summarize the NL queries that are applied for the speed reduction study.

Open monthly

speed by speed limitEncode speed limit

by colorShow speed distribution

1

2

3

4 Draw speed over time grouped by speed limit

Figure 6.4: Using FlowSense to study the overall speed reduction trend of NYCstreets with different speeds limits. The queries are applied in the numberedorder. The resulting visualization shows a time series for the average speed ofroad segments, aggregated by unique speed limits. The smaller histogram snapshotshows the speed histogram without color encoding before step 3.

6.7. CASE STUDY 88

Initially, the analysts would like to look at the speed reduction impact at a

larger scale. They first load a pre-computed speed table (Figure 6.4(1)) with the

FlowSense data loading utility function (the analysts know the dataset name). The

table contains the monthly average speed aggregated by the speed limits of the

streets. The analysts ask the system to present a histogram of speed by “Show

speed distribution” (Figure 6.4(2)). The first histogram has no color encoding but

the analysts are able to immediately add a color scale by “Encode speed limit by

color”. FlowSense inserts a color mapping node with red-green scale at the input

of the histogram (Figure 6.4(3)). The histogram thus shows the street groups with

higher speed limit in green, and lower speed limit in red. To view the speed changes

over time, the analysts use the query “Draw speed over time grouped by speed limit”

(Figure 6.4(4)). The query result is a line chart showing average speed changes

for different speed limit groups. The analysts observe that overall there is a speed

reduction pattern for each speed limit group that started around middle 2013.

Show only segments

with a sign of yes2

Show the data

in a map1

Load segment

monthly speed3

Find roads with a same segment id from West Village/Alphabet City4

5 Set blue/red color

7 Show speed over time by segment id

6 Merge

Figure 6.5: Applying FlowSense for a comparative study on the street speed changesbetween the West Village slow zone (blue) and the Alphabet City slow zone (red).FlowSense processes the rich dataflow context and allows the user to referencedataflow diagram elements at different specificity levels, e.g. with node types, nodelabels, or implicit references. The NL queries are executed in the numbered order.

6.7. CASE STUDY 89

Seeing the overall trend, the analysts move on to a comparative analysis between

the individual streets from two slow zones. They load and visualize a speed sign

installation table in a map (Figure 6.5(1)) by “Show the data in a map”. This

dataset has information on the speed limit, the geographical location, and whether

the street has speed sign installed for every road segment in Manhattan (signs are

shown as dots in the map). As the slow zones mostly have speed signs installed,

the analysts narrow down the data in the map by placing a filter on the “sign”

column (Figure 6.5(2)). The filtered map reveals two slow zone neighborhoods with

densely located signs: Alphabet City and West Village. The analysts apply one

map visualization for each zone for a comparison between the two zones. They

name the two maps by the slow zone names and select a few streets from each

zone (marked in the maps of Figure 6.5). To study the speed changes of these

selected streets, another table (named “segment monthly speed”, also known to

the analysts) that includes monthly average speed for each road segment is added

to the diagram (Figure 6.5(3)). The analysts then use the link queries to create

a sequence of nodes that extract segment IDs from the selected streets and find

their monthly average speed from the segment monthly speed table (Figure 6.5(4)).

Blue and red colors are assigned to the streets in West Village and Alphabet City

respectively to visually differentiate them (Figure 6.5(5)). The two groups of streets

are then merged by a set manipulation function (Figure 6.5(6)). Note that the query

“Merge” only has a single word. But it still works because the query completion

of FlowSense automatically locates the recently focused color editors as source

nodes for this query. Finally, the two groups are rendered together in a speed series

visualization (Figure 6.5(7)), which compares the speed changes between the two

groups of streets. As the visualizations produced by FlowSense are linked, the

6.7. CASE STUDY 90

analysts can easily change the street selection in the maps to compare different

groups of streets. The generated visualizations are helpful to guide the analysts

towards further data analysis.

This case study demonstrates that FlowSense can be applied to a practical,

comprehensive analytical task. The analysts participating in this study think that

FlowSense is intuitive and easy to use after they understand how to work with

VisFlow dataflow diagram to create those visualizations. They also think FlowSense

exemplifies how to build diagrams in VisFlow and is helpful to their learning of the

DFVS.

6.7.2 Diagram Reproduction Study

We carried out a diagram reproduction experiment that shows how FlowSense

can assist experienced DFVS users to simplify dataflow diagram construction. We

use the dataflow diagram designed in the VisFlow case study (Section 4.5.2) that

visualizes the baseball pitchers’ movements with MLB.com Statcast data as the

target diagram. We invited two participants who have good familiarity with the

VisFlow system and the dataset but new to FlowSense. We introduced FlowSense

to both participants in a 15-minute session and walked them through the target

diagram in another 15 minutes to make sure that they understand how the target

diagram works. Then the participants were asked to reproduce the functionality

of the target diagram with FlowSense without referencing the original diagram.

Both participants were able to reproduce the diagram within 25 minutes. This

study demonstrates that an experienced DFVS user can benefit from FlowSense

in that it speeds up and simplifies dataflow diagram construction. Through post-

study conversations, we learn that the speed improvement mainly results from the

6.8. USER STUDY 91

FlowSense capability of expanding the diagram at the editing focus, and operating

on multiple nodes at once (e.g. in data linking or merging). These have to be

achieved otherwise by multiple drag-and-drop interactions to create several nodes

and edges.

6.8 User Study

We conduct a formal user study to evaluate the effective of FlowSense together

with the VisFlow framework. We use the user study to validate whether a user

is able to smoothly apply FlowSense for dataflow diagram construction, and how

well do FlowSense responses meet the user’s expectation. We design an experiment

that introduces VisFlow and FlowSense to the participant and assigns analytical

tasks to be solved using the system. We collect quantitative feedback from the

participants, measure the task completion time, and carry out a post-study data

analysis on the participants’ NL queries.

6.8.1 Experiment Design

We design an online experiment environment to perform the user study. Partic-

ipants are invited to participate in the study online in a web browser on their own

machines. Participants may ask the experiment assistant for help and clarification

via web chat or phone call during the experiment session.

We recruited 17 participants, all with an age between 20 and 30. Among them

11 are male, and 6 are female. All of the participants work or study in the field of

computer science, and 12 have a data visualization background. 9 of the partici-

pants are graduate students, and the other 8 are professionals (software engineer,

6.8. USER STUDY 92

researcher, faculty). Three participants have prior experience with VisFlow. None

of the participants have prior knowledge on FlowSense.

The procedure of the user study is as follows:

• The participant completes a tutorial of the VisFlow dataflow framework. The

participant is asked to complete the tutorial diagram following the instructions

to demonstrate familiarity with the subset flow. (10–20 minutes)

• The participant completes a tutorial of the FlowSense natural language

interface. The participant is asked to complete the VisFlow tutorial diagram

using solely FlowSense to demonstrate the familiarity of the NLI. (10–20

minutes)

• The participant freely explores and practices with both VisFlow and FlowSense.

(10 minutes)

• The participant explores an SDE Test dataset and constructs dataflow di-

agrams using VisFlow and FlowSense to answer questions about the data.

The participant is encouraged to use FlowSense as much as possible. (30–60

minutes)

• The participant takes a survey to give quantitative feedback about VisFlow

and FlowSense.

The SDE Test dataset includes the test results of software engineer candidates,

which reflect how strong a background each candidate has in computer science.

The dataset includes two tables. The first table describes the test results for each

candidate. A test consists of answering several multi-choice questions selected

by the system from a large question pool. Each question has a unique ID, a

6.8. USER STUDY 93

pre-determined difficulty, its supported programming language(s), and possibly a

time limit. If the candidate answers a question, a result (“correct” or “wrong”) is

given. Getting a question wrong results in a negative score penalty. So a candidate

may choose to “skip” a question and get zero score. If a candidate has no action

within the time limit of a question, the result is “unanswered”. The “TimeTaken”

column stores how much time in seconds a candidate took to answer a question.

The second table includes background information about each candidate, such as

the candidate’s age, field of study, and graduation date.

We give three analytical tasks about this dataset:

• Overview Task. The participant is asked to visualize the overview distri-

bution of the question answering results, and figure out the total number of

questions that were skipped, and the percentage of a question being answered

correctly.

• Outlier Task. The participant is first asked to find an outlier user with

invalid age value (“2018”). Then the participant is asked to investigate a

data recording discrepancy regarding the “TimeTaken” column: Some of

the “TimeTaken” values are significantly bigger than the other values when a

question is unanswered. This is probably due to a bug in the data collection

code.

• Comprehensive Task. The participant is asked to identify one question

that Masters candidates answer significantly better than Bachelors candidate.

This task requires comprehensive usage of VisFlow features, such as attribute

filtering, brushing, and heterogeneous table linking.

All the three tasks have definitive answers to ensure that participants explore

6.8. USER STUDY 94

the data and draw conclusions reasonably. Each user study session is logged with

anonymous full diagram editing history for post-study data analysis.

6.8.2 Results

We analyze the user study results based on the quantitative feedback, time

measurement of task completion time, and task answers. The NL queries the

participants entered are collected to analyze where the NLI meets and fails the

user expectation. In particular, we manually walk through the rejected FlowSense

queries one by one to identify and categorize their reasons of failure.

6.8.2.1 Task Completion Quality

Figure 6.6 shows the correctness distribution of the participants’ answers. It

can be seen that the majority of the participants were able to come up with the

correct answers to the tasks. This demonstrates that VisFlow and FlowSense are

effective at assisting visual data exploration, and may help the users successfully

seek answers to data analysis questions.

Figure 6.7 shows the completion time distribution for each step of the user study.

Overall, the amount of time consumed is as expected. However, the time participants

spent on VisFlow tutorial and Task 3 is longer than expected. This shows that

users of VisFlow may need a sufficient amount of time to go through the tutorial so

as to understand the dataflow system itself. When the task involves heterogeneous

tables and interactive data filtering to find solutions, the time required to complete

the task increases. By analyzing the user comments in the feedback, we believe

this may be due to the fact that many participants are first-time VisFlow users.

For them it takes some effort to digest the concept of the subset flow model. In

6.8. USER STUDY 95

0

4

8

12

16

ok watask1.count

0

4

8

12

16

ok unanswered watask1.percentage

0

4

8

12

16

ok unanswered watask2.user_id

0

4

8

12

16

ok unansweredtask2.timetaken

0

4

8

12

16

ok unanswered watask3.question_id

Figure 6.6: Correctness distributions of the participants’ answers to each of theuser study tasks. “ok” represents a correct answer. “wa” indicates a wrong answer.“unanswered” means the participant did not find a proper answer and skipped thetask.

6.8. USER STUDY 96

particular, table linking can be challenging to understand at first. However, after

approximately 20 minutes of system usage, most users were able to figure out how

to relate tables in VisFlow or use FlowSense for the purpose. This is reflected by

one of the feedback comments: “The operator and linker functions are confusing

at first. But after experimenting with the tool for a while and getting to know how

they work, things become easier”.

Task3

Task2

Task1

FlowSense.Tutorial

VisFlow.Tutorial

0 20 40 60Time (minutes)

Figure 6.7: Completion time box plot for each step of the user study. Four outliersare not shown: Task1(2550), Task3(109, 119, 212).

6.8.2.2 Quantitative Feedback

We ask for feedback on six aspects regarding VisFlow and FlowSense respectively

in our survey. Each aspect is presented with a statement with a 1–5 Likert scale

for the participant to express agreement (5) or disagreement (1). Table 6.2 shows

the feedback for the VisFlow dataflow system. Table 6.3 shows the feedback for

the FlowSense NLI.

The quantitative feedback for VisFlow in Table 6.2 shows that the users were

able to understand the subset flow model of VisFlow. The majority of the users

agree that VisFlow presents an effective approach to visual data exploration, and

can successfully utilize VisFlow features for their data exploration.

6.8. USER STUDY 97

Question Feedback

I understand the majority of Vis-Flow features. 0

36912

1 2 3 4 5

I understand the subset flow inVisFlow. 0

36912

1 2 3 4 5

I can follow the VisFlow dataflowdiagram and understand theirfunctionality. 0

36912

1 2 3 4 5

VisFlow is relatively simple tolearn and use.

036912

1 2 3 4 5

VisFlow is an effective system forvisual data exploration.

036912

1 2 3 4 5

I would like to use VisFlow formy future data exploration tasks.

036912

1 2 3 4 5

Table 6.2: VisFlow Survey Result

The quantitative feedback for FlowSense in Table 6.3 shows that most users

were able to understand the scope of FlowSense, and apply it for dataflow diagram

construction. 12/17 of the users agree (with a feedback score greater than 3)

that FlowSense simplifies the diagram construction, and 10/17 users agree that

FlowSense speeds up the data exploration. The feedback also reveals space for

improving the NLI, as it is unclear to an average user how to update a rejected

query to make it accepted by FlowSense. It may be helpful to design an algorithm

that provides suggested corrections or changes on a rejected query.

6.8. USER STUDY 98

Question Feedback

I understand what queries FlowSensemay accept and execute.

036912

1 2 3 4 5

The responses of FlowSense meet myexpectations.

036912

1 2 3 4 5

When my query got rejected, I canfigure out how to update it to let itbe accepted. 0

36912

1 2 3 4 5

FlowSense simplifies dataflow dia-gram construction.

036912

1 2 3 4 5

FlowSense speeds up my data explo-ration.

036912

1 2 3 4 5

FlowSense helps me learn VisFlow fea-tures that I was not aware of.

036912

1 2 3 4 5

Table 6.3: FlowSense Survey Result

6.8.2.3 Reasons for Query Rejection

To closely study where FlowSense does not accept a query, we conduct a manual

walk-through of the rejected queries and categorize each rejected query by its reason

of rejection. Figure 6.8 lists the identified categories and their relative level of

difficulty to be resolved.

The meaning of each category is:

• Not Implemented: FlowSense grammar may technically support parsing

this query. Yet we have not implemented the corresponding grammar and

6.8. USER STUDY 99

its web client handler. Example queries include “change x column to mpg”.

The current system implementation does not support node option changes

triggered by the NLI. Queries of this category can be accepted by extending

the grammar and adding more rules.

• Rephrase: The user rephrases the query using grammatical structures not

expected by the grammar, or the user uses words that do not appear in the

dataset table to describe a table entity or value. For example, in Task 3 if the

user mentions “degree”, FlowSense does not know that “degree” is equivalent

to the “HighestLevelOfEducation” column in the data. There needs to be

additional knowledge base added to the system so that the NLI can make

concept derivation.

• Not Supported: The functionality indicated by the query is not supported

by the VisFlow dataflow framework. A query like “how many questions were

skipped” asks directly an analytical question about the dataset and exceeds

the scope of VisFlow and FlowSense. It cannot be accepted because VisFlow

has no such functionality.

• Bug: The system should be able to handle that query. But due to an

implementation bug the execution went wrong.

• Tagging Miss: A special utterance should have (have not) been tagged, but

it was not (was) tagged. For example, the query “select iris with id between

3 and 5” has the word “iris” that is both a word to describe the data entity

and a dataset name. When FlowSense automatically tags “iris” as a dataset

name special utterance, the parser may fail to accept the query. In this case

6.8. USER STUDY 100

the user may manually override the tagging to avoid the error resulted from

parsing ambiguity.

• Invalid: The query was an invalid sentence, and cannot be understood by a

human.

• Composite: The user inputs a query that attempts to execute multiple

VisFlow functions that exceed the limit expected by the grammar or the

web client handler. The grammatical structure between these multiple func-

tions poses parsing difficulty. It is recommended that composite queries are

refactored into multiple smaller steps so as not to overload the NLI with

complicated grammatical structure that exceeds its parsing capability.

• Mistyped: The query has mistyped words.

MistypedComposite

InvalidTagging Miss

BugNot Supported

RephraseNot Implemented

0 20 40 60 80count

Difficulty to Resolve low medium high

Figure 6.8: Number of rejected queries for different rejection reasons. The colors ofthe bars indicate the relative difficulty of resolving a rejection.

Overall, we analyzed 649 queries, out of which 421 were accepted by FlowSense,

at an acceptance rate of 64.869%. Excluding the 34 invalid and mistyped queries,

the acceptance rate was 421/(649− 34) = 68.455%.

6.9. DISCUSSIONS 101

6.9 Discussions

The studies we conducted help validate the effectiveness of the FlowSense NLI.

Yet based on the study results we identify a set of possible tasks that may help

improve the NLI design. One outstanding user experience issue is the lack of

guidance on how to correct a rejected query. Though correction suggestion is

highly desirable, it is technically challenging to derive the intermediate parsing

state of a parse tree to find out which parts of the query mismatch. Even knowing

a derivation and its parse tree, it may be difficult to generate a human-readable

sentence because the grammar tends to be loose and may omit unimportant tokens

such as the stop words. We leave the design and research of query correction

suggestion to our future work.

In terms of query performance analysis, so far we have only analyzed the reasons

for the rejected queries. However the accepted queries may not necessarily achieve

what the user wants. We may more closely analyze the diagram editing logs

collected from the user study and check what actions a user performs following

the NL queries. For example, if the user immediately undoes after an NL query is

executed, it clearly indicates that the outcome of the query was not desirable.

On the other hand, despite the strength and intuitiveness of the NLI, the user’s

habit of using an NLI also plays a vital role. We observe that the traditional drag-

and-drop interaction is definitive and convenient in many cases, and is often more

trusted by the user. Upon complex analysis tasks, a user tends to less frequently

use the NLI and rely more on direct interactions. Given enough experience, the

user may gradually figure out over time a habit of when and where to use the NLI

to utilize its strengths and avoid its weaknesses.

In this work we prefer semantic parsing to deep learning mainly because the

6.9. DISCUSSIONS 102

latter has a bottleneck of requiring a large volume of training examples. Though

there are benchmark datasets for general NLP, there has not yet been a training

set catered for visualization-oriented NLI or DFVS. In the future with more users

working with our NLI, we would like to collect more user queries that constitute a

rich training set for text classification. With more data, we may explore alternative

algorithms for query parsing and execution. It is also possible to use the data to

improve the results of query auto-completion.

103

Chapter 7

Conclusions and Future Work

This dissertation presents VisFlow, a web-based dataflow framework for visual

data exploration. VisFlow uses a subset flow model that explores the possibility

of interactive visual data analysis in a dataflow context. The subset flow model

focuses on tabular data subsets and requires data immutability. The advantage

of using the subset flow model is that the dataflow may unambiguously assign

visual properties to the data items, so that subsets can be naturally brushed,

linked, and highlighted across multiple visualizations. VisFlow is a novel dataflow

framework that addresses the interactivity limitation of dataflow systems. It focuses

on enhancing the interactivity instead of supporting general computation as in

many of the other computational dataflow systems. The goal of the design is to

have a dataflow system that excels in flexibility, usability, and interactivity.

Chapter 3 introduces the concept of the subset flow model. It gives the definitions

of diagram elements and describes the mechanism of using visual properties to

visually track data subsets. A list of node categories is presented. This set of node

categories covers the majority of visual data exploration tasks that are possible in

CHAPTER 7. CONCLUSIONS AND FUTURE WORK 104

the subset flow context. The subset flow data immutability constraints are given

and the design philosophy behind it is discussed. We illustrate that the data schema

of the subset flow model, which is equivalent to having node outputs as copies

of its input with immutable original table columns but mutable visual property

columns. We show an example diagram that demonstrates the diagram concepts

and visualizes the Auto MPG dataset in multiple views.

Chapter 4 discusses the implementation details of the VisFlow framework. The

choices for the application stack are listed, with an in-depth discussion on how

component inheritance can be implemented within the VueJS frontend framework.

The user interface of the implemented framework is presented. In particular, the

VisMode dashboard allows the diagram connection details to be hidden, so that

the user may focus on result presentation or data exploration. VisMode seamlessly

switches between diagram editing and dashboard display, so as to help the user

keep track of the correspondence between the two modes. With the implemented

VisFlow framework, we work with domain experts on real-world data analysis tasks

to showcase the application of VisFlow. We demonstrate a gene regulatory network

study and a baseball pitch study. The two studies exemplify how the VisFlow

framework can be used to analyze heterogeneous datasets with multiple tables, and

achieve multi-view linked visualization environment for data exploration. Such

data exploration has to be supported by a bespoke application otherwise.

Chapter 5 introduces the experimental extensions built over the subset flow

model to enhance its analytical capability. We introduce the concept of the extended

subset flow model, in which nodes are allowed to mutate their input data, and

consequently generate data mutation boundaries. It is shown that the original

subset flow may apply to groups of nodes within a same data mutation boundary,


so that the interactivity advantage of the subset flow model can be preserved while

we introduce data mutating node types into the system. In particular, a script

editor node type is added to allow custom JavaScript scripting inside the dataflow.

Using JavaScript, the user can edit and generate the data in situ. With a series

player node, we demonstrate an analysis of the evacuation dataset to show that

how scripting support can be used to perform data visualization that requires

custom rendering and display, e.g. drawing a floor plan. We also show two cases on

k-means algorithm visualization and model training visualization to demonstrate

more comprehensive usage of the extended subset flow model. To overcome the

limitation of iterative data analysis, a stateful data reservoir node is introduced

to allow downflow data to be sent back to the upflow for analysis iterations. The

input data of the data reservoir are released and propagated backward upon user

interaction. Such a design overcomes the limitation of acyclic dataflow diagram

and avoids multiple layers of nodes that would otherwise have to be created for

iterative analyses. Other data mutation nodes, such as the data transpose that

converts series data ordered by columns to series points listed in row order, can be

useful in the extended subset flow.

There could be many design variations on how to apply dataflow for interactive

data analysis and visual data exploration. In this work we propose the subset flow

model. We only present an example set of node types in our VisFlow implementation.

It is always possible to add new types on demand to enhance the subset flow.

Furthermore under the extended subset flow, virtually any type of node can be

added, given that they do not over-complicate the dataflow usage and compromise

the subset flow usability. In the future we would like to experiment on more node

types and consolidate a set of node types that best fulfills general visual data


exploration requirement in a dataflow context. We also would like to study the

impact of the extended subset flow model in terms of the user’s perception of the

dataflow. It would be interesting to analyze the usage distribution between the

subset flow that focuses on interactivity and the extended subset flow that leans

towards computation capability.

Chapter 6 covers the natural language interface FlowSense, which is integrated

into VisFlow to enhance the usability of the dataflow framework. FlowSense uses a

semantic parser to analyze and execute natural language queries. The NLI aims at

supporting dataflow diagram editing operations. We identify a set of commonly

performed tasks in VisFlow as the VisFlow functions, and make them the query

parsing targets. The grammar behind the parser is independent of the loaded

dataset or the dataflow diagram context. Instead, POS and special utterance

tagging is performed and the grammar includes utterance placeholders to accept

context-related values such as table column names and node labels. The FlowSense

grammar attempts to fill in the missing key components in a diagram editing

command by locating the current diagram editing focus, or using the default values.

The FlowSense user interface includes a rich input box with token and query

auto-completion. The query auto-completion algorithm takes a template-based

approach and matches a partial query against the templates to suggest acceptable

queries. One case study on the analysis of speed reduction in NYC is presented to

show the application of FlowSense in practical data analysis scenario. A diagram

reproduction study is performed to demonstrate the efficiency improvement brought

by FlowSense over the traditional drag-and-drop diagram editing interactions. In

addition to the two case studies, a formal user study is performed to validate the

effectiveness of FlowSense. In the study, the participants went through VisFlow


and FlowSense training, and completed three assigned tasks with definitive answers.

We analyze the participants’ task completion time, summarize their quantitative

feedback, and identify space for improving the NLI based on where the queries

were rejected.

Providing query correction suggestions would be a future research direction to

improve the user experience of the NLI. We may also more detailedly analyze the

diagram editing logs and replay the diagram construction to identify where the

FlowSense results do not meet the user expectation even when the NL queries are

accepted by the grammar. The NL queries collected from the user study may help

compose a larger training volume of dataflow NL queries, based on which we may

explore alternative algorithms for parsing the NL input.

108

Appendix A

VisFlow Resources

All VisFlow resources can be found at the website https://vislfow.org. The

site hosts an online demo of the dataflow framework at https://visflow.org/demo,

where we provide several demo datasets for the users to try out the system features.

The users may also create their own account to upload custom datasets to explore.

The documentation of the system is available at https://visflow.org/docs. The

documentation has a getting-started tutorial which introduces the the basic concept

and usage of the system. It also comprehensively covers all the design details,

including the definitions of the subset flow model and its elements, the supported

node types and their usage, system shortcuts, and so on. The documentation also

introduces the FlowSense natural language interface and provides example natural

language queries. The implementation source code of the VisFlow framework is

available as an open source project on GitHub: https://github.com/yubowenok/

visflow. Past codebase revisions are also available at this repository.

https://vislfow.org

https://visflow.org/demo

https://visflow.org/docs

https://github.com/yubowenok/visflow

https://github.com/yubowenok/visflow

109

Bibliography

[1] R. Alexander, P. Rukshan, and S. Mahesan. Natural language web interface

for database (NLWIDB). CoRR, abs/1308.3830, 2013.

[2] M. Allahyari, S. A. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez,

and K. Kochut. A brief survey of text mining: Classification, clustering and

extraction techniques. CoRR, abs/1707.02919, 2017.

[3] I. Androutsopoulos, G. D. Ritchie, and P. Thanisch. Natural language interfaces

to databases - an introduction. CoRR, cmp-lg/9503016, 1995.

[4] C. Batini, E. Nardelli, and R. Tamassia. A layout algorithm for data flow

diagrams. IEEE Trans. Software Engineering, 12(4):538–546, Apr. 1986.

[5] L. Bavoil, S. P. Callahan, C. E. Scheidegger, H. T. Vo, P. Crossno, C. T. Silva,

and J. Freire. VisTrails: Enabling interactive multiple-view visualizations. In

IEEE Visualization Conference, pages 135–142, 2005.

[6] J. Berant, A. Chou, R. Frostig, and P. Liang. Semantic parsing on Free-

base from question-answer pairs. In Empirical Methods in Natural Language

Processing (EMNLP), 2013.

[7] M. Bostock, V. Ogievetsky, and J. Heer. D3: Data-driven documents.

BIBLIOGRAPHY 110

IEEE Transactions on Visualization and Computer Graphics (InfoVis’11),

17(12):2301–2309, 2011.

[8] M. Ciofani, A. Madar, C. Galan, M. Sellars, K. Mace, F. Pauli, A. Agar-

wal, W. Huang, C. N. Parkurst, M. Muratet, K. M. Newberry, S. Meadows,

A. Greenfield, Y. Yang, P. Jain, F. K. Kirigin, C. Birchmeier, E. F. Wagner,

K. M. Murphy, R. M. Myers, R. Bonneau, and D. R. Littman. A validated

regulatory network for Th17 cell specification. Cell, 151:289–303, 2012.

[9] K. Cox, R. E. Grinter, S. L. Hibino, Lalita, J. Jagadeesan, and D. Mantilla.

A multi-modal natural language interface to an information visualisation

environment. International Journal of Speech Technology, pages 297–314,

2001.

[10] Cycling74. https://cycling74.com/.

[11] D3: Data Driven Documents. http://d3js.org.

[12] J. de Leeuw. Modern multidimensional scaling: Theory and applications

(second edition). Journal of Statistical Software, Book Reviews, 14(4):1–2, 9

2005.

[13] L. Deng. A tutorial survey of architectures, algorithms, and applications for

deep learning. APSIPA Trans. Signal and Information Processing, 3, 2014.

[14] J.-D. Fekete. The infovis toolkit. In IEEE Symposium on Information Visual-

ization, pages 167–174, 2004.

[15] C. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, May

1998.

https://cycling74.com/

http://d3js.org

BIBLIOGRAPHY 111

[16] D. Foulser. IRIS Explorer: A framework for investigation. ACM SIGGRAPH

Computer Graphics, 29(2):13–16, 1995.

[17] J. Freire, C. T. Silva, S. P. Callahan, E. Santos, C. E. Scheidegger, and H. T.

Vo. Managing Rapidly-Evolving Scientific Workflows, pages 10–18. Springer

Berlin Heidelberg, 2006.

[18] C. Gane and T. Sarson. Structured Systems Analysis: Tools and Techniques.

McDonnell Douglas Systems Integration Company, 1979.

[19] T. Gao, M. Dontcheva, E. Adar, Z. Liu, and K. G. Karahalios. Datatone:

Managing ambiguity in natural language interfaces for data visualization. In

Proc. 28th Annual ACM Symposium on User Interface Software and Technology,

UIST’15, pages 489–500, 2015.

[20] Grasshopper3D. http://www.grasshopper3d.com/.

[21] S. Gratzl, N. Gehlenborg, A. Lex, H. Pfister, and M. Streit. Domino: Extract-

ing, comparing, and manipulating subsets across multiple tabular datasets.

IEEE Transactions on Visualization and Computer Graphics (InfoVis’14),

2014.

[22] J. Gurd, W. Bohm, and Y. M. Teo. Performance issues in dataflow machines.

In Future Generations Computer Systems. Elsevier Scientific, pages 285–297,

1987.

[23] P. E. Haeberli. ConMan: A visual programming language for interactive

graphics. ACM SIGGRAPH Computer Graphics, 22(4):103–111, June 1988.

http://www.grasshopper3d.com/

BIBLIOGRAPHY 112

[24] E. Hoque, V. Setlur, M. Tory, and I. Dykeman. Applying pragmatics principles

for interaction with visual analytics. IEEE Transactions on Visualization and

Computer Graphics, 24(1):309–318, Jan 2018.

[25] IBM OpenDX. http://opendx.org/.

[26] IBM SPSS Modeler. http://www.ibm.com/software/products/en/

spss-modeler.

[27] IBM Watson Analytics. https://www.ibm.com/analytics/

watson-analytics/.

[28] W. Javed and N. Elmqvist. ExPlates: Spatializing interactive analysis to scaf-

fold visual exploration. Computer Graphics Forum (Proc. EuroVis), 32(2):441–

450, 2013.

[29] KNIME data analysis platform. http://www.knime.org/.

[30] A. Kumar, J. Aurisano, B. D. Eugenio, A. Johnson, A. Gonzalez, and J. Leigh.

Towards a dialogue system that supports rich visualizations of data. In The

17th Annual Meeting of the Special Interest Group on Discourse and Dialogue,

2016.

[31] S. Kumar, A. Kumar, P. Mitra, and G. Sundaram. System and methods for

converting speech to SQL. CoRR, abs/1308.3106, 2013.

[32] M. Lage, J. H. Ono, D. Cervone, J. Chiang, C. Dietrich, and C. Silva. Statcast

dashboard: Exploration of spatiotemporal baseball data. IEEE Computer

Graphics & Applications, 2016, to appear.

http://opendx.org/

http://www.ibm.com/software/products/en/spss-modeler

http://www.ibm.com/software/products/en/spss-modeler

https://www.ibm.com/analytics/watson-analytics/

https://www.ibm.com/analytics/watson-analytics/

http://www.knime.org/

BIBLIOGRAPHY 113

[33] V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and

reversals. Soviet Physics Doklady, 10:707, 1966.

[34] Y. Li, H. Yang, and H. V. Jagadish. NaLIX: A generic natural language search

environment for XML data. ACM Trans. Database Systems, 32(4), Nov. 2007.

[35] P. Liang and C. Potts. Bringing machine learning and compositional semantics

together. Annual Review of Linguistics, 1:355–376, 2014.

[36] Z. Liu, S. Navathe, and J. Stasko. Network-based visual analysis of tabular

data. In IEEE Visual Analytics Science and Technology (VAST’11), pages

41–50, Oct 2011.

[37] B. Ludascher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. A.

Lee, J. Tao, and Y. Zhao. Scientific workflow management and the Kepler

system. Concurrency Computat.: Pract. Exper., 18(10):1039–1065, Aug. 2006.

[38] J. Mackinlay, P. Hanrahan, and C. Stolte. Show Me: Automatic presentation for

visual analysis. IEEE Trans. Visualization and Computer Graphics, 13(6):1137–

1144, 2007.

[39] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, P. Inc, S. J. Bethard, and

D. Mcclosky. The Stanford CoreNLP natural language processing toolkit. In In

Proc. 52nd Annual Meeting of the Association for Computational Linguistics:

System Demonstrations, pages 55–60, 2014.

[40] A. Meduna. Formal Languages and Computation: Models and Their Applica-

tions. Auerbach Publications, 2014.

BIBLIOGRAPHY 114

[41] J. Meyer-Spradow, T. Ropinski, J. Mensmann, and K. Hinrichs. Voreen: A

rapid-prototyping environment for ray-casting-based volume visualizations.

IEEE Computer Graphics and Applications, 29(6):6–13, Nov 2009.

[42] Microsoft Power BI. https://powerbi.microsoft.com/.

[43] W. A. Najjar, E. A. Lee, and G. R. Gao. Advances in the dataflow computa-

tional model. Parallel Computing, 25(13-14):1907 – 1929, 1999.

[44] G. Navarro. A guided tour to approximate string matching. ACM Computing

Surveys, 33(1):31–88, Mar. 2001.

[45] C. North, N. Conklin, K. Indukuri, and V. Saini. Visualization schemas and a

web-based architecture for custom multiple-view visualization of multiple-table

databases. In Information Visualization, pages 211–228, 2002.

[46] Observable. https://beta.observablehq.com/.

[47] Orange Data Mining Software. https://orange.biolab.si/.

[48] S. G. Parker and C. R. Johnson. SCIRun: A scientific programming envi-

ronment for computational steering. In Proc. ACM/IEEE Conference on

Supercomputing. ACM, 1995.

[49] S. G. Parker, D. M. Weinstein, and C. R. Johnson. The SCIRun Computational

Steering Software System. Birkhauser Boston, 1997.

[50] P. Pasupat and P. Liang. Compositional semantic parsing on semi-structured

tables. In In Proc. Annual Meeting of the Association for Computational

Linguistics, 2015.

https://powerbi.microsoft.com/

https://beta.observablehq.com/

https://orange.biolab.si/

BIBLIOGRAPHY 115

[51] J. Poco, H. Doraiswamy, H. T. Vo, J. L. D. Comba, J. Freire, and C. T.

Silva. Exploring traffic dynamics in urban environments using vector-valued

functions. Computer Graphics Forum (Proc. EuroVis), 34(3):161–170, 2015.

[52] Project Jupyter. http://jupyter.org/.

[53] Quadrigram. http://www.quadrigram.com/.

[54] D. Ren, T. Hllerer, and X. Yuan. iVisDesigner: Expressive interactive design of

information visualizations. IEEE Transactions on Visualization and Computer

Graphics (InfoVis’14), 20(12):2092–2101, Dec 2014.

[55] J. Roberts. Waltz - an exploratory visualization tool for volume data, using

multiform abstract displays. In Visual Data Exploration and Analysis V, Proc.

SPIE, volume 3298, pages 112–122, 1998.

[56] J. C. Roberts. On encouraging coupled views for visualization exploration.

In Visual Data Exploration and Analysis VI, Proc. SPIE, volume 3643, pages

14–24, 1999.

[57] G. Ross and M. Chalmers. A visual workspace for hybrid multidimensional scal-

ing algorithms. In IEEE Symposium on Information Visualization (InfoVis’03),

pages 91–96, Oct 2003.

[58] A. Satyanarayan and J. Heer. Lyra: An interactive visualization design

environment. Computer Graphics Forum (Proc. EuroVis), 2014.

[59] A. Satyanarayan, D. Moritz, K. Wongsuphasawat, and J. Heer. Vega-lite: A

grammar of interactive graphics. IEEE Transsactions on Visualization and

Computer Graphics (Proc. InfoVis), 2017.

http://jupyter.org/

http://www.quadrigram.com/

BIBLIOGRAPHY 116

[60] V. Setlur, S. E. Battersby, M. Tory, R. Gossweiler, and A. X. Chang. Eviza: A

natural language interface for visual analysis. In Proc. 29th Annual Symposium

on User Interface Software and Technology, UIST’16, pages 365–377, 2016.

[61] B. Shneiderman. The eyes have it: A task by data type taxonomy for infor-

mation visualizations. In In IEEE Symposium on Visual Languages, pages

336–343, 1996.

[62] Y. Sun, J. Leigh, A. Johnson, and S. Lee. Articulate: A semi-automated

model for translating natural language queries into meaningful visualizations.

In Proc. 10th International Conference on Smart Graphics, pages 184–195.

Springer-Verlag, 2010.

[63] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. Neural

Information Processing Systems, 2003.

[64] A. Telea and J. J. van Wijk. Vission: An object oriented dataflow system for

simulation and visualization. In E. Groller, H. Loffelmann, and W. Ribarsky,

editors, Data Visualization ’99: Proceedings of the Joint EUROGRAPHICS

and IEEE TCVG Symposium on Visualization, pages 225–234, Vienna, 1999.

Springer Vienna.

[65] A. Telea and J. J. van Wijk. Smartlink: An agent for supporting dataflow

application construction. In W. C. de Leeuw and R. van Liere, editors, Data

Visualization 2000: Proceedings of the Joint EuroGraphics and IEEE TVCG

Symposium on Visualization, pages 189–198, Vienna, 2000. Springer Vienna.

[66] The Tableau Software. http://www.tableausoftware.com/.

[67] ThoughtSpot. http://www.thoughtspot.com/.

http://www.tableausoftware.com/

http://www.thoughtspot.com/

BIBLIOGRAPHY 117

[68] TLC Trip Records. http://www.nyc.gov/html/tlc/html/about/trip_

record_data.shtml.

[69] C. Upson, J. Faulhaber, T.A., D. Kamins, D. Laidlaw, D. Schlegel, J. Vroom,

R. Gurwitz, and A. van Dam. The application visualization system: a com-

putational environment for scientific visualization. IEEE Computer Graphics

and Applications, 9(4):30–42, July 1989.

[70] B. Victor. Drawing dynamic visualizations. https://vimeo.com/66085662,

February 2013.

[71] vvvv. http://vvvv.org/.

[72] Y. Wang, J. Berant, and P. Liang. Building a semantic parser overnight. In

Association for Computational Linguistics (ACL), 2015.

[73] J. Waser, H. Ribicic, R. Fuchs, C. Hirsch, B. Schindler, G. Bloschl, and

E. Groller. Nodes on ropes: A comprehensive data and control flow for steering

ensemble simulations. IEEE Transactions on Visualization and Computer

Graphics, 17(12):1872–1881, Dec 2011.

[74] C. Weaver. Building highly-coordinated visualizations in Improvise. In IEEE

Symposium on Information Visualization (InfoVis’04), pages 159–166, 2004.

[75] Wikipedia article: Yahoo Pipes. https://en.wikipedia.org/wiki/Yahoo!

_Pipes.

[76] Wolfram Alpha. http://www.wolframalpha.com/.

[77] K. Wolstencroft, R. Haines, D. Fellows, A. R. Williams, D. Withers, S. Owen,

S. Soiland-Reyes, I. Dunlop, A. Nenadic, P. Fisher, J. Bhagat, K. Belhajjame,

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

https://vimeo.com/66085662

http://vvvv.org/

https://en.wikipedia.org/wiki/Yahoo!_Pipes

https://en.wikipedia.org/wiki/Yahoo!_Pipes

http://www.wolframalpha.com/

BIBLIOGRAPHY 118

F. Bacall, A. Hardisty, A. N. de la Hidalga, M. P. B. Vargas, S. Sufi, and C. A.

Goble. The Taverna workflow suite: designing and executing workflows of web

services on the desktop, web or in the cloud. Nucleic Acids Research, pages

557–561, 2013.

[78] H. Wright, K. Brodlie, and M. Brown. The dataflow visualization pipeline as

a problem solving environment. In Proceedings of the Eurographics Workshop

on Virtual Environments and Scientific Visualization, pages 267–276, 1996.

[79] Z. Wu and M. Palmer. Verbs semantics and lexical selection. In Proceedings of

the 32Nd Annual Meeting on Association for Computational Linguistics, ACL

’94, pages 133–138, Stroudsburg, PA, USA, 1994. Association for Computa-

tional Linguistics.

[80] B. Yu, H. Doraiswamy, X. Chen, E. Miraldi, M. Arrieta-Ortiz, C. Hafemeister,

A. Madar, R. Bonneau, and C. Silva. Genotet: An interactive web-based

visual exploration framework to support validation of gene regulatory networks.

IEEE Transactions on Visualization and Computer Graphics (Proc. VAST),

20(12):1903–1912, Dec 2014.

[81] B. Yu and C. T. Silva. VisFlow - web-based visualization framework for tabular

data with a subset flow model. IEEE Transsactions on Visualization and

Computer Graphics (Proc. VAST), 23(1):251–260, 2017.