thedata viewer: a program for graphical data analysis … · 2017-09-01 · the use of motion...
TRANSCRIPT
THE DATA VIEWER: A PROGRAM FOR GRAPHICAL DATA ANALYSIS
by
Catherine Hurley
TECHNICAL REPORT No. 115
December 1987
Department of Statistics, GN-22
University of Washington
Seattle, Washington 98195 USA
University of Washington
Abstract
THE DATA VIEWER: A PROGRA..VI FOR GRAPHICAL DATA ANALYSIS
by Catherine Hurley
Chairperson of Supervisory Committee: Associate Professor Andreas Buja
Department of Statistics
This thesis describes some new graphical methods for analyzing high-dimensional data based
on motion through sequences of projections.
3-D point cloud rotations provide the canonical example of such methods. The grand tour
(Asimov 1985), is a recent attempt at devising a higher 4imensional analog. It is insufficient
for data analysis purposes b~cause the motion paths areirandofuly constructed, and so do not
allow for the possibility of user-control.
I propose some methods for guided tours, offering the data analyst control over the sequence
of projections. An idea from the grand tour contributes the basic building block for the new
methods- geodesici interpolatiQn between· pairs of target· planes. Then, user-selected target
planes yield a guided tour.
Guided tours are implemented in the data viewer program, on a Symbolics 36xx Lisp
machine. By nature they demand a highly interactive program; I found a graphical interface
to be appropriate for controlling the moving scatterplot. Object-oriented programming
facilitates the system design, and provides a convenient mechanism for linking plots in very
general ways.
TABLE OF CONTENTS
Chapter 1: Introduction .
1.1 Moving Scatterplots .1.2 Some Background .
1.3 Computing Environment .1.4 Outline .
Chapter 2: Data Viewer Examples .2.1 Viewing the St.Helens Data Set .2.2 TI1e Data Viewer Window ...........•..................................................2.3 Bivariate Scatterplots .2.4 3-D Point Cloud Rotations .
2.5 Density Estimates .2.6 Linked Data Viewer Windows ..
Chapter 3: TI1e Data Viewer Pipeline .3.1 Pipeline Operations .
3.2 Univariate TraIliformations .3.3 Projections .....•.......................•.........................................................3AI)ensityiEstimagon •...............•...•..............•.......................................3.5 Viewportlng ..............•.....................................................................3.6 Multiple Pipelines .
Chapter 4: Paths ofPlanes ...•....................................................................
4.1 Smooth Paths ofPlanes ..4.2 3-D Rotations .
4.3 1'be Grand Tour .4.4 Guided Tours .4.5 Interpolation between Pairs ofPlanes .
Chapter 5: Methods for Guided Tours ..5.1 Constraints on Planes .5.2 Sequences ofTargets .5.3 Data Dependent Paths .
I12
35
6678
1012
13
15151718212222
2727282931
32
35363942
Chapter 6: The User Interface 466.1 Functionality 466.2 Choosing Targets 486.3 Rotating 3-D and 4-D Point Clouds 546.4 Derived Variables 56
Chapter 7: Applications 597.1 Places Data 597.2 Satellite Data 677.3 Principal Components of the Satellite Data 717.3 Sphered Satellite Data ...................•................................................ 767.4 Discriminant Analysis of Places Data 81
Chapter 8: Designing the Data Viewer 848.1 Objeet..()rientedProgramming 848.2 A Simple User-Model 868.3 A Detailed User-Model................................................................... 888.4 Implementation 93
Chapter 9: Conclusion 97
Bibliography 99
LIST OF FIGURES
Number
1.1 The Lisp Machine Screen .2.1 A data viewer window .2.2 A variable box .2.3 Bivariate scatterplot .2.4 A 3-variable subspace .2.5 3-D rotations .2.6 Density estimate .2.7 Linked data viewer windows ..3.1 The pipeline .3.2 Pipelines for linked windows ..6.1 A data viewer window .....•...............................................................6.2 Changing shape of the mouse cursor .6.3 The control panel .7.1 Bivariate scatterp10ts of places data .7.2 Connecting bivariate scatterplots .7.3 Connecting bivariate scatterplots .7.4 Density estimare. .7.5 A co:n:elation tour .7.6 Viewing the satellite data .7.7 4-D rotations .7.8 Principal component plot ofsatellite data .7.9 Original variable plot .7.10 Variability of principal components .7.11 Comparing principal components .7.12 Sphering the satenite data ..7.13 Sphered variable plot .7.14 Discriminant analysis of places data .8.1 Simple data viewer model .8.2 Sharing data sets .8.3 Data viewer model .8.4 Sharing data sets and projection engines ..
iv
Page
479
9
101112141625484951606263656668707374757779808387879091
AC~~OWLEDGEMENTS
As the last surviving member of the "Irish house", let me send off a big thank:
you to all my far-flown housemates, who provided me with friendship and
encouragement over the last few years.
To everyone in the statistics department, thank: you. I enjoyed my stay at the U,
and I'll miss those regular trips to the "Bois"!
I would like to acknowledge the help of the Bellcore Statistics group, where I
spent a worthwhile few months. I even had a Lisp machine all of my own!
A very sincere thank: you to John McDonald, whose energetic red pen drove this
thesis out the door.
Lastly, I gratefully acknowledge the help and encouragement of my long-
distance thesis advisor, Andreas Buja, who inspired much of my work on the
"data viewer". Thanks so much, Andreas!
Researchcontract DE,-F(:;Uc>-8,5ER2'SUlJ6
v
under
CHAPTER 1
INTRODUCTION
We analyze data to obtain information. Graphical methods are a major data analysis tool;
they are invaluable for obtaining and presenting information. We use histograms and
scatterplotsroutinely to display umvariateand bivariate data. Effective display of
multivarillte data presents us with mOre ola challenge. because we are limited to two
dimensional plotting surfaces.
In this thesis, I discuss some new methods for displaying multivariate data. These methods
are implemented in an interactive program I call the data viewer.
1.1. Moving Scatterplots
To display multivariate data. we usually reson to projections onto low-dimensional subspaces.
For example, we plot. principal components. or the residuals from a regression analysis.
Motion graphics gives U$one way to increase the information content ola display.
A movings(;atterplotisascatterplotwho~poiJ1ts .changeposition·.·overtime; The .i11usion·of
a continuously moving scatterplotis.created for the viewer when new scatterplots appear in
quick succession on the. screen. There are two good reasons for using moving scatterplots in
exploratory dat#ana1Ysis:
• The information in a moV'ing scatterplot is far more than that displayed in the sequence of
static plots. A reasonably smooth motion permits us to connect plots, that is. follow
points as their position changes through the sequence of plots.
2
• With motio~ we can perceive a third dimension. A 3-d point cloud rotation is.a moving
scatterplot obtained by projecting observations onto a sequence of 2-d subspaces in R 3.
Our visual system interprets the result as a rotating 3-d object. Even though a single
scatterplot shows only two dimensions. with sequences of scatterplots we perceive the
full 3-d structure of the point cloud.
Unfortunately, we cannot hope to perceive p -space for p > 3. Nevertheless, projections onto
sequences of 2-d subspaces of R P have applications in data analysis. If the observations lie in
a low-dimensionaimanifold, for example, form clusters, curves, or surfaces, this is readily
l"et:ognized from the moving scatterplot Motion can help to detect p-dimensional outliers. In
a static plot, each· point has two coordinates. In a moving scatterplot, each point also has
velocities in the x and y directions. Therefore, motion adds another two dimensions of
information to a display.
Data exploration with the data viewer is based on moving scatterplots. More specifically, the
program aims to help die data analyst construct moving scatterplots from projections of
observations intoJowdimensional subspaces.
1.2. Some Background
The use of motion graphics for data analysis has quite a long history. Both the work of
FowlkeS(l971),andtbe PR.1M:9 system (FiSherkeller, FriedlIlan and T\1key 1974) foi 3:d
rotations, were early applications of real-time, graphical methods to data analysis. Later
systems suchas.PR.1M..H (Donoho, Huber, Ramos and Thoma 1982), and ORION (Friedman,
McDc>nald and Stuetzle 1982; McDc>nald 1982) continued to emphasize the power of.interaction and motion for data display and interpretation. Nowadays, improvements in
hardware mean that these methods are no longer confined to expensive. special purpose
systems; interactive graphics packages for statistics are generally available, for example.
3
MacSpin (Donoho, Donoho and Gasko 1985). Also, it has encouraged the development of
new display and interaction methods for data analysis, such as scatterplot brushing (Becker
and Cleveland 1987), and the grand tour (Asimov 1985; Buja and Asimov 1985).
The data viewer program represents some of the efforts of Andreas Buja and I in this
direction. A previous paper (Buja et al 1987), described some of its capabilities; this thesis
focuses on additional methods for guiding projection planes through high-dimensional data
space.
1.3. Computing Environment
The data viewer program is implemented. on a Symbolics lisp machine 36:a (Symbolics
1984), a single-user graphics workstation with a high-resolution bitmap display and a mouse
for graphical input. It has processing power comparable to a VAX 780, and is fast enough for
computing and drawing scatterplots of500 points at the rate of 15 a second.
Figure 1.1 shows a pic:tu1'eofthe sc~n.Thesc~n~iorgllIlized intorec:t(Ulgular areas called
windows. In figure 1.1 there are two windows fully visible on the screen-- an editor window
on the left. and a data viewer window labeled "St.Helens". The editor window is provided by
the Lisp machine environttlent, whereas the data viewer window is produced by the data
viewer program.
There are two input devices, a keybOard and a mouse. Usually, input changes the appearance
of the screen. For example, typing in text fOthe editor changes the contents of the editor
window. Similarly, input with the mouse changes the contents ofthedat8viewerwindow.
The mouse controls the position of the (typica1lyarrow shaped) mouse cursor on the screen.
In figure 1.1, the mouse cursor points at the scatterplot in the data viewer window. The user
clicks one of the three mouse buttons to send input to a window. Areas of the screen that
respond to mouse clicks are called mouse sensitive. When the mouse is present in a sensitive
4
t.fl...* .....l~i4... .•._1_t...."t_ u·fti'!4.. ,,_I
f .....#·...... t...-..-,.... u.~\..., ...itt_ U'·{flh\·..... t""'"I
I
ca......
ST.HELENS
' ..
.......
Figure 1.1: The Lisp machine screen
region, a mouse documentation line.atthe bottom of the screen gives a brief description of the
effect of a click. Notice the mouse documentation line in figure 1.1- it says "L: move to here,
R2: system menu". With a single click on the left button, u;'e contents of the data viewer
window change as the scatterplot begins to move.
Theishaded areas on rh.e screen are occupied by partially visible windows, including .a file
system window on the lower right hand side, and another data viewer window. Mouse clicks
uncover these windows, in the process removing the shading. The second data viewer
window shows a second scatterplot. Since each data viewer window can only show a single
scatterplot, multiple scatterplots are displayed in individual windows.
The Symbolics Lisp machine is designed primarily for programming in Lisp. The data viewer
program is written entirely in Lisp, and a Lisp-based language, Flavors (Cannon 1982), for
5
object-oriented programming (see chapter 8).
1.4. Outline
Chapter 2 contains some examples of the plots produced by the data viewer program, and
gives an introduction to its capabilities.
Chapter 3 describes the sequence of operations (the data viewer pipeline), which are applied
to the observations to produce plots on the screen.
Chapter 4 begins with a description of existing methods for constructing sequences of
projections, namely, 3-d rotations, and the grand tour. I introduce the data viewer's paradigm
for constructing paths of planes, which potentially offers the user greater control and
flexibility than is possible with existing methods.
Chapter 5 presents some tools which aid the user in constructing paths of planes within the
framework provided by the paradigm.
In chapter 6 I describe the userinterface-- how the abovementioned tools are made available
to the data analyst.
Chapter 7 gives some examples of data exploration with the program, even though it is
impossible to do it justice with static plots on paper.
Chapter 8 discusses the data viewer design, and describes a user-model for the program.
Finally, chapter 9 states my conclusions.
CHAPTER 2
DATA VIEWER EXAMPLES
In this chapter I give an introduction to the data viewer. I will show how to obtain bivariate
scatterplots, density estimates, and 3-d point cloud rotations using the program. To begin
with, here is a brief description of the St. Helens data set, which I use throughout these
examples.
2.1. Viewing the St.Helens Data Set
The St. Helens data set contains observations on earthquakes occurring in the vicinity of
Mount St. Helens, during May, 1980. Date, latitude, longitude, depth and magnitude were
recordedfo:reacnor680 .earti\q1.1akes.
I regllrd a data set as a collection of cases, where each case has values for a number of
variables. For the St. Helens dataset, a case represents an earthquake, and the variables
are dCite,latitudE!, l()l1g.ttl.1de,clepthandmagnit ucle.
(make...DV-windOw :data...set st.helens
ma.x,e",uv"'w:t.na,ow is function, taking the two 113J11~arguments :data"'set, and
Executing this code produces a data viewer window in the top right hand
corner the screen, containing a view of the St. Helens data set. Figure 2.1 shows the
7
data viewer window.
ST.HELENS
....
'...'.
: ~:
'"
Figure 2.1: A data viewer window
2.2. The Data Viewer Window
There are a fixed set of items that appear in a data viewer window. As in figure 2.1, it is
m.ostly> occupied bya scatterplot. The scatterplQt, along with the other items drawn in the
window, constitute the data viewer's display list. The display list contains five items, which
are
(i) a scatterplot: Each data set case is represented by a plotting symbol called a glyph. In
figure 2.1, all glyphs are square-shaped, but in general they may have contrasting shapes
and colors. I refer to the part of the data viewer window containing the scatterplot as the
plot region.
8
(ii) a title: There is a rectangular area across the top of the window showing the data set
name, in this case "StHelens".
(iii) variable boxes: A box for each of the variables is drawn on the window's left hand side,
with the variable's name printed across the bottom. In figure 2.1, the boxes for date
and latitude have horizontal and vertical lines drawn from their centers, telling us
that the displayed plot is a bivariate scatterp10t of date and latitude. The reason
for using this, instead of the usual scheme of printing a variable's name along each plot
axis, will become apparent later.
Variable boxes also have labels, appearing on the top left hand comer. In figure 2.1, the
boxes for date and latitude have the labels X and Y. In this case, it might seem
that labels and lines give the same information. As we will see in chapter 6, labels have a
special purpose.
(iv) a control panel:~ is drawn in the window's lower left comer. By clicking on various
pans of the control panel, the user controls scanerplot motion. Again, details are given in
chapter 6.
(iv) a plot interaction menu: This lies next to the control panel. Different modes of plot
interaction are possible in the data viewer program. Throughout the discussion, we are
concerned with in all examples, the plot interaction menu shows
"PROJECTION"•
2.3. Bivariate Seatlerpiots
In conventional programs for statistical graphics such as S (Becker and Chambers 1984), or
Minitab (Ryan, Joiner and Ryan 1976), the data analyst types in a command like
plot (x, y) to obtain a scatterplot In contrast, the data viewer program has primarily a
graphical interface; once the window is created, most communication between the user and
9
Figure 2.2: A variable box
program consists of mouse clicks which change the display. As an example, I describe how to
obtain a bivarla.te scatterplot.
ST.HELENS
..
. . . ...
~ .. ,......
Figure 2.3: Bivariate scatterplot
When the mouse cursor moves into the circle within a variable box, its shape changes. Figure
10
2.2 shows a close-up of the longitude box, where the character X gives the mouse
cursor's position. With a double middle click, (two clicks on the middle button in fast
succession), the display changes to show a scatterplot with longitude on the x-axis, and
latitude remaining on the y-axis. The X label and horizontal line move from the date
to the longitude box. Figure 2.3 shows the changed data viewer window. Similarly,
mouse clicks can also change the scatterplot's y-variable. Therefore, with a few clicks, the
user may produce any bivariate scatterplot almost instantaneously.
2.4, 3-D Point Cloud Rotations
.,. ."
ST. HlELENS
'~
'.,. ,
, ,",...,"
Figure 2.4: A 3-variable subspace
In figure 2.4, we have picked out the 3-variable subspace consisting of latitude,
longitude and depth, by marking their respective boxes with A labels, I describe how
11
ST.HELENS
.'.'
'...'
......'"
Figure 2.5: 3-D rotations
to use 3-d rotations to examine this subspace.
Figure 2.4 shows a bivariate scatterplot of latitude and depth. This pair of variables
are in the plane of the screen, while the third, longitude, is perpendicular to the screen.
Notice that the mouse cursor is positioned on the right hand side of the scatterplot With a left
mouse click cloud rotates towards the mouse cursor. More precisely,
the point cloud>spins in the direction given by the center of the plot region and the cursor
position. A mouse click in the plot region as the points are moving stops the motion. The
next click restarts the rotation, in the direction specified by the current position of the mouse
cursor. With these controls, the user Qm spin a 3-d point cloud in any direction.
Figure 2.5 shOws a picture of the data viewer window after some point cloud rotations. Notice
now that lines are drawn in all three latitude, longitude and depth boxe~. The
12
lines are in fact the projections of the three coordinate axes.
2.5. Density Estimates
The data viewer program is not limited to showing projections onto 2-d subspaces. Displays
based on I-d projections only, such as plots of density estimates, often highlight features of
distributions that are not evident in 2-d projections.
...It'...
ST.HELENS.\:\
:i
/_ ...... t ..... _ .. -., ,; _ , .........
Figure 2~6: Density estimate
\\\"'...
For example, figure 2.6 shows a density estimate of latitude. As in figure 2.1.~the box for
this variable has a horizontal line and an X label. Since the plot shows a I-d projection. there
is no box with a vertical line. We select other density estimates for display just as we selected
bivariate scanerplots. Pointing the mouse at the longitude box and clicking gives it the
X label. and the plot becomes a density estimate of this variable.
13
2.6. Linked Data Viewer Windows
All of the plots shown so far demonstrate that earthquake locations are highly concentrated, so
that it is hard to see the structure of the dense cluster. For this, separate plots of the high
density region are necessary.
Suppose the data set St. Helens-dense contains the subset of cases in the high-density
region. To view this subset separately, I construct a second data viewer window. Figure 2.7
shows two data viewer windows, for the St. Helens and St. Helens-dense data sets
respectively, as indicated by the titles.. In both windows, the cases belonging to the dense
subset are drawn with square glyphs, while the remaining points have hollow circular glyphs.
By comparing the lines drawn in the variable boxes, we see that the two windows show the
same projection. This implies that the scatterplot in the lower window is a "close-up" of the
upper scatterplot.
As before, pointing the !00use cursor at the plot region in the upper window and clicking
causes the point cloud to rotate. However, this time the point cloud in the lower window also
rotates, and in the same direction. This is because the second window was constructed in a
special way, in order to link it to the existing window. Linking data viewer windows is a
useful capability of the program, and will be further discussed in succeeding chapters. In this
case, simultaneous motion of the two scatterplots permits a dynamic data set comparison,
because the second window displays throughout a close-up of the first.
A
14
ST. HELENS
..~.......
'.
...".
IA
..." ,
P9O.J€ct [011
ST. HELENS-DENSE
Figure 2.7: Linked data viewer windows
CHAPTER 3
THE DATA VIEWER PIPELINE
A data viewer window shows a view of a data set, where each case is represented by a
scatterplot glyph. A glyph has a shape, color and a lcx:ation .911 the bitmap, given by the·x and
y screen coordinates. The data set contains cases as they were observed: in measurement
coordinates. To produce a view, ase.quenceof operations are performed on each case,
transforming it from measurement to screen coordinates. We call these operations the data
viewer pipeline (Buja et alI987).
3.1. Pipeline Operations
Figure 3.1showsthesequen~ofoperations in the pipeline. They are:
(1) Univariate transformation: Typically, this step transforms the variables to comparable
units, so I refer to me result as standardized coordinates.
(2) Projection: Cases in standardized coordinates are projected onto a one or two
dimensi9na1subspace; theseanrtheplotc()()rdinates.
(2b) Density estimation: When projections are onto a I-d subspace, an additional operation
estimates the density of the projection. Then, the plot y...coordinates are the estimated
density evaluated at each projec:.ted case.
(3) Viewporting: This operation determines where the scatterplot is located on the screen,
converting the plot to screen coordinates.
16
DataSet
Univariate Transformation
Projection
r------------------,I II Density Estimation II IL ~
Viewporting
Scatterplot
Figure 3.1: The pipeline
Each pipeline operation has associated pipeline paratn£ters. Their values determine how the
data viewer's display appears. AU pipeline parameters are under user control. When the
Paramelersare changedt the display changes. showing another view of the data.
When a parameter changes in real..time. the result is a moving scatterplot. I described an
example ollbis in the previous chapter; with a single mouse click the display moved through
a sequence of data projections. The current version of the data viewer program allows real-
time parameter mcKiilacaltion and viewporting operations. Other work
17
(Fowlkes 1971; Becker, Cleveland and Wilks 1986) has demonstrated some applications of
real-time univariate transformations.
3.2. Univariate Transformations
The first pipeline operation allows for arbitrary transformations of each variable. The
operation has two steps, a non-linear transformation followed by an affine transformation.
Typically, the affine transformation is data dependent, and the non-linear transformation is
not. This is why I consider these steps separately.
Non-linear Transformations
The non-linear transformation is a useful capability because some variables are better plotted
on a log scale, for example. Each variable has a user-supplied function as an associated
pipeline parameter. This function is applied to all values of that variable. The default is the
identity function.
Affine Transformations
The affine transformation is for variable standardization, and is a very important pan of the
data viewer pipeline. A data set's variables typically measure unrelated quantities-- one
vanablecould measure temperature in degrees celcius, and another the amount of rainfall in
mIs. Therefore, interpreting linear combinations of variables is difficult. Converting each
variable to standard units is the usual·"solution". This operation attempts to eliminate the
problem of arbitrary measurement scales.
There is no single standardization scheme that is best for all applications. Here are some
candidate schemes:
18
• Standaniize variables to mean zero and unit standard deviation. This is the typical
transformation employed in principal components analysis.
• Standaniize variables to the same range. Equivalently, center and scale each variable by
its midrange and half-range respectively. Conventional scatterplot programs use
transformations of this form; the rectangular area allocated to a plot implicitly determines
scalings for the x and y variables. A square plot region implies that variables are scaled
to have identical max - min values.
• Use thei8anlesca1ing faetorJor a group of variai:>les, thus preserving their relative scales.
This .isappropriate whenivariables .1lI'Ctn.easUl'ed incomparable units.
• Perform identical standardization operations on a group of variables. An important
application is repeated measures situations. One possibility is to standardize each
variable using the joint mean and standard deviation of the group of variables.
In the data viewer, all of-l:he above standaniization schemes are supported. Each variable has
twoassociatef.ipipeli~p~eters;·theYillI'Ca·~n.teran([scalefaetor.. The user can require
that a variable's center factor be the mean, median, midrange, or some arbitrary value.
Similarly, the choices. for scale factor are the standard deviation, median absolute deviation,
half-range or S<>tne user-supplied value. Data based quantities may be computed over the
A variable is left UBSta11daniized if its center and scale factors are 0 and 1 respectively. The
defaults are tbemidrange and half-range.
3.3. Projections
The pipeline parameters for the projection operation are the one or two vectors which
determine the projection. Because the operation is determined by p parameters for each
19
vector, there is no obvious way to construct general, real-time controls for manipulating
changing projections. The bulk of my work. is concerned with devising and implementing
useful methods for controlling projections. This is the subject of later chapters. For now, here
is a discussion of the kinds of projections available in the data viewer program.
Orthogonal Projections
The cases have numerical values for each ofp variables, so we may regard them as vectors in
somep-dimensional vector space. There are many reasonable coordinate systems for the
cases; I have named two, the measurement and the standardized coordinates. Note these
coordinates are in possibly different vector spaces. As I discussed in the previous section,
standardized coordinates are preferable because they aim to eliminate the problem of arbitrary
measurement units. In standardized coordinate space, there is a natural identity between the
standardized variables and the canonical basis vectors ej ,j = l,,,,p
On the screen, we see projections from a vector space to a l-d or 2-d subspace. To interpret
the plots thatappeat on the screen, projections should be orthogonal with regard to some inner
product. Usually, the pipeline uses the canonical inner product in the standardized coordinate
space. The main motivation for this choice is that conventional bivariate scatterplots result
from orthogonal projections.
Nevertheless, this inner product is arbitrary; for example, it depends on the univariate
transformations. Other univariate transformations give different inner products, producing
quite different plots on the screen. In section 5.3, I describe applications (such as data
sphering), where other inner produetsare appropriate.
In what follows, all projections are orthogonal with respect to the canonical inner product in
the standardized coordinate space, unless explicitly stated otherwise.
20
2D-Projections
The most general kind of projection is a 2d-projt!ction. This is an orthogonal projection of the
cases on to a 2-plane P. Let Zi e R P , i = 1,.., n be the case vectors in standardized
coordinates. The plot coordinates for case i are ax t zi and a" t Zi, where (ax' a,,) is an
orthonormal basis for P. In variable box j, the line drawn from the box center is proportional
to the projection of the ph standardized variable on P, i.e., a/ej ,a/ej'
xv-Projections
An xy-projection is an orthogonal projection of· theicasevectors onto a 2-plane, with the
restriction that the x and y-vectors, ax and a", are restricted to disjoint x- and y-subspaces.
The subspaces are usually spanned by two subsets of the variables.
XY-projections are common in data analysis, and so deserve separate consideration. The
simplest example is a biv~ate scatterplot. Ifax and a" are el and e2 respectively, the result is
a scatterplotof Zi 1 against Zi2, i = 1,.., n. In general, xy-projections are appropriate when we
wish to explore ~~iationsbetweentwogtoupsc:>fvariables,as with canonical correlations
and regression data.
tD-Projections
As we have seen, the data viewer alSQ allows plots where there is no projection vector for the
y-direction. These plots are based on a 1d"projection: an orthogonal projection of the
observation vectors onto a line.
A naive Id-projeetion would result a "dot-plot" consisting of points along a line. In
preference, the data viewer enhances Id-projections by computing a density estimate for the
projected observations. Evaluating the density estimate at each case gives the y plot
21
3.4. Density Estimation
An additional pipeline operation computes a density estimate when projections are onto a l-d
subspace. Whenever the projection changes. the density is recomputed. implying that real
time modifications to projections produce moving ld-projections on the screen.
Note that the density plot consists of a series of glyphs representing the estimate at each
projected case, rather than the usual curve. (This is why I use the term scatterplot in a general
sense to describe the resultofa\2d- or Id-projection.) Unless the projected cases are sparse.
such a plot/can give a reasonably goodpicmreof thedensity's shape. Unfortunately. the
availablefuil'dware is not sufficient to plOt curves moving in real-time.
Currently. the form of .the density estimate used in the pipeline is fixed; it is a frequency
polygon average shifted histogram (Scott 1985). This uses a histogram smoothed with
(weighted) running averages. There are two associated pipeline parameters: the number of
histogram bins. and a smpothing parameter- the number of bins used for the running average.
A >smoothing Parameter of 1 gives the USual hi~togram. Linear interpolation between bin
Centers provides a density estimate at intermediate points.
My reasons for using this particular density estimate are as follows.
(i) It retains the computational efficiency of the histogram. and so depends linearly on the
number of data points. Other popular density estimators are slower because they require
that points be sorted. Efficiency is relevant because displaying moving Id-projections
demands that the density be re-estimated for every new plot.
(ii) In contrast to histograms. this .density estimate does not have the usual lack of
smoothness caused by.·binning. Particularly for motion. it is important that the density
estimate be smooth. A slight modification in the ld-projection causes discrete jumps in
the histogram bin counts. but only small changes in level for smooth density estimates.
22
Jumps distract us from observing how small changes to the projection affect the marginal
distribution. Smooth changes enable us to see. for example. how the distribution
becomes more skewed as the projection changes.
3.5. Viewporting
The viewport operation determines where the scatterplot is located on the screen. We require
that the transfonnation from plot to screen coordinates does not depend on the projection;
otherwise. size comparisons across projections are not possible.
The viewport transfonnation maps a rectangular region in plot coordinate space into the
rectangular region of the data viewer window designated for the scatterplot. In most
scatterplot drawing programs. the mapping is determined by the restriction that all points fit
into the rectangle on the screen. This restriction is not feasible for plotting projections of
high-dimensional data; most plots would occupy just a small portion of the available plot
region. The alternative a1lows cases to have screen coordinates falling outside the plot region.
The. glyphs for· these cases are clipped from view. that· is. not drawn. By modifying the
viewport transformation. glyphs may be returned to view.
The data viewer program supports real-time modification of the viewporting parameters.
Briefly. the user may shift or scale the scatterplot that appears on the screen. Viewport
operationS l1Ie more than just a conveniel'lce. they also have some interesting applicanonsto
data analysis. For a full discussion with illustrations see Buja et al (1987).
3.6. Multiple Pipelines
A data viewer window can only show one scatterplot at a time. To display multiple
scatterplots simultaneously. we use multiple data viewer windows. Each window has its own
pipeline for transforming the observations to screen coordinates. There are two main reasons
23
for looking at multiple scatterplots: (i) for multiple views of a single data set and (ii) for views
of different data sets. Botbreasons require that the data viewer windows be linked, that is,. use
the same cases or pipeline operations. Windows linked by projection, for example, permit
dynamic comparisons; we could watch two data sets as they undergo the same sequence of
projections. Section 2.6 gave an example.
Linked Data Viewer Windows
Each window may show the result of applying different pipeline transformations to the same
cases, giving multiple views of a data set. We could compare the effects of different
univariate transformations, projections, density estimates, or different plot seatings.
Comparisons are far easier when the pipelines are closely related, say, differ by one operation
only.
For example, suppose there are two data viewer windows, each with views of the
St. One shows bivariate scatterplot of latitude and longitude, the other
pro.ject~on operation, the two sequeD.ces
of pipeline operations should be the same.
When displaying different data sets in multiple windows, similar pipeline operations help us
compare the data sets. If each data viewer window applies the same sequence of pipeline
operations, the difference between their scatterplots can only be due to the difference between
the data sets. Graphical cross-validation is a potential application. By splitting a data set in
two, findings from the exploration of one half can be checked for consistency with the other.
one can allay suspicions that the structure discovered is an artifact of the
exploration procedure.
Let me clarify what "the same" means for each of the four pipeline operations.
24
(1) Univariate transformations are the same when the pipeline parameters for this pipeline
element are identical for all the variables. When the data viewer windows show the
same data set, or different data sets with common variables, it may be convenient to
use the same univariate transformations in the pipelines. This implies that cases are
represented in just one standardized coordinate space.
(2) Suppose (ax' ay ) is an orthonormal basis for a 2-plane P. The projection operations
are the same when the plot coordinates for the pipelines are ax1z,a,,/z, where z is any
case vector. Note that the notion of "sameness" for projections makes precise sense
only when the same inner products are used. In the data viewer, it falls on the user to
supply univariate transformations that guarantee equivalent inner products.
(2b) So far, the data viewer pipeline has just one possible density estimation procedure.
The density estimation operations are the same when both the number of bins and
smoothing parameter are identical.
(3) Viewport operatiOI]$ are the same·when they map the same rec:tangle in plot C90rdinate
space to rectangles of the same size in screen space. (For simplicity, we assume that
the data viewer windows, and therefore plot regions, are the same size.)
Pipelines for Linked Data Viewer Windows
Consider the example of section 2.6, where two data viewer windows were linked by
projection. Figure 3.2 shows a representation of their pipelines. Notice how the pipelines use
the same data set and projection operation. Observations from the St. Helens data set are
plotted in both windows; all cases appear in window-I, and the subset belonging to
St . Helens-dense in window-2. Also, both windows show the same projection; this is
how we obtain a dynamic comparison of the data sets.
2.S
nivariate Transfonnati
Projection
,.- -----------,: Density Estimation :'- ..1
-,II______ ..1
,.-----------: Density Estimation
'-------
yiewporting Viewporting
Scauerplot 1 Scatterplot 2
Figure 3.2: Pipelines for linked windows
The purpose of the second window is to show a dose-up of the high-density duster. As a first
attempt at obtaining aclo$e"up, one might think of modifying the viewporting operation. This
suffices in the special case where the duster is located at the origin in standardized coordinate
space. In general. viewporting operations are good for showing a duster blow-up a single
projection only. not in the moving sequence.
26
In the example of section 2.6. the close-up plots are due to the different univariate
transformations of the second pipeline. The pipelines of windows-l and 2 use the default
univariate transformation procedure. so their variables are standardized to a fixed range across
the entire data set and subset respectively. However. this implies that projections in the two
windows use different notions of orthogonality. Figure 2.7 shows some evidence of this; the
upper plot show two dense rods forming a "L" shape. whereas the angle between the rods
appears more acute in the lower plot. Therefore, it is not quite accurate to say that the
scatterplot in window-2 is a close-up of that in window-I. This would require a sophisticated
limngof the univariate transformation operations. which scaled the cases in the dense subset
in proportion to their scales for the entire data.
CHAPTER 4
PATHS OF PLANES
Data exploration with the data viewer is based on displaying sequences of projections.
Smoothness is an essentialreq~Il1ent for the sequences. I describe some existixtg methods
for formingpathsofprojecti()nplan~S, >Ill\tnely, 3-4 ro~tions, and the grand tour. These are
available in tile data viewer program, but as we will see, they are not enough to Illeettbegoal
of user-controlled moving scatterplots. I introduce the data viewer's paradigm for
constructing paths of planes, which potentially offers the user greater control and flexibility
than is possible with existing methods.
4.1. Smooth Paths of Planes
A crucial requirement is that the scatterplot appears to move smoothly, that is, the position of
each point does not change much from plot to plot. With smooth motion, we can watch how
the position of each point changes through the sequence of plots. It is due to this ability that
we "see" 3-d point clouds on the screen. A smoothly moving scatterplot is far more
informative than a sequence of disconnected plots. Besides losing information, lack of
smoothness is unpleasant and disorienting for the user. At worst, successive plots are totally
unrelated, causing flashing clouds of points to appear on the screen.
The smoothness requirement places restrictions on the sequences of planes.
Let {P,} denote the path of planes, where t represents time. The display shows projections
onto Pat, It; = 1,2, ... , for some increment 5, yielding a smoothly moving scatterplot when
successive planes do not differ very much. For this reason, we require that the planes {P,}
28
change smoothly, in some appropriate metric. As long as the increment 0 is small, the user
perceives a smoothly moving scatterplot.
4.2. 3-D Rotations
The canonical example of moving scatterplots is 3-d point cloud rotations. We can
characterize 3-d rotations by paths of planes, as in the following example.
Let (ax(t) , ay(t» be an orthonormal basis for Pt , where the vectors ax(t) , ay(t) correspond
to the x and ydirections respectively.
The path of planes P, given by ax (t) = cos t el + sin t ~ , a, (t) = e2 is a 3-d rotation in the
el-i!) plane.
This gives a point cloud rotation around the y-axis. At time t = 0, the scatterplot has el in the
x-direction, and e2 in the y-direction. When t reaches 90 degrees, e3 replaces el in the x
direction, with e2 remaining in the y-direction.
3-d rotations of variable sub$paces are easy to use and interpret However, as Iexplain below,
they alone are too restrictive for the data viewer.
• More general subspaces: 3-d rotations need not be limited to subspaces spanned by
three variables. Mechanisms for selecting more general 3-d subspaces would be useful,
for eXaInple, the space spanned by the first three principal components.
• Higher dimensional paths: Even so, paths of planes in arbitrary 3-d subspaces do not
suffice. Later, 1 describe examples where paths in a 4-d subspace are appropriate. 3-d
rotations are degenerate in that coordinates for one direction (the y-direction in the above
example), are held fixed. More information is displayed when points move
simultaneously around both axes.
29
4.3. The Grand Tour
There are a number of so-called grand tour methods (Asimov 1985) for constrUcting paths of
planes. Such methods have played an important part in the development of the data viewer.
Earlier versions (Buja et al 1987) of the program relied entirely on grand tour methods for
producing paths of planes. In addition, the new methods I will describe are based on ideas
from the grand tour.
A grand tour is defined as follows (Asimov 1985):
A grand tour is. a sequence of 2-planes {Pi, i = I, 2, ..},
which is dense in the space of all 2-planes inR p.
The space of 2-planes in RP is termed a Grassmann manifold, denoted by G2,p. One choice
of metric on G2,p is the squint angle, that is, the larger of the two principal angles 911 , 9y
between two 2-planes. Other choices are the Lr metries, 0 < r < 00, given by (911 r + 9/) lIr •
All of the above mentiOlJed metries induce the same topology on G 2,p (Buja and Asimov
1985).
Basically, the grand tour is designed to come arbitrarily close (eventually) to all possible 2d
projections of the data. Note that SIIlOOthnesS is not part of the definition. This is because the
grand tour was invented with applications other than motion graphics in mind. However, we
doreq~smoothnessforthepartic:uJ.argrandtourpatbs.usedby the data viewer.
A Grand Tour Algorithm
There are a range of different schemes which produce grand tours (Asimov 1985; Buja and
Asimov 1985). I describe the particular grand tour used by the data viewer.
30
A grand tour may be obtained by inteqx>lating between consecutive
elements of a randomly sampled sequence of planes.
This grand tour scheme is of particular interest because it introduces the imponant idea of
inteqx>lating between pairs of planes, which contributes a useful building block for new
methods.
(The inteqx>lation paths used are geodesics on G2,p. I will say more on this subject later.)
Data Exploration with the Grand Tour
By definition, any grand tour produces a path ofplanes which is dense in the set of all possible
planes. Denseness implies that after enough time, a grand tour path comes close to any given
2-plane. Unfortunately, enough time can be far too long. In dimensions bigger than four, we
can not rely on the grand tour to provide us with informative projections. For example, in six
dimensions, it takes a mi,nimum of 6,000,000 planes to get within 10 degrees of all possible
planes (Asimov 1985). At the rate of 10 planes a second, this is 160 hours of viewing time!
In the data viewer implementation, the grand tour produces a smoothly moving scatterplot.
The above figures neglect to account for additional information yielded by smooth motion.
Therefore, they do oot 8ccuratelyreflect the amount of time necessary to uncover a particular
feature of the
Iust the same, I conclude that the grand tour does not provide a stand-alone method for
exploring high-dimensional data, the goal of displaying all 2d-projections being rather too
ambitious for practical purposes. Realistically, the grand tour is useful for scanning 2d
projections-- watching the moving sCatterplot for a few minutes in the hope of seeing
something interesting.
31
4.4. Guided Tours
The problem with grand tour paths is that they are entirely independent of the data itself. and
the requirements of the user. Effective use of motion demands guided tours. that is. user
controlled paths of planes. With guided tours. the user can customize the moving scatterplot
so that it is potentially more informative.
In the data viewer. I use one general paradigm to produce guided tours. A single paradigm is
convenient because there are fewer ideas for the user to grasp. Besides. it simplifies the user
interface. program design and implementation.. Here is paradigm I have chosen:
The data viewer produces smooth paths or planes by interpolating between
consecutive elements or a user-determined sequence or planes.
This scheme is similar to the grand tour algorithm. utilizing interpolation between pairs of
planes. The crucial difference is that a user-determined sequence of planes has been
substituted for a sequence of random planes. so that the resulting path of planes is controlled
by the user.
I use this paradigm for guided tours because
• it is straightforwan:i: ideally. the user can specify any sequence of planes. and the smooth
path is supplied automatically.
• it is sufficiently general: I will demonstrate how the paradigm covers many useful ways
of producing moving scatterplots. including 3-d rotations on general subspaces.
With this paradigm, the kind of moving scatterplots produced by the data viewer depends on
the interpolation paths. and on the user selected planes. Methods for choosing planes are the
subject of the following chapters. Currently, only one interpolation scheme is available.
32
4.5. Interpolation between Pairs of Planes
Following the proposal of Buja and Asimov (1985) for the grand tour, I use interpolation
paths which are geodesics on G 2,p (for any of the 4 metrics, 1 < r < ....). Some motivation for
this choice is given in the succeeding section. For a formal description of geodesics on
grassmann manifolds, see Wong (1967).
Geodesic Paths on the Unit Sphere
I also use geodesic paths for interpolating between Id-projections. In this case, geodesics
follow a great circle route ona unit sphere.• Suppose two Id-projectionsare described by the
points Ul and U2 on the unit sphere in RP. Let 9 be the angle between Ul and U2, and U2* be
the unit vector in the direction of U2 orthogonalized with regard to ul (obtained by a Gram
Schmidt step.) Then, one such path is u(t) = cos t u1+ sin t u2*, for 0 S t S 9.
Geodesic Paths on Gz,p
Let Q.. , Qy be two orthogonal 2-planes in Gz,p. Then, for any pair of unit vectors u(t)e Q.. ,
v(t) e Qy, t ~ 0, each rotating at a fixed speed, a geodesic path on G 2,p is given by P(t), t ~ 0,
where
p(t) = span(u(t) •v(t».
For obvious reasons. we call Q.. and Qy the rotation planes.
We need to construct a geodesic path between two planes PI and Pz.
(4.1)
Take the simple case where PI and P1 are orthogonal 2-planes. For example. PI andP2 are
spanned by the pairs of vectors (el •ev and (~. e4) respectively. For OS t S 1CI2. define a plane
pet) to be the span of (cost el+sint ~. cost ez+ sin t e4)' The planes p(r) form a geodesic
path from PI to P2• with P(O) = PI and P(1tI2) = P2•
33
In general, the geodesic paths between two 2-planes correspond to simultaneous interpolations
of the principal angles:
Let (UI. VI), (U2, Vv be the principal vectors and all > ay ( > 0) the corresponding principal
angles. for the pair of planes Pl. P2 (see Golub & Van Loan 1983). (Ult VI). (U2. V2) are
orthonormal bases for PI and P2 respectively. and the following relationships hold:
Ult U2 = cos ell' Vl t V2 = COSey
Ul t V2=Vl t U2=O.
Let U2*bethe unit vector along the part OfU2 which is orthogonal to ul. Similarly, V2* is the
unit vector along the part of V2 which is orthogonal to VI' Define planes P(t) as the span of
(u(t). vet». for 0 S t S 1, where
(4.2)
The planes pet) form a geodesic path from PI to P2, with P(O) = PI and P(l) = P2'
The planes sPanned by (UI.UV. (VI' Vv are the rotation planes. and the angles ell and ey are
the speeds of rotation. Notice that this path of planes gives a 3-d rotation when one of the
speeds is zero.
Properties Geodesic Paths
Geodesicimerpolationpaths have the following properties.
* Smooth motion
GeodesicpatbSare $m()()tb. in an abstract sense, which implies visual smoothness. Paths
of planes constrUCted using this incerpolation scheme have lack of smoothness only at the
endpoints of geodesic segments.
• Generalizes 3-d rotations
The sequence of data projections is locally within a 4-d subspace, because the path
34
segment connecting two planes is entirely determined by those planes. When the two
planes span a 3-d subspace, the sequence of data projections is a 3-d point cloud rotation.
In this sense, geodesic plane interpolation provides a consistent generalization of 3-d
rotations to four dimensions. Projections onto the path segments have a convenient
interpretation as point clouds rotating around a moving axis.
• Computational efficiency
The 4-d nature of the interpolation paths gives a large computational saving; it is
unnecessary to project every case from R P. to R2 for every new·scatterplot. Instead, for
each geodesic segment. cases are projected onto a 4-d subspace. Following this, drawing
the scatterplot requires a projection from R 4 to R 2. Except for the occasional projection
from RP to R 4, the computational effort depends only on the number of cases, not the
number of variables.
It is possible that paths confined to 4-d subspaces place unnecessary restrictions on the
information content of the sequence of projections. However, the benefits of the
restriction are interpretability and computational efficiency.
• No within-screen spin
Within-screen spin occurs when the scatterplot rotates in the plane of the screen. For
purposes of displaying 2d-projections, within-screen spin is wasteful and confusing, the
reason being mat plots differing by orientation contain the same information. (This is
only approximately true, since human vision is not symmetric with respect to
orientation.) Moving scatterplotsobtained from geodesic interpolation do not rotate in
the plane, because the position and velocity planes are orthogonal.
CHAPTER 5
METHODS FOR GUIDED TOURS
Once again, here is the data viewer's paradigm for guided tours:
The data vie",er produces smooth paths of planes by interpolating between
consecutive.elements of auser-determined sequence of planes.
This paradigm provides a basis for successful data exploration only if the user can specify a
sequence of planes quickly and easily. The simple solution of typing in lists of 2p numbers
for each plane is slow and tedious, and would detract from the impact of real-time motion. In
this chapter, I describe some ideas which make fast plane selection possible in the data viewer
program.
notation {T,h k>= 1,2,...} for the sequenCe of user-sele~tedplanes, and reserve
{Pi, i = 1,2,...} for the denser sequence of interpolating planes. Since motion always proceeds
towards. a plane Tk, I call these. the target planes. Target selection in the data viewer is based
onfour ideas.
• R.atherilian precisely picking a target, the user imposes constraints. Then, the targ.et is
required to satisfy these constraints.
• The program provides a variety of schemes for supplying sequences of targets
Typica11y, the targets are subjecttothe user-imposed constraints.
• In chapter 2, we saw how the user could control a rotating 3-d point cloud. Similarly, the
user can control 4-d point douds.
36
• Controls for the user alone are not enough; data dependent paths are necessary to pick
out interesting features of the data.
As we will see, much of the data viewer's power derives from the fact that these are
cooperating rather than competing tools for constructing guided tours.
5.1. Constraints on Planes
The most basic kind of constraint requires that a plane be orthogonal to some of the vectors
ej ,j = 1,•.,p , representing the variables. Such constraints enable a user to temporarily set
aside some subset of variables and examine a smaller subspace. With the data viewer
paradigm for paths of planes, constraints need only be imposed on the targets. When each
member .. of target sequence satisfies the same orthogonality constraints, the geodesic
interpolation paths give intermediate planes also satisfying the constraints.
Constraints are set up in the following manner.
Each variable is classified as active or inactive. Active variables are further classified as A, X
or Y. The constraintsdependonthe.classifications. Let ax and ay denote the x andy vectors
for the target plane. Then, if variable j is:
(i) an A-variable, the target is unconstrained with regard to ej
(ii) an X-variable, then a, is orthogonal to ej
(iii) a Y-variable, then ax is orthogonal to ej
(iv) inactive, the target (i.e. ax and a,> is orthogonal to ej
With this set of constraints, it is easy to request either a 2d-projection, xy-projection, or Id
projection.
37
2D-Projedions
When all active variables are A-variables, projection onto the target plane gives a 2d
projection from the subspace spanned by the active variables. With only three A-variables,
we obtain 3-d rotations.
One may use prior knowledge about the data at hand to reduce the dimensionality of the space
for exploration. By choosing successive targets randomly subject to the constraints, the result
is.a grand tour of the sro::uler sllbspace. In ligh10f the huge numbers of planes required for a
"complete" grand. tour, this variation is anece$sity for its practical application to data sets
with more than four variables.
XY-Projections
Recall that an xy-projection is a specialization of the 2d-projection, where the x- and y
vectors are restricted to qisjoint subspaces. Suppose that the active variables consist of q X
variables ands Y-variables.. 1'henaplane.sa.tisfyingd:1eC()nstraints yields an xy-projection.
In the special case where q = s = 1. the res1Jlt is a bivariate scatterplot of the X- and Y
variables.
If successive targets· are randomly obtainedsubject.to the constraints, we obtain a so-called
c()'rreltJ.tionteur. NotetJ:1at this is equivalent to sampling a sequence ofunit vectors. fi'Qll'lthe
uniform distribution on the unit sphere in R q. and a second sequence from the unit sphere in
R'. Like the grand tour algorithfJl. this scheroe for correlation tours is due to Buja and
Asimov (1985). AC()rrela.tion tour scans Id-projections of the X-variables simultaneously
with Id-projections of the Y-variables. It can expose relationships between two groups of
variables, providing an exploratory alternative to regression and canonical correlation
analysis: hence the name correlation tour.
38
ID-Projections
For Id-projecrions, only the x-coordinates are linear combinations of variables. The
constraints give a Id-projection when all active variables are X-variables. With a single X
variable, the result is a density estimate for the X-variable.
AId-tour is obtained when the sequence of targets is randomly chosen subject to these
constraints. Data viewer plots of Id-projections show a density estimate, so that watching a
Id-tour lets us scan the marginal distributions.
At anyone time, the program requires that all active variables are either all A-variables, or, all
X- or Y- variables. That is, combinations of A-variables with either X-variables or Y
variables are not allowed. Projections that satisfy the resulting constraints are not easily
interpretable, and I can think of no appropriate applications.
Additional Constraints-'
Following som.e changes to the active set of variables, additional constraints are imposed on
the next target. When the active set
(i) loses members,
that is, has members that become inactive, the new target is the plane lying c1osestto the
cumnt plane, and. satisfying.the constraints imposed by the mOdified· set. Notice the
target is completely specified. For example, suppose ax and ay are the current x- and y
vectors, and the first variable is removed from the the X-variables. The new target is
given by 8x-, 8y-, where (up to normalization) ax- = ax - (ax1el)el and
8y- = 8y. Smooth motion towards the new target establishes how the scatterplot
changes as the projection plane becomes orthogonal to some variables. I will give an
example in section 7.1.
39
(ii) gains members,
the new target is obtained by modifying the current plane along the coordinate directions
for the added variables. Suppose a.x and a, are the current x- and y-vectors, and the first
variable is added to the X-variables. Then, a, remains unchanged, while
a.x /uw = a.x + o:el (up to normalization), where 0: is any arbitrary amount. Smooth
motion towards this target enables us to judge how moving the projection plane in the
direction of a variable changes the appearance of the scatterplot.
5.2. SequencesotTargets
Even for guided tours, we will see that it is unnecessary to rely on the user to pick each target
in tum. With many applications it is enough that the user picks a particular area of data space,
and leaves target selection to some automatic procedure. For example, I already described
how a user can use constraints to select subspaces, and how this, in conjunction with random
target selection, produces different varieties of grand tours.
I call any scheme for. providing sequences of targets a •target generator method. Currently,
five such schemes are available in the data viewer program. For now, I describe these
schemes; applications are presented in chapter 7.
In what folllows.
current plane.
(1) Scan
the se<llUelrlce of targets to date, i.e., To is the
Scan provides a sequence of targets, T1; T2, T3' ...., where each target is randomly selected.
By requiring that the targets satisfy the user-imposed constraints. we may obtain grand tours,
correlation tours and td-tours on subspaces spanned by variables.
40
(2) Local scan
This scheme is a variation on scan, designed for exploring the "neighborhood" of a plane, To.
say. By viewing a local scan. the data analyst may establish the sensitivity of the To
projection to small changes in the plane.
Local scan produces a sequence of targets T1> To. T2. To, ....• where alternate planes TA;, k > 0
are randomly selected from a neighborhood ofTo. A step-size parameter controls the size of
the neighborhood. By requiring the targets TA;, k > 0 to satisfy constraints, the user can restrict
exploration to a particular neighborhood of To.
(3) Cycle
Cycle constructs a sequence of targets made up solely of the two planes: To. TI'To. T1> •••••
As the projection changes, the user mentally connects the sequence of scatterplots. He/she
discovers where cases ill one scatterplot are located in the other. This is a very important
benefit<()fmovingscatterplots.• Particularly<with large numbers of cases, it is often not enough
to see the sequence of scatterplots once. For this reason, the cycle scheme is for moving to
and fro repeatedly along the same path segment.
(4) Backtrack
The backtrack scheme constructs a sequence made up of old targets: T_lt T_2, T_3, ••••
It is frequently useful to retrace data exploration steps. Particularly when viewing a sequence
of projections, interesting features can pass by quickly. Therefore, a scheme for moving
backwaIds through the sequence of projections is essential. Since the path is entirely
determined by the target planes, we can reconstruct past projections by re-using old targets.
41
(5) Rotate
With the rotate scheme, a user can control the rotation of 3-d and 4-d point clouds.
This capability does not quite fit in with the data viewer's paradigm for guided toUTS.
Typically, the user controls only the sequence of targets, and geodesic interpolation
determines the intermediate path. Instead, the rotate scheme uses two targets To and T1, say,
to specify a subspace, and the user may pick a particular rotation within this subspace.
By default, the rotation is in the direction given by <the interpolation path from To to T l'
(Notice that this relies on thefaet that the geodesic paths give rotations.lRotate is similar to
cycle in that all paths are in the To. T1 subspace. With rotate, the motion continues in the
given direction, whereas with cycle, the direction of motion is reversed on reaching the target
Except in some special cases, the path produced by the rotate scheme does not necessarily
return to To.
For 1d-projections, To aad T1 are two lines, and determine the plane of rotation for the x
vector. Tllat is, there is olllY one possible rotation ·path so .that •. user controls ·are unn~Fessary.
For 2d- or xy- projections, there are an infinity of rotations. Recall that geodesic paths are
characterized by simultaneous rotations in a pair of orthogonal 2-planes (see equation 4.1 in
section 4.5). Choosing rotation planes and thes~dsof rotation determines a 4-d rotation.
and y-veetors rotate in the subspaces spanned by the X and Y variables respectively. To
specify a path, the user needs to choose the relative s~ds of rotation for the x- and y
directions. An example in a later chapter will demonstrate why this is useful. (Note that
when the speeds form an irrational ratio, the resulting path is dense in the set of constrained
xy-projeetions.)
42
With general2d-projections, it is more difficult to give the user full positioning power for the
projection plane. Within any 4-d subspace, there are arbitrarily many orthogonal 2-planes
which could contain the rotating unit vectors u(t) , v(t). The rotation speeds of these vectors
are also arbitrary, so there are too many variable factors determining the sequence of
projections. I describe a scheme which allows the user some measure of control in this
situation.
To begin with, consider the situation where the planes span a·3-dsubspace. For this, the user
need$onlypickaplane (orequival~nt1yan axis) of rotation. Section 2.4 gave S()~eJl:lUI1ples.
A possible COmPl'Qmise in the case of 4-d subspaces 1sto let the user Choose a rotation
contained in the dominant 3-d subspace. This dominant subspace is obtained as follows.
Take the default 4-d rotation produced by geodesic interpolation from To to T1 (see equation
4.2). We could apprmtimate this path with a rotation in a 3-d subspace by stopping the slower
of the two rotating vectors vet), that is, set 8y to zero in equation 4.2. This leaves us with a
subspace SPanned by the vectors (Ut, VI). In the 3-d subspace, it remains to pick an nis of
rotation.
5.3. Data Dependent Paths
The methods I described so far give the user a significant amount of control over the moving
sequence of projections. The methods have one factor in common; each supplies sequences
which are independent of the data. Even with guidance from the user, there is no guarantee
that data independent paths can yield interesting projections. Therefore, we need data
dependent paths. which hold greater promise of showing structure.
I have considered two possible ways of producing data dependent paths. (i) using data
dependent target generating schemes and (ii) using data dependent constraints.
43
Data Dependent Target Generators
There is a wide range of potentially useful data dependent target generating schemes. For
example, take any projection index, which measures some feature of each data projection.
Possible choices of indices are given in Huber (1985), Friedman and Tukey (1974). Using
standard optimization teChniques, one could design a target generating scheme which
produces projections having increasing values of this index.
As before, the tar'gets may be also restricted by user-chosen constraints and step-size.
However, allowing thecoIlStraints and step-size choices to be modified at any time demands
that.the optimization is done in real-time. For some further discussion, see Buja et.al (1986).
The success of these methods depends on constructing indices measuring interesting features.
For real-time optimization, speed is an important consideration, so the index cannot be too
computationally demanding. In view of the available computing power, this is a real
restriction, especially for indices of 2d-projections. Data independent paths have the
significant advantage of sPeed, because no data based calculations are necessary.
Data dependent target generators are not inclUded in the current version or the data. viewer
program, so I do not discuss them further.
Data Dependent Constraints
Alternatively, one could give up on real-time optimization andpre-compute some directions
with high values for a suitable projection index. Then, data dependencies can be enforced on
the sequence of targets by imposing constraints with regard to these directions. With this
approach, the computational efficiency,of the. projection index is not a significant issue, since
the data dependent constraints are pre-computed.
The current version of the data viewer program supportS data dependent constraints. I have
considered projection indices from classical methods in statistics, namely,
44
principal components, canonical correlation analysis, discriminant analysis and data sphering.
These techniques form a powerful tool kit for exploring multivariate data. They supply
projections of the data which are often highly structured. Because the projections optimize
some fairly simple indices, they can be easily interpreted. Also, the derived variables have an
order of importance, and so provide a possible basis for dimensionality reduction. Motion
aside, including these techniques in the data viewer program broadens the range of its
applications COnsiderably. Chapter 7 gives some examples.
Suppose we· have q. directions, constIUctedso·th4lrprojections onto these directions are in
some sense interesting. I call the directions derived· variables, and represent them by the
vectors c1' C2' ... ,cq in standardized variable space.
In section 5.1, I described how the original variables could be classified as X, Y, A, or
inactive, in order to impose constraints on targets. Oassifying the derived variables in the
same way gives an analogous set of constraints. For simplicity, assume there are p derived~
variables Cit c2, ...•cp • spanning standardized variable space, (As long as Cl. C2, ...•cq are
linearly independent. onecanal'n'ays construct additionaldirectio11S.cq+1 • ...•cp so that this
holds.)
Introducing new variables raises some problems.
• Over,.constrainedtargets:
With 2p variables, there are a total of 2p possible orthogonality constraints, making
over-constrained targets hard to avoid. I side-step this problem by requiring that anyone
target may only be constrained with regard to either the original or derived variables. In
practice. this is not a real restrictiOt;L
• Choice of inner product:
We would like to plot projections onto pairs of Cj vectors, but they not necessarily
orthogonal. This implies that bivariate scatterplots of the derived variables need not be
45
obtained from orthogonal projections in the canonical inner product. which was· chosen
so that the original (standardized) variables were mutually orthonormal (see section 3.3).
One way around this problem is to use a different inner product. Since by assumption the
derived variables are linearly independent, we can define an alternative inner product for
which they form an orthonormal set.
By appropriate use of constraints we can
(i) request 24·projections, xy·projections or Id·projections on subspaces spanned by derived
variables. With the alternative inner product, the plots available from orthogonal
projections are different from before.
(ii) pick a particular xy·projection consisting of two derived variables. This is achieved by
having just one X·variable and one Y·variable. With one X·variable and no Y·variable,
the result is a marginal density estimate of a derived variable.
(iii) use methods for generating sequences of targets in subspaces spanned by derived
variables.
Principal component analysis, canonical correlation analysis, and discriminant analysis
provide derived variables which are "ordered", so they can be regarded as teChniques for
dimensionality reduction. For example, one may concentrate explorationampng the first
few pri11cipal oomponents, the· ··idta.~ing .• that. these dimensions·.contafumostofthe
structure present in the data. This .can considerably reduce the amount of work necessary
to examine the data set. Also, a common criticism of exploration methods based on
projectiPnpl~moving through high.mmensional spaces is that interpretation of the
movingscatterplotis difficult. Including dimensionality reduction techniques in the data
viewer's repertoire alleviates this problem by supporting exploration of smaller and so
more interpretable subspaces.
CHAPTER (;
THE USER INTERFACE
In this chapter I describe the user interface to the task of constructing sequences of
projections.
It is important that the interface be consistent, so that each user action. always has a similar
effect, regardless of the circumstances (Foley and Van Dam 1982). This makes the system
easier to learn and use, because there are no special cases to memorize.
I begin with a brief review of the available functionality. Then I describe the user
interactions, with some discussion of how the interface meets the goal of consistency.
6.1. Functionality
Tr..e previous chapter presented methods for producing paths of planes which use interpolation
between pairs of user~sel~eted·targets. I described four tools for selecting· targets, namely,
constraints, schemes for providing sequences of targets, controls for rotating 4-d point clouds,
and tina1.ly, data dependent constraints. user can COIlSttuct
planes using these tools.
Following the paradigm for paths of planes, the data viewer program supplies a sequence of
projections as follows:
Suppose the current plane is Pit that is, the display shows the projection of observations and
variables onto Pl' Let T I denote the target plane. Then, intermediate planes Pit Pz, P3' .
are generated along an interpolating path from the start plane, Pit to the the target T1. When
47
the current plane reaches the target (Pi =T 1 for some i), another target T2 is obtained.
The user controls the selection of successive targets by selecting a target generator scheme
and setting up constraints. This can also involve supplying data dependent constraints.
Variables may be classified as inactive, X-. Y- or A-variables. The classifications define
constraints, which are imposed on targets as described in the previous chapter. Changing the
constraints imposes additional constraints on the very next target. Currently. the data viewer
prognunprovides a choice of five target generator schemes: scan, local scan, cycle, rotate and
backtrack. The currentitarg~tgeneratorschemeprovidesithe next target. Usually the targ~t is
random, subject to the user-imposed constraints.
The sequence of targets is never deterministic because the user is always free to intervene and
re-direct the moving scatterplot by changing the constraints or the target generator. By
appropriate choice of target generator, the user can obtain a local scan in the neighborhood of
the current plane, or, cycl,e between the current plane and previous target. Generally, changes
to the constraints~$\l1t in succeeding ~gets satisfying the modified restrictions.
Some of the target generator algorithmS produce sequences that re-use previous planes,ror
example, local scan and backtrack. For tJ:-Js reason, I distinguish between old and new
new. targets can be guaranteed to satisfy the current constraints. The target
as described in the previous chapter, except for one special
case, occurring when·· the userchaI1g~ the constraints. Then the next target is· new.
irrespective of the current choice of target generator. This exception is necessary. because
when the user makes a change, he expects that this affects the motion.
48
6.2. Choosing Targets
In chapter 2, we saw that the data viewer program has a graphical interface. The user points
the mouse cursor at some mouse sensitive pan of the data viewer window, and clicks on a
mouse button. In response, the display changes somehow. Some of the mouse clicks affect
the path of planes.
A
......:':
ST.HELENS
" .. . ..~.. '. .:.;/A(. \:~ .'.•. ••. r&~ . •.. .. .tilJ ..•~,,; ••• -
le~ . ... •...;...... " .." ••.....~:: ";':---.~.~ r. ...... ... . .. ."... .... ,.. ,
e••••
..
"..
Figure 6.1: A data viewer window
For controlling paths of planes, there are two important mouse sensitive areas on the screen,
namely. the variable boxes and the control QaneL By default, each data set variable receives a
box, placed on the l.11.s. of the screen. Section 6.4 will discuss how to add boxes for the
derived variables to the display list. Labels in. the left hand corner of each variable box give
the current classification of that variable: the labels A, X, Yanda blank: denote an A-, X-, Y
or inactive variable respectively. A string drawn in the control panel gives the current choice
49
oftarget generator.
For example, in figure 6.1 the boxes for the variables latitude, longitude and
depth have A labels. Also, "rotate" appears in the control panel. This means that the
window can currently show 3-d rotations in the space spanned by these three variables.
Variable Boxes
Changing the constraints is. achieved.by changing the variable box labels. The circular area
marked in each variable box is mouse isensitive;Clicks in this region change the labels.
Within tile circle, the mouseClll'SOrhas an X, Y, A or °shape, depending on its location. I
use the different shapes to indicate the effect of a click. Figure 6.2 shows the mouse cursor's
shape in each part of the circle.
Figure 6.2: Changing shape of the mouse cursor
When the cursor shapeis X, Y,Aor 0, a single click on the left or middle mouse button
changes the labelw an X, Y A or blank respectively. For X and Y cursor shapes, middle
clicks have an additional effect. If the box with the mouse cursor receives an X label, then an
X in any other variable box is replaced by a blank label. This reduces the number of clicks
50
necessary to pick bivariate scatterplots.
There are other simations where changing the label in one box changes labels in others as a
side effect. Recall from section 5.1 that mixtures of X- or Y-variables with A-variables are
not allowed. The program enforces this rule as follows. When an X or Y label changes to an
A, all other X or Y labels change to A's automatically. Conversely, when an A label
changes to X (or Y), all other A labels become Y 's (or X 's). There is no good reason for
this particular assignment of labels; many other schemes would do just as well.
Changes to the variable boxlabels as I have described have no instantaneous effect on the
displayed projection, they only affect selection of new targets. Of course, smooth motion
towards the target is not always of interest, for example, when examining bivariate
scatterplots. To accommodate these simmons, I assign double clicks on mouse buttons in the
variable boxes to mean change the label and the displayed scatterplot. With double clicks, the
scatterplot changes immediately to show the projection on to the new target. At all times,
mouse clicks in two boxes are sufficient to produce a bivariate scatterplot. Section 2.3 gave
an eXample.
As tr.e mouse cursor moves within the circle, the mouse documentation line at the bottom of
the screen briefly describes the effect of a click. Suppose the mouse shape is X, then the
documentation is "L: X variable, M: Single X, double click for plot change". where "plot
change" refers to the fact that the projectioncha.IlgeS aloIigWith the label. As long as the
variable boxes are not too small (which happens when the Window is too small or the number
of variables too large), I find this a reasenable scheme for selecting constraints.
51
Control Panel
Figure 6.3 shows a close-up of the control panel. (The strings in italics do not appear on the
screen, they have been added for identification purposes.)
SMOOTH
negative speed
step-size
SCAN
speed
Figure 6.3: The control panel
The control panel consists of 6 mouse-sensitive regions. When the mouse cursor moves into
one of them, the region'.s. border is temporarily highlighted. The regions represent five ways
of c()ntrolling motion. AmPl.lse.:lickinone regionaifects thechoi(:e()f target generator.
Clicks in other areas control factOrs such as wnetherthemotion is on or off, and the speed at
which the scatterplot moves.
(i) Motion switch:
This is the small rectangular region currently containing the string "OFF". The switch
displays either of two values, "ON" or "OFF". When the value shown is "ON", the data
viewer's scatterplot is moving through a sequence of projections. When the value is
"OFF", the projection plane does not change. A left mouse click in this area toggles the
string's value from causing the scatterplot to move.
(ii) Speed:
One of the two rectangular areas at either side of the motion switch contains a vertical
dashed line, which I call the speed bar. The position of the gives the speed of the
52
moving scatterplot; the farther the line is from the motion switch, the faster the
scatterplot moves. (Fast motion is also jerkier, because successive projection planes.
P l , Pz,P3 , are farther apart.) A bar on the left hand side of the switch also tells us that
the scatterplot is moving backwards through old projections, Le., backtrack is the current
target generator. A left mouse click anywhere within these two rectangles repositions the
speed bar, causing thescatterplot to speed up or slow down.
(iii) Interpolation switch:
This is the rectangular area currently showing the string "SMOOTH". The switch
displays either oft'wo values, "SMooTH" or"JUMPY". indicating the presence or lack
of interpolation between successive target planes. The usual value is "SMOOTH".
causing the data viewer's scatterplot to move through a sequence of close projection
planes, P 1.PZ.P3,..... As a contrast to smooth motion, the program also allows for jumpy
motion. In this case there is no interpolation so that Pi = Ti> i = 1. ,2•.. " and
successive projection planes may be arbitrarily far apart. A left mouse click in this area
toggles the string's value. from "SMOOTH" to "JUMPY".
(iv) Step-size:
This is the rectangular area containing the string "step-size". The vertical dashed line is
the step-size bar. which may be repositioned by mouse clicks, just like the speed bar.
The position of thebatdeterm.i~how faraway a new target can be; recall that this
comes in useful with the local scan target generator.
(v) Target Generator:
The rectallgle in the top right-hand comer of the control panel shows a string
representing one ofthe target generators scan, rotate, local-scan or cycle. Mouse clicks
in this area change the current choice of target generator. The fifth target generator,
backtrack, is chosen in a different manner, by giving speed a negative value.
53
Path Parameters
The foregoing discussion described how the user selects the path of planes Ph P:z, P3, , by
choosing values for (i) the motion switch. (ii) speed, (iii) the interpolation switch, (iv) step
size, (v) the target generator and (vi) the variable box labels, which I term collectively the
path parameters. The interface will be consistent if these parameters are onhogonal. We say
two parameters are orthogonal when each of the values of either has the same effect for all
values of the other. Orthogonality simplifies the user interface, because the user only has to
rememberthefuncti.on of eachpa.rameteralone,ratherthan in conjunction with the other five.
With my choice of path parameters, orthogonality is not always achievable. At least, each
parameter value should have a predictable effect for all values of other parameters.
The path parameters have a natural separation into two orthogonal groups--
(i) target parameters, for controlling the sequence of targets Th T2, T3, .......• These are the
step-size, target gen~rator and box labels.
(ii) motion parameters, forcqntr()llingth~>interinediateplanes.. 'I'heseare the motion switch,
the interpolation switch and speed control.
These groups are orthogonal.because target generation is not affeetedby motion parameters.
Also, for any two targets, motion parameters determine •the intermediate planes in the same
way, no matter howtbosetaljetSwe~generated.Foiexample, changing the motioD. switch
to "OFF" always stops the motion.
Within each group, there are some dependencies. The effect of either box labels or step-size
depends on the current choice oftarge~ generator. Target generating schemes that re-use old
targets, such as backtrack, simply ignore the current settings of other path parameters.
Schemes like scan that provide new targets do not. Some exceptions to this rule are
necessary. Otherwise, clicks in variables boxes could produce a bivariate scatterplot
54
sometimes, but not always. The resulting interface would be dominated by the mode of target
generation, which is quite intolerable. To avoid this, I require a label change to be followed
by a new target (see section 6.1), so that the immediate effect of changing labels does not
depend on the target generator. Then, double middle clicks in two variable boxes can always
produce a bivariate scatterplot.
The speed parameter has different interpretations for the two kinds of motion, but in both
cases, its ~bsolute value controls how quickly the plot moves through the sequence of targets.
In the case ()f jtJ.I11PY moti()n, speed controls the flash frequencybyreguiating the time interval
between targets.PorsI1100th llloti()n, .il1tel'IIlediate planes are generated along the path
between start and target planes at increments proportional to speed.
There are some cases where regardless of the value shown by the interpolation switch, smooth
motion is not provided. Instead, the display jumps to the next target. The first case is rather
obvious: when switching between 2d- and Id-projections. The next case is when the start and
target planes give 2d- and xy-projections respectively. Geodesic plane interpolation for two
plan..,s·need notanive atthexy..projection,>burrather,~l'()tated/xy-projection. To avoid this
inconsistency, some other form of interpolation would be necessary. The case where start and
target planes are both xy-projecti()ns, but the start's x- (y-) subspace overlaps with the target's
y- (x-)s-ubspace, is similar.
6.3. Rotating 3-D and 4-D Point Clouds
When the plot interaction mode shows "PROJEcnON", clicks on the plot region help control
thellloving projections. We already had someexamplesinsection 2.4 of how the user could
rotate 3-d point clouds in particular directions. The following sections give a more complete
description.
55
Clicks on the left mouse button in the plot region toggle the state of the motion switch. This
is convenient for two reasons. It makes it far easier for the user to stop the motion
immediately. Some care is necessary to position the mouse cursor in the rather small region
allocated to the motion switch, causing a potential delay in stopping the motion. Also,
moving the scatterplot by clicks in the plot region rather than on the motion switch gives a
more direct style of interface.
In section 5.2, I suggested how one might specify a path of planes through a 4-d subspace. I
considered the two cases of xy-projections and 2d-projections individually. Now, I describe
how the data viewer user can actually pick such a path, by clicking the mouse in the plot
region. This capability is available only when rotate is the current choice of target generator.
XY-Projections
For xy-projections, the user needs to choose the relative speeds of rotation for the x- and y
subspaces. After a mouse click in the plot region, the scatterplot moves. The program uses
the slope of the line from the mouse cursor to the plot region's center as the relative speeds.
When the mouse click is near the horizontal line through the plot region's center, the rotation
speed for the y-direction is comparatively small. The resulting path is "almost" a 3-d rotation.
This special case illustrates why I chose this particular mechanism for providing relative
sJijeds.• FolloWing aclick, the points appear ro·move towards the mouse cursor.
2D-Projections
Suppose the two planes, To and Tit are the start and target respectively. FolloWing the
suggestion of the previous chapter, the program approximates the interpolation from start to
target by a path in the dominant 3-d subspace. In this 3-d subspace, it remains to pick an axis
of rotation. For simplicity, suppose the two planes, To and Tit span a 3-d subspace. As usual,
56
after a mouse click in the plot region. the seatterplot moves. This click also picks a. vector a in
To- If b is the vector in the subspace which is orthogonal to To, the click causes a rotation the
a-b plane.
As a user, I found it easier to pick a direction of rotation, rather than the axis itself. This way.
the points move towards the position of the mouse cursor at the time of the click. Also. it
means that clicks in the plot region have consistent interpretations for 2d- and xy-projections.
6.4. Derived Variables
Boxes for Derived Variables
Each of the data set variables is represented. in the display by a variable box. Suppose I used
these p boxes to represent the derived variables rather than the originals, and used the
alternative rather than the canonical inner product Then, the user could interact with these
boxes just as before to produce plots of the derived variables.
Instead, I add new boxes to the display for the derived variables. There are two advantages to
this approach.
(i) More user control: With 2p variable boxes, the userhas more choices. Mouse clicks in
apPrtlpriar.e .• boxes can noW .select bivariarescanerplots. of the original, or the derived
variables.
(ii) More infonnationdisplayed: Bach of the boxes has a line drawn from the box center
representins theproject:ion of its variable onto the current plane. We can see any given
projection as a linear combination of the original and the derived variables
simultaneously.
57
The new boxes are drawn on the right hand side of the window to separate them from the
existing boxes. Whenever necessary, I will distinguish the two sets of boxes by using the
terms i.k.s. boxes and r.k.s. boxes. With new variable boxes for derived variables, the user has
additional controls for guiding the moving scatterplot. These are
Setting up constraints:
To avoid over constrained targets, each target may be constrained with regard to either the
original or derived variables. The data viewer enforces this by allowing the boxes for one set
of variables only to be mouse sensitive at any given time. Then, the user selectsCOIlStra1nts
by clicking on the mouse sensitive boxes. Oicks on the other set of boxes have no effect.
The user can recognize the mouse sensitive boxes on sight, because they alone have X, Y or
A labels. It is easy to switch control from one set of boxes to the other. Suppose the r.h.s.
boxes are mouse sensitive. A mouse click close to the 1.h.s. boxes makes them mouse
sensitive instead, and labels are drawn in the 1.h.s. rather than the r.h.s. boxes.
A choice of inner prodtlcts:
When the user picks one or other set of boxes to be mouse sensitive, he/she>aIso choose
between the canonical and alternative inner product; the current inner product is such that the
variables represented by the mouse sensitive boxes are orthonormal. When the mouse
sensitive set of boxes switches from 1.h.s. to r.h.a. (or vice versa), the inner product changes,
causingtberesUlfoftbeprojectioIltO Chailgealso.
ConstruetingDerived Variables
Suppose there is a list named derived-variables containing p derived variables. The
user tells a data viewer window named a-DV-window to display new boxes for the derived
variables by executing the Usp code
58
(send a-DV-window :add-variables derived-variables)
Then. the window is redrawn. with additional boxes for the derived variables placed in the
window's r.h.s.
Currently, I give special treatment to four sets of derived variables, obtained from principal
component analysis, canonical correlation analysis, discriminant analysis, and also data
sphering. To produce principal components, for example, the user only has to run the code
(senda-DV-winq.ow :add-variables :prcomp)
Then the data viewer progmmitself calculates ·the principal components. This is far more
convenient for the user; he/she need no knowledge of the data viewer's internal data
representation. Also, each of the named methods involves calculations based on some
selection of the cases and variables. In the data viewer, these selections could be made
graphically by marking the relevant variables and cases on the screen. At the moment, only
graphical selection of vanables is provided for.
CHAPTER 7
APPLICATIONS
This chapter contains some examples of exploring data with the data viewer program, using
the. tools described.
7.1. Places Data
The places data consists of scores for 329 US cities on 9 criteria, chosen to measure
"livability" of the cities (Rand McNally 1986). The nine criteria are climate, housing, health
care, crime, transportation, education, the arts, recreation and economics. For housing and
crime, the lower the score the better. For all other variables, the higher the score, the better.
I have added(an additionalthree variables to the data set. They are •pOpulation, latitude and
longitude for each of the 329 cities. To eliminate skewness, I transformed population to a log
scale.
Figure 7.1(a) shows a data viewer window for the places data set, with a scatterplotof two
variables, latitude against longitude. Notice the latitude box has a Y label, and
the longitude box ••$8 X label. The. two "extra" points on the left hand side of the map
represent Anchorage, Alaska, and Honolulu, Hawaii. Their latitude and longitude coordinates
have been adjusted so that all cities fit nicely into the plot region.
60
PLACES
"
",,: .: " t'\'" .... '1." .. ",: \' ~:'~:,\.." :: .:~...: .. :..:.:-.:: ,,"'~:y<*!
., ....... """,,:" "" .. :.:;.."" "." "..." "... ..\ "'" ".. . ... "" " ....
".." ,,"" " ...... ,," ..::".
, i
"
;1:"..
:::,·f'
PLACES
," '.. .. ". " ..... :. ....
.. •*#.~.~IJ;"'t<;~.;, ,:~!.p~.:;t-
....:., V\."" .1." •• '~. """ .. " .<t ...... •," '. ": ,."
"" .. .. """':." \... .. ..,
(
Figure 7,1: Bivariate scar:rerplors of places data
61
In the univariate transformation pan of the pipeline operations, I took care to scale these two
variables appropriately (see section 3.2.) By default, variables are standardized individually,
which in this case would give an elongated U.S..
Mouse clicks in the climate box change the display to show climate and latitude.
Figure 7.1(b) shows the result. From this plot we see that northern cities tend to have either
very bad, or quite good climate.
Connecting Two.Biv~riate Scatterplots
Instead of changing the display immediately from one bivariate scatterplot to another, we can
gain a lot of information by watching a smooth progression from one scatterplot to another,
that is, connecting the scatterplots. In this way, we discover which U.S. cities have good or
bad climate. Grasping the features of this motion usually takes a few repetitions. This is the
purpose of the cycle target generator.
We constructthesequ~nceofprojections which connect the scatt.erplots shown in figure 7.1
(a)and (b) as follows: Suppose the window currently displays climate and latitude,
and we pick longitude and latitude as the target. A click on the motion switch
toggl~s the value from "OFF' to "ON', and the projection begins to change. When the
projection reaches the target,· motion pauses momentarily, and then resumes back towards the
climate, latitud!!plot. 1b.e displayed projection continues to cycle between these two
scanerplots until a click on the motion switch changes its value to "OFF', or another target
generator is selected.
Figure 7.2 shows one of the intermediate projections. From the vari.able boxes we see that
both climate and longitude have non-zero projections in the horizontal direction.
By watching the smooth progression repeatedly between the pair of scatterplots shown in
figure 7.1 (a) and (b), we gain the following information:
62
PLACES
.. ............. . .. .. ..I .. .. ...... "" • •._1'.
"" .. t"' " .. "'i..... "''' ... .. .. l. ; ...;::: .,"..",? !;;.-; ..."".. .." ....... ..... :; ::.. : .. ,,~:,,:
.. ". .. ... ...... .:"t .. .. ,," ".. .. ,"":_ ..
.. ...: : : "..:.. :. ":.. " # ,. ." .. .."
.. :: " ...•.....:.,... " .. .. .. "... ..
. .... ,'.
......• I • P'IOJRflCII
Figure 7.2: Connecting bivariate scatterplots
• The cluster of points with the best climate are all Californian cities.
• The northern cluster with good climate are north west cities.
• The mid-west has the worst climate: Minnesota, Wisconsin, and the Dakotas.
• The Atlantic coast ofFlorida has far better climate than does the Gulfcoast.
In the same way, we can let the projection cycle between the longitude, latitude and
the climate, housing plots. Figure 7.3(a) shows a bivariate scatterplot of Climate
and housing, figure 7.3(b) shows ODe of the intermediate projections. By cycling between
the pair ofscatterplots we see that
• Highest housing costs are in the vicinity of New York. (The two points with very high
scores on housing are actually Connecticut cities).
63
PLACES
'.
.'
.........PLACES
". .'
'. r:::'. .-,,-.......
.'
.'...
.....•!".
.. ""'.,. :~.. .. .
'.,..
". .. .: ::.~., ... ....... . ;."
I'"
Figure 7.3: Connecting bivariate scanerplots
64
• California has high housing costs.
Density Estimates
Some of the ratings, in particular the-arts and health-care, give extremely high
scores to the biggest cities-- New York, Chicago and L.A.. This results in scatterplots where
most of the observations are clustered together, so that associations between variables are hard
to pick out. For this reason, I transformed the ratings to normal scores.
Figure 7.4 shows two dataviewerwind()ws,<the upper one with the rating variables as before,
and the lower one with the I1OrmaFscores. There are no longer boxes for latitude and
longitUde. Both windows display a density estimate for a linear combination of the rating
variables. The linear combination is the same since the data viewer windows are linked by
common projections. -Notice the dot on the extreme right in the upper plot; this is New York.
In the lower plot, New York lies far closer to the other cities. As the x-vector changes, we see,-
how the transformation to normal scores affects the Id-projections. The density estimate in
the lower<window isgenerallysYmmetric,an4Quite often looks "bell-shaped". For the
untransformed ratings, the Id-projections have highly skewed distributions. With a moving
x-vector, the density's peak shifts to and fro across the screen.
Most of the nine rating variables tend to assign high values to big cities. To judge the overall
nature of the· association between population and the ratings, we can examine plots of
population against linear combinations of the rating variables (on a normal SCOJ'eS.scale).
Suppose we .pick popUlation as the single Y-variable, and make each of the-arts,
health-care, economics, education and recreation X-variables. (From the
bivariate scatterplots, these five have the strongest association with population.) Then, the
......tl. ~t.
1.
65
PLACES
1\. \i \I ,. ,! \;•f \1\
\\'.
pIeces nscores
Figure 7.4: Density estimates
'..
66
plll!tces nscores
"
..
.. .. ..• -t l ..
, ' i', ':: 'I, "
.........plll!tces nscores
.," :.. : .. ,
'.' ..,: .. -:.."~' ..
.. " :f .. "• • .. • .•.aI\. ..
,,' ' • l' .',',
. .. .. :'.' :.: ::-a.. "._" :: oJ" .....: .::-._ ... .. .: i;'~.'.~:.
:. ..... ., ••~ <l' .••:: ..• , ......a,' .' :.'\;····,..:'••*.:·4·:· ..: .. ,r ,,"':: "" ......." .
"
,,
,, ,
, "''' ...,,
Figure 7.5: A correlation tour
67
target generator scan yields a correlation tour of population against linear colIlbinations
of the five X-variables.
By watching the moving scatterplot, we discover a projection with high x-y association, as
shown in figure 7.5(a). We can see that population is linearly related to a weighted
average of the five selected rating variables. Also, health-care and the-arts have
the largest coefficients. whereas the coefficients for economics and recreation are
comparatively sm.all. (The variables have been transformed to normal scores, so that it is
reasonable to compare their projection ~fficients.)
Deactivating Variables
Do the variables economics and recreation have a negligible contribution to the x-y
association in the above projection? We may answer this question as follows. Suppose we
make the two variables inactive. Then the x-vector for the next target is the current x-vector
orthogonalized with regaId to the two deactivated variables (see section 5.1). With a rotation
towards this target, we receiveavisualim.p~i,on •• ofhQWJheqlJality ofthe· x-y .associatiQI'J.
deteriorates.
The second plot in figure 7.5 sho.wsthe projectioll onto the •• new target. Overall, it looks very
similar tot1le previolJS plot, with mQ8t changes occurrlngamongcities with lower population.
association observed in the upper plot.
7.2. Satellite Data
The Satellite-oct data set consists of microwave briglltrless temperatures recorded by
the NIMBUS-7 satellite one October Saturday (Madrid 1978). The data was observed on a
grid of 1000 points, located in a long, narrow strip stretching from the Bering Sea to the North
68
sATELUTE-QCT31
..,,'.~:~
.;..:;.(.,......~!f.
....... ~J'"., .. .. -
\, .... '
......I' I NlUECTt.
. .'.,
Figure 7.6: Viewing the satellite data
69
Pole, for the purpose of investigating the ice cover. Each observation includes location, given
by latitude and longitude, and six microwave variables corresponding to three
frequencies (in gHz) with values for horizontal and vertical polarizations. I name the variables
lS-H, lS-V, 22-H, 22-V, 37-H and 37-V. In each name f-p, f and p represent the
frequency and polarization respectively.
Bivariate Scatterplots
figtl.re 7.6 contains two bivanate.scatterplots of the data: lS-Vagainst 22-Vand 37-V
agajnst22-H. The first scatterplot shows two correlated variables, with a dense lump of
points at either end. From plotting either variable against latitude, it is clear that the
clusters on the right and left are located at high and low latitudes, and the remaining points at
middle latitudes. We conclude that the upper cluster represents icy areas, the lower one water,
and the middle points icy water.
The same split into tw<f dominating clusters is evident from the Id-projections of all the
Illicn>wave vari~Qles,exCep~for 37-V. Iniigtl.re 7.6(b),thel.tppercluster has.a distincfrod.
In fact, the rod of points has a strong negative correlation with latitude, implying that cases to
the left on the rod are at higher latitudes and so have deeper ice cover.
4-D ROt:atiol!1S
By doing 4-d rotations with lS-V, 37-Vas the two X-variables, and 22-V, 22-H as the
two Y-variables, the cluster Jepresenting non-icy locations also shows further structure. The
upper plot in figure 7.7 gives one of the xy-projections obtained.
The x-vector is· a contrast of 37-V and 18-V. This separates a small.rod of points from the
lower cluster, lying slightly to the right and above the majority. These particular observations
are from a location where there is known to be rough, open water. From left to right in the
......• , l J'IGJlCtlClt
Figure
70
SATELLlTE·OCT:31
'.'..
t·
" .. .•••~ o,".' ,
4-D rotations
71
horizontal direction we have thick and thin ice. and calm and stormy water. Since thinner ice
has a rougher surface. it seems that the contrast between high and low frequencies
distinguishes between locations with rough and smooth surfaces.
The second plot in figure 7.7 shows a second of the xy-projections obtained. Here the y
vector is a linear combination of the two Y-variables. and the horizontal projection is as
before. The projection has changed very little. except that the rod emanating from the lower
cluster more distinct.
7.3. <PrincipaJ Components of the Satellite Data
Principal components analysis aims to summarize the data in a small number of dimensions. I
use a principal components analysis of the Satellite-oct data to discover its major
sources of variability.
Constructing Principal Components
LetZ the nxp ·.··data· matrix. containing the obServations in standardized coordinates.
(Assume the standardization step has centered the variables to zero mean.) The principal
components Cl .c2 • ....cp are defined as follows (Mardia. Kent and Bibby 1982):
Cl ma:umizes ff ztz C = var(Zc). sub.iect =1.
Similarly,
Cj maximizes ff ztZ c = var(Zc). subject to c'c= 1. and ffck=O. fork = 1.2....j-1.
The Cj. j = 1•..P are.tbe eigen vectors with decreasing eigen values var (Z Cj) of the matrix
zt Z. They are calculated using a singular value decomposition of Z (Golub and VanLoan
1983).
72
Satellite Data
By executing some code (see section 6.4), we obtain the principal components of the first 6
variables, that is, the microwave readings.
The principal components computed depend on the scales of the variables, and it falls on the
user to choose suitable scales by modifying, if necessary, the pipeline's univariate
transfonnations (section 3.2). In this case, all microwave readings are measured in the same
units, so I used the same scaling factor for all six varitibles, preserving their relative scales.
Therefore, the principal components are the eigen vectors of the data's covariance matrix (in
measurement aJl,din standardized coordinates).
The derived variables consist of the principal components cit C2, ...,C6, plus el' e2. The
additional variables el and e2 are included so that the derived variables span standardized
coordinate space. (Notice this requires that standardization has centered the variables to a
mean of zero.) The derived variable are ormononnal with regard to the canonical inner
produ.ct. For principal components analysis, there is no need for an alternative inner product.
When the calculations are complete, the data viewer redraws its window. There are more
boxes on the r.h.s. for the derived variables. The additional boxes are ordered column-wise,
and labeled pc-l through pc-6, followed by latitude aJl,d longitude.
Principal Component PI()ts
Suppose the r.as. boxes are mouse sensitive. Mouse clicks in the first two r.as. boxes select
the bivariate scatterplot of pc-l and pc-2, as shown in figu.re 7.8. It contains a sharper
version of the structure displayed in figt.u'e The observations fonn three rods: a sparse
rod in the vertical direction, with two denser rods at angles either end. Once again, by
examining the latitude and longitude coordinates, we discover that the rods correspond to
no
73
..'
"t" "
SATELUTE-OCT31
Figure 7.8: Principal component plot of satellite data
From~.l.11.s.bc>~ieS,we see tllie relative contributions<of the original variables to the first two
principal components. By examining the magnitudes of the lines in the horizontal direction,
we notice that the first principal component is a weighted average of the original variables.
The microwave reacti.ngs.arestrongly related to the surface temperature. With the range of
latil:Ude:s, temperar:u.re is source of vaziatJ,ility aznOI1lg the locations. It is
that the first principal component is a surrogate for surface temperature.
The second principal component has positive coefficients for the 37 gHz measurements, and
negative coefficients for the other frequencies. As was previously remarked. the contrast
between high and low frequencies distinguishes between rough and smooth surfaces.
74
SATELLJTE-OCT:31
. '..
'\ i \i ""-.1 \.~ ...
, .\"'.""
..... ,I" ~nON
Figure 7.9: Original variable plots
Original Variable Plots
Suppose the l.h.s. boxes are mouse sensitive and we select a Id-projection of 22-V, given in
figure 7.9. The r.h.s. boxes display the projections of the principal components onto the
direction of this variable.• demonstrating that . 22-V lies close toth~direction of the third
princip.aLcompo:oent... Ingenerat.such.plots maybe informative; for example, a variable with
large>coefficients for the p_lth and p til principal components only, has a less important
contribution to the overall variability of the data set.
75
Variability of Principal Components
SATELLlTE-OCT31
Figure 7.10: Variability of principal components
Unlike the data set variables, no standardization is applied to the derived variables prior to
plotting. This means we can graphically compare the relative variabilities of the principal
components by flipping through a series of plots.
pc-S versus pc-6 (see
figure 7.10), all points are concentrated in a solid lump at the center. In fact, pc-6 is a
contrast of the horizontal and vertical measurements at each frequency.
The first principal component accounts for 96% of the variability in the data set, and the first
and second components together for 99%. This is not surprising, because the variables tend to
be correlated. Also, when scanning the space spanned by the 6 microwave readings, a "z"like
shape dominates throughout. 'The same shape is clearly evident in the plot of the first two
76
principal components.
Dynamic Comparisons of Principal Components
The satellite-dec data set is similar to satellite-oct, containing observations
on the same variables at the same locations, but recorded in December. Figure 7.11 shows the
satellite-oct and satellite-dec data, in the upper and lower windows
respectively. The r.h.s. boxes represent the principal components (computed individually) for
both .sets of data. ... The windows are linked by common projection operations, allowing
dynamic comparisons of the principal components. Clicking on the boxes for pc... l and
pc-2 in the upper window yields a bivariate scatterplot of pc-l and pc-2 in both
.. windows.
For both data sets, the first principal component is a weighted average of the microwave
variables and the second component is a contrast of high and low frequencies. It seems that
the sources of variability~for microwave measurements do not change much from October to
December. However, the "z" shape is .no longer evident in the December plot, since almost all
locations are covered by ice.
7.4. Sphered Satellite Data
"Sphering" the data means transforming it to identity covariance. One motivation for viewing
sphered data is that it has all linear structure removed. Usually, this is the structure that is
easiest to find and understand, but it may be distracting, obscuring non-linear relationships. In
a sphered coordinate system it becomes easier to pick out clusters and non-linear associations.
This is why data is often sphered prior to applying projection pursuit methods (Friedman
1987).
77
SATEt..LJTE·OCT31
..-
...·.. 1.,t, ~l.
SATEt..LJTE·OEC1B
Figure 7.11: Comparing principal components
78
Constru-=tmg Sphered Variables
Take the principal components Cj, j = 1, ..p. defined in the previous section. Let
vj = var (Z Cj), j = 1, :.p. Then the directions C/~ represent sphered variables. I use these
particular sphered variables because they lie along the principal components. The sphered
variables are onhogonal but not onhonormal for the canonical inner product, so the data
viewer makes use of the alternative inner product (see section 6.4).
Sphered Variable Plots
Figure 7.12 shows two plots from the satellite-oct data, demonstrating dramatically
the effect of sphering. Each of the first six boxes on the r.h.s. represents a sphered variable,
which are named s-l s-2 •... s-6.
The upper plot shows 18-V versus 18-H, and the two variables are highly correlated. From
the r.h.s. boxes we see $at these variables point in the same direction in sphered coordinate
space. To sphere the scatterplot, 9necould imagine stretching it in a di~ction perpendicular
to the axis which runs through its middle.
When we make the r.as. boxes mouse sensitive, the display changes due to the change in
inner product. The second plot of figure 7.12 shows the result. What were the upper and
lower dense portions ota long narrow point cloud, are now two p~lelclu~ters.(The A
labels are automatically assigned to the r.h.s. boxes whose variables are not onhogonal to the
projection plane.) Also, the lines drawn in both sets of variable boxes are quite different. All
these differences are due to the changed definition ofonhogonality.
Figure 7.13 shows s-2 and s-4, or equivalently, a plot of the second and founh
standardized principal components. In this particular data set the projection onto the
principal component has by far the largest variance. Plots of the 3rd to the 6th components
points at the window's center (see 7.10). One possible remedy uses
79
SATELUTE-QCT::31
."!.:/"• ,'!>
: ,"" .'.' ;?~~,
:;Jj~'.A!'
SATELUTE-OCT:31
Figure 12: Sphering the sa.tellite data
80
SATELLlTE-OCT:31
Figure 7.13: Sphered variable plot
viewporting operations to blow up the picture to a reasonable size (section 3..5), but this causes
an unacceptably high amount of clipping for other projections. Sphered coordinates avoid this
From figure 7.13 we see there is nevertheless considerable sttucture in the fourth
pri.ncipal component; the long cluster of points on the left co1Tespond to the locations with
change of structure would not be evident.
The spheredplotsi of the data viewer give the same information as a biplot of the·data matrix
(K.R. Gabriel 198I).
81
7.5. Discriminant Analysis of Places Data
A discriminant analysis aims to distinguish between groups. It achieves this by producing
linear combinations which separate the group means as much as possible.
I perform a discriminant analysis to discover how the ratings vary across locations. First, I
group cities by location. The groups contain (i) west coast states plus Alaska and Hawaii, (il)
Rocky mountain states, (iii) mid-west states, (iv) south-west states, (v) south-east states and
(vi) north-east states. Cities. in each group are plotted with different glyphs. They are (i) a
square. (il) a horizontal dash, (Hi) a plus sign, (iv) an "x", (v) alriangleand(vi) an open circle.
The discriminant analysis producesnve lillear combinations of the (untransformed) rating
variables which best separate the groups.
Constructing Linear Discriminants
Let Z be the n xp ma,trix containing the cases (in standardized coordinates), and let
Zl' ...• Zg be the l1~XPIIl~trice$.containiJlg the casesbelongi.ng tQeachoftheg groups. z/ci
denotesap-vector, represelltingease i in group k. We use the vectorsmk. k = t,..,gand m
for the group and overall means respectively.
The linear discriminants Cl, .... cq , where q.= min(g-l,p) are as follows (Mardia, Kent and
Cl maximizes between group variancewithin group variance
Cj maximizes between group variancewithin group variance
g
i:nk [et (mk - m)fk-l
82
such that Ze is uncorrelated with each ofZ el. I = I, ... , j -1.
For computational details. see Chambers (1977. section 5k).
Discriminant Variable Plots
The upper plot in figure 7.14 shows the different groups. The lower plot displays a projection
obtained by performing 3-d rotations in the space spanned by the first three discriminants.
ThisprojeQtion giyes good separation oft.he west coast and north-east states. For a clearer
p~sentationofthetwogrpups. marked them with large square~ and open circles
respectively./andused invisible glyphs for cities ·in all other regions. The west coas.t. and
north-east cities form two clusters separated in the horizontal direction. The l.h.s. boxes show
which rating variables contribute to the separation. For example. we see that west coast cities
have better climate, but poorer health care and education.
It is actually not so strai~tforward to visually pick out projections which distinguish between
subsets. In a single scanerplot whe~ the groups overlap, .different glyphs (or colors) do.not
give.· us an itlltnedlotte •..pe~ptionof the/group locations. Alternagraphic methods· (Tukey
1973). where subsets are displayed in turn. attempt to solve this problem. In conjunction with
motion, suchmetb.odshelp in finding projections which distinguish groups. For the places
example. lfoundthatthewest coast and north-east cities were wen-separated in the space of
the first t1ueedi~riminatlt'$.bYdisplaYing just rwosubsetSa:t atiIlle as· t.hepointsmoved.
7.
PLACES
.-
.'.,'. '"...... y
...
PLACES
.. . .." .-., ... ....
• ••:... t ~.. ..- •.... ...
of
.'
CHAPTER 8
DESIGNING THE DATA VIEWER
In this chapter, I discuss the design of the data viewer, that is, the procedures and data
structures which constitute the program. The discussion is of interest to future designers of
similar programs. It is of relevance to users also, providing a user-model for how the program
works (Foley and Van Dam 1982). Equipped with a clear model of the program, a user can
operate it more efficiently and creatively.
8.1. Object-Oriented Programming
The data viewer program is implemented in Flavors (Cannon 1982), an extension to Lisp.
This language supports a style of programming called object-oriented. Flavors had a strong
influence on how I approached the design problem.
The data viewer program is organized around a collection of objects. The user-model consists
of a description of these objects and their interrelationships. The user-model itself does not
depend on the choice of language; the same user-model could equally well apply to a data
viewer implemented in Fortran. However, an object-oriented language simplifies
implementation, because it contains tools for building abstractions which behave like the
objeets in the model I will describe the data viewer design using the terminology of object
oriented programming.
Objeet-oriented programming is based on the concept of an object. At the simplest level, an
object is a data structure, similar to a record in Pascal or a structure in S. Corresponding to
85
the fields of a record. an object has instance variables. The instance variables determine the
state of an object. The main difference between objects and Pascal records is that
communication with their contents is achieved by a mechanism called message passing.
Messages can change the state of an object by modifying its instance variables.
Object-oriented languages have special properties that distinguish them from more traditional
languages such as Fortran or Pascal. (i) They provide a means for building abstractions and
(ii) messages act like generic functions. in that different types of object can respond to the
same Jllessage. Both of these properties encourage a highly modular style of programming. I
will demonstrate this using some eX3ll\ples.
Data Sets
There are many potential representations for data sets. 2-D arrays of cases by variables are a
common choice. but a list of lists. say, could be used just as well.
As I explained in section 2.1, we regard data sc:ts as collections of cases. In the data viewer
Program, I use a represc:ntation for.data sets, data-s.et objects, which is close to this
description. A data-set has an instance variable containing a list of case objects. For
eXatnple, places and st.helens are both data-set objects. This representation for
data sets is due to McDonald (t986).
Theacivamageof using objects to represent data sets is that the internal implementation may
be hidden from the user. The set of messages understood by an object defines its interface
with the outside world. If they agree on a protocol~the programmer may change the internal
representation of anobjeet unknown to the user. Therefore, objects and message passing
provide a mechanism for abstraction.
86
Data Viewer Windows
A data viewer window. like the editor or system window. is represented by an object. The
instance variables of a window object determine its position and appearance on the screen.
Window objects respond to messages for refreshing their display. and for moving or reshaping
them on the screen.
A DV-window object represents each data viewer window. A DV-window object has an
instance variable named display-list, containing .. variable-boxes, control
panel, scatterplot, t;i.tl$andplQt...inteJ':~ction ...menu.Eachitem,in the
display list is an opject,respondtng toaCOmtn()l'Lset of messages, but indifferent ways. For
example, executing the following code draws the contents of a DV-window display by
sending each object in the display list a : draw message.
(loop for item in display-list do
(send item :draw»
The : draw message acts like a generic function, dispatching on the type of display-"c.. .:··..>··.·.··.··. ..:::
1 i st item. This feature of object-onented programming simplifies implementation because
only minimal changes to existing code are D.e9'ssary to add new items to the display
list.
8.2. A Simple User-Model
Dependent objects are the mainingI'ed!ent of the data viewer's user model. Figure 8.1 is a
dependency: diagram, showing a simplified version ofthe model.
In figure 8.1. there are boxes representing DV-window and data"'set objects. The
directed line connecting the boxes indicates that the DV-window depends on the data
set, in that its display always shows the data-set's current state. I term this dependency
87
'--__D_Y_-W1_._n_d_ow -----·~I'_ d_a_ta_-s_e_t__....
Figure 8.1: Simple data viewer model
a display constraint. Since it is a one-way dependency, the data-set is completely
independent of the DV-window. In practice, DV-window has an instance variable
refeningto the da.ta-set,so I call data-set a component object of DV-window.
DY-window 1
DY-window2
data-set
Figure 8.2: Sharing data sets
In section 3.6, I discussed linked data viewer windows. We saw how two data viewer
windows are linked when they overlapping data (sub) sets. The depencJlen(:y diiagl-am
DV-windows depend ona single data-set.
The display consttaintimpliesthat when any property of the data-set changes, the display
chtu1gesso as to show the new state. For example, executing the code
(send New-York :set-shape :t>~q-trl.angJ.e
changes the shape property of the case object New-York in the places data. Then, all
glyphs representing New-York are redrawn with a big triangular shape. Notice that this
88
involves redraws in all windows showing the case New-York. In fact, a display
constraint is a convenient model for interactive brushing (Stuewe 1986), but this is not
currently available in the program under discussion.
8.3. A Detailed User-Model
A DV-window object produces a view by applying a sequence of transfonnations to the
cases. The particular view displayed is detennined. by the pipeline parameters (chapter 3).
The simple v~rsion ofth~ user model does notd~scribe the dependency ••·ofthe view on these
param~ters.
Transformation Objects
I use distinct transfonnation objects to hold the parameters for each pipeline operation. Like
the data-set, each transformation object is a component object of DV-window. The
objects are independent cit each other, so they may be developed and implemented separately,
giving•• t!1e'program.amodularidesign. Thetra.nsf():r.t:glltion objects are
• a univar~at.-tran.~ormat~onob~ct
This contains a function, such as log or inverse, and a center and scale factor for each variable.
The object's depends on the data-set, and may be
A user canJIlO(ti.fy the contents of a univariate-transformation object in either of
two ways: by'aelecting·itemson a menu, or, by executing some code. Modification by menu
is easier for the user; code execution offers more Sl.lppose u-t is a
un.I.VCLL1ate-traIl,sI:OX:!l'l.CttjLon object Then, the following sends a message to
u-t, to ensure that the variables lS-V, 22-H, 22-V, 37-H and 37-V are
scaled by a common factor of 250.
89
{send u-t :scale-by 250
I ( :18~H :18-V :22-H :22-V :37-H :37-V»
The : group-variables message ensures that the same standardization procedure is
applied to more than one variable.
(send u-t :group-variables
I ( :18-H :18-V :22-H :22-V :37-H :37-V»
Then, the data-dependent center and scale factors are computed over all six variables.
• aprojeet£on-eng1.n.eobject.
The current projection plane is held in an instance variable of the projection-engine.
Additional instance variables represent the six path parameters of section 6.3, namely (i) the
motion switch, (ii) speed, (iii) the interpolation switch, (iv) step-size, (v) the target generator
and (vi) the variable box labels. Sending a : next-plane message to a projection
engine results in a projection plane update, as determined by the path parameters.
• a dens1.t.y-.st..imat.or object.
This object has.instance variables which are t:1'1e number of bins and a smoothing .parameter
(seesection<3.4).
• aV1.ewpo:rt.er object.
This has instance variables specifying the affine transfonnation from plot to screen
coordinates. In a fashion similar to that described in Buja et al (1981), the parameters are
changed by depressing mouse buttons.
Figure 8.3 shows the complete dependency diagram for the data viewer. Each of DV
window's five component objects-· the data-set and four transformation objects, are
nY-window
90
data-set
nivanate-transformatioDt-----i1ll1
projection-engine
histogram-estimator
viewporter
Figure 8.3: Full data viewer model
data-set *
represented by boxes. A directed line is interpreted as before; DV-window's display is
constrained to show the current state of its five component objects. Also, the
univariate-transformation object depends on a data-set, wbich is not
nec:ess'a.ri:Ly the in the This is useful for
the univariate transformations.
'I'he mail!ideasof the data viewer program-- motion, linking and interaction have convellient
explanacipos iI! te1'In80f the dependency diagram. Thus, it provides a user's model.
Motion
Motion is a direct conseqllence of the displa.y constraint. The constraint requires that the
displayed view changes when the paratneters held in any of the transformation objects change.
When the projection-engine, for example, changes continuously producing a
sequence of planes, the window shows a continuously moving scatterplot, with moving lines
appearing in the variable boxes.
Linking
Two (or more) DV-windowSa.re linked the five cOII1ponenfobjects are
common to both windows. Recall the example of in section 2.6. There were two
windows, both showing observations from the St. Helens data-set, and they are
linked by projection.
DV-window 1
DV-window2
St.Helens
projection-engine
and projection engines
Figure 8.4 gives the dependency diagrams for these windows. (For clarity, only the common
component objects
in both windows.
92
The user can construct windows linked by projection as follows: Suppose st. helens
viewer is a DV-window for the St. Helens data. Executing the following code creates
a second DV-window for the St. Helens-dense subset which is linked to the first.
(make-DV-window :data-set st.helens-dense
:projection-engine
(send st.helens-viewer :projection-engine»
Multiple DV-windows dependent on the same objects is a convenient way to achieve
linking. Since each. window satisfies its own display constraint, the linked windows
themselves are totally unaware of each other. Basically, by following this design. windows
can be linked with arbitrary combinations of component objects at no extra cost to the user or
programmer.
Interaction
Data-sets and transformation objects may have displayed representations. When any of
the component obj¢cts are >shared amongidataviewer windows, theY may !lave multiple
displayed representations. Clicks on a mouse sensitive displayed representation of an object
sen<is a message which changes its state. For example, the control-panel and the
variable-box$s display the current state of the path parameters contained in the
projection-en.9'ine~ Section 6.2 explained how mouse clicks in these areas change the
parameters. Then, the display constraint demands that all displayed representations of the
modified object be updated.
93
8.4. Implementation
This section gives a brief discussion of some implementation issues.
Processes
The Symbolics Lisp environment supports multi-tasking, where a number of processes appear
to run simultaneously. This is achieved by the the system scheduler, which allocates time
slices to processes in tum.
Windows provided. as part.of the environment, such as file system windows, typically have
their own individual processes. Similarly, each DV-window has a process. A window's
process is responsible for displaying its items and handling its user inputs. This is a natural
way of handling multiple DV-windows, allowing changes to occur in all windows
simultaneously.
Since the ttansformatiorr-objects are components of DV-window, changes to these objects
usually take place within/the .window'sp:rocess.For example, when the. projection
engine's motion switch is on, it produces a sequence of planes, due to the DV-window
repeatedly sending the : next-plane message. However, the same projection
engine could be a component of arbitrarily many DV-windows. This leads to problems
caused by multiple processes simultaneously accessing the same object I side-step this
difficulty by not permitting the scheduler to interrupt the : next -plane operation. A more
general solution would use locks attached to the shared data structure.
The Display Constraint
This implementation is a generalization of the scheme suggested by Stuetzle (1986), for
scatterplot brushing.
94
The display constraint requires that the DV-window always shows the current state of the
transformations objects and the data-set. That is, DV-window must always be up-to
date. The DV-window's process executes an etemalloop, continually checking whether or
not the display remains up-to-date. If not. the display is updated. With this arrangement. the
display is redrawn only when necessary. This is important for the application of moving
scatterplots; with multiple data viewer windows on the screen. unnecessary scatterplot draws
would seriously impede performance.
Once again, we see the use ofgeneric messages; the DV-window updates itself by sending
eachme1llbet()fits di~play-list an .: upgate me.ssage. A DV-window is up-to-date
if each item in its display-list responds t(for true) to an up-to-date? message.
The display-list item may itself have a display-list. For example.
variable-boxes has a display-list containing each individual variable-box.
In this case it handles the : up-to-date? message by passing the query on to its own
display-list items.-
Otherwi~e, display-list is out-of-date if it is a screeltrepreseXl~tion for a
changed object. The scatterplot is out-of-date when any of the component objects have
changed. A variable box labeHs out-of-date when the labels in the projection-engine
are modified. Changes are determined by comParing time stamps. The component objects
recent change.
Other schemes for implementing one-way dependencies are discussed in McDonald (1986).
95
The Pipeline
Speed is an important issue in the production of scatterplors moving in real-time; the rate at
which new plots appear on the screen determines how clearly we perceive the motion. With
the computational power of the SymboIics Lisp machine 36xx at my disposal, the naive
approach of performing each pipeline transformation (see figure 3.1) anew on the data to
produce every plot is far too slow. The data viewer obtains significant speed-ups by
• Usinglnteger arithmetic
The univariate transformations are appIiedto·· the data and the results cached, until the
data set itself or the univariate transformations are modified. The transformed data is
converted to 16 bit integers, so that all further computations use integer arithmetic. The
loss in accuracy is negligible given the resolution of the screen, but the speed gain is
considerable, particularly since our Lisp machines do not have floating point hardware.
• Using geodesic interpolation
With geodesic in.terpolationberween planes (section 4.5), the computat'iQn time is almost
independent of the number of variables. For each geodesic segment, cases are projected
onto R 4, so that most plots require only a projection from R 4 to R 2.
• Drawing onto the bitmap
The Lisp machiIle environment provides the gsuaLset of graphics primitives, but these
are too .slow for real-time scatterplot motion. Instead, the data viewer scatterplots are
produced by writing directly into the screen array holding the bitmap. Timing
experiments (performed by 1. McDonald) have shown that this reduces the time for a
scatterplot draw by a factor of5. .
The data viewer program displays projections of 500 cases roughly at the rate of 12 per
second. For this calculation, the plots were drawn on the monochrome screen and each glyph
consisted of 4 (Single pixel points are too small for comfortable and
96
slightly faster.) Even with direct access to the bitmap, the erase/draw step is responsible for
most (about 213) of the time required to display a new projection.
CHAPTER 9
CONCLUSION
The goal of this resea.rcl1 was to devise and implement ways of exploring multivariate data
based on motion through sequen<;es of projections. This demands a highly interactive
program, so that the user may controLthe exploration process. The key issues involved in the
development of the resulting data viewer program were:
• guided tours
I found that extending a grand tour algorithm based on geodesic interpolation between
2-planes (Buja and Asimov 1985) gave a general paradigm for constructing guided tours
of data: interpolation between consecutive elements of a user-determined sequence of
planes. I. develoP¢d four aids for plane .selection: (i) constraints, (ii) methods. for
producing sequences of planes. (iii) methods for controlling rotating 4-d point clouds
and (iv) derived va.."'iables. These aids put many schemes for exploration of multivariate
data at the user's disposal, from existing methods such as 3-d rotations and the grand
tour. to new methods as local scan.
• the user-interface
A graphical interface is suitable for the data viewer. Parameters controlling the motion
have displayed representations, so that the user is always aware of their current· state.
Using the mouse, the user can make choices quickly, which is particularly important for
interactive applications.
98
• the program design
The data viewer design consists of objects, related by a display constraint. I chose this
design because it provided a basis for implementing linked data viewer windows, useful
for the important applications of comparing and relating plots.
As is, the data viewer program contains a fairly comprehensive set of tools for constructing
sequences of projections. However, even within the scope of its paradigm for paths of planes,
there are many areas deserving more work,. At present, all interpolation proceeds along a
geodesic path connecting pairs of planes, but alternatives should be considered. Secondly,
data dependent target generators, as described briefly in section 5.3, merit further attention.
BIBLIOGRAPHY
Asimov, D. (1985)
"1be Grand Tour: A Tool for Viewing Multidimensional Data", SIAM Journal on
Scientific and Statistical Computing, voL 6(1), p. 128-143.
Becker, R. A., Chambers, J. M (1984)
S:An Interactive Environment for Data Analysis and Graphics, Belmont. CA:
Wadswonh.
Becker, R. A., Cleveland, W. S. (1987)
"Brushing Seatterplots", Technometrics, voL 29, no. 2.
Becker, R. A., Cleveland, W. S., Wilks, A. R. (1986)
"High-Interaction Graphics for Data Analysis", Technical memorandum, AT&T Bell
Laboratories.
Buja. A., Asimov, D. (1985) "Grand Tour Methods: An Outline", Computer Science and
Statistics: Proceedings ofthe 17th Symposium On the Interface, Amsterdam: Elsevier
Buja, A., AsiJI1ov,D.,lI~eY,C:;~,McDolla1d.J~.A..(1987)
"Elements ofa VieWing Pipeline for Data Analysis", in Dynamic Graphics for Statistics,
eds. W. S. Cleveland andM. E. McGill, Monterey, CA: Wadsworth.
Buja, A., Hurley, C., McDonald, J. A. (1986)
"A Data Viewer for Multivariate Data" , Computer Science and Statistics: Proceedings of
the 18th Symposium on the Interface.
100
Cannon, H. I. (1982)
"Flavors- A Non-Hierarchical Approach to Object-Oriented Programming", manuscript
from Symbolics Inc., 5 Cambridge Center, Cambridge, Mass.
Chambers, 1. M. (1977)
Computational Methods For Data Analysis, Wiley, New York.
Donoho, A. W., Donoho, D. L., Gasko, M. (1985)
MacSpin: GraphicalData Analysis Software, 02 Software, Austin, Texas.
Donoho,D.L., Huber, P.I., Ramos,]~.,Thoma,M. (1982)
"Kinematic Display of Multivariate Data", Proceedings ofthe Third Annual Conference
and Exposition of the National Computer Graphics Association.
Fisherkeller, M. A., Friedman, 1. H., Tukey, 1. W. (1974)
"PRIM-9, An Interactive Multidimensional Data Display and Analysis System",
Proceedings of the 2acific ACM Regional Conference.
Foley,J.D.,VanDam, A. (1?82)
Fundamentals of Interactive Computer Graphics, Addison-Wesley Publishing
Company, Reading, Massachusetts.
Fowlkes, E. B.(1971)
"Users Manual for an Qn..I,:.ine Inr.er;lcrlve System for Probability Plotting on the DDP
24 COmputer", Technical Memorandum, Bell Laboratories.
Friedman, 1.H. (1987)
"Exploratory Projection Pursuit", Journal ofthe American Statistical Association 82.
Friedman, 1. H., McDonald. 1. A., Stuetzle, W. (1982)
"An Introduction to Real Time Graphics for Analyzing Multivariate Data" , Proceedings
of the Third Annual Conference and Exposition of the National Computer Ci,.,ani:!ics
101
Association.
Friedman, 1. H, Tukey, J. (1974)
"A Projection Pursuit Algorithm for Exploratory Data Analysis", EEE Trans. Comp., C
23, 881-890.
Gabriel, K. R. (1981)
"The Biplot-Graphic Display of Multivariate Matrices for Inspection of Data and
Diagnosis", In Interpreting Multivariate Data,ed. V. Barnett, Wiley, London.
Golub, C;. H., Vanl:..o~C. F. (198$)
Matrix Computations, Johns Hopkins University Press, Baltimore, Maryland.
Hube~P.J.(1985)
"Projection Pursuit" , Annals ofStatistics, vol 13, no. 2.
Mardia, K. V., Kent, J. T., Bibby, J. M. (1982)
Multivariate Analysis, Academic Press.
Madrid,e. R, (Ed.) ·(1978)
The Nimbus User's Guide., Greenbelt, MD: NASA Goddard Space Flight Center.
McDonald. J. A. (1982)
Anaty:sis", PhD thesis, Stanford University.
McDonald.J.A.(1986)
ItAntelope: Data Analysis with Object-Oriented Programming and Constraints",
StatistlcsDepart;mentTechnical R~port, no. 89, University of Washington, Seattle.
~cNally,~(1986)
Places Rated Almanac
102
Ryan. T. A., Joiner, B. L., Ryan. B. F. (1976)
Minitab Student Handbook., Duxbery Press.
Scott. D. W. (1985)
"Average Shifted Histograms: Effective Non-Parametric Density Estimation in Several
Dimensions''. Annals o/Statistics 13, p. 1024-1040.
Stuetde, W.(1986)
"Design and Implementation of Plot Windows", Proceedings 0/ the Statistical
Computing Section, Arn.erican Statistical Association, pp 32-40.
SymboUcs (1983)
3600 Technical Summary, SymboUcs Inc., 5 Cambridge Center, Cambridge, Mass.
Tukey, J. W. (1973) "Some Thoughts on Alternagraphic Displays", Department of Statistics
Technical Report No. 45, series 2, Princeton University.
WOllg, Y.-e. (1967)
"Diff~rentialGeometry.of Grassmann Mallifolds" , Proceedings.0/ the National Academy
o/Sciences, voL 57, p. 589.
VITA
Catherine Brid Hurley was born on January 7, 1962 in Cork, Ireland. She
attended Colaiste Muire, Cobh, Co. Cork and later University College Cork,
graduating in 1982 with a B.Sc. in Computer Science and Statistics. In 1984, she
obtained an M.s. in Statistics at the University ofWashington.