thedata viewer: a program for graphical data analysis … · 2017-09-01 · the use of motion...

THE DATA VIEWER: A PROGRAM FOR GRAPHICAL DATA ANALYSIS

by

Catherine Hurley

TECHNICAL REPORT No. 115

December 1987

Department of Statistics, GN-22

University of Washington

Seattle, Washington 98195 USA

University of Washington

Abstract

THE DATA VIEWER: A PROGRA..VI FOR GRAPHICAL DATA ANALYSIS

by Catherine Hurley

Chairperson of Supervisory Committee: Associate Professor Andreas Buja

Department of Statistics

This thesis describes some new graphical methods for analyzing high-dimensional data based

on motion through sequences of projections.

3-D point cloud rotations provide the canonical example of such methods. The grand tour

(Asimov 1985), is a recent attempt at devising a higher 4imensional analog. It is insufficient

for data analysis purposes b~cause the motion paths areirandofuly constructed, and so do not

allow for the possibility of user-control.

I propose some methods for guided tours, offering the data analyst control over the sequence

of projections. An idea from the grand tour contributes the basic building block for the new

methods- geodesici interpolatiQn between· pairs of target· planes. Then, user-selected target

planes yield a guided tour.

Guided tours are implemented in the data viewer program, on a Symbolics 36xx Lisp

machine. By nature they demand a highly interactive program; I found a graphical interface

to be appropriate for controlling the moving scatterplot. Object-oriented programming

facilitates the system design, and provides a convenient mechanism for linking plots in very

general ways.

TABLE OF CONTENTS

Chapter 1: Introduction .

1.1 Moving Scatterplots .1.2 Some Background .

1.3 Computing Environment .1.4 Outline .

Chapter 2: Data Viewer Examples .2.1 Viewing the St.Helens Data Set .2.2 TI1e Data Viewer Window ...........•..................................................2.3 Bivariate Scatterplots .2.4 3-D Point Cloud Rotations .

2.5 Density Estimates .2.6 Linked Data Viewer Windows ..

Chapter 3: TI1e Data Viewer Pipeline .3.1 Pipeline Operations .

3.2 Univariate TraIliformations .3.3 Projections .....•.......................•.........................................................3AI)ensityiEstimagon •...............•...•..............•.......................................3.5 Viewportlng ..............•.....................................................................3.6 Multiple Pipelines .

Chapter 4: Paths ofPlanes ...•....................................................................

4.1 Smooth Paths ofPlanes ..4.2 3-D Rotations .

4.3 1'be Grand Tour .4.4 Guided Tours .4.5 Interpolation between Pairs ofPlanes .

Chapter 5: Methods for Guided Tours ..5.1 Constraints on Planes .5.2 Sequences ofTargets .5.3 Data Dependent Paths .

I12

35

6678

1012

13

15151718212222

2727282931

32

35363942

Chapter 6: The User Interface 466.1 Functionality 466.2 Choosing Targets 486.3 Rotating 3-D and 4-D Point Clouds 546.4 Derived Variables 56

Chapter 7: Applications 597.1 Places Data 597.2 Satellite Data 677.3 Principal Components of the Satellite Data 717.3 Sphered Satellite Data ...................•................................................ 767.4 Discriminant Analysis of Places Data 81

Chapter 8: Designing the Data Viewer 848.1 Objeet..()rientedProgramming 848.2 A Simple User-Model 868.3 A Detailed User-Model................................................................... 888.4 Implementation 93

Chapter 9: Conclusion 97

Bibliography 99

LIST OF FIGURES

Number

1.1 The Lisp Machine Screen .2.1 A data viewer window .2.2 A variable box .2.3 Bivariate scatterplot .2.4 A 3-variable subspace .2.5 3-D rotations .2.6 Density estimate .2.7 Linked data viewer windows ..3.1 The pipeline .3.2 Pipelines for linked windows ..6.1 A data viewer window .....•...............................................................6.2 Changing shape of the mouse cursor .6.3 The control panel .7.1 Bivariate scatterp10ts of places data .7.2 Connecting bivariate scatterplots .7.3 Connecting bivariate scatterplots .7.4 Density estimare. .7.5 A co:n:elation tour .7.6 Viewing the satellite data .7.7 4-D rotations .7.8 Principal component plot ofsatellite data .7.9 Original variable plot .7.10 Variability of principal components .7.11 Comparing principal components .7.12 Sphering the satenite data ..7.13 Sphered variable plot .7.14 Discriminant analysis of places data .8.1 Simple data viewer model .8.2 Sharing data sets .8.3 Data viewer model .8.4 Sharing data sets and projection engines ..

iv

Page

479

9

101112141625484951606263656668707374757779808387879091

AC~~OWLEDGEMENTS

As the last surviving member of the "Irish house", let me send off a big thank:

you to all my far-flown housemates, who provided me with friendship and

encouragement over the last few years.

To everyone in the statistics department, thank: you. I enjoyed my stay at the U,

and I'll miss those regular trips to the "Bois"!

I would like to acknowledge the help of the Bellcore Statistics group, where I

spent a worthwhile few months. I even had a Lisp machine all of my own!

A very sincere thank: you to John McDonald, whose energetic red pen drove this

thesis out the door.

Lastly, I gratefully acknowledge the help and encouragement of my long-

distance thesis advisor, Andreas Buja, who inspired much of my work on the

"data viewer". Thanks so much, Andreas!

Researchcontract DE,-F(:;Uc>-8,5ER2'SUlJ6

v

under

CHAPTER 1

INTRODUCTION

We analyze data to obtain information. Graphical methods are a major data analysis tool;

they are invaluable for obtaining and presenting information. We use histograms and

scatterplotsroutinely to display umvariateand bivariate data. Effective display of

multivarillte data presents us with mOre ola challenge. because we are limited to two

dimensional plotting surfaces.

In this thesis, I discuss some new methods for displaying multivariate data. These methods

are implemented in an interactive program I call the data viewer.

1.1. Moving Scatterplots

To display multivariate data. we usually reson to projections onto low-dimensional subspaces.

For example, we plot. principal components. or the residuals from a regression analysis.

Motion graphics gives U$one way to increase the information content ola display.

A movings(;atterplotisascatterplotwho~poiJ1ts .changeposition·.·overtime; The .i11usion·of

a continuously moving scatterplotis.created for the viewer when new scatterplots appear in

quick succession on the. screen. There are two good reasons for using moving scatterplots in

exploratory dat#ana1Ysis:

• The information in a moV'ing scatterplot is far more than that displayed in the sequence of

static plots. A reasonably smooth motion permits us to connect plots, that is. follow

points as their position changes through the sequence of plots.

2

• With motio~ we can perceive a third dimension. A 3-d point cloud rotation is.a moving

scatterplot obtained by projecting observations onto a sequence of 2-d subspaces in R 3.

Our visual system interprets the result as a rotating 3-d object. Even though a single

scatterplot shows only two dimensions. with sequences of scatterplots we perceive the

full 3-d structure of the point cloud.

Unfortunately, we cannot hope to perceive p -space for p > 3. Nevertheless, projections onto

sequences of 2-d subspaces of R P have applications in data analysis. If the observations lie in

a low-dimensionaimanifold, for example, form clusters, curves, or surfaces, this is readily

l"et:ognized from the moving scatterplot Motion can help to detect p-dimensional outliers. In

a static plot, each· point has two coordinates. In a moving scatterplot, each point also has

velocities in the x and y directions. Therefore, motion adds another two dimensions of

information to a display.

Data exploration with the data viewer is based on moving scatterplots. More specifically, the

program aims to help die data analyst construct moving scatterplots from projections of

observations intoJowdimensional subspaces.

1.2. Some Background

The use of motion graphics for data analysis has quite a long history. Both the work of

FowlkeS(l971),andtbe PR.1M:9 system (FiSherkeller, FriedlIlan and T\1key 1974) foi 3:d

rotations, were early applications of real-time, graphical methods to data analysis. Later

systems suchas.PR.1M..H (Donoho, Huber, Ramos and Thoma 1982), and ORION (Friedman,

McDc>nald and Stuetzle 1982; McDc>nald 1982) continued to emphasize the power of.interaction and motion for data display and interpretation. Nowadays, improvements in

hardware mean that these methods are no longer confined to expensive. special purpose

systems; interactive graphics packages for statistics are generally available, for example.

3

MacSpin (Donoho, Donoho and Gasko 1985). Also, it has encouraged the development of

new display and interaction methods for data analysis, such as scatterplot brushing (Becker

and Cleveland 1987), and the grand tour (Asimov 1985; Buja and Asimov 1985).

The data viewer program represents some of the efforts of Andreas Buja and I in this

direction. A previous paper (Buja et al 1987), described some of its capabilities; this thesis

focuses on additional methods for guiding projection planes through high-dimensional data

space.

1.3. Computing Environment

The data viewer program is implemented. on a Symbolics lisp machine 36:a (Symbolics

1984), a single-user graphics workstation with a high-resolution bitmap display and a mouse

for graphical input. It has processing power comparable to a VAX 780, and is fast enough for

computing and drawing scatterplots of500 points at the rate of 15 a second.

Figure 1.1 shows a pic:tu1'eofthe sc~n.Thesc~n~iorgllIlized intorec:t(Ulgular areas called

windows. In figure 1.1 there are two windows fully visible on the screen-- an editor window

on the left. and a data viewer window labeled "St.Helens". The editor window is provided by

the Lisp machine environttlent, whereas the data viewer window is produced by the data

viewer program.

There are two input devices, a keybOard and a mouse. Usually, input changes the appearance

of the screen. For example, typing in text fOthe editor changes the contents of the editor

window. Similarly, input with the mouse changes the contents ofthedat8viewerwindow.

The mouse controls the position of the (typica1lyarrow shaped) mouse cursor on the screen.

In figure 1.1, the mouse cursor points at the scatterplot in the data viewer window. The user

clicks one of the three mouse buttons to send input to a window. Areas of the screen that

respond to mouse clicks are called mouse sensitive. When the mouse is present in a sensitive

4

t.fl...* .....l~i4... .•._1_t...."t_ u·fti'!4.. ,,_I

f .....#·...... t...-..-,.... u.~\..., ...itt_ U'·{flh\·..... t""'"I

I

ca......

ST.HELENS

' ..

.......

Figure 1.1: The Lisp machine screen

region, a mouse documentation line.atthe bottom of the screen gives a brief description of the

effect of a click. Notice the mouse documentation line in figure 1.1- it says "L: move to here,

R2: system menu". With a single click on the left button, u;'e contents of the data viewer

window change as the scatterplot begins to move.

Theishaded areas on rh.e screen are occupied by partially visible windows, including .a file

system window on the lower right hand side, and another data viewer window. Mouse clicks

uncover these windows, in the process removing the shading. The second data viewer

window shows a second scatterplot. Since each data viewer window can only show a single

scatterplot, multiple scatterplots are displayed in individual windows.

The Symbolics Lisp machine is designed primarily for programming in Lisp. The data viewer

program is written entirely in Lisp, and a Lisp-based language, Flavors (Cannon 1982), for

5

object-oriented programming (see chapter 8).

1.4. Outline

Chapter 2 contains some examples of the plots produced by the data viewer program, and

gives an introduction to its capabilities.

Chapter 3 describes the sequence of operations (the data viewer pipeline), which are applied

to the observations to produce plots on the screen.

Chapter 4 begins with a description of existing methods for constructing sequences of

projections, namely, 3-d rotations, and the grand tour. I introduce the data viewer's paradigm

for constructing paths of planes, which potentially offers the user greater control and

flexibility than is possible with existing methods.

Chapter 5 presents some tools which aid the user in constructing paths of planes within the

framework provided by the paradigm.

In chapter 6 I describe the userinterface-- how the abovementioned tools are made available

to the data analyst.

Chapter 7 gives some examples of data exploration with the program, even though it is

impossible to do it justice with static plots on paper.

Chapter 8 discusses the data viewer design, and describes a user-model for the program.

Finally, chapter 9 states my conclusions.

CHAPTER 2

DATA VIEWER EXAMPLES

In this chapter I give an introduction to the data viewer. I will show how to obtain bivariate

scatterplots, density estimates, and 3-d point cloud rotations using the program. To begin

with, here is a brief description of the St. Helens data set, which I use throughout these

examples.

2.1. Viewing the St.Helens Data Set

The St. Helens data set contains observations on earthquakes occurring in the vicinity of

Mount St. Helens, during May, 1980. Date, latitude, longitude, depth and magnitude were

recordedfo:reacnor680 .earti\q1.1akes.

I regllrd a data set as a collection of cases, where each case has values for a number of

variables. For the St. Helens dataset, a case represents an earthquake, and the variables

are dCite,latitudE!, l()l1g.ttl.1de,clepthandmagnit ucle.

(make...DV-windOw :data...set st.helens

ma.x,e",uv"'w:t.na,ow is function, taking the two 113J11~arguments :data"'set, and

Executing this code produces a data viewer window in the top right hand

corner the screen, containing a view of the St. Helens data set. Figure 2.1 shows the

7

data viewer window.

ST.HELENS

....

'...'.

: ~:

'"

Figure 2.1: A data viewer window

2.2. The Data Viewer Window

There are a fixed set of items that appear in a data viewer window. As in figure 2.1, it is

m.ostly> occupied bya scatterplot. The scatterplQt, along with the other items drawn in the

window, constitute the data viewer's display list. The display list contains five items, which

are

(i) a scatterplot: Each data set case is represented by a plotting symbol called a glyph. In

figure 2.1, all glyphs are square-shaped, but in general they may have contrasting shapes

and colors. I refer to the part of the data viewer window containing the scatterplot as the

plot region.

8

(ii) a title: There is a rectangular area across the top of the window showing the data set

name, in this case "StHelens".

(iii) variable boxes: A box for each of the variables is drawn on the window's left hand side,

with the variable's name printed across the bottom. In figure 2.1, the boxes for date

and latitude have horizontal and vertical lines drawn from their centers, telling us

that the displayed plot is a bivariate scatterp10t of date and latitude. The reason

for using this, instead of the usual scheme of printing a variable's name along each plot

axis, will become apparent later.

Variable boxes also have labels, appearing on the top left hand comer. In figure 2.1, the

boxes for date and latitude have the labels X and Y. In this case, it might seem

that labels and lines give the same information. As we will see in chapter 6, labels have a

special purpose.

(iv) a control panel:~ is drawn in the window's lower left comer. By clicking on various

pans of the control panel, the user controls scanerplot motion. Again, details are given in

chapter 6.

(iv) a plot interaction menu: This lies next to the control panel. Different modes of plot

interaction are possible in the data viewer program. Throughout the discussion, we are

concerned with in all examples, the plot interaction menu shows

"PROJECTION"•

2.3. Bivariate Seatlerpiots

In conventional programs for statistical graphics such as S (Becker and Chambers 1984), or

Minitab (Ryan, Joiner and Ryan 1976), the data analyst types in a command like

plot (x, y) to obtain a scatterplot In contrast, the data viewer program has primarily a

graphical interface; once the window is created, most communication between the user and

9

Figure 2.2: A variable box

program consists of mouse clicks which change the display. As an example, I describe how to

obtain a bivarla.te scatterplot.

ST.HELENS

..

. . . ...

~ .. ,......

Figure 2.3: Bivariate scatterplot

When the mouse cursor moves into the circle within a variable box, its shape changes. Figure

10

2.2 shows a close-up of the longitude box, where the character X gives the mouse

cursor's position. With a double middle click, (two clicks on the middle button in fast

succession), the display changes to show a scatterplot with longitude on the x-axis, and

latitude remaining on the y-axis. The X label and horizontal line move from the date

to the longitude box. Figure 2.3 shows the changed data viewer window. Similarly,

mouse clicks can also change the scatterplot's y-variable. Therefore, with a few clicks, the

user may produce any bivariate scatterplot almost instantaneously.

2.4, 3-D Point Cloud Rotations

.,. ."

ST. HlELENS

'~

'.,. ,

, ,",...,"

Figure 2.4: A 3-variable subspace

In figure 2.4, we have picked out the 3-variable subspace consisting of latitude,

longitude and depth, by marking their respective boxes with A labels, I describe how

11

ST.HELENS

.'.'

'...'

......'"

Figure 2.5: 3-D rotations

to use 3-d rotations to examine this subspace.

Figure 2.4 shows a bivariate scatterplot of latitude and depth. This pair of variables

are in the plane of the screen, while the third, longitude, is perpendicular to the screen.

Notice that the mouse cursor is positioned on the right hand side of the scatterplot With a left

mouse click cloud rotates towards the mouse cursor. More precisely,

the point cloud>spins in the direction given by the center of the plot region and the cursor

position. A mouse click in the plot region as the points are moving stops the motion. The

next click restarts the rotation, in the direction specified by the current position of the mouse

cursor. With these controls, the user Qm spin a 3-d point cloud in any direction.

Figure 2.5 shOws a picture of the data viewer window after some point cloud rotations. Notice

now that lines are drawn in all three latitude, longitude and depth boxe~. The

12

lines are in fact the projections of the three coordinate axes.

2.5. Density Estimates

The data viewer program is not limited to showing projections onto 2-d subspaces. Displays

based on I-d projections only, such as plots of density estimates, often highlight features of

distributions that are not evident in 2-d projections.

...It'...

ST.HELENS.\:\

:i

/_ ...... t ..... _ .. -., ,; _ , .........

Figure 2~6: Density estimate

\\\"'...

For example, figure 2.6 shows a density estimate of latitude. As in figure 2.1.~the box for

this variable has a horizontal line and an X label. Since the plot shows a I-d projection. there

is no box with a vertical line. We select other density estimates for display just as we selected

bivariate scanerplots. Pointing the mouse at the longitude box and clicking gives it the

X label. and the plot becomes a density estimate of this variable.

13

2.6. Linked Data Viewer Windows

All of the plots shown so far demonstrate that earthquake locations are highly concentrated, so

that it is hard to see the structure of the dense cluster. For this, separate plots of the high

density region are necessary.

Suppose the data set St. Helens-dense contains the subset of cases in the high-density

region. To view this subset separately, I construct a second data viewer window. Figure 2.7

shows two data viewer windows, for the St. Helens and St. Helens-dense data sets

respectively, as indicated by the titles.. In both windows, the cases belonging to the dense

subset are drawn with square glyphs, while the remaining points have hollow circular glyphs.

By comparing the lines drawn in the variable boxes, we see that the two windows show the

same projection. This implies that the scatterplot in the lower window is a "close-up" of the

upper scatterplot.

As before, pointing the !00use cursor at the plot region in the upper window and clicking

causes the point cloud to rotate. However, this time the point cloud in the lower window also

rotates, and in the same direction. This is because the second window was constructed in a

special way, in order to link it to the existing window. Linking data viewer windows is a

useful capability of the program, and will be further discussed in succeeding chapters. In this

case, simultaneous motion of the two scatterplots permits a dynamic data set comparison,

because the second window displays throughout a close-up of the first.

A

14

ST. HELENS

..~.......

'.

...".

IA

..." ,

P9O.J€ct [011

ST. HELENS-DENSE

Figure 2.7: Linked data viewer windows

CHAPTER 3

THE DATA VIEWER PIPELINE

A data viewer window shows a view of a data set, where each case is represented by a

scatterplot glyph. A glyph has a shape, color and a lcx:ation .911 the bitmap, given by the·x and

y screen coordinates. The data set contains cases as they were observed: in measurement

coordinates. To produce a view, ase.quenceof operations are performed on each case,

transforming it from measurement to screen coordinates. We call these operations the data

viewer pipeline (Buja et alI987).

3.1. Pipeline Operations

Figure 3.1showsthesequen~ofoperations in the pipeline. They are:

(1) Univariate transformation: Typically, this step transforms the variables to comparable

units, so I refer to me result as standardized coordinates.

(2) Projection: Cases in standardized coordinates are projected onto a one or two

dimensi9na1subspace; theseanrtheplotc()()rdinates.

(2b) Density estimation: When projections are onto a I-d subspace, an additional operation

estimates the density of the projection. Then, the plot y...coordinates are the estimated

density evaluated at each projec:.ted case.

(3) Viewporting: This operation determines where the scatterplot is located on the screen,

converting the plot to screen coordinates.

16

DataSet

Univariate Transformation

Projection

r------------------,I II Density Estimation II IL ~

Viewporting

Scatterplot

Figure 3.1: The pipeline

Each pipeline operation has associated pipeline paratn£ters. Their values determine how the

data viewer's display appears. AU pipeline parameters are under user control. When the

Paramelersare changedt the display changes. showing another view of the data.

When a parameter changes in real..time. the result is a moving scatterplot. I described an

example ollbis in the previous chapter; with a single mouse click the display moved through

a sequence of data projections. The current version of the data viewer program allows real-

time parameter mcKiilacaltion and viewporting operations. Other work

17

(Fowlkes 1971; Becker, Cleveland and Wilks 1986) has demonstrated some applications of

real-time univariate transformations.

3.2. Univariate Transformations

The first pipeline operation allows for arbitrary transformations of each variable. The

operation has two steps, a non-linear transformation followed by an affine transformation.

Typically, the affine transformation is data dependent, and the non-linear transformation is

not. This is why I consider these steps separately.

Non-linear Transformations

The non-linear transformation is a useful capability because some variables are better plotted

on a log scale, for example. Each variable has a user-supplied function as an associated

pipeline parameter. This function is applied to all values of that variable. The default is the

identity function.

Affine Transformations

The affine transformation is for variable standardization, and is a very important pan of the

data viewer pipeline. A data set's variables typically measure unrelated quantities-- one

vanablecould measure temperature in degrees celcius, and another the amount of rainfall in

mIs. Therefore, interpreting linear combinations of variables is difficult. Converting each

variable to standard units is the usual·"solution". This operation attempts to eliminate the

problem of arbitrary measurement scales.

There is no single standardization scheme that is best for all applications. Here are some

candidate schemes:

18

• Standaniize variables to mean zero and unit standard deviation. This is the typical

transformation employed in principal components analysis.

• Standaniize variables to the same range. Equivalently, center and scale each variable by

its midrange and half-range respectively. Conventional scatterplot programs use

transformations of this form; the rectangular area allocated to a plot implicitly determines

scalings for the x and y variables. A square plot region implies that variables are scaled

to have identical max - min values.

• Use thei8anlesca1ing faetorJor a group of variai:>les, thus preserving their relative scales.

This .isappropriate whenivariables .1lI'Ctn.easUl'ed incomparable units.

• Perform identical standardization operations on a group of variables. An important

application is repeated measures situations. One possibility is to standardize each

variable using the joint mean and standard deviation of the group of variables.

In the data viewer, all of-l:he above standaniization schemes are supported. Each variable has

twoassociatef.ipipeli~p~eters;·theYillI'Ca·~n.teran([scalefaetor.. The user can require

that a variable's center factor be the mean, median, midrange, or some arbitrary value.

Similarly, the choices. for scale factor are the standard deviation, median absolute deviation,

half-range or S<>tne user-supplied value. Data based quantities may be computed over the

A variable is left UBSta11daniized if its center and scale factors are 0 and 1 respectively. The

defaults are tbemidrange and half-range.

3.3. Projections

The pipeline parameters for the projection operation are the one or two vectors which

determine the projection. Because the operation is determined by p parameters for each

19

vector, there is no obvious way to construct general, real-time controls for manipulating

changing projections. The bulk of my work. is concerned with devising and implementing

useful methods for controlling projections. This is the subject of later chapters. For now, here

is a discussion of the kinds of projections available in the data viewer program.

Orthogonal Projections

The cases have numerical values for each ofp variables, so we may regard them as vectors in

somep-dimensional vector space. There are many reasonable coordinate systems for the

cases; I have named two, the measurement and the standardized coordinates. Note these

coordinates are in possibly different vector spaces. As I discussed in the previous section,

standardized coordinates are preferable because they aim to eliminate the problem of arbitrary

measurement units. In standardized coordinate space, there is a natural identity between the

standardized variables and the canonical basis vectors ej ,j = l,,,,p

On the screen, we see projections from a vector space to a l-d or 2-d subspace. To interpret

the plots thatappeat on the screen, projections should be orthogonal with regard to some inner

product. Usually, the pipeline uses the canonical inner product in the standardized coordinate

space. The main motivation for this choice is that conventional bivariate scatterplots result

from orthogonal projections.

Nevertheless, this inner product is arbitrary; for example, it depends on the univariate

transformations. Other univariate transformations give different inner products, producing

quite different plots on the screen. In section 5.3, I describe applications (such as data

sphering), where other inner produetsare appropriate.

In what follows, all projections are orthogonal with respect to the canonical inner product in

the standardized coordinate space, unless explicitly stated otherwise.

20

2D-Projections

The most general kind of projection is a 2d-projt!ction. This is an orthogonal projection of the

cases on to a 2-plane P. Let Zi e R P , i = 1,.., n be the case vectors in standardized

coordinates. The plot coordinates for case i are ax t zi and a" t Zi, where (ax' a,,) is an

orthonormal basis for P. In variable box j, the line drawn from the box center is proportional

to the projection of the ph standardized variable on P, i.e., a/ej ,a/ej'

xv-Projections

An xy-projection is an orthogonal projection of· theicasevectors onto a 2-plane, with the

restriction that the x and y-vectors, ax and a", are restricted to disjoint x- and y-subspaces.

The subspaces are usually spanned by two subsets of the variables.

XY-projections are common in data analysis, and so deserve separate consideration. The

simplest example is a biv~ate scatterplot. Ifax and a" are el and e2 respectively, the result is

a scatterplotof Zi 1 against Zi2, i = 1,.., n. In general, xy-projections are appropriate when we

wish to explore ~~iationsbetweentwogtoupsc:>fvariables,as with canonical correlations

and regression data.

tD-Projections

As we have seen, the data viewer alSQ allows plots where there is no projection vector for the

y-direction. These plots are based on a 1d"projection: an orthogonal projection of the

observation vectors onto a line.

A naive Id-projeetion would result a "dot-plot" consisting of points along a line. In

preference, the data viewer enhances Id-projections by computing a density estimate for the

projected observations. Evaluating the density estimate at each case gives the y plot

21

3.4. Density Estimation

An additional pipeline operation computes a density estimate when projections are onto a l-d

subspace. Whenever the projection changes. the density is recomputed. implying that real

time modifications to projections produce moving ld-projections on the screen.

Note that the density plot consists of a series of glyphs representing the estimate at each

projected case, rather than the usual curve. (This is why I use the term scatterplot in a general

sense to describe the resultofa\2d- or Id-projection.) Unless the projected cases are sparse.

such a plot/can give a reasonably goodpicmreof thedensity's shape. Unfortunately. the

availablefuil'dware is not sufficient to plOt curves moving in real-time.

Currently. the form of .the density estimate used in the pipeline is fixed; it is a frequency

polygon average shifted histogram (Scott 1985). This uses a histogram smoothed with

(weighted) running averages. There are two associated pipeline parameters: the number of

histogram bins. and a smpothing parameter- the number of bins used for the running average.

A >smoothing Parameter of 1 gives the USual hi~togram. Linear interpolation between bin

Centers provides a density estimate at intermediate points.

My reasons for using this particular density estimate are as follows.

(i) It retains the computational efficiency of the histogram. and so depends linearly on the

number of data points. Other popular density estimators are slower because they require

that points be sorted. Efficiency is relevant because displaying moving Id-projections

demands that the density be re-estimated for every new plot.

(ii) In contrast to histograms. this .density estimate does not have the usual lack of

smoothness caused by.·binning. Particularly for motion. it is important that the density

estimate be smooth. A slight modification in the ld-projection causes discrete jumps in

the histogram bin counts. but only small changes in level for smooth density estimates.

22

Jumps distract us from observing how small changes to the projection affect the marginal

distribution. Smooth changes enable us to see. for example. how the distribution

becomes more skewed as the projection changes.

3.5. Viewporting

The viewport operation determines where the scatterplot is located on the screen. We require

that the transfonnation from plot to screen coordinates does not depend on the projection;

otherwise. size comparisons across projections are not possible.

The viewport transfonnation maps a rectangular region in plot coordinate space into the

rectangular region of the data viewer window designated for the scatterplot. In most

scatterplot drawing programs. the mapping is determined by the restriction that all points fit

into the rectangle on the screen. This restriction is not feasible for plotting projections of

high-dimensional data; most plots would occupy just a small portion of the available plot

region. The alternative a1lows cases to have screen coordinates falling outside the plot region.

The. glyphs for· these cases are clipped from view. that· is. not drawn. By modifying the

viewport transformation. glyphs may be returned to view.

The data viewer program supports real-time modification of the viewporting parameters.

Briefly. the user may shift or scale the scatterplot that appears on the screen. Viewport

operationS l1Ie more than just a conveniel'lce. they also have some interesting applicanonsto

data analysis. For a full discussion with illustrations see Buja et al (1987).

3.6. Multiple Pipelines

A data viewer window can only show one scatterplot at a time. To display multiple

scatterplots simultaneously. we use multiple data viewer windows. Each window has its own

pipeline for transforming the observations to screen coordinates. There are two main reasons

23

for looking at multiple scatterplots: (i) for multiple views of a single data set and (ii) for views

of different data sets. Botbreasons require that the data viewer windows be linked, that is,. use

the same cases or pipeline operations. Windows linked by projection, for example, permit

dynamic comparisons; we could watch two data sets as they undergo the same sequence of

projections. Section 2.6 gave an example.

Linked Data Viewer Windows

Each window may show the result of applying different pipeline transformations to the same

cases, giving multiple views of a data set. We could compare the effects of different

univariate transformations, projections, density estimates, or different plot seatings.

Comparisons are far easier when the pipelines are closely related, say, differ by one operation

only.

For example, suppose there are two data viewer windows, each with views of the

St. One shows bivariate scatterplot of latitude and longitude, the other

pro.ject~on operation, the two sequeD.ces

of pipeline operations should be the same.

When displaying different data sets in multiple windows, similar pipeline operations help us

compare the data sets. If each data viewer window applies the same sequence of pipeline

operations, the difference between their scatterplots can only be due to the difference between

the data sets. Graphical cross-validation is a potential application. By splitting a data set in

two, findings from the exploration of one half can be checked for consistency with the other.

one can allay suspicions that the structure discovered is an artifact of the

exploration procedure.

Let me clarify what "the same" means for each of the four pipeline operations.

24

(1) Univariate transformations are the same when the pipeline parameters for this pipeline

element are identical for all the variables. When the data viewer windows show the

same data set, or different data sets with common variables, it may be convenient to

use the same univariate transformations in the pipelines. This implies that cases are

represented in just one standardized coordinate space.

(2) Suppose (ax' ay ) is an orthonormal basis for a 2-plane P. The projection operations

are the same when the plot coordinates for the pipelines are ax1z,a,,/z, where z is any

case vector. Note that the notion of "sameness" for projections makes precise sense

only when the same inner products are used. In the data viewer, it falls on the user to

supply univariate transformations that guarantee equivalent inner products.

(2b) So far, the data viewer pipeline has just one possible density estimation procedure.

The density estimation operations are the same when both the number of bins and

smoothing parameter are identical.

(3) Viewport operatiOI]$ are the same·when they map the same rec:tangle in plot C90rdinate

space to rectangles of the same size in screen space. (For simplicity, we assume that

the data viewer windows, and therefore plot regions, are the same size.)

Pipelines for Linked Data Viewer Windows

Consider the example of section 2.6, where two data viewer windows were linked by

projection. Figure 3.2 shows a representation of their pipelines. Notice how the pipelines use

the same data set and projection operation. Observations from the St. Helens data set are

plotted in both windows; all cases appear in window-I, and the subset belonging to

St . Helens-dense in window-2. Also, both windows show the same projection; this is

how we obtain a dynamic comparison of the data sets.

2.S

nivariate Transfonnati

Projection

,.- -----------,: Density Estimation :'- ..1

-,II______ ..1

,.-----------: Density Estimation

'-------

yiewporting Viewporting

Scauerplot 1 Scatterplot 2

Figure 3.2: Pipelines for linked windows

The purpose of the second window is to show a dose-up of the high-density duster. As a first

attempt at obtaining aclo$e"up, one might think of modifying the viewporting operation. This

suffices in the special case where the duster is located at the origin in standardized coordinate

space. In general. viewporting operations are good for showing a duster blow-up a single

projection only. not in the moving sequence.

26

In the example of section 2.6. the close-up plots are due to the different univariate

transformations of the second pipeline. The pipelines of windows-l and 2 use the default

univariate transformation procedure. so their variables are standardized to a fixed range across

the entire data set and subset respectively. However. this implies that projections in the two

windows use different notions of orthogonality. Figure 2.7 shows some evidence of this; the

upper plot show two dense rods forming a "L" shape. whereas the angle between the rods

appears more acute in the lower plot. Therefore, it is not quite accurate to say that the

scatterplot in window-2 is a close-up of that in window-I. This would require a sophisticated

limngof the univariate transformation operations. which scaled the cases in the dense subset

in proportion to their scales for the entire data.

CHAPTER 4

PATHS OF PLANES

Data exploration with the data viewer is based on displaying sequences of projections.

Smoothness is an essentialreq~Il1ent for the sequences. I describe some existixtg methods

for formingpathsofprojecti()nplan~S, >Ill\tnely, 3-4 ro~tions, and the grand tour. These are

available in tile data viewer program, but as we will see, they are not enough to Illeettbegoal

of user-controlled moving scatterplots. I introduce the data viewer's paradigm for

constructing paths of planes, which potentially offers the user greater control and flexibility

than is possible with existing methods.

4.1. Smooth Paths of Planes

A crucial requirement is that the scatterplot appears to move smoothly, that is, the position of

each point does not change much from plot to plot. With smooth motion, we can watch how

the position of each point changes through the sequence of plots. It is due to this ability that

we "see" 3-d point clouds on the screen. A smoothly moving scatterplot is far more

informative than a sequence of disconnected plots. Besides losing information, lack of

smoothness is unpleasant and disorienting for the user. At worst, successive plots are totally

unrelated, causing flashing clouds of points to appear on the screen.

The smoothness requirement places restrictions on the sequences of planes.

Let {P,} denote the path of planes, where t represents time. The display shows projections

onto Pat, It; = 1,2, ... , for some increment 5, yielding a smoothly moving scatterplot when

successive planes do not differ very much. For this reason, we require that the planes {P,}

28

change smoothly, in some appropriate metric. As long as the increment 0 is small, the user

perceives a smoothly moving scatterplot.

4.2. 3-D Rotations

The canonical example of moving scatterplots is 3-d point cloud rotations. We can

characterize 3-d rotations by paths of planes, as in the following example.

Let (ax(t) , ay(t» be an orthonormal basis for Pt , where the vectors ax(t) , ay(t) correspond

to the x and ydirections respectively.

The path of planes P, given by ax (t) = cos t el + sin t ~ , a, (t) = e2 is a 3-d rotation in the

el-i!) plane.

This gives a point cloud rotation around the y-axis. At time t = 0, the scatterplot has el in the

x-direction, and e2 in the y-direction. When t reaches 90 degrees, e3 replaces el in the x

direction, with e2 remaining in the y-direction.

3-d rotations of variable sub$paces are easy to use and interpret However, as Iexplain below,

they alone are too restrictive for the data viewer.

• More general subspaces: 3-d rotations need not be limited to subspaces spanned by

three variables. Mechanisms for selecting more general 3-d subspaces would be useful,

for eXaInple, the space spanned by the first three principal components.

• Higher dimensional paths: Even so, paths of planes in arbitrary 3-d subspaces do not

suffice. Later, 1 describe examples where paths in a 4-d subspace are appropriate. 3-d

rotations are degenerate in that coordinates for one direction (the y-direction in the above

example), are held fixed. More information is displayed when points move

simultaneously around both axes.

29

4.3. The Grand Tour

There are a number of so-called grand tour methods (Asimov 1985) for constrUcting paths of

planes. Such methods have played an important part in the development of the data viewer.

Earlier versions (Buja et al 1987) of the program relied entirely on grand tour methods for

producing paths of planes. In addition, the new methods I will describe are based on ideas

from the grand tour.

A grand tour is defined as follows (Asimov 1985):

A grand tour is. a sequence of 2-planes {Pi, i = I, 2, ..},

which is dense in the space of all 2-planes inR p.

The space of 2-planes in RP is termed a Grassmann manifold, denoted by G2,p. One choice

of metric on G2,p is the squint angle, that is, the larger of the two principal angles 911 , 9y

between two 2-planes. Other choices are the Lr metries, 0 < r < 00, given by (911 r + 9/) lIr •

All of the above mentiOlJed metries induce the same topology on G 2,p (Buja and Asimov

1985).

Basically, the grand tour is designed to come arbitrarily close (eventually) to all possible 2d

projections of the data. Note that SIIlOOthnesS is not part of the definition. This is because the

grand tour was invented with applications other than motion graphics in mind. However, we

doreq~smoothnessforthepartic:uJ.argrandtourpatbs.usedby the data viewer.

A Grand Tour Algorithm

There are a range of different schemes which produce grand tours (Asimov 1985; Buja and

Asimov 1985). I describe the particular grand tour used by the data viewer.

30

A grand tour may be obtained by inteqx>lating between consecutive

elements of a randomly sampled sequence of planes.

This grand tour scheme is of particular interest because it introduces the imponant idea of

inteqx>lating between pairs of planes, which contributes a useful building block for new

methods.

(The inteqx>lation paths used are geodesics on G2,p. I will say more on this subject later.)

Data Exploration with the Grand Tour

By definition, any grand tour produces a path ofplanes which is dense in the set of all possible

planes. Denseness implies that after enough time, a grand tour path comes close to any given

2-plane. Unfortunately, enough time can be far too long. In dimensions bigger than four, we

can not rely on the grand tour to provide us with informative projections. For example, in six

dimensions, it takes a mi,nimum of 6,000,000 planes to get within 10 degrees of all possible

planes (Asimov 1985). At the rate of 10 planes a second, this is 160 hours of viewing time!

In the data viewer implementation, the grand tour produces a smoothly moving scatterplot.

The above figures neglect to account for additional information yielded by smooth motion.

Therefore, they do oot 8ccuratelyreflect the amount of time necessary to uncover a particular

feature of the

Iust the same, I conclude that the grand tour does not provide a stand-alone method for

exploring high-dimensional data, the goal of displaying all 2d-projections being rather too

ambitious for practical purposes. Realistically, the grand tour is useful for scanning 2d

projections-- watching the moving sCatterplot for a few minutes in the hope of seeing

something interesting.

31

4.4. Guided Tours

The problem with grand tour paths is that they are entirely independent of the data itself. and

the requirements of the user. Effective use of motion demands guided tours. that is. user

controlled paths of planes. With guided tours. the user can customize the moving scatterplot

so that it is potentially more informative.

In the data viewer. I use one general paradigm to produce guided tours. A single paradigm is

convenient because there are fewer ideas for the user to grasp. Besides. it simplifies the user

interface. program design and implementation.. Here is paradigm I have chosen:

The data viewer produces smooth paths or planes by interpolating between

consecutive elements or a user-determined sequence or planes.

This scheme is similar to the grand tour algorithm. utilizing interpolation between pairs of

planes. The crucial difference is that a user-determined sequence of planes has been

substituted for a sequence of random planes. so that the resulting path of planes is controlled

by the user.

I use this paradigm for guided tours because

• it is straightforwan:i: ideally. the user can specify any sequence of planes. and the smooth

path is supplied automatically.

• it is sufficiently general: I will demonstrate how the paradigm covers many useful ways

of producing moving scatterplots. including 3-d rotations on general subspaces.

With this paradigm, the kind of moving scatterplots produced by the data viewer depends on

the interpolation paths. and on the user selected planes. Methods for choosing planes are the

subject of the following chapters. Currently, only one interpolation scheme is available.

32

4.5. Interpolation between Pairs of Planes

Following the proposal of Buja and Asimov (1985) for the grand tour, I use interpolation

paths which are geodesics on G 2,p (for any of the 4 metrics, 1 < r < ....). Some motivation for

this choice is given in the succeeding section. For a formal description of geodesics on

grassmann manifolds, see Wong (1967).

Geodesic Paths on the Unit Sphere

I also use geodesic paths for interpolating between Id-projections. In this case, geodesics

follow a great circle route ona unit sphere.• Suppose two Id-projectionsare described by the

points Ul and U2 on the unit sphere in RP. Let 9 be the angle between Ul and U2, and U2* be

the unit vector in the direction of U2 orthogonalized with regard to ul (obtained by a Gram

Schmidt step.) Then, one such path is u(t) = cos t u1+ sin t u2*, for 0 S t S 9.

Geodesic Paths on Gz,p

Let Q.. , Qy be two orthogonal 2-planes in Gz,p. Then, for any pair of unit vectors u(t)e Q.. ,

v(t) e Qy, t ~ 0, each rotating at a fixed speed, a geodesic path on G 2,p is given by P(t), t ~ 0,

where

p(t) = span(u(t) •v(t».

For obvious reasons. we call Q.. and Qy the rotation planes.

We need to construct a geodesic path between two planes PI and Pz.

(4.1)

Take the simple case where PI and P1 are orthogonal 2-planes. For example. PI andP2 are

spanned by the pairs of vectors (el •ev and (~. e4) respectively. For OS t S 1CI2. define a plane

pet) to be the span of (cost el+sint ~. cost ez+ sin t e4)' The planes p(r) form a geodesic

path from PI to P2• with P(O) = PI and P(1tI2) = P2•

33

In general, the geodesic paths between two 2-planes correspond to simultaneous interpolations

of the principal angles:

Let (UI. VI), (U2, Vv be the principal vectors and all > ay ( > 0) the corresponding principal

angles. for the pair of planes Pl. P2 (see Golub & Van Loan 1983). (Ult VI). (U2. V2) are

orthonormal bases for PI and P2 respectively. and the following relationships hold:

Ult U2 = cos ell' Vl t V2 = COSey

Ul t V2=Vl t U2=O.

Let U2*bethe unit vector along the part OfU2 which is orthogonal to ul. Similarly, V2* is the

unit vector along the part of V2 which is orthogonal to VI' Define planes P(t) as the span of

(u(t). vet». for 0 S t S 1, where

(4.2)

The planes pet) form a geodesic path from PI to P2, with P(O) = PI and P(l) = P2'

The planes sPanned by (UI.UV. (VI' Vv are the rotation planes. and the angles ell and ey are

the speeds of rotation. Notice that this path of planes gives a 3-d rotation when one of the

speeds is zero.

Properties Geodesic Paths

Geodesicimerpolationpaths have the following properties.

* Smooth motion

GeodesicpatbSare $m()()tb. in an abstract sense, which implies visual smoothness. Paths

of planes constrUCted using this incerpolation scheme have lack of smoothness only at the

endpoints of geodesic segments.

• Generalizes 3-d rotations

The sequence of data projections is locally within a 4-d subspace, because the path

34

segment connecting two planes is entirely determined by those planes. When the two

planes span a 3-d subspace, the sequence of data projections is a 3-d point cloud rotation.

In this sense, geodesic plane interpolation provides a consistent generalization of 3-d

rotations to four dimensions. Projections onto the path segments have a convenient

interpretation as point clouds rotating around a moving axis.

• Computational efficiency

The 4-d nature of the interpolation paths gives a large computational saving; it is

unnecessary to project every case from R P. to R2 for every new·scatterplot. Instead, for

each geodesic segment. cases are projected onto a 4-d subspace. Following this, drawing

the scatterplot requires a projection from R 4 to R 2. Except for the occasional projection

from RP to R 4, the computational effort depends only on the number of cases, not the

number of variables.

It is possible that paths confined to 4-d subspaces place unnecessary restrictions on the

information content of the sequence of projections. However, the benefits of the

restriction are interpretability and computational efficiency.

• No within-screen spin

Within-screen spin occurs when the scatterplot rotates in the plane of the screen. For

purposes of displaying 2d-projections, within-screen spin is wasteful and confusing, the

reason being mat plots differing by orientation contain the same information. (This is

only approximately true, since human vision is not symmetric with respect to

orientation.) Moving scatterplotsobtained from geodesic interpolation do not rotate in

the plane, because the position and velocity planes are orthogonal.

CHAPTER 5

METHODS FOR GUIDED TOURS

Once again, here is the data viewer's paradigm for guided tours:

The data vie",er produces smooth paths of planes by interpolating between

consecutive.elements of auser-determined sequence of planes.

This paradigm provides a basis for successful data exploration only if the user can specify a

sequence of planes quickly and easily. The simple solution of typing in lists of 2p numbers

for each plane is slow and tedious, and would detract from the impact of real-time motion. In

this chapter, I describe some ideas which make fast plane selection possible in the data viewer

program.

notation {T,h k>= 1,2,...} for the sequenCe of user-sele~tedplanes, and reserve

{Pi, i = 1,2,...} for the denser sequence of interpolating planes. Since motion always proceeds

towards. a plane Tk, I call these. the target planes. Target selection in the data viewer is based

onfour ideas.

• R.atherilian precisely picking a target, the user imposes constraints. Then, the targ.et is

required to satisfy these constraints.

• The program provides a variety of schemes for supplying sequences of targets

Typica11y, the targets are subjecttothe user-imposed constraints.

• In chapter 2, we saw how the user could control a rotating 3-d point cloud. Similarly, the

user can control 4-d point douds.

36

• Controls for the user alone are not enough; data dependent paths are necessary to pick

out interesting features of the data.

As we will see, much of the data viewer's power derives from the fact that these are

cooperating rather than competing tools for constructing guided tours.

5.1. Constraints on Planes

The most basic kind of constraint requires that a plane be orthogonal to some of the vectors

ej ,j = 1,•.,p , representing the variables. Such constraints enable a user to temporarily set

aside some subset of variables and examine a smaller subspace. With the data viewer

paradigm for paths of planes, constraints need only be imposed on the targets. When each

member .. of target sequence satisfies the same orthogonality constraints, the geodesic

interpolation paths give intermediate planes also satisfying the constraints.

Constraints are set up in the following manner.

Each variable is classified as active or inactive. Active variables are further classified as A, X

or Y. The constraintsdependonthe.classifications. Let ax and ay denote the x andy vectors

for the target plane. Then, if variable j is:

(i) an A-variable, the target is unconstrained with regard to ej

(ii) an X-variable, then a, is orthogonal to ej

(iii) a Y-variable, then ax is orthogonal to ej

(iv) inactive, the target (i.e. ax and a,> is orthogonal to ej

With this set of constraints, it is easy to request either a 2d-projection, xy-projection, or Id

projection.

37

2D-Projedions

When all active variables are A-variables, projection onto the target plane gives a 2d

projection from the subspace spanned by the active variables. With only three A-variables,

we obtain 3-d rotations.

One may use prior knowledge about the data at hand to reduce the dimensionality of the space

for exploration. By choosing successive targets randomly subject to the constraints, the result

is.a grand tour of the sro::uler sllbspace. In ligh10f the huge numbers of planes required for a

"complete" grand. tour, this variation is anece$sity for its practical application to data sets

with more than four variables.

XY-Projections

Recall that an xy-projection is a specialization of the 2d-projection, where the x- and y

vectors are restricted to qisjoint subspaces. Suppose that the active variables consist of q X

variables ands Y-variables.. 1'henaplane.sa.tisfyingd:1eC()nstraints yields an xy-projection.

In the special case where q = s = 1. the res1Jlt is a bivariate scatterplot of the X- and Y

variables.

If successive targets· are randomly obtainedsubject.to the constraints, we obtain a so-called

c()'rreltJ.tionteur. NotetJ:1at this is equivalent to sampling a sequence ofunit vectors. fi'Qll'lthe

uniform distribution on the unit sphere in R q. and a second sequence from the unit sphere in

R'. Like the grand tour algorithfJl. this scheroe for correlation tours is due to Buja and

Asimov (1985). AC()rrela.tion tour scans Id-projections of the X-variables simultaneously

with Id-projections of the Y-variables. It can expose relationships between two groups of

variables, providing an exploratory alternative to regression and canonical correlation

analysis: hence the name correlation tour.

38

ID-Projections

For Id-projecrions, only the x-coordinates are linear combinations of variables. The

constraints give a Id-projection when all active variables are X-variables. With a single X

variable, the result is a density estimate for the X-variable.

AId-tour is obtained when the sequence of targets is randomly chosen subject to these

constraints. Data viewer plots of Id-projections show a density estimate, so that watching a

Id-tour lets us scan the marginal distributions.

At anyone time, the program requires that all active variables are either all A-variables, or, all

X- or Y- variables. That is, combinations of A-variables with either X-variables or Y

variables are not allowed. Projections that satisfy the resulting constraints are not easily

interpretable, and I can think of no appropriate applications.

Additional Constraints-'

Following som.e changes to the active set of variables, additional constraints are imposed on

the next target. When the active set

(i) loses members,

that is, has members that become inactive, the new target is the plane lying c1osestto the

cumnt plane, and. satisfying.the constraints imposed by the mOdified· set. Notice the

target is completely specified. For example, suppose ax and ay are the current x- and y

vectors, and the first variable is removed from the the X-variables. The new target is

given by 8x-, 8y-, where (up to normalization) ax- = ax - (ax1el)el and

8y- = 8y. Smooth motion towards the new target establishes how the scatterplot

changes as the projection plane becomes orthogonal to some variables. I will give an

example in section 7.1.

39

(ii) gains members,

the new target is obtained by modifying the current plane along the coordinate directions

for the added variables. Suppose a.x and a, are the current x- and y-vectors, and the first

variable is added to the X-variables. Then, a, remains unchanged, while

a.x /uw = a.x + o:el (up to normalization), where 0: is any arbitrary amount. Smooth

motion towards this target enables us to judge how moving the projection plane in the

direction of a variable changes the appearance of the scatterplot.

5.2. SequencesotTargets

Even for guided tours, we will see that it is unnecessary to rely on the user to pick each target

in tum. With many applications it is enough that the user picks a particular area of data space,

and leaves target selection to some automatic procedure. For example, I already described

how a user can use constraints to select subspaces, and how this, in conjunction with random

target selection, produces different varieties of grand tours.

I call any scheme for. providing sequences of targets a •target generator method. Currently,

five such schemes are available in the data viewer program. For now, I describe these

schemes; applications are presented in chapter 7.

In what folllows.

current plane.

(1) Scan

the se<llUelrlce of targets to date, i.e., To is the

Scan provides a sequence of targets, T1; T2, T3' ...., where each target is randomly selected.

By requiring that the targets satisfy the user-imposed constraints. we may obtain grand tours,

correlation tours and td-tours on subspaces spanned by variables.

40

(2) Local scan

This scheme is a variation on scan, designed for exploring the "neighborhood" of a plane, To.

say. By viewing a local scan. the data analyst may establish the sensitivity of the To

projection to small changes in the plane.

Local scan produces a sequence of targets T1> To. T2. To, ....• where alternate planes TA;, k > 0

are randomly selected from a neighborhood ofTo. A step-size parameter controls the size of

the neighborhood. By requiring the targets TA;, k > 0 to satisfy constraints, the user can restrict

exploration to a particular neighborhood of To.

(3) Cycle

Cycle constructs a sequence of targets made up solely of the two planes: To. TI'To. T1> •••••

As the projection changes, the user mentally connects the sequence of scatterplots. He/she

discovers where cases ill one scatterplot are located in the other. This is a very important

benefit<()fmovingscatterplots.• Particularly<with large numbers of cases, it is often not enough

to see the sequence of scatterplots once. For this reason, the cycle scheme is for moving to

and fro repeatedly along the same path segment.

(4) Backtrack

The backtrack scheme constructs a sequence made up of old targets: T_lt T_2, T_3, ••••

It is frequently useful to retrace data exploration steps. Particularly when viewing a sequence

of projections, interesting features can pass by quickly. Therefore, a scheme for moving

backwaIds through the sequence of projections is essential. Since the path is entirely

determined by the target planes, we can reconstruct past projections by re-using old targets.

41

(5) Rotate

With the rotate scheme, a user can control the rotation of 3-d and 4-d point clouds.

This capability does not quite fit in with the data viewer's paradigm for guided toUTS.

Typically, the user controls only the sequence of targets, and geodesic interpolation

determines the intermediate path. Instead, the rotate scheme uses two targets To and T1, say,

to specify a subspace, and the user may pick a particular rotation within this subspace.

By default, the rotation is in the direction given by <the interpolation path from To to T l'

(Notice that this relies on thefaet that the geodesic paths give rotations.lRotate is similar to

cycle in that all paths are in the To. T1 subspace. With rotate, the motion continues in the

given direction, whereas with cycle, the direction of motion is reversed on reaching the target

Except in some special cases, the path produced by the rotate scheme does not necessarily

return to To.

For 1d-projections, To aad T1 are two lines, and determine the plane of rotation for the x

vector. Tllat is, there is olllY one possible rotation ·path so .that •. user controls ·are unn~Fessary.

For 2d- or xy- projections, there are an infinity of rotations. Recall that geodesic paths are

characterized by simultaneous rotations in a pair of orthogonal 2-planes (see equation 4.1 in

section 4.5). Choosing rotation planes and thes~dsof rotation determines a 4-d rotation.

and y-veetors rotate in the subspaces spanned by the X and Y variables respectively. To

specify a path, the user needs to choose the relative s~ds of rotation for the x- and y

directions. An example in a later chapter will demonstrate why this is useful. (Note that

when the speeds form an irrational ratio, the resulting path is dense in the set of constrained

xy-projeetions.)

42

With general2d-projections, it is more difficult to give the user full positioning power for the

projection plane. Within any 4-d subspace, there are arbitrarily many orthogonal 2-planes

which could contain the rotating unit vectors u(t) , v(t). The rotation speeds of these vectors

are also arbitrary, so there are too many variable factors determining the sequence of

projections. I describe a scheme which allows the user some measure of control in this

situation.

To begin with, consider the situation where the planes span a·3-dsubspace. For this, the user

need$onlypickaplane (orequival~nt1yan axis) of rotation. Section 2.4 gave S()~eJl:lUI1ples.

A possible COmPl'Qmise in the case of 4-d subspaces 1sto let the user Choose a rotation

contained in the dominant 3-d subspace. This dominant subspace is obtained as follows.

Take the default 4-d rotation produced by geodesic interpolation from To to T1 (see equation

4.2). We could apprmtimate this path with a rotation in a 3-d subspace by stopping the slower

of the two rotating vectors vet), that is, set 8y to zero in equation 4.2. This leaves us with a

subspace SPanned by the vectors (Ut, VI). In the 3-d subspace, it remains to pick an nis of

rotation.

5.3. Data Dependent Paths

The methods I described so far give the user a significant amount of control over the moving

sequence of projections. The methods have one factor in common; each supplies sequences

which are independent of the data. Even with guidance from the user, there is no guarantee

that data independent paths can yield interesting projections. Therefore, we need data

dependent paths. which hold greater promise of showing structure.

I have considered two possible ways of producing data dependent paths. (i) using data

dependent target generating schemes and (ii) using data dependent constraints.

43

Data Dependent Target Generators

There is a wide range of potentially useful data dependent target generating schemes. For

example, take any projection index, which measures some feature of each data projection.

Possible choices of indices are given in Huber (1985), Friedman and Tukey (1974). Using

standard optimization teChniques, one could design a target generating scheme which

produces projections having increasing values of this index.

As before, the tar'gets may be also restricted by user-chosen constraints and step-size.

However, allowing thecoIlStraints and step-size choices to be modified at any time demands

that.the optimization is done in real-time. For some further discussion, see Buja et.al (1986).

The success of these methods depends on constructing indices measuring interesting features.

For real-time optimization, speed is an important consideration, so the index cannot be too

computationally demanding. In view of the available computing power, this is a real

restriction, especially for indices of 2d-projections. Data independent paths have the

significant advantage of sPeed, because no data based calculations are necessary.

Data dependent target generators are not inclUded in the current version or the data. viewer

program, so I do not discuss them further.

Data Dependent Constraints

Alternatively, one could give up on real-time optimization andpre-compute some directions

with high values for a suitable projection index. Then, data dependencies can be enforced on

the sequence of targets by imposing constraints with regard to these directions. With this

approach, the computational efficiency,of the. projection index is not a significant issue, since

the data dependent constraints are pre-computed.

The current version of the data viewer program supportS data dependent constraints. I have

considered projection indices from classical methods in statistics, namely,

44

principal components, canonical correlation analysis, discriminant analysis and data sphering.

These techniques form a powerful tool kit for exploring multivariate data. They supply

projections of the data which are often highly structured. Because the projections optimize

some fairly simple indices, they can be easily interpreted. Also, the derived variables have an

order of importance, and so provide a possible basis for dimensionality reduction. Motion

aside, including these techniques in the data viewer program broadens the range of its

applications COnsiderably. Chapter 7 gives some examples.

Suppose we· have q. directions, constIUctedso·th4lrprojections onto these directions are in

some sense interesting. I call the directions derived· variables, and represent them by the

vectors c1' C2' ... ,cq in standardized variable space.

In section 5.1, I described how the original variables could be classified as X, Y, A, or

inactive, in order to impose constraints on targets. Oassifying the derived variables in the

same way gives an analogous set of constraints. For simplicity, assume there are p derived~

variables Cit c2, ...•cp • spanning standardized variable space, (As long as Cl. C2, ...•cq are

linearly independent. onecanal'n'ays construct additionaldirectio11S.cq+1 • ...•cp so that this

holds.)

Introducing new variables raises some problems.

• Over,.constrainedtargets:

With 2p variables, there are a total of 2p possible orthogonality constraints, making

over-constrained targets hard to avoid. I side-step this problem by requiring that anyone

target may only be constrained with regard to either the original or derived variables. In

practice. this is not a real restrictiOt;L

• Choice of inner product:

We would like to plot projections onto pairs of Cj vectors, but they not necessarily

orthogonal. This implies that bivariate scatterplots of the derived variables need not be

45

obtained from orthogonal projections in the canonical inner product. which was· chosen

so that the original (standardized) variables were mutually orthonormal (see section 3.3).

One way around this problem is to use a different inner product. Since by assumption the

derived variables are linearly independent, we can define an alternative inner product for

which they form an orthonormal set.

By appropriate use of constraints we can

(i) request 24·projections, xy·projections or Id·projections on subspaces spanned by derived

variables. With the alternative inner product, the plots available from orthogonal

projections are different from before.

(ii) pick a particular xy·projection consisting of two derived variables. This is achieved by

having just one X·variable and one Y·variable. With one X·variable and no Y·variable,

the result is a marginal density estimate of a derived variable.

(iii) use methods for generating sequences of targets in subspaces spanned by derived

variables.

Principal component analysis, canonical correlation analysis, and discriminant analysis

provide derived variables which are "ordered", so they can be regarded as teChniques for

dimensionality reduction. For example, one may concentrate explorationampng the first

few pri11cipal oomponents, the· ··idta.~ing .• that. these dimensions·.contafumostofthe

structure present in the data. This .can considerably reduce the amount of work necessary

to examine the data set. Also, a common criticism of exploration methods based on

projectiPnpl~moving through high.mmensional spaces is that interpretation of the

movingscatterplotis difficult. Including dimensionality reduction techniques in the data

viewer's repertoire alleviates this problem by supporting exploration of smaller and so

more interpretable subspaces.

CHAPTER (;

THE USER INTERFACE

In this chapter I describe the user interface to the task of constructing sequences of

projections.

It is important that the interface be consistent, so that each user action. always has a similar

effect, regardless of the circumstances (Foley and Van Dam 1982). This makes the system

easier to learn and use, because there are no special cases to memorize.

I begin with a brief review of the available functionality. Then I describe the user

interactions, with some discussion of how the interface meets the goal of consistency.

6.1. Functionality

Tr..e previous chapter presented methods for producing paths of planes which use interpolation

between pairs of user~sel~eted·targets. I described four tools for selecting· targets, namely,

constraints, schemes for providing sequences of targets, controls for rotating 4-d point clouds,

and tina1.ly, data dependent constraints. user can COIlSttuct

planes using these tools.

Following the paradigm for paths of planes, the data viewer program supplies a sequence of

projections as follows:

Suppose the current plane is Pit that is, the display shows the projection of observations and

variables onto Pl' Let T I denote the target plane. Then, intermediate planes Pit Pz, P3' .

are generated along an interpolating path from the start plane, Pit to the the target T1. When

47

the current plane reaches the target (Pi =T 1 for some i), another target T2 is obtained.

The user controls the selection of successive targets by selecting a target generator scheme

and setting up constraints. This can also involve supplying data dependent constraints.

Variables may be classified as inactive, X-. Y- or A-variables. The classifications define

constraints, which are imposed on targets as described in the previous chapter. Changing the

constraints imposes additional constraints on the very next target. Currently. the data viewer

prognunprovides a choice of five target generator schemes: scan, local scan, cycle, rotate and

backtrack. The currentitarg~tgeneratorschemeprovidesithe next target. Usually the targ~t is

random, subject to the user-imposed constraints.

The sequence of targets is never deterministic because the user is always free to intervene and

re-direct the moving scatterplot by changing the constraints or the target generator. By

appropriate choice of target generator, the user can obtain a local scan in the neighborhood of

the current plane, or, cycl,e between the current plane and previous target. Generally, changes

to the constraints~$\l1t in succeeding ~gets satisfying the modified restrictions.

Some of the target generator algorithmS produce sequences that re-use previous planes,ror

example, local scan and backtrack. For tJ:-Js reason, I distinguish between old and new

new. targets can be guaranteed to satisfy the current constraints. The target

as described in the previous chapter, except for one special

case, occurring when·· the userchaI1g~ the constraints. Then the next target is· new.

irrespective of the current choice of target generator. This exception is necessary. because

when the user makes a change, he expects that this affects the motion.

48

6.2. Choosing Targets

In chapter 2, we saw that the data viewer program has a graphical interface. The user points

the mouse cursor at some mouse sensitive pan of the data viewer window, and clicks on a

mouse button. In response, the display changes somehow. Some of the mouse clicks affect

the path of planes.

A

......:':

ST.HELENS

" .. . ..~.. '. .:.;/A(. \:~ .'.•. ••. r&~ . •.. .. .tilJ ..•~,,; ••• -

le~ . ... •...;...... " .." ••.....~:: ";':---.~.~ r. ...... ... . .. ."... .... ,.. ,

e••••

..

"..

Figure 6.1: A data viewer window

For controlling paths of planes, there are two important mouse sensitive areas on the screen,

namely. the variable boxes and the control QaneL By default, each data set variable receives a

box, placed on the l.11.s. of the screen. Section 6.4 will discuss how to add boxes for the

derived variables to the display list. Labels in. the left hand corner of each variable box give

the current classification of that variable: the labels A, X, Yanda blank: denote an A-, X-, Y

or inactive variable respectively. A string drawn in the control panel gives the current choice

49

oftarget generator.

For example, in figure 6.1 the boxes for the variables latitude, longitude and

depth have A labels. Also, "rotate" appears in the control panel. This means that the

window can currently show 3-d rotations in the space spanned by these three variables.

Variable Boxes

Changing the constraints is. achieved.by changing the variable box labels. The circular area

marked in each variable box is mouse isensitive;Clicks in this region change the labels.

Within tile circle, the mouseClll'SOrhas an X, Y, A or °shape, depending on its location. I

use the different shapes to indicate the effect of a click. Figure 6.2 shows the mouse cursor's

shape in each part of the circle.

Figure 6.2: Changing shape of the mouse cursor

When the cursor shapeis X, Y,Aor 0, a single click on the left or middle mouse button

changes the labelw an X, Y A or blank respectively. For X and Y cursor shapes, middle

clicks have an additional effect. If the box with the mouse cursor receives an X label, then an

X in any other variable box is replaced by a blank label. This reduces the number of clicks

50

necessary to pick bivariate scatterplots.

There are other simations where changing the label in one box changes labels in others as a

side effect. Recall from section 5.1 that mixtures of X- or Y-variables with A-variables are

not allowed. The program enforces this rule as follows. When an X or Y label changes to an

A, all other X or Y labels change to A's automatically. Conversely, when an A label

changes to X (or Y), all other A labels become Y 's (or X 's). There is no good reason for

this particular assignment of labels; many other schemes would do just as well.

Changes to the variable boxlabels as I have described have no instantaneous effect on the

displayed projection, they only affect selection of new targets. Of course, smooth motion

towards the target is not always of interest, for example, when examining bivariate

scatterplots. To accommodate these simmons, I assign double clicks on mouse buttons in the

variable boxes to mean change the label and the displayed scatterplot. With double clicks, the

scatterplot changes immediately to show the projection on to the new target. At all times,

mouse clicks in two boxes are sufficient to produce a bivariate scatterplot. Section 2.3 gave

an eXample.

As tr.e mouse cursor moves within the circle, the mouse documentation line at the bottom of

the screen briefly describes the effect of a click. Suppose the mouse shape is X, then the

documentation is "L: X variable, M: Single X, double click for plot change". where "plot

change" refers to the fact that the projectioncha.IlgeS aloIigWith the label. As long as the

variable boxes are not too small (which happens when the Window is too small or the number

of variables too large), I find this a reasenable scheme for selecting constraints.

51

Control Panel

Figure 6.3 shows a close-up of the control panel. (The strings in italics do not appear on the

screen, they have been added for identification purposes.)

SMOOTH

negative speed

step-size

SCAN

speed

Figure 6.3: The control panel

The control panel consists of 6 mouse-sensitive regions. When the mouse cursor moves into

one of them, the region'.s. border is temporarily highlighted. The regions represent five ways

of c()ntrolling motion. AmPl.lse.:lickinone regionaifects thechoi(:e()f target generator.

Clicks in other areas control factOrs such as wnetherthemotion is on or off, and the speed at

which the scatterplot moves.

(i) Motion switch:

This is the small rectangular region currently containing the string "OFF". The switch

displays either of two values, "ON" or "OFF". When the value shown is "ON", the data

viewer's scatterplot is moving through a sequence of projections. When the value is

"OFF", the projection plane does not change. A left mouse click in this area toggles the

string's value from causing the scatterplot to move.

(ii) Speed:

One of the two rectangular areas at either side of the motion switch contains a vertical

dashed line, which I call the speed bar. The position of the gives the speed of the

52

moving scatterplot; the farther the line is from the motion switch, the faster the

scatterplot moves. (Fast motion is also jerkier, because successive projection planes.

P l , Pz,P3 , are farther apart.) A bar on the left hand side of the switch also tells us that

the scatterplot is moving backwards through old projections, Le., backtrack is the current

target generator. A left mouse click anywhere within these two rectangles repositions the

speed bar, causing thescatterplot to speed up or slow down.

(iii) Interpolation switch:

This is the rectangular area currently showing the string "SMOOTH". The switch

displays either oft'wo values, "SMooTH" or"JUMPY". indicating the presence or lack

of interpolation between successive target planes. The usual value is "SMOOTH".

causing the data viewer's scatterplot to move through a sequence of close projection

planes, P 1.PZ.P3,..... As a contrast to smooth motion, the program also allows for jumpy

motion. In this case there is no interpolation so that Pi = Ti> i = 1. ,2•.. " and

successive projection planes may be arbitrarily far apart. A left mouse click in this area

toggles the string's value. from "SMOOTH" to "JUMPY".

(iv) Step-size:

This is the rectangular area containing the string "step-size". The vertical dashed line is

the step-size bar. which may be repositioned by mouse clicks, just like the speed bar.

The position of thebatdeterm.i~how faraway a new target can be; recall that this

comes in useful with the local scan target generator.

(v) Target Generator:

The rectallgle in the top right-hand comer of the control panel shows a string

representing one ofthe target generators scan, rotate, local-scan or cycle. Mouse clicks

in this area change the current choice of target generator. The fifth target generator,

backtrack, is chosen in a different manner, by giving speed a negative value.

53

Path Parameters

The foregoing discussion described how the user selects the path of planes Ph P:z, P3, , by

choosing values for (i) the motion switch. (ii) speed, (iii) the interpolation switch, (iv) step

size, (v) the target generator and (vi) the variable box labels, which I term collectively the

path parameters. The interface will be consistent if these parameters are onhogonal. We say

two parameters are orthogonal when each of the values of either has the same effect for all

values of the other. Orthogonality simplifies the user interface, because the user only has to

rememberthefuncti.on of eachpa.rameteralone,ratherthan in conjunction with the other five.

With my choice of path parameters, orthogonality is not always achievable. At least, each

parameter value should have a predictable effect for all values of other parameters.

The path parameters have a natural separation into two orthogonal groups--

(i) target parameters, for controlling the sequence of targets Th T2, T3, .......• These are the

step-size, target gen~rator and box labels.

(ii) motion parameters, forcqntr()llingth~>interinediateplanes.. 'I'heseare the motion switch,

the interpolation switch and speed control.

These groups are orthogonal.because target generation is not affeetedby motion parameters.

Also, for any two targets, motion parameters determine •the intermediate planes in the same

way, no matter howtbosetaljetSwe~generated.Foiexample, changing the motioD. switch

to "OFF" always stops the motion.

Within each group, there are some dependencies. The effect of either box labels or step-size

depends on the current choice oftarge~ generator. Target generating schemes that re-use old

targets, such as backtrack, simply ignore the current settings of other path parameters.

Schemes like scan that provide new targets do not. Some exceptions to this rule are

necessary. Otherwise, clicks in variables boxes could produce a bivariate scatterplot

54

sometimes, but not always. The resulting interface would be dominated by the mode of target

generation, which is quite intolerable. To avoid this, I require a label change to be followed

by a new target (see section 6.1), so that the immediate effect of changing labels does not

depend on the target generator. Then, double middle clicks in two variable boxes can always

produce a bivariate scatterplot.

The speed parameter has different interpretations for the two kinds of motion, but in both

cases, its ~bsolute value controls how quickly the plot moves through the sequence of targets.

In the case ()f jtJ.I11PY moti()n, speed controls the flash frequencybyreguiating the time interval

between targets.PorsI1100th llloti()n, .il1tel'IIlediate planes are generated along the path

between start and target planes at increments proportional to speed.

There are some cases where regardless of the value shown by the interpolation switch, smooth

motion is not provided. Instead, the display jumps to the next target. The first case is rather

obvious: when switching between 2d- and Id-projections. The next case is when the start and

target planes give 2d- and xy-projections respectively. Geodesic plane interpolation for two

plan..,s·need notanive atthexy..projection,>burrather,~l'()tated/xy-projection. To avoid this

inconsistency, some other form of interpolation would be necessary. The case where start and

target planes are both xy-projecti()ns, but the start's x- (y-) subspace overlaps with the target's

y- (x-)s-ubspace, is similar.

6.3. Rotating 3-D and 4-D Point Clouds

When the plot interaction mode shows "PROJEcnON", clicks on the plot region help control

thellloving projections. We already had someexamplesinsection 2.4 of how the user could

rotate 3-d point clouds in particular directions. The following sections give a more complete

description.

55

Clicks on the left mouse button in the plot region toggle the state of the motion switch. This

is convenient for two reasons. It makes it far easier for the user to stop the motion

immediately. Some care is necessary to position the mouse cursor in the rather small region

allocated to the motion switch, causing a potential delay in stopping the motion. Also,

moving the scatterplot by clicks in the plot region rather than on the motion switch gives a

more direct style of interface.

In section 5.2, I suggested how one might specify a path of planes through a 4-d subspace. I

considered the two cases of xy-projections and 2d-projections individually. Now, I describe

how the data viewer user can actually pick such a path, by clicking the mouse in the plot

region. This capability is available only when rotate is the current choice of target generator.

XY-Projections

For xy-projections, the user needs to choose the relative speeds of rotation for the x- and y

subspaces. After a mouse click in the plot region, the scatterplot moves. The program uses

the slope of the line from the mouse cursor to the plot region's center as the relative speeds.

When the mouse click is near the horizontal line through the plot region's center, the rotation

speed for the y-direction is comparatively small. The resulting path is "almost" a 3-d rotation.

This special case illustrates why I chose this particular mechanism for providing relative

sJijeds.• FolloWing aclick, the points appear ro·move towards the mouse cursor.

2D-Projections

Suppose the two planes, To and Tit are the start and target respectively. FolloWing the

suggestion of the previous chapter, the program approximates the interpolation from start to

target by a path in the dominant 3-d subspace. In this 3-d subspace, it remains to pick an axis

of rotation. For simplicity, suppose the two planes, To and Tit span a 3-d subspace. As usual,

56

after a mouse click in the plot region. the seatterplot moves. This click also picks a. vector a in

To- If b is the vector in the subspace which is orthogonal to To, the click causes a rotation the

a-b plane.

As a user, I found it easier to pick a direction of rotation, rather than the axis itself. This way.

the points move towards the position of the mouse cursor at the time of the click. Also. it

means that clicks in the plot region have consistent interpretations for 2d- and xy-projections.

6.4. Derived Variables

Boxes for Derived Variables

Each of the data set variables is represented. in the display by a variable box. Suppose I used

these p boxes to represent the derived variables rather than the originals, and used the

alternative rather than the canonical inner product Then, the user could interact with these

boxes just as before to produce plots of the derived variables.

Instead, I add new boxes to the display for the derived variables. There are two advantages to

this approach.

(i) More user control: With 2p variable boxes, the userhas more choices. Mouse clicks in

apPrtlpriar.e .• boxes can noW .select bivariarescanerplots. of the original, or the derived

variables.

(ii) More infonnationdisplayed: Bach of the boxes has a line drawn from the box center

representins theproject:ion of its variable onto the current plane. We can see any given

projection as a linear combination of the original and the derived variables

simultaneously.

57

The new boxes are drawn on the right hand side of the window to separate them from the

existing boxes. Whenever necessary, I will distinguish the two sets of boxes by using the

terms i.k.s. boxes and r.k.s. boxes. With new variable boxes for derived variables, the user has

additional controls for guiding the moving scatterplot. These are

Setting up constraints:

To avoid over constrained targets, each target may be constrained with regard to either the

original or derived variables. The data viewer enforces this by allowing the boxes for one set

of variables only to be mouse sensitive at any given time. Then, the user selectsCOIlStra1nts

by clicking on the mouse sensitive boxes. Oicks on the other set of boxes have no effect.

The user can recognize the mouse sensitive boxes on sight, because they alone have X, Y or

A labels. It is easy to switch control from one set of boxes to the other. Suppose the r.h.s.

boxes are mouse sensitive. A mouse click close to the 1.h.s. boxes makes them mouse

sensitive instead, and labels are drawn in the 1.h.s. rather than the r.h.s. boxes.

A choice of inner prodtlcts:

When the user picks one or other set of boxes to be mouse sensitive, he/she>aIso choose

between the canonical and alternative inner product; the current inner product is such that the

variables represented by the mouse sensitive boxes are orthonormal. When the mouse

sensitive set of boxes switches from 1.h.s. to r.h.a. (or vice versa), the inner product changes,

causingtberesUlfoftbeprojectioIltO Chailgealso.

ConstruetingDerived Variables

Suppose there is a list named derived-variables containing p derived variables. The

user tells a data viewer window named a-DV-window to display new boxes for the derived

variables by executing the Usp code

58

(send a-DV-window :add-variables derived-variables)

Then. the window is redrawn. with additional boxes for the derived variables placed in the

window's r.h.s.

Currently, I give special treatment to four sets of derived variables, obtained from principal

component analysis, canonical correlation analysis, discriminant analysis, and also data

sphering. To produce principal components, for example, the user only has to run the code

(senda-DV-winq.ow :add-variables :prcomp)

Then the data viewer progmmitself calculates ·the principal components. This is far more

convenient for the user; he/she need no knowledge of the data viewer's internal data

representation. Also, each of the named methods involves calculations based on some

selection of the cases and variables. In the data viewer, these selections could be made

graphically by marking the relevant variables and cases on the screen. At the moment, only

graphical selection of vanables is provided for.

CHAPTER 7

APPLICATIONS

This chapter contains some examples of exploring data with the data viewer program, using

the. tools described.

7.1. Places Data

The places data consists of scores for 329 US cities on 9 criteria, chosen to measure

"livability" of the cities (Rand McNally 1986). The nine criteria are climate, housing, health

care, crime, transportation, education, the arts, recreation and economics. For housing and

crime, the lower the score the better. For all other variables, the higher the score, the better.

I have added(an additionalthree variables to the data set. They are •pOpulation, latitude and

longitude for each of the 329 cities. To eliminate skewness, I transformed population to a log

scale.

Figure 7.1(a) shows a data viewer window for the places data set, with a scatterplotof two

variables, latitude against longitude. Notice the latitude box has a Y label, and

the longitude box ••$8 X label. The. two "extra" points on the left hand side of the map

represent Anchorage, Alaska, and Honolulu, Hawaii. Their latitude and longitude coordinates

have been adjusted so that all cities fit nicely into the plot region.

60

PLACES

"

",,: .: " t'\'" .... '1." .. ",: \' ~:'~:,\.." :: .:~...: .. :..:.:-.:: ,,"'~:y<*!

., ....... """,,:" "" .. :.:;.."" "." "..." "... ..\ "'" ".. . ... "" " ....

".." ,,"" " ...... ,," ..::".

, i

"

;1:"..

:::,·f'

PLACES

," '.. .. ". " ..... :. ....

.. •*#.~.~IJ;"'t<;~.;, ,:~!.p~.:;t-

....:., V\."" .1." •• '~. """ .. " .<t ...... •," '. ": ,."

"" .. .. """':." \... .. ..,

(

Figure 7,1: Bivariate scar:rerplors of places data

61

In the univariate transformation pan of the pipeline operations, I took care to scale these two

variables appropriately (see section 3.2.) By default, variables are standardized individually,

which in this case would give an elongated U.S..

Mouse clicks in the climate box change the display to show climate and latitude.

Figure 7.1(b) shows the result. From this plot we see that northern cities tend to have either

very bad, or quite good climate.

Connecting Two.Biv~riate Scatterplots

Instead of changing the display immediately from one bivariate scatterplot to another, we can

gain a lot of information by watching a smooth progression from one scatterplot to another,

that is, connecting the scatterplots. In this way, we discover which U.S. cities have good or

bad climate. Grasping the features of this motion usually takes a few repetitions. This is the

purpose of the cycle target generator.

We constructthesequ~nceofprojections which connect the scatt.erplots shown in figure 7.1

(a)and (b) as follows: Suppose the window currently displays climate and latitude,

and we pick longitude and latitude as the target. A click on the motion switch

toggl~s the value from "OFF' to "ON', and the projection begins to change. When the

projection reaches the target,· motion pauses momentarily, and then resumes back towards the

climate, latitud!!plot. 1b.e displayed projection continues to cycle between these two

scanerplots until a click on the motion switch changes its value to "OFF', or another target

generator is selected.

Figure 7.2 shows one of the intermediate projections. From the vari.able boxes we see that

both climate and longitude have non-zero projections in the horizontal direction.

By watching the smooth progression repeatedly between the pair of scatterplots shown in

figure 7.1 (a) and (b), we gain the following information:

62

PLACES

.. ............. . .. .. ..I .. .. ...... "" • •._1'.

"" .. t"' " .. "'i..... "''' ... .. .. l. ; ...;::: .,"..",? !;;.-; ..."".. .." ....... ..... :; ::.. : .. ,,~:,,:

.. ". .. ... ...... .:"t .. .. ,," ".. .. ,"":_ ..

.. ...: : : "..:.. :. ":.. " # ,. ." .. .."

.. :: " ...•.....:.,... " .. .. .. "... ..

. .... ,'.

......• I • P'IOJRflCII

Figure 7.2: Connecting bivariate scatterplots

• The cluster of points with the best climate are all Californian cities.

• The northern cluster with good climate are north west cities.

• The mid-west has the worst climate: Minnesota, Wisconsin, and the Dakotas.

• The Atlantic coast ofFlorida has far better climate than does the Gulfcoast.

In the same way, we can let the projection cycle between the longitude, latitude and

the climate, housing plots. Figure 7.3(a) shows a bivariate scatterplot of Climate

and housing, figure 7.3(b) shows ODe of the intermediate projections. By cycling between

the pair ofscatterplots we see that

• Highest housing costs are in the vicinity of New York. (The two points with very high

scores on housing are actually Connecticut cities).

63

PLACES

'.

.'

.........PLACES

". .'

'. r:::'. .-,,-.......

.'

.'...

.....•!".

.. ""'.,. :~.. .. .

'.,..

". .. .: ::.~., ... ....... . ;."

I'"

Figure 7.3: Connecting bivariate scanerplots

64

• California has high housing costs.

Density Estimates

Some of the ratings, in particular the-arts and health-care, give extremely high

scores to the biggest cities-- New York, Chicago and L.A.. This results in scatterplots where

most of the observations are clustered together, so that associations between variables are hard

to pick out. For this reason, I transformed the ratings to normal scores.

Figure 7.4 shows two dataviewerwind()ws,<the upper one with the rating variables as before,

and the lower one with the I1OrmaFscores. There are no longer boxes for latitude and

longitUde. Both windows display a density estimate for a linear combination of the rating

variables. The linear combination is the same since the data viewer windows are linked by

common projections. -Notice the dot on the extreme right in the upper plot; this is New York.

In the lower plot, New York lies far closer to the other cities. As the x-vector changes, we see,-

how the transformation to normal scores affects the Id-projections. The density estimate in

the lower<window isgenerallysYmmetric,an4Quite often looks "bell-shaped". For the

untransformed ratings, the Id-projections have highly skewed distributions. With a moving

x-vector, the density's peak shifts to and fro across the screen.

Most of the nine rating variables tend to assign high values to big cities. To judge the overall

nature of the· association between population and the ratings, we can examine plots of

population against linear combinations of the rating variables (on a normal SCOJ'eS.scale).

Suppose we .pick popUlation as the single Y-variable, and make each of the-arts,

health-care, economics, education and recreation X-variables. (From the

bivariate scatterplots, these five have the strongest association with population.) Then, the

......tl. ~t.

1.

65

PLACES

1\. \i \I ,. ,! \;•f \1\

\\'.

pIeces nscores

Figure 7.4: Density estimates

'..

66

plll!tces nscores

"

..

.. .. ..• -t l ..

, ' i', ':: 'I, "

.........plll!tces nscores

.," :.. : .. ,

'.' ..,: .. -:.."~' ..

.. " :f .. "• • .. • .•.aI\. ..

,,' ' • l' .',',

. .. .. :'.' :.: ::-a.. "._" :: oJ" .....: .::-._ ... .. .: i;'~.'.~:.

:. ..... ., ••~ <l' .••:: ..• , ......a,' .' :.'\;····,..:'••*.:·4·:· ..: .. ,r ,,"':: "" ......." .

"

,,

,, ,

, "''' ...,,

Figure 7.5: A correlation tour

67

target generator scan yields a correlation tour of population against linear colIlbinations

of the five X-variables.

By watching the moving scatterplot, we discover a projection with high x-y association, as

shown in figure 7.5(a). We can see that population is linearly related to a weighted

average of the five selected rating variables. Also, health-care and the-arts have

the largest coefficients. whereas the coefficients for economics and recreation are

comparatively sm.all. (The variables have been transformed to normal scores, so that it is

reasonable to compare their projection ~fficients.)

Deactivating Variables

Do the variables economics and recreation have a negligible contribution to the x-y

association in the above projection? We may answer this question as follows. Suppose we

make the two variables inactive. Then the x-vector for the next target is the current x-vector

orthogonalized with regaId to the two deactivated variables (see section 5.1). With a rotation

towards this target, we receiveavisualim.p~i,on •• ofhQWJheqlJality ofthe· x-y .associatiQI'J.

deteriorates.

The second plot in figure 7.5 sho.wsthe projectioll onto the •• new target. Overall, it looks very

similar tot1le previolJS plot, with mQ8t changes occurrlngamongcities with lower population.

association observed in the upper plot.

7.2. Satellite Data

The Satellite-oct data set consists of microwave briglltrless temperatures recorded by

the NIMBUS-7 satellite one October Saturday (Madrid 1978). The data was observed on a

grid of 1000 points, located in a long, narrow strip stretching from the Bering Sea to the North

68

sATELUTE-QCT31

..,,'.~:~

.;..:;.(.,......~!f.

....... ~J'"., .. .. -

\, .... '

......I' I NlUECTt.

. .'.,

Figure 7.6: Viewing the satellite data

69

Pole, for the purpose of investigating the ice cover. Each observation includes location, given

by latitude and longitude, and six microwave variables corresponding to three

frequencies (in gHz) with values for horizontal and vertical polarizations. I name the variables

lS-H, lS-V, 22-H, 22-V, 37-H and 37-V. In each name f-p, f and p represent the

frequency and polarization respectively.

Bivariate Scatterplots

figtl.re 7.6 contains two bivanate.scatterplots of the data: lS-Vagainst 22-Vand 37-V

agajnst22-H. The first scatterplot shows two correlated variables, with a dense lump of

points at either end. From plotting either variable against latitude, it is clear that the

clusters on the right and left are located at high and low latitudes, and the remaining points at

middle latitudes. We conclude that the upper cluster represents icy areas, the lower one water,

and the middle points icy water.

The same split into tw<f dominating clusters is evident from the Id-projections of all the

Illicn>wave vari~Qles,exCep~for 37-V. Iniigtl.re 7.6(b),thel.tppercluster has.a distincfrod.

In fact, the rod of points has a strong negative correlation with latitude, implying that cases to

the left on the rod are at higher latitudes and so have deeper ice cover.

4-D ROt:atiol!1S

By doing 4-d rotations with lS-V, 37-Vas the two X-variables, and 22-V, 22-H as the

two Y-variables, the cluster Jepresenting non-icy locations also shows further structure. The

upper plot in figure 7.7 gives one of the xy-projections obtained.

The x-vector is· a contrast of 37-V and 18-V. This separates a small.rod of points from the

lower cluster, lying slightly to the right and above the majority. These particular observations

are from a location where there is known to be rough, open water. From left to right in the

......• , l J'IGJlCtlClt

Figure

70

SATELLlTE·OCT:31

'.'..

t·

" .. .•••~ o,".' ,

4-D rotations

71

horizontal direction we have thick and thin ice. and calm and stormy water. Since thinner ice

has a rougher surface. it seems that the contrast between high and low frequencies

distinguishes between locations with rough and smooth surfaces.

The second plot in figure 7.7 shows a second of the xy-projections obtained. Here the y

vector is a linear combination of the two Y-variables. and the horizontal projection is as

before. The projection has changed very little. except that the rod emanating from the lower

cluster more distinct.

7.3. <PrincipaJ Components of the Satellite Data

Principal components analysis aims to summarize the data in a small number of dimensions. I

use a principal components analysis of the Satellite-oct data to discover its major

sources of variability.

Constructing Principal Components

LetZ the nxp ·.··data· matrix. containing the obServations in standardized coordinates.

(Assume the standardization step has centered the variables to zero mean.) The principal

components Cl .c2 • ....cp are defined as follows (Mardia. Kent and Bibby 1982):

Cl ma:umizes ff ztz C = var(Zc). sub.iect =1.

Similarly,

Cj maximizes ff ztZ c = var(Zc). subject to c'c= 1. and ffck=O. fork = 1.2....j-1.

The Cj. j = 1•..P are.tbe eigen vectors with decreasing eigen values var (Z Cj) of the matrix

zt Z. They are calculated using a singular value decomposition of Z (Golub and VanLoan

1983).

72

Satellite Data

By executing some code (see section 6.4), we obtain the principal components of the first 6

variables, that is, the microwave readings.

The principal components computed depend on the scales of the variables, and it falls on the

user to choose suitable scales by modifying, if necessary, the pipeline's univariate

transfonnations (section 3.2). In this case, all microwave readings are measured in the same

units, so I used the same scaling factor for all six varitibles, preserving their relative scales.

Therefore, the principal components are the eigen vectors of the data's covariance matrix (in

measurement aJl,din standardized coordinates).

The derived variables consist of the principal components cit C2, ...,C6, plus el' e2. The

additional variables el and e2 are included so that the derived variables span standardized

coordinate space. (Notice this requires that standardization has centered the variables to a

mean of zero.) The derived variable are ormononnal with regard to the canonical inner

produ.ct. For principal components analysis, there is no need for an alternative inner product.

When the calculations are complete, the data viewer redraws its window. There are more

boxes on the r.h.s. for the derived variables. The additional boxes are ordered column-wise,

and labeled pc-l through pc-6, followed by latitude aJl,d longitude.

Principal Component PI()ts

Suppose the r.as. boxes are mouse sensitive. Mouse clicks in the first two r.as. boxes select

the bivariate scatterplot of pc-l and pc-2, as shown in figu.re 7.8. It contains a sharper

version of the structure displayed in figt.u'e The observations fonn three rods: a sparse

rod in the vertical direction, with two denser rods at angles either end. Once again, by

examining the latitude and longitude coordinates, we discover that the rods correspond to

no

73

..'

"t" "

SATELUTE-OCT31

Figure 7.8: Principal component plot of satellite data

From~.l.11.s.bc>~ieS,we see tllie relative contributions<of the original variables to the first two

principal components. By examining the magnitudes of the lines in the horizontal direction,

we notice that the first principal component is a weighted average of the original variables.

The microwave reacti.ngs.arestrongly related to the surface temperature. With the range of

latil:Ude:s, temperar:u.re is source of vaziatJ,ility aznOI1lg the locations. It is

that the first principal component is a surrogate for surface temperature.

The second principal component has positive coefficients for the 37 gHz measurements, and

negative coefficients for the other frequencies. As was previously remarked. the contrast

between high and low frequencies distinguishes between rough and smooth surfaces.

74

SATELLJTE-OCT:31

. '..

'\ i \i ""-.1 \.~ ...

, .\"'.""

..... ,I" ~nON

Figure 7.9: Original variable plots

Original Variable Plots

Suppose the l.h.s. boxes are mouse sensitive and we select a Id-projection of 22-V, given in

figure 7.9. The r.h.s. boxes display the projections of the principal components onto the

direction of this variable.• demonstrating that . 22-V lies close toth~direction of the third

princip.aLcompo:oent... Ingenerat.such.plots maybe informative; for example, a variable with

large>coefficients for the p_lth and p til principal components only, has a less important

contribution to the overall variability of the data set.

75

Variability of Principal Components

SATELLlTE-OCT31

Figure 7.10: Variability of principal components

Unlike the data set variables, no standardization is applied to the derived variables prior to

plotting. This means we can graphically compare the relative variabilities of the principal

components by flipping through a series of plots.

pc-S versus pc-6 (see

figure 7.10), all points are concentrated in a solid lump at the center. In fact, pc-6 is a

contrast of the horizontal and vertical measurements at each frequency.

The first principal component accounts for 96% of the variability in the data set, and the first

and second components together for 99%. This is not surprising, because the variables tend to

be correlated. Also, when scanning the space spanned by the 6 microwave readings, a "z"like

shape dominates throughout. 'The same shape is clearly evident in the plot of the first two

76

principal components.

Dynamic Comparisons of Principal Components

The satellite-dec data set is similar to satellite-oct, containing observations

on the same variables at the same locations, but recorded in December. Figure 7.11 shows the

satellite-oct and satellite-dec data, in the upper and lower windows

respectively. The r.h.s. boxes represent the principal components (computed individually) for

both .sets of data. ... The windows are linked by common projection operations, allowing

dynamic comparisons of the principal components. Clicking on the boxes for pc... l and

pc-2 in the upper window yields a bivariate scatterplot of pc-l and pc-2 in both

.. windows.

For both data sets, the first principal component is a weighted average of the microwave

variables and the second component is a contrast of high and low frequencies. It seems that

the sources of variability~for microwave measurements do not change much from October to

December. However, the "z" shape is .no longer evident in the December plot, since almost all

locations are covered by ice.

7.4. Sphered Satellite Data

"Sphering" the data means transforming it to identity covariance. One motivation for viewing

sphered data is that it has all linear structure removed. Usually, this is the structure that is

easiest to find and understand, but it may be distracting, obscuring non-linear relationships. In

a sphered coordinate system it becomes easier to pick out clusters and non-linear associations.

This is why data is often sphered prior to applying projection pursuit methods (Friedman

1987).

77

SATEt..LJTE·OCT31

..-

...·.. 1.,t, ~l.

SATEt..LJTE·OEC1B

Figure 7.11: Comparing principal components

78

Constru-=tmg Sphered Variables

Take the principal components Cj, j = 1, ..p. defined in the previous section. Let

vj = var (Z Cj), j = 1, :.p. Then the directions C/~ represent sphered variables. I use these

particular sphered variables because they lie along the principal components. The sphered

variables are onhogonal but not onhonormal for the canonical inner product, so the data

viewer makes use of the alternative inner product (see section 6.4).

Sphered Variable Plots

Figure 7.12 shows two plots from the satellite-oct data, demonstrating dramatically

the effect of sphering. Each of the first six boxes on the r.h.s. represents a sphered variable,

which are named s-l s-2 •... s-6.

The upper plot shows 18-V versus 18-H, and the two variables are highly correlated. From

the r.h.s. boxes we see $at these variables point in the same direction in sphered coordinate

space. To sphere the scatterplot, 9necould imagine stretching it in a di~ction perpendicular

to the axis which runs through its middle.

When we make the r.as. boxes mouse sensitive, the display changes due to the change in

inner product. The second plot of figure 7.12 shows the result. What were the upper and

lower dense portions ota long narrow point cloud, are now two p~lelclu~ters.(The A

labels are automatically assigned to the r.h.s. boxes whose variables are not onhogonal to the

projection plane.) Also, the lines drawn in both sets of variable boxes are quite different. All

these differences are due to the changed definition ofonhogonality.

Figure 7.13 shows s-2 and s-4, or equivalently, a plot of the second and founh

standardized principal components. In this particular data set the projection onto the

principal component has by far the largest variance. Plots of the 3rd to the 6th components

points at the window's center (see 7.10). One possible remedy uses

79

SATELUTE-QCT::31

."!.:/"• ,'!>

: ,"" .'.' ;?~~,

:;Jj~'.A!'

SATELUTE-OCT:31

Figure 12: Sphering the sa.tellite data

80

SATELLlTE-OCT:31

Figure 7.13: Sphered variable plot

viewporting operations to blow up the picture to a reasonable size (section 3..5), but this causes

an unacceptably high amount of clipping for other projections. Sphered coordinates avoid this

From figure 7.13 we see there is nevertheless considerable sttucture in the fourth

pri.ncipal component; the long cluster of points on the left co1Tespond to the locations with

change of structure would not be evident.

The spheredplotsi of the data viewer give the same information as a biplot of the·data matrix

(K.R. Gabriel 198I).

81

7.5. Discriminant Analysis of Places Data

A discriminant analysis aims to distinguish between groups. It achieves this by producing

linear combinations which separate the group means as much as possible.

I perform a discriminant analysis to discover how the ratings vary across locations. First, I

group cities by location. The groups contain (i) west coast states plus Alaska and Hawaii, (il)

Rocky mountain states, (iii) mid-west states, (iv) south-west states, (v) south-east states and

(vi) north-east states. Cities. in each group are plotted with different glyphs. They are (i) a

square. (il) a horizontal dash, (Hi) a plus sign, (iv) an "x", (v) alriangleand(vi) an open circle.

The discriminant analysis producesnve lillear combinations of the (untransformed) rating

variables which best separate the groups.

Constructing Linear Discriminants

Let Z be the n xp ma,trix containing the cases (in standardized coordinates), and let

Zl' ...• Zg be the l1~XPIIl~trice$.containiJlg the casesbelongi.ng tQeachoftheg groups. z/ci

denotesap-vector, represelltingease i in group k. We use the vectorsmk. k = t,..,gand m

for the group and overall means respectively.

The linear discriminants Cl, .... cq , where q.= min(g-l,p) are as follows (Mardia, Kent and

Cl maximizes between group variancewithin group variance

Cj maximizes between group variancewithin group variance

g

i:nk [et (mk - m)fk-l

82

such that Ze is uncorrelated with each ofZ el. I = I, ... , j -1.

For computational details. see Chambers (1977. section 5k).

Discriminant Variable Plots

The upper plot in figure 7.14 shows the different groups. The lower plot displays a projection

obtained by performing 3-d rotations in the space spanned by the first three discriminants.

ThisprojeQtion giyes good separation oft.he west coast and north-east states. For a clearer

p~sentationofthetwogrpups. marked them with large square~ and open circles

respectively./andused invisible glyphs for cities ·in all other regions. The west coas.t. and

north-east cities form two clusters separated in the horizontal direction. The l.h.s. boxes show

which rating variables contribute to the separation. For example. we see that west coast cities

have better climate, but poorer health care and education.

It is actually not so strai~tforward to visually pick out projections which distinguish between

subsets. In a single scanerplot whe~ the groups overlap, .different glyphs (or colors) do.not

give.· us an itlltnedlotte •..pe~ptionof the/group locations. Alternagraphic methods· (Tukey

1973). where subsets are displayed in turn. attempt to solve this problem. In conjunction with

motion, suchmetb.odshelp in finding projections which distinguish groups. For the places

example. lfoundthatthewest coast and north-east cities were wen-separated in the space of

the first t1ueedi~riminatlt'$.bYdisplaYing just rwosubsetSa:t atiIlle as· t.hepointsmoved.

7.

PLACES

.-

.'.,'. '"...... y

...

PLACES

.. . .." .-., ... ....

• ••:... t ~.. ..- •.... ...

of

.'

CHAPTER 8

DESIGNING THE DATA VIEWER

In this chapter, I discuss the design of the data viewer, that is, the procedures and data

structures which constitute the program. The discussion is of interest to future designers of

similar programs. It is of relevance to users also, providing a user-model for how the program

works (Foley and Van Dam 1982). Equipped with a clear model of the program, a user can

operate it more efficiently and creatively.

8.1. Object-Oriented Programming

The data viewer program is implemented in Flavors (Cannon 1982), an extension to Lisp.

This language supports a style of programming called object-oriented. Flavors had a strong

influence on how I approached the design problem.

The data viewer program is organized around a collection of objects. The user-model consists

of a description of these objects and their interrelationships. The user-model itself does not

depend on the choice of language; the same user-model could equally well apply to a data

viewer implemented in Fortran. However, an object-oriented language simplifies

implementation, because it contains tools for building abstractions which behave like the

objeets in the model I will describe the data viewer design using the terminology of object

oriented programming.

Objeet-oriented programming is based on the concept of an object. At the simplest level, an

object is a data structure, similar to a record in Pascal or a structure in S. Corresponding to

85

the fields of a record. an object has instance variables. The instance variables determine the

state of an object. The main difference between objects and Pascal records is that

communication with their contents is achieved by a mechanism called message passing.

Messages can change the state of an object by modifying its instance variables.

Object-oriented languages have special properties that distinguish them from more traditional

languages such as Fortran or Pascal. (i) They provide a means for building abstractions and

(ii) messages act like generic functions. in that different types of object can respond to the

same Jllessage. Both of these properties encourage a highly modular style of programming. I

will demonstrate this using some eX3ll\ples.

Data Sets

There are many potential representations for data sets. 2-D arrays of cases by variables are a

common choice. but a list of lists. say, could be used just as well.

As I explained in section 2.1, we regard data sc:ts as collections of cases. In the data viewer

Program, I use a represc:ntation for.data sets, data-s.et objects, which is close to this

description. A data-set has an instance variable containing a list of case objects. For

eXatnple, places and st.helens are both data-set objects. This representation for

data sets is due to McDonald (t986).

Theacivamageof using objects to represent data sets is that the internal implementation may

be hidden from the user. The set of messages understood by an object defines its interface

with the outside world. If they agree on a protocol~the programmer may change the internal

representation of anobjeet unknown to the user. Therefore, objects and message passing

provide a mechanism for abstraction.

86

Data Viewer Windows

A data viewer window. like the editor or system window. is represented by an object. The

instance variables of a window object determine its position and appearance on the screen.

Window objects respond to messages for refreshing their display. and for moving or reshaping

them on the screen.

A DV-window object represents each data viewer window. A DV-window object has an

instance variable named display-list, containing .. variable-boxes, control

panel, scatterplot, t;i.tl$andplQt...inteJ':~ction ...menu.Eachitem,in the

display list is an opject,respondtng toaCOmtn()l'Lset of messages, but indifferent ways. For

example, executing the following code draws the contents of a DV-window display by

sending each object in the display list a : draw message.

(loop for item in display-list do

(send item :draw»

The : draw message acts like a generic function, dispatching on the type of display-"c.. .:··..>··.·.··.··. ..:::

1 i st item. This feature of object-onented programming simplifies implementation because

only minimal changes to existing code are D.e9'ssary to add new items to the display

list.

8.2. A Simple User-Model

Dependent objects are the mainingI'ed!ent of the data viewer's user model. Figure 8.1 is a

dependency: diagram, showing a simplified version ofthe model.

In figure 8.1. there are boxes representing DV-window and data"'set objects. The

directed line connecting the boxes indicates that the DV-window depends on the data

set, in that its display always shows the data-set's current state. I term this dependency

87

'--__D_Y_-W1_._n_d_ow -----·~I'_ d_a_ta_-s_e_t__....

Figure 8.1: Simple data viewer model

a display constraint. Since it is a one-way dependency, the data-set is completely

independent of the DV-window. In practice, DV-window has an instance variable

refeningto the da.ta-set,so I call data-set a component object of DV-window.

DY-window 1

DY-window2

data-set

Figure 8.2: Sharing data sets

In section 3.6, I discussed linked data viewer windows. We saw how two data viewer

windows are linked when they overlapping data (sub) sets. The depencJlen(:y diiagl-am

DV-windows depend ona single data-set.

The display consttaintimpliesthat when any property of the data-set changes, the display

chtu1gesso as to show the new state. For example, executing the code

(send New-York :set-shape :t>~q-trl.angJ.e

changes the shape property of the case object New-York in the places data. Then, all

glyphs representing New-York are redrawn with a big triangular shape. Notice that this

88

involves redraws in all windows showing the case New-York. In fact, a display

constraint is a convenient model for interactive brushing (Stuewe 1986), but this is not

currently available in the program under discussion.

8.3. A Detailed User-Model

A DV-window object produces a view by applying a sequence of transfonnations to the

cases. The particular view displayed is detennined. by the pipeline parameters (chapter 3).

The simple v~rsion ofth~ user model does notd~scribe the dependency ••·ofthe view on these

param~ters.

Transformation Objects

I use distinct transfonnation objects to hold the parameters for each pipeline operation. Like

the data-set, each transformation object is a component object of DV-window. The

objects are independent cit each other, so they may be developed and implemented separately,

giving•• t!1e'program.amodularidesign. Thetra.nsf():r.t:glltion objects are

• a univar~at.-tran.~ormat~onob~ct

This contains a function, such as log or inverse, and a center and scale factor for each variable.

The object's depends on the data-set, and may be

A user canJIlO(ti.fy the contents of a univariate-transformation object in either of

two ways: by'aelecting·itemson a menu, or, by executing some code. Modification by menu

is easier for the user; code execution offers more Sl.lppose u-t is a

un.I.VCLL1ate-traIl,sI:OX:!l'l.CttjLon object Then, the following sends a message to

u-t, to ensure that the variables lS-V, 22-H, 22-V, 37-H and 37-V are

scaled by a common factor of 250.

89

{send u-t :scale-by 250

I ( :18~H :18-V :22-H :22-V :37-H :37-V»

The : group-variables message ensures that the same standardization procedure is

applied to more than one variable.

(send u-t :group-variables

I ( :18-H :18-V :22-H :22-V :37-H :37-V»

Then, the data-dependent center and scale factors are computed over all six variables.

• aprojeet£on-eng1.n.eobject.

The current projection plane is held in an instance variable of the projection-engine.

Additional instance variables represent the six path parameters of section 6.3, namely (i) the

motion switch, (ii) speed, (iii) the interpolation switch, (iv) step-size, (v) the target generator

and (vi) the variable box labels. Sending a : next-plane message to a projection

engine results in a projection plane update, as determined by the path parameters.

• a dens1.t.y-.st..imat.or object.

This object has.instance variables which are t:1'1e number of bins and a smoothing .parameter

(seesection<3.4).

• aV1.ewpo:rt.er object.

This has instance variables specifying the affine transfonnation from plot to screen

coordinates. In a fashion similar to that described in Buja et al (1981), the parameters are

changed by depressing mouse buttons.

Figure 8.3 shows the complete dependency diagram for the data viewer. Each of DV

window's five component objects-· the data-set and four transformation objects, are

nY-window

90

data-set

nivanate-transformatioDt-----i1ll1

projection-engine

histogram-estimator

viewporter

Figure 8.3: Full data viewer model

data-set *

represented by boxes. A directed line is interpreted as before; DV-window's display is

constrained to show the current state of its five component objects. Also, the

univariate-transformation object depends on a data-set, wbich is not

nec:ess'a.ri:Ly the in the This is useful for

the univariate transformations.

'I'he mail!ideasof the data viewer program-- motion, linking and interaction have convellient

explanacipos iI! te1'In80f the dependency diagram. Thus, it provides a user's model.

Motion

Motion is a direct conseqllence of the displa.y constraint. The constraint requires that the

displayed view changes when the paratneters held in any of the transformation objects change.

When the projection-engine, for example, changes continuously producing a

sequence of planes, the window shows a continuously moving scatterplot, with moving lines

appearing in the variable boxes.

Linking

Two (or more) DV-windowSa.re linked the five cOII1ponenfobjects are

common to both windows. Recall the example of in section 2.6. There were two

windows, both showing observations from the St. Helens data-set, and they are

linked by projection.

DV-window 1

DV-window2

St.Helens

projection-engine

and projection engines

Figure 8.4 gives the dependency diagrams for these windows. (For clarity, only the common

component objects

in both windows.

92

The user can construct windows linked by projection as follows: Suppose st. helens

viewer is a DV-window for the St. Helens data. Executing the following code creates

a second DV-window for the St. Helens-dense subset which is linked to the first.

(make-DV-window :data-set st.helens-dense

:projection-engine

(send st.helens-viewer :projection-engine»

Multiple DV-windows dependent on the same objects is a convenient way to achieve

linking. Since each. window satisfies its own display constraint, the linked windows

themselves are totally unaware of each other. Basically, by following this design. windows

can be linked with arbitrary combinations of component objects at no extra cost to the user or

programmer.

Interaction

Data-sets and transformation objects may have displayed representations. When any of

the component obj¢cts are >shared amongidataviewer windows, theY may !lave multiple

displayed representations. Clicks on a mouse sensitive displayed representation of an object

sen<is a message which changes its state. For example, the control-panel and the

variable-box$s display the current state of the path parameters contained in the

projection-en.9'ine~ Section 6.2 explained how mouse clicks in these areas change the

parameters. Then, the display constraint demands that all displayed representations of the

modified object be updated.

93

8.4. Implementation

This section gives a brief discussion of some implementation issues.

Processes

The Symbolics Lisp environment supports multi-tasking, where a number of processes appear

to run simultaneously. This is achieved by the the system scheduler, which allocates time

slices to processes in tum.

Windows provided. as part.of the environment, such as file system windows, typically have

their own individual processes. Similarly, each DV-window has a process. A window's

process is responsible for displaying its items and handling its user inputs. This is a natural

way of handling multiple DV-windows, allowing changes to occur in all windows

simultaneously.

Since the ttansformatiorr-objects are components of DV-window, changes to these objects

usually take place within/the .window'sp:rocess.For example, when the. projection

engine's motion switch is on, it produces a sequence of planes, due to the DV-window

repeatedly sending the : next-plane message. However, the same projection

engine could be a component of arbitrarily many DV-windows. This leads to problems

caused by multiple processes simultaneously accessing the same object I side-step this

difficulty by not permitting the scheduler to interrupt the : next -plane operation. A more

general solution would use locks attached to the shared data structure.

The Display Constraint

This implementation is a generalization of the scheme suggested by Stuetzle (1986), for

scatterplot brushing.

94

The display constraint requires that the DV-window always shows the current state of the

transformations objects and the data-set. That is, DV-window must always be up-to

date. The DV-window's process executes an etemalloop, continually checking whether or

not the display remains up-to-date. If not. the display is updated. With this arrangement. the

display is redrawn only when necessary. This is important for the application of moving

scatterplots; with multiple data viewer windows on the screen. unnecessary scatterplot draws

would seriously impede performance.

Once again, we see the use ofgeneric messages; the DV-window updates itself by sending

eachme1llbet()fits di~play-list an .: upgate me.ssage. A DV-window is up-to-date

if each item in its display-list responds t(for true) to an up-to-date? message.

The display-list item may itself have a display-list. For example.

variable-boxes has a display-list containing each individual variable-box.

In this case it handles the : up-to-date? message by passing the query on to its own

display-list items.-

Otherwi~e, display-list is out-of-date if it is a screeltrepreseXl~tion for a

changed object. The scatterplot is out-of-date when any of the component objects have

changed. A variable box labeHs out-of-date when the labels in the projection-engine

are modified. Changes are determined by comParing time stamps. The component objects

recent change.

Other schemes for implementing one-way dependencies are discussed in McDonald (1986).

95

The Pipeline

Speed is an important issue in the production of scatterplors moving in real-time; the rate at

which new plots appear on the screen determines how clearly we perceive the motion. With

the computational power of the SymboIics Lisp machine 36xx at my disposal, the naive

approach of performing each pipeline transformation (see figure 3.1) anew on the data to

produce every plot is far too slow. The data viewer obtains significant speed-ups by

• Usinglnteger arithmetic

The univariate transformations are appIiedto·· the data and the results cached, until the

data set itself or the univariate transformations are modified. The transformed data is

converted to 16 bit integers, so that all further computations use integer arithmetic. The

loss in accuracy is negligible given the resolution of the screen, but the speed gain is

considerable, particularly since our Lisp machines do not have floating point hardware.

• Using geodesic interpolation

With geodesic in.terpolationberween planes (section 4.5), the computat'iQn time is almost

independent of the number of variables. For each geodesic segment, cases are projected

onto R 4, so that most plots require only a projection from R 4 to R 2.

• Drawing onto the bitmap

The Lisp machiIle environment provides the gsuaLset of graphics primitives, but these

are too .slow for real-time scatterplot motion. Instead, the data viewer scatterplots are

produced by writing directly into the screen array holding the bitmap. Timing

experiments (performed by 1. McDonald) have shown that this reduces the time for a

scatterplot draw by a factor of5. .

The data viewer program displays projections of 500 cases roughly at the rate of 12 per

second. For this calculation, the plots were drawn on the monochrome screen and each glyph

consisted of 4 (Single pixel points are too small for comfortable and

96

slightly faster.) Even with direct access to the bitmap, the erase/draw step is responsible for

most (about 213) of the time required to display a new projection.

CHAPTER 9

CONCLUSION

The goal of this resea.rcl1 was to devise and implement ways of exploring multivariate data

based on motion through sequen<;es of projections. This demands a highly interactive

program, so that the user may controLthe exploration process. The key issues involved in the

development of the resulting data viewer program were:

• guided tours

I found that extending a grand tour algorithm based on geodesic interpolation between

2-planes (Buja and Asimov 1985) gave a general paradigm for constructing guided tours

of data: interpolation between consecutive elements of a user-determined sequence of

planes. I. develoP¢d four aids for plane .selection: (i) constraints, (ii) methods. for

producing sequences of planes. (iii) methods for controlling rotating 4-d point clouds

and (iv) derived va.."'iables. These aids put many schemes for exploration of multivariate

data at the user's disposal, from existing methods such as 3-d rotations and the grand

tour. to new methods as local scan.

• the user-interface

A graphical interface is suitable for the data viewer. Parameters controlling the motion

have displayed representations, so that the user is always aware of their current· state.

Using the mouse, the user can make choices quickly, which is particularly important for

interactive applications.

98

• the program design

The data viewer design consists of objects, related by a display constraint. I chose this

design because it provided a basis for implementing linked data viewer windows, useful

for the important applications of comparing and relating plots.

As is, the data viewer program contains a fairly comprehensive set of tools for constructing

sequences of projections. However, even within the scope of its paradigm for paths of planes,

there are many areas deserving more work,. At present, all interpolation proceeds along a

geodesic path connecting pairs of planes, but alternatives should be considered. Secondly,

data dependent target generators, as described briefly in section 5.3, merit further attention.

BIBLIOGRAPHY

Asimov, D. (1985)

"1be Grand Tour: A Tool for Viewing Multidimensional Data", SIAM Journal on

Scientific and Statistical Computing, voL 6(1), p. 128-143.

Becker, R. A., Chambers, J. M (1984)

S:An Interactive Environment for Data Analysis and Graphics, Belmont. CA:

Wadswonh.

Becker, R. A., Cleveland, W. S. (1987)

"Brushing Seatterplots", Technometrics, voL 29, no. 2.

Becker, R. A., Cleveland, W. S., Wilks, A. R. (1986)

"High-Interaction Graphics for Data Analysis", Technical memorandum, AT&T Bell

Laboratories.

Buja. A., Asimov, D. (1985) "Grand Tour Methods: An Outline", Computer Science and

Statistics: Proceedings ofthe 17th Symposium On the Interface, Amsterdam: Elsevier

Buja, A., AsiJI1ov,D.,lI~eY,C:;~,McDolla1d.J~.A..(1987)

"Elements ofa VieWing Pipeline for Data Analysis", in Dynamic Graphics for Statistics,

eds. W. S. Cleveland andM. E. McGill, Monterey, CA: Wadsworth.

Buja, A., Hurley, C., McDonald, J. A. (1986)

"A Data Viewer for Multivariate Data" , Computer Science and Statistics: Proceedings of

the 18th Symposium on the Interface.

100

Cannon, H. I. (1982)

"Flavors- A Non-Hierarchical Approach to Object-Oriented Programming", manuscript

from Symbolics Inc., 5 Cambridge Center, Cambridge, Mass.

Chambers, 1. M. (1977)

Computational Methods For Data Analysis, Wiley, New York.

Donoho, A. W., Donoho, D. L., Gasko, M. (1985)

MacSpin: GraphicalData Analysis Software, 02 Software, Austin, Texas.

Donoho,D.L., Huber, P.I., Ramos,]~.,Thoma,M. (1982)

"Kinematic Display of Multivariate Data", Proceedings ofthe Third Annual Conference

and Exposition of the National Computer Graphics Association.

Fisherkeller, M. A., Friedman, 1. H., Tukey, 1. W. (1974)

"PRIM-9, An Interactive Multidimensional Data Display and Analysis System",

Proceedings of the 2acific ACM Regional Conference.

Foley,J.D.,VanDam, A. (1?82)

Fundamentals of Interactive Computer Graphics, Addison-Wesley Publishing

Company, Reading, Massachusetts.

Fowlkes, E. B.(1971)

"Users Manual for an Qn..I,:.ine Inr.er;lcrlve System for Probability Plotting on the DDP

24 COmputer", Technical Memorandum, Bell Laboratories.

Friedman, 1.H. (1987)

"Exploratory Projection Pursuit", Journal ofthe American Statistical Association 82.

Friedman, 1. H., McDonald. 1. A., Stuetzle, W. (1982)

"An Introduction to Real Time Graphics for Analyzing Multivariate Data" , Proceedings

of the Third Annual Conference and Exposition of the National Computer Ci,.,ani:!ics

101

Association.

Friedman, 1. H, Tukey, J. (1974)

"A Projection Pursuit Algorithm for Exploratory Data Analysis", EEE Trans. Comp., C

23, 881-890.

Gabriel, K. R. (1981)

"The Biplot-Graphic Display of Multivariate Matrices for Inspection of Data and

Diagnosis", In Interpreting Multivariate Data,ed. V. Barnett, Wiley, London.

Golub, C;. H., Vanl:..o~C. F. (198$)

Matrix Computations, Johns Hopkins University Press, Baltimore, Maryland.

Hube~P.J.(1985)

"Projection Pursuit" , Annals ofStatistics, vol 13, no. 2.

Mardia, K. V., Kent, J. T., Bibby, J. M. (1982)

Multivariate Analysis, Academic Press.

Madrid,e. R, (Ed.) ·(1978)

The Nimbus User's Guide., Greenbelt, MD: NASA Goddard Space Flight Center.

McDonald. J. A. (1982)

Anaty:sis", PhD thesis, Stanford University.

McDonald.J.A.(1986)

ItAntelope: Data Analysis with Object-Oriented Programming and Constraints",

StatistlcsDepart;mentTechnical R~port, no. 89, University of Washington, Seattle.

~cNally,~(1986)

Places Rated Almanac

102

Ryan. T. A., Joiner, B. L., Ryan. B. F. (1976)

Minitab Student Handbook., Duxbery Press.

Scott. D. W. (1985)

"Average Shifted Histograms: Effective Non-Parametric Density Estimation in Several

Dimensions''. Annals o/Statistics 13, p. 1024-1040.

Stuetde, W.(1986)

"Design and Implementation of Plot Windows", Proceedings 0/ the Statistical

Computing Section, Arn.erican Statistical Association, pp 32-40.

SymboUcs (1983)

3600 Technical Summary, SymboUcs Inc., 5 Cambridge Center, Cambridge, Mass.

Tukey, J. W. (1973) "Some Thoughts on Alternagraphic Displays", Department of Statistics

Technical Report No. 45, series 2, Princeton University.

WOllg, Y.-e. (1967)

"Diff~rentialGeometry.of Grassmann Mallifolds" , Proceedings.0/ the National Academy

o/Sciences, voL 57, p. 589.

VITA

Catherine Brid Hurley was born on January 7, 1962 in Cork, Ireland. She

attended Colaiste Muire, Cobh, Co. Cork and later University College Cork,

graduating in 1982 with a B.Sc. in Computer Science and Statistics. In 1984, she

obtained an M.s. in Statistics at the University ofWashington.

thedata viewer: a program for graphical data analysis … · 2017-09-01 · the use of motion...

Documents