friedmantukey_1974

10
8/3/2019 FriedmanTukey_1974 http://slidepdf.com/reader/full/friedmantukey1974 1/10 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-23, NO. 9, SEPTEMBER 1974 A Projection Pursuit Algorithm ror Exploratory Data Analysis JEROME H. FRIEDMAN AN D JOHN W. TUj(YHN1CAL LIB3RARY SINGER COMPANY 1 t o o a onLA PRODUCT,) DIVIS SIMlULATrW YOi 13902 BINGHAMION1 NEW YORK Abstract-An algorithm for the analysis of multivariate data is dimensional point swarm than the linear methods, they presented andisdiscussedin terms of specific examples. The algo- can suffer from some serious shortcomings; namely, the rithm seeks to find one- and two-dimensional linear projections of resulting mapping is often difficult to interpret; it cannot multivariate data that are relatively highly revealing. be summarized by a few parameters; it only exists for the Index Terms-Clustering, dimensionality reduction, mappings, data set used in the analysis, so that additional data multidimensional scaling, multivariate data analysis, nonparametric cann't be identically mapped; and, for moderate to large pattern recognition, statistics. data bases, its use is extremely costly in computational resources (both CPU cycles and memory). INTRODUCTION Our attention here is devoted to linear methods, mnore specifically to those expressable as projections (although MJ[APPING of multivariate data onto low-dimensional the technique seems extendable to more general linear lTt anifolds for visual inspection is a commonly used methods). Classical linear methods include principal com- technique in data analysis. The discovery of mappings ponents and linear factor analysis. Linear methods have that reveal the salient features of the multidimensional the advantages of straightforward interpretability and point swarm is often far from trivial. Even when every computational economy. Linear mappings provide par- adequate description of the data requires more variables ameters which are independently useful in the understand- than can be conveniently perceived (at on e time) by ing of the data, as well as being defined throughout the humans, it is quite often still useful to map the data into space, thus allowing the same mapping to be performed on a lower, humanly perceivable dimensionality where the additional data that were not part of the original analysis. human gift for pattern recognition ca n be applied. Th e disadvantage of many classical linear methods is that While the particular dimension-reducing mapping used the only property of the point swarm that is used to may sometunes be influenced by the nature of the problem determine the mapping is a global one, usually the swarm's at hand, it usually seems to be dictated by the intuition of variance along various directions in the multidimensional the researcher. Potentially useful techniques can be divided space. Techniques that, like projection pursuit, combine into three classes. global and local properties of multivariate point swarms 1) Linear dimension reducers, which can usually be to obtain useful linear mappings have been proposed by usefully thought of as projections. Kruskal [5], [6]. Since projection pursuit uses trimmed 2) Nonlinear dimnension reducers that are defined over global measures, it has the additional advantage of robust- the whole high-dimensional space. (No examples seem as ness against outliers. yet to have been seriously proposed for use in any gen- erality.) PROJECTION PURSUIT 3) Nonlinear mappings that are only defined for the This paper describes a linear mapping algorithm that given points-most of these begin with.the mutual inter- uses interpoint distances as well as the variance of the point distances as the basic ingredient. (Minimal spanning point swarm to pursue optimum projections. This pro- trees [2] and iterative algorithms for nonlinear mappings jection pursuit algorithm associates with each direction [3], [4] are examples. The literature of clustering tech- in the multidimenional space a continuous index that niques is extensive.') measures its "usefulness" as a projection axis, and then While the noninear algorithms, have the ability to varies the projection direction so as to maximize this provide a more faithful representation of the multi- index. This projection index is sufficiently continuous to allow the use of sophisticated hill-climbing algorithms for Manuscript received November 28 , 1973; revised March 11, 1974. the maximization, thus increasing computational effi- Th e work of J. H. Friedman was supported by the U. S. Atomic ciency. (In particular, both Rosenbrock [7] and Powell Energy Conmnission under Contract AT(043)515. Th e work of J. W. Tukey was prepared in part in connection with research at principal axis [8] methods have proved very successful.) Princeton University, Princeton, N. J., supported by the U. S- For complex data structures, several solutions may exist, Atomnic Energy Commission. J. H. Friedman is with the Stanford Linear Accelerator Center, and for each of these, the projections can be visually in- Stanford, Calif. 94305.* tanford Calif. 9430.5.. spected by the researcher for interpretation and judgmnent . W. Tukey is with the Department of Statistics, Princeton s University, Princeton, N. J. 08540 and Bell Laboratories, Murray as to their usefulness. This multiplicity is often important. Hill, N. J. 07974. I See [1] and [2] and their references for a reasonably extensive Computationaly, the projection pursuit (PP) algorithm bibliography on clustering techniques. is considerably more economical than the nonlinear 881

Upload: luis-martin

Post on 06-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FriedmanTukey_1974

8/3/2019 FriedmanTukey_1974

http://slidepdf.com/reader/full/friedmantukey1974 1/10

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-23, NO. 9, SEPTEMBER 1974

A Projection Pursuit Algorithm ror Exploratory Data Analysis

JEROME H. FRIEDMAN AN D JOHN W. TUj(YHN1CAL LIB3RARY

SINGER COMPANY1

t o o a onLA PRODUCT,) DIVISSIMlULATrW YOi 13902

BINGHAMION1 NEW YORKAbstract-An algorithm for th e analysis of multivariate data is dimensional point swarm than the linear methods, they

presented andisdiscussedin terms of specific examples. The algo- can suffer from some serious shortcomings; namely, therithm seeks to find one- and two-dimensional linear projections of resulting mapping is often difficult to interpret; it cannotmultivariate data that are relatively highly revealing.

be summarized by a few parameters; it only exists for th eIndex Terms-Clustering, dimensionality reduction, mappings, data set used in the analysis, so that additional data

multidimensional scaling, multivariate data analysis, nonparametric cann't be identically mapped; and, for moderate to largepattern recognition, statistics.

data bases, its use is extremely costly in computational

resources (both CPU cycles and memory).INTRODUCTION Our attention here is devoted to linear methods, mnore

specifically to those expressable as projections (althoughMJ[APPING of multivariate data onto low-dimensional the technique seems extendable to more general linearlTtanifolds for visual inspection is a commonly used methods). Classical linear methods include principal com-

technique in data analysis. The discovery of mappings ponents and linear factor analysis. Linear methods have

that reveal the salient features of the multidimensional the advantages of straightforward interpretability andpoint swarm is often far from trivial. Even when every computational economy. Linear mappings provide par-adequate description of the data requires more variables ameters which are independently useful in the understand-than can be conveniently perceived (at on e time) by in g of the data, as well as being defined throughout thehumans, it is quite often still useful to map the data into space, thus allowing the same mapping to be performed on

a lower, humanly perceivable dimensionality where the additional data that were not part of the original analysis.human gift for pattern recognition can be applied.

The disadvantage of many classical linear methods is thatWhile th e particular dimension-reducing mapping used th e only property of the point swarm that is used tomay sometunes be influenced by the nature of the problem determine the mapping is a global one, usually the swarm'sat hand, it usually seems to be dictated by the intuition of variance along various directions in the multidimensionalthe researcher. Potentially useful techniques can be divided space. Techniques that, like projection pursuit, combineinto three classes. global and local properties of multivariate point swarms

1) Linear dimension reducers, which can usually be to obtain useful linear mappings have been proposed byusefully thought of as projections. Kruskal [5], [6]. Since projection pursuit uses trimmed

2) Nonlinear dimnension reducers that are defined over global measures, it has the additional advantage of robust-the whole high-dimensional space. (No examples seem as ness against outliers.ye t to have been seriously proposed for use in any gen-erality.) PROJECTION PURSUIT

3) Nonlinear mappings that are only defined for the This paper describes a linear mapping algorithm that

given points-most of these begin with.the mutual inter- uses interpoint distances as well as the variance of thepoint distances as th e basic ingredient. (Minimal spanning point swarm to pursue optimum projections. This pro-trees [2] and iterative algorithms for nonlinear mappings jection pursuit algorithm associates with each direction[3], [4] are examples. The literature of clustering tech- in the multidimenional space a continuous index that

niques is extensive.') measures its "usefulness" as a projection axis, and thenWhile the noninear algorithms, have the ability to varies the projection direction so as to maximize this

provide a more faithful representation of the multi- index. This projection index is sufficiently continuous to

allow the use of sophisticated hill-climbing algorithms forManuscript received November 28 , 1973; revised March 11 , 1974. the maximization, thus increasing computational effi-

The work of J. H. Friedman was supported by the U. S. Atomic ciency. (I n particular, both Rosenbrock [7] and PowellEnergy Conmnission under Contract AT(043)515. The work ofJ. W. Tukey was prepared in part in connection with research at principal axis [8] methods have proved very successful.)Princeton University, Princeton, N. J., supported by the U. S- For complex data structures, several solutions may exist,Atomnic Energy Commission.

J. H. Friedman is with the Stanford Linear Accelerator Center, and for each of these, the projections can be visually in-Stanford, Calif. 94305.*tanford Calif. 9430.5.. spected by the researcher for interpretation and judgmnent. W. Tukey is with the Department of Statistics, Princeton s

University, Princeton, N. J. 08540 and Bell Laboratories, Murray as to their usefulness. This multiplicity is often important.Hill, N. J. 07974.

I See [1] and [2] and their references for a reasonably extensive Computationaly, the projection pursuit (PP) algorithmbibliography on clustering techniques. is considerably more economical than the nonlinear

881

Page 2: FriedmanTukey_1974

8/3/2019 FriedmanTukey_1974

http://slidepdf.com/reader/full/friedmantukey1974 2/10

IEEE TRANSACTIONS ON COMPUTERS, SEPTEMBER 1974

mapping algorithms. Its memory requirements are simplyproportional to the number of data points N, while thenumber of CPU cycles required grows as N log N for in-creasing N. This allows the PP algorithm to be applied tomuch larger data bases than is possible with nonlinear

mapping algorithms, where both the memory and CPUrequirements tend to grow as N 2. S in ce the mappings ar elinear, they have th e advantages of straightforward in-terpretability and convenient sunmmarization. Also, onceth e parameters that define a solution projection are ob-tained, additional data that did not participate in thesearch can be mapped onto it. For example, one may applyprojection pursuit to a data subsample of size N.. Thisrequires computation proportional to N, log N,. Once a

subsample solution projection is found, th e entire dataset can be projected onto it (for inspection by the re-

searcher) with CPU requirements simply proportionalto N.When combined with isolation [9], projection pursuit

has been found to be an effective tool for cluster detectionand separation. As projections ar e found that separateth e data into two or more apparent clusters, the datapoints in each cluster can be isolated. The PP algorithmcan then be applied to each cluster separately, finding newprojections that may r ev ea l f ur th er clustering withineach isolated data set. These subelusters can each be iso-lated and the process repeated.

Because of its computational economy, projection pur-

suit can be repeated many times on the entire data base,or it s isolated subsets, making it a feasible tool for explora-tory data analysis. The algorithm has so far been imple-mented for projection onto one and two dimensions; how-ever, there is no fundamental limitation on the dimen-sionality of the projection space.

to concentrate th e points into 'clusters while, at the sametime, separating the clusters.

The P-indexes we use to expre ss quantitatively suchproperties of a projection axis k can be written as a

product of two functions

I (k) = s (k) d (k) (1)

where s((k) measures the spread of th e data, and d (k)describes the "local density" of the points after projectiononto k. For s (k), we take the trimmed standard deviationof the data from the mean as projected onto k:

-(1-p)N _1/2s(k) = [ (Xi.k -Xk)2/(1- 2p)N

i=pN(2)

where

(1-p)N

Xk= E Xik/(l - 2p)N.i=pN

Here N is the total number of data points, and Xi(i = 1,N)are the multivariate vectors representing each of the datapoints, ordered according to their projections Xi *k. Asmall fraction p of the points that li e at each of theextremes of the projection are omitted from both sums.Thus, extreme values of Xi k do not contribute to s(k),which is thus robust against extreme outliers.

For d(k), we use an average nearness function of the

formN N

d (k) = E f (rij) l (R - rij)i=l j 1

(3)

where

ri j = Xi. - Xjk'I

THE PROJECTION INDEX

Th e choice of the intent of the projection index, and to

a somewhat lesser degree the choice of its details, are

crucial for the success of the algorithm. Our choice ofintent was motivated by studying the interaction betweenhuman operators and the computer on the PRIM-9 inter-active data display system [10]. This system providesthe operator with the ability to rotate the data to any

desired orientation while continuously viewing a two-dimensional projection of the multidimensional data.These rotations are performed in real time and in a con-

tinuous manner under operator control. This gives theoperator the ability to perform manual projection pursuit.That is, by controlling the rotations and viewing thechanging projections, the operator can try to discoverthose data' orientations (or equivalently, projection direc-tions) that reveal to him interesting structure. It was

found that the strategy most frequently employed byresearchers operating the system was to seek out projec-tions that tended to produce many very small interpointdistances while, at the same time, maintaining the overallspread of the data. Such strategies will, fo r instance, tend

and 1 (-q) is unity for positive-valued arguments and zerofor negative values. (Thus, the double sum is confinedto pairs with 0 < rij < R.) The function f(r) should bemonotonically decreasing for increasing r in the ranger < R, reducing to zero at r = R. This continuity assures

maximum smoothness of the objective function I (k).For moderate to large sample size N, th e cutoff radiusR is usually chosen so that the average number of pointscontained within the window, defined by the step function1 (R - ri),is not only a small fraction of N, but increasesmuch more slowly than N, say as lo g N. After sorting theprojected point values Xi Ik, the number of operationsrequired to evaluate d(k) [ as we ll as s(k)] is thus abouta fixed multiple of N log N (probably no more than thisfor large N). Since sorting requires a number of operationsproportional to N log N, th e same is true of the entireevaluation of d(k') and s(k) combined.

Projections onto two dimensions are characterized bytwo directions k and I (conveniently taken to be orthogonalwith respect to the initially given coordinates and theirscales). For this case, (2) generalizes to

s (k,l) = s (k) s (l ) (2a)

882

Page 3: FriedmanTukey_1974

8/3/2019 FriedmanTukey_1974

http://slidepdf.com/reader/full/friedmantukey1974 3/10

FRIEDMAN AND TUKEY: EXPLORATORY DATA ANALYSIS

and rij becomes

ri j = [(X,.k - X,.k) 2 + (Xi.l - Xjl) 2]12 (3a)

in (3).Repeated application of the algorithm has shown that itis insensitive to the explicit functional form of f(r) andshows major dependence only on its characteristic width

= rf(r) dr/f f(r) dr

R r/R

r=| rf(r)rdr/ f(r) rdr

(one dimension) (4)

(two dimensions).

It is this characteristic width that seems to define "local"in the search for maximum local density. Its value estab-lishes the distance in the projected subspace over which thelocal density is averaged, and thus establishes the scale ofdensity variation to which the algorithm is sensitive. Ex-perimentation has also shown that when a preferreddirection is available, the algorithm is remarkably stableagainst small to moderate changes in f, but it does respondto large changes in it s value (say a few factors of two).

The projection index I(k) (or I(k,l) for two-dimen-sional projections) measures the degree to which the datapoints in th e projection ar e both concentrated locally(d(k) large) while, at the same time, expanded globally

(s(k) large). Experience has shown that projections thathave this property to a large degree tend to be those thatar e most interesting to researchers. Thus, it seems naturalto pursue those projections that maximize this index.

ONE-DIMENSIONAL PROJECTION PURSUIT

The projection index for projection onto an one-dimen-sional line imbedded in an n-dimensional space is a func-tion of n - 1 independent variables that define the direc-tion of the line, conveniently, its direction cosines. Thesecosines are the n components of a vector parallel to theline subject to the constraint that the squares of thesecomponents sum t o un ity . Thus, we seek the maxima ofthe P-index I(k) on the (n - 1)-dimensional surface of asphere of unit radius Sn (1) in an n-dimensional Euclideanspace.One technique for accomplishing this is to apply a solid

angle transform (SAT) [11] which reversibly maps sucha sphere to an (n -1)-dimensional infinite Euclideanspace E1 (-, o) (see the Appendix). This reducesthe problem from finding the maxima of I(k) on the unitsphere in n dimensions to finding the equivalent maximaof I[SAT (k) ] in En-' (-, oo ). This replacement of theconstrained optimization problem' with a totally uncon-

strained one greatly increases the stability of the algorithmand simplifies it s implementation. The variables of thesearch ar e the n - 1 SAT parameters, and for any suchset of parameters, there exists a unique point on thenAdimensional sphere defined by the n components of k.

The computational resources required by projectionpursuit are greatly affected by the algorithm used in thesearch for the maxima of the projection index. Since thenumber of CPU cycles required to evaluate the P-index

for N data points grows as N log N, it is important formoderate to large data sets to employ a search algorithmthat requires as few evaluations of the object functionas possible. It is usually the case that the more sophisti-cated the search algorithm, the smoother the object func-tion is required to be for stability. The P-index I(k), as

defined above, is remarkably smooth, and both Rosen-brock [12] and Powell principal axis [13] search algor-ithms have been successfully applied without encounteringany instability problems. For these algorithms, the num-ber of objective function evaluations per varied parameterrequired to find a solution projection has been found to

vary considerably from instance to instance and to bestrongly influenced by the convergence criteria estab-lished by the user. Applying a rather demanding conver-gence criteria, approximately 15-25 evaluations per variedparameter were required to achieve a solution. (Stoppingwhen the P-index changes by only a few percent seemsreasonable. A convergence criteria of one percent was usedin al l of our applications.)

In order to be useful as a tool for exploratory dataanalysis on data sets with complex structure, it is impor-tant that the algorithm find several solutions that repre-sent potentially informative projections for inspection by

the researcher. This can be accomplished by applying thealgorithm many times with different starting directionsfor the search. Useful starting directions include the largerprincipal axes of the data set, the original coordinate axes,and even directions chosen at random. From each startingdirection ki-, the algorithm finds the solution projectionaxis k8* corresponding to the first maximum of the P-indexuphill from the starting point. From these searches, severalquite distinct solutions often result. Each of these pro-

jections can then be examined to determine their usefulnessin data interpretation.

In order to encourage the algorithm to find additionaldistinct solutions, it is useful t o be able to reduce the

dimensionality of the sphere to be searched. This can bedone by choosing an arbitrary set of directions {°5jIitim,m < n, which need not be mutually orthogonal, andapplying the constraints

k*o = oi i= l,m (4 )

on the solution direction k*. Possible choices for constraintdirections might be solution directions found on previoussearches, o r d ir ec ti on s that are known in advance tocontain considerable, but well understood, structures.Also, wheR.the choice of scales for the several coordinatesis guided by considerations outside the data, one mightwish to remove directions with small variance about themean, since these directions often provide little informa-tion about the structure of the data. The introduction ofeach such constraint direction reduces by one the number

883

Page 4: FriedmanTukey_1974

8/3/2019 FriedmanTukey_1974

http://slidepdf.com/reader/full/friedmantukey1974 4/10

IEEE TRANSACTIONS ON COMPUTERS, SEPTEMBER 1974

of search variables, and thus increases the computationalefficiency of the algorithm.The algorithm can allow for the introduction of an

arbitrary number m < n of nonparallel constraint direc-tions. This is accomplished by using Householder reduc-

tions [14] to form an orthogonal basis for the (n - m) -dimensional orthogonal subspace of the m-dimensional sub-

space spanned by the m constraint vectors {I i i=im- Then - m - 1 search variables are then the solid angle trans-form parameters of the unit sphere in this (n - m) -dimen-

sional space. The transformation from the original n-di-mensional data space proceeds in two steps. First, a lineardimension reducing transformation to the (n -m) -dimen-

sional complement subspace, and then the nonlinear SATthat maps the sphere Sn-r (1) to En-m-1 (-oo, co ) .

TWO-DIMENSIONAL PROJECTION PURSUIT

The projection index I (k,l) for a two-dimensional plane

imbedded in an n-dimensional space is defined by (2a),(3), and (3a). This index is a function of the parametersthat define such a plane. Proceeding in analogy with one-

dimensional projection pursuit, one could seek the maxi-mum of I (k,l) with respect to these parameters. The dataprojected onto the plane represented by th e solutionvectors k* and l* can then be inspected by the researcher.

Another useful strategy is to hold one of the directions(for example, k) constant along some interesting direc-

tion, and then seek the maximum of I (k,l) with respectto 1 in En-l(k), the (n -1)-dimensional subspace or-thogonal to k. This reduces the number of search param-eters to n - 2. The choice of the constant direction kcould be nmotivated by the problem at hand (like one ofth e original or principal axes), or it could be a solutiondirection found in an one-dimensional projection pursuit.A third, intermediate strategy would be to first fix k

and seek the maximum of I(k,l) in En-l(k), as describedabove. Then, holding 1 f ix ed a t the solution value 1*, varyk in En-i(1*), seeking a further maximum of I (k,l*). Thisprocess of alternately fixing one direction and varying the

other in the orthogonal subspace of the first can be repeateduntil th e solution becomes stable. The final directions k*

and l* ar e then regarded as defining the solution plane.This third strategy, while not as completely general as the

first, is computationally much more efficient. This is dueto the economies that can be achieved in computingI(k,l), knowing that one of the directions is constant andthat k . 1 = 0. (Using similar criteria for choosing the cutoffradius as that used for one-dimensional projection pursuit,and sorting the projected values along the constant direc-tion, allows I (k,l) to be evaluated with a number of opera-

tions proportional to N lo g N.)

As for the one-dimensional case, the two-dimensionalP-index I (k,l) is sufficiently smooth to allow the use ofsophisticated optimization algorithms. Al so , c on st ra in tdirections can be introduced in the same manner as de -scribed above for one-dimensional projection pursuit.

TABLE ISPHERICALLY RANDOM DATA

14 Dimensions 975 Date Points

Search P-Index P-Index Search P-Index P-IndexNo . Starting Solution No. Starting Solution.

1 151.8 156.3 15 150.6 155.8

2 150.7 156.7 16 150.6 156.9

3 149.4 156.0 17 151.0 157.0

4 152.4 159.7 18 152.2 155.6

5 152.3 159.2 19 152.1 160.2

6 151.0 154.4 20 152.9 159.0

7 152.6 155.6 21 150.0 160.8

8 149.6 155.4 22 150.0 157.9

9 150.0 156.1 23 150.7 155.4

10 151.8 153.0 24 151.5 158.4

11

154.o154.6 25 151.0 155.3

12 151.4 156.7 26 152.8 159.6

13 149.7 154.1 27 151.8 157.0

14 152.1 154.2 28 152.1 157.9

SOME EXPERIMENTAL RESULTS

To illustrate the application of the algorithm, we de -scribe it s effect upon several data sets. The first two are

artificially generated so that the results can be comparedwith th e known data structure. The third is the well-knownIris data used by Fisher [15], and the fourth is a data settaken from a particle physics experiment. For thesee-xamples, f(r) = R - r, for one-dimensional projectionpursuit (3), while for two-dimensional projection pursuit3(a), f(r) = R- r2. In both cases R was set to ten per-

cent of the square root of the data variance along the

largest principal axis, and the trimming (2) was P = 0.01.

A. Uniformally Distributed Random Data

To test the effect of projection pursuit on artificial datahaving no preferred projection axes, we generated 975data points randomly from a uniform distribution inside a

14-dimensional sphere, and repeatedly applied one- andtwo-dimensional projection pursuit to the sample with

different starting directions. Table I shows the results of28 such trials with one-dimensional projection pursuitwhere th e starting directions were the 14 original axes

and th e 14 principal axes of the data set. The results ofthe two-dimensional projection pursuit trials were verysimilar.The results shown in Table I strongly reflect the uni-

form nature of the 14-dimensional data set. The standard

deviation of the index values for the starting directionsis less than one percent, while the increase achieved at

the solutions averages four percent. Also, only two searches(runs 13 and 14 ) appeared to converge to the same solu-

tion. The angle between the two directions correspondingto the largest P-indices found (runs 19 and 21) was 67

degrees. The small increase in the P-index from the start-ing to the solution directions indicates that the algorithmconsiders these solution directions at most only slightly

884

Page 5: FriedmanTukey_1974

8/3/2019 FriedmanTukey_1974

http://slidepdf.com/reader/full/friedmantukey1974 5/10

FRIEDMAN AND TUKEY: EXPLORATORY DATA ANALYSIS

better projection axes than the starting directions. Visual

inspection of the data projections verifies this assessment.

B. Gaussian Data Distributed at the Vertices of a Simplex

The previous example shows that projection pursuitfinds no seriously preferred projection axes when appliedto spherically uniform random data. Another interestingexperiment is to test its effect on an artificial data se twith considerable multidimensional structure. Following

Sammon [4], we applied one-dimensional projection pur-suit to a data se t consisting of 15 spherical Gaussianclusters of 65 points, each centered at the vertices of a

14-dimensional simplex. The variance of each cluster is

one, while the distance between centers is ten. Thus, the

clusters are well separated in the 14-dimensional space.

Fig. 1(a) shows these data projected onto the directionof their largest principal axis. (For this sample, the largeststandard deviation was about 1.15 times the smallest.)As can be seen, this projection shows no hint of the multi-dimensional structure of the data. Inspection of th e one-

and two-dimensional projections onto the other principalaxes shows th e same result.

Using the largest principal axis [Fig. 1 (a) ] as the start-in g direction, the one-dimensional PP algorithm yieldedth e solution shown in Fig. 1 (b). The threefold increasein the P-index at the solution indicates that the algorithmconsiders it a much better projection axis than the starting

direction. This is verified by visual inspection of the dataas projected onto the solution axis, where the data se t isseen to break up into two well-separated clusters of 65

and 910 points.In order to investigate possible additional structure, we

isolated each of the clusters and applied projection pursuitto each one individually. The results ar e shown in Fig.1(c) and (d). The solution projection for the 65-pointisolate showed no evidence for additional clustering, while

the 910-point sample clearly separated into two sub-

clusters of 130 and 780 points. We further isolated thesetwo subelusters and applied projection pursuit to each

one individually. The results are illustrated in Fig. 1(e)and (f). The solution for the 130-point subeluster shows

it divided into two clusters of 65 p oi nt s e ac h, while the

780-point cluster separates into a 65-point cluster and715-point cluster. Continuing with these repeated appli-cations of isolation and projection pursuit, one finds,after using a sequence of linear projections, that the datase t is composed of 15 clusters of 65 points each.

Two-dimensional projection pursuit could equally wellbe applied at each stage in the above analysis. This has

th e advantage that the solution at a given stage sometimes

separates th e data set into three apparent clusters. Thedisadvantage is the increased computational requiiements

of th e two-dimensional projection pursuit algorithm.

C. Iris Data

This is a classical data set first used by Fisher [15] and

Simplex Data

97 5 Data Points 14 Dimensions

Largest Principal Axis

,- P-Index= 1.00X105100

80

60

40

20

0

(a )

Simplex Ddta

975 Data Points 14 Dimensions

P- Index = 3.37 x 105

200

150

100

50

0

(b )

Simplex Data

65 Data Points 14 Dimensions20

tO_(c)

Simplex Data910 Data Points 14 Dimensions

150

100

50

0

(d )

Simplex Data

130 Data Points 14 Dimensions40 -

20 -

(e )

Simplex Data

780 Data Points 14 Dimensions

200

150

100

50

0

(f)

Fig. 1.

885

Page 6: FriedmanTukey_1974

8/3/2019 FriedmanTukey_1974

http://slidepdf.com/reader/full/friedmantukey1974 6/10

IEEE TRANSACTIONS ON COMPUTERS, SEPTEMBER 1974

Iris Data

Solution Projectiony = 0. .04,.83,-.56x = -.16, .23, -.54, -.79

4.0 - 150 points

3.0

2.0

1.0 .

-6.0 -5.0 -4.0 -3.0 -2.0 -1.0

(a )

Iris DataPrincipal Axes

Projection

y = -.56, -.19, -.74, -.32

x = -.76,-.02, .32..57

-6.0 -I !100 pointsP=l.lx l04

-7.0

-8.0

-9.0

-10.0

-3.0 -2.6 -2.2 -1.8 -1.4

(b )

-0.2

-0.4

-0.6

-0.8

-1.0

-1.2

-0.2

-0.4

-0.6

-0.8

-1.0

-1.2

Iris DataSolution Projection

y =-.16, -.21, -.18, .95x =-.41,-.42.-.75, -.31

.1 100 pointsP 2.2x 104

-10.0 -9.0 -8.0 -7.0 -6.0

(c )

Iris DataSolution Projection

Species Identified And Level Lines OfTh e Fisher Discriminant Shown

*1

100 points-~~~~~~~~~~=2.2 104_x

x X x x~~~~~~~x x.x ,> : .

L x ~~~~~~

**

xx

-10.0 -9.0 -8.0 -7.0 -6.0

! (d)

Iris DataLinear Discriminant

Projectionx = -.80, -.05, -.09, -.60

y =.60, .06. -.09, -.80

. IIx

- x x x

x xx

x

x x *

x x xx .

x

xxx

xxx xx x xx

x xx xx x

-x x x

x xx ~I x II

7. 0 6.0 5.0

(e)

Fig. 2.

subsequently by many other researchers fo r testing sta-tistical procedures. The data consist of measurementsmade on 50 observations from each of three species (one

quite different from the other two) of iris flowers. Fourmeasurements were made on each flower, and there were

150 flowers in the' entire data set. Taking the largestprincipal axes as starting directions, we applied projection

pursuit to the entire four-dimensional data set. The resultfo r two-dimensional projection pursuit is shown in Fig.2(a). As can be seen, the data as projected on the solu-tion plane show clear separation into two well-definedclusters of 50 (one species) and 100 (two unresolvedspecies) points. The one-dimensional algorithm alsoclearly separates the data into these two clusters. How-

2.5

2.0

1.5

886

Page 7: FriedmanTukey_1974

8/3/2019 FriedmanTukey_1974

http://slidepdf.com/reader/full/friedmantukey1974 7/10

FRIEDMAN AND tUKEY: EXPLORATORY DATA ANALYSIS

ever, this two-cluster separation is easy to achieve and is

readily apparent from simple inspection of the original

data.Applying the procedure discussed above, we isolate the

100-point cluster (largest standard deviation was about

six times the smallest) and reapply projection pursuit,starting with the largest principal axes of the isolate. Fig.2(b) shows the data projected onto the plane defined by the

two largest principal axes. Here the data seem to show no

apparent clustering.One-dimensional projection pursuit, starting with the

largest principal axis, wa s unable to separate this isolateinto discernible clusters. Fig. 2(c) shows the results oftwo-dimensional projection pursuit starting with the

plane of Fig. 2(b). This solution plane seems to divide

th e projected data into two discernible clusters. One inthe lower right-hand quadrant with higher than average

density seems slightly separated from another, which is

somewhat sparser and occupies most of the rest of the

projection. In order to see to what extent this apparent

clustering corresponds to the different iris species known

to be contained in this isolate, Fig. 2 (d) tags these species.As can be seen, the two clusters very closely correspond to

the two iris species. Also shown in Fig. 2(d) are some

level lines of (the projection onto the same plane of)Fisher's linear discriminant function [15] f for this iso-late, calculated by using the known identities of the tw o

species. In this example, the angle between the directionupon which this linear discriminant function is a projec-tion and this plane is a little more than 45°.

The projection pursuit solution can be compared to a

two-dimensional projection of this isolate that is chosen

to provide maximum separation of the two species, giventhe a priori information as to which species each data

point represents. Fig. 2(e) shows the isolate projectedonto such a plane whose horizontal coordinate is the value

of Fisher's linear discriminant for the isolate in the fullfour-dimensional space f, while the vertical axis is the

value of a similar Fisher linear discriminant in 3 ( f), the

three-dimensionalspace

orthogonalto f

[16].(If the

"within variance" were spherical in the initially givencoordinate system, this vertical coordinate would not be

well defined, since the centers of the two species groups

would coincide in E3(f). While we may feel that the

vertical coordinate adds little to the horizontal one, lineardiscrimination seems to offer no better choice of a second

coordinate, especially since we would like this view alsoto be a projection of the original data-a projection interms of the original coordinates and scales-as al l two-

dimensional proj-ection pursuit views ar e required to be.)A comparison of Fig. 2 (d) and (e) shows that the unsuper-

vised projection pursuit solution achieves separation of

th e two species equivalent to this discriminant plane. Sincethese two species ar e known to touch in the full four-dimensional space [2], [4], it is probably not possible tofind a projection that completely separates them.

D. Particle Physics Data

For the final example, we apply projection pursuit to a

data set taken from a high-energy particle physics-scattering experiment [17]. In this experiment, a beam

of positively charged pi-mesons, with an energy of 16

billion eV, was used to bombard a stationary target ofprotons contained in hydrogen nuclei. Five hundred

examples were recorded of those nuclear reactions inwhich the final products were a proton, two positivelycharged pi-mesons, and a negatively charged pi-meson.Such a nuclear reaction with four reaction products can

be completely described by seven independent measur-

ables.2 These data can thus be regarded as 500 points ina seven-dimensional space.

The data projected onto its largest principal axis ar eshown in Fig. 3(a), while the projection onto the plane

defined by the largest two principal axes is shown in Fig.3(c). (The largest standard deviation was about eighttimes the smallest.) One-dimensional projection pursuitwas applied, starting with the largest principal axis. Fig.3 (b) shows the data projected onto the solution direction.The result of a two-dimensional projection pursuit startingwith the plane of Fig. 3(c) is shown in Fig. 3(d).

Although the primcipal axis projections indicate possiblestructure'within the data set, the projection pursuit solu-tions ar e clearly more revealing. This is indicated by the

substantial increase in the P-index, and is verified by

visual inspection. In particular, the two-dimensional solu-tion projection shows that there ar e at least three clusters,possibly connected, one of which reasonably separatesfrom the other two. Proceeding as above, one could isolatethis cluster from the others and apply projection pursuitto the two samples separately, continuing the analysis.

DISCUSSION

Th e experimental results of the previous section indicatethat the PP algorithm behaves reasonably. That is, it

tends to find structure when it exists in th e multidimen-

sional data, and it does not find structure when it is known

not to exist. When combined with isolation, projectionpursuit seems to be an effective tool for the detection ofcertain types of clustering.

Because projection pursuit is a linear mapping algo-rithm, it suffers from some of the well-known limitationsof linear mapping. The algorithm will have difficulty indetecting clustering about highly curved surfaces in the

full dimensionality. In particular, it cannot detect nestedspherical clustering. It can, however, detect nested cylin-

2 For this reaction, 7b±pt -+ p7rl+72+7r-, the following measurableswere used: X1 = A ,T2( I,7T2j, X2 = /A2(l-, ,rl+), X3 = A2(p, 7r-),X4 = M2(r-, 72+), X5-= A2(p, 7,1+), X6 = "2(p, r+ -ps), and X7 =M2(p, 2+, -p,tHere, IA2(A, B, i C) = (EA + EB ± Ec)2- (P A +PB + PC)2 and A2(A, :i: B) = (EA i EB) 2- (PA i PB)2, whereE and P represent the particle's energy and momentum, respectively,as measured in billions of electron volts. The notation (p)2 representsthe inner product P/P. The ordinal assignment of the two 7r+'s was

887

done ran(fomly-

Page 8: FriedmanTukey_1974

8/3/2019 FriedmanTukey_1974

http://slidepdf.com/reader/full/friedmantukey1974 8/10

IEEE TRANSACTIONS ON COMPUTERS, SEPTEMBER 1974

Particle Physics Data

7r+p 7r+p7,r+-r 16 GeV/c

Largest Principal Axis

P-Index = 263040

20

-15 -10 -5 0 5 10 15

(a)

120

Particle Physics Data

7r+P-7r+p7r+7r- 16 GeV/c

Solution ProjectionP-Index= 5879

-20 -15 -10 -5

(b )

Particle Physics Data

7r+p-7r+p7r+7r- 16 GeV/cLargest Principal Axes

ProtectionP-Index= 45.3

l2 l1 l1 -5 0 5l. ;

(e,) vPatil Phsc Data-:.

7rP 7'7'r 16 GeV/c.Souto Projection :.

P.Ine = 30...1~~~ -A .

-1.5 -10 -5 0 5 10 15

(d )

Fig. 3.

drical clustering where the cylinders have parallel gen-

erators.

Projection pursuit l ea ve s t o the researcher's discretionthe choice of measurement variables and metric. Thealgorithm is, of course, sensitive to change of relativescale of the input measurement variables, as well as to

highly nonlinear transformations of them. If there is noa priori motivation for a choice o f sc al e for the measure-ment variables, then they can be independently scaled(standardized) so as to all have the same variance. In the

spirit of exploratory data analysis, the researcher mightemploy projection pursuit to several carefully selectednonlinear transformations of his measurement variables.For example, transformations to various spherical polar

coordinate representations [11] would enable projectionpursuit to detect nested spherical clustering.Frequently with multidimensional data, only a few of

the measurement variables contribute to the structure orclustering. The clusters may overlap in many of the dimen-sions and separate in only a few. As pointed out by bothSammon [4 ] and Kruskal [6], those variables that are

irrelevant to the s tr uc tu re o r c lu ste ri ng can dilute theeffect of those that display it, especially for those mappingalgorithms that depend solely on the multidimensionalinterpoint distances. It is easy to see that the projectionpursuit algorithm does not suffer seriously from this effect.Projection pursuit will automatically tend to avoid pro-

jections involving those measurement variables that donot contribute to data structure, since the inclusion of thesevariables will tend to reduce d(k) while not modifyings(k) greatly.

In order to apply the PP algorithm, the researcher isnot required t o p os se ss a great deal of a priori knowledgeconcerning the structure of his data, either for setting upthe control parameters for the algorithm or for interpretingit s results. The only control parameter requiring care isthe characteristic radius f defined in (4). Its value estab-lishes the minimum scale of density variation detectableby the algorithm. A choice for its value can be influencedby the global scale of the data, as well as any informationthat may be known about the nature of the variationsin the multivariate density of th e points. The sample sizeis also an important consideration since the radius shouldbe large enough to include, on the average, enough points(in each projection) to obtain a reasonable estimate ofthe local density. These considerations usually result in a

compromise, making f as small as possible, consistent withthe sample size requirement. Because of the computationalefficiency of the algorithm, however, it is possible to applyit several times with different values for f. Interpretationof the results of projection pursuit is especially straight-

forward, owing to the linear nature of the mapping.The researcher also has the choice of the dimensionality

of the projection subspace. That is, whether to employone-, two-, or higher dimensional projection pursuit. The

15

10

5

0

-5

-10

-15

0

-5 F

-10

-15

888

Page 9: FriedmanTukey_1974

8/3/2019 FriedmanTukey_1974

http://slidepdf.com/reader/full/friedmantukey1974 9/10

FRIEDMAN AND TU1KEY: EXPLORATORY DATA ANALYSIS

two-dimensional projection pursuit algorithm is slower

and slightly less stable than the one-dimensional algorithm;however,

th e resulting two-dimensionalmap

containsmuch more information about the data. Experience has

shown that a useful strategy is to first find several one-

dimensional PP solutions, then use each of these direc-tions as one of the starting axes for two-dimensional pro-jection pursuits.

APPENDIX

This section presents the solid angle transform (SAT)that reversibly maps th e s ur fa ce o f a unit sphere in ann-dimensional space Sn(1) to an (n -l)-dimensional in-finite Euclidean space En-l (- oo, oo ). This transformationis derived in [11] and only th e results are presented here.

Let (X1,X2, - - -,X.) be the coordinates of a point lyingon th e surface of an n-dimensional unit sphere and

(,71,1q2,- * *,,7n-i) be the corresponding point in an (n - 1)-dimensional unit hypercube El-' (0,1). Then for n even,the transformation is

X2i = [ 712ji(n 2)] COS [sin 712il1(n-2O] sin (21r12i-1),

1 <i < n/2 - 1X n/2-1

Xn = II172il(-j sin (27r7n-1_)

X2i_1 = X2 i cot (272isl), 1 < i < n/2.

for n odd, we take

= (n-3)/2

X. = II 772,11 (n-2 j) ( 271n-2- )

(n-3) /2

Xn-I = II2) 1/(n-2i) (17n-2 7/n-2 ) 12 si n (21r1tn-l)

X2i = II 12sJ/(n-2j) COS [sinI1 772i1/(n-2i)] sin (27r72i-1),

1 < i < (n - 3)/2

X2i_i = X2i cot (27rn12i-i), 1 < i < (n - 1)/2.

by using standard techniques [18], specifically, multiple

reflection.

ACKNOWLEDGMENTHelpful discussions with W. H. Rogers and G. H. Golub

are gratefully acknowledged.REFERENCES

[1 ] L. N. Bolshev, Bull. Int. Statist. Inst., vol. 43, pp. 411-425, 1969.[2] C. T. Zahn, "Graph-theoretical methods for detecting and

describing gestalt clusters," IEEE Trans. Comput., vol. C-20,pp. 68-86, Jan. 1971.

[3] R. N. Shepard and J. D. Carroll, "Parametric representationof nonlinear data structures," in Multivariate A na ly si s, P .Krishnaiah, Ed. New York: Academic, 1966.

[4 ] J. W. Sammon, Jr., "A nonlinear mapping for data structureanalysis," IEEE Trans. Comput., vol. C-18, pp. 401-409, May1969.

[5 ] J. B. Kruskal, "Toward a practical method which helps un-cover the structure of a set of multivariate observations byfinding the linear transformation which optimizes a new 'indexof condensation," in Statistical Computation, R. C. Miltonand J. A. Nelder, Ed. New York: Academic, 1969.

[6 ] -, "Linear transformation of multivariate data to revealclustering," in Multidimensional Scaling: Theory and Applica-tion in the Behavorial Sciences, vol. 1, Theory. New York andLondon: Seminar Press, 1972.

[7 ] H. H. Rosenbrock, "An automatic method for finding thegreatest or least value of a function," Comput. J., vol. 3, pp.175-184, 1960.

[8] M. J. D. Powell, "An efficient method for finding the minimumof a function of several variables without calculating derivates,"Comput. J., vol. 7, pp . 155-162, 1964.

[9 ] R. L. Maltson and J. E. Dammann, "A technique for deter-mining and coding subclasses in pattern recognition problems,"IBM J., vol. 9, pp. 294-302, July 1965.

[101 "PRIM-9," film produced by Stanford Linear Accelerator Cen.,Stanford, Calif., Bin 88 Productions, Apr. 1973.[11] J. H. Friedman and S. Steppel, "Non-linear constraint elim-ination in high dimensionality through reversible transforma-tions," Stanford Linear Accelerator Cen., Stanford, Calif., Rep.SLAC PUB-1292, Aug. 1973.

[12] S. Derenzo, "MINF-68--A general minimizing routine,"Lawrence Berkeley Lab., Berkeley, Calif., Group A Tech. Rep.P-190, 1969.

[13] R. P. Brent, Algorithms fo r Minimization Without Derivatives.Englewood Cliffs, N. J.: Prentice-Hall, 1973.

[14] G. H. Golub, "Numerical methods for solving linear leastsquares problems," Numer. Math., vol. 7, pp. 206-216, 1965.

[15] R. A. Fisher, "Multiple measurements in taxonomic problems,"in Contributions to Mathematical Statistics. New York: Wiley.

[16] J. W. Sammon, Jr., "An optimal discriminant plane," IEEETrans. Comput., vol. C-19, pp. 826-829, Sept. 1970.

[17] J. Ballam, G. B. Chadwick, Z. C. G. Guiragossian, W. B.Johnson, D. W. G. S. Leith, and K. Moriyasu, "Van Hove

analysis of the reactions r-p -7-r-r-r+p and ,r+p ir+r+7r-pat 16 GeV/C," Phys. Rev., vol. 4, pp. 1946-1947, Oct. 1971.[18] M. J. Box, Comput. J., vol. 9, pp. 67-77, 1966.

The Jacobian of this transformation,

r(n/2)

is a constant, namely, the well-known expression for the

surface area of an n-dimensional sphere of unit radius.Adjusted by a factor of the (n - I)st root of Jn, the

transformation is volume preserving, one t o o ne , and onto.The inverse transformation can easily be obtained by

solving the above equations for the i7's in terms of theX coordinates. Th e unit hypercube E"-'(0,1) can be

expanded to the infinite Euclidean space En-l(-cc,cc)

Jerome H. Friedman was born in Yreka,Calif., on December 29, 1939. He receivedthe A.B. and Ph.D. degrees in physics fromthe University of California, Berkeley, in 1962and 1968, respectively.From 1968 through 1971 he wa s a Research

Associate in High Energy Particle Physicsat the Lawrence Berkeley Laboratory,Berkeley, Calif. He is presently Head of the

Mathematics and Computation Group at th eStanford Linear Accelerator Center, Stan-

ford, Calif. His primary interests are in computational techniquesin multivariate data analysis and pattern recognition.

889

Page 10: FriedmanTukey_1974

8/3/2019 FriedmanTukey_1974

http://slidepdf.com/reader/full/friedmantukey1974 10/10

IEEE TRANSACTONS ON COMPUTERS, VOL. c-23, NO. 9, SEPTEMBER 1974

John W. Tukey was born in Bedford, Mass.,on June 16 , 1915. He received the Sc.B. andSc.M. degrees in chemistry from Brown Uni-versity, Providence, R. I., in 1936 and 1937,

respectively,and th e M.S. and Ph.D.

degreesfrom Princeton University, Princeton, N. J.,in 1938 and 1939, respectively.He joined t he T ec hn ic al Staff of Bell

Laboratories in 1945. He is the AssociateExecutive Director of the Research-Com-munications Principles Division of Bell Labs.,

Murray Hill, N. J. His work has covered the development of a widevariety of new statistical techniques, broad analysis and synthesisproblems related to complex weapons systems, and other problemswith mathematical or statistical aspects. He has been associatedwith the Department of Mathematics, Princeton University, Prince-ton, N. J. , since 1939, and served as Professor of Mathematics from1959 to 1966, when he became Professor of Statistics. He was thefirst Chairman of th e Department of Statistics from 1966 to 1970.He is the author of more than 125 technical articles on mathematics,statistics, and other scientific subjects, and is the author of Explora-

tory Data Analysis (now in a preliminary edition) and other books.

Dr. Tukey was awarded honorary Doctor of Science degrees byCase Institute of Technology, Cleveland, Ohio, in 1962 and BrownUniversity i n 1 96 5. He was a Jacobus Fellow at Princeton in 1938-1939, a Fellow of the John Simon Guggenheim Memorial Foundationin 1949-1950, and a Fellow at the Center for Advanced Study inthe Behavioral Sciences in 1957-1958. He served as a member ofth e President's Science Advisory Committee from 1960 to 1963,and has served on a number of other committees advisory to th egovernment, including the Science Information Council from 1962to 1964. He has served as President of the Institute of MathematicalStatistics, and as Vice President of the American Statistical Associa-tion. He wa s a member of the United States delegation to TechnicalWorking Group 2 of the Conference on the Discontinuance of NuclearWeapon Tests. He is a member of the American PhilosophicalSociety, the American Academy of Arts and Sciences, the AmericanSociety for Quality Control, th e Biometric Society, th e AmericanMathematical Society, the Mathematical Association of America,the New York Academy of Sciences, the Econometric Society, theAssociation for Computing Machinery, the Royal Statistical Society,the Operations Research Society of America, the Society for In-dustrial Mathematics, the American Association for the Advance-

ment of Science, and th e International Statistical Institute.

An Adaptive Search Optimization AlgorithmKRZYSZTOF K. BURHARDT, MEMBER, IEEE

Abstract-A very fast nongradient procedure for function optimiza-tion is described. Th e procedure is based on the ideas of Rosenbrock[1] and Swann [2]. These were modified and refined to obtain analgorithm which provides an optimum with a very small number offunction evaluations. This algorithm, compared with recently re-

ported algorithms by Lawrence and Steglitz (L-S) [3], and Beltramiand Indusi (B-I) [4], appears to be very robust and reliable. Con-strained optimization problems can be handled and a special methodfor handling optimization with linear constraints is presented.

Index Terms-Adaptive algorithms, mathematical programming,optimization.

INTRODUCTION

THE problem of minimizing a function f(x) of nvariables from a given initial point has received con-

siderable attention in recent years. One can distinguish twoseparate problems: functions for which both the function f

Manuscript received November 9, 1972; revised November 23,1973.The author is with the 3M Company, St. Paul, Minn. 55133.

and grad f can be evaluated at any point x and functionsfor which only f can be evaluated.

Methods of solving the first of these problems are givenby many authors, The latter problem is very important

because it is often practically impossible to calculate thefirst derivatives of the function. This can be the case fora performance measure in encountered cases, when thevalue of a function f is obtained vi a digital simulation. Forthese types of problems obviously optimization means tofind the optimum in the minimum number of functionevaluations. Historically, the earliest concepts to solvenongradient problems such as tabulation, random search,or the technique of improving each variable in turn wereinefficient and often unreliable. Improved methods were

developed. These include: the simplex method of Hims-worth, Spendley and Hext [5], "pattern search" by Hookand Jeeves [6], Rosenbrock's method [1], Powell [7],and others. The pattern search method is most widelyused because of the simplicity; however, the search pro-cedure has many drawbacks as shown by [3], [4]. Re-

890