towards meaningful high-dimensional nearest neighbor ...charuaggarwal.net/aggarwal_nearest.pdf ·...

Towards Meaningful High-Dimensional Nearest Neighbor Search byHuman-Computer Interaction

Charu C. AggarwalIBM T. J. Watson Research Center

Yorktown Heights, NY [email protected]

Abstract

Nearest Neighbor search is an important and widelyused problem in a number of important application do-mains. In many of these domains, the dimensionality of thedata representation is often very high. Recent theoreticalresults have shown that the concept of proximity or nearestneighbors may not be very meaningful for the high dimen-sional case. Therefore, it is often a complex problem to findgood quality nearest neighbors in such data sets. Further-more, it is also difficult to judge the value and relevanceof the returned results. In fact, it is hard for any fully au-tomated system to satisfy a user about the quality of thenearest neighbors found unless he is directly involved in theprocess. This is especially the case for high dimensionaldata in which the meaningfulness of the nearest neighborsfound is questionable. In this paper, we address the com-plex problem of high dimensional nearest neighbor searchfrom the user perspective by designing a system which useseffective cooperation between the human and the computer.The system provides the user with visual representations ofcarefully chosen subspaces of the data in order to repeat-edly elicit his preferences about the data patterns which aremost closely related to the query point. These preferencesare used in order to determine and quantify the meaningful-ness of the nearest neighbors. Our system is not only able tofind and quantify the meaningfulness of the nearest neigh-bors, but is also able to diagnose situations in which thenearest neighbors found are truly not meaningful.

1. Introduction

The nearest neighbor search problem is defined as fol-lows: For a given query pointQ, find the data points whichare closest to it based on a pre-defined distance function.Examples of application domains in which this problemarises are similarity search in geometric databases, multi-

media databases, and data mining applications such as frauddetection and information retrieval. Typical domains suchas data mining contain applications in which the dimen-sionality of the representation is very high. Consequentlya wide variety of access methods and data structures havebeen proposed for high dimensional nearest neighbor search[9, 11, 18, 21, 27].

It has been questioned in recent theoretical work [10] asto whether the nearest neighbor problem is meaningful forthe high dimensional case. These results have characterizedthe data distributions and distance functions for which allpairs of points are almost equidistant from one another inhigh dimensional space and have illustrated the validity ofthe results on a number of real work loads. We note thatthese results do not necessarily claim that nearest neighboris not meaningful in every high dimensional case, but thatone must be careful in interpreting the significance of theresults. For example, a lack ofcontrastin the distributionof distances implies that a slight relative perturbation of thequery point away from the nearest neighbor could changeit into farthest neighbor and vice versa. In such cases, anearest neighbor query is said to beunstable. Furthermore,the use of different distance metrics can result in widelyvarying ordering of distances of points from the target fora given query. This leads to questions on whether a usershould consider such results meaningful.

Recent work [15] has shown that by finding discrimi-natory projections in the neighborhood of a query point, itis possible to improve the quality of the nearest neighbors.This approach uses the fact that even though high dimen-sional data is sparse in full dimensionality, certain projec-tions of the space may contain meaningful patterns. Thesemeaningful patterns are more closely related to the querypoint than the ones determined by using the full dimension-ality. Related techniques [3, 6] design distance functions ina data driven manner in order to find the most meaningfulnearest neighbors. In these techniques, the statistical prop-erties of high dimensional feature vectors are used in orderto obtain meaningful measures of the distances between the

points. This is often a difficult task, since the most effectivemethod of measuring distances may vary considerably withthe data set and application domain.

Since the issue of meaningfulness is connected to theinstability in measurement of distances, a natural guidingprinciple in these methods is to find data projections anddistance functions in which the distances of the nearestneighbors from the query point have high contrast fromthe rest of the data. It has been shown in [15] that sucha strategy leads to improvement in search quality. It hasalso been independently confirmed for the multimedia do-main thatdistinctiveness sensitivenearest neighbor search[19] leads to higher quality of retrieval. At the same time,it is quite difficult for a fully automated system to alwaysfind nearest neighbors which would be considered valuableand meaningful by the user. Furthermore, even if the neigh-bors found are valuable, a user would have little idea aboutthe quality of the neighbors found without being activelyinvolved in the process. The fully automated systems dis-cussed in [6, 15] are incomplete in their characterizationof the data in terms of a single best projection or distancefunction. Different projections can provide different viewsof the data, all of which may be informative to a human inunderstanding the relationship between the query point andthe rest of the data.

In recent years, the importance of incorporating hu-man interaction into several data mining problems has beenwell understood and appreciated [5, 8, 14, 16, 20, 24, 28].For particular domains of data such as multimedia, im-age databases and information retrieval, application-specificmethods have been devised [13, 22, 23, 25, 28] in order toincorporate user feedback into the retrieval process. Theimportance of human interaction in the nearest neighborsearch process arises from the ability of a user to make in-tuitive judgements that are outside the capabilities of fullyautomated systems. Simply speaking, a computer cannotmatch the visual insight, understanding and intuition of ahuman in distinguishing useful patterns in the data. On theother hand, a human needs computational support in orderto determine effective summaries of the data which can beused to derive this intuition and understanding. Therefore,a natural strategy would be to devise a system which is cen-tered around a human-computer interactive process. In sucha system, the work of finding nearest neighbors can be di-vided between the human and the computer in such a waythat each entity performs the task that it is most well suitedto. The active participation of the user has the additionaladvantage that he has a better understanding of the qualityof the nearest neighbors found.

In this paper, we will describe a human-computer inter-active system for high-dimensional nearest neighbor search.In this method, the distribution of the data in carefully cho-sen projections are presented visually to the user in order

to repeatedly elicit his preferences about the relationshipsbetween the data patterns and the query point. Specifically,these projections are chosen in such a way that the naturaldata patterns containing the query point can be visually dis-tinguished easily. Recent work [4, 7, 15] has shown thateven though it is difficult to define clusters for sparse highdimensional data, it is often possible to find clusters in cer-tain lower dimensional projections. Many of these clustersmay contain the query point. We refer to such projectionsasquery centered projectionsand the corresponding clus-ters asquery clusters. These projections may exist either inthe original sets of dimensions or in an arbitrarily orientedsubspace of the data. For each such projection determined,the user is provided with the ability to visually separate outthe data patterns which are most closely related to the querypoint. In a given view, a user may choose to pick some orno points depending upon the nature and distribution of thedata. At the end of the process, these reactions are utilizedin order to determine and quantify the meaningfulness ofthe nearest neighbors found from theuser perspective.

This paper is organized as follows. The remainder ofthis Section discusses issues of nearest neighbor search inhigh dimensional data, and formalizes the contributions ofthe paper. In section 2, we discuss the interactive systemfor nearest neighbor search by data exploration. In section3, we discuss the quantification of nearest neighbor mean-ingfulness. The empirical section is discussed in section 4,whereas section 5 contains the conclusions and summary.

1.1. On the Nature of High Dimensional Data

Recent theoretical results [10] have shown that in highdimensional space, the nearest and furthest neighbors areof the same relative distance to the query point for a widevariety of data distributions and distance functions. How-ever, it has also been shown in recent work [4, 7, 15] thateven though meaningful contrasts cannot be determined inthe full dimensional space, it is often possible to find lowerdimensional projections of the data in which tight clustersof points exist. In this spirit, the projected nearest neigh-bor technique [15] finds a single optimal projection of thedata in which the closest neighbors are determined in anautomated way using the euclidean metric. In reality, nosingle projection provides a complete idea of the distribu-tion of the data in the locality of the query point. The fullpicture may be obtained by using multiple projections, eachof which provides the user with different insights about thedistribution of the points. For example, Figure 1(a) illus-trates a projection in which there is a cluster near the querypoint which is well separated from the other data points.Such a projection is very useful, since it provides a distinctset of points which can be properly distinguished from therest of the data. Figures 1(b) and 1(c) are examples of pro-

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

* Query Point

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

* Query Point

−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2

0

0.2

0.4

0.6

0.8

1

1.2

* Query Point

(a) (b) (c)

Figure 1. (a) Good Query Centered Projection (b) Poor Query Centered Projection (Query Point inSparse Region) (c) Noisy projection

jections in which the closest records to the query point arenot well distinguished from the rest of the data. For exam-ple, in the case of Figure 1(b), the query point is locatedin a region which is sparsely populated; this is not a goodquery centered projection, since one cannot identify a dis-tinct subset of points in proximity to the query. In the caseof Figure 1(c), the projection is a poor one irrespective ofthe nature of the query point, since the points are uniformlydistributed and do not separate out well into clusters. It isoften a subjective issue to determine whether or not a givenprojection distinguishes the query cluster well. There mayalso be many cases in which query points may be located atthe fringes of a cluster, and it may be difficult to determinethe query cluster in an automated way. In all such cases, itbecomes important to use the visual insight and feedback ofa user in diagnosing the query clusters.

We note that it is possible that a good query-centeredprojection may be difficult to find in a combination of theoriginal set of attributes. In such cases, it may be necessaryto determine arbitrary projections created by vectors whichare not parallel to the original axis directions. On the otherhand, the use of axis-parallel projections has the advantageof greater interpretability to the user. In this paper, we pro-pose methods for determining projections of both kinds i.e.axis-parallel and arbitrary projections.

1.2. Contributions of this paper

This paper discusses a interactive system between theuser and the computer for high dimensional nearest neigh-bor search. The advantage behind such a system is its abilityto share the best skills of a human and a computer in find-ing the most meaningful nearest neighbors. Effective tech-niques are proposed for choosing projections of the data inwhich the natural data patterns containing the query pointare well distinguished from the entire data set. An inter-esting method discussed in this paper is agraded subspace

determination processso that most of the data discrimina-tion is hidden in a small subset of orthogonal projections.The visual profiles of these projections is used to providefeedback. The repeated feedback of the user over multipleiterations is used to determine a set of neighbors which arestatistically significant and meaningful. In the event thatthe data does not have any consistently interesting patternsin different lower dimensional projections, then this is de-tected by the meaningfulness quantification method and re-ported. Thus, an additional advantage of this approach isthat it is also able to detect and report cases when the data istruly not amenable to meaningful nearest neighbor search.

1.3. Notations and Terminology

We assume that the total number of points isN and thedimensionality of the data space isd. The query point isdenoted byQ. The data set is denoted byD and the univer-sald-dimensional space byU . Let E be thel-dimensionalsubspace defined by a set ofl � d orthogonal vectorsfe1 : : : elg. In the general case, the vectorse1 : : : el maybe arbitrary directions in the space which are not parallelto the axes directions representing the attributes. The pro-jection of a pointy onto the subspaceE is thel-dimensionalpoint(y�e1 : : : y�el) and is denoted byProj(y; E). The pro-jected distance between two pointsx1 andx2 in subspaceEis denoted byPdist(x1; x2; E) and is equal to the distancebetweenProj(x1; E) andProj(x2; E) in subspaceE .

2. The Interactive Nearest Neighbor System

Since the system works by determining the distributionof the nearest records to the query points in a given pro-jection, we introduce a parameter called thesupport. Thisis the number of database points that are candidates for thenearest neighbor in a given projection, and whose distribu-tion relative to the rest of the data set is analyzed. The value

Algorithm InteractiveNNSearch(Data Set:D,QueryPoint:Q, Support:s);

beginfor each data pointi do setP(i) = 0;f P = fP(1) : : :P(N)g denotes the probability of each oftheN points in the database being a meaningful nearest neighbor;gwhile not(terminationcriterion)dobeginSet countv(i) if each data pointi to zero;Dc = D; Ec = U ; f U is universal subspacegf User counts areV = fv(1) : : : v(N)g g;for i = 1 to d=2 dobegin(Eproj ;Enew ;Dnew) = FindQueryCenteredProjection(Dc , Ec, Q, s);DisplayVisualProfile(Dc , Eproj ,1);� = AdjustDensitySeparator(Dc , Eproj , �);V = UpdateCounts(Dc , Eproj , Q,�, V);Ec = Enew ; Dc = Dnew ;

end;P = QuantifyMeaningfulness(V , P);Remove any pointi fromD for which v(i) = 0;end;

return s points with highest value ofP(i);end

Figure 2. The Nearest Neighbor Algorithm

Algorithm FindQueryCenteredProjections(Data Set:Dc,Current Subspace:Ec, QueryPoint:q, Support:s)

beginlp = Dimensionality ofEc; Ep = Ec;while lp > 2 dobeginCompute the value ofPdist(q; x;Ep) for each data pointx 2 Dc;Find the nearests points toq with smallest value ofPdist(q; x;Ep)

and denote byN ;Ep = QueryClusterSubspace(Np; lp); lp = maxf2; [lp=2]g;

endEnew = Ec � Ep;ComputeDnew as the projection of the set of points inDc ontoEnew ;return (Ep , Enew ,Dnew);

end

Figure 3. Finding Query Centered Projections

Algorithm QueryClusterSubspace(Query Cluster: C,Dimensionality of Subspace:lp);

beginLet �Cij be covariance between dimensionsi andj using the points inC;f We denote the corresponding covariance matrix by� = [�ij ] g;Determine the eigenvectors of matrix� with eigenvalues�i;Let i be the variance of entire data set along eigenvectori;E = Set oflp eigenvectors with least value of�i= i;return (E);

end;

Figure 4. Determining Query Cluster Sub-spaces

Algorithm DisplayVisualProfile(Data Set:Dc,Projection Subspace:Eproj , Noise Threshold Density:� )

beginfor each data pointxi 2 Dc computeyi = Proj(xi;Eproj);Y = fy1; : : : yNg;Divide the 2-dimensional hyperplane forEproj into ap � p grid;f We denote the coordinate points on the grid to befz1 : : : zp2g; gUseY to compute kernel density�(zi) on thep2 grid points;Display the density profile�(�) for the subspaceEproj ;Superpose the density profile with a hyperplane at density value�;

end

Figure 5. Computing Visual Profiles

Algorithm AdjustDensitySeparator(Data Set:Dc,Query Cluster Subspace:Eproj );

begininteraction-flag=true;while (interaction-flag==true)dobegin

User inputs new height of density separator�;DisplayVisualProfile(Dc , Eproj , �);User specifies interaction-flag;

end;return (�);

end

Figure 6. Visually Separating Query Clusters

Algorithm UpdateCounts(Data Set:Dc, Query Cluster Subspace:Eproj ,Query Point: Q, Separator Density:�, Counts:V);

beginfor each data pointxi 2 Dc dobegin

Compute density atxi in subspaceEproj ;if xi is density-connected toQ and has density at least

� then add one to countv(i) of recordi;end;return (V);

end

Figure 7. Updating User Preference Counts

Algorithm QuantifyMeaningfulness(Counts:V ,Meaningfulness Vector:P);

beginf ni is the number of points picked byuser in projectioni, i 2 f1; : : : [d=2]g gfor j = 1 toN dobegin

E[Yj ] =Pd=2

i=1ni=N ; var(Yj ) =

Pd=2

i=1(ni=N) � (1� ni=N);

p =v(j)�E[Yj ]p

var(Yj); P(j) = P(j) + p;

end;return (P);

end;

Figure 8. Quantification of Meaningfulness

of this support parameter can either be chosen by the user orthe system. In most real applications, users are not lookingfor a single nearest neighbors, but a group of nearest neigh-bors all of which can be considered to be close matches fora target query. Therefore, the number of database points tobe retrieved by the user is thesupports used for the analy-sis. We note that in order to perform a proper analysis of thedirections in the data of greatest discrimination, this supportshould at least be equal to the dimensionalityd. Therefore,in cases when the user-specified support is less thand, weset it equal tod. We also note that in many cases, there maybe a certain number of points which are inherently moreclosely related to the query as compared to the rest of thedatabase. This number may be different froms, and can-not be known a-priori. In such cases, we will see that ourapproach is able to provide some guidance in returning thenatural sets of points related to the query.

The overall framework of the algorithm is illustrated inFigure 2. The system works in an iterative loop in whicha set ofd=2 mutually orthogonal projections are presentedto the user in a given iteration. Each of these projections iscarefully chosen such that it brings out the contrast betweenthe points closest to the query and the remaining points.Once such a projection has been found, the user separatesout the points which belong to the query cluster. The se-lection statistics of each data point are maintained in the setof countsv(1) : : : v(N) which are initialized to zero, andincremented whenever a set of points is picked. After eachiteration ofd=2 projections, the set of choices made by theuserv(�) are utilized determine his level of perception as tothe level of closeness of each data point to the query pointQ. This number lies in the range(0; 1), and is referred toas the meaningfulness probability. This number defines theuser-reaction probability that the data point can be distin-guished as significantly more closely related to the querypoint as compared to the average record in the data. Themeaningfulness probability is calculated independently foreach iteration ofd=2 projections, and the values over mul-tiple iterations are aggregated in order to determine a finalvalue. At the end of each iteration, those points are removedfrom the data set which were not picked even once in anyprojection. Thus, the user behavior in an iteration influencesthe later profiles which are presented to him. The processcontinues until it is determined that the current ordering ofmeaningfulness probabilities reliably matches user’s intentbased on his reaction to all views which have been presentedto him so far. The details of the meaningfulness quantifica-tion and termination criterion are described a later section.

Each iteration (henceforth referred to as a major itera-tion) is divided into a set ofd=2 minor iterations, in each ofwhich a projection is determined and presented visually tothe user for his feedback. The set ofd=2 projections whichare determined in each major iteration are mutually orthog-

onal. This is because we would like to present the user withseveral independent perspectives of the data, which togetherspan the full dimensional space. In order to achieve this, wemaintain a current data setDc, and a current subspaceEc.Let us say that a total ofr < d=2 projections have alreadybeen presented to the user in the current major iteration. Letthe subspaces corresponding to these projections be denotedbyEproj(1) : : : Eproj(r). Then, the current subspaceEc is thed� 2 � r dimensional subspace which is orthogonal to all ofthese projections, and is given byU �[ri=1Eproj(i). HereUis the full dimensional space. The data setDc is the projec-tion of the data setD onto the subspaceEc. Thus, each datapointxi 2 D is represented by the data pointProj(xi; Ec)in Dc. The next projectionEproj(r+1) is determined by us-ing the data setDc.

In each minor iteration, we perform the following steps:(1)Finding the most discriminatory projection which is cen-tered around the query point. This discriminatory projec-tion is picked out ofEc. (2) Interactive determination of thequery cluster by the user based on the visual separation ofthe query point from the remaining data.(3) Updating thecounts for the points in the query cluster.In the remaining part of this section, we will discuss eachof these steps in detail.

2.1. Finding Good Query Centered Projections

The overall process for determining a discriminatoryprojection is illustrated in Figure 3. A discriminatory pro-jection is defined as one which distinguishes the naturallower dimensional clusters containing the query point fromthe rest of the data. In order to find the most highly dis-criminating projection, it may often be desirable to useprojections which are created by arbitrary sets of vectorsfe1 : : : elg which are not parallel to the original axis sys-tem. In other cases, it may be desirable to pick projectionsfrom the original set of attributes for reasons of better in-terpretability. Our system can support both versions. In thefollowing discussion, we will first discuss the general case,and then discuss the minor changes required for the partic-ular case of axis-parallel projections.

In order to find the most discriminatory projections, wecompare the distribution of the nearests points, as com-pared to the rest of the data set. Our goal is to pick a pro-jection in which this small fraction of points shows a welldistinguished cluster around the query point. Since the cur-rent data setDc is represented in the spaceEc, the projec-tion subspace needs to be a subspace ofEc. The process offinding the query cluster and query subspace is an iterativeone. We start with the subspaceEp = Ec from which the2-dimensional projectionEproj needs to be found. In eachiteration, we reduce the dimensionality of the subspaceEpin which there is a distinct clusterNp surrounding the query

point which is also well separated from the data. In the firstiteration, we start off withNp being the set of points whichare closest to the query pointQ in the subspaceEc. Then,we find a subspaceEp in which this “query cluster”Np isa tightly-knit cluster as compared to the variance of the re-maining data set. In order to do so, we find the principlecomponent directions [17] of the set of points inNp. Theprinciple component directions are helpful in finding thoseprojections of the data in whichNp is tightly clustered. Inorder to find the principle components, we determine the co-variance matrix� of the set of points inNp. Since the datapoints inNp have dimensionalityjEcj, the covariance ma-trix is an jEcj � jEcj matrix in which the entry(i; j) denotesthe covariance between dimensionsi and j. This covari-ance matrix is positive semi-definite and can be expressedas� = P �D �P T , whereD is a diagonal matrix containingthe eigenvalues, and the columns ofP contain the eigenvec-tors which form an orthonormal axis-system. These eigen-vectors are the principal components and represent the di-rections in the data along which the second order covari-ances of the points inNp are zero. The eigenvalue�i alongthe directioni denotes the variance of the set of points inNp along the directioni. Therefore, if i be the variance ofthe entire data setDc along the eigenvectori, then the ratio�i= i denotes the ratio of the variances between the querycluster and remaining data when projected onto the eigen-vectori. Therefore, by picking thelp directions with thesmallest variance ratio, we are able to determine the direc-tions in which the query cluster is well distinguished fromthe rest of the data. The procedure for determining the querycluster subspace is illustrated in Figure 4.

Once the query subspace has been determined, then itwill be used in the next iteration in order to determine a newquery clusterNp. Specifically, the set of pointsNp in thenext iteration is determined by finding those points whichare closest to the query point when projected into the newlyfound subspaceEp. The value of the dimensionality of thesubspaceEp is denoted bylp and is reduced by factor of 2 ineach iteration. The process continues till the value oflp is 2.The reason for the iterative methodology used by the algo-rithm is that bothNp andEp are dependent on one anotherand the gradual reduction in the dimensionality ensures aneffective refining process in which we find query clusterNp

and a corresponding subspaceEproj in which this cluster iswell distinguished from the rest of the data. When it is de-sirable to use clusters only from axis parallel projections,then a minor modification needs to the query subspace de-termination subroutine of Figure 4. Here instead of usingthe principal components of the set of data points inNp, weuse the original set of axis directions.

Once the projection subspaceEp is determined, we com-pute the new subspace and data setEnew andDnew whichwill be used in order to determine the projection in the

next iteration. We wish to ensure that later iterations findonly subspaces which are orthogonal to those found so far.Therefore,Enew is chosen as the complementary subspaceto Ep, assuming thatEc is the entire subspace. The datasetDc is also projected onto this new subspace in order tocreateDnew.

2.2. Interactive Separation of Query Cluster

Once a discriminatory projection has been determined,human interaction is used in order to separate the querycluster from the remaining data points. In order to maxi-mize the use of human intuition, we use a visual profile ofthe probabilistic data distribution. To this effect, we use ker-nel density estimation techniques [26]. In this technique theprobability density at a given point is estimated as the sumof the smoothed values of kernel functionsKh(�) associatedwith each point in the data set. Each kernel function is asso-ciated with a kernel widthh which determines the level ofsmoothing created by the function. The kernel estimationf(x) based onN data pointsx1 : : : xN and kernel functionKh(�) is defined as follows:

f(x) = (1=N) �NXi=1

Kh(x� xi) (1)

Thus, each discrete pointxi in the data set is replaced bya continuous functionKh(�) which peaks atxi and has avariance which is determined by the smoothing parameterh. An example of such a distribution would be a gaussiankernel with widthh.

Kh(x� xi) = (1=p2� � h) � e�(x�xi)

2=2h2 (2)

The error in density estimation is determined by the band-width h. One well known approximation formula [26] fordetermining the bandwidth ish = 1:06 � � � N�1=5 for adata set withN points and standard deviation�.

In order to actually construct the density profiles, we es-timate the probability density of the data at a set ofp � pgrid-points, which are used to create surface plots. Exam-ples of two such density profiles are illustrated in Figures9(a) and 9(b). Note that in the case of Figure 9(a) there isa sharp and well separated peak containing the query point.This corresponds to the highly dense cluster near the querypoint. This behavior is typical of a well chosen projectionwhich discriminates the data patterns near the query pointwell. A second way of providing the user with a visual un-derstanding of the data is to provide alateral density plot,in which we have a scatter plot of fictitious points which aregenerated in proportion to their density. We note that all ofFigures 1(a), 1(b), and 1(c) are lateral scatter plots of 500points generated from synthetic data sets.

−0.5 0 0.5 1 1.5 2 −0.5 0 0.5 1 1.5 20

10

20

30

40

50

60

70

80

Y

* Query Point

X

Rel

ativ

e D

ensi

ty

−0.5 0 0.5 1 1.5 2−0.5 0 0.5 1 1.5

0

10

20

30

40

50

60

70

80

YX

Rel

ativ

e D

ensi

ty

QueryPoint

(a) A Good Query Centered Projection (b) A Poor Query Centered Projection

Figure 9. Illustration of the Qulaity of Projections

Once the user is provided with this visual profile then itis possible for him to separate the query cluster from theremaining points by using either of the two visual profiles.A convenient way of separating the query cluster visuallyis by using density separators of a certain height. In thistechnique the user specifies the density� which is the noisethreshold. This threshold is used in order to determine theset of points which are the user-defined nearest neighbors inthat projection by using the concept of density connectivity.The concept of density connectivity is discussed in [12, 16],and is defined as follows:

Definition 2.1 A data pointx is density connected to thequery pointQ at noise threshold�, if there exists as pathP from x toQ such that each point onP has density largerthan the noise threshold�.

Thus, for a given noise threshold� and query pointQ, itis possible to uniquely determine the set of points in thedatabase which are density connected to the query point.For example, in Figures 9(a), we have shown the densityprofile of a data set along with adensity separator planeswhich separates the data out into different clusters. Wenote that the contour of intersection of the density separatorplane with the density profile of the data is a set of closedregions. Each such closed region corresponds to the con-tour of the cluster in the projection. However, only one ofthese contours is relevant; the one that contains the querypointQ. All data points contained within this contour arerelevant answers to the query point for this particular pro-jection. We shall henceforth refer to the contour containingthe query pointQ at noise threshold� as the(�;Q)-contour.Such a contour is not restricted to be of any particular shape,and is dependent only upon the distribution of the points inthe data. For example, in Figure 9(a), there are two den-sity connected regions above the noise threshold�. All the

data points which lie in the same region as the query clus-ter are the set of preferences for that particular projection.At a noise threshold of� = 20, a distinct cluster of pointscontaining the query point are created; by reducing� fur-ther, more and more points from the fringes of the clusterare included. Here, the intuition of a user is very useful,since an accurate delineation of the related data pattern isoften not possible by fully automated methods. We alsonote that if the query point had belonged to one of the othertwo peaks, then for different values of the noise threshold,different number of peaks would have been included in thequery cluster. By using� = 0, all points are included inthe query cluster. We refer to such views created by thisprocess asdensity separated views, since they clearly showthe various clusters in the data based on the density profileand the noise threshold supplied by the user. We note thatit is not necessary for the user to supply the noise thresholdafter just one view of the data. Rather, the user can lookat density separated views for many different values of thenoise threshold� in order to interactively converge at themost intuitively appropriate value.

As discussed in an earlier section, not all views areequally informative in understanding the relationship of thedata to the query point. For example, the projection for Fig-ure 9(a) is significantly better than the similar profile of Fig-ure 9(b), since in the former case the query point is locatedon a peak of the density profile, whereas in the latter casethe query point is in a sparse region of the data. In suchcases, it is difficult to find a coherent cluster of points inthe projection, which are related to the query point. Theuser can choose to ignore this projection by specifying anarbitrarily high value of the noise threshold�.

An alternative way of separating the query cluster is byusing the lateral density plot in which the user visually spec-ifies the separating hyperplanes (lines) in order to divide the

space into a set of polygonal regions. The set of points inthe same polygonal region as the query point is the user re-sponse to the query for that particular projection. However,using a density separator tends to be a more attractive op-tion, since it can separate out clusters of arbitrary shapeswith the specification of a single noise threshold. The al-gorithms for displaying visual profiles and performing userinteraction are illustrated in Figures 5 and 6 respectively.

2.3. Updating the Preference Counts

After each minor iteration, we need to update the prefer-ence counts for the points which have been determined to liein the query cluster. An important problem is to discover allthe points which are density connected to the query point,without having to calculate the density value at each indi-vidual data point. It is possible to use the density valuescalculated across thep � p grid structure in order to approx-imate the points which are density connected to the querypoint. The first step is to find all the elementary rectan-gles in the grid structure which approximately lie within the(�;Q)-contour. We shall denote this set of elementary rect-angles byR(�;Q). We defineR(�;Q) as follows:

Definition 2.2 An elementary rectangleL is defined to bea member ofR(�;Q), if and only if there exists some se-quence of adjacent rectanglesL = L0;L1 : : :Lk such that(i) Lk containsQ, and (ii) at least three corners of eachrectangleLi have density above the noise threshold�.

Two rectangles are said to be adjacent, when they share acommon side. In effect, the setR(�;Q) is the set of highdensity rectangles which are connected to the rectangle con-tainingQ by some adjacent sequence of high-density rect-angles. In order to findR(�;Q), we use a simple graphsearch algorithm in which we start at the rectangle contain-ingQ and keep searching adjacent rectangles until we havedetermined all rectangles which lie inR(�;Q).

Once the points inside all these rectangles have been de-termined, we increment their counts by one unit. It is alsopossible to weight different query clusters by importance.Specifically, the weight for each point may be increased bywi corresponding to the projectioni. This procedure is il-lustrated in Figure 7. In this paper, we always assume thatwi = 1; therefore each query cluster is considered equallyimportant. It is important to understand that since the userhas the ability to pick only those projections in which thereare meaningful data patterns surrounding the query point,the preference countsv(�) will not be affected by thosenoisy combinations of dimensions which are not useful forthe nearest neighbor search process. At the end of each ma-jor iteration, the countv(i) for each pointi quantified into ameaningfulness value, and is used in order to update a corre-sponding probability value for that data point. This process

is outlined in Figure 8, and is described in detail in the nextsection.

2.4. Iterative User Preference Quantification

The iterative process discussed above continues until asufficient amount of feedback has been calculated to findand quantify the meaningfulness of the nearest neighbors.In order to do so, we convert the preference countsv(�) ofthe user intomeaningfulness probabilitiesin each major it-eration. These probabilities quantify the level of coherencein the behavior of the user in classifying a certain point intothe query cluster across different projections. The varia-tion in these probabilities from iteration to iteration is usedin order to decide whether the process should terminate.The process for conversion of the user preference countsin each major iteration into a probability vectorP(�) andsubsequent termination is discussed in the next section.

3. User Quantification of Meaningfulness

If the user coherently picks similar points across the dif-ferent orthogonal projections in a given major iteration, thensuch behavior can be used to quantify the meaningfulness ofthe technique. After each major iteration, the set of prefer-ences provided by the userv(�) are used to update a mean-ingfulness probability vectorP(�). This procedure is illus-trated in Figure 8. First, we will analyze the coherence ofa set of reactions by the user in a sequence ofd=2 projec-tions. LetXij be a random variable which denotes whether(Xij = 1) or not (Xij = 0) the user picks the pointj inprojectioni. We note thatXij is a bernoulli random vari-able, and ifni be the number of points that a user picks inprojectioni, then the probability thatXij = 1 for the datapoint j is given byni=N . Let wi be the weight for eachpreference count in projectioni. Then, the random variableYj indicating the total user preference for pointj is givenby:

Yj =

d=2Xi=1

wi �Xij (3)

The corresponding expected valueE[Yj ] is given by:

E[Yj ] =NXi=1

wi � ni=N (4)

The value ofE[Yj ] is the same for every data pointj, and issimply the sum of the (weighted) fractions of points pickedby the user in the different projections. Since this sys-tem tries to detect the relationships across preference pat-terns in different projections (because of correlations amongthe different attributes), it is instructive to look at the casewhen the data is completely uncorrelated. If such were

the case, then the preference values of the users in the dif-ferent projections in an iteration would be uncorrelated toone another. This would mean that the random variablesX1j : : : Xd=2;j are also independent. Consequently, we cancompute the variance ofYj as the sum of the variances ofthe individual componentswi �Xij . SinceXij is a bernoullirandom variable with probabilityni=N , we have:

var(Yj) =

d=2Xi=1

w2i � (ni=N) � (1� ni=N) (5)

Again, var(Yj) is independent ofj. Let v(j) be the truenumber of preference counts that a user has given to a datapointxj . When the data is distributed in a noisy way, thenthe preference pattern across different projections will notshow any meaningful consistency. Therefore, the value ofv(j) will not vary significantly fromE[Yj ]. In order toquantify this notion, we define themeaningfulness coeffi-cientM(j) for the pointxj as follows:

M(j) = (v(j) �E[Yj ])=qvar(Yj) (6)

The meaningfulness coefficient is a numerical estimate ofthe level of confidence with which the nearest neighborfound is closer to the target than the average. Note thatwhen the value ofd is high, the distribution ofM(j) is ap-proximately normal. Let�(�) be the cumulative distributionof the normal distribution with zero mean and unit variance.In such a case, we can compute themeaningfulness proba-bility P(j) as follows:

P(j) = maxf2 ��(M(j))� 1; 0g (7)

This value is equal to the probability that the preferencecount for data pointxj is larger than the expected prefer-ence countE[Yj ] purely by chance. Note that when thevalue ofxj is smaller thanE[Y ], this value is equal to zero.When the user preference patterns shows considerable con-sistency across different projections, we expectP(j) to bealmost one for some of the points, whereas for the otherpoints, the value ofP(j) is significantly less than one. Thisis a very desirable situation, since some of the data pointscan be clearly distinguished as the nearest neighbors.

In a single iteration, we obtain the user preference pat-terns from a set ofd=2 mutually orthogonal projections.The above analysis for calculation ofP(j) is based on suchan iteration of mutually orthogonal projections. However,as illustrated in Figure 2, the process is repeated for multi-ple iterations in order to obtain feedback for many sets ofd=2 projections. In such a case, the overall meaningfulnessis calculated as the arithmetic average over multiple itera-tions. Let us say that the values ofP(j) for data pointxjfor each of the� iterations are given byp1j ; : : : p

�j . Then,

the overall meaningfulness probability for the data pointj

Data Set Precision Recall

Synthetic 1 87% 98%Synthetic 2 91% 96%

Table 1. Accuracy on Synthetic Data Sets

over multiple iterations is given by the following arithmeticaverage:

P(j) =�Xi=1

(pij)=� (8)

We note that the values of the meaningfulness probabilityvectorP(�) in two successive iterations will be highly cor-related with one another and the level of this correlation willincrease with the value of�. Therefore we compare the setof s points with highest value ofP(�) in the iteration�� 1and�. When the percentage of common points betweenthese two iterations is larger than a certain thresholdt, thenwe terminate. At the end of the procedure, we return thes data points which have the highest value ofP(j). Notethat in Figure 8 for each data pointj we maintain the valueofP�

i=1(pij) as opposed to the average. Therefore, the true

value of the meaningfulness probability may be obtained bydividing this value by�.

4. Empirical Results

In this section, we will discuss the results obtained by us-ing our user-adaptable system for a variety of real and syn-thetic data sets. For the case of synthetic data sets, we showthat the nearest neighbors indeed lie within the natural pro-jected clusters which are created for testing purposes. Wealso show some interesting examples of high dimensionaldata in which the data is truly distributed in a noisy andmeaningless way. In these cases, we show that the tech-nique is effectively able to predict the meaninglessness ofapplying a nearest neighbor search process.

4.1. Synthetic Data Sets

We generated a set of sparse synthetic data sets in highdimensionality, such that projected clusters were embeddedin lower dimensional subspaces. We generated two datasets withN = 5000 points using this technique. We shallrefer to these data sets as Case 1 and Case 2 respectivelywith the same parameters used in [4], except for the numberof points. These data sets contain 6-dimensional projectedclusters embedded in 20 dimensional data.

For the purpose of testing, we adopted the policy ofisolating a cluster with the query point containing about0:5�5% of the data. Correspondingly, the value of the sup-port used in order to determine a discriminatory projection

−100 −50 0 50 100 150 200

−500

50100

1500

50

100

150

200

250

300

350

400

X

* Query Point

Y

Rel

ativ

e D

ensi

ty

Figure 10. Density Profile (Syn. 1: Early MinorIteration)

−100

−50

0

50

100

−150−100

−500

50100

150

0

20

40

60

80

100

120

140

160

X

* Query Point

Y

Rel

ativ

e D

ensi

ty

Figure 11. Density Profile (Syn. 1: Late MinorIteration)

−10

12

3

−0.5 0 0.5 1 1.5 2 2.5

0

10

20

30

40

50

60

X

* Query Point

Y

Rel

ativ

e D

ensi

ty

Figure 12. Density Profile (Uniform)

−2 −1 0 1 2 3 4−2

0

2

40

10

20

30

40

50

60

70

80

90

* Query Point

XY

Rel

ativ

e D

ensi

ty

Figure 13. Density Profile (Ionosphere)

was set at0:5%. However, in many cases, when the visualprofile was constructed, the actual cluster was often eithermuch smaller or larger than this threshold. In such cases, theinteractive query cluster separation process was able to cor-rect for any discrepancies. Because of the careful methodin which the subspace is determined, we found that in mostcases a clear cluster could be found near the query point.In each major iteration, the visual profiles obtained duringthe first few minor iterations were the most discriminatory.For example, a visual profile obtained during an early (first)minor iteration for the first case is illustrated in Figure 10,whereas the visual profile in the last minor iteration is il-lustrated in Figure 11. It is clear that the former profile isone in which the query cluster can be more clearly distin-guished from the remaining data points. This is because inthe first few minor iterations, the subspace determinationsubroutine has considerable flexibility in choosing a sub-space which results in the best query centered projection.This does not continue to be the case in the last iterationin which the algorithm is forced to pick from the subspacewhich is complementary to the union of all the subspacesalready chosen.This gradation in the quality of the pro-jections has an important influence on the nearest neighborsearch process.Since the user can choose to discard theprojections determined in the last few minor iterations, onlythe nicely coherent behavior of the data is reflected in theuser preference countsv(�). Thus, the graded quality ofthe projections ensures that most of the noise in the data ispushed into the last few projections, and the user is able touse his intuition in order to easily pick out the clearly co-herent projections (and data patterns) in the first few minoriterations. In other words, at the end of each major iter-ation, the user preference counts implicitly define a rele-vance value for each data point in which the noise/sparsityeffects of high dimensionality have been filtered with theuse of human intuition. However, this intuition could not

have been harnessed without the use of a carefully gradedsubspace determination process. This interdependence be-tween the user and the computer reflects the nature of theinteraction between the two entities. The visual profile forthe first and last minor iterations in the second data set weresimilar to those obtained in the first. In each case, we deter-mined the meaningfulness probability of each data point atthe termination of the process. We sorted the data in order ofmeaningfulness probability and found that a few of the datapoints had meaningfulness probability in the range of0:9 to1, after which there was a steep drop. This steep drop cor-responds to the distinct projected cluster to which the querypoint belongs. By using the threshold which occurs just be-fore this steep drop, it is possible to isolate the natural setof points related to the query. Note that the correspondingcardinality may be quite different from the user-specifiedsupport. In this case, the corresponding value was0:9. Thiscorresponds to about 520 neighbors in Case 1, which had ameaningfulness probability higher than this threshold. Thisalso compares well with the cardinality (562) of the pro-jected cluster containing the query point. Of the 520 neigh-bors recovered in case 1, 508 belonged to the same clusteras the query point. In order to illustrate the effectiveness ofthe technique, we have illustrated some summary results forthe case 1 and case 2 data sets in Table 1 over a set of 10examples on which we ran the method. We have illustratedthe precision and recall of the two techniques in the Tableusing the natural number of nearest neighbors found by thethresholding technique. Typically, the natural number ofnearest neighbors are often a slight overestimate (about5to 15% over the correct value), and hence the recall valuesare higher than the precision. The recall value indicates thatvery few of the true nearest neighbors are actually missed.It is clear that in each case, since both the precision and re-call are so high, the nearest neighbor search technique wasnot only accurate, but was able to determine the “natural”number of nearest neighbors effectively. This is very usefulfor nearest neighbor applications in which finding a naturalnumber of nearest neighbors is as important as the qualityof the records returned.

4.2. Results on Poorly Behaved Data

The meaningfulness issue of the nearest neighbor prob-lem in high dimensionality critically depends upon the factthat often the local implicit dimensionality of the data issignificantly lower than the full dimensionality [1, 2, 4, 11].The technique discussed in this paper exploits this fact inconjunction with the user interaction. However, for somedata sets even the local implicit dimensionality is high. Anexample of such a case is uniformly distributed data. Insuch a situation, the problem of nearest neighbor search isindeed not very meaningful in high dimensionality. There-

Data Set Accuracy Accuracy(Dimensionality) (L2) (Interactive)

Ionosphere(34) 71% 86%Segmentation(19) 61% 83%

Table 2. Accuracy on Real Data Sets

fore, it is interesting to test what happens in such a systemfor uniformly distributed data.

We tested a case withN = 5000 uniformly distributedpoints ind = 20 dimensions. In this case, we found that itwas difficult to find views in which the points were well dis-criminated from one another. A typical example of a viewobtained from such a case is illustrated in Figure 12. Wenote that the discrimination of the the data surrounding thequery cluster is very poor in such a case. This is valuableinformation for a user, since it tends to indicate the poorselectivity of the data even in carefully chosen projections.Therefore, a user can infer that the data is not very proneto meaningful nearest neighbor search in high dimensionalspace. Even further evidence may be obtained from the dis-tribution of the meaningfulness probability values. Eventhough it was difficult to pick out the query cluster becauseof the poor discrimination behavior, we were able to find afew query-clusters in some of the projections because of lo-cal variations in density. When the process was completed,we found that there was very little coherence in the prefer-ence counts across the different views, and they got evenlydistributed among the different data points. In this case, themeaningfulness values do not show the kind of steep dropwhich is visible in the synthetic data sets. Consequently, itis difficult to isolate a well defined query cluster based on asimilar threshold. The conclusion from these numerous ob-servations is that in those high dimensional cases in whichmeaningful nearest neighbors do not exist, the technique isable discover this valuable information.

4.3. Results on Real Data

We also tested the results on a number of real data setsobtained from the UCI machine learning1 repository. Anexample of a visual profile from a query-centered projec-tion obtained from the ionosphere data set is illustrated inFigures 13. The meaningfulness probability values alsoshowed a similar steep drop as in the case of the clusteredsynthetic data. Thus, both the visual profiles and the mean-ingfulness behavior for this real data set is similar to theclustered synthetic data set as opposed to the uniformlydistributed case. We tested the results on the ionosphereand segmentation data sets from the UCI machine learning

1http://www.cs.uci.edu/~mlearn.

repository. We have illustrated the nearest neighbor clas-sification accuracy for using 10 query points (and as manynearest neighbors as determined by the natural query clus-ter size) for a number of data sets. The results are illustratedin Table 2. As evident from Table 2, it is clear that the in-teractive nearest neighbor process was more effective thanthe full dimensional method. In the particular case of thesegmentation data set, the improvement was32% over thefull dimensional method, which was quite significant.

5. Conclusions and Summary

In this paper, we discussed a user-adaptive methodfor meaningful high dimensional similarity search. Highdimensional data has always been a challenge to simi-larity search methods from the meaningfulness perspec-tive because of its peculiar property of getting distributedsparsely throughout the data space. Therefore, we proposean interactive method to present a user with meaningfulquery-centered projections which can distinguish the near-est points to the query from the remaining points. The useris provided with a visual perspective which he can use in or-der to separate out these points. Such choices made by theuser over multiple iterations are used to quantify the mean-ingfulness of the discovered nearest neighbors. In caseswhen the nearest neighbor is truly not meaningful, our sys-tem is able to detect and report the poor behavior of thedata. Thus, our system provides a unique understandingby either solving or diagnosing the poor behavior of highdimensional nearest neighbor search from the user perspec-tive.

References

[1] C. C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, J.-S. Park.Fast algorithms for projected clustering.SIGMOD Confer-ence, 1999.

[2] C. C. Aggarwal. Re-designing distance functions and dis-tance based applications for high dimensional data.SIGMODRecord, March 2001.

[3] C. C. Aggarwal, A. Hinneburg, D. A. Keim. On the Surpris-ing Behavior of Distance Metrics in High Dimensional Space.ICDT Conference, 2001.

[4] C. C. Aggarwal, P. S. Yu. Finding Generalized ProjectedClusters in High Dimensional Spaces.SIGMOD Conference,2000.

[5] C. C. Aggarwal. A Human-Computer Cooperative Systemfor Effective High Dimensional Clustering.KDD Conference,2001.

[6] C. C. Aggarwal, P. S. Yu. The IGrid Index: Reversing the Di-mensionality Curse for Similarity Indexing in High Dimen-sional Space.KDD Conference, 2000.

[7] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan. Auto-matic Subspace Clustering of High Dimensional Data forData Mining Applications.SIGMOD Conference, 1998.

[8] M. Ankerst, M. Ester, H.-P. Kriegel. Towards an EffectiveCooperation of the User and the Computer for Classification.KDD Conference, 2000.

[9] S. Berchtold, D. A. Keim, H.-P. Kriegel: The X-Tree: An In-dex Structure for High-Dimensional Data,VLDB Conference,1996.

[10] K. Beyer, R. Ramakrishnan, U. Shaft, J. Goldstein. When isnearest neighbor meaningful?ICDT Conference, 1999.

[11] K. Chakrabarti, S. Mehrotra. Local Dimensionality Reduc-tion: A New Approach to Indexing High Dimensional Spaces.VLDB Conference, 2000.

[12] M. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, X. Xu.Density-Connected Sets and their Application for Trend De-tection in Spatial databases.KDD Conference, 1997.

[13] C. Faloutsos et al. Efficient and Effective Querying by ImageContent.Journal of Intelligent Information Systems, Vol 3, pp.231-262, 1994.

[14] J. Han, L. Lakshmanan, R. Ng. Constraint Based Multidi-mensional Data Mining.IEEE Computer, Vol. 32, no. 8, 1999,pp. 46-50.

[15] A. Hinneburg, C. C. Aggarwal, D. A. Keim. What is the near-est neighbor in high dimensional spaces?VLDB Conference,2000.

[16] A. Hinneburg. D. A. Keim, M. Wawryniuk. HD-Eye: VisualMining of High Dimensional Data.IEEE Computer Graphicsand Applications, 19(5), pp. 22-31, 1999.

[17] I. T. Jolliffe. Principal Component Analysis, Springer-Verlag, New York, 1986.

[18] N. Katayama, S. Satoh: The SR-Tree: An Index Structurefor High Dimensional Nearest Neighbor Queries.SIGMODConference, 1997.

[19] N. Katayama, S. Satoh. Distinctiveness Sensitive NearestNeighbor Search for Efficient Similarity Retrieval of Multi-media Information.ICDE Conference, 2001.

[20] D. A. Keim, H.-P. Kriegel, T. Seidl. Supporting Data Min-ing of Large Databases by Visual Feedback Queries.ICDEConference, 1994.

[21] K.-I. Lin, H. V. Jagadish, C. Faloutsos The TV-tree: AnIndex Structure for High Dimensional Data.VLDB Journal,Volume 3, Number 4, pages 517–542, 1992.

[22] Y. Rui, T. S. Huang, S. Mehrotra, Content-based image re-trieval with relevance feedback in MARS.IEEE Conferenceon Image Processing, 1997.

[23] G. Salton.THE SMART Retrieval System - Experiments inAutomatic Document Processing, Prentice Hall, 1971.

[24] S. Sarawagi. User-adaptive Exploration of MultidimensionalData.VLDB Conference, 2000.

[25] T. Seidl, H.-P. Kriegel: Efficient User-Adaptable SimilaritySearch in Large Multimedia Databases.VLDB Conference,1997.

[26] B. W. Silverman.Density Estimation for Statistics and DataAnalysis, Chapman and Hall, 1986.

[27] R. Weber, H.-J. Schek, S. Blott. A Quantitative Analysis andPerformance Study for Similarity-Search Methods in High-Dimensional Spaces,VLDB Conference, 1998.

[28] L. Wu, C. Faloutsos, K. Sycara, T. Payne. FALCON: Feed-back Adaptive Loop for Content-Based Retrieval.VLDB Con-ference, 2000.

towards meaningful high-dimensional nearest neighbor ...charuaggarwal.net/aggarwal_nearest.pdf ·...

Documents