1967 - nearest neighbor pattern classification - in

8/13/2019 1967 - Nearest Neighbor Pattern Classification - In

1/7


2/7

22 IEEETRANSACTIONSONINFORMATIONTHEORY JANUis less than twice the Bayes probabili ty of error, andhence is less than twice the probabili ty of error of anyother decision rule, nonparametric or otherwise, based onthe infinite sample set.The first formulation of a rule of the nearest neighbortype and primary previous contribution to the analysisof its properties, appears to have been made by Fix andHodges [I] and [a]. They investigated a rule which mightbe called the k,-nearest neighbor rule. It assigns to anunclassified point the class most heavily represented amongits k, nearest neighbors. Rx and Hodges established theconsistency of this rule for sequences lc, ---f m such thatlc,/n --+ 0. In reference [a], they i nvestigate numericallythe small sample performance of the Ic,-NN rule underthe assumption of normal statistics.The NN rule has been used by Johns [3] as an exampleof an empirical Bayes rule. Kanal [4], Sebestyen [5](who calls it the proximity algorithm), and Nilsson [6]have mentioned the intuitive appeal of the NN rule andsuggested its use in the pattern recognition problem.Loftsgaarden and Quesenberry [7] have shown that asimple modificat ion of the L,-NN rule gives a consistentestimate of a probabili ty density function. In the abovementioned papers, no analytical results in the nonpara-metric case were obtained either for the finite samplesize problem or for the finite number of nearest neighborsproblem.In this paper we shall show that, for any number n ofsamples, the single-NN rule has strictly lower probabili tyof error than any other k,-NN rule against certain classesof distributions, and hence is admissible among the Ic,-NNrules. We will then establish the extent to which sampleswhich are close together have categories which are closetogether and use this to compare in Section VI theprobabili ty of error of the NN-rule with the minimumpossible probabili ty of error.

II. TI-IE NEAREST NEIGI~BOR RULEA set of n pairs (x1, &), + . , (z,, 0,) is given, wherethe xis take values in a metric space X upon which isdefined a metric d, and the 0,s take values in the set

11, 2, ..* 7 ill}. Each 6( is considered to be the indexof the category to which the ith individual belongs, andeach x3 is the outcome of the set of measurements madeupon that individual. Vor brevity, we shall frequentlysay (xi belongs to ei when we mean precisely that theith individual upon which measurements xi have beenobserved, belongs to category Bi.A new pair (2, 0) is given, where only the measurementx is observable by the statistician, and it is desired toestimate e by utili zing the information contained in theset of correctly classified points. We shall callx:, & {Xl, X2) * * * f x,1

a nearest neighbor to x ifmin d(zi, x) = d(&, Z) i = 1, 2, * ** , n. (1)

The nearest neighbor rule decides x belongs to the catee; of its nearest neighbor XL. A mistake is made if e:,Notice that the NN rule utilizes only the classificatof the nearest neighbor. The n - 1 remaining classitions Bi are ignored.III. ADMISSIBILITY OF NEAREST NEIGHBOR RULE

If the number of samples is large it makes good sto use, instead of the single nearest neighbor, the majorvote of the nearest k neighbors. We wish lc to be lin order to minimize the probabili ty of a non-Badecision for the unclassified point x, but we wish Ic tosmall (in proportion to the number of samples) in othat the points be close enough to x to give an accuestimate of the posterior probabili ties of the true of x.The purpose of this section is to show that, amthe class of L-NN rules, the single nearest neighbor (I-NN) is admissible. That is, for the n-sample problethere exists no lc-NN rule, k # 1, which has lower pability of error against all distributions. We shall sthat the single NN rule is undominated by exhibita simple distribution for which it has strictly lower pability of error P,. The example to be given comes fthe family of distributions for which simple deciboundaries provide complete separation of the saminto their respective categories. l?ortunately, one ample will serve for all n.Consider the two category problem in which the pprobabili ties v1 = v2 = +, and the conditional denfl is uniform on the unit disk D, centered at (-3, and the conditional density fz is uniform on the disk D, centered at (3, 0) as shown in Fig. 1. In n-sample problem, the probabili ty that i individuals cfrom category 1, and hence have measurements lyingD,, is (+)(;). Without loss of generalit y, assume that unclassified x lies in category 1. Then the NN rule make a classification error only if the nearest neighborbelongs to category 2, and thus, necessarily, lies in But, from inspection of the distance relationships, if nearest neighbor to x is in D,, then each of the xi mustin D,. Thus the probability P,(l; n) of error of the rule in this case is precisely ($)*-the probabili ty Xl, x2, * -. , x, all lie in D,. Let lc = 2/c, + 1. Then k-NN rule makes an error if k, or fewer points lie inThis occurs with probability

Thus in this example, the I-NN rule has strictly loP, than does any k-NN rule, k # 1, and hence ismissible in that class. Indeed1 n case of ties for the nearest neighbor, the rule may be modto decide the most popular category among the ties. However,those cases in which ties occur with nonzero probabilit y, our reare trivially true.


3/7


4/7

24 IEEETRAT\ TSACTIONS ON IT\'FORMATION THEORY JANIt r emains to argue that the random variable x hasthis property with probability one. We shall do so byproving that the set N of points failing to have thisproperty has probability measure zero. Accordingly, letN be the set of all x for which there exists some rZ suffi-ciently small that P(S,(r,)) = 0.By the definition of the separability of X, there existsa countable dense subset A of X. For each z e N thereexists, by the denseness of A, a, in A for which ag ES,(r,/3).Thus, there exists a small sphere X,,(r,/Z) which is strictlycontained in the original sphere X,(r,) and which contains3. Thus P(X,,(r, /2)) = 0. Then the possibly uncountableset N is contained in the countable union (by the count-ability of A) of spheres uZ,,,, S,Z(r,). Since N i s containedin the countable union of sets of measure zero, P(N) = 0,as was to be shown.

VI. NEAREST NEIGHBOR PROBABILITY OF ERRORLet 2; E {x,, x1, ... , z,) be the nearest neighbor t,oII: and let 8; be the category to which the individual having

measurement 2; belongs. If 0 is indeed the category of IL,theNN rule incurs loss L(e, 0;). If (5, e), (x1, 0,), . . . , (z,, 0%)are random variables, we define the n-sample NN riskR(n) by the expectationR(n) = E[L(e, e:)]

and the (large sample) NN risk ZZ by(10)

R = lim R(n). (11)n-mThroughout this discussion we shall assume that thepairs (x, S), (x1, e,), . . . , (x,,, 0,) are independent identi-cally distributed random variables in X X 8. Of course,except in trivial cases, there wil l be some dependencebetween t,he elements xi, ei of each pair.We shall first consider t,he M = 2 category problemwith probability of error criterion given by the 0 - 1loss matrix

w>where L counts an error whenever a mistake in classifica-tion is made. The following theorem is the principal resultof this discussion.Theorem

Let X be a separable metric space. Let f l and f2 besuch that, with probabilit y one, x is either 1) a continuitypoint of f1 and fz, or 2) a point of nonzero probabilit ymeasure. Then the NN risk R (probability of error) hasthe boundsR* I R 5 2R*(l - R*). (13)

These bounds are as tight as possible.Remarks: In particular, the hypotheses of the theoremare satisfied for probability densiti es which consist of any

mixture of &functions and piecewise continuous defunctions on Euclidean d-space. Observe that 0 5 RR 5 2R*(l - R*) 5 8; so Ii?* = 0 if and only if R and R* = 3 if and only if R = 3. Thus in the extrcases of complete certainty and complete uncertainty NN probability of error equals the Bayes probabof error. Conditions for equality of R and R* for values of R wi ll be developed in the proof.Proof: Let us conditi on on the random variables x2: in the n-sample KN problem. The conditi onal NN r(x, a:) is then given, upon using the conditi onal dependence of e and e;, by

r(x, 2;) = E[L(e, e:,) / 5, x;] = P,{e # ei, / X, XL}= P?(e = 1 1 x)Pr(e:, = 2 1XL)

+ Pr{ e = 2 1 }Pr(e:, = 1 IX;)where the expectation is taken over e and e;.development of (4) the above may be written as

r(x, 23 = 7jl(x)fj2(x J + $e(x)~i(x J.By

We wish first to show that r(Ic, ai) converges torandom variable 2+ji (x) +jZx) with probabilit y one.We have not required that fl, f2 be continuous atpoints x of nonzcro probabili ty measure v(x), becthese points may be trivially taken into account as folloLet v(x,) > 0; thenPr{x, # XL] = (1 - ZJ(Xg))n--f 0.

Since x$ once equalli ng x0, equals x0 thereafter,r(x, xi) --$ 2&(x0)&(x0)

with probabilit y one.For the remaining points, the hypothesized continuof fl and f2 is needed. Here x is a continuity point and fz with conditi onal probability one (conditi oned I% such that v(x) = 0). Then, since 7i is continuousfl and f2, x is a continuity point of +j with probability By the lemma, XL converges to the random variablewith probability one. Hence, with probabilit y one,ri(x3 --$ 4(x)

and, from (15), with probability one,r(x, XL) t r(x) = 2rj1(x)qZ(x),

where r(x) is the limit of the n-sample conditi onal NN As shown in (6) the conditi onal Bayes risk isr*(x) = min (+jl(x), $(x) }

= min ( el(x), 1 - q,(x) ) .Now, by the symmetry of r* in +jl, we may wri tef(x) = 24l(X)?iZ(X) = 2?j,(4o(I - 41(x))

= 2r*(x)(l - r*(x)).


5/7


6/7

26 IEEE TRANSACTIONS ON INFORMATION THEORYThe upper bound is attained for the no-informationexperiment fl = fz = . . . = f,,{, with 71~ = 1 - R*, R* = 1 min {~ld~, ~f?l dv

and vi = R*/ (M - 1); i = 2, . . . , M. The lower bounds

1R = R* is attained, for example, when vi = l/ill, i = = min. (x, 1 - ~1 c/n: = a.1, 2, ..a , ill, and 0MR* Exhibiting corresponding terms WC haveori =C& ($?;)(l - ,.*)i,-i.

Now let /&(r*) be defined to be the least concave functgreater than am. Thenr,, = pB(r*) I pI;(r*),

and, by Jensens inequality,R, = Erk = Epk(r*) 5 E&(r*) < jk(Er*) = &(R*). So ,&(R*) is an upper bound on the large sample k-risk Rk. It may further be shown, for any R*, that .&(is the least upper bound on Rk by demonstrating simstatistics which achieve it. Hence me have the bounds


7/7

IEEE TRANSACTI ONS ON INFORMATION THEORY, VOL. IT-~& NO. 1, J.iNU.+RY 1967 27R* 5 Rk I j%(R*) 5 &cm-1(R*) . . .

I iI = 2R*(l - R*) (49)where the upper and lower bounds on R, are as tightas possible.

IX. CONCLUSIONSThe single NN rule has been shown to be admissibleamong the class of k,-NN rules for the n-sample casefor any n. It has been shown that the NN probabilityof error R, in the M-category classification problem, isbounded below by the Bayes probability of error R andabove by R*(2 - MR*/(M - 1)). Thus any other decisionrule based on the infinite data set can cut the probabilityof error by at most one half. In this sense, half of thenvailable information in an infinite collection of classified

s.tmples is contained in the nearest neighbor.

111

[41

[51blPI

REFEREXCESE. Fix and J. L. Hodges, Jr., Discriminat ory analysis, non-parametric discrimination, USAF School of Aviation Medivine,Randolph Field, Tex., Project 21-49-004, Rept. 4, ContractAF41(128)-31, February 1951.-t Discnminat ory analysis: small sample performance,USAF School of Aviation Medicine, Randolph Field, Tex.,Project 21-49-004, Rept. 11, August 1952.M. V. Johns, (An empirical Bayes approach to non-parametrictwo-way classification, in Studies in Item Analysis and Predic-tion, H. Solomon, Ed. Stanford, Calif.: Stanford University Press,1961.L. N. Kanal, Statistical methods for pattern classification,Philco Rept., 1963; originally appeared in T. Harley et al,,Semi-automatic imagery screening research study and expen-mental i nvestigation,, Philco Reports, VO43-2 and VO43-3, Vol. I,sec. 6, and Appendix H, prepared for U. S. Army Electroni csResearch and Development, Lab. under Contract DA-36-039-SC-90742. March 29. 1963.G. Sebestyen, D kision Making Processes in Pattern Recogniti on.New York: Macmillan, 1962, pp. 90-92.Nils Nilsson, Learning Machines. New York: McGraw-Hill,1965, pp. 120-121.D. 0. Loftsgaarden and C. P. Quesenberry, A nonparametricestimate of a multivariate density function, Annals Mat h Stat.,vol. 36, pp. 1049-1051, June 1965.

A Generalized Form of Prices Theorem and Its ConverseJOHN IA. BROWN, JR., SENIOR MEMBEK, IEEE

Abstract- The case of n unity-vari ance random variablesx1, XZ,. * *, x, governed by the joint probability density w(xl, xz, * * * x,)is considered, where t he density depends on the (normalized)cross-covar iances pii = E[(xi - jzi)(xi - li)]. It is shown that thecondition

holds f or an arbitrary function f(xl, x2, * * * , x,) of n variables ifand onl y if the underl ying density w(xl, XZ, * * * , x,) is the usualn-dimensional Gaussian density for correlated r andom variables.This result establi shes a generalized form of Prices theorem inwhich: 1) the relevant condition (*) subsumes Prices original con-dition; 2) the proof is accomplis hed without appeal to Laplaceintegral expansions; and 3) conditions referring to derivati ves withrespect to diagonal terms pii are avoided, so that the unity varianceassumption can be retained.

Manuscript received February 10, 1966; revised May 2, 1966.The author is with the Ordnance Research Laboratory, Penn-sylvania State University, State College, Pa.

RICES THEOREM and its various extensions([l]-[4]) have had great utility in the determinationof output correlations between zero-memory non-linearities subjected to jointly Gaussian inputs. In itsoriginal f orm, the theorem considered n jointly normalrandom variables, x1, x2, . . . x,, with respective means21, LE2, . . ) Z, and nth-order joint probability density,P(z 1, .x2, . 1 , :r,,) = (27p ply,, y

. exp { -a F F ;;;, (2,. - 2,)(:r, - 5,) I , (1)where IM,l is the determinant of J1,, = [p,,],

Pr-. = E[(sr - 5$.)(x, - ,2J] = xvx, - &IL:,is the correlation coefficient of x, and x,, and Ail,, is thecofactor of p.? in ilf,.

From [l], t he theorem statement is as follows:Let there be n zero-memory nonlinear devices specifiedby the input-output relationship f

1967 - nearest neighbor pattern classification - in

Documents