New Algorithms for Efficient High-Dimensional Nonparametric Classification
Ting Liu, Andrew W. Moore, and Alexander Gray
Overview Introduction
k Nearest Neighbors (k-NN) KNS1: conventional k-NN search
New algorithms for k-NN classification KNS2: for skewed-class data KNS3: ”are at least t of k-NN positive”?
Results Comments
Introduction: k-NN k-NN
Nonparametric classification method. Given a data set of n data points, it
finds the k closest points to a query point , and chooses the label corresponding to the majority.
Computational complexity is too high in many solutions, especially for the high-dimensional case.
DRV
DRq
Introduction: KNS1 KNS1:
Conventional k-NN search with ball-tree. Ball-Tree (binary):
Root node represents full set of points. Leaf node contains some points. Non-leaf node has two children nodes. Pivot of a node: one of the points in the node,
or the centroid of the points. Radius of a node:
Introduction: KNS1 Bound the distance from a query point q:
Trade off the cost of construction against the tightness of the radius of the balls.
Introduction: KNS1 recursive procedure: PSout=BallKNN (PSin, Node)
PSin consists of the k-NN of q in V ( the set of points searched so far)
PSout consists of the k-NN of q in V and Node
points interested of distance
minimum theis ||max
to inpoint any from
distance possible minimum theis
sofar
Nodeminp
qD
qNode
D
inPS
xx
KNS2 KNS2:
For skewed-class data: one class is much more frequent than the other.
Find the # of the k NN in the positive class without explicitly finding the k-NN set.
Basic idea: Build two ball-trees: Postree (small), Negtree “Find Positive”: Search Postree to find k-nn set Possetk
using KNS1; “Insert negative”: Search Negtree, use Possetk as bounds
to prune nodes far away and to estimate the # of negative points to be inserted to the true nearest neighbor set.
KNS2 Definitions:
Dists={Dist1,…, Distk}: the distance to the k nearest positive neighbors of q, sorted in increasing order.
V: the set of points in the negative balls visited so far. (n, C): n is the # of positive points in k NN of q.
C ={C1,…,Cn},
Ci is # of the negative points in V closer than the ith positive neighbor to q. and
NodeminpD Node
maxpD
KNS2 Step 2 “insert negative”
is implemented by the recursive function
(nout, Cout)=NegCount(nin, Cin, Node, jparent, Dists)
(nin, Cin) sumarize interesting negative points for V;
(nout, Cout) sumarize interesting negative points for V and Node;
KNS3 KNS3
“are at least t of k nearest neighbors positive?”
No constraint of skewness in the class. Proposition:
Instead of directly compute the exact values, we compute the lower and upper bound, since
m+t=k+1
class. positive thefrom ofNN theof least at ifonly and ifNegm
Post qktDD
KNS3P is a set of balls from Postree, N consists of balls from Negtree.
t)|(u|
t)|(u|
DD
DDji
P,uuPu)Lo(D
j
ii
j-
ii
upost
ji
ji
post
j
1
1
1
minp
minpminp
Points
and Points where
,) Lo(Then
, that such , balls thesortingfirst , compute To
Experimental results Real data
Experimental resultsk=9, t=ceiling(k/2),
Randomly pick 1% negative records and 50% positive records as test (986 points)
Train on the reaming 87372 data points
Comments Why k-NN? Baseline
No free lunch: For uniform high-dimensional data, no
benefits. Results mean the intrinsic dimensionality
is much lower.