[ieee 2012 6th ieee international conference intelligent systems (is) - sofia, bulgaria...

6

Click here to load reader

Upload: efendi

Post on 15-Apr-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2012 6th IEEE International Conference Intelligent Systems (IS) - Sofia, Bulgaria (2012.09.6-2012.09.8)] 2012 6th IEEE INTERNATIONAL CONFERENCE INTELLIGENT SYSTEMS - OWA aggregation

OWA Aggregation Based CxK-Nearest Neighbor Classification Algorithm*

Gozde Ulutagay Department of Industrial Engineering

Izmir University Izmir, Turkey

[email protected]

Efendi Nasibov Department of Computer Sciences

Dokuz Eylul University Izmir, Turkey

[email protected]

Abstract— A new OWA (Ordered Weighted Averaging) distance based CxK-nearest neighbor algorithm (CxK-NN) is proposed. In this approach, K-nearest neighbors from each of the classes are taken into account instead of the well-known K-nearest neighbor (K-NN) algorithm in which only the total number, K of neighbors are considered. Distance between the classified point and its K-nearest set is determined based on the OWA operator. After experiments with well-known classification datasets, we conclude that average accuracy results of the OWA distance-based CxK-NN algorithm are better than that of K-NN and weighted K-NN algorithms.

Keywords- OWA distance; classification; K nearest neighbor; CxK nearest neighbor

I. INTRODUCTION One of the widely used classification algorithms is K-nearest

neighbor (K-NN) algorithm and for the sake of increasing its effect, a number of weighted versions of K-NN algorithm have being proposed. For instance, in order to improve the results of one single K-nearest neighbor prediction, Paik and Yang (2004) proposed using combinations of various K-nearest neighbor classifiers by using different values of K and different subsets of covariates. Since the method known as adaptive classification by mixing – ACM, it is also suitable for working with weights. The difference is that rather than giving weights to the samples, a weighting scheme for all of the classifiers is calculated according to their classification probabilities.

For pattern classification problem, Zuo et al. (2008) proposed a method called kernel difference-weighted K-NN (DFWKNN) method defines the weighted K-NN rule as a constrained optimization problem. Then they provided a solution to obtain the weights of various nearest neighbors. Distance-weighted K-NN gives different weights to the nearest neighbors according to the distance to the unclassified data. On the other hand, DFWKNN gives weights to the nearest neighbors by using not only the distance, but also the correlation of the differences between the unclassified sample and its nearest neighbors.

Friedman (1994) proposed flexible metric nearest neighbor classification in which local flexible weights are used for the covariates for the sake of considering their local relevance, which are predicted by recursive partitioning techniques.

In order to overcome the distance choice dependency problem of fuzzy K-NN algorithm, Pham (2005) proposed a computational scheme to obtain the optimal weighting coefficients in terms of statistical measure and combine these weights with various degree memberships for classification purpose by the fuzzy K-NN algorithm which is called optimally weighted fuzzy K-nearest neighbor algorithm.

In this study, K-NN algorithm is discussed from a different perspective in such a way that in the generation of neighborhood set, K-nearest neighbors from each of class is considered instead of only the nearest neighbors of each class. Hence, in this approach totally CxK-neighbors are marked whereas in the traditional K-NN algorithm totally K neighbors are of concern. Then, by taking the OWA (Ordered Weighted Averaging) distance for K nearest neighbors of each class of into account, the distance of the new instance from each class is calculated. Eventually, the new instance is assigned to the nearest class. It is shown that, such an approach is more generic than the traditional one and by changing the values of OWA parameters appropriate classification for different strategies is possible.

The rest of the paper is organized as follows: In Section 2, various linkage distance metrics used to measure the distance between clusters in clustering are discussed. In Section 3, some information about OWA aggregation operator is given and inter-cluster distance concept based on this operator is mentioned. In Section 4, CxK-NN algorithm which is the main subject of this study is presented. In Section 5, OWA-Based CxK-NN algorithm is compared with the well-known methods on reputable datasets and the results are discussed. Conclusion is stated in the final section.

II. THE LINKAGE DISTANCES A variety of inter-cluster distance approaches are generally

used in hierarchical clustering algorithms which function by either in a process of successive mergers or in a process of successive divisions. Agglomerative hierarchical methods start proceeding with the individual objects. At each step, the most similar objects are taken into the same clusters, and these initial groups are merged according to their similarities. On the other hand, divisive hierarchical methods work conversely. At each step group of objects is divided into two subgroups such that

*This work is supported by TUBITAK Grant No.111T273.

Page 2: [IEEE 2012 6th IEEE International Conference Intelligent Systems (IS) - Sofia, Bulgaria (2012.09.6-2012.09.8)] 2012 6th IEEE INTERNATIONAL CONFERENCE INTELLIGENT SYSTEMS - OWA aggregation

the objects in the one subgroup are far from the objects in the other. These subgroups are then further divided into similar subgroups, and this process continues until there are as many subgroups as objects, i.e. until each object forms a group.

In this study, inter-cluster linkage approaches used in hierarchical clustering to measure the distance between the classified point and its nearest neighbor points’ class will be of interest.

A. Single-Linkage Distance

The inputs for the single linkage algorithm could be either distances or similarities between pairs of objects. Groups are formed from the individual entities by merging nearest neighbors, where the term nearest neighbor connotes the smallest distance or the largest similarity. Let ( , )Dist A B be the distance between clusters A and ,B and iy and jz be the elements of clusters A and ,B respectively. Then the single-linkage method defines the inter-cluster distance as the distance between the elements in each of the two clusters that are nearest:

,( , ) min ( , ).

i ji j

y A z BDist A B d y z

(1)

The clusters formed by the single-linkage method will be unchanged by any assignment of distance or similarity that gives the same relative orderings as the initial distances.

B. Complete-Linkage Distance

Complete-linkage approach works similar to single-linkage, except that the distance or similarity between clusters is determined by the distance between the two elements from different clusters that are most distant. Similar to the single-linkage, complete-linkage method defines the inter-cluster distance as the distance between the elements in each of the two clusters that are most distant:

,( , ) max ( , ).

i ji j

y A z BDist A B d y z

(2)

Thus, complete-linkage ensures that all items in a cluster are within some maximum distance or minimum similarity of each other.

C. Average-Linkage Distance

Average-linkage treats the distance between two clusters as the average distance between all pairs of items where one member of a pair belongs to each cluster:

1 1( , ) ( , )i j

i jy A z Ba b

Dist A B d y zn n

. (3)

For average-linkage clustering, changes in the assignment of distances or similarities can affect the arrangement of the final

configuration of clusters, even though the changes preserve relative orderings. An illustration of these three distance measures is shown in Figure1.

Figure 1. Illustration of the working principle of (a) single-linkage,

(b) complete-linkage, (c) average-linkage methods

III. ORDERED WEIGHTED AVERAGING (OWA) DISTANCE

Ordered weighted averaging (OWA) method as a way for providing aggregations which lie between max and minoperators is one the best formulas in order to determine the average representative of a finite number set (Yager, 1988). In essence, OWA is the weighted aggregation of the set of elements. Therewithal, the weights are assigned to the ranking positions rather than to elements themselves. Such a situation enables to perform the aggregation strategy according to the pessimism or optimism degree of the decision-maker. Since its proposal, OWA aggregation operator has been studied enthusiastically and applied to numerous fields.

Let the set of real numbers 1 2 1, ,..., nn i iA a a a a

be

determined.

Definition 1. An OWA operator of n dimension is a mapping : nf R R with weighting vector

1 2, , ..., nw w w w and is determined as follows:

1 2 ( )1

, ,...,n

w n i ii

OWA a a a w a

(4)

where ( )ia is the i -th largest value among the elements

1 2, ,..., na a a , and 1 2, , ..., nw w w w is the weighting vector satisfying the following conditions:

i) [0,1], 1,...,iw i n

ii) 1

1n

ii

w

Note that the weighting vector reflects the averaging strategy independent of the particular values of the averaged

Page 3: [IEEE 2012 6th IEEE International Conference Intelligent Systems (IS) - Sofia, Bulgaria (2012.09.6-2012.09.8)] 2012 6th IEEE INTERNATIONAL CONFERENCE INTELLIGENT SYSTEMS - OWA aggregation

elements. The choice of the weighting vector specifies different aggregation strategies. By using various weighting vectors, OWA operator transforms into widely used operators. These operators are given in Table 1.

Table 1. Various transformations of OWA operator. Notation of

weighting vector Weighting components

Transforms into operator

*W

1,0,11

iww

i

(Maximal element) i

inW

aaaOWA max),...,( 1*

*W niw

w

i

n

,0,1 (Minimal element)

ii

nW aaaOWA min),...,( 1*

AveW ni

nwi

,...,1

,1

(Arithmetical mean)

n

jjnW a

naaOWA Ave

11

1),...,(

kW kiw

w

i

k

,0,1 (k-th maximal element)

)(1 ),...,(][ knW aaaOWA

k

HWH

]1,0[,1

,1

nww

(H-operator)

iiii

n

aaaaH

min)1(max

),...,( 1

There are many approaches to determine the weights. In (Xu, 2005), in order to generate weights normal probability density function is used:

2 221 , 1,....,2

n nxNi

n

w e i n

(5)

where n and n are the mean and the standard deviation calculated as follows:

1 (1 ) (1 )2n

n n nn

(6)

2

1

1 ( )n

n ni

in

(7)

where is a parameter which acts as a quantile, representing the location of the maximum weight. Note that if is referred to as the 50th percentile, it corresponds to median position for symmetric distribution and it corresponds to positively or negatively skewed distributions if 0.5 or 0.5, respectively. In order to obtain the weight distribution of OWA the probability density functions need to be discretized by using the following normalization:

1

, 1,...,N

N ii n

Ni

j

ww i n

w

(8)

Yager (1988) proposed two measures characterizing the weighting vector. The first one is the “orness”, which measures

the degree of agreement with the logical “or” and it is given below:

1

1( ) ( ) ( )1

n

ii

orness w w w n in

(9)

where ( ) [0,1]w is the situation parameter. The closer ( )w to zero, the closer are the generated values of OWA

operator to “min”, which corresponds to the closeness to the logical “and”.

The second measure is the measure of dispersion of the aggregation or entropy, which reflects the fact of the completeness of the use of the information in aggregated values:

1

( ) lnn

i ii

disp w w w

. (10)

As seen above, one can calculate the entropy if the OWA weights are known. Sometimes it is of interest to solve the inverse of the mentioned problem. Hence, one can be interested in calculating the weights when either orness or entropy values are given. For instance, Fuller and Majlender (2003) interested in determining the associated weights which provide maximum entropy with respect to the given orness value by transforming the Yager’s OWA equation into a polynomial equation with the help of Lagrange multipliers.

In the studies of Yager (2000) and Nasibov&Kandemir (2011), inter-cluster OWA distance is defined as follows.

Definition 2. An OWA distance between the sets A and B is

OWA ( )1

d A, B OWA ( , ) | ,z

i ii

d x y x A y B w d

(11)

where iw are weights of the OWA operator given directly or calculated according to the any distribution function,

.z A B , and ( )id is the i-th maximal distance of cartesian product of A B .

IV. OWA DISTANCE-BASED CXK-NEAREST NEIGHBOR ALGORITHM

The well-known nonparametric approach K-nearest neighbor algorithm, puts a new unknown data into the class that nestles superior number of elements in the nearest neighborhood set.

K-NN algorithm gives equal importance to each of the objects in assigning class label to the input vector which is one of the challenges of the K-NN algorithm. Because such an assignment could reduce the accuracy of the algorithm if there is a strong overlapping degree amongst the data vectors. So, K-nearest neighbor algorithm is a sub-optimal procedure. But, it was proven that the error rate for the 1-NN rule is not more than the twice the optimal Bayes error rate, which asymptotically approaches to the optimal rate as K increases,

Page 4: [IEEE 2012 6th IEEE International Conference Intelligent Systems (IS) - Sofia, Bulgaria (2012.09.6-2012.09.8)] 2012 6th IEEE INTERNATIONAL CONFERENCE INTELLIGENT SYSTEMS - OWA aggregation

with infinite number of data (Duda&Hart, 1973; Fukunaga&Hostetler, 1975; Cover&Hart, 1967).

As above-mentioned, a point is assigned to a class in which it has the most number of neighbors with K-NN algorithm. The difference of the CxK-nearest neighbor algorithm is that a point is assigned to a class, which K-nearest points set is closest to the classified point. The distance between the point being classified and K-nearest neighbors set is calculated as OWA-distance.

Assume the following representations:

- },...,,{ 21 pxxxX is the set of n labeled samples;

- },...,,{ 21 pCCC is the class labels of the samples;

- newx is the new sample to be classified; - K

jC is the K-nearest neighbors to the point newx set in the

class jC .

BEGIN Start a learning set with separate classes },...,,{ 21 pCCC ;

Clear K-nearest neighbors sets pjC Kj ,..,1, .

Find an unclassified input sample newx ; Set , 1K K n FOR EACH CLASS },...,,{ 21 pj CCCC

Set 0i . FOR EACH jCx

If ( Ki ) THEN Assign x to the K -nearest neighbors sets K

jC ;

1i i ; ELSE Calculate the OWA- distance between x and newx ;

IF ( x is closer to newx than any sample in class

KjC )

THEN Delete the farthest sample from the set K

jC .

Assign x to set KjC

END IF END IF END FOR

Calculate the OWA-distance jd between newx and the set KjC ;

END FOR Mark the class with minimum distance jd as *j Assign x to *jC

END Figure 2. Pseudo-code of OWA distance-based C K NN algorithm

Then the OWA distance between the point to be classified, newx , and the K -nearest neighbors set K

jC is calculated as follows:

, ,new K new Kj jd x C OWA d x x x C (12)

where xxd new , is the distance between the points newx and .x In accordance with the above notations, the pseudo-code of

OWA Distance-Based CxK-NN Classification Algorithm is given in Figure 2.

V. EXPERIMENTAL RESULTS

We suggest that our proposed OWA distance-based CxK- nearest neighbor approach is a more general approach and by changing the orness values, it is possible to obtain the results of methods such as single-linkage, complete-linkage, and average-linkage.

In this study, by changing the orness values, the OWA weights with maximal entropy associated with each orness value are determined. By changing these values, different results are obtained. The results, given in Table 2, are compared with the single-linkage, complete-linkage, and average-linkage based classification results for known datasets such as Glass, Iris, and Wine obtained from UCI machine learning repository.

IRIS data set contains 150 objects and 4 attributes in 3 classes, where each class refers to a type of Iris plant, i.e. Iris Setosa, Iris Versicolour, Iris Virginica. Wine data set contains 178 objects and 13 attributes each of which represents a chemical property of wine. There are totally 3 classes. Finally, Glass data set contains 210 objects and 10 attributes in 7 classes each of which represents a glass type.

Note that circa two-thirds of data for each data set are used as learning set and the rest of the data is used as test set. First, a model is constructed on learning set, then this model is used on test data set and classification accuracy of the results is measured. We handle the following formula as the classification accuracy rate:

correctly classified number of dataCAtotal number of data

. (13)

In Figures 3-5, the classification accuracy results of single-linkage, complete-linkage, and average-linkage approaches are shown for K=1…20. It is seen that approximately all results are between the single-linkage and average-linkage approaches.

Note that the orness value corresponding to single-linkage approach is 0 where as this value is 1 corresponding to complete-linkage approach. Similarly, the orness value corresponding to average-linkage approach is 0.5. On the other hand, if we had only known the result of any method in Figures 3-5, what would be the corresponding orness value? In order

Page 5: [IEEE 2012 6th IEEE International Conference Intelligent Systems (IS) - Sofia, Bulgaria (2012.09.6-2012.09.8)] 2012 6th IEEE INTERNATIONAL CONFERENCE INTELLIGENT SYSTEMS - OWA aggregation

to solve such an interesting question, an orness value which produces the closest or the most similar results to the graphic should have been found. The closeness or similarity can be measured by various similarity measures. In many studies the following sum-square-of-error similarity measure is used:

2

1

( ) ( ( ) )K

i ii

f q a q b

(14)

where K is the maximal number of points used in the construction of nearest neighbors set; ( )ia q is the accuracy result corresponding to the orness value q with i neighbors; and ib is the result of the accuracy in any approach of interest. In order to find the most appropriate orness value, ,q the result which makes Eq.(14) minimum with respect to q must be obtained.

Figure 3. Average accuracy versus K for Iris data set

Table 2. Classification results of K-NN, weighted K-NN, OWA distance-based CxK-NN with orness=0 and orness=0.5.

K

GLASS IRIS WINE

K-NN WK-NN CxK-NN Orness=0

CxK-NN Orness=0.5 K-NN WK-NN

CxK-NN Orness=0

CxK-NN Orness=0.5 K-NN W K-NN

CxK-NN Orness=0

CxK-NN Orness=0.5

1 0,6485 0,6390 0,6719 0,6448 0,9577 0,9573 0,9357 0,9451 0,9547 0,9550 0,9472 0,9544 2 0,6064 0,6307 0,6399 0,6514 0,9462 0,9381 0,9612 0,9496 0,9380 0,9249 0,9417 0,9569 3 0,6284 0,6424 0,6706 0,6714 0,9628 0,9529 0,9617 0,9569 0,9432 0,9416 0,9640 0,9530 4 0,6444 0,6337 0,6229 0,6364 0,9568 0,9404 0,9389 0,9486 0,9354 0,9225 0,9572 0,9543 5 0,6487 0,6100 0,6205 0,6275 0,9671 0,9665 0,9474 0,9616 0,9304 0,9340 0,9466 0,9573 6 0,5951 0,6235 0,6312 0,6218 0,9609 0,9512 0,9432 0,9536 0,9213 0,9254 0,9515 0,9534 7 0,6025 0,6242 0,6344 0,6174 0,9529 0,9614 0,9540 0,9602 0,9412 0,9454 0,9477 0,9622 8 0,6233 0,5946 0,6582 0,6338 0,9538 0,9666 0,9433 0,9552 0,9191 0,9206 0,9550 0,9511 9 0,6192 0,5891 0,6500 0,6185 0,9411 0,9637 0,9484 0,9537 0,9462 0,9478 0,9436 0,9540

10 0,6044 0,5963 0,6237 0,6310 0,9609 0,9562 0,9506 0,9446 0,9306 0,9287 0,9557 0,9629 11 0,5804 0,5938 0,6370 0,6219 0,9517 0,9619 0,9525 0,9593 0,9412 0,9429 0,9574 0,9562 12 0,5992 0,5778 0,6460 0,6438 0,9607 0,9540 0,9522 0,9567 0,9431 0,9216 0,9556 0,9539 13 0,5998 0,5776 0,6457 0,6066 0,9540 0,9612 0,9472 0,9574 0,9262 0,9478 0,9545 0,9570 14 0,6007 0,5724 0,6569 0,5999 0,9618 0,9470 0,9486 0,9542 0,9403 0,9255 0,9456 0,9531 15 0,5733 0,5930 0,6477 0,5985 0,9385 0,9618 0,9568 0,9577 0,9393 0,9547 0,9569 0,9591 16 0,6117 0,6003 0,6497 0,5526 0,9328 0,9385 0,9400 0,9653 0,9395 0,9309 0,9644 0,9562 17 0,5923 0,5871 0,6531 0,5559 0,9321 0,9493 0,9539 0,9297 0,9408 0,9444 0,9706 0,9562 18 0,5894 0,5777 0,6168 0,5344 0,9361 0,9429 0,9445 0,9536 0,9469 0,9379 0,9614 0,9548 19 0,5655 0,5902 0,6396 0,5446 0,9399 0,9188 0,9459 0,9298 0,9382 0,9449 0,9501 0,9601 20 0,5842 0,5679 0,6441 0,5245 0,9538 0,9633 0,9529 0,8755 0,9265 0,9314 0,9500 0,9434

Figure 4. Average accuracy versus K for Wine data set

Figure 5. Average accuracy versus K for Glass data set

Page 6: [IEEE 2012 6th IEEE International Conference Intelligent Systems (IS) - Sofia, Bulgaria (2012.09.6-2012.09.8)] 2012 6th IEEE INTERNATIONAL CONFERENCE INTELLIGENT SYSTEMS - OWA aggregation

In our experiments, we realized that OWA results encompass other K-NN type results. The well-known data sets GLASS, WINE and IRIS are classified by using the algorithms K-NN, weighted K-NN, OWA distance-based CxK-NN with orness=0 and OWA distance-based CxK-NN with orness=0.5. For all values of K between 1 and 20, the data set is partitioned into learning and test datasets randomly on the ratio of three-seventh and classification is performed for 10 times. The average classification accuracy results are given in Table 2. As is it seen from Table 2, OWA-distance based CxK-NN algorithm’s performance is the best on GLASS and WINE datasets. On the other hand, approximately similar results are obtained for IRIS dataset. Furthermore, the results of OWA distance-based CxK-NN with orness equal to zero are among the best methods.

VI. CONCLUSION In this study, an OWA distance-based CxK-NN is proposed. It is shown that the proposed algorithm is more flexible than the traditional K-NN algorithm. It is also shown that by changing the orness value it gives better results than K-NN and the weighted K-NN algorithms.

We think that due to its flexibility and adjustability, better results could be obtained if CxK-NN algorithm is applied to K-NN based algorithms used for different purposes.

ACKNOWLEDGMENT This work has been supported by Scientific and Technological Research Council of Turkey (TUBITAK) under Grant No.111T273.

REFERENCES [1] T.M. Cover, and P.E. Hart, “Nearest neighbor pattern classification”,

IEEE Transactions on Information Theory, Vol.13,pp.21-27, 1967.

[2] R.O.Duda, and P.E. Hart, Pattern Classification and Scene Analysis, New York: Wiley, 1973.

[3] J. Friedman, “Flexible metric nearest neighbor classification”, Technical Report 113, Stanford University, Statistics Department, 1994.

[4] K.Fukunaga, and L.D. Hostetler, “K-nearest-neighbor Bayes risk estimation”, IEEE Transactions on Information Theory, vol.21, No. 3, pp.285-293, 1975.

[5] R.A.Nasibova, and E.N. Nasibov, “Linear Aggregation with Weighted Ranking”, Automatic Control and Computer Science, vol.44, No.2, pp.51-61, 2010.

[6] E.N.Nasibov, and G. Ulutagay, “Comparative Clustering Analysis of Bispectral Index Series of Brain Activity”, Expert Systems with Applications, Vol. 37, No.3, pp. 2495-2504, 2010.

[7] E.N.Nasibov, and Kandemir-Cavas C., “OWA-based linkage method in hierarchical clustering: Application on phylogenetic trees”, Expert Systems with Applications, Vol. 38, No.4, pp. 12684-12690, 2011.

[8] M.Paik, and Y.Yang, “Combining Nearest Neighbor Classifiers Versus Cross-Validation selection”. Statistical Applications in Genetics and Molecular Biology, Vol.3, pp. 1-19, 2004.

[9] V. Vapnik, Statistical Learning Theory, John Wiley, New York, 1998. [10] Z.Xu. “An overview of methods determining OWA weights”, Int.

Journal of Intelligent Systems, Vol.20, pp.843-865, 2005. [11] R.Yager, “On ordered weighted averaging aggregation operators in

multicriteria decision making”, IEEE Transactions on Systems, Man and Cybernetics, Vol.18, pp.183-190, 1988

[12] R.Yager, “Intelligent Control of the Hierarchical Agglomerative Clustering Process”, IEEE Transactions on Systems, Man, Cybernetics-Part B: Cybernetics, vol. 30, No.6, pp.835-845, December 2000.

[13] W.Zuo, D.Zhang, and K.Wang, “On kernel difference-weighted k-nearest neighbor classification”, Pattern Analysis and Applications, vol. 11, No.3, 2008.

[14] UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[15] G. Ulutagay, and E. Nasibov, “A New CxK-Nearest Neighbor Linkage Approach to the Classification Problem, 10th International FLINS Conference on Uncertainty Modeling in Knowledge Engineering and Decision Making”, Istanbul, Turkey, August 2012.