supplemental materials: [section #1]: an example of the … · 2017. 10. 28. · ! 1! supplemental...

1

Supplemental Materials:

[Section #1]: An example of the calculation of jinΔ .

[Section #2]: The classification accuracy of ENN rule and KNN rule for each class under four different data models.

[Section #3]: The comparison of classification performance of ENN rule and KNN rule over 20 datasets from UCI Machine Learning Repository.

2

[Section #1]:

In this section, we give a detailed example of calculation of jinΔ for a two-class

classification problem. We assume that there are four training samples in class 1, denoted as 1 2 3 4{ , , , }X X X X , and four training samples in class 2, denoted as 1 2 3 4{ , , , }Y Y Y Y . These eight samples are shown in Fig. 2 in our paper. We choose the parameter for the number of nearest neighbors as 3k = . Fig. S1 shows the detailed calculation procedure.

Fig. S1. A detailed example in calculating j

inΔ . We iteratively assume the test sample Z belongs to each possible class j , count the number of class i data who have a change in the number of class i data members in its k-nearest neighbors, where , 1,2i j = in this example, and then calculate the j

inΔ .

3

[Section #2]: In this section, we examine four Gaussian data models for three classes whose

parameters are shown below:

21 1

[8,5,5]1µC Iσ

=

=, 2

22 2

[5,8,5]µC Iσ

=

=, and 3

23 3

[5,5,8]µC Iσ

=

=.

Model 1: 21 5σ = , 2

2 5σ = and 23 5σ = .

Model 2: 21 5σ = , 2

2 20σ = and 23 5σ = .

Model 3: 21 5σ = , 2

2 5σ = and 23 20σ = .

Model 4: 21 5σ = , 2

2 20σ = and 23 20σ = .

The classification performances of ENN rule and KNN rule for each class are presented in Table S1. It demonstrates that ENN rule performs better than KNN rule when there are equal distributions, such as in Model 1, and solves the scale-sensitive problem of KNN rule in which misclassifications may happen when their nearest neighbors are dominated by the samples from other classes, such as Model 2, Model 3, and Model 4. In Table S1, both ENN and KNN classifiers are compared with the Maximum a Posterior (MAP) rule in which the parameters of Gaussian distribution for each class are estimated with the given training data.

Table S1: The detailed classification error rate (in percentage) of ENN method for each class compared to the classic KNN and MAP method. The highlighted value denotes the better performance between the KNN and the ENN method.

Model 1: 2 2 21 2 35, 5, 5σ σ σ= = = Model 2: 2 2 2

1 2 35, 20, 5σ σ σ= = =

Class 1 Class 2 Class 3 Class 1 Class 2 Class 3

KNN ENN KNN ENN KNN ENN KNN ENN KNN ENN KNN ENN

k = 1 33.1 34.1 38.3 35.5 33.7 34.2 k = 1 36.4 36.8 36 34.1 34.2 35.1

k = 3 34.6 33.1 36.6 30.2 30.3 29.5 k = 3 32 31.9 39.3 34.4 31.4 30.5

k = 5 34.9 31.6 35.2 28.6 28.7 26.5 k = 5 31.2 29.7 40.5 33.7 28.6 26.7

k = 7 32.8 30.8 32.7 28.1 26.5 25.4 k = 7 28.5 28.3 40.8 33.6 25 24.3

MAP 28.6 25.8 24.5 MAP 22.7 36 19.3

Model 3: 2 2 21 2 35, 5, 20σ σ σ= = = Model 4: 2 2 2

1 2 35, 20, 20σ σ σ= = =

Class 1 Class 2 Class 3 Class 1 Class 2 Class 3

KNN ENN KNN ENN KNN ENN KNN ENN KNN ENN KNN ENN

k = 1 36.5 36.2 33.1 34.6 37.5 35.9 k = 1 32 34.5 51.1 49.5 49.8 49.2

k = 3 33.2 31 27 26.8 38.8 33.7 k = 3 26.5 26.4 50.1 45.9 51 47.1

k = 5 30.3 27.3 24 23.2 40.2 33.5 k = 5 24 23.3 50.3 35.2 51.7 44.6

k = 7 26.7 25.1 20.8 20.8 40.6 33 k = 7 19.9 21.2 48.2 42.7 48.6 43.9

MAP 19.1 16 34.6 MAP 12.4 39.6 44.8

4

[Section #3]: In this section, we furthermore compare the classification performance of our

proposed ENN rule with KNN rule over 20 datasets from UCI Machine Learning Repository. The average testing error rates and standard deviations are given in Table S2 (in percentage). All the results are based on the average of 100 random runs: for each run, we randomly select half of the data as the training data and the remaining half as the test data. For each dataset, the best result has been highlighted in bold with underline.

In order to analyze whether the performance improvements are statistically significant or not, we further employ t test to compare our ENN results with those of KNN for each dataset. The test statistic is defined as

1 22 21 2

1 2

Z

n n

µ µ

σ σ

−=

+

where 1µ and 2µ is the mean error of ENN and KNN, respectively, 1σ and 2σ is the standard deviation of ENN and KNN, respectively, 1 2 100n n= = is the number of experiments. When 2.345Z < − ( 0.01p = for a one-tailed test), we can conclude that the ENN performance improvements are significant under the specified p value. Otherwise, we cannot justify the significant improvement though ENN shows a better average classification performance. We list the t test statistic vale in Table S3, with bold and underline to highlight the cases when ENN can significantly improve the classification performance compared with KNN method (i.e., 2.345Z < − ).

5

Table S2: Average testing error and standard derivation in percentage, with k = 3, 5, 7 for both ENN and KNN. Each result is averaged over 100 random runs: in each run, we randomly select half of the data as the training data and the remaining half as the test data. For each dataset, we highlight the best result with bold and underline.

Methods

Dataset

KNN ENN

k = 3 k = 5 k = 7 k = 3 k = 5 k = 7

Ionosphere 18.55±2.94 19.53±2.79 20.07±2.90 17.35±2.69 16.66±2.95 17.43±2.67

Vowel 11.73±1.80 21.57±2.31 27.08±2.37 8.50±1.92 14.21±2.01 19.32±2.42

Sonar 24.49±4.06 29.41±4.80 33.35±4.80 22.67±3.97 26.00±4.00 29.70±4.01

Wine 7.08±2.20 6.43±2.14 6.25±2.20 4.49±2.16 4.64±1.97 4.19±1.93

Breast-‐cancer 4.44±1.07 4.24±1.10 4.33±1.18 4.04±0.87 3.80±0.85 3.81±0.91

Haberman 32.13±5.79 31.29±5.61 31.19±7.12 31.32±6.53 29.80±7.32 29.06±7.53

Breast tissue 42.40±6.19 45.32±7.51 48.15±7.21 36.71±6.37 39.61±7.88 39.66±5.55

Movement libras 32.16±2.97 42.21±2.89 46.38±2.68 26.33±2.88 30.71±3.17 35.57±3.59

Mammographic masses

22.27±1.55 21.46±1.58 21.66±1.38 21.16±1.43 21.32±1.56 21.36±1.51

Segmentation 27.85±3.04 28.51±3.26 28.93±2.38 24.71±3.07 23.59±3.04 25.71±2.71

ILPD 40.91±3.68 40.36±3.74 39.78±3.67 40.04±3.58 39.08±3.53 38.65±3.36

Pima Indians diabetes

33.08±2.37 31.36±2.35 30.83±2.14 31.22±2.15 31.21±2.09 30.57±2.09

Knowledge 27.11±4.45 30.65±5.08 31.31±5.70 23.93±4.69 25.17±4.59 25.84±5.15

Vertebral 37.64±5.06 39.11±5.40 40.14±5.65 35.13±4.83 35.90±5.61 36.39±6.05

Bank note 0.12±0.23 0.26±0.39 0.43±0.44 0.09±0.18 0.13±0.28 0.21±0.35

Magic 20.42±0.36 19.78±0.33 19.49±0.31 20.10±0.33 19.14±0.36 18.72±0.34

Pen digits 0.94±0.17 1.16±0.17 1.39±0.20 0.74±0.15 0.78±0.13 0.88±0.14

Faults 1.65±0.86 2.58±0.95 3.33±0.89 0.91±0.52 1.51±0.85 2.28±0.91

Letter 7.44±0.25 8.19±0.28 8.52±0.29 5.60±0.25 5.77±0.23 6.03±0.23

Spam 11.52±0.63 10.68±0.72 11.12±0.68 10.08±0.59 9.91±0.72 9.56±0.56

6

Table S3: The detailed Z value of the significance testing. When 2.345Z < − ( 0.01p = for a one-tailed test), one can conclude that the ENN performance improvement compared to KNN is significant under the specified p value, which is highlighted in bold with underline in the table. From this table one can see, ENN showed a significant performance improvement for 50 cases out of a total of 60.

k = 3 k = 5 k = 7 k = 3 k = 5 k = 7

Ionosphere -‐3.01 -‐7.07 -‐6.70 ILPD -‐1.69 -‐2.49 -‐2.27

Vowel -‐12.27 -‐24.04 -‐22.91 Pima Indians diabetes

-‐5.81 -‐0.48 -‐0.87

Sonar -‐3.21 -‐5.46 -‐5.84 Knowledge -‐4.92 -‐8.00 -‐7.12

Wine -‐8.40 -‐6.15 -‐7.04 Vertebral -‐3.59 -‐4.12 -‐4.53

Breast-‐cancer -‐2.90 -‐3.17 -‐3.49 Bank note -‐1.03 -‐2.71 -‐3.91

Haberman -‐0.93 -‐1.62 -‐2.06 Magic -‐6.55 -‐13.10 -‐16.74

Breast tissue -‐6.41 -‐5.25 -‐9.33 Pen digits -‐8.82 -‐17.76 -‐20.89

Movement libras

-‐14.09 -‐26.81 -‐24.13 Faults -‐7.36 -‐8.39 -‐8.25

Mammographic masses

-‐5.26 -‐0.63 -‐1.47 Letter -‐52.04 -‐66.79 -‐67.27

Segmentation -‐7.27 -‐11.04 -‐8.93 Spam -‐16.68 -‐7.56 -‐17.71

supplemental materials: [section #1]: an example of the … · 2017. 10. 28. · ! 1! supplemental...

Documents