clusteringbydetectingdensitypeaksandassigningpointsby ... · fknn-dpc [9] is given in (5) and (6),...

17
Research Article ClusteringbyDetectingDensityPeaksandAssigningPointsby Similarity-First Search Based on Weighted K-Nearest Neighbors Graph QiDiao , 1 YapingDai, 1 QichaoAn , 1 WeixingLi, 1 XiaoxueFeng , 1 andFengPan 1,2 1 Beijing Institute of Technology, School of Automation, Beijing 100081, China 2 Kunming-BIT Industry Technology Research Institute INC, Kunming 650106, China Correspondence should be addressed to Feng Pan; [email protected] Received 29 April 2020; Revised 14 June 2020; Accepted 18 June 2020; Published 12 August 2020 Guest Editor: Kailong Liu Copyright © 2020 Qi Diao et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. is paper presents an improved clustering algorithm for categorizing data with arbitrary shapes. Most of the conventional clustering approaches work only with round-shaped clusters. is task can be accomplished by quickly searching and finding clustering methods for density peaks (DPC), but in some cases, it is limited by density peaks and allocation strategy. To overcome these limitations, two improvements are proposed in this paper. To describe the clustering center more comprehensively, the definitions of local density and relative distance are fused with multiple distances, including K-nearest neighbors (KNN) and shared-nearest neighbors (SNN). A similarity-first search algorithm is designed to search the most matching cluster centers for noncenter points in a weighted KNN graph. Extensive comparison with several existing DPC methods, e.g., traditional DPC algorithm, density-based spatial clustering of applications with noise (DBSCAN), affinity propagation (AP), FKNN-DPC, and K-means methods, has been carried out. Experiments based on synthetic data and real data show that the proposed clustering algorithm can outperform DPC, DBSCAN, AP, and K-means in terms of the clustering accuracy (ACC), the adjusted mutual information (AMI), and the adjusted Rand index (ARI). 1.Introduction e natural ecosystem has the characteristics of diversity, complexity, and intelligence, which provide infinite space for data-driven technology. As a new research focus, the data- driven prediction method has been widely used in energy, transportation, finance, and automobiles [1–7]. Clustering algorithm is an important branch of data-driven technology, which provides important information for further data analysis through mining the internal association of data [8, 9]. Due to the different definitions of clustering, different clustering strategies have been reported. Among them, the K-means algorithm is a simple and effective clustering al- gorithm. It preselects K initial clustering centers and then iteratively assigns each data point to the nearest clustering center [10]. Since the initial clustering center has certain impacts on the clustering results of K-means, the works [11, 12] provided several methods for selecting the initial clustering center and improving the accuracy of clustering. Since the K-means and its variants are based on the idea that data points are assigned to the nearest clustering center, these methods cannot facilitate the nonspherical clustering task well. Unlike the K-means algorithm, affinity propaga- tion (AP) [8] has been developed based on the similarity between data points, and it can complete clustering by exchanging information between them. Hence, the AP al- gorithm does not need to determine the number of clusters in advance, and it has the time advantage in completing the clustering task of large-scale datasets [13]. However, for complex datasets, the AP method may also lead to perfor- mance degradation as the K-means method [14]. To address the aforementioned problems, density-based clustering methods have been proposed, which can find clusters of various shapes and sizes in noisy data, where the Hindawi Complexity Volume 2020, Article ID 1731075, 17 pages https://doi.org/10.1155/2020/1731075

Upload: others

Post on 06-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ClusteringbyDetectingDensityPeaksandAssigningPointsby ... · FKNN-DPC [9] is given in (5) and (6), respectively: ρ i exp − 1 K X j∈knn(i) d2 ij ⎛⎝ ⎞⎠, (5) ρ i X j∈knn(i)

Research ArticleClustering by Detecting Density Peaks and Assigning Points bySimilarity-First Search Based on Weighted K-NearestNeighbors Graph

Qi Diao 1 Yaping Dai1 Qichao An 1 Weixing Li1 Xiaoxue Feng 1 and Feng Pan 12

1Beijing Institute of Technology School of Automation Beijing 100081 China2Kunming-BIT Industry Technology Research Institute INC Kunming 650106 China

Correspondence should be addressed to Feng Pan panfengbiteducn

Received 29 April 2020 Revised 14 June 2020 Accepted 18 June 2020 Published 12 August 2020

Guest Editor Kailong Liu

Copyright copy 2020 Qi Diao et al+is is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

+is paper presents an improved clustering algorithm for categorizing data with arbitrary shapes Most of the conventionalclustering approaches work only with round-shaped clusters +is task can be accomplished by quickly searching and findingclustering methods for density peaks (DPC) but in some cases it is limited by density peaks and allocation strategy To overcomethese limitations two improvements are proposed in this paper To describe the clustering center more comprehensively thedefinitions of local density and relative distance are fused with multiple distances including K-nearest neighbors (KNN) andshared-nearest neighbors (SNN) A similarity-first search algorithm is designed to search the most matching cluster centers fornoncenter points in a weighted KNN graph Extensive comparison with several existing DPC methods eg traditional DPCalgorithm density-based spatial clustering of applications with noise (DBSCAN) affinity propagation (AP) FKNN-DPC andK-means methods has been carried out Experiments based on synthetic data and real data show that the proposed clusteringalgorithm can outperform DPC DBSCAN AP and K-means in terms of the clustering accuracy (ACC) the adjusted mutualinformation (AMI) and the adjusted Rand index (ARI)

1 Introduction

+e natural ecosystem has the characteristics of diversitycomplexity and intelligence which provide infinite space fordata-driven technology As a new research focus the data-driven prediction method has been widely used in energytransportation finance and automobiles [1ndash7] Clusteringalgorithm is an important branch of data-driven technologywhich provides important information for further data analysisthrough mining the internal association of data [8 9]

Due to the different definitions of clustering differentclustering strategies have been reported Among them theK-means algorithm is a simple and effective clustering al-gorithm It preselects K initial clustering centers and theniteratively assigns each data point to the nearest clusteringcenter [10] Since the initial clustering center has certainimpacts on the clustering results of K-means the works

[11 12] provided several methods for selecting the initialclustering center and improving the accuracy of clusteringSince the K-means and its variants are based on the idea thatdata points are assigned to the nearest clustering centerthese methods cannot facilitate the nonspherical clusteringtask well Unlike the K-means algorithm affinity propaga-tion (AP) [8] has been developed based on the similaritybetween data points and it can complete clustering byexchanging information between them Hence the AP al-gorithm does not need to determine the number of clustersin advance and it has the time advantage in completing theclustering task of large-scale datasets [13] However forcomplex datasets the AP method may also lead to perfor-mance degradation as the K-means method [14]

To address the aforementioned problems density-basedclustering methods have been proposed which can findclusters of various shapes and sizes in noisy data where the

HindawiComplexityVolume 2020 Article ID 1731075 17 pageshttpsdoiorg10115520201731075

high-density regions are considered as the clusters and sepa-rated by low-density regions [15ndash19] In this line density-basedspatial clustering of applications with noise (DBSCAN) [15 16]was proposed as an effective density-based clustering methodIt needs to determine two parameters about the density ofpoints (ε andMinPts) to achieve clustering of arbitrary shapeswhere ε is the neighborhood radius and MinPts is the numberof points contained in the neighborhood radius ε [15]However choosing a suitable threshold is a challenging task forthese methods [15 17] Subsequently Rodriguez and Laio [20]proposed a novel density-based clustering algorithm throughfast search and density peaking (named as DPC) +e DPCalgorithmuses the local density and the relative distance of eachpoint to establish a decision graph finds the cluster centersaccording to the decision graph and then assigns the noncenterpoint to the cluster of the nearest higher density neighborAlthough the DPC algorithm is simple and effective fordetecting arbitrary shape clustering several issues are limitingits practical application Firstly DPC is sensitive to the cutoffdistance dc implying that the parameter dc is set suitably toretain satisfactory performance which is not a trivial taskSecondly the clustering centers should be manually selectedwhich may not be feasible and convenient for some datasetsMoreover the allocation error of high-density points will di-rectly affect the allocation of low-density points around itwhich may also contribute to propagating in the subsequentallocation process continuously

To overcome these issues the main advanced DPC al-gorithm has recently been studied To avoid the influence ofthe cutoff distance dc the concept of K-nearest neighbors(KNN) has been introduced into the DPC algorithm whichproposed two different density measures eg DPC-KNN[19] and FKNN-DPC [9] Although both algorithms arebased on the K-nearest neighbor information they havebeen developed separately Moreover to solve the problemof manual selection of clustering centers Li et al [21]proposed a density peak clustering method to automaticallydetermine the clustering centers In this algorithm thepotential clustering centers are determined by the c rankinggraph and then the true clustering centers are filtered outusing the cutoff distance dc To remedy the allocation errortransmission FKNN-DPC [9] and SNN [22] both adopted atwo-step allocation strategy to allocate noncentral points Inthe first step they use the breadth-first search to assignnonoutlier points In the second step FKNN-DPC uses thefuzzy weighted K-nearest neighbor technology to allocatethe remaining points and the SNN is based on whether thenumber of shared neighbors reaches the threshold to de-termine the cluster of the remaining points

+is paper proposed an improved clustering algorithmbased on the density peaks (named as DPC-SFSKNN) It hasthe following new features (1) the local density and therelative distance are redefined and the distance attributes ofthe two neighbor relationships (KNN and SNN) are fused+is method can detect the low-density clustering center (2)A new allocation strategy is proposed A similarity-firstsearch algorithm based on weighted KNN graphs is designedto allocate noncenter points It has to be ensured that theallocation strategy is fault tolerant

In general this paper is organized as follows Section 2briefly mainly introduces the DPC algorithm and its de-velopment and analyzes the DPC algorithm in detail Section3 introduces the DPC-SFSKNN algorithm in detail and givesa detailed analysis Section 4 tests the proposed algorithm onseveral synthetic and real-world datasets and compares itsperformance with DPC DBSCAN AP FKNN-DPC andK-means in terms of several very popular criteria for testinga clustering algorithm namely clustering accuracy (ACC)adjusted mutual information (AMI) and adjusted Randindex (ARI) Section 5 draws some conclusions

2 Related Work

+e density peak clustering algorithm (DPC) was proposed byAlex and Alessandro in 2014 +e core idea of the DPC algo-rithm lies in the characterization of the cluster center which hasthe following two characteristics the cluster center point has ahigher local density which is surrounded by neighbor pointswith lower local density the cluster center point is relatively farfrom other denser data points +ese characteristics of thecluster center are related to two quantities the local density ρi ofeach point i and its relative distance δi which represents theclosest distance from the point to larger density points

21 DPC Algorithm and Improvements Suppose X is adataset for clustering and dij represents the Euclidean distancebetween data points i and j+e calculation of local density andrelative distance depends on the distance dij +e DPC algo-rithm introduces two methods for calculating local density theldquocutoffrdquo kernel method and the Gaussian kernel method For adata point i its local density ρi is defined in (1) with the ldquocutoffrdquokernel method and in (2) with the Gaussian kernel method

ρi 1113944j

χ dij minus dc1113872 1113873

χ(ς) 1 ςlt 0

0 ςge 0

⎧⎨

(1)

ρi 1113944j

exp minusd2

ij

d2c

1113888 1113889 (2)

where dc is defined as a cutoff distance which represents theneighborhood radius of the data point +e most significantdifference between the two methods is that ρi calculated by theldquocutoffrdquo kernel is a discrete value while ρi calculated by theGaussian kernel is a continuous value+erefore the probabilityof conflict (different data points correspond to the same localdensity) in the latter is relatively smaller

Moreover dc is an adjustable parameter in (1) and (2)which is defined as

dc dNlowast2 (3)

where dc represents the average number of neighbors for eachpoint which is between 1 and 2 of all points [20]N is the serialnumber of the last data point after the ascending order of all thedistances dij and it is also the total number of points 2 in

2 Complexity

formula (3) is the empirical parameter provided in reference[20] which can be adjusted according to different datasets

+e relative distance δi represents the minimum distancebetween the point i and any other higher density points andis mathematically expressed as

δi

minj ρjgtρi

dij1113872 1113873 ρi lt maxk

ρk( 1113857

maxj

dij1113872 1113873 ρi maxk

ρk( 1113857

⎧⎪⎪⎨

⎪⎪⎩(4)

where dij is the distance between points i and jWhen the localdensity ρi is not themaximumdensity the relative distance δi isdefined as the minimum distance between the point i and anyother higher density points when ρi is themaximumdensity δi

takes the maximum distance to all other pointsAfter calculating the local density and relative distance of

all data points the DPC algorithm establishes a decisiongraph through the set of points ρi and δi +e point with highvalues of ρi and δi is called a peak and the center of thecluster is selected from the peaks +en the DPC algorithmdirectly assigns the remaining points to the same cluster asthe nearest neighbor peak

For the DPC algorithm the selection of dc has a greatinfluence on the correctness of the clustering results BothDPC-KNN and FKNN-DPC schemes introduce the conceptof K-nearest neighbors to eliminate the influence of dcHence two different local density calculations are provided

+e local density proposed by DPC-KNN [19] andFKNN-DPC [9] is given in (5) and (6) respectively

ρi exp minus1K

1113944jisinknn(i)

d2ij

⎛⎝ ⎞⎠ (5)

ρi 1113944jisinknn(i)

exp minusdij1113872 1113873(6)

where K is the total number of nearest neighbors andKNN(i) represents the set of K-nearest neighbors of point i+ese two methods provide a unified density metric fordatasets of any size through the idea of K-nearest neighborsand solve the problem of nonuniformity of DPCrsquos densitymetric for different datasets

Based on K-nearest neighbors SNN-DPC proposes theconcept of shared-nearest neighbors (SNN) [22] which isused to represent the local density ρi and the relative distanceδi +e idea of SNN is that if there are more same neighborsin the K-nearest neighbors of two points the similarity oftwo points is higher and the expression is given by

SNN(i j) KNN(i)capKNN(j) (7)

Based on the SNN concept the expression of SNNsimilarity is as follows

Simij

|SNN(i j)|2

1113936pisinSNN(ij) dip + djp1113872 1113873 if i j isin SNN(i j)

0 otherwise

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

(8)

where dip is the distance between points i and p and djp isthe distance between points j and p +e condition forcalculating SNN similarity is that points i and j appear ineach otherrsquos K-nearest neighbor set Otherwise the SNNsimilarity between the two points is 0

Next the local density ρi of point i is expressed by SNNsimilarity Suppose point i is any point in the dataset X thenS(i) x1 x2 xk represents the set of k points with thehighest similarity with point i +e expression of localdensity is

ρi 1113944jisinS(i)

Sim(i j)(9)

At the same time the equation for the relative distance δi

of the point i is as follows

δi

minj ρjgtρi

dij 1113944pisinknn(i)

dip + 1113944qisinknn(j)

djp⎛⎝ ⎞⎠⎡⎢⎢⎢⎣ ⎤⎥⎥⎥⎦ ρi lt max

kρk( 1113857

maxjisin(Xminusi)

δi( 1113857 ρi maxk

ρk( 1113857

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

(10)

+e SNN-DPC algorithm not only redefines the localdensity and relative distance but also changes the data pointallocation strategy +e allocation strategy divides the datapoints into two categories ldquounavoidable subordinate pointsrdquoand ldquoprobable subordinate pointsrdquo +e two types of datapoints have their allocation algorithms Compared with theDPC algorithm this allocation strategy method is better forthe clustering of clusters with different shapes

22 DPC Algorithm Analysis +e DPC algorithm proposesa very simple and elegant clustering algorithm Howeverdue to its simplicity DPC has the following two potentialproblems to be further addressed in practice

221 DPC Ignores Low-Density Points When the densitydifference between clusters is large the performance of theDPC algorithm can be significantly degraded To show thisissue we take the dataset Jain [23] as an example and thenthe clustering results calculated using the truncated kerneldistance of the DPC are shown in Figure 1 It can be seen thatthe cluster distribution in the upper left is relatively sparseand the cluster distribution in the lower right is relativelyclose +e red star in the figure represents the cluster centersin the upper left corner Under the disparity in density theclustering centers selected by the DPC are all on the tightlydistributed cluster below Due to the incorrect selection ofthe clustering centers the subsequent allocations are alsoincorrect

Analyzing the local density and the relative distanceseparately from Figures 2(a) and 2(b) it can be seen that theρ value and the δ value of point A of the false cluster centerare much higher than that of the true cluster center C +eresults of Gaussian kernel distance calculation are the sameand the correct clustering center cannot be selected on thedataset Jain +erefore how to increase the ρ value and the δ

Complexity 3

value of the low-density center and make it stand out in thedecision graph is a problem that needs to be considered

222 DPC Ignores Low-Density Point Allocation Strategywith Low Fault Tolerance +e fault tolerance of the allo-cation strategy of the DPC algorithm is not satisfactorymainly because the allocation of points receives a higherimpact than the density of points Hence if a high-densitypoint allocation error occurs it will directly affect thesubsequent allocation of points with a lower density Takingthe Pathbased dataset [24] as an example Figure 3 shows theclustering result calculated by the DPC algorithm by usingthe ldquocutoffrdquo kernel distance It can be seen from the figurethat the DPC algorithm can find a suitable clustering centerbut the allocation results of most points are incorrect +esame is true of the results using the Gaussian kernel distancecalculation +e results of point assignment on the

Pathbased dataset are similar to those of ldquocutoffrdquo kernelclustering +erefore the fault tolerance of the point allo-cation strategy should be further improved Moreover thepoints are greatly affected by other points during the allo-cation which is also an issue to be further addressed

3 Proposed Method

In this section the DPC-SFSKNN algorithm is introduced indetail +e DPC-SFSKNN algorithm is proposed where thefive main definitions of the algorithm are introduced andthe entire algorithm process is introduced Moreover thecomplexity of the DPC-SFSKNN algorithm is analyzed

31 3e Main Idea of DPC-SFSKNN +e DPC algorithmrelies on the distance between points to calculate the localdensity and the relative distance and is also very sensitive to

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(a)

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(b)

Figure 1 Results of the traditional DPC algorithm on the Jain dataset (a) Clustering of Jain by DPC (b) Ground truth

20

15

10

5

002 04 06 08 10

A

B

C

(a)

025

02

015

01

005

002 04 06 08 10

A B

C

(b)

Figure 2 ρ and δ values of the result of the traditional DPC algorithm on the Jain dataset (a) Clustering of Jain by DPC (b) Ground truth

4 Complexity

the choice of the cutoff distance dc Hence the DPC algo-rithmmay not be able to correctly process for some complexdatasets +e probability that a point and its neighborsbelong to the same cluster is high Adding attributes relatedto neighbors in the clustering process can help to make acorrect judgment +erefore we introduce the concept ofshared-nearest neighbor (SNN) proposed in [22] whendefining the local density and the relative distance Its basicidea is that if they have more common neighbors the twopoints are considered to be more similar as said above (seeequation (7))

Based on the above ideas we define the average distancedsnn(i j) of the shared-nearest neighbor between point i andpoint j and the similarity between the two points

Definition 1 (average distance of SNN) For any points i andj in the dataset X the shared-nearest neighbor set of twopoints is SNN(i j) and the average distance of SNNdsnn(i j) is expressed as

dsnn(i j) 1113936pisinSNN(ij) dip + djp1113872 1113873

2S (11)

where point p is any point of SNN(i j) and S is the numberof members in the set SNN(i j) dsnn(i j) shows the spatialrelationship between point i and point j more compre-hensively by calculating the distances between two pointsand shared-nearest neighbor points

Definition 2 (similarity) For any points i and j in the datasetX the similarity Sim(i j) between point i and j can beexpressed as

Sim(i j) S

Klowast100 (12)

where K is the number of nearest neighbors K is selectedfrom 4 to 40 until the optimal parameter appears +e lower

bound is 4 because a smaller K may cause the algorithm tobecome endless For the upper bound it is found by ex-periments that a large K will not significantly affect theresults of the algorithm +e similarity is defined accordingto the aforementioned basic idea ldquoif they have more com-mon neighbors the two points are considered to be moresimilarrdquo and the similarity is described using the ratio of thenumber of shared-nearest neighbors to the number ofnearest neighbors

Definition 3 (K-nearest neighbor average distance) For anypoint i in the dataset X its K-nearest neighbor set is KNN(i)and then the expression of K-nearest neighbor averagedistance dknn(i) is as follows

dknn(i) 1113936pisinknn(i)dip

K (13)

where point p is any point in KNN(i) and the number ofnearest neighbors of any point is K K-nearest neighboraverage distance can describe the surrounding environmentof a point to some extent Next we use it to describe localdensity

Definition 4 (local density) For any point i in the dataset Xthe local density expression is

ρi 1113944jisinknn(i)

S

dknn(i) + dknn(j) (14)

where point j is a point in the set KNN(i) and dknn(i) anddknn(j) are the K-nearest neighbor average distances ofpoint i and point j respectively In formula (14) the nu-merator (the number of shared-nearest neighbor S) repre-sents the similarity between the two points and thedenominator (the sum of the average distances) describesthe environment around them When S is a constant and if

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(a)

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(b)

Figure 3 Results of the traditional DPC algorithm on the Pathbased dataset

Complexity 5

the value of the sum of the average distances(dknn(i) + dknn(j)) is small the local density ρi of point i islarge Point j is one of the K-nearest neighbors of point iWhen the values of dknn(i) and dknn(j) are small it means i

and j are closely surrounded by their neighbors If dknn(i)

has a larger value (point j is far away from point i) or dknn(j)

has a larger value (when the neighboring points of thedistance are far away from the point j) the local density ofthe point i becomes smaller +erefore only the averagedistances of the two points are small and it can be expressedthat the local density of point i is large Moreover when thesum of the average distances of the two points is constantand if the number of shared-nearest neighbors of the twopoints is large the local density is large A large number ofshared neighbors indicate that the two points have a highsimilarity and a high probability of belonging to the samecluster +e higher the similarity points around a point thegreater its local density and the greater the probability ofbecoming a cluster center +is is beneficial to those low-density clustering centers A large number of sharedneighbors can compensate for the loss caused by their largedistance from other points so that their local density is notonly affected by distance Next we define the relative dis-tance of the points

Definition 5 (relative distance) For any point i in the datasetX the relative distance can be expressed as

δi

minj ρj gt ρi

dij + dknn(i) + dknn(j)1113960 1113961 ρi lt maxk

ρk( 1113857

maxjisin(Xminusi)

δi( 1113857 ρi maxk

ρk( 1113857

⎧⎪⎪⎨

⎪⎪⎩

(15)

where point j is one of the K-nearest neighbors of point idij is the distance between points i and j and dknn(i) anddknn(j) are the average distance from the nearest neighbor ofpoints i and j We can use the sum of the three distances torepresent the relative distance Compared to the DPC al-gorithm which only uses dij to represent the relative dis-tance we define the concept of relative distance andK-nearest neighbor average distances of two points +e newdefinition can not only express the relative distance but alsobe more friendly to low-density cluster centers Under thecondition of constant dij the average distance of the nearestneighbors of the low-density points is relatively large and itsrelative distance will also increase which can increase theprobability of low-density points being selected

+e DPC-SFSKNN clustering center is selected in thesame way as the traditional DPC algorithm+e local densityρ and relative distance δ are used to form a decision graph+e n points with the largest local density and relativedistance are selected as the clustering centers

For DPC-SFSKNN the sum of the distances from pointsof a low-density cluster to their K-neighbors may be largethus they receive a greater compensation for their δ valueFigures 4(a) and 4(b) show the results of DPC-SFSKNN onthe Jain dataset [23] Compared to Figure 2(b) the δ valuesof points in the upper branch are generally larger than thoseof the lower branch +is is because the density of the upper

branch is significantly smaller and the distances from thepoints to their respective K-nearest neighbors are largerthus they receive a greater compensation Even if the densityis at a disadvantage the higher δ value still makes the centerof the upper branch distinguished in the decision graph+isshows that the DPC-SFSKNN algorithm can correctly selectlow-density clustering centers

32 Processes +e entire process of the algorithm is dividedinto two parts the selection of clustering centers and theallocation of noncenter points +e main step of our DPC-SFSKNN and a detailed introduction of the proposed al-location strategy are given in Algorithm 1

Line 9 of the DPC-SFSKNN algorithm establishes aweighted K-nearest neighbor graph and Line 11 is aK-nearest neighbor similarity search allocation strategy Toassign noncenter points in the dataset we designed asimilarity-first search algorithm based on the weightedK-nearest neighbor graph +e algorithm uses the breadth-first search idea to find the cluster center with the highestsimilarity for the noncenter point +e similarity of non-center points and their K-nearest neighbors is sorted in anascending order the neighbor point with the highest sim-ilarity is selected as the next visited node and it is pushedinto the path queue If the highest similarity point is notunique the point with the smallest SNN average distance isselected as the next visited node+e visiting node also needsto sort the similarity of its K-nearest neighbors and select thenext visiting node +e search stops until the visited node isthe cluster center point Algorithm 2 describes the entiresearch process Finally each data point except the clustercenters is traversed to complete the assignment

Similarity-first search algorithm is an optimization al-gorithm based on breadth-first search according to the al-location requirements of noncenter points Similarity is animportant concept for clustering algorithms Points in thesame cluster are similar to each other Two points with ahigher similarity have more of the same neighbors Based onthe above ideas the definition of similarity is proposed in(12) In the process of searching if only similarity is used asthe search criteria it is easy to appear that the highestsimilarity point is not unique +erefore the algorithmchooses the average distance of the SNN as the secondcriterion and a smaller dsnn point means that the two pointsare closer in space

+e clustering results of the DPC-SFSKNN algorithmbased on the Pathbased dataset are shown in Figure 5Figure 3 clearly shows that although the traditional DPCalgorithm can find cluster centers on each of the threeclusters there is a serious bias in the allocation of noncenterpoints From Figure 5 we can see the effectiveness of thenoncentral point allocation algorithm of the DPC-SFSKNNalgorithm+e allocation strategy uses similarity-first searchto ensure that the similarity from the search path is thehighest and a gradual search to the cluster center to avoidthe points with low similarity is used as a reference Besidesthe similarity-first search allocation strategy based on theweighted K-nearest neighbor graph considers neighbor

6 Complexity

information When the point of the highest similarity is notunique the point with the shortest average distance of theshared neighbors is selected as the next visited point

33 Complexity Analysis In this section the complexities ofthe DPC-SFSKNN algorithm are analyzed including timecomplexity and space complexity Suppose the size of the

02 04 06 08 100

0102030405060708091

Y

X

(a)

02 04 06 08 10X

δ

025

020

015

010

005

0

(b)

Figure 4 Result and ρ value of the DPC-SFSKNN algorithm on the Jain dataset

Require dataset X parameter K

Ensure clustering result C(1) Data preprocessing normalize the data(2) Calculate the Euclidean distance between the points(3) Calculate the K-nearest neighbors of each point i isin X

(4) Calculate the average distance of K-nearest neighbors of each point dknn(i) according to (13)(5) Calculate the local density ρi of each point i isin X according to (14)(6) Calculate the relative distance δi of each point i isin X according to (15)(7) Find the cluster center by analyzing the decision graph composed of ρ and δ and use the cluster center as the set CC(8) Calculate the similarity between point i and its K-nearest neighbors according to (12)(9) Connect each point in the dataset X with its K-nearest neighbors and use the similarity as the connection weight to construct a

weighted K-nearest neighbor graph(10) Calculate the average distance of SNN dsnn(i j) between point i and its shared-nearest neighbors according to (11)(11) Apply Algorithm 2 to allocate the remaining points

ALGORITHM 1 DPC-SFSKNN

Require w isin X set of cluster centers CC number of neighbors K similarity matrix Snlowast n sim(i j)nlowast n and SNN averagedistance matrix DSNNnlowast n dsnn(i j)nlowast n

Ensure point w isin CC(1) Initialize the descending queue Q and the path queue P+e K-nearest neighbors of point w are sorted in the ascending order of

similarity and pushed into Q Push M into P(2) while tail point of P P isin CC do(3) if the highest similarity point is unique then(4) Pop a point this at Qrsquos tail(5) else(6) Select a point this with the smallest DSNN(7) end if(8) Empty descending queue Q(9) +e K-nearest neighbors of this are sorted in the ascending order of similarity and pushed into Q(10) Push this into P(11) end while

ALGORITHM 2 Similarity-first search allocation strategy

Complexity 7

dataset is n the number of cluster centers is m and thenumber of neighbors is k

331 Time Complexity +e time complexity analysis ofDPC-SFSKNN is as follows

Normalization requires a processing complexity of ap-proximately O(n) the complexities of calculating the Eu-clidean distance and similarity between points are O(n2) thecomplexity of computing the K-nearest neighbor averagedistance dknn is O(n2) similarly the complexity of the averagedistance dsnn between the calculation point and its shared-nearest neighbors does not exceed O(n2) at most the calcu-lation process of calculating the local density ρi and distance δi

of each point needs to acquire the KNN information com-plexity of each point as O(kn) so the complexities of localdensity ρ and distance δ areO(kn2) the point allocation part isthe search time of one point and in the worst case searching allpoints requires O(n) +ere are n points in the dataset and thetotal time does not exceed O(n2) In summary the total ap-proximate time complexity of DPC-SFSKNN is O(kn2)

+e time complexity of the DPC algorithm depends onthe following three aspects (a) the time to calculate thedistance between points (b) the time to calculate the localdensity ρi for point i and (c) the time to calculate the distanceδi for each point i +e time complexity of each part is O(n2)so the total approximate time complexity of DPC is O(n2)

+e time complexity of the DPC-SFSKNN algorithm is k

times higher than that of the traditional DPC algorithmHowever k is relatively small compared to n +erefore theydo not significantly affect the run time In Section 4 it isdemonstrated that the actual running time of DPC-SFSKNNdoes not exceed k times of the running time of the traditionalDPC algorithm

332 Space Complexity DPC-SFSKNN needs to calculatethe distance and similarity between points and its com-plexity is O(n2) Other data structures (such as ρ and δ arrays

and various average distance arrays) are O(n) For the al-location strategy in the worst case its complexity is O(n2)+e space complexity of DPC is O(n2) which is mainly dueto the distance matrix stored

+e space complexity of our DPC-SFSKNN is the sameas that of traditional DPC which is O(n2)

4 Experiments and Results

In this section experiments are performed based on severalpublic datasets commonly used to test the performance ofclustering algorithms including synthetic datasets [23ndash27]and real datasets [28ndash34] In order to visually observe theclustering ability of DPC-SFSKNN the DPC [20] DBSCAN[15] AP [8] FKNN-DPC [9] and K-means [10] methods areall tested for comparison +ree popular benchmarks areused to evaluate the performance of the above clusteringalgorithms including the clustering accuracy (ACC) ad-justed mutual information (AMI) and adjusted Rand index(ARI) [35] +e upper bounds of the three benchmarks wereall 1 +e larger the benchmark value the better the clus-tering effect +e codes for DPC DBSCAN and AP wereprovided based on the corresponding references

Table 1 lists the synthetic datasets used in the experi-ments+ese datasets were published in [23ndash27] Table 2 liststhe real datasets used in the experiments +ese datasetsinclude the real-world dataset from [29ndash34] and the Olivettiface dataset in [28]

To eliminate the influence of missing values and dif-ferences in different dimension ranges the datasets need tobe preprocessed before proceeding to the experiments Wereplace the missing values by the mean of all valid values ofthe same dimension and normalize the data using the min-max normalization method shown in the followingequation

xij xij minus min xj1113872 1113873

max xj1113872 1113873 minus min xj1113872 1113873 (16)

where xij represents the original data located in the ith rowand jth column xij represents the rescaled data of xij andxj represents the original data located in the jth column

Min-max normalization method processes each di-mension of the data and preserves the relationships betweenthe original data values [36] therefore decreasing the in-fluence of the difference in dimensions and increasing theefficiency of the calculation

To fairly reflect the clustering results of the five algo-rithms the parameters in the algorithms are adjusted toensure that their satisfactory clustering performance can beretained For the DPC-SFSKNN algorithm the parameter K

needs to be specified in advance and an initial clusteringcenter is manually selected based on a decision graphcomposed of the local density ρ and the relative distance δ Itcan be seen from the experimental results in Tables 3 and 4that the value of parameter K is around 6 and the value ofparameterK for the dataset with dense sample distribution ismore than 6 In addition to manually select the initialclustering center the traditional DPC algorithm also needs

02 04X

06 08 100

01

02

03

04

05Y

06

07

08

09

1

Figure 5 Results of the traditional DPC-SFSKNN algorithm on thePathbased dataset

8 Complexity

to determine dc Based on the provided selection range dc isselected so that the number of neighbors is between 1 and 2of the total number of data points [20] +e two parametersthat DBSCAN needs to determine are ε and minpts as in[15]+e optimal parameters are determined using a circularsearch method +e AP algorithm only needs to determine apreference and the larger the preference the more the center

points are allowed to be selected [8] +e general method forselecting parameters is not effective and only multiple ex-periments can be performed to select the optimal parame-ters +e only parameter of K-means is the number ofclusters +e true number of clusters in the dataset is used inthis case Similarly FKNN-DPC needs to determine thenearest neighbors K

Table 2 Real-world datasets

Dataset Records Attributes Clusters SourceIris 150 4 3 [29]Libras movement 360 90 15 [31]Wine 178 13 3 [29]Parkinsons 197 23 2 [29]WDBC 569 30 2 [34]Pima-Indians-diabetes 768 8 2 [29]Segmentation 2310 19 7 [29]Dermatology 366 33 6 [29]Seeds 210 7 3 [30]Ionosphere 351 34 2 [33]Waveform 5000 21 3 [32]Waveform (noise) 5000 40 3 [32]Olivetti face 400 92lowast112 40 [28]

Table 1 Synthetic datasets

Dataset Records Attributes Clusters SourcePathbased 300 2 3 [24]Jain 373 2 2 [23]Flame 240 2 2 [25]Aggregation 788 2 7 [26]DIM512 1024 512 16 [27]DIM1024 1024 1024 16 [27]

Table 3 +e comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on synthetic datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParPathbased Jain

DPC-SFSKNN 0926 0910 0925 33 6 1000 1000 1000 22 7DPC 0521 0463 0742 33 2 0609 0713 0853 22 3DBSCAN 0781 0522 0667 mdash 0056 0883 0985 0918 mdash 00810AP 0679 0475 0783 33 10 0681 0812 0882 22 40FKNN-DPC 0941 0960 0987 33 5 0056 0132 0793 mdash 10K-means 0568 0461 0772 mdash 3 0492 0577 0712 mdash 2

Aggregation FlameDPC-SFSKNN 0942 0951 0963 77 6 0873 0934 0956 22 6DPC 1000 1000 1000 77 4 1000 1000 1000 22 5DBSCAN 0969 0982 0988 mdash 0058 0867 0936 0981 - 0098AP 0795 0753 0841 77 77 0452 0534 0876 33 35FKNN-DPC 0995 0997 0999 33 8 1000 1000 1000 22 5K-means 0784 0717 0786 mdash 7 0418 0465 0828 mdash 2

DIM512 DIM1024DPC-SFSKNN 1000 1000 1000 1616 8 1000 1000 1000 1616 9DPC 1000 1000 1000 1616 2 1000 1000 1000 1616 001DBSCAN 1000 1000 1000 mdash 037 1000 1000 1000 mdash 108AP 1000 1000 1000 1616 20 1000 1000 1000 1616 30FKNN-DPC 1000 1000 1000 1616 8 1000 1000 1000 1616 10K-means 0895 0811 0850 mdash 1 0868 0752 0796 mdash 16

Complexity 9

41Analysis of theExperimentalResults on SyntheticDatasetsIn this section the performance of DPC-SFSKNN DPC[20] DBSCAN [15] AP [8] FKNN-DPC [9] and K-means[10] is tested with six synthetic datasets given in Table 1+ese synthetic datasets are different in distribution andquantity Different data situations can be simulated tocompare the performance of six algorithms in differentsituations Table 3 shows AMI ARI ACC and ECAC of thefive clustering algorithms on the six comprehensive datasetswhere the best results are shown in bold and ldquomdashrdquo means novalue Figures 6ndash9 show the clustering results of DPC-SFSKNN DPC DBSCAN AP FKNN-DPC and K-meansbased on the Pathbased Flame Aggregation and Jaindatasets respectively +e five algorithms achieve the

optimal clustering on DIM512 and DIM1024 datasets sothat the clustering of the two datasets is not shown Since thecluster centers of DBSCAN are relatively random only thepositions of clustering centers of the other three algorithmsare marked

Figure 6 shows the results of the Pathbased datasetDPC-SFSKNN and FKNN-DPC can complete the clusteringof the Pathbased dataset correctly From Figures 6(b) 6(d)and 6(f) it can be seen that the clustering results of DPC APand K-means are similar +e clustering centers selected byDPC AP DPC-SFSKNN and FKNN-DPC are highlysimilar but the clustering results of DPC and AP are notsatisfactory For the DPC algorithm the low fault tolerancerate of its allocation strategy is the cause of this result A

Table 4 Comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on real-world datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParIris Libras movement

DPC-SFSKNN 0896 0901 0962 33 6 0547 0368 0510 1015 8DPC 0812 0827 0926 33 2 0535 0304 0438 915 05DBSCAN 0792 0754 0893 mdash 0149 0412 0183 0385 mdash 0965AP 0764 0775 0911 33 6 0364 0267 0453 1015 25FKNN-DPC 0912 0922 0973 33 7 0508 0308 0436 1015 9K-means 0683 0662 0823 mdash 3 0522 0306 0449 mdash 15

Wine ParkinsonsDPC-SFSKNN 0843 0851 0951 33 6 0193 0380 0827 22 6DPC 0706 0672 0882 33 2 0210 0114 0612 22 5DBSCAN 0612 0643 0856 mdash 04210 0205 0213 0674 mdash 046AP 0592 0544 0781 33 6 0142 0127 0669 22 15FKNN-DPC 0831 0852 0949 33 7 0273 0391 0851 22 5K-means 0817 0838 0936 mdash 3 0201 0049 0625 mdash 2

WDBC IonosphereDPC-SFSKNN 0432 0516 0857 22 6 0361 0428 0786 32 7DPC 0002 minus0004 0602 22 9 0238 0276 0681 32 065DBSCAN 0397 0538 0862 mdash 0277 0544 0683 0853 mdash 027AP 0598 0461 0854 22 40 0132 0168 0706 22 15FKNN-DPC 0679 0786 0944 22 7 0284 0355 0752 22 8K-means 0611 0730 0928 mdash 2 0129 0178 0712 mdash 2

Segmentation Pima-Indians-diabetesDPC-SFSKNN 0665 0562 0746 67 6 0037 0083 0652 22 6DPC 0650 0550 0684 67 3 0033 0075 0647 22 4DBSCAN 0446 0451 0550 mdash 02510 0028 0041 0577 mdash 0156AP 0405 0436 0554 77 25 0045 0089 0629 32 35FKNN-DPC 0655 0555 0716 77 7 0001 0011 0612 22 6K-means 0583 0495 0612 mdash 6 0050 0102 0668 mdash 2

Seeds DermatologyDPC-SFSKNN 0753 0786 0919 33 7 0862 0753 0808 76 6DPC 0727 0760 0918 33 2 0611 0514 0703 46 2DBSCAN 0640 0713 0874 mdash 0178 0689 0690 0815 mdash 073AP 0598 0682 0896 33 10 0766 0701 0762 76 5FKNN-DPC 0759 0790 0924 33 8 0847 0718 0768 76 7K-means 0671 0705 0890 mdash 3 0796 0680 0702 mdash 6

Waveform Waveform (noise)DPC-SFSKNN 0355 0382 0725 33 5 0267 0288 0651 33 6DPC 0320 0269 0586 33 05 0104 0095 0502 33 03DBSCAN mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashAP mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashFKNN-DPC 0324 0350 0703 33 5 0247 0253 0648 33 5K-means 0363 0254 0501 - 3 0364 0252 0512 mdash 3

10 Complexity

0

02

04

06

08

1

05 10

(a)

0

02

04

06

08

1

05 10

(b)

Figure 7 Continued

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 6+e clustering of Pathbased by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and (f)K-means

Complexity 11

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 7 +e clustering of Flame by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

05 100

02

04

06

08

1

(d)

Figure 8 Continued

12 Complexity

high-density point allocation error will be transferred tolow-density points and the error propagation will seriouslyaffect the clustering results AP and K-means algorithms arenot good at dealing with irregular clusters +e two clustersin the middle are too attractive to the points on both sides ofthe semicircular cluster which leads to clustering errorsDBSCAN can completely detect the semicircular cluster butthe semicircular cluster and the cluster on the left of themiddle are incorrectly classified into one category and thecluster on the right of the middle is divided into two clusters+e similarities between points and manually prespecifiedparameters may severely affect the clustering DPC-SFSKNNand FKNN-DPC algorithms perform well on the Pathbaseddataset +ese improved algorithms that consider neighborrelationships have a great advantage in handling suchcomplex distributed datasets

Figure 7 shows the results of four algorithms on theFlame dataset As shown in the figure DPC-SFSKNN DPCFKNN-DPC and DBSCAN can correctly detect two clusterswhile AP and K-means cannot completely correct clusteringAlthough AP can correctly identify higher clusters and selectthe appropriate cluster center the lower cluster is dividedinto two clusters Both clusters of K-means are wrong +eclustering results in Figure 8 show that the DPC-SFSKNNDPC FKNN-DPC and DBSCAN algorithms can detect 7clusters in the Aggregation dataset but AP and K-means stillcannot cluster correctly DPC-SFSKNN DPC and FKNN-DPC can identify clusters and centers Although the clustercenters are not marked for DBSCAN the number of clustersand the overall shape of each cluster are correct +e APalgorithm successfully found the correct number of clustersbut it chose two centers for one cluster which divided thecluster into two clusters +e clustering result of K-means issimilar to that of AP

+e Jain dataset shown in Figure 9 is a dataset consistingof two semicircular clusters of different densities As shownin the figure the DPC-SFSKNN algorithm can completelycluster two clusters with different densities However DPCAP FKNN-DPC and K-means incorrectly assign the leftend of the lower cluster to the higher cluster and the clustercenters of the DPC are all on the lower cluster Compared

with that the distribution of the cluster centers of the AP ismore reasonable For the DBSCAN algorithm it can ac-curately identify lower clusters but the left end of the highercluster is incorrectly divided into a new cluster so that thehigher cluster is divided into two clusters

According to the benchmark data shown in Table 3 it isclear that the performance of DPC-SFSKNN is very effectiveamong the six clustering algorithms especially in the Jaindataset Although DPC and FKNN-DPC perform betterthan DPC-SFSKNN on Aggregation and Flame datasetsDPC-SFSKNN can find the correct clustering center of theaggregation and can complete the clustering task correctly

42 Analysis of Experimental Results on Real-World DatasetsIn this section the performance of the five algorithms is stillbenchmarked according to AMI ARI ACC and ECACand the clustering results are summarized in Table 4 12 real-world datasets are selected to test DPC-SFSKNNrsquos ability toidentify clusters on different datasets DBSCAN and APalgorithm cannot get effective clustering results on wave-form and waveform (noise) +e symbol ldquomdashrdquo represents noresult

As shown in Table 4 in terms of benchmarks AMI ARIand ACC DPC-SFSKNN outperforms all five other algo-rithms on the Wine Segmentation and Libras movementdatasets At the same time FKNN-DPC performs better thanthe other five algorithms on the Iris Seeds Parkinsons andWDBC datasets It can be seen that the overall performanceof DPC-SFSKNN is slightly better than DPC on 11 datasetsexcept for Parkinsons On the Parkinsons DPC-SFSKNN isslightly worse than DPC in AMI but better than DPC in ARIand ACC Similarly DPC-SFSKNN had a slightly betterperformance in addition to Iris Parkinsons WDBC andSeeds of eight sets of data in FKNN-DPC and DPC-SFSKNN is slightly worse than DPC in AMI ARI and ACCDBSCAN gets the best results on the Ionosphere K-means isthe best on Pima-Indians-diabetes and K-means is the bestin AMI on waveform and waveform (noise) datasets Ingeneral the clustering results of DPC-SFSKNN in real-worlddatasets are satisfactory

0

02

04

06

08

1

05 10

(e)

05 100

02

04

06

08

1

(f )

Figure 8+e clustering of Aggregation by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

Complexity 13

43 Experimental Analysis of Olivetti Face DatasetOlivetti face dataset [28] is an image dataset widely used bymachine learning algorithms Its purpose is to test theclustering situation of the algorithm without supervisionincluding determining the number of clusters in the data-base and the members of each cluster +e dataset contains40 clusters each of which has 10 different images Becausethe actual number of clusters (40 different clusters) is equalto the number of elements in the dataset (10 different imageseach cluster) the reliability of local density becomes smallerwhich is a great challenge for density-based clustering al-gorithms To further test the clustering performance ofDPC-SFSKNN DPC-SFSKNN performed experiments onthe Olivetti face database and compared it with DPC APDBSCAN FKNN-DPC and K-means

+e clustering results achieved by DPC-SFSKNN andDPC for the Olivetti face database are shown in Figure 10and white squares represent the cluster centers +e 32clusters corresponding to DPC-SFSKNN found inFigure 10(a) and the 20 clusters found by DPC inFigure 10(b) are displayed in different colors Gray imagesindicate that the image is not assigned to any cluster It canbe seen from Figure 10(a) that DPC-SFSKNN found that the32 cluster centers were covered 29 clusters and as shown inFigure 10(b) the 20 cluster centers found by DPC werescattered in 19 clusters Similar to DPC-SFSKNN DPC maydivide one cluster into two clusters Because DPC-SFSKNNcan find much more density peaks than DPC it is morelikely to identify a cluster as two different clusters +e samesituation occurs with the FKNN-DPC algorithm However

0

02

04

06

08

1

05 10

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

0

02

04

06

08

1

05 10

(d)

0

02

04

06

08

1

05 10

(e)

0

02

04

06

08

1

05 10

(f )

Figure 9 +e clustering of Jain by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

14 Complexity

the performance of FKNN-DPC is better than that of DPC-SFSKNN in AMI ARI ACC and ECAC From the data inTable 5 based on AMI ARI ACC and ECAC the clus-tering results of these algorithms are compared +e per-formance of DPC-SFSKNNC is slightly superior to theperformance of the other four algorithms except FKNN-DPC

44RunningTime +is section shows the comparison of thetime performance of DPC-SFSKNN with DPC DBSCANAP FKNN-DPC and K-means on real-world datasets +etime complexity of DPC-SFSKNN and DPC has been an-alyzed in Section 331 and the time complexity of DPC isO(n2) and the time complexity of DPC-SFSKNN is O(kn2)where n is the size of the dataset However the time con-sumed by DPC mainly comes from calculating the localdensity and the relative distance of each point while the timeconsumed by DPC-SFSKNN comes mainly from the cal-culation of K-nearest neighbors and the division strategy ofnoncenter points Table 6 lists the running time (in seconds)

of the six algorithms on the real datasets It can be seen thatalthough the time complexity of DPC-SFSKNN is approx-imately k times that of DPC their execution time on actualdatasets is not k times

In Table 6 it can be found that on a relatively smalldataset the running time of DPC-SFSKNN is about twice ormore times that of DPC and the difference mainly comesfrom DPC-SFSKNNrsquos allocation strategy Although thecomputational load of the local densities for points growsvery quickly with the size of a dataset the time consumed bythe allocation strategy in DPC-SFSKNN increases randomlywith the distribution of a dataset +is leads to an irregulargap between the running time of DPC and DPC-SFSKNN

FKNN-DPC has the same time and space complexity asDPC but the running time is almost the same as DPC-SFSKNN It takes a lot of running time to calculate therelationship between K-nearest neighbors +e time com-plexity of DBSCAN and AP is approximate to O(n2) and theparameter selection of both cannot be determined by simplemethods When the dataset is relatively large it is difficult tofind their optimal parameters which may be the reason that

(a) (b)

Figure 10 +e clustering of Olivetti by two clustering algorithms (a) DPC-SFSKNN and (b) DPC

Table 5 Performance comparison of algorithms by clustering criteria for the Olivetti face database

Metric DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansACC 0786 0665 0648 0763 0818 0681AMI 0792 0728 0691 0737 0832 0742ARI 0669 0560 0526 0619 0714 0585ECAC 3240 2040 mdash 2840 3640 mdashPar 6 05 064 21 4 40

Complexity 15

the two algorithms have no running results on the waveformdataset +e approximate time complexity of K-means isO(n) and Table 6 proves its efficiency K-means has almostno loss of accuracy under the premise of fast speed whichmakes it a very popular clustering algorithm but K-means isnot sensitive to irregularly shaped data

5 Conclusions and Future Work

A new clustering algorithm is proposed based on the tra-ditional DPC algorithm in this paper +is algorithm pro-poses a density peak search algorithm that takes into accountthe surrounding neighbor information and develops a newallocation strategy to detect the true distribution of thedataset +e proposed clustering algorithm performs fastsearch finds density peaks say cluster centers of a dataset ofany size and recognizes clusters with any arbitrary shape ordimensionality +e algorithm is called DPC-SFSKNNwhich means that it calculates the local density and therelative distance by using some distance information be-tween points and neighbors to find the cluster center andthen the remaining points are assigned using similarity-first+e search algorithm is based on the weighted KNN graph tofind the owner (clustering center) of the point +e DPC-SFSKNN successfully addressed several issues arising fromthe clustering algorithm of Alex Rodriguez and AlessandroLaio [20] including its density metric and the potential issuehidden in its assignment strategy +e performance of DPC-SFSKNN was tested on several synthetic datasets and thereal-word datasets from the UCI machine learning reposi-tory and the well-known Olivetti face database +e ex-perimental results on these datasets demonstrate that ourDPC-SFSKNN is powerful in finding cluster centers and inrecognizing clusters regardless of their shape and of thedimensionality of the space in which they are embedded andof the size of the datasets and is robust to outliers It per-forms much better than the original algorithm DPCHowever the proposed algorithm has some limitations theparameter K needs to be manually adjusted according todifferent datasets the clustering centers still need to bemanually selected by analyzing the decision graph (like theDPC algorithm) the allocation strategy improves theclustering accuracy but takes time and cost How to improve

the degree of automation and allocation efficiency of thealgorithm needs further research

Data Availability

+e synthetic datasets are cited at relevant places within thetext as references [23ndash27] +e real-world datasets are citedat relevant places within the text as references [29ndash34] +eOlivetti face dataset is cited at relevant places within the textas references [28]

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported in part by the National NaturalScience Foundation of China (6160303040 and 61433003) inpart by the Yunnan Applied Basic Research Project of China(201701CF00037) and in part by the Yunnan ProvincialScience and Technology Department Key Research Program(Engineering) (2018BA070)

Supplementary Materials

It includes the datasets used in the experiments in this paper(Supplementary Materials)

References

[1] K L Liu Y L Shang Q Ouyang and W D Widanage ldquoAdata-driven approach with uncertainty quantification forpredicting future capacities and remaining useful life oflithium-ion batteryrdquo IEEE Transactions on Industrial Elec-tronics p 1 2020

[2] X P Tang K L Liu X Wang et al ldquoModel migration neuralnetwork for predicting battery aging trajectoriesrdquo IEEETransactions on Transportation Electrification vol 6 no 2pp 363ndash374 2020

[3] X Tang K Liu XWang B Liu F Gao andW DWidanageldquoReal-time aging trajectory prediction using a base model-oriented gradient-correction particle filter for Lithium-ion

Table 6 Running time of 6 clustering algorithms in seconds on UCI datasets

Dataset DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansIris 0241 0049 0059 0565 0148 0014Wine 0238 0048 0098 0832 0168 0013WDBC 0484 0092 0884 6115 0464 0018Seeds 0244 0050 0122 0973 0164 0014Libras movement 0602 0068 0309 3016 2602 0075Ionosphere 0325 0064 0349 2018 0309 0021Segmentation 1569 0806 8727 6679 0313 0062Dermatology 0309 0063 0513 2185 0409 0007Pima-Indians-diabetes 0792 0126 2018 9709 0892 0009Parkinsons 0255 0048 0114 0866 0263 0003Waveform 16071 3511 mdash mdash 7775 0067Waveform (noise) 17571 3784 mdash mdash 7525 0109

16 Complexity

batteriesrdquo Journal of Power Sources vol 440 Article ID227118 2019

[4] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[5] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[6] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[7] L Cai J H Meng D I Stroe et al ldquoMulti-objective opti-mization of data-driven model for lithium-ion battery SOHestimation with short-term featurerdquo IEEE Transactions onPower Electronics p 1 2020

[8] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[9] J Xie H Gao W Xie X Liu and P W Grant ldquoRobustclustering by detecting density peaks and assigning pointsbased on fuzzy weighted K-nearest neighborsrdquo InformationSciences vol 354 pp 19ndash40 2016

[10] F S Samaria and A C Harter ldquoSome methods for classifi-cation and analysis of multivariate observationsrdquo in Pro-ceedings of the Berkeley SymposiumOnMathematical Statisticsand Probability pp 281ndash297 Berkeley CA USA 1967

[11] S Kant T L Rao and P N Sundaram ldquoAn automatic andstable clustering algorithmrdquo Pattern Recognition Lettersvol 15 no 6 pp 543ndash549 1994

[12] D Arthur and S Vassilvitskii ldquoK-Means++ the advantages ofcareful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms pp 7ndash9 NewOrleans LA USA 2007

[13] Y Zhao W Halang and X Wang ldquoRough ontology mappingin E-business integrationrdquo E-Service Intelligence BMC Bioinfvol 8 pp 75ndash93 2007

[14] Y Xiao and J Yu ldquoSemi-supervised clustering based on af-finity propagaiton algorithmrdquo ACM Transactions onKnowledge Discovery from Data vol 1 no 1 2007

[15] M Ester H Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databaseswith noiserdquo in Proceedings of the Second International Con-ference On Knowledge Discovery and Data Mining pp 226ndash231 Portland OR USA 1996

[16] R J G B Campello D Moulavi and J Sander ldquoDensity-based clustering based on hierarchical density estimatesrdquoAdvances in Knowledge Discovery and Data Mining vol 7819pp 160ndash172 2013

[17] Z Liang and P Chen ldquoDelta-density based clustering with adivide-and-conquer strategy 3DC clusteringrdquo Pattern Rec-ognition Letters vol 73 pp 52ndash59 2016

[18] M Ankerst M M Breuning H P Kriegel and J SanderldquoOPTICS ordering points to identify the clustering struc-turerdquo in Proceedings of the 1999 ACM SIGMOD-InternationalConference on Management of Data pp 49ndash60 PhiladelphiaPA USA 1999

[19] M Du S Ding and H Jia ldquoStudy on density peaks clusteringbased on k-nearest neighbors and principal componentanalysisrdquo Knowledge-Based Systems vol 99 pp 135ndash1452016

[20] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[21] T Li H W Ge and S Z Su ldquoDensity peaks clustering byautomatic determination of cluster centersrdquo Journal ofComputer Science and Technology vol 10 no 11 pp 1614ndash1622 2016

[22] R Liu H Wang and X Yu ldquoShared-nearest-neighbor-basedclustering by fast search and find of density peaksrdquo Infor-mation Sciences vol 450 pp 200ndash226 2018

[23] R A Jarvis and E A Patrick ldquoClustering using a similaritymeasure based on shared near neighborsrdquo IEEE Transactionson Computers vol C-22 no 11 pp 1025ndash1034 1973

[24] H Chang and D-Y Yeung ldquoRobust path-based spectralclusteringrdquo Pattern Recognition vol 41 no 1 pp 191ndash2032008

[25] L Fu and E Medico ldquoFlame a novel fuzzy clustering methodfor the analysis of DNA microarray datardquo BMC Bio-informatics vol 8 no 1 2007

[26] A Gionis H Mannila and P Tsaparas ldquoClustering aggre-gationrdquo ACM Transactions on Knowledge Discovery fromData vol 1 no 1 p 4 2007

[27] P Franti O Virmajoki and V Hautamaki ldquoFast agglom-erative clustering using a k-nearest neighbor graphrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 28 no 11 pp 1875ndash1881 2006

[28] F S Samaria and A C Harter ldquoParameterisation of a sto-chastic model for human face identificationrdquo in Proceedings ofthe 1994 IEEEWorkshop On Applications Of Computer Visionpp 138ndash142 Sarasota FL USA 1994

[29] K Bache and M Lichman UCI Machine Learning Repositoryhttparchiveicsucieduml 2013

[30] M Charytanowicz J Niewczas P Kulczycki P A KowalskiS Lukasik and S Zak ldquoComplete gradient clustering algo-rithm for features analysis of X-ray imagesrdquo InformationTechnologies in biomedicine Advances in Intelligent and SoftComputing vol 69 Berlin Germany Springer

[31] D B Dias R C B Madeo T Rocha H H Biscaro andS M Peres ldquoHand movement recognition for brazilian signlanguage a study using distance-based neural networksrdquo inProceedings of the 2009 International Joint Conference onNeural Networks pp 697ndash704 Atlanta GA USA 2009

[32] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees Routledge New York NYUSA 1st edition 1984

[33] V G Sigillito S P Wing L V Hutton and K B BakerldquoClassification of radar returns from the ionosphere usingneural networksrdquo Johns Hopkins APL vol 10 no 3pp 262ndash266 1989

[34] W N Street W H Wolberg and O L MangasarianldquoNuclear feature extraction for breast tumor diagnosisrdquo inProceedings of the SPIE 1905 Biomedical Image Processing andBiomedical Visualization San Jose CA USA 1993

[35] X V Nguyen J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison is a correction forchance necessaryrdquo in Proceedings of the ICML 2009 the 26thAnnual International Conference On Machine Learning SanMontreal Canada 2009

[36] J Han M Kamber and J Pei Data Mining Concepts andTechniques the Morgan Kaufmann Series in Data Manage-ment Systems Morgan Kaufmann Burlington MA USA 3rdedition 2011

Complexity 17

Page 2: ClusteringbyDetectingDensityPeaksandAssigningPointsby ... · FKNN-DPC [9] is given in (5) and (6), respectively: ρ i exp − 1 K X j∈knn(i) d2 ij ⎛⎝ ⎞⎠, (5) ρ i X j∈knn(i)

high-density regions are considered as the clusters and sepa-rated by low-density regions [15ndash19] In this line density-basedspatial clustering of applications with noise (DBSCAN) [15 16]was proposed as an effective density-based clustering methodIt needs to determine two parameters about the density ofpoints (ε andMinPts) to achieve clustering of arbitrary shapeswhere ε is the neighborhood radius and MinPts is the numberof points contained in the neighborhood radius ε [15]However choosing a suitable threshold is a challenging task forthese methods [15 17] Subsequently Rodriguez and Laio [20]proposed a novel density-based clustering algorithm throughfast search and density peaking (named as DPC) +e DPCalgorithmuses the local density and the relative distance of eachpoint to establish a decision graph finds the cluster centersaccording to the decision graph and then assigns the noncenterpoint to the cluster of the nearest higher density neighborAlthough the DPC algorithm is simple and effective fordetecting arbitrary shape clustering several issues are limitingits practical application Firstly DPC is sensitive to the cutoffdistance dc implying that the parameter dc is set suitably toretain satisfactory performance which is not a trivial taskSecondly the clustering centers should be manually selectedwhich may not be feasible and convenient for some datasetsMoreover the allocation error of high-density points will di-rectly affect the allocation of low-density points around itwhich may also contribute to propagating in the subsequentallocation process continuously

To overcome these issues the main advanced DPC al-gorithm has recently been studied To avoid the influence ofthe cutoff distance dc the concept of K-nearest neighbors(KNN) has been introduced into the DPC algorithm whichproposed two different density measures eg DPC-KNN[19] and FKNN-DPC [9] Although both algorithms arebased on the K-nearest neighbor information they havebeen developed separately Moreover to solve the problemof manual selection of clustering centers Li et al [21]proposed a density peak clustering method to automaticallydetermine the clustering centers In this algorithm thepotential clustering centers are determined by the c rankinggraph and then the true clustering centers are filtered outusing the cutoff distance dc To remedy the allocation errortransmission FKNN-DPC [9] and SNN [22] both adopted atwo-step allocation strategy to allocate noncentral points Inthe first step they use the breadth-first search to assignnonoutlier points In the second step FKNN-DPC uses thefuzzy weighted K-nearest neighbor technology to allocatethe remaining points and the SNN is based on whether thenumber of shared neighbors reaches the threshold to de-termine the cluster of the remaining points

+is paper proposed an improved clustering algorithmbased on the density peaks (named as DPC-SFSKNN) It hasthe following new features (1) the local density and therelative distance are redefined and the distance attributes ofthe two neighbor relationships (KNN and SNN) are fused+is method can detect the low-density clustering center (2)A new allocation strategy is proposed A similarity-firstsearch algorithm based on weighted KNN graphs is designedto allocate noncenter points It has to be ensured that theallocation strategy is fault tolerant

In general this paper is organized as follows Section 2briefly mainly introduces the DPC algorithm and its de-velopment and analyzes the DPC algorithm in detail Section3 introduces the DPC-SFSKNN algorithm in detail and givesa detailed analysis Section 4 tests the proposed algorithm onseveral synthetic and real-world datasets and compares itsperformance with DPC DBSCAN AP FKNN-DPC andK-means in terms of several very popular criteria for testinga clustering algorithm namely clustering accuracy (ACC)adjusted mutual information (AMI) and adjusted Randindex (ARI) Section 5 draws some conclusions

2 Related Work

+e density peak clustering algorithm (DPC) was proposed byAlex and Alessandro in 2014 +e core idea of the DPC algo-rithm lies in the characterization of the cluster center which hasthe following two characteristics the cluster center point has ahigher local density which is surrounded by neighbor pointswith lower local density the cluster center point is relatively farfrom other denser data points +ese characteristics of thecluster center are related to two quantities the local density ρi ofeach point i and its relative distance δi which represents theclosest distance from the point to larger density points

21 DPC Algorithm and Improvements Suppose X is adataset for clustering and dij represents the Euclidean distancebetween data points i and j+e calculation of local density andrelative distance depends on the distance dij +e DPC algo-rithm introduces two methods for calculating local density theldquocutoffrdquo kernel method and the Gaussian kernel method For adata point i its local density ρi is defined in (1) with the ldquocutoffrdquokernel method and in (2) with the Gaussian kernel method

ρi 1113944j

χ dij minus dc1113872 1113873

χ(ς) 1 ςlt 0

0 ςge 0

⎧⎨

(1)

ρi 1113944j

exp minusd2

ij

d2c

1113888 1113889 (2)

where dc is defined as a cutoff distance which represents theneighborhood radius of the data point +e most significantdifference between the two methods is that ρi calculated by theldquocutoffrdquo kernel is a discrete value while ρi calculated by theGaussian kernel is a continuous value+erefore the probabilityof conflict (different data points correspond to the same localdensity) in the latter is relatively smaller

Moreover dc is an adjustable parameter in (1) and (2)which is defined as

dc dNlowast2 (3)

where dc represents the average number of neighbors for eachpoint which is between 1 and 2 of all points [20]N is the serialnumber of the last data point after the ascending order of all thedistances dij and it is also the total number of points 2 in

2 Complexity

formula (3) is the empirical parameter provided in reference[20] which can be adjusted according to different datasets

+e relative distance δi represents the minimum distancebetween the point i and any other higher density points andis mathematically expressed as

δi

minj ρjgtρi

dij1113872 1113873 ρi lt maxk

ρk( 1113857

maxj

dij1113872 1113873 ρi maxk

ρk( 1113857

⎧⎪⎪⎨

⎪⎪⎩(4)

where dij is the distance between points i and jWhen the localdensity ρi is not themaximumdensity the relative distance δi isdefined as the minimum distance between the point i and anyother higher density points when ρi is themaximumdensity δi

takes the maximum distance to all other pointsAfter calculating the local density and relative distance of

all data points the DPC algorithm establishes a decisiongraph through the set of points ρi and δi +e point with highvalues of ρi and δi is called a peak and the center of thecluster is selected from the peaks +en the DPC algorithmdirectly assigns the remaining points to the same cluster asthe nearest neighbor peak

For the DPC algorithm the selection of dc has a greatinfluence on the correctness of the clustering results BothDPC-KNN and FKNN-DPC schemes introduce the conceptof K-nearest neighbors to eliminate the influence of dcHence two different local density calculations are provided

+e local density proposed by DPC-KNN [19] andFKNN-DPC [9] is given in (5) and (6) respectively

ρi exp minus1K

1113944jisinknn(i)

d2ij

⎛⎝ ⎞⎠ (5)

ρi 1113944jisinknn(i)

exp minusdij1113872 1113873(6)

where K is the total number of nearest neighbors andKNN(i) represents the set of K-nearest neighbors of point i+ese two methods provide a unified density metric fordatasets of any size through the idea of K-nearest neighborsand solve the problem of nonuniformity of DPCrsquos densitymetric for different datasets

Based on K-nearest neighbors SNN-DPC proposes theconcept of shared-nearest neighbors (SNN) [22] which isused to represent the local density ρi and the relative distanceδi +e idea of SNN is that if there are more same neighborsin the K-nearest neighbors of two points the similarity oftwo points is higher and the expression is given by

SNN(i j) KNN(i)capKNN(j) (7)

Based on the SNN concept the expression of SNNsimilarity is as follows

Simij

|SNN(i j)|2

1113936pisinSNN(ij) dip + djp1113872 1113873 if i j isin SNN(i j)

0 otherwise

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

(8)

where dip is the distance between points i and p and djp isthe distance between points j and p +e condition forcalculating SNN similarity is that points i and j appear ineach otherrsquos K-nearest neighbor set Otherwise the SNNsimilarity between the two points is 0

Next the local density ρi of point i is expressed by SNNsimilarity Suppose point i is any point in the dataset X thenS(i) x1 x2 xk represents the set of k points with thehighest similarity with point i +e expression of localdensity is

ρi 1113944jisinS(i)

Sim(i j)(9)

At the same time the equation for the relative distance δi

of the point i is as follows

δi

minj ρjgtρi

dij 1113944pisinknn(i)

dip + 1113944qisinknn(j)

djp⎛⎝ ⎞⎠⎡⎢⎢⎢⎣ ⎤⎥⎥⎥⎦ ρi lt max

kρk( 1113857

maxjisin(Xminusi)

δi( 1113857 ρi maxk

ρk( 1113857

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

(10)

+e SNN-DPC algorithm not only redefines the localdensity and relative distance but also changes the data pointallocation strategy +e allocation strategy divides the datapoints into two categories ldquounavoidable subordinate pointsrdquoand ldquoprobable subordinate pointsrdquo +e two types of datapoints have their allocation algorithms Compared with theDPC algorithm this allocation strategy method is better forthe clustering of clusters with different shapes

22 DPC Algorithm Analysis +e DPC algorithm proposesa very simple and elegant clustering algorithm Howeverdue to its simplicity DPC has the following two potentialproblems to be further addressed in practice

221 DPC Ignores Low-Density Points When the densitydifference between clusters is large the performance of theDPC algorithm can be significantly degraded To show thisissue we take the dataset Jain [23] as an example and thenthe clustering results calculated using the truncated kerneldistance of the DPC are shown in Figure 1 It can be seen thatthe cluster distribution in the upper left is relatively sparseand the cluster distribution in the lower right is relativelyclose +e red star in the figure represents the cluster centersin the upper left corner Under the disparity in density theclustering centers selected by the DPC are all on the tightlydistributed cluster below Due to the incorrect selection ofthe clustering centers the subsequent allocations are alsoincorrect

Analyzing the local density and the relative distanceseparately from Figures 2(a) and 2(b) it can be seen that theρ value and the δ value of point A of the false cluster centerare much higher than that of the true cluster center C +eresults of Gaussian kernel distance calculation are the sameand the correct clustering center cannot be selected on thedataset Jain +erefore how to increase the ρ value and the δ

Complexity 3

value of the low-density center and make it stand out in thedecision graph is a problem that needs to be considered

222 DPC Ignores Low-Density Point Allocation Strategywith Low Fault Tolerance +e fault tolerance of the allo-cation strategy of the DPC algorithm is not satisfactorymainly because the allocation of points receives a higherimpact than the density of points Hence if a high-densitypoint allocation error occurs it will directly affect thesubsequent allocation of points with a lower density Takingthe Pathbased dataset [24] as an example Figure 3 shows theclustering result calculated by the DPC algorithm by usingthe ldquocutoffrdquo kernel distance It can be seen from the figurethat the DPC algorithm can find a suitable clustering centerbut the allocation results of most points are incorrect +esame is true of the results using the Gaussian kernel distancecalculation +e results of point assignment on the

Pathbased dataset are similar to those of ldquocutoffrdquo kernelclustering +erefore the fault tolerance of the point allo-cation strategy should be further improved Moreover thepoints are greatly affected by other points during the allo-cation which is also an issue to be further addressed

3 Proposed Method

In this section the DPC-SFSKNN algorithm is introduced indetail +e DPC-SFSKNN algorithm is proposed where thefive main definitions of the algorithm are introduced andthe entire algorithm process is introduced Moreover thecomplexity of the DPC-SFSKNN algorithm is analyzed

31 3e Main Idea of DPC-SFSKNN +e DPC algorithmrelies on the distance between points to calculate the localdensity and the relative distance and is also very sensitive to

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(a)

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(b)

Figure 1 Results of the traditional DPC algorithm on the Jain dataset (a) Clustering of Jain by DPC (b) Ground truth

20

15

10

5

002 04 06 08 10

A

B

C

(a)

025

02

015

01

005

002 04 06 08 10

A B

C

(b)

Figure 2 ρ and δ values of the result of the traditional DPC algorithm on the Jain dataset (a) Clustering of Jain by DPC (b) Ground truth

4 Complexity

the choice of the cutoff distance dc Hence the DPC algo-rithmmay not be able to correctly process for some complexdatasets +e probability that a point and its neighborsbelong to the same cluster is high Adding attributes relatedto neighbors in the clustering process can help to make acorrect judgment +erefore we introduce the concept ofshared-nearest neighbor (SNN) proposed in [22] whendefining the local density and the relative distance Its basicidea is that if they have more common neighbors the twopoints are considered to be more similar as said above (seeequation (7))

Based on the above ideas we define the average distancedsnn(i j) of the shared-nearest neighbor between point i andpoint j and the similarity between the two points

Definition 1 (average distance of SNN) For any points i andj in the dataset X the shared-nearest neighbor set of twopoints is SNN(i j) and the average distance of SNNdsnn(i j) is expressed as

dsnn(i j) 1113936pisinSNN(ij) dip + djp1113872 1113873

2S (11)

where point p is any point of SNN(i j) and S is the numberof members in the set SNN(i j) dsnn(i j) shows the spatialrelationship between point i and point j more compre-hensively by calculating the distances between two pointsand shared-nearest neighbor points

Definition 2 (similarity) For any points i and j in the datasetX the similarity Sim(i j) between point i and j can beexpressed as

Sim(i j) S

Klowast100 (12)

where K is the number of nearest neighbors K is selectedfrom 4 to 40 until the optimal parameter appears +e lower

bound is 4 because a smaller K may cause the algorithm tobecome endless For the upper bound it is found by ex-periments that a large K will not significantly affect theresults of the algorithm +e similarity is defined accordingto the aforementioned basic idea ldquoif they have more com-mon neighbors the two points are considered to be moresimilarrdquo and the similarity is described using the ratio of thenumber of shared-nearest neighbors to the number ofnearest neighbors

Definition 3 (K-nearest neighbor average distance) For anypoint i in the dataset X its K-nearest neighbor set is KNN(i)and then the expression of K-nearest neighbor averagedistance dknn(i) is as follows

dknn(i) 1113936pisinknn(i)dip

K (13)

where point p is any point in KNN(i) and the number ofnearest neighbors of any point is K K-nearest neighboraverage distance can describe the surrounding environmentof a point to some extent Next we use it to describe localdensity

Definition 4 (local density) For any point i in the dataset Xthe local density expression is

ρi 1113944jisinknn(i)

S

dknn(i) + dknn(j) (14)

where point j is a point in the set KNN(i) and dknn(i) anddknn(j) are the K-nearest neighbor average distances ofpoint i and point j respectively In formula (14) the nu-merator (the number of shared-nearest neighbor S) repre-sents the similarity between the two points and thedenominator (the sum of the average distances) describesthe environment around them When S is a constant and if

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(a)

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(b)

Figure 3 Results of the traditional DPC algorithm on the Pathbased dataset

Complexity 5

the value of the sum of the average distances(dknn(i) + dknn(j)) is small the local density ρi of point i islarge Point j is one of the K-nearest neighbors of point iWhen the values of dknn(i) and dknn(j) are small it means i

and j are closely surrounded by their neighbors If dknn(i)

has a larger value (point j is far away from point i) or dknn(j)

has a larger value (when the neighboring points of thedistance are far away from the point j) the local density ofthe point i becomes smaller +erefore only the averagedistances of the two points are small and it can be expressedthat the local density of point i is large Moreover when thesum of the average distances of the two points is constantand if the number of shared-nearest neighbors of the twopoints is large the local density is large A large number ofshared neighbors indicate that the two points have a highsimilarity and a high probability of belonging to the samecluster +e higher the similarity points around a point thegreater its local density and the greater the probability ofbecoming a cluster center +is is beneficial to those low-density clustering centers A large number of sharedneighbors can compensate for the loss caused by their largedistance from other points so that their local density is notonly affected by distance Next we define the relative dis-tance of the points

Definition 5 (relative distance) For any point i in the datasetX the relative distance can be expressed as

δi

minj ρj gt ρi

dij + dknn(i) + dknn(j)1113960 1113961 ρi lt maxk

ρk( 1113857

maxjisin(Xminusi)

δi( 1113857 ρi maxk

ρk( 1113857

⎧⎪⎪⎨

⎪⎪⎩

(15)

where point j is one of the K-nearest neighbors of point idij is the distance between points i and j and dknn(i) anddknn(j) are the average distance from the nearest neighbor ofpoints i and j We can use the sum of the three distances torepresent the relative distance Compared to the DPC al-gorithm which only uses dij to represent the relative dis-tance we define the concept of relative distance andK-nearest neighbor average distances of two points +e newdefinition can not only express the relative distance but alsobe more friendly to low-density cluster centers Under thecondition of constant dij the average distance of the nearestneighbors of the low-density points is relatively large and itsrelative distance will also increase which can increase theprobability of low-density points being selected

+e DPC-SFSKNN clustering center is selected in thesame way as the traditional DPC algorithm+e local densityρ and relative distance δ are used to form a decision graph+e n points with the largest local density and relativedistance are selected as the clustering centers

For DPC-SFSKNN the sum of the distances from pointsof a low-density cluster to their K-neighbors may be largethus they receive a greater compensation for their δ valueFigures 4(a) and 4(b) show the results of DPC-SFSKNN onthe Jain dataset [23] Compared to Figure 2(b) the δ valuesof points in the upper branch are generally larger than thoseof the lower branch +is is because the density of the upper

branch is significantly smaller and the distances from thepoints to their respective K-nearest neighbors are largerthus they receive a greater compensation Even if the densityis at a disadvantage the higher δ value still makes the centerof the upper branch distinguished in the decision graph+isshows that the DPC-SFSKNN algorithm can correctly selectlow-density clustering centers

32 Processes +e entire process of the algorithm is dividedinto two parts the selection of clustering centers and theallocation of noncenter points +e main step of our DPC-SFSKNN and a detailed introduction of the proposed al-location strategy are given in Algorithm 1

Line 9 of the DPC-SFSKNN algorithm establishes aweighted K-nearest neighbor graph and Line 11 is aK-nearest neighbor similarity search allocation strategy Toassign noncenter points in the dataset we designed asimilarity-first search algorithm based on the weightedK-nearest neighbor graph +e algorithm uses the breadth-first search idea to find the cluster center with the highestsimilarity for the noncenter point +e similarity of non-center points and their K-nearest neighbors is sorted in anascending order the neighbor point with the highest sim-ilarity is selected as the next visited node and it is pushedinto the path queue If the highest similarity point is notunique the point with the smallest SNN average distance isselected as the next visited node+e visiting node also needsto sort the similarity of its K-nearest neighbors and select thenext visiting node +e search stops until the visited node isthe cluster center point Algorithm 2 describes the entiresearch process Finally each data point except the clustercenters is traversed to complete the assignment

Similarity-first search algorithm is an optimization al-gorithm based on breadth-first search according to the al-location requirements of noncenter points Similarity is animportant concept for clustering algorithms Points in thesame cluster are similar to each other Two points with ahigher similarity have more of the same neighbors Based onthe above ideas the definition of similarity is proposed in(12) In the process of searching if only similarity is used asthe search criteria it is easy to appear that the highestsimilarity point is not unique +erefore the algorithmchooses the average distance of the SNN as the secondcriterion and a smaller dsnn point means that the two pointsare closer in space

+e clustering results of the DPC-SFSKNN algorithmbased on the Pathbased dataset are shown in Figure 5Figure 3 clearly shows that although the traditional DPCalgorithm can find cluster centers on each of the threeclusters there is a serious bias in the allocation of noncenterpoints From Figure 5 we can see the effectiveness of thenoncentral point allocation algorithm of the DPC-SFSKNNalgorithm+e allocation strategy uses similarity-first searchto ensure that the similarity from the search path is thehighest and a gradual search to the cluster center to avoidthe points with low similarity is used as a reference Besidesthe similarity-first search allocation strategy based on theweighted K-nearest neighbor graph considers neighbor

6 Complexity

information When the point of the highest similarity is notunique the point with the shortest average distance of theshared neighbors is selected as the next visited point

33 Complexity Analysis In this section the complexities ofthe DPC-SFSKNN algorithm are analyzed including timecomplexity and space complexity Suppose the size of the

02 04 06 08 100

0102030405060708091

Y

X

(a)

02 04 06 08 10X

δ

025

020

015

010

005

0

(b)

Figure 4 Result and ρ value of the DPC-SFSKNN algorithm on the Jain dataset

Require dataset X parameter K

Ensure clustering result C(1) Data preprocessing normalize the data(2) Calculate the Euclidean distance between the points(3) Calculate the K-nearest neighbors of each point i isin X

(4) Calculate the average distance of K-nearest neighbors of each point dknn(i) according to (13)(5) Calculate the local density ρi of each point i isin X according to (14)(6) Calculate the relative distance δi of each point i isin X according to (15)(7) Find the cluster center by analyzing the decision graph composed of ρ and δ and use the cluster center as the set CC(8) Calculate the similarity between point i and its K-nearest neighbors according to (12)(9) Connect each point in the dataset X with its K-nearest neighbors and use the similarity as the connection weight to construct a

weighted K-nearest neighbor graph(10) Calculate the average distance of SNN dsnn(i j) between point i and its shared-nearest neighbors according to (11)(11) Apply Algorithm 2 to allocate the remaining points

ALGORITHM 1 DPC-SFSKNN

Require w isin X set of cluster centers CC number of neighbors K similarity matrix Snlowast n sim(i j)nlowast n and SNN averagedistance matrix DSNNnlowast n dsnn(i j)nlowast n

Ensure point w isin CC(1) Initialize the descending queue Q and the path queue P+e K-nearest neighbors of point w are sorted in the ascending order of

similarity and pushed into Q Push M into P(2) while tail point of P P isin CC do(3) if the highest similarity point is unique then(4) Pop a point this at Qrsquos tail(5) else(6) Select a point this with the smallest DSNN(7) end if(8) Empty descending queue Q(9) +e K-nearest neighbors of this are sorted in the ascending order of similarity and pushed into Q(10) Push this into P(11) end while

ALGORITHM 2 Similarity-first search allocation strategy

Complexity 7

dataset is n the number of cluster centers is m and thenumber of neighbors is k

331 Time Complexity +e time complexity analysis ofDPC-SFSKNN is as follows

Normalization requires a processing complexity of ap-proximately O(n) the complexities of calculating the Eu-clidean distance and similarity between points are O(n2) thecomplexity of computing the K-nearest neighbor averagedistance dknn is O(n2) similarly the complexity of the averagedistance dsnn between the calculation point and its shared-nearest neighbors does not exceed O(n2) at most the calcu-lation process of calculating the local density ρi and distance δi

of each point needs to acquire the KNN information com-plexity of each point as O(kn) so the complexities of localdensity ρ and distance δ areO(kn2) the point allocation part isthe search time of one point and in the worst case searching allpoints requires O(n) +ere are n points in the dataset and thetotal time does not exceed O(n2) In summary the total ap-proximate time complexity of DPC-SFSKNN is O(kn2)

+e time complexity of the DPC algorithm depends onthe following three aspects (a) the time to calculate thedistance between points (b) the time to calculate the localdensity ρi for point i and (c) the time to calculate the distanceδi for each point i +e time complexity of each part is O(n2)so the total approximate time complexity of DPC is O(n2)

+e time complexity of the DPC-SFSKNN algorithm is k

times higher than that of the traditional DPC algorithmHowever k is relatively small compared to n +erefore theydo not significantly affect the run time In Section 4 it isdemonstrated that the actual running time of DPC-SFSKNNdoes not exceed k times of the running time of the traditionalDPC algorithm

332 Space Complexity DPC-SFSKNN needs to calculatethe distance and similarity between points and its com-plexity is O(n2) Other data structures (such as ρ and δ arrays

and various average distance arrays) are O(n) For the al-location strategy in the worst case its complexity is O(n2)+e space complexity of DPC is O(n2) which is mainly dueto the distance matrix stored

+e space complexity of our DPC-SFSKNN is the sameas that of traditional DPC which is O(n2)

4 Experiments and Results

In this section experiments are performed based on severalpublic datasets commonly used to test the performance ofclustering algorithms including synthetic datasets [23ndash27]and real datasets [28ndash34] In order to visually observe theclustering ability of DPC-SFSKNN the DPC [20] DBSCAN[15] AP [8] FKNN-DPC [9] and K-means [10] methods areall tested for comparison +ree popular benchmarks areused to evaluate the performance of the above clusteringalgorithms including the clustering accuracy (ACC) ad-justed mutual information (AMI) and adjusted Rand index(ARI) [35] +e upper bounds of the three benchmarks wereall 1 +e larger the benchmark value the better the clus-tering effect +e codes for DPC DBSCAN and AP wereprovided based on the corresponding references

Table 1 lists the synthetic datasets used in the experi-ments+ese datasets were published in [23ndash27] Table 2 liststhe real datasets used in the experiments +ese datasetsinclude the real-world dataset from [29ndash34] and the Olivettiface dataset in [28]

To eliminate the influence of missing values and dif-ferences in different dimension ranges the datasets need tobe preprocessed before proceeding to the experiments Wereplace the missing values by the mean of all valid values ofthe same dimension and normalize the data using the min-max normalization method shown in the followingequation

xij xij minus min xj1113872 1113873

max xj1113872 1113873 minus min xj1113872 1113873 (16)

where xij represents the original data located in the ith rowand jth column xij represents the rescaled data of xij andxj represents the original data located in the jth column

Min-max normalization method processes each di-mension of the data and preserves the relationships betweenthe original data values [36] therefore decreasing the in-fluence of the difference in dimensions and increasing theefficiency of the calculation

To fairly reflect the clustering results of the five algo-rithms the parameters in the algorithms are adjusted toensure that their satisfactory clustering performance can beretained For the DPC-SFSKNN algorithm the parameter K

needs to be specified in advance and an initial clusteringcenter is manually selected based on a decision graphcomposed of the local density ρ and the relative distance δ Itcan be seen from the experimental results in Tables 3 and 4that the value of parameter K is around 6 and the value ofparameterK for the dataset with dense sample distribution ismore than 6 In addition to manually select the initialclustering center the traditional DPC algorithm also needs

02 04X

06 08 100

01

02

03

04

05Y

06

07

08

09

1

Figure 5 Results of the traditional DPC-SFSKNN algorithm on thePathbased dataset

8 Complexity

to determine dc Based on the provided selection range dc isselected so that the number of neighbors is between 1 and 2of the total number of data points [20] +e two parametersthat DBSCAN needs to determine are ε and minpts as in[15]+e optimal parameters are determined using a circularsearch method +e AP algorithm only needs to determine apreference and the larger the preference the more the center

points are allowed to be selected [8] +e general method forselecting parameters is not effective and only multiple ex-periments can be performed to select the optimal parame-ters +e only parameter of K-means is the number ofclusters +e true number of clusters in the dataset is used inthis case Similarly FKNN-DPC needs to determine thenearest neighbors K

Table 2 Real-world datasets

Dataset Records Attributes Clusters SourceIris 150 4 3 [29]Libras movement 360 90 15 [31]Wine 178 13 3 [29]Parkinsons 197 23 2 [29]WDBC 569 30 2 [34]Pima-Indians-diabetes 768 8 2 [29]Segmentation 2310 19 7 [29]Dermatology 366 33 6 [29]Seeds 210 7 3 [30]Ionosphere 351 34 2 [33]Waveform 5000 21 3 [32]Waveform (noise) 5000 40 3 [32]Olivetti face 400 92lowast112 40 [28]

Table 1 Synthetic datasets

Dataset Records Attributes Clusters SourcePathbased 300 2 3 [24]Jain 373 2 2 [23]Flame 240 2 2 [25]Aggregation 788 2 7 [26]DIM512 1024 512 16 [27]DIM1024 1024 1024 16 [27]

Table 3 +e comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on synthetic datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParPathbased Jain

DPC-SFSKNN 0926 0910 0925 33 6 1000 1000 1000 22 7DPC 0521 0463 0742 33 2 0609 0713 0853 22 3DBSCAN 0781 0522 0667 mdash 0056 0883 0985 0918 mdash 00810AP 0679 0475 0783 33 10 0681 0812 0882 22 40FKNN-DPC 0941 0960 0987 33 5 0056 0132 0793 mdash 10K-means 0568 0461 0772 mdash 3 0492 0577 0712 mdash 2

Aggregation FlameDPC-SFSKNN 0942 0951 0963 77 6 0873 0934 0956 22 6DPC 1000 1000 1000 77 4 1000 1000 1000 22 5DBSCAN 0969 0982 0988 mdash 0058 0867 0936 0981 - 0098AP 0795 0753 0841 77 77 0452 0534 0876 33 35FKNN-DPC 0995 0997 0999 33 8 1000 1000 1000 22 5K-means 0784 0717 0786 mdash 7 0418 0465 0828 mdash 2

DIM512 DIM1024DPC-SFSKNN 1000 1000 1000 1616 8 1000 1000 1000 1616 9DPC 1000 1000 1000 1616 2 1000 1000 1000 1616 001DBSCAN 1000 1000 1000 mdash 037 1000 1000 1000 mdash 108AP 1000 1000 1000 1616 20 1000 1000 1000 1616 30FKNN-DPC 1000 1000 1000 1616 8 1000 1000 1000 1616 10K-means 0895 0811 0850 mdash 1 0868 0752 0796 mdash 16

Complexity 9

41Analysis of theExperimentalResults on SyntheticDatasetsIn this section the performance of DPC-SFSKNN DPC[20] DBSCAN [15] AP [8] FKNN-DPC [9] and K-means[10] is tested with six synthetic datasets given in Table 1+ese synthetic datasets are different in distribution andquantity Different data situations can be simulated tocompare the performance of six algorithms in differentsituations Table 3 shows AMI ARI ACC and ECAC of thefive clustering algorithms on the six comprehensive datasetswhere the best results are shown in bold and ldquomdashrdquo means novalue Figures 6ndash9 show the clustering results of DPC-SFSKNN DPC DBSCAN AP FKNN-DPC and K-meansbased on the Pathbased Flame Aggregation and Jaindatasets respectively +e five algorithms achieve the

optimal clustering on DIM512 and DIM1024 datasets sothat the clustering of the two datasets is not shown Since thecluster centers of DBSCAN are relatively random only thepositions of clustering centers of the other three algorithmsare marked

Figure 6 shows the results of the Pathbased datasetDPC-SFSKNN and FKNN-DPC can complete the clusteringof the Pathbased dataset correctly From Figures 6(b) 6(d)and 6(f) it can be seen that the clustering results of DPC APand K-means are similar +e clustering centers selected byDPC AP DPC-SFSKNN and FKNN-DPC are highlysimilar but the clustering results of DPC and AP are notsatisfactory For the DPC algorithm the low fault tolerancerate of its allocation strategy is the cause of this result A

Table 4 Comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on real-world datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParIris Libras movement

DPC-SFSKNN 0896 0901 0962 33 6 0547 0368 0510 1015 8DPC 0812 0827 0926 33 2 0535 0304 0438 915 05DBSCAN 0792 0754 0893 mdash 0149 0412 0183 0385 mdash 0965AP 0764 0775 0911 33 6 0364 0267 0453 1015 25FKNN-DPC 0912 0922 0973 33 7 0508 0308 0436 1015 9K-means 0683 0662 0823 mdash 3 0522 0306 0449 mdash 15

Wine ParkinsonsDPC-SFSKNN 0843 0851 0951 33 6 0193 0380 0827 22 6DPC 0706 0672 0882 33 2 0210 0114 0612 22 5DBSCAN 0612 0643 0856 mdash 04210 0205 0213 0674 mdash 046AP 0592 0544 0781 33 6 0142 0127 0669 22 15FKNN-DPC 0831 0852 0949 33 7 0273 0391 0851 22 5K-means 0817 0838 0936 mdash 3 0201 0049 0625 mdash 2

WDBC IonosphereDPC-SFSKNN 0432 0516 0857 22 6 0361 0428 0786 32 7DPC 0002 minus0004 0602 22 9 0238 0276 0681 32 065DBSCAN 0397 0538 0862 mdash 0277 0544 0683 0853 mdash 027AP 0598 0461 0854 22 40 0132 0168 0706 22 15FKNN-DPC 0679 0786 0944 22 7 0284 0355 0752 22 8K-means 0611 0730 0928 mdash 2 0129 0178 0712 mdash 2

Segmentation Pima-Indians-diabetesDPC-SFSKNN 0665 0562 0746 67 6 0037 0083 0652 22 6DPC 0650 0550 0684 67 3 0033 0075 0647 22 4DBSCAN 0446 0451 0550 mdash 02510 0028 0041 0577 mdash 0156AP 0405 0436 0554 77 25 0045 0089 0629 32 35FKNN-DPC 0655 0555 0716 77 7 0001 0011 0612 22 6K-means 0583 0495 0612 mdash 6 0050 0102 0668 mdash 2

Seeds DermatologyDPC-SFSKNN 0753 0786 0919 33 7 0862 0753 0808 76 6DPC 0727 0760 0918 33 2 0611 0514 0703 46 2DBSCAN 0640 0713 0874 mdash 0178 0689 0690 0815 mdash 073AP 0598 0682 0896 33 10 0766 0701 0762 76 5FKNN-DPC 0759 0790 0924 33 8 0847 0718 0768 76 7K-means 0671 0705 0890 mdash 3 0796 0680 0702 mdash 6

Waveform Waveform (noise)DPC-SFSKNN 0355 0382 0725 33 5 0267 0288 0651 33 6DPC 0320 0269 0586 33 05 0104 0095 0502 33 03DBSCAN mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashAP mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashFKNN-DPC 0324 0350 0703 33 5 0247 0253 0648 33 5K-means 0363 0254 0501 - 3 0364 0252 0512 mdash 3

10 Complexity

0

02

04

06

08

1

05 10

(a)

0

02

04

06

08

1

05 10

(b)

Figure 7 Continued

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 6+e clustering of Pathbased by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and (f)K-means

Complexity 11

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 7 +e clustering of Flame by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

05 100

02

04

06

08

1

(d)

Figure 8 Continued

12 Complexity

high-density point allocation error will be transferred tolow-density points and the error propagation will seriouslyaffect the clustering results AP and K-means algorithms arenot good at dealing with irregular clusters +e two clustersin the middle are too attractive to the points on both sides ofthe semicircular cluster which leads to clustering errorsDBSCAN can completely detect the semicircular cluster butthe semicircular cluster and the cluster on the left of themiddle are incorrectly classified into one category and thecluster on the right of the middle is divided into two clusters+e similarities between points and manually prespecifiedparameters may severely affect the clustering DPC-SFSKNNand FKNN-DPC algorithms perform well on the Pathbaseddataset +ese improved algorithms that consider neighborrelationships have a great advantage in handling suchcomplex distributed datasets

Figure 7 shows the results of four algorithms on theFlame dataset As shown in the figure DPC-SFSKNN DPCFKNN-DPC and DBSCAN can correctly detect two clusterswhile AP and K-means cannot completely correct clusteringAlthough AP can correctly identify higher clusters and selectthe appropriate cluster center the lower cluster is dividedinto two clusters Both clusters of K-means are wrong +eclustering results in Figure 8 show that the DPC-SFSKNNDPC FKNN-DPC and DBSCAN algorithms can detect 7clusters in the Aggregation dataset but AP and K-means stillcannot cluster correctly DPC-SFSKNN DPC and FKNN-DPC can identify clusters and centers Although the clustercenters are not marked for DBSCAN the number of clustersand the overall shape of each cluster are correct +e APalgorithm successfully found the correct number of clustersbut it chose two centers for one cluster which divided thecluster into two clusters +e clustering result of K-means issimilar to that of AP

+e Jain dataset shown in Figure 9 is a dataset consistingof two semicircular clusters of different densities As shownin the figure the DPC-SFSKNN algorithm can completelycluster two clusters with different densities However DPCAP FKNN-DPC and K-means incorrectly assign the leftend of the lower cluster to the higher cluster and the clustercenters of the DPC are all on the lower cluster Compared

with that the distribution of the cluster centers of the AP ismore reasonable For the DBSCAN algorithm it can ac-curately identify lower clusters but the left end of the highercluster is incorrectly divided into a new cluster so that thehigher cluster is divided into two clusters

According to the benchmark data shown in Table 3 it isclear that the performance of DPC-SFSKNN is very effectiveamong the six clustering algorithms especially in the Jaindataset Although DPC and FKNN-DPC perform betterthan DPC-SFSKNN on Aggregation and Flame datasetsDPC-SFSKNN can find the correct clustering center of theaggregation and can complete the clustering task correctly

42 Analysis of Experimental Results on Real-World DatasetsIn this section the performance of the five algorithms is stillbenchmarked according to AMI ARI ACC and ECACand the clustering results are summarized in Table 4 12 real-world datasets are selected to test DPC-SFSKNNrsquos ability toidentify clusters on different datasets DBSCAN and APalgorithm cannot get effective clustering results on wave-form and waveform (noise) +e symbol ldquomdashrdquo represents noresult

As shown in Table 4 in terms of benchmarks AMI ARIand ACC DPC-SFSKNN outperforms all five other algo-rithms on the Wine Segmentation and Libras movementdatasets At the same time FKNN-DPC performs better thanthe other five algorithms on the Iris Seeds Parkinsons andWDBC datasets It can be seen that the overall performanceof DPC-SFSKNN is slightly better than DPC on 11 datasetsexcept for Parkinsons On the Parkinsons DPC-SFSKNN isslightly worse than DPC in AMI but better than DPC in ARIand ACC Similarly DPC-SFSKNN had a slightly betterperformance in addition to Iris Parkinsons WDBC andSeeds of eight sets of data in FKNN-DPC and DPC-SFSKNN is slightly worse than DPC in AMI ARI and ACCDBSCAN gets the best results on the Ionosphere K-means isthe best on Pima-Indians-diabetes and K-means is the bestin AMI on waveform and waveform (noise) datasets Ingeneral the clustering results of DPC-SFSKNN in real-worlddatasets are satisfactory

0

02

04

06

08

1

05 10

(e)

05 100

02

04

06

08

1

(f )

Figure 8+e clustering of Aggregation by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

Complexity 13

43 Experimental Analysis of Olivetti Face DatasetOlivetti face dataset [28] is an image dataset widely used bymachine learning algorithms Its purpose is to test theclustering situation of the algorithm without supervisionincluding determining the number of clusters in the data-base and the members of each cluster +e dataset contains40 clusters each of which has 10 different images Becausethe actual number of clusters (40 different clusters) is equalto the number of elements in the dataset (10 different imageseach cluster) the reliability of local density becomes smallerwhich is a great challenge for density-based clustering al-gorithms To further test the clustering performance ofDPC-SFSKNN DPC-SFSKNN performed experiments onthe Olivetti face database and compared it with DPC APDBSCAN FKNN-DPC and K-means

+e clustering results achieved by DPC-SFSKNN andDPC for the Olivetti face database are shown in Figure 10and white squares represent the cluster centers +e 32clusters corresponding to DPC-SFSKNN found inFigure 10(a) and the 20 clusters found by DPC inFigure 10(b) are displayed in different colors Gray imagesindicate that the image is not assigned to any cluster It canbe seen from Figure 10(a) that DPC-SFSKNN found that the32 cluster centers were covered 29 clusters and as shown inFigure 10(b) the 20 cluster centers found by DPC werescattered in 19 clusters Similar to DPC-SFSKNN DPC maydivide one cluster into two clusters Because DPC-SFSKNNcan find much more density peaks than DPC it is morelikely to identify a cluster as two different clusters +e samesituation occurs with the FKNN-DPC algorithm However

0

02

04

06

08

1

05 10

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

0

02

04

06

08

1

05 10

(d)

0

02

04

06

08

1

05 10

(e)

0

02

04

06

08

1

05 10

(f )

Figure 9 +e clustering of Jain by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

14 Complexity

the performance of FKNN-DPC is better than that of DPC-SFSKNN in AMI ARI ACC and ECAC From the data inTable 5 based on AMI ARI ACC and ECAC the clus-tering results of these algorithms are compared +e per-formance of DPC-SFSKNNC is slightly superior to theperformance of the other four algorithms except FKNN-DPC

44RunningTime +is section shows the comparison of thetime performance of DPC-SFSKNN with DPC DBSCANAP FKNN-DPC and K-means on real-world datasets +etime complexity of DPC-SFSKNN and DPC has been an-alyzed in Section 331 and the time complexity of DPC isO(n2) and the time complexity of DPC-SFSKNN is O(kn2)where n is the size of the dataset However the time con-sumed by DPC mainly comes from calculating the localdensity and the relative distance of each point while the timeconsumed by DPC-SFSKNN comes mainly from the cal-culation of K-nearest neighbors and the division strategy ofnoncenter points Table 6 lists the running time (in seconds)

of the six algorithms on the real datasets It can be seen thatalthough the time complexity of DPC-SFSKNN is approx-imately k times that of DPC their execution time on actualdatasets is not k times

In Table 6 it can be found that on a relatively smalldataset the running time of DPC-SFSKNN is about twice ormore times that of DPC and the difference mainly comesfrom DPC-SFSKNNrsquos allocation strategy Although thecomputational load of the local densities for points growsvery quickly with the size of a dataset the time consumed bythe allocation strategy in DPC-SFSKNN increases randomlywith the distribution of a dataset +is leads to an irregulargap between the running time of DPC and DPC-SFSKNN

FKNN-DPC has the same time and space complexity asDPC but the running time is almost the same as DPC-SFSKNN It takes a lot of running time to calculate therelationship between K-nearest neighbors +e time com-plexity of DBSCAN and AP is approximate to O(n2) and theparameter selection of both cannot be determined by simplemethods When the dataset is relatively large it is difficult tofind their optimal parameters which may be the reason that

(a) (b)

Figure 10 +e clustering of Olivetti by two clustering algorithms (a) DPC-SFSKNN and (b) DPC

Table 5 Performance comparison of algorithms by clustering criteria for the Olivetti face database

Metric DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansACC 0786 0665 0648 0763 0818 0681AMI 0792 0728 0691 0737 0832 0742ARI 0669 0560 0526 0619 0714 0585ECAC 3240 2040 mdash 2840 3640 mdashPar 6 05 064 21 4 40

Complexity 15

the two algorithms have no running results on the waveformdataset +e approximate time complexity of K-means isO(n) and Table 6 proves its efficiency K-means has almostno loss of accuracy under the premise of fast speed whichmakes it a very popular clustering algorithm but K-means isnot sensitive to irregularly shaped data

5 Conclusions and Future Work

A new clustering algorithm is proposed based on the tra-ditional DPC algorithm in this paper +is algorithm pro-poses a density peak search algorithm that takes into accountthe surrounding neighbor information and develops a newallocation strategy to detect the true distribution of thedataset +e proposed clustering algorithm performs fastsearch finds density peaks say cluster centers of a dataset ofany size and recognizes clusters with any arbitrary shape ordimensionality +e algorithm is called DPC-SFSKNNwhich means that it calculates the local density and therelative distance by using some distance information be-tween points and neighbors to find the cluster center andthen the remaining points are assigned using similarity-first+e search algorithm is based on the weighted KNN graph tofind the owner (clustering center) of the point +e DPC-SFSKNN successfully addressed several issues arising fromthe clustering algorithm of Alex Rodriguez and AlessandroLaio [20] including its density metric and the potential issuehidden in its assignment strategy +e performance of DPC-SFSKNN was tested on several synthetic datasets and thereal-word datasets from the UCI machine learning reposi-tory and the well-known Olivetti face database +e ex-perimental results on these datasets demonstrate that ourDPC-SFSKNN is powerful in finding cluster centers and inrecognizing clusters regardless of their shape and of thedimensionality of the space in which they are embedded andof the size of the datasets and is robust to outliers It per-forms much better than the original algorithm DPCHowever the proposed algorithm has some limitations theparameter K needs to be manually adjusted according todifferent datasets the clustering centers still need to bemanually selected by analyzing the decision graph (like theDPC algorithm) the allocation strategy improves theclustering accuracy but takes time and cost How to improve

the degree of automation and allocation efficiency of thealgorithm needs further research

Data Availability

+e synthetic datasets are cited at relevant places within thetext as references [23ndash27] +e real-world datasets are citedat relevant places within the text as references [29ndash34] +eOlivetti face dataset is cited at relevant places within the textas references [28]

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported in part by the National NaturalScience Foundation of China (6160303040 and 61433003) inpart by the Yunnan Applied Basic Research Project of China(201701CF00037) and in part by the Yunnan ProvincialScience and Technology Department Key Research Program(Engineering) (2018BA070)

Supplementary Materials

It includes the datasets used in the experiments in this paper(Supplementary Materials)

References

[1] K L Liu Y L Shang Q Ouyang and W D Widanage ldquoAdata-driven approach with uncertainty quantification forpredicting future capacities and remaining useful life oflithium-ion batteryrdquo IEEE Transactions on Industrial Elec-tronics p 1 2020

[2] X P Tang K L Liu X Wang et al ldquoModel migration neuralnetwork for predicting battery aging trajectoriesrdquo IEEETransactions on Transportation Electrification vol 6 no 2pp 363ndash374 2020

[3] X Tang K Liu XWang B Liu F Gao andW DWidanageldquoReal-time aging trajectory prediction using a base model-oriented gradient-correction particle filter for Lithium-ion

Table 6 Running time of 6 clustering algorithms in seconds on UCI datasets

Dataset DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansIris 0241 0049 0059 0565 0148 0014Wine 0238 0048 0098 0832 0168 0013WDBC 0484 0092 0884 6115 0464 0018Seeds 0244 0050 0122 0973 0164 0014Libras movement 0602 0068 0309 3016 2602 0075Ionosphere 0325 0064 0349 2018 0309 0021Segmentation 1569 0806 8727 6679 0313 0062Dermatology 0309 0063 0513 2185 0409 0007Pima-Indians-diabetes 0792 0126 2018 9709 0892 0009Parkinsons 0255 0048 0114 0866 0263 0003Waveform 16071 3511 mdash mdash 7775 0067Waveform (noise) 17571 3784 mdash mdash 7525 0109

16 Complexity

batteriesrdquo Journal of Power Sources vol 440 Article ID227118 2019

[4] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[5] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[6] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[7] L Cai J H Meng D I Stroe et al ldquoMulti-objective opti-mization of data-driven model for lithium-ion battery SOHestimation with short-term featurerdquo IEEE Transactions onPower Electronics p 1 2020

[8] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[9] J Xie H Gao W Xie X Liu and P W Grant ldquoRobustclustering by detecting density peaks and assigning pointsbased on fuzzy weighted K-nearest neighborsrdquo InformationSciences vol 354 pp 19ndash40 2016

[10] F S Samaria and A C Harter ldquoSome methods for classifi-cation and analysis of multivariate observationsrdquo in Pro-ceedings of the Berkeley SymposiumOnMathematical Statisticsand Probability pp 281ndash297 Berkeley CA USA 1967

[11] S Kant T L Rao and P N Sundaram ldquoAn automatic andstable clustering algorithmrdquo Pattern Recognition Lettersvol 15 no 6 pp 543ndash549 1994

[12] D Arthur and S Vassilvitskii ldquoK-Means++ the advantages ofcareful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms pp 7ndash9 NewOrleans LA USA 2007

[13] Y Zhao W Halang and X Wang ldquoRough ontology mappingin E-business integrationrdquo E-Service Intelligence BMC Bioinfvol 8 pp 75ndash93 2007

[14] Y Xiao and J Yu ldquoSemi-supervised clustering based on af-finity propagaiton algorithmrdquo ACM Transactions onKnowledge Discovery from Data vol 1 no 1 2007

[15] M Ester H Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databaseswith noiserdquo in Proceedings of the Second International Con-ference On Knowledge Discovery and Data Mining pp 226ndash231 Portland OR USA 1996

[16] R J G B Campello D Moulavi and J Sander ldquoDensity-based clustering based on hierarchical density estimatesrdquoAdvances in Knowledge Discovery and Data Mining vol 7819pp 160ndash172 2013

[17] Z Liang and P Chen ldquoDelta-density based clustering with adivide-and-conquer strategy 3DC clusteringrdquo Pattern Rec-ognition Letters vol 73 pp 52ndash59 2016

[18] M Ankerst M M Breuning H P Kriegel and J SanderldquoOPTICS ordering points to identify the clustering struc-turerdquo in Proceedings of the 1999 ACM SIGMOD-InternationalConference on Management of Data pp 49ndash60 PhiladelphiaPA USA 1999

[19] M Du S Ding and H Jia ldquoStudy on density peaks clusteringbased on k-nearest neighbors and principal componentanalysisrdquo Knowledge-Based Systems vol 99 pp 135ndash1452016

[20] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[21] T Li H W Ge and S Z Su ldquoDensity peaks clustering byautomatic determination of cluster centersrdquo Journal ofComputer Science and Technology vol 10 no 11 pp 1614ndash1622 2016

[22] R Liu H Wang and X Yu ldquoShared-nearest-neighbor-basedclustering by fast search and find of density peaksrdquo Infor-mation Sciences vol 450 pp 200ndash226 2018

[23] R A Jarvis and E A Patrick ldquoClustering using a similaritymeasure based on shared near neighborsrdquo IEEE Transactionson Computers vol C-22 no 11 pp 1025ndash1034 1973

[24] H Chang and D-Y Yeung ldquoRobust path-based spectralclusteringrdquo Pattern Recognition vol 41 no 1 pp 191ndash2032008

[25] L Fu and E Medico ldquoFlame a novel fuzzy clustering methodfor the analysis of DNA microarray datardquo BMC Bio-informatics vol 8 no 1 2007

[26] A Gionis H Mannila and P Tsaparas ldquoClustering aggre-gationrdquo ACM Transactions on Knowledge Discovery fromData vol 1 no 1 p 4 2007

[27] P Franti O Virmajoki and V Hautamaki ldquoFast agglom-erative clustering using a k-nearest neighbor graphrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 28 no 11 pp 1875ndash1881 2006

[28] F S Samaria and A C Harter ldquoParameterisation of a sto-chastic model for human face identificationrdquo in Proceedings ofthe 1994 IEEEWorkshop On Applications Of Computer Visionpp 138ndash142 Sarasota FL USA 1994

[29] K Bache and M Lichman UCI Machine Learning Repositoryhttparchiveicsucieduml 2013

[30] M Charytanowicz J Niewczas P Kulczycki P A KowalskiS Lukasik and S Zak ldquoComplete gradient clustering algo-rithm for features analysis of X-ray imagesrdquo InformationTechnologies in biomedicine Advances in Intelligent and SoftComputing vol 69 Berlin Germany Springer

[31] D B Dias R C B Madeo T Rocha H H Biscaro andS M Peres ldquoHand movement recognition for brazilian signlanguage a study using distance-based neural networksrdquo inProceedings of the 2009 International Joint Conference onNeural Networks pp 697ndash704 Atlanta GA USA 2009

[32] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees Routledge New York NYUSA 1st edition 1984

[33] V G Sigillito S P Wing L V Hutton and K B BakerldquoClassification of radar returns from the ionosphere usingneural networksrdquo Johns Hopkins APL vol 10 no 3pp 262ndash266 1989

[34] W N Street W H Wolberg and O L MangasarianldquoNuclear feature extraction for breast tumor diagnosisrdquo inProceedings of the SPIE 1905 Biomedical Image Processing andBiomedical Visualization San Jose CA USA 1993

[35] X V Nguyen J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison is a correction forchance necessaryrdquo in Proceedings of the ICML 2009 the 26thAnnual International Conference On Machine Learning SanMontreal Canada 2009

[36] J Han M Kamber and J Pei Data Mining Concepts andTechniques the Morgan Kaufmann Series in Data Manage-ment Systems Morgan Kaufmann Burlington MA USA 3rdedition 2011

Complexity 17

Page 3: ClusteringbyDetectingDensityPeaksandAssigningPointsby ... · FKNN-DPC [9] is given in (5) and (6), respectively: ρ i exp − 1 K X j∈knn(i) d2 ij ⎛⎝ ⎞⎠, (5) ρ i X j∈knn(i)

formula (3) is the empirical parameter provided in reference[20] which can be adjusted according to different datasets

+e relative distance δi represents the minimum distancebetween the point i and any other higher density points andis mathematically expressed as

δi

minj ρjgtρi

dij1113872 1113873 ρi lt maxk

ρk( 1113857

maxj

dij1113872 1113873 ρi maxk

ρk( 1113857

⎧⎪⎪⎨

⎪⎪⎩(4)

where dij is the distance between points i and jWhen the localdensity ρi is not themaximumdensity the relative distance δi isdefined as the minimum distance between the point i and anyother higher density points when ρi is themaximumdensity δi

takes the maximum distance to all other pointsAfter calculating the local density and relative distance of

all data points the DPC algorithm establishes a decisiongraph through the set of points ρi and δi +e point with highvalues of ρi and δi is called a peak and the center of thecluster is selected from the peaks +en the DPC algorithmdirectly assigns the remaining points to the same cluster asthe nearest neighbor peak

For the DPC algorithm the selection of dc has a greatinfluence on the correctness of the clustering results BothDPC-KNN and FKNN-DPC schemes introduce the conceptof K-nearest neighbors to eliminate the influence of dcHence two different local density calculations are provided

+e local density proposed by DPC-KNN [19] andFKNN-DPC [9] is given in (5) and (6) respectively

ρi exp minus1K

1113944jisinknn(i)

d2ij

⎛⎝ ⎞⎠ (5)

ρi 1113944jisinknn(i)

exp minusdij1113872 1113873(6)

where K is the total number of nearest neighbors andKNN(i) represents the set of K-nearest neighbors of point i+ese two methods provide a unified density metric fordatasets of any size through the idea of K-nearest neighborsand solve the problem of nonuniformity of DPCrsquos densitymetric for different datasets

Based on K-nearest neighbors SNN-DPC proposes theconcept of shared-nearest neighbors (SNN) [22] which isused to represent the local density ρi and the relative distanceδi +e idea of SNN is that if there are more same neighborsin the K-nearest neighbors of two points the similarity oftwo points is higher and the expression is given by

SNN(i j) KNN(i)capKNN(j) (7)

Based on the SNN concept the expression of SNNsimilarity is as follows

Simij

|SNN(i j)|2

1113936pisinSNN(ij) dip + djp1113872 1113873 if i j isin SNN(i j)

0 otherwise

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

(8)

where dip is the distance between points i and p and djp isthe distance between points j and p +e condition forcalculating SNN similarity is that points i and j appear ineach otherrsquos K-nearest neighbor set Otherwise the SNNsimilarity between the two points is 0

Next the local density ρi of point i is expressed by SNNsimilarity Suppose point i is any point in the dataset X thenS(i) x1 x2 xk represents the set of k points with thehighest similarity with point i +e expression of localdensity is

ρi 1113944jisinS(i)

Sim(i j)(9)

At the same time the equation for the relative distance δi

of the point i is as follows

δi

minj ρjgtρi

dij 1113944pisinknn(i)

dip + 1113944qisinknn(j)

djp⎛⎝ ⎞⎠⎡⎢⎢⎢⎣ ⎤⎥⎥⎥⎦ ρi lt max

kρk( 1113857

maxjisin(Xminusi)

δi( 1113857 ρi maxk

ρk( 1113857

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

(10)

+e SNN-DPC algorithm not only redefines the localdensity and relative distance but also changes the data pointallocation strategy +e allocation strategy divides the datapoints into two categories ldquounavoidable subordinate pointsrdquoand ldquoprobable subordinate pointsrdquo +e two types of datapoints have their allocation algorithms Compared with theDPC algorithm this allocation strategy method is better forthe clustering of clusters with different shapes

22 DPC Algorithm Analysis +e DPC algorithm proposesa very simple and elegant clustering algorithm Howeverdue to its simplicity DPC has the following two potentialproblems to be further addressed in practice

221 DPC Ignores Low-Density Points When the densitydifference between clusters is large the performance of theDPC algorithm can be significantly degraded To show thisissue we take the dataset Jain [23] as an example and thenthe clustering results calculated using the truncated kerneldistance of the DPC are shown in Figure 1 It can be seen thatthe cluster distribution in the upper left is relatively sparseand the cluster distribution in the lower right is relativelyclose +e red star in the figure represents the cluster centersin the upper left corner Under the disparity in density theclustering centers selected by the DPC are all on the tightlydistributed cluster below Due to the incorrect selection ofthe clustering centers the subsequent allocations are alsoincorrect

Analyzing the local density and the relative distanceseparately from Figures 2(a) and 2(b) it can be seen that theρ value and the δ value of point A of the false cluster centerare much higher than that of the true cluster center C +eresults of Gaussian kernel distance calculation are the sameand the correct clustering center cannot be selected on thedataset Jain +erefore how to increase the ρ value and the δ

Complexity 3

value of the low-density center and make it stand out in thedecision graph is a problem that needs to be considered

222 DPC Ignores Low-Density Point Allocation Strategywith Low Fault Tolerance +e fault tolerance of the allo-cation strategy of the DPC algorithm is not satisfactorymainly because the allocation of points receives a higherimpact than the density of points Hence if a high-densitypoint allocation error occurs it will directly affect thesubsequent allocation of points with a lower density Takingthe Pathbased dataset [24] as an example Figure 3 shows theclustering result calculated by the DPC algorithm by usingthe ldquocutoffrdquo kernel distance It can be seen from the figurethat the DPC algorithm can find a suitable clustering centerbut the allocation results of most points are incorrect +esame is true of the results using the Gaussian kernel distancecalculation +e results of point assignment on the

Pathbased dataset are similar to those of ldquocutoffrdquo kernelclustering +erefore the fault tolerance of the point allo-cation strategy should be further improved Moreover thepoints are greatly affected by other points during the allo-cation which is also an issue to be further addressed

3 Proposed Method

In this section the DPC-SFSKNN algorithm is introduced indetail +e DPC-SFSKNN algorithm is proposed where thefive main definitions of the algorithm are introduced andthe entire algorithm process is introduced Moreover thecomplexity of the DPC-SFSKNN algorithm is analyzed

31 3e Main Idea of DPC-SFSKNN +e DPC algorithmrelies on the distance between points to calculate the localdensity and the relative distance and is also very sensitive to

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(a)

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(b)

Figure 1 Results of the traditional DPC algorithm on the Jain dataset (a) Clustering of Jain by DPC (b) Ground truth

20

15

10

5

002 04 06 08 10

A

B

C

(a)

025

02

015

01

005

002 04 06 08 10

A B

C

(b)

Figure 2 ρ and δ values of the result of the traditional DPC algorithm on the Jain dataset (a) Clustering of Jain by DPC (b) Ground truth

4 Complexity

the choice of the cutoff distance dc Hence the DPC algo-rithmmay not be able to correctly process for some complexdatasets +e probability that a point and its neighborsbelong to the same cluster is high Adding attributes relatedto neighbors in the clustering process can help to make acorrect judgment +erefore we introduce the concept ofshared-nearest neighbor (SNN) proposed in [22] whendefining the local density and the relative distance Its basicidea is that if they have more common neighbors the twopoints are considered to be more similar as said above (seeequation (7))

Based on the above ideas we define the average distancedsnn(i j) of the shared-nearest neighbor between point i andpoint j and the similarity between the two points

Definition 1 (average distance of SNN) For any points i andj in the dataset X the shared-nearest neighbor set of twopoints is SNN(i j) and the average distance of SNNdsnn(i j) is expressed as

dsnn(i j) 1113936pisinSNN(ij) dip + djp1113872 1113873

2S (11)

where point p is any point of SNN(i j) and S is the numberof members in the set SNN(i j) dsnn(i j) shows the spatialrelationship between point i and point j more compre-hensively by calculating the distances between two pointsand shared-nearest neighbor points

Definition 2 (similarity) For any points i and j in the datasetX the similarity Sim(i j) between point i and j can beexpressed as

Sim(i j) S

Klowast100 (12)

where K is the number of nearest neighbors K is selectedfrom 4 to 40 until the optimal parameter appears +e lower

bound is 4 because a smaller K may cause the algorithm tobecome endless For the upper bound it is found by ex-periments that a large K will not significantly affect theresults of the algorithm +e similarity is defined accordingto the aforementioned basic idea ldquoif they have more com-mon neighbors the two points are considered to be moresimilarrdquo and the similarity is described using the ratio of thenumber of shared-nearest neighbors to the number ofnearest neighbors

Definition 3 (K-nearest neighbor average distance) For anypoint i in the dataset X its K-nearest neighbor set is KNN(i)and then the expression of K-nearest neighbor averagedistance dknn(i) is as follows

dknn(i) 1113936pisinknn(i)dip

K (13)

where point p is any point in KNN(i) and the number ofnearest neighbors of any point is K K-nearest neighboraverage distance can describe the surrounding environmentof a point to some extent Next we use it to describe localdensity

Definition 4 (local density) For any point i in the dataset Xthe local density expression is

ρi 1113944jisinknn(i)

S

dknn(i) + dknn(j) (14)

where point j is a point in the set KNN(i) and dknn(i) anddknn(j) are the K-nearest neighbor average distances ofpoint i and point j respectively In formula (14) the nu-merator (the number of shared-nearest neighbor S) repre-sents the similarity between the two points and thedenominator (the sum of the average distances) describesthe environment around them When S is a constant and if

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(a)

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(b)

Figure 3 Results of the traditional DPC algorithm on the Pathbased dataset

Complexity 5

the value of the sum of the average distances(dknn(i) + dknn(j)) is small the local density ρi of point i islarge Point j is one of the K-nearest neighbors of point iWhen the values of dknn(i) and dknn(j) are small it means i

and j are closely surrounded by their neighbors If dknn(i)

has a larger value (point j is far away from point i) or dknn(j)

has a larger value (when the neighboring points of thedistance are far away from the point j) the local density ofthe point i becomes smaller +erefore only the averagedistances of the two points are small and it can be expressedthat the local density of point i is large Moreover when thesum of the average distances of the two points is constantand if the number of shared-nearest neighbors of the twopoints is large the local density is large A large number ofshared neighbors indicate that the two points have a highsimilarity and a high probability of belonging to the samecluster +e higher the similarity points around a point thegreater its local density and the greater the probability ofbecoming a cluster center +is is beneficial to those low-density clustering centers A large number of sharedneighbors can compensate for the loss caused by their largedistance from other points so that their local density is notonly affected by distance Next we define the relative dis-tance of the points

Definition 5 (relative distance) For any point i in the datasetX the relative distance can be expressed as

δi

minj ρj gt ρi

dij + dknn(i) + dknn(j)1113960 1113961 ρi lt maxk

ρk( 1113857

maxjisin(Xminusi)

δi( 1113857 ρi maxk

ρk( 1113857

⎧⎪⎪⎨

⎪⎪⎩

(15)

where point j is one of the K-nearest neighbors of point idij is the distance between points i and j and dknn(i) anddknn(j) are the average distance from the nearest neighbor ofpoints i and j We can use the sum of the three distances torepresent the relative distance Compared to the DPC al-gorithm which only uses dij to represent the relative dis-tance we define the concept of relative distance andK-nearest neighbor average distances of two points +e newdefinition can not only express the relative distance but alsobe more friendly to low-density cluster centers Under thecondition of constant dij the average distance of the nearestneighbors of the low-density points is relatively large and itsrelative distance will also increase which can increase theprobability of low-density points being selected

+e DPC-SFSKNN clustering center is selected in thesame way as the traditional DPC algorithm+e local densityρ and relative distance δ are used to form a decision graph+e n points with the largest local density and relativedistance are selected as the clustering centers

For DPC-SFSKNN the sum of the distances from pointsof a low-density cluster to their K-neighbors may be largethus they receive a greater compensation for their δ valueFigures 4(a) and 4(b) show the results of DPC-SFSKNN onthe Jain dataset [23] Compared to Figure 2(b) the δ valuesof points in the upper branch are generally larger than thoseof the lower branch +is is because the density of the upper

branch is significantly smaller and the distances from thepoints to their respective K-nearest neighbors are largerthus they receive a greater compensation Even if the densityis at a disadvantage the higher δ value still makes the centerof the upper branch distinguished in the decision graph+isshows that the DPC-SFSKNN algorithm can correctly selectlow-density clustering centers

32 Processes +e entire process of the algorithm is dividedinto two parts the selection of clustering centers and theallocation of noncenter points +e main step of our DPC-SFSKNN and a detailed introduction of the proposed al-location strategy are given in Algorithm 1

Line 9 of the DPC-SFSKNN algorithm establishes aweighted K-nearest neighbor graph and Line 11 is aK-nearest neighbor similarity search allocation strategy Toassign noncenter points in the dataset we designed asimilarity-first search algorithm based on the weightedK-nearest neighbor graph +e algorithm uses the breadth-first search idea to find the cluster center with the highestsimilarity for the noncenter point +e similarity of non-center points and their K-nearest neighbors is sorted in anascending order the neighbor point with the highest sim-ilarity is selected as the next visited node and it is pushedinto the path queue If the highest similarity point is notunique the point with the smallest SNN average distance isselected as the next visited node+e visiting node also needsto sort the similarity of its K-nearest neighbors and select thenext visiting node +e search stops until the visited node isthe cluster center point Algorithm 2 describes the entiresearch process Finally each data point except the clustercenters is traversed to complete the assignment

Similarity-first search algorithm is an optimization al-gorithm based on breadth-first search according to the al-location requirements of noncenter points Similarity is animportant concept for clustering algorithms Points in thesame cluster are similar to each other Two points with ahigher similarity have more of the same neighbors Based onthe above ideas the definition of similarity is proposed in(12) In the process of searching if only similarity is used asthe search criteria it is easy to appear that the highestsimilarity point is not unique +erefore the algorithmchooses the average distance of the SNN as the secondcriterion and a smaller dsnn point means that the two pointsare closer in space

+e clustering results of the DPC-SFSKNN algorithmbased on the Pathbased dataset are shown in Figure 5Figure 3 clearly shows that although the traditional DPCalgorithm can find cluster centers on each of the threeclusters there is a serious bias in the allocation of noncenterpoints From Figure 5 we can see the effectiveness of thenoncentral point allocation algorithm of the DPC-SFSKNNalgorithm+e allocation strategy uses similarity-first searchto ensure that the similarity from the search path is thehighest and a gradual search to the cluster center to avoidthe points with low similarity is used as a reference Besidesthe similarity-first search allocation strategy based on theweighted K-nearest neighbor graph considers neighbor

6 Complexity

information When the point of the highest similarity is notunique the point with the shortest average distance of theshared neighbors is selected as the next visited point

33 Complexity Analysis In this section the complexities ofthe DPC-SFSKNN algorithm are analyzed including timecomplexity and space complexity Suppose the size of the

02 04 06 08 100

0102030405060708091

Y

X

(a)

02 04 06 08 10X

δ

025

020

015

010

005

0

(b)

Figure 4 Result and ρ value of the DPC-SFSKNN algorithm on the Jain dataset

Require dataset X parameter K

Ensure clustering result C(1) Data preprocessing normalize the data(2) Calculate the Euclidean distance between the points(3) Calculate the K-nearest neighbors of each point i isin X

(4) Calculate the average distance of K-nearest neighbors of each point dknn(i) according to (13)(5) Calculate the local density ρi of each point i isin X according to (14)(6) Calculate the relative distance δi of each point i isin X according to (15)(7) Find the cluster center by analyzing the decision graph composed of ρ and δ and use the cluster center as the set CC(8) Calculate the similarity between point i and its K-nearest neighbors according to (12)(9) Connect each point in the dataset X with its K-nearest neighbors and use the similarity as the connection weight to construct a

weighted K-nearest neighbor graph(10) Calculate the average distance of SNN dsnn(i j) between point i and its shared-nearest neighbors according to (11)(11) Apply Algorithm 2 to allocate the remaining points

ALGORITHM 1 DPC-SFSKNN

Require w isin X set of cluster centers CC number of neighbors K similarity matrix Snlowast n sim(i j)nlowast n and SNN averagedistance matrix DSNNnlowast n dsnn(i j)nlowast n

Ensure point w isin CC(1) Initialize the descending queue Q and the path queue P+e K-nearest neighbors of point w are sorted in the ascending order of

similarity and pushed into Q Push M into P(2) while tail point of P P isin CC do(3) if the highest similarity point is unique then(4) Pop a point this at Qrsquos tail(5) else(6) Select a point this with the smallest DSNN(7) end if(8) Empty descending queue Q(9) +e K-nearest neighbors of this are sorted in the ascending order of similarity and pushed into Q(10) Push this into P(11) end while

ALGORITHM 2 Similarity-first search allocation strategy

Complexity 7

dataset is n the number of cluster centers is m and thenumber of neighbors is k

331 Time Complexity +e time complexity analysis ofDPC-SFSKNN is as follows

Normalization requires a processing complexity of ap-proximately O(n) the complexities of calculating the Eu-clidean distance and similarity between points are O(n2) thecomplexity of computing the K-nearest neighbor averagedistance dknn is O(n2) similarly the complexity of the averagedistance dsnn between the calculation point and its shared-nearest neighbors does not exceed O(n2) at most the calcu-lation process of calculating the local density ρi and distance δi

of each point needs to acquire the KNN information com-plexity of each point as O(kn) so the complexities of localdensity ρ and distance δ areO(kn2) the point allocation part isthe search time of one point and in the worst case searching allpoints requires O(n) +ere are n points in the dataset and thetotal time does not exceed O(n2) In summary the total ap-proximate time complexity of DPC-SFSKNN is O(kn2)

+e time complexity of the DPC algorithm depends onthe following three aspects (a) the time to calculate thedistance between points (b) the time to calculate the localdensity ρi for point i and (c) the time to calculate the distanceδi for each point i +e time complexity of each part is O(n2)so the total approximate time complexity of DPC is O(n2)

+e time complexity of the DPC-SFSKNN algorithm is k

times higher than that of the traditional DPC algorithmHowever k is relatively small compared to n +erefore theydo not significantly affect the run time In Section 4 it isdemonstrated that the actual running time of DPC-SFSKNNdoes not exceed k times of the running time of the traditionalDPC algorithm

332 Space Complexity DPC-SFSKNN needs to calculatethe distance and similarity between points and its com-plexity is O(n2) Other data structures (such as ρ and δ arrays

and various average distance arrays) are O(n) For the al-location strategy in the worst case its complexity is O(n2)+e space complexity of DPC is O(n2) which is mainly dueto the distance matrix stored

+e space complexity of our DPC-SFSKNN is the sameas that of traditional DPC which is O(n2)

4 Experiments and Results

In this section experiments are performed based on severalpublic datasets commonly used to test the performance ofclustering algorithms including synthetic datasets [23ndash27]and real datasets [28ndash34] In order to visually observe theclustering ability of DPC-SFSKNN the DPC [20] DBSCAN[15] AP [8] FKNN-DPC [9] and K-means [10] methods areall tested for comparison +ree popular benchmarks areused to evaluate the performance of the above clusteringalgorithms including the clustering accuracy (ACC) ad-justed mutual information (AMI) and adjusted Rand index(ARI) [35] +e upper bounds of the three benchmarks wereall 1 +e larger the benchmark value the better the clus-tering effect +e codes for DPC DBSCAN and AP wereprovided based on the corresponding references

Table 1 lists the synthetic datasets used in the experi-ments+ese datasets were published in [23ndash27] Table 2 liststhe real datasets used in the experiments +ese datasetsinclude the real-world dataset from [29ndash34] and the Olivettiface dataset in [28]

To eliminate the influence of missing values and dif-ferences in different dimension ranges the datasets need tobe preprocessed before proceeding to the experiments Wereplace the missing values by the mean of all valid values ofthe same dimension and normalize the data using the min-max normalization method shown in the followingequation

xij xij minus min xj1113872 1113873

max xj1113872 1113873 minus min xj1113872 1113873 (16)

where xij represents the original data located in the ith rowand jth column xij represents the rescaled data of xij andxj represents the original data located in the jth column

Min-max normalization method processes each di-mension of the data and preserves the relationships betweenthe original data values [36] therefore decreasing the in-fluence of the difference in dimensions and increasing theefficiency of the calculation

To fairly reflect the clustering results of the five algo-rithms the parameters in the algorithms are adjusted toensure that their satisfactory clustering performance can beretained For the DPC-SFSKNN algorithm the parameter K

needs to be specified in advance and an initial clusteringcenter is manually selected based on a decision graphcomposed of the local density ρ and the relative distance δ Itcan be seen from the experimental results in Tables 3 and 4that the value of parameter K is around 6 and the value ofparameterK for the dataset with dense sample distribution ismore than 6 In addition to manually select the initialclustering center the traditional DPC algorithm also needs

02 04X

06 08 100

01

02

03

04

05Y

06

07

08

09

1

Figure 5 Results of the traditional DPC-SFSKNN algorithm on thePathbased dataset

8 Complexity

to determine dc Based on the provided selection range dc isselected so that the number of neighbors is between 1 and 2of the total number of data points [20] +e two parametersthat DBSCAN needs to determine are ε and minpts as in[15]+e optimal parameters are determined using a circularsearch method +e AP algorithm only needs to determine apreference and the larger the preference the more the center

points are allowed to be selected [8] +e general method forselecting parameters is not effective and only multiple ex-periments can be performed to select the optimal parame-ters +e only parameter of K-means is the number ofclusters +e true number of clusters in the dataset is used inthis case Similarly FKNN-DPC needs to determine thenearest neighbors K

Table 2 Real-world datasets

Dataset Records Attributes Clusters SourceIris 150 4 3 [29]Libras movement 360 90 15 [31]Wine 178 13 3 [29]Parkinsons 197 23 2 [29]WDBC 569 30 2 [34]Pima-Indians-diabetes 768 8 2 [29]Segmentation 2310 19 7 [29]Dermatology 366 33 6 [29]Seeds 210 7 3 [30]Ionosphere 351 34 2 [33]Waveform 5000 21 3 [32]Waveform (noise) 5000 40 3 [32]Olivetti face 400 92lowast112 40 [28]

Table 1 Synthetic datasets

Dataset Records Attributes Clusters SourcePathbased 300 2 3 [24]Jain 373 2 2 [23]Flame 240 2 2 [25]Aggregation 788 2 7 [26]DIM512 1024 512 16 [27]DIM1024 1024 1024 16 [27]

Table 3 +e comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on synthetic datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParPathbased Jain

DPC-SFSKNN 0926 0910 0925 33 6 1000 1000 1000 22 7DPC 0521 0463 0742 33 2 0609 0713 0853 22 3DBSCAN 0781 0522 0667 mdash 0056 0883 0985 0918 mdash 00810AP 0679 0475 0783 33 10 0681 0812 0882 22 40FKNN-DPC 0941 0960 0987 33 5 0056 0132 0793 mdash 10K-means 0568 0461 0772 mdash 3 0492 0577 0712 mdash 2

Aggregation FlameDPC-SFSKNN 0942 0951 0963 77 6 0873 0934 0956 22 6DPC 1000 1000 1000 77 4 1000 1000 1000 22 5DBSCAN 0969 0982 0988 mdash 0058 0867 0936 0981 - 0098AP 0795 0753 0841 77 77 0452 0534 0876 33 35FKNN-DPC 0995 0997 0999 33 8 1000 1000 1000 22 5K-means 0784 0717 0786 mdash 7 0418 0465 0828 mdash 2

DIM512 DIM1024DPC-SFSKNN 1000 1000 1000 1616 8 1000 1000 1000 1616 9DPC 1000 1000 1000 1616 2 1000 1000 1000 1616 001DBSCAN 1000 1000 1000 mdash 037 1000 1000 1000 mdash 108AP 1000 1000 1000 1616 20 1000 1000 1000 1616 30FKNN-DPC 1000 1000 1000 1616 8 1000 1000 1000 1616 10K-means 0895 0811 0850 mdash 1 0868 0752 0796 mdash 16

Complexity 9

41Analysis of theExperimentalResults on SyntheticDatasetsIn this section the performance of DPC-SFSKNN DPC[20] DBSCAN [15] AP [8] FKNN-DPC [9] and K-means[10] is tested with six synthetic datasets given in Table 1+ese synthetic datasets are different in distribution andquantity Different data situations can be simulated tocompare the performance of six algorithms in differentsituations Table 3 shows AMI ARI ACC and ECAC of thefive clustering algorithms on the six comprehensive datasetswhere the best results are shown in bold and ldquomdashrdquo means novalue Figures 6ndash9 show the clustering results of DPC-SFSKNN DPC DBSCAN AP FKNN-DPC and K-meansbased on the Pathbased Flame Aggregation and Jaindatasets respectively +e five algorithms achieve the

optimal clustering on DIM512 and DIM1024 datasets sothat the clustering of the two datasets is not shown Since thecluster centers of DBSCAN are relatively random only thepositions of clustering centers of the other three algorithmsare marked

Figure 6 shows the results of the Pathbased datasetDPC-SFSKNN and FKNN-DPC can complete the clusteringof the Pathbased dataset correctly From Figures 6(b) 6(d)and 6(f) it can be seen that the clustering results of DPC APand K-means are similar +e clustering centers selected byDPC AP DPC-SFSKNN and FKNN-DPC are highlysimilar but the clustering results of DPC and AP are notsatisfactory For the DPC algorithm the low fault tolerancerate of its allocation strategy is the cause of this result A

Table 4 Comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on real-world datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParIris Libras movement

DPC-SFSKNN 0896 0901 0962 33 6 0547 0368 0510 1015 8DPC 0812 0827 0926 33 2 0535 0304 0438 915 05DBSCAN 0792 0754 0893 mdash 0149 0412 0183 0385 mdash 0965AP 0764 0775 0911 33 6 0364 0267 0453 1015 25FKNN-DPC 0912 0922 0973 33 7 0508 0308 0436 1015 9K-means 0683 0662 0823 mdash 3 0522 0306 0449 mdash 15

Wine ParkinsonsDPC-SFSKNN 0843 0851 0951 33 6 0193 0380 0827 22 6DPC 0706 0672 0882 33 2 0210 0114 0612 22 5DBSCAN 0612 0643 0856 mdash 04210 0205 0213 0674 mdash 046AP 0592 0544 0781 33 6 0142 0127 0669 22 15FKNN-DPC 0831 0852 0949 33 7 0273 0391 0851 22 5K-means 0817 0838 0936 mdash 3 0201 0049 0625 mdash 2

WDBC IonosphereDPC-SFSKNN 0432 0516 0857 22 6 0361 0428 0786 32 7DPC 0002 minus0004 0602 22 9 0238 0276 0681 32 065DBSCAN 0397 0538 0862 mdash 0277 0544 0683 0853 mdash 027AP 0598 0461 0854 22 40 0132 0168 0706 22 15FKNN-DPC 0679 0786 0944 22 7 0284 0355 0752 22 8K-means 0611 0730 0928 mdash 2 0129 0178 0712 mdash 2

Segmentation Pima-Indians-diabetesDPC-SFSKNN 0665 0562 0746 67 6 0037 0083 0652 22 6DPC 0650 0550 0684 67 3 0033 0075 0647 22 4DBSCAN 0446 0451 0550 mdash 02510 0028 0041 0577 mdash 0156AP 0405 0436 0554 77 25 0045 0089 0629 32 35FKNN-DPC 0655 0555 0716 77 7 0001 0011 0612 22 6K-means 0583 0495 0612 mdash 6 0050 0102 0668 mdash 2

Seeds DermatologyDPC-SFSKNN 0753 0786 0919 33 7 0862 0753 0808 76 6DPC 0727 0760 0918 33 2 0611 0514 0703 46 2DBSCAN 0640 0713 0874 mdash 0178 0689 0690 0815 mdash 073AP 0598 0682 0896 33 10 0766 0701 0762 76 5FKNN-DPC 0759 0790 0924 33 8 0847 0718 0768 76 7K-means 0671 0705 0890 mdash 3 0796 0680 0702 mdash 6

Waveform Waveform (noise)DPC-SFSKNN 0355 0382 0725 33 5 0267 0288 0651 33 6DPC 0320 0269 0586 33 05 0104 0095 0502 33 03DBSCAN mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashAP mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashFKNN-DPC 0324 0350 0703 33 5 0247 0253 0648 33 5K-means 0363 0254 0501 - 3 0364 0252 0512 mdash 3

10 Complexity

0

02

04

06

08

1

05 10

(a)

0

02

04

06

08

1

05 10

(b)

Figure 7 Continued

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 6+e clustering of Pathbased by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and (f)K-means

Complexity 11

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 7 +e clustering of Flame by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

05 100

02

04

06

08

1

(d)

Figure 8 Continued

12 Complexity

high-density point allocation error will be transferred tolow-density points and the error propagation will seriouslyaffect the clustering results AP and K-means algorithms arenot good at dealing with irregular clusters +e two clustersin the middle are too attractive to the points on both sides ofthe semicircular cluster which leads to clustering errorsDBSCAN can completely detect the semicircular cluster butthe semicircular cluster and the cluster on the left of themiddle are incorrectly classified into one category and thecluster on the right of the middle is divided into two clusters+e similarities between points and manually prespecifiedparameters may severely affect the clustering DPC-SFSKNNand FKNN-DPC algorithms perform well on the Pathbaseddataset +ese improved algorithms that consider neighborrelationships have a great advantage in handling suchcomplex distributed datasets

Figure 7 shows the results of four algorithms on theFlame dataset As shown in the figure DPC-SFSKNN DPCFKNN-DPC and DBSCAN can correctly detect two clusterswhile AP and K-means cannot completely correct clusteringAlthough AP can correctly identify higher clusters and selectthe appropriate cluster center the lower cluster is dividedinto two clusters Both clusters of K-means are wrong +eclustering results in Figure 8 show that the DPC-SFSKNNDPC FKNN-DPC and DBSCAN algorithms can detect 7clusters in the Aggregation dataset but AP and K-means stillcannot cluster correctly DPC-SFSKNN DPC and FKNN-DPC can identify clusters and centers Although the clustercenters are not marked for DBSCAN the number of clustersand the overall shape of each cluster are correct +e APalgorithm successfully found the correct number of clustersbut it chose two centers for one cluster which divided thecluster into two clusters +e clustering result of K-means issimilar to that of AP

+e Jain dataset shown in Figure 9 is a dataset consistingof two semicircular clusters of different densities As shownin the figure the DPC-SFSKNN algorithm can completelycluster two clusters with different densities However DPCAP FKNN-DPC and K-means incorrectly assign the leftend of the lower cluster to the higher cluster and the clustercenters of the DPC are all on the lower cluster Compared

with that the distribution of the cluster centers of the AP ismore reasonable For the DBSCAN algorithm it can ac-curately identify lower clusters but the left end of the highercluster is incorrectly divided into a new cluster so that thehigher cluster is divided into two clusters

According to the benchmark data shown in Table 3 it isclear that the performance of DPC-SFSKNN is very effectiveamong the six clustering algorithms especially in the Jaindataset Although DPC and FKNN-DPC perform betterthan DPC-SFSKNN on Aggregation and Flame datasetsDPC-SFSKNN can find the correct clustering center of theaggregation and can complete the clustering task correctly

42 Analysis of Experimental Results on Real-World DatasetsIn this section the performance of the five algorithms is stillbenchmarked according to AMI ARI ACC and ECACand the clustering results are summarized in Table 4 12 real-world datasets are selected to test DPC-SFSKNNrsquos ability toidentify clusters on different datasets DBSCAN and APalgorithm cannot get effective clustering results on wave-form and waveform (noise) +e symbol ldquomdashrdquo represents noresult

As shown in Table 4 in terms of benchmarks AMI ARIand ACC DPC-SFSKNN outperforms all five other algo-rithms on the Wine Segmentation and Libras movementdatasets At the same time FKNN-DPC performs better thanthe other five algorithms on the Iris Seeds Parkinsons andWDBC datasets It can be seen that the overall performanceof DPC-SFSKNN is slightly better than DPC on 11 datasetsexcept for Parkinsons On the Parkinsons DPC-SFSKNN isslightly worse than DPC in AMI but better than DPC in ARIand ACC Similarly DPC-SFSKNN had a slightly betterperformance in addition to Iris Parkinsons WDBC andSeeds of eight sets of data in FKNN-DPC and DPC-SFSKNN is slightly worse than DPC in AMI ARI and ACCDBSCAN gets the best results on the Ionosphere K-means isthe best on Pima-Indians-diabetes and K-means is the bestin AMI on waveform and waveform (noise) datasets Ingeneral the clustering results of DPC-SFSKNN in real-worlddatasets are satisfactory

0

02

04

06

08

1

05 10

(e)

05 100

02

04

06

08

1

(f )

Figure 8+e clustering of Aggregation by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

Complexity 13

43 Experimental Analysis of Olivetti Face DatasetOlivetti face dataset [28] is an image dataset widely used bymachine learning algorithms Its purpose is to test theclustering situation of the algorithm without supervisionincluding determining the number of clusters in the data-base and the members of each cluster +e dataset contains40 clusters each of which has 10 different images Becausethe actual number of clusters (40 different clusters) is equalto the number of elements in the dataset (10 different imageseach cluster) the reliability of local density becomes smallerwhich is a great challenge for density-based clustering al-gorithms To further test the clustering performance ofDPC-SFSKNN DPC-SFSKNN performed experiments onthe Olivetti face database and compared it with DPC APDBSCAN FKNN-DPC and K-means

+e clustering results achieved by DPC-SFSKNN andDPC for the Olivetti face database are shown in Figure 10and white squares represent the cluster centers +e 32clusters corresponding to DPC-SFSKNN found inFigure 10(a) and the 20 clusters found by DPC inFigure 10(b) are displayed in different colors Gray imagesindicate that the image is not assigned to any cluster It canbe seen from Figure 10(a) that DPC-SFSKNN found that the32 cluster centers were covered 29 clusters and as shown inFigure 10(b) the 20 cluster centers found by DPC werescattered in 19 clusters Similar to DPC-SFSKNN DPC maydivide one cluster into two clusters Because DPC-SFSKNNcan find much more density peaks than DPC it is morelikely to identify a cluster as two different clusters +e samesituation occurs with the FKNN-DPC algorithm However

0

02

04

06

08

1

05 10

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

0

02

04

06

08

1

05 10

(d)

0

02

04

06

08

1

05 10

(e)

0

02

04

06

08

1

05 10

(f )

Figure 9 +e clustering of Jain by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

14 Complexity

the performance of FKNN-DPC is better than that of DPC-SFSKNN in AMI ARI ACC and ECAC From the data inTable 5 based on AMI ARI ACC and ECAC the clus-tering results of these algorithms are compared +e per-formance of DPC-SFSKNNC is slightly superior to theperformance of the other four algorithms except FKNN-DPC

44RunningTime +is section shows the comparison of thetime performance of DPC-SFSKNN with DPC DBSCANAP FKNN-DPC and K-means on real-world datasets +etime complexity of DPC-SFSKNN and DPC has been an-alyzed in Section 331 and the time complexity of DPC isO(n2) and the time complexity of DPC-SFSKNN is O(kn2)where n is the size of the dataset However the time con-sumed by DPC mainly comes from calculating the localdensity and the relative distance of each point while the timeconsumed by DPC-SFSKNN comes mainly from the cal-culation of K-nearest neighbors and the division strategy ofnoncenter points Table 6 lists the running time (in seconds)

of the six algorithms on the real datasets It can be seen thatalthough the time complexity of DPC-SFSKNN is approx-imately k times that of DPC their execution time on actualdatasets is not k times

In Table 6 it can be found that on a relatively smalldataset the running time of DPC-SFSKNN is about twice ormore times that of DPC and the difference mainly comesfrom DPC-SFSKNNrsquos allocation strategy Although thecomputational load of the local densities for points growsvery quickly with the size of a dataset the time consumed bythe allocation strategy in DPC-SFSKNN increases randomlywith the distribution of a dataset +is leads to an irregulargap between the running time of DPC and DPC-SFSKNN

FKNN-DPC has the same time and space complexity asDPC but the running time is almost the same as DPC-SFSKNN It takes a lot of running time to calculate therelationship between K-nearest neighbors +e time com-plexity of DBSCAN and AP is approximate to O(n2) and theparameter selection of both cannot be determined by simplemethods When the dataset is relatively large it is difficult tofind their optimal parameters which may be the reason that

(a) (b)

Figure 10 +e clustering of Olivetti by two clustering algorithms (a) DPC-SFSKNN and (b) DPC

Table 5 Performance comparison of algorithms by clustering criteria for the Olivetti face database

Metric DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansACC 0786 0665 0648 0763 0818 0681AMI 0792 0728 0691 0737 0832 0742ARI 0669 0560 0526 0619 0714 0585ECAC 3240 2040 mdash 2840 3640 mdashPar 6 05 064 21 4 40

Complexity 15

the two algorithms have no running results on the waveformdataset +e approximate time complexity of K-means isO(n) and Table 6 proves its efficiency K-means has almostno loss of accuracy under the premise of fast speed whichmakes it a very popular clustering algorithm but K-means isnot sensitive to irregularly shaped data

5 Conclusions and Future Work

A new clustering algorithm is proposed based on the tra-ditional DPC algorithm in this paper +is algorithm pro-poses a density peak search algorithm that takes into accountthe surrounding neighbor information and develops a newallocation strategy to detect the true distribution of thedataset +e proposed clustering algorithm performs fastsearch finds density peaks say cluster centers of a dataset ofany size and recognizes clusters with any arbitrary shape ordimensionality +e algorithm is called DPC-SFSKNNwhich means that it calculates the local density and therelative distance by using some distance information be-tween points and neighbors to find the cluster center andthen the remaining points are assigned using similarity-first+e search algorithm is based on the weighted KNN graph tofind the owner (clustering center) of the point +e DPC-SFSKNN successfully addressed several issues arising fromthe clustering algorithm of Alex Rodriguez and AlessandroLaio [20] including its density metric and the potential issuehidden in its assignment strategy +e performance of DPC-SFSKNN was tested on several synthetic datasets and thereal-word datasets from the UCI machine learning reposi-tory and the well-known Olivetti face database +e ex-perimental results on these datasets demonstrate that ourDPC-SFSKNN is powerful in finding cluster centers and inrecognizing clusters regardless of their shape and of thedimensionality of the space in which they are embedded andof the size of the datasets and is robust to outliers It per-forms much better than the original algorithm DPCHowever the proposed algorithm has some limitations theparameter K needs to be manually adjusted according todifferent datasets the clustering centers still need to bemanually selected by analyzing the decision graph (like theDPC algorithm) the allocation strategy improves theclustering accuracy but takes time and cost How to improve

the degree of automation and allocation efficiency of thealgorithm needs further research

Data Availability

+e synthetic datasets are cited at relevant places within thetext as references [23ndash27] +e real-world datasets are citedat relevant places within the text as references [29ndash34] +eOlivetti face dataset is cited at relevant places within the textas references [28]

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported in part by the National NaturalScience Foundation of China (6160303040 and 61433003) inpart by the Yunnan Applied Basic Research Project of China(201701CF00037) and in part by the Yunnan ProvincialScience and Technology Department Key Research Program(Engineering) (2018BA070)

Supplementary Materials

It includes the datasets used in the experiments in this paper(Supplementary Materials)

References

[1] K L Liu Y L Shang Q Ouyang and W D Widanage ldquoAdata-driven approach with uncertainty quantification forpredicting future capacities and remaining useful life oflithium-ion batteryrdquo IEEE Transactions on Industrial Elec-tronics p 1 2020

[2] X P Tang K L Liu X Wang et al ldquoModel migration neuralnetwork for predicting battery aging trajectoriesrdquo IEEETransactions on Transportation Electrification vol 6 no 2pp 363ndash374 2020

[3] X Tang K Liu XWang B Liu F Gao andW DWidanageldquoReal-time aging trajectory prediction using a base model-oriented gradient-correction particle filter for Lithium-ion

Table 6 Running time of 6 clustering algorithms in seconds on UCI datasets

Dataset DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansIris 0241 0049 0059 0565 0148 0014Wine 0238 0048 0098 0832 0168 0013WDBC 0484 0092 0884 6115 0464 0018Seeds 0244 0050 0122 0973 0164 0014Libras movement 0602 0068 0309 3016 2602 0075Ionosphere 0325 0064 0349 2018 0309 0021Segmentation 1569 0806 8727 6679 0313 0062Dermatology 0309 0063 0513 2185 0409 0007Pima-Indians-diabetes 0792 0126 2018 9709 0892 0009Parkinsons 0255 0048 0114 0866 0263 0003Waveform 16071 3511 mdash mdash 7775 0067Waveform (noise) 17571 3784 mdash mdash 7525 0109

16 Complexity

batteriesrdquo Journal of Power Sources vol 440 Article ID227118 2019

[4] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[5] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[6] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[7] L Cai J H Meng D I Stroe et al ldquoMulti-objective opti-mization of data-driven model for lithium-ion battery SOHestimation with short-term featurerdquo IEEE Transactions onPower Electronics p 1 2020

[8] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[9] J Xie H Gao W Xie X Liu and P W Grant ldquoRobustclustering by detecting density peaks and assigning pointsbased on fuzzy weighted K-nearest neighborsrdquo InformationSciences vol 354 pp 19ndash40 2016

[10] F S Samaria and A C Harter ldquoSome methods for classifi-cation and analysis of multivariate observationsrdquo in Pro-ceedings of the Berkeley SymposiumOnMathematical Statisticsand Probability pp 281ndash297 Berkeley CA USA 1967

[11] S Kant T L Rao and P N Sundaram ldquoAn automatic andstable clustering algorithmrdquo Pattern Recognition Lettersvol 15 no 6 pp 543ndash549 1994

[12] D Arthur and S Vassilvitskii ldquoK-Means++ the advantages ofcareful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms pp 7ndash9 NewOrleans LA USA 2007

[13] Y Zhao W Halang and X Wang ldquoRough ontology mappingin E-business integrationrdquo E-Service Intelligence BMC Bioinfvol 8 pp 75ndash93 2007

[14] Y Xiao and J Yu ldquoSemi-supervised clustering based on af-finity propagaiton algorithmrdquo ACM Transactions onKnowledge Discovery from Data vol 1 no 1 2007

[15] M Ester H Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databaseswith noiserdquo in Proceedings of the Second International Con-ference On Knowledge Discovery and Data Mining pp 226ndash231 Portland OR USA 1996

[16] R J G B Campello D Moulavi and J Sander ldquoDensity-based clustering based on hierarchical density estimatesrdquoAdvances in Knowledge Discovery and Data Mining vol 7819pp 160ndash172 2013

[17] Z Liang and P Chen ldquoDelta-density based clustering with adivide-and-conquer strategy 3DC clusteringrdquo Pattern Rec-ognition Letters vol 73 pp 52ndash59 2016

[18] M Ankerst M M Breuning H P Kriegel and J SanderldquoOPTICS ordering points to identify the clustering struc-turerdquo in Proceedings of the 1999 ACM SIGMOD-InternationalConference on Management of Data pp 49ndash60 PhiladelphiaPA USA 1999

[19] M Du S Ding and H Jia ldquoStudy on density peaks clusteringbased on k-nearest neighbors and principal componentanalysisrdquo Knowledge-Based Systems vol 99 pp 135ndash1452016

[20] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[21] T Li H W Ge and S Z Su ldquoDensity peaks clustering byautomatic determination of cluster centersrdquo Journal ofComputer Science and Technology vol 10 no 11 pp 1614ndash1622 2016

[22] R Liu H Wang and X Yu ldquoShared-nearest-neighbor-basedclustering by fast search and find of density peaksrdquo Infor-mation Sciences vol 450 pp 200ndash226 2018

[23] R A Jarvis and E A Patrick ldquoClustering using a similaritymeasure based on shared near neighborsrdquo IEEE Transactionson Computers vol C-22 no 11 pp 1025ndash1034 1973

[24] H Chang and D-Y Yeung ldquoRobust path-based spectralclusteringrdquo Pattern Recognition vol 41 no 1 pp 191ndash2032008

[25] L Fu and E Medico ldquoFlame a novel fuzzy clustering methodfor the analysis of DNA microarray datardquo BMC Bio-informatics vol 8 no 1 2007

[26] A Gionis H Mannila and P Tsaparas ldquoClustering aggre-gationrdquo ACM Transactions on Knowledge Discovery fromData vol 1 no 1 p 4 2007

[27] P Franti O Virmajoki and V Hautamaki ldquoFast agglom-erative clustering using a k-nearest neighbor graphrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 28 no 11 pp 1875ndash1881 2006

[28] F S Samaria and A C Harter ldquoParameterisation of a sto-chastic model for human face identificationrdquo in Proceedings ofthe 1994 IEEEWorkshop On Applications Of Computer Visionpp 138ndash142 Sarasota FL USA 1994

[29] K Bache and M Lichman UCI Machine Learning Repositoryhttparchiveicsucieduml 2013

[30] M Charytanowicz J Niewczas P Kulczycki P A KowalskiS Lukasik and S Zak ldquoComplete gradient clustering algo-rithm for features analysis of X-ray imagesrdquo InformationTechnologies in biomedicine Advances in Intelligent and SoftComputing vol 69 Berlin Germany Springer

[31] D B Dias R C B Madeo T Rocha H H Biscaro andS M Peres ldquoHand movement recognition for brazilian signlanguage a study using distance-based neural networksrdquo inProceedings of the 2009 International Joint Conference onNeural Networks pp 697ndash704 Atlanta GA USA 2009

[32] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees Routledge New York NYUSA 1st edition 1984

[33] V G Sigillito S P Wing L V Hutton and K B BakerldquoClassification of radar returns from the ionosphere usingneural networksrdquo Johns Hopkins APL vol 10 no 3pp 262ndash266 1989

[34] W N Street W H Wolberg and O L MangasarianldquoNuclear feature extraction for breast tumor diagnosisrdquo inProceedings of the SPIE 1905 Biomedical Image Processing andBiomedical Visualization San Jose CA USA 1993

[35] X V Nguyen J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison is a correction forchance necessaryrdquo in Proceedings of the ICML 2009 the 26thAnnual International Conference On Machine Learning SanMontreal Canada 2009

[36] J Han M Kamber and J Pei Data Mining Concepts andTechniques the Morgan Kaufmann Series in Data Manage-ment Systems Morgan Kaufmann Burlington MA USA 3rdedition 2011

Complexity 17

Page 4: ClusteringbyDetectingDensityPeaksandAssigningPointsby ... · FKNN-DPC [9] is given in (5) and (6), respectively: ρ i exp − 1 K X j∈knn(i) d2 ij ⎛⎝ ⎞⎠, (5) ρ i X j∈knn(i)

value of the low-density center and make it stand out in thedecision graph is a problem that needs to be considered

222 DPC Ignores Low-Density Point Allocation Strategywith Low Fault Tolerance +e fault tolerance of the allo-cation strategy of the DPC algorithm is not satisfactorymainly because the allocation of points receives a higherimpact than the density of points Hence if a high-densitypoint allocation error occurs it will directly affect thesubsequent allocation of points with a lower density Takingthe Pathbased dataset [24] as an example Figure 3 shows theclustering result calculated by the DPC algorithm by usingthe ldquocutoffrdquo kernel distance It can be seen from the figurethat the DPC algorithm can find a suitable clustering centerbut the allocation results of most points are incorrect +esame is true of the results using the Gaussian kernel distancecalculation +e results of point assignment on the

Pathbased dataset are similar to those of ldquocutoffrdquo kernelclustering +erefore the fault tolerance of the point allo-cation strategy should be further improved Moreover thepoints are greatly affected by other points during the allo-cation which is also an issue to be further addressed

3 Proposed Method

In this section the DPC-SFSKNN algorithm is introduced indetail +e DPC-SFSKNN algorithm is proposed where thefive main definitions of the algorithm are introduced andthe entire algorithm process is introduced Moreover thecomplexity of the DPC-SFSKNN algorithm is analyzed

31 3e Main Idea of DPC-SFSKNN +e DPC algorithmrelies on the distance between points to calculate the localdensity and the relative distance and is also very sensitive to

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(a)

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(b)

Figure 1 Results of the traditional DPC algorithm on the Jain dataset (a) Clustering of Jain by DPC (b) Ground truth

20

15

10

5

002 04 06 08 10

A

B

C

(a)

025

02

015

01

005

002 04 06 08 10

A B

C

(b)

Figure 2 ρ and δ values of the result of the traditional DPC algorithm on the Jain dataset (a) Clustering of Jain by DPC (b) Ground truth

4 Complexity

the choice of the cutoff distance dc Hence the DPC algo-rithmmay not be able to correctly process for some complexdatasets +e probability that a point and its neighborsbelong to the same cluster is high Adding attributes relatedto neighbors in the clustering process can help to make acorrect judgment +erefore we introduce the concept ofshared-nearest neighbor (SNN) proposed in [22] whendefining the local density and the relative distance Its basicidea is that if they have more common neighbors the twopoints are considered to be more similar as said above (seeequation (7))

Based on the above ideas we define the average distancedsnn(i j) of the shared-nearest neighbor between point i andpoint j and the similarity between the two points

Definition 1 (average distance of SNN) For any points i andj in the dataset X the shared-nearest neighbor set of twopoints is SNN(i j) and the average distance of SNNdsnn(i j) is expressed as

dsnn(i j) 1113936pisinSNN(ij) dip + djp1113872 1113873

2S (11)

where point p is any point of SNN(i j) and S is the numberof members in the set SNN(i j) dsnn(i j) shows the spatialrelationship between point i and point j more compre-hensively by calculating the distances between two pointsand shared-nearest neighbor points

Definition 2 (similarity) For any points i and j in the datasetX the similarity Sim(i j) between point i and j can beexpressed as

Sim(i j) S

Klowast100 (12)

where K is the number of nearest neighbors K is selectedfrom 4 to 40 until the optimal parameter appears +e lower

bound is 4 because a smaller K may cause the algorithm tobecome endless For the upper bound it is found by ex-periments that a large K will not significantly affect theresults of the algorithm +e similarity is defined accordingto the aforementioned basic idea ldquoif they have more com-mon neighbors the two points are considered to be moresimilarrdquo and the similarity is described using the ratio of thenumber of shared-nearest neighbors to the number ofnearest neighbors

Definition 3 (K-nearest neighbor average distance) For anypoint i in the dataset X its K-nearest neighbor set is KNN(i)and then the expression of K-nearest neighbor averagedistance dknn(i) is as follows

dknn(i) 1113936pisinknn(i)dip

K (13)

where point p is any point in KNN(i) and the number ofnearest neighbors of any point is K K-nearest neighboraverage distance can describe the surrounding environmentof a point to some extent Next we use it to describe localdensity

Definition 4 (local density) For any point i in the dataset Xthe local density expression is

ρi 1113944jisinknn(i)

S

dknn(i) + dknn(j) (14)

where point j is a point in the set KNN(i) and dknn(i) anddknn(j) are the K-nearest neighbor average distances ofpoint i and point j respectively In formula (14) the nu-merator (the number of shared-nearest neighbor S) repre-sents the similarity between the two points and thedenominator (the sum of the average distances) describesthe environment around them When S is a constant and if

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(a)

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(b)

Figure 3 Results of the traditional DPC algorithm on the Pathbased dataset

Complexity 5

the value of the sum of the average distances(dknn(i) + dknn(j)) is small the local density ρi of point i islarge Point j is one of the K-nearest neighbors of point iWhen the values of dknn(i) and dknn(j) are small it means i

and j are closely surrounded by their neighbors If dknn(i)

has a larger value (point j is far away from point i) or dknn(j)

has a larger value (when the neighboring points of thedistance are far away from the point j) the local density ofthe point i becomes smaller +erefore only the averagedistances of the two points are small and it can be expressedthat the local density of point i is large Moreover when thesum of the average distances of the two points is constantand if the number of shared-nearest neighbors of the twopoints is large the local density is large A large number ofshared neighbors indicate that the two points have a highsimilarity and a high probability of belonging to the samecluster +e higher the similarity points around a point thegreater its local density and the greater the probability ofbecoming a cluster center +is is beneficial to those low-density clustering centers A large number of sharedneighbors can compensate for the loss caused by their largedistance from other points so that their local density is notonly affected by distance Next we define the relative dis-tance of the points

Definition 5 (relative distance) For any point i in the datasetX the relative distance can be expressed as

δi

minj ρj gt ρi

dij + dknn(i) + dknn(j)1113960 1113961 ρi lt maxk

ρk( 1113857

maxjisin(Xminusi)

δi( 1113857 ρi maxk

ρk( 1113857

⎧⎪⎪⎨

⎪⎪⎩

(15)

where point j is one of the K-nearest neighbors of point idij is the distance between points i and j and dknn(i) anddknn(j) are the average distance from the nearest neighbor ofpoints i and j We can use the sum of the three distances torepresent the relative distance Compared to the DPC al-gorithm which only uses dij to represent the relative dis-tance we define the concept of relative distance andK-nearest neighbor average distances of two points +e newdefinition can not only express the relative distance but alsobe more friendly to low-density cluster centers Under thecondition of constant dij the average distance of the nearestneighbors of the low-density points is relatively large and itsrelative distance will also increase which can increase theprobability of low-density points being selected

+e DPC-SFSKNN clustering center is selected in thesame way as the traditional DPC algorithm+e local densityρ and relative distance δ are used to form a decision graph+e n points with the largest local density and relativedistance are selected as the clustering centers

For DPC-SFSKNN the sum of the distances from pointsof a low-density cluster to their K-neighbors may be largethus they receive a greater compensation for their δ valueFigures 4(a) and 4(b) show the results of DPC-SFSKNN onthe Jain dataset [23] Compared to Figure 2(b) the δ valuesof points in the upper branch are generally larger than thoseof the lower branch +is is because the density of the upper

branch is significantly smaller and the distances from thepoints to their respective K-nearest neighbors are largerthus they receive a greater compensation Even if the densityis at a disadvantage the higher δ value still makes the centerof the upper branch distinguished in the decision graph+isshows that the DPC-SFSKNN algorithm can correctly selectlow-density clustering centers

32 Processes +e entire process of the algorithm is dividedinto two parts the selection of clustering centers and theallocation of noncenter points +e main step of our DPC-SFSKNN and a detailed introduction of the proposed al-location strategy are given in Algorithm 1

Line 9 of the DPC-SFSKNN algorithm establishes aweighted K-nearest neighbor graph and Line 11 is aK-nearest neighbor similarity search allocation strategy Toassign noncenter points in the dataset we designed asimilarity-first search algorithm based on the weightedK-nearest neighbor graph +e algorithm uses the breadth-first search idea to find the cluster center with the highestsimilarity for the noncenter point +e similarity of non-center points and their K-nearest neighbors is sorted in anascending order the neighbor point with the highest sim-ilarity is selected as the next visited node and it is pushedinto the path queue If the highest similarity point is notunique the point with the smallest SNN average distance isselected as the next visited node+e visiting node also needsto sort the similarity of its K-nearest neighbors and select thenext visiting node +e search stops until the visited node isthe cluster center point Algorithm 2 describes the entiresearch process Finally each data point except the clustercenters is traversed to complete the assignment

Similarity-first search algorithm is an optimization al-gorithm based on breadth-first search according to the al-location requirements of noncenter points Similarity is animportant concept for clustering algorithms Points in thesame cluster are similar to each other Two points with ahigher similarity have more of the same neighbors Based onthe above ideas the definition of similarity is proposed in(12) In the process of searching if only similarity is used asthe search criteria it is easy to appear that the highestsimilarity point is not unique +erefore the algorithmchooses the average distance of the SNN as the secondcriterion and a smaller dsnn point means that the two pointsare closer in space

+e clustering results of the DPC-SFSKNN algorithmbased on the Pathbased dataset are shown in Figure 5Figure 3 clearly shows that although the traditional DPCalgorithm can find cluster centers on each of the threeclusters there is a serious bias in the allocation of noncenterpoints From Figure 5 we can see the effectiveness of thenoncentral point allocation algorithm of the DPC-SFSKNNalgorithm+e allocation strategy uses similarity-first searchto ensure that the similarity from the search path is thehighest and a gradual search to the cluster center to avoidthe points with low similarity is used as a reference Besidesthe similarity-first search allocation strategy based on theweighted K-nearest neighbor graph considers neighbor

6 Complexity

information When the point of the highest similarity is notunique the point with the shortest average distance of theshared neighbors is selected as the next visited point

33 Complexity Analysis In this section the complexities ofthe DPC-SFSKNN algorithm are analyzed including timecomplexity and space complexity Suppose the size of the

02 04 06 08 100

0102030405060708091

Y

X

(a)

02 04 06 08 10X

δ

025

020

015

010

005

0

(b)

Figure 4 Result and ρ value of the DPC-SFSKNN algorithm on the Jain dataset

Require dataset X parameter K

Ensure clustering result C(1) Data preprocessing normalize the data(2) Calculate the Euclidean distance between the points(3) Calculate the K-nearest neighbors of each point i isin X

(4) Calculate the average distance of K-nearest neighbors of each point dknn(i) according to (13)(5) Calculate the local density ρi of each point i isin X according to (14)(6) Calculate the relative distance δi of each point i isin X according to (15)(7) Find the cluster center by analyzing the decision graph composed of ρ and δ and use the cluster center as the set CC(8) Calculate the similarity between point i and its K-nearest neighbors according to (12)(9) Connect each point in the dataset X with its K-nearest neighbors and use the similarity as the connection weight to construct a

weighted K-nearest neighbor graph(10) Calculate the average distance of SNN dsnn(i j) between point i and its shared-nearest neighbors according to (11)(11) Apply Algorithm 2 to allocate the remaining points

ALGORITHM 1 DPC-SFSKNN

Require w isin X set of cluster centers CC number of neighbors K similarity matrix Snlowast n sim(i j)nlowast n and SNN averagedistance matrix DSNNnlowast n dsnn(i j)nlowast n

Ensure point w isin CC(1) Initialize the descending queue Q and the path queue P+e K-nearest neighbors of point w are sorted in the ascending order of

similarity and pushed into Q Push M into P(2) while tail point of P P isin CC do(3) if the highest similarity point is unique then(4) Pop a point this at Qrsquos tail(5) else(6) Select a point this with the smallest DSNN(7) end if(8) Empty descending queue Q(9) +e K-nearest neighbors of this are sorted in the ascending order of similarity and pushed into Q(10) Push this into P(11) end while

ALGORITHM 2 Similarity-first search allocation strategy

Complexity 7

dataset is n the number of cluster centers is m and thenumber of neighbors is k

331 Time Complexity +e time complexity analysis ofDPC-SFSKNN is as follows

Normalization requires a processing complexity of ap-proximately O(n) the complexities of calculating the Eu-clidean distance and similarity between points are O(n2) thecomplexity of computing the K-nearest neighbor averagedistance dknn is O(n2) similarly the complexity of the averagedistance dsnn between the calculation point and its shared-nearest neighbors does not exceed O(n2) at most the calcu-lation process of calculating the local density ρi and distance δi

of each point needs to acquire the KNN information com-plexity of each point as O(kn) so the complexities of localdensity ρ and distance δ areO(kn2) the point allocation part isthe search time of one point and in the worst case searching allpoints requires O(n) +ere are n points in the dataset and thetotal time does not exceed O(n2) In summary the total ap-proximate time complexity of DPC-SFSKNN is O(kn2)

+e time complexity of the DPC algorithm depends onthe following three aspects (a) the time to calculate thedistance between points (b) the time to calculate the localdensity ρi for point i and (c) the time to calculate the distanceδi for each point i +e time complexity of each part is O(n2)so the total approximate time complexity of DPC is O(n2)

+e time complexity of the DPC-SFSKNN algorithm is k

times higher than that of the traditional DPC algorithmHowever k is relatively small compared to n +erefore theydo not significantly affect the run time In Section 4 it isdemonstrated that the actual running time of DPC-SFSKNNdoes not exceed k times of the running time of the traditionalDPC algorithm

332 Space Complexity DPC-SFSKNN needs to calculatethe distance and similarity between points and its com-plexity is O(n2) Other data structures (such as ρ and δ arrays

and various average distance arrays) are O(n) For the al-location strategy in the worst case its complexity is O(n2)+e space complexity of DPC is O(n2) which is mainly dueto the distance matrix stored

+e space complexity of our DPC-SFSKNN is the sameas that of traditional DPC which is O(n2)

4 Experiments and Results

In this section experiments are performed based on severalpublic datasets commonly used to test the performance ofclustering algorithms including synthetic datasets [23ndash27]and real datasets [28ndash34] In order to visually observe theclustering ability of DPC-SFSKNN the DPC [20] DBSCAN[15] AP [8] FKNN-DPC [9] and K-means [10] methods areall tested for comparison +ree popular benchmarks areused to evaluate the performance of the above clusteringalgorithms including the clustering accuracy (ACC) ad-justed mutual information (AMI) and adjusted Rand index(ARI) [35] +e upper bounds of the three benchmarks wereall 1 +e larger the benchmark value the better the clus-tering effect +e codes for DPC DBSCAN and AP wereprovided based on the corresponding references

Table 1 lists the synthetic datasets used in the experi-ments+ese datasets were published in [23ndash27] Table 2 liststhe real datasets used in the experiments +ese datasetsinclude the real-world dataset from [29ndash34] and the Olivettiface dataset in [28]

To eliminate the influence of missing values and dif-ferences in different dimension ranges the datasets need tobe preprocessed before proceeding to the experiments Wereplace the missing values by the mean of all valid values ofthe same dimension and normalize the data using the min-max normalization method shown in the followingequation

xij xij minus min xj1113872 1113873

max xj1113872 1113873 minus min xj1113872 1113873 (16)

where xij represents the original data located in the ith rowand jth column xij represents the rescaled data of xij andxj represents the original data located in the jth column

Min-max normalization method processes each di-mension of the data and preserves the relationships betweenthe original data values [36] therefore decreasing the in-fluence of the difference in dimensions and increasing theefficiency of the calculation

To fairly reflect the clustering results of the five algo-rithms the parameters in the algorithms are adjusted toensure that their satisfactory clustering performance can beretained For the DPC-SFSKNN algorithm the parameter K

needs to be specified in advance and an initial clusteringcenter is manually selected based on a decision graphcomposed of the local density ρ and the relative distance δ Itcan be seen from the experimental results in Tables 3 and 4that the value of parameter K is around 6 and the value ofparameterK for the dataset with dense sample distribution ismore than 6 In addition to manually select the initialclustering center the traditional DPC algorithm also needs

02 04X

06 08 100

01

02

03

04

05Y

06

07

08

09

1

Figure 5 Results of the traditional DPC-SFSKNN algorithm on thePathbased dataset

8 Complexity

to determine dc Based on the provided selection range dc isselected so that the number of neighbors is between 1 and 2of the total number of data points [20] +e two parametersthat DBSCAN needs to determine are ε and minpts as in[15]+e optimal parameters are determined using a circularsearch method +e AP algorithm only needs to determine apreference and the larger the preference the more the center

points are allowed to be selected [8] +e general method forselecting parameters is not effective and only multiple ex-periments can be performed to select the optimal parame-ters +e only parameter of K-means is the number ofclusters +e true number of clusters in the dataset is used inthis case Similarly FKNN-DPC needs to determine thenearest neighbors K

Table 2 Real-world datasets

Dataset Records Attributes Clusters SourceIris 150 4 3 [29]Libras movement 360 90 15 [31]Wine 178 13 3 [29]Parkinsons 197 23 2 [29]WDBC 569 30 2 [34]Pima-Indians-diabetes 768 8 2 [29]Segmentation 2310 19 7 [29]Dermatology 366 33 6 [29]Seeds 210 7 3 [30]Ionosphere 351 34 2 [33]Waveform 5000 21 3 [32]Waveform (noise) 5000 40 3 [32]Olivetti face 400 92lowast112 40 [28]

Table 1 Synthetic datasets

Dataset Records Attributes Clusters SourcePathbased 300 2 3 [24]Jain 373 2 2 [23]Flame 240 2 2 [25]Aggregation 788 2 7 [26]DIM512 1024 512 16 [27]DIM1024 1024 1024 16 [27]

Table 3 +e comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on synthetic datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParPathbased Jain

DPC-SFSKNN 0926 0910 0925 33 6 1000 1000 1000 22 7DPC 0521 0463 0742 33 2 0609 0713 0853 22 3DBSCAN 0781 0522 0667 mdash 0056 0883 0985 0918 mdash 00810AP 0679 0475 0783 33 10 0681 0812 0882 22 40FKNN-DPC 0941 0960 0987 33 5 0056 0132 0793 mdash 10K-means 0568 0461 0772 mdash 3 0492 0577 0712 mdash 2

Aggregation FlameDPC-SFSKNN 0942 0951 0963 77 6 0873 0934 0956 22 6DPC 1000 1000 1000 77 4 1000 1000 1000 22 5DBSCAN 0969 0982 0988 mdash 0058 0867 0936 0981 - 0098AP 0795 0753 0841 77 77 0452 0534 0876 33 35FKNN-DPC 0995 0997 0999 33 8 1000 1000 1000 22 5K-means 0784 0717 0786 mdash 7 0418 0465 0828 mdash 2

DIM512 DIM1024DPC-SFSKNN 1000 1000 1000 1616 8 1000 1000 1000 1616 9DPC 1000 1000 1000 1616 2 1000 1000 1000 1616 001DBSCAN 1000 1000 1000 mdash 037 1000 1000 1000 mdash 108AP 1000 1000 1000 1616 20 1000 1000 1000 1616 30FKNN-DPC 1000 1000 1000 1616 8 1000 1000 1000 1616 10K-means 0895 0811 0850 mdash 1 0868 0752 0796 mdash 16

Complexity 9

41Analysis of theExperimentalResults on SyntheticDatasetsIn this section the performance of DPC-SFSKNN DPC[20] DBSCAN [15] AP [8] FKNN-DPC [9] and K-means[10] is tested with six synthetic datasets given in Table 1+ese synthetic datasets are different in distribution andquantity Different data situations can be simulated tocompare the performance of six algorithms in differentsituations Table 3 shows AMI ARI ACC and ECAC of thefive clustering algorithms on the six comprehensive datasetswhere the best results are shown in bold and ldquomdashrdquo means novalue Figures 6ndash9 show the clustering results of DPC-SFSKNN DPC DBSCAN AP FKNN-DPC and K-meansbased on the Pathbased Flame Aggregation and Jaindatasets respectively +e five algorithms achieve the

optimal clustering on DIM512 and DIM1024 datasets sothat the clustering of the two datasets is not shown Since thecluster centers of DBSCAN are relatively random only thepositions of clustering centers of the other three algorithmsare marked

Figure 6 shows the results of the Pathbased datasetDPC-SFSKNN and FKNN-DPC can complete the clusteringof the Pathbased dataset correctly From Figures 6(b) 6(d)and 6(f) it can be seen that the clustering results of DPC APand K-means are similar +e clustering centers selected byDPC AP DPC-SFSKNN and FKNN-DPC are highlysimilar but the clustering results of DPC and AP are notsatisfactory For the DPC algorithm the low fault tolerancerate of its allocation strategy is the cause of this result A

Table 4 Comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on real-world datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParIris Libras movement

DPC-SFSKNN 0896 0901 0962 33 6 0547 0368 0510 1015 8DPC 0812 0827 0926 33 2 0535 0304 0438 915 05DBSCAN 0792 0754 0893 mdash 0149 0412 0183 0385 mdash 0965AP 0764 0775 0911 33 6 0364 0267 0453 1015 25FKNN-DPC 0912 0922 0973 33 7 0508 0308 0436 1015 9K-means 0683 0662 0823 mdash 3 0522 0306 0449 mdash 15

Wine ParkinsonsDPC-SFSKNN 0843 0851 0951 33 6 0193 0380 0827 22 6DPC 0706 0672 0882 33 2 0210 0114 0612 22 5DBSCAN 0612 0643 0856 mdash 04210 0205 0213 0674 mdash 046AP 0592 0544 0781 33 6 0142 0127 0669 22 15FKNN-DPC 0831 0852 0949 33 7 0273 0391 0851 22 5K-means 0817 0838 0936 mdash 3 0201 0049 0625 mdash 2

WDBC IonosphereDPC-SFSKNN 0432 0516 0857 22 6 0361 0428 0786 32 7DPC 0002 minus0004 0602 22 9 0238 0276 0681 32 065DBSCAN 0397 0538 0862 mdash 0277 0544 0683 0853 mdash 027AP 0598 0461 0854 22 40 0132 0168 0706 22 15FKNN-DPC 0679 0786 0944 22 7 0284 0355 0752 22 8K-means 0611 0730 0928 mdash 2 0129 0178 0712 mdash 2

Segmentation Pima-Indians-diabetesDPC-SFSKNN 0665 0562 0746 67 6 0037 0083 0652 22 6DPC 0650 0550 0684 67 3 0033 0075 0647 22 4DBSCAN 0446 0451 0550 mdash 02510 0028 0041 0577 mdash 0156AP 0405 0436 0554 77 25 0045 0089 0629 32 35FKNN-DPC 0655 0555 0716 77 7 0001 0011 0612 22 6K-means 0583 0495 0612 mdash 6 0050 0102 0668 mdash 2

Seeds DermatologyDPC-SFSKNN 0753 0786 0919 33 7 0862 0753 0808 76 6DPC 0727 0760 0918 33 2 0611 0514 0703 46 2DBSCAN 0640 0713 0874 mdash 0178 0689 0690 0815 mdash 073AP 0598 0682 0896 33 10 0766 0701 0762 76 5FKNN-DPC 0759 0790 0924 33 8 0847 0718 0768 76 7K-means 0671 0705 0890 mdash 3 0796 0680 0702 mdash 6

Waveform Waveform (noise)DPC-SFSKNN 0355 0382 0725 33 5 0267 0288 0651 33 6DPC 0320 0269 0586 33 05 0104 0095 0502 33 03DBSCAN mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashAP mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashFKNN-DPC 0324 0350 0703 33 5 0247 0253 0648 33 5K-means 0363 0254 0501 - 3 0364 0252 0512 mdash 3

10 Complexity

0

02

04

06

08

1

05 10

(a)

0

02

04

06

08

1

05 10

(b)

Figure 7 Continued

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 6+e clustering of Pathbased by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and (f)K-means

Complexity 11

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 7 +e clustering of Flame by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

05 100

02

04

06

08

1

(d)

Figure 8 Continued

12 Complexity

high-density point allocation error will be transferred tolow-density points and the error propagation will seriouslyaffect the clustering results AP and K-means algorithms arenot good at dealing with irregular clusters +e two clustersin the middle are too attractive to the points on both sides ofthe semicircular cluster which leads to clustering errorsDBSCAN can completely detect the semicircular cluster butthe semicircular cluster and the cluster on the left of themiddle are incorrectly classified into one category and thecluster on the right of the middle is divided into two clusters+e similarities between points and manually prespecifiedparameters may severely affect the clustering DPC-SFSKNNand FKNN-DPC algorithms perform well on the Pathbaseddataset +ese improved algorithms that consider neighborrelationships have a great advantage in handling suchcomplex distributed datasets

Figure 7 shows the results of four algorithms on theFlame dataset As shown in the figure DPC-SFSKNN DPCFKNN-DPC and DBSCAN can correctly detect two clusterswhile AP and K-means cannot completely correct clusteringAlthough AP can correctly identify higher clusters and selectthe appropriate cluster center the lower cluster is dividedinto two clusters Both clusters of K-means are wrong +eclustering results in Figure 8 show that the DPC-SFSKNNDPC FKNN-DPC and DBSCAN algorithms can detect 7clusters in the Aggregation dataset but AP and K-means stillcannot cluster correctly DPC-SFSKNN DPC and FKNN-DPC can identify clusters and centers Although the clustercenters are not marked for DBSCAN the number of clustersand the overall shape of each cluster are correct +e APalgorithm successfully found the correct number of clustersbut it chose two centers for one cluster which divided thecluster into two clusters +e clustering result of K-means issimilar to that of AP

+e Jain dataset shown in Figure 9 is a dataset consistingof two semicircular clusters of different densities As shownin the figure the DPC-SFSKNN algorithm can completelycluster two clusters with different densities However DPCAP FKNN-DPC and K-means incorrectly assign the leftend of the lower cluster to the higher cluster and the clustercenters of the DPC are all on the lower cluster Compared

with that the distribution of the cluster centers of the AP ismore reasonable For the DBSCAN algorithm it can ac-curately identify lower clusters but the left end of the highercluster is incorrectly divided into a new cluster so that thehigher cluster is divided into two clusters

According to the benchmark data shown in Table 3 it isclear that the performance of DPC-SFSKNN is very effectiveamong the six clustering algorithms especially in the Jaindataset Although DPC and FKNN-DPC perform betterthan DPC-SFSKNN on Aggregation and Flame datasetsDPC-SFSKNN can find the correct clustering center of theaggregation and can complete the clustering task correctly

42 Analysis of Experimental Results on Real-World DatasetsIn this section the performance of the five algorithms is stillbenchmarked according to AMI ARI ACC and ECACand the clustering results are summarized in Table 4 12 real-world datasets are selected to test DPC-SFSKNNrsquos ability toidentify clusters on different datasets DBSCAN and APalgorithm cannot get effective clustering results on wave-form and waveform (noise) +e symbol ldquomdashrdquo represents noresult

As shown in Table 4 in terms of benchmarks AMI ARIand ACC DPC-SFSKNN outperforms all five other algo-rithms on the Wine Segmentation and Libras movementdatasets At the same time FKNN-DPC performs better thanthe other five algorithms on the Iris Seeds Parkinsons andWDBC datasets It can be seen that the overall performanceof DPC-SFSKNN is slightly better than DPC on 11 datasetsexcept for Parkinsons On the Parkinsons DPC-SFSKNN isslightly worse than DPC in AMI but better than DPC in ARIand ACC Similarly DPC-SFSKNN had a slightly betterperformance in addition to Iris Parkinsons WDBC andSeeds of eight sets of data in FKNN-DPC and DPC-SFSKNN is slightly worse than DPC in AMI ARI and ACCDBSCAN gets the best results on the Ionosphere K-means isthe best on Pima-Indians-diabetes and K-means is the bestin AMI on waveform and waveform (noise) datasets Ingeneral the clustering results of DPC-SFSKNN in real-worlddatasets are satisfactory

0

02

04

06

08

1

05 10

(e)

05 100

02

04

06

08

1

(f )

Figure 8+e clustering of Aggregation by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

Complexity 13

43 Experimental Analysis of Olivetti Face DatasetOlivetti face dataset [28] is an image dataset widely used bymachine learning algorithms Its purpose is to test theclustering situation of the algorithm without supervisionincluding determining the number of clusters in the data-base and the members of each cluster +e dataset contains40 clusters each of which has 10 different images Becausethe actual number of clusters (40 different clusters) is equalto the number of elements in the dataset (10 different imageseach cluster) the reliability of local density becomes smallerwhich is a great challenge for density-based clustering al-gorithms To further test the clustering performance ofDPC-SFSKNN DPC-SFSKNN performed experiments onthe Olivetti face database and compared it with DPC APDBSCAN FKNN-DPC and K-means

+e clustering results achieved by DPC-SFSKNN andDPC for the Olivetti face database are shown in Figure 10and white squares represent the cluster centers +e 32clusters corresponding to DPC-SFSKNN found inFigure 10(a) and the 20 clusters found by DPC inFigure 10(b) are displayed in different colors Gray imagesindicate that the image is not assigned to any cluster It canbe seen from Figure 10(a) that DPC-SFSKNN found that the32 cluster centers were covered 29 clusters and as shown inFigure 10(b) the 20 cluster centers found by DPC werescattered in 19 clusters Similar to DPC-SFSKNN DPC maydivide one cluster into two clusters Because DPC-SFSKNNcan find much more density peaks than DPC it is morelikely to identify a cluster as two different clusters +e samesituation occurs with the FKNN-DPC algorithm However

0

02

04

06

08

1

05 10

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

0

02

04

06

08

1

05 10

(d)

0

02

04

06

08

1

05 10

(e)

0

02

04

06

08

1

05 10

(f )

Figure 9 +e clustering of Jain by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

14 Complexity

the performance of FKNN-DPC is better than that of DPC-SFSKNN in AMI ARI ACC and ECAC From the data inTable 5 based on AMI ARI ACC and ECAC the clus-tering results of these algorithms are compared +e per-formance of DPC-SFSKNNC is slightly superior to theperformance of the other four algorithms except FKNN-DPC

44RunningTime +is section shows the comparison of thetime performance of DPC-SFSKNN with DPC DBSCANAP FKNN-DPC and K-means on real-world datasets +etime complexity of DPC-SFSKNN and DPC has been an-alyzed in Section 331 and the time complexity of DPC isO(n2) and the time complexity of DPC-SFSKNN is O(kn2)where n is the size of the dataset However the time con-sumed by DPC mainly comes from calculating the localdensity and the relative distance of each point while the timeconsumed by DPC-SFSKNN comes mainly from the cal-culation of K-nearest neighbors and the division strategy ofnoncenter points Table 6 lists the running time (in seconds)

of the six algorithms on the real datasets It can be seen thatalthough the time complexity of DPC-SFSKNN is approx-imately k times that of DPC their execution time on actualdatasets is not k times

In Table 6 it can be found that on a relatively smalldataset the running time of DPC-SFSKNN is about twice ormore times that of DPC and the difference mainly comesfrom DPC-SFSKNNrsquos allocation strategy Although thecomputational load of the local densities for points growsvery quickly with the size of a dataset the time consumed bythe allocation strategy in DPC-SFSKNN increases randomlywith the distribution of a dataset +is leads to an irregulargap between the running time of DPC and DPC-SFSKNN

FKNN-DPC has the same time and space complexity asDPC but the running time is almost the same as DPC-SFSKNN It takes a lot of running time to calculate therelationship between K-nearest neighbors +e time com-plexity of DBSCAN and AP is approximate to O(n2) and theparameter selection of both cannot be determined by simplemethods When the dataset is relatively large it is difficult tofind their optimal parameters which may be the reason that

(a) (b)

Figure 10 +e clustering of Olivetti by two clustering algorithms (a) DPC-SFSKNN and (b) DPC

Table 5 Performance comparison of algorithms by clustering criteria for the Olivetti face database

Metric DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansACC 0786 0665 0648 0763 0818 0681AMI 0792 0728 0691 0737 0832 0742ARI 0669 0560 0526 0619 0714 0585ECAC 3240 2040 mdash 2840 3640 mdashPar 6 05 064 21 4 40

Complexity 15

the two algorithms have no running results on the waveformdataset +e approximate time complexity of K-means isO(n) and Table 6 proves its efficiency K-means has almostno loss of accuracy under the premise of fast speed whichmakes it a very popular clustering algorithm but K-means isnot sensitive to irregularly shaped data

5 Conclusions and Future Work

A new clustering algorithm is proposed based on the tra-ditional DPC algorithm in this paper +is algorithm pro-poses a density peak search algorithm that takes into accountthe surrounding neighbor information and develops a newallocation strategy to detect the true distribution of thedataset +e proposed clustering algorithm performs fastsearch finds density peaks say cluster centers of a dataset ofany size and recognizes clusters with any arbitrary shape ordimensionality +e algorithm is called DPC-SFSKNNwhich means that it calculates the local density and therelative distance by using some distance information be-tween points and neighbors to find the cluster center andthen the remaining points are assigned using similarity-first+e search algorithm is based on the weighted KNN graph tofind the owner (clustering center) of the point +e DPC-SFSKNN successfully addressed several issues arising fromthe clustering algorithm of Alex Rodriguez and AlessandroLaio [20] including its density metric and the potential issuehidden in its assignment strategy +e performance of DPC-SFSKNN was tested on several synthetic datasets and thereal-word datasets from the UCI machine learning reposi-tory and the well-known Olivetti face database +e ex-perimental results on these datasets demonstrate that ourDPC-SFSKNN is powerful in finding cluster centers and inrecognizing clusters regardless of their shape and of thedimensionality of the space in which they are embedded andof the size of the datasets and is robust to outliers It per-forms much better than the original algorithm DPCHowever the proposed algorithm has some limitations theparameter K needs to be manually adjusted according todifferent datasets the clustering centers still need to bemanually selected by analyzing the decision graph (like theDPC algorithm) the allocation strategy improves theclustering accuracy but takes time and cost How to improve

the degree of automation and allocation efficiency of thealgorithm needs further research

Data Availability

+e synthetic datasets are cited at relevant places within thetext as references [23ndash27] +e real-world datasets are citedat relevant places within the text as references [29ndash34] +eOlivetti face dataset is cited at relevant places within the textas references [28]

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported in part by the National NaturalScience Foundation of China (6160303040 and 61433003) inpart by the Yunnan Applied Basic Research Project of China(201701CF00037) and in part by the Yunnan ProvincialScience and Technology Department Key Research Program(Engineering) (2018BA070)

Supplementary Materials

It includes the datasets used in the experiments in this paper(Supplementary Materials)

References

[1] K L Liu Y L Shang Q Ouyang and W D Widanage ldquoAdata-driven approach with uncertainty quantification forpredicting future capacities and remaining useful life oflithium-ion batteryrdquo IEEE Transactions on Industrial Elec-tronics p 1 2020

[2] X P Tang K L Liu X Wang et al ldquoModel migration neuralnetwork for predicting battery aging trajectoriesrdquo IEEETransactions on Transportation Electrification vol 6 no 2pp 363ndash374 2020

[3] X Tang K Liu XWang B Liu F Gao andW DWidanageldquoReal-time aging trajectory prediction using a base model-oriented gradient-correction particle filter for Lithium-ion

Table 6 Running time of 6 clustering algorithms in seconds on UCI datasets

Dataset DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansIris 0241 0049 0059 0565 0148 0014Wine 0238 0048 0098 0832 0168 0013WDBC 0484 0092 0884 6115 0464 0018Seeds 0244 0050 0122 0973 0164 0014Libras movement 0602 0068 0309 3016 2602 0075Ionosphere 0325 0064 0349 2018 0309 0021Segmentation 1569 0806 8727 6679 0313 0062Dermatology 0309 0063 0513 2185 0409 0007Pima-Indians-diabetes 0792 0126 2018 9709 0892 0009Parkinsons 0255 0048 0114 0866 0263 0003Waveform 16071 3511 mdash mdash 7775 0067Waveform (noise) 17571 3784 mdash mdash 7525 0109

16 Complexity

batteriesrdquo Journal of Power Sources vol 440 Article ID227118 2019

[4] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[5] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[6] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[7] L Cai J H Meng D I Stroe et al ldquoMulti-objective opti-mization of data-driven model for lithium-ion battery SOHestimation with short-term featurerdquo IEEE Transactions onPower Electronics p 1 2020

[8] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[9] J Xie H Gao W Xie X Liu and P W Grant ldquoRobustclustering by detecting density peaks and assigning pointsbased on fuzzy weighted K-nearest neighborsrdquo InformationSciences vol 354 pp 19ndash40 2016

[10] F S Samaria and A C Harter ldquoSome methods for classifi-cation and analysis of multivariate observationsrdquo in Pro-ceedings of the Berkeley SymposiumOnMathematical Statisticsand Probability pp 281ndash297 Berkeley CA USA 1967

[11] S Kant T L Rao and P N Sundaram ldquoAn automatic andstable clustering algorithmrdquo Pattern Recognition Lettersvol 15 no 6 pp 543ndash549 1994

[12] D Arthur and S Vassilvitskii ldquoK-Means++ the advantages ofcareful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms pp 7ndash9 NewOrleans LA USA 2007

[13] Y Zhao W Halang and X Wang ldquoRough ontology mappingin E-business integrationrdquo E-Service Intelligence BMC Bioinfvol 8 pp 75ndash93 2007

[14] Y Xiao and J Yu ldquoSemi-supervised clustering based on af-finity propagaiton algorithmrdquo ACM Transactions onKnowledge Discovery from Data vol 1 no 1 2007

[15] M Ester H Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databaseswith noiserdquo in Proceedings of the Second International Con-ference On Knowledge Discovery and Data Mining pp 226ndash231 Portland OR USA 1996

[16] R J G B Campello D Moulavi and J Sander ldquoDensity-based clustering based on hierarchical density estimatesrdquoAdvances in Knowledge Discovery and Data Mining vol 7819pp 160ndash172 2013

[17] Z Liang and P Chen ldquoDelta-density based clustering with adivide-and-conquer strategy 3DC clusteringrdquo Pattern Rec-ognition Letters vol 73 pp 52ndash59 2016

[18] M Ankerst M M Breuning H P Kriegel and J SanderldquoOPTICS ordering points to identify the clustering struc-turerdquo in Proceedings of the 1999 ACM SIGMOD-InternationalConference on Management of Data pp 49ndash60 PhiladelphiaPA USA 1999

[19] M Du S Ding and H Jia ldquoStudy on density peaks clusteringbased on k-nearest neighbors and principal componentanalysisrdquo Knowledge-Based Systems vol 99 pp 135ndash1452016

[20] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[21] T Li H W Ge and S Z Su ldquoDensity peaks clustering byautomatic determination of cluster centersrdquo Journal ofComputer Science and Technology vol 10 no 11 pp 1614ndash1622 2016

[22] R Liu H Wang and X Yu ldquoShared-nearest-neighbor-basedclustering by fast search and find of density peaksrdquo Infor-mation Sciences vol 450 pp 200ndash226 2018

[23] R A Jarvis and E A Patrick ldquoClustering using a similaritymeasure based on shared near neighborsrdquo IEEE Transactionson Computers vol C-22 no 11 pp 1025ndash1034 1973

[24] H Chang and D-Y Yeung ldquoRobust path-based spectralclusteringrdquo Pattern Recognition vol 41 no 1 pp 191ndash2032008

[25] L Fu and E Medico ldquoFlame a novel fuzzy clustering methodfor the analysis of DNA microarray datardquo BMC Bio-informatics vol 8 no 1 2007

[26] A Gionis H Mannila and P Tsaparas ldquoClustering aggre-gationrdquo ACM Transactions on Knowledge Discovery fromData vol 1 no 1 p 4 2007

[27] P Franti O Virmajoki and V Hautamaki ldquoFast agglom-erative clustering using a k-nearest neighbor graphrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 28 no 11 pp 1875ndash1881 2006

[28] F S Samaria and A C Harter ldquoParameterisation of a sto-chastic model for human face identificationrdquo in Proceedings ofthe 1994 IEEEWorkshop On Applications Of Computer Visionpp 138ndash142 Sarasota FL USA 1994

[29] K Bache and M Lichman UCI Machine Learning Repositoryhttparchiveicsucieduml 2013

[30] M Charytanowicz J Niewczas P Kulczycki P A KowalskiS Lukasik and S Zak ldquoComplete gradient clustering algo-rithm for features analysis of X-ray imagesrdquo InformationTechnologies in biomedicine Advances in Intelligent and SoftComputing vol 69 Berlin Germany Springer

[31] D B Dias R C B Madeo T Rocha H H Biscaro andS M Peres ldquoHand movement recognition for brazilian signlanguage a study using distance-based neural networksrdquo inProceedings of the 2009 International Joint Conference onNeural Networks pp 697ndash704 Atlanta GA USA 2009

[32] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees Routledge New York NYUSA 1st edition 1984

[33] V G Sigillito S P Wing L V Hutton and K B BakerldquoClassification of radar returns from the ionosphere usingneural networksrdquo Johns Hopkins APL vol 10 no 3pp 262ndash266 1989

[34] W N Street W H Wolberg and O L MangasarianldquoNuclear feature extraction for breast tumor diagnosisrdquo inProceedings of the SPIE 1905 Biomedical Image Processing andBiomedical Visualization San Jose CA USA 1993

[35] X V Nguyen J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison is a correction forchance necessaryrdquo in Proceedings of the ICML 2009 the 26thAnnual International Conference On Machine Learning SanMontreal Canada 2009

[36] J Han M Kamber and J Pei Data Mining Concepts andTechniques the Morgan Kaufmann Series in Data Manage-ment Systems Morgan Kaufmann Burlington MA USA 3rdedition 2011

Complexity 17

Page 5: ClusteringbyDetectingDensityPeaksandAssigningPointsby ... · FKNN-DPC [9] is given in (5) and (6), respectively: ρ i exp − 1 K X j∈knn(i) d2 ij ⎛⎝ ⎞⎠, (5) ρ i X j∈knn(i)

the choice of the cutoff distance dc Hence the DPC algo-rithmmay not be able to correctly process for some complexdatasets +e probability that a point and its neighborsbelong to the same cluster is high Adding attributes relatedto neighbors in the clustering process can help to make acorrect judgment +erefore we introduce the concept ofshared-nearest neighbor (SNN) proposed in [22] whendefining the local density and the relative distance Its basicidea is that if they have more common neighbors the twopoints are considered to be more similar as said above (seeequation (7))

Based on the above ideas we define the average distancedsnn(i j) of the shared-nearest neighbor between point i andpoint j and the similarity between the two points

Definition 1 (average distance of SNN) For any points i andj in the dataset X the shared-nearest neighbor set of twopoints is SNN(i j) and the average distance of SNNdsnn(i j) is expressed as

dsnn(i j) 1113936pisinSNN(ij) dip + djp1113872 1113873

2S (11)

where point p is any point of SNN(i j) and S is the numberof members in the set SNN(i j) dsnn(i j) shows the spatialrelationship between point i and point j more compre-hensively by calculating the distances between two pointsand shared-nearest neighbor points

Definition 2 (similarity) For any points i and j in the datasetX the similarity Sim(i j) between point i and j can beexpressed as

Sim(i j) S

Klowast100 (12)

where K is the number of nearest neighbors K is selectedfrom 4 to 40 until the optimal parameter appears +e lower

bound is 4 because a smaller K may cause the algorithm tobecome endless For the upper bound it is found by ex-periments that a large K will not significantly affect theresults of the algorithm +e similarity is defined accordingto the aforementioned basic idea ldquoif they have more com-mon neighbors the two points are considered to be moresimilarrdquo and the similarity is described using the ratio of thenumber of shared-nearest neighbors to the number ofnearest neighbors

Definition 3 (K-nearest neighbor average distance) For anypoint i in the dataset X its K-nearest neighbor set is KNN(i)and then the expression of K-nearest neighbor averagedistance dknn(i) is as follows

dknn(i) 1113936pisinknn(i)dip

K (13)

where point p is any point in KNN(i) and the number ofnearest neighbors of any point is K K-nearest neighboraverage distance can describe the surrounding environmentof a point to some extent Next we use it to describe localdensity

Definition 4 (local density) For any point i in the dataset Xthe local density expression is

ρi 1113944jisinknn(i)

S

dknn(i) + dknn(j) (14)

where point j is a point in the set KNN(i) and dknn(i) anddknn(j) are the K-nearest neighbor average distances ofpoint i and point j respectively In formula (14) the nu-merator (the number of shared-nearest neighbor S) repre-sents the similarity between the two points and thedenominator (the sum of the average distances) describesthe environment around them When S is a constant and if

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(a)

02 04 06 08 100

01

02

03

04

05

06

07

08

09

1

(b)

Figure 3 Results of the traditional DPC algorithm on the Pathbased dataset

Complexity 5

the value of the sum of the average distances(dknn(i) + dknn(j)) is small the local density ρi of point i islarge Point j is one of the K-nearest neighbors of point iWhen the values of dknn(i) and dknn(j) are small it means i

and j are closely surrounded by their neighbors If dknn(i)

has a larger value (point j is far away from point i) or dknn(j)

has a larger value (when the neighboring points of thedistance are far away from the point j) the local density ofthe point i becomes smaller +erefore only the averagedistances of the two points are small and it can be expressedthat the local density of point i is large Moreover when thesum of the average distances of the two points is constantand if the number of shared-nearest neighbors of the twopoints is large the local density is large A large number ofshared neighbors indicate that the two points have a highsimilarity and a high probability of belonging to the samecluster +e higher the similarity points around a point thegreater its local density and the greater the probability ofbecoming a cluster center +is is beneficial to those low-density clustering centers A large number of sharedneighbors can compensate for the loss caused by their largedistance from other points so that their local density is notonly affected by distance Next we define the relative dis-tance of the points

Definition 5 (relative distance) For any point i in the datasetX the relative distance can be expressed as

δi

minj ρj gt ρi

dij + dknn(i) + dknn(j)1113960 1113961 ρi lt maxk

ρk( 1113857

maxjisin(Xminusi)

δi( 1113857 ρi maxk

ρk( 1113857

⎧⎪⎪⎨

⎪⎪⎩

(15)

where point j is one of the K-nearest neighbors of point idij is the distance between points i and j and dknn(i) anddknn(j) are the average distance from the nearest neighbor ofpoints i and j We can use the sum of the three distances torepresent the relative distance Compared to the DPC al-gorithm which only uses dij to represent the relative dis-tance we define the concept of relative distance andK-nearest neighbor average distances of two points +e newdefinition can not only express the relative distance but alsobe more friendly to low-density cluster centers Under thecondition of constant dij the average distance of the nearestneighbors of the low-density points is relatively large and itsrelative distance will also increase which can increase theprobability of low-density points being selected

+e DPC-SFSKNN clustering center is selected in thesame way as the traditional DPC algorithm+e local densityρ and relative distance δ are used to form a decision graph+e n points with the largest local density and relativedistance are selected as the clustering centers

For DPC-SFSKNN the sum of the distances from pointsof a low-density cluster to their K-neighbors may be largethus they receive a greater compensation for their δ valueFigures 4(a) and 4(b) show the results of DPC-SFSKNN onthe Jain dataset [23] Compared to Figure 2(b) the δ valuesof points in the upper branch are generally larger than thoseof the lower branch +is is because the density of the upper

branch is significantly smaller and the distances from thepoints to their respective K-nearest neighbors are largerthus they receive a greater compensation Even if the densityis at a disadvantage the higher δ value still makes the centerof the upper branch distinguished in the decision graph+isshows that the DPC-SFSKNN algorithm can correctly selectlow-density clustering centers

32 Processes +e entire process of the algorithm is dividedinto two parts the selection of clustering centers and theallocation of noncenter points +e main step of our DPC-SFSKNN and a detailed introduction of the proposed al-location strategy are given in Algorithm 1

Line 9 of the DPC-SFSKNN algorithm establishes aweighted K-nearest neighbor graph and Line 11 is aK-nearest neighbor similarity search allocation strategy Toassign noncenter points in the dataset we designed asimilarity-first search algorithm based on the weightedK-nearest neighbor graph +e algorithm uses the breadth-first search idea to find the cluster center with the highestsimilarity for the noncenter point +e similarity of non-center points and their K-nearest neighbors is sorted in anascending order the neighbor point with the highest sim-ilarity is selected as the next visited node and it is pushedinto the path queue If the highest similarity point is notunique the point with the smallest SNN average distance isselected as the next visited node+e visiting node also needsto sort the similarity of its K-nearest neighbors and select thenext visiting node +e search stops until the visited node isthe cluster center point Algorithm 2 describes the entiresearch process Finally each data point except the clustercenters is traversed to complete the assignment

Similarity-first search algorithm is an optimization al-gorithm based on breadth-first search according to the al-location requirements of noncenter points Similarity is animportant concept for clustering algorithms Points in thesame cluster are similar to each other Two points with ahigher similarity have more of the same neighbors Based onthe above ideas the definition of similarity is proposed in(12) In the process of searching if only similarity is used asthe search criteria it is easy to appear that the highestsimilarity point is not unique +erefore the algorithmchooses the average distance of the SNN as the secondcriterion and a smaller dsnn point means that the two pointsare closer in space

+e clustering results of the DPC-SFSKNN algorithmbased on the Pathbased dataset are shown in Figure 5Figure 3 clearly shows that although the traditional DPCalgorithm can find cluster centers on each of the threeclusters there is a serious bias in the allocation of noncenterpoints From Figure 5 we can see the effectiveness of thenoncentral point allocation algorithm of the DPC-SFSKNNalgorithm+e allocation strategy uses similarity-first searchto ensure that the similarity from the search path is thehighest and a gradual search to the cluster center to avoidthe points with low similarity is used as a reference Besidesthe similarity-first search allocation strategy based on theweighted K-nearest neighbor graph considers neighbor

6 Complexity

information When the point of the highest similarity is notunique the point with the shortest average distance of theshared neighbors is selected as the next visited point

33 Complexity Analysis In this section the complexities ofthe DPC-SFSKNN algorithm are analyzed including timecomplexity and space complexity Suppose the size of the

02 04 06 08 100

0102030405060708091

Y

X

(a)

02 04 06 08 10X

δ

025

020

015

010

005

0

(b)

Figure 4 Result and ρ value of the DPC-SFSKNN algorithm on the Jain dataset

Require dataset X parameter K

Ensure clustering result C(1) Data preprocessing normalize the data(2) Calculate the Euclidean distance between the points(3) Calculate the K-nearest neighbors of each point i isin X

(4) Calculate the average distance of K-nearest neighbors of each point dknn(i) according to (13)(5) Calculate the local density ρi of each point i isin X according to (14)(6) Calculate the relative distance δi of each point i isin X according to (15)(7) Find the cluster center by analyzing the decision graph composed of ρ and δ and use the cluster center as the set CC(8) Calculate the similarity between point i and its K-nearest neighbors according to (12)(9) Connect each point in the dataset X with its K-nearest neighbors and use the similarity as the connection weight to construct a

weighted K-nearest neighbor graph(10) Calculate the average distance of SNN dsnn(i j) between point i and its shared-nearest neighbors according to (11)(11) Apply Algorithm 2 to allocate the remaining points

ALGORITHM 1 DPC-SFSKNN

Require w isin X set of cluster centers CC number of neighbors K similarity matrix Snlowast n sim(i j)nlowast n and SNN averagedistance matrix DSNNnlowast n dsnn(i j)nlowast n

Ensure point w isin CC(1) Initialize the descending queue Q and the path queue P+e K-nearest neighbors of point w are sorted in the ascending order of

similarity and pushed into Q Push M into P(2) while tail point of P P isin CC do(3) if the highest similarity point is unique then(4) Pop a point this at Qrsquos tail(5) else(6) Select a point this with the smallest DSNN(7) end if(8) Empty descending queue Q(9) +e K-nearest neighbors of this are sorted in the ascending order of similarity and pushed into Q(10) Push this into P(11) end while

ALGORITHM 2 Similarity-first search allocation strategy

Complexity 7

dataset is n the number of cluster centers is m and thenumber of neighbors is k

331 Time Complexity +e time complexity analysis ofDPC-SFSKNN is as follows

Normalization requires a processing complexity of ap-proximately O(n) the complexities of calculating the Eu-clidean distance and similarity between points are O(n2) thecomplexity of computing the K-nearest neighbor averagedistance dknn is O(n2) similarly the complexity of the averagedistance dsnn between the calculation point and its shared-nearest neighbors does not exceed O(n2) at most the calcu-lation process of calculating the local density ρi and distance δi

of each point needs to acquire the KNN information com-plexity of each point as O(kn) so the complexities of localdensity ρ and distance δ areO(kn2) the point allocation part isthe search time of one point and in the worst case searching allpoints requires O(n) +ere are n points in the dataset and thetotal time does not exceed O(n2) In summary the total ap-proximate time complexity of DPC-SFSKNN is O(kn2)

+e time complexity of the DPC algorithm depends onthe following three aspects (a) the time to calculate thedistance between points (b) the time to calculate the localdensity ρi for point i and (c) the time to calculate the distanceδi for each point i +e time complexity of each part is O(n2)so the total approximate time complexity of DPC is O(n2)

+e time complexity of the DPC-SFSKNN algorithm is k

times higher than that of the traditional DPC algorithmHowever k is relatively small compared to n +erefore theydo not significantly affect the run time In Section 4 it isdemonstrated that the actual running time of DPC-SFSKNNdoes not exceed k times of the running time of the traditionalDPC algorithm

332 Space Complexity DPC-SFSKNN needs to calculatethe distance and similarity between points and its com-plexity is O(n2) Other data structures (such as ρ and δ arrays

and various average distance arrays) are O(n) For the al-location strategy in the worst case its complexity is O(n2)+e space complexity of DPC is O(n2) which is mainly dueto the distance matrix stored

+e space complexity of our DPC-SFSKNN is the sameas that of traditional DPC which is O(n2)

4 Experiments and Results

In this section experiments are performed based on severalpublic datasets commonly used to test the performance ofclustering algorithms including synthetic datasets [23ndash27]and real datasets [28ndash34] In order to visually observe theclustering ability of DPC-SFSKNN the DPC [20] DBSCAN[15] AP [8] FKNN-DPC [9] and K-means [10] methods areall tested for comparison +ree popular benchmarks areused to evaluate the performance of the above clusteringalgorithms including the clustering accuracy (ACC) ad-justed mutual information (AMI) and adjusted Rand index(ARI) [35] +e upper bounds of the three benchmarks wereall 1 +e larger the benchmark value the better the clus-tering effect +e codes for DPC DBSCAN and AP wereprovided based on the corresponding references

Table 1 lists the synthetic datasets used in the experi-ments+ese datasets were published in [23ndash27] Table 2 liststhe real datasets used in the experiments +ese datasetsinclude the real-world dataset from [29ndash34] and the Olivettiface dataset in [28]

To eliminate the influence of missing values and dif-ferences in different dimension ranges the datasets need tobe preprocessed before proceeding to the experiments Wereplace the missing values by the mean of all valid values ofthe same dimension and normalize the data using the min-max normalization method shown in the followingequation

xij xij minus min xj1113872 1113873

max xj1113872 1113873 minus min xj1113872 1113873 (16)

where xij represents the original data located in the ith rowand jth column xij represents the rescaled data of xij andxj represents the original data located in the jth column

Min-max normalization method processes each di-mension of the data and preserves the relationships betweenthe original data values [36] therefore decreasing the in-fluence of the difference in dimensions and increasing theefficiency of the calculation

To fairly reflect the clustering results of the five algo-rithms the parameters in the algorithms are adjusted toensure that their satisfactory clustering performance can beretained For the DPC-SFSKNN algorithm the parameter K

needs to be specified in advance and an initial clusteringcenter is manually selected based on a decision graphcomposed of the local density ρ and the relative distance δ Itcan be seen from the experimental results in Tables 3 and 4that the value of parameter K is around 6 and the value ofparameterK for the dataset with dense sample distribution ismore than 6 In addition to manually select the initialclustering center the traditional DPC algorithm also needs

02 04X

06 08 100

01

02

03

04

05Y

06

07

08

09

1

Figure 5 Results of the traditional DPC-SFSKNN algorithm on thePathbased dataset

8 Complexity

to determine dc Based on the provided selection range dc isselected so that the number of neighbors is between 1 and 2of the total number of data points [20] +e two parametersthat DBSCAN needs to determine are ε and minpts as in[15]+e optimal parameters are determined using a circularsearch method +e AP algorithm only needs to determine apreference and the larger the preference the more the center

points are allowed to be selected [8] +e general method forselecting parameters is not effective and only multiple ex-periments can be performed to select the optimal parame-ters +e only parameter of K-means is the number ofclusters +e true number of clusters in the dataset is used inthis case Similarly FKNN-DPC needs to determine thenearest neighbors K

Table 2 Real-world datasets

Dataset Records Attributes Clusters SourceIris 150 4 3 [29]Libras movement 360 90 15 [31]Wine 178 13 3 [29]Parkinsons 197 23 2 [29]WDBC 569 30 2 [34]Pima-Indians-diabetes 768 8 2 [29]Segmentation 2310 19 7 [29]Dermatology 366 33 6 [29]Seeds 210 7 3 [30]Ionosphere 351 34 2 [33]Waveform 5000 21 3 [32]Waveform (noise) 5000 40 3 [32]Olivetti face 400 92lowast112 40 [28]

Table 1 Synthetic datasets

Dataset Records Attributes Clusters SourcePathbased 300 2 3 [24]Jain 373 2 2 [23]Flame 240 2 2 [25]Aggregation 788 2 7 [26]DIM512 1024 512 16 [27]DIM1024 1024 1024 16 [27]

Table 3 +e comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on synthetic datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParPathbased Jain

DPC-SFSKNN 0926 0910 0925 33 6 1000 1000 1000 22 7DPC 0521 0463 0742 33 2 0609 0713 0853 22 3DBSCAN 0781 0522 0667 mdash 0056 0883 0985 0918 mdash 00810AP 0679 0475 0783 33 10 0681 0812 0882 22 40FKNN-DPC 0941 0960 0987 33 5 0056 0132 0793 mdash 10K-means 0568 0461 0772 mdash 3 0492 0577 0712 mdash 2

Aggregation FlameDPC-SFSKNN 0942 0951 0963 77 6 0873 0934 0956 22 6DPC 1000 1000 1000 77 4 1000 1000 1000 22 5DBSCAN 0969 0982 0988 mdash 0058 0867 0936 0981 - 0098AP 0795 0753 0841 77 77 0452 0534 0876 33 35FKNN-DPC 0995 0997 0999 33 8 1000 1000 1000 22 5K-means 0784 0717 0786 mdash 7 0418 0465 0828 mdash 2

DIM512 DIM1024DPC-SFSKNN 1000 1000 1000 1616 8 1000 1000 1000 1616 9DPC 1000 1000 1000 1616 2 1000 1000 1000 1616 001DBSCAN 1000 1000 1000 mdash 037 1000 1000 1000 mdash 108AP 1000 1000 1000 1616 20 1000 1000 1000 1616 30FKNN-DPC 1000 1000 1000 1616 8 1000 1000 1000 1616 10K-means 0895 0811 0850 mdash 1 0868 0752 0796 mdash 16

Complexity 9

41Analysis of theExperimentalResults on SyntheticDatasetsIn this section the performance of DPC-SFSKNN DPC[20] DBSCAN [15] AP [8] FKNN-DPC [9] and K-means[10] is tested with six synthetic datasets given in Table 1+ese synthetic datasets are different in distribution andquantity Different data situations can be simulated tocompare the performance of six algorithms in differentsituations Table 3 shows AMI ARI ACC and ECAC of thefive clustering algorithms on the six comprehensive datasetswhere the best results are shown in bold and ldquomdashrdquo means novalue Figures 6ndash9 show the clustering results of DPC-SFSKNN DPC DBSCAN AP FKNN-DPC and K-meansbased on the Pathbased Flame Aggregation and Jaindatasets respectively +e five algorithms achieve the

optimal clustering on DIM512 and DIM1024 datasets sothat the clustering of the two datasets is not shown Since thecluster centers of DBSCAN are relatively random only thepositions of clustering centers of the other three algorithmsare marked

Figure 6 shows the results of the Pathbased datasetDPC-SFSKNN and FKNN-DPC can complete the clusteringof the Pathbased dataset correctly From Figures 6(b) 6(d)and 6(f) it can be seen that the clustering results of DPC APand K-means are similar +e clustering centers selected byDPC AP DPC-SFSKNN and FKNN-DPC are highlysimilar but the clustering results of DPC and AP are notsatisfactory For the DPC algorithm the low fault tolerancerate of its allocation strategy is the cause of this result A

Table 4 Comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on real-world datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParIris Libras movement

DPC-SFSKNN 0896 0901 0962 33 6 0547 0368 0510 1015 8DPC 0812 0827 0926 33 2 0535 0304 0438 915 05DBSCAN 0792 0754 0893 mdash 0149 0412 0183 0385 mdash 0965AP 0764 0775 0911 33 6 0364 0267 0453 1015 25FKNN-DPC 0912 0922 0973 33 7 0508 0308 0436 1015 9K-means 0683 0662 0823 mdash 3 0522 0306 0449 mdash 15

Wine ParkinsonsDPC-SFSKNN 0843 0851 0951 33 6 0193 0380 0827 22 6DPC 0706 0672 0882 33 2 0210 0114 0612 22 5DBSCAN 0612 0643 0856 mdash 04210 0205 0213 0674 mdash 046AP 0592 0544 0781 33 6 0142 0127 0669 22 15FKNN-DPC 0831 0852 0949 33 7 0273 0391 0851 22 5K-means 0817 0838 0936 mdash 3 0201 0049 0625 mdash 2

WDBC IonosphereDPC-SFSKNN 0432 0516 0857 22 6 0361 0428 0786 32 7DPC 0002 minus0004 0602 22 9 0238 0276 0681 32 065DBSCAN 0397 0538 0862 mdash 0277 0544 0683 0853 mdash 027AP 0598 0461 0854 22 40 0132 0168 0706 22 15FKNN-DPC 0679 0786 0944 22 7 0284 0355 0752 22 8K-means 0611 0730 0928 mdash 2 0129 0178 0712 mdash 2

Segmentation Pima-Indians-diabetesDPC-SFSKNN 0665 0562 0746 67 6 0037 0083 0652 22 6DPC 0650 0550 0684 67 3 0033 0075 0647 22 4DBSCAN 0446 0451 0550 mdash 02510 0028 0041 0577 mdash 0156AP 0405 0436 0554 77 25 0045 0089 0629 32 35FKNN-DPC 0655 0555 0716 77 7 0001 0011 0612 22 6K-means 0583 0495 0612 mdash 6 0050 0102 0668 mdash 2

Seeds DermatologyDPC-SFSKNN 0753 0786 0919 33 7 0862 0753 0808 76 6DPC 0727 0760 0918 33 2 0611 0514 0703 46 2DBSCAN 0640 0713 0874 mdash 0178 0689 0690 0815 mdash 073AP 0598 0682 0896 33 10 0766 0701 0762 76 5FKNN-DPC 0759 0790 0924 33 8 0847 0718 0768 76 7K-means 0671 0705 0890 mdash 3 0796 0680 0702 mdash 6

Waveform Waveform (noise)DPC-SFSKNN 0355 0382 0725 33 5 0267 0288 0651 33 6DPC 0320 0269 0586 33 05 0104 0095 0502 33 03DBSCAN mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashAP mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashFKNN-DPC 0324 0350 0703 33 5 0247 0253 0648 33 5K-means 0363 0254 0501 - 3 0364 0252 0512 mdash 3

10 Complexity

0

02

04

06

08

1

05 10

(a)

0

02

04

06

08

1

05 10

(b)

Figure 7 Continued

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 6+e clustering of Pathbased by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and (f)K-means

Complexity 11

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 7 +e clustering of Flame by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

05 100

02

04

06

08

1

(d)

Figure 8 Continued

12 Complexity

high-density point allocation error will be transferred tolow-density points and the error propagation will seriouslyaffect the clustering results AP and K-means algorithms arenot good at dealing with irregular clusters +e two clustersin the middle are too attractive to the points on both sides ofthe semicircular cluster which leads to clustering errorsDBSCAN can completely detect the semicircular cluster butthe semicircular cluster and the cluster on the left of themiddle are incorrectly classified into one category and thecluster on the right of the middle is divided into two clusters+e similarities between points and manually prespecifiedparameters may severely affect the clustering DPC-SFSKNNand FKNN-DPC algorithms perform well on the Pathbaseddataset +ese improved algorithms that consider neighborrelationships have a great advantage in handling suchcomplex distributed datasets

Figure 7 shows the results of four algorithms on theFlame dataset As shown in the figure DPC-SFSKNN DPCFKNN-DPC and DBSCAN can correctly detect two clusterswhile AP and K-means cannot completely correct clusteringAlthough AP can correctly identify higher clusters and selectthe appropriate cluster center the lower cluster is dividedinto two clusters Both clusters of K-means are wrong +eclustering results in Figure 8 show that the DPC-SFSKNNDPC FKNN-DPC and DBSCAN algorithms can detect 7clusters in the Aggregation dataset but AP and K-means stillcannot cluster correctly DPC-SFSKNN DPC and FKNN-DPC can identify clusters and centers Although the clustercenters are not marked for DBSCAN the number of clustersand the overall shape of each cluster are correct +e APalgorithm successfully found the correct number of clustersbut it chose two centers for one cluster which divided thecluster into two clusters +e clustering result of K-means issimilar to that of AP

+e Jain dataset shown in Figure 9 is a dataset consistingof two semicircular clusters of different densities As shownin the figure the DPC-SFSKNN algorithm can completelycluster two clusters with different densities However DPCAP FKNN-DPC and K-means incorrectly assign the leftend of the lower cluster to the higher cluster and the clustercenters of the DPC are all on the lower cluster Compared

with that the distribution of the cluster centers of the AP ismore reasonable For the DBSCAN algorithm it can ac-curately identify lower clusters but the left end of the highercluster is incorrectly divided into a new cluster so that thehigher cluster is divided into two clusters

According to the benchmark data shown in Table 3 it isclear that the performance of DPC-SFSKNN is very effectiveamong the six clustering algorithms especially in the Jaindataset Although DPC and FKNN-DPC perform betterthan DPC-SFSKNN on Aggregation and Flame datasetsDPC-SFSKNN can find the correct clustering center of theaggregation and can complete the clustering task correctly

42 Analysis of Experimental Results on Real-World DatasetsIn this section the performance of the five algorithms is stillbenchmarked according to AMI ARI ACC and ECACand the clustering results are summarized in Table 4 12 real-world datasets are selected to test DPC-SFSKNNrsquos ability toidentify clusters on different datasets DBSCAN and APalgorithm cannot get effective clustering results on wave-form and waveform (noise) +e symbol ldquomdashrdquo represents noresult

As shown in Table 4 in terms of benchmarks AMI ARIand ACC DPC-SFSKNN outperforms all five other algo-rithms on the Wine Segmentation and Libras movementdatasets At the same time FKNN-DPC performs better thanthe other five algorithms on the Iris Seeds Parkinsons andWDBC datasets It can be seen that the overall performanceof DPC-SFSKNN is slightly better than DPC on 11 datasetsexcept for Parkinsons On the Parkinsons DPC-SFSKNN isslightly worse than DPC in AMI but better than DPC in ARIand ACC Similarly DPC-SFSKNN had a slightly betterperformance in addition to Iris Parkinsons WDBC andSeeds of eight sets of data in FKNN-DPC and DPC-SFSKNN is slightly worse than DPC in AMI ARI and ACCDBSCAN gets the best results on the Ionosphere K-means isthe best on Pima-Indians-diabetes and K-means is the bestin AMI on waveform and waveform (noise) datasets Ingeneral the clustering results of DPC-SFSKNN in real-worlddatasets are satisfactory

0

02

04

06

08

1

05 10

(e)

05 100

02

04

06

08

1

(f )

Figure 8+e clustering of Aggregation by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

Complexity 13

43 Experimental Analysis of Olivetti Face DatasetOlivetti face dataset [28] is an image dataset widely used bymachine learning algorithms Its purpose is to test theclustering situation of the algorithm without supervisionincluding determining the number of clusters in the data-base and the members of each cluster +e dataset contains40 clusters each of which has 10 different images Becausethe actual number of clusters (40 different clusters) is equalto the number of elements in the dataset (10 different imageseach cluster) the reliability of local density becomes smallerwhich is a great challenge for density-based clustering al-gorithms To further test the clustering performance ofDPC-SFSKNN DPC-SFSKNN performed experiments onthe Olivetti face database and compared it with DPC APDBSCAN FKNN-DPC and K-means

+e clustering results achieved by DPC-SFSKNN andDPC for the Olivetti face database are shown in Figure 10and white squares represent the cluster centers +e 32clusters corresponding to DPC-SFSKNN found inFigure 10(a) and the 20 clusters found by DPC inFigure 10(b) are displayed in different colors Gray imagesindicate that the image is not assigned to any cluster It canbe seen from Figure 10(a) that DPC-SFSKNN found that the32 cluster centers were covered 29 clusters and as shown inFigure 10(b) the 20 cluster centers found by DPC werescattered in 19 clusters Similar to DPC-SFSKNN DPC maydivide one cluster into two clusters Because DPC-SFSKNNcan find much more density peaks than DPC it is morelikely to identify a cluster as two different clusters +e samesituation occurs with the FKNN-DPC algorithm However

0

02

04

06

08

1

05 10

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

0

02

04

06

08

1

05 10

(d)

0

02

04

06

08

1

05 10

(e)

0

02

04

06

08

1

05 10

(f )

Figure 9 +e clustering of Jain by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

14 Complexity

the performance of FKNN-DPC is better than that of DPC-SFSKNN in AMI ARI ACC and ECAC From the data inTable 5 based on AMI ARI ACC and ECAC the clus-tering results of these algorithms are compared +e per-formance of DPC-SFSKNNC is slightly superior to theperformance of the other four algorithms except FKNN-DPC

44RunningTime +is section shows the comparison of thetime performance of DPC-SFSKNN with DPC DBSCANAP FKNN-DPC and K-means on real-world datasets +etime complexity of DPC-SFSKNN and DPC has been an-alyzed in Section 331 and the time complexity of DPC isO(n2) and the time complexity of DPC-SFSKNN is O(kn2)where n is the size of the dataset However the time con-sumed by DPC mainly comes from calculating the localdensity and the relative distance of each point while the timeconsumed by DPC-SFSKNN comes mainly from the cal-culation of K-nearest neighbors and the division strategy ofnoncenter points Table 6 lists the running time (in seconds)

of the six algorithms on the real datasets It can be seen thatalthough the time complexity of DPC-SFSKNN is approx-imately k times that of DPC their execution time on actualdatasets is not k times

In Table 6 it can be found that on a relatively smalldataset the running time of DPC-SFSKNN is about twice ormore times that of DPC and the difference mainly comesfrom DPC-SFSKNNrsquos allocation strategy Although thecomputational load of the local densities for points growsvery quickly with the size of a dataset the time consumed bythe allocation strategy in DPC-SFSKNN increases randomlywith the distribution of a dataset +is leads to an irregulargap between the running time of DPC and DPC-SFSKNN

FKNN-DPC has the same time and space complexity asDPC but the running time is almost the same as DPC-SFSKNN It takes a lot of running time to calculate therelationship between K-nearest neighbors +e time com-plexity of DBSCAN and AP is approximate to O(n2) and theparameter selection of both cannot be determined by simplemethods When the dataset is relatively large it is difficult tofind their optimal parameters which may be the reason that

(a) (b)

Figure 10 +e clustering of Olivetti by two clustering algorithms (a) DPC-SFSKNN and (b) DPC

Table 5 Performance comparison of algorithms by clustering criteria for the Olivetti face database

Metric DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansACC 0786 0665 0648 0763 0818 0681AMI 0792 0728 0691 0737 0832 0742ARI 0669 0560 0526 0619 0714 0585ECAC 3240 2040 mdash 2840 3640 mdashPar 6 05 064 21 4 40

Complexity 15

the two algorithms have no running results on the waveformdataset +e approximate time complexity of K-means isO(n) and Table 6 proves its efficiency K-means has almostno loss of accuracy under the premise of fast speed whichmakes it a very popular clustering algorithm but K-means isnot sensitive to irregularly shaped data

5 Conclusions and Future Work

A new clustering algorithm is proposed based on the tra-ditional DPC algorithm in this paper +is algorithm pro-poses a density peak search algorithm that takes into accountthe surrounding neighbor information and develops a newallocation strategy to detect the true distribution of thedataset +e proposed clustering algorithm performs fastsearch finds density peaks say cluster centers of a dataset ofany size and recognizes clusters with any arbitrary shape ordimensionality +e algorithm is called DPC-SFSKNNwhich means that it calculates the local density and therelative distance by using some distance information be-tween points and neighbors to find the cluster center andthen the remaining points are assigned using similarity-first+e search algorithm is based on the weighted KNN graph tofind the owner (clustering center) of the point +e DPC-SFSKNN successfully addressed several issues arising fromthe clustering algorithm of Alex Rodriguez and AlessandroLaio [20] including its density metric and the potential issuehidden in its assignment strategy +e performance of DPC-SFSKNN was tested on several synthetic datasets and thereal-word datasets from the UCI machine learning reposi-tory and the well-known Olivetti face database +e ex-perimental results on these datasets demonstrate that ourDPC-SFSKNN is powerful in finding cluster centers and inrecognizing clusters regardless of their shape and of thedimensionality of the space in which they are embedded andof the size of the datasets and is robust to outliers It per-forms much better than the original algorithm DPCHowever the proposed algorithm has some limitations theparameter K needs to be manually adjusted according todifferent datasets the clustering centers still need to bemanually selected by analyzing the decision graph (like theDPC algorithm) the allocation strategy improves theclustering accuracy but takes time and cost How to improve

the degree of automation and allocation efficiency of thealgorithm needs further research

Data Availability

+e synthetic datasets are cited at relevant places within thetext as references [23ndash27] +e real-world datasets are citedat relevant places within the text as references [29ndash34] +eOlivetti face dataset is cited at relevant places within the textas references [28]

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported in part by the National NaturalScience Foundation of China (6160303040 and 61433003) inpart by the Yunnan Applied Basic Research Project of China(201701CF00037) and in part by the Yunnan ProvincialScience and Technology Department Key Research Program(Engineering) (2018BA070)

Supplementary Materials

It includes the datasets used in the experiments in this paper(Supplementary Materials)

References

[1] K L Liu Y L Shang Q Ouyang and W D Widanage ldquoAdata-driven approach with uncertainty quantification forpredicting future capacities and remaining useful life oflithium-ion batteryrdquo IEEE Transactions on Industrial Elec-tronics p 1 2020

[2] X P Tang K L Liu X Wang et al ldquoModel migration neuralnetwork for predicting battery aging trajectoriesrdquo IEEETransactions on Transportation Electrification vol 6 no 2pp 363ndash374 2020

[3] X Tang K Liu XWang B Liu F Gao andW DWidanageldquoReal-time aging trajectory prediction using a base model-oriented gradient-correction particle filter for Lithium-ion

Table 6 Running time of 6 clustering algorithms in seconds on UCI datasets

Dataset DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansIris 0241 0049 0059 0565 0148 0014Wine 0238 0048 0098 0832 0168 0013WDBC 0484 0092 0884 6115 0464 0018Seeds 0244 0050 0122 0973 0164 0014Libras movement 0602 0068 0309 3016 2602 0075Ionosphere 0325 0064 0349 2018 0309 0021Segmentation 1569 0806 8727 6679 0313 0062Dermatology 0309 0063 0513 2185 0409 0007Pima-Indians-diabetes 0792 0126 2018 9709 0892 0009Parkinsons 0255 0048 0114 0866 0263 0003Waveform 16071 3511 mdash mdash 7775 0067Waveform (noise) 17571 3784 mdash mdash 7525 0109

16 Complexity

batteriesrdquo Journal of Power Sources vol 440 Article ID227118 2019

[4] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[5] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[6] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[7] L Cai J H Meng D I Stroe et al ldquoMulti-objective opti-mization of data-driven model for lithium-ion battery SOHestimation with short-term featurerdquo IEEE Transactions onPower Electronics p 1 2020

[8] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[9] J Xie H Gao W Xie X Liu and P W Grant ldquoRobustclustering by detecting density peaks and assigning pointsbased on fuzzy weighted K-nearest neighborsrdquo InformationSciences vol 354 pp 19ndash40 2016

[10] F S Samaria and A C Harter ldquoSome methods for classifi-cation and analysis of multivariate observationsrdquo in Pro-ceedings of the Berkeley SymposiumOnMathematical Statisticsand Probability pp 281ndash297 Berkeley CA USA 1967

[11] S Kant T L Rao and P N Sundaram ldquoAn automatic andstable clustering algorithmrdquo Pattern Recognition Lettersvol 15 no 6 pp 543ndash549 1994

[12] D Arthur and S Vassilvitskii ldquoK-Means++ the advantages ofcareful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms pp 7ndash9 NewOrleans LA USA 2007

[13] Y Zhao W Halang and X Wang ldquoRough ontology mappingin E-business integrationrdquo E-Service Intelligence BMC Bioinfvol 8 pp 75ndash93 2007

[14] Y Xiao and J Yu ldquoSemi-supervised clustering based on af-finity propagaiton algorithmrdquo ACM Transactions onKnowledge Discovery from Data vol 1 no 1 2007

[15] M Ester H Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databaseswith noiserdquo in Proceedings of the Second International Con-ference On Knowledge Discovery and Data Mining pp 226ndash231 Portland OR USA 1996

[16] R J G B Campello D Moulavi and J Sander ldquoDensity-based clustering based on hierarchical density estimatesrdquoAdvances in Knowledge Discovery and Data Mining vol 7819pp 160ndash172 2013

[17] Z Liang and P Chen ldquoDelta-density based clustering with adivide-and-conquer strategy 3DC clusteringrdquo Pattern Rec-ognition Letters vol 73 pp 52ndash59 2016

[18] M Ankerst M M Breuning H P Kriegel and J SanderldquoOPTICS ordering points to identify the clustering struc-turerdquo in Proceedings of the 1999 ACM SIGMOD-InternationalConference on Management of Data pp 49ndash60 PhiladelphiaPA USA 1999

[19] M Du S Ding and H Jia ldquoStudy on density peaks clusteringbased on k-nearest neighbors and principal componentanalysisrdquo Knowledge-Based Systems vol 99 pp 135ndash1452016

[20] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[21] T Li H W Ge and S Z Su ldquoDensity peaks clustering byautomatic determination of cluster centersrdquo Journal ofComputer Science and Technology vol 10 no 11 pp 1614ndash1622 2016

[22] R Liu H Wang and X Yu ldquoShared-nearest-neighbor-basedclustering by fast search and find of density peaksrdquo Infor-mation Sciences vol 450 pp 200ndash226 2018

[23] R A Jarvis and E A Patrick ldquoClustering using a similaritymeasure based on shared near neighborsrdquo IEEE Transactionson Computers vol C-22 no 11 pp 1025ndash1034 1973

[24] H Chang and D-Y Yeung ldquoRobust path-based spectralclusteringrdquo Pattern Recognition vol 41 no 1 pp 191ndash2032008

[25] L Fu and E Medico ldquoFlame a novel fuzzy clustering methodfor the analysis of DNA microarray datardquo BMC Bio-informatics vol 8 no 1 2007

[26] A Gionis H Mannila and P Tsaparas ldquoClustering aggre-gationrdquo ACM Transactions on Knowledge Discovery fromData vol 1 no 1 p 4 2007

[27] P Franti O Virmajoki and V Hautamaki ldquoFast agglom-erative clustering using a k-nearest neighbor graphrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 28 no 11 pp 1875ndash1881 2006

[28] F S Samaria and A C Harter ldquoParameterisation of a sto-chastic model for human face identificationrdquo in Proceedings ofthe 1994 IEEEWorkshop On Applications Of Computer Visionpp 138ndash142 Sarasota FL USA 1994

[29] K Bache and M Lichman UCI Machine Learning Repositoryhttparchiveicsucieduml 2013

[30] M Charytanowicz J Niewczas P Kulczycki P A KowalskiS Lukasik and S Zak ldquoComplete gradient clustering algo-rithm for features analysis of X-ray imagesrdquo InformationTechnologies in biomedicine Advances in Intelligent and SoftComputing vol 69 Berlin Germany Springer

[31] D B Dias R C B Madeo T Rocha H H Biscaro andS M Peres ldquoHand movement recognition for brazilian signlanguage a study using distance-based neural networksrdquo inProceedings of the 2009 International Joint Conference onNeural Networks pp 697ndash704 Atlanta GA USA 2009

[32] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees Routledge New York NYUSA 1st edition 1984

[33] V G Sigillito S P Wing L V Hutton and K B BakerldquoClassification of radar returns from the ionosphere usingneural networksrdquo Johns Hopkins APL vol 10 no 3pp 262ndash266 1989

[34] W N Street W H Wolberg and O L MangasarianldquoNuclear feature extraction for breast tumor diagnosisrdquo inProceedings of the SPIE 1905 Biomedical Image Processing andBiomedical Visualization San Jose CA USA 1993

[35] X V Nguyen J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison is a correction forchance necessaryrdquo in Proceedings of the ICML 2009 the 26thAnnual International Conference On Machine Learning SanMontreal Canada 2009

[36] J Han M Kamber and J Pei Data Mining Concepts andTechniques the Morgan Kaufmann Series in Data Manage-ment Systems Morgan Kaufmann Burlington MA USA 3rdedition 2011

Complexity 17

Page 6: ClusteringbyDetectingDensityPeaksandAssigningPointsby ... · FKNN-DPC [9] is given in (5) and (6), respectively: ρ i exp − 1 K X j∈knn(i) d2 ij ⎛⎝ ⎞⎠, (5) ρ i X j∈knn(i)

the value of the sum of the average distances(dknn(i) + dknn(j)) is small the local density ρi of point i islarge Point j is one of the K-nearest neighbors of point iWhen the values of dknn(i) and dknn(j) are small it means i

and j are closely surrounded by their neighbors If dknn(i)

has a larger value (point j is far away from point i) or dknn(j)

has a larger value (when the neighboring points of thedistance are far away from the point j) the local density ofthe point i becomes smaller +erefore only the averagedistances of the two points are small and it can be expressedthat the local density of point i is large Moreover when thesum of the average distances of the two points is constantand if the number of shared-nearest neighbors of the twopoints is large the local density is large A large number ofshared neighbors indicate that the two points have a highsimilarity and a high probability of belonging to the samecluster +e higher the similarity points around a point thegreater its local density and the greater the probability ofbecoming a cluster center +is is beneficial to those low-density clustering centers A large number of sharedneighbors can compensate for the loss caused by their largedistance from other points so that their local density is notonly affected by distance Next we define the relative dis-tance of the points

Definition 5 (relative distance) For any point i in the datasetX the relative distance can be expressed as

δi

minj ρj gt ρi

dij + dknn(i) + dknn(j)1113960 1113961 ρi lt maxk

ρk( 1113857

maxjisin(Xminusi)

δi( 1113857 ρi maxk

ρk( 1113857

⎧⎪⎪⎨

⎪⎪⎩

(15)

where point j is one of the K-nearest neighbors of point idij is the distance between points i and j and dknn(i) anddknn(j) are the average distance from the nearest neighbor ofpoints i and j We can use the sum of the three distances torepresent the relative distance Compared to the DPC al-gorithm which only uses dij to represent the relative dis-tance we define the concept of relative distance andK-nearest neighbor average distances of two points +e newdefinition can not only express the relative distance but alsobe more friendly to low-density cluster centers Under thecondition of constant dij the average distance of the nearestneighbors of the low-density points is relatively large and itsrelative distance will also increase which can increase theprobability of low-density points being selected

+e DPC-SFSKNN clustering center is selected in thesame way as the traditional DPC algorithm+e local densityρ and relative distance δ are used to form a decision graph+e n points with the largest local density and relativedistance are selected as the clustering centers

For DPC-SFSKNN the sum of the distances from pointsof a low-density cluster to their K-neighbors may be largethus they receive a greater compensation for their δ valueFigures 4(a) and 4(b) show the results of DPC-SFSKNN onthe Jain dataset [23] Compared to Figure 2(b) the δ valuesof points in the upper branch are generally larger than thoseof the lower branch +is is because the density of the upper

branch is significantly smaller and the distances from thepoints to their respective K-nearest neighbors are largerthus they receive a greater compensation Even if the densityis at a disadvantage the higher δ value still makes the centerof the upper branch distinguished in the decision graph+isshows that the DPC-SFSKNN algorithm can correctly selectlow-density clustering centers

32 Processes +e entire process of the algorithm is dividedinto two parts the selection of clustering centers and theallocation of noncenter points +e main step of our DPC-SFSKNN and a detailed introduction of the proposed al-location strategy are given in Algorithm 1

Line 9 of the DPC-SFSKNN algorithm establishes aweighted K-nearest neighbor graph and Line 11 is aK-nearest neighbor similarity search allocation strategy Toassign noncenter points in the dataset we designed asimilarity-first search algorithm based on the weightedK-nearest neighbor graph +e algorithm uses the breadth-first search idea to find the cluster center with the highestsimilarity for the noncenter point +e similarity of non-center points and their K-nearest neighbors is sorted in anascending order the neighbor point with the highest sim-ilarity is selected as the next visited node and it is pushedinto the path queue If the highest similarity point is notunique the point with the smallest SNN average distance isselected as the next visited node+e visiting node also needsto sort the similarity of its K-nearest neighbors and select thenext visiting node +e search stops until the visited node isthe cluster center point Algorithm 2 describes the entiresearch process Finally each data point except the clustercenters is traversed to complete the assignment

Similarity-first search algorithm is an optimization al-gorithm based on breadth-first search according to the al-location requirements of noncenter points Similarity is animportant concept for clustering algorithms Points in thesame cluster are similar to each other Two points with ahigher similarity have more of the same neighbors Based onthe above ideas the definition of similarity is proposed in(12) In the process of searching if only similarity is used asthe search criteria it is easy to appear that the highestsimilarity point is not unique +erefore the algorithmchooses the average distance of the SNN as the secondcriterion and a smaller dsnn point means that the two pointsare closer in space

+e clustering results of the DPC-SFSKNN algorithmbased on the Pathbased dataset are shown in Figure 5Figure 3 clearly shows that although the traditional DPCalgorithm can find cluster centers on each of the threeclusters there is a serious bias in the allocation of noncenterpoints From Figure 5 we can see the effectiveness of thenoncentral point allocation algorithm of the DPC-SFSKNNalgorithm+e allocation strategy uses similarity-first searchto ensure that the similarity from the search path is thehighest and a gradual search to the cluster center to avoidthe points with low similarity is used as a reference Besidesthe similarity-first search allocation strategy based on theweighted K-nearest neighbor graph considers neighbor

6 Complexity

information When the point of the highest similarity is notunique the point with the shortest average distance of theshared neighbors is selected as the next visited point

33 Complexity Analysis In this section the complexities ofthe DPC-SFSKNN algorithm are analyzed including timecomplexity and space complexity Suppose the size of the

02 04 06 08 100

0102030405060708091

Y

X

(a)

02 04 06 08 10X

δ

025

020

015

010

005

0

(b)

Figure 4 Result and ρ value of the DPC-SFSKNN algorithm on the Jain dataset

Require dataset X parameter K

Ensure clustering result C(1) Data preprocessing normalize the data(2) Calculate the Euclidean distance between the points(3) Calculate the K-nearest neighbors of each point i isin X

(4) Calculate the average distance of K-nearest neighbors of each point dknn(i) according to (13)(5) Calculate the local density ρi of each point i isin X according to (14)(6) Calculate the relative distance δi of each point i isin X according to (15)(7) Find the cluster center by analyzing the decision graph composed of ρ and δ and use the cluster center as the set CC(8) Calculate the similarity between point i and its K-nearest neighbors according to (12)(9) Connect each point in the dataset X with its K-nearest neighbors and use the similarity as the connection weight to construct a

weighted K-nearest neighbor graph(10) Calculate the average distance of SNN dsnn(i j) between point i and its shared-nearest neighbors according to (11)(11) Apply Algorithm 2 to allocate the remaining points

ALGORITHM 1 DPC-SFSKNN

Require w isin X set of cluster centers CC number of neighbors K similarity matrix Snlowast n sim(i j)nlowast n and SNN averagedistance matrix DSNNnlowast n dsnn(i j)nlowast n

Ensure point w isin CC(1) Initialize the descending queue Q and the path queue P+e K-nearest neighbors of point w are sorted in the ascending order of

similarity and pushed into Q Push M into P(2) while tail point of P P isin CC do(3) if the highest similarity point is unique then(4) Pop a point this at Qrsquos tail(5) else(6) Select a point this with the smallest DSNN(7) end if(8) Empty descending queue Q(9) +e K-nearest neighbors of this are sorted in the ascending order of similarity and pushed into Q(10) Push this into P(11) end while

ALGORITHM 2 Similarity-first search allocation strategy

Complexity 7

dataset is n the number of cluster centers is m and thenumber of neighbors is k

331 Time Complexity +e time complexity analysis ofDPC-SFSKNN is as follows

Normalization requires a processing complexity of ap-proximately O(n) the complexities of calculating the Eu-clidean distance and similarity between points are O(n2) thecomplexity of computing the K-nearest neighbor averagedistance dknn is O(n2) similarly the complexity of the averagedistance dsnn between the calculation point and its shared-nearest neighbors does not exceed O(n2) at most the calcu-lation process of calculating the local density ρi and distance δi

of each point needs to acquire the KNN information com-plexity of each point as O(kn) so the complexities of localdensity ρ and distance δ areO(kn2) the point allocation part isthe search time of one point and in the worst case searching allpoints requires O(n) +ere are n points in the dataset and thetotal time does not exceed O(n2) In summary the total ap-proximate time complexity of DPC-SFSKNN is O(kn2)

+e time complexity of the DPC algorithm depends onthe following three aspects (a) the time to calculate thedistance between points (b) the time to calculate the localdensity ρi for point i and (c) the time to calculate the distanceδi for each point i +e time complexity of each part is O(n2)so the total approximate time complexity of DPC is O(n2)

+e time complexity of the DPC-SFSKNN algorithm is k

times higher than that of the traditional DPC algorithmHowever k is relatively small compared to n +erefore theydo not significantly affect the run time In Section 4 it isdemonstrated that the actual running time of DPC-SFSKNNdoes not exceed k times of the running time of the traditionalDPC algorithm

332 Space Complexity DPC-SFSKNN needs to calculatethe distance and similarity between points and its com-plexity is O(n2) Other data structures (such as ρ and δ arrays

and various average distance arrays) are O(n) For the al-location strategy in the worst case its complexity is O(n2)+e space complexity of DPC is O(n2) which is mainly dueto the distance matrix stored

+e space complexity of our DPC-SFSKNN is the sameas that of traditional DPC which is O(n2)

4 Experiments and Results

In this section experiments are performed based on severalpublic datasets commonly used to test the performance ofclustering algorithms including synthetic datasets [23ndash27]and real datasets [28ndash34] In order to visually observe theclustering ability of DPC-SFSKNN the DPC [20] DBSCAN[15] AP [8] FKNN-DPC [9] and K-means [10] methods areall tested for comparison +ree popular benchmarks areused to evaluate the performance of the above clusteringalgorithms including the clustering accuracy (ACC) ad-justed mutual information (AMI) and adjusted Rand index(ARI) [35] +e upper bounds of the three benchmarks wereall 1 +e larger the benchmark value the better the clus-tering effect +e codes for DPC DBSCAN and AP wereprovided based on the corresponding references

Table 1 lists the synthetic datasets used in the experi-ments+ese datasets were published in [23ndash27] Table 2 liststhe real datasets used in the experiments +ese datasetsinclude the real-world dataset from [29ndash34] and the Olivettiface dataset in [28]

To eliminate the influence of missing values and dif-ferences in different dimension ranges the datasets need tobe preprocessed before proceeding to the experiments Wereplace the missing values by the mean of all valid values ofthe same dimension and normalize the data using the min-max normalization method shown in the followingequation

xij xij minus min xj1113872 1113873

max xj1113872 1113873 minus min xj1113872 1113873 (16)

where xij represents the original data located in the ith rowand jth column xij represents the rescaled data of xij andxj represents the original data located in the jth column

Min-max normalization method processes each di-mension of the data and preserves the relationships betweenthe original data values [36] therefore decreasing the in-fluence of the difference in dimensions and increasing theefficiency of the calculation

To fairly reflect the clustering results of the five algo-rithms the parameters in the algorithms are adjusted toensure that their satisfactory clustering performance can beretained For the DPC-SFSKNN algorithm the parameter K

needs to be specified in advance and an initial clusteringcenter is manually selected based on a decision graphcomposed of the local density ρ and the relative distance δ Itcan be seen from the experimental results in Tables 3 and 4that the value of parameter K is around 6 and the value ofparameterK for the dataset with dense sample distribution ismore than 6 In addition to manually select the initialclustering center the traditional DPC algorithm also needs

02 04X

06 08 100

01

02

03

04

05Y

06

07

08

09

1

Figure 5 Results of the traditional DPC-SFSKNN algorithm on thePathbased dataset

8 Complexity

to determine dc Based on the provided selection range dc isselected so that the number of neighbors is between 1 and 2of the total number of data points [20] +e two parametersthat DBSCAN needs to determine are ε and minpts as in[15]+e optimal parameters are determined using a circularsearch method +e AP algorithm only needs to determine apreference and the larger the preference the more the center

points are allowed to be selected [8] +e general method forselecting parameters is not effective and only multiple ex-periments can be performed to select the optimal parame-ters +e only parameter of K-means is the number ofclusters +e true number of clusters in the dataset is used inthis case Similarly FKNN-DPC needs to determine thenearest neighbors K

Table 2 Real-world datasets

Dataset Records Attributes Clusters SourceIris 150 4 3 [29]Libras movement 360 90 15 [31]Wine 178 13 3 [29]Parkinsons 197 23 2 [29]WDBC 569 30 2 [34]Pima-Indians-diabetes 768 8 2 [29]Segmentation 2310 19 7 [29]Dermatology 366 33 6 [29]Seeds 210 7 3 [30]Ionosphere 351 34 2 [33]Waveform 5000 21 3 [32]Waveform (noise) 5000 40 3 [32]Olivetti face 400 92lowast112 40 [28]

Table 1 Synthetic datasets

Dataset Records Attributes Clusters SourcePathbased 300 2 3 [24]Jain 373 2 2 [23]Flame 240 2 2 [25]Aggregation 788 2 7 [26]DIM512 1024 512 16 [27]DIM1024 1024 1024 16 [27]

Table 3 +e comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on synthetic datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParPathbased Jain

DPC-SFSKNN 0926 0910 0925 33 6 1000 1000 1000 22 7DPC 0521 0463 0742 33 2 0609 0713 0853 22 3DBSCAN 0781 0522 0667 mdash 0056 0883 0985 0918 mdash 00810AP 0679 0475 0783 33 10 0681 0812 0882 22 40FKNN-DPC 0941 0960 0987 33 5 0056 0132 0793 mdash 10K-means 0568 0461 0772 mdash 3 0492 0577 0712 mdash 2

Aggregation FlameDPC-SFSKNN 0942 0951 0963 77 6 0873 0934 0956 22 6DPC 1000 1000 1000 77 4 1000 1000 1000 22 5DBSCAN 0969 0982 0988 mdash 0058 0867 0936 0981 - 0098AP 0795 0753 0841 77 77 0452 0534 0876 33 35FKNN-DPC 0995 0997 0999 33 8 1000 1000 1000 22 5K-means 0784 0717 0786 mdash 7 0418 0465 0828 mdash 2

DIM512 DIM1024DPC-SFSKNN 1000 1000 1000 1616 8 1000 1000 1000 1616 9DPC 1000 1000 1000 1616 2 1000 1000 1000 1616 001DBSCAN 1000 1000 1000 mdash 037 1000 1000 1000 mdash 108AP 1000 1000 1000 1616 20 1000 1000 1000 1616 30FKNN-DPC 1000 1000 1000 1616 8 1000 1000 1000 1616 10K-means 0895 0811 0850 mdash 1 0868 0752 0796 mdash 16

Complexity 9

41Analysis of theExperimentalResults on SyntheticDatasetsIn this section the performance of DPC-SFSKNN DPC[20] DBSCAN [15] AP [8] FKNN-DPC [9] and K-means[10] is tested with six synthetic datasets given in Table 1+ese synthetic datasets are different in distribution andquantity Different data situations can be simulated tocompare the performance of six algorithms in differentsituations Table 3 shows AMI ARI ACC and ECAC of thefive clustering algorithms on the six comprehensive datasetswhere the best results are shown in bold and ldquomdashrdquo means novalue Figures 6ndash9 show the clustering results of DPC-SFSKNN DPC DBSCAN AP FKNN-DPC and K-meansbased on the Pathbased Flame Aggregation and Jaindatasets respectively +e five algorithms achieve the

optimal clustering on DIM512 and DIM1024 datasets sothat the clustering of the two datasets is not shown Since thecluster centers of DBSCAN are relatively random only thepositions of clustering centers of the other three algorithmsare marked

Figure 6 shows the results of the Pathbased datasetDPC-SFSKNN and FKNN-DPC can complete the clusteringof the Pathbased dataset correctly From Figures 6(b) 6(d)and 6(f) it can be seen that the clustering results of DPC APand K-means are similar +e clustering centers selected byDPC AP DPC-SFSKNN and FKNN-DPC are highlysimilar but the clustering results of DPC and AP are notsatisfactory For the DPC algorithm the low fault tolerancerate of its allocation strategy is the cause of this result A

Table 4 Comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on real-world datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParIris Libras movement

DPC-SFSKNN 0896 0901 0962 33 6 0547 0368 0510 1015 8DPC 0812 0827 0926 33 2 0535 0304 0438 915 05DBSCAN 0792 0754 0893 mdash 0149 0412 0183 0385 mdash 0965AP 0764 0775 0911 33 6 0364 0267 0453 1015 25FKNN-DPC 0912 0922 0973 33 7 0508 0308 0436 1015 9K-means 0683 0662 0823 mdash 3 0522 0306 0449 mdash 15

Wine ParkinsonsDPC-SFSKNN 0843 0851 0951 33 6 0193 0380 0827 22 6DPC 0706 0672 0882 33 2 0210 0114 0612 22 5DBSCAN 0612 0643 0856 mdash 04210 0205 0213 0674 mdash 046AP 0592 0544 0781 33 6 0142 0127 0669 22 15FKNN-DPC 0831 0852 0949 33 7 0273 0391 0851 22 5K-means 0817 0838 0936 mdash 3 0201 0049 0625 mdash 2

WDBC IonosphereDPC-SFSKNN 0432 0516 0857 22 6 0361 0428 0786 32 7DPC 0002 minus0004 0602 22 9 0238 0276 0681 32 065DBSCAN 0397 0538 0862 mdash 0277 0544 0683 0853 mdash 027AP 0598 0461 0854 22 40 0132 0168 0706 22 15FKNN-DPC 0679 0786 0944 22 7 0284 0355 0752 22 8K-means 0611 0730 0928 mdash 2 0129 0178 0712 mdash 2

Segmentation Pima-Indians-diabetesDPC-SFSKNN 0665 0562 0746 67 6 0037 0083 0652 22 6DPC 0650 0550 0684 67 3 0033 0075 0647 22 4DBSCAN 0446 0451 0550 mdash 02510 0028 0041 0577 mdash 0156AP 0405 0436 0554 77 25 0045 0089 0629 32 35FKNN-DPC 0655 0555 0716 77 7 0001 0011 0612 22 6K-means 0583 0495 0612 mdash 6 0050 0102 0668 mdash 2

Seeds DermatologyDPC-SFSKNN 0753 0786 0919 33 7 0862 0753 0808 76 6DPC 0727 0760 0918 33 2 0611 0514 0703 46 2DBSCAN 0640 0713 0874 mdash 0178 0689 0690 0815 mdash 073AP 0598 0682 0896 33 10 0766 0701 0762 76 5FKNN-DPC 0759 0790 0924 33 8 0847 0718 0768 76 7K-means 0671 0705 0890 mdash 3 0796 0680 0702 mdash 6

Waveform Waveform (noise)DPC-SFSKNN 0355 0382 0725 33 5 0267 0288 0651 33 6DPC 0320 0269 0586 33 05 0104 0095 0502 33 03DBSCAN mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashAP mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashFKNN-DPC 0324 0350 0703 33 5 0247 0253 0648 33 5K-means 0363 0254 0501 - 3 0364 0252 0512 mdash 3

10 Complexity

0

02

04

06

08

1

05 10

(a)

0

02

04

06

08

1

05 10

(b)

Figure 7 Continued

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 6+e clustering of Pathbased by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and (f)K-means

Complexity 11

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 7 +e clustering of Flame by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

05 100

02

04

06

08

1

(d)

Figure 8 Continued

12 Complexity

high-density point allocation error will be transferred tolow-density points and the error propagation will seriouslyaffect the clustering results AP and K-means algorithms arenot good at dealing with irregular clusters +e two clustersin the middle are too attractive to the points on both sides ofthe semicircular cluster which leads to clustering errorsDBSCAN can completely detect the semicircular cluster butthe semicircular cluster and the cluster on the left of themiddle are incorrectly classified into one category and thecluster on the right of the middle is divided into two clusters+e similarities between points and manually prespecifiedparameters may severely affect the clustering DPC-SFSKNNand FKNN-DPC algorithms perform well on the Pathbaseddataset +ese improved algorithms that consider neighborrelationships have a great advantage in handling suchcomplex distributed datasets

Figure 7 shows the results of four algorithms on theFlame dataset As shown in the figure DPC-SFSKNN DPCFKNN-DPC and DBSCAN can correctly detect two clusterswhile AP and K-means cannot completely correct clusteringAlthough AP can correctly identify higher clusters and selectthe appropriate cluster center the lower cluster is dividedinto two clusters Both clusters of K-means are wrong +eclustering results in Figure 8 show that the DPC-SFSKNNDPC FKNN-DPC and DBSCAN algorithms can detect 7clusters in the Aggregation dataset but AP and K-means stillcannot cluster correctly DPC-SFSKNN DPC and FKNN-DPC can identify clusters and centers Although the clustercenters are not marked for DBSCAN the number of clustersand the overall shape of each cluster are correct +e APalgorithm successfully found the correct number of clustersbut it chose two centers for one cluster which divided thecluster into two clusters +e clustering result of K-means issimilar to that of AP

+e Jain dataset shown in Figure 9 is a dataset consistingof two semicircular clusters of different densities As shownin the figure the DPC-SFSKNN algorithm can completelycluster two clusters with different densities However DPCAP FKNN-DPC and K-means incorrectly assign the leftend of the lower cluster to the higher cluster and the clustercenters of the DPC are all on the lower cluster Compared

with that the distribution of the cluster centers of the AP ismore reasonable For the DBSCAN algorithm it can ac-curately identify lower clusters but the left end of the highercluster is incorrectly divided into a new cluster so that thehigher cluster is divided into two clusters

According to the benchmark data shown in Table 3 it isclear that the performance of DPC-SFSKNN is very effectiveamong the six clustering algorithms especially in the Jaindataset Although DPC and FKNN-DPC perform betterthan DPC-SFSKNN on Aggregation and Flame datasetsDPC-SFSKNN can find the correct clustering center of theaggregation and can complete the clustering task correctly

42 Analysis of Experimental Results on Real-World DatasetsIn this section the performance of the five algorithms is stillbenchmarked according to AMI ARI ACC and ECACand the clustering results are summarized in Table 4 12 real-world datasets are selected to test DPC-SFSKNNrsquos ability toidentify clusters on different datasets DBSCAN and APalgorithm cannot get effective clustering results on wave-form and waveform (noise) +e symbol ldquomdashrdquo represents noresult

As shown in Table 4 in terms of benchmarks AMI ARIand ACC DPC-SFSKNN outperforms all five other algo-rithms on the Wine Segmentation and Libras movementdatasets At the same time FKNN-DPC performs better thanthe other five algorithms on the Iris Seeds Parkinsons andWDBC datasets It can be seen that the overall performanceof DPC-SFSKNN is slightly better than DPC on 11 datasetsexcept for Parkinsons On the Parkinsons DPC-SFSKNN isslightly worse than DPC in AMI but better than DPC in ARIand ACC Similarly DPC-SFSKNN had a slightly betterperformance in addition to Iris Parkinsons WDBC andSeeds of eight sets of data in FKNN-DPC and DPC-SFSKNN is slightly worse than DPC in AMI ARI and ACCDBSCAN gets the best results on the Ionosphere K-means isthe best on Pima-Indians-diabetes and K-means is the bestin AMI on waveform and waveform (noise) datasets Ingeneral the clustering results of DPC-SFSKNN in real-worlddatasets are satisfactory

0

02

04

06

08

1

05 10

(e)

05 100

02

04

06

08

1

(f )

Figure 8+e clustering of Aggregation by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

Complexity 13

43 Experimental Analysis of Olivetti Face DatasetOlivetti face dataset [28] is an image dataset widely used bymachine learning algorithms Its purpose is to test theclustering situation of the algorithm without supervisionincluding determining the number of clusters in the data-base and the members of each cluster +e dataset contains40 clusters each of which has 10 different images Becausethe actual number of clusters (40 different clusters) is equalto the number of elements in the dataset (10 different imageseach cluster) the reliability of local density becomes smallerwhich is a great challenge for density-based clustering al-gorithms To further test the clustering performance ofDPC-SFSKNN DPC-SFSKNN performed experiments onthe Olivetti face database and compared it with DPC APDBSCAN FKNN-DPC and K-means

+e clustering results achieved by DPC-SFSKNN andDPC for the Olivetti face database are shown in Figure 10and white squares represent the cluster centers +e 32clusters corresponding to DPC-SFSKNN found inFigure 10(a) and the 20 clusters found by DPC inFigure 10(b) are displayed in different colors Gray imagesindicate that the image is not assigned to any cluster It canbe seen from Figure 10(a) that DPC-SFSKNN found that the32 cluster centers were covered 29 clusters and as shown inFigure 10(b) the 20 cluster centers found by DPC werescattered in 19 clusters Similar to DPC-SFSKNN DPC maydivide one cluster into two clusters Because DPC-SFSKNNcan find much more density peaks than DPC it is morelikely to identify a cluster as two different clusters +e samesituation occurs with the FKNN-DPC algorithm However

0

02

04

06

08

1

05 10

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

0

02

04

06

08

1

05 10

(d)

0

02

04

06

08

1

05 10

(e)

0

02

04

06

08

1

05 10

(f )

Figure 9 +e clustering of Jain by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

14 Complexity

the performance of FKNN-DPC is better than that of DPC-SFSKNN in AMI ARI ACC and ECAC From the data inTable 5 based on AMI ARI ACC and ECAC the clus-tering results of these algorithms are compared +e per-formance of DPC-SFSKNNC is slightly superior to theperformance of the other four algorithms except FKNN-DPC

44RunningTime +is section shows the comparison of thetime performance of DPC-SFSKNN with DPC DBSCANAP FKNN-DPC and K-means on real-world datasets +etime complexity of DPC-SFSKNN and DPC has been an-alyzed in Section 331 and the time complexity of DPC isO(n2) and the time complexity of DPC-SFSKNN is O(kn2)where n is the size of the dataset However the time con-sumed by DPC mainly comes from calculating the localdensity and the relative distance of each point while the timeconsumed by DPC-SFSKNN comes mainly from the cal-culation of K-nearest neighbors and the division strategy ofnoncenter points Table 6 lists the running time (in seconds)

of the six algorithms on the real datasets It can be seen thatalthough the time complexity of DPC-SFSKNN is approx-imately k times that of DPC their execution time on actualdatasets is not k times

In Table 6 it can be found that on a relatively smalldataset the running time of DPC-SFSKNN is about twice ormore times that of DPC and the difference mainly comesfrom DPC-SFSKNNrsquos allocation strategy Although thecomputational load of the local densities for points growsvery quickly with the size of a dataset the time consumed bythe allocation strategy in DPC-SFSKNN increases randomlywith the distribution of a dataset +is leads to an irregulargap between the running time of DPC and DPC-SFSKNN

FKNN-DPC has the same time and space complexity asDPC but the running time is almost the same as DPC-SFSKNN It takes a lot of running time to calculate therelationship between K-nearest neighbors +e time com-plexity of DBSCAN and AP is approximate to O(n2) and theparameter selection of both cannot be determined by simplemethods When the dataset is relatively large it is difficult tofind their optimal parameters which may be the reason that

(a) (b)

Figure 10 +e clustering of Olivetti by two clustering algorithms (a) DPC-SFSKNN and (b) DPC

Table 5 Performance comparison of algorithms by clustering criteria for the Olivetti face database

Metric DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansACC 0786 0665 0648 0763 0818 0681AMI 0792 0728 0691 0737 0832 0742ARI 0669 0560 0526 0619 0714 0585ECAC 3240 2040 mdash 2840 3640 mdashPar 6 05 064 21 4 40

Complexity 15

the two algorithms have no running results on the waveformdataset +e approximate time complexity of K-means isO(n) and Table 6 proves its efficiency K-means has almostno loss of accuracy under the premise of fast speed whichmakes it a very popular clustering algorithm but K-means isnot sensitive to irregularly shaped data

5 Conclusions and Future Work

A new clustering algorithm is proposed based on the tra-ditional DPC algorithm in this paper +is algorithm pro-poses a density peak search algorithm that takes into accountthe surrounding neighbor information and develops a newallocation strategy to detect the true distribution of thedataset +e proposed clustering algorithm performs fastsearch finds density peaks say cluster centers of a dataset ofany size and recognizes clusters with any arbitrary shape ordimensionality +e algorithm is called DPC-SFSKNNwhich means that it calculates the local density and therelative distance by using some distance information be-tween points and neighbors to find the cluster center andthen the remaining points are assigned using similarity-first+e search algorithm is based on the weighted KNN graph tofind the owner (clustering center) of the point +e DPC-SFSKNN successfully addressed several issues arising fromthe clustering algorithm of Alex Rodriguez and AlessandroLaio [20] including its density metric and the potential issuehidden in its assignment strategy +e performance of DPC-SFSKNN was tested on several synthetic datasets and thereal-word datasets from the UCI machine learning reposi-tory and the well-known Olivetti face database +e ex-perimental results on these datasets demonstrate that ourDPC-SFSKNN is powerful in finding cluster centers and inrecognizing clusters regardless of their shape and of thedimensionality of the space in which they are embedded andof the size of the datasets and is robust to outliers It per-forms much better than the original algorithm DPCHowever the proposed algorithm has some limitations theparameter K needs to be manually adjusted according todifferent datasets the clustering centers still need to bemanually selected by analyzing the decision graph (like theDPC algorithm) the allocation strategy improves theclustering accuracy but takes time and cost How to improve

the degree of automation and allocation efficiency of thealgorithm needs further research

Data Availability

+e synthetic datasets are cited at relevant places within thetext as references [23ndash27] +e real-world datasets are citedat relevant places within the text as references [29ndash34] +eOlivetti face dataset is cited at relevant places within the textas references [28]

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported in part by the National NaturalScience Foundation of China (6160303040 and 61433003) inpart by the Yunnan Applied Basic Research Project of China(201701CF00037) and in part by the Yunnan ProvincialScience and Technology Department Key Research Program(Engineering) (2018BA070)

Supplementary Materials

It includes the datasets used in the experiments in this paper(Supplementary Materials)

References

[1] K L Liu Y L Shang Q Ouyang and W D Widanage ldquoAdata-driven approach with uncertainty quantification forpredicting future capacities and remaining useful life oflithium-ion batteryrdquo IEEE Transactions on Industrial Elec-tronics p 1 2020

[2] X P Tang K L Liu X Wang et al ldquoModel migration neuralnetwork for predicting battery aging trajectoriesrdquo IEEETransactions on Transportation Electrification vol 6 no 2pp 363ndash374 2020

[3] X Tang K Liu XWang B Liu F Gao andW DWidanageldquoReal-time aging trajectory prediction using a base model-oriented gradient-correction particle filter for Lithium-ion

Table 6 Running time of 6 clustering algorithms in seconds on UCI datasets

Dataset DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansIris 0241 0049 0059 0565 0148 0014Wine 0238 0048 0098 0832 0168 0013WDBC 0484 0092 0884 6115 0464 0018Seeds 0244 0050 0122 0973 0164 0014Libras movement 0602 0068 0309 3016 2602 0075Ionosphere 0325 0064 0349 2018 0309 0021Segmentation 1569 0806 8727 6679 0313 0062Dermatology 0309 0063 0513 2185 0409 0007Pima-Indians-diabetes 0792 0126 2018 9709 0892 0009Parkinsons 0255 0048 0114 0866 0263 0003Waveform 16071 3511 mdash mdash 7775 0067Waveform (noise) 17571 3784 mdash mdash 7525 0109

16 Complexity

batteriesrdquo Journal of Power Sources vol 440 Article ID227118 2019

[4] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[5] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[6] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[7] L Cai J H Meng D I Stroe et al ldquoMulti-objective opti-mization of data-driven model for lithium-ion battery SOHestimation with short-term featurerdquo IEEE Transactions onPower Electronics p 1 2020

[8] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[9] J Xie H Gao W Xie X Liu and P W Grant ldquoRobustclustering by detecting density peaks and assigning pointsbased on fuzzy weighted K-nearest neighborsrdquo InformationSciences vol 354 pp 19ndash40 2016

[10] F S Samaria and A C Harter ldquoSome methods for classifi-cation and analysis of multivariate observationsrdquo in Pro-ceedings of the Berkeley SymposiumOnMathematical Statisticsand Probability pp 281ndash297 Berkeley CA USA 1967

[11] S Kant T L Rao and P N Sundaram ldquoAn automatic andstable clustering algorithmrdquo Pattern Recognition Lettersvol 15 no 6 pp 543ndash549 1994

[12] D Arthur and S Vassilvitskii ldquoK-Means++ the advantages ofcareful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms pp 7ndash9 NewOrleans LA USA 2007

[13] Y Zhao W Halang and X Wang ldquoRough ontology mappingin E-business integrationrdquo E-Service Intelligence BMC Bioinfvol 8 pp 75ndash93 2007

[14] Y Xiao and J Yu ldquoSemi-supervised clustering based on af-finity propagaiton algorithmrdquo ACM Transactions onKnowledge Discovery from Data vol 1 no 1 2007

[15] M Ester H Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databaseswith noiserdquo in Proceedings of the Second International Con-ference On Knowledge Discovery and Data Mining pp 226ndash231 Portland OR USA 1996

[16] R J G B Campello D Moulavi and J Sander ldquoDensity-based clustering based on hierarchical density estimatesrdquoAdvances in Knowledge Discovery and Data Mining vol 7819pp 160ndash172 2013

[17] Z Liang and P Chen ldquoDelta-density based clustering with adivide-and-conquer strategy 3DC clusteringrdquo Pattern Rec-ognition Letters vol 73 pp 52ndash59 2016

[18] M Ankerst M M Breuning H P Kriegel and J SanderldquoOPTICS ordering points to identify the clustering struc-turerdquo in Proceedings of the 1999 ACM SIGMOD-InternationalConference on Management of Data pp 49ndash60 PhiladelphiaPA USA 1999

[19] M Du S Ding and H Jia ldquoStudy on density peaks clusteringbased on k-nearest neighbors and principal componentanalysisrdquo Knowledge-Based Systems vol 99 pp 135ndash1452016

[20] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[21] T Li H W Ge and S Z Su ldquoDensity peaks clustering byautomatic determination of cluster centersrdquo Journal ofComputer Science and Technology vol 10 no 11 pp 1614ndash1622 2016

[22] R Liu H Wang and X Yu ldquoShared-nearest-neighbor-basedclustering by fast search and find of density peaksrdquo Infor-mation Sciences vol 450 pp 200ndash226 2018

[23] R A Jarvis and E A Patrick ldquoClustering using a similaritymeasure based on shared near neighborsrdquo IEEE Transactionson Computers vol C-22 no 11 pp 1025ndash1034 1973

[24] H Chang and D-Y Yeung ldquoRobust path-based spectralclusteringrdquo Pattern Recognition vol 41 no 1 pp 191ndash2032008

[25] L Fu and E Medico ldquoFlame a novel fuzzy clustering methodfor the analysis of DNA microarray datardquo BMC Bio-informatics vol 8 no 1 2007

[26] A Gionis H Mannila and P Tsaparas ldquoClustering aggre-gationrdquo ACM Transactions on Knowledge Discovery fromData vol 1 no 1 p 4 2007

[27] P Franti O Virmajoki and V Hautamaki ldquoFast agglom-erative clustering using a k-nearest neighbor graphrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 28 no 11 pp 1875ndash1881 2006

[28] F S Samaria and A C Harter ldquoParameterisation of a sto-chastic model for human face identificationrdquo in Proceedings ofthe 1994 IEEEWorkshop On Applications Of Computer Visionpp 138ndash142 Sarasota FL USA 1994

[29] K Bache and M Lichman UCI Machine Learning Repositoryhttparchiveicsucieduml 2013

[30] M Charytanowicz J Niewczas P Kulczycki P A KowalskiS Lukasik and S Zak ldquoComplete gradient clustering algo-rithm for features analysis of X-ray imagesrdquo InformationTechnologies in biomedicine Advances in Intelligent and SoftComputing vol 69 Berlin Germany Springer

[31] D B Dias R C B Madeo T Rocha H H Biscaro andS M Peres ldquoHand movement recognition for brazilian signlanguage a study using distance-based neural networksrdquo inProceedings of the 2009 International Joint Conference onNeural Networks pp 697ndash704 Atlanta GA USA 2009

[32] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees Routledge New York NYUSA 1st edition 1984

[33] V G Sigillito S P Wing L V Hutton and K B BakerldquoClassification of radar returns from the ionosphere usingneural networksrdquo Johns Hopkins APL vol 10 no 3pp 262ndash266 1989

[34] W N Street W H Wolberg and O L MangasarianldquoNuclear feature extraction for breast tumor diagnosisrdquo inProceedings of the SPIE 1905 Biomedical Image Processing andBiomedical Visualization San Jose CA USA 1993

[35] X V Nguyen J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison is a correction forchance necessaryrdquo in Proceedings of the ICML 2009 the 26thAnnual International Conference On Machine Learning SanMontreal Canada 2009

[36] J Han M Kamber and J Pei Data Mining Concepts andTechniques the Morgan Kaufmann Series in Data Manage-ment Systems Morgan Kaufmann Burlington MA USA 3rdedition 2011

Complexity 17

Page 7: ClusteringbyDetectingDensityPeaksandAssigningPointsby ... · FKNN-DPC [9] is given in (5) and (6), respectively: ρ i exp − 1 K X j∈knn(i) d2 ij ⎛⎝ ⎞⎠, (5) ρ i X j∈knn(i)

information When the point of the highest similarity is notunique the point with the shortest average distance of theshared neighbors is selected as the next visited point

33 Complexity Analysis In this section the complexities ofthe DPC-SFSKNN algorithm are analyzed including timecomplexity and space complexity Suppose the size of the

02 04 06 08 100

0102030405060708091

Y

X

(a)

02 04 06 08 10X

δ

025

020

015

010

005

0

(b)

Figure 4 Result and ρ value of the DPC-SFSKNN algorithm on the Jain dataset

Require dataset X parameter K

Ensure clustering result C(1) Data preprocessing normalize the data(2) Calculate the Euclidean distance between the points(3) Calculate the K-nearest neighbors of each point i isin X

(4) Calculate the average distance of K-nearest neighbors of each point dknn(i) according to (13)(5) Calculate the local density ρi of each point i isin X according to (14)(6) Calculate the relative distance δi of each point i isin X according to (15)(7) Find the cluster center by analyzing the decision graph composed of ρ and δ and use the cluster center as the set CC(8) Calculate the similarity between point i and its K-nearest neighbors according to (12)(9) Connect each point in the dataset X with its K-nearest neighbors and use the similarity as the connection weight to construct a

weighted K-nearest neighbor graph(10) Calculate the average distance of SNN dsnn(i j) between point i and its shared-nearest neighbors according to (11)(11) Apply Algorithm 2 to allocate the remaining points

ALGORITHM 1 DPC-SFSKNN

Require w isin X set of cluster centers CC number of neighbors K similarity matrix Snlowast n sim(i j)nlowast n and SNN averagedistance matrix DSNNnlowast n dsnn(i j)nlowast n

Ensure point w isin CC(1) Initialize the descending queue Q and the path queue P+e K-nearest neighbors of point w are sorted in the ascending order of

similarity and pushed into Q Push M into P(2) while tail point of P P isin CC do(3) if the highest similarity point is unique then(4) Pop a point this at Qrsquos tail(5) else(6) Select a point this with the smallest DSNN(7) end if(8) Empty descending queue Q(9) +e K-nearest neighbors of this are sorted in the ascending order of similarity and pushed into Q(10) Push this into P(11) end while

ALGORITHM 2 Similarity-first search allocation strategy

Complexity 7

dataset is n the number of cluster centers is m and thenumber of neighbors is k

331 Time Complexity +e time complexity analysis ofDPC-SFSKNN is as follows

Normalization requires a processing complexity of ap-proximately O(n) the complexities of calculating the Eu-clidean distance and similarity between points are O(n2) thecomplexity of computing the K-nearest neighbor averagedistance dknn is O(n2) similarly the complexity of the averagedistance dsnn between the calculation point and its shared-nearest neighbors does not exceed O(n2) at most the calcu-lation process of calculating the local density ρi and distance δi

of each point needs to acquire the KNN information com-plexity of each point as O(kn) so the complexities of localdensity ρ and distance δ areO(kn2) the point allocation part isthe search time of one point and in the worst case searching allpoints requires O(n) +ere are n points in the dataset and thetotal time does not exceed O(n2) In summary the total ap-proximate time complexity of DPC-SFSKNN is O(kn2)

+e time complexity of the DPC algorithm depends onthe following three aspects (a) the time to calculate thedistance between points (b) the time to calculate the localdensity ρi for point i and (c) the time to calculate the distanceδi for each point i +e time complexity of each part is O(n2)so the total approximate time complexity of DPC is O(n2)

+e time complexity of the DPC-SFSKNN algorithm is k

times higher than that of the traditional DPC algorithmHowever k is relatively small compared to n +erefore theydo not significantly affect the run time In Section 4 it isdemonstrated that the actual running time of DPC-SFSKNNdoes not exceed k times of the running time of the traditionalDPC algorithm

332 Space Complexity DPC-SFSKNN needs to calculatethe distance and similarity between points and its com-plexity is O(n2) Other data structures (such as ρ and δ arrays

and various average distance arrays) are O(n) For the al-location strategy in the worst case its complexity is O(n2)+e space complexity of DPC is O(n2) which is mainly dueto the distance matrix stored

+e space complexity of our DPC-SFSKNN is the sameas that of traditional DPC which is O(n2)

4 Experiments and Results

In this section experiments are performed based on severalpublic datasets commonly used to test the performance ofclustering algorithms including synthetic datasets [23ndash27]and real datasets [28ndash34] In order to visually observe theclustering ability of DPC-SFSKNN the DPC [20] DBSCAN[15] AP [8] FKNN-DPC [9] and K-means [10] methods areall tested for comparison +ree popular benchmarks areused to evaluate the performance of the above clusteringalgorithms including the clustering accuracy (ACC) ad-justed mutual information (AMI) and adjusted Rand index(ARI) [35] +e upper bounds of the three benchmarks wereall 1 +e larger the benchmark value the better the clus-tering effect +e codes for DPC DBSCAN and AP wereprovided based on the corresponding references

Table 1 lists the synthetic datasets used in the experi-ments+ese datasets were published in [23ndash27] Table 2 liststhe real datasets used in the experiments +ese datasetsinclude the real-world dataset from [29ndash34] and the Olivettiface dataset in [28]

To eliminate the influence of missing values and dif-ferences in different dimension ranges the datasets need tobe preprocessed before proceeding to the experiments Wereplace the missing values by the mean of all valid values ofthe same dimension and normalize the data using the min-max normalization method shown in the followingequation

xij xij minus min xj1113872 1113873

max xj1113872 1113873 minus min xj1113872 1113873 (16)

where xij represents the original data located in the ith rowand jth column xij represents the rescaled data of xij andxj represents the original data located in the jth column

Min-max normalization method processes each di-mension of the data and preserves the relationships betweenthe original data values [36] therefore decreasing the in-fluence of the difference in dimensions and increasing theefficiency of the calculation

To fairly reflect the clustering results of the five algo-rithms the parameters in the algorithms are adjusted toensure that their satisfactory clustering performance can beretained For the DPC-SFSKNN algorithm the parameter K

needs to be specified in advance and an initial clusteringcenter is manually selected based on a decision graphcomposed of the local density ρ and the relative distance δ Itcan be seen from the experimental results in Tables 3 and 4that the value of parameter K is around 6 and the value ofparameterK for the dataset with dense sample distribution ismore than 6 In addition to manually select the initialclustering center the traditional DPC algorithm also needs

02 04X

06 08 100

01

02

03

04

05Y

06

07

08

09

1

Figure 5 Results of the traditional DPC-SFSKNN algorithm on thePathbased dataset

8 Complexity

to determine dc Based on the provided selection range dc isselected so that the number of neighbors is between 1 and 2of the total number of data points [20] +e two parametersthat DBSCAN needs to determine are ε and minpts as in[15]+e optimal parameters are determined using a circularsearch method +e AP algorithm only needs to determine apreference and the larger the preference the more the center

points are allowed to be selected [8] +e general method forselecting parameters is not effective and only multiple ex-periments can be performed to select the optimal parame-ters +e only parameter of K-means is the number ofclusters +e true number of clusters in the dataset is used inthis case Similarly FKNN-DPC needs to determine thenearest neighbors K

Table 2 Real-world datasets

Dataset Records Attributes Clusters SourceIris 150 4 3 [29]Libras movement 360 90 15 [31]Wine 178 13 3 [29]Parkinsons 197 23 2 [29]WDBC 569 30 2 [34]Pima-Indians-diabetes 768 8 2 [29]Segmentation 2310 19 7 [29]Dermatology 366 33 6 [29]Seeds 210 7 3 [30]Ionosphere 351 34 2 [33]Waveform 5000 21 3 [32]Waveform (noise) 5000 40 3 [32]Olivetti face 400 92lowast112 40 [28]

Table 1 Synthetic datasets

Dataset Records Attributes Clusters SourcePathbased 300 2 3 [24]Jain 373 2 2 [23]Flame 240 2 2 [25]Aggregation 788 2 7 [26]DIM512 1024 512 16 [27]DIM1024 1024 1024 16 [27]

Table 3 +e comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on synthetic datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParPathbased Jain

DPC-SFSKNN 0926 0910 0925 33 6 1000 1000 1000 22 7DPC 0521 0463 0742 33 2 0609 0713 0853 22 3DBSCAN 0781 0522 0667 mdash 0056 0883 0985 0918 mdash 00810AP 0679 0475 0783 33 10 0681 0812 0882 22 40FKNN-DPC 0941 0960 0987 33 5 0056 0132 0793 mdash 10K-means 0568 0461 0772 mdash 3 0492 0577 0712 mdash 2

Aggregation FlameDPC-SFSKNN 0942 0951 0963 77 6 0873 0934 0956 22 6DPC 1000 1000 1000 77 4 1000 1000 1000 22 5DBSCAN 0969 0982 0988 mdash 0058 0867 0936 0981 - 0098AP 0795 0753 0841 77 77 0452 0534 0876 33 35FKNN-DPC 0995 0997 0999 33 8 1000 1000 1000 22 5K-means 0784 0717 0786 mdash 7 0418 0465 0828 mdash 2

DIM512 DIM1024DPC-SFSKNN 1000 1000 1000 1616 8 1000 1000 1000 1616 9DPC 1000 1000 1000 1616 2 1000 1000 1000 1616 001DBSCAN 1000 1000 1000 mdash 037 1000 1000 1000 mdash 108AP 1000 1000 1000 1616 20 1000 1000 1000 1616 30FKNN-DPC 1000 1000 1000 1616 8 1000 1000 1000 1616 10K-means 0895 0811 0850 mdash 1 0868 0752 0796 mdash 16

Complexity 9

41Analysis of theExperimentalResults on SyntheticDatasetsIn this section the performance of DPC-SFSKNN DPC[20] DBSCAN [15] AP [8] FKNN-DPC [9] and K-means[10] is tested with six synthetic datasets given in Table 1+ese synthetic datasets are different in distribution andquantity Different data situations can be simulated tocompare the performance of six algorithms in differentsituations Table 3 shows AMI ARI ACC and ECAC of thefive clustering algorithms on the six comprehensive datasetswhere the best results are shown in bold and ldquomdashrdquo means novalue Figures 6ndash9 show the clustering results of DPC-SFSKNN DPC DBSCAN AP FKNN-DPC and K-meansbased on the Pathbased Flame Aggregation and Jaindatasets respectively +e five algorithms achieve the

optimal clustering on DIM512 and DIM1024 datasets sothat the clustering of the two datasets is not shown Since thecluster centers of DBSCAN are relatively random only thepositions of clustering centers of the other three algorithmsare marked

Figure 6 shows the results of the Pathbased datasetDPC-SFSKNN and FKNN-DPC can complete the clusteringof the Pathbased dataset correctly From Figures 6(b) 6(d)and 6(f) it can be seen that the clustering results of DPC APand K-means are similar +e clustering centers selected byDPC AP DPC-SFSKNN and FKNN-DPC are highlysimilar but the clustering results of DPC and AP are notsatisfactory For the DPC algorithm the low fault tolerancerate of its allocation strategy is the cause of this result A

Table 4 Comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on real-world datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParIris Libras movement

DPC-SFSKNN 0896 0901 0962 33 6 0547 0368 0510 1015 8DPC 0812 0827 0926 33 2 0535 0304 0438 915 05DBSCAN 0792 0754 0893 mdash 0149 0412 0183 0385 mdash 0965AP 0764 0775 0911 33 6 0364 0267 0453 1015 25FKNN-DPC 0912 0922 0973 33 7 0508 0308 0436 1015 9K-means 0683 0662 0823 mdash 3 0522 0306 0449 mdash 15

Wine ParkinsonsDPC-SFSKNN 0843 0851 0951 33 6 0193 0380 0827 22 6DPC 0706 0672 0882 33 2 0210 0114 0612 22 5DBSCAN 0612 0643 0856 mdash 04210 0205 0213 0674 mdash 046AP 0592 0544 0781 33 6 0142 0127 0669 22 15FKNN-DPC 0831 0852 0949 33 7 0273 0391 0851 22 5K-means 0817 0838 0936 mdash 3 0201 0049 0625 mdash 2

WDBC IonosphereDPC-SFSKNN 0432 0516 0857 22 6 0361 0428 0786 32 7DPC 0002 minus0004 0602 22 9 0238 0276 0681 32 065DBSCAN 0397 0538 0862 mdash 0277 0544 0683 0853 mdash 027AP 0598 0461 0854 22 40 0132 0168 0706 22 15FKNN-DPC 0679 0786 0944 22 7 0284 0355 0752 22 8K-means 0611 0730 0928 mdash 2 0129 0178 0712 mdash 2

Segmentation Pima-Indians-diabetesDPC-SFSKNN 0665 0562 0746 67 6 0037 0083 0652 22 6DPC 0650 0550 0684 67 3 0033 0075 0647 22 4DBSCAN 0446 0451 0550 mdash 02510 0028 0041 0577 mdash 0156AP 0405 0436 0554 77 25 0045 0089 0629 32 35FKNN-DPC 0655 0555 0716 77 7 0001 0011 0612 22 6K-means 0583 0495 0612 mdash 6 0050 0102 0668 mdash 2

Seeds DermatologyDPC-SFSKNN 0753 0786 0919 33 7 0862 0753 0808 76 6DPC 0727 0760 0918 33 2 0611 0514 0703 46 2DBSCAN 0640 0713 0874 mdash 0178 0689 0690 0815 mdash 073AP 0598 0682 0896 33 10 0766 0701 0762 76 5FKNN-DPC 0759 0790 0924 33 8 0847 0718 0768 76 7K-means 0671 0705 0890 mdash 3 0796 0680 0702 mdash 6

Waveform Waveform (noise)DPC-SFSKNN 0355 0382 0725 33 5 0267 0288 0651 33 6DPC 0320 0269 0586 33 05 0104 0095 0502 33 03DBSCAN mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashAP mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashFKNN-DPC 0324 0350 0703 33 5 0247 0253 0648 33 5K-means 0363 0254 0501 - 3 0364 0252 0512 mdash 3

10 Complexity

0

02

04

06

08

1

05 10

(a)

0

02

04

06

08

1

05 10

(b)

Figure 7 Continued

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 6+e clustering of Pathbased by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and (f)K-means

Complexity 11

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 7 +e clustering of Flame by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

05 100

02

04

06

08

1

(d)

Figure 8 Continued

12 Complexity

high-density point allocation error will be transferred tolow-density points and the error propagation will seriouslyaffect the clustering results AP and K-means algorithms arenot good at dealing with irregular clusters +e two clustersin the middle are too attractive to the points on both sides ofthe semicircular cluster which leads to clustering errorsDBSCAN can completely detect the semicircular cluster butthe semicircular cluster and the cluster on the left of themiddle are incorrectly classified into one category and thecluster on the right of the middle is divided into two clusters+e similarities between points and manually prespecifiedparameters may severely affect the clustering DPC-SFSKNNand FKNN-DPC algorithms perform well on the Pathbaseddataset +ese improved algorithms that consider neighborrelationships have a great advantage in handling suchcomplex distributed datasets

Figure 7 shows the results of four algorithms on theFlame dataset As shown in the figure DPC-SFSKNN DPCFKNN-DPC and DBSCAN can correctly detect two clusterswhile AP and K-means cannot completely correct clusteringAlthough AP can correctly identify higher clusters and selectthe appropriate cluster center the lower cluster is dividedinto two clusters Both clusters of K-means are wrong +eclustering results in Figure 8 show that the DPC-SFSKNNDPC FKNN-DPC and DBSCAN algorithms can detect 7clusters in the Aggregation dataset but AP and K-means stillcannot cluster correctly DPC-SFSKNN DPC and FKNN-DPC can identify clusters and centers Although the clustercenters are not marked for DBSCAN the number of clustersand the overall shape of each cluster are correct +e APalgorithm successfully found the correct number of clustersbut it chose two centers for one cluster which divided thecluster into two clusters +e clustering result of K-means issimilar to that of AP

+e Jain dataset shown in Figure 9 is a dataset consistingof two semicircular clusters of different densities As shownin the figure the DPC-SFSKNN algorithm can completelycluster two clusters with different densities However DPCAP FKNN-DPC and K-means incorrectly assign the leftend of the lower cluster to the higher cluster and the clustercenters of the DPC are all on the lower cluster Compared

with that the distribution of the cluster centers of the AP ismore reasonable For the DBSCAN algorithm it can ac-curately identify lower clusters but the left end of the highercluster is incorrectly divided into a new cluster so that thehigher cluster is divided into two clusters

According to the benchmark data shown in Table 3 it isclear that the performance of DPC-SFSKNN is very effectiveamong the six clustering algorithms especially in the Jaindataset Although DPC and FKNN-DPC perform betterthan DPC-SFSKNN on Aggregation and Flame datasetsDPC-SFSKNN can find the correct clustering center of theaggregation and can complete the clustering task correctly

42 Analysis of Experimental Results on Real-World DatasetsIn this section the performance of the five algorithms is stillbenchmarked according to AMI ARI ACC and ECACand the clustering results are summarized in Table 4 12 real-world datasets are selected to test DPC-SFSKNNrsquos ability toidentify clusters on different datasets DBSCAN and APalgorithm cannot get effective clustering results on wave-form and waveform (noise) +e symbol ldquomdashrdquo represents noresult

As shown in Table 4 in terms of benchmarks AMI ARIand ACC DPC-SFSKNN outperforms all five other algo-rithms on the Wine Segmentation and Libras movementdatasets At the same time FKNN-DPC performs better thanthe other five algorithms on the Iris Seeds Parkinsons andWDBC datasets It can be seen that the overall performanceof DPC-SFSKNN is slightly better than DPC on 11 datasetsexcept for Parkinsons On the Parkinsons DPC-SFSKNN isslightly worse than DPC in AMI but better than DPC in ARIand ACC Similarly DPC-SFSKNN had a slightly betterperformance in addition to Iris Parkinsons WDBC andSeeds of eight sets of data in FKNN-DPC and DPC-SFSKNN is slightly worse than DPC in AMI ARI and ACCDBSCAN gets the best results on the Ionosphere K-means isthe best on Pima-Indians-diabetes and K-means is the bestin AMI on waveform and waveform (noise) datasets Ingeneral the clustering results of DPC-SFSKNN in real-worlddatasets are satisfactory

0

02

04

06

08

1

05 10

(e)

05 100

02

04

06

08

1

(f )

Figure 8+e clustering of Aggregation by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

Complexity 13

43 Experimental Analysis of Olivetti Face DatasetOlivetti face dataset [28] is an image dataset widely used bymachine learning algorithms Its purpose is to test theclustering situation of the algorithm without supervisionincluding determining the number of clusters in the data-base and the members of each cluster +e dataset contains40 clusters each of which has 10 different images Becausethe actual number of clusters (40 different clusters) is equalto the number of elements in the dataset (10 different imageseach cluster) the reliability of local density becomes smallerwhich is a great challenge for density-based clustering al-gorithms To further test the clustering performance ofDPC-SFSKNN DPC-SFSKNN performed experiments onthe Olivetti face database and compared it with DPC APDBSCAN FKNN-DPC and K-means

+e clustering results achieved by DPC-SFSKNN andDPC for the Olivetti face database are shown in Figure 10and white squares represent the cluster centers +e 32clusters corresponding to DPC-SFSKNN found inFigure 10(a) and the 20 clusters found by DPC inFigure 10(b) are displayed in different colors Gray imagesindicate that the image is not assigned to any cluster It canbe seen from Figure 10(a) that DPC-SFSKNN found that the32 cluster centers were covered 29 clusters and as shown inFigure 10(b) the 20 cluster centers found by DPC werescattered in 19 clusters Similar to DPC-SFSKNN DPC maydivide one cluster into two clusters Because DPC-SFSKNNcan find much more density peaks than DPC it is morelikely to identify a cluster as two different clusters +e samesituation occurs with the FKNN-DPC algorithm However

0

02

04

06

08

1

05 10

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

0

02

04

06

08

1

05 10

(d)

0

02

04

06

08

1

05 10

(e)

0

02

04

06

08

1

05 10

(f )

Figure 9 +e clustering of Jain by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

14 Complexity

the performance of FKNN-DPC is better than that of DPC-SFSKNN in AMI ARI ACC and ECAC From the data inTable 5 based on AMI ARI ACC and ECAC the clus-tering results of these algorithms are compared +e per-formance of DPC-SFSKNNC is slightly superior to theperformance of the other four algorithms except FKNN-DPC

44RunningTime +is section shows the comparison of thetime performance of DPC-SFSKNN with DPC DBSCANAP FKNN-DPC and K-means on real-world datasets +etime complexity of DPC-SFSKNN and DPC has been an-alyzed in Section 331 and the time complexity of DPC isO(n2) and the time complexity of DPC-SFSKNN is O(kn2)where n is the size of the dataset However the time con-sumed by DPC mainly comes from calculating the localdensity and the relative distance of each point while the timeconsumed by DPC-SFSKNN comes mainly from the cal-culation of K-nearest neighbors and the division strategy ofnoncenter points Table 6 lists the running time (in seconds)

of the six algorithms on the real datasets It can be seen thatalthough the time complexity of DPC-SFSKNN is approx-imately k times that of DPC their execution time on actualdatasets is not k times

In Table 6 it can be found that on a relatively smalldataset the running time of DPC-SFSKNN is about twice ormore times that of DPC and the difference mainly comesfrom DPC-SFSKNNrsquos allocation strategy Although thecomputational load of the local densities for points growsvery quickly with the size of a dataset the time consumed bythe allocation strategy in DPC-SFSKNN increases randomlywith the distribution of a dataset +is leads to an irregulargap between the running time of DPC and DPC-SFSKNN

FKNN-DPC has the same time and space complexity asDPC but the running time is almost the same as DPC-SFSKNN It takes a lot of running time to calculate therelationship between K-nearest neighbors +e time com-plexity of DBSCAN and AP is approximate to O(n2) and theparameter selection of both cannot be determined by simplemethods When the dataset is relatively large it is difficult tofind their optimal parameters which may be the reason that

(a) (b)

Figure 10 +e clustering of Olivetti by two clustering algorithms (a) DPC-SFSKNN and (b) DPC

Table 5 Performance comparison of algorithms by clustering criteria for the Olivetti face database

Metric DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansACC 0786 0665 0648 0763 0818 0681AMI 0792 0728 0691 0737 0832 0742ARI 0669 0560 0526 0619 0714 0585ECAC 3240 2040 mdash 2840 3640 mdashPar 6 05 064 21 4 40

Complexity 15

the two algorithms have no running results on the waveformdataset +e approximate time complexity of K-means isO(n) and Table 6 proves its efficiency K-means has almostno loss of accuracy under the premise of fast speed whichmakes it a very popular clustering algorithm but K-means isnot sensitive to irregularly shaped data

5 Conclusions and Future Work

A new clustering algorithm is proposed based on the tra-ditional DPC algorithm in this paper +is algorithm pro-poses a density peak search algorithm that takes into accountthe surrounding neighbor information and develops a newallocation strategy to detect the true distribution of thedataset +e proposed clustering algorithm performs fastsearch finds density peaks say cluster centers of a dataset ofany size and recognizes clusters with any arbitrary shape ordimensionality +e algorithm is called DPC-SFSKNNwhich means that it calculates the local density and therelative distance by using some distance information be-tween points and neighbors to find the cluster center andthen the remaining points are assigned using similarity-first+e search algorithm is based on the weighted KNN graph tofind the owner (clustering center) of the point +e DPC-SFSKNN successfully addressed several issues arising fromthe clustering algorithm of Alex Rodriguez and AlessandroLaio [20] including its density metric and the potential issuehidden in its assignment strategy +e performance of DPC-SFSKNN was tested on several synthetic datasets and thereal-word datasets from the UCI machine learning reposi-tory and the well-known Olivetti face database +e ex-perimental results on these datasets demonstrate that ourDPC-SFSKNN is powerful in finding cluster centers and inrecognizing clusters regardless of their shape and of thedimensionality of the space in which they are embedded andof the size of the datasets and is robust to outliers It per-forms much better than the original algorithm DPCHowever the proposed algorithm has some limitations theparameter K needs to be manually adjusted according todifferent datasets the clustering centers still need to bemanually selected by analyzing the decision graph (like theDPC algorithm) the allocation strategy improves theclustering accuracy but takes time and cost How to improve

the degree of automation and allocation efficiency of thealgorithm needs further research

Data Availability

+e synthetic datasets are cited at relevant places within thetext as references [23ndash27] +e real-world datasets are citedat relevant places within the text as references [29ndash34] +eOlivetti face dataset is cited at relevant places within the textas references [28]

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported in part by the National NaturalScience Foundation of China (6160303040 and 61433003) inpart by the Yunnan Applied Basic Research Project of China(201701CF00037) and in part by the Yunnan ProvincialScience and Technology Department Key Research Program(Engineering) (2018BA070)

Supplementary Materials

It includes the datasets used in the experiments in this paper(Supplementary Materials)

References

[1] K L Liu Y L Shang Q Ouyang and W D Widanage ldquoAdata-driven approach with uncertainty quantification forpredicting future capacities and remaining useful life oflithium-ion batteryrdquo IEEE Transactions on Industrial Elec-tronics p 1 2020

[2] X P Tang K L Liu X Wang et al ldquoModel migration neuralnetwork for predicting battery aging trajectoriesrdquo IEEETransactions on Transportation Electrification vol 6 no 2pp 363ndash374 2020

[3] X Tang K Liu XWang B Liu F Gao andW DWidanageldquoReal-time aging trajectory prediction using a base model-oriented gradient-correction particle filter for Lithium-ion

Table 6 Running time of 6 clustering algorithms in seconds on UCI datasets

Dataset DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansIris 0241 0049 0059 0565 0148 0014Wine 0238 0048 0098 0832 0168 0013WDBC 0484 0092 0884 6115 0464 0018Seeds 0244 0050 0122 0973 0164 0014Libras movement 0602 0068 0309 3016 2602 0075Ionosphere 0325 0064 0349 2018 0309 0021Segmentation 1569 0806 8727 6679 0313 0062Dermatology 0309 0063 0513 2185 0409 0007Pima-Indians-diabetes 0792 0126 2018 9709 0892 0009Parkinsons 0255 0048 0114 0866 0263 0003Waveform 16071 3511 mdash mdash 7775 0067Waveform (noise) 17571 3784 mdash mdash 7525 0109

16 Complexity

batteriesrdquo Journal of Power Sources vol 440 Article ID227118 2019

[4] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[5] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[6] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[7] L Cai J H Meng D I Stroe et al ldquoMulti-objective opti-mization of data-driven model for lithium-ion battery SOHestimation with short-term featurerdquo IEEE Transactions onPower Electronics p 1 2020

[8] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[9] J Xie H Gao W Xie X Liu and P W Grant ldquoRobustclustering by detecting density peaks and assigning pointsbased on fuzzy weighted K-nearest neighborsrdquo InformationSciences vol 354 pp 19ndash40 2016

[10] F S Samaria and A C Harter ldquoSome methods for classifi-cation and analysis of multivariate observationsrdquo in Pro-ceedings of the Berkeley SymposiumOnMathematical Statisticsand Probability pp 281ndash297 Berkeley CA USA 1967

[11] S Kant T L Rao and P N Sundaram ldquoAn automatic andstable clustering algorithmrdquo Pattern Recognition Lettersvol 15 no 6 pp 543ndash549 1994

[12] D Arthur and S Vassilvitskii ldquoK-Means++ the advantages ofcareful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms pp 7ndash9 NewOrleans LA USA 2007

[13] Y Zhao W Halang and X Wang ldquoRough ontology mappingin E-business integrationrdquo E-Service Intelligence BMC Bioinfvol 8 pp 75ndash93 2007

[14] Y Xiao and J Yu ldquoSemi-supervised clustering based on af-finity propagaiton algorithmrdquo ACM Transactions onKnowledge Discovery from Data vol 1 no 1 2007

[15] M Ester H Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databaseswith noiserdquo in Proceedings of the Second International Con-ference On Knowledge Discovery and Data Mining pp 226ndash231 Portland OR USA 1996

[16] R J G B Campello D Moulavi and J Sander ldquoDensity-based clustering based on hierarchical density estimatesrdquoAdvances in Knowledge Discovery and Data Mining vol 7819pp 160ndash172 2013

[17] Z Liang and P Chen ldquoDelta-density based clustering with adivide-and-conquer strategy 3DC clusteringrdquo Pattern Rec-ognition Letters vol 73 pp 52ndash59 2016

[18] M Ankerst M M Breuning H P Kriegel and J SanderldquoOPTICS ordering points to identify the clustering struc-turerdquo in Proceedings of the 1999 ACM SIGMOD-InternationalConference on Management of Data pp 49ndash60 PhiladelphiaPA USA 1999

[19] M Du S Ding and H Jia ldquoStudy on density peaks clusteringbased on k-nearest neighbors and principal componentanalysisrdquo Knowledge-Based Systems vol 99 pp 135ndash1452016

[20] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[21] T Li H W Ge and S Z Su ldquoDensity peaks clustering byautomatic determination of cluster centersrdquo Journal ofComputer Science and Technology vol 10 no 11 pp 1614ndash1622 2016

[22] R Liu H Wang and X Yu ldquoShared-nearest-neighbor-basedclustering by fast search and find of density peaksrdquo Infor-mation Sciences vol 450 pp 200ndash226 2018

[23] R A Jarvis and E A Patrick ldquoClustering using a similaritymeasure based on shared near neighborsrdquo IEEE Transactionson Computers vol C-22 no 11 pp 1025ndash1034 1973

[24] H Chang and D-Y Yeung ldquoRobust path-based spectralclusteringrdquo Pattern Recognition vol 41 no 1 pp 191ndash2032008

[25] L Fu and E Medico ldquoFlame a novel fuzzy clustering methodfor the analysis of DNA microarray datardquo BMC Bio-informatics vol 8 no 1 2007

[26] A Gionis H Mannila and P Tsaparas ldquoClustering aggre-gationrdquo ACM Transactions on Knowledge Discovery fromData vol 1 no 1 p 4 2007

[27] P Franti O Virmajoki and V Hautamaki ldquoFast agglom-erative clustering using a k-nearest neighbor graphrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 28 no 11 pp 1875ndash1881 2006

[28] F S Samaria and A C Harter ldquoParameterisation of a sto-chastic model for human face identificationrdquo in Proceedings ofthe 1994 IEEEWorkshop On Applications Of Computer Visionpp 138ndash142 Sarasota FL USA 1994

[29] K Bache and M Lichman UCI Machine Learning Repositoryhttparchiveicsucieduml 2013

[30] M Charytanowicz J Niewczas P Kulczycki P A KowalskiS Lukasik and S Zak ldquoComplete gradient clustering algo-rithm for features analysis of X-ray imagesrdquo InformationTechnologies in biomedicine Advances in Intelligent and SoftComputing vol 69 Berlin Germany Springer

[31] D B Dias R C B Madeo T Rocha H H Biscaro andS M Peres ldquoHand movement recognition for brazilian signlanguage a study using distance-based neural networksrdquo inProceedings of the 2009 International Joint Conference onNeural Networks pp 697ndash704 Atlanta GA USA 2009

[32] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees Routledge New York NYUSA 1st edition 1984

[33] V G Sigillito S P Wing L V Hutton and K B BakerldquoClassification of radar returns from the ionosphere usingneural networksrdquo Johns Hopkins APL vol 10 no 3pp 262ndash266 1989

[34] W N Street W H Wolberg and O L MangasarianldquoNuclear feature extraction for breast tumor diagnosisrdquo inProceedings of the SPIE 1905 Biomedical Image Processing andBiomedical Visualization San Jose CA USA 1993

[35] X V Nguyen J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison is a correction forchance necessaryrdquo in Proceedings of the ICML 2009 the 26thAnnual International Conference On Machine Learning SanMontreal Canada 2009

[36] J Han M Kamber and J Pei Data Mining Concepts andTechniques the Morgan Kaufmann Series in Data Manage-ment Systems Morgan Kaufmann Burlington MA USA 3rdedition 2011

Complexity 17

Page 8: ClusteringbyDetectingDensityPeaksandAssigningPointsby ... · FKNN-DPC [9] is given in (5) and (6), respectively: ρ i exp − 1 K X j∈knn(i) d2 ij ⎛⎝ ⎞⎠, (5) ρ i X j∈knn(i)

dataset is n the number of cluster centers is m and thenumber of neighbors is k

331 Time Complexity +e time complexity analysis ofDPC-SFSKNN is as follows

Normalization requires a processing complexity of ap-proximately O(n) the complexities of calculating the Eu-clidean distance and similarity between points are O(n2) thecomplexity of computing the K-nearest neighbor averagedistance dknn is O(n2) similarly the complexity of the averagedistance dsnn between the calculation point and its shared-nearest neighbors does not exceed O(n2) at most the calcu-lation process of calculating the local density ρi and distance δi

of each point needs to acquire the KNN information com-plexity of each point as O(kn) so the complexities of localdensity ρ and distance δ areO(kn2) the point allocation part isthe search time of one point and in the worst case searching allpoints requires O(n) +ere are n points in the dataset and thetotal time does not exceed O(n2) In summary the total ap-proximate time complexity of DPC-SFSKNN is O(kn2)

+e time complexity of the DPC algorithm depends onthe following three aspects (a) the time to calculate thedistance between points (b) the time to calculate the localdensity ρi for point i and (c) the time to calculate the distanceδi for each point i +e time complexity of each part is O(n2)so the total approximate time complexity of DPC is O(n2)

+e time complexity of the DPC-SFSKNN algorithm is k

times higher than that of the traditional DPC algorithmHowever k is relatively small compared to n +erefore theydo not significantly affect the run time In Section 4 it isdemonstrated that the actual running time of DPC-SFSKNNdoes not exceed k times of the running time of the traditionalDPC algorithm

332 Space Complexity DPC-SFSKNN needs to calculatethe distance and similarity between points and its com-plexity is O(n2) Other data structures (such as ρ and δ arrays

and various average distance arrays) are O(n) For the al-location strategy in the worst case its complexity is O(n2)+e space complexity of DPC is O(n2) which is mainly dueto the distance matrix stored

+e space complexity of our DPC-SFSKNN is the sameas that of traditional DPC which is O(n2)

4 Experiments and Results

In this section experiments are performed based on severalpublic datasets commonly used to test the performance ofclustering algorithms including synthetic datasets [23ndash27]and real datasets [28ndash34] In order to visually observe theclustering ability of DPC-SFSKNN the DPC [20] DBSCAN[15] AP [8] FKNN-DPC [9] and K-means [10] methods areall tested for comparison +ree popular benchmarks areused to evaluate the performance of the above clusteringalgorithms including the clustering accuracy (ACC) ad-justed mutual information (AMI) and adjusted Rand index(ARI) [35] +e upper bounds of the three benchmarks wereall 1 +e larger the benchmark value the better the clus-tering effect +e codes for DPC DBSCAN and AP wereprovided based on the corresponding references

Table 1 lists the synthetic datasets used in the experi-ments+ese datasets were published in [23ndash27] Table 2 liststhe real datasets used in the experiments +ese datasetsinclude the real-world dataset from [29ndash34] and the Olivettiface dataset in [28]

To eliminate the influence of missing values and dif-ferences in different dimension ranges the datasets need tobe preprocessed before proceeding to the experiments Wereplace the missing values by the mean of all valid values ofthe same dimension and normalize the data using the min-max normalization method shown in the followingequation

xij xij minus min xj1113872 1113873

max xj1113872 1113873 minus min xj1113872 1113873 (16)

where xij represents the original data located in the ith rowand jth column xij represents the rescaled data of xij andxj represents the original data located in the jth column

Min-max normalization method processes each di-mension of the data and preserves the relationships betweenthe original data values [36] therefore decreasing the in-fluence of the difference in dimensions and increasing theefficiency of the calculation

To fairly reflect the clustering results of the five algo-rithms the parameters in the algorithms are adjusted toensure that their satisfactory clustering performance can beretained For the DPC-SFSKNN algorithm the parameter K

needs to be specified in advance and an initial clusteringcenter is manually selected based on a decision graphcomposed of the local density ρ and the relative distance δ Itcan be seen from the experimental results in Tables 3 and 4that the value of parameter K is around 6 and the value ofparameterK for the dataset with dense sample distribution ismore than 6 In addition to manually select the initialclustering center the traditional DPC algorithm also needs

02 04X

06 08 100

01

02

03

04

05Y

06

07

08

09

1

Figure 5 Results of the traditional DPC-SFSKNN algorithm on thePathbased dataset

8 Complexity

to determine dc Based on the provided selection range dc isselected so that the number of neighbors is between 1 and 2of the total number of data points [20] +e two parametersthat DBSCAN needs to determine are ε and minpts as in[15]+e optimal parameters are determined using a circularsearch method +e AP algorithm only needs to determine apreference and the larger the preference the more the center

points are allowed to be selected [8] +e general method forselecting parameters is not effective and only multiple ex-periments can be performed to select the optimal parame-ters +e only parameter of K-means is the number ofclusters +e true number of clusters in the dataset is used inthis case Similarly FKNN-DPC needs to determine thenearest neighbors K

Table 2 Real-world datasets

Dataset Records Attributes Clusters SourceIris 150 4 3 [29]Libras movement 360 90 15 [31]Wine 178 13 3 [29]Parkinsons 197 23 2 [29]WDBC 569 30 2 [34]Pima-Indians-diabetes 768 8 2 [29]Segmentation 2310 19 7 [29]Dermatology 366 33 6 [29]Seeds 210 7 3 [30]Ionosphere 351 34 2 [33]Waveform 5000 21 3 [32]Waveform (noise) 5000 40 3 [32]Olivetti face 400 92lowast112 40 [28]

Table 1 Synthetic datasets

Dataset Records Attributes Clusters SourcePathbased 300 2 3 [24]Jain 373 2 2 [23]Flame 240 2 2 [25]Aggregation 788 2 7 [26]DIM512 1024 512 16 [27]DIM1024 1024 1024 16 [27]

Table 3 +e comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on synthetic datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParPathbased Jain

DPC-SFSKNN 0926 0910 0925 33 6 1000 1000 1000 22 7DPC 0521 0463 0742 33 2 0609 0713 0853 22 3DBSCAN 0781 0522 0667 mdash 0056 0883 0985 0918 mdash 00810AP 0679 0475 0783 33 10 0681 0812 0882 22 40FKNN-DPC 0941 0960 0987 33 5 0056 0132 0793 mdash 10K-means 0568 0461 0772 mdash 3 0492 0577 0712 mdash 2

Aggregation FlameDPC-SFSKNN 0942 0951 0963 77 6 0873 0934 0956 22 6DPC 1000 1000 1000 77 4 1000 1000 1000 22 5DBSCAN 0969 0982 0988 mdash 0058 0867 0936 0981 - 0098AP 0795 0753 0841 77 77 0452 0534 0876 33 35FKNN-DPC 0995 0997 0999 33 8 1000 1000 1000 22 5K-means 0784 0717 0786 mdash 7 0418 0465 0828 mdash 2

DIM512 DIM1024DPC-SFSKNN 1000 1000 1000 1616 8 1000 1000 1000 1616 9DPC 1000 1000 1000 1616 2 1000 1000 1000 1616 001DBSCAN 1000 1000 1000 mdash 037 1000 1000 1000 mdash 108AP 1000 1000 1000 1616 20 1000 1000 1000 1616 30FKNN-DPC 1000 1000 1000 1616 8 1000 1000 1000 1616 10K-means 0895 0811 0850 mdash 1 0868 0752 0796 mdash 16

Complexity 9

41Analysis of theExperimentalResults on SyntheticDatasetsIn this section the performance of DPC-SFSKNN DPC[20] DBSCAN [15] AP [8] FKNN-DPC [9] and K-means[10] is tested with six synthetic datasets given in Table 1+ese synthetic datasets are different in distribution andquantity Different data situations can be simulated tocompare the performance of six algorithms in differentsituations Table 3 shows AMI ARI ACC and ECAC of thefive clustering algorithms on the six comprehensive datasetswhere the best results are shown in bold and ldquomdashrdquo means novalue Figures 6ndash9 show the clustering results of DPC-SFSKNN DPC DBSCAN AP FKNN-DPC and K-meansbased on the Pathbased Flame Aggregation and Jaindatasets respectively +e five algorithms achieve the

optimal clustering on DIM512 and DIM1024 datasets sothat the clustering of the two datasets is not shown Since thecluster centers of DBSCAN are relatively random only thepositions of clustering centers of the other three algorithmsare marked

Figure 6 shows the results of the Pathbased datasetDPC-SFSKNN and FKNN-DPC can complete the clusteringof the Pathbased dataset correctly From Figures 6(b) 6(d)and 6(f) it can be seen that the clustering results of DPC APand K-means are similar +e clustering centers selected byDPC AP DPC-SFSKNN and FKNN-DPC are highlysimilar but the clustering results of DPC and AP are notsatisfactory For the DPC algorithm the low fault tolerancerate of its allocation strategy is the cause of this result A

Table 4 Comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on real-world datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParIris Libras movement

DPC-SFSKNN 0896 0901 0962 33 6 0547 0368 0510 1015 8DPC 0812 0827 0926 33 2 0535 0304 0438 915 05DBSCAN 0792 0754 0893 mdash 0149 0412 0183 0385 mdash 0965AP 0764 0775 0911 33 6 0364 0267 0453 1015 25FKNN-DPC 0912 0922 0973 33 7 0508 0308 0436 1015 9K-means 0683 0662 0823 mdash 3 0522 0306 0449 mdash 15

Wine ParkinsonsDPC-SFSKNN 0843 0851 0951 33 6 0193 0380 0827 22 6DPC 0706 0672 0882 33 2 0210 0114 0612 22 5DBSCAN 0612 0643 0856 mdash 04210 0205 0213 0674 mdash 046AP 0592 0544 0781 33 6 0142 0127 0669 22 15FKNN-DPC 0831 0852 0949 33 7 0273 0391 0851 22 5K-means 0817 0838 0936 mdash 3 0201 0049 0625 mdash 2

WDBC IonosphereDPC-SFSKNN 0432 0516 0857 22 6 0361 0428 0786 32 7DPC 0002 minus0004 0602 22 9 0238 0276 0681 32 065DBSCAN 0397 0538 0862 mdash 0277 0544 0683 0853 mdash 027AP 0598 0461 0854 22 40 0132 0168 0706 22 15FKNN-DPC 0679 0786 0944 22 7 0284 0355 0752 22 8K-means 0611 0730 0928 mdash 2 0129 0178 0712 mdash 2

Segmentation Pima-Indians-diabetesDPC-SFSKNN 0665 0562 0746 67 6 0037 0083 0652 22 6DPC 0650 0550 0684 67 3 0033 0075 0647 22 4DBSCAN 0446 0451 0550 mdash 02510 0028 0041 0577 mdash 0156AP 0405 0436 0554 77 25 0045 0089 0629 32 35FKNN-DPC 0655 0555 0716 77 7 0001 0011 0612 22 6K-means 0583 0495 0612 mdash 6 0050 0102 0668 mdash 2

Seeds DermatologyDPC-SFSKNN 0753 0786 0919 33 7 0862 0753 0808 76 6DPC 0727 0760 0918 33 2 0611 0514 0703 46 2DBSCAN 0640 0713 0874 mdash 0178 0689 0690 0815 mdash 073AP 0598 0682 0896 33 10 0766 0701 0762 76 5FKNN-DPC 0759 0790 0924 33 8 0847 0718 0768 76 7K-means 0671 0705 0890 mdash 3 0796 0680 0702 mdash 6

Waveform Waveform (noise)DPC-SFSKNN 0355 0382 0725 33 5 0267 0288 0651 33 6DPC 0320 0269 0586 33 05 0104 0095 0502 33 03DBSCAN mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashAP mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashFKNN-DPC 0324 0350 0703 33 5 0247 0253 0648 33 5K-means 0363 0254 0501 - 3 0364 0252 0512 mdash 3

10 Complexity

0

02

04

06

08

1

05 10

(a)

0

02

04

06

08

1

05 10

(b)

Figure 7 Continued

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 6+e clustering of Pathbased by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and (f)K-means

Complexity 11

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 7 +e clustering of Flame by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

05 100

02

04

06

08

1

(d)

Figure 8 Continued

12 Complexity

high-density point allocation error will be transferred tolow-density points and the error propagation will seriouslyaffect the clustering results AP and K-means algorithms arenot good at dealing with irregular clusters +e two clustersin the middle are too attractive to the points on both sides ofthe semicircular cluster which leads to clustering errorsDBSCAN can completely detect the semicircular cluster butthe semicircular cluster and the cluster on the left of themiddle are incorrectly classified into one category and thecluster on the right of the middle is divided into two clusters+e similarities between points and manually prespecifiedparameters may severely affect the clustering DPC-SFSKNNand FKNN-DPC algorithms perform well on the Pathbaseddataset +ese improved algorithms that consider neighborrelationships have a great advantage in handling suchcomplex distributed datasets

Figure 7 shows the results of four algorithms on theFlame dataset As shown in the figure DPC-SFSKNN DPCFKNN-DPC and DBSCAN can correctly detect two clusterswhile AP and K-means cannot completely correct clusteringAlthough AP can correctly identify higher clusters and selectthe appropriate cluster center the lower cluster is dividedinto two clusters Both clusters of K-means are wrong +eclustering results in Figure 8 show that the DPC-SFSKNNDPC FKNN-DPC and DBSCAN algorithms can detect 7clusters in the Aggregation dataset but AP and K-means stillcannot cluster correctly DPC-SFSKNN DPC and FKNN-DPC can identify clusters and centers Although the clustercenters are not marked for DBSCAN the number of clustersand the overall shape of each cluster are correct +e APalgorithm successfully found the correct number of clustersbut it chose two centers for one cluster which divided thecluster into two clusters +e clustering result of K-means issimilar to that of AP

+e Jain dataset shown in Figure 9 is a dataset consistingof two semicircular clusters of different densities As shownin the figure the DPC-SFSKNN algorithm can completelycluster two clusters with different densities However DPCAP FKNN-DPC and K-means incorrectly assign the leftend of the lower cluster to the higher cluster and the clustercenters of the DPC are all on the lower cluster Compared

with that the distribution of the cluster centers of the AP ismore reasonable For the DBSCAN algorithm it can ac-curately identify lower clusters but the left end of the highercluster is incorrectly divided into a new cluster so that thehigher cluster is divided into two clusters

According to the benchmark data shown in Table 3 it isclear that the performance of DPC-SFSKNN is very effectiveamong the six clustering algorithms especially in the Jaindataset Although DPC and FKNN-DPC perform betterthan DPC-SFSKNN on Aggregation and Flame datasetsDPC-SFSKNN can find the correct clustering center of theaggregation and can complete the clustering task correctly

42 Analysis of Experimental Results on Real-World DatasetsIn this section the performance of the five algorithms is stillbenchmarked according to AMI ARI ACC and ECACand the clustering results are summarized in Table 4 12 real-world datasets are selected to test DPC-SFSKNNrsquos ability toidentify clusters on different datasets DBSCAN and APalgorithm cannot get effective clustering results on wave-form and waveform (noise) +e symbol ldquomdashrdquo represents noresult

As shown in Table 4 in terms of benchmarks AMI ARIand ACC DPC-SFSKNN outperforms all five other algo-rithms on the Wine Segmentation and Libras movementdatasets At the same time FKNN-DPC performs better thanthe other five algorithms on the Iris Seeds Parkinsons andWDBC datasets It can be seen that the overall performanceof DPC-SFSKNN is slightly better than DPC on 11 datasetsexcept for Parkinsons On the Parkinsons DPC-SFSKNN isslightly worse than DPC in AMI but better than DPC in ARIand ACC Similarly DPC-SFSKNN had a slightly betterperformance in addition to Iris Parkinsons WDBC andSeeds of eight sets of data in FKNN-DPC and DPC-SFSKNN is slightly worse than DPC in AMI ARI and ACCDBSCAN gets the best results on the Ionosphere K-means isthe best on Pima-Indians-diabetes and K-means is the bestin AMI on waveform and waveform (noise) datasets Ingeneral the clustering results of DPC-SFSKNN in real-worlddatasets are satisfactory

0

02

04

06

08

1

05 10

(e)

05 100

02

04

06

08

1

(f )

Figure 8+e clustering of Aggregation by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

Complexity 13

43 Experimental Analysis of Olivetti Face DatasetOlivetti face dataset [28] is an image dataset widely used bymachine learning algorithms Its purpose is to test theclustering situation of the algorithm without supervisionincluding determining the number of clusters in the data-base and the members of each cluster +e dataset contains40 clusters each of which has 10 different images Becausethe actual number of clusters (40 different clusters) is equalto the number of elements in the dataset (10 different imageseach cluster) the reliability of local density becomes smallerwhich is a great challenge for density-based clustering al-gorithms To further test the clustering performance ofDPC-SFSKNN DPC-SFSKNN performed experiments onthe Olivetti face database and compared it with DPC APDBSCAN FKNN-DPC and K-means

+e clustering results achieved by DPC-SFSKNN andDPC for the Olivetti face database are shown in Figure 10and white squares represent the cluster centers +e 32clusters corresponding to DPC-SFSKNN found inFigure 10(a) and the 20 clusters found by DPC inFigure 10(b) are displayed in different colors Gray imagesindicate that the image is not assigned to any cluster It canbe seen from Figure 10(a) that DPC-SFSKNN found that the32 cluster centers were covered 29 clusters and as shown inFigure 10(b) the 20 cluster centers found by DPC werescattered in 19 clusters Similar to DPC-SFSKNN DPC maydivide one cluster into two clusters Because DPC-SFSKNNcan find much more density peaks than DPC it is morelikely to identify a cluster as two different clusters +e samesituation occurs with the FKNN-DPC algorithm However

0

02

04

06

08

1

05 10

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

0

02

04

06

08

1

05 10

(d)

0

02

04

06

08

1

05 10

(e)

0

02

04

06

08

1

05 10

(f )

Figure 9 +e clustering of Jain by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

14 Complexity

the performance of FKNN-DPC is better than that of DPC-SFSKNN in AMI ARI ACC and ECAC From the data inTable 5 based on AMI ARI ACC and ECAC the clus-tering results of these algorithms are compared +e per-formance of DPC-SFSKNNC is slightly superior to theperformance of the other four algorithms except FKNN-DPC

44RunningTime +is section shows the comparison of thetime performance of DPC-SFSKNN with DPC DBSCANAP FKNN-DPC and K-means on real-world datasets +etime complexity of DPC-SFSKNN and DPC has been an-alyzed in Section 331 and the time complexity of DPC isO(n2) and the time complexity of DPC-SFSKNN is O(kn2)where n is the size of the dataset However the time con-sumed by DPC mainly comes from calculating the localdensity and the relative distance of each point while the timeconsumed by DPC-SFSKNN comes mainly from the cal-culation of K-nearest neighbors and the division strategy ofnoncenter points Table 6 lists the running time (in seconds)

of the six algorithms on the real datasets It can be seen thatalthough the time complexity of DPC-SFSKNN is approx-imately k times that of DPC their execution time on actualdatasets is not k times

In Table 6 it can be found that on a relatively smalldataset the running time of DPC-SFSKNN is about twice ormore times that of DPC and the difference mainly comesfrom DPC-SFSKNNrsquos allocation strategy Although thecomputational load of the local densities for points growsvery quickly with the size of a dataset the time consumed bythe allocation strategy in DPC-SFSKNN increases randomlywith the distribution of a dataset +is leads to an irregulargap between the running time of DPC and DPC-SFSKNN

FKNN-DPC has the same time and space complexity asDPC but the running time is almost the same as DPC-SFSKNN It takes a lot of running time to calculate therelationship between K-nearest neighbors +e time com-plexity of DBSCAN and AP is approximate to O(n2) and theparameter selection of both cannot be determined by simplemethods When the dataset is relatively large it is difficult tofind their optimal parameters which may be the reason that

(a) (b)

Figure 10 +e clustering of Olivetti by two clustering algorithms (a) DPC-SFSKNN and (b) DPC

Table 5 Performance comparison of algorithms by clustering criteria for the Olivetti face database

Metric DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansACC 0786 0665 0648 0763 0818 0681AMI 0792 0728 0691 0737 0832 0742ARI 0669 0560 0526 0619 0714 0585ECAC 3240 2040 mdash 2840 3640 mdashPar 6 05 064 21 4 40

Complexity 15

the two algorithms have no running results on the waveformdataset +e approximate time complexity of K-means isO(n) and Table 6 proves its efficiency K-means has almostno loss of accuracy under the premise of fast speed whichmakes it a very popular clustering algorithm but K-means isnot sensitive to irregularly shaped data

5 Conclusions and Future Work

A new clustering algorithm is proposed based on the tra-ditional DPC algorithm in this paper +is algorithm pro-poses a density peak search algorithm that takes into accountthe surrounding neighbor information and develops a newallocation strategy to detect the true distribution of thedataset +e proposed clustering algorithm performs fastsearch finds density peaks say cluster centers of a dataset ofany size and recognizes clusters with any arbitrary shape ordimensionality +e algorithm is called DPC-SFSKNNwhich means that it calculates the local density and therelative distance by using some distance information be-tween points and neighbors to find the cluster center andthen the remaining points are assigned using similarity-first+e search algorithm is based on the weighted KNN graph tofind the owner (clustering center) of the point +e DPC-SFSKNN successfully addressed several issues arising fromthe clustering algorithm of Alex Rodriguez and AlessandroLaio [20] including its density metric and the potential issuehidden in its assignment strategy +e performance of DPC-SFSKNN was tested on several synthetic datasets and thereal-word datasets from the UCI machine learning reposi-tory and the well-known Olivetti face database +e ex-perimental results on these datasets demonstrate that ourDPC-SFSKNN is powerful in finding cluster centers and inrecognizing clusters regardless of their shape and of thedimensionality of the space in which they are embedded andof the size of the datasets and is robust to outliers It per-forms much better than the original algorithm DPCHowever the proposed algorithm has some limitations theparameter K needs to be manually adjusted according todifferent datasets the clustering centers still need to bemanually selected by analyzing the decision graph (like theDPC algorithm) the allocation strategy improves theclustering accuracy but takes time and cost How to improve

the degree of automation and allocation efficiency of thealgorithm needs further research

Data Availability

+e synthetic datasets are cited at relevant places within thetext as references [23ndash27] +e real-world datasets are citedat relevant places within the text as references [29ndash34] +eOlivetti face dataset is cited at relevant places within the textas references [28]

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported in part by the National NaturalScience Foundation of China (6160303040 and 61433003) inpart by the Yunnan Applied Basic Research Project of China(201701CF00037) and in part by the Yunnan ProvincialScience and Technology Department Key Research Program(Engineering) (2018BA070)

Supplementary Materials

It includes the datasets used in the experiments in this paper(Supplementary Materials)

References

[1] K L Liu Y L Shang Q Ouyang and W D Widanage ldquoAdata-driven approach with uncertainty quantification forpredicting future capacities and remaining useful life oflithium-ion batteryrdquo IEEE Transactions on Industrial Elec-tronics p 1 2020

[2] X P Tang K L Liu X Wang et al ldquoModel migration neuralnetwork for predicting battery aging trajectoriesrdquo IEEETransactions on Transportation Electrification vol 6 no 2pp 363ndash374 2020

[3] X Tang K Liu XWang B Liu F Gao andW DWidanageldquoReal-time aging trajectory prediction using a base model-oriented gradient-correction particle filter for Lithium-ion

Table 6 Running time of 6 clustering algorithms in seconds on UCI datasets

Dataset DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansIris 0241 0049 0059 0565 0148 0014Wine 0238 0048 0098 0832 0168 0013WDBC 0484 0092 0884 6115 0464 0018Seeds 0244 0050 0122 0973 0164 0014Libras movement 0602 0068 0309 3016 2602 0075Ionosphere 0325 0064 0349 2018 0309 0021Segmentation 1569 0806 8727 6679 0313 0062Dermatology 0309 0063 0513 2185 0409 0007Pima-Indians-diabetes 0792 0126 2018 9709 0892 0009Parkinsons 0255 0048 0114 0866 0263 0003Waveform 16071 3511 mdash mdash 7775 0067Waveform (noise) 17571 3784 mdash mdash 7525 0109

16 Complexity

batteriesrdquo Journal of Power Sources vol 440 Article ID227118 2019

[4] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[5] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[6] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[7] L Cai J H Meng D I Stroe et al ldquoMulti-objective opti-mization of data-driven model for lithium-ion battery SOHestimation with short-term featurerdquo IEEE Transactions onPower Electronics p 1 2020

[8] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[9] J Xie H Gao W Xie X Liu and P W Grant ldquoRobustclustering by detecting density peaks and assigning pointsbased on fuzzy weighted K-nearest neighborsrdquo InformationSciences vol 354 pp 19ndash40 2016

[10] F S Samaria and A C Harter ldquoSome methods for classifi-cation and analysis of multivariate observationsrdquo in Pro-ceedings of the Berkeley SymposiumOnMathematical Statisticsand Probability pp 281ndash297 Berkeley CA USA 1967

[11] S Kant T L Rao and P N Sundaram ldquoAn automatic andstable clustering algorithmrdquo Pattern Recognition Lettersvol 15 no 6 pp 543ndash549 1994

[12] D Arthur and S Vassilvitskii ldquoK-Means++ the advantages ofcareful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms pp 7ndash9 NewOrleans LA USA 2007

[13] Y Zhao W Halang and X Wang ldquoRough ontology mappingin E-business integrationrdquo E-Service Intelligence BMC Bioinfvol 8 pp 75ndash93 2007

[14] Y Xiao and J Yu ldquoSemi-supervised clustering based on af-finity propagaiton algorithmrdquo ACM Transactions onKnowledge Discovery from Data vol 1 no 1 2007

[15] M Ester H Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databaseswith noiserdquo in Proceedings of the Second International Con-ference On Knowledge Discovery and Data Mining pp 226ndash231 Portland OR USA 1996

[16] R J G B Campello D Moulavi and J Sander ldquoDensity-based clustering based on hierarchical density estimatesrdquoAdvances in Knowledge Discovery and Data Mining vol 7819pp 160ndash172 2013

[17] Z Liang and P Chen ldquoDelta-density based clustering with adivide-and-conquer strategy 3DC clusteringrdquo Pattern Rec-ognition Letters vol 73 pp 52ndash59 2016

[18] M Ankerst M M Breuning H P Kriegel and J SanderldquoOPTICS ordering points to identify the clustering struc-turerdquo in Proceedings of the 1999 ACM SIGMOD-InternationalConference on Management of Data pp 49ndash60 PhiladelphiaPA USA 1999

[19] M Du S Ding and H Jia ldquoStudy on density peaks clusteringbased on k-nearest neighbors and principal componentanalysisrdquo Knowledge-Based Systems vol 99 pp 135ndash1452016

[20] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[21] T Li H W Ge and S Z Su ldquoDensity peaks clustering byautomatic determination of cluster centersrdquo Journal ofComputer Science and Technology vol 10 no 11 pp 1614ndash1622 2016

[22] R Liu H Wang and X Yu ldquoShared-nearest-neighbor-basedclustering by fast search and find of density peaksrdquo Infor-mation Sciences vol 450 pp 200ndash226 2018

[23] R A Jarvis and E A Patrick ldquoClustering using a similaritymeasure based on shared near neighborsrdquo IEEE Transactionson Computers vol C-22 no 11 pp 1025ndash1034 1973

[24] H Chang and D-Y Yeung ldquoRobust path-based spectralclusteringrdquo Pattern Recognition vol 41 no 1 pp 191ndash2032008

[25] L Fu and E Medico ldquoFlame a novel fuzzy clustering methodfor the analysis of DNA microarray datardquo BMC Bio-informatics vol 8 no 1 2007

[26] A Gionis H Mannila and P Tsaparas ldquoClustering aggre-gationrdquo ACM Transactions on Knowledge Discovery fromData vol 1 no 1 p 4 2007

[27] P Franti O Virmajoki and V Hautamaki ldquoFast agglom-erative clustering using a k-nearest neighbor graphrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 28 no 11 pp 1875ndash1881 2006

[28] F S Samaria and A C Harter ldquoParameterisation of a sto-chastic model for human face identificationrdquo in Proceedings ofthe 1994 IEEEWorkshop On Applications Of Computer Visionpp 138ndash142 Sarasota FL USA 1994

[29] K Bache and M Lichman UCI Machine Learning Repositoryhttparchiveicsucieduml 2013

[30] M Charytanowicz J Niewczas P Kulczycki P A KowalskiS Lukasik and S Zak ldquoComplete gradient clustering algo-rithm for features analysis of X-ray imagesrdquo InformationTechnologies in biomedicine Advances in Intelligent and SoftComputing vol 69 Berlin Germany Springer

[31] D B Dias R C B Madeo T Rocha H H Biscaro andS M Peres ldquoHand movement recognition for brazilian signlanguage a study using distance-based neural networksrdquo inProceedings of the 2009 International Joint Conference onNeural Networks pp 697ndash704 Atlanta GA USA 2009

[32] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees Routledge New York NYUSA 1st edition 1984

[33] V G Sigillito S P Wing L V Hutton and K B BakerldquoClassification of radar returns from the ionosphere usingneural networksrdquo Johns Hopkins APL vol 10 no 3pp 262ndash266 1989

[34] W N Street W H Wolberg and O L MangasarianldquoNuclear feature extraction for breast tumor diagnosisrdquo inProceedings of the SPIE 1905 Biomedical Image Processing andBiomedical Visualization San Jose CA USA 1993

[35] X V Nguyen J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison is a correction forchance necessaryrdquo in Proceedings of the ICML 2009 the 26thAnnual International Conference On Machine Learning SanMontreal Canada 2009

[36] J Han M Kamber and J Pei Data Mining Concepts andTechniques the Morgan Kaufmann Series in Data Manage-ment Systems Morgan Kaufmann Burlington MA USA 3rdedition 2011

Complexity 17

Page 9: ClusteringbyDetectingDensityPeaksandAssigningPointsby ... · FKNN-DPC [9] is given in (5) and (6), respectively: ρ i exp − 1 K X j∈knn(i) d2 ij ⎛⎝ ⎞⎠, (5) ρ i X j∈knn(i)

to determine dc Based on the provided selection range dc isselected so that the number of neighbors is between 1 and 2of the total number of data points [20] +e two parametersthat DBSCAN needs to determine are ε and minpts as in[15]+e optimal parameters are determined using a circularsearch method +e AP algorithm only needs to determine apreference and the larger the preference the more the center

points are allowed to be selected [8] +e general method forselecting parameters is not effective and only multiple ex-periments can be performed to select the optimal parame-ters +e only parameter of K-means is the number ofclusters +e true number of clusters in the dataset is used inthis case Similarly FKNN-DPC needs to determine thenearest neighbors K

Table 2 Real-world datasets

Dataset Records Attributes Clusters SourceIris 150 4 3 [29]Libras movement 360 90 15 [31]Wine 178 13 3 [29]Parkinsons 197 23 2 [29]WDBC 569 30 2 [34]Pima-Indians-diabetes 768 8 2 [29]Segmentation 2310 19 7 [29]Dermatology 366 33 6 [29]Seeds 210 7 3 [30]Ionosphere 351 34 2 [33]Waveform 5000 21 3 [32]Waveform (noise) 5000 40 3 [32]Olivetti face 400 92lowast112 40 [28]

Table 1 Synthetic datasets

Dataset Records Attributes Clusters SourcePathbased 300 2 3 [24]Jain 373 2 2 [23]Flame 240 2 2 [25]Aggregation 788 2 7 [26]DIM512 1024 512 16 [27]DIM1024 1024 1024 16 [27]

Table 3 +e comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on synthetic datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParPathbased Jain

DPC-SFSKNN 0926 0910 0925 33 6 1000 1000 1000 22 7DPC 0521 0463 0742 33 2 0609 0713 0853 22 3DBSCAN 0781 0522 0667 mdash 0056 0883 0985 0918 mdash 00810AP 0679 0475 0783 33 10 0681 0812 0882 22 40FKNN-DPC 0941 0960 0987 33 5 0056 0132 0793 mdash 10K-means 0568 0461 0772 mdash 3 0492 0577 0712 mdash 2

Aggregation FlameDPC-SFSKNN 0942 0951 0963 77 6 0873 0934 0956 22 6DPC 1000 1000 1000 77 4 1000 1000 1000 22 5DBSCAN 0969 0982 0988 mdash 0058 0867 0936 0981 - 0098AP 0795 0753 0841 77 77 0452 0534 0876 33 35FKNN-DPC 0995 0997 0999 33 8 1000 1000 1000 22 5K-means 0784 0717 0786 mdash 7 0418 0465 0828 mdash 2

DIM512 DIM1024DPC-SFSKNN 1000 1000 1000 1616 8 1000 1000 1000 1616 9DPC 1000 1000 1000 1616 2 1000 1000 1000 1616 001DBSCAN 1000 1000 1000 mdash 037 1000 1000 1000 mdash 108AP 1000 1000 1000 1616 20 1000 1000 1000 1616 30FKNN-DPC 1000 1000 1000 1616 8 1000 1000 1000 1616 10K-means 0895 0811 0850 mdash 1 0868 0752 0796 mdash 16

Complexity 9

41Analysis of theExperimentalResults on SyntheticDatasetsIn this section the performance of DPC-SFSKNN DPC[20] DBSCAN [15] AP [8] FKNN-DPC [9] and K-means[10] is tested with six synthetic datasets given in Table 1+ese synthetic datasets are different in distribution andquantity Different data situations can be simulated tocompare the performance of six algorithms in differentsituations Table 3 shows AMI ARI ACC and ECAC of thefive clustering algorithms on the six comprehensive datasetswhere the best results are shown in bold and ldquomdashrdquo means novalue Figures 6ndash9 show the clustering results of DPC-SFSKNN DPC DBSCAN AP FKNN-DPC and K-meansbased on the Pathbased Flame Aggregation and Jaindatasets respectively +e five algorithms achieve the

optimal clustering on DIM512 and DIM1024 datasets sothat the clustering of the two datasets is not shown Since thecluster centers of DBSCAN are relatively random only thepositions of clustering centers of the other three algorithmsare marked

Figure 6 shows the results of the Pathbased datasetDPC-SFSKNN and FKNN-DPC can complete the clusteringof the Pathbased dataset correctly From Figures 6(b) 6(d)and 6(f) it can be seen that the clustering results of DPC APand K-means are similar +e clustering centers selected byDPC AP DPC-SFSKNN and FKNN-DPC are highlysimilar but the clustering results of DPC and AP are notsatisfactory For the DPC algorithm the low fault tolerancerate of its allocation strategy is the cause of this result A

Table 4 Comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on real-world datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParIris Libras movement

DPC-SFSKNN 0896 0901 0962 33 6 0547 0368 0510 1015 8DPC 0812 0827 0926 33 2 0535 0304 0438 915 05DBSCAN 0792 0754 0893 mdash 0149 0412 0183 0385 mdash 0965AP 0764 0775 0911 33 6 0364 0267 0453 1015 25FKNN-DPC 0912 0922 0973 33 7 0508 0308 0436 1015 9K-means 0683 0662 0823 mdash 3 0522 0306 0449 mdash 15

Wine ParkinsonsDPC-SFSKNN 0843 0851 0951 33 6 0193 0380 0827 22 6DPC 0706 0672 0882 33 2 0210 0114 0612 22 5DBSCAN 0612 0643 0856 mdash 04210 0205 0213 0674 mdash 046AP 0592 0544 0781 33 6 0142 0127 0669 22 15FKNN-DPC 0831 0852 0949 33 7 0273 0391 0851 22 5K-means 0817 0838 0936 mdash 3 0201 0049 0625 mdash 2

WDBC IonosphereDPC-SFSKNN 0432 0516 0857 22 6 0361 0428 0786 32 7DPC 0002 minus0004 0602 22 9 0238 0276 0681 32 065DBSCAN 0397 0538 0862 mdash 0277 0544 0683 0853 mdash 027AP 0598 0461 0854 22 40 0132 0168 0706 22 15FKNN-DPC 0679 0786 0944 22 7 0284 0355 0752 22 8K-means 0611 0730 0928 mdash 2 0129 0178 0712 mdash 2

Segmentation Pima-Indians-diabetesDPC-SFSKNN 0665 0562 0746 67 6 0037 0083 0652 22 6DPC 0650 0550 0684 67 3 0033 0075 0647 22 4DBSCAN 0446 0451 0550 mdash 02510 0028 0041 0577 mdash 0156AP 0405 0436 0554 77 25 0045 0089 0629 32 35FKNN-DPC 0655 0555 0716 77 7 0001 0011 0612 22 6K-means 0583 0495 0612 mdash 6 0050 0102 0668 mdash 2

Seeds DermatologyDPC-SFSKNN 0753 0786 0919 33 7 0862 0753 0808 76 6DPC 0727 0760 0918 33 2 0611 0514 0703 46 2DBSCAN 0640 0713 0874 mdash 0178 0689 0690 0815 mdash 073AP 0598 0682 0896 33 10 0766 0701 0762 76 5FKNN-DPC 0759 0790 0924 33 8 0847 0718 0768 76 7K-means 0671 0705 0890 mdash 3 0796 0680 0702 mdash 6

Waveform Waveform (noise)DPC-SFSKNN 0355 0382 0725 33 5 0267 0288 0651 33 6DPC 0320 0269 0586 33 05 0104 0095 0502 33 03DBSCAN mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashAP mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashFKNN-DPC 0324 0350 0703 33 5 0247 0253 0648 33 5K-means 0363 0254 0501 - 3 0364 0252 0512 mdash 3

10 Complexity

0

02

04

06

08

1

05 10

(a)

0

02

04

06

08

1

05 10

(b)

Figure 7 Continued

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 6+e clustering of Pathbased by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and (f)K-means

Complexity 11

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 7 +e clustering of Flame by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

05 100

02

04

06

08

1

(d)

Figure 8 Continued

12 Complexity

high-density point allocation error will be transferred tolow-density points and the error propagation will seriouslyaffect the clustering results AP and K-means algorithms arenot good at dealing with irregular clusters +e two clustersin the middle are too attractive to the points on both sides ofthe semicircular cluster which leads to clustering errorsDBSCAN can completely detect the semicircular cluster butthe semicircular cluster and the cluster on the left of themiddle are incorrectly classified into one category and thecluster on the right of the middle is divided into two clusters+e similarities between points and manually prespecifiedparameters may severely affect the clustering DPC-SFSKNNand FKNN-DPC algorithms perform well on the Pathbaseddataset +ese improved algorithms that consider neighborrelationships have a great advantage in handling suchcomplex distributed datasets

Figure 7 shows the results of four algorithms on theFlame dataset As shown in the figure DPC-SFSKNN DPCFKNN-DPC and DBSCAN can correctly detect two clusterswhile AP and K-means cannot completely correct clusteringAlthough AP can correctly identify higher clusters and selectthe appropriate cluster center the lower cluster is dividedinto two clusters Both clusters of K-means are wrong +eclustering results in Figure 8 show that the DPC-SFSKNNDPC FKNN-DPC and DBSCAN algorithms can detect 7clusters in the Aggregation dataset but AP and K-means stillcannot cluster correctly DPC-SFSKNN DPC and FKNN-DPC can identify clusters and centers Although the clustercenters are not marked for DBSCAN the number of clustersand the overall shape of each cluster are correct +e APalgorithm successfully found the correct number of clustersbut it chose two centers for one cluster which divided thecluster into two clusters +e clustering result of K-means issimilar to that of AP

+e Jain dataset shown in Figure 9 is a dataset consistingof two semicircular clusters of different densities As shownin the figure the DPC-SFSKNN algorithm can completelycluster two clusters with different densities However DPCAP FKNN-DPC and K-means incorrectly assign the leftend of the lower cluster to the higher cluster and the clustercenters of the DPC are all on the lower cluster Compared

with that the distribution of the cluster centers of the AP ismore reasonable For the DBSCAN algorithm it can ac-curately identify lower clusters but the left end of the highercluster is incorrectly divided into a new cluster so that thehigher cluster is divided into two clusters

According to the benchmark data shown in Table 3 it isclear that the performance of DPC-SFSKNN is very effectiveamong the six clustering algorithms especially in the Jaindataset Although DPC and FKNN-DPC perform betterthan DPC-SFSKNN on Aggregation and Flame datasetsDPC-SFSKNN can find the correct clustering center of theaggregation and can complete the clustering task correctly

42 Analysis of Experimental Results on Real-World DatasetsIn this section the performance of the five algorithms is stillbenchmarked according to AMI ARI ACC and ECACand the clustering results are summarized in Table 4 12 real-world datasets are selected to test DPC-SFSKNNrsquos ability toidentify clusters on different datasets DBSCAN and APalgorithm cannot get effective clustering results on wave-form and waveform (noise) +e symbol ldquomdashrdquo represents noresult

As shown in Table 4 in terms of benchmarks AMI ARIand ACC DPC-SFSKNN outperforms all five other algo-rithms on the Wine Segmentation and Libras movementdatasets At the same time FKNN-DPC performs better thanthe other five algorithms on the Iris Seeds Parkinsons andWDBC datasets It can be seen that the overall performanceof DPC-SFSKNN is slightly better than DPC on 11 datasetsexcept for Parkinsons On the Parkinsons DPC-SFSKNN isslightly worse than DPC in AMI but better than DPC in ARIand ACC Similarly DPC-SFSKNN had a slightly betterperformance in addition to Iris Parkinsons WDBC andSeeds of eight sets of data in FKNN-DPC and DPC-SFSKNN is slightly worse than DPC in AMI ARI and ACCDBSCAN gets the best results on the Ionosphere K-means isthe best on Pima-Indians-diabetes and K-means is the bestin AMI on waveform and waveform (noise) datasets Ingeneral the clustering results of DPC-SFSKNN in real-worlddatasets are satisfactory

0

02

04

06

08

1

05 10

(e)

05 100

02

04

06

08

1

(f )

Figure 8+e clustering of Aggregation by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

Complexity 13

43 Experimental Analysis of Olivetti Face DatasetOlivetti face dataset [28] is an image dataset widely used bymachine learning algorithms Its purpose is to test theclustering situation of the algorithm without supervisionincluding determining the number of clusters in the data-base and the members of each cluster +e dataset contains40 clusters each of which has 10 different images Becausethe actual number of clusters (40 different clusters) is equalto the number of elements in the dataset (10 different imageseach cluster) the reliability of local density becomes smallerwhich is a great challenge for density-based clustering al-gorithms To further test the clustering performance ofDPC-SFSKNN DPC-SFSKNN performed experiments onthe Olivetti face database and compared it with DPC APDBSCAN FKNN-DPC and K-means

+e clustering results achieved by DPC-SFSKNN andDPC for the Olivetti face database are shown in Figure 10and white squares represent the cluster centers +e 32clusters corresponding to DPC-SFSKNN found inFigure 10(a) and the 20 clusters found by DPC inFigure 10(b) are displayed in different colors Gray imagesindicate that the image is not assigned to any cluster It canbe seen from Figure 10(a) that DPC-SFSKNN found that the32 cluster centers were covered 29 clusters and as shown inFigure 10(b) the 20 cluster centers found by DPC werescattered in 19 clusters Similar to DPC-SFSKNN DPC maydivide one cluster into two clusters Because DPC-SFSKNNcan find much more density peaks than DPC it is morelikely to identify a cluster as two different clusters +e samesituation occurs with the FKNN-DPC algorithm However

0

02

04

06

08

1

05 10

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

0

02

04

06

08

1

05 10

(d)

0

02

04

06

08

1

05 10

(e)

0

02

04

06

08

1

05 10

(f )

Figure 9 +e clustering of Jain by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

14 Complexity

the performance of FKNN-DPC is better than that of DPC-SFSKNN in AMI ARI ACC and ECAC From the data inTable 5 based on AMI ARI ACC and ECAC the clus-tering results of these algorithms are compared +e per-formance of DPC-SFSKNNC is slightly superior to theperformance of the other four algorithms except FKNN-DPC

44RunningTime +is section shows the comparison of thetime performance of DPC-SFSKNN with DPC DBSCANAP FKNN-DPC and K-means on real-world datasets +etime complexity of DPC-SFSKNN and DPC has been an-alyzed in Section 331 and the time complexity of DPC isO(n2) and the time complexity of DPC-SFSKNN is O(kn2)where n is the size of the dataset However the time con-sumed by DPC mainly comes from calculating the localdensity and the relative distance of each point while the timeconsumed by DPC-SFSKNN comes mainly from the cal-culation of K-nearest neighbors and the division strategy ofnoncenter points Table 6 lists the running time (in seconds)

of the six algorithms on the real datasets It can be seen thatalthough the time complexity of DPC-SFSKNN is approx-imately k times that of DPC their execution time on actualdatasets is not k times

In Table 6 it can be found that on a relatively smalldataset the running time of DPC-SFSKNN is about twice ormore times that of DPC and the difference mainly comesfrom DPC-SFSKNNrsquos allocation strategy Although thecomputational load of the local densities for points growsvery quickly with the size of a dataset the time consumed bythe allocation strategy in DPC-SFSKNN increases randomlywith the distribution of a dataset +is leads to an irregulargap between the running time of DPC and DPC-SFSKNN

FKNN-DPC has the same time and space complexity asDPC but the running time is almost the same as DPC-SFSKNN It takes a lot of running time to calculate therelationship between K-nearest neighbors +e time com-plexity of DBSCAN and AP is approximate to O(n2) and theparameter selection of both cannot be determined by simplemethods When the dataset is relatively large it is difficult tofind their optimal parameters which may be the reason that

(a) (b)

Figure 10 +e clustering of Olivetti by two clustering algorithms (a) DPC-SFSKNN and (b) DPC

Table 5 Performance comparison of algorithms by clustering criteria for the Olivetti face database

Metric DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansACC 0786 0665 0648 0763 0818 0681AMI 0792 0728 0691 0737 0832 0742ARI 0669 0560 0526 0619 0714 0585ECAC 3240 2040 mdash 2840 3640 mdashPar 6 05 064 21 4 40

Complexity 15

the two algorithms have no running results on the waveformdataset +e approximate time complexity of K-means isO(n) and Table 6 proves its efficiency K-means has almostno loss of accuracy under the premise of fast speed whichmakes it a very popular clustering algorithm but K-means isnot sensitive to irregularly shaped data

5 Conclusions and Future Work

A new clustering algorithm is proposed based on the tra-ditional DPC algorithm in this paper +is algorithm pro-poses a density peak search algorithm that takes into accountthe surrounding neighbor information and develops a newallocation strategy to detect the true distribution of thedataset +e proposed clustering algorithm performs fastsearch finds density peaks say cluster centers of a dataset ofany size and recognizes clusters with any arbitrary shape ordimensionality +e algorithm is called DPC-SFSKNNwhich means that it calculates the local density and therelative distance by using some distance information be-tween points and neighbors to find the cluster center andthen the remaining points are assigned using similarity-first+e search algorithm is based on the weighted KNN graph tofind the owner (clustering center) of the point +e DPC-SFSKNN successfully addressed several issues arising fromthe clustering algorithm of Alex Rodriguez and AlessandroLaio [20] including its density metric and the potential issuehidden in its assignment strategy +e performance of DPC-SFSKNN was tested on several synthetic datasets and thereal-word datasets from the UCI machine learning reposi-tory and the well-known Olivetti face database +e ex-perimental results on these datasets demonstrate that ourDPC-SFSKNN is powerful in finding cluster centers and inrecognizing clusters regardless of their shape and of thedimensionality of the space in which they are embedded andof the size of the datasets and is robust to outliers It per-forms much better than the original algorithm DPCHowever the proposed algorithm has some limitations theparameter K needs to be manually adjusted according todifferent datasets the clustering centers still need to bemanually selected by analyzing the decision graph (like theDPC algorithm) the allocation strategy improves theclustering accuracy but takes time and cost How to improve

the degree of automation and allocation efficiency of thealgorithm needs further research

Data Availability

+e synthetic datasets are cited at relevant places within thetext as references [23ndash27] +e real-world datasets are citedat relevant places within the text as references [29ndash34] +eOlivetti face dataset is cited at relevant places within the textas references [28]

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported in part by the National NaturalScience Foundation of China (6160303040 and 61433003) inpart by the Yunnan Applied Basic Research Project of China(201701CF00037) and in part by the Yunnan ProvincialScience and Technology Department Key Research Program(Engineering) (2018BA070)

Supplementary Materials

It includes the datasets used in the experiments in this paper(Supplementary Materials)

References

[1] K L Liu Y L Shang Q Ouyang and W D Widanage ldquoAdata-driven approach with uncertainty quantification forpredicting future capacities and remaining useful life oflithium-ion batteryrdquo IEEE Transactions on Industrial Elec-tronics p 1 2020

[2] X P Tang K L Liu X Wang et al ldquoModel migration neuralnetwork for predicting battery aging trajectoriesrdquo IEEETransactions on Transportation Electrification vol 6 no 2pp 363ndash374 2020

[3] X Tang K Liu XWang B Liu F Gao andW DWidanageldquoReal-time aging trajectory prediction using a base model-oriented gradient-correction particle filter for Lithium-ion

Table 6 Running time of 6 clustering algorithms in seconds on UCI datasets

Dataset DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansIris 0241 0049 0059 0565 0148 0014Wine 0238 0048 0098 0832 0168 0013WDBC 0484 0092 0884 6115 0464 0018Seeds 0244 0050 0122 0973 0164 0014Libras movement 0602 0068 0309 3016 2602 0075Ionosphere 0325 0064 0349 2018 0309 0021Segmentation 1569 0806 8727 6679 0313 0062Dermatology 0309 0063 0513 2185 0409 0007Pima-Indians-diabetes 0792 0126 2018 9709 0892 0009Parkinsons 0255 0048 0114 0866 0263 0003Waveform 16071 3511 mdash mdash 7775 0067Waveform (noise) 17571 3784 mdash mdash 7525 0109

16 Complexity

batteriesrdquo Journal of Power Sources vol 440 Article ID227118 2019

[4] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[5] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[6] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[7] L Cai J H Meng D I Stroe et al ldquoMulti-objective opti-mization of data-driven model for lithium-ion battery SOHestimation with short-term featurerdquo IEEE Transactions onPower Electronics p 1 2020

[8] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[9] J Xie H Gao W Xie X Liu and P W Grant ldquoRobustclustering by detecting density peaks and assigning pointsbased on fuzzy weighted K-nearest neighborsrdquo InformationSciences vol 354 pp 19ndash40 2016

[10] F S Samaria and A C Harter ldquoSome methods for classifi-cation and analysis of multivariate observationsrdquo in Pro-ceedings of the Berkeley SymposiumOnMathematical Statisticsand Probability pp 281ndash297 Berkeley CA USA 1967

[11] S Kant T L Rao and P N Sundaram ldquoAn automatic andstable clustering algorithmrdquo Pattern Recognition Lettersvol 15 no 6 pp 543ndash549 1994

[12] D Arthur and S Vassilvitskii ldquoK-Means++ the advantages ofcareful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms pp 7ndash9 NewOrleans LA USA 2007

[13] Y Zhao W Halang and X Wang ldquoRough ontology mappingin E-business integrationrdquo E-Service Intelligence BMC Bioinfvol 8 pp 75ndash93 2007

[14] Y Xiao and J Yu ldquoSemi-supervised clustering based on af-finity propagaiton algorithmrdquo ACM Transactions onKnowledge Discovery from Data vol 1 no 1 2007

[15] M Ester H Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databaseswith noiserdquo in Proceedings of the Second International Con-ference On Knowledge Discovery and Data Mining pp 226ndash231 Portland OR USA 1996

[16] R J G B Campello D Moulavi and J Sander ldquoDensity-based clustering based on hierarchical density estimatesrdquoAdvances in Knowledge Discovery and Data Mining vol 7819pp 160ndash172 2013

[17] Z Liang and P Chen ldquoDelta-density based clustering with adivide-and-conquer strategy 3DC clusteringrdquo Pattern Rec-ognition Letters vol 73 pp 52ndash59 2016

[18] M Ankerst M M Breuning H P Kriegel and J SanderldquoOPTICS ordering points to identify the clustering struc-turerdquo in Proceedings of the 1999 ACM SIGMOD-InternationalConference on Management of Data pp 49ndash60 PhiladelphiaPA USA 1999

[19] M Du S Ding and H Jia ldquoStudy on density peaks clusteringbased on k-nearest neighbors and principal componentanalysisrdquo Knowledge-Based Systems vol 99 pp 135ndash1452016

[20] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[21] T Li H W Ge and S Z Su ldquoDensity peaks clustering byautomatic determination of cluster centersrdquo Journal ofComputer Science and Technology vol 10 no 11 pp 1614ndash1622 2016

[22] R Liu H Wang and X Yu ldquoShared-nearest-neighbor-basedclustering by fast search and find of density peaksrdquo Infor-mation Sciences vol 450 pp 200ndash226 2018

[23] R A Jarvis and E A Patrick ldquoClustering using a similaritymeasure based on shared near neighborsrdquo IEEE Transactionson Computers vol C-22 no 11 pp 1025ndash1034 1973

[24] H Chang and D-Y Yeung ldquoRobust path-based spectralclusteringrdquo Pattern Recognition vol 41 no 1 pp 191ndash2032008

[25] L Fu and E Medico ldquoFlame a novel fuzzy clustering methodfor the analysis of DNA microarray datardquo BMC Bio-informatics vol 8 no 1 2007

[26] A Gionis H Mannila and P Tsaparas ldquoClustering aggre-gationrdquo ACM Transactions on Knowledge Discovery fromData vol 1 no 1 p 4 2007

[27] P Franti O Virmajoki and V Hautamaki ldquoFast agglom-erative clustering using a k-nearest neighbor graphrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 28 no 11 pp 1875ndash1881 2006

[28] F S Samaria and A C Harter ldquoParameterisation of a sto-chastic model for human face identificationrdquo in Proceedings ofthe 1994 IEEEWorkshop On Applications Of Computer Visionpp 138ndash142 Sarasota FL USA 1994

[29] K Bache and M Lichman UCI Machine Learning Repositoryhttparchiveicsucieduml 2013

[30] M Charytanowicz J Niewczas P Kulczycki P A KowalskiS Lukasik and S Zak ldquoComplete gradient clustering algo-rithm for features analysis of X-ray imagesrdquo InformationTechnologies in biomedicine Advances in Intelligent and SoftComputing vol 69 Berlin Germany Springer

[31] D B Dias R C B Madeo T Rocha H H Biscaro andS M Peres ldquoHand movement recognition for brazilian signlanguage a study using distance-based neural networksrdquo inProceedings of the 2009 International Joint Conference onNeural Networks pp 697ndash704 Atlanta GA USA 2009

[32] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees Routledge New York NYUSA 1st edition 1984

[33] V G Sigillito S P Wing L V Hutton and K B BakerldquoClassification of radar returns from the ionosphere usingneural networksrdquo Johns Hopkins APL vol 10 no 3pp 262ndash266 1989

[34] W N Street W H Wolberg and O L MangasarianldquoNuclear feature extraction for breast tumor diagnosisrdquo inProceedings of the SPIE 1905 Biomedical Image Processing andBiomedical Visualization San Jose CA USA 1993

[35] X V Nguyen J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison is a correction forchance necessaryrdquo in Proceedings of the ICML 2009 the 26thAnnual International Conference On Machine Learning SanMontreal Canada 2009

[36] J Han M Kamber and J Pei Data Mining Concepts andTechniques the Morgan Kaufmann Series in Data Manage-ment Systems Morgan Kaufmann Burlington MA USA 3rdedition 2011

Complexity 17

Page 10: ClusteringbyDetectingDensityPeaksandAssigningPointsby ... · FKNN-DPC [9] is given in (5) and (6), respectively: ρ i exp − 1 K X j∈knn(i) d2 ij ⎛⎝ ⎞⎠, (5) ρ i X j∈knn(i)

41Analysis of theExperimentalResults on SyntheticDatasetsIn this section the performance of DPC-SFSKNN DPC[20] DBSCAN [15] AP [8] FKNN-DPC [9] and K-means[10] is tested with six synthetic datasets given in Table 1+ese synthetic datasets are different in distribution andquantity Different data situations can be simulated tocompare the performance of six algorithms in differentsituations Table 3 shows AMI ARI ACC and ECAC of thefive clustering algorithms on the six comprehensive datasetswhere the best results are shown in bold and ldquomdashrdquo means novalue Figures 6ndash9 show the clustering results of DPC-SFSKNN DPC DBSCAN AP FKNN-DPC and K-meansbased on the Pathbased Flame Aggregation and Jaindatasets respectively +e five algorithms achieve the

optimal clustering on DIM512 and DIM1024 datasets sothat the clustering of the two datasets is not shown Since thecluster centers of DBSCAN are relatively random only thepositions of clustering centers of the other three algorithmsare marked

Figure 6 shows the results of the Pathbased datasetDPC-SFSKNN and FKNN-DPC can complete the clusteringof the Pathbased dataset correctly From Figures 6(b) 6(d)and 6(f) it can be seen that the clustering results of DPC APand K-means are similar +e clustering centers selected byDPC AP DPC-SFSKNN and FKNN-DPC are highlysimilar but the clustering results of DPC and AP are notsatisfactory For the DPC algorithm the low fault tolerancerate of its allocation strategy is the cause of this result A

Table 4 Comparison of ACC AMI and ARI benchmarks for 6 clustering algorithms on real-world datasets

Algorithm AMI ARI ACC ECAC Par AMI ARI ACC ECAC ParIris Libras movement

DPC-SFSKNN 0896 0901 0962 33 6 0547 0368 0510 1015 8DPC 0812 0827 0926 33 2 0535 0304 0438 915 05DBSCAN 0792 0754 0893 mdash 0149 0412 0183 0385 mdash 0965AP 0764 0775 0911 33 6 0364 0267 0453 1015 25FKNN-DPC 0912 0922 0973 33 7 0508 0308 0436 1015 9K-means 0683 0662 0823 mdash 3 0522 0306 0449 mdash 15

Wine ParkinsonsDPC-SFSKNN 0843 0851 0951 33 6 0193 0380 0827 22 6DPC 0706 0672 0882 33 2 0210 0114 0612 22 5DBSCAN 0612 0643 0856 mdash 04210 0205 0213 0674 mdash 046AP 0592 0544 0781 33 6 0142 0127 0669 22 15FKNN-DPC 0831 0852 0949 33 7 0273 0391 0851 22 5K-means 0817 0838 0936 mdash 3 0201 0049 0625 mdash 2

WDBC IonosphereDPC-SFSKNN 0432 0516 0857 22 6 0361 0428 0786 32 7DPC 0002 minus0004 0602 22 9 0238 0276 0681 32 065DBSCAN 0397 0538 0862 mdash 0277 0544 0683 0853 mdash 027AP 0598 0461 0854 22 40 0132 0168 0706 22 15FKNN-DPC 0679 0786 0944 22 7 0284 0355 0752 22 8K-means 0611 0730 0928 mdash 2 0129 0178 0712 mdash 2

Segmentation Pima-Indians-diabetesDPC-SFSKNN 0665 0562 0746 67 6 0037 0083 0652 22 6DPC 0650 0550 0684 67 3 0033 0075 0647 22 4DBSCAN 0446 0451 0550 mdash 02510 0028 0041 0577 mdash 0156AP 0405 0436 0554 77 25 0045 0089 0629 32 35FKNN-DPC 0655 0555 0716 77 7 0001 0011 0612 22 6K-means 0583 0495 0612 mdash 6 0050 0102 0668 mdash 2

Seeds DermatologyDPC-SFSKNN 0753 0786 0919 33 7 0862 0753 0808 76 6DPC 0727 0760 0918 33 2 0611 0514 0703 46 2DBSCAN 0640 0713 0874 mdash 0178 0689 0690 0815 mdash 073AP 0598 0682 0896 33 10 0766 0701 0762 76 5FKNN-DPC 0759 0790 0924 33 8 0847 0718 0768 76 7K-means 0671 0705 0890 mdash 3 0796 0680 0702 mdash 6

Waveform Waveform (noise)DPC-SFSKNN 0355 0382 0725 33 5 0267 0288 0651 33 6DPC 0320 0269 0586 33 05 0104 0095 0502 33 03DBSCAN mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashAP mdash mdash mdash mdash mdash mdash mdash mdash mdash mdashFKNN-DPC 0324 0350 0703 33 5 0247 0253 0648 33 5K-means 0363 0254 0501 - 3 0364 0252 0512 mdash 3

10 Complexity

0

02

04

06

08

1

05 10

(a)

0

02

04

06

08

1

05 10

(b)

Figure 7 Continued

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 6+e clustering of Pathbased by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and (f)K-means

Complexity 11

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 7 +e clustering of Flame by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

05 100

02

04

06

08

1

(d)

Figure 8 Continued

12 Complexity

high-density point allocation error will be transferred tolow-density points and the error propagation will seriouslyaffect the clustering results AP and K-means algorithms arenot good at dealing with irregular clusters +e two clustersin the middle are too attractive to the points on both sides ofthe semicircular cluster which leads to clustering errorsDBSCAN can completely detect the semicircular cluster butthe semicircular cluster and the cluster on the left of themiddle are incorrectly classified into one category and thecluster on the right of the middle is divided into two clusters+e similarities between points and manually prespecifiedparameters may severely affect the clustering DPC-SFSKNNand FKNN-DPC algorithms perform well on the Pathbaseddataset +ese improved algorithms that consider neighborrelationships have a great advantage in handling suchcomplex distributed datasets

Figure 7 shows the results of four algorithms on theFlame dataset As shown in the figure DPC-SFSKNN DPCFKNN-DPC and DBSCAN can correctly detect two clusterswhile AP and K-means cannot completely correct clusteringAlthough AP can correctly identify higher clusters and selectthe appropriate cluster center the lower cluster is dividedinto two clusters Both clusters of K-means are wrong +eclustering results in Figure 8 show that the DPC-SFSKNNDPC FKNN-DPC and DBSCAN algorithms can detect 7clusters in the Aggregation dataset but AP and K-means stillcannot cluster correctly DPC-SFSKNN DPC and FKNN-DPC can identify clusters and centers Although the clustercenters are not marked for DBSCAN the number of clustersand the overall shape of each cluster are correct +e APalgorithm successfully found the correct number of clustersbut it chose two centers for one cluster which divided thecluster into two clusters +e clustering result of K-means issimilar to that of AP

+e Jain dataset shown in Figure 9 is a dataset consistingof two semicircular clusters of different densities As shownin the figure the DPC-SFSKNN algorithm can completelycluster two clusters with different densities However DPCAP FKNN-DPC and K-means incorrectly assign the leftend of the lower cluster to the higher cluster and the clustercenters of the DPC are all on the lower cluster Compared

with that the distribution of the cluster centers of the AP ismore reasonable For the DBSCAN algorithm it can ac-curately identify lower clusters but the left end of the highercluster is incorrectly divided into a new cluster so that thehigher cluster is divided into two clusters

According to the benchmark data shown in Table 3 it isclear that the performance of DPC-SFSKNN is very effectiveamong the six clustering algorithms especially in the Jaindataset Although DPC and FKNN-DPC perform betterthan DPC-SFSKNN on Aggregation and Flame datasetsDPC-SFSKNN can find the correct clustering center of theaggregation and can complete the clustering task correctly

42 Analysis of Experimental Results on Real-World DatasetsIn this section the performance of the five algorithms is stillbenchmarked according to AMI ARI ACC and ECACand the clustering results are summarized in Table 4 12 real-world datasets are selected to test DPC-SFSKNNrsquos ability toidentify clusters on different datasets DBSCAN and APalgorithm cannot get effective clustering results on wave-form and waveform (noise) +e symbol ldquomdashrdquo represents noresult

As shown in Table 4 in terms of benchmarks AMI ARIand ACC DPC-SFSKNN outperforms all five other algo-rithms on the Wine Segmentation and Libras movementdatasets At the same time FKNN-DPC performs better thanthe other five algorithms on the Iris Seeds Parkinsons andWDBC datasets It can be seen that the overall performanceof DPC-SFSKNN is slightly better than DPC on 11 datasetsexcept for Parkinsons On the Parkinsons DPC-SFSKNN isslightly worse than DPC in AMI but better than DPC in ARIand ACC Similarly DPC-SFSKNN had a slightly betterperformance in addition to Iris Parkinsons WDBC andSeeds of eight sets of data in FKNN-DPC and DPC-SFSKNN is slightly worse than DPC in AMI ARI and ACCDBSCAN gets the best results on the Ionosphere K-means isthe best on Pima-Indians-diabetes and K-means is the bestin AMI on waveform and waveform (noise) datasets Ingeneral the clustering results of DPC-SFSKNN in real-worlddatasets are satisfactory

0

02

04

06

08

1

05 10

(e)

05 100

02

04

06

08

1

(f )

Figure 8+e clustering of Aggregation by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

Complexity 13

43 Experimental Analysis of Olivetti Face DatasetOlivetti face dataset [28] is an image dataset widely used bymachine learning algorithms Its purpose is to test theclustering situation of the algorithm without supervisionincluding determining the number of clusters in the data-base and the members of each cluster +e dataset contains40 clusters each of which has 10 different images Becausethe actual number of clusters (40 different clusters) is equalto the number of elements in the dataset (10 different imageseach cluster) the reliability of local density becomes smallerwhich is a great challenge for density-based clustering al-gorithms To further test the clustering performance ofDPC-SFSKNN DPC-SFSKNN performed experiments onthe Olivetti face database and compared it with DPC APDBSCAN FKNN-DPC and K-means

+e clustering results achieved by DPC-SFSKNN andDPC for the Olivetti face database are shown in Figure 10and white squares represent the cluster centers +e 32clusters corresponding to DPC-SFSKNN found inFigure 10(a) and the 20 clusters found by DPC inFigure 10(b) are displayed in different colors Gray imagesindicate that the image is not assigned to any cluster It canbe seen from Figure 10(a) that DPC-SFSKNN found that the32 cluster centers were covered 29 clusters and as shown inFigure 10(b) the 20 cluster centers found by DPC werescattered in 19 clusters Similar to DPC-SFSKNN DPC maydivide one cluster into two clusters Because DPC-SFSKNNcan find much more density peaks than DPC it is morelikely to identify a cluster as two different clusters +e samesituation occurs with the FKNN-DPC algorithm However

0

02

04

06

08

1

05 10

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

0

02

04

06

08

1

05 10

(d)

0

02

04

06

08

1

05 10

(e)

0

02

04

06

08

1

05 10

(f )

Figure 9 +e clustering of Jain by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

14 Complexity

the performance of FKNN-DPC is better than that of DPC-SFSKNN in AMI ARI ACC and ECAC From the data inTable 5 based on AMI ARI ACC and ECAC the clus-tering results of these algorithms are compared +e per-formance of DPC-SFSKNNC is slightly superior to theperformance of the other four algorithms except FKNN-DPC

44RunningTime +is section shows the comparison of thetime performance of DPC-SFSKNN with DPC DBSCANAP FKNN-DPC and K-means on real-world datasets +etime complexity of DPC-SFSKNN and DPC has been an-alyzed in Section 331 and the time complexity of DPC isO(n2) and the time complexity of DPC-SFSKNN is O(kn2)where n is the size of the dataset However the time con-sumed by DPC mainly comes from calculating the localdensity and the relative distance of each point while the timeconsumed by DPC-SFSKNN comes mainly from the cal-culation of K-nearest neighbors and the division strategy ofnoncenter points Table 6 lists the running time (in seconds)

of the six algorithms on the real datasets It can be seen thatalthough the time complexity of DPC-SFSKNN is approx-imately k times that of DPC their execution time on actualdatasets is not k times

In Table 6 it can be found that on a relatively smalldataset the running time of DPC-SFSKNN is about twice ormore times that of DPC and the difference mainly comesfrom DPC-SFSKNNrsquos allocation strategy Although thecomputational load of the local densities for points growsvery quickly with the size of a dataset the time consumed bythe allocation strategy in DPC-SFSKNN increases randomlywith the distribution of a dataset +is leads to an irregulargap between the running time of DPC and DPC-SFSKNN

FKNN-DPC has the same time and space complexity asDPC but the running time is almost the same as DPC-SFSKNN It takes a lot of running time to calculate therelationship between K-nearest neighbors +e time com-plexity of DBSCAN and AP is approximate to O(n2) and theparameter selection of both cannot be determined by simplemethods When the dataset is relatively large it is difficult tofind their optimal parameters which may be the reason that

(a) (b)

Figure 10 +e clustering of Olivetti by two clustering algorithms (a) DPC-SFSKNN and (b) DPC

Table 5 Performance comparison of algorithms by clustering criteria for the Olivetti face database

Metric DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansACC 0786 0665 0648 0763 0818 0681AMI 0792 0728 0691 0737 0832 0742ARI 0669 0560 0526 0619 0714 0585ECAC 3240 2040 mdash 2840 3640 mdashPar 6 05 064 21 4 40

Complexity 15

the two algorithms have no running results on the waveformdataset +e approximate time complexity of K-means isO(n) and Table 6 proves its efficiency K-means has almostno loss of accuracy under the premise of fast speed whichmakes it a very popular clustering algorithm but K-means isnot sensitive to irregularly shaped data

5 Conclusions and Future Work

A new clustering algorithm is proposed based on the tra-ditional DPC algorithm in this paper +is algorithm pro-poses a density peak search algorithm that takes into accountthe surrounding neighbor information and develops a newallocation strategy to detect the true distribution of thedataset +e proposed clustering algorithm performs fastsearch finds density peaks say cluster centers of a dataset ofany size and recognizes clusters with any arbitrary shape ordimensionality +e algorithm is called DPC-SFSKNNwhich means that it calculates the local density and therelative distance by using some distance information be-tween points and neighbors to find the cluster center andthen the remaining points are assigned using similarity-first+e search algorithm is based on the weighted KNN graph tofind the owner (clustering center) of the point +e DPC-SFSKNN successfully addressed several issues arising fromthe clustering algorithm of Alex Rodriguez and AlessandroLaio [20] including its density metric and the potential issuehidden in its assignment strategy +e performance of DPC-SFSKNN was tested on several synthetic datasets and thereal-word datasets from the UCI machine learning reposi-tory and the well-known Olivetti face database +e ex-perimental results on these datasets demonstrate that ourDPC-SFSKNN is powerful in finding cluster centers and inrecognizing clusters regardless of their shape and of thedimensionality of the space in which they are embedded andof the size of the datasets and is robust to outliers It per-forms much better than the original algorithm DPCHowever the proposed algorithm has some limitations theparameter K needs to be manually adjusted according todifferent datasets the clustering centers still need to bemanually selected by analyzing the decision graph (like theDPC algorithm) the allocation strategy improves theclustering accuracy but takes time and cost How to improve

the degree of automation and allocation efficiency of thealgorithm needs further research

Data Availability

+e synthetic datasets are cited at relevant places within thetext as references [23ndash27] +e real-world datasets are citedat relevant places within the text as references [29ndash34] +eOlivetti face dataset is cited at relevant places within the textas references [28]

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported in part by the National NaturalScience Foundation of China (6160303040 and 61433003) inpart by the Yunnan Applied Basic Research Project of China(201701CF00037) and in part by the Yunnan ProvincialScience and Technology Department Key Research Program(Engineering) (2018BA070)

Supplementary Materials

It includes the datasets used in the experiments in this paper(Supplementary Materials)

References

[1] K L Liu Y L Shang Q Ouyang and W D Widanage ldquoAdata-driven approach with uncertainty quantification forpredicting future capacities and remaining useful life oflithium-ion batteryrdquo IEEE Transactions on Industrial Elec-tronics p 1 2020

[2] X P Tang K L Liu X Wang et al ldquoModel migration neuralnetwork for predicting battery aging trajectoriesrdquo IEEETransactions on Transportation Electrification vol 6 no 2pp 363ndash374 2020

[3] X Tang K Liu XWang B Liu F Gao andW DWidanageldquoReal-time aging trajectory prediction using a base model-oriented gradient-correction particle filter for Lithium-ion

Table 6 Running time of 6 clustering algorithms in seconds on UCI datasets

Dataset DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansIris 0241 0049 0059 0565 0148 0014Wine 0238 0048 0098 0832 0168 0013WDBC 0484 0092 0884 6115 0464 0018Seeds 0244 0050 0122 0973 0164 0014Libras movement 0602 0068 0309 3016 2602 0075Ionosphere 0325 0064 0349 2018 0309 0021Segmentation 1569 0806 8727 6679 0313 0062Dermatology 0309 0063 0513 2185 0409 0007Pima-Indians-diabetes 0792 0126 2018 9709 0892 0009Parkinsons 0255 0048 0114 0866 0263 0003Waveform 16071 3511 mdash mdash 7775 0067Waveform (noise) 17571 3784 mdash mdash 7525 0109

16 Complexity

batteriesrdquo Journal of Power Sources vol 440 Article ID227118 2019

[4] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[5] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[6] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[7] L Cai J H Meng D I Stroe et al ldquoMulti-objective opti-mization of data-driven model for lithium-ion battery SOHestimation with short-term featurerdquo IEEE Transactions onPower Electronics p 1 2020

[8] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[9] J Xie H Gao W Xie X Liu and P W Grant ldquoRobustclustering by detecting density peaks and assigning pointsbased on fuzzy weighted K-nearest neighborsrdquo InformationSciences vol 354 pp 19ndash40 2016

[10] F S Samaria and A C Harter ldquoSome methods for classifi-cation and analysis of multivariate observationsrdquo in Pro-ceedings of the Berkeley SymposiumOnMathematical Statisticsand Probability pp 281ndash297 Berkeley CA USA 1967

[11] S Kant T L Rao and P N Sundaram ldquoAn automatic andstable clustering algorithmrdquo Pattern Recognition Lettersvol 15 no 6 pp 543ndash549 1994

[12] D Arthur and S Vassilvitskii ldquoK-Means++ the advantages ofcareful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms pp 7ndash9 NewOrleans LA USA 2007

[13] Y Zhao W Halang and X Wang ldquoRough ontology mappingin E-business integrationrdquo E-Service Intelligence BMC Bioinfvol 8 pp 75ndash93 2007

[14] Y Xiao and J Yu ldquoSemi-supervised clustering based on af-finity propagaiton algorithmrdquo ACM Transactions onKnowledge Discovery from Data vol 1 no 1 2007

[15] M Ester H Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databaseswith noiserdquo in Proceedings of the Second International Con-ference On Knowledge Discovery and Data Mining pp 226ndash231 Portland OR USA 1996

[16] R J G B Campello D Moulavi and J Sander ldquoDensity-based clustering based on hierarchical density estimatesrdquoAdvances in Knowledge Discovery and Data Mining vol 7819pp 160ndash172 2013

[17] Z Liang and P Chen ldquoDelta-density based clustering with adivide-and-conquer strategy 3DC clusteringrdquo Pattern Rec-ognition Letters vol 73 pp 52ndash59 2016

[18] M Ankerst M M Breuning H P Kriegel and J SanderldquoOPTICS ordering points to identify the clustering struc-turerdquo in Proceedings of the 1999 ACM SIGMOD-InternationalConference on Management of Data pp 49ndash60 PhiladelphiaPA USA 1999

[19] M Du S Ding and H Jia ldquoStudy on density peaks clusteringbased on k-nearest neighbors and principal componentanalysisrdquo Knowledge-Based Systems vol 99 pp 135ndash1452016

[20] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[21] T Li H W Ge and S Z Su ldquoDensity peaks clustering byautomatic determination of cluster centersrdquo Journal ofComputer Science and Technology vol 10 no 11 pp 1614ndash1622 2016

[22] R Liu H Wang and X Yu ldquoShared-nearest-neighbor-basedclustering by fast search and find of density peaksrdquo Infor-mation Sciences vol 450 pp 200ndash226 2018

[23] R A Jarvis and E A Patrick ldquoClustering using a similaritymeasure based on shared near neighborsrdquo IEEE Transactionson Computers vol C-22 no 11 pp 1025ndash1034 1973

[24] H Chang and D-Y Yeung ldquoRobust path-based spectralclusteringrdquo Pattern Recognition vol 41 no 1 pp 191ndash2032008

[25] L Fu and E Medico ldquoFlame a novel fuzzy clustering methodfor the analysis of DNA microarray datardquo BMC Bio-informatics vol 8 no 1 2007

[26] A Gionis H Mannila and P Tsaparas ldquoClustering aggre-gationrdquo ACM Transactions on Knowledge Discovery fromData vol 1 no 1 p 4 2007

[27] P Franti O Virmajoki and V Hautamaki ldquoFast agglom-erative clustering using a k-nearest neighbor graphrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 28 no 11 pp 1875ndash1881 2006

[28] F S Samaria and A C Harter ldquoParameterisation of a sto-chastic model for human face identificationrdquo in Proceedings ofthe 1994 IEEEWorkshop On Applications Of Computer Visionpp 138ndash142 Sarasota FL USA 1994

[29] K Bache and M Lichman UCI Machine Learning Repositoryhttparchiveicsucieduml 2013

[30] M Charytanowicz J Niewczas P Kulczycki P A KowalskiS Lukasik and S Zak ldquoComplete gradient clustering algo-rithm for features analysis of X-ray imagesrdquo InformationTechnologies in biomedicine Advances in Intelligent and SoftComputing vol 69 Berlin Germany Springer

[31] D B Dias R C B Madeo T Rocha H H Biscaro andS M Peres ldquoHand movement recognition for brazilian signlanguage a study using distance-based neural networksrdquo inProceedings of the 2009 International Joint Conference onNeural Networks pp 697ndash704 Atlanta GA USA 2009

[32] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees Routledge New York NYUSA 1st edition 1984

[33] V G Sigillito S P Wing L V Hutton and K B BakerldquoClassification of radar returns from the ionosphere usingneural networksrdquo Johns Hopkins APL vol 10 no 3pp 262ndash266 1989

[34] W N Street W H Wolberg and O L MangasarianldquoNuclear feature extraction for breast tumor diagnosisrdquo inProceedings of the SPIE 1905 Biomedical Image Processing andBiomedical Visualization San Jose CA USA 1993

[35] X V Nguyen J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison is a correction forchance necessaryrdquo in Proceedings of the ICML 2009 the 26thAnnual International Conference On Machine Learning SanMontreal Canada 2009

[36] J Han M Kamber and J Pei Data Mining Concepts andTechniques the Morgan Kaufmann Series in Data Manage-ment Systems Morgan Kaufmann Burlington MA USA 3rdedition 2011

Complexity 17

Page 11: ClusteringbyDetectingDensityPeaksandAssigningPointsby ... · FKNN-DPC [9] is given in (5) and (6), respectively: ρ i exp − 1 K X j∈knn(i) d2 ij ⎛⎝ ⎞⎠, (5) ρ i X j∈knn(i)

0

02

04

06

08

1

05 10

(a)

0

02

04

06

08

1

05 10

(b)

Figure 7 Continued

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 6+e clustering of Pathbased by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and (f)K-means

Complexity 11

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 7 +e clustering of Flame by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

05 100

02

04

06

08

1

(d)

Figure 8 Continued

12 Complexity

high-density point allocation error will be transferred tolow-density points and the error propagation will seriouslyaffect the clustering results AP and K-means algorithms arenot good at dealing with irregular clusters +e two clustersin the middle are too attractive to the points on both sides ofthe semicircular cluster which leads to clustering errorsDBSCAN can completely detect the semicircular cluster butthe semicircular cluster and the cluster on the left of themiddle are incorrectly classified into one category and thecluster on the right of the middle is divided into two clusters+e similarities between points and manually prespecifiedparameters may severely affect the clustering DPC-SFSKNNand FKNN-DPC algorithms perform well on the Pathbaseddataset +ese improved algorithms that consider neighborrelationships have a great advantage in handling suchcomplex distributed datasets

Figure 7 shows the results of four algorithms on theFlame dataset As shown in the figure DPC-SFSKNN DPCFKNN-DPC and DBSCAN can correctly detect two clusterswhile AP and K-means cannot completely correct clusteringAlthough AP can correctly identify higher clusters and selectthe appropriate cluster center the lower cluster is dividedinto two clusters Both clusters of K-means are wrong +eclustering results in Figure 8 show that the DPC-SFSKNNDPC FKNN-DPC and DBSCAN algorithms can detect 7clusters in the Aggregation dataset but AP and K-means stillcannot cluster correctly DPC-SFSKNN DPC and FKNN-DPC can identify clusters and centers Although the clustercenters are not marked for DBSCAN the number of clustersand the overall shape of each cluster are correct +e APalgorithm successfully found the correct number of clustersbut it chose two centers for one cluster which divided thecluster into two clusters +e clustering result of K-means issimilar to that of AP

+e Jain dataset shown in Figure 9 is a dataset consistingof two semicircular clusters of different densities As shownin the figure the DPC-SFSKNN algorithm can completelycluster two clusters with different densities However DPCAP FKNN-DPC and K-means incorrectly assign the leftend of the lower cluster to the higher cluster and the clustercenters of the DPC are all on the lower cluster Compared

with that the distribution of the cluster centers of the AP ismore reasonable For the DBSCAN algorithm it can ac-curately identify lower clusters but the left end of the highercluster is incorrectly divided into a new cluster so that thehigher cluster is divided into two clusters

According to the benchmark data shown in Table 3 it isclear that the performance of DPC-SFSKNN is very effectiveamong the six clustering algorithms especially in the Jaindataset Although DPC and FKNN-DPC perform betterthan DPC-SFSKNN on Aggregation and Flame datasetsDPC-SFSKNN can find the correct clustering center of theaggregation and can complete the clustering task correctly

42 Analysis of Experimental Results on Real-World DatasetsIn this section the performance of the five algorithms is stillbenchmarked according to AMI ARI ACC and ECACand the clustering results are summarized in Table 4 12 real-world datasets are selected to test DPC-SFSKNNrsquos ability toidentify clusters on different datasets DBSCAN and APalgorithm cannot get effective clustering results on wave-form and waveform (noise) +e symbol ldquomdashrdquo represents noresult

As shown in Table 4 in terms of benchmarks AMI ARIand ACC DPC-SFSKNN outperforms all five other algo-rithms on the Wine Segmentation and Libras movementdatasets At the same time FKNN-DPC performs better thanthe other five algorithms on the Iris Seeds Parkinsons andWDBC datasets It can be seen that the overall performanceof DPC-SFSKNN is slightly better than DPC on 11 datasetsexcept for Parkinsons On the Parkinsons DPC-SFSKNN isslightly worse than DPC in AMI but better than DPC in ARIand ACC Similarly DPC-SFSKNN had a slightly betterperformance in addition to Iris Parkinsons WDBC andSeeds of eight sets of data in FKNN-DPC and DPC-SFSKNN is slightly worse than DPC in AMI ARI and ACCDBSCAN gets the best results on the Ionosphere K-means isthe best on Pima-Indians-diabetes and K-means is the bestin AMI on waveform and waveform (noise) datasets Ingeneral the clustering results of DPC-SFSKNN in real-worlddatasets are satisfactory

0

02

04

06

08

1

05 10

(e)

05 100

02

04

06

08

1

(f )

Figure 8+e clustering of Aggregation by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

Complexity 13

43 Experimental Analysis of Olivetti Face DatasetOlivetti face dataset [28] is an image dataset widely used bymachine learning algorithms Its purpose is to test theclustering situation of the algorithm without supervisionincluding determining the number of clusters in the data-base and the members of each cluster +e dataset contains40 clusters each of which has 10 different images Becausethe actual number of clusters (40 different clusters) is equalto the number of elements in the dataset (10 different imageseach cluster) the reliability of local density becomes smallerwhich is a great challenge for density-based clustering al-gorithms To further test the clustering performance ofDPC-SFSKNN DPC-SFSKNN performed experiments onthe Olivetti face database and compared it with DPC APDBSCAN FKNN-DPC and K-means

+e clustering results achieved by DPC-SFSKNN andDPC for the Olivetti face database are shown in Figure 10and white squares represent the cluster centers +e 32clusters corresponding to DPC-SFSKNN found inFigure 10(a) and the 20 clusters found by DPC inFigure 10(b) are displayed in different colors Gray imagesindicate that the image is not assigned to any cluster It canbe seen from Figure 10(a) that DPC-SFSKNN found that the32 cluster centers were covered 29 clusters and as shown inFigure 10(b) the 20 cluster centers found by DPC werescattered in 19 clusters Similar to DPC-SFSKNN DPC maydivide one cluster into two clusters Because DPC-SFSKNNcan find much more density peaks than DPC it is morelikely to identify a cluster as two different clusters +e samesituation occurs with the FKNN-DPC algorithm However

0

02

04

06

08

1

05 10

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

0

02

04

06

08

1

05 10

(d)

0

02

04

06

08

1

05 10

(e)

0

02

04

06

08

1

05 10

(f )

Figure 9 +e clustering of Jain by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

14 Complexity

the performance of FKNN-DPC is better than that of DPC-SFSKNN in AMI ARI ACC and ECAC From the data inTable 5 based on AMI ARI ACC and ECAC the clus-tering results of these algorithms are compared +e per-formance of DPC-SFSKNNC is slightly superior to theperformance of the other four algorithms except FKNN-DPC

44RunningTime +is section shows the comparison of thetime performance of DPC-SFSKNN with DPC DBSCANAP FKNN-DPC and K-means on real-world datasets +etime complexity of DPC-SFSKNN and DPC has been an-alyzed in Section 331 and the time complexity of DPC isO(n2) and the time complexity of DPC-SFSKNN is O(kn2)where n is the size of the dataset However the time con-sumed by DPC mainly comes from calculating the localdensity and the relative distance of each point while the timeconsumed by DPC-SFSKNN comes mainly from the cal-culation of K-nearest neighbors and the division strategy ofnoncenter points Table 6 lists the running time (in seconds)

of the six algorithms on the real datasets It can be seen thatalthough the time complexity of DPC-SFSKNN is approx-imately k times that of DPC their execution time on actualdatasets is not k times

In Table 6 it can be found that on a relatively smalldataset the running time of DPC-SFSKNN is about twice ormore times that of DPC and the difference mainly comesfrom DPC-SFSKNNrsquos allocation strategy Although thecomputational load of the local densities for points growsvery quickly with the size of a dataset the time consumed bythe allocation strategy in DPC-SFSKNN increases randomlywith the distribution of a dataset +is leads to an irregulargap between the running time of DPC and DPC-SFSKNN

FKNN-DPC has the same time and space complexity asDPC but the running time is almost the same as DPC-SFSKNN It takes a lot of running time to calculate therelationship between K-nearest neighbors +e time com-plexity of DBSCAN and AP is approximate to O(n2) and theparameter selection of both cannot be determined by simplemethods When the dataset is relatively large it is difficult tofind their optimal parameters which may be the reason that

(a) (b)

Figure 10 +e clustering of Olivetti by two clustering algorithms (a) DPC-SFSKNN and (b) DPC

Table 5 Performance comparison of algorithms by clustering criteria for the Olivetti face database

Metric DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansACC 0786 0665 0648 0763 0818 0681AMI 0792 0728 0691 0737 0832 0742ARI 0669 0560 0526 0619 0714 0585ECAC 3240 2040 mdash 2840 3640 mdashPar 6 05 064 21 4 40

Complexity 15

the two algorithms have no running results on the waveformdataset +e approximate time complexity of K-means isO(n) and Table 6 proves its efficiency K-means has almostno loss of accuracy under the premise of fast speed whichmakes it a very popular clustering algorithm but K-means isnot sensitive to irregularly shaped data

5 Conclusions and Future Work

A new clustering algorithm is proposed based on the tra-ditional DPC algorithm in this paper +is algorithm pro-poses a density peak search algorithm that takes into accountthe surrounding neighbor information and develops a newallocation strategy to detect the true distribution of thedataset +e proposed clustering algorithm performs fastsearch finds density peaks say cluster centers of a dataset ofany size and recognizes clusters with any arbitrary shape ordimensionality +e algorithm is called DPC-SFSKNNwhich means that it calculates the local density and therelative distance by using some distance information be-tween points and neighbors to find the cluster center andthen the remaining points are assigned using similarity-first+e search algorithm is based on the weighted KNN graph tofind the owner (clustering center) of the point +e DPC-SFSKNN successfully addressed several issues arising fromthe clustering algorithm of Alex Rodriguez and AlessandroLaio [20] including its density metric and the potential issuehidden in its assignment strategy +e performance of DPC-SFSKNN was tested on several synthetic datasets and thereal-word datasets from the UCI machine learning reposi-tory and the well-known Olivetti face database +e ex-perimental results on these datasets demonstrate that ourDPC-SFSKNN is powerful in finding cluster centers and inrecognizing clusters regardless of their shape and of thedimensionality of the space in which they are embedded andof the size of the datasets and is robust to outliers It per-forms much better than the original algorithm DPCHowever the proposed algorithm has some limitations theparameter K needs to be manually adjusted according todifferent datasets the clustering centers still need to bemanually selected by analyzing the decision graph (like theDPC algorithm) the allocation strategy improves theclustering accuracy but takes time and cost How to improve

the degree of automation and allocation efficiency of thealgorithm needs further research

Data Availability

+e synthetic datasets are cited at relevant places within thetext as references [23ndash27] +e real-world datasets are citedat relevant places within the text as references [29ndash34] +eOlivetti face dataset is cited at relevant places within the textas references [28]

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported in part by the National NaturalScience Foundation of China (6160303040 and 61433003) inpart by the Yunnan Applied Basic Research Project of China(201701CF00037) and in part by the Yunnan ProvincialScience and Technology Department Key Research Program(Engineering) (2018BA070)

Supplementary Materials

It includes the datasets used in the experiments in this paper(Supplementary Materials)

References

[1] K L Liu Y L Shang Q Ouyang and W D Widanage ldquoAdata-driven approach with uncertainty quantification forpredicting future capacities and remaining useful life oflithium-ion batteryrdquo IEEE Transactions on Industrial Elec-tronics p 1 2020

[2] X P Tang K L Liu X Wang et al ldquoModel migration neuralnetwork for predicting battery aging trajectoriesrdquo IEEETransactions on Transportation Electrification vol 6 no 2pp 363ndash374 2020

[3] X Tang K Liu XWang B Liu F Gao andW DWidanageldquoReal-time aging trajectory prediction using a base model-oriented gradient-correction particle filter for Lithium-ion

Table 6 Running time of 6 clustering algorithms in seconds on UCI datasets

Dataset DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansIris 0241 0049 0059 0565 0148 0014Wine 0238 0048 0098 0832 0168 0013WDBC 0484 0092 0884 6115 0464 0018Seeds 0244 0050 0122 0973 0164 0014Libras movement 0602 0068 0309 3016 2602 0075Ionosphere 0325 0064 0349 2018 0309 0021Segmentation 1569 0806 8727 6679 0313 0062Dermatology 0309 0063 0513 2185 0409 0007Pima-Indians-diabetes 0792 0126 2018 9709 0892 0009Parkinsons 0255 0048 0114 0866 0263 0003Waveform 16071 3511 mdash mdash 7775 0067Waveform (noise) 17571 3784 mdash mdash 7525 0109

16 Complexity

batteriesrdquo Journal of Power Sources vol 440 Article ID227118 2019

[4] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[5] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[6] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[7] L Cai J H Meng D I Stroe et al ldquoMulti-objective opti-mization of data-driven model for lithium-ion battery SOHestimation with short-term featurerdquo IEEE Transactions onPower Electronics p 1 2020

[8] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[9] J Xie H Gao W Xie X Liu and P W Grant ldquoRobustclustering by detecting density peaks and assigning pointsbased on fuzzy weighted K-nearest neighborsrdquo InformationSciences vol 354 pp 19ndash40 2016

[10] F S Samaria and A C Harter ldquoSome methods for classifi-cation and analysis of multivariate observationsrdquo in Pro-ceedings of the Berkeley SymposiumOnMathematical Statisticsand Probability pp 281ndash297 Berkeley CA USA 1967

[11] S Kant T L Rao and P N Sundaram ldquoAn automatic andstable clustering algorithmrdquo Pattern Recognition Lettersvol 15 no 6 pp 543ndash549 1994

[12] D Arthur and S Vassilvitskii ldquoK-Means++ the advantages ofcareful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms pp 7ndash9 NewOrleans LA USA 2007

[13] Y Zhao W Halang and X Wang ldquoRough ontology mappingin E-business integrationrdquo E-Service Intelligence BMC Bioinfvol 8 pp 75ndash93 2007

[14] Y Xiao and J Yu ldquoSemi-supervised clustering based on af-finity propagaiton algorithmrdquo ACM Transactions onKnowledge Discovery from Data vol 1 no 1 2007

[15] M Ester H Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databaseswith noiserdquo in Proceedings of the Second International Con-ference On Knowledge Discovery and Data Mining pp 226ndash231 Portland OR USA 1996

[16] R J G B Campello D Moulavi and J Sander ldquoDensity-based clustering based on hierarchical density estimatesrdquoAdvances in Knowledge Discovery and Data Mining vol 7819pp 160ndash172 2013

[17] Z Liang and P Chen ldquoDelta-density based clustering with adivide-and-conquer strategy 3DC clusteringrdquo Pattern Rec-ognition Letters vol 73 pp 52ndash59 2016

[18] M Ankerst M M Breuning H P Kriegel and J SanderldquoOPTICS ordering points to identify the clustering struc-turerdquo in Proceedings of the 1999 ACM SIGMOD-InternationalConference on Management of Data pp 49ndash60 PhiladelphiaPA USA 1999

[19] M Du S Ding and H Jia ldquoStudy on density peaks clusteringbased on k-nearest neighbors and principal componentanalysisrdquo Knowledge-Based Systems vol 99 pp 135ndash1452016

[20] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[21] T Li H W Ge and S Z Su ldquoDensity peaks clustering byautomatic determination of cluster centersrdquo Journal ofComputer Science and Technology vol 10 no 11 pp 1614ndash1622 2016

[22] R Liu H Wang and X Yu ldquoShared-nearest-neighbor-basedclustering by fast search and find of density peaksrdquo Infor-mation Sciences vol 450 pp 200ndash226 2018

[23] R A Jarvis and E A Patrick ldquoClustering using a similaritymeasure based on shared near neighborsrdquo IEEE Transactionson Computers vol C-22 no 11 pp 1025ndash1034 1973

[24] H Chang and D-Y Yeung ldquoRobust path-based spectralclusteringrdquo Pattern Recognition vol 41 no 1 pp 191ndash2032008

[25] L Fu and E Medico ldquoFlame a novel fuzzy clustering methodfor the analysis of DNA microarray datardquo BMC Bio-informatics vol 8 no 1 2007

[26] A Gionis H Mannila and P Tsaparas ldquoClustering aggre-gationrdquo ACM Transactions on Knowledge Discovery fromData vol 1 no 1 p 4 2007

[27] P Franti O Virmajoki and V Hautamaki ldquoFast agglom-erative clustering using a k-nearest neighbor graphrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 28 no 11 pp 1875ndash1881 2006

[28] F S Samaria and A C Harter ldquoParameterisation of a sto-chastic model for human face identificationrdquo in Proceedings ofthe 1994 IEEEWorkshop On Applications Of Computer Visionpp 138ndash142 Sarasota FL USA 1994

[29] K Bache and M Lichman UCI Machine Learning Repositoryhttparchiveicsucieduml 2013

[30] M Charytanowicz J Niewczas P Kulczycki P A KowalskiS Lukasik and S Zak ldquoComplete gradient clustering algo-rithm for features analysis of X-ray imagesrdquo InformationTechnologies in biomedicine Advances in Intelligent and SoftComputing vol 69 Berlin Germany Springer

[31] D B Dias R C B Madeo T Rocha H H Biscaro andS M Peres ldquoHand movement recognition for brazilian signlanguage a study using distance-based neural networksrdquo inProceedings of the 2009 International Joint Conference onNeural Networks pp 697ndash704 Atlanta GA USA 2009

[32] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees Routledge New York NYUSA 1st edition 1984

[33] V G Sigillito S P Wing L V Hutton and K B BakerldquoClassification of radar returns from the ionosphere usingneural networksrdquo Johns Hopkins APL vol 10 no 3pp 262ndash266 1989

[34] W N Street W H Wolberg and O L MangasarianldquoNuclear feature extraction for breast tumor diagnosisrdquo inProceedings of the SPIE 1905 Biomedical Image Processing andBiomedical Visualization San Jose CA USA 1993

[35] X V Nguyen J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison is a correction forchance necessaryrdquo in Proceedings of the ICML 2009 the 26thAnnual International Conference On Machine Learning SanMontreal Canada 2009

[36] J Han M Kamber and J Pei Data Mining Concepts andTechniques the Morgan Kaufmann Series in Data Manage-ment Systems Morgan Kaufmann Burlington MA USA 3rdedition 2011

Complexity 17

Page 12: ClusteringbyDetectingDensityPeaksandAssigningPointsby ... · FKNN-DPC [9] is given in (5) and (6), respectively: ρ i exp − 1 K X j∈knn(i) d2 ij ⎛⎝ ⎞⎠, (5) ρ i X j∈knn(i)

05 100

02

04

06

08

1

(c)

05 100

02

04

06

08

1

(d)

05 100

02

04

06

08

1

(e)

05 100

02

04

06

08

1

(f )

Figure 7 +e clustering of Flame by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

05 100

02

04

06

08

1

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

05 100

02

04

06

08

1

(d)

Figure 8 Continued

12 Complexity

high-density point allocation error will be transferred tolow-density points and the error propagation will seriouslyaffect the clustering results AP and K-means algorithms arenot good at dealing with irregular clusters +e two clustersin the middle are too attractive to the points on both sides ofthe semicircular cluster which leads to clustering errorsDBSCAN can completely detect the semicircular cluster butthe semicircular cluster and the cluster on the left of themiddle are incorrectly classified into one category and thecluster on the right of the middle is divided into two clusters+e similarities between points and manually prespecifiedparameters may severely affect the clustering DPC-SFSKNNand FKNN-DPC algorithms perform well on the Pathbaseddataset +ese improved algorithms that consider neighborrelationships have a great advantage in handling suchcomplex distributed datasets

Figure 7 shows the results of four algorithms on theFlame dataset As shown in the figure DPC-SFSKNN DPCFKNN-DPC and DBSCAN can correctly detect two clusterswhile AP and K-means cannot completely correct clusteringAlthough AP can correctly identify higher clusters and selectthe appropriate cluster center the lower cluster is dividedinto two clusters Both clusters of K-means are wrong +eclustering results in Figure 8 show that the DPC-SFSKNNDPC FKNN-DPC and DBSCAN algorithms can detect 7clusters in the Aggregation dataset but AP and K-means stillcannot cluster correctly DPC-SFSKNN DPC and FKNN-DPC can identify clusters and centers Although the clustercenters are not marked for DBSCAN the number of clustersand the overall shape of each cluster are correct +e APalgorithm successfully found the correct number of clustersbut it chose two centers for one cluster which divided thecluster into two clusters +e clustering result of K-means issimilar to that of AP

+e Jain dataset shown in Figure 9 is a dataset consistingof two semicircular clusters of different densities As shownin the figure the DPC-SFSKNN algorithm can completelycluster two clusters with different densities However DPCAP FKNN-DPC and K-means incorrectly assign the leftend of the lower cluster to the higher cluster and the clustercenters of the DPC are all on the lower cluster Compared

with that the distribution of the cluster centers of the AP ismore reasonable For the DBSCAN algorithm it can ac-curately identify lower clusters but the left end of the highercluster is incorrectly divided into a new cluster so that thehigher cluster is divided into two clusters

According to the benchmark data shown in Table 3 it isclear that the performance of DPC-SFSKNN is very effectiveamong the six clustering algorithms especially in the Jaindataset Although DPC and FKNN-DPC perform betterthan DPC-SFSKNN on Aggregation and Flame datasetsDPC-SFSKNN can find the correct clustering center of theaggregation and can complete the clustering task correctly

42 Analysis of Experimental Results on Real-World DatasetsIn this section the performance of the five algorithms is stillbenchmarked according to AMI ARI ACC and ECACand the clustering results are summarized in Table 4 12 real-world datasets are selected to test DPC-SFSKNNrsquos ability toidentify clusters on different datasets DBSCAN and APalgorithm cannot get effective clustering results on wave-form and waveform (noise) +e symbol ldquomdashrdquo represents noresult

As shown in Table 4 in terms of benchmarks AMI ARIand ACC DPC-SFSKNN outperforms all five other algo-rithms on the Wine Segmentation and Libras movementdatasets At the same time FKNN-DPC performs better thanthe other five algorithms on the Iris Seeds Parkinsons andWDBC datasets It can be seen that the overall performanceof DPC-SFSKNN is slightly better than DPC on 11 datasetsexcept for Parkinsons On the Parkinsons DPC-SFSKNN isslightly worse than DPC in AMI but better than DPC in ARIand ACC Similarly DPC-SFSKNN had a slightly betterperformance in addition to Iris Parkinsons WDBC andSeeds of eight sets of data in FKNN-DPC and DPC-SFSKNN is slightly worse than DPC in AMI ARI and ACCDBSCAN gets the best results on the Ionosphere K-means isthe best on Pima-Indians-diabetes and K-means is the bestin AMI on waveform and waveform (noise) datasets Ingeneral the clustering results of DPC-SFSKNN in real-worlddatasets are satisfactory

0

02

04

06

08

1

05 10

(e)

05 100

02

04

06

08

1

(f )

Figure 8+e clustering of Aggregation by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

Complexity 13

43 Experimental Analysis of Olivetti Face DatasetOlivetti face dataset [28] is an image dataset widely used bymachine learning algorithms Its purpose is to test theclustering situation of the algorithm without supervisionincluding determining the number of clusters in the data-base and the members of each cluster +e dataset contains40 clusters each of which has 10 different images Becausethe actual number of clusters (40 different clusters) is equalto the number of elements in the dataset (10 different imageseach cluster) the reliability of local density becomes smallerwhich is a great challenge for density-based clustering al-gorithms To further test the clustering performance ofDPC-SFSKNN DPC-SFSKNN performed experiments onthe Olivetti face database and compared it with DPC APDBSCAN FKNN-DPC and K-means

+e clustering results achieved by DPC-SFSKNN andDPC for the Olivetti face database are shown in Figure 10and white squares represent the cluster centers +e 32clusters corresponding to DPC-SFSKNN found inFigure 10(a) and the 20 clusters found by DPC inFigure 10(b) are displayed in different colors Gray imagesindicate that the image is not assigned to any cluster It canbe seen from Figure 10(a) that DPC-SFSKNN found that the32 cluster centers were covered 29 clusters and as shown inFigure 10(b) the 20 cluster centers found by DPC werescattered in 19 clusters Similar to DPC-SFSKNN DPC maydivide one cluster into two clusters Because DPC-SFSKNNcan find much more density peaks than DPC it is morelikely to identify a cluster as two different clusters +e samesituation occurs with the FKNN-DPC algorithm However

0

02

04

06

08

1

05 10

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

0

02

04

06

08

1

05 10

(d)

0

02

04

06

08

1

05 10

(e)

0

02

04

06

08

1

05 10

(f )

Figure 9 +e clustering of Jain by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

14 Complexity

the performance of FKNN-DPC is better than that of DPC-SFSKNN in AMI ARI ACC and ECAC From the data inTable 5 based on AMI ARI ACC and ECAC the clus-tering results of these algorithms are compared +e per-formance of DPC-SFSKNNC is slightly superior to theperformance of the other four algorithms except FKNN-DPC

44RunningTime +is section shows the comparison of thetime performance of DPC-SFSKNN with DPC DBSCANAP FKNN-DPC and K-means on real-world datasets +etime complexity of DPC-SFSKNN and DPC has been an-alyzed in Section 331 and the time complexity of DPC isO(n2) and the time complexity of DPC-SFSKNN is O(kn2)where n is the size of the dataset However the time con-sumed by DPC mainly comes from calculating the localdensity and the relative distance of each point while the timeconsumed by DPC-SFSKNN comes mainly from the cal-culation of K-nearest neighbors and the division strategy ofnoncenter points Table 6 lists the running time (in seconds)

of the six algorithms on the real datasets It can be seen thatalthough the time complexity of DPC-SFSKNN is approx-imately k times that of DPC their execution time on actualdatasets is not k times

In Table 6 it can be found that on a relatively smalldataset the running time of DPC-SFSKNN is about twice ormore times that of DPC and the difference mainly comesfrom DPC-SFSKNNrsquos allocation strategy Although thecomputational load of the local densities for points growsvery quickly with the size of a dataset the time consumed bythe allocation strategy in DPC-SFSKNN increases randomlywith the distribution of a dataset +is leads to an irregulargap between the running time of DPC and DPC-SFSKNN

FKNN-DPC has the same time and space complexity asDPC but the running time is almost the same as DPC-SFSKNN It takes a lot of running time to calculate therelationship between K-nearest neighbors +e time com-plexity of DBSCAN and AP is approximate to O(n2) and theparameter selection of both cannot be determined by simplemethods When the dataset is relatively large it is difficult tofind their optimal parameters which may be the reason that

(a) (b)

Figure 10 +e clustering of Olivetti by two clustering algorithms (a) DPC-SFSKNN and (b) DPC

Table 5 Performance comparison of algorithms by clustering criteria for the Olivetti face database

Metric DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansACC 0786 0665 0648 0763 0818 0681AMI 0792 0728 0691 0737 0832 0742ARI 0669 0560 0526 0619 0714 0585ECAC 3240 2040 mdash 2840 3640 mdashPar 6 05 064 21 4 40

Complexity 15

the two algorithms have no running results on the waveformdataset +e approximate time complexity of K-means isO(n) and Table 6 proves its efficiency K-means has almostno loss of accuracy under the premise of fast speed whichmakes it a very popular clustering algorithm but K-means isnot sensitive to irregularly shaped data

5 Conclusions and Future Work

A new clustering algorithm is proposed based on the tra-ditional DPC algorithm in this paper +is algorithm pro-poses a density peak search algorithm that takes into accountthe surrounding neighbor information and develops a newallocation strategy to detect the true distribution of thedataset +e proposed clustering algorithm performs fastsearch finds density peaks say cluster centers of a dataset ofany size and recognizes clusters with any arbitrary shape ordimensionality +e algorithm is called DPC-SFSKNNwhich means that it calculates the local density and therelative distance by using some distance information be-tween points and neighbors to find the cluster center andthen the remaining points are assigned using similarity-first+e search algorithm is based on the weighted KNN graph tofind the owner (clustering center) of the point +e DPC-SFSKNN successfully addressed several issues arising fromthe clustering algorithm of Alex Rodriguez and AlessandroLaio [20] including its density metric and the potential issuehidden in its assignment strategy +e performance of DPC-SFSKNN was tested on several synthetic datasets and thereal-word datasets from the UCI machine learning reposi-tory and the well-known Olivetti face database +e ex-perimental results on these datasets demonstrate that ourDPC-SFSKNN is powerful in finding cluster centers and inrecognizing clusters regardless of their shape and of thedimensionality of the space in which they are embedded andof the size of the datasets and is robust to outliers It per-forms much better than the original algorithm DPCHowever the proposed algorithm has some limitations theparameter K needs to be manually adjusted according todifferent datasets the clustering centers still need to bemanually selected by analyzing the decision graph (like theDPC algorithm) the allocation strategy improves theclustering accuracy but takes time and cost How to improve

the degree of automation and allocation efficiency of thealgorithm needs further research

Data Availability

+e synthetic datasets are cited at relevant places within thetext as references [23ndash27] +e real-world datasets are citedat relevant places within the text as references [29ndash34] +eOlivetti face dataset is cited at relevant places within the textas references [28]

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported in part by the National NaturalScience Foundation of China (6160303040 and 61433003) inpart by the Yunnan Applied Basic Research Project of China(201701CF00037) and in part by the Yunnan ProvincialScience and Technology Department Key Research Program(Engineering) (2018BA070)

Supplementary Materials

It includes the datasets used in the experiments in this paper(Supplementary Materials)

References

[1] K L Liu Y L Shang Q Ouyang and W D Widanage ldquoAdata-driven approach with uncertainty quantification forpredicting future capacities and remaining useful life oflithium-ion batteryrdquo IEEE Transactions on Industrial Elec-tronics p 1 2020

[2] X P Tang K L Liu X Wang et al ldquoModel migration neuralnetwork for predicting battery aging trajectoriesrdquo IEEETransactions on Transportation Electrification vol 6 no 2pp 363ndash374 2020

[3] X Tang K Liu XWang B Liu F Gao andW DWidanageldquoReal-time aging trajectory prediction using a base model-oriented gradient-correction particle filter for Lithium-ion

Table 6 Running time of 6 clustering algorithms in seconds on UCI datasets

Dataset DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansIris 0241 0049 0059 0565 0148 0014Wine 0238 0048 0098 0832 0168 0013WDBC 0484 0092 0884 6115 0464 0018Seeds 0244 0050 0122 0973 0164 0014Libras movement 0602 0068 0309 3016 2602 0075Ionosphere 0325 0064 0349 2018 0309 0021Segmentation 1569 0806 8727 6679 0313 0062Dermatology 0309 0063 0513 2185 0409 0007Pima-Indians-diabetes 0792 0126 2018 9709 0892 0009Parkinsons 0255 0048 0114 0866 0263 0003Waveform 16071 3511 mdash mdash 7775 0067Waveform (noise) 17571 3784 mdash mdash 7525 0109

16 Complexity

batteriesrdquo Journal of Power Sources vol 440 Article ID227118 2019

[4] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[5] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[6] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[7] L Cai J H Meng D I Stroe et al ldquoMulti-objective opti-mization of data-driven model for lithium-ion battery SOHestimation with short-term featurerdquo IEEE Transactions onPower Electronics p 1 2020

[8] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[9] J Xie H Gao W Xie X Liu and P W Grant ldquoRobustclustering by detecting density peaks and assigning pointsbased on fuzzy weighted K-nearest neighborsrdquo InformationSciences vol 354 pp 19ndash40 2016

[10] F S Samaria and A C Harter ldquoSome methods for classifi-cation and analysis of multivariate observationsrdquo in Pro-ceedings of the Berkeley SymposiumOnMathematical Statisticsand Probability pp 281ndash297 Berkeley CA USA 1967

[11] S Kant T L Rao and P N Sundaram ldquoAn automatic andstable clustering algorithmrdquo Pattern Recognition Lettersvol 15 no 6 pp 543ndash549 1994

[12] D Arthur and S Vassilvitskii ldquoK-Means++ the advantages ofcareful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms pp 7ndash9 NewOrleans LA USA 2007

[13] Y Zhao W Halang and X Wang ldquoRough ontology mappingin E-business integrationrdquo E-Service Intelligence BMC Bioinfvol 8 pp 75ndash93 2007

[14] Y Xiao and J Yu ldquoSemi-supervised clustering based on af-finity propagaiton algorithmrdquo ACM Transactions onKnowledge Discovery from Data vol 1 no 1 2007

[15] M Ester H Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databaseswith noiserdquo in Proceedings of the Second International Con-ference On Knowledge Discovery and Data Mining pp 226ndash231 Portland OR USA 1996

[16] R J G B Campello D Moulavi and J Sander ldquoDensity-based clustering based on hierarchical density estimatesrdquoAdvances in Knowledge Discovery and Data Mining vol 7819pp 160ndash172 2013

[17] Z Liang and P Chen ldquoDelta-density based clustering with adivide-and-conquer strategy 3DC clusteringrdquo Pattern Rec-ognition Letters vol 73 pp 52ndash59 2016

[18] M Ankerst M M Breuning H P Kriegel and J SanderldquoOPTICS ordering points to identify the clustering struc-turerdquo in Proceedings of the 1999 ACM SIGMOD-InternationalConference on Management of Data pp 49ndash60 PhiladelphiaPA USA 1999

[19] M Du S Ding and H Jia ldquoStudy on density peaks clusteringbased on k-nearest neighbors and principal componentanalysisrdquo Knowledge-Based Systems vol 99 pp 135ndash1452016

[20] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[21] T Li H W Ge and S Z Su ldquoDensity peaks clustering byautomatic determination of cluster centersrdquo Journal ofComputer Science and Technology vol 10 no 11 pp 1614ndash1622 2016

[22] R Liu H Wang and X Yu ldquoShared-nearest-neighbor-basedclustering by fast search and find of density peaksrdquo Infor-mation Sciences vol 450 pp 200ndash226 2018

[23] R A Jarvis and E A Patrick ldquoClustering using a similaritymeasure based on shared near neighborsrdquo IEEE Transactionson Computers vol C-22 no 11 pp 1025ndash1034 1973

[24] H Chang and D-Y Yeung ldquoRobust path-based spectralclusteringrdquo Pattern Recognition vol 41 no 1 pp 191ndash2032008

[25] L Fu and E Medico ldquoFlame a novel fuzzy clustering methodfor the analysis of DNA microarray datardquo BMC Bio-informatics vol 8 no 1 2007

[26] A Gionis H Mannila and P Tsaparas ldquoClustering aggre-gationrdquo ACM Transactions on Knowledge Discovery fromData vol 1 no 1 p 4 2007

[27] P Franti O Virmajoki and V Hautamaki ldquoFast agglom-erative clustering using a k-nearest neighbor graphrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 28 no 11 pp 1875ndash1881 2006

[28] F S Samaria and A C Harter ldquoParameterisation of a sto-chastic model for human face identificationrdquo in Proceedings ofthe 1994 IEEEWorkshop On Applications Of Computer Visionpp 138ndash142 Sarasota FL USA 1994

[29] K Bache and M Lichman UCI Machine Learning Repositoryhttparchiveicsucieduml 2013

[30] M Charytanowicz J Niewczas P Kulczycki P A KowalskiS Lukasik and S Zak ldquoComplete gradient clustering algo-rithm for features analysis of X-ray imagesrdquo InformationTechnologies in biomedicine Advances in Intelligent and SoftComputing vol 69 Berlin Germany Springer

[31] D B Dias R C B Madeo T Rocha H H Biscaro andS M Peres ldquoHand movement recognition for brazilian signlanguage a study using distance-based neural networksrdquo inProceedings of the 2009 International Joint Conference onNeural Networks pp 697ndash704 Atlanta GA USA 2009

[32] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees Routledge New York NYUSA 1st edition 1984

[33] V G Sigillito S P Wing L V Hutton and K B BakerldquoClassification of radar returns from the ionosphere usingneural networksrdquo Johns Hopkins APL vol 10 no 3pp 262ndash266 1989

[34] W N Street W H Wolberg and O L MangasarianldquoNuclear feature extraction for breast tumor diagnosisrdquo inProceedings of the SPIE 1905 Biomedical Image Processing andBiomedical Visualization San Jose CA USA 1993

[35] X V Nguyen J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison is a correction forchance necessaryrdquo in Proceedings of the ICML 2009 the 26thAnnual International Conference On Machine Learning SanMontreal Canada 2009

[36] J Han M Kamber and J Pei Data Mining Concepts andTechniques the Morgan Kaufmann Series in Data Manage-ment Systems Morgan Kaufmann Burlington MA USA 3rdedition 2011

Complexity 17

Page 13: ClusteringbyDetectingDensityPeaksandAssigningPointsby ... · FKNN-DPC [9] is given in (5) and (6), respectively: ρ i exp − 1 K X j∈knn(i) d2 ij ⎛⎝ ⎞⎠, (5) ρ i X j∈knn(i)

high-density point allocation error will be transferred tolow-density points and the error propagation will seriouslyaffect the clustering results AP and K-means algorithms arenot good at dealing with irregular clusters +e two clustersin the middle are too attractive to the points on both sides ofthe semicircular cluster which leads to clustering errorsDBSCAN can completely detect the semicircular cluster butthe semicircular cluster and the cluster on the left of themiddle are incorrectly classified into one category and thecluster on the right of the middle is divided into two clusters+e similarities between points and manually prespecifiedparameters may severely affect the clustering DPC-SFSKNNand FKNN-DPC algorithms perform well on the Pathbaseddataset +ese improved algorithms that consider neighborrelationships have a great advantage in handling suchcomplex distributed datasets

Figure 7 shows the results of four algorithms on theFlame dataset As shown in the figure DPC-SFSKNN DPCFKNN-DPC and DBSCAN can correctly detect two clusterswhile AP and K-means cannot completely correct clusteringAlthough AP can correctly identify higher clusters and selectthe appropriate cluster center the lower cluster is dividedinto two clusters Both clusters of K-means are wrong +eclustering results in Figure 8 show that the DPC-SFSKNNDPC FKNN-DPC and DBSCAN algorithms can detect 7clusters in the Aggregation dataset but AP and K-means stillcannot cluster correctly DPC-SFSKNN DPC and FKNN-DPC can identify clusters and centers Although the clustercenters are not marked for DBSCAN the number of clustersand the overall shape of each cluster are correct +e APalgorithm successfully found the correct number of clustersbut it chose two centers for one cluster which divided thecluster into two clusters +e clustering result of K-means issimilar to that of AP

+e Jain dataset shown in Figure 9 is a dataset consistingof two semicircular clusters of different densities As shownin the figure the DPC-SFSKNN algorithm can completelycluster two clusters with different densities However DPCAP FKNN-DPC and K-means incorrectly assign the leftend of the lower cluster to the higher cluster and the clustercenters of the DPC are all on the lower cluster Compared

with that the distribution of the cluster centers of the AP ismore reasonable For the DBSCAN algorithm it can ac-curately identify lower clusters but the left end of the highercluster is incorrectly divided into a new cluster so that thehigher cluster is divided into two clusters

According to the benchmark data shown in Table 3 it isclear that the performance of DPC-SFSKNN is very effectiveamong the six clustering algorithms especially in the Jaindataset Although DPC and FKNN-DPC perform betterthan DPC-SFSKNN on Aggregation and Flame datasetsDPC-SFSKNN can find the correct clustering center of theaggregation and can complete the clustering task correctly

42 Analysis of Experimental Results on Real-World DatasetsIn this section the performance of the five algorithms is stillbenchmarked according to AMI ARI ACC and ECACand the clustering results are summarized in Table 4 12 real-world datasets are selected to test DPC-SFSKNNrsquos ability toidentify clusters on different datasets DBSCAN and APalgorithm cannot get effective clustering results on wave-form and waveform (noise) +e symbol ldquomdashrdquo represents noresult

As shown in Table 4 in terms of benchmarks AMI ARIand ACC DPC-SFSKNN outperforms all five other algo-rithms on the Wine Segmentation and Libras movementdatasets At the same time FKNN-DPC performs better thanthe other five algorithms on the Iris Seeds Parkinsons andWDBC datasets It can be seen that the overall performanceof DPC-SFSKNN is slightly better than DPC on 11 datasetsexcept for Parkinsons On the Parkinsons DPC-SFSKNN isslightly worse than DPC in AMI but better than DPC in ARIand ACC Similarly DPC-SFSKNN had a slightly betterperformance in addition to Iris Parkinsons WDBC andSeeds of eight sets of data in FKNN-DPC and DPC-SFSKNN is slightly worse than DPC in AMI ARI and ACCDBSCAN gets the best results on the Ionosphere K-means isthe best on Pima-Indians-diabetes and K-means is the bestin AMI on waveform and waveform (noise) datasets Ingeneral the clustering results of DPC-SFSKNN in real-worlddatasets are satisfactory

0

02

04

06

08

1

05 10

(e)

05 100

02

04

06

08

1

(f )

Figure 8+e clustering of Aggregation by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

Complexity 13

43 Experimental Analysis of Olivetti Face DatasetOlivetti face dataset [28] is an image dataset widely used bymachine learning algorithms Its purpose is to test theclustering situation of the algorithm without supervisionincluding determining the number of clusters in the data-base and the members of each cluster +e dataset contains40 clusters each of which has 10 different images Becausethe actual number of clusters (40 different clusters) is equalto the number of elements in the dataset (10 different imageseach cluster) the reliability of local density becomes smallerwhich is a great challenge for density-based clustering al-gorithms To further test the clustering performance ofDPC-SFSKNN DPC-SFSKNN performed experiments onthe Olivetti face database and compared it with DPC APDBSCAN FKNN-DPC and K-means

+e clustering results achieved by DPC-SFSKNN andDPC for the Olivetti face database are shown in Figure 10and white squares represent the cluster centers +e 32clusters corresponding to DPC-SFSKNN found inFigure 10(a) and the 20 clusters found by DPC inFigure 10(b) are displayed in different colors Gray imagesindicate that the image is not assigned to any cluster It canbe seen from Figure 10(a) that DPC-SFSKNN found that the32 cluster centers were covered 29 clusters and as shown inFigure 10(b) the 20 cluster centers found by DPC werescattered in 19 clusters Similar to DPC-SFSKNN DPC maydivide one cluster into two clusters Because DPC-SFSKNNcan find much more density peaks than DPC it is morelikely to identify a cluster as two different clusters +e samesituation occurs with the FKNN-DPC algorithm However

0

02

04

06

08

1

05 10

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

0

02

04

06

08

1

05 10

(d)

0

02

04

06

08

1

05 10

(e)

0

02

04

06

08

1

05 10

(f )

Figure 9 +e clustering of Jain by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

14 Complexity

the performance of FKNN-DPC is better than that of DPC-SFSKNN in AMI ARI ACC and ECAC From the data inTable 5 based on AMI ARI ACC and ECAC the clus-tering results of these algorithms are compared +e per-formance of DPC-SFSKNNC is slightly superior to theperformance of the other four algorithms except FKNN-DPC

44RunningTime +is section shows the comparison of thetime performance of DPC-SFSKNN with DPC DBSCANAP FKNN-DPC and K-means on real-world datasets +etime complexity of DPC-SFSKNN and DPC has been an-alyzed in Section 331 and the time complexity of DPC isO(n2) and the time complexity of DPC-SFSKNN is O(kn2)where n is the size of the dataset However the time con-sumed by DPC mainly comes from calculating the localdensity and the relative distance of each point while the timeconsumed by DPC-SFSKNN comes mainly from the cal-culation of K-nearest neighbors and the division strategy ofnoncenter points Table 6 lists the running time (in seconds)

of the six algorithms on the real datasets It can be seen thatalthough the time complexity of DPC-SFSKNN is approx-imately k times that of DPC their execution time on actualdatasets is not k times

In Table 6 it can be found that on a relatively smalldataset the running time of DPC-SFSKNN is about twice ormore times that of DPC and the difference mainly comesfrom DPC-SFSKNNrsquos allocation strategy Although thecomputational load of the local densities for points growsvery quickly with the size of a dataset the time consumed bythe allocation strategy in DPC-SFSKNN increases randomlywith the distribution of a dataset +is leads to an irregulargap between the running time of DPC and DPC-SFSKNN

FKNN-DPC has the same time and space complexity asDPC but the running time is almost the same as DPC-SFSKNN It takes a lot of running time to calculate therelationship between K-nearest neighbors +e time com-plexity of DBSCAN and AP is approximate to O(n2) and theparameter selection of both cannot be determined by simplemethods When the dataset is relatively large it is difficult tofind their optimal parameters which may be the reason that

(a) (b)

Figure 10 +e clustering of Olivetti by two clustering algorithms (a) DPC-SFSKNN and (b) DPC

Table 5 Performance comparison of algorithms by clustering criteria for the Olivetti face database

Metric DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansACC 0786 0665 0648 0763 0818 0681AMI 0792 0728 0691 0737 0832 0742ARI 0669 0560 0526 0619 0714 0585ECAC 3240 2040 mdash 2840 3640 mdashPar 6 05 064 21 4 40

Complexity 15

the two algorithms have no running results on the waveformdataset +e approximate time complexity of K-means isO(n) and Table 6 proves its efficiency K-means has almostno loss of accuracy under the premise of fast speed whichmakes it a very popular clustering algorithm but K-means isnot sensitive to irregularly shaped data

5 Conclusions and Future Work

A new clustering algorithm is proposed based on the tra-ditional DPC algorithm in this paper +is algorithm pro-poses a density peak search algorithm that takes into accountthe surrounding neighbor information and develops a newallocation strategy to detect the true distribution of thedataset +e proposed clustering algorithm performs fastsearch finds density peaks say cluster centers of a dataset ofany size and recognizes clusters with any arbitrary shape ordimensionality +e algorithm is called DPC-SFSKNNwhich means that it calculates the local density and therelative distance by using some distance information be-tween points and neighbors to find the cluster center andthen the remaining points are assigned using similarity-first+e search algorithm is based on the weighted KNN graph tofind the owner (clustering center) of the point +e DPC-SFSKNN successfully addressed several issues arising fromthe clustering algorithm of Alex Rodriguez and AlessandroLaio [20] including its density metric and the potential issuehidden in its assignment strategy +e performance of DPC-SFSKNN was tested on several synthetic datasets and thereal-word datasets from the UCI machine learning reposi-tory and the well-known Olivetti face database +e ex-perimental results on these datasets demonstrate that ourDPC-SFSKNN is powerful in finding cluster centers and inrecognizing clusters regardless of their shape and of thedimensionality of the space in which they are embedded andof the size of the datasets and is robust to outliers It per-forms much better than the original algorithm DPCHowever the proposed algorithm has some limitations theparameter K needs to be manually adjusted according todifferent datasets the clustering centers still need to bemanually selected by analyzing the decision graph (like theDPC algorithm) the allocation strategy improves theclustering accuracy but takes time and cost How to improve

the degree of automation and allocation efficiency of thealgorithm needs further research

Data Availability

+e synthetic datasets are cited at relevant places within thetext as references [23ndash27] +e real-world datasets are citedat relevant places within the text as references [29ndash34] +eOlivetti face dataset is cited at relevant places within the textas references [28]

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported in part by the National NaturalScience Foundation of China (6160303040 and 61433003) inpart by the Yunnan Applied Basic Research Project of China(201701CF00037) and in part by the Yunnan ProvincialScience and Technology Department Key Research Program(Engineering) (2018BA070)

Supplementary Materials

It includes the datasets used in the experiments in this paper(Supplementary Materials)

References

[1] K L Liu Y L Shang Q Ouyang and W D Widanage ldquoAdata-driven approach with uncertainty quantification forpredicting future capacities and remaining useful life oflithium-ion batteryrdquo IEEE Transactions on Industrial Elec-tronics p 1 2020

[2] X P Tang K L Liu X Wang et al ldquoModel migration neuralnetwork for predicting battery aging trajectoriesrdquo IEEETransactions on Transportation Electrification vol 6 no 2pp 363ndash374 2020

[3] X Tang K Liu XWang B Liu F Gao andW DWidanageldquoReal-time aging trajectory prediction using a base model-oriented gradient-correction particle filter for Lithium-ion

Table 6 Running time of 6 clustering algorithms in seconds on UCI datasets

Dataset DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansIris 0241 0049 0059 0565 0148 0014Wine 0238 0048 0098 0832 0168 0013WDBC 0484 0092 0884 6115 0464 0018Seeds 0244 0050 0122 0973 0164 0014Libras movement 0602 0068 0309 3016 2602 0075Ionosphere 0325 0064 0349 2018 0309 0021Segmentation 1569 0806 8727 6679 0313 0062Dermatology 0309 0063 0513 2185 0409 0007Pima-Indians-diabetes 0792 0126 2018 9709 0892 0009Parkinsons 0255 0048 0114 0866 0263 0003Waveform 16071 3511 mdash mdash 7775 0067Waveform (noise) 17571 3784 mdash mdash 7525 0109

16 Complexity

batteriesrdquo Journal of Power Sources vol 440 Article ID227118 2019

[4] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[5] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[6] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[7] L Cai J H Meng D I Stroe et al ldquoMulti-objective opti-mization of data-driven model for lithium-ion battery SOHestimation with short-term featurerdquo IEEE Transactions onPower Electronics p 1 2020

[8] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[9] J Xie H Gao W Xie X Liu and P W Grant ldquoRobustclustering by detecting density peaks and assigning pointsbased on fuzzy weighted K-nearest neighborsrdquo InformationSciences vol 354 pp 19ndash40 2016

[10] F S Samaria and A C Harter ldquoSome methods for classifi-cation and analysis of multivariate observationsrdquo in Pro-ceedings of the Berkeley SymposiumOnMathematical Statisticsand Probability pp 281ndash297 Berkeley CA USA 1967

[11] S Kant T L Rao and P N Sundaram ldquoAn automatic andstable clustering algorithmrdquo Pattern Recognition Lettersvol 15 no 6 pp 543ndash549 1994

[12] D Arthur and S Vassilvitskii ldquoK-Means++ the advantages ofcareful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms pp 7ndash9 NewOrleans LA USA 2007

[13] Y Zhao W Halang and X Wang ldquoRough ontology mappingin E-business integrationrdquo E-Service Intelligence BMC Bioinfvol 8 pp 75ndash93 2007

[14] Y Xiao and J Yu ldquoSemi-supervised clustering based on af-finity propagaiton algorithmrdquo ACM Transactions onKnowledge Discovery from Data vol 1 no 1 2007

[15] M Ester H Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databaseswith noiserdquo in Proceedings of the Second International Con-ference On Knowledge Discovery and Data Mining pp 226ndash231 Portland OR USA 1996

[16] R J G B Campello D Moulavi and J Sander ldquoDensity-based clustering based on hierarchical density estimatesrdquoAdvances in Knowledge Discovery and Data Mining vol 7819pp 160ndash172 2013

[17] Z Liang and P Chen ldquoDelta-density based clustering with adivide-and-conquer strategy 3DC clusteringrdquo Pattern Rec-ognition Letters vol 73 pp 52ndash59 2016

[18] M Ankerst M M Breuning H P Kriegel and J SanderldquoOPTICS ordering points to identify the clustering struc-turerdquo in Proceedings of the 1999 ACM SIGMOD-InternationalConference on Management of Data pp 49ndash60 PhiladelphiaPA USA 1999

[19] M Du S Ding and H Jia ldquoStudy on density peaks clusteringbased on k-nearest neighbors and principal componentanalysisrdquo Knowledge-Based Systems vol 99 pp 135ndash1452016

[20] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[21] T Li H W Ge and S Z Su ldquoDensity peaks clustering byautomatic determination of cluster centersrdquo Journal ofComputer Science and Technology vol 10 no 11 pp 1614ndash1622 2016

[22] R Liu H Wang and X Yu ldquoShared-nearest-neighbor-basedclustering by fast search and find of density peaksrdquo Infor-mation Sciences vol 450 pp 200ndash226 2018

[23] R A Jarvis and E A Patrick ldquoClustering using a similaritymeasure based on shared near neighborsrdquo IEEE Transactionson Computers vol C-22 no 11 pp 1025ndash1034 1973

[24] H Chang and D-Y Yeung ldquoRobust path-based spectralclusteringrdquo Pattern Recognition vol 41 no 1 pp 191ndash2032008

[25] L Fu and E Medico ldquoFlame a novel fuzzy clustering methodfor the analysis of DNA microarray datardquo BMC Bio-informatics vol 8 no 1 2007

[26] A Gionis H Mannila and P Tsaparas ldquoClustering aggre-gationrdquo ACM Transactions on Knowledge Discovery fromData vol 1 no 1 p 4 2007

[27] P Franti O Virmajoki and V Hautamaki ldquoFast agglom-erative clustering using a k-nearest neighbor graphrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 28 no 11 pp 1875ndash1881 2006

[28] F S Samaria and A C Harter ldquoParameterisation of a sto-chastic model for human face identificationrdquo in Proceedings ofthe 1994 IEEEWorkshop On Applications Of Computer Visionpp 138ndash142 Sarasota FL USA 1994

[29] K Bache and M Lichman UCI Machine Learning Repositoryhttparchiveicsucieduml 2013

[30] M Charytanowicz J Niewczas P Kulczycki P A KowalskiS Lukasik and S Zak ldquoComplete gradient clustering algo-rithm for features analysis of X-ray imagesrdquo InformationTechnologies in biomedicine Advances in Intelligent and SoftComputing vol 69 Berlin Germany Springer

[31] D B Dias R C B Madeo T Rocha H H Biscaro andS M Peres ldquoHand movement recognition for brazilian signlanguage a study using distance-based neural networksrdquo inProceedings of the 2009 International Joint Conference onNeural Networks pp 697ndash704 Atlanta GA USA 2009

[32] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees Routledge New York NYUSA 1st edition 1984

[33] V G Sigillito S P Wing L V Hutton and K B BakerldquoClassification of radar returns from the ionosphere usingneural networksrdquo Johns Hopkins APL vol 10 no 3pp 262ndash266 1989

[34] W N Street W H Wolberg and O L MangasarianldquoNuclear feature extraction for breast tumor diagnosisrdquo inProceedings of the SPIE 1905 Biomedical Image Processing andBiomedical Visualization San Jose CA USA 1993

[35] X V Nguyen J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison is a correction forchance necessaryrdquo in Proceedings of the ICML 2009 the 26thAnnual International Conference On Machine Learning SanMontreal Canada 2009

[36] J Han M Kamber and J Pei Data Mining Concepts andTechniques the Morgan Kaufmann Series in Data Manage-ment Systems Morgan Kaufmann Burlington MA USA 3rdedition 2011

Complexity 17

Page 14: ClusteringbyDetectingDensityPeaksandAssigningPointsby ... · FKNN-DPC [9] is given in (5) and (6), respectively: ρ i exp − 1 K X j∈knn(i) d2 ij ⎛⎝ ⎞⎠, (5) ρ i X j∈knn(i)

43 Experimental Analysis of Olivetti Face DatasetOlivetti face dataset [28] is an image dataset widely used bymachine learning algorithms Its purpose is to test theclustering situation of the algorithm without supervisionincluding determining the number of clusters in the data-base and the members of each cluster +e dataset contains40 clusters each of which has 10 different images Becausethe actual number of clusters (40 different clusters) is equalto the number of elements in the dataset (10 different imageseach cluster) the reliability of local density becomes smallerwhich is a great challenge for density-based clustering al-gorithms To further test the clustering performance ofDPC-SFSKNN DPC-SFSKNN performed experiments onthe Olivetti face database and compared it with DPC APDBSCAN FKNN-DPC and K-means

+e clustering results achieved by DPC-SFSKNN andDPC for the Olivetti face database are shown in Figure 10and white squares represent the cluster centers +e 32clusters corresponding to DPC-SFSKNN found inFigure 10(a) and the 20 clusters found by DPC inFigure 10(b) are displayed in different colors Gray imagesindicate that the image is not assigned to any cluster It canbe seen from Figure 10(a) that DPC-SFSKNN found that the32 cluster centers were covered 29 clusters and as shown inFigure 10(b) the 20 cluster centers found by DPC werescattered in 19 clusters Similar to DPC-SFSKNN DPC maydivide one cluster into two clusters Because DPC-SFSKNNcan find much more density peaks than DPC it is morelikely to identify a cluster as two different clusters +e samesituation occurs with the FKNN-DPC algorithm However

0

02

04

06

08

1

05 10

(a)

05 100

02

04

06

08

1

(b)

0

02

04

06

08

1

05 10

(c)

0

02

04

06

08

1

05 10

(d)

0

02

04

06

08

1

05 10

(e)

0

02

04

06

08

1

05 10

(f )

Figure 9 +e clustering of Jain by 6 clustering algorithms (a) DPC-SFSKNN (b) DPC (c) DBSCAN (d) AP (e) FKNN-DPC and(f) K-means

14 Complexity

the performance of FKNN-DPC is better than that of DPC-SFSKNN in AMI ARI ACC and ECAC From the data inTable 5 based on AMI ARI ACC and ECAC the clus-tering results of these algorithms are compared +e per-formance of DPC-SFSKNNC is slightly superior to theperformance of the other four algorithms except FKNN-DPC

44RunningTime +is section shows the comparison of thetime performance of DPC-SFSKNN with DPC DBSCANAP FKNN-DPC and K-means on real-world datasets +etime complexity of DPC-SFSKNN and DPC has been an-alyzed in Section 331 and the time complexity of DPC isO(n2) and the time complexity of DPC-SFSKNN is O(kn2)where n is the size of the dataset However the time con-sumed by DPC mainly comes from calculating the localdensity and the relative distance of each point while the timeconsumed by DPC-SFSKNN comes mainly from the cal-culation of K-nearest neighbors and the division strategy ofnoncenter points Table 6 lists the running time (in seconds)

of the six algorithms on the real datasets It can be seen thatalthough the time complexity of DPC-SFSKNN is approx-imately k times that of DPC their execution time on actualdatasets is not k times

In Table 6 it can be found that on a relatively smalldataset the running time of DPC-SFSKNN is about twice ormore times that of DPC and the difference mainly comesfrom DPC-SFSKNNrsquos allocation strategy Although thecomputational load of the local densities for points growsvery quickly with the size of a dataset the time consumed bythe allocation strategy in DPC-SFSKNN increases randomlywith the distribution of a dataset +is leads to an irregulargap between the running time of DPC and DPC-SFSKNN

FKNN-DPC has the same time and space complexity asDPC but the running time is almost the same as DPC-SFSKNN It takes a lot of running time to calculate therelationship between K-nearest neighbors +e time com-plexity of DBSCAN and AP is approximate to O(n2) and theparameter selection of both cannot be determined by simplemethods When the dataset is relatively large it is difficult tofind their optimal parameters which may be the reason that

(a) (b)

Figure 10 +e clustering of Olivetti by two clustering algorithms (a) DPC-SFSKNN and (b) DPC

Table 5 Performance comparison of algorithms by clustering criteria for the Olivetti face database

Metric DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansACC 0786 0665 0648 0763 0818 0681AMI 0792 0728 0691 0737 0832 0742ARI 0669 0560 0526 0619 0714 0585ECAC 3240 2040 mdash 2840 3640 mdashPar 6 05 064 21 4 40

Complexity 15

the two algorithms have no running results on the waveformdataset +e approximate time complexity of K-means isO(n) and Table 6 proves its efficiency K-means has almostno loss of accuracy under the premise of fast speed whichmakes it a very popular clustering algorithm but K-means isnot sensitive to irregularly shaped data

5 Conclusions and Future Work

A new clustering algorithm is proposed based on the tra-ditional DPC algorithm in this paper +is algorithm pro-poses a density peak search algorithm that takes into accountthe surrounding neighbor information and develops a newallocation strategy to detect the true distribution of thedataset +e proposed clustering algorithm performs fastsearch finds density peaks say cluster centers of a dataset ofany size and recognizes clusters with any arbitrary shape ordimensionality +e algorithm is called DPC-SFSKNNwhich means that it calculates the local density and therelative distance by using some distance information be-tween points and neighbors to find the cluster center andthen the remaining points are assigned using similarity-first+e search algorithm is based on the weighted KNN graph tofind the owner (clustering center) of the point +e DPC-SFSKNN successfully addressed several issues arising fromthe clustering algorithm of Alex Rodriguez and AlessandroLaio [20] including its density metric and the potential issuehidden in its assignment strategy +e performance of DPC-SFSKNN was tested on several synthetic datasets and thereal-word datasets from the UCI machine learning reposi-tory and the well-known Olivetti face database +e ex-perimental results on these datasets demonstrate that ourDPC-SFSKNN is powerful in finding cluster centers and inrecognizing clusters regardless of their shape and of thedimensionality of the space in which they are embedded andof the size of the datasets and is robust to outliers It per-forms much better than the original algorithm DPCHowever the proposed algorithm has some limitations theparameter K needs to be manually adjusted according todifferent datasets the clustering centers still need to bemanually selected by analyzing the decision graph (like theDPC algorithm) the allocation strategy improves theclustering accuracy but takes time and cost How to improve

the degree of automation and allocation efficiency of thealgorithm needs further research

Data Availability

+e synthetic datasets are cited at relevant places within thetext as references [23ndash27] +e real-world datasets are citedat relevant places within the text as references [29ndash34] +eOlivetti face dataset is cited at relevant places within the textas references [28]

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported in part by the National NaturalScience Foundation of China (6160303040 and 61433003) inpart by the Yunnan Applied Basic Research Project of China(201701CF00037) and in part by the Yunnan ProvincialScience and Technology Department Key Research Program(Engineering) (2018BA070)

Supplementary Materials

It includes the datasets used in the experiments in this paper(Supplementary Materials)

References

[1] K L Liu Y L Shang Q Ouyang and W D Widanage ldquoAdata-driven approach with uncertainty quantification forpredicting future capacities and remaining useful life oflithium-ion batteryrdquo IEEE Transactions on Industrial Elec-tronics p 1 2020

[2] X P Tang K L Liu X Wang et al ldquoModel migration neuralnetwork for predicting battery aging trajectoriesrdquo IEEETransactions on Transportation Electrification vol 6 no 2pp 363ndash374 2020

[3] X Tang K Liu XWang B Liu F Gao andW DWidanageldquoReal-time aging trajectory prediction using a base model-oriented gradient-correction particle filter for Lithium-ion

Table 6 Running time of 6 clustering algorithms in seconds on UCI datasets

Dataset DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansIris 0241 0049 0059 0565 0148 0014Wine 0238 0048 0098 0832 0168 0013WDBC 0484 0092 0884 6115 0464 0018Seeds 0244 0050 0122 0973 0164 0014Libras movement 0602 0068 0309 3016 2602 0075Ionosphere 0325 0064 0349 2018 0309 0021Segmentation 1569 0806 8727 6679 0313 0062Dermatology 0309 0063 0513 2185 0409 0007Pima-Indians-diabetes 0792 0126 2018 9709 0892 0009Parkinsons 0255 0048 0114 0866 0263 0003Waveform 16071 3511 mdash mdash 7775 0067Waveform (noise) 17571 3784 mdash mdash 7525 0109

16 Complexity

batteriesrdquo Journal of Power Sources vol 440 Article ID227118 2019

[4] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[5] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[6] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[7] L Cai J H Meng D I Stroe et al ldquoMulti-objective opti-mization of data-driven model for lithium-ion battery SOHestimation with short-term featurerdquo IEEE Transactions onPower Electronics p 1 2020

[8] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[9] J Xie H Gao W Xie X Liu and P W Grant ldquoRobustclustering by detecting density peaks and assigning pointsbased on fuzzy weighted K-nearest neighborsrdquo InformationSciences vol 354 pp 19ndash40 2016

[10] F S Samaria and A C Harter ldquoSome methods for classifi-cation and analysis of multivariate observationsrdquo in Pro-ceedings of the Berkeley SymposiumOnMathematical Statisticsand Probability pp 281ndash297 Berkeley CA USA 1967

[11] S Kant T L Rao and P N Sundaram ldquoAn automatic andstable clustering algorithmrdquo Pattern Recognition Lettersvol 15 no 6 pp 543ndash549 1994

[12] D Arthur and S Vassilvitskii ldquoK-Means++ the advantages ofcareful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms pp 7ndash9 NewOrleans LA USA 2007

[13] Y Zhao W Halang and X Wang ldquoRough ontology mappingin E-business integrationrdquo E-Service Intelligence BMC Bioinfvol 8 pp 75ndash93 2007

[14] Y Xiao and J Yu ldquoSemi-supervised clustering based on af-finity propagaiton algorithmrdquo ACM Transactions onKnowledge Discovery from Data vol 1 no 1 2007

[15] M Ester H Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databaseswith noiserdquo in Proceedings of the Second International Con-ference On Knowledge Discovery and Data Mining pp 226ndash231 Portland OR USA 1996

[16] R J G B Campello D Moulavi and J Sander ldquoDensity-based clustering based on hierarchical density estimatesrdquoAdvances in Knowledge Discovery and Data Mining vol 7819pp 160ndash172 2013

[17] Z Liang and P Chen ldquoDelta-density based clustering with adivide-and-conquer strategy 3DC clusteringrdquo Pattern Rec-ognition Letters vol 73 pp 52ndash59 2016

[18] M Ankerst M M Breuning H P Kriegel and J SanderldquoOPTICS ordering points to identify the clustering struc-turerdquo in Proceedings of the 1999 ACM SIGMOD-InternationalConference on Management of Data pp 49ndash60 PhiladelphiaPA USA 1999

[19] M Du S Ding and H Jia ldquoStudy on density peaks clusteringbased on k-nearest neighbors and principal componentanalysisrdquo Knowledge-Based Systems vol 99 pp 135ndash1452016

[20] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[21] T Li H W Ge and S Z Su ldquoDensity peaks clustering byautomatic determination of cluster centersrdquo Journal ofComputer Science and Technology vol 10 no 11 pp 1614ndash1622 2016

[22] R Liu H Wang and X Yu ldquoShared-nearest-neighbor-basedclustering by fast search and find of density peaksrdquo Infor-mation Sciences vol 450 pp 200ndash226 2018

[23] R A Jarvis and E A Patrick ldquoClustering using a similaritymeasure based on shared near neighborsrdquo IEEE Transactionson Computers vol C-22 no 11 pp 1025ndash1034 1973

[24] H Chang and D-Y Yeung ldquoRobust path-based spectralclusteringrdquo Pattern Recognition vol 41 no 1 pp 191ndash2032008

[25] L Fu and E Medico ldquoFlame a novel fuzzy clustering methodfor the analysis of DNA microarray datardquo BMC Bio-informatics vol 8 no 1 2007

[26] A Gionis H Mannila and P Tsaparas ldquoClustering aggre-gationrdquo ACM Transactions on Knowledge Discovery fromData vol 1 no 1 p 4 2007

[27] P Franti O Virmajoki and V Hautamaki ldquoFast agglom-erative clustering using a k-nearest neighbor graphrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 28 no 11 pp 1875ndash1881 2006

[28] F S Samaria and A C Harter ldquoParameterisation of a sto-chastic model for human face identificationrdquo in Proceedings ofthe 1994 IEEEWorkshop On Applications Of Computer Visionpp 138ndash142 Sarasota FL USA 1994

[29] K Bache and M Lichman UCI Machine Learning Repositoryhttparchiveicsucieduml 2013

[30] M Charytanowicz J Niewczas P Kulczycki P A KowalskiS Lukasik and S Zak ldquoComplete gradient clustering algo-rithm for features analysis of X-ray imagesrdquo InformationTechnologies in biomedicine Advances in Intelligent and SoftComputing vol 69 Berlin Germany Springer

[31] D B Dias R C B Madeo T Rocha H H Biscaro andS M Peres ldquoHand movement recognition for brazilian signlanguage a study using distance-based neural networksrdquo inProceedings of the 2009 International Joint Conference onNeural Networks pp 697ndash704 Atlanta GA USA 2009

[32] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees Routledge New York NYUSA 1st edition 1984

[33] V G Sigillito S P Wing L V Hutton and K B BakerldquoClassification of radar returns from the ionosphere usingneural networksrdquo Johns Hopkins APL vol 10 no 3pp 262ndash266 1989

[34] W N Street W H Wolberg and O L MangasarianldquoNuclear feature extraction for breast tumor diagnosisrdquo inProceedings of the SPIE 1905 Biomedical Image Processing andBiomedical Visualization San Jose CA USA 1993

[35] X V Nguyen J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison is a correction forchance necessaryrdquo in Proceedings of the ICML 2009 the 26thAnnual International Conference On Machine Learning SanMontreal Canada 2009

[36] J Han M Kamber and J Pei Data Mining Concepts andTechniques the Morgan Kaufmann Series in Data Manage-ment Systems Morgan Kaufmann Burlington MA USA 3rdedition 2011

Complexity 17

Page 15: ClusteringbyDetectingDensityPeaksandAssigningPointsby ... · FKNN-DPC [9] is given in (5) and (6), respectively: ρ i exp − 1 K X j∈knn(i) d2 ij ⎛⎝ ⎞⎠, (5) ρ i X j∈knn(i)

the performance of FKNN-DPC is better than that of DPC-SFSKNN in AMI ARI ACC and ECAC From the data inTable 5 based on AMI ARI ACC and ECAC the clus-tering results of these algorithms are compared +e per-formance of DPC-SFSKNNC is slightly superior to theperformance of the other four algorithms except FKNN-DPC

44RunningTime +is section shows the comparison of thetime performance of DPC-SFSKNN with DPC DBSCANAP FKNN-DPC and K-means on real-world datasets +etime complexity of DPC-SFSKNN and DPC has been an-alyzed in Section 331 and the time complexity of DPC isO(n2) and the time complexity of DPC-SFSKNN is O(kn2)where n is the size of the dataset However the time con-sumed by DPC mainly comes from calculating the localdensity and the relative distance of each point while the timeconsumed by DPC-SFSKNN comes mainly from the cal-culation of K-nearest neighbors and the division strategy ofnoncenter points Table 6 lists the running time (in seconds)

of the six algorithms on the real datasets It can be seen thatalthough the time complexity of DPC-SFSKNN is approx-imately k times that of DPC their execution time on actualdatasets is not k times

In Table 6 it can be found that on a relatively smalldataset the running time of DPC-SFSKNN is about twice ormore times that of DPC and the difference mainly comesfrom DPC-SFSKNNrsquos allocation strategy Although thecomputational load of the local densities for points growsvery quickly with the size of a dataset the time consumed bythe allocation strategy in DPC-SFSKNN increases randomlywith the distribution of a dataset +is leads to an irregulargap between the running time of DPC and DPC-SFSKNN

FKNN-DPC has the same time and space complexity asDPC but the running time is almost the same as DPC-SFSKNN It takes a lot of running time to calculate therelationship between K-nearest neighbors +e time com-plexity of DBSCAN and AP is approximate to O(n2) and theparameter selection of both cannot be determined by simplemethods When the dataset is relatively large it is difficult tofind their optimal parameters which may be the reason that

(a) (b)

Figure 10 +e clustering of Olivetti by two clustering algorithms (a) DPC-SFSKNN and (b) DPC

Table 5 Performance comparison of algorithms by clustering criteria for the Olivetti face database

Metric DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansACC 0786 0665 0648 0763 0818 0681AMI 0792 0728 0691 0737 0832 0742ARI 0669 0560 0526 0619 0714 0585ECAC 3240 2040 mdash 2840 3640 mdashPar 6 05 064 21 4 40

Complexity 15

the two algorithms have no running results on the waveformdataset +e approximate time complexity of K-means isO(n) and Table 6 proves its efficiency K-means has almostno loss of accuracy under the premise of fast speed whichmakes it a very popular clustering algorithm but K-means isnot sensitive to irregularly shaped data

5 Conclusions and Future Work

A new clustering algorithm is proposed based on the tra-ditional DPC algorithm in this paper +is algorithm pro-poses a density peak search algorithm that takes into accountthe surrounding neighbor information and develops a newallocation strategy to detect the true distribution of thedataset +e proposed clustering algorithm performs fastsearch finds density peaks say cluster centers of a dataset ofany size and recognizes clusters with any arbitrary shape ordimensionality +e algorithm is called DPC-SFSKNNwhich means that it calculates the local density and therelative distance by using some distance information be-tween points and neighbors to find the cluster center andthen the remaining points are assigned using similarity-first+e search algorithm is based on the weighted KNN graph tofind the owner (clustering center) of the point +e DPC-SFSKNN successfully addressed several issues arising fromthe clustering algorithm of Alex Rodriguez and AlessandroLaio [20] including its density metric and the potential issuehidden in its assignment strategy +e performance of DPC-SFSKNN was tested on several synthetic datasets and thereal-word datasets from the UCI machine learning reposi-tory and the well-known Olivetti face database +e ex-perimental results on these datasets demonstrate that ourDPC-SFSKNN is powerful in finding cluster centers and inrecognizing clusters regardless of their shape and of thedimensionality of the space in which they are embedded andof the size of the datasets and is robust to outliers It per-forms much better than the original algorithm DPCHowever the proposed algorithm has some limitations theparameter K needs to be manually adjusted according todifferent datasets the clustering centers still need to bemanually selected by analyzing the decision graph (like theDPC algorithm) the allocation strategy improves theclustering accuracy but takes time and cost How to improve

the degree of automation and allocation efficiency of thealgorithm needs further research

Data Availability

+e synthetic datasets are cited at relevant places within thetext as references [23ndash27] +e real-world datasets are citedat relevant places within the text as references [29ndash34] +eOlivetti face dataset is cited at relevant places within the textas references [28]

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported in part by the National NaturalScience Foundation of China (6160303040 and 61433003) inpart by the Yunnan Applied Basic Research Project of China(201701CF00037) and in part by the Yunnan ProvincialScience and Technology Department Key Research Program(Engineering) (2018BA070)

Supplementary Materials

It includes the datasets used in the experiments in this paper(Supplementary Materials)

References

[1] K L Liu Y L Shang Q Ouyang and W D Widanage ldquoAdata-driven approach with uncertainty quantification forpredicting future capacities and remaining useful life oflithium-ion batteryrdquo IEEE Transactions on Industrial Elec-tronics p 1 2020

[2] X P Tang K L Liu X Wang et al ldquoModel migration neuralnetwork for predicting battery aging trajectoriesrdquo IEEETransactions on Transportation Electrification vol 6 no 2pp 363ndash374 2020

[3] X Tang K Liu XWang B Liu F Gao andW DWidanageldquoReal-time aging trajectory prediction using a base model-oriented gradient-correction particle filter for Lithium-ion

Table 6 Running time of 6 clustering algorithms in seconds on UCI datasets

Dataset DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansIris 0241 0049 0059 0565 0148 0014Wine 0238 0048 0098 0832 0168 0013WDBC 0484 0092 0884 6115 0464 0018Seeds 0244 0050 0122 0973 0164 0014Libras movement 0602 0068 0309 3016 2602 0075Ionosphere 0325 0064 0349 2018 0309 0021Segmentation 1569 0806 8727 6679 0313 0062Dermatology 0309 0063 0513 2185 0409 0007Pima-Indians-diabetes 0792 0126 2018 9709 0892 0009Parkinsons 0255 0048 0114 0866 0263 0003Waveform 16071 3511 mdash mdash 7775 0067Waveform (noise) 17571 3784 mdash mdash 7525 0109

16 Complexity

batteriesrdquo Journal of Power Sources vol 440 Article ID227118 2019

[4] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[5] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[6] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[7] L Cai J H Meng D I Stroe et al ldquoMulti-objective opti-mization of data-driven model for lithium-ion battery SOHestimation with short-term featurerdquo IEEE Transactions onPower Electronics p 1 2020

[8] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[9] J Xie H Gao W Xie X Liu and P W Grant ldquoRobustclustering by detecting density peaks and assigning pointsbased on fuzzy weighted K-nearest neighborsrdquo InformationSciences vol 354 pp 19ndash40 2016

[10] F S Samaria and A C Harter ldquoSome methods for classifi-cation and analysis of multivariate observationsrdquo in Pro-ceedings of the Berkeley SymposiumOnMathematical Statisticsand Probability pp 281ndash297 Berkeley CA USA 1967

[11] S Kant T L Rao and P N Sundaram ldquoAn automatic andstable clustering algorithmrdquo Pattern Recognition Lettersvol 15 no 6 pp 543ndash549 1994

[12] D Arthur and S Vassilvitskii ldquoK-Means++ the advantages ofcareful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms pp 7ndash9 NewOrleans LA USA 2007

[13] Y Zhao W Halang and X Wang ldquoRough ontology mappingin E-business integrationrdquo E-Service Intelligence BMC Bioinfvol 8 pp 75ndash93 2007

[14] Y Xiao and J Yu ldquoSemi-supervised clustering based on af-finity propagaiton algorithmrdquo ACM Transactions onKnowledge Discovery from Data vol 1 no 1 2007

[15] M Ester H Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databaseswith noiserdquo in Proceedings of the Second International Con-ference On Knowledge Discovery and Data Mining pp 226ndash231 Portland OR USA 1996

[16] R J G B Campello D Moulavi and J Sander ldquoDensity-based clustering based on hierarchical density estimatesrdquoAdvances in Knowledge Discovery and Data Mining vol 7819pp 160ndash172 2013

[17] Z Liang and P Chen ldquoDelta-density based clustering with adivide-and-conquer strategy 3DC clusteringrdquo Pattern Rec-ognition Letters vol 73 pp 52ndash59 2016

[18] M Ankerst M M Breuning H P Kriegel and J SanderldquoOPTICS ordering points to identify the clustering struc-turerdquo in Proceedings of the 1999 ACM SIGMOD-InternationalConference on Management of Data pp 49ndash60 PhiladelphiaPA USA 1999

[19] M Du S Ding and H Jia ldquoStudy on density peaks clusteringbased on k-nearest neighbors and principal componentanalysisrdquo Knowledge-Based Systems vol 99 pp 135ndash1452016

[20] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[21] T Li H W Ge and S Z Su ldquoDensity peaks clustering byautomatic determination of cluster centersrdquo Journal ofComputer Science and Technology vol 10 no 11 pp 1614ndash1622 2016

[22] R Liu H Wang and X Yu ldquoShared-nearest-neighbor-basedclustering by fast search and find of density peaksrdquo Infor-mation Sciences vol 450 pp 200ndash226 2018

[23] R A Jarvis and E A Patrick ldquoClustering using a similaritymeasure based on shared near neighborsrdquo IEEE Transactionson Computers vol C-22 no 11 pp 1025ndash1034 1973

[24] H Chang and D-Y Yeung ldquoRobust path-based spectralclusteringrdquo Pattern Recognition vol 41 no 1 pp 191ndash2032008

[25] L Fu and E Medico ldquoFlame a novel fuzzy clustering methodfor the analysis of DNA microarray datardquo BMC Bio-informatics vol 8 no 1 2007

[26] A Gionis H Mannila and P Tsaparas ldquoClustering aggre-gationrdquo ACM Transactions on Knowledge Discovery fromData vol 1 no 1 p 4 2007

[27] P Franti O Virmajoki and V Hautamaki ldquoFast agglom-erative clustering using a k-nearest neighbor graphrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 28 no 11 pp 1875ndash1881 2006

[28] F S Samaria and A C Harter ldquoParameterisation of a sto-chastic model for human face identificationrdquo in Proceedings ofthe 1994 IEEEWorkshop On Applications Of Computer Visionpp 138ndash142 Sarasota FL USA 1994

[29] K Bache and M Lichman UCI Machine Learning Repositoryhttparchiveicsucieduml 2013

[30] M Charytanowicz J Niewczas P Kulczycki P A KowalskiS Lukasik and S Zak ldquoComplete gradient clustering algo-rithm for features analysis of X-ray imagesrdquo InformationTechnologies in biomedicine Advances in Intelligent and SoftComputing vol 69 Berlin Germany Springer

[31] D B Dias R C B Madeo T Rocha H H Biscaro andS M Peres ldquoHand movement recognition for brazilian signlanguage a study using distance-based neural networksrdquo inProceedings of the 2009 International Joint Conference onNeural Networks pp 697ndash704 Atlanta GA USA 2009

[32] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees Routledge New York NYUSA 1st edition 1984

[33] V G Sigillito S P Wing L V Hutton and K B BakerldquoClassification of radar returns from the ionosphere usingneural networksrdquo Johns Hopkins APL vol 10 no 3pp 262ndash266 1989

[34] W N Street W H Wolberg and O L MangasarianldquoNuclear feature extraction for breast tumor diagnosisrdquo inProceedings of the SPIE 1905 Biomedical Image Processing andBiomedical Visualization San Jose CA USA 1993

[35] X V Nguyen J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison is a correction forchance necessaryrdquo in Proceedings of the ICML 2009 the 26thAnnual International Conference On Machine Learning SanMontreal Canada 2009

[36] J Han M Kamber and J Pei Data Mining Concepts andTechniques the Morgan Kaufmann Series in Data Manage-ment Systems Morgan Kaufmann Burlington MA USA 3rdedition 2011

Complexity 17

Page 16: ClusteringbyDetectingDensityPeaksandAssigningPointsby ... · FKNN-DPC [9] is given in (5) and (6), respectively: ρ i exp − 1 K X j∈knn(i) d2 ij ⎛⎝ ⎞⎠, (5) ρ i X j∈knn(i)

the two algorithms have no running results on the waveformdataset +e approximate time complexity of K-means isO(n) and Table 6 proves its efficiency K-means has almostno loss of accuracy under the premise of fast speed whichmakes it a very popular clustering algorithm but K-means isnot sensitive to irregularly shaped data

5 Conclusions and Future Work

A new clustering algorithm is proposed based on the tra-ditional DPC algorithm in this paper +is algorithm pro-poses a density peak search algorithm that takes into accountthe surrounding neighbor information and develops a newallocation strategy to detect the true distribution of thedataset +e proposed clustering algorithm performs fastsearch finds density peaks say cluster centers of a dataset ofany size and recognizes clusters with any arbitrary shape ordimensionality +e algorithm is called DPC-SFSKNNwhich means that it calculates the local density and therelative distance by using some distance information be-tween points and neighbors to find the cluster center andthen the remaining points are assigned using similarity-first+e search algorithm is based on the weighted KNN graph tofind the owner (clustering center) of the point +e DPC-SFSKNN successfully addressed several issues arising fromthe clustering algorithm of Alex Rodriguez and AlessandroLaio [20] including its density metric and the potential issuehidden in its assignment strategy +e performance of DPC-SFSKNN was tested on several synthetic datasets and thereal-word datasets from the UCI machine learning reposi-tory and the well-known Olivetti face database +e ex-perimental results on these datasets demonstrate that ourDPC-SFSKNN is powerful in finding cluster centers and inrecognizing clusters regardless of their shape and of thedimensionality of the space in which they are embedded andof the size of the datasets and is robust to outliers It per-forms much better than the original algorithm DPCHowever the proposed algorithm has some limitations theparameter K needs to be manually adjusted according todifferent datasets the clustering centers still need to bemanually selected by analyzing the decision graph (like theDPC algorithm) the allocation strategy improves theclustering accuracy but takes time and cost How to improve

the degree of automation and allocation efficiency of thealgorithm needs further research

Data Availability

+e synthetic datasets are cited at relevant places within thetext as references [23ndash27] +e real-world datasets are citedat relevant places within the text as references [29ndash34] +eOlivetti face dataset is cited at relevant places within the textas references [28]

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported in part by the National NaturalScience Foundation of China (6160303040 and 61433003) inpart by the Yunnan Applied Basic Research Project of China(201701CF00037) and in part by the Yunnan ProvincialScience and Technology Department Key Research Program(Engineering) (2018BA070)

Supplementary Materials

It includes the datasets used in the experiments in this paper(Supplementary Materials)

References

[1] K L Liu Y L Shang Q Ouyang and W D Widanage ldquoAdata-driven approach with uncertainty quantification forpredicting future capacities and remaining useful life oflithium-ion batteryrdquo IEEE Transactions on Industrial Elec-tronics p 1 2020

[2] X P Tang K L Liu X Wang et al ldquoModel migration neuralnetwork for predicting battery aging trajectoriesrdquo IEEETransactions on Transportation Electrification vol 6 no 2pp 363ndash374 2020

[3] X Tang K Liu XWang B Liu F Gao andW DWidanageldquoReal-time aging trajectory prediction using a base model-oriented gradient-correction particle filter for Lithium-ion

Table 6 Running time of 6 clustering algorithms in seconds on UCI datasets

Dataset DPC-SFSKNN DPC DBSCAN AP FKNN-DPC K-meansIris 0241 0049 0059 0565 0148 0014Wine 0238 0048 0098 0832 0168 0013WDBC 0484 0092 0884 6115 0464 0018Seeds 0244 0050 0122 0973 0164 0014Libras movement 0602 0068 0309 3016 2602 0075Ionosphere 0325 0064 0349 2018 0309 0021Segmentation 1569 0806 8727 6679 0313 0062Dermatology 0309 0063 0513 2185 0409 0007Pima-Indians-diabetes 0792 0126 2018 9709 0892 0009Parkinsons 0255 0048 0114 0866 0263 0003Waveform 16071 3511 mdash mdash 7775 0067Waveform (noise) 17571 3784 mdash mdash 7525 0109

16 Complexity

batteriesrdquo Journal of Power Sources vol 440 Article ID227118 2019

[4] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[5] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[6] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[7] L Cai J H Meng D I Stroe et al ldquoMulti-objective opti-mization of data-driven model for lithium-ion battery SOHestimation with short-term featurerdquo IEEE Transactions onPower Electronics p 1 2020

[8] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[9] J Xie H Gao W Xie X Liu and P W Grant ldquoRobustclustering by detecting density peaks and assigning pointsbased on fuzzy weighted K-nearest neighborsrdquo InformationSciences vol 354 pp 19ndash40 2016

[10] F S Samaria and A C Harter ldquoSome methods for classifi-cation and analysis of multivariate observationsrdquo in Pro-ceedings of the Berkeley SymposiumOnMathematical Statisticsand Probability pp 281ndash297 Berkeley CA USA 1967

[11] S Kant T L Rao and P N Sundaram ldquoAn automatic andstable clustering algorithmrdquo Pattern Recognition Lettersvol 15 no 6 pp 543ndash549 1994

[12] D Arthur and S Vassilvitskii ldquoK-Means++ the advantages ofcareful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms pp 7ndash9 NewOrleans LA USA 2007

[13] Y Zhao W Halang and X Wang ldquoRough ontology mappingin E-business integrationrdquo E-Service Intelligence BMC Bioinfvol 8 pp 75ndash93 2007

[14] Y Xiao and J Yu ldquoSemi-supervised clustering based on af-finity propagaiton algorithmrdquo ACM Transactions onKnowledge Discovery from Data vol 1 no 1 2007

[15] M Ester H Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databaseswith noiserdquo in Proceedings of the Second International Con-ference On Knowledge Discovery and Data Mining pp 226ndash231 Portland OR USA 1996

[16] R J G B Campello D Moulavi and J Sander ldquoDensity-based clustering based on hierarchical density estimatesrdquoAdvances in Knowledge Discovery and Data Mining vol 7819pp 160ndash172 2013

[17] Z Liang and P Chen ldquoDelta-density based clustering with adivide-and-conquer strategy 3DC clusteringrdquo Pattern Rec-ognition Letters vol 73 pp 52ndash59 2016

[18] M Ankerst M M Breuning H P Kriegel and J SanderldquoOPTICS ordering points to identify the clustering struc-turerdquo in Proceedings of the 1999 ACM SIGMOD-InternationalConference on Management of Data pp 49ndash60 PhiladelphiaPA USA 1999

[19] M Du S Ding and H Jia ldquoStudy on density peaks clusteringbased on k-nearest neighbors and principal componentanalysisrdquo Knowledge-Based Systems vol 99 pp 135ndash1452016

[20] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[21] T Li H W Ge and S Z Su ldquoDensity peaks clustering byautomatic determination of cluster centersrdquo Journal ofComputer Science and Technology vol 10 no 11 pp 1614ndash1622 2016

[22] R Liu H Wang and X Yu ldquoShared-nearest-neighbor-basedclustering by fast search and find of density peaksrdquo Infor-mation Sciences vol 450 pp 200ndash226 2018

[23] R A Jarvis and E A Patrick ldquoClustering using a similaritymeasure based on shared near neighborsrdquo IEEE Transactionson Computers vol C-22 no 11 pp 1025ndash1034 1973

[24] H Chang and D-Y Yeung ldquoRobust path-based spectralclusteringrdquo Pattern Recognition vol 41 no 1 pp 191ndash2032008

[25] L Fu and E Medico ldquoFlame a novel fuzzy clustering methodfor the analysis of DNA microarray datardquo BMC Bio-informatics vol 8 no 1 2007

[26] A Gionis H Mannila and P Tsaparas ldquoClustering aggre-gationrdquo ACM Transactions on Knowledge Discovery fromData vol 1 no 1 p 4 2007

[27] P Franti O Virmajoki and V Hautamaki ldquoFast agglom-erative clustering using a k-nearest neighbor graphrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 28 no 11 pp 1875ndash1881 2006

[28] F S Samaria and A C Harter ldquoParameterisation of a sto-chastic model for human face identificationrdquo in Proceedings ofthe 1994 IEEEWorkshop On Applications Of Computer Visionpp 138ndash142 Sarasota FL USA 1994

[29] K Bache and M Lichman UCI Machine Learning Repositoryhttparchiveicsucieduml 2013

[30] M Charytanowicz J Niewczas P Kulczycki P A KowalskiS Lukasik and S Zak ldquoComplete gradient clustering algo-rithm for features analysis of X-ray imagesrdquo InformationTechnologies in biomedicine Advances in Intelligent and SoftComputing vol 69 Berlin Germany Springer

[31] D B Dias R C B Madeo T Rocha H H Biscaro andS M Peres ldquoHand movement recognition for brazilian signlanguage a study using distance-based neural networksrdquo inProceedings of the 2009 International Joint Conference onNeural Networks pp 697ndash704 Atlanta GA USA 2009

[32] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees Routledge New York NYUSA 1st edition 1984

[33] V G Sigillito S P Wing L V Hutton and K B BakerldquoClassification of radar returns from the ionosphere usingneural networksrdquo Johns Hopkins APL vol 10 no 3pp 262ndash266 1989

[34] W N Street W H Wolberg and O L MangasarianldquoNuclear feature extraction for breast tumor diagnosisrdquo inProceedings of the SPIE 1905 Biomedical Image Processing andBiomedical Visualization San Jose CA USA 1993

[35] X V Nguyen J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison is a correction forchance necessaryrdquo in Proceedings of the ICML 2009 the 26thAnnual International Conference On Machine Learning SanMontreal Canada 2009

[36] J Han M Kamber and J Pei Data Mining Concepts andTechniques the Morgan Kaufmann Series in Data Manage-ment Systems Morgan Kaufmann Burlington MA USA 3rdedition 2011

Complexity 17

Page 17: ClusteringbyDetectingDensityPeaksandAssigningPointsby ... · FKNN-DPC [9] is given in (5) and (6), respectively: ρ i exp − 1 K X j∈knn(i) d2 ij ⎛⎝ ⎞⎠, (5) ρ i X j∈knn(i)

batteriesrdquo Journal of Power Sources vol 440 Article ID227118 2019

[4] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[5] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[6] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[7] L Cai J H Meng D I Stroe et al ldquoMulti-objective opti-mization of data-driven model for lithium-ion battery SOHestimation with short-term featurerdquo IEEE Transactions onPower Electronics p 1 2020

[8] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[9] J Xie H Gao W Xie X Liu and P W Grant ldquoRobustclustering by detecting density peaks and assigning pointsbased on fuzzy weighted K-nearest neighborsrdquo InformationSciences vol 354 pp 19ndash40 2016

[10] F S Samaria and A C Harter ldquoSome methods for classifi-cation and analysis of multivariate observationsrdquo in Pro-ceedings of the Berkeley SymposiumOnMathematical Statisticsand Probability pp 281ndash297 Berkeley CA USA 1967

[11] S Kant T L Rao and P N Sundaram ldquoAn automatic andstable clustering algorithmrdquo Pattern Recognition Lettersvol 15 no 6 pp 543ndash549 1994

[12] D Arthur and S Vassilvitskii ldquoK-Means++ the advantages ofcareful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms pp 7ndash9 NewOrleans LA USA 2007

[13] Y Zhao W Halang and X Wang ldquoRough ontology mappingin E-business integrationrdquo E-Service Intelligence BMC Bioinfvol 8 pp 75ndash93 2007

[14] Y Xiao and J Yu ldquoSemi-supervised clustering based on af-finity propagaiton algorithmrdquo ACM Transactions onKnowledge Discovery from Data vol 1 no 1 2007

[15] M Ester H Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databaseswith noiserdquo in Proceedings of the Second International Con-ference On Knowledge Discovery and Data Mining pp 226ndash231 Portland OR USA 1996

[16] R J G B Campello D Moulavi and J Sander ldquoDensity-based clustering based on hierarchical density estimatesrdquoAdvances in Knowledge Discovery and Data Mining vol 7819pp 160ndash172 2013

[17] Z Liang and P Chen ldquoDelta-density based clustering with adivide-and-conquer strategy 3DC clusteringrdquo Pattern Rec-ognition Letters vol 73 pp 52ndash59 2016

[18] M Ankerst M M Breuning H P Kriegel and J SanderldquoOPTICS ordering points to identify the clustering struc-turerdquo in Proceedings of the 1999 ACM SIGMOD-InternationalConference on Management of Data pp 49ndash60 PhiladelphiaPA USA 1999

[19] M Du S Ding and H Jia ldquoStudy on density peaks clusteringbased on k-nearest neighbors and principal componentanalysisrdquo Knowledge-Based Systems vol 99 pp 135ndash1452016

[20] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[21] T Li H W Ge and S Z Su ldquoDensity peaks clustering byautomatic determination of cluster centersrdquo Journal ofComputer Science and Technology vol 10 no 11 pp 1614ndash1622 2016

[22] R Liu H Wang and X Yu ldquoShared-nearest-neighbor-basedclustering by fast search and find of density peaksrdquo Infor-mation Sciences vol 450 pp 200ndash226 2018

[23] R A Jarvis and E A Patrick ldquoClustering using a similaritymeasure based on shared near neighborsrdquo IEEE Transactionson Computers vol C-22 no 11 pp 1025ndash1034 1973

[24] H Chang and D-Y Yeung ldquoRobust path-based spectralclusteringrdquo Pattern Recognition vol 41 no 1 pp 191ndash2032008

[25] L Fu and E Medico ldquoFlame a novel fuzzy clustering methodfor the analysis of DNA microarray datardquo BMC Bio-informatics vol 8 no 1 2007

[26] A Gionis H Mannila and P Tsaparas ldquoClustering aggre-gationrdquo ACM Transactions on Knowledge Discovery fromData vol 1 no 1 p 4 2007

[27] P Franti O Virmajoki and V Hautamaki ldquoFast agglom-erative clustering using a k-nearest neighbor graphrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 28 no 11 pp 1875ndash1881 2006

[28] F S Samaria and A C Harter ldquoParameterisation of a sto-chastic model for human face identificationrdquo in Proceedings ofthe 1994 IEEEWorkshop On Applications Of Computer Visionpp 138ndash142 Sarasota FL USA 1994

[29] K Bache and M Lichman UCI Machine Learning Repositoryhttparchiveicsucieduml 2013

[30] M Charytanowicz J Niewczas P Kulczycki P A KowalskiS Lukasik and S Zak ldquoComplete gradient clustering algo-rithm for features analysis of X-ray imagesrdquo InformationTechnologies in biomedicine Advances in Intelligent and SoftComputing vol 69 Berlin Germany Springer

[31] D B Dias R C B Madeo T Rocha H H Biscaro andS M Peres ldquoHand movement recognition for brazilian signlanguage a study using distance-based neural networksrdquo inProceedings of the 2009 International Joint Conference onNeural Networks pp 697ndash704 Atlanta GA USA 2009

[32] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees Routledge New York NYUSA 1st edition 1984

[33] V G Sigillito S P Wing L V Hutton and K B BakerldquoClassification of radar returns from the ionosphere usingneural networksrdquo Johns Hopkins APL vol 10 no 3pp 262ndash266 1989

[34] W N Street W H Wolberg and O L MangasarianldquoNuclear feature extraction for breast tumor diagnosisrdquo inProceedings of the SPIE 1905 Biomedical Image Processing andBiomedical Visualization San Jose CA USA 1993

[35] X V Nguyen J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison is a correction forchance necessaryrdquo in Proceedings of the ICML 2009 the 26thAnnual International Conference On Machine Learning SanMontreal Canada 2009

[36] J Han M Kamber and J Pei Data Mining Concepts andTechniques the Morgan Kaufmann Series in Data Manage-ment Systems Morgan Kaufmann Burlington MA USA 3rdedition 2011

Complexity 17