comparisonofclusteringalgorithmsreport[1]
TRANSCRIPT
-
8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]
1/22
COMPARISONOFCLUSTERINGALGORITHMS:
PARTITIONALANDHIERARCHICAL
Principal InvestigatrDr.Sanjay Ranka
Professor
Department of Computer Science, University of Florida
Teac!ing AssistantManas Somaiya
A"t!rsJoyes Misra, !nana Sundar Rajendiran, "asant Pra#u Sundararaj
Depart#ent $ C#p"ter Science%Universit& $ Flri'a
!ainesville$$$.cise.ufl.edu
Final Reprt Dece#(er )**+
-
8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]
2/22
TA,LE OF CONTENTS
%. &'S(R&C(.........................................................................................................................)
%%. D*(&%+*D R*PR(..........................................................................................................)). -Means Partitional clusterin/.................................................................................)
).) Caracteristics of - means..............................................................................)).0 &l/oritm............................................................................................................)).1 #servations.......................................................................................................0
0. &//lomerative 2ierarcical Clusterin/....................................................................30.) Definition............................................................................................................30.0 &l/oritms implemented in tis Project..............................................................30.1 Datasets and *4periments..................................................................................5
1. D'SC&6 7Usin/ -D (rees8..................................................................................)01.) D'SC&6 &l/oritm........................................................................................)0
1.0 D'SC&6 Performance *nancements Usin/ -D (rees..................................)01.1 #servations re/ardin/ D'SC&6 %ssues.........................................................)13. CUR* 9 2ierarcical Clusterin/ 7Usin/ -D (rees8...............................................)1
3.) CUR* 2ierarcical Clusterin/ &l/oritm........................................................)33.0 CUR* vervie$..............................................................................................):3.1 CUR* Data Structures Used.........................................................................):3.3 'enefits of CUR* a/ainst ter &l/oritms....................................................)53.: #servations to$ards Sensitivity to Parameters..............................................);
%%%. C6C+US%6..................................................................................................................)== data points8
-
8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]
11/22
Fig"re 8 SPAETH 'ataset
utput Cluster 9 Plot
Gl("lar Cl"sters
A$ter )9*** iteratins 5 cl"sters re#ain;
Fig"re < Aggl#erative Cl"sters A$ter )9*** iteratins
-
8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]
12/22
A$ter
-
8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]
13/22
Nn2Gl("lar Cl"sters R"n n C!ec/,ar' 'ata;
Single Lin/
Fig"re = Aggl#erative Nn 3 gl("lar cl"sters
CURE
Fig"re 0* CURE Nn 3 gl("lar cl"sters
-
8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]
14/22
C#plete Lin/
%t $as e4ecuted on a part of te Census data o#tained from UC% Repository
Fig"re 00 C#plete Lin/
utput Cluster 9 Plot 7Compared $it CUR* al/oritm8
A$ter )*** iteratins 05 cl"sters re#ain;
Fig"re 0) C#plete Lin/ cl"sters A$ter )*** iteratins
-
8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]
15/22
Final Cl"ster a$ter )*0) iteratins;
Fig"re 05 C#plete Lin/ cl"sters A$ter )*0) iteratins
CURE
Fig"re 07 CURE cl"sters
-
8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]
16/22
5- D,SCAN Using 1D Trees;
(e main reason $y natural clusters are reco/ni?a#le is tat $itin eac cluster $e avea typical density of points $ic is considera#ly i/er tan outside of te cluster. Furtermore,te density $itin te areas of noise is lo$er tan te density in any of te clusters. it tis
understandin/, $e can descri#e core, #order and noise points in a /iven data set ne4t.
Cre pints & point is a core point if te num#er of points $itin a /iven nei/#orood aroundte point as determined #y te distance function and as user specified distance parameter ps,e4ceeds a certain tresold,"inPts, $ic is also a userspecified parameter.
,r'er pints & #order point is not a core point, #ut falls $itin te nei/#orood of a corepoint.
Nise pints & noise point is any point tat is neiter a core point nor a #order point.
3.1 D#$CA% Algorithm
). +a#el all points as core, #order or noise points0. *liminate noise points1. Put an ed/e #et$een all core points tat are $itinps of eac oter3. Make eac /roup of connected core points into a separate cluster:. &ssi/n eac #order point to one of te clusters of its associated core points
3.2 D#$CA% Performance nhancements &sing KD 'rees
e used -D (rees to improve te efficiency of D'SC&6 clusterin/. (e $orst case time
comple4ity of D'SC&6 al/oritm is 7mB08. 2o$ever, it can #e so$n tat in lo$ dimensionaldata, tis time comple4ity can #e reduced to 7mAlo/m8 usin/ -D (rees.
(e %nitiali?ation of -D (rees is a one time cost $ic te al/oritm incurs $ile readin/ te datapoints from File. nce te -D (ree as #een initiali?ed, it can #e used across te al/oritm toclassify core points, #order points and noise points #ased on te te num#er of nearest nei/#orsfound as $ell as find te nearest core point for a #order point. -D (ree elps to decrease tesearc time for nearest nei/#or of a point from 7n8 to 7lo/ n8 $ere n is te si?e of te dataset.
e sa$ performance improvements #y usin/ -D (rees. (e al/oritm $as run on a %ntel Pentium%" ).< !? 7Duo Core8 System $it ) !' R&M. (e pro/ram $as compiled usin/ Java ).5Compiler.
-
8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]
17/22
6o. of Points Clusterin/ (ime 7sec8
):;0 1.:
1:5< )=.>
;:=0 1>.:
)=0:5 ;
-
8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]
18/22
data sets ran/in/ more tan :=== points, a//lomerative clusterin/ $as i/ly inefficient tou/@uality of clusterin/ could #e acieved #y usin/ one of te a#ove options.ur e4periments on CUR* clusterin/ al/oritm, su//est tat CUR* depends on fe$ parametersand if once tey are tuned for a /iven data set pertainin/ to a domain, te al/oritm can scale $ell#y addin/ more resources and partitionin/ te data.
*.1 C&+ ,ierarchical Cl)stering Algorithm
(e CUR* clusterin/ al/oritm is a ierarcical al/oritm $ic mer/es t$o clusters at everystep and te clusterin/ process is carried over in t$o passes. (e overall ierarcical al/oritm isas follo$s
(o enance performance, scala#ility as $ell as @uality of clusterin/, CUR* takes into account fe$more preclusterin/ and postclusterin/ steps.
*.2 C&+ Overvie
-
8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]
19/22
ile dra$in/ te random sample, due importance $as /iven to te fact tat all clusters $ererepresented and none of tem $ere missed out #y estimatin/ a minimum pro#a#ility.
*.3 C&+ - Data $tr)ct)res &sed
e used t$o data structures namely te -D (ree and Min 2eap. Follo$in/ are te #riefdescription of #ot of tem.
3.1.) -D (ree
& -D(ree 7sort for kdimensional tree8 is a spacepartitionin/ data structure for or/ani?in/points in a kdimensional space. -D(rees are a useful data structure for several applications,suc as searces involvin/ a multidimensional searc key 7e./. ran/e searces and nearestnei/#our searces8. -D(rees are a special case of 'SP trees.
& -D(ree uses only splittin/ planes tat are perpendicular to one of te coordinate system a4es.(is differs from 'SP trees, in $ic ar#itrary splittin/ planes can #e used. %n addition, in tetypical definition every node of a -D(ree, from te root to te leaves, stores a point. (is differsfrom 'SP trees, in $ic leaves are typically te only nodes tat contain points 7or oter
/eometric primitives8. &s a conse@uence, eac splittin/ plane must /o trou/ one of te points inte -D(ree. -D(ries are a variant tat store data only in leaf nodes. %t is $ort notin/ tat in analternative definition of -D(ree te points are stored in its leaf nodes only, altou/ eacsplittin/ plane still /oes trou/ one of te points.
%n Cure, te -D (ree is initiali?ed durin/ te initial pase of clusterin/ to old all te points. +ateron in te al/oritm, $e use tis tree for nearest nei/#or searc and findin/ closest clusters #asedon representative points of a cluster. en a ne$ cluster is formed, ne$ representative points areadded to te -D (rees. (e representative points of older clusters are deleted from te tree.
-D (ree improves te searc of points in k dimensional space from 7n8 to 7lo/ n8 as it uses#inary partitionin/ across coordinate a4es.
3.1.0 Min 2eap
& Min 2eap is a simple eap data structure created usin/ a #inary tree. %t can #e seen as a #inarytree $it t$o additional constraints
). (e sape property all levels of te tree, e4cept possi#ly te last one 7deepest8 are fullyfilled, and, if te last level of te tree is not complete, te nodes of tat level are filled fromleft to ri/t.
-
8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]
20/22
0. (e eap property eac node is lesser tan or e@ual to eac of its cildren.
(e Min 2eap stores te minimum element at te root of te eap. %n Cure, $e al$ays mer/e t$oclusters at every step. (us te cluster to #e mer/ed $ould necessary #e avin/ te closestdistance from anoter near#y cluster as te eap is created usin/ intercluster distancecomparisons. 2ence $e can /et tis cluster in 7)8 time al$ays.
e used java.util.PriorityNueue $ic supports all te Min 2eap operations.
*.* #enefits of C&+ against Other Algorithms
-Means 7O Centroid #ased &l/oritms8 Unsuita#le for nonsperical and si?e differin/ clusters.
C+&R&6S 6eeds multiple data scan 7RA (rees $ere proposed later on8. CUR* uses -D (reesinerently to store te dataset and use it across passes.
'%RC2 Suffers from identifyin/ only conve4 or sperical clusters of uniform si?e
D'SC&6 6o parallelism, 2i/ Sensitivity, Samplin/ of data may affect density measures.
*. Observations toards $ensitivit/ to Parameters
e o#served tat te random sample si?e $as an important criterion $ile preclusterin/ te data set. 2ence $e used te Chernoff bo)nds as /iven in )Q to calculate teminimum si?e of sample to #e selected. Random Samplin/ often missed out some of te smallerclusters. (e ne4t important parameter $as te Srink Factor of Representative Points7a8. %f $eincreased a to make it ), te al/oritm $ould de/enerate to MS( #ased al/oritms. %f teparameter a is reduced to =.), CUR* starts #eavin/ as a centroid #ased al/oritm. (us for aran/e of =.1 to =.;, CUR* identified te ri/t clusters.
-
8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]
21/22
(e num#er of Representative Points present in a cluster is an importantparameter. %f te cluster is too sparse, it may need more representative points tan a compactsmaller cluster. e o#served tat if te num#er of representative points is increased to < or )=,sparse clusters $it varia#le si?e and density $ere identified properly. 'ut $it increase inrepresentative points, te computation time for clusterin/ increased as for every ne$ clusterformed, ne$ representative points ave to #e calculated and srunk.
ne of te most important o#servations of our e4periments $as $it respect topartitionin/ of data sets as CUR* supports concurrent e4ecution of te first pass of al/oritm. &ste num#er of partitions $as increased from 0 to 5 or )=, te clusterin/ time droppedsi/nificantly. (ou/ te num#er of clusters to #e mer/ed increased in te second step, #ut teadvanta/e of concurrent e4ecution $as far more. 'ut $at $e noticed is tat if $e increased tenum#er of partitions to i/er num#ers suc as :=, te clusterin/ $ould not /ive proper results assome of te partitions $ould not ave any data to cluster. 2ence, tou/ te time consumed$ould #e lesser, te @uality of cluster /ets affected and CUR* could not identify all te clusterscorrectly. Some of tem /ot mer/ed to form #i//er clusters. 2ence, a partitionin/ of )= 9 0=$ould result in efficient speed up of al/oritm $ile maintainin/ te @uality of clusters.
Partitionin/ Results
6o. of Points ):;0 1:5< ;:=0 )=0:5
(ime 7sec8
Partition P 0 5.3 ;.< 0>.3 ;:.;
Partition P 0 5.: ;.5 0).5 31.5
Partition P : 5.) ;.1 )0.0 0).0
Fig"re 0< Partitining res"lts
%f a cart is plotted for te same, $e can see tat as te partitionin/ is increased, te time taken tocluster increases very slo$ly even tou/ te data set si?e as increased #y four times.
III- CONCLUSION
From te clusters o#tained trou/ various al/oritms and te time taken #y eacal/oritm on te datasets, $e can say tat, - 9 means is not te #est of clusterin/ metods $itits i/ space comple4ity. For i/ dimensional data, - 9 means takes a lot of time and memory.&lso it cannot al$ays conver/e.
ur e4periments su//est tat D'SC&6 faired $ell for lo$dimensional data. &lso, if tedensity of clusters did not vary too muc, D'SC&6 fairly identified all te clusters. 'ut if te si?eof te data increases and if sapes and density of clusters vary too muc, D'SC&6 resulted incom#inin/ or splittin/ tose clusters.
Cure could identify all te clusters properly. 'ut CUR* depends on some of te userparameters $ic ave to #e data specific. (e ran/e of suc parameters do not vary too mucmany of tem #ein/ from = 9 ). Cure could identify several clusters $it i/ purity $ic -
-
8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]
22/22
means and D'SC&6 failed to identify.
it respect to a//lomerative clusterin/, clusters $it i/ purity could #e o#tained #utte computation time for clusterin/ $as i/. &pplication of -ruskal and Union'yRank&l/oritm elped to improve te efficiency #ut still te computation time increased si/nificantly aste si?e of te data set increased.
I>- REFERENCES
). &n *fficient kMeans Clusterin/ &l/oritm &nalysis and %mplementation (apas -anun/o,6atan S. 6etanyau, Cristine D. Piatko, Rut Silverman, &n/ela . u.
0. & Density'ased &l/oritm for Discoverin/ Clusters in +ar/e Spatial Data#ases $it 6oise Martin *ster, 2ansPeter -rie/el, Jr/ Sander, iao$ei u, -DD L>5
1. CUR* &n *fficient Clusterin/ &l/oritm for +ar/e Data#ases 9 S. !ua, R. Rasto/i and -.Sim, )>>>5 &CM S%!MD international conference onMana/ement of data, p.)=1))3, June =3=5, )>>5, Montreal, Nue#ec, Canada
>. &n *fficient -Means Clusterin/ &l/oritm. -. &lsa#ti, S. Ranka, ". Sin/. )>>