comparisonofclusteringalgorithmsreport[1]

8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]

1/22

COMPARISONOFCLUSTERINGALGORITHMS:

PARTITIONALANDHIERARCHICAL

Principal InvestigatrDr.Sanjay Ranka

Professor

Department of Computer Science, University of Florida

Teac!ing AssistantManas Somaiya

A"t!rsJoyes Misra, !nana Sundar Rajendiran, "asant Pra#u Sundararaj

Depart#ent $ C#p"ter Science%Universit& $ Flri'a

!ainesville$$$.cise.ufl.edu

Final Reprt Dece#(er )**+


2/22

TA,LE OF CONTENTS

%. &'S(R&C(.........................................................................................................................)

%%. D*(&%+*D R*PR(..........................................................................................................)). -Means Partitional clusterin/.................................................................................)

).) Caracteristics of - means..............................................................................)).0 &l/oritm............................................................................................................)).1 #servations.......................................................................................................0

0. &//lomerative 2ierarcical Clusterin/....................................................................30.) Definition............................................................................................................30.0 &l/oritms implemented in tis Project..............................................................30.1 Datasets and *4periments..................................................................................5

1. D'SC&6 7Usin/ -D (rees8..................................................................................)01.) D'SC&6 &l/oritm........................................................................................)0

1.0 D'SC&6 Performance *nancements Usin/ -D (rees..................................)01.1 #servations re/ardin/ D'SC&6 %ssues.........................................................)13. CUR* 9 2ierarcical Clusterin/ 7Usin/ -D (rees8...............................................)1

3.) CUR* 2ierarcical Clusterin/ &l/oritm........................................................)33.0 CUR* vervie$..............................................................................................):3.1 CUR* Data Structures Used.........................................................................):3.3 'enefits of CUR* a/ainst ter &l/oritms....................................................)53.: #servations to$ards Sensitivity to Parameters..............................................);

%%%. C6C+US%6..................................................................................................................)== data points8


11/22

Fig"re 8 SPAETH 'ataset

utput Cluster 9 Plot

Gl("lar Cl"sters

A$ter )9*** iteratins 5 cl"sters re#ain;

Fig"re < Aggl#erative Cl"sters A$ter )9*** iteratins


12/22

A$ter


13/22

Nn2Gl("lar Cl"sters R"n n C!ec/,ar' 'ata;

Single Lin/

Fig"re = Aggl#erative Nn 3 gl("lar cl"sters

CURE

Fig"re 0* CURE Nn 3 gl("lar cl"sters


14/22

C#plete Lin/

%t $as e4ecuted on a part of te Census data o#tained from UC% Repository

Fig"re 00 C#plete Lin/

utput Cluster 9 Plot 7Compared $it CUR* al/oritm8

A$ter )*** iteratins 05 cl"sters re#ain;

Fig"re 0) C#plete Lin/ cl"sters A$ter )*** iteratins


15/22

Final Cl"ster a$ter )*0) iteratins;

Fig"re 05 C#plete Lin/ cl"sters A$ter )*0) iteratins

CURE

Fig"re 07 CURE cl"sters


16/22

5- D,SCAN Using 1D Trees;

(e main reason $y natural clusters are reco/ni?a#le is tat $itin eac cluster $e avea typical density of points $ic is considera#ly i/er tan outside of te cluster. Furtermore,te density $itin te areas of noise is lo$er tan te density in any of te clusters. it tis

understandin/, $e can descri#e core, #order and noise points in a /iven data set ne4t.

Cre pints & point is a core point if te num#er of points $itin a /iven nei/#orood aroundte point as determined #y te distance function and as user specified distance parameter ps,e4ceeds a certain tresold,"inPts, $ic is also a userspecified parameter.

,r'er pints & #order point is not a core point, #ut falls $itin te nei/#orood of a corepoint.

Nise pints & noise point is any point tat is neiter a core point nor a #order point.

3.1 D#$CA% Algorithm

). +a#el all points as core, #order or noise points0. *liminate noise points1. Put an ed/e #et$een all core points tat are $itinps of eac oter3. Make eac /roup of connected core points into a separate cluster:. &ssi/n eac #order point to one of te clusters of its associated core points

3.2 D#$CA% Performance nhancements &sing KD 'rees

e used -D (rees to improve te efficiency of D'SC&6 clusterin/. (e $orst case time

comple4ity of D'SC&6 al/oritm is 7mB08. 2o$ever, it can #e so$n tat in lo$ dimensionaldata, tis time comple4ity can #e reduced to 7mAlo/m8 usin/ -D (rees.

(e %nitiali?ation of -D (rees is a one time cost $ic te al/oritm incurs $ile readin/ te datapoints from File. nce te -D (ree as #een initiali?ed, it can #e used across te al/oritm toclassify core points, #order points and noise points #ased on te te num#er of nearest nei/#orsfound as $ell as find te nearest core point for a #order point. -D (ree elps to decrease tesearc time for nearest nei/#or of a point from 7n8 to 7lo/ n8 $ere n is te si?e of te dataset.

e sa$ performance improvements #y usin/ -D (rees. (e al/oritm $as run on a %ntel Pentium%" ).< !? 7Duo Core8 System $it ) !' R&M. (e pro/ram $as compiled usin/ Java ).5Compiler.


17/22

6o. of Points Clusterin/ (ime 7sec8

):;0 1.:

1:5< )=.>

;:=0 1>.:

)=0:5 ;


18/22

data sets ran/in/ more tan :=== points, a//lomerative clusterin/ $as i/ly inefficient tou/@uality of clusterin/ could #e acieved #y usin/ one of te a#ove options.ur e4periments on CUR* clusterin/ al/oritm, su//est tat CUR* depends on fe$ parametersand if once tey are tuned for a /iven data set pertainin/ to a domain, te al/oritm can scale $ell#y addin/ more resources and partitionin/ te data.

*.1 C&+ ,ierarchical Cl)stering Algorithm

(e CUR* clusterin/ al/oritm is a ierarcical al/oritm $ic mer/es t$o clusters at everystep and te clusterin/ process is carried over in t$o passes. (e overall ierarcical al/oritm isas follo$s

(o enance performance, scala#ility as $ell as @uality of clusterin/, CUR* takes into account fe$more preclusterin/ and postclusterin/ steps.

*.2 C&+ Overvie


19/22

ile dra$in/ te random sample, due importance $as /iven to te fact tat all clusters $ererepresented and none of tem $ere missed out #y estimatin/ a minimum pro#a#ility.

*.3 C&+ - Data $tr)ct)res &sed

e used t$o data structures namely te -D (ree and Min 2eap. Follo$in/ are te #riefdescription of #ot of tem.

3.1.) -D (ree

& -D(ree 7sort for kdimensional tree8 is a spacepartitionin/ data structure for or/ani?in/points in a kdimensional space. -D(rees are a useful data structure for several applications,suc as searces involvin/ a multidimensional searc key 7e./. ran/e searces and nearestnei/#our searces8. -D(rees are a special case of 'SP trees.

& -D(ree uses only splittin/ planes tat are perpendicular to one of te coordinate system a4es.(is differs from 'SP trees, in $ic ar#itrary splittin/ planes can #e used. %n addition, in tetypical definition every node of a -D(ree, from te root to te leaves, stores a point. (is differsfrom 'SP trees, in $ic leaves are typically te only nodes tat contain points 7or oter

/eometric primitives8. &s a conse@uence, eac splittin/ plane must /o trou/ one of te points inte -D(ree. -D(ries are a variant tat store data only in leaf nodes. %t is $ort notin/ tat in analternative definition of -D(ree te points are stored in its leaf nodes only, altou/ eacsplittin/ plane still /oes trou/ one of te points.

%n Cure, te -D (ree is initiali?ed durin/ te initial pase of clusterin/ to old all te points. +ateron in te al/oritm, $e use tis tree for nearest nei/#or searc and findin/ closest clusters #asedon representative points of a cluster. en a ne$ cluster is formed, ne$ representative points areadded to te -D (rees. (e representative points of older clusters are deleted from te tree.

-D (ree improves te searc of points in k dimensional space from 7n8 to 7lo/ n8 as it uses#inary partitionin/ across coordinate a4es.

3.1.0 Min 2eap

& Min 2eap is a simple eap data structure created usin/ a #inary tree. %t can #e seen as a #inarytree $it t$o additional constraints

). (e sape property all levels of te tree, e4cept possi#ly te last one 7deepest8 are fullyfilled, and, if te last level of te tree is not complete, te nodes of tat level are filled fromleft to ri/t.


20/22

0. (e eap property eac node is lesser tan or e@ual to eac of its cildren.

(e Min 2eap stores te minimum element at te root of te eap. %n Cure, $e al$ays mer/e t$oclusters at every step. (us te cluster to #e mer/ed $ould necessary #e avin/ te closestdistance from anoter near#y cluster as te eap is created usin/ intercluster distancecomparisons. 2ence $e can /et tis cluster in 7)8 time al$ays.

e used java.util.PriorityNueue $ic supports all te Min 2eap operations.

*.* #enefits of C&+ against Other Algorithms

-Means 7O Centroid #ased &l/oritms8 Unsuita#le for nonsperical and si?e differin/ clusters.

C+&R&6S 6eeds multiple data scan 7RA (rees $ere proposed later on8. CUR* uses -D (reesinerently to store te dataset and use it across passes.

'%RC2 Suffers from identifyin/ only conve4 or sperical clusters of uniform si?e

D'SC&6 6o parallelism, 2i/ Sensitivity, Samplin/ of data may affect density measures.

*. Observations toards $ensitivit/ to Parameters

e o#served tat te random sample si?e $as an important criterion $ile preclusterin/ te data set. 2ence $e used te Chernoff bo)nds as /iven in )Q to calculate teminimum si?e of sample to #e selected. Random Samplin/ often missed out some of te smallerclusters. (e ne4t important parameter $as te Srink Factor of Representative Points7a8. %f $eincreased a to make it ), te al/oritm $ould de/enerate to MS( #ased al/oritms. %f teparameter a is reduced to =.), CUR* starts #eavin/ as a centroid #ased al/oritm. (us for aran/e of =.1 to =.;, CUR* identified te ri/t clusters.


21/22

(e num#er of Representative Points present in a cluster is an importantparameter. %f te cluster is too sparse, it may need more representative points tan a compactsmaller cluster. e o#served tat if te num#er of representative points is increased to < or )=,sparse clusters $it varia#le si?e and density $ere identified properly. 'ut $it increase inrepresentative points, te computation time for clusterin/ increased as for every ne$ clusterformed, ne$ representative points ave to #e calculated and srunk.

ne of te most important o#servations of our e4periments $as $it respect topartitionin/ of data sets as CUR* supports concurrent e4ecution of te first pass of al/oritm. &ste num#er of partitions $as increased from 0 to 5 or )=, te clusterin/ time droppedsi/nificantly. (ou/ te num#er of clusters to #e mer/ed increased in te second step, #ut teadvanta/e of concurrent e4ecution $as far more. 'ut $at $e noticed is tat if $e increased tenum#er of partitions to i/er num#ers suc as :=, te clusterin/ $ould not /ive proper results assome of te partitions $ould not ave any data to cluster. 2ence, tou/ te time consumed$ould #e lesser, te @uality of cluster /ets affected and CUR* could not identify all te clusterscorrectly. Some of tem /ot mer/ed to form #i//er clusters. 2ence, a partitionin/ of )= 9 0=$ould result in efficient speed up of al/oritm $ile maintainin/ te @uality of clusters.

Partitionin/ Results

6o. of Points ):;0 1:5< ;:=0 )=0:5

(ime 7sec8

Partition P 0 5.3 ;.< 0>.3 ;:.;

Partition P 0 5.: ;.5 0).5 31.5

Partition P : 5.) ;.1 )0.0 0).0

Fig"re 0< Partitining res"lts

%f a cart is plotted for te same, $e can see tat as te partitionin/ is increased, te time taken tocluster increases very slo$ly even tou/ te data set si?e as increased #y four times.

III- CONCLUSION

From te clusters o#tained trou/ various al/oritms and te time taken #y eacal/oritm on te datasets, $e can say tat, - 9 means is not te #est of clusterin/ metods $itits i/ space comple4ity. For i/ dimensional data, - 9 means takes a lot of time and memory.&lso it cannot al$ays conver/e.

ur e4periments su//est tat D'SC&6 faired $ell for lo$dimensional data. &lso, if tedensity of clusters did not vary too muc, D'SC&6 fairly identified all te clusters. 'ut if te si?eof te data increases and if sapes and density of clusters vary too muc, D'SC&6 resulted incom#inin/ or splittin/ tose clusters.

Cure could identify all te clusters properly. 'ut CUR* depends on some of te userparameters $ic ave to #e data specific. (e ran/e of suc parameters do not vary too mucmany of tem #ein/ from = 9 ). Cure could identify several clusters $it i/ purity $ic -


22/22

means and D'SC&6 failed to identify.

it respect to a//lomerative clusterin/, clusters $it i/ purity could #e o#tained #utte computation time for clusterin/ $as i/. &pplication of -ruskal and Union'yRank&l/oritm elped to improve te efficiency #ut still te computation time increased si/nificantly aste si?e of te data set increased.

I>- REFERENCES

). &n *fficient kMeans Clusterin/ &l/oritm &nalysis and %mplementation (apas -anun/o,6atan S. 6etanyau, Cristine D. Piatko, Rut Silverman, &n/ela . u.

0. & Density'ased &l/oritm for Discoverin/ Clusters in +ar/e Spatial Data#ases $it 6oise Martin *ster, 2ansPeter -rie/el, Jr/ Sander, iao$ei u, -DD L>5

1. CUR* &n *fficient Clusterin/ &l/oritm for +ar/e Data#ases 9 S. !ua, R. Rasto/i and -.Sim, )>>>5 &CM S%!MD international conference onMana/ement of data, p.)=1))3, June =3=5, )>>5, Montreal, Nue#ec, Canada

>. &n *fficient -Means Clusterin/ &l/oritm. -. &lsa#ti, S. Ranka, ". Sin/. )>>

comparisonofclusteringalgorithmsreport[1]

Documents