comparisonofclusteringalgorithmsreport[1]

Upload: joorgesilva

Post on 02-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]

    1/22

    COMPARISONOFCLUSTERINGALGORITHMS:

    PARTITIONALANDHIERARCHICAL

    Principal InvestigatrDr.Sanjay Ranka

    Professor

    Department of Computer Science, University of Florida

    Teac!ing AssistantManas Somaiya

    A"t!rsJoyes Misra, !nana Sundar Rajendiran, "asant Pra#u Sundararaj

    Depart#ent $ C#p"ter Science%Universit& $ Flri'a

    !ainesville$$$.cise.ufl.edu

    Final Reprt Dece#(er )**+

  • 8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]

    2/22

    TA,LE OF CONTENTS

    %. &'S(R&C(.........................................................................................................................)

    %%. D*(&%+*D R*PR(..........................................................................................................)). -Means Partitional clusterin/.................................................................................)

    ).) Caracteristics of - means..............................................................................)).0 &l/oritm............................................................................................................)).1 #servations.......................................................................................................0

    0. &//lomerative 2ierarcical Clusterin/....................................................................30.) Definition............................................................................................................30.0 &l/oritms implemented in tis Project..............................................................30.1 Datasets and *4periments..................................................................................5

    1. D'SC&6 7Usin/ -D (rees8..................................................................................)01.) D'SC&6 &l/oritm........................................................................................)0

    1.0 D'SC&6 Performance *nancements Usin/ -D (rees..................................)01.1 #servations re/ardin/ D'SC&6 %ssues.........................................................)13. CUR* 9 2ierarcical Clusterin/ 7Usin/ -D (rees8...............................................)1

    3.) CUR* 2ierarcical Clusterin/ &l/oritm........................................................)33.0 CUR* vervie$..............................................................................................):3.1 CUR* Data Structures Used.........................................................................):3.3 'enefits of CUR* a/ainst ter &l/oritms....................................................)53.: #servations to$ards Sensitivity to Parameters..............................................);

    %%%. C6C+US%6..................................................................................................................)== data points8

  • 8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]

    11/22

    Fig"re 8 SPAETH 'ataset

    utput Cluster 9 Plot

    Gl("lar Cl"sters

    A$ter )9*** iteratins 5 cl"sters re#ain;

    Fig"re < Aggl#erative Cl"sters A$ter )9*** iteratins

  • 8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]

    12/22

    A$ter

  • 8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]

    13/22

    Nn2Gl("lar Cl"sters R"n n C!ec/,ar' 'ata;

    Single Lin/

    Fig"re = Aggl#erative Nn 3 gl("lar cl"sters

    CURE

    Fig"re 0* CURE Nn 3 gl("lar cl"sters

  • 8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]

    14/22

    C#plete Lin/

    %t $as e4ecuted on a part of te Census data o#tained from UC% Repository

    Fig"re 00 C#plete Lin/

    utput Cluster 9 Plot 7Compared $it CUR* al/oritm8

    A$ter )*** iteratins 05 cl"sters re#ain;

    Fig"re 0) C#plete Lin/ cl"sters A$ter )*** iteratins

  • 8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]

    15/22

    Final Cl"ster a$ter )*0) iteratins;

    Fig"re 05 C#plete Lin/ cl"sters A$ter )*0) iteratins

    CURE

    Fig"re 07 CURE cl"sters

  • 8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]

    16/22

    5- D,SCAN Using 1D Trees;

    (e main reason $y natural clusters are reco/ni?a#le is tat $itin eac cluster $e avea typical density of points $ic is considera#ly i/er tan outside of te cluster. Furtermore,te density $itin te areas of noise is lo$er tan te density in any of te clusters. it tis

    understandin/, $e can descri#e core, #order and noise points in a /iven data set ne4t.

    Cre pints & point is a core point if te num#er of points $itin a /iven nei/#orood aroundte point as determined #y te distance function and as user specified distance parameter ps,e4ceeds a certain tresold,"inPts, $ic is also a userspecified parameter.

    ,r'er pints & #order point is not a core point, #ut falls $itin te nei/#orood of a corepoint.

    Nise pints & noise point is any point tat is neiter a core point nor a #order point.

    3.1 D#$CA% Algorithm

    ). +a#el all points as core, #order or noise points0. *liminate noise points1. Put an ed/e #et$een all core points tat are $itinps of eac oter3. Make eac /roup of connected core points into a separate cluster:. &ssi/n eac #order point to one of te clusters of its associated core points

    3.2 D#$CA% Performance nhancements &sing KD 'rees

    e used -D (rees to improve te efficiency of D'SC&6 clusterin/. (e $orst case time

    comple4ity of D'SC&6 al/oritm is 7mB08. 2o$ever, it can #e so$n tat in lo$ dimensionaldata, tis time comple4ity can #e reduced to 7mAlo/m8 usin/ -D (rees.

    (e %nitiali?ation of -D (rees is a one time cost $ic te al/oritm incurs $ile readin/ te datapoints from File. nce te -D (ree as #een initiali?ed, it can #e used across te al/oritm toclassify core points, #order points and noise points #ased on te te num#er of nearest nei/#orsfound as $ell as find te nearest core point for a #order point. -D (ree elps to decrease tesearc time for nearest nei/#or of a point from 7n8 to 7lo/ n8 $ere n is te si?e of te dataset.

    e sa$ performance improvements #y usin/ -D (rees. (e al/oritm $as run on a %ntel Pentium%" ).< !? 7Duo Core8 System $it ) !' R&M. (e pro/ram $as compiled usin/ Java ).5Compiler.

  • 8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]

    17/22

    6o. of Points Clusterin/ (ime 7sec8

    ):;0 1.:

    1:5< )=.>

    ;:=0 1>.:

    )=0:5 ;

  • 8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]

    18/22

    data sets ran/in/ more tan :=== points, a//lomerative clusterin/ $as i/ly inefficient tou/@uality of clusterin/ could #e acieved #y usin/ one of te a#ove options.ur e4periments on CUR* clusterin/ al/oritm, su//est tat CUR* depends on fe$ parametersand if once tey are tuned for a /iven data set pertainin/ to a domain, te al/oritm can scale $ell#y addin/ more resources and partitionin/ te data.

    *.1 C&+ ,ierarchical Cl)stering Algorithm

    (e CUR* clusterin/ al/oritm is a ierarcical al/oritm $ic mer/es t$o clusters at everystep and te clusterin/ process is carried over in t$o passes. (e overall ierarcical al/oritm isas follo$s

    (o enance performance, scala#ility as $ell as @uality of clusterin/, CUR* takes into account fe$more preclusterin/ and postclusterin/ steps.

    *.2 C&+ Overvie

  • 8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]

    19/22

    ile dra$in/ te random sample, due importance $as /iven to te fact tat all clusters $ererepresented and none of tem $ere missed out #y estimatin/ a minimum pro#a#ility.

    *.3 C&+ - Data $tr)ct)res &sed

    e used t$o data structures namely te -D (ree and Min 2eap. Follo$in/ are te #riefdescription of #ot of tem.

    3.1.) -D (ree

    & -D(ree 7sort for kdimensional tree8 is a spacepartitionin/ data structure for or/ani?in/points in a kdimensional space. -D(rees are a useful data structure for several applications,suc as searces involvin/ a multidimensional searc key 7e./. ran/e searces and nearestnei/#our searces8. -D(rees are a special case of 'SP trees.

    & -D(ree uses only splittin/ planes tat are perpendicular to one of te coordinate system a4es.(is differs from 'SP trees, in $ic ar#itrary splittin/ planes can #e used. %n addition, in tetypical definition every node of a -D(ree, from te root to te leaves, stores a point. (is differsfrom 'SP trees, in $ic leaves are typically te only nodes tat contain points 7or oter

    /eometric primitives8. &s a conse@uence, eac splittin/ plane must /o trou/ one of te points inte -D(ree. -D(ries are a variant tat store data only in leaf nodes. %t is $ort notin/ tat in analternative definition of -D(ree te points are stored in its leaf nodes only, altou/ eacsplittin/ plane still /oes trou/ one of te points.

    %n Cure, te -D (ree is initiali?ed durin/ te initial pase of clusterin/ to old all te points. +ateron in te al/oritm, $e use tis tree for nearest nei/#or searc and findin/ closest clusters #asedon representative points of a cluster. en a ne$ cluster is formed, ne$ representative points areadded to te -D (rees. (e representative points of older clusters are deleted from te tree.

    -D (ree improves te searc of points in k dimensional space from 7n8 to 7lo/ n8 as it uses#inary partitionin/ across coordinate a4es.

    3.1.0 Min 2eap

    & Min 2eap is a simple eap data structure created usin/ a #inary tree. %t can #e seen as a #inarytree $it t$o additional constraints

    ). (e sape property all levels of te tree, e4cept possi#ly te last one 7deepest8 are fullyfilled, and, if te last level of te tree is not complete, te nodes of tat level are filled fromleft to ri/t.

  • 8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]

    20/22

    0. (e eap property eac node is lesser tan or e@ual to eac of its cildren.

    (e Min 2eap stores te minimum element at te root of te eap. %n Cure, $e al$ays mer/e t$oclusters at every step. (us te cluster to #e mer/ed $ould necessary #e avin/ te closestdistance from anoter near#y cluster as te eap is created usin/ intercluster distancecomparisons. 2ence $e can /et tis cluster in 7)8 time al$ays.

    e used java.util.PriorityNueue $ic supports all te Min 2eap operations.

    *.* #enefits of C&+ against Other Algorithms

    -Means 7O Centroid #ased &l/oritms8 Unsuita#le for nonsperical and si?e differin/ clusters.

    C+&R&6S 6eeds multiple data scan 7RA (rees $ere proposed later on8. CUR* uses -D (reesinerently to store te dataset and use it across passes.

    '%RC2 Suffers from identifyin/ only conve4 or sperical clusters of uniform si?e

    D'SC&6 6o parallelism, 2i/ Sensitivity, Samplin/ of data may affect density measures.

    *. Observations toards $ensitivit/ to Parameters

    e o#served tat te random sample si?e $as an important criterion $ile preclusterin/ te data set. 2ence $e used te Chernoff bo)nds as /iven in )Q to calculate teminimum si?e of sample to #e selected. Random Samplin/ often missed out some of te smallerclusters. (e ne4t important parameter $as te Srink Factor of Representative Points7a8. %f $eincreased a to make it ), te al/oritm $ould de/enerate to MS( #ased al/oritms. %f teparameter a is reduced to =.), CUR* starts #eavin/ as a centroid #ased al/oritm. (us for aran/e of =.1 to =.;, CUR* identified te ri/t clusters.

  • 8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]

    21/22

    (e num#er of Representative Points present in a cluster is an importantparameter. %f te cluster is too sparse, it may need more representative points tan a compactsmaller cluster. e o#served tat if te num#er of representative points is increased to < or )=,sparse clusters $it varia#le si?e and density $ere identified properly. 'ut $it increase inrepresentative points, te computation time for clusterin/ increased as for every ne$ clusterformed, ne$ representative points ave to #e calculated and srunk.

    ne of te most important o#servations of our e4periments $as $it respect topartitionin/ of data sets as CUR* supports concurrent e4ecution of te first pass of al/oritm. &ste num#er of partitions $as increased from 0 to 5 or )=, te clusterin/ time droppedsi/nificantly. (ou/ te num#er of clusters to #e mer/ed increased in te second step, #ut teadvanta/e of concurrent e4ecution $as far more. 'ut $at $e noticed is tat if $e increased tenum#er of partitions to i/er num#ers suc as :=, te clusterin/ $ould not /ive proper results assome of te partitions $ould not ave any data to cluster. 2ence, tou/ te time consumed$ould #e lesser, te @uality of cluster /ets affected and CUR* could not identify all te clusterscorrectly. Some of tem /ot mer/ed to form #i//er clusters. 2ence, a partitionin/ of )= 9 0=$ould result in efficient speed up of al/oritm $ile maintainin/ te @uality of clusters.

    Partitionin/ Results

    6o. of Points ):;0 1:5< ;:=0 )=0:5

    (ime 7sec8

    Partition P 0 5.3 ;.< 0>.3 ;:.;

    Partition P 0 5.: ;.5 0).5 31.5

    Partition P : 5.) ;.1 )0.0 0).0

    Fig"re 0< Partitining res"lts

    %f a cart is plotted for te same, $e can see tat as te partitionin/ is increased, te time taken tocluster increases very slo$ly even tou/ te data set si?e as increased #y four times.

    III- CONCLUSION

    From te clusters o#tained trou/ various al/oritms and te time taken #y eacal/oritm on te datasets, $e can say tat, - 9 means is not te #est of clusterin/ metods $itits i/ space comple4ity. For i/ dimensional data, - 9 means takes a lot of time and memory.&lso it cannot al$ays conver/e.

    ur e4periments su//est tat D'SC&6 faired $ell for lo$dimensional data. &lso, if tedensity of clusters did not vary too muc, D'SC&6 fairly identified all te clusters. 'ut if te si?eof te data increases and if sapes and density of clusters vary too muc, D'SC&6 resulted incom#inin/ or splittin/ tose clusters.

    Cure could identify all te clusters properly. 'ut CUR* depends on some of te userparameters $ic ave to #e data specific. (e ran/e of suc parameters do not vary too mucmany of tem #ein/ from = 9 ). Cure could identify several clusters $it i/ purity $ic -

  • 8/11/2019 ComparisonOfClusteringAlgorithmsReport[1]

    22/22

    means and D'SC&6 failed to identify.

    it respect to a//lomerative clusterin/, clusters $it i/ purity could #e o#tained #utte computation time for clusterin/ $as i/. &pplication of -ruskal and Union'yRank&l/oritm elped to improve te efficiency #ut still te computation time increased si/nificantly aste si?e of te data set increased.

    I>- REFERENCES

    ). &n *fficient kMeans Clusterin/ &l/oritm &nalysis and %mplementation (apas -anun/o,6atan S. 6etanyau, Cristine D. Piatko, Rut Silverman, &n/ela . u.

    0. & Density'ased &l/oritm for Discoverin/ Clusters in +ar/e Spatial Data#ases $it 6oise Martin *ster, 2ansPeter -rie/el, Jr/ Sander, iao$ei u, -DD L>5

    1. CUR* &n *fficient Clusterin/ &l/oritm for +ar/e Data#ases 9 S. !ua, R. Rasto/i and -.Sim, )>>>5 &CM S%!MD international conference onMana/ement of data, p.)=1))3, June =3=5, )>>5, Montreal, Nue#ec, Canada

    >. &n *fficient -Means Clusterin/ &l/oritm. -. &lsa#ti, S. Ranka, ". Sin/. )>>