survey of clustering algorithms
DESCRIPTION
Comparison of Clustering AlgorithmsTRANSCRIPT
![Page 1: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/1.jpg)
PREPARED BY-
DEBABRAT DAS- R010113014
ANIKET ROY- R010113007
Survey of Clustering Algorithms
GUIDED BY-
JYOTISMITA TALUKDAR
ASST. PROFESSOR
CIT, UTM
![Page 2: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/2.jpg)
CONTENTS
PROBLEM STATEMENT
WHAT IS CLUSTERING?
ALGORITHMS FOR CLUSTERING
HIERARCHICAL CLUSTERING
PARTITIONING BASED CLUSTERINGk-means algorithm
BRIEF INTRODUCTION TO WEKA
CONCLUSION
![Page 3: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/3.jpg)
PROBLEM STATEMENT AND AIM
• GIVEN A SET OF RECORDS (INSTANCES, EXAMPLES, OBJECTS, OBSERVATIONS, …), ORGANIZE THEM INTO CLUSTERS (GROUPS, CLASSES)
• CLUSTERING: THE PROCESS OF GROUPING PHYSICAL OR ABSTRACT OBJECTS INTO CLASSES OF SIMILAR OBJECTS
• AIM:
• HERE WE USE WEKA TOOL TO COMPARE DIFFERENT CLUSTERING ALGORITHM ON IRIS DATA SET AND SHOWN DIFFERENCES BETWEEN THEM.
![Page 4: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/4.jpg)
WHAT IS CLUSTERING?
• CLUSTERING IS THE MOST COMMON FORM OF UNSUPERVISED LEARNING.
• CLUSTERING IS THE PROCESS OF GROUPING A SET OF PHYSICAL OR ABSTRACT OBJECTS INTO CLASSES OF SIMILAR OBJECTS.
![Page 5: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/5.jpg)
APPLICATION
• Clustering help marketers discover distinct groups in their customer base. and they can characterize their customer groups based on the purchasing patterns.
• Thematic maps in GIS by clustering feature spaces
• WWW
• Document classification
• Cluster weblog data to discover groups of similar access patterns
• Clustering is also used in outlier detection applications such as detection of credit card fraud.
• As data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to observe characteristics of each cluster.
![Page 6: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/6.jpg)
![Page 7: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/7.jpg)
ALGORITHMS TO BE ANALYSED
• HIERARCHICAL CLUSTERING ALGORITHM
• PARTITIONING BASED CLUSTERING ALGORITHM
![Page 8: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/8.jpg)
Hierarchical Clustering
• Initially each point in the data set is assigned as a cluster.
• Then we repeatedly combine two nearest points in a single cluster.
• The distance function is calculated on the basis of the characteristics we want to
cluster.
![Page 9: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/9.jpg)
• Hierarchical:
• Agglomerative (bottom up):
• Initially, each point is a cluster
• Repeatedly combine the two “nearest” clusters into one
• Divisive (top down):
• start with one cluster and recursively split it
![Page 10: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/10.jpg)
How do you represent a cluster of more than one point?
How do you determine the “nearness” of clusters?
When to stop combining clusters?
Three important questions:
Answers
1. Centroid- Centroid (mean) in i th dimension = SUMi /N.
SUMi = i th component of SUM.
2. Nearest inter-cluster distance (Euclidian Distance).
3. When min(inter-cluster distance d[i]) for a point i > Threshold.
![Page 11: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/11.jpg)
Algorithm
1.Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0.
2. Find the least dissimilar pair of clusters in the current clustering, say pair (r), (s), according towhere the minimum is over all pairs of clusters in the current clustering.
3. Increment the sequence number : m = m +1. Merge clusters (r) and (s) into a single cluster to form the next clustering m. Set the level of this clustering to
Hierarchical Clustering
L(m) = d[(r),(s)]
d[(r),(s)] = min d[(i),(j)]
![Page 12: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/12.jpg)
4.Update the proximity matrix, D, by deleting the rows and columns corresponding to clusters (r) and (s) and adding a row and column corresponding to the newly formed cluster.
5.The proximity between the new cluster, denoted (r,s) and old cluster (k) is defined in this way:
6.If all objects are in one cluster, stop. Else, go to step 2.
Cont.…
d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)]
Hierarchical Clustering
![Page 13: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/13.jpg)
13
EXAMPLE: HIERARCHICAL CLUSTERING
(5,3)o
(1,2)o
o (2,1) o (4,1)
o (0,0) o (5,0)
x (1.5,1.5)
x (4.5,0.5)x (1,1)
x (4.7,1.3)
Data:o … data pointx … centroid Dendrogram
![Page 14: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/14.jpg)
Implementation
//Assigning points to array
loop: i=0 to n a[i] = Point(i);
//For The Distance matrix:
loop:i=1 to nBegin loop: j=i to n Begin calculate distance(i,j); set d[i,j] = distance(i,j); set d[j,i] = distance(i,j); ENDcalculate min( row([i] );set n[i] = min( row[i] );END
//Clustering the pointsloop: i=1 to nBegin loop j=1 to n Begin if( i=j) continue; if(n[i]<n[j]-threshhold) cluster i and j; calculate centroid; update d[i,j]; else Set outlier[i] = a[i]; ENDEND
![Page 15: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/15.jpg)
Complexity of setting the points to array=O(n)
Complexity of calculating d[i,j]=O(nlog(n))
Complexity of clustering=O(n2log(n))
Complexities
![Page 16: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/16.jpg)
PARTITIONING BASED
• Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify the data into k groups, which satisfy the following requirements −
• Each group contains at least one object.
• Each object must belong to exactly one group.
![Page 17: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/17.jpg)
• FOR A GIVEN NUMBER OF PARTITIONS (SAY K), THE PARTITIONING METHOD WILL CREATE AN INITIAL PARTITIONING.
• THEN IT USES THE ITERATIVE RELOCATION TECHNIQUE TO IMPROVE THE PARTITIONING BY MOVING OBJECTS FROM ONE GROUP TO OTHER.
• GLOBAL OPTIMAL: EXHAUSTIVELY ENUMERATE ALL PARTITIONS.
• HEURISTIC METHODS: K-MEANS.
![Page 18: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/18.jpg)
K-MEANS CLUSTERING
![Page 19: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/19.jpg)
K–MEANS ALGORITHM(S)
• ASSUMES EUCLIDEAN SPACE/DISTANCE
• START BY PICKING K, THE NUMBER OF CLUSTERS
• INITIALIZE CLUSTERS BY PICKING ONE POINT PER CLUSTER
• FOR THE MOMENT, ASSUME WE PICK THE K POINTS AT RANDOM
19
![Page 20: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/20.jpg)
POPULATING CLUSTERS
• 1) FOR EACH POINT, PLACE IT IN THE CLUSTER WHOSE CURRENT CENTROID IT IS NEAREST
• 2) AFTER ALL POINTS ARE ASSIGNED, UPDATE THE LOCATIONS OF CENTROIDS OF THE K CLUSTERS
• 3) REASSIGN ALL POINTS TO THEIR CLOSEST CENTROID
• SOMETIMES MOVES POINTS BETWEEN CLUSTERS
• REPEAT 2 AND 3 UNTIL CONVERGENCE
• CONVERGENCE: POINTS DON’T MOVE BETWEEN CLUSTERS AND CENTROIDS STABILIZE
20
![Page 21: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/21.jpg)
21
![Page 22: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/22.jpg)
EXAMPLE: ASSIGNING CLUSTERS
22
x
x
x
x
x
x
x x
x … data point … centroid
x
x
x
Clusters after round 1
![Page 23: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/23.jpg)
EXAMPLE: ASSIGNING CLUSTERS
23
x
x
x
x
x
x
x x
x … data point … centroid
x
x
x
Clusters after round 2
![Page 24: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/24.jpg)
EXAMPLE: ASSIGNING CLUSTERS
24
x
x
x
x
x
x
x x
x … data point … centroid
x
x
x
Clusters at the end
![Page 25: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/25.jpg)
GETTING THE K RIGHT
HOW TO SELECT K?
25
k
Averagedistance to
centroid
Best valueof k
![Page 26: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/26.jpg)
EXAMPLE: PICKING K
26
x xx x x xx x x x
x x xx x
xxx x
x x x x x
xx x x
x
x xx x x x x x x
x
x
x
Too few;many longdistancesto centroid.
![Page 27: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/27.jpg)
27
x xx x x xx x x x
x x xx x
xxx x
x x x x x
xx x x
x
x xx x x x x x x
x
x
x
Just right;distancesrather short.
![Page 28: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/28.jpg)
28
x xx x x xx x x x
x x xx x
xxx x
x x x x x
xx x x
x
x xx x x x x x x
x
x
x
Too many;little improvementin averagedistance.
![Page 29: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/29.jpg)
ADVANTAGES OF K-MEANS
• WITH A LARGE NUMBER OF VARIABLES, K-MEANS MAY BE COMPUTATIONALLY FASTER THAN HIERARCHICAL CLUSTERING(IF K IS SMALL).
• K-MEANS MAY PRODUCE TIGHTER CLUSTERS THAN HIERARCHICAL CLUSTERING, ESPECIALLY IF THE CLUSTERS ARE GLOBULAR.
• COMPLEXITY:
• EACH ROUND IS O(KN) FOR N POINTS, K CLUSTER
29
![Page 30: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/30.jpg)
DISADVANTAGES
• DIFFICULTY IN COMPARING QUALITY OF THE CLUSTERS PRODUCED (E.G. FOR DIFFERENT INITIAL PARTITIONS OR VALUES OF KAFFECT OUTCOME).
• FIXED NUMBER OF CLUSTERS CAN MAKE IT DIFFICULT TO PREDICT WHAT K SHOULD BE.
30
![Page 31: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/31.jpg)
WEKA
• WAIKATO ENVIRONMENT FOR KNOWLEDGE ANALYSIS (WEKA) IS A POPULAR SUITE OF MACHINE LEARNING SOFTWARE WRITTEN IN JAVA, DEVELOPED AT THE UNIVERSITY OF WAIKATO, NEW ZEALAND
• IRIS DATA
31
![Page 32: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/32.jpg)
32
![Page 33: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/33.jpg)
33
HIERARCHAL ALGORITHM
![Page 34: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/34.jpg)
34
K-MEANS
![Page 35: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/35.jpg)
FURTHER WORK
• WE WOULD LIKE TO SOLVE THE COMPLEXITY PROBLEM OF HIERARCHAL ALGORITHM
• IMPROVEMENT OF K-MEANS FOR THE NO OF CLUSTER SELECTION NEED TO BE AUTOMATE FURTHER
35
![Page 36: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/36.jpg)
CONCLUSION
• EVERY ALGORITHM HAS THEIR OWN IMPORTANCE AND WE USE THEM ON THE BEHAVIOR OF THE DATA
• ON THE BASIS OF THIS RESEARCH WE FOUND THAT K-MEANS CLUSTERING ALGORITHM IS SIMPLEST ALGORITHM AS COMPARED TO OTHERALGORITHMS.
• WE CAN’T REQUIRED DEEP KNOWLEDGE OF ALGORITHMS FOR WORKING IN WEKA. THAT’S WHY WEKA IS MORE SUITABLE TOOL FOR DATA MINING APPLICATIONS.
36
![Page 37: Survey of Clustering Algorithms](https://reader036.vdocuments.us/reader036/viewer/2022081502/5695d13f1a28ab9b0295c009/html5/thumbnails/37.jpg)
REFERENCES• ÄYRÄMÖ, S. AND KÄRKKÄINEN, T. INTRODUCTION TO PARTITIONING-BASED CLUSTERING METHODS WITH A ROBUST
EXAMPLE, REPORTS OF THE DEPT. OF MATH. INF. TECH. (SERIES C. SOFTWARE AND COMPUTATIONAL ENGINEERING), 1/2006, UNIVERSITY OF JYVÄSKYLÄ, 2006.
• BERKHIN, P. (1998). SURVEY OF CLUSTERING DATA MINING TECHNIQUES. RETRIEVED NOVEMBER 6TH, 2015, WEBSITE:
• E.B FAWLKES AND C.L. MALLOWS, “A METHOD FOR COMPARING TWO HIERARCHICAL CLUSTERINGS”, JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 78:553–584, 1983.
• J. BEZDEK AND R. HATHAWAY, “NUMERICAL CONVERGENCE AND INTERPRETATION OF THE FUZZY C-SHELLS CLUSTERING ALGORITHMS,” IEEE TRANS. NEURAL NETW., VOL. 3, NO. 5, PP. 787–793, SEP. 1992.
• M. AND HECKERMAN, D. (FEBRUARY, 1998), “AN EXPERIMENTAL COMPARISON OF SEVERAL CLUSTERING AND INITIALIZATION METHOD”, TECHNICAL REPORT MSRTR-98-06, MICROSOFT RESEARCH, REDMOND, WA
• MYTHILI S ET AL, INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND MOBILE COMPUTING, VOL.3 ISSUE.1 , JANUARY- 2014, PG. 334-340
• NARENDRA SHARMA, AMAN BAJPAI, MR. RATNESH LITORIYA ―COMPARISON THE VARIOUS CLUSTERING ALGORITHMS OF WEKA TOOLS‖, INTERNATIONAL JOURNAL OF EMERGING TECHNOLOGY AND ADVANCED ENGINEERING, VOLUME 2, ISSUE 5, MAY 2012.
• R. DAVÉ, “ADAPTIVE FUZZY C-SHELLS CLUSTERING AND DETECTION OF ELLIPSES,” IEEE TRANS. NEURAL NETW., VOL. 3, NO. 5, PP. 643–662, SEP. 1992.
• T. VELMURUGAN AND T. SANTHANAM, 2011. A SURVEY OF PARTITION BASED CLUSTERING ALGORITHMS IN DATA MINING: AN EXPERIMENTAL APPROACH. INFORMATION TECHNOLOGY JOURNAL, 10: 478-484.
• WEKA AT HTTP://WWW.CS.WAIKATO.AC.NZ/~ML/WEKA.
37