prob form
TRANSCRIPT
-
7/24/2019 prob form
1/3
PROBLEM FORMULATION
K-Means Clustering
Among the clustering algorithms, extensively and most widely used clustering algorithm iscalled as k-means algorithm.The k-means algorithm takes two input parameters: the dataset of n
objects, and k, the number of clusters to be created. The algorithm partitions the dataset of n
objects into k clusters. luster similarity is measured by taking !uclidean distance between
objects. "n this way k-means find spherical or ball shaped clusters. The mean value of the objects
in a cluster, can be viewed as the cluster#s centre of gravity.
The algorithm works in two phases: in the first phase kinitial centroids are selected randomly,
one for each cluster. "n the second phase each object of the given input dataset is associated with
the cluster having the nearest centroids. $owever, other measures like %anhattan, etc. can also
be used. &hen all the objects from input dataset are assigned to some clusters, the firstintegration is completed and an early grouping is done. At this point, the algorithm starts new
integration and recalculated the new centroids, as the inclusion of new data may lead to a change
in the cluster centroids. The k centroids may change their position in a step by step manner.
!ventually, a situation will be reached where the centroids to do not move anymore or the data
objects do not change their cluster. This signifies the convergence criterion for clustering.
'our steps can be identified for existing k-means algorithm:
Step 1: "nitiali(ation of data objects. This step for each group describe centroids that is how
much set of objects that is re)uire to be partitioned from data points.
Step : *ata objects classification. To the relative group the data object is included by
calculating the shortest distance between each of the centroids with data objects to determine the
closest centroids for grouping in each database objects.
Step !: alculate centroid as representative of cluster. 'or each group is generated by placing the
data objects in cluster by considering the previous step. After assigning all clusters centroids are
recalculated for improve clustering.
Step ": "mproved clustering through convergence condition. +everal convergence conditions use
to stop continuous process among groups when there is no exchange of centroids or selected data
objects. "t reaches a give number of iterations from which the most utili(ed to stop clustering to
fulfill user re)uirements, or stopping for clustering. &hen it reaches to given threshold of two
consecutive iterations by calculating s)uared error function difference. epeat step two, three
-
7/24/2019 prob form
2/3
and four of the algorithm, if at any point the satisfaction of convergence condition is not
achieved
Pr#$le% F#r%ulati#n:
The original k-means algorithm selects centroids refine method is determined initial centroids
systematically so as to produce clusters with better accuracy and then it may use of a variant of
the clustering method for verification. "t starts by forming the initial clusters based on the
calculated threshold value and then formulates clusters by calculating relative distance of each
data-point from the initial centroids. These clusters are subse)uently modified according to the
data points, thereby improving the efficiency as well accuracy.
Alg#rit&%:
seudo code for the k-means clustering algorithm is listed as :
Input:A datasetDof nobjects
D /d0,d1, 22, dn3
k The number of desired clusters
Output:
A set of k clusters containing data from datasetD.%ethod.
Steps:
0. andomly select kobjects from the dataset * as initial centroids4
1. epeat
a. Assign each object d, from datasetDto the cluster to which the object is the most similar
i.e., has the closest centroid4b. alculate new mean for each sluter4
c. 5ntil convergence criteria is met /there is no change in the cluster cenres3.
I'ea #( )etting Initial Centr#i's:
Input:6o. of clusters k, no. of objects n
Output:k clusters that specify the least error.
Process:
There are n objects in population set A and wants to partition A into k clusters.
-
7/24/2019 prob form
3/3
!uclidean distance formula use to calculate distance between data objects,
e.g. distance between one vector 7/x0, y03 and the other vector 8 /x1, y13 is describing as
follow:
+et c6o 0 as c6o is a variable that contain information regarding no of clusters.
ompute distances between each data object to the other data object in A until reaches 9 d
%ean x .
&hen distances between two points is 9 d %ean x and delete from set A and add in set ;4
/0 c6o k3.
'ind those points in A that is closest to the selected data set points ;c6o, add it to ;c6o and
delete it from A4 /;c6o represents cluster set with cluster no3 epeat step n?k.
"f c6ok, then *N#+*N#,1 find another pair of datapoints between which the distance is the
shortest in A and form another data-point setBcNo and delete them (r#% A then go to step
/034
"f c6o k, then take mean of each cluster and moving each centroid to the mean of its assigned
data-points.
eassign data points to their closest centroid into clusters according to new centroids.
The basic matter of previous algorithm is to provide data objects into several partition where
distances between objects within the same class is much closer rather than the distances betweenobjects in different classes. K-means algorithm easily finds out local minimum but not global
minimum. +election of different initial centroids lead to different results of clusters, so here, new
proposed solution find out certain consistent initial centroids with some criteria and distribution
of data, that will hopefully provide a better clustering in effective manner.