prob form

7/24/2019 prob form

1/3

PROBLEM FORMULATION

K-Means Clustering

Among the clustering algorithms, extensively and most widely used clustering algorithm iscalled as k-means algorithm.The k-means algorithm takes two input parameters: the dataset of n

objects, and k, the number of clusters to be created. The algorithm partitions the dataset of n

objects into k clusters. luster similarity is measured by taking !uclidean distance between

objects. "n this way k-means find spherical or ball shaped clusters. The mean value of the objects

in a cluster, can be viewed as the cluster#s centre of gravity.

The algorithm works in two phases: in the first phase kinitial centroids are selected randomly,

one for each cluster. "n the second phase each object of the given input dataset is associated with

the cluster having the nearest centroids. $owever, other measures like %anhattan, etc. can also

be used. &hen all the objects from input dataset are assigned to some clusters, the firstintegration is completed and an early grouping is done. At this point, the algorithm starts new

integration and recalculated the new centroids, as the inclusion of new data may lead to a change

in the cluster centroids. The k centroids may change their position in a step by step manner.

!ventually, a situation will be reached where the centroids to do not move anymore or the data

objects do not change their cluster. This signifies the convergence criterion for clustering.

'our steps can be identified for existing k-means algorithm:

Step 1: "nitiali(ation of data objects. This step for each group describe centroids that is how

much set of objects that is re)uire to be partitioned from data points.

Step : *ata objects classification. To the relative group the data object is included by

calculating the shortest distance between each of the centroids with data objects to determine the

closest centroids for grouping in each database objects.

Step !: alculate centroid as representative of cluster. 'or each group is generated by placing the

data objects in cluster by considering the previous step. After assigning all clusters centroids are

recalculated for improve clustering.

Step ": "mproved clustering through convergence condition. +everal convergence conditions use

to stop continuous process among groups when there is no exchange of centroids or selected data

objects. "t reaches a give number of iterations from which the most utili(ed to stop clustering to

fulfill user re)uirements, or stopping for clustering. &hen it reaches to given threshold of two

consecutive iterations by calculating s)uared error function difference. epeat step two, three

7/24/2019 prob form

2/3

and four of the algorithm, if at any point the satisfaction of convergence condition is not

achieved

Pr#$le% F#r%ulati#n:

The original k-means algorithm selects centroids refine method is determined initial centroids

systematically so as to produce clusters with better accuracy and then it may use of a variant of

the clustering method for verification. "t starts by forming the initial clusters based on the

calculated threshold value and then formulates clusters by calculating relative distance of each

data-point from the initial centroids. These clusters are subse)uently modified according to the

data points, thereby improving the efficiency as well accuracy.

Alg#rit&%:

seudo code for the k-means clustering algorithm is listed as :

Input:A datasetDof nobjects

D /d0,d1, 22, dn3

k The number of desired clusters

Output:

A set of k clusters containing data from datasetD.%ethod.

Steps:

0. andomly select kobjects from the dataset * as initial centroids4

1. epeat

a. Assign each object d, from datasetDto the cluster to which the object is the most similar

i.e., has the closest centroid4b. alculate new mean for each sluter4

c. 5ntil convergence criteria is met /there is no change in the cluster cenres3.

I'ea #( )etting Initial Centr#i's:

Input:6o. of clusters k, no. of objects n

Output:k clusters that specify the least error.

Process:

There are n objects in population set A and wants to partition A into k clusters.

7/24/2019 prob form

3/3

!uclidean distance formula use to calculate distance between data objects,

e.g. distance between one vector 7/x0, y03 and the other vector 8 /x1, y13 is describing as

follow:

+et c6o 0 as c6o is a variable that contain information regarding no of clusters.

ompute distances between each data object to the other data object in A until reaches 9 d

%ean x .

&hen distances between two points is 9 d %ean x and delete from set A and add in set ;4

/0 c6o k3.

'ind those points in A that is closest to the selected data set points ;c6o, add it to ;c6o and

delete it from A4 /;c6o represents cluster set with cluster no3 epeat step n?k.

"f c6ok, then *N#+*N#,1 find another pair of datapoints between which the distance is the

shortest in A and form another data-point setBcNo and delete them (r#% A then go to step

/034

"f c6o k, then take mean of each cluster and moving each centroid to the mean of its assigned

data-points.

eassign data points to their closest centroid into clusters according to new centroids.

The basic matter of previous algorithm is to provide data objects into several partition where

distances between objects within the same class is much closer rather than the distances betweenobjects in different classes. K-means algorithm easily finds out local minimum but not global

minimum. +election of different initial centroids lead to different results of clusters, so here, new

proposed solution find out certain consistent initial centroids with some criteria and distribution

of data, that will hopefully provide a better clustering in effective manner.

prob form

Documents