prob form

Upload: jaya-shukla

Post on 23-Feb-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/24/2019 prob form

    1/3

    PROBLEM FORMULATION

    K-Means Clustering

    Among the clustering algorithms, extensively and most widely used clustering algorithm iscalled as k-means algorithm.The k-means algorithm takes two input parameters: the dataset of n

    objects, and k, the number of clusters to be created. The algorithm partitions the dataset of n

    objects into k clusters. luster similarity is measured by taking !uclidean distance between

    objects. "n this way k-means find spherical or ball shaped clusters. The mean value of the objects

    in a cluster, can be viewed as the cluster#s centre of gravity.

    The algorithm works in two phases: in the first phase kinitial centroids are selected randomly,

    one for each cluster. "n the second phase each object of the given input dataset is associated with

    the cluster having the nearest centroids. $owever, other measures like %anhattan, etc. can also

    be used. &hen all the objects from input dataset are assigned to some clusters, the firstintegration is completed and an early grouping is done. At this point, the algorithm starts new

    integration and recalculated the new centroids, as the inclusion of new data may lead to a change

    in the cluster centroids. The k centroids may change their position in a step by step manner.

    !ventually, a situation will be reached where the centroids to do not move anymore or the data

    objects do not change their cluster. This signifies the convergence criterion for clustering.

    'our steps can be identified for existing k-means algorithm:

    Step 1: "nitiali(ation of data objects. This step for each group describe centroids that is how

    much set of objects that is re)uire to be partitioned from data points.

    Step : *ata objects classification. To the relative group the data object is included by

    calculating the shortest distance between each of the centroids with data objects to determine the

    closest centroids for grouping in each database objects.

    Step !: alculate centroid as representative of cluster. 'or each group is generated by placing the

    data objects in cluster by considering the previous step. After assigning all clusters centroids are

    recalculated for improve clustering.

    Step ": "mproved clustering through convergence condition. +everal convergence conditions use

    to stop continuous process among groups when there is no exchange of centroids or selected data

    objects. "t reaches a give number of iterations from which the most utili(ed to stop clustering to

    fulfill user re)uirements, or stopping for clustering. &hen it reaches to given threshold of two

    consecutive iterations by calculating s)uared error function difference. epeat step two, three

  • 7/24/2019 prob form

    2/3

    and four of the algorithm, if at any point the satisfaction of convergence condition is not

    achieved

    Pr#$le% F#r%ulati#n:

    The original k-means algorithm selects centroids refine method is determined initial centroids

    systematically so as to produce clusters with better accuracy and then it may use of a variant of

    the clustering method for verification. "t starts by forming the initial clusters based on the

    calculated threshold value and then formulates clusters by calculating relative distance of each

    data-point from the initial centroids. These clusters are subse)uently modified according to the

    data points, thereby improving the efficiency as well accuracy.

    Alg#rit&%:

    seudo code for the k-means clustering algorithm is listed as :

    Input:A datasetDof nobjects

    D /d0,d1, 22, dn3

    k The number of desired clusters

    Output:

    A set of k clusters containing data from datasetD.%ethod.

    Steps:

    0. andomly select kobjects from the dataset * as initial centroids4

    1. epeat

    a. Assign each object d, from datasetDto the cluster to which the object is the most similar

    i.e., has the closest centroid4b. alculate new mean for each sluter4

    c. 5ntil convergence criteria is met /there is no change in the cluster cenres3.

    I'ea #( )etting Initial Centr#i's:

    Input:6o. of clusters k, no. of objects n

    Output:k clusters that specify the least error.

    Process:

    There are n objects in population set A and wants to partition A into k clusters.

  • 7/24/2019 prob form

    3/3

    !uclidean distance formula use to calculate distance between data objects,

    e.g. distance between one vector 7/x0, y03 and the other vector 8 /x1, y13 is describing as

    follow:

    +et c6o 0 as c6o is a variable that contain information regarding no of clusters.

    ompute distances between each data object to the other data object in A until reaches 9 d

    %ean x .

    &hen distances between two points is 9 d %ean x and delete from set A and add in set ;4

    /0 c6o k3.

    'ind those points in A that is closest to the selected data set points ;c6o, add it to ;c6o and

    delete it from A4 /;c6o represents cluster set with cluster no3 epeat step n?k.

    "f c6ok, then *N#+*N#,1 find another pair of datapoints between which the distance is the

    shortest in A and form another data-point setBcNo and delete them (r#% A then go to step

    /034

    "f c6o k, then take mean of each cluster and moving each centroid to the mean of its assigned

    data-points.

    eassign data points to their closest centroid into clusters according to new centroids.

    The basic matter of previous algorithm is to provide data objects into several partition where

    distances between objects within the same class is much closer rather than the distances betweenobjects in different classes. K-means algorithm easily finds out local minimum but not global

    minimum. +election of different initial centroids lead to different results of clusters, so here, new

    proposed solution find out certain consistent initial centroids with some criteria and distribution

    of data, that will hopefully provide a better clustering in effective manner.