clustering deviation analysis on breast cancer … deviation analysis on breast cancer using linear...

12
Clustering Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique Soumya Sahoo* Sushruta Mishra** Sunil Kumar Mohapatra*** Brojo Kishore Mishra**** Abstract : Data mining aims to discover unknown knowledge in large datasets. One of the most popular and widely used functionality of knowledge mining is clustering. It partitions a set of data or objects into a set of meaningful sub-classes, called clusters. In this paper we have considered Breast Cancer data samples. The entire dataset is partitioned into training set and testing set. Different distribution of data is taken and Scatter Search is applied to eliminate the irrelevant attributes from the dataset. Then Linear Vector Quantization (LVQ) clustering technique is used as a classifier to examine the deviation in clustered data samples. The evaluation suggests that with almost all combination of training and testing set the performance remains the same. Further it is demonstrated that the performance of our system model with LVQ technique is the worst when the breast cancer dataset is partitioned with 60% samples as training data and rest 40% samples as testing data. The latency was found to be least with 80-20 ratio while highest with 60-40 ratio of training and testing set. Keywords : Linear Vector Quantization (LVQ), Scatter Search, Breast Cancer, Clustering Variation. IJCTA, 9(23), 2016, pp. 313-324 International Science Press * Department of C.V. Raman College of Engineering, Bhubaneswar, INDIA Email- [email protected] ** Department of C.V. Raman College of Engineering, Bhubaneswar, INDIA Email- [email protected] *** Department of C.V. Raman College of Engineering, Bhubaneswar, INDIA Email- [email protected] **** Department of C.V. Raman College of Engineering, Bhubaneswar, INDIA Email- [email protected] 1. INTRODUCTION DATA MINING aims to discover unknown knowledge in large datasets. Clustering is a process of partitioning a set of data or objects into a set of meaningful sub-classes, called clusters. A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. Clustering can be roughly distinguished as: hard clustering: each object belongs to a cluster or not and soft clustering each object belongs to each cluster to a certain degree. Various types of clustering algorithm are there out of which we have emphasized on linear vector quantization algorithm. Vector quantization is one example of competitive learning. The goal here is to have the network “discover” structure in the data by finding how the data is clustered. The results can be used for data encoding and compression. One such method for doing this is called vector quantization. Linear vector Quantization (LVQ) is a supervised version of vector quantization. Learning vector quantization (LVQ) is a widely popular clustering technique where a cluster is described by its center along with some shape and size metrics [1]. The parameters are adapted accordingly so that the clusters fit themselves as per a given dataset. There are some other techniques similar to that of LVQ which includes clustering by fuzzy logic [2][3] and that of K-Means method [4][5]. Classes are predefined and we have a set of labelled data where we have to determine a set of prototypes the best represent each class. The basic idea of our approach is to compute a desired radius from the data points that are assigned to a cluster and then to adapt the current radius of the cluster in the direction of this desired radius.

Upload: lelien

Post on 21-Mar-2018

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Clustering Deviation Analysis on Breast Cancer … Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique 313 Clustering Deviation Analysis on Breast Cancer

313Clustering Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique

Clustering Deviation Analysis on BreastCancer using Linear Vector QuantizationTechniqueSoumya Sahoo* Sushruta Mishra** Sunil Kumar Mohapatra*** Brojo Kishore Mishra****

Abstract : Data mining aims to discover unknown knowledge in large datasets. One of the most popular andwidely used functionality of knowledge mining is clustering. It partitions a set of data or objects into a set ofmeaningful sub-classes, called clusters. In this paper we have considered Breast Cancer data samples. Theentire dataset is partitioned into training set and testing set. Different distribution of data is taken and ScatterSearch is applied to eliminate the irrelevant attributes from the dataset. Then Linear Vector Quantization(LVQ) clustering technique is used as a classifier to examine the deviation in clustered data samples. Theevaluation suggests that with almost all combination of training and testing set the performance remains thesame. Further it is demonstrated that the performance of our system model with LVQ technique is the worstwhen the breast cancer dataset is partitioned with 60% samples as training data and rest 40% samples astesting data. The latency was found to be least with 80-20 ratio while highest with 60-40 ratio of training andtesting set.

Keywords : Linear Vector Quantization (LVQ), Scatter Search, Breast Cancer, Clustering Variation.

IJCTA, 9(23), 2016, pp. 313-324��International Science Press

* Department of C.V. Raman College of Engineering, Bhubaneswar, INDIA Email- [email protected]

** Department of C.V. Raman College of Engineering, Bhubaneswar, INDIA Email- [email protected]

*** Department of C.V. Raman College of Engineering, Bhubaneswar, INDIA Email- [email protected]

**** Department of C.V. Raman College of Engineering, Bhubaneswar, INDIA Email- [email protected]

1. INTRODUCTION

DATA MINING aims to discover unknown knowledge in large datasets. Clustering is a process of partitioninga set of data or objects into a set of meaningful sub-classes, called clusters. A cluster is a set of points such that anypoint in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster.Clustering can be roughly distinguished as: hard clustering: each object belongs to a cluster or not and soft clusteringeach object belongs to each cluster to a certain degree. Various types of clustering algorithm are there out of whichwe have emphasized on linear vector quantization algorithm. Vector quantization is one example of competitivelearning. The goal here is to have the network “discover” structure in the data by finding how the data is clustered.The results can be used for data encoding and compression. One such method for doing this is called vectorquantization. Linear vector Quantization (LVQ) is a supervised version of vector quantization. Learning vectorquantization (LVQ) is a widely popular clustering technique where a cluster is described by its center along withsome shape and size metrics [1]. The parameters are adapted accordingly so that the clusters fit themselves as pera given dataset. There are some other techniques similar to that of LVQ which includes clustering by fuzzy logic[2][3] and that of K-Means method [4][5]. Classes are predefined and we have a set of labelled data where wehave to determine a set of prototypes the best represent each class. The basic idea of our approach is to computea desired radius from the data points that are assigned to a cluster and then to adapt the current radius of the clusterin the direction of this desired radius.

Page 2: Clustering Deviation Analysis on Breast Cancer … Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique 313 Clustering Deviation Analysis on Breast Cancer

314 Soumya Sahoo, Sushruta Mishra, Sunil Kumar Mohapatra and Brojo Kishore Mishra

In this paper we have used LVQ technique as a clustering method and analyzed the effect of dataset samplesdistribution on the clustering deviation. We have tried different combinations of the dataset which includes completetraining set, partitioning dataset into training and testing instances with a ratio of 80-20, 70-30 and 60-40 respectively.To each of the corresponding partition a feature selection method Scatter Search was applied to get an optimizedresult set. To this reduced feature set LVQ technique was subjected and the results were depicted in the form ofclustered instances spread between the two clusters. The paper is organized as follows. The next section presentsthe literature work done. Subsequent part describes the LVQ technique and Scatter search method in detail. Thenour proposed model of evaluation is presented and the results are evaluated with WEKA 3.7.12 software. Finallythe conclusion is inferred with future work to be performed.

2. RELATED WORK

LVQ networks can be trained by cost functions were proposed in [6][7][8]. Further mathematical analysis ofthese algorithms can be demonstrated based on the respective cost function [9]. In [10], it has been shown thatLVQ aims at margin optimization, i.e. good generalization ability can be expected. A theoretical analysis of differentLVQ algorithms in simplified model situations can also be found in [11][12]. Fujiki Morii and Kazuko Kurahashi[13] suggested that Fujiki Morii and Kazuko Kurahashi [13] suggested that performing classification of linearseparable data with learning vector quantization or K-Means algorithm it was difficult to get good classificationresults if the initial cluster centre selection was bad along with the variations of distribution of class data. Thereforethey had proposed a suitable method of clustering which is based on multiple criteria for linear vector quantizationand K-means algorithm along with its performance which helps to produce strong classification results. K meanswith split and merge procedure is used by them to get good cluster centre as well as to reduce the squared-errordistortion. LVQ These cluster centres were used as initial one by LVQ clustering which helped to produce kclusters. They had introduced a criteria to find out if individual cluster imparts uni-modality followed by subclusters split done with K means for clusters having zero uni-modality were combined into applicable neighbourlycluster leaving out one sub cluster and henceforth the validation of the classification result had been tested. KarayiannisNB, Randolph-Gips [14] this paper presents the development of soft clustering and learning vector quantization(LVQ) algorithms that rely on multiple weighted norms to measure the distance between the feature vectors andtheir prototypes. Clustering and LVQ are formulated in this paper as the minimization of a reformulation functionthat employs distinct weighted norms to measure the distance between each of the prototypes and the featurevectors under a set of equality constraints imposed on the weight matrices. Fuzzy LVQ and clustering algorithmsare obtained as special cases of the proposed formulation. The resulting clustering algorithm is evaluated andbenchmarked on three data sets that differ in terms of the data structure and the dimensionality of the featurevectors. This experimental evaluation indicates that the proposed multinorm algorithm outperforms algorithmsemploying the Euclidean norm as well as existing clustering algorithms employing weighted norms. N.R Pal,J.CBezdek,E.C.K Taso [15] discussed the relationship between the sequential hard c-means (SHCM) and learningvector quantization (LVQ) clustering. They had taken into consideration the impingement and correspondence ofthe methods with Kohonen’s self-organizing feature mapping which is not a clustering method but many a timesleads ideas about clustering algorithms. They had suggested a generalization of LVQ that updates all nodes for agiven input vector. The network tried to find a minimum of an apparent objective function. The learning rulesdepend on the degree of distance match to the winner node; the lesser the degree of match with the winner, thegreater the impact on non-winner nodes. The results indicate that the terminal prototypes provoked by the alterationof LVQ are usually unresponsive to initialization and independent of any choice of learning coefficient. They hadconsidered IRIS data obtained by E. Anderson to illustrate the proposed method and compared the results withthe standard LVQ approach.

3. LVQ CLUSTERING

A clustering technique called Kohonen SOM used to determine the nature of data samples can be varied intoa supervised version without any topological architecture called LVQ neural network.Here every output neuron

Page 3: Clustering Deviation Analysis on Breast Cancer … Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique 313 Clustering Deviation Analysis on Breast Cancer

315Clustering Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique

denotes a predefined category (e.g. cricket, tennis, hockey etc).The Input vector is represented as: x = (x1, x2,x3,……., xn).The Weight vector for the jth output neuron is represented as wj = (w1 j, w2j, w3j, ………..wnj)

Cj = Category represented by the jth neuron. This is pre-assigned.• T = Correct category for input • Define Euclidean distance between the input vector and the weight vector of the jth neuron as:

3.1. Algorithm steps of LVQ Clustering

Step 1: The weight vectors are initialized to the first m training vectors, where m denotes the number ofdifferent classes and set �.

Step 2 : Until termination criterion is satisfied repeat steps 2 to 6Step 3 : For each training input vector x, do steps 3 to 4Step 4 : Determine J such that D(J) is a minimumStep 5 : Update the weights of the J neuron as follows:

IF T = Cj THENWj (new) = wj (old) + � (x – wj (old))

(i.e. the weight vector w is moved towards the input vector x)IF T � Cj THEN

Wj (new) = wj (old) – � (x – wj (old))(i.e. move w away from x)Step 6 : The learning rate is reducedStep 7 : Test termination criterion: It may be a predefined number of iterations or the learning rate reaching a

sufficiently small value.Where x – wj = (x1 – w1j), (x2 – w2j), (x3 – w3j), (xn – wnj)

4. SCATTER SEARCH

Scatter search is a Metaheuristic and a Global Optimization algorithm. Co-relating with the field of evolutionarycomputing it is used with the method of population and recombination .Scatter Search is a branch of Tabu Searchwhich is based on similar origins. The main aim of Scatter Search is to preserve a set of divergent and high-qualitycandidate solutions. The principle of the approach is that useful information about the global optima is stored in adiversified and excusive solution set and that reassociating samples from the set can exploit this information. Theselection of members for the ReferenceSet at the end of each iteration favours solutions with higher quality and mayalso promote diversity. The ReferenceSet may be updated at the end of iteration, or dynamically as candidates arecreated (a so-called steady-state population in some evolutionary computation literature).A lack of changes to theReferenceSet may be used as a signal to stop the current search, and potentially restart the search with a newlyinitialized Reference Set..

4.1. Pseudocode of scatter Search Algorithm

1. Start with P = Ø. Use the diversification generation method to construct a solution and apply theimprovement method. Let x be the resulting solution. If x “P then add x to P, otherwise discard x.Repeat this step until cardinality of P = P Size.

2. Use the reference set update method to build Ref Set with the “best” b solutions in P. Find the bestsolution and worst in the Ref Set according to their objective function value. Make New Solution = True.

While ( New Solutions)

{

Page 4: Clustering Deviation Analysis on Breast Cancer … Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique 313 Clustering Deviation Analysis on Breast Cancer

316 Soumya Sahoo, Sushruta Mishra, Sunil Kumar Mohapatra and Brojo Kishore Mishra

3. Generate New Subsets with the subset generation method. Make NewSolutions = FALSE.While ( New Subsets “�”){

4. Select the next subset s in New Subsets.5. Apply the solution combination method to s to obtain one or more new trial solutions x. Apply the

Improvement method to the trial solutions. 6. Apply the reference set update method.

If ( RefSethas changed) then, Make NewSolutions = TRUE.7. Delete s from New Subsets

}}

5. DATASET DESCRIPTION

Breast Cancer dataset constitutes 201 instances of one class and 85 instances of another class. The instancesare described by 9 attributes, some of which are linear and some are nominal.Total numbers of Instances aresummed to be 286 while there are 09 attributes present. The Breast Cancer dataset with its attributes are representedin table 1.

Table 1. Breast Cancer dataset details

Attribute Detail

Class no-recurrence-events, recurrence-events

age 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.

menopause lt40, ge40, premeno

tumor-size 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59.

inv-nodes 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39

node-caps yes, no

deg-malig 1, 2, 3

breast left, right

breast-quad left-up, left-low, right-up, right-low, central

irradiat yes, no

Table 2. Attribute details after feature reduction.

Attribute name

Tumor-size

Inv-nodes

Node-caps

Deg-malig

Irradiat

It can be clearly seen that before applying Scatter Search method the total number of attributes are 9 apartfrom the class label. On using Scatter Search method the irrelevant features are removed and the resultant attributeset is 05 as shown in table 2. This optimized attribute set is simulated using Linear Vector Quantization technique tocalculate the data deviation analysis taking different combination of training and testing set of samples. The detailsof the proposed system model are presented in the next section.

Page 5: Clustering Deviation Analysis on Breast Cancer … Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique 313 Clustering Deviation Analysis on Breast Cancer

317Clustering Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique

6. PROPOSED WORK

Fig. 1. Proposed Framework to compute data deviation in breast Cancer dataset.

The proposed system model is presented as demonstrated in the given figure. As it can be seen in thediagram the original Breast Cancer dataset is gathered from UCI repository. In our work we have applied LVQclustering method on Breast Cancer data with Scatter Search as the feature selection technique. The objectiveof our study is to analyze the clustering distribution based on various attributes in consideration with the dataset.A dataset may be divided into a training set and testing set. Different combination of such dataset may arisewhich may lead to different findings. In our research we have sub divided the entire data samples into training setand testing set with different distribution. For example a 70-30 dataset ratio denotes that training set samplesconstitutes 70% of the dataset while remaining 30% has been reserved for testing the data with suitable classifier.There are two possible clusters to which the instances can belong to. The standard parameters associated withthese attributes are min, max, mean and standard deviation. The distribution of samples in the respective clustersis shown in figure. The left section denotes the training phase while the right part is the testing stage. During thetraining phase as indicated in the diagram, once the dataset is collected it is subjected to Scatter Search as thefeature selection technique. The result is the reduced and optimized samples with all the irrelevant features getsremoved. Since there are two categories of classes in this dataset, the data samples belong to either of them inthe training stage. The two classes include no-recurrence-events and recurrence-events. Similarly based on thedata combination ratio between training and testing samples the testing stage is carried out. The initial stepsremain the same. However after feature selection using Scatter Search is applied to the dataset, the samples areapplied to Linear Vector Quantization (LVQ) for clustering and finally the clustered instances are assigned to thetwo clusters and are collected for evaluation. As it can be seen the variation in value of the parameters associatedwith the attributes like mean, min etc is not so significant. Table 3 to table 6 depicts the value of parameters ofattributes of dataset before feature selection using Scatter Search. Table 7 to table 10 shows the parametersdata distribution when Scatter Search is used.

Page 6: Clustering Deviation Analysis on Breast Cancer … Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique 313 Clustering Deviation Analysis on Breast Cancer

318 Soumya Sahoo, Sushruta Mishra, Sunil Kumar Mohapatra and Brojo Kishore Mishra

Table 3. Attribute-wise data distribution of Breast Cancer with Full Training set.

Attribute Cluster-0 Cluster-1

name Min Max mean Std.dev Min max mean Std.dev

Age 1 6 3.665 1.0274 2 6 3.6627 0.9788

menopause 0 2 1.4877 0.5483 0 2 1.5301 0.5486

Tumor-size 0 10 4.7241 2.0973 2 10 5.2651 2.09

Inv-nodes 0 8 0.4433 1.1082 0 5 0.7229 1.2328

Node-caps 0 1 0.8232 0.3824 0 1 0.7375 0.4428

Deg-malig 0 2 1.0148 0.7344 0 2 1.1325 0.7452

breast 0 1 0.2857 0.4529 0 1 0.9157 0.2796

breastquad 0 4 1.5297 1.1809 0 3 0.2771 0.6498

irradiat 0 1 0.8621 0.3457 0 1 0.5181 0.5027

class 0 1 0.2956 0.4574 0 1 0.3012 0.4616

Table 4. Attribute-wise data distribution of Breast Cancer with 60-40 ratio between Training and testing set.

Attribute Cluster-0 Cluster-1

name Min Max mean Std.dev Min max mean Std.dev

age 1 6 3.9286 1.0811 2 6 3.4851 0.9655

menopause 0 2 1.3143 0.4976 0 2 1.6337 0.5239

tumor-size 0 10 4.7571 2.7157 1 9 4.9604 1.7488

inv-nodes 0 5 0.2571 0.7359 0 5 0.505 1.0161

node-caps 0 1 0.8841 0.3225 0 1 0.7879 0.4109

deg-malig 0 2 0.7714 0.569 0 2 1.2772 0.7631

breast 0 1 0.6857 0.4676 0 1 0.3267 0.4714

breastquad 0 4 1.5 1.2247 0 4 1.07 1.233

irradiat 0 1 0.8286 0.3796 0 1 0.6832 0.4676

class 0 1 0.1857 0.3917 0 1 0.3267 0.4714

Table 5. Attribute-wise data distribution of Breast Cancer with 70-30 ratio between Training and Testing set.

Attribute Cluster-0 Cluster-1

name Min Max mean Std.dev Min max mean Std.dev

age 2 6 3.8776 0.9923 1 6 3.6093 1.0391

menopause 1 2 1.3061 0.4657 0 2 1.5563 0.5494

tumor-size 2 9 5.449 1.5687 0 10 4.5497 2.3229

inv-nodes 0 5 1.2449 1.2505 0 5 0.1391 0.6223

node-caps 0 1 0.4348 0.5012 0 1 0.9533 0.2116

deg-malig 0 2 1.6122 0.5329 0 2 0.8874 0.7075

breast 0 1 0.3878 0.4923 0 1 0.5166 0.5014

breastquad 0 4 0.9167 1.2688 0 4 1.3576 1.2401

irradiat 0 1 0.3878 0.4923 0 1 0.8675 0.3401

class 0 1 0.5714 0.5 0 1 0.1656 0.3729

Page 7: Clustering Deviation Analysis on Breast Cancer … Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique 313 Clustering Deviation Analysis on Breast Cancer

319Clustering Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique

Table 6. Attribute-wise data distribution of Breast Cancer with 80-20 ratio betweenTraining and Testing set.

Attribute Cluster-0 Cluster-1

name Min Max mean Std.dev Min max mean Std.dev

age 2 6 3.8375 0.9898 1 6 3.2794 0.975

menopause 0 2 1.4125 0.5187 0 2 1.7206 0.5139

tumor-size 0 10 4.875 1.9992 0 10 4.8088 2.5815

inv-nodes 0 8 0.575 1.2765 0 3 0.3235 0.7419

node-caps 0 1 0.7806 0.4152 0 1 0.8529 0.3568

deg-malig 0 2 1.2 0.7754 0 2 0.8529 0.5535

breast 0 1 0.3 0.4597 0 1 0.9412 0.237

breastquad 0 4 1.3208 1.171 0 4 0.9412 1.3258

irradiat 0 1 0.6813 0.4675 0 1 0.8971 0.3061

class 0 1 0.3063 0.4624 0 1 0.2206 0.4177

Table 7. Attribute-wise data distribution of Breast Cancer with Full Training set.

Attribute Cluster-0 Cluster-1

name Min Max mean Std.dev Min max mean Std.dev

Tumor-size 0 10 4.6881 2.1129 1 10 5.5 1.9737

Inv-nodes 0 5 0.3165 0.9137 0 8 1.1912 1.5284

Node-caps 0 1 0.8704 0.3367 0 1 0.5484 0.5017

Deg-malig 0 2 0.9633 0.7427 0 2 1.3235 0.6566

irradiat 1 1 1 0 0 0 1 0

Table 8. Attribute-wise data distribution of Breast Cancer with 60-40 ratio betweenTraining and Testing set.

Attribute Cluster-0 Cluster-1

name Min Max mean Std.dev Min max mean Std.dev

Tumor-size 1 10 4.9939 2.0197 0 10 1.6667 4.0825

Inv-nodes 0 5 0.4182 0.931 0 0 0 0

Node-caps 0 1 0.821 0.3846 1 1 1 0

Deg-malig 0 2 1.0909 0.731 0 1 0.5 0.5477

irradiat 0 1 0.7333 0.4436 1 1 1 0

Page 8: Clustering Deviation Analysis on Breast Cancer … Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique 313 Clustering Deviation Analysis on Breast Cancer

320 Soumya Sahoo, Sushruta Mishra, Sunil Kumar Mohapatra and Brojo Kishore Mishra

Table 9. Attribute-wise data distribution of Breast Cancer with 70-30 ratio betweenTraining and Testing set.

Attribute Cluster-0 Cluster-1

name Min Max mean Std.dev Min max mean Std.dev

Tumor-size 0 9 5.4878 1.5966 0 10 4.2712 2.4096

Inv-nodes 0 5 0.9634 1.2809 0 1 0.0254 0.1581

Node-caps 0 1 0.6375 0.4838 0 1 0.9655 0.1833

Deg-malig 1 2 1.7439 0.4392 0 1 0.5932 0.4933

irradiat 0 1 0.6098 0.4908 0 1 0.8475 0.3611

Table 10. Attribute-wise data distribution of Breast Cancer with 80-20 ratio betweenTraining and Testing set.

Attribute Cluster-0 Cluster-1

name Min Max mean Std.dev Min max mean Std.dev

Tumor-size 0 10 4.6881 2.1129 1 10 5.5 1.9737

Inv-nodes 0 5 0.3165 0.9137 0 8 1.1912 1.5284

Node-caps 0 1 0.8704 0.3367 0 1 0.5484 0.5017

Deg-malig 0 2 0.9633 0.7427 0 2 1.3235 0.6566

Irradiat 1 1 1 0 0 0 0 0

7. RESULTS AND ANALYSIS

Initially the entire dataset in used for training by LVQ classifier. It is observed that 71% instances fall in thecategory of cluster 1 while the remaining 29% in cluster 2. It can be seen that in 80-20 and 70-30 distribution there

Fig. 2. Cluster distribution with a 80-20 ratio between Training and Testing set of Breast Cancer.

is a marginal deviation in the training set and testing set data samples with respective to the cluster distribution.In 80-20 ratio while there are 71% training instances in cluster 1 and 29% instances belongs to cluster 2, thereexists 66% testing samples in the first cluster and 34% in the second cluster. Similarly in 70-30 data distribution asfar as training samples are concerned 28% instances are in cluster 1 while 58% in the other cluster. The testing dataanalysis consists of also 33% records in the first cluster and 57% in the second cluster.In 60-40 data distribution

Page 9: Clustering Deviation Analysis on Breast Cancer … Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique 313 Clustering Deviation Analysis on Breast Cancer

321Clustering Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique

among training and testing set especially in the cluster 2 with just 23% samples of training data and 61% samplesin testing set for cluster 2. The detail result is observed in figure 2 to figure 5. The system model development timewith LVQ technique varies as per the data distribution of clusters as seen in figure 6. The least latency incurred is0.45 seconds when 80-20 data distribution ratio is considered while it takes 1.37 seconds with the dataset whenused in training stage.

Fig. 3. Cluster distribution with a 70-30 ratio between Training and Testing set of Breast Cancer.

Fig. 4. Cluster distribution with a 60-40 ratio between Training and Testing set of Breast Cancer.

Fig. 5. Cluster distribution with complete training set of Breast Cancer.

Page 10: Clustering Deviation Analysis on Breast Cancer … Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique 313 Clustering Deviation Analysis on Breast Cancer

322 Soumya Sahoo, Sushruta Mishra, Sunil Kumar Mohapatra and Brojo Kishore Mishra

Fig. 6. Time taken to develop the model with LVQ algorithm without using any parameter optimization method.

In the second scenario Scatter Search was used as a filter method to remove the irrelevant features in thebreast cancer data samples. The remaining relevant attributes include Tumor-size, Inv-nodes, Node-caps, Deg-malig and irradiat. The outcome is illustrated in figure 7 to figure 10. Similar procedure is followed where LVQtechnique is used for clustering and the data samples are analyzed as per the clustering deviation. When the fulldataset is used for training phase there is a distribution of 76:24 in cluster 1 and cluster 2 respectively. When 80-20ratio between training and testing data is taken into consideration it is seen that around 56% of training data is in thefirst cluster and 44% lies in cluster 2. Among the rest 20% testing data 69% belongs to cluster 1 and 31% to cluster2. When 70-30 data distribution is analyzed it is observed that there is a data distribution of 41% in first cluster and59% in second cluster among both training as well as testing samples. There is a drastic change in deviation of datainstances when 60-40 distribution among training and testing data is analyzed. It is seen that around 96% of trainingdata falls in cluster 1 while only 4% in cluster 2. When the remaining 40% data is used for testing a massive 98%of samples go to the cluster 1 and only 2% in cluster 2. As far as the system development time is concerned it theminimum time taken is 0.38 seconds with 80-20 data distribution and maximum is 10.53 seconds when 60-40distribution is considered as seen in figure 11.

Fig. 7. Cluster distribution with complete training set of Breast Cancer.

Page 11: Clustering Deviation Analysis on Breast Cancer … Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique 313 Clustering Deviation Analysis on Breast Cancer

323Clustering Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique

Fig. 8. Cluster distribution with a 80-20 ratio between Training and Testing set of Breast Cancer.

Fig. 9. Cluster distribution with a 70-30 ratio between Training and Testing set of Breast Cancer.

Fig. 10. Cluster distribution with a 60-40 ratio between Training and Testing set of Breast Cancer.

Fig. 11. Time taken to develop the model with LVQ algorithm using Scatter Search with a parameter optimization method

Page 12: Clustering Deviation Analysis on Breast Cancer … Deviation Analysis on Breast Cancer using Linear Vector Quantization Technique 313 Clustering Deviation Analysis on Breast Cancer

324 Soumya Sahoo, Sushruta Mishra, Sunil Kumar Mohapatra and Brojo Kishore Mishra

8. CONCLUSION

Vector quantization is a popular example of competitive learning. Linear Vector quantization (LVQ) may bereferred to as the supervised modification of Vector quantization. Here the classes are defined at prior with a set oflabelled data samples in hand. The main objective of this technique is to determine a set of prototypes that representeach class in the best way possible. In the study Breast Cancer dataset is taken into consideration and someparameters like min, mean, max etc are evaluated for each combination of data samples. Two experimental set upare carried out here. At first the entire dataset is simulated without using any feature selection method. Secondly wehave used Scatter Search method as a feature selection tool and then classified sing LVQ technique to determinethe amount of data deviation in the Breast Cancer samples collected. In this paper we have used this LVQ techniqueto analyze the data deviation in clustering distribution. Different set of combination between training and testing dataare taken into account and with each of them the Scatter Search method is applied as a feature selection method.The best selected feature set is then implemented using LVQ technique which is a supervised learning scheme. Theclustered deviation of data samples along with the latency in analyzed in detail. It is observed that when a 60-40ratio of training data and testing data is evaluated and simulated the data distribution deviation is highest in clusteredclasses while the most optimal result exist in case of a 80-20 data distribution scenario. As a future work thepriority is to use some hybrid soft computing methods like Neuro-fuzzy approach in determining the impact of datadeviation in sophisticated clustering schemes.

9. REFERENCES

1. T. Kohonen. Self-Organizing Maps. Springer-Verlag, Heidelberg, Germany 1995 (3rd ext. edition 2001)

2. J.C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York, USA 1981.

3. F. Hoppner, F. Klawonn, R. Kruse, and T. Runkler. ̈ Fuzzy Cluster Analysis. J. Wiley & Sons, Chichester, UK 1999 .

4. B.S. Everitt. Cluster Analysis. Heinemann, London, UK 1981.

5. J.A. Hartigan and M.A. Wong. A k-means Clustering Algorithm. Applied Statistics 28:100–108. Blackwell, Oxford, UK1979.

6. Sato, A. and Yamada, K.: 1996, Generalized learning vector quantization, in M. C. M. D. S. Touretzky and M. E. Hasselmo(eds), Advancesin Neural Information Processing Systems, Vol. 8, MIT Press, Cambridge, MA, USA, pp. 423–429.

7. Seo, S. and Obermayer, K.: 2003, Soft learning vector quantization, Neural Computation 15(7), 1589–1604.

8. Seo, S., Bode, M. and Obermayer, K.: 2003, Soft nearest prototype classification, IEEE Transactions on Neural Networks14(2), 390–398.

9. Sato, A. and Yamada, K.: 1998, An analysis of convergence in generalized lvq, in L. Niklasson, M. Bod´en and T. Ziemke(eds), Proceedings of the International Conference on Artificial Neural Networks, Springer, pp. 170–176.

10. Crammer, K., Gilad-Bachrach, R., Navot, A. and Tishby, A.: 2003, Margin analysis of the lvq algorithm, Advances inNeural Information Processing Systems, Vol. 15, MIT Press, Cambridge, MA, USA, pp. 462–469.

11. Ghosh, A., Biehl, M. and Hammer, B.: 2006, Performance analysis of lvq algorithms: a statistical physics approach,Neural Networks 19(6), 817–829.

12. Biehl, M., Ghosh, A. and Hammer, B.: 2007, Dynamics and generalization ability of LVQ algorithms, Journal of MachineLearning Research 8, 323–360.

13. Clustering Based on Multiple Criteria for LVQ and K-Means Algorithm, Fuzy technology Press limited, Morii* andKazuko ,pp. 360-365 Vol.13 No.4 JACIII ,doi: 10.20965/jaciii.2009.p0360.

14. Soft Learning vector Quantization and Clustering Algorithms basedon non eucledian norms: multinorm algorithms,Karayiannis NB, Randolph-Gips MM.,IEEE Trans Neural Netw. 2003;14(1):89-102. doi: 10.1109/TNN.2002.806951.

15. N.R Pal,J.C Bezdek,E.C.K Taso,IEEEeural Network,Vol-4,Issue-4 ,Jul1993,Pg no-549-557.