hybrid machine learning for improving classification accuracy

69

CHAPTER – IV

HYBRID MACHINE LEARNING FOR IMPROVING

CLASSIFICATION ACCURACY

4.1.1 Hybrid Machine Learning

Machine Learning algorithms are used in the data mining applications to retrieve the hidden information that may be used for good decision-making. Machine learning contains various techniques like rule based learning, case based reasoning, artificial neural network and decision tree. Every technique has it’s own advantages and disadvantages. In the past lot of hybrid machine learning systems were developed to bring the best from the two different machine learning methods. For example a hybrid machine learning system is created based on genetic algorithm and support vector machines for stock market prediction by Rohit and Kumkum [8] Nerijis, Ignas and Vida developed a hybrid machine learning approach for text categorization using decision trees and artificial neural network [8]. Sankar developed an integrated data mining approach for maintenance scheduling using case based reasoning and artificial neural network [9] and Mammone developed a hybrid machine learning system combining neural network and decision tree. In our research we are using artificial neural network with case based reasoning technique. Artificial neural network gives good classification accuracy compare to other machine learning techniques like rule based system and decision tree. Through this hybrid machine learning we want to improve the classification accuracy of the neural network system that may be used for medical diagnosis.

4.1.2 Improving Classification Accuracy

For a machine learning system to be useful in solving medical diagnostic tasks it has to satisfy some desired features like good performance, transparency of diagnostic knowledge, ability to explain decisions and ability of the algorithm to reduce the number of tests to obtain reliable diagnosis. To give good performance the classification system has to give the diagnostic accuracy on new cases as high as possible. Now a days to get best performance several learning algorithm are used to test on the available dataset and best one or two get selected for diagnosis. Always artificial neural network gives good performance in medical diagnosis still it is desirable to improve its accuracy. To achieve this we are combining it with the case based reasoning approach.

70

4.2.1 Importance of Case Based Reasoning

Case based reasoning is closely related to the human reasoning. In many situations the problems human encounter are solved with human equivalent of case based reasoning. When a person encounters a previously inexperienced situation or problem, he refers it to a past experience of a similar problem. This similar, previous experience may be one himself or someone else has experienced. In case it is someone else experience, the case will be added to the reasoner’s memory via an oral or written account of that experience [10].

In medical diagnosis, when the patient comes to the doctor, he examines the patient and immediately recollects the similar past cases. Because similar cases have similar answers, the doctor looks for the similar past cases to treat the present case. Some times the retrieved similar case may be fully or partially like the new case. If it is a fully similar one then the corresponding solution may be used otherwise the solution is going to be modified to adapt the new case. In our research when an input case comes the case based reasoning system retrieves the near by similar cases and give it as training samples to the Neural Network. Neural Network classification performance fully depends on the quality of the training samples which were used in the time of training. The training samples should be valid and related ones. Even though the number of training samples may be more if they are not related then they will affect the classification performance of the neural network. One of the main challenges of the data mining application is data growth. Medical data set may have more number of patients record. Instead of giving the meaningless and more volume data it is always good to give meaningful and related data for training. It will help the neural network to train properly and classify the user input more accurately.

4.2.2 Case Based Reasoning Approach

CBR System

Case-based reasoning is a methodology for solving problems by utilizing previous experiences. It involves retaining a memory of previous problems and their solutions and using it to solve new problems. When presented with a problem case base reasoner searches its memory of past cases and attempts to find a case that has the same problem specification as the current case. If the reasoner cannot find an identical case in its case base, it will attempt to find a case or cases in the case base that most closely match the current query case.

In the situation where a previous identical case is retrieved, presuming its solution was successful, it can be returned as the current problem’s solution. In the more likely case that the retrieved case is not identical to the current case, an adaptation phase occurs. In adaptation, the differences between the current case and the retrieved case must first be identified and then the solution associated with the retrieved case must be modified taking into account these differences. The solution

71

returned in response to the current problem specification may then be tried in the appropriate domain setting.

Case-based reasoning (CBR) system incorporates the reasoning mechanism and the external facets like the input specification, the output suggested solution, and the memory of past cases that are referenced by the reasoning mechanism. It is represented in Figure 4.1.

CBR System Internal Structure

CBR system has an internal structure divided into two major parts called the case retriever and the case reasoner as shown in Figure.4.2. The case retriever’s job is to find the appropriate cases in the case base for the given input case. The case

Problem Case Derived Solution

Case Base

Case Base Reasoning

Mechanism

Figure 4.1 CBR System [10]

Problem Case Derived Solution

Case Base

Figure 4.2 Two Major Components of a CBR System

Case

Retriever

Case

Reasoner

72

reasoner uses the retrieved cases to find a solution to the given input case. This reasoning generally involves both determining the differences between the retrieved cases and the current input case and modifying the retrieved solution appropriately, reflecting these differences. The case reasoner may or may not consult the case base to find the relative cases to find the solution.

Case can be in the form of a record, which contains all the related information of a previous experience or problem. The information recorded about this past experience depends on the domain of the reasoner and the purpose to which the case will be put. In the instance of a problem solving CBR system, the details will usually include the specification of the problem and the relevant attributes of the environment that are the circumstances of the problem. The other important part of the case is the solution that was applied in the previous situation. Depending on how the CBR system reasons with cases, this solution may include only the facts of the solution, or, additionally, the steps or processes involved in obtaining the solution. It is also important to include the achieved measure of success in the case description if the cases in the case base have achieved different degrees of success or failure.

If the problem domain has a fundamental model or do not have exceptions or novel cases or the case is likely to occur very frequently then it has more chance of using the case based reasoning technique. To reduce the knowledge acquisition task, avoid the repeating mistakes in the past, graceful degradation of performance, and reason in a domain with a small body of knowledge and learn over time are some of the important reasons for which case based reasoning is used.

Case Representation

Cases in a case base can represent many different types of knowledge and store it in many different representational formats. The objective of a system will greatly influence what is stored. A case based reasoning system may be aimed at the creation of a new design or plan, the diagnosis of a new problem, or the argument of a point of view with precedents. In each type of system, a case may represent something different. The cases could be people, things or objects, situations, diagnoses, designs, plans or rulings among others.

In many practical CBR applications, cases are usually represented as two unstructured sets of attribute value pairs, i.e. the problem and solution features. However, the decision of what to represent can be one of the difficult decisions to make. For example: In some sort of medical CBR system, that diagnosis a patient, a case could represent an individual’s entire case history or be limited to a single visit to a doctor. In this situation the case may be a set of symptoms along with the diagnosis. It may also include a diagnosis or treatment. If a case is a person then a more complete model is being used as this could incorporate the change of symptoms from one visit to the next. It is however harder to find and use cases in this format to search for a particular set of symptoms in a current problem and obtain a diagnosis/treatment. Alternatively if a case is simply a single visit to the doctor involving the symptoms at the time of that visit and the diagnosis of those symptoms, the changes in symptoms

73

that might be a useful key in solving a problem may be missed. Cases may need to be broken down and consist of sub-cases. For example, a case could be a person’s medical history and could include all visits made by them to the doctor as sub cases. A sample case structure is represented in Figure 4.3.

Patient

Age

Height

Weight

Visit 1

Symptom 1

Symptom 2

Diagnosis

Treatment

Visit 2

Visit 3

Figure 4.3 A Patient Case Record

No matter what the case actually represents as a whole, the features of it have to be represented in some format. One of the advantages of case-based reasoning is the flexibility it has in this regard. Depending on what types of features have to be represented, an appropriate implementation platform can be chosen. Ranging from simple Boolean, numeric and textual data to binary files, time dependent data, and relationships between data, CBR can be made to reason with all of them.

No matter what is stored, or the format it is represented in, a case must store that information that is relevant to the purpose of the system and which will ensure that the most appropriate case is retrieved in each new situation. Thus the cases have to include those features that will ensure that case will be retrieved in the most appropriate contexts.

In many CBR systems, all existing cases do not need to be stored. In these systems criteria are needed to decide which cases will be stored and which will be discarded. In the situation where two or more cases are very similar, only one case may need to be stored. Alternatively, it may be possible to create an artificial case that

74

is a generalization of two or more actual incidents or problems. By creating generalized cases the most important aspects of a case need only be stored once.

When choosing a representation format for a case, there are many choices and many factors to consider. Some examples of representation formats that may be used include data base formats, frames, objects, and semantic networks.

Whatever format the cases are represented in, the collection of cases itself has to be structured in some way to facilitate the retrieval of the appropriate case when queried. Numerous approaches have been used for this. A flat case base is a common structure; in this method indices are chosen to represent the important aspects of the case and retrieval involves comparing the current cases features to each case in the case base. In our work the diabetes dataset is stored in the form of a flat file. It contains nine fields to store the patient input and output parameters. Another common case base structure is a hierarchical structure that stores the cases by grouping them to reduce the number of cases that have to be searched.

Case Indexing

Case indexing refers to assigning indices to cases for future retrieval and comparisons. This choice of indices is important to being able to retrieve the right case at the right time. This is because the indices of a case will determine in which context it will be retrieved in future. These are some suggestions for choosing indices. Indices must be both predictive and predictive in a useful manner. This means that they should reflect the important aspects of the case, the attributes that influenced the outcome of the case and also those which will describe the circumstances in which it is expected that they should be retrieved in the future. Indices should be abstract enough to allow for that cases retrieval in all the circumstances in which the case will be useful, but not too abstract. When a case’s indices are too abstract that case may be retrieved in too many situations or too much processing would be required to match cases.

Case Retrieval

Case retrieval is the process of finding within the case base those cases that are the closest to the current case. To carry out case retrieval there must be criteria that determine how a case is judged to be appropriate for retrieval and a mechanism to control how the case base is searched. The selection criteria are necessary to decide which case is the best one to retrieve, that is, to determine how close the current and stored cases are.

This criteria depends in part on what the case retriever is searching for. Most often the case retriever is searching for an entire case, the features of which will be compared to the current query case. There are however times when a portion of a case is required. This may be because no full case that exists and a solution is being built by selecting portions of multiple cases, or because a retrieved case is being modified by adopting a portion of another case in the case base. The actual processes involved

75

in retrieving a case from the case base depend very much on the memory model and indexing procedures used.

Nearest Neighbor Retrieval based on Euclidean distance

Euclidean distance is used to retrieve all near by similar cases to the current user case.

If u = )y , (x 11 and v = )y , (x 22 then Euclidean distance between u and v is

2

21

2

21 )y - (y) x- (x +

We considered all the parameters having the same and equal weight. When a new input case comes we retrieve all the nearby past cases based on the distance value which is calculated using the Euclidean distance. In our research we a fixed distance value (e.g 1.5) and all the cases whose distance values are less than or equal to the fixed value are retrieved for training the artificial neural network.

Instead of retrieving all the past case bases only the cases which are in side the fixed distance value boundary is retrieved and send it as the training samples to the feed forward backpropagation neural network.

4.3 Introduction to the Pima Indian Diabetes dataset This dataset was originally donated by Vincent Sigillito, Applied physics

Laboratory, John Hopkins university, Laurel, MD 20707. It was selected from a larger database held by the national Institute of diabetes and digestive and kidney diseases. It is publicly available in the machine learning dataset UCI. All patients represented in this dataset are females at least 21 years of Pima Indian heritage living near Phoenix, Arizona, USA. This dataset contains 8 input variables and a single output variable called class. The class value 1 means the patient is tested positive for diabetes and 0 means tested negative for diabetes disease.

4.4 Literature Background Pima Indian Diabetes dataset is very difficult to classify. Lot of research has

been done on this dataset to improve the classification accuracy. Michie, Spiegelhalter and Taylor used different machine learning methods to classify the Pima Indian Diabetes dataset [15]. The table-4.1 shows the names of the applied machine learning algorithms and the calculated classification accuracies on the Pima Indian diabetes dataset.

76

Table 4.1 Michie, Spiegelhalter and Taylor Classification Result on Pima

Indian Diabetes Dataset

Sr. No Algorithm Correct

Classification (%)

Miss Classification

((Error Rate) %)

1 Discrim 77.5 22.5

2 Quadisc 73.8 26.2

3 Logdisc 77.7 22.3

4 SMART 76.8 23.2

5 ALLOC80 69.9 30.1

6 K-NN 67.6 32.4

7 CASTLE 74.2 25.8

8 CART 74.5 25.5

9 IndCART 72.9 27.1

10 NewID 71.1 28.9

11 AC2 72.4 27.6

12 Baytree 72.9 27.1

13 NaiveBay 73.8 26.2

14 CN2 71.1 28.9

15 C4.5 73 27

16 Itrule 75.5 24.5

17 Cal5 75 25

18 Kohonen 72.7 27.3

19 DIPOL92 77.6 22.4

20 Backprob 75.2 24.8

21 RBF 75.7 24.3

22 LVQ 72.8 27.2

Average 73.80 26.20

Among the used 22 different machine learning algorithms that Logdisc was the most impressive one. It gave 77.7 % correct classification and 22.3 % miss classification. K-NN gave the least performance with 67.6 % correct classification and 32.4 % miss classification result. Two most important algorithms in the artificial neural network namely Backprob and RBF gave 75.2 % and 75.7 % correct classification and 24.8% and 24.3% miss classification. Lot of research has been done in the artificial neural network to improve the classification accuracy in the Pima Indian diabetes dataset. Jeatrakul and Wong have done a comparative study of different neural networks performance on the Pima Indian Diabetes dataset [17]. They used 5 different type of neural network architectures namely Back propagation Neural Network (BPNN), Radial basis function Neural Network (RBFNN), General Regression Neural Network (GRNN),

77

Probabilistic Neural Network (PNN) and Complementary Neural Network (CMTNN). The table-4.2 shows the classification performance of the BPNN, RBFNN, GRNN, PNN and CMTNN. Table 4.2 Jeatrakul and Wong Classification Result on the Pima Indian

Diabetes Dataset

Test No. BPNN GRNN RBFNN PNN CMTNN

1 77.27 74.68 79.22 74.68 77.92

2 76.62 79.87 79.22 79.87 76.62

3 70.13 70.13 74.03 70.13 72.08

4 85.71 81.82 79.22 81.82 83.77

5 75.97 75.97 77.27 75.97 75.32

6 70.78 70.13 72.08 70.13 72.08

7 75.32 72.73 76.62 72.73 75.97

8 79.22 78.57 77.27 78.57 79.22

9 74.68 74.68 76.62 74.68 75.32

10 75.97 74.03 74.03 74.03 76.62

Average 76.17 75.26 76.56 75.26 76.49

Estebanez, Alter and Valls used genetic programming based data projections for classification tasks [20]. They used the Pima Indian diabetes dataset in their research and reduced the input dimension from 8 to 3. They applied the Support Vector Machine (SVM), Simple Logistics and Multilayer Perceptron algorithms on Pima Indian Diabetes data for the classification purpose. Their results are available in the table-4.3. Table 4.3 Estebanez, Alter and Valls Classification Result on Pima Indian

Diabetes Dataset

Sr. No. Algorithm Classification

Performance

1 SVM 77.21

2 Simple Logistics 77.86

3 Multilayer Perceptron 76.69

Multilayer Perceptron from Artificial Neural Network has given 76.69 % classification performance. Single Logistics has given the maximum performance 77.86%. Lena Kallin Westin in her missing data and the preprocessing perceptron paper discussed the different preprocessing methods for handling the missing data in the Pima Indian Diabetes dataset [18]. She developed a preprocessing perceptron to train decision support system on the diabetes dataset. The accuracy of the trained

78

decision support system has given average 79% classification performance. Bylander used naïve Bayes, Decision trees and two types of belief networks on the Pima Indian diabetes dataset [19]. The table – 4.4 shows the various classification methods and classification performance obtained by Bylander. Table 4.4 Bylander Classification Performance on Pima Indian Diabetes

Dataset

Sr. No. Method Accuracy

1 Belief Network(Laplace) 72.50%

2 Belief Network 72.30%

3 Decision Tree 72.00%

4 Naïve Bayes 71.50%

Misra and Dehuri in their research paper Functional Link Artificial Neural Network for Classification Task in Data Mining created a Functional Link Artificial Neural network and compared its classification performance with other machine learning algorithms [16]. Their FLANN has given 78.13% classification performance and MLP gave 75.2% classification performance. Table – 4.5 gives the classification performance of different machine learning algorithms on the Pima Indian Diabetes dataset. KNN from Case based reasoning can be used to retrieve the similar past cases and remove the outliers through that neural network classification performance can be improved [48]. Table 4.5 Misra and Dehuri Classification Performance on Pima Indian

Diabetes Dataset

Sr. No. Classification

Systems Name Accuracy

1 NN 65.1

2 KNN 69.7

3 FSS 73.6

4 BSS 67.7

5 MFS1 68.5

6 MFS2 72.5

7 CART 74.5

8 C4.5 74.7

9 FID3.1 75.9

10 MLP 75.2

11 FLANN 78.13

79

4.5 Proposed Model and its Functioning

To increase the classification accuracy we have proposed a hybrid machine

learning algorithm using the Multilayer Perceptron from Artificial Neural Network and K-NN from Case Based Reasoning.

The Algorithm

1. Do the preprocessing step for handling the missing values in the Pima Indian

Diabetes Dataset.

i) Replace the missing values by its column’s mean value for each output

class separately.

2. Divide the preprocessed total dataset into two by 80% and 20% and named it

as Training Dataset T1 and Testing Dataset T2.

3. Train the Artificial Neural Network System using the Backpropagation

Algorithm using Training Dataset T1.

4. Train the Case Based Reasoning System using K-Nearest Neighbor Algorithm

using Training Dataset T1.

5. Ensemble System calculates the combined mean value from the outputs of the

ANN and CBR systems. Based on the calculated mean value it displays the

output either “Positive for Diabetes” or “Negative for Diabetes”.

6. Use the Testing dataset T2 to calculate the classification performance of the

Proposed System.

The block diagram of the Proposed Model is displayed in Figure-4.4.

Figure 4.4 Proposed Hybrid Machine Learning System for Medical Diagnosis

Calculated Output for Test Data

CBR Calculated Output Probability

ANN Calculated Output Probability ANN

Input Test Data

Artificial Neural Network System (Trained Using Backpropagation Algorithm)

Case Based Reasoning System (Trained Using K-NN Algorithm)

Ensemble Method

(Using Mean Method)

80

The block diagram has two important trained machine learning systems. The First one is Artificial Neural Network system which used the Backpropagation algorithm for its training. The Second one is Case Based System which used the K-Nearest Neighbor Algorithm for its training. The Total dataset contains 768 patient records out of which 614 records (80%) are used for training and the remaining 154 records (20%) used for testing purpose. In the time of testing the new test data will be passed through the trained ANN and the CBR Systems. Both the systems will give the calculated output values to the Ensemble Module. Input values will be in between 0 to 1.

Ensemble module uses the mean method to combine the values from the ANN and CBR system. Based on the calculated output value the result will be either positive or negative for diabetes disease. We used the cut off value as .5. If the Ensemble module calculated value is greater than or equal to .5 then the output will be “Positive to Diabetes” otherwise the output will be “Negative to Diabetes”.

Created Artificial Neural Network Structure

The Artificial neural Network used the Multi layer feed forward Network architecture and it used the Backpropagation Algorithm for Training. The network has a single hidden layer, an input layer and an output layer. The input layer has 8 input nodes. The hidden layer has 5 neurons and the output layer has a single neuron. Sigmoid function is used in the hidden layer and in the output layer. Squared Error is used as cost function for error calculation to adjust the network weight.

Input Layer Hidden Layer Output Layer

Figure 4.5. Multi Layer Feed Forward Neural Network

81

4.6 Experiments Results and its Explanation We have conducted 10 different test cases from the Pima Indian Diabetes Dataset based on the different random sampling. Below ANN and K-NN is constructed for the first test case. Below paragraphs explain clearly the structure, training and testing of the ANN, KNN and the Hybrid method.

ANN Training Information

The Total Pima Indian Diabetes Dataset contains 768 patients’ records. We used 8-5-1 ANN architecture for training and testing the Pima Indian Diabetes Dataset. Below table shows the partition of the training and testing dataset and it’s sizes. We used the random sampling with the random seed 12345 to select the samples for training and testing.80% of the data is assigned to training dataset and the remaining 20% is assigned to the testing data set. The ANN architecture is represented in Figure 4.5. Table 4.6 shows the dataset partition for training and testing purpose.

Table 4.6 ANN Training and Testing Dataset

Data

Data source Sheet1!$A$2:$I$769

Selected variables

preg Pg dbp skin insulin bmi pedig age class

Partitioning Method

Randomly chosen

Random Seed

12345

# training rows

614

# validation rows

154

Data

Training data used for building the model

['preprocessed_mean_separate_class.xls']'Data_Partition1'!$C$19:$J$632

# Records in the training data

614

Validation data ['preprocessed_mean_separate_class.xls']'Data_Partition1'!$C$633:$J$786

# Records in the validation data

154

Input variables normalized

Yes

82

ANN Network Parameters

The dataset has 8 inputs and 1 output fields. The hidden layer has 5 nodes and the output layer has one node. We used the Standard error functions as cost function. Standard Sigmoid function is used in hidden and output layer as activation function. We used the 200 Epochs to train the network. Table 4.7 shows the ANN network training functions and parameters. Table 4.7 ANN Network Parameters and Activation Functions

Variables

# Input Variables 8

Input variables preg Pg dbp skin insulin bmi pedig age

Output variable class

Parameters/Options # Hidden layers 1 # Nodes in HiddenLayer-1 5 CostFunctions Squared error Hidden layer sigmoid Standard Output layer sigmoid Standard # Epochs 200 Step size for gradient descent 0.1 Weight change momentum 0.6 Error tolerance 0.01 Weight decay 0

Inter layer Node Connection Weights

In between the input, hidden and output layer the ANN has the node connection. The connection has the weight. The following table 4.8 shows the inter layer connection weights between the input layer, hidden layer and hidden layer and output layer.

Table 4.8 ANN Inter Layer Connection Weights

Input Layer

Hidden

Layer # 1 preg Pg dbp skin insulin bmi pedig age

Bias

Node

Node # 1 -4.36 -3.23 -0.18 -4.12 -5.17 -2.31 -2.59 3.57 -0.70

Node # 2 -1.85 -5.71 1.94 -0.36 -6.08 -1.25 -4.95 2.22 -2.51

Node # 3 -0.87 -0.35 0.27 0.08 2.68 0.00 -0.21 -4.84 -4.40

Node # 4 1.92 -1.25 0.25 5.21 -14.06 0.73 0.06 1.35 -1.72

Node # 5 -1.98 -0.29 1.14 -10.26 -1.52 -3.28 3.00 -3.61 -3.19

83

Hidden Layer # 1

Output

Layer Node # 1 Node # 2 Node # 3 Node # 4 Node # 5

Bias

Node

1 -2.96828 -2.69314 -6.6264 -5.84048 -3.81711 6.81824 0 2.96833 2.69316 6.62652 5.84058 3.81716 -6.81835

ANN Training Curve

We have used the 200 Epochs to train the ANN. Initially the training error rate is 36.6 and for the final Epoch the Error rate is 10.4. The following Figure 4.6 shows the comparison between the different Epoch Number and the Error Rate in the Form of the Chart.

Figure 4. 6 ANN Training Error Curve

84

Following table 4.9 shows the Classification Confusion Matrix and the Error report

for the training dataset which contains 614 cases. Class 1 points to “+ive to Diabetes”

and class 0 points to “-ive to Diabetes”. The ANN system classifies 552 cases out of

614 cases correctly and miss classifies 62 cases out of 614 cases wrongly. The ANN

Correct classification performance rate is 89.90% and the miss classification Error

Rate is 10.10%.

ANN Training Data – Performance Report

Table 4.9 ANN Training Data Performance and Error Report

Classification Confusion Matrix

Predicted Class

Actual Class 1 0

1 199 26 0 36 353

ANN Training Data –Error Report

Error Report

Class # Cases # Errors % Error

1 225 26 11.56 0 389 36 9.25

Overall 614 62 10.10

Following table 4.10 shows the Classification Confusion Matrix and Error Report for

the validation (testing) dataset which contains 154 cases. Class 1 points to “+ive to

Diabetes” and class 0 points to “-ive to Diabetes”. The ANN system classifies 133

cases out of 154 cases correctly and miss classifies 21 cases out of 154 cases wrongly.

The ANN Correct classification performance rate is 86.36% and the miss

classification Error Rate is 13.64%.

ANN Validation Data – Classification Performance Report

Table 4.10 ANN Validation Data- Performance and Error Report


Predicted Class

Actual Class 1 0

1 36 7 0 14 97

85

ANN Validation Data – Classification Error Report

Error Report


1 43 7 16.28 0 111 14 12.61

Overall 154 21 13.64

ANN Total Correct Classification and Miss Classification Error Number for 10

different Test Cases

Below table 4.11 contains the reports of the 10 different test cases correct and miss

classification information in the form of numbers. Among the 10 cases test case 1

gave minimum miss classification number 21 out of 154 test cases. Test cases 4, 7, 9

gave maximum miss classification number 28 out of 154 test cases. The average

correct classification number was 128.9 out of 154 and the average miss classification

number was 25.1.

Table 4.11 ANN Classification Performance for 10 different test cases

Test. No

Correct Classification

Numbers

Miss Classification

Error Numbers

1 133 21

2 132 22

3 131 23

4 126 28

5 131 23

6 129 25

7 126 28

8 128 26

9 126 28

10 127 27

Average 128.9 25.1

ANN Total Correct Classification Accuracy and Miss Classification Error Rate

for 10 different Test Cases

We have converted the number of correct classification number and miss

classification error numbers in terms of percentage. Below table 4.12 contains the

correct classification accuracy and miss classification error rate in terms of Percentage

(%). The test no 1 gives the minimum miss classification error rate 13.64% and the

86

test case 4, 7, 9 gave the maximum 18.18%. Over all the average correct classification

accuracy was 83.70% and the miss classification Error rate was 16.30%.

Table 4.12 ANN Classification Performance Accuracy for 10 different test cases

Test. No Correct Classification Miss Classification

1 86.36 13.64

2 85.71 14.29

3 85.06 14.94

4 81.82 18.18

5 85.06 14.94

6 83.77 16.23

7 81.82 18.18

8 83.12 16.88

9 81.82 18.18

10 82.47 17.53

Average 83.70 16.30

Performance Comparison of Earlier ANN Systems and Our ANN Systems

In the literature survey we have found that 4 different researchers constructed the

Artificial Neural Network using the Backpropagation Algorithm based on the Pima

Indian Diabetes Dataset. Their classification performance was less than 77%.The

dataset contains two different classes namely Class 1 and Class 0. Class 1 means the

patient is “+ive for Diabetes” and Class 0 means the patient is “-ive for Diabetes”. We

have separated the data set based on the two different classes and replaced the missing

values by the corresponding field’s mean value. Pima Indian Diabetes Dataset has lot

of missing values. Handling of the missing values plays the major role in the

classification performance. Training the Neural Network takes an important step in

classification performance because neural network has many local minima. Finding

the best solution is a trail and error method. Because of our effective way of missing

values handling and the effective training we reach the classification performance

83.70. Our neural network structure is 8-5-1.

87

5 hidden nodes we used in the hidden layer. The table 4.13 and the Figure 4.7 show

the effective classification performance of our neural network.

Table 4.13 Performance comparisons between Different ANN Systems

S.No ANN System Reference

Correct

Classification

Accuracy

Miss

Classification

Error Rate

1 Michie, Spiegelhalter and Taylor 75.20 24.80

2 Jeatrakul and Wong 76.17 23.83

3 Estebanez, Alter and Valls 76.69 23.31

4 Misra and Dehuri 75.20 24.80

5 Our Method 83.70 16.30

Comparison Chart

Figure 4.7 Various ANN Classifiers Performance Comparisons

88

KNN Training Information

We have carried out 10 different test cases using the different random samples in the

Pima Indian Diabetes Dataset. Below table shows the training in the KNN System and

the best K value in the test case 1.

Test Case 1 training and testing Information:

Best K Value Selection

In the time of training, the system substitutes the various values of k and calculates

the corresponding training and validation (testing) Error rates. We have given the

input value k=20 to the system. It has generated the training and testing errors for the

20 different k values. The system automatically finds the best k value based on the

Testing error Value. For the k value 13 and 20 the Error Rate is 10.39 and this is the

minimum value among the set of 20 k values. The system has chosen the minimum k

value 13 as the best k value. The following table 4.14 shows the training and testing

error values for different k values and the selection of the best k = 13.

Table 4.14 KNN Training and Testing Errors for Different K values

Validation error log for different k and the Best K Value = 13

Value of k % Error Training % Error

Validation(Testing)

1 0.00 15.58

2 11.40 21.43

3 11.07 12.34

4 13.68 16.88

5 13.68 13.64

6 14.50 14.94

7 14.17 15.58

8 14.82 13.64

9 15.80 14.29

10 14.01 11.04

11 15.64 11.04

12 14.82 12.34

13 15.80 10.39 <--- Best k

14 15.31 11.69

15 16.61 11.69

16 16.12 12.34

17 16.45 11.69

18 15.96 11.69

19 16.45 11.04

20 16.12 10.39

89

Below Figure 4.8 shows the Training Error Curve for the different values of K in the

time of training. For the values of k=10 up to 20 the training error is zero. For the

value k=9 it gave the maximum training error 15.80.

Figure 4. 8 K-NN Training Curve

90

Figure 4. 9 KNN Validation Error Curve

Above Figure 4.9 shows the Validation Error for the different values of K. for the k

value 13 and 20 it gave the minimum testing error value 10.39. The value 13 is

selected as a best k value and used in the time of validation.

91

Classification Performance on Training Data

Following table 4.15 shows the Classification Confusion Matrix for the training

dataset which contains 614 cases. Class 1 points to “+ive to Diabetes” and class 0

points to “-ive to Diabetes”. The KNN system classifies 517 cases out of 614 cases

correctly and miss classifies 97 cases out of 614 cases wrongly. The KNN Correct

classification performance rate is 84.20% and the miss classification Error Rate is

15.80%.

Training Data scoring - Summary Report (for k=13)

Table 4.15 KNN Training Data- performance and Error Report


Predicted Class

Actual Class 1 0

1 170 55

0 42 347

Error Report


1 225 55 24.44

0 389 42 10.80

Overall 614 97 15.80

Classification Performance on Validation (Testing) Data

Following table 4.16 shows the Classification Confusion Matrix for the validation

(testing) dataset which contains 154 cases. Class 1 points to “+ive to Diabetes” and

class 0 points to “-ive to Diabetes”. The KNN system classifies 138 cases out of 154

cases correctly and miss classifies 16 cases out of 154 cases wrongly. The KNN

Correct classification performance rate is 89.61% and the miss classification Error

Rate is 10.39%.

92

Table 4.16 KNN validation Data- Performance and Error Report


Predicted Class

Actual Class 1 0

1 36 7 0 9 102

Error Report


1 43 7 16.28

0 111 9 8.11

Overall 154 16 10.39

KNN Classifiers Correct and Miss Classification Performance for the 10

different test cases.

Below table 4.17 contains the report of the 10 different test cases correct and miss

classification information in the form of numbers. Among the 10 cases test case 1

gave minimum miss classification number 16 out of 154 test cases. Test case 9 gave

maximum miss classification number 32 out of 154 test cases. The average correct

classification number was 129.8 out of 154 and the average miss classification

number was 24.2.

Table 4.17 KNN Classification performance for 10 Test Cases

KNN (CBR) Classifier’s Total Correct Classification and

Miss Classification Numbers for 154 Test Cases

Test .No Correct Classification Number Miss Classification

Number

1 138 16

2 129 25

3 127 27

4 133 21

5 128 26

6 136 18

7 125 29

8 129 25

9 122 32

10 131 23

Average 129.8 24.2

93

We have converted the number of correct classification number and miss

classification error numbers in terms of percentage. Below table 4.17 contains the

correct classification accuracy and miss classification error rate in terms of Percentage

(%). The test no 1 gives the minimum miss classification error rate 10.39% and the

test case 9 gave the maximum 20.78%. Over all the average correct classification

accuracy is 84.29% and the miss classification Error rate was 15.71%.

Table 4.17 KNN Classification performance Accuracy for 10 Test Cases

KNN(CBR) Classifier’s Correct Classification Accuracy and

Miss Classification Error Rate for 154 Test Cases

Test .No Correct Classification Accuracy Miss Classification Error

Rate

1 89.61 10.39

2 83.77 16.23

3 82.47 17.53

4 86.36 13.64

5 83.12 16.88

6 88.31 11.69

7 81.17 18.83

8 83.77 16.23

9 79.22 20.78

10 85.06 14.94

Average 84.29 15.71

Performance Comparison of Earlier KNN Systems and Our KNN Systems

In the literature survey we have found that 2 different researchers constructed the

Case Based Reasoning System using the K-Nearest Neighbor Algorithm based on the

Pima Indian Diabetes Dataset. Their classification performance was less than 70%.

We have separated the data set based on the two different classes and replaced the

missing values by the corresponding field’s mean value. Pima Indian Diabetes Dataset

has lot of missing values. Handling of the missing values plays the major role in the

classification performance. In the K-NN method finding the best value of K plays a

major role in the classification performance. It is a trail and error method. But in the

xlminer software there is an option to find the best k value based on the training and

testing dataset. We used the xlminer and found the best k value. Because of our

effective way of missing values handling and the finding the best k value through

xlminer software we reach the classification performance 84.29. The below table 4.18

94

And the Figure 4.10 show the effective classification performance of our case based

reasoning system with the earlier Case based reasoning systems on the Pima Indian

Diabetes Dataset.

Table 4.18 Different KNN System Performances on Pima Indian Diabetes

Dataset

S.No KNN System

Reference

Correct Classification

Accuracy

Miss Classification

Error Rate

1 Michie_Spiegelhalter 67.6 32.4

2 Jeatrakul_Wong 69.7 30.3

3 Our_Method 84.29 15.71

Figure 4.10 Different KNN Classifiers Performance on Pima Indian Diabetes

Dataset

95

Hybrid System Performance and Results

According to our proposed method algorithm first we constructed separately a Artificial Neural Network and a Case Based Reasoning System for the Pima Indian Diabetes Dataset. We divided the total dataset 768 into 614 training dataset and 154 testing dataset. We used the same 614 training dataset to train both the ANN and KNN Systems and the same testing dataset we used to test the classification performance of the ANN and KNN systems. We took the average probability value for class 1 or class 0 based on the probability performance from the ANN and KNN Systems. We kept the cut off probability value .5 for the class 1. If the new proposed method probability for a test data is greater than .5 then it belongs to class 1 or it belongs to class 0. We have conducted 10 different tests for the proposed model. The test results are represented in the following table-4.19.

Table 4.19 Proposed Hybrid Machine Learning Model Classification Result

Test. No PM- MCN

Classification

(%)

Miss Classification

Error (%)

1 18 88.31 11.69

2 21 86.36 13.64

3 25 83.77 16.23

4 28 81.82 18.18

5 24 84.42 15.58

6 21 86.36 13.64

7 26 83.12 16.88

8 26 83.12 16.88

9 31 79.87 20.13

10 24 84.42 15.58

Average 24 84.16 15.84

Note: PM-MCN- Proposed Model Miss Classification Number Below we have given the detailed result calculation for first 5 tests for the above table-4.19 in the form of table 4.20 to 4.24. Remaining tests results are attached in Appendix-C.

Thresh Hold Value / Cutoff Value for Hybrid Model We used the cut off (Threshold) value .5 for the combined Ensemble method. We add the ANN probability of Success for the Class 1(+ive For Diabetic) and the KNN (CBR) probability of Success for the Class 1 and found the average probability value

96

for Class 1. If the average probability value is greater than or equal to .5 (Cut off Value) then we give the result “+ive for Diabetes” other wise the result will be “-ive For Diabetes”.

Result Calculation for Test. No 1

Table 4.20 Output Calculation for Test. No.1

Row

Id.

Actual

Class

ANN Prob.

for 1 (success)

CBR Prob.

for 1 (success)

Ensemble

Average

PM

Result

2 1 0.99 0.85 0.92 Yes

7 0 0.01 0.31 0.16 Yes

8 0 0.00 0.23 0.12 Yes

13 1 0.18 0.23 0.20 No

22 0 0.00 0.15 0.08 Yes

24 0 0.09 0.77 0.43 Yes

25 0 0.15 0.85 0.50 No

28 1 1.00 0.85 0.92 Yes

33 0 0.01 0.15 0.08 Yes

40 0 0.00 0.08 0.04 Yes

46 0 0.01 0.38 0.20 Yes

59 1 0.36 0.62 0.49 No

61 0 0.32 0.15 0.24 Yes

70 1 1.00 0.62 0.81 Yes

79 0 0.12 0.46 0.29 Yes

80 0 0.85 0.77 0.81 No

83 1 0.55 0.46 0.50 Yes

86 0 0.00 0.00 0.00 Yes

93 0 0.01 0.31 0.16 Yes

103 1 0.00 0.00 0.00 No

109 0 0.00 0.15 0.08 Yes

117 0 0.01 0.31 0.16 Yes

118 0 0.72 0.69 0.71 No

120 0 0.01 0.08 0.04 Yes

Note: Yes – Positive for Diabetes Disease No – Negative for Diabetes Disease PM Result – Proposed Model Result

97



Row

Id.

Actual

Class

ANN Prob.

for 1 (success)

CBR Prob. for

1 (success)

Ensemble

Average

PM

Result

3 1 0.97 0.93 0.95 Yes

7 0 0.08 0.36 0.22 Yes

9 0 0.31 0.57 0.44 Yes

11 1 0.96 0.71 0.84 Yes

33 0 0.36 0.07 0.21 Yes

34 0 0.33 0.07 0.20 Yes

38 1 0.81 0.50 0.66 Yes

45 1 0.88 0.86 0.87 Yes

50 1 1.00 0.71 0.85 Yes

53 0 0.54 0.57 0.56 No

54 1 1.00 1.00 1.00 Yes

62 0 0.27 0.07 0.17 Yes

63 0 0.00 0.00 0.00 Yes

72 1 0.99 0.71 0.85 Yes

74 0 0.52 0.29 0.40 Yes

83 1 0.54 0.57 0.56 Yes

91 0 0.01 0.07 0.04 Yes

100 1 0.79 0.86 0.83 Yes

101 1 0.99 0.64 0.82 Yes

102 1 0.99 0.64 0.82 Yes

105 1 0.65 0.71 0.68 Yes

112 0 0.42 0.43 0.43 Yes

115 0 0.21 0.71 0.46 Yes

122 0 0.25 0.43 0.34 Yes

98



Row

Id.

Actual

Class

ANN Prob.

for 1

(success)

CBR Prob.

for 1

(success)

Ensemble

Average

PM

Result

4 0 0.99 0.78 0.88 No

17 0 0.66 0.56 0.61 No

19 1 0.13 0.56 0.34 No

21 0 0.01 0.00 0.00 Yes

33 0 0.09 0.00 0.04 Yes

37 1 1.00 0.78 0.89 Yes

38 1 0.02 0.44 0.23 No

43 1 0.97 0.67 0.82 Yes

59 1 0.69 0.56 0.62 Yes

85 0 0.13 0.33 0.23 Yes

91 0 0.00 0.00 0.00 Yes

93 0 0.04 0.22 0.13 Yes

99 1 0.84 0.78 0.81 Yes

100 1 0.66 1.00 0.83 Yes

101 1 1.00 0.67 0.83 Yes

112 0 0.01 0.22 0.12 Yes

115 0 0.17 0.56 0.36 Yes

118 0 0.07 0.67 0.37 Yes

119 0 0.01 0.22 0.12 Yes

132 1 0.96 0.56 0.76 Yes

134 1 0.99 0.44 0.72 Yes

139 0 0.00 0.11 0.06 Yes

140 1 0.85 0.11 0.48 No

141 0 0.02 0.11 0.07 Yes

99



Row

Id.

Actual

Class

ANN Prob.

for 1

(success)

CBR Prob.

for 1

(success)

Ensemble

Average

PM

Result

2 1 0.99 0.80 0.90 Yes

11 1 1.00 0.80 0.90 Yes

15 1 1.00 0.40 0.70 Yes

23 1 1.00 0.60 0.80 Yes

25 0 1.00 0.70 0.85 No

45 1 0.97 0.90 0.94 Yes

46 0 0.21 0.20 0.20 Yes

67 0 0.00 0.10 0.05 Yes

70 1 0.98 0.60 0.79 Yes

81 1 1.00 0.90 0.95 Yes

82 1 1.00 0.80 0.90 Yes

84 0 0.07 0.00 0.03 Yes

86 0 0.00 0.00 0.00 Yes

87 0 0.00 0.00 0.00 Yes

88 0 0.01 0.00 0.00 Yes

91 0 0.00 0.00 0.00 Yes

102 1 1.00 0.70 0.85 Yes

104 1 1.00 0.80 0.90 Yes

108 1 0.05 0.00 0.03 No

110 1 1.00 0.70 0.85 Yes

113 1 1.00 1.00 1.00 Yes

117 0 0.08 0.20 0.14 Yes

127 1 1.00 1.00 1.00 Yes

134 1 0.77 0.60 0.69 Yes

100



Row

Id.

Actual

Class

ANN Prob.

for 1

(success)

CBR Prob.

for 1

(success)

Ensemble

Average

PM

Result

8 0 0.00 0.08 0.04 Yes

18 1 0.53 0.67 0.60 Yes

20 0 0.00 0.00 0.00 Yes

30 0 0.68 0.42 0.55 No

40 0 0.01 0.00 0.00 Yes

46 0 0.04 0.33 0.19 Yes

50 1 1.00 0.67 0.83 Yes

51 1 0.88 0.92 0.90 Yes

58 0 0.43 0.42 0.42 Yes

62 0 0.94 0.08 0.51 No

65 0 0.06 0.42 0.24 Yes

73 1 0.85 0.75 0.80 Yes

77 0 0.86 0.92 0.89 No

82 1 0.90 0.83 0.87 Yes

86 0 0.00 0.00 0.00 Yes

100 1 0.19 0.67 0.43 No

101 1 0.94 0.75 0.84 Yes

103 1 0.00 0.00 0.00 No

106 0 0.09 0.00 0.04 Yes

107 1 0.94 0.67 0.80 Yes

109 0 0.02 0.08 0.05 Yes

112 0 0.62 0.50 0.56 No

125 1 0.87 0.42 0.64 Yes

129 1 0.41 0.33 0.37 No

101

4.7 Comparing Proposed Model Results with Earlier Results

Table 4.25 Comparing Proposed Model Results with Table-4.1 Result

Sr. No. Algorithm Correct

Classification (%)

Miss Classification

((Error Rate) %)

1 Discrim 77.5 22.5

2 Quadisc 73.8 26.2

3 Logdisc 77.7 22.3

4 SMART 76.8 23.2

5 ALLOC80 69.9 30.1

6 K-NN 67.6 32.4

7 CASTLE 74.2 25.8

8 CART 74.5 25.5

9 IndCART 72.9 27.1

10 NewID 71.1 28.9

11 AC2 72.4 27.6

12 Baytree 72.9 27.1

13 NaiveBay 73.8 26.2

14 CN2 71.1 28.9

15 C4.5 73 27

16 Itrule 75.5 24.5

17 Cal5 75 25

18 Kohonen 72.7 27.3

19 DIPOL92 77.6 22.4

20 Backprob 75.2 24.8

21 RBF 75.7 24.3

22 LVQ 72.8 27.2

23

PM(Ensemble

Method) 84.16 15.84

102

Comparison Chart

Figure 4.11 Comparing PM Result with Table -4.1 Results

The above table 4.25 and figure 4.11 show the comparison between the individual machine learning methods classification rates with the proposed Ensemble Model. The proposed model gives the correct classification performance 84.16 % but in the other machine learning algorithms LogDisc gives maximum only 77.7%.

Comparing Proposed Model Results with Table-4.2 Result

Table-4.2 has 5 different artificial neural network architectures classification performance on the Pima Indian diabetes dataset. Complementary Neural Network (CMTNN) gives the maximum classification performance 76.49%. But our proposed system gives 84.16% classification on the Pima Indian diabetes dataset. It is represented in the form of table-4.26 and figure 4.12.

103

Table 4.26 Comparing proposed Model Result with Table-4.2 Result

Sr. No. Model Type Classification

Performance (%)

1 BPNN 76.17

2 GRNN 75.26

3 RBFNN 76.56

4 PNN 75.26

5 CMTNN 76.49

6 Proposed Model 84.16

Comparison Chart

Figure 4.12 Chart shows comparison between PM Result and Table-4.2 Result

104

Comparing Proposed Model Results with Table-4.3 Result Table-4.3 has 3 machine learning methods like SVM, Single Logistics and Multilayer Perceptron and their classification performance on the Pima Indian Diabetes dataset. With in 3 algorithms single logistics gives maximum classification performance 77.86%. Our proposed ensemble model has given 84.16% classification performance. It is represented in the form of Table-4.27 and figure 4.13.

Table 4.27 Comparing Proposed Model Result with Table-4.3 Result

Algorithm

Classification

Performance

SVM 77.21

Simple Logistics 77.86

Multilayer Perceptron 76.69

PM (Ensemble Method) 84.16

Comparisons Chart

Figure 4.13 Chart shows the Comparison between PM result and Table-4.3

result

105

Comparing our Proposed Model result with Bylander classification Table-4.4

result


Method Accuracy

Belief Network(Laplace) 72.50%

Belief Network 72.30%

Decision Tree 72%

Naïve Bayes 71.50%

Proposed Model 84.16%

Bylander has applied 4 machine learning methods on the Pima Indian Diabetes dataset and got the classification performance maximum 72.30% using the Belief network. Our proposed method has given 84.16% classification performance on the diabetes dataset. It gives near to 12% more classification compare to the Bylander result. It is represented in the form of table 4.28 and figure 4.14. Comparisons Chart

Figure 4.14 Comparing PM Classification Performances with Bylander result

106

Comparing Proposed Model result with Misra and Dehuri Table-4.5 result


Classification Systems Name Classification Accuracy

NN 65.1

KNN 69.7

FSS 73.6

BSS 67.7

MFS1 68.5

MFS2 72.5

CART 74.5

C4.5 74.7

FID3.1 75.9

MLP 75.2

FLANN 78.13

Proposed Model 84.16

Misra and Dehuri constructed the Functional Link Artificial neural network. FLANN has given 78.13% classification accuracy on the Pima Indian Diabetes dataset. Our proposed model has given 84.16% classification accuracy that is near to 6% more when it compared to FLANN classification performance. It is represented in the form of table 4.29 and figure 4.15.

Comparisons Chart

Figure 4.15 Performance Comparisons between FLANN and Proposed Model

107

4. 8 Chapter Summary

Diabetes disease is the fourth biggest cause of death worldwide particularly in

the industrial and developing countries. Lot of research has been done to classify the diabetes dataset and to improve the classification performance based on the individual machine learning method. In this chapter we did the data pre-processing for handling the missing data. Pima Indian Diabetes dataset has lot of missing values. We have replaced the missing values based on the corresponding fields mean values for separate classes.

Our is the first hybrid method for classifying the Pima Indian Diabetes

Dataset. In the process according to our Proposed Algorithm we have developed an Artificial Neural Network. It gave 83.70% classification performance. It is near to 7% more than the previous ANN System classification performances.

Next we developed a Case based reasoning system based on the K-Nearest

Neighbor method. It gave 84.29% classification performance. It is near to 14% more than the previous KNN classifiers Systems. The Hybrid system combined the performance of both the ANN and KNN Systems and found the average probability value for output classes. Based on the 10 test cases conducted on the Pima Indian diabetes dataset our proposed ensemble method gave classification accuracy 84.16%. It is near to 7% more compare to the earlier methods. Through the hybrid machine learning we improved the classification performance and the reliability of the system. These two are the prime objectives of our research.

hybrid machine learning for improving classification accuracy

Documents