hybrid machine learning for improving classification accuracy
TRANSCRIPT
69
CHAPTER – IV
HYBRID MACHINE LEARNING FOR IMPROVING
CLASSIFICATION ACCURACY
4.1.1 Hybrid Machine Learning
Machine Learning algorithms are used in the data mining applications to retrieve the hidden information that may be used for good decision-making. Machine learning contains various techniques like rule based learning, case based reasoning, artificial neural network and decision tree. Every technique has it’s own advantages and disadvantages. In the past lot of hybrid machine learning systems were developed to bring the best from the two different machine learning methods. For example a hybrid machine learning system is created based on genetic algorithm and support vector machines for stock market prediction by Rohit and Kumkum [8] Nerijis, Ignas and Vida developed a hybrid machine learning approach for text categorization using decision trees and artificial neural network [8]. Sankar developed an integrated data mining approach for maintenance scheduling using case based reasoning and artificial neural network [9] and Mammone developed a hybrid machine learning system combining neural network and decision tree. In our research we are using artificial neural network with case based reasoning technique. Artificial neural network gives good classification accuracy compare to other machine learning techniques like rule based system and decision tree. Through this hybrid machine learning we want to improve the classification accuracy of the neural network system that may be used for medical diagnosis.
4.1.2 Improving Classification Accuracy
For a machine learning system to be useful in solving medical diagnostic tasks it has to satisfy some desired features like good performance, transparency of diagnostic knowledge, ability to explain decisions and ability of the algorithm to reduce the number of tests to obtain reliable diagnosis. To give good performance the classification system has to give the diagnostic accuracy on new cases as high as possible. Now a days to get best performance several learning algorithm are used to test on the available dataset and best one or two get selected for diagnosis. Always artificial neural network gives good performance in medical diagnosis still it is desirable to improve its accuracy. To achieve this we are combining it with the case based reasoning approach.
70
4.2.1 Importance of Case Based Reasoning
Case based reasoning is closely related to the human reasoning. In many situations the problems human encounter are solved with human equivalent of case based reasoning. When a person encounters a previously inexperienced situation or problem, he refers it to a past experience of a similar problem. This similar, previous experience may be one himself or someone else has experienced. In case it is someone else experience, the case will be added to the reasoner’s memory via an oral or written account of that experience [10].
In medical diagnosis, when the patient comes to the doctor, he examines the patient and immediately recollects the similar past cases. Because similar cases have similar answers, the doctor looks for the similar past cases to treat the present case. Some times the retrieved similar case may be fully or partially like the new case. If it is a fully similar one then the corresponding solution may be used otherwise the solution is going to be modified to adapt the new case. In our research when an input case comes the case based reasoning system retrieves the near by similar cases and give it as training samples to the Neural Network. Neural Network classification performance fully depends on the quality of the training samples which were used in the time of training. The training samples should be valid and related ones. Even though the number of training samples may be more if they are not related then they will affect the classification performance of the neural network. One of the main challenges of the data mining application is data growth. Medical data set may have more number of patients record. Instead of giving the meaningless and more volume data it is always good to give meaningful and related data for training. It will help the neural network to train properly and classify the user input more accurately.
4.2.2 Case Based Reasoning Approach
CBR System
Case-based reasoning is a methodology for solving problems by utilizing previous experiences. It involves retaining a memory of previous problems and their solutions and using it to solve new problems. When presented with a problem case base reasoner searches its memory of past cases and attempts to find a case that has the same problem specification as the current case. If the reasoner cannot find an identical case in its case base, it will attempt to find a case or cases in the case base that most closely match the current query case.
In the situation where a previous identical case is retrieved, presuming its solution was successful, it can be returned as the current problem’s solution. In the more likely case that the retrieved case is not identical to the current case, an adaptation phase occurs. In adaptation, the differences between the current case and the retrieved case must first be identified and then the solution associated with the retrieved case must be modified taking into account these differences. The solution
71
returned in response to the current problem specification may then be tried in the appropriate domain setting.
Case-based reasoning (CBR) system incorporates the reasoning mechanism and the external facets like the input specification, the output suggested solution, and the memory of past cases that are referenced by the reasoning mechanism. It is represented in Figure 4.1.
CBR System Internal Structure
CBR system has an internal structure divided into two major parts called the case retriever and the case reasoner as shown in Figure.4.2. The case retriever’s job is to find the appropriate cases in the case base for the given input case. The case
Problem Case Derived Solution
Case Base
Case Base Reasoning
Mechanism
Figure 4.1 CBR System [10]
Problem Case Derived Solution
Case Base
Figure 4.2 Two Major Components of a CBR System
Case
Retriever
Case
Reasoner
72
reasoner uses the retrieved cases to find a solution to the given input case. This reasoning generally involves both determining the differences between the retrieved cases and the current input case and modifying the retrieved solution appropriately, reflecting these differences. The case reasoner may or may not consult the case base to find the relative cases to find the solution.
Case can be in the form of a record, which contains all the related information of a previous experience or problem. The information recorded about this past experience depends on the domain of the reasoner and the purpose to which the case will be put. In the instance of a problem solving CBR system, the details will usually include the specification of the problem and the relevant attributes of the environment that are the circumstances of the problem. The other important part of the case is the solution that was applied in the previous situation. Depending on how the CBR system reasons with cases, this solution may include only the facts of the solution, or, additionally, the steps or processes involved in obtaining the solution. It is also important to include the achieved measure of success in the case description if the cases in the case base have achieved different degrees of success or failure.
If the problem domain has a fundamental model or do not have exceptions or novel cases or the case is likely to occur very frequently then it has more chance of using the case based reasoning technique. To reduce the knowledge acquisition task, avoid the repeating mistakes in the past, graceful degradation of performance, and reason in a domain with a small body of knowledge and learn over time are some of the important reasons for which case based reasoning is used.
Case Representation
Cases in a case base can represent many different types of knowledge and store it in many different representational formats. The objective of a system will greatly influence what is stored. A case based reasoning system may be aimed at the creation of a new design or plan, the diagnosis of a new problem, or the argument of a point of view with precedents. In each type of system, a case may represent something different. The cases could be people, things or objects, situations, diagnoses, designs, plans or rulings among others.
In many practical CBR applications, cases are usually represented as two unstructured sets of attribute value pairs, i.e. the problem and solution features. However, the decision of what to represent can be one of the difficult decisions to make. For example: In some sort of medical CBR system, that diagnosis a patient, a case could represent an individual’s entire case history or be limited to a single visit to a doctor. In this situation the case may be a set of symptoms along with the diagnosis. It may also include a diagnosis or treatment. If a case is a person then a more complete model is being used as this could incorporate the change of symptoms from one visit to the next. It is however harder to find and use cases in this format to search for a particular set of symptoms in a current problem and obtain a diagnosis/treatment. Alternatively if a case is simply a single visit to the doctor involving the symptoms at the time of that visit and the diagnosis of those symptoms, the changes in symptoms
73
that might be a useful key in solving a problem may be missed. Cases may need to be broken down and consist of sub-cases. For example, a case could be a person’s medical history and could include all visits made by them to the doctor as sub cases. A sample case structure is represented in Figure 4.3.
Patient
Age
Height
Weight
Visit 1
Symptom 1
Symptom 2
Diagnosis
Treatment
Visit 2
Visit 3
Figure 4.3 A Patient Case Record
No matter what the case actually represents as a whole, the features of it have to be represented in some format. One of the advantages of case-based reasoning is the flexibility it has in this regard. Depending on what types of features have to be represented, an appropriate implementation platform can be chosen. Ranging from simple Boolean, numeric and textual data to binary files, time dependent data, and relationships between data, CBR can be made to reason with all of them.
No matter what is stored, or the format it is represented in, a case must store that information that is relevant to the purpose of the system and which will ensure that the most appropriate case is retrieved in each new situation. Thus the cases have to include those features that will ensure that case will be retrieved in the most appropriate contexts.
In many CBR systems, all existing cases do not need to be stored. In these systems criteria are needed to decide which cases will be stored and which will be discarded. In the situation where two or more cases are very similar, only one case may need to be stored. Alternatively, it may be possible to create an artificial case that
74
is a generalization of two or more actual incidents or problems. By creating generalized cases the most important aspects of a case need only be stored once.
When choosing a representation format for a case, there are many choices and many factors to consider. Some examples of representation formats that may be used include data base formats, frames, objects, and semantic networks.
Whatever format the cases are represented in, the collection of cases itself has to be structured in some way to facilitate the retrieval of the appropriate case when queried. Numerous approaches have been used for this. A flat case base is a common structure; in this method indices are chosen to represent the important aspects of the case and retrieval involves comparing the current cases features to each case in the case base. In our work the diabetes dataset is stored in the form of a flat file. It contains nine fields to store the patient input and output parameters. Another common case base structure is a hierarchical structure that stores the cases by grouping them to reduce the number of cases that have to be searched.
Case Indexing
Case indexing refers to assigning indices to cases for future retrieval and comparisons. This choice of indices is important to being able to retrieve the right case at the right time. This is because the indices of a case will determine in which context it will be retrieved in future. These are some suggestions for choosing indices. Indices must be both predictive and predictive in a useful manner. This means that they should reflect the important aspects of the case, the attributes that influenced the outcome of the case and also those which will describe the circumstances in which it is expected that they should be retrieved in the future. Indices should be abstract enough to allow for that cases retrieval in all the circumstances in which the case will be useful, but not too abstract. When a case’s indices are too abstract that case may be retrieved in too many situations or too much processing would be required to match cases.
Case Retrieval
Case retrieval is the process of finding within the case base those cases that are the closest to the current case. To carry out case retrieval there must be criteria that determine how a case is judged to be appropriate for retrieval and a mechanism to control how the case base is searched. The selection criteria are necessary to decide which case is the best one to retrieve, that is, to determine how close the current and stored cases are.
This criteria depends in part on what the case retriever is searching for. Most often the case retriever is searching for an entire case, the features of which will be compared to the current query case. There are however times when a portion of a case is required. This may be because no full case that exists and a solution is being built by selecting portions of multiple cases, or because a retrieved case is being modified by adopting a portion of another case in the case base. The actual processes involved
75
in retrieving a case from the case base depend very much on the memory model and indexing procedures used.
Nearest Neighbor Retrieval based on Euclidean distance
Euclidean distance is used to retrieve all near by similar cases to the current user case.
If u = )y , (x 11 and v = )y , (x 22 then Euclidean distance between u and v is
2
21
2
21 )y - (y) x- (x +
We considered all the parameters having the same and equal weight. When a new input case comes we retrieve all the nearby past cases based on the distance value which is calculated using the Euclidean distance. In our research we a fixed distance value (e.g 1.5) and all the cases whose distance values are less than or equal to the fixed value are retrieved for training the artificial neural network.
Instead of retrieving all the past case bases only the cases which are in side the fixed distance value boundary is retrieved and send it as the training samples to the feed forward backpropagation neural network.
4.3 Introduction to the Pima Indian Diabetes dataset This dataset was originally donated by Vincent Sigillito, Applied physics
Laboratory, John Hopkins university, Laurel, MD 20707. It was selected from a larger database held by the national Institute of diabetes and digestive and kidney diseases. It is publicly available in the machine learning dataset UCI. All patients represented in this dataset are females at least 21 years of Pima Indian heritage living near Phoenix, Arizona, USA. This dataset contains 8 input variables and a single output variable called class. The class value 1 means the patient is tested positive for diabetes and 0 means tested negative for diabetes disease.
4.4 Literature Background Pima Indian Diabetes dataset is very difficult to classify. Lot of research has
been done on this dataset to improve the classification accuracy. Michie, Spiegelhalter and Taylor used different machine learning methods to classify the Pima Indian Diabetes dataset [15]. The table-4.1 shows the names of the applied machine learning algorithms and the calculated classification accuracies on the Pima Indian diabetes dataset.
76
Table 4.1 Michie, Spiegelhalter and Taylor Classification Result on Pima
Indian Diabetes Dataset
Sr. No Algorithm Correct
Classification (%)
Miss Classification
((Error Rate) %)
1 Discrim 77.5 22.5
2 Quadisc 73.8 26.2
3 Logdisc 77.7 22.3
4 SMART 76.8 23.2
5 ALLOC80 69.9 30.1
6 K-NN 67.6 32.4
7 CASTLE 74.2 25.8
8 CART 74.5 25.5
9 IndCART 72.9 27.1
10 NewID 71.1 28.9
11 AC2 72.4 27.6
12 Baytree 72.9 27.1
13 NaiveBay 73.8 26.2
14 CN2 71.1 28.9
15 C4.5 73 27
16 Itrule 75.5 24.5
17 Cal5 75 25
18 Kohonen 72.7 27.3
19 DIPOL92 77.6 22.4
20 Backprob 75.2 24.8
21 RBF 75.7 24.3
22 LVQ 72.8 27.2
Average 73.80 26.20
Among the used 22 different machine learning algorithms that Logdisc was the most impressive one. It gave 77.7 % correct classification and 22.3 % miss classification. K-NN gave the least performance with 67.6 % correct classification and 32.4 % miss classification result. Two most important algorithms in the artificial neural network namely Backprob and RBF gave 75.2 % and 75.7 % correct classification and 24.8% and 24.3% miss classification. Lot of research has been done in the artificial neural network to improve the classification accuracy in the Pima Indian diabetes dataset. Jeatrakul and Wong have done a comparative study of different neural networks performance on the Pima Indian Diabetes dataset [17]. They used 5 different type of neural network architectures namely Back propagation Neural Network (BPNN), Radial basis function Neural Network (RBFNN), General Regression Neural Network (GRNN),
77
Probabilistic Neural Network (PNN) and Complementary Neural Network (CMTNN). The table-4.2 shows the classification performance of the BPNN, RBFNN, GRNN, PNN and CMTNN. Table 4.2 Jeatrakul and Wong Classification Result on the Pima Indian
Diabetes Dataset
Test No. BPNN GRNN RBFNN PNN CMTNN
1 77.27 74.68 79.22 74.68 77.92
2 76.62 79.87 79.22 79.87 76.62
3 70.13 70.13 74.03 70.13 72.08
4 85.71 81.82 79.22 81.82 83.77
5 75.97 75.97 77.27 75.97 75.32
6 70.78 70.13 72.08 70.13 72.08
7 75.32 72.73 76.62 72.73 75.97
8 79.22 78.57 77.27 78.57 79.22
9 74.68 74.68 76.62 74.68 75.32
10 75.97 74.03 74.03 74.03 76.62
Average 76.17 75.26 76.56 75.26 76.49
Estebanez, Alter and Valls used genetic programming based data projections for classification tasks [20]. They used the Pima Indian diabetes dataset in their research and reduced the input dimension from 8 to 3. They applied the Support Vector Machine (SVM), Simple Logistics and Multilayer Perceptron algorithms on Pima Indian Diabetes data for the classification purpose. Their results are available in the table-4.3. Table 4.3 Estebanez, Alter and Valls Classification Result on Pima Indian
Diabetes Dataset
Sr. No. Algorithm Classification
Performance
1 SVM 77.21
2 Simple Logistics 77.86
3 Multilayer Perceptron 76.69
Multilayer Perceptron from Artificial Neural Network has given 76.69 % classification performance. Single Logistics has given the maximum performance 77.86%. Lena Kallin Westin in her missing data and the preprocessing perceptron paper discussed the different preprocessing methods for handling the missing data in the Pima Indian Diabetes dataset [18]. She developed a preprocessing perceptron to train decision support system on the diabetes dataset. The accuracy of the trained
78
decision support system has given average 79% classification performance. Bylander used naïve Bayes, Decision trees and two types of belief networks on the Pima Indian diabetes dataset [19]. The table – 4.4 shows the various classification methods and classification performance obtained by Bylander. Table 4.4 Bylander Classification Performance on Pima Indian Diabetes
Dataset
Sr. No. Method Accuracy
1 Belief Network(Laplace) 72.50%
2 Belief Network 72.30%
3 Decision Tree 72.00%
4 Naïve Bayes 71.50%
Misra and Dehuri in their research paper Functional Link Artificial Neural Network for Classification Task in Data Mining created a Functional Link Artificial Neural network and compared its classification performance with other machine learning algorithms [16]. Their FLANN has given 78.13% classification performance and MLP gave 75.2% classification performance. Table – 4.5 gives the classification performance of different machine learning algorithms on the Pima Indian Diabetes dataset. KNN from Case based reasoning can be used to retrieve the similar past cases and remove the outliers through that neural network classification performance can be improved [48]. Table 4.5 Misra and Dehuri Classification Performance on Pima Indian
Diabetes Dataset
Sr. No. Classification
Systems Name Accuracy
1 NN 65.1
2 KNN 69.7
3 FSS 73.6
4 BSS 67.7
5 MFS1 68.5
6 MFS2 72.5
7 CART 74.5
8 C4.5 74.7
9 FID3.1 75.9
10 MLP 75.2
11 FLANN 78.13
79
4.5 Proposed Model and its Functioning
To increase the classification accuracy we have proposed a hybrid machine
learning algorithm using the Multilayer Perceptron from Artificial Neural Network and K-NN from Case Based Reasoning.
The Algorithm
1. Do the preprocessing step for handling the missing values in the Pima Indian
Diabetes Dataset.
i) Replace the missing values by its column’s mean value for each output
class separately.
2. Divide the preprocessed total dataset into two by 80% and 20% and named it
as Training Dataset T1 and Testing Dataset T2.
3. Train the Artificial Neural Network System using the Backpropagation
Algorithm using Training Dataset T1.
4. Train the Case Based Reasoning System using K-Nearest Neighbor Algorithm
using Training Dataset T1.
5. Ensemble System calculates the combined mean value from the outputs of the
ANN and CBR systems. Based on the calculated mean value it displays the
output either “Positive for Diabetes” or “Negative for Diabetes”.
6. Use the Testing dataset T2 to calculate the classification performance of the
Proposed System.
The block diagram of the Proposed Model is displayed in Figure-4.4.
Figure 4.4 Proposed Hybrid Machine Learning System for Medical Diagnosis
Calculated Output for Test Data
CBR Calculated Output Probability
ANN Calculated Output Probability ANN
Input Test Data
Artificial Neural Network System (Trained Using Backpropagation Algorithm)
Case Based Reasoning System (Trained Using K-NN Algorithm)
Ensemble Method
(Using Mean Method)
80
The block diagram has two important trained machine learning systems. The First one is Artificial Neural Network system which used the Backpropagation algorithm for its training. The Second one is Case Based System which used the K-Nearest Neighbor Algorithm for its training. The Total dataset contains 768 patient records out of which 614 records (80%) are used for training and the remaining 154 records (20%) used for testing purpose. In the time of testing the new test data will be passed through the trained ANN and the CBR Systems. Both the systems will give the calculated output values to the Ensemble Module. Input values will be in between 0 to 1.
Ensemble module uses the mean method to combine the values from the ANN and CBR system. Based on the calculated output value the result will be either positive or negative for diabetes disease. We used the cut off value as .5. If the Ensemble module calculated value is greater than or equal to .5 then the output will be “Positive to Diabetes” otherwise the output will be “Negative to Diabetes”.
Created Artificial Neural Network Structure
The Artificial neural Network used the Multi layer feed forward Network architecture and it used the Backpropagation Algorithm for Training. The network has a single hidden layer, an input layer and an output layer. The input layer has 8 input nodes. The hidden layer has 5 neurons and the output layer has a single neuron. Sigmoid function is used in the hidden layer and in the output layer. Squared Error is used as cost function for error calculation to adjust the network weight.
Input Layer Hidden Layer Output Layer
Figure 4.5. Multi Layer Feed Forward Neural Network
81
4.6 Experiments Results and its Explanation We have conducted 10 different test cases from the Pima Indian Diabetes Dataset based on the different random sampling. Below ANN and K-NN is constructed for the first test case. Below paragraphs explain clearly the structure, training and testing of the ANN, KNN and the Hybrid method.
ANN Training Information
The Total Pima Indian Diabetes Dataset contains 768 patients’ records. We used 8-5-1 ANN architecture for training and testing the Pima Indian Diabetes Dataset. Below table shows the partition of the training and testing dataset and it’s sizes. We used the random sampling with the random seed 12345 to select the samples for training and testing.80% of the data is assigned to training dataset and the remaining 20% is assigned to the testing data set. The ANN architecture is represented in Figure 4.5. Table 4.6 shows the dataset partition for training and testing purpose.
Table 4.6 ANN Training and Testing Dataset
Data
Data source Sheet1!$A$2:$I$769
Selected variables
preg Pg dbp skin insulin bmi pedig age class
Partitioning Method
Randomly chosen
Random Seed
12345
# training rows
614
# validation rows
154
Data
Training data used for building the model
['preprocessed_mean_separate_class.xls']'Data_Partition1'!$C$19:$J$632
# Records in the training data
614
Validation data ['preprocessed_mean_separate_class.xls']'Data_Partition1'!$C$633:$J$786
# Records in the validation data
154
Input variables normalized
Yes
82
ANN Network Parameters
The dataset has 8 inputs and 1 output fields. The hidden layer has 5 nodes and the output layer has one node. We used the Standard error functions as cost function. Standard Sigmoid function is used in hidden and output layer as activation function. We used the 200 Epochs to train the network. Table 4.7 shows the ANN network training functions and parameters. Table 4.7 ANN Network Parameters and Activation Functions
Variables
# Input Variables 8
Input variables preg Pg dbp skin insulin bmi pedig age
Output variable class
Parameters/Options # Hidden layers 1 # Nodes in HiddenLayer-1 5 CostFunctions Squared error Hidden layer sigmoid Standard Output layer sigmoid Standard # Epochs 200 Step size for gradient descent 0.1 Weight change momentum 0.6 Error tolerance 0.01 Weight decay 0
Inter layer Node Connection Weights
In between the input, hidden and output layer the ANN has the node connection. The connection has the weight. The following table 4.8 shows the inter layer connection weights between the input layer, hidden layer and hidden layer and output layer.
Table 4.8 ANN Inter Layer Connection Weights
Input Layer
Hidden
Layer # 1 preg Pg dbp skin insulin bmi pedig age
Bias
Node
Node # 1 -4.36 -3.23 -0.18 -4.12 -5.17 -2.31 -2.59 3.57 -0.70
Node # 2 -1.85 -5.71 1.94 -0.36 -6.08 -1.25 -4.95 2.22 -2.51
Node # 3 -0.87 -0.35 0.27 0.08 2.68 0.00 -0.21 -4.84 -4.40
Node # 4 1.92 -1.25 0.25 5.21 -14.06 0.73 0.06 1.35 -1.72
Node # 5 -1.98 -0.29 1.14 -10.26 -1.52 -3.28 3.00 -3.61 -3.19
83
Hidden Layer # 1
Output
Layer Node # 1 Node # 2 Node # 3 Node # 4 Node # 5
Bias
Node
1 -2.96828 -2.69314 -6.6264 -5.84048 -3.81711 6.81824 0 2.96833 2.69316 6.62652 5.84058 3.81716 -6.81835
ANN Training Curve
We have used the 200 Epochs to train the ANN. Initially the training error rate is 36.6 and for the final Epoch the Error rate is 10.4. The following Figure 4.6 shows the comparison between the different Epoch Number and the Error Rate in the Form of the Chart.
Figure 4. 6 ANN Training Error Curve
84
Following table 4.9 shows the Classification Confusion Matrix and the Error report
for the training dataset which contains 614 cases. Class 1 points to “+ive to Diabetes”
and class 0 points to “-ive to Diabetes”. The ANN system classifies 552 cases out of
614 cases correctly and miss classifies 62 cases out of 614 cases wrongly. The ANN
Correct classification performance rate is 89.90% and the miss classification Error
Rate is 10.10%.
ANN Training Data – Performance Report
Table 4.9 ANN Training Data Performance and Error Report
Classification Confusion Matrix
Predicted Class
Actual Class 1 0
1 199 26 0 36 353
ANN Training Data –Error Report
Error Report
Class # Cases # Errors % Error
1 225 26 11.56 0 389 36 9.25
Overall 614 62 10.10
Following table 4.10 shows the Classification Confusion Matrix and Error Report for
the validation (testing) dataset which contains 154 cases. Class 1 points to “+ive to
Diabetes” and class 0 points to “-ive to Diabetes”. The ANN system classifies 133
cases out of 154 cases correctly and miss classifies 21 cases out of 154 cases wrongly.
The ANN Correct classification performance rate is 86.36% and the miss
classification Error Rate is 13.64%.
ANN Validation Data – Classification Performance Report
Table 4.10 ANN Validation Data- Performance and Error Report
Classification Confusion Matrix
Predicted Class
Actual Class 1 0
1 36 7 0 14 97
85
ANN Validation Data – Classification Error Report
Error Report
Class # Cases # Errors % Error
1 43 7 16.28 0 111 14 12.61
Overall 154 21 13.64
ANN Total Correct Classification and Miss Classification Error Number for 10
different Test Cases
Below table 4.11 contains the reports of the 10 different test cases correct and miss
classification information in the form of numbers. Among the 10 cases test case 1
gave minimum miss classification number 21 out of 154 test cases. Test cases 4, 7, 9
gave maximum miss classification number 28 out of 154 test cases. The average
correct classification number was 128.9 out of 154 and the average miss classification
number was 25.1.
Table 4.11 ANN Classification Performance for 10 different test cases
Test. No
Correct Classification
Numbers
Miss Classification
Error Numbers
1 133 21
2 132 22
3 131 23
4 126 28
5 131 23
6 129 25
7 126 28
8 128 26
9 126 28
10 127 27
Average 128.9 25.1
ANN Total Correct Classification Accuracy and Miss Classification Error Rate
for 10 different Test Cases
We have converted the number of correct classification number and miss
classification error numbers in terms of percentage. Below table 4.12 contains the
correct classification accuracy and miss classification error rate in terms of Percentage
(%). The test no 1 gives the minimum miss classification error rate 13.64% and the
86
test case 4, 7, 9 gave the maximum 18.18%. Over all the average correct classification
accuracy was 83.70% and the miss classification Error rate was 16.30%.
Table 4.12 ANN Classification Performance Accuracy for 10 different test cases
Test. No Correct Classification Miss Classification
1 86.36 13.64
2 85.71 14.29
3 85.06 14.94
4 81.82 18.18
5 85.06 14.94
6 83.77 16.23
7 81.82 18.18
8 83.12 16.88
9 81.82 18.18
10 82.47 17.53
Average 83.70 16.30
Performance Comparison of Earlier ANN Systems and Our ANN Systems
In the literature survey we have found that 4 different researchers constructed the
Artificial Neural Network using the Backpropagation Algorithm based on the Pima
Indian Diabetes Dataset. Their classification performance was less than 77%.The
dataset contains two different classes namely Class 1 and Class 0. Class 1 means the
patient is “+ive for Diabetes” and Class 0 means the patient is “-ive for Diabetes”. We
have separated the data set based on the two different classes and replaced the missing
values by the corresponding field’s mean value. Pima Indian Diabetes Dataset has lot
of missing values. Handling of the missing values plays the major role in the
classification performance. Training the Neural Network takes an important step in
classification performance because neural network has many local minima. Finding
the best solution is a trail and error method. Because of our effective way of missing
values handling and the effective training we reach the classification performance
83.70. Our neural network structure is 8-5-1.
87
5 hidden nodes we used in the hidden layer. The table 4.13 and the Figure 4.7 show
the effective classification performance of our neural network.
Table 4.13 Performance comparisons between Different ANN Systems
S.No ANN System Reference
Correct
Classification
Accuracy
Miss
Classification
Error Rate
1 Michie, Spiegelhalter and Taylor 75.20 24.80
2 Jeatrakul and Wong 76.17 23.83
3 Estebanez, Alter and Valls 76.69 23.31
4 Misra and Dehuri 75.20 24.80
5 Our Method 83.70 16.30
Comparison Chart
Figure 4.7 Various ANN Classifiers Performance Comparisons
88
KNN Training Information
We have carried out 10 different test cases using the different random samples in the
Pima Indian Diabetes Dataset. Below table shows the training in the KNN System and
the best K value in the test case 1.
Test Case 1 training and testing Information:
Best K Value Selection
In the time of training, the system substitutes the various values of k and calculates
the corresponding training and validation (testing) Error rates. We have given the
input value k=20 to the system. It has generated the training and testing errors for the
20 different k values. The system automatically finds the best k value based on the
Testing error Value. For the k value 13 and 20 the Error Rate is 10.39 and this is the
minimum value among the set of 20 k values. The system has chosen the minimum k
value 13 as the best k value. The following table 4.14 shows the training and testing
error values for different k values and the selection of the best k = 13.
Table 4.14 KNN Training and Testing Errors for Different K values
Validation error log for different k and the Best K Value = 13
Value of k % Error Training % Error
Validation(Testing)
1 0.00 15.58
2 11.40 21.43
3 11.07 12.34
4 13.68 16.88
5 13.68 13.64
6 14.50 14.94
7 14.17 15.58
8 14.82 13.64
9 15.80 14.29
10 14.01 11.04
11 15.64 11.04
12 14.82 12.34
13 15.80 10.39 <--- Best k
14 15.31 11.69
15 16.61 11.69
16 16.12 12.34
17 16.45 11.69
18 15.96 11.69
19 16.45 11.04
20 16.12 10.39
89
Below Figure 4.8 shows the Training Error Curve for the different values of K in the
time of training. For the values of k=10 up to 20 the training error is zero. For the
value k=9 it gave the maximum training error 15.80.
Figure 4. 8 K-NN Training Curve
90
Figure 4. 9 KNN Validation Error Curve
Above Figure 4.9 shows the Validation Error for the different values of K. for the k
value 13 and 20 it gave the minimum testing error value 10.39. The value 13 is
selected as a best k value and used in the time of validation.
91
Classification Performance on Training Data
Following table 4.15 shows the Classification Confusion Matrix for the training
dataset which contains 614 cases. Class 1 points to “+ive to Diabetes” and class 0
points to “-ive to Diabetes”. The KNN system classifies 517 cases out of 614 cases
correctly and miss classifies 97 cases out of 614 cases wrongly. The KNN Correct
classification performance rate is 84.20% and the miss classification Error Rate is
15.80%.
Training Data scoring - Summary Report (for k=13)
Table 4.15 KNN Training Data- performance and Error Report
Classification Confusion Matrix
Predicted Class
Actual Class 1 0
1 170 55
0 42 347
Error Report
Class # Cases # Errors % Error
1 225 55 24.44
0 389 42 10.80
Overall 614 97 15.80
Classification Performance on Validation (Testing) Data
Following table 4.16 shows the Classification Confusion Matrix for the validation
(testing) dataset which contains 154 cases. Class 1 points to “+ive to Diabetes” and
class 0 points to “-ive to Diabetes”. The KNN system classifies 138 cases out of 154
cases correctly and miss classifies 16 cases out of 154 cases wrongly. The KNN
Correct classification performance rate is 89.61% and the miss classification Error
Rate is 10.39%.
92
Table 4.16 KNN validation Data- Performance and Error Report
Classification Confusion Matrix
Predicted Class
Actual Class 1 0
1 36 7 0 9 102
Error Report
Class # Cases # Errors % Error
1 43 7 16.28
0 111 9 8.11
Overall 154 16 10.39
KNN Classifiers Correct and Miss Classification Performance for the 10
different test cases.
Below table 4.17 contains the report of the 10 different test cases correct and miss
classification information in the form of numbers. Among the 10 cases test case 1
gave minimum miss classification number 16 out of 154 test cases. Test case 9 gave
maximum miss classification number 32 out of 154 test cases. The average correct
classification number was 129.8 out of 154 and the average miss classification
number was 24.2.
Table 4.17 KNN Classification performance for 10 Test Cases
KNN (CBR) Classifier’s Total Correct Classification and
Miss Classification Numbers for 154 Test Cases
Test .No Correct Classification Number Miss Classification
Number
1 138 16
2 129 25
3 127 27
4 133 21
5 128 26
6 136 18
7 125 29
8 129 25
9 122 32
10 131 23
Average 129.8 24.2
93
We have converted the number of correct classification number and miss
classification error numbers in terms of percentage. Below table 4.17 contains the
correct classification accuracy and miss classification error rate in terms of Percentage
(%). The test no 1 gives the minimum miss classification error rate 10.39% and the
test case 9 gave the maximum 20.78%. Over all the average correct classification
accuracy is 84.29% and the miss classification Error rate was 15.71%.
Table 4.17 KNN Classification performance Accuracy for 10 Test Cases
KNN(CBR) Classifier’s Correct Classification Accuracy and
Miss Classification Error Rate for 154 Test Cases
Test .No Correct Classification Accuracy Miss Classification Error
Rate
1 89.61 10.39
2 83.77 16.23
3 82.47 17.53
4 86.36 13.64
5 83.12 16.88
6 88.31 11.69
7 81.17 18.83
8 83.77 16.23
9 79.22 20.78
10 85.06 14.94
Average 84.29 15.71
Performance Comparison of Earlier KNN Systems and Our KNN Systems
In the literature survey we have found that 2 different researchers constructed the
Case Based Reasoning System using the K-Nearest Neighbor Algorithm based on the
Pima Indian Diabetes Dataset. Their classification performance was less than 70%.
We have separated the data set based on the two different classes and replaced the
missing values by the corresponding field’s mean value. Pima Indian Diabetes Dataset
has lot of missing values. Handling of the missing values plays the major role in the
classification performance. In the K-NN method finding the best value of K plays a
major role in the classification performance. It is a trail and error method. But in the
xlminer software there is an option to find the best k value based on the training and
testing dataset. We used the xlminer and found the best k value. Because of our
effective way of missing values handling and the finding the best k value through
xlminer software we reach the classification performance 84.29. The below table 4.18
94
And the Figure 4.10 show the effective classification performance of our case based
reasoning system with the earlier Case based reasoning systems on the Pima Indian
Diabetes Dataset.
Table 4.18 Different KNN System Performances on Pima Indian Diabetes
Dataset
S.No KNN System
Reference
Correct Classification
Accuracy
Miss Classification
Error Rate
1 Michie_Spiegelhalter 67.6 32.4
2 Jeatrakul_Wong 69.7 30.3
3 Our_Method 84.29 15.71
Figure 4.10 Different KNN Classifiers Performance on Pima Indian Diabetes
Dataset
95
Hybrid System Performance and Results
According to our proposed method algorithm first we constructed separately a Artificial Neural Network and a Case Based Reasoning System for the Pima Indian Diabetes Dataset. We divided the total dataset 768 into 614 training dataset and 154 testing dataset. We used the same 614 training dataset to train both the ANN and KNN Systems and the same testing dataset we used to test the classification performance of the ANN and KNN systems. We took the average probability value for class 1 or class 0 based on the probability performance from the ANN and KNN Systems. We kept the cut off probability value .5 for the class 1. If the new proposed method probability for a test data is greater than .5 then it belongs to class 1 or it belongs to class 0. We have conducted 10 different tests for the proposed model. The test results are represented in the following table-4.19.
Table 4.19 Proposed Hybrid Machine Learning Model Classification Result
Test. No PM- MCN
Classification
(%)
Miss Classification
Error (%)
1 18 88.31 11.69
2 21 86.36 13.64
3 25 83.77 16.23
4 28 81.82 18.18
5 24 84.42 15.58
6 21 86.36 13.64
7 26 83.12 16.88
8 26 83.12 16.88
9 31 79.87 20.13
10 24 84.42 15.58
Average 24 84.16 15.84
Note: PM-MCN- Proposed Model Miss Classification Number Below we have given the detailed result calculation for first 5 tests for the above table-4.19 in the form of table 4.20 to 4.24. Remaining tests results are attached in Appendix-C.
Thresh Hold Value / Cutoff Value for Hybrid Model We used the cut off (Threshold) value .5 for the combined Ensemble method. We add the ANN probability of Success for the Class 1(+ive For Diabetic) and the KNN (CBR) probability of Success for the Class 1 and found the average probability value
96
for Class 1. If the average probability value is greater than or equal to .5 (Cut off Value) then we give the result “+ive for Diabetes” other wise the result will be “-ive For Diabetes”.
Result Calculation for Test. No 1
Table 4.20 Output Calculation for Test. No.1
Row
Id.
Actual
Class
ANN Prob.
for 1 (success)
CBR Prob.
for 1 (success)
Ensemble
Average
PM
Result
2 1 0.99 0.85 0.92 Yes
7 0 0.01 0.31 0.16 Yes
8 0 0.00 0.23 0.12 Yes
13 1 0.18 0.23 0.20 No
22 0 0.00 0.15 0.08 Yes
24 0 0.09 0.77 0.43 Yes
25 0 0.15 0.85 0.50 No
28 1 1.00 0.85 0.92 Yes
33 0 0.01 0.15 0.08 Yes
40 0 0.00 0.08 0.04 Yes
46 0 0.01 0.38 0.20 Yes
59 1 0.36 0.62 0.49 No
61 0 0.32 0.15 0.24 Yes
70 1 1.00 0.62 0.81 Yes
79 0 0.12 0.46 0.29 Yes
80 0 0.85 0.77 0.81 No
83 1 0.55 0.46 0.50 Yes
86 0 0.00 0.00 0.00 Yes
93 0 0.01 0.31 0.16 Yes
103 1 0.00 0.00 0.00 No
109 0 0.00 0.15 0.08 Yes
117 0 0.01 0.31 0.16 Yes
118 0 0.72 0.69 0.71 No
120 0 0.01 0.08 0.04 Yes
Note: Yes – Positive for Diabetes Disease No – Negative for Diabetes Disease PM Result – Proposed Model Result
97
Result Calculation for Test. No 2
Table 4.21 Output Calculation for Test. No.2
Row
Id.
Actual
Class
ANN Prob.
for 1 (success)
CBR Prob. for
1 (success)
Ensemble
Average
PM
Result
3 1 0.97 0.93 0.95 Yes
7 0 0.08 0.36 0.22 Yes
9 0 0.31 0.57 0.44 Yes
11 1 0.96 0.71 0.84 Yes
33 0 0.36 0.07 0.21 Yes
34 0 0.33 0.07 0.20 Yes
38 1 0.81 0.50 0.66 Yes
45 1 0.88 0.86 0.87 Yes
50 1 1.00 0.71 0.85 Yes
53 0 0.54 0.57 0.56 No
54 1 1.00 1.00 1.00 Yes
62 0 0.27 0.07 0.17 Yes
63 0 0.00 0.00 0.00 Yes
72 1 0.99 0.71 0.85 Yes
74 0 0.52 0.29 0.40 Yes
83 1 0.54 0.57 0.56 Yes
91 0 0.01 0.07 0.04 Yes
100 1 0.79 0.86 0.83 Yes
101 1 0.99 0.64 0.82 Yes
102 1 0.99 0.64 0.82 Yes
105 1 0.65 0.71 0.68 Yes
112 0 0.42 0.43 0.43 Yes
115 0 0.21 0.71 0.46 Yes
122 0 0.25 0.43 0.34 Yes
98
Result Calculation for Test. No 3
Table 4.22 Output Calculation for Test. No.3
Row
Id.
Actual
Class
ANN Prob.
for 1
(success)
CBR Prob.
for 1
(success)
Ensemble
Average
PM
Result
4 0 0.99 0.78 0.88 No
17 0 0.66 0.56 0.61 No
19 1 0.13 0.56 0.34 No
21 0 0.01 0.00 0.00 Yes
33 0 0.09 0.00 0.04 Yes
37 1 1.00 0.78 0.89 Yes
38 1 0.02 0.44 0.23 No
43 1 0.97 0.67 0.82 Yes
59 1 0.69 0.56 0.62 Yes
85 0 0.13 0.33 0.23 Yes
91 0 0.00 0.00 0.00 Yes
93 0 0.04 0.22 0.13 Yes
99 1 0.84 0.78 0.81 Yes
100 1 0.66 1.00 0.83 Yes
101 1 1.00 0.67 0.83 Yes
112 0 0.01 0.22 0.12 Yes
115 0 0.17 0.56 0.36 Yes
118 0 0.07 0.67 0.37 Yes
119 0 0.01 0.22 0.12 Yes
132 1 0.96 0.56 0.76 Yes
134 1 0.99 0.44 0.72 Yes
139 0 0.00 0.11 0.06 Yes
140 1 0.85 0.11 0.48 No
141 0 0.02 0.11 0.07 Yes
99
Result Calculation for Test. No 4
Table 4.23 Output Calculation for Test. No.4
Row
Id.
Actual
Class
ANN Prob.
for 1
(success)
CBR Prob.
for 1
(success)
Ensemble
Average
PM
Result
2 1 0.99 0.80 0.90 Yes
11 1 1.00 0.80 0.90 Yes
15 1 1.00 0.40 0.70 Yes
23 1 1.00 0.60 0.80 Yes
25 0 1.00 0.70 0.85 No
45 1 0.97 0.90 0.94 Yes
46 0 0.21 0.20 0.20 Yes
67 0 0.00 0.10 0.05 Yes
70 1 0.98 0.60 0.79 Yes
81 1 1.00 0.90 0.95 Yes
82 1 1.00 0.80 0.90 Yes
84 0 0.07 0.00 0.03 Yes
86 0 0.00 0.00 0.00 Yes
87 0 0.00 0.00 0.00 Yes
88 0 0.01 0.00 0.00 Yes
91 0 0.00 0.00 0.00 Yes
102 1 1.00 0.70 0.85 Yes
104 1 1.00 0.80 0.90 Yes
108 1 0.05 0.00 0.03 No
110 1 1.00 0.70 0.85 Yes
113 1 1.00 1.00 1.00 Yes
117 0 0.08 0.20 0.14 Yes
127 1 1.00 1.00 1.00 Yes
134 1 0.77 0.60 0.69 Yes
100
Result Calculation for Test. No 5
Table 4.24 Output Calculation for Test. No.5
Row
Id.
Actual
Class
ANN Prob.
for 1
(success)
CBR Prob.
for 1
(success)
Ensemble
Average
PM
Result
8 0 0.00 0.08 0.04 Yes
18 1 0.53 0.67 0.60 Yes
20 0 0.00 0.00 0.00 Yes
30 0 0.68 0.42 0.55 No
40 0 0.01 0.00 0.00 Yes
46 0 0.04 0.33 0.19 Yes
50 1 1.00 0.67 0.83 Yes
51 1 0.88 0.92 0.90 Yes
58 0 0.43 0.42 0.42 Yes
62 0 0.94 0.08 0.51 No
65 0 0.06 0.42 0.24 Yes
73 1 0.85 0.75 0.80 Yes
77 0 0.86 0.92 0.89 No
82 1 0.90 0.83 0.87 Yes
86 0 0.00 0.00 0.00 Yes
100 1 0.19 0.67 0.43 No
101 1 0.94 0.75 0.84 Yes
103 1 0.00 0.00 0.00 No
106 0 0.09 0.00 0.04 Yes
107 1 0.94 0.67 0.80 Yes
109 0 0.02 0.08 0.05 Yes
112 0 0.62 0.50 0.56 No
125 1 0.87 0.42 0.64 Yes
129 1 0.41 0.33 0.37 No
101
4.7 Comparing Proposed Model Results with Earlier Results
Table 4.25 Comparing Proposed Model Results with Table-4.1 Result
Sr. No. Algorithm Correct
Classification (%)
Miss Classification
((Error Rate) %)
1 Discrim 77.5 22.5
2 Quadisc 73.8 26.2
3 Logdisc 77.7 22.3
4 SMART 76.8 23.2
5 ALLOC80 69.9 30.1
6 K-NN 67.6 32.4
7 CASTLE 74.2 25.8
8 CART 74.5 25.5
9 IndCART 72.9 27.1
10 NewID 71.1 28.9
11 AC2 72.4 27.6
12 Baytree 72.9 27.1
13 NaiveBay 73.8 26.2
14 CN2 71.1 28.9
15 C4.5 73 27
16 Itrule 75.5 24.5
17 Cal5 75 25
18 Kohonen 72.7 27.3
19 DIPOL92 77.6 22.4
20 Backprob 75.2 24.8
21 RBF 75.7 24.3
22 LVQ 72.8 27.2
23
PM(Ensemble
Method) 84.16 15.84
102
Comparison Chart
Figure 4.11 Comparing PM Result with Table -4.1 Results
The above table 4.25 and figure 4.11 show the comparison between the individual machine learning methods classification rates with the proposed Ensemble Model. The proposed model gives the correct classification performance 84.16 % but in the other machine learning algorithms LogDisc gives maximum only 77.7%.
Comparing Proposed Model Results with Table-4.2 Result
Table-4.2 has 5 different artificial neural network architectures classification performance on the Pima Indian diabetes dataset. Complementary Neural Network (CMTNN) gives the maximum classification performance 76.49%. But our proposed system gives 84.16% classification on the Pima Indian diabetes dataset. It is represented in the form of table-4.26 and figure 4.12.
103
Table 4.26 Comparing proposed Model Result with Table-4.2 Result
Sr. No. Model Type Classification
Performance (%)
1 BPNN 76.17
2 GRNN 75.26
3 RBFNN 76.56
4 PNN 75.26
5 CMTNN 76.49
6 Proposed Model 84.16
Comparison Chart
Figure 4.12 Chart shows comparison between PM Result and Table-4.2 Result
104
Comparing Proposed Model Results with Table-4.3 Result Table-4.3 has 3 machine learning methods like SVM, Single Logistics and Multilayer Perceptron and their classification performance on the Pima Indian Diabetes dataset. With in 3 algorithms single logistics gives maximum classification performance 77.86%. Our proposed ensemble model has given 84.16% classification performance. It is represented in the form of Table-4.27 and figure 4.13.
Table 4.27 Comparing Proposed Model Result with Table-4.3 Result
Algorithm
Classification
Performance
SVM 77.21
Simple Logistics 77.86
Multilayer Perceptron 76.69
PM (Ensemble Method) 84.16
Comparisons Chart
Figure 4.13 Chart shows the Comparison between PM result and Table-4.3
result
105
Comparing our Proposed Model result with Bylander classification Table-4.4
result
Table 4.28 Comparing Proposed Model Result with Table-4.4 Result
Method Accuracy
Belief Network(Laplace) 72.50%
Belief Network 72.30%
Decision Tree 72%
Naïve Bayes 71.50%
Proposed Model 84.16%
Bylander has applied 4 machine learning methods on the Pima Indian Diabetes dataset and got the classification performance maximum 72.30% using the Belief network. Our proposed method has given 84.16% classification performance on the diabetes dataset. It gives near to 12% more classification compare to the Bylander result. It is represented in the form of table 4.28 and figure 4.14. Comparisons Chart
Figure 4.14 Comparing PM Classification Performances with Bylander result
106
Comparing Proposed Model result with Misra and Dehuri Table-4.5 result
Table 4.29 Comparing Proposed Model Result with Table-4.5 Result
Classification Systems Name Classification Accuracy
NN 65.1
KNN 69.7
FSS 73.6
BSS 67.7
MFS1 68.5
MFS2 72.5
CART 74.5
C4.5 74.7
FID3.1 75.9
MLP 75.2
FLANN 78.13
Proposed Model 84.16
Misra and Dehuri constructed the Functional Link Artificial neural network. FLANN has given 78.13% classification accuracy on the Pima Indian Diabetes dataset. Our proposed model has given 84.16% classification accuracy that is near to 6% more when it compared to FLANN classification performance. It is represented in the form of table 4.29 and figure 4.15.
Comparisons Chart
Figure 4.15 Performance Comparisons between FLANN and Proposed Model
107
4. 8 Chapter Summary
Diabetes disease is the fourth biggest cause of death worldwide particularly in
the industrial and developing countries. Lot of research has been done to classify the diabetes dataset and to improve the classification performance based on the individual machine learning method. In this chapter we did the data pre-processing for handling the missing data. Pima Indian Diabetes dataset has lot of missing values. We have replaced the missing values based on the corresponding fields mean values for separate classes.
Our is the first hybrid method for classifying the Pima Indian Diabetes
Dataset. In the process according to our Proposed Algorithm we have developed an Artificial Neural Network. It gave 83.70% classification performance. It is near to 7% more than the previous ANN System classification performances.
Next we developed a Case based reasoning system based on the K-Nearest
Neighbor method. It gave 84.29% classification performance. It is near to 14% more than the previous KNN classifiers Systems. The Hybrid system combined the performance of both the ANN and KNN Systems and found the average probability value for output classes. Based on the 10 test cases conducted on the Pima Indian diabetes dataset our proposed ensemble method gave classification accuracy 84.16%. It is near to 7% more compare to the earlier methods. Through the hybrid machine learning we improved the classification performance and the reliability of the system. These two are the prime objectives of our research.
108