final presentation dwi riyono

COMPARISON OF MACHINE LEARNING TECHNIQUES USING

WEKA ENVIRONMENT2011 20th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises

Dwi Riyono D10207801

Dani Pranata M10307804

Nurul Retno Nurwulan

D10301807

INTRODUCTION

• Correct diagnosis for further treatment or a potential therapy change of a specific patient could be assisted with the use of machine learning.

• The medical data given was employed in order to evaluate the performance of a number of classification techniques.

• This study analyzed and evaluated the decision making task of therapy change which a doctor suggests, when a number of blood test parameters – mainly Prostate Specific Antigen (PSA) – are measured every 3 months.

• Prostate cancer is the most common non-cutaneous cancer and the second-leading cause of death in men in USA. It prevalent in many countries and exhibits a wide spectrum of aggressiveness.

Material and Methods

• Medical Problem Description and Data

• The WEKA environment

• Techniques

MEDICAL PROBLEM DESCRIPTION AND DATA

• Challenge to physician who treats patients with prostate cancer in advising effective treatment.

• Selection of appropriate treatment requires assessment of the tumor’s potential aggressiveness and the general health, life expectancy, and quality of life preferences of the patient.

• Parameters chosen: Hematocrit (HCT), White Blood Cells (WBC), free Prostate Specific Antigen (PSA free), total Prostate Specific Antigen (PSA total), ratio PSA (PSAfree/PSAtotal), Prostatic Acidic Phospatase (PAP), and potential therapy change decision (yes/no).

• Real data of 40 patients were obtained. There were 1960 unique instances consisting of 280 rows and 7 columns.

THE WEKA ENVIRONMENT

• WEKA implements various machine learning classification techniques, algorithm for regression and clustering along with a number of visualization tools that has been accepted as powerful and adequate environment for data mining.

• All data analyzed and mined with the aim of WEKA is saved in ARFF file format, which consists of special tags in order to designate between attributes, values, and names of the data given.

• All of the parameter chosen (blood test parameters) were numerical values and the change therapy decision of the doctor in the simple format of a yes/no.

TECHNIQUES

• Decision Trees – J48• Neural Network (Multilayer Perceptron – MLP)• Naïve Bayes• Radial Basis Function (RBF)• K-Nearest Neighbor (IBk)

TECHNIQUES (2)

• Decision Trees – J48

It represents a mapping of the attributes given and consists of nodes which link to two or more sub-trees. A node calculates a specific outcome which is based on the value of the instance and each possible outcome is linked with one of the sub-trees. The J48 algorithm is an efficient method for estimation and classification of fuzzy data.

TECHNIQUES (3)

• Neural Network (Multilayer Perceptron – MLP)

An adaptive system that changes its structure based on external or internal information which flows through the network during an initial learning phase. In more practical terms, NN is a non-linear statistical data modeling tools. It can be used to model complex relationships between inputs and outputs or to find patterns in data.

The back propagation algorithm MLP was applied in order to categorize a practitioner’s decision (therapy change) was applied, using two input nodes (no = 0, yes =1)

TECHNIQUES (4)

• Naïve Bayes

A representation of the Bayesian classifier that produces probabilistic rules and received noteworthy attention when used for classification purposes. Classification is performed when the well-known Bayes rule is applied to each attribute of the model and the probability over an independent class variable is computed. Although the model is straightforward, it provides quite promising results on many real world datasets.

TECHNIQUES (5)

• Radial Basis Function (RBF)

Initially introduced in order to address a variety of problems (old pattern recognition techniques, clustering, functional approximation, etc.). It is now acknowledged to be one of the most important NN models for classification. The basic function is based on two-layer feed-forward model with a hidden layer between the sets of input and output. Gaussian function is preferred for classification and a key factor for the successful implementation is to find a suitable center.

TECHNIQUES (6)

• K-Nearest Neighbor (IBk)

One of the simplest forms of classification algorithms, depicted as statistical learning algorithms and are generated by simply storing the given data. Distance metric is chosen and any new data is compared against all-ready “memorized” data items, for the classification purpose. The new item is assigned to the class which is most common amongst its k nearest neighbors. IBk is an implementation of the k-nearest neighbor, which the number of nearest neighbor (k) can be set manually or determined automatically using cross-validation.

EXPERIMENT AND RESULT

DATA SOURCE

Blood test are used to collect the data from patients

Six parameters that would be measured :

1. Hct (Hematocrit), volume percentage (%) of red blood cells in blood

2. WBC (White blood cell)

3. PAP (Prostatic Acidic Phosphatase),

4. PSA Free (Prostate-Specific Antigen)

5. PSA Total

6. PSAf/PSAt

The percentage of PSA in the free or complex isoforms, were used to predict the patient’s state over a period of 2 years.

BLOOD TEST PARAMETERS

Blood test are used to collect some health information from each patients with the diagnosis of prostate cancer.

Six parameters that will be measured :

1. Hct (Hematocrit), volume percentage (%) of red blood cells in blood

2. WBC (White blood cell)

3. PAP (Prostatic Acidic Phosphatase),

4. PSA Free (Prostate-Specific Antigen)

5. PSA Total

6. PSAf/PSAt

The percentage of PSA in the free or complex isoforms, were used to predict the patient’s state over a period of 2 years.

TABLE 1.BLOOD TEST PARAMETERS AND THEIR CRITICAL VALUES

Blood Tes Parameters Critical Values

HCT >28%

WBC >4000/mL

PSA Free 0.03ng/dl

PSA Total 0.05ng/dl

PSAf/PSAt >0.2

Prostatic Acid Phosfatase <3.5ng/ml

The difference value between each parameters and their critical value for every quarter, would be used to decide a potential therapy plan change, along with patient’s history and previous blood test results.

CLASSIFICATION PROCESS OF THE TARGET PARAMETER (THERAPY

CHANGE)

These difference value, would be provided to WEKA toolbox for classification of the target parameter (therapy change).

5 machine learning algorithm are used to obtain the result.

1. J48

2. MLP

3. Naïve Bayes

4. RBF

5. IBk

CLASSIFICATION RESULTS FOR EACH EXAMINED ALGORITHM FOR QUARTER 1

WEKA Techniques

Simulation Results for Quarter 1

Correctly Classified

Incorrectly Classified

Time taken (sec)

Kappa statistic

J48 85% (34) 15% (6) 0.03 0.4146

MLP 85% (34) 15% (6) 0.13 0.4146

Naïve Bayes 90% (36) 10% (4) 0.01 0.6098

RBF 90% (36) 10% (4) 0.11 0.6098

IBk 82.5% (33) 17.5% (7) 0.01 0.2708

The table above, is mainly summarizes the accuracy of each machine learning algorithm for all 40 patients, along with the time taken and Kappa statistic for each algorithm.

TRAINING AND SIMULATION ERROR FOR QUARTER 1

WEKA Techniques

Simulation Results for Quarter 1

Mean Absolute

Error

Root Mean Squared

Error

Relative Absolute Error

(%)

Root Relative Squared Error

(%)

J48 0.1737 0.3638 57.409 94.407

MLP 0.1899 0.3651 62.735 95.039

Naïve Bayes 0.1014 0.3163 33.494 81.334

RBF 0.1423 0.3127 47.007 81.406

IBk 0.1921 0.408 63.478 106.212

The table above is an overall synopsis based on different error rates.

DISCUSSION

&

CONCLUSIONS AND FUTURE WORK

DISCUSSION Based on the results obtained for the 1st quarter of the therapy plan for all patients examined, a number of useful conclusions could be yielded, concerning the performance and error rates of the algorithms chosen.

1. Naïve Bayes and RBF Network algorithms succeed to obtain a relatively high accuracy rate (90%) with Kappa score of 0.6098

2. Between the two of them, Naïve Bayes performs very fast only 0.01 seconds comparing to 0.11 seconds that RBF takes.

3. IBk algorithm has the worst accurracy with a small Kappa Statitisic score of 0.2708, although time taken only 0.01 s.

4. This study observe that Naïve Bayes has the lowest mean absolute, relative absolute and root relative squared error rates, therefore it has more powerful classification capabilities.

5. This study appointing that Naïve Bayes and RBF are the best algorithm.

6. IBk is the algorithm with the highest error rate.

DISCUSSION

Perfomance J48 worth to mention, with accuracy rate 85% and the visualization tree which derived from the execution of the algorithm for Q1. This decision tree given in Figure 1 : For any patient given :

1. If the difference of PSA free is greater than 2.07 then there is definitely a necessity for therapy change.

2. If not, then a ratioPSA (i.e.PSAfree/PSAtotal) and if it’s greater than 0.15 (0.17 for a physician) then there may not be need to change therapy.

3. If ratioPSA is lower than 0.15 then the last parameter that the algorithm takes into consideration is Prostatic Acidic Phosfatase and characterizes a decision made according to a difference of 2.3ng/ml.

DISCUSSION (2)

• Based on table III, concerning in classification error. Naïve bayes (powerful classification capabilities)has the lowest mean absolute, relative absolute and root relative squared error rates.

• The algorithm with the highest error rates as can be easily seen is IBk.

DISCUSSION (3)

• Based on table III, concerning in classification error. Naïve bayes (powerful classification capabilities)has the lowest mean absolute, relative absolute and root relative squared error rates.

• The algorithm with the highest error rates as can be easily seen is IBk.

DISCUSSION

• This study, from (Table IV) a few observation can be appointed most important aspect that a physician rarely changes therapy to many patients during the period of a quarter, therefore only one or two patients, therapy plan was changed show in Table IV (Q2,Q4,Q6, and Q7).

• After performing a closer look to the value of the parameters measured, moreover discussing this discovery with the physicians (doctors in Urology), it turned out that case considered to be ‘problematic’ in terms of measured values. Specifically, these patients were not responding to the treatment given and constantly the blood parameters measured were extremely high or very low.

• Mean classification accuracy for each algorithm, for all quarters examined, was: 92% for J48, 89% for MLP, 86% for Bayes, 95% for RBF network and 92% for IBk.

CONCLUSIONS AND FUTURE WORK• This study a comparison of five machine learning algorithms

upon real medical data was presented.

• Useful results were obtained concerning the performance and error rates of the algorithms.

• The experiments performed showed that the best algorithm based on the prostate cancer data given, is RBF Network technique.

• RBF algorithm performed quite well in terms of classification accuracy and Kappa score, as well as has given relatively low error rates for the Q1 presented.

• One way of improving the result is the proposal of a new hybrid algorithm.(algorithm which comprises of both the difference between value s measured and critical values as well as the difference in the values measured between two subsequent quarters.

• Future work more clinical cases have to be evaluated to justify these results as more will become available from the Dept.of Urology.

REFERENCES• Nils J. Nilsson (1999) Introduction to Machine Learning. California, United States of America

• Jemal A et al: Cancer statistics, 2005. CA Cancer J Clin 2005; 55(1):10–30.

• Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An Update; SIGKD Explorations, Volume 11, Issue 1

• Quinlan, J.R. (1990). Decision trees and decision making. IEEE Trans System, Man and Cybernetics 20, (2), 339–346.

• Duda et al. Pattern classification, John Wiley & Sons, 2001

• Langley, P., Iba, W. Thompson, K. 1992 , An analysis of Bayesian classifiers, in Proceedings of the tenth national conference on artificial intelligence, AAAI Press and MIT Press, pp. 223--228.

• A. G. Bors, "Introduction of the Radial Basis Function (RBF) Networks," Online Symposium for Electronics Engineers, issue 1, vol. 1, DSP Algorithms: Multimedia, http://www.osee.net/, Feb. 13 2001, pp. 1-7.

• Aha, D. (1992) “Tolerating noisy, irrelevant, and novel attributes in instance-based learning algorithms.” International Journal of Man-Machine Studies, Vol. 36, 267–287.

• Quinlan, J.R. (1993). C4.5: Programs for machine learning, Morgan Kaufmann, San Mateo, CA, USA.

• Jang, J.,-S., R., Sun, C., T., Mizutani, E ,: Neuro-fuzzy and soft computing: a computational approach to learning and machine intelligence, Prentice Hall, Upper Saddle River, NJ, 1997.

• Bishop, C.M. (1996). Neural networks for pattern recognition, First edition, Oxford University Press, USA

• Bors, A.G., Pitas, I., (1996) “Median radial basis functions neural network,” IEEE Trans. on Neural Networks, vol. 7, no. 6, pp. 1351-1364.

final presentation dwi riyono

Science

medical data

real data

data mining

classification of fuzzy

prostate cancer

change therapy decision

specific patient

potential therapy