performance comparison of classification algorithms

Performance Comparison of Classification Algorithms

Using WEKA Sreenath.M

#1, D.Sumathi*

2

1PG Scholar, 2Assistant Professor

Department of Computer Science and Engineering, PPG Institute of Technology, Coimbatore, India [email protected]

[email protected]

Abstract Data mining is the knowledge discovery process by analysing the huge volume of data from various perspectives and

summarizing it into useful information. Data mining is used to

find cloaked patterns from a large data set. Classification is one

of the most important applications of data mining. Classification

techniques are used to classify data items into predefined class

label. Classification employs supervised learning technique.

During data mining the classifier builds classification models

from an input data set, which are used to predict future data

trends. For study purpose various algorithm available for

classification like decision tree, k-nearest neighbour ,Naive Bayes,

Neural Network, Back propagation, Artificial Neural, Multi class

classification, Multi-layer perceptron, Support vector Machine,

etc. In this paper we introduce four algorithms from them. In

this paper we have worked with different data classification

algorithms and these algorithms have been applied on NSL-KDD

dataset to find out the evaluation criteria using Waikato

Environment for Knowledge Analysis (WEKA).

Keywords classifier algorithms, data mining, nsl-kdd, oner, Hoeffding tree, decision stump, alternating decision tree, weka.

I. INTRODUCTION

Data mining is the technique of automated data analysis to

reveal previously undetected dependence among data .Three

of the major data mining techniques are classification,

regression and clustering. In this research paper we are

working with the classification because it is most important

process, if we have a very huge database. Weka tool is used

for classification. Classification [1] is one of the most

important techniques in data mining to build classification

models from an input data set.

These build models are used to predict future data trends [2, 3].

Our knowledge about data becomes greater and easier once

the classification is complete. We can deduct logic from the

classified data. Most of all it makes the data retrieval faster

with better results and new data to be sorted easily.

There are many data mining tools available [4, 5]. In this

paper we will be using Weka data mining tool, which is an

open source tool developed using JAVA [6]. It contains tools

for data preprocessing, clustering, classification, visualization,

association rule, regression. It not only supports data

algorithms, but also Meta learners like bagging, boosting and

data preparation. Weka toolkit has achieved the highest

applicability among Orange, Tanagra, and KNIME,

respectively [4]. While using Weka for classification,

performance is tested by applying cross validation test mode

instead of using percentage split test mode [7].

II. RELATED WORKS Data mining is the method of extracting information from vast

dataset. Their techniques apply advanced computation ways to

get unknown relations and total up results by analysing the

determined dataset to form these relations clear and

apprehensible. Hu and et.al [8] conducted experimental

comparison of C4.5, LibSVMs, AdaBoosting C4.5, Bagging

C4.5, and Random Forest on seven Microarray cancer

information sets.

They concluded that C4.5 was higher among all algorithms

and additionally found that information preprocessing and

cleansing improves the potency of classification algorithms.

Shin and et.al [9] conducted comparison between C4.5 and

Nave bayes and hence concluded that C4.5 is out performing

algorithm than Nave bayes. Sharma [10] conducted

experiment with weka environment by comparing four

algorithms specifically ID3, J48, easy CART and Alternating

decision Tree (ADTree).

He compared these four algorithms for spam email dataset in

terms of classification accuracy. According to his simulation

results, the J48 classifier performs better than ID3, CART and

ADTree in terms of classification accuracy. Abdul Hamid M.

Ragab and et.al [11] compared Classification Algorithms for

Students College Enrolment Approval Using Data Mining.

They found that C4.5 gives the best performance and accuracy

and lowest absolute errors, then PART, Random Forest,

Multilayer Perceptron, and Nave Bayes, respectively.Table-1

shows an outline for a few recent works associated with

classification algorithms performance and sort of the

applications space for the experimental datasets used.

It illustrates many data mining algorithms that may be applied

into completely different application space.

International Journal of Advanced and Innovative Research (2278-7844) / # 193 / Volume 4 Issue 4

2015 IJAIR. All Rights Reserved 193

TABLE I

A SUMMARY OF RELATED DATA MINING ALGORITHMS

AND THE APPLICATION DATA SETS USED.

III. METHODOLOGY We used Intel core i3 Processor platform which consist of 4

GB memory, Windows 7 ultimate operating system, a 500GB

secondary memory .In all the experiments, we used Weka

3.7.11, to find the performance characteristics on the input

data set.

A. Weka interface Weka (Waikato environment for knowledge Analysis) is a

widely used machine learning software written in Java,

originally developed at the University of Waikato, New

Zealand. The weka suite contains a group of algorithms and

visualization tools for data analysis with graphical user

interfaces for easy access to this functionality.

The Weka is employed in many different application areas,

specifically for academic purposes and research. There are

numerous benefits of Weka:

It is freely obtainable under the gnu General Public License.

It is portable, since it's totally implemented within the Java programing language and therefore runs on

almost any architecture.

It is a large collection of data preprocessing and modelling techniques.

It is simple to use because of its graphical interface.

Weka supports multiple data mining tasks specifically data

preprocessing, clustering, classification, regression, feature

selection and visualization. All techniques of Weka's software

are predicated on the belief that the data is obtainable as one

file or relation, wherever each data point is represented by a

fixed number of attributes.

B. Data set NSL-KDD data set is used for evaluation. The NSL-KDD data

set is advised to resolve a number of the inherent issues of the

KDD CUP'99 data set. KDD CUP99 is the mostly wide used data set for anomaly detection. However Tavallaee et al

directed a measurable investigation on this data set and found

two essential issues that enormously influenced the

performance of evaluated systems, and lands up in a very poor

analysis of anomaly detection approaches. To resolve these

problems, they projected anew data set, NSL-KDD that

consists of selected records of the whole KDD data set [20].

The following are the benefits of the NSL-KDD over the

original KDD data set: First, it doesn't include redundant

records within the train set, so the classifiers won't be biased

towards more frequent records. Second, the amount of

selected records from every difficulty level group is inversely

proportional to the share of records in the original KDD data

set. As a result, the classification rates of distinct machine

learning methods vary in a very wider range, which makes it

more efficient to have an accurate evaluation of different

learning techniques. Third, the quantities of records in the

train and test sets is sensible, that makes it reasonable to run

the experiments on the entire set without the requirement to

randomly choose a tiny low portion. Consequently, analysis

Year Authors Data mining

Algorithms

Data set

2010 Nesreen K.

Ahmed,Amir F.

Atiya,Neamat El

Gayar,Hisham El-

Shishiny[12]

MLP,BNN ,RBF

,K Nearest

Neighbor

Regression

M3

competition

data

2011 S. Aruna, Dr S.P.

Rajagopalan and

L.V. Nandakishore

[13]

RBF

networks,Nave

Bayes,J48,CAR

T,SVM-RBF

kernel

WBC,

WDBC, Pima

Indians

Diabetes

database

2011 R. Kishore Kumar,

G. Poonkuzhali, P.

Sudhakar [14]

ID3, J48, Simple

CART and

Alternating

Decision Tree

(ADTree)

Spam Email

Data

2012 Abdullah H.

Wahbeh,

Mohammed Al-Kabi

[15]

C4.5,SVM,

Nave Bayes

Arabic Text

2012 Rohit Arora,

Suman[16]

C4.5, MLP Diabetes and

Glass

2013 S. Vijayarani, M.

Muthulakshmi[17]

Attribute

Selected

Classifier,

Filtered

Classifier,

LogitBoost

Classifying

computer files

2013 Murat Koklu and

Yavuz Unal [18]

MLP, J48, and

Nave Bayes

Classifying

computer files

2014 Devendra Kumar

Tiwary[19]

Decision

Tree(DT), Nave

Bayes (NB),

Artificial Neural

Networks

(ANN), Support

Vector Machine

(SVM).

Credit Card



results of different research works are going to be consistent

and comparable.

C. Classification algorithms The following classifier algorithms are taken for the

performance comparison on the NSL-KDD data set.

(a) OneR OneR [21], short for "One Rule", accurate and simple

classification algorithm that generates one rule for every

predictor within the data, then selects the rule with the tiniest

total error as its "one rule". To make a rule for a predictor, we

construct a frequency table for every predictor against the

target. It's been shown that OneR produces rules only slightly

less accurate than progressive classification algorithms

whereas producing rules that are easy for humans to interpret.

(b) Hoeffding Tree

A Hoeffding tree [22] is a progressive, anytime decision tree

induction algorithm that's capable of learning from data

streams, accepting that the distribution generating examples

doesn't change over the long run. Hoeffding trees exploit the

actual fact that a small sample will usually be enough to

decide on the optimal splitting attribute. This is determined

mathematically by the Hoeffding bound that quantifies the

amount of observations required to estimate some statistics

within a prescribed preciseness. One of the features of

Hoeffding Trees not shared by other incremental decision tree

learners is that its sound guarantees of performance.

(c) Decision Stump

A decision stump [23] is a machine learning model consisting

of one-level decision tree. That is, it's a decision tree with one

internal node that is instantly connected to its leaves. The

predictions made by decision stump are based on just one

input feature. They're also known as 1-rules.Decision stumps

are usually used as base learners in machine learning

ensemble techniques like boosting and bagging. For example,

the ViolaJones face detection algorithm employs AdaBoost with decision stumps as weak learners.

(d) Alternating Decision Tree

An alternating decision tree (ADTree) [24], combines the

simplicity of distinct decision tree with the effectiveness of

boosting. The information illustration combines tree stumps, a

standard prototype deployed in boosting, into a decision tree

kind structure. The various branches aren't any longer

mutually exclusive. The root node could be a prediction node,

and has simply a numeric score. Consecutive layer of nodes

are decision nodes, and are basically a group of decision tree

stumps. Subsequent layer then consists of prediction nodes,

and so on, alternating between prediction nodes and call nodes.

A model is deployed by identifying the possibly multiple

ways from the root node to the leaves through the alternating

decision tree that correspond to the values for the variables of

an observation to be classified. The observation's

classification score is the total of the prediction values along

the corresponding ways.

IV. CLASSIFIER PERFORMANCE MEASURES

A confusion matrix contains information regarding actual and

foreseen classifications done by a classification system.

Performance of such systems is often evaluated using the data

within the matrix. The following Fig. 1 shows the confusion

matrix,

Fig. 1 Confusion matrix

The entries within the confusion matrix have meaning which

means within the context of our study:

a is that the number of correct predictions that an instance is negative,

b is that the number of incorrect predictions that an instance is positive,

c is that the number of incorrect of predictions that an instance negative, and

d is that the number of correct predictions that an instance is positive.

The following are the metrics that is used for the evaluation of

data set:

Accuracy: The accuracy is that the proportion of the total number of predictions that were correct. it's

determined using the equation:

Accuracy=

Detection Rate::Detection Rate is the proportion of the predicted positive cases that were correct, as

calculated using the equation:

Detection Rate=

False Alarm Rate: False Alarm Rate is the proportion of negatives cases that were incorrectly

classified as positive, as calculated using the equation

False Alarm Rate=b/ (a+b)

V. EXPERIMENTAL RESULTS AND COMPARATIVE ANALYSIS

We investigated the performance of designated classification

algorithms .The classifications are done using 10-fold cross-

validation. In WEKA, all data is considered as instances and

features within the data are referred to as attributes. The

simulation results are divided into different bar charts for

easier analysis and evaluation. The Table-2 shows the

performance of classifier algorithms on NSL-KDD data set.

TABLE- II



PERFORMANCE OF CLASSIFIER ALGORITHMS

The Fig.2 shows the Accuracy of classifiers on NSL-KDD

data set. From the result it can be observed that Hoeffding

Tree is the best classifier, followed by ADTree, oneR and

Decision Stump.

Fig. 2 Accuracy of classifiers

The Fig. 3 shows the Detection Rate of classifiers on NSL-

KDD data set. The experimental result shows that, Hoeffding

Tree is the best classifier, followed by Decision Stump, oneR

and ADTree.

Fig. 3 Detection Rate of classifiers

The Fig. 4 shows the False Alarm Rate of classifiers on NSL-

KDD data set. The result of the experiment shows that,

Decision Stump is the best classifier, followed by Hoeffding

Tree, oneR and ADTree.

Fig. 4 False Alarm Rate of classifiers

VI. CONCLUSION Four classification algorithms are investigated in this paper

with NSL-KDD as data set. They included Hoeffding Tree,

ADTree, oneR and Decision Stump. Comparative study and

analysis related to classification measures included Accuracy,

Detection Rate and False Alarm Rate have been computed by

simulation using Weka Toolkit. Experimental Results show

that Hoeffding Tree gives the best performance in terms of

Accuracy and Detection Rate .But when we consider False

Alarm Rate; Decision Stump is the best performer.

Classifier Accuracy Detection rate False alarm rate

oneR 0.94615 0.94954 0.06714

Decision

Stump

0.81733 0.94964 0.05025

Hoeffding

Tree

0.95120 0.95515 0.05952

ADTree 0.95094 0.94592 0.07321



REFERENCES

[1] O. Serhat and A.C Yilamaz, Classification And Prediction In A Data Mining Application, Journal of Marmara for Pure and Applied Sciences, Vol. 18, pp. 159-174.

[2] H. Kaushik and B.G Raviya, Performance Evaluation of Different Data Mining Classification Algorithm Using

WEKA, Indian Journal of Research(PARIPEX), Vol. 2 ,2013.

[3] WaiHoAu, KeithC, C.Chan and XinYao, A NovelEvolutionaryDataMiningAlgorithm with

Applications toChurn Prediction, IEEE Transactions On Evolutionary Computation, Vol. 7, pp. 532- 545, 2003.

[4] G. Karina, S.M Miquel, and S. Beatriz, Tools for Environmental Data Mining and Intelligent Decision

Support, International Congress on Environmental Modelling and Software, 2012.

[5] C. Giraud-Carrier and O. Povel, Characterising Data Mining software, Intelligent Data Analysis, pp. 181192,2003.

[6] H. Mark, F. Eibe, H. Geoffrey, P. Bernhard, R. Peter, H.W Ian, The WEKA Data Mining Software: An Update, SIGKDD Explorations, Vol. 11, 2009.

[7] H.W Abdullah ,A.A Qasem , N.A Mohammed, and M.A Emad , A Comparison Study between Data Mining Tools over some Classification Methods, International Journal of Advanced Computer Science and Applications, pp18-26,

2012.

[8] H. Hong, L. Jiuyong, P. Ashley, W. Hua, and D. Grant, A Comparative Study of Classification Methods for

Microarray Data Analysis, in Australasian Data Mining Conference, 2006.

[9] C.T My, S. Dongil, and S. Dongkyoo, A Comparative Study of Medical Data Classification Methods Based on

Decision Tree and Bagging Algorithms, in Eighth IEEE International Conference on Dependable, Autonomic and

Secure Computing, 2009.

[10] K.S Aman, A Comparative Study of Classification Algorithms for Spam Email Data Analysis, International Journal on Computer Science and Engineering, Vol. 3, pp.

1890- 1895, 2011.

[11] M. Ragab, Abdul Hamid., et al. , A Comparative Analysis of Classification Algorithms for Students College

Enrollment Approval Using Data Mining., Proceedings of the 2014 Workshop on Interaction Design in

Educational Environments. 2014.

[12] K. Ahmed, Nesreen., et al., An empirical comparison of machine learning models for time series forecasting., Econometric Review,s ,Vol. 29 , pp. 594-621, 2010.

[13] S. Aruna, S.P. Rajagopalan and L.V. Nandakishore, An Empirical Comparison of Supervised Learning Algorithms

in Disease Detection, International Journal of Information

Technology Convergence and Services, vol. 1, pp. 81-92,

2011.

[14] R. Kishore Kumar, G. Poonkuzhali, and P. Sudhakar, Comparative Study on Email Spam Classifier using Data Mining Techniques, Proceedings of Int. MultConf. of Engineers and Computer Scientists, Vol. 1, 2012.

[15] H.W Abdullah and A.K Mohamedi, Comparative Assessment of the Performance of Three, Basic Sci. & Eng, Vol. 21, pp. 15- 28, 2012.

[16] A. Rohit and Suman, Comparative Analysis of Classification Algorithms on Different Datasets using

WEKA, International Journal of Computer Applications, Vol. 54, pp. 21-25,2012.

[17] S. Vijayarani1 and M. Muthulakshmi, Comparative Study on Classification Meta Algorithms, International Journal of Innovative Research in Computer and Communication

Engineering, Vol. 1, pp. 1768- 1774, 2013.

[18] K. Murat and U. Yavuz, Analysis of a Population of Diabetic Patients Databases with Classifiers, International Journal of Medical, Pharmaceutical Science and

Engineering, Vol.7, pp.176- 178, 2013.

[19] K.T Devendra, A Comparative Study of Classification Algorithms for Credit Card Approval using WEKA, GALAXY International Interdisciplinary Research Journal,

Vol.2,,pp. 165 174, 2014.

[20] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, A detailed analysis of the KDD CUP99 data set, IEEE International. Conference on Computational Intelligence in

Security and Defence Applications., pp. 5358, 2009.

[21] Holte, C. Robert. "Very simple classification rules perform well on most commonly used datasets." Machine learning,

Vol. 11, pp. 63-90,1993.

[22] G. Hulten, S. Laurie, and D. Pedro, Mining time-changing data streams, Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery

and data mining. 2001.

[23] W.Iba and P. Langley, Induction of One-Level Decision Trees, in Proceedings of the Ninth International Conference on Machine Learning, pp. 233240,1992.

[24] Y. Freund, L. Mason, The alternating decision tree learning algorithm In Proceeding of the Sixteenth International Conference on Machine Learning, pp. 124-

133, 1999.



performance comparison of classification algorithms

Documents