performance comparison of classification algorithms
Post on 28-Sep-2015
12 Views
Preview:
DESCRIPTION
TRANSCRIPT
-
Performance Comparison of Classification Algorithms
Using WEKA Sreenath.M
#1, D.Sumathi*
2
1PG Scholar, 2Assistant Professor
Department of Computer Science and Engineering, PPG Institute of Technology, Coimbatore, India 1sreenath.m.91@gmail.com
2sumicse.sai@gmail.com
Abstract Data mining is the knowledge discovery process by analysing the huge volume of data from various perspectives and
summarizing it into useful information. Data mining is used to
find cloaked patterns from a large data set. Classification is one
of the most important applications of data mining. Classification
techniques are used to classify data items into predefined class
label. Classification employs supervised learning technique.
During data mining the classifier builds classification models
from an input data set, which are used to predict future data
trends. For study purpose various algorithm available for
classification like decision tree, k-nearest neighbour ,Naive Bayes,
Neural Network, Back propagation, Artificial Neural, Multi class
classification, Multi-layer perceptron, Support vector Machine,
etc. In this paper we introduce four algorithms from them. In
this paper we have worked with different data classification
algorithms and these algorithms have been applied on NSL-KDD
dataset to find out the evaluation criteria using Waikato
Environment for Knowledge Analysis (WEKA).
Keywords classifier algorithms, data mining, nsl-kdd, oner, Hoeffding tree, decision stump, alternating decision tree, weka.
I. INTRODUCTION
Data mining is the technique of automated data analysis to
reveal previously undetected dependence among data .Three
of the major data mining techniques are classification,
regression and clustering. In this research paper we are
working with the classification because it is most important
process, if we have a very huge database. Weka tool is used
for classification. Classification [1] is one of the most
important techniques in data mining to build classification
models from an input data set.
These build models are used to predict future data trends [2, 3].
Our knowledge about data becomes greater and easier once
the classification is complete. We can deduct logic from the
classified data. Most of all it makes the data retrieval faster
with better results and new data to be sorted easily.
There are many data mining tools available [4, 5]. In this
paper we will be using Weka data mining tool, which is an
open source tool developed using JAVA [6]. It contains tools
for data preprocessing, clustering, classification, visualization,
association rule, regression. It not only supports data
algorithms, but also Meta learners like bagging, boosting and
data preparation. Weka toolkit has achieved the highest
applicability among Orange, Tanagra, and KNIME,
respectively [4]. While using Weka for classification,
performance is tested by applying cross validation test mode
instead of using percentage split test mode [7].
II. RELATED WORKS Data mining is the method of extracting information from vast
dataset. Their techniques apply advanced computation ways to
get unknown relations and total up results by analysing the
determined dataset to form these relations clear and
apprehensible. Hu and et.al [8] conducted experimental
comparison of C4.5, LibSVMs, AdaBoosting C4.5, Bagging
C4.5, and Random Forest on seven Microarray cancer
information sets.
They concluded that C4.5 was higher among all algorithms
and additionally found that information preprocessing and
cleansing improves the potency of classification algorithms.
Shin and et.al [9] conducted comparison between C4.5 and
Nave bayes and hence concluded that C4.5 is out performing
algorithm than Nave bayes. Sharma [10] conducted
experiment with weka environment by comparing four
algorithms specifically ID3, J48, easy CART and Alternating
decision Tree (ADTree).
He compared these four algorithms for spam email dataset in
terms of classification accuracy. According to his simulation
results, the J48 classifier performs better than ID3, CART and
ADTree in terms of classification accuracy. Abdul Hamid M.
Ragab and et.al [11] compared Classification Algorithms for
Students College Enrolment Approval Using Data Mining.
They found that C4.5 gives the best performance and accuracy
and lowest absolute errors, then PART, Random Forest,
Multilayer Perceptron, and Nave Bayes, respectively.Table-1
shows an outline for a few recent works associated with
classification algorithms performance and sort of the
applications space for the experimental datasets used.
It illustrates many data mining algorithms that may be applied
into completely different application space.
International Journal of Advanced and Innovative Research (2278-7844) / # 193 / Volume 4 Issue 4
2015 IJAIR. All Rights Reserved 193
-
TABLE I
A SUMMARY OF RELATED DATA MINING ALGORITHMS
AND THE APPLICATION DATA SETS USED.
III. METHODOLOGY We used Intel core i3 Processor platform which consist of 4
GB memory, Windows 7 ultimate operating system, a 500GB
secondary memory .In all the experiments, we used Weka
3.7.11, to find the performance characteristics on the input
data set.
A. Weka interface Weka (Waikato environment for knowledge Analysis) is a
widely used machine learning software written in Java,
originally developed at the University of Waikato, New
Zealand. The weka suite contains a group of algorithms and
visualization tools for data analysis with graphical user
interfaces for easy access to this functionality.
The Weka is employed in many different application areas,
specifically for academic purposes and research. There are
numerous benefits of Weka:
It is freely obtainable under the gnu General Public License.
It is portable, since it's totally implemented within the Java programing language and therefore runs on
almost any architecture.
It is a large collection of data preprocessing and modelling techniques.
It is simple to use because of its graphical interface.
Weka supports multiple data mining tasks specifically data
preprocessing, clustering, classification, regression, feature
selection and visualization. All techniques of Weka's software
are predicated on the belief that the data is obtainable as one
file or relation, wherever each data point is represented by a
fixed number of attributes.
B. Data set NSL-KDD data set is used for evaluation. The NSL-KDD data
set is advised to resolve a number of the inherent issues of the
KDD CUP'99 data set. KDD CUP99 is the mostly wide used data set for anomaly detection. However Tavallaee et al
directed a measurable investigation on this data set and found
two essential issues that enormously influenced the
performance of evaluated systems, and lands up in a very poor
analysis of anomaly detection approaches. To resolve these
problems, they projected anew data set, NSL-KDD that
consists of selected records of the whole KDD data set [20].
The following are the benefits of the NSL-KDD over the
original KDD data set: First, it doesn't include redundant
records within the train set, so the classifiers won't be biased
towards more frequent records. Second, the amount of
selected records from every difficulty level group is inversely
proportional to the share of records in the original KDD data
set. As a result, the classification rates of distinct machine
learning methods vary in a very wider range, which makes it
more efficient to have an accurate evaluation of different
learning techniques. Third, the quantities of records in the
train and test sets is sensible, that makes it reasonable to run
the experiments on the entire set without the requirement to
randomly choose a tiny low portion. Consequently, analysis
Year Authors Data mining
Algorithms
Data set
2010 Nesreen K.
Ahmed,Amir F.
Atiya,Neamat El
Gayar,Hisham El-
Shishiny[12]
MLP,BNN ,RBF
,K Nearest
Neighbor
Regression
M3
competition
data
2011 S. Aruna, Dr S.P.
Rajagopalan and
L.V. Nandakishore
[13]
RBF
networks,Nave
Bayes,J48,CAR
T,SVM-RBF
kernel
WBC,
WDBC, Pima
Indians
Diabetes
database
2011 R. Kishore Kumar,
G. Poonkuzhali, P.
Sudhakar [14]
ID3, J48, Simple
CART and
Alternating
Decision Tree
(ADTree)
Spam Email
Data
2012 Abdullah H.
Wahbeh,
Mohammed Al-Kabi
[15]
C4.5,SVM,
Nave Bayes
Arabic Text
2012 Rohit Arora,
Suman[16]
C4.5, MLP Diabetes and
Glass
2013 S. Vijayarani, M.
Muthulakshmi[17]
Attribute
Selected
Classifier,
Filtered
Classifier,
LogitBoost
Classifying
computer files
2013 Murat Koklu and
Yavuz Unal [18]
MLP, J48, and
Nave Bayes
Classifying
computer files
2014 Devendra Kumar
Tiwary[19]
Decision
Tree(DT), Nave
Bayes (NB),
Artificial Neural
Networks
(ANN), Support
Vector Machine
(SVM).
Credit Card
International Journal of Advanced and Innovative Research (2278-7844) / # 194 / Volume 4 Issue 4
2015 IJAIR. All Rights Reserved 194
-
results of different research works are going to be consistent
and comparable.
C. Classification algorithms The following classifier algorithms are taken for the
performance comparison on the NSL-KDD data set.
(a) OneR OneR [21], short for "One Rule", accurate and simple
classification algorithm that generates one rule for every
predictor within the data, then selects the rule with the tiniest
total error as its "one rule". To make a rule for a predictor, we
construct a frequency table for every predictor against the
target. It's been shown that OneR produces rules only slightly
less accurate than progressive classification algorithms
whereas producing rules that are easy for humans to interpret.
(b) Hoeffding Tree
A Hoeffding tree [22] is a progressive, anytime decision tree
induction algorithm that's capable of learning from data
streams, accepting that the distribution generating examples
doesn't change over the long run. Hoeffding trees exploit the
actual fact that a small sample will usually be enough to
decide on the optimal splitting attribute. This is determined
mathematically by the Hoeffding bound that quantifies the
amount of observations required to estimate some statistics
within a prescribed preciseness. One of the features of
Hoeffding Trees not shared by other incremental decision tree
learners is that its sound guarantees of performance.
(c) Decision Stump
A decision stump [23] is a machine learning model consisting
of one-level decision tree. That is, it's a decision tree with one
internal node that is instantly connected to its leaves. The
predictions made by decision stump are based on just one
input feature. They're also known as 1-rules.Decision stumps
are usually used as base learners in machine learning
ensemble techniques like boosting and bagging. For example,
the ViolaJones face detection algorithm employs AdaBoost with decision stumps as weak learners.
(d) Alternating Decision Tree
An alternating decision tree (ADTree) [24], combines the
simplicity of distinct decision tree with the effectiveness of
boosting. The information illustration combines tree stumps, a
standard prototype deployed in boosting, into a decision tree
kind structure. The various branches aren't any longer
mutually exclusive. The root node could be a prediction node,
and has simply a numeric score. Consecutive layer of nodes
are decision nodes, and are basically a group of decision tree
stumps. Subsequent layer then consists of prediction nodes,
and so on, alternating between prediction nodes and call nodes.
A model is deployed by identifying the possibly multiple
ways from the root node to the leaves through the alternating
decision tree that correspond to the values for the variables of
an observation to be classified. The observation's
classification score is the total of the prediction values along
the corresponding ways.
IV. CLASSIFIER PERFORMANCE MEASURES
A confusion matrix contains information regarding actual and
foreseen classifications done by a classification system.
Performance of such systems is often evaluated using the data
within the matrix. The following Fig. 1 shows the confusion
matrix,
Fig. 1 Confusion matrix
The entries within the confusion matrix have meaning which
means within the context of our study:
a is that the number of correct predictions that an instance is negative,
b is that the number of incorrect predictions that an instance is positive,
c is that the number of incorrect of predictions that an instance negative, and
d is that the number of correct predictions that an instance is positive.
The following are the metrics that is used for the evaluation of
data set:
Accuracy: The accuracy is that the proportion of the total number of predictions that were correct. it's
determined using the equation:
Accuracy=
Detection Rate::Detection Rate is the proportion of the predicted positive cases that were correct, as
calculated using the equation:
Detection Rate=
False Alarm Rate: False Alarm Rate is the proportion of negatives cases that were incorrectly
classified as positive, as calculated using the equation
False Alarm Rate=b/ (a+b)
V. EXPERIMENTAL RESULTS AND COMPARATIVE ANALYSIS
We investigated the performance of designated classification
algorithms .The classifications are done using 10-fold cross-
validation. In WEKA, all data is considered as instances and
features within the data are referred to as attributes. The
simulation results are divided into different bar charts for
easier analysis and evaluation. The Table-2 shows the
performance of classifier algorithms on NSL-KDD data set.
TABLE- II
International Journal of Advanced and Innovative Research (2278-7844) / # 195 / Volume 4 Issue 4
2015 IJAIR. All Rights Reserved 195
-
PERFORMANCE OF CLASSIFIER ALGORITHMS
The Fig.2 shows the Accuracy of classifiers on NSL-KDD
data set. From the result it can be observed that Hoeffding
Tree is the best classifier, followed by ADTree, oneR and
Decision Stump.
Fig. 2 Accuracy of classifiers
The Fig. 3 shows the Detection Rate of classifiers on NSL-
KDD data set. The experimental result shows that, Hoeffding
Tree is the best classifier, followed by Decision Stump, oneR
and ADTree.
Fig. 3 Detection Rate of classifiers
The Fig. 4 shows the False Alarm Rate of classifiers on NSL-
KDD data set. The result of the experiment shows that,
Decision Stump is the best classifier, followed by Hoeffding
Tree, oneR and ADTree.
Fig. 4 False Alarm Rate of classifiers
VI. CONCLUSION Four classification algorithms are investigated in this paper
with NSL-KDD as data set. They included Hoeffding Tree,
ADTree, oneR and Decision Stump. Comparative study and
analysis related to classification measures included Accuracy,
Detection Rate and False Alarm Rate have been computed by
simulation using Weka Toolkit. Experimental Results show
that Hoeffding Tree gives the best performance in terms of
Accuracy and Detection Rate .But when we consider False
Alarm Rate; Decision Stump is the best performer.
Classifier Accuracy Detection rate False alarm rate
oneR 0.94615 0.94954 0.06714
Decision
Stump
0.81733 0.94964 0.05025
Hoeffding
Tree
0.95120 0.95515 0.05952
ADTree 0.95094 0.94592 0.07321
International Journal of Advanced and Innovative Research (2278-7844) / # 196 / Volume 4 Issue 4
2015 IJAIR. All Rights Reserved 196
-
REFERENCES
[1] O. Serhat and A.C Yilamaz, Classification And Prediction In A Data Mining Application, Journal of Marmara for Pure and Applied Sciences, Vol. 18, pp. 159-174.
[2] H. Kaushik and B.G Raviya, Performance Evaluation of Different Data Mining Classification Algorithm Using
WEKA, Indian Journal of Research(PARIPEX), Vol. 2 ,2013.
[3] WaiHoAu, KeithC, C.Chan and XinYao, A NovelEvolutionaryDataMiningAlgorithm with
Applications toChurn Prediction, IEEE Transactions On Evolutionary Computation, Vol. 7, pp. 532- 545, 2003.
[4] G. Karina, S.M Miquel, and S. Beatriz, Tools for Environmental Data Mining and Intelligent Decision
Support, International Congress on Environmental Modelling and Software, 2012.
[5] C. Giraud-Carrier and O. Povel, Characterising Data Mining software, Intelligent Data Analysis, pp. 181192,2003.
[6] H. Mark, F. Eibe, H. Geoffrey, P. Bernhard, R. Peter, H.W Ian, The WEKA Data Mining Software: An Update, SIGKDD Explorations, Vol. 11, 2009.
[7] H.W Abdullah ,A.A Qasem , N.A Mohammed, and M.A Emad , A Comparison Study between Data Mining Tools over some Classification Methods, International Journal of Advanced Computer Science and Applications, pp18-26,
2012.
[8] H. Hong, L. Jiuyong, P. Ashley, W. Hua, and D. Grant, A Comparative Study of Classification Methods for
Microarray Data Analysis, in Australasian Data Mining Conference, 2006.
[9] C.T My, S. Dongil, and S. Dongkyoo, A Comparative Study of Medical Data Classification Methods Based on
Decision Tree and Bagging Algorithms, in Eighth IEEE International Conference on Dependable, Autonomic and
Secure Computing, 2009.
[10] K.S Aman, A Comparative Study of Classification Algorithms for Spam Email Data Analysis, International Journal on Computer Science and Engineering, Vol. 3, pp.
1890- 1895, 2011.
[11] M. Ragab, Abdul Hamid., et al. , A Comparative Analysis of Classification Algorithms for Students College
Enrollment Approval Using Data Mining., Proceedings of the 2014 Workshop on Interaction Design in
Educational Environments. 2014.
[12] K. Ahmed, Nesreen., et al., An empirical comparison of machine learning models for time series forecasting., Econometric Review,s ,Vol. 29 , pp. 594-621, 2010.
[13] S. Aruna, S.P. Rajagopalan and L.V. Nandakishore, An Empirical Comparison of Supervised Learning Algorithms
in Disease Detection, International Journal of Information
Technology Convergence and Services, vol. 1, pp. 81-92,
2011.
[14] R. Kishore Kumar, G. Poonkuzhali, and P. Sudhakar, Comparative Study on Email Spam Classifier using Data Mining Techniques, Proceedings of Int. MultConf. of Engineers and Computer Scientists, Vol. 1, 2012.
[15] H.W Abdullah and A.K Mohamedi, Comparative Assessment of the Performance of Three, Basic Sci. & Eng, Vol. 21, pp. 15- 28, 2012.
[16] A. Rohit and Suman, Comparative Analysis of Classification Algorithms on Different Datasets using
WEKA, International Journal of Computer Applications, Vol. 54, pp. 21-25,2012.
[17] S. Vijayarani1 and M. Muthulakshmi, Comparative Study on Classification Meta Algorithms, International Journal of Innovative Research in Computer and Communication
Engineering, Vol. 1, pp. 1768- 1774, 2013.
[18] K. Murat and U. Yavuz, Analysis of a Population of Diabetic Patients Databases with Classifiers, International Journal of Medical, Pharmaceutical Science and
Engineering, Vol.7, pp.176- 178, 2013.
[19] K.T Devendra, A Comparative Study of Classification Algorithms for Credit Card Approval using WEKA, GALAXY International Interdisciplinary Research Journal,
Vol.2,,pp. 165 174, 2014.
[20] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, A detailed analysis of the KDD CUP99 data set, IEEE International. Conference on Computational Intelligence in
Security and Defence Applications., pp. 5358, 2009.
[21] Holte, C. Robert. "Very simple classification rules perform well on most commonly used datasets." Machine learning,
Vol. 11, pp. 63-90,1993.
[22] G. Hulten, S. Laurie, and D. Pedro, Mining time-changing data streams, Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery
and data mining. 2001.
[23] W.Iba and P. Langley, Induction of One-Level Decision Trees, in Proceedings of the Ninth International Conference on Machine Learning, pp. 233240,1992.
[24] Y. Freund, L. Mason, The alternating decision tree learning algorithm In Proceeding of the Sixteenth International Conference on Machine Learning, pp. 124-
133, 1999.
International Journal of Advanced and Innovative Research (2278-7844) / # 197 / Volume 4 Issue 4
2015 IJAIR. All Rights Reserved 197
top related