real time motif classification from database using intelligent algorithms
TRANSCRIPT
-
7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS
1/16
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012
DOI : 10.5121/ijaia.2012.3610 95
REAL TIMEMOTIF CLASSIFICATION FROM
DATABASE USING INTELLIGENT ALGORITHMS
Hiteshkumar Nimbark1 , Pratik Chavda2, Chetan Undhad2 and Paresh Kotak3
1Department of Computer Science and Engineering,
B H Gardi Engg, Rajkot, Gujarat, [email protected]
2Noble Enggering College, Junagadh, Gujarat, [email protected],[email protected]
2MParesh Kotak, A V Parekh polytechnique, Rajkot, Gujarat, India
ABSTRACT
The amount of raw data being accumulated in the databases is increasing at an inconceivable rate.
However, these data-rich databases are poor in providing substantial information. This is where data
mining comes into picture. Specifically, data mining is "the process of extracting or mining information
from large amount of data". Motif classification has been an active area of research in data mining. It
consists of assigning a data instance to one of the predefined classes/groups based upon the knowledge
gained from previously seen (classified) data.
Real world problems demand classifiers that are accurate as well as easy to interpret. Traditional
comprehensible classifiers like decision trees are very accurate at classifying new instances, but with
increase in the size of datasets and/or the number of classes, and/or with increase in number of attributes,
the trees induced are very large in size and hence difficult to interpret. On the other hand, evolutionary
algorithms like Genetic Programming when applied to classification problems give trees that have smaller
size but are not very accurate.
Thus the work proposes an algorithm MotifClass MotifClass, that employs Genetic Proframming as an
optimization technique to yield efficient and effective classification. The goal is to evolve a classifier that
performs a trade-off between accuracy and comprehensibility in order to produce an optimal decision tree
classifier (for n-class where n>=2) using GP.
When evaluated on some benchmark datasets, the proposed algorithm obtained by merging Genetic
Programming and decision tree, outperforms the traditional classification techniques in terms of a
combination of accuracy and comprehensibility.
KEYWORDS
Accuracy, Motif Classification Trees, Comprehensibility, Genetic Programming, Multi-class classification.
1. INTRODUCTION
The digital revolution has brought an explosion of stored data to our world, and it's increasing at
an ever faster rate. This collection of data is quickly outpacing the ability for humans tounderstand or make use of it. Scientists across the world have joined in the hunt to discover
valuable knowledge from this digital information[1].
mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected] -
7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS
2/16
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012
96
Classification algorithms generally yield results in either symbolic or non-symbolic form.Algorithms that provide non-symbolic results only assign class labels to the data instances and do
not convey the reason as to why the algorithm led to that classification. Examples includeBayesian classifiers, k Nearest Neighbor (kNN) classifier, Neural Networks, etc. However in realworld applications, it is also required to know the reasons behind classifying any data into a
particular class. Algorithms that provide such reasons fall in the category of symbolic classifiers.Decision trees are symbolic classifiers, i.e. they have high comprehensibility and are accurate too.
But, with increase in number of classes or when training data becomes very large, size of the treeincreases considerably. This leads to loss of "comprehensibility", which was considered as one of
the best properties of decision tree.
Evolutionary algorithms like Genetic Algorithms (GA) and Genetic Programming (GP) have been
found successful in solving numerous classification problems. Also, tree structures are anencoding scheme for GP individuals. The classification trees generated using GP have smaller
number of nodes and hence are more comprehensible.
Hence the need to develop an optimal multiclass classifier that performs a trade-off between
accuracy and comprehensibility has been the motivation of the work.
2. LITERATURE SURVEY
Motif Classification has been an active area of research in data mining. It is a form of dataanalysis that can be used to extract models describing important data classes and can help provide
us with a better understanding of the data at large. Classification predicts categorical (discrete,unordered) classes, given any unseen data.
Intelligent Evolutionary algorithms (EAs) are stochastic optimization techniques that imitatesome aspects of natural evolution. They try to obtain good solutions by creating a population of
data structures that represent them, and evolve that population by changing those solutions,combining them to create new ones, and, on average, making the better-suited survive and the
discarding worst-suited individuals. EAs are a set of modern met heuristics used successfully in
many applications with very high complexity.
2.1. Motif Classification
Motif Classification is a task of assigning a data instance to one of the predefined classes/groupsbased upon the knowledge gained from previously seen (classified) data.
Basically, classification is a two step process. First, a training set consisting of records whoseclass labels are known must be provided. The training set is used to induce a classification model,
which is subsequently applied to the test set consisting of records with unknown class labels[1].Classification plays a major role in numerous practical situations such as classifying e-mails as
spam or not spam, for a bank in classifying loan applications as either safe or risky, diagnosing
medical conditions, categorizing news stories as finance, weather, business, entertainment, sports,etc. There are numerous methods used to build classifiers.
Decision Tree (CART) [1]
Naive Bayes Classifier [7]
k Nearest Neighbors Classifier [2,11,12] etc.
-
7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS
3/16
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012
97
2.2. Intelligent Algorithms
Real world applications are most likely to encounter problems that are hard to solve. Some
examples for these kind of problems are the travelling salesman, knapsack problem, etc. To
approach such hard problems, a couple of concepts were introduced in the past decades, which
were inspired by nature. Some of the more popular and successful examples are GeneticAlgorithms [3], Neural Networks (NN), Fuzzy Methods (FM) and Evolutionary Algorithms(EA or also known as Evolutionary Computation), Ant Colony Optimization (AGO), etc.
EA[23] are slow but the comprehensibility of EA based classifiers is better compared to the non-evolutionary approach based classification algorithms. Hence, EA is the best choice when
comprehensibility is a selection criterion and decision tree is the first choice when training time isa selection criterion.
Different ways of establishing the classification technique using GP has been stated and in eachmethod the issues which need to be analyzed for GP are mentioned [22].
The methods and different research work which can be used to create a decision tree and
evaluation using GP are as follows:
The paper [24] presents an approach for designing classifiers for a multiclass problem using
Genetic Programming).
An MOGP system [25] designed to optimize two goals is described, but these goals are not fixed.The work of A multi-objective genetic programming approach [25] decision trees are generatedthrough evolution of training data. Usually, DTs are designed from fixed data.
The authors of Adaptive genetic programming for dynamic classification problems [26]
investigate the feasibility of using Genetic Programming in dynamically changing environments
to evolve decision trees for classification problems and proposes an new version of GeneticProgramming called Adaptive Genetic Programming. To cope up with the changing environment,
a new control parameter known as culling is introduced.
The author of Novel feature selection method for genetic programming [27] where GP isemployed to achieve a selection of features. Each individual is a tree encoding a classifierrepresented as a discriminate function, and classification accuracy is used as the fitness function.
The authors capitalize on the implicit feature selection ability of GP classifiers by applying GPfor classification in a two-stage scheme. First, a certain number of GP runs are carried out, each
one resulting in a best-of-run classifier. Then, in the second stage, the GP is run again using onlythe features most frequently used in the best classifiers obtained in the previous stage. However,
the goal of the first stage is just to discover the most interesting features, and the classifiers
obtained are discarded. The final classifier is induced in the second stage using the featuresselected in the first stage.
Further, Fitness landscapes and inductive genetic programming [28] presents an analyticalstudy on the influence of the fitness function and mutation operator on the quality of the decisiontrees induced, taking into account both classification accuracy and tree size
Also, Searching the forest: Using decision trees as building blocks for evolutionary search inclassification databases [29], a new mutation technique called Mutation on RHS has been
proposed. Given a condition, the method consists of changing the right hand side of the condition
-
7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS
4/16
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012
98
into a randomly chosen (allowed) new one. This can either be a value of the attribute or anotherattribute. Also, the paper uses pruning as a Genetic Operator.
The potential of GP to include any kind of operation into individuals is used in Application ofgenetic programming to induction of linear classification trees [30] in order to evolve oblique
decision trees. The paper shows that in GP, individuals tend to become larger over time. Thisphenomenon is known as bloat.
The above paper provides an up-to-date overview that is fully focused on evolutionary algorithms
and decision trees and does not concentrate on any specific evolutionary approach.
Also, it provides a taxonomy, which addresses works that evolve decision trees and works that
design decision-tree components by the use of evolutionary algorithms. Further, the paperaddresses some more important issues and open questions that can be the subject of future
research in concern with combining decision tree and evolutionary algorithms.
Our the proposed algorithm, the strategy of using decision trees for producing initial population
instead of initializing the population randomly is adopted. Further, emphasis is laid on the nature
of mutation is destructive and a solution to this is suggested in the proposed algorithm [24].
2. PROPOSED ALGORITHM
A new algorithm named MotifClass that performs a trade-off between accuracy and
comprehensibility. As the name suggests, the proposed algorithm is a hybrid version of GP andClassification Tree (Decision Tree).
The algorithm is especially designed for large datasets and for problems involving multi-classclassification because the increase in dataset size or number of classes directly affects the size of
the trees, resulting into trees that are not comprehensible enough. As a solution to this issue, theproposed approach gives equal importance to size of the tree and its accuracy by designing the
fitness function accordingly. Undoubtedly, the approach works perfectly well for binary
classification problems and small datasets.
The proposed algorithm uses some lessons learnt from the literature survey and combines them toproduce a new classification technique. The basic approach is to form a population of decision
trees and evolve them using GP with accuracy and number of nodes as fitness parameters.
Algorithm 1 MotifClass.
Input: Data Partition, Dt, which is a set of training instances and their associated class labels
Output: BEST_SO_FAR, the optimized classification tree
1. for i = 1 to POP_SIZE do // Initial population
2. Create a decision tree from a random subset of Dt
3. end for4. Evaluatefitness of each individual in the initial population using fitness measure
of Equation: 1 and Dt5. while GENERATION < MAX_GENERATIONS or
Termination criterion not satisfied do
6. repeat7. Perform genetic operations on parent trees to generate offspring trees for the
new population
-
7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS
5/16
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012
99
8. if(Size of Individual < c1 _ Size of BEST_SO_FAR ) then9. Evaluate its fitness and increment NEW_POP_SIZE
10. else if(Accuracy of Individual >c2 * Accuracy of BEST SO FAR ) then
11. Evaluate its fitness and increment NEW POP SIZE
12.else
13. Discard the individual
14. end if15. until POP_SIZE individuals are produced
16. GENERATION GENERATION + 117. POP_SIZE NEW_POP_SIZE
18. end while19. Save the population of last generation for incremental learning20. Designate best individual found so far as BEST_SO_FAR
The input to the algorithm is the set of training instances along with their associated class labels.As an output, the algorithm gives us a classification tree BEST_SO_FAR, which is an optimal
classifier in terms of accuracy and comprehen-sibility. Initially, POP-SIZE is set equal to the
desired size of the initial population and as the name suggests, the parameter MAX-GENERATIONS is set equal to the number of maximum generations the algorithm is desired to
run. Because of scarcity of space we are not including the detailed explanation of the
algorithm.
Equation: 1
where control parameter representing a tradeoffs between accuracy and size. Its value can range
from 0.05 to 0.1.
3. IMPLEMENTATION
This details of the tool used for implementation followed by a block diagram picturizing themodular flow of the project and other relevant details.
3.1 Tools Used and Modular flow
CART and the proposed algorithm MotifClass have been implemented in MATLAB 7.8.0
(R2009a) whereas GP for classification is implemented in HeuristicLab 3.5.33, an open sourceunder GNU General Public License. All the experiments are performed on Windows Platform.
The reason behind choosing MATLAB with plug-in GPlab[38] for Implementation of theproposed algorithm are Interactive interface, Debugging facilities, High quality graphics and
visualization facility.
The block diagram of figure 1 shows the modular flow of the project. In the PreRequisiteCheckstage, the dataset is imported and checked for noisy data. The algorithm also facilitates selecting
few attributes out of many.
In the second stage, initial population is generated and evaluated for their efficiency using the
training data itself. The algorithm practices to keep the best individuals at the top of array so that
-
7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS
6/16
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012
100
its comparison with newly developed individuals can be done easily. Hence a sort function is
used that sorts individuals based on fitness.
The third stage performs the GP runs until the termination criterion is reached and presents the
best solutions at the top of the array. The module GPruns include performing Reproduction with
Rank Selection method implemented in it. Further, it includes Crossover and Mutation functionswith the stated sub-functions like swap, etc.
The results of the last generation are saved for incremental learning, which occurs after the
algorithm has produced its best classifier and some new training data arrives at hand. In order toperform incremental learning, one more GP run is conducted on the saved population. This is whyGPruns module is called again from Incremental-Learning module.
Figure 1. Block diagram of modules implemented
3.2 Methodology
3.2.1 Pre Requisite Checking
Figure 2 shows a GUI that enables selection of the attributes the user wants to allow forparticipating in classification. This feature is provided as the data might be noisy or someattributes may be irrelevant for classification. Such attributes can be eliminated here.
-
7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS
7/16
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012
101
Figure 2. GUI enabling selection of attributes for classification
3.2.2 Initial Population
Figure 3 shows a snapshot of some of the trees of the initial generation. Few bestindividuals formed as the initial population on the Robot Navigation dataset described earlier are
listed as in table 1
Figure 3: Sample of initial population generated (Decision Trees)
-
7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS
8/16
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012
102
Table 1: A list of few Best Individuals
Individual Number Fitness Accuracy No. of Nodes
16 65.710079 74.701472 13
3 63.069364 75.118023 3371 61.863498 73.451819 31
42 59.618168 70.785893 31
65 58.222897 71.008053 53
84 57.769857 70.952513 61
The results show that by using Decision Trees for creating initial population, gives good trees
from the very beginning which matches the theoretical assumption [32].
3.2.3 Generic Implementation
Once the initial trees have been developed, the next step is to perform crossover, mutation andreproduction upon them based on the results. Figures 4 to 6 show how crossover and mutation isimplemented.
Figure 6 shows an offspring obtain by performing genetic operation on parents. In case of
crossover, two parents undergo the above stated process whereas for mutation, one parent ismutated with a randomly created new tree.
Whenever a leaf node is to be swapped with a leaf node, the task becomes very simple and simplyone node swap is required. However, if a node that has branches becomes the point of crossover
or mutation, it is to be pruned first and then swapped with the crossover or mutation node of the
other tree.
Figure 4 (a) (a)Step - 1: Randomly selecting
crossover/mutation point
Figure 4(b) : Step - 2: Pruning the node
undergoing crossover/mutation
-
7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS
9/16
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012
103
Figure 5: (a) Step - 3: Substituting node 2 of Tree 2 at
node 5 of Tree 1
Figure 5 (b): Step - 4: Addition of new nodes
and changing properties of all following Nodes
Figure 6: Step - 5: Addition of new nodes and changing properties of all following Nodes
If the node that is added has branches, crossover or mutation takes place by finding the next node
with branch and giving the number of its children to children of the swapped node. All theproperties of the nodes that follow are to be changed.
Building crossover and mutation functions have been the heart of the implementation as it not
only involves simple swapping of nodes, but all the properties of the nodes such as its parent, itschildren, etc. have to be swapped. Also, crossover and mutation are responsible for consumingthe major time the algorithm runs. Further, the proposed algorithm works well with both numeric
and text data.
3.2.4 Output Classification Tree
An example of the final classification tree developed on the Robot Navigation dataset is shown infigure 7. The view function of MATLAB is used to display the tree. The final tree is
comprehensible enough to analyze on the given dataset.
-
7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS
10/16
-
7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS
11/16
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012
105
In the experimentations, 66% of each of these datasets have been used for training the classifierand the rest for testing it. The next section describes in detail the results of experiments on five
datasets, each representing a multi-class problem. The algorithm has also been tested on manyother datasets representing multi-class as well as binary classification and the test results areshown in Table 4. Although the algorithm has been designed especially for multi-class
classification problems, it works equally well on problems with binary classes. Hence,experimental results have also been performed on binary classification problems in order to show
the generalized behavior of the proposed algorithm on classification problems with any number ofclasses.
4.2 Experimental Results and Findings
Some initial runs were performed to compare settings for tournament size, trade off factor andvalidation factors. The results showing fitness, accuracy and number of nodes of individuals for
different values of trade-off factor is shown in Table 3. The results show that 0.05 is the optimal
value of trade-off factor 7 on the above datasets. Hence the algorithm uses value of 7 as 0.05
throughout the experiments.
Table 3: Optimal value of Trade-off factor 7
Dataset Trade-off
factor
7
Fitness
Accuracy No. of
Nodes
Cardiotocography 0.0f 92.31 96.56 123
0.05 75.24 94.07 870.1 61.35 88.5 39
Robot Navigation 0.01 95.34 99.45 68
0.05 82.86 99.25 370.1 56.69 76.86 21
Nursery 0.01 78.86 83.24 223
0.05 62.89 79.65 1130.1 50.48 72.44 37
The heart of the algorithm are the crossover and mutation operations which lead to increase in
accuracy from a smaller value to significantly high values. The average results of how thecrossover and mutation operations perform on all the stated datasets is shown in figure 8 (a) and(b) respectively.
The results show that on an average, mutation raises the accuracy of weak individuals to a higher
value whereas crossover increases their accuracy to the highest values.
Figure 9 shows how accuracy of the individuals increase with generations. The results show thatafter a finitely small number of generations, the termination criterion is reached and the algorithmterminates. In such cases, the graph becomes a straight line showing that the accuracy is no more
increasing.
-
7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS
12/16
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012
106
Figure 8(a): Effect of Mutation Figure 8(b): Effect of Crossover
For Nursery dataset, the fitness becomes stable after 17 generations whereas on RobotNavigation, it takes 13 generations for fitness to stabilize. On the other hand, MotifClass runs for19 generations on Cardiotocography (CTG) dataset to terminate.
Figure 9: Increase in Fitness of Individuals with
Generations
Figure 10: Accuracies of GP, CART and
MotifClass on Training instances
Figures 10 and 11 respectively show the results of accuracy (in %) on training and test instanceswhereas Figure 12 shows the results comprehensibility (number of nodes) of the classifier on the
above stated datasets for the three algorithms CART, GP for classification and MotifClass. with
increase in number of classes and length of dataset, CART (i.e. decision tree) performs good interms of accuracy but comparison of classifier is less, whereas GP produces classification trees
with less number of nodes and hence is highly comprehensible. This can also be concluded fromthe results above. Hence the goal of our proposed approach was to perform a trade-off betweenaccuracy and number of nodes and the results shown in figure 11 and 12 verify that we have been
successful in performing the trade-off. Not only a trade-off, in most of the cases, the accuracy hasalso increased in comparison to CART.
On training instances, the classifier performs as good as CART in terms of accuracy whereas the
number of nodes have significantly decreased. On the test instances, MotifClass outperforms GP
and CART in most of the cases.
-
7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS
13/16
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012
107
For example, on CTG and Nursery datasets, the accuracy obtained by applying MotifClass isremarkably more than CART. Moreover, as MotifClass is a result of merging GP to CART, the
number of nodes have also been reduced as compared to CART. On the other hand, the accuracyof MotifClass on Robot Navigation dataset is almost equivalent to CART, and the number ofnodes in the classification tree are convincingly less than CART resulting into a triumphant trade-
off.
The only database on which MotifClass has a notably less accuracy is on the Forest cover-typedatabase. But, the number of nodes have decreased by approximately three times and hence the
reduction in accuracy is quite acceptable.
Further, the largest dataset used, that is the Poker-hand dataset, is classified the best with
MotifClass as compared to CART and GP. Not only is the accuracy of MotifClass fairly high, butthe number of nodes have also been very significantly decreased (almost 6 times). This shows
that MotifClass performs at its best when mining interpretable results from very large datasets.
Figure 11: Accuracies of GP, CART and MotifClass on Test instances
Figure 12: Number of nodes in trees generated by GP, CART and MotifClass
4.3 Comments on Classification Time
As far as the time complexity is concerned, it is quite obvious that Genetic Programming takesmore time than CART as it continues for a number of generations. MotifClass being a
combination of CART and GP, clearly takes more time in construction of the classifier. However,
-
7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS
14/16
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012
108
the construction of classifier is a one-time cost and once generated, what is of importance is theamount of time taken to classify new instances using that classifier.
As the results show, the tree size of CART is considerably large and hence it takes considerablyhigh time in classifying the data. MotifClass on the other hand, may take time in building the
classifier, but once trained, the new instances can be classified quickly as the size of the tree issmall (at least smaller than tree developed by CART or any other decision tree algorithm).
Undoubtedly, pruning [39] the trees developed by decision tree algorithm reduces the size, but it
happens only by compromising in accuracy to a greater extent. MotifClass, while performingtrade-off, sees to the fact that accuracy remains convincingly high.
MotifClass has been verified on several other datasets representing binary as well as multiclassproblems. The results on some of them are shown in Table 4. As the results of CART and
MotifClass are more significant, only they are compared. Further, the accuracy tabulated is for thetest instances as the actual performance of the classifier can be measured as how efficiently itclassifies the unknown instances.
Table 4: Experimental results on different datasets
DatasetNo. of
Attributes
No. of
Instances
No. of
Classes
Optimization
ParameterCART MotifClass
Iris 5 150 3% Accuracy
No. Of Nodes
96.07
9
98
7
Ecoli 8 336 8% Accuracy
No. Of Nodes
74.34
31
81.42
23
Diabetes 9 786 2% Accuracy
No. Of Nodes
74.23
103
81.54
27
Vehicle 18 946 4% Accuracy
No. Of Nodes
66.43
212
69.58
99
CMC 9 1473 3% Accuracy
No. Of Nodes
48.8
249
49.6
186
Segment 20 1500 7 % Accuracy
No. Of Nodes
96.17
77
96.66
53
CarEvaluatioin 7 1728 4% Accuracy
No. Of Nodes
95.22
65
94.2
39
Spambase 58 4601 2% Accuracy
No. Of Nodes
80.36
239
91.59
133
The previous results and the results from this table show that in almost all the cases, MotifClass
outperforms CART in terms of accuracy and comprehensibility.
Further, the results presented are an average of experiments conducted three times using the same
algorithm parameters. The reason for conducting the same experiment thrice is because GP andhence the algorithm MotifClass has its roots in randomization. The parts of training data selected
for generation of initial population, the selection of trees for crossover, mutation, etc. is allrandomly done. Hence, to assure the consistency of output, the average of three experimentalresults is presented.
-
7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS
15/16
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012
109
5. CONCLUSIONS
The work proposes a new algorithm MotifClass developed with an aim to perform a trade-off
between accuracy and comprehensibility of the classifier. The work focuses on large datasets andmulti-class classification. The implementation results show that the proposed algorithm
outperforms the traditional decision tree classifier CART, as well as GP. Not only has a trade-offbeen performed, but the accuracy of the MotifClass classifier is obtained more than decision tree
on most of the datasets. The number of nodes have also been decreased noticeably giving us a
comprehensible classifier.
Using CART to produce the initial population has performed a major role in increasing the
accuracy, as we have been getting good trees from the very beginning. Secondly, the fitnessmeasure proposed succeeds in imparting more emphasis on trees with high accuracy, as well as
less nodes, which in turn increases the efficiency of the algorithm.
Further, the algorithm uses variable size population by performing an extra step that simple GP,in which it validates some of the individuals and reject the few. This extra step saves timerequired in performing operations on the individuals that can be rejected at an early stage. Also, it
saves a lot of computation involved in calculating fitness of few so-called "useless" trees
Clearly, with the decrease in population size, the overall running time of algorithm also decreases.Lastly, using Reverse Rank Selection to select trees that undergo mutation results into asignificant increase in accuracy of the weak individuals.
Summarizing it, the proposed algorithm has been successful in attempting to produce an optimal
classifier, that is, a classifier with high accuracy and smaller number of nodes.
REFERENCES
[1] P. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Addison-Wesley, 2005.
[2] J. Han and M. Kamber, Data Mining: Concepts and Techniques. San Mateo: The Morgan Kaufmann Series
in Data Management Systems, 2 ed., 2006.[3] W. R. Smart, Genetic Programming for Multiclass Object Classification. PhD thesis, School of
Mathematical, Statistical and Computing Sciences, Victoria University of Wellington, 2005.
[4] T. Mitchell, Machine Learning. New York: McGraw-Hill, 2 ed., 1997.
[5] M. Vijaya, K. Jamuna, and S. Karpagavalli, "Password strength prediction using supervised machine
learning techniques," in International Conference on Advances in Computing, Control, and
Telecommunication Technologies, pp. 401-405, 2009.
[6] L. Rokach and O. Maimon, "Top-down induction of decision trees classifiers-a survey," IEEE transactions
on Systems, Man, and Cybernetics, Part C, November 2005.
[7] R. Kotecha, V. Ukani, and S. Garg, "An empirical analysis of multiclass classification techniques in data
mining," in Nirma University International Conference on Engineering (NUiCONE), (Ahmedabad),
December 2011.
[8] M. Aly, "Survey on multiclass classification methods," Citeseer, Nov 2005.
[9] I. Rish, "An empirical study of the naive bayes classifier," in IJCAI Workshop on Empirical Methods in
Artificial Intelligence, (Italy), pp. 41-46, 2011.
[10] A. alias Balamurugan, R. Rajaram, S. Pramala, S. Rajalakshmi, C. Jeyen-dran, and J. D. S. Prakash, "Nb+:An improved naive bayesian algorithm," Knowledge-Based Systems, vol. 24, pp. 563-569, October 2011.
[11] T. Phyu, "Survey of classification techniques in data mining," Proceedings of the International
MultiConference of Engineers and Computer Scientists, 2009.
[12] N. Suguna and K. Thanushkodi, "An improved k-nearest neighbor classification using genetic algorithm,"
International Journal of Computer Science Issues, vol. 7, July 2010.
[13] P. G. Espejo, S. Ventura, and F. Herrera, "A survey on the application of genetic programming to
classification," IEEE transactions on systems, man, and cybernetics-part C: applications and Reviews, vol.
40, pp. 121-144, March 2010.
-
7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS
16/16
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012
110
[14] J. R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT
Press, 1992.
[15] T. Loveard, Genetic Programming for Classification Learning Problems. PhD thesis, Royal Melbourne
Institute of Technology, 2003.
[16] R. Poli, W. B. Langdon, and N. F. McPhee, A Field Guide to Genetic Programming. PhD thesis.
[17] R. Feldt and P. Nordin, "Using factorial experiments to evaluate the effect of genetic programming
parameters," in Proceedings of European Conference on Genetic Programming, vol. 1802 of LNCS, pp.271-282, 2000.
[18] H. Jabeen and A. R. Baig, "Review of classification using genetic programming," in International Journal
of Engineering Science and Technology, vol. 2, pp. 94-103, 2010.
[19] W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone, Genetic Programming: An Introduction; On the
Automatic Evolution of Computer Programs and its Applications. San Mateo, CA/Heidelberg, (Germany):
Morgan Kaufmann, 1998.
[20] S. Oka and Q. Zhao, "Design of decision trees through integration of c4.5 and gp," in Proceedings of 4th
Japan-Australia Joint Workshop Intelligent and Evolutionary Systems, pp. 128-135, 2000.
[21] T. Enodu and Q. Zhao, "Generation of comprehensible decision trees through evolution of training data,"
Proceedings of Congress on Evolutionary Computation, vol. 2, pp. 1221-1225, 2002.
[22] M. H. Saraee and R. S. Sadjady, "Optimizing classification techniques using genetic programming
approach," in Proceedings of the 12th IEEE International Multitopic Conference, (Advanced Databases,
Data Mining and Bioinformatics Research Laboratory), pp. 345-348, December 2008.
[23] P. Kumar and Saroj, "Classification models: Non evolutionary vs evolutionary approach," in International
Conference on Advances in Computing, Control, and Telecommunication Technologies, pp. 341-343,2009.
[24] N. S. Chaudhari, A. Purohit, and A. Tiwari, "A multiclass classifier using genetic programming," in 10th
International Conference on Control, Automation, Robotics and Vision, (Hanoi, Vietnam), pp. 1884-1887,
December 2008.
[25] H. Zhao, "A multi-objective genetic programming approach to developing pareto optimal decision trees,"
Decision Support Systems, vol. 43, pp. 809-826, April 2007.
[26] M. Riekert, K. M. Malan, and A. Engelbrect, "Adaptive genetic programming for dynamic classification
problems," Congress on Evolutionary Computation, pp. 674-681, 2009.
[27] R. A. Davis, A. J. Charlton, S. Oehlschlager, and J. C. Wilson, "Novel feature selection method for
genetic programming using metabolomic h nmr data," Chemometrics Intelligent Laboratory Systems, vol.
81, pp. 50-59, March 2006.
[28] V. Slavov and N. I. Nokolaev, "Fitness landscapes and inductive genetic programming," in Proceedings
of 3rd International Conference Artificial Neural Networks and Genetic Algorithms, (Berlin, Germany), pp.
414-418, Springer-Verlag, 1997.
[29] S. Rouwhorst and A. Engelbrecht, "Searching the forest: Using decision trees as building blocks forevolutionary search in classification databases," In Proceedings of the Congress on Evolutionary
Computation, pp. 633-638, 2000.
[30] M. C. J. Bot and W. B. Langdon, "Application of genetic programming to induction of linear
classification trees," in Proceedings of the Third European Conference on Genetic Programming, vol.
1802, (Berlin, Germany), pp. 247-258, Springer-Verlag, April 2000. Lecture Notes in Computer Science
Series.
[32] C. S. Kuo, T. P. Hong, and C. L. Chen, "Applying genetic programming technique in classification trees,"
Soft Computing - A Fusion of Foundations, Methodologies and Applications, vol. 11, pp. 1165-1172,
October 2007.
Authors
Prof Hiteshkumar Nimbark working as Associate Professor and Head, at B H Gardi
college of engineering and technology, Rajkot, Gujarat, India. He is having more than
12 years of experience in academic field after completion of his M.Tech degree from
Nirma University, India. His area of interest is Artificial Intelligence, ComputerNetwork and Operating system. He is pursuing PhD in the field of Artificial
Intelligence. He has published 9 International papers from which 2 are in IEEE
explore and 3 are available in Springer.