a dynamic and scalable for distributed environment · dwdm data warehousing and data mining rm...
TRANSCRIPT
A DYNAMIC AND SCALABLEEVOLUTIONARY DATA MINING & KDD
FOR DISTRIBUTED ENVIRONMENT
A Thesis submitted to Gujarat Technological Universityfor the Award of
Doctor of Philosophy
in
Computer/IT Engineering
by
Dineshkumar Bhagwandas VaghelaEnrollment No. 119997107013
Under supervision of
Dr. Priyanka SharmaHead, I.T. Department, Raksha Shakti University,
Meghaninagar
GUJARAT TECHNOLOGICALUNIVERSITY-AHMEDABAD
January-2017
I
© Dineshkumar Bhagwandas Vaghela
II
DeclarationI declare that the thesis entitled ”A Dynamic And Scalable Evolutionary
Data Mining & KDD In Distributed Environment” submitted by me for
the degree of Doctor of Philosophy is the record of research work carried out by
me during the period from July 2011 to March 2016 under the supervision of
Dr. Priyanka Sharma (Prof. and Head of IT, Raksha Shakti Univer-
sity) and this has not formed the basis for the award of any degree, diploma,
associate ship, fellowship, titles in this or any other University or other institu-
tion of higher learning.
I further declare that the material obtained from other sources has been duly
acknowledged in the thesis. I shall be solely responsible for any plagiarism or
other irregularities, if noticed in the thesis.
Signature of Research Scholar:
Name of Research Scholar: Dineshkumar Bhagwandas Vaghela
Place: Ahmedabad
Date:
III
CertificateI certify that the work incorporated in the thesis ”A Dynamic And Scalable
Evolutionary Data Mining & KDD In Distributed Environment” sub-
mitted by Mr. Dineshkumar Bhagwandas Vaghela was carried out by the
candidate under my supervision/guidance. To the best of my knowledge: (i) the
candidate has not submitted the same research work to any other institution
for any degree/diploma, Associate ship, Fellowship or other similar titles (ii)
the thesis submitted is a record of original research work done by the Research
Scholar during the period of study under my supervision, and (iii) the thesis
represents independent research work on the part of the Research Scholar.
Signature of Supervisor:
Name of Supervisor: Dr. Priyanka Sharma
Place: Ahmedabad
Date:
IV
Originality Report CertificateIt is certified that PhD Thesis titled ”A Dynamic And Scalable Evolution-
ary Data Mining & KDD In Distributed Environment” submitted by
Mr. Dineshkumar Bhagwandas Vaghela has been examined by me. I undertake
the following:
1. Thesis has significant new work / knowledge as compared already pub-
lished or are under consideration to be published elsewhere. No sentence,
equation, diagram, table, paragraph or section has been copied verbatim
from previous work unless it is placed under quotation marks and duly
referenced.
2. The work presented is original and own work of the author (i.e. there is
no plagiarism). No ideas, processes, results or words of others have been
presented as Author own work.
3. There is no fabrication of data or results which have been compiled /
analyzed.
4. There is no falsification by manipulating research materials, equipment or
processes, or changing or omitting data or results such that the research
is not accurately represented in the research record.
5. The thesis has been checked using https://turnitin.com (copy of orig-
inality report attached) and found within limits as per GTU Plagiarism
Policy and instructions issued from time to time (i.e. permitted similarity
index <= 25%).
Signature of Research Scholar: Date:
Name of Research Scholar: Dineshkumar Bhagwandas Vaghela
Place: Ahmedabad
Signature of Supervisor: Date:
Name of Supervisor: Dr.Priyanka Sharma
Place: Ahmedabad
V
VI
VII
PhD THESIS Non-Exclusive License to GUJARAT
TECHNOLOGICAL UNIVERSITY
In consideration of being a PhD Research Scholar at GTU and in the interests of
the facilitation of research at GTU and elsewhere, I, Dineshkumar Bhagwandas
Vaghela having Enrollment No. 119997107013 hereby grant a non- exclusive,
royalty free and perpetual license to GTU on the following terms:
1. GTU is permitted to archive, reproduce and distribute my thesis, in whole
or in part, and/or my abstract, in whole or in part ( referred to collectively
as the Work) anywhere in the world, for non-commercial purposes, in all
forms of media;
2. GTU is permitted to authorize, sub-lease, sub-contract or procure any of
the acts mentioned in paragraph (1);
3. GTU is authorized to submit the Work at any National / International
Library, under the authority of their Thesis Non-Exclusive License;
4. The Universal Copyright Notice ©shall appear on all copies made under
the authority of this license;
5. I undertake to submit my thesis, through my University, to any Library
and Archives. Any abstract submitted with the thesis will be considered
to form part of the thesis.
6. I represent that my thesis is my original work, does not infringe any rights
of others, including privacy rights, and that I have the right to make the
grant conferred by this non-exclusive license.
7. If third party copyrighted material was included in my thesis for which,
under the terms of the Copyright Act, written permission from the copy-
right owners is required, I have obtained such permission from the copy-
right owners to do the acts mentioned in paragraph (1) above for the full
term of copyright protection.
8. I retain copyright ownership and moral rights in my thesis, and may deal
with the copyright in my thesis, in any way consistent with rights granted
by me to my University in this non-exclusive license.
VIII
9. I further promise to inform any person to whom I may hereafter assign
or license my copyright in my thesis of the rights granted by me to my
University in this non- exclusive license.
10. I am aware of and agree to accept the conditions and regulations of PhD
including all policy matters related to authorship and plagiarism.
Signature of Research Scholar: Date:
Name of Research Scholar: Dineshkumar Bhagwandas Vaghela
Place: Ahmedabad
Signature of Supervisor: Date:
Name of Supervisor: Dr.Priyanka Sharma
Place: Ahmedabad
IX
Thesis Approval FormThe viva-voce of the PhD Thesis submitted by Mr. Dineshkumar Bhagwandas
Vaghela (Enrollment No. 119997107013) entitled ”A Dynamic And Scalable
Evolutionary Data Mining & KDD In Distributed Environment” was
conducted on Date: , at Gujarat Technological University.
(Please tick any one of the following option)
The performance of the candidate was satisfactory. We recommend that
he be awarded the PhD degree.
Any further modifications in research work recommended by the panel
after 3 months from the date of first viva-voce upon request of the Supervi-
sor or request of Independent Research Scholar after which viva-voce can be
re-conducted by the same panel again.
The performance of the candidate was unsatisfactory. We recommend
that he should not be awarded the PhD degree.
Name & Sign. of Supervisor with Seal External Examiner-1 Name & Sign.
External Examiner-2 Name & Sign. External Examiner-3 Name & Sign.
X
AbbreviationsML Machine Learning
PUWP Parul University Web Portal
DT Decision Tree
SVM Support Vector Machine
DM Data Mining
DDM Distributed Data Mining
KDD Knowledge Discovery and Data Mining
ARM Association Rule Mining
DB DataBase
XML eXtensible Markup Language
OLAP Online Analytical Processing
DWDM Data Warehousing and Data Mining
RM Research Methodology
CHAID CHi-squared Automatic Interaction Detector
AID Automatic Interaction Detector
THAID Theta Automatic Interaction Detector
CART Classification and regression tree
HODA Hierarchical Optimal Discriminate Analysis
ID3 Iterative Dichotomiser 3
UCI University of California, Irvine
TP True Positive
TN True Negative
FP False Positive
FN False Negative
IBL Instance Based Learning
k-NN k- Nearest Neighbor
CD Concept Description
SPDT Streaming Parallel Decision Tree
SLIQ Supervised Learning In Quest
SPRINT Scalable Parallelizable Induction of decision tree
MDL Minimum Description Length
XI
AbstractThere are lots of fields such as biology, education, environmental research,
sensor network, stock market, weather forecasting and many more produce very
large volume of data. The data are produced with the high degree of velocity
and variety due to the immense and dynamic growth of these fields. Due to vast
use of internet in distributed environment has generated an urgent need for
new techniques and tools that can intelligently automatically transform the pro-
cessed data into useful information and knowledge. This is why data mining
has become a research with increasing importance to analyze such large volume
of data efficiently and effectively. The continuous collection of more and more
data at this velocity and scale, it will become the paramount of formalizing the
process of big data analysis at this stage.
The large volume of data are geographically spread across the globe may
tend to generate the very large number of models. The heterogeneous nature
of data, resultant models and techniques raise problems on how to generalize
knowledge in order to have global view of the phenomena across the entire or-
ganization.
Lots of data mining techniques have been introduced for different analyti-
cal processes such as clustering, frequent pattern, classification, rare item set
finding and many more. Out of all these two techniques namely classification
and prediction, are two forms of data analysis that can be used to extract
models describing important data classes or to predict future data trends. The
data analysis in these ways can help us to provide better understanding of the
data at large. Usually classification predicts categorical (discrete, unordered)
labels while prediction models predict continuous valued functions. Many clas-
sification and prediction methods have been proposed by researcher in machine
learning pattern recognition and statistics. As per the state of the art and liter-
ature survey most of the algorithms are memory resident, typically assuming
a small data size, static in nature, not scalable and not domain-free. Recent
data mining research has built on such work, developing scalable classification
and prediction techniques capable of handling large disk-resident data. The
application of the classical knowledge discovery process in distributed environ-
ments requires the collection of distributed data in a data warehouse for central
processing. However, this is usually either ineffective or infeasible for several
XII
reasons such as (1) Storage cost (2) Communication cost (3) Computational cost
(4) Private and sensitive data. From the literature review, the most promising
issues for prediction in distributed environment are RAM size due to very large
volume of data set, scalability of size of data set and dynamic nature of learning.
In decision tree learning the training data set is used for learning i.e. gener-
ating the decision tree model. In distributed environment the data at different
sites are not in corelation with other sites’ data and hence local site decision
tree model is not sufficient to produce global view for prediction. There are
two approaches so far, first is all data set need to be collected at one location
and then do data mining operation (i.e. central site processing) and second is
intermediate message passing among the sites involved for training the model.
In later approach the participating sites have to communicate among each other
through passing their intermediate trained models to generate the global model.
The main limitation of this approach is overhead of message passing. These two
approaches stated above are not effective and efficient, and hence this is the mo-
tivation for this research. The objectives of the research are: 1) To minimize the
training time and communication overhead and 2) To preserve the prediction
quality. In this research the effective and efficient approach has been proposed
which works on global decision tree generation in distributed environment to
extract the global knowledge.
This research has been carried out on real data set of Parul University for the
students admission prediction in different fields/branches of different colleges.
In the first phase, these data have been collected from the Parul University
Web Portal. There are more than 1,00,000 records in total used for training
purpose. As data collected from the University Portal they need to be pre-
processed such as removing the noise, outliers and missing values. The data are
stored in .csv file format.
In the second phase of this research, at the local site J48 algorithm (com-
plexity O (mn2)) generates the decision tree. In third phase the decision tree
at each site is converted into decision rules using the proposed parser. These
decision rules have been later converted into the decision tables. In the fourth
phase to reduce the transmission cost, the decision tables have been converted
into XML file and then sent to the coordinator site. In the fifth phase the
global tree model is generated at coordinator site by consolidating the decision
XIII
tables formed from XML files.
In sixth phase the dataset has been equally partitioned into the subsets
equal to the number of sites. The experiments have been performed on 10k, 20k,
50k and 100k records (Here k means thousand) at 2, 5 and 10 sites. The local
training models have been generated and merged using the proposed approach.
The accuracy of these global models has been checked on test datasets. The
accuracy is more than 98.03% to classify the test dataset. The results of basic
comparison clearly shows that accuracy, training time, communication overhead
and other parameters have been optimized. The data set of student admission
for the year 2013-14, 2014-15 have been used to train the model, this model has
been used with the data set of student admission for the year 2015-16 which
gives more than 98.03% accuracy for the prediction. These experimental results
have been also verified using the 10-fold cross validation.
XIV
Dedicated
To
My Parents (Surajben and Bhagwanbhai),
My Wife (Leela)
And
My Children (Late. Mittal, Devendra and Mamta)
XV
AcknowledgementAt first, my deepest gratitude to my supervisor, Dr. Priyanka Sharma,
Professor and Head of IT Department, Raksha Shakti University, for her con-
sistent support, supervision, guidance and inspiration during my doctoral pro-
gramme. Her invaluable suggestions and constructive criticisms from time to
time enabled me to complete my work successfully.
The completion of this work would not have been possible without, the Doc-
torate Progress Committee (DPC) members: Dr. Sanjaykumar Vij, Dean
(CSE, MCA, and MBA) ITM Universe and Dr. Bankim Patel, Director of
SRIMCA, Uka Tarsadia University. I am really thankful for their rigorous ex-
aminations and precious suggestions during my research.
My gratitude goes out to the assistance and support of Dr. Akshai Ag-
garwal, Ex-Vice Chancellor, Dr. Rajul Gajjar, Dean, PhD Programme, Dr.
N. M. Bhatt, Dean, PhD Programme, Mr. J. C. Lilani, I/C Registrar, Ms.
Mona Chaurasiya, Research Coordinator, Mr. Dhaval Gohil, Data Entry Op-
erator and other staff members of PhD Section, GTU.
Most importantly, none of this would have been possible without the love
and patience of my wife and family members. My wife, to whom this dis-
sertation is dedicated to, has been a constant source of love, concern, support
and strength all these years. My family members has aided and encouraged me
throughout this endeavor. I would like to express my heart-felt gratitude to
both of them. Finally, I have to give a special mention for the direct or indirect
support given by my colleagues.
I would like to address special thanks to the reviewers of my thesis, for ac-
cepting to read and review this thesis and giving approval of it. I would like to
appreciate all the researchers whose works I have used, initially in understand-
ing my field of research and later for updates. I would like to thank the many
people who have taught me starting with my school teachers, my undergraduate
teachers, and my post graduate teachers.
Dineshkumar B. Vaghela
List of Figures
2.1 Induction: Model Construction . . . . . . . . . . . . . . . . . . . 182.2 Deduction of test data using the model/classifier . . . . . . . . . 192.3 Posterior Probability of Naive Bayes . . . . . . . . . . . . . . . . . 202.4 Classification based on linear SVM . . . . . . . . . . . . . . . . . 232.5 Classification based on Hard SVM . . . . . . . . . . . . . . . . . . 232.6 Nonlinear Classification . . . . . . . . . . . . . . . . . . . . . . . 252.7 Distance functions equations . . . . . . . . . . . . . . . . . . . . . 272.8 Hamming Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 272.9 Decision tree based classification for car subscription . . . . . . . 32
3.1 Decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . 664.2 Local site Processing . . . . . . . . . . . . . . . . . . . . . . . . . 684.3 Coordinate site Processing . . . . . . . . . . . . . . . . . . . . . . 704.4 proposed system architecture for dynamic and scalable decision
tree generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.5 Decision table merging process to generate the global decision
tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1 Data set at site S1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.2 Decision Tree generated at local site S1 . . . . . . . . . . . . . . . 815.3 Detailed accuracy and the confusion matrix . . . . . . . . . . . . 825.4 Decision table generated at local site S1 . . . . . . . . . . . . . . . 845.5 The XML file at local site S1 . . . . . . . . . . . . . . . . . . . . . 845.6 Dynamic and Scalable decision tree generation . . . . . . . . . . 855.7 The Decision table merging process to generate the global deci-
sion tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1 Forms of Data Preprocessing . . . . . . . . . . . . . . . . . . . . . 936.2 Site Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.3 Run J48 algorithm to each site . . . . . . . . . . . . . . . . . . . . 976.4 Load/Save the training model . . . . . . . . . . . . . . . . . . . . 986.5 Decision Tree and Decision Table at each site . . . . . . . . . . . 98
1
List of Figures List of Figures
6.6 Combined Decision Tree and Decision Table . . . . . . . . . . . . 996.7 Branch wise decision rules . . . . . . . . . . . . . . . . . . . . . . 99
7.1 Apache Hadoop Architecture . . . . . . . . . . . . . . . . . . . . 1027.2 Architecture and job execution flow in Hadoop Map Reduce ver-
sion 1.x (MRv1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.3 Overview of Map-Reduce Model . . . . . . . . . . . . . . . . . . . 1067.4 Apache Hadoop Installation . . . . . . . . . . . . . . . . . . . . . 1097.5 Hadoop MapReduce Administration . . . . . . . . . . . . . . . . 1097.6 Two Node cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117.7 Cluster Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117.8 Contents of Directory . . . . . . . . . . . . . . . . . . . . . . . . . 1127.9 Directory Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.1 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . 1168.2 Recall and Training Time (Sec) . . . . . . . . . . . . . . . . . . . . 1178.3 Performance comparison . . . . . . . . . . . . . . . . . . . . . . . 1188.4 Comparison for total time on 2 sites . . . . . . . . . . . . . . . . . 1198.5 Comparison for total time on 5 sites . . . . . . . . . . . . . . . . . 1208.6 Comparison for total time on 10 sites . . . . . . . . . . . . . . . . 1208.7 Communications Overhead in Centralized Approach . . . . . . . 1218.8 Communications Overhead in Intermediate Message Passing Ap-
proach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1218.9 Communications Overhead in Proposed Approach . . . . . . . . 122
2
List of Tables
2.1 Advantages of different classification algorithms . . . . . . . . . 402.2 Feature comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 412.3 Comparison of Classification Algorithms . . . . . . . . . . . . . . 42
3.1 Performance based comparisons of different Decision tree algo-rithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Comparisons between different Decision Tree Algorithms . . . . 513.3 Comparison of Merits and Demerits of Decision Tree Algorithms 523.4 Merge Models with combination of rules: Examples . . . . . . . 55
5.1 Overall distributions of Site S2 instances with respect to attributes 785.2 Detail distribution of Site S2 instances with respect to attribute
values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.1 ZOO DataSet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.2 Student admission data set collected from Parul University Web
Portal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.3 Student performance data set collected from Departments of
PIT College . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.1 Confusion Matrix for Site1 . . . . . . . . . . . . . . . . . . . . . . 1158.2 Confusion Matrix for Site2 . . . . . . . . . . . . . . . . . . . . . . 1158.3 Confusion Matrix for combined data set at coordinator site . . . 1158.4 Comparative performance at site1, site2 and coordinator site . . 1168.5 Performance statistics for three different data sets . . . . . . . . . 1178.6 Statistics for total time on 2 sites . . . . . . . . . . . . . . . . . . . 1188.7 Statistics for total time on 5 sites . . . . . . . . . . . . . . . . . . . 1198.8 Statistics for total time on 10 sites . . . . . . . . . . . . . . . . . . 1198.9 Total time for 2 sites . . . . . . . . . . . . . . . . . . . . . . . . . . 1228.10 Total time for 5 sites . . . . . . . . . . . . . . . . . . . . . . . . . . 1228.11 Total time for 10 sites . . . . . . . . . . . . . . . . . . . . . . . . . 123
3
Contents
1 Introduction 71.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Background History . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4 Contribution of the research . . . . . . . . . . . . . . . . . . . . . 121.5 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . 131.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Classification Techniques 172.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Classification techniques . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Naı̈ve Bayesian . . . . . . . . . . . . . . . . . . . . . . . . 192.2.2 Support Vector Machines (SVM) . . . . . . . . . . . . . . . 212.2.3 K-Nearest Neighbor Classifier . . . . . . . . . . . . . . . . 252.2.4 Instance Based Learning (IBL) . . . . . . . . . . . . . . . . 282.2.5 Rule Based Classification . . . . . . . . . . . . . . . . . . . 302.2.6 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 302.2.7 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Attribute Splitting Measures . . . . . . . . . . . . . . . . . . . . . 342.4 Decision Tree Classification . . . . . . . . . . . . . . . . . . . . . 36
2.4.1 Tree Building Phase . . . . . . . . . . . . . . . . . . . . . . 372.4.2 Tree Pruning Phase . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Comparison of classification techniques . . . . . . . . . . . . . . 402.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3 Literature Survey On Decision Tree 463.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.2 First Phase Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3 Second Phase Study . . . . . . . . . . . . . . . . . . . . . . . . . . 503.4 Third Phase Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.5 Challenges with DT merging . . . . . . . . . . . . . . . . . . . . . 573.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4
Contents Contents
4 Proposed Approach 624.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 624.3 Objective and Scope of Research . . . . . . . . . . . . . . . . . . . 634.4 Original Contribution by thesis . . . . . . . . . . . . . . . . . . . 644.5 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . 654.6 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6.1 Algorithm steps at local site . . . . . . . . . . . . . . . . . 664.6.2 Algorithm steps at coordinator site . . . . . . . . . . . . . 69
4.7 System architecture at local site . . . . . . . . . . . . . . . . . . . 714.8 System architecture at coordinator site . . . . . . . . . . . . . . . 714.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5 Working of the Proposed Model 775.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.2 Local site algorithm computation . . . . . . . . . . . . . . . . . . 77
5.2.1 Building the decision tree . . . . . . . . . . . . . . . . . . 805.2.2 Rule Generation . . . . . . . . . . . . . . . . . . . . . . . . 815.2.3 Decision Table . . . . . . . . . . . . . . . . . . . . . . . . . 835.2.4 XML File Generation . . . . . . . . . . . . . . . . . . . . . 83
5.3 Coordinator site algorithm computation . . . . . . . . . . . . . . 855.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6 Data Collection, Preprocessing and Implementation 896.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.1 Zoo Data set . . . . . . . . . . . . . . . . . . . . . . . . . . 906.2.2 Student Admission Data Set . . . . . . . . . . . . . . . . . 906.2.3 Student Performance Data Set . . . . . . . . . . . . . . . . 92
6.3 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 926.4 Test Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7 Implementation with Apache Hadoop 1017.1 Introduction to Apache Hadoop . . . . . . . . . . . . . . . . . . . 1017.2 Hadoop Map-Reduce . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2.1 Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027.2.2 Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.3 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037.4 Decision Tree Map-Reduce . . . . . . . . . . . . . . . . . . . . . . 105
7.4.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . 1057.4.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1067.4.3 Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5
Contents Contents
7.4.4 Tree Growing . . . . . . . . . . . . . . . . . . . . . . . . . 1077.5 Apache Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . 1087.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1137.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8 Results, Conclusions and Future Enhancements 1148.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1238.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1258.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6
Chapter 1
Introduction
1.1 Introduction
The different fields like education, sensor network, Internet of Things (IoT),
biology, stock market, weather forecasting and many more generates the data
at a rapid speed, of different variety and of large volume. The quick-tempered
growth in the volume of data due to enormous use of internet in distributedenvironment [1] and due to this it is a high demanding requirement for new
approaches and techniques which may easily and automatically convert the al-
ready processed data in the valuable decision making form of information or
knowledge. This can only be achieved by properly processing and analyzing
the data with some techniques. These all techniques are the part of data min-
ing, this is the main reason why data mining takes significance important for
data analytic. The continuous collection of more and more data at this veloc-
ity and scale, it will become the paramount of formalizing the process of big
data analysis at this stage. The large volume of data are geographically spread
across the globe may tend to generate the very large number of models. The
heterogeneous nature of data, resultant models and techniques raise issues for
generalizing the knowledge for global view of the phenomena across the entire
organization.
Lots of data mining techniques have been introduced for different analyti-
cally processes such as clustering, frequent pattern, classification, rare item set
finding and many more. Out of all these two techniques namely classificationand prediction, are used to extract/fetch models to describe class labels of data
which are important for classification or this models can also be used to predict
7
Chapter 1. Introduction 1.1. Introduction
the future data trends. The data analysis in these ways can help us to provide
better understanding of the large data sets. Usually classification predicts cate-
gorical (discrete, un-ordered) labels or classes while prediction models predict
continuous valued functions. In the machine learning, pattern recognition and
statistics, the researchers have proposed many methods for both classification
and prediction. As per the state of the art and literature survey most of the
algorithms are in-memory, generally work on the data of small size, not dy-namic in nature, not scalable and not domain-free. From the state of the art it
has been concluded that at present lots of the research have been carried out
to support the scalable classification and performing the prediction which can
be able to handle large in-memory disk data. The application of the classical
knowledge discovery process in distributed environments requires the collec-
tion of distributed data in a data warehouse for central processing. However,
this is usually either ineffective or unfeasible for several reasons such as (1)
Storage cost (2) Communication cost (3) Computational cost (4) Private and
sensitive data. From the literature review, the most promising issues for pre-
diction in distributed environment are RAM size due to very large volume of
data set, scalability of size of data set and dynamic nature of learning.
Classification mainly used to analyze a given training data set and takes
each instance of it and assigns this instance to a particular class such that clas-
sification error will be least. It is used to extract models that accurately define
important data classes within the given data set. Classification is a two step
process. During first step the model is created by applying classification al-
gorithm on training data set then in second step the extracted model is tested
against a predefined test data set to measure the model trained performance
and accuracy. So classification is the process to assign class label from data set
whose class label is unknown.
The training data set is used for learning purpose to generate the decision
tree models. In distributed environment, the local decision tree model gener-
ated at each site is not sufficient to provide the global view for prediction as
because the local training data set are not in co-relation with other data set at
different geographically spread sites. In order to generate the global decision
tree either these all data set need to be collected at one location and then do
data mining operation. The other approach is intermediate message passing
among the sites involved for training the model. These participating sites have
8
Chapter 1. Introduction 1.2. Background History
to communicate among each other through passing their intermediate trained
models to generate the global model. This leads multiple messages passing
which causes the communication overhead. So both the approaches are not
effective and efficient, and hence this is the motivation for this research. The
objectives of the research are: 1) To minimize the training time and commu-
nication overhead and 2) To preserve the prediction quality. In this research,
the proposed approach is effective and efficient which works on global decision
tree generation in distributed environment to extract the global knowledge.
In this chapter the section 1.2 is about the background history of data min-
ing (Decision Tree based classification) in distributed environment, section 1.3
focuses the lacking which motivates for this research to be carried out, section
1.4 is briefly describes the contribution of the research, section 1.5 is about the
research methodology used in this research, in section 1.6 the thesis organiza-
tion has been introduced.
1.2 Background History
The analytical tools are used in data mining to discover unknown and useful
patterns, the relationships of these patterns in large volume of data and also
predicts the future patterns/classes by training the model with available data
set. For this the data mining tools applied either machine learning techniques,
any of relevant statistical model or any algorithm based on mathematics. More-
over to the above functionality of data mining, it also consists collection of
data from various sources, pre-processing and also managing them for proper
processing. Data Mining also involves Clustering, classification, regression,
frequent pattern generation and many more analysis and processing facilities.
Thus wider range of data can be processed by classification technique than ei-
ther regression or correlation. This is the main reason why the popularity of
classification is increasing day to day.
Data mining is the most important, significant and accurate machine learn-
ing [2] application. It allows very large volume of day to day data to be pro-
cessed effectively and to generate useful analysis which may further extend to
help the prediction for decision making. There are high chances of mistakes
9
Chapter 1. Introduction 1.2. Background History
during analysis of large volume of data specially for finding the correlation
among the different features of the data sets. Due to the above mentioned mis-
take some time its difficult to find solutions and take decision. These problems
can be easily resolved by machine learning which improves the efficiency of
the systems.
Classification technique is capable of processing/analyzing wider range of
data for decision making. There are numerous techniques available such as
Neural Network, Naive Bayesian, Support Vector Machine (SVM), K-Nearest
Neighbor Classifier (kNN), Instance Based Learning (IBL), Rule Based Classi-
fication and Decision Tree. Among all the techniques, decision tree is more
effective and easy to use because of the following two reasons:
1. Decision trees are powerful and popular tools for classification and pre-
diction [2].
2. Decision trees represent rules, which can be understood by humans and
used in knowledge system such as database [2].
The classification with decision tree [2][3], learner uses the training data
set for learning purpose and generates the decision tree model. In distributed
environment of data i.e. if the data are geographically spread across the dif-
ferent sites, it needs special attention for decision tree generation at each site
and taking them together to generate the global view i.e. global model. The
local decision tree model generated at each site is not sufficient to provide the
global view for prediction as because the local training data set are not in co-
relation with other data set at different geographically spread sites. In order to
generate the global decision tree either these all data set need to be collected at
one location and then do data mining operation. The other approach is inter-
mediate message passing [1] among the sites involved for training the model.
These participating sites have to communicate among each other through pass-
ing their intermediate trained models to generate the global model. This leads
multiple messages passing which causes the communication overhead. So both
the approaches are not effective and efficient, and hence this is the motivation
for this research. The objectives of the research are: 1) To minimize the training
time and communication overhead and 2) To preserve the prediction quality.
10
Chapter 1. Introduction 1.3. Motivation
In this research the effective and efficient approach has been proposed which
works on global decision tree generation in distributed environment to extract
the global knowledge.
1.3 Motivation
In distributed environment, a series of challenges have been emerged in the
field of data mining, triggered in different real life applications. The current
thesis is concerned with dynamic and scalable classification and prediction
tasks for distributed environment. Therefore, the general context is that of
classification (a potentially large volume of) data which distributed across the
different geographical sites. In the following, the motivation behind the main
objectives of the thesis is presented.
The first issue tackled is that it is neither feasible nor desirable for gath-ering all of the data in a centralized location as because it may need high
internet bandwidth and storage space requirements. For such kind of applica-
tion domain, it should be advisable and feasible is that to develop the systems
for acquiring the knowledge and performing the effective analysis at local sites
where the data and other computing resources are present, then transmit the
results/models to the needed sites. But this also cause the data privacy and
security to share the data of autonomous organizations. In such kind of situa-
tions, the knowledge acquisition techniques to be developed which may learn
from the statistical summaries and these can be supplied whenever required.
The second issue in machine learning and data mining is the development
of dynamic, adaptive and inductive learning techniques that scale up to large
and possibly physically distributed data sets. Many organizations which are
seeking further/added value from their data are already dealing with over-
whelming amounts of information. The number and size of their databases and
data warehouses grows at rapid rates, faster than the corresponding improve-ments in machine resources [2] and inductive learning techniques. Most
of the current generation of learning algorithms are computationally complex
and require all data to be resident in main memory which is clearly untenable
for many realistic problems and databases.
11
Chapter 1. Introduction 1.4. Contribution of the research
The third issue is to reduce the communication overhead and processing
time to merge the decision trees without losing the predictive quality of the
model. The decision trees generated at each site need to send to the coordi-
nator site, the size of this model cause the communication overhead. At the
coordinator site there is not any efficient merging algorithm which takes care
of generating the global model without losing the prediction quality.
As a whole, researchers have given significant contribution by proposing
algorithms for classification (i.e. here decision tree) and prediction, and they
also have proposed different approaches for merging the local decision trees.
From the sate of the art it has been observed that many of the algorithms are
limiting their performance due to they are not good enough with small mem-
ory size of RAM (i.e. memory resident), mainly working for a small datasize, not domain-free [4], static in nature [4], less efficient in terms of pro-cessing and communication overhead. None of the research has focused on
scalable and dynamic classification and prediction process of data mining in
distributed environment. At present enough research work is going on such
work. The researchers are at present working for developing scalable and dy-
namic techniques for classification and prediction which may be able to handle
large dataset in distributed environment.
The main objectives of the thesis are enlisted as below:
1. To reduce the model (i.e. decision tree) training time and communication
time in distributed environment for large volume of data.
2. To introduce the efficient scalable and dynamic approach for newly gen-
erated dataset and already trained model.
3. To prepare the rule merging policies to generate the global model.
4. To generate the globally interpretable model by preserving the prediction
quality.
1.4 Contribution of the research
This thesis provides major contributions in the field of dynamic and scalable
data mining in distributed environment with decision tree based classifica-
12
Chapter 1. Introduction 1.5. Organization of thesis
tion as discussed in the objectives above. The outcome of the proposed model
shows that the objectives of the research work have been acquired. The pro-
posed model can handle large volume of data. The decision trees are merged
with minimal network overhead compared to other approaches and global
model preserves the quality of prediction. The results collected after the exper-
iments with different approaches on distributed data sets in terms of accuracy,
error rate, specificity, sensitivity, precision, recall and training time are better
than the existing systems/approaches. The total time for local model gener-
ation and communication time in proposed approach is 3.14 and 2.34 times
faster than the centralized and intermediate message passing approaches re-
spectively. The data set of student admission for the year 2013-14, 2014-15
collected from Parul University Web Portal (PUWP) have been used to train
the model, this model has been used with the data set of student admission for
the year 2015-16 which gives more than 98.03% accuracy for the prediction.
These experimental results have been also verified using the 10-fold cross vali-
dation. The other data sets have also been used to check the performance of the
proposed model. The experimental results prove that the proposed approach
is better than the existing one.
1.5 Organization of thesis
In chapter 2, different classification techniques such as Nave Bayesian, Deci-
sion Tree, Support Vector Machine, linear and non linear classification have
been discussed. In this chapter further the decision tree has been discussed in
detail about its generation ( tree building phase and pruning phase). Moreover
to this, the reasons of why decision tree is better than other classification tech-
niques also have been discussed.
In chapter 3, the literature survey has been described in the area of classifi-
cation techniques of data mining. The survey has been segregated in different
classification techniques, decision tree based learning algorithms, decision tree
learning in distributed environment, merging the decision trees and the chal-
lenges of it. The survey has been done in three phases.
In chapter 4, the overview of proposed dynamic and scalable approach has
13
Chapter 1. Introduction 1.6. References
been described with the problem statement, objective and scope, original con-
tribution by the thesis and the proposed system architecture at local and coor-
dinator site.
In this chapter, the proposed algorithms at co-ordinate site and each local site
with the architecture have been represented in detail. The system architecture
at local and co-ordinate site also have been explained in detail. In this chapter
the decision rule merging policies have been introduced.
In chapter 5, the working of proposed model has been discussed in detail
with the proposed algorithm and the flow chart. The algorithm computation
at each local site has been explained in detail with example. IN this chapter
the decision tree generation, decision table generation from the decision rules,
XML file generation and the algorithm computation at co-ordinate site have
been discussed in detail.
In chapter 6, the training and testing data set have been discussed with the
data collection and pre-processing. In this chapter the features with their type
and the number of instances of each training data set have been shown. The
data pre-processing steps also discussed in detail.
In chapter 7, the proposed approach has been implemented in an Apache
Hadoop framework. This chapter describes the hadoop map-reduce, Hadoop
Distributed File System (HDFS) and Decision Tree map-reduce with four steps
like data preparation, selection, update and tree growing.
Chapter 8 describes the experimental results with parameterized compar-
isons. This chapter also concludes the research with objectives achieved with
justification, conclusions of the work, and scope of future enhancements pos-
sible in this research.
1.6 References
1. J. da Silva, C. Giannella, R. Bhargava, H. Kargupta, and M. Klusch, Dis-
tributed Data Mining and Agents, Eng. Applications of Artificial Intelli-
gence, 2005,vol.18(7) , pp. 791-80.
14
Chapter 1. Introduction 1.6. References
2. J.R. Quinlan, Induction of decision trees. Machine Learning, Vol.1-1, pp.
81-106, 1986
3. L. Hall, N. Chawla, and K. Bowyer, Decision tree learning on very large
data sets, IEEE International Conference on Systems, Man, and Cyber-
netics, vol. 3, pp. 2579 2584, 1998.
4. P. Strecht, J. Mendes-Moreira, and C. Soares, Merging Decision Trees:
a case study in predicting student performance, in Proceedings of 10th
International Conference on Advanced Data Mining and Applications,
pp. 535548, 2014.
5. Dinesh Vaghela, Priyanka Sharma, A decision support application for
student admission process based on prediction in distributed data min-
ing, International Conference on Information, Knowledge & Research In
Engineering, Management and Sciences(IC-IKR-EMS), Gujarat. IJETAETS-
ISSN 0974-3588, 7th Dec-2014.
6. Dinesh Vaghela, Priyanka Sharma, A Proposed DDM Algorithm and Frame-
work For EDM of Gujarat Technological University, Organized by Saf-
frony Institute of Technology International Conference on Advances in
Engineering, 22nd-23rd January 2015.
7. Baik, S. Bala, J., A Decision Tree Algorithm For Distributed Data Mining,
2004.
8. Dinesh Vaghela, Priyanka Sharma, Prediction and analysis of student
performance using distributed data mining, International Conference on
Information, Knowledge & Research In Engineering, Management and
Sciences(IC-IKR-EMS), Gujarat. IJETAETS-ISSN 0974-3588, 7th Dec-2014
9. Yael Ben-Haim, Elad Tom-Tov, A Streaming Parallel Decision Tree Algo-
rithm, Journal of Machine Learning Research, 849-872, 2010
10. Raj Kumar, Rajesh Verma, Classification Algorithms for Data Mining:A
Survey, International Journal of Innovations in Engineering and Technol-
ogy, ISSN: 2319 1058.
11. Bendi Venkata Ramana, M.Surendra Prasad Babu, N. B. Venkateswarlu,
A Critical Study of Selected Classification Algorithms for Liver Disease
Diagnosis, International Journal of Database Management Systems ( IJDMS
), Vol.3, No.2, May 2011
15
Chapter 1. Introduction 1.6. References
12. Thair Nu Phyu, Survey of Classification Techniques in Data Mining, Vol
I IMECS, March 18 - 20, 2009, Hong Kong
13. Rahul Gupta, Anuja Priyam, Anju Rathee, Abhijeet, and Saurabh Srivas-
tava, Comparative Analysis of Decision Tree Classification Algorithms,
International Journal of Current Engineering and Technology, ISSN 2277
4106.
16
Chapter 2
Classification Techniques
2.1 Introduction
Classification is a data mining function that assigns items/instances from the
data set in a collection to target categories or classes. The goal of classification
is to accurately predict the target class for each case in the data. There are lots
of applications of classification with data mining. For example, a classification
model could be used to identify loan applicants as low, medium, or high credit
risks. The Data Classification process includes two steps:
1. Building the Classifier or Model: This step is the learning step or the
learning phase. In this step the classification algorithms build the classi-
fier[7]. The classifier is built from the training set made up of database
instances/tuples and their associated class labels. Each instance/tuple
that constitutes the training set is referred to as a category or class. These
tuples can also be referred to as sample, object or data points.
2. Using Classifier for Classification - The training model/classifier gener-
ated using the training data set will classify the test data set objects/tuples.
The major issue is preparing the data for Classification and Prediction.
Preparing the data involves the following activities
• Data Cleaning Data cleaning involves removing the noise and treat-
ment of missing values. The noise is removed by applying smooth-
ing techniques and the problem of missing values is solved by re-
17
Chapter 2. Classification Techniques 2.2. Classification techniques
Figure 2.1: Induction: Model Construction
placing a missing value with most commonly occurring value for
that attribute.
• Relevance Analysis Database may also have the irrelevant attributes.
Correlation [11] analysis is used to know whether any two given at-
tributes are related.
• Data Transformation and reduction The data can be transformed
by any of the following methods.
– Normalization the data is transformed using normalization.
Normalization involves scaling all values for given attribute in
order to make them fall within a small specified range. Normal-
ization is used when in the learning step, the neural networks
or the methods involving measurements are used.
– Generalization the data can also be transformed by general-
izing it to the higher concept. For this purpose we can use the
concept hierarchies. In section 2.2 of this chapter, the different
classification techniques have been discussed.
2.2 Classification techniques
Classification technique can be classified into five categories, which are based
on different mathematical concepts. These categories are statistical-based [17],
distance-based, decision tree-based, neural network-based, and rule-based. Each
18
Chapter 2. Classification Techniques 2.2. Classification techniques
Figure 2.2: Deduction of test data using the model/classifier
category consists of several algorithms, but the most popular from each cate-
gory that are used extensively are C4.5, Naı̈ve Bayes, K-Nearest Neighbors, and
Back propagation Neural Network [18, 19, 20]. In this section, different clas-
sification techniques like Naı̈ve Bayesian, Support Vector Machine, Decision
Tree . . . etc has been discussed in detail.
2.2.1 Naı̈ve Bayesian
Naı̈ve Bayesian classifiers are based on theorem of Bayesian and they are sim-
ple probabilistic classifiers. These classifiers use the weak (naı̈ve) dependence
assumptions among the attributes/features of the data sets. Naı̈ve Bayes clas-
sifiers require the set of parameters linear in nature with variables for learning
task. They are highly scalable i.e. can be further applied on increasing data
set size. They use closed-form expression to train the model for likelihood as
much as possible[1][2][8], this algorithm takes linear (O(n)) time, not the ex-
pensive loop/iterative approximation which are used by many other types of
classifiers.
To construct the classifiers the Naı̈ve Bayes is a simple technique in which
19
Chapter 2. Classification Techniques 2.2. Classification techniques
Figure 2.3: Posterior Probability of Naive Bayes
the models are prepared as vectors of attribute values to assign class labels to
test objects/instances and the class labels are used from some finite set of la-
bels. Naı̈ve Bayes is a set of techniques/algorithms based on common principle
for training the classifiers. All naı̈ve Bayes classifiers assume the weak depen-
dence among the feature values for the class variables. Consider one example
to understand this principle, a bird may be considered to be a dove if it is grey
in color, small in size, and about 100 gm in weight. Each of these features
are to be considered independently to contribute that the bird is a dove by
the Naı̈ve Bayes classifiers, here the any possible correlations among the color,
size and weight features are considered without that they in correlation. Using
the Naı̈ve Bayes approach it is easy to build model for very large data sets. In
general Naı̈ve Bayes is known for its simplicity and highly sophisticated clas-
sification.
As shown in figure 2.3 posterior probability P(c — x) from P(c), P(x) and
P(x — c) can be calculated by the Naı̈ve Bayes.
Where,
• Posterior probability of class c: P(c—x) where c is target and x is at-
tributes.
• Prior probability of class c: P(c).
• Likelihood: P(x—c).
• Predictor’s prior probability: P(x).
There are numerous advantages of the Nav̈e Bayes for which it is widely
used are as below:
20
Chapter 2. Classification Techniques 2.2. Classification techniques
• It provides fast and easy prediction of test data samples. Multiple class
prediction is performed very well by it.
• With minimal training data and strong assumption of independence among
attributes, compared to other classifier models like logistic regression a
Naı̈ve Bayes classifier performs better.
• It is performing more effectively for categorical input variables rather
than to numerical variable(s). Normal distribution is assumed for nu-
merical variable.
It has the limitations as below:
• Zero Frequency problem: The model will not be able to make a predic-
tion if categorical variable has a category (in test data set), which was
not observed in training data set. To resolve this problem the smoothing
technique such as Laplace is used.
• In Naı̈ve Bayes probability outputs are not to be taken seriously and
hence it is also known as a bad estimator.
• In Naı̈ve Bayes is also not good because of the assumption of independent
predictors. In ideal situation it is not possible to have the completely
independent set of predictors.
2.2.2 Support Vector Machines (SVM)
A Support Vector Machine (SVM) is a discriminative classifier formally de-
fined by a separating hyper plane. In other words, given labeled training data
(supervised learning), the algorithm outputs an optimal hyper plane which
categorizes new examples.
SVMs[1][2] are one of the supervised learning models in machine learn-
ing same used for analysis by classification and regression. For given finite
set of training samples, the SVM marks them for categories and thus it builds
the training model which later assigns new test samples into the relevant cat-
egory. It is binary linear and non probabilistic classifier. In SVM model the
training/test samples are represented as dot/points in the space and they are
mapped in such a way that the clear gap among the categories appears which
separates the samples. New test samples/examples are later mapped on the
21
Chapter 2. Classification Techniques 2.2. Classification techniques
same space to predict the belonging category based on the gap.
Using the kernel trick SVMs are also be able to perform non-linear classifi-
cation in addition to linear classification. SVMs automatically map the inputs
with high dimensional feature/attribute spaces. In general SVM supports to
construct hyper plane or set of hyper planes in any dimensional space and this
can be used for any tasks such as regression, prediction or classification. SVM
hyper plane is the only responsible for good separation of training data with
the largest distance to its nearby data. This approach is generally known as
functional margin. The state of the rule is if margin is larger then the general-
ization error of classifier will be lower.
Linear SVMIf the training data set of n points is given and the form of it is (−→x1 , y1) .... (−→xn ,
yn) and here yi is 1 or 1, this indicates the class to which the point −→xi is present.
Each −→xi is a real vector of p-dimension. Our interest is to find the ”maximum-
margin hyper plane” which separates the group of samples/points from the
group of points −→xi here yi=1. This is required to be defined to maximize the
distance between the hyper plane and the nearest point −→xi .
It is understandable that H1 usually does not separate the classes. While
H2 separates them with a small margin, on the other hand H3 separates them
with the maximum margin. The hyper plane is defined as the set of points−→x satisfying −→x * −→w - b = 0. The support vector consists the sample on the
margin. Here −→w is the normal vector of hyper plane but it is not necessarily
normalized. The offset of the hyper plane from the origin with normal vector−→w is determined by the parameter −→w .
Hard-margin Two parallel hyper planes which separates two classes of data
can be selected if the training data are linearly separable. So by this we can
have the distance between them is as possible as large. The ”margin” is noth-
ing but the region bounded by these two hyper planes. The maximum margin
hyper plane lies between these planes. These hyper planes can be described by
the equations −→w * −→x - b = 1 and −→w * −→x - b = -1.
2−→w is the distance between two hyper planes, by minimizing −→w we can maxi-
mize the distance between the planes. By adding the constraint: for each either−→w * −→x - b ≥ 1 or −→w * −→x - b ≤ 1if yi = -1, here the data points can be prevented
22
Chapter 2. Classification Techniques 2.2. Classification techniques
Figure 2.4: Classification based on linear SVM
Figure 2.5: Classification based on Hard SVM
23
Chapter 2. Classification Techniques 2.2. Classification techniques
from falling into the margin. According to the constraints/conditions each
data point must lie on the actual side of the margin. The rewritten equation is
as follows:
yi(−→w ∗ −→x − b) ≥ 1, f orall1 ≤ i ≤ n (2.1)
To get the optimization problem we can put all these together: ”Minimize−→w subject to −→w * −→x - b ≥ 1 for i = 1, 2, 3,. n”. Completely the max-margin
hyper plane is determined by overrightarrow(xi) which lies nearest to it. Here
the −→x are called support vectors.
Soft Margin is the loss function with the equation max ( 0, 1 - yi(−→w * −→x- b)). It is introduced to enhance Support Vector Machine where the data are
non linearly separable. If the constraint in (2.1) is satisfied then this function
value is zero, that means, −→x lies on the correct side/way of the margin. If the
data are available at the wrong side of the margin then the function’s value is
calculated as proportional to the distance from the margin. The function with
minimization is as below:
1n
n∑i=1
max(0,1− yi(−→w ∗ −→x − b)) +λ−→w (2.2)
Here the parameter λ finds the trade off between increment the values of
margin-size and making sure that the −→x is fall/separated with the actual side
of it. Therefor, the soft margin Support Vector Machine may behave equally
same as hard margin Support Vector Machine for enough small values of λ in
the case when linearly classifiable test data are available.
Nonlinear Classification Vapnik in 1963 proposed original maximum mar-
gin hyper plane algorithm to construct a linear classifier. Moreover to this
Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik in 1992 sug-
gested the way to create nonlinear classifier using kernel trick approach which
was initially proposed by Aizerman et al. [5] to maximum margin hyper planes
[6]. In this the kernel function which is nonlinear replaces each dot prod-
uct. Thus the maximum margin hyper plane is allowed to fit in a transformed
feature space. In this case the transformed space may be high dimensional
and transformation may be nonlinear, here in the transformed feature space
24
Chapter 2. Classification Techniques 2.2. Classification techniques
Figure 2.6: Nonlinear Classification
the classifier is a hyper plane but in original input space may be non linear.
Therefor the generalization error of support vector machine may be increased
if working in a high dimensional feature space but the algorithm may work
well if enough samples/instances are provided. Various real world problems
can be solved by SVMs.
• As the applications of Support Vector Machines (SVMs) reduce the efforts
for training the instances with the labels in both transductive and stan-
dard inductive settings, due to this SVMs are good for text and hypertext
categorization.
• SVMs are much efficient in search accuracy than classical query refine-
ment and hence they are good for classification of images.
• In medical science classification of proteins more than 90% be only pos-
sible with SVMs.
• SVMs can also recognize Hand-written characters with good accuracy.
2.2.3 K-Nearest Neighbor Classifier
K nearest neighbors is a simple algorithm that stores all available cases and
classifies new cases based on a similarity measure (e.g., distance functions).
KNN has been used in statistical[17] estimation and pattern recognition al-
ready in the beginning of 1970s as a non-parametric technique.
25
Chapter 2. Classification Techniques 2.2. Classification techniques
The k-Nearest Neighbors algorithm can work for regression and classifi-
cation [10] in pattern recognition. In classification or regression the input is
supplied as the k-closest training samples for the given feature/attribute space,
while the output depends on either k-NN works for regression or classification:
• The voting of neighbors play an important role for classification of an
object. Here k is the number of nearest neighbor. For example if k=1,
means the object is assigned to a single closest neighbor.
• With respect to the property value (i.e. an average value of k nearest
neighbors) of the object, k-NN performs the regression.
Among all the machine learning algorithms k-NN is the easy and simplest
one. In k-NN the function value is approximated and calculated locally with
different computation for the classification. Thus it is lazy learner of can also
be said instance-based learner.
In k-NN, the nearer neighbors play important role for contribution than far
objects for computing the weight in both cases like classification and regres-
sion. Let d is the distance to the neighbor with the weight 1/d. This value can
be plays a role to classify the object. In k-NN, no training step is required and
hence it is sensitive to the local distribution of the data.
Training example has a class label and they are represented in the vector
form of the feature space in multidimensional. In the training phase only the
feature vectors and the class labels of the training objects are stored. in k-NN
algorithm, k is the constant given by the user, so in the classification part test
point is assigned the label that is most nearest in the training of k samples.
Euclidean distance is majorly used for continuous variables, on the other
hand Hamming distance is used for text classification kind of discrete vari-
ables. Pearson and Spearman [12] used microarray has been used for finding
correlation coefficients for gene expression. The performance of k-NN can be
also improved by learning the distance matrix and analysis of neighborhood
components. In figure 2.7 different distance equations have been given as fol-
lows:
26
Chapter 2. Classification Techniques 2.2. Classification techniques
Figure 2.7: Distance functions equations
Figure 2.8: Hamming Distance
It should also be noted that all three distance measures are only valid for
continuous variables. In the instance of categorical variables the Hamming
distance as shown in figure 2.8 must be used. It also brings up the issue of
standardization of the numerical variables between 0 and 1 when there is a
mixture of numerical and categorical variables in the dataset.
It has been observed that if the classes of the objects distribution is skewed
then k-NN has to suffer from ”majority voting” drawback. That means, new
sample predictions are dominated by more frequent class samples because of
their large value (weight) [13]. But this classification problem can also be over-
come by weighting the classification with the consideration of the distances
among the test point and its k nearest neighbors. In regression problem, the
class value of each k nearest points is multiplied with the inverse of the dis-
tance from the specific point to the test point. Abstraction in data represen-
27
Chapter 2. Classification Techniques 2.2. Classification techniques
tation is also the other way to overcome skew problem. K-NN can also be
applied to Self-Organizing Map (SOM) without consideration of the density of
node which is represented as center of the given cluster.
The data plays important role for selecting the value of k, in most cases
larger k value may reduce the noise effects in classification [14]. In this case
the class boundaries are less distinct. The heuristic approaches can also be ap-
plied to select the good k value. When the class label is predicted to its closest
training sample, then it is called nearest neighbor algorithm.
The noise, irrelevant features or non-consistent feature scales are only re-
sponsible to degrade the overall accuracy of the k-NN algorithm. In order to
improve the classification accuracy, many of the researchers contributed for
scaling or selection of features. The evolutionary algorithms which optimizes
the feature scaling is the well known approach [15]. The mutual information
among training data and training classes is also playing good role for feature
scaling. Selection of k as an odd number may avoid tied votes in can of binary
classification. The well known bootstrap method also generates the practically
optimal value of k [16].
2.2.4 Instance Based Learning (IBL)
As discussed earlier that k-NN does not have training part, it looks for k near-
est neighbors for new sample for the selection of the class to which it belongs.
k-NN is also incremental classification in which the algorithm uses indexing
concept to finding the neighbors efficiently. The comparison is an exhaustive
if it finds the nearest neighbor for the new instance with storing all instances
in memory which may also lead to lots of memory usage.
The solution of the above problem is the Instance Based Learning (IBL)
which is an enhancement of the k-NN classification algorithm. The k-NN algo-
rithm requires large space while IBL does not need to maintain model abstrac-
tions. Aha et al. (1991) focuses on the way of reducing the storage require-
ments with minimal loss in learning rate and accuracy. The k-NN does not
more suitable with noisy data, while the IBL supports noisy data and hence it
may be applied on lots of real-life datasets.
28
Chapter 2. Classification Techniques 2.2. Classification techniques
IBL are good for the following reasons:
• Objects are classified with the help of instances as they are supervised
learning algorithm.
• Very less expensive to update the object instances.
• It is much faster for learning/training the model.
• In order to obtain the concept description, this algorithm can also be
extended further.
IBL are bad for the following reasons:
• Since all training instances are saved, they are more expensive computa-
tionally.
• Does not much support the noisy attribute values.
• Irrelevant attributes are also not much supported by them.
• Its working only depends on the similarity function.
• They are not supporting working with nominal or missing valued at-
tributes.
• No way to know how the data are structured.
IBL methodology and framework The methodology and the frame work
of IBL have been discussed as below:
• The primary output of IBL algorithm is the Concept Description (CD)
which maps the instances to categories.
• The concept description includes the collection of stored instances with
past performance information at the time of classification. After the pro-
cessing of each instance the stored instances in a CD can be changed.
• The IBL framework has three parts: similarity function, classification
function and concept description updated.
– The similarity function computes the similarity between training
instances and the similarity is the numeric value.
29
Chapter 2. Classification Techniques 2.2. Classification techniques
– The classification function classifies the new instance based on the
value calculated from similarity function.
– The records of classification performance are maintained by concept
description updater.
– The new test instances, classification results, current value of CD
and the similarity values are the inputs while the modified CD is
the output for IBL.
2.2.5 Rule Based Classification
The rule based classification is the systematic selection of a small number of
features used for the decision making. It Increases the comprehensibility of
the knowledge patterns. The useful if-then rules have been extracted from the
dataset on statistical significance.
IF-THEN rules can be extracted using the Sequential Covering Algorithm
(SCA) such as AQ, CN2 and RIPPER from the training set of data. In this algo-
rithm it is not needed to generate the Decision Tree (DT). Many of the tuples
of the given class are covered by each rule. In this category of the classification
the rules are learned all at a time. Every time learning of rules is performed
followed by removing the tuple covered by that rule and thus this process con-
tinues for all the set of tuples. The path from root to leaf in a decision tree
represents a rule. The rules are also required to be pruned for the following
reasons:
• The quality assessment is depended on the original collection of training
samples. The rule pruning is mainly required because the rules perform
well for training data rather than subsequent data.
• The rule R is only pruned if the new version R’ has good quality.
2.2.6 Neural Networks
A neural network is same as biological human brain system which includes
the collection of neurons and it is also considered as the border line between
approximation algorithm and artificial intelligence. It learns through train-
ing resemble structured biological neuron networks and hence it is known as
a nonlinear predictive model. The neuron networks work for the applications
30
Chapter 2. Classification Techniques 2.2. Classification techniques
which include pattern detection, making prediction and learn from the past
such as biological systems. The artificial neuron networks are nothing but the
computer programs which enables the computer to learn like human being but
it can not mimic the human brain completely, but having some lacking or lim-
itations. They are highly accurate predictive models which can be applied for
large range of problems.
The Strengths of Neural Networks:
• High tolerance to noisy data
• Well-suited for continuous-valued inputs and outputs
• Successful on a wide array of real-world data
• Techniques exist for extraction of rules from neural networks
The Weaknesses of Neural Networks:
• Long training time
• Many non-trivial parameters, e.g., network topology
• Poor interpretability
2.2.7 Decision Tree
Due to the computational efficiency to handle the large volume of data, De-
cision Tree (DT) induction is the most well known Machine Learning (ML)
framework. It identifies the most contributing features/attributes for the given
problem and also provides interpretable results.
The Decision Tree is a Tree-shaped structure that represents sets of deci-
sions. These decisions generate rules for the classification of a dataset. Each
unique leaf node is dedicated to a record which is starting from the root and
continuously moves toward a child node with respect to the splitting criterion.
The splitting criteria evaluates a branching condition on the current node with
respect to the input records. There are two stages for decision tree construc-
tion: the first stage is to build a tree and second is to prune it. In most of
the algorithms the tree grows in top down way with greedy approach. It starts
with the root node, followed by at each intermediate node the database records
31
Chapter 2. Classification Techniques 2.2. Classification techniques
Figure 2.9: Decision tree based classification for car subscription
are evaluated with some splitting criterion. This procedure is applied recur-
sively and like wise the database is partitioned/splitted. In second stage, the
tree pruning is applied to reduce its size with some sophisticated way which
reduces the prediction error.
Why Decision tree??
Decision trees are having several advantages among all decision support tools:
• Decision Trees are easy to interpret and understand.
• Using it important insights can easily be generated.
• It also supports to add new scenarios if introduced.
• For different scenarios it can determine best, average and worst values.
• For the given results it can work as white box model to explain the con-
dition by Boolean logic.
• Other decision techniques can also be combined.
Decision tees having advantage over other data mining methods:
• Unlike to other techniques, Decision tree requires little data pre-processing
such as normaliztaion, removing blank values.
32
Chapter 2. Classification Techniques 2.2. Classification techniques
• It can be able to process/work on both either categorical or numerical
data.
• It also improves the reliability of the model by validating it using stan-
dard statistical tests.
• Even if the assumptions are violated by some validated true model gen-
erated from data, it works well. So it is robust.
• Very large volume of data can be effectively and efficiently analyzed using
the available resources.
Disadvantages of Decision tree
Disadvantages and limitations of decision trees are as follows:
• The information gain are biased in the favor of attributes with more lev-
els for categorical variables.[4]
• For uncertain values or linked outcomes the calculations are more com-
plex.
• Determining how deeply to grow the decision tree.
• Handling continuous attributes.
• Choosing an appropriate attribute selection measure.
• Handling training data with missing attribute values.
• Handing attributes with differing costs.
• Improve computational efficiency
Limitations:
• The problem of learning an optimal decision tree is known to be NP-
complete under several aspects of optimality and even for simple con-
cepts. Consequently, practical decision-tree learning algorithms are based
on heuristics such as the greedy algorithm where locally-optimal deci-
sions are made at each node. Such algorithms cannot guarantee to return
the globally-optimal decision tree.
• Decision-tree learners can create over-complex trees that do not general-
ize well from the training data. This is known as over fitting, Mechanisms
such as pruning are necessary to avoid this problem.
33
Chapter 2. Classification Techniques 2.3. Attribute Splitting Measures
2.3 Attribute Splitting Measures
The central choice of the basic algorithm (ID3) is selecting which attribute to
test at each node in the tree. The attribute which is most useful for classifying
the instances will be selected. The good quantitative measure of the worth of
the attribute is defined by the statistical properties like Entropy, Information
Gain, Split Info, Gain Ratio and Gini Index ...which measures how well a given
attribute separates the training instances according to their target classifica-
tion.
EntropyEntropy H(S) is a measure of the amount of uncertainty in the (data) set S (i.e.
entropy characterizes the (data) set S).
H(S) = −∑x
p(x)log2p(x) (2.3)
Where,
• S - The current (data) set for which entropy is being calculated (changes
every iteration of the ID3 algorithm)
• X - Set of classes in S
• p(x) - The proportion of the number of elements in class x to the number
of elements in set S
When H(S)=0, the set S is perfectly classified (i.e. all elements in S are of
the same class). In ID3, entropy is calculated for each remaining attribute. The
attribute with the smallest entropy is used to split the set S on this iteration.
The higher the entropy, the higher the potential to improve the classification
here. Entropy can be calculated using the frequency table.
Information GainThe information gain is based on the decrease in entropy after a dataset is split
on an attribute. Information gain IG (A) is the measure of the difference in
entropy from before to after the set S is split on an attribute A. In other words,
how much uncertainty in S was reduced after splitting set S on attribute A.
34
Chapter 2. Classification Techniques 2.3. Attribute Splitting Measures
IG(A,S) =H(S)−∑t
p(t)H(t) (2.4)
Where,
• H(S) - Entropy of set S
• T - The subsets created from splitting set S by attribute A such that
(S =∪t ∈ T )
• p(t) - The proportion of the number of elements in t to the number of
elements in set S
• H(t) - Entropy of subset t
In ID3, information gain can be calculated (instead of entropy) for each re-
maining attribute. The attribute with the largest information gain is used to
split the set S on this iteration.
Split InformationThe split information value represents the potential information generated by
splitting the training data set D into v partitions, corresponding to v outcomes
on attribute A.
High splitInfo: partitions have more or less the same size (uniform).
Low split Info: few partitions hold most of the tuples (peaks).
SplitInf oA(D) = −v∑j=1
|Dj |Dlog|Dj |D
(2.5)
Gain RatioC4.5 a successor of ID3 uses an extension to information gain C4.5, a successor
of ID3 uses an extension to information gain known as gain ratio. It overcomes
the bias of Information gain and applies a kind of normalization to informa-
tion gain using a split information value. The Gain ratio is defined as:
GainRatio(A) =Gain(A)
SplitInf o(A)(2.6)
The attribute with the maximum gain ratio will be selected as the splitting
35
Chapter 2. Classification Techniques 2.4. Decision Tree Classification
attribute.
Gini Index The Gini Index (used in CART) measures the impurity of a data
partition D.
Gini(D) = 1−m∑i=1
p2i (2.7)
m: the number of classes, pi: the probability that a tuple in D belongs to class
Ci. The Gini Index considers a binary split for each attribute A say D1 and D2.
The Gini index of D given that partitioning is:
GiniA(D) =D1DGini(D1) +
D2DGini(D2) (2.8)
The reduction in impurity is given by
∆Gini(A) = Gini(D)−GiniA(D) (2.9)
The attribute that maximizes the reduction in impurity is chosen as the split-
ting attribute.
2.4 Decision Tree Classification
Decision tree learning[9] uses a decision tree as a predictive model which maps
observations about an item to conclusions about the item’s target value. It is
one of the predictive modelling approaches used in statistics, data mining and
machine learning. Tree models where the target variable can take a finite set of
values are called classification trees. In these tree structures, leaves represent
class labels and branches represent conjunctions of features that lead to those
class labels. Decision trees where the target variable can take continuous val-
ues (typically real numbers) are called regression trees.
In decision analysis, a decision tree can be used to visually and explicitly
represent decisions and decision making. In data mining, a decision tree de-
scribes data but not decisions; rather the resulting classification tree can be an
input for decision making.
36
Chapter 2. Classification Techniques 2.4. Decision Tree Classification
There are two main types of decision trees used in data mining as follows:
• Classification tree: In this type of the tree the predicted outcome is the
class in which it belongs.
• Regression tree: In this type of the tree, the a real number can be con-
sidered the predicted outcome.
The above analysis procedures can be referred by the term ClassificationAnd Regression Tree (CART) which was first introduced by Breiman et al
[3]. The trees used for both regression and classification have some similarities
with some considerable differences like where to split[3].
There are some well known decision trees:
• Bagging decision trees: The training data are repeatedly re-sampled and
replaced to build multiple decision trees. It also uses voting concept for
trees for prediction [4].
• A Random Forest: It uses different classifiers to improve the classifica-
tion rate by creating multiple decision trees.
• Boosted Trees: It is used either for regression or classification type prob-
lems [5][6].
• Rotation forest:It uses the concept of Principal Component Analysis (PCA)
on random subset of input features and every decision tree is trained.
2.4.1 Tree Building Phase
Tree (Decision Tree) learning is one of the most widely used and practical
method for inductive inference. It is the method of approximating the value
of target function, in which the learned function is represented by the tree.
The learned trees can also be represented as the set of if-the rules to improve
human readability. These learning methods are among the most popular of
inductive inference algorithms and have been successfully applied to a broad
range of tasks.
Decision tree learning is a method commonly used in data mining [20]. The
goal is to create a model that predicts the value of a target variable based on
several input variables. Each interior node corresponds to one of the input
37
Chapter 2. Classification Techniques 2.4. Decision Tree Classification
variables; there are edges to children for each of the possible values of that in-
put variable. With respect to the path from the root to the leaf node, every leaf
uniquely represents the value of the target variable.
The decision tree is one of the simplest form for classification types of prob-
lems. All attributes/features have finite discrete domains and unique target
feature for classification, and this element is called the class. In decision tree,
every internal (i.e. non-leaf) node is labeled with input feature while arcs be-
tween the nodes are possible values of the features. Every leaf node is labeled
with a class.
Based on the attribute value the source set is divided into subsets in order
to learn a tree. Using the recursive partitioning on each derived subset this
process is repeated and this process stops when a subset at a node all target
variable having same value or there will be no more contribution of splitting
in prediction. This is an example of greedy algorithm and it is known as Top
Down Induction of Decision Tree (TDIDT) [21].
Most of the algorithms that have been developed for learning the trees are
variations on a core algorithm that employs a top-down, greedy search through
the space of possible trees. This approach is exemplified by ID3 (Quinlan 1986)
and its successors C4.5 (Quinlan 1993) C5.0 and many more.
The basic, ID3 algorithm learns the decision trees by constructing them
top-down, beginning with the question Which attribute should be tested at
the root of the decision tree? The answer is each attribute is evaluated using a
statistical test (by finding the information gain) to determine how well it alone
classifies the training examples. The best attribute is selected and used as the
test at the root node of the tree. A descended of the root node is created for
each possible value of this attribute, and the training examples are sorted to
the appropriate descendant node. The entire process is then repeated using
the training examples associated with each descendant node to select the best
attribute to test at that point in the tree. This forms a greedy search for an ac-
ceptable decision tree, in which the algorithm never backtracks to reconsider
the earlier choices.
38
Chapter 2. Classification Techniques 2.4. Decision Tree Classification
2.4.2 Tree Pruning Phase
The smaller the complexity of a concept, the less danger that it over fits the
data , A polynomial of degree n can always fit n+1 points . Thus, learning al-
gorithms try to keep the learned concepts. For very large data sets, over fitting
is a challenge to generate the decision tree or other predictive models. Over
fitting happens when the learning algorithm continues to develop hypotheses
that reduce training set error at the cost of an increased test set error. In build-
ing decision trees there are several approaches to reduce the over-fitting.
• Pre-pruning: Before perfectly classifying the training set, the tree grow-
ing stops earlier.
• Post-pruning: In this approach the tree development is completed to
classify the entire training data set and then does the post pruning.
Many of the times, it is difficult to decide when to stop growing the trees
and hence practically the post-pruning approach is more successful and also it
covers the entire training data set for classification. There are several methods
as follows to define a criteria for finding the correct final tree.
1. Different data set from the training set is used as validation set.
2. Very first build the tree with available training data set and then make
sure whether expanding or pruning a node may bring some improvement
or not.
• Error estimation
• Significance testing (e.g., Chi-square test)
• Minimum Description Length principle: It stops the growth of the
tree when encoding size is reached to minimized.
Pre Pruning
Pre-Pruning stops growing a branch when information becomes unreliable.
Based on statistical significance test stops growing the tree when there is no
statistically significant association between any attribute and the class at a par-
ticular node. The pre pruning test is as follows:
39
Chapter 2. Classification Techniques 2.5. Comparison of classification techniques
DecisionTree
Naive Bayes K- Nearest Neigh-bor
SVM Neural Networks
Easily Ob-served anddevelopgeneratedrules
Fast, highly scal-able model build-ing (parallelized)and scoring
Robust to noisytraining data andeffective if thetraining data islarge
More accu-rate thanDecisionTree classifi-cation
High tolerance ofnoisy data andability to clas-sify patterns foruntrained data
Table 2.1: Advantages of different classification algorithms
• Most popular test: chi-squared test
• ID3 used chi-squared test in addition to information gain
Only statistically significant attributes were allowed to be selected by in-
formation gain procedure.
Post Pruning
Post-Pruning is unreliable which allows to grow a decision tree that correctly
classifies all training data simplify it later by replacing some nodes with leafs.
The post pruning is preferred usually in practice. Prepruning can stop early.
2.5 Comparison of classification techniques
In the table 2.1,the overall comparison [18][19] among different classification
techniques such as decision tree, Naive Bayesian, K-Nearest Neighbor, Support
Vector Machine (SVM) and Neural Networks have been discussed. Out of these
techniques Decision tree is easy to understand and to develop the rules while
SVM is more faster than it.
In the table 2.2, the feature wise comparison [18][19] has been shown. The
different features such as learning type, speed, accuracy, scalability and trans-
parency for classification techniques have been summarized more precisely.
In table 2.3, the merits and demerits of the classification techniques have
been discussed in much detail. The decision tree works well with redundant
attributes while irrelevant attributes affect in the construction of the the deci-
sion tree. Naive bayesian assumes the independence of the features hence it
40
Chapter 2. Classification Techniques 2.5. Comparison of classification techniques
Features DecisionTree
NaiveBayes
K-Nearest
NeighborSVM Neural
Networks
LearningType
EagerLearner
EagerLearner
LazyLearner
EagerLearner
EagerLearner
Speed Fast Very Fast SlowFast with
activelearning
Slow
AccuracyGood in
manydomains
Good inmany
domains
HighRobust
SignificantlyHigh
Good inmany
domains
ScalabilityEfficient forsmall data
set
Efficientfor largedata set
——- ——- Slow
Interpretability Good ——– ——– ——- BadTransparency Rules No Rules Rules No Rules No RulesMissing Value
Interpret.MissingValue
MissingValue
MissingValue
SparseData
——-
Table 2.2: Feature comparisons
offers less accuracy in classification. Neural Networks are having high toler-
ance towards the noisy data but take much time for training the model.
41
Chapter 2. Classification Techniques 2.5. Comparison of classification techniques
Algorithm Merits Demertis
DecisionTree
• Handles: continuous, discrete data andnumeric data.
• It provides fast result in classifying un-known records.
• It supports redundant attribute.
• Very good results acquired for smallsize tree. Results are not affected withoutliers.
• Normalization is not required.
• It cant predict the valueof a continuous class at-tribute.
• It provides error proneresults when too manyclasses are used.
• Construction of decisiontree is affected by irrele-vant attributes.
• Decision tree affected byeven small change in data.
NaiveBayesian
• It provides high accuracy and speed onlarge database.
• Minimum error rate compared to otherclassifiers.
• It is easy to understand.
• Supports streaming data, real and dis-crete valued data also.
• It assumes independenceof features. So it providesless accuracy.
NeuralNetworks
• Highly affected by noisy data.
• Good for continuous values.
• Non trained patterns can also be classi-fied.
• Complex to interpret.
• Takes much time to trainthe model.
Table 2.3: Comparison of Classification Algorithms
42
Chapter 2. Classification Techniques 2.6. Summary
2.6 Summary
In this chapter different classification techniques have been discussed followed
by what is the importance of decision tree based classification in decision mak-
ing. Moreover to this, two phases of decision tree such as tree building and
tree pruning have been discussed. The comparison of decision tree attribute
splitting measures have been discussed which play the crucial role in building
the decision tree. In the last section the detailed comparison with respect to
various parameters of different classification techniques have been discussed.
2.7 References
1. Cortes, C., Vapnik, V. Support-vector networks. Machine Learning 20 (3):
273. doi:10.1007/BF00994018.1995.
2. William H., Teukolsky, Saul A., Vetterling, William T., Flannery, B. P.Section
16.5. Support Vector Machines. Numerical Recipes: The Art of Scientific
Computing (3rd ed.). New York: Cambridge University Press. ISBN 978-
0-521-88068-8. 2007.
3. Boser, Guyon and Vapnik. A training algorithm for optimal margin clas-
sifiers. Proceedings of the fifth annual workshop on Computational learn-
ing theory COLT ’92. p. 144. doi:10.1145/130385.130401. ISBN 089791497X.
1992.
4. Aizerman, Braverman,Rozonoer and Lev I. Theoretical foundations of
the potential function method in pattern recognition learning. Automa-
tion and Remote Control 25: 821837. 1964.
5. Boser, B. E., Guyon, I. M., Vapnik, V. N. A training algorithm for optimal
margin classifiers. Proceedings of the fifth annual workshop on Compu-
tational learning theory COLT ’92. p. 144.1992.
6. Platt, John, Cristianini and Shawe-Taylor. Large margin DAGs for mul-
ticlass classification. In Solla, Sara A., Leen, Todd K., and Mller, Klaus-
Robert, eds.Advances in Neural Information Processing Systems
43
Chapter 2. Classification Techniques 2.7. References
7. Dietterich, Thomas G., and Bakiri, Ghulum, Bakiri. Solving Multiclass
Learning Problems via Error-Correcting Output Codes, Journal of Artifi-
cial Intelligence Research, Vol. 2 2: 263286.1995.
8. Lee, Y., Lin, Y., and Wahba, G. Multicategory Support Vector Machines.
Computing Science and Statistics 33. 2001.
9. https://en.wikipedia.org/wiki/Decisiontreelearning
10. Altman. An introduction to kernel and nearest-neighbor nonparametric
regression, The American Statistician 46(3): 175-185. 1992.
11. Jaskowiak, P. A., Campello, R. J. G. B. Comparing Correlation Coefficients
as Dissimilarity Measures for Cancer Classification in Gene Expression
Data.http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.208.993.
Brazilian Symposium on Bioinformatics, pp. 18.2011.
12. D. Coomans, D.L. Massart. Alternative k-nearest neighbor rules in su-
pervised pattern recognition : Part 1. k-Nearest neighbor classification
by using alternative voting rules. Analytica Chimica Acta 136: 1527.
1982.
13. Everitt, B. S., Landau, S., Leese, M. and Stahl, D. Miscellaneous Cluster-
ing Methods, in Cluster Analysis, 5th Edition, John Wiley & Sons, Ltd,
Chichester,UK. 2011.
14. Nigsch F, Bender A, van Buuren B, Tissen J, Nigsch E, Mitchell JB. Melt-
ing point prediction employing k-nearest neighbor algorithms and ge-
netic parameter optimization. Journal of Chemical Information and Mod-
eling 46, 24122422. 2006.
15. Hall P, Park BU, Samworth RJ. Choice of neighbor order in nearest-neighbor
classification”. Annals of Statistics 36, 21352152. 2008.
16. D. G. Terrell; D. W. Scott. Variable kernel density estimation. Annals of
Statistics 20, 12361265
17. Mills, Peter. Efficient statistical classification of satellite measurements.
International Journal of Remote Sensing. 1992.
18. G. Dimitoglou, J. A. Adams and C. M. Jim, Comparison of the C4.5 and
a Nave Bayes Classifier for the Prediction of Lung Cancer Survivability,
Journal of Comput-ing, Vol. 4, No. 2, pp. 1-9. 2012.
44
Chapter 2. Classification Techniques 2.7. References
19. J. Huang, J. Lu and C. X. Ling, Comparing Nave Bayes, Decision Trees,
and SVM with AUC and Accuracy, Proceedings of Third IEEE Interna-
tional Conference on Data Mining, 19-22 November 2003, pp. 553-556.
doi:10.1109/ICDM.2003.1250975.
20. Rokach, Lior, Maimon, O. Data mining with decision trees: theory and
applications. World Scientific Pub Co Inc. ISBN 978-9812771711. 2008.
21. Quinlan, J. R. Induction of Decision Trees. Machine Learning 1: 81-106,
Kluwer Academic Publishers. 1986.
45
Chapter 3
Literature Survey On
Decision Tree
3.1 Introduction
There are lists of data mining techniques such as frequent pattern mining,
classification, regression, clustering, association rule mining and many more,
but out of these classification is frequently used at the most. In classification
[1], the model is trained which describes and differentiate different data classes
to predict the classes whose labels are not known. The classification can be
performed with different algorithms such as neural networks, decision trees,
regression, etc. Due to the significance important of decision tree for large
data sets, in this research decision tree approach has been used. Generally the
classification is the sequence of the operations as follows:
1. Prepare the training data set using pre-processing on the raw data.
2. Class attribute and classes are identified.
3. Identify useful attributes for classification (Relevance analysis).
4. Learn a model using training examples in Training set.
5. Use the model to classify the unknown data samples.
46
Chapter 3. Literature Survey On Decision Tree 3.2. First Phase Study
Figure 3.1: Decision tree
3.2 First Phase Study
As the decision tree has been discussed in chapter 2, the Decision trees [2] rep-
resent a sequence of rules to form the class. It is a flowchart like tree structure.
The decision tree consists of three fundamentals, root node, internal node
and leaf node. Top most fundamental is the root node. Leaf node is the ter-
minal fundamental of the structure and the nodes in between is called the
internal node. Each internal node denotes test on an attribute, each branch
represents an outcome of the test, and each leaf node holds a class label. Var-
ious decision tree algorithms are used in classification like ID3, J48, CART,
C5.0, SLIQ, SPRINT, random forest, random tree, etc. In this work following
tree algorithms are taken for comparison.
ID3 (Iterative Dichotomiser 3) decision tree algorithm is developed by
Quinlan [6]. The basic idea of ID3 algorithm is to construct the decision tree
by employing a top-down, greedy search through the given sets to test each
attribute at every tree node. In the decision tree method, information gain ap-
proach is generally used to determine suitable property for each node of a gen-
erated decision tree. Thus, we can select the attribute with the highest infor-
mation gain (entropy reduction in the level of maximum) as the test attribute
of current node. In this way, the information needed to classify the training
sample subset obtained from later on partitioning will be the smallest. That is
to say, the use of this property to partition the sample set contained in current
node will make the mixture degree of different types for all generated sam-
ple subsets reduce to a minimum. Therefore, the use of such an information
47
Chapter 3. Literature Survey On Decision Tree 3.2. First Phase Study
theory approach will effectively reduce the required dividing number of object
classification. ID3 uses only categorical attributes to build a tree model. This
algorithm does not produce more accurate outcomes if the noise is present in
the data set. In order to get more accurate results the effective pre-processing
is carried out before model is built using ID3.
J48 J48 is an advance version of ID3, it decides target value of a new test
data with respect to different attribute values of training data [3]. The internal
nodes of a decision tree are denoted by different attributes while the branches
tell the possible values of these attributes. The internal nodes tell the depen-
dent variable values.
CART, its full form is Classification And Regression Tree (CART) and it was
initially proposed by Breiman et al. [7] which was binary tree also known as
Hierarchical Optimal Discriminate Analysis (HODA). It is non parametric de-
cision tree. It produces regression or classification tree depends on dependent
variable’s type either numeric or categorical respectively. Here the meaning of
binary means the node in a decision tree has two out word branches i.e. groups.
Gini index is used in CART as feature selection measure. The attribute with
largest gini index is used to split the records. CART handles both categorical
and numerical values along with missing attribute values also. It uses cost-
complexity pruning and also generate regression trees.
It builds both classifications and regression trees. It can be implemented
serially from Hunt’s algorithm. It also uses regression analysis using regressin
trees (S.Anupama et al,2011). Over a given period of time and the set of pre-
dictor variables regression analysis feature forecasts a dependent variable. It
gives high classification and prediction accuracy.
C5.0 is an extension of C4.5 which was initially derived from ID3. It is
applied on big data sets. It is much faster and memory efficient than C4.5. It
splits the samples based on the maximum information gain. The sample subset
that is get from the former split will be split afterward. This is continuous pro-
cess until the sample subset can not be further split. The attributes/features
which have less contribution will be rejected. One major advantage of C5.0 is
it handles multi value attributes and missing attributes from the data set [8].
48
Chapter 3. Literature Survey On Decision Tree 3.2. First Phase Study
SLIQ was introduced by Mehta et al(1996) and it stands for supervised
learning in ques. It can be implemented in serial and parallel system for fast
scalable decision tree. It is not based on Hunt’s algorithm. It uses breadth
first search greedy strategy to partition training data set recursively during the
tree building phase. SLIQ handles both numeric and categorical attributes but
with memory resident class list data structure which is the disadvantage of it.
It uses Minimum Description Length (MDL) principle in tree pruning.
SPRINT is the induction of decision tree algorithm and it stands for Scal-
able Parallelizable Induction and was introduce by Shafer et all (1996). It is
fast and scalable classifier. It partitions the training data recursively using
the breadth-first greed approach until no further split possible. It is also im-
plemented both serially and parallel.It uses attribute list and histogram data
structures which are not memory resident making sprint suitable for large data
sets, thus it removes all the data memory restrictions on data. It handles both
continuous and categorical attributes (Sunita et ,2011).
RANDOM FOREST[4] is an ensemble learning method of classificaiton,
regression and other tasks which construct decision tree at training time and
predicts the class at output time. It overcomes the over fitting problem of de-
cision trees. It averages multiple deep decision trees, train them on different
parts with the goal of reducing the variance. This comes at the expense of a
small increase and some loss of interpretability, but generally greatly boosts
the presentation of the final model.
The Random Forest algorithm was developed by Leo Breiman. It is a meta-
learner made of many individual trees to operate quickly over large data sets
and more importantly to be diverse by using random samples to build each
tree in the forest.
Construction of a tree:
1. In big data 2/3 of data is used to train the model with bootstrap replace-
ment.
2. Select the attribute with the most information gain from the random
number of attributes.
49
Chapter 3. Literature Survey On Decision Tree 3.3. Second Phase Study
3. Continue to construct the tree until no more nodes can be created.
4. Compute the error and measure the correctness of the tree.
) At each node of the tree diversity is acquired with random selection of at-
tributes, then select the attribute with highest level of learning. The perfor-
mance of random forests algorithm is linked to the level of correlation between
the two trees in the forest. The overall performance of the entire forest of trees
reduces if the correlation increases. The way to vary the level of correlation
between trees is by adjusting the number of random attributes to be selected
when creating a split in each tree. Increasing this variable (m) will both in-
crease the correlation of each tree and the strength of each tree. At some point
the tree correlation and tree strength will complement each other providing
the highest performance. In addition, increasing the number of trees will pro-
vide a more intelligent learner just as having a large diverse group will make
intelligent decisions [10] [11].
RANDOM TREE [5] is a collection of tree predictors. Tree predictors are
also known as forests. It handles both classification and regression problems.
In classification, the input feature vector is taken by the random tree classifier
to classify it with every tree in the forest, the class label which receives the
majority of votes is considered as the output. The average of the responses over
all the trees in the forest is considered the actual response in the regression. All
the trees are trained with different training sets but on the same parameters.
3.3 Second Phase Study
The huge datasets have been generated exponentially day to day by the tremen-
dous use of application softwares developed for numerous services for exam-
ple stock market, banking, supermarket, education, mobile devices and many
more. For the analysis and visualization these data need to be processed with
Distributed Data Mining (DDM) approach. With any of the below four ap-
proaches distributed data mining can be implemented [22].
• Central approach: Bring the all site datasets to a single site, followed by
applying the data mining on the entire combined dataset. This causes
two problems, first huge amount of communication overhead and hence
50
Chapter 3. Literature Survey On Decision Tree 3.3. Second Phase Study
Algorithms Measure Procedure Pruning
ID3 Entropy, info-gainTop-down decisiontree construction
Pre-pruning
C4.5 Entropy, split info,gain ratio
Top-down decisiontree construction
Pre-pruning
C5.0 Entropy, split info,gain ratio
Top-down decisiontree construction
Post-pruning
CART Gini diversityindex
Constructs binarydecision tree
Post pruning basedon cost-complexity
measure
SLIQ Gini IndexBreadth First
based Decisiontree construction
Post-pruningbased on MDL
principle
SPRIT Gini IndexBreadth First
based Decisiontree construction
Post-pruningbased on MDL
principle
Table 3.1: Performance based comparisons of different Decision tree algo-rithms
Algorithms ID3 C4.5 C5.0 CART
Types ofData
CategoricalContinuous,Categorical
Continuous, dates,times, Categorical,
timestamps
Nominal,Continu-
ousProcessing
SpeedSlow Better than ID3 Fastest Average
TreePruning
No Pruning Early pruning Late PruningEarly
pruningBoosting Do not allow Do not allow Allow AllowMissingValues
No Support No Support Supports Supports
SplittingMeasure
Entropy, Info Gain Gain Ratio, Split Gain Ratio, SplitGini
diversityindex
Table 3.2: Comparisons between different Decision Tree Algorithms
51
Chapter 3. Literature Survey On Decision Tree 3.3. Second Phase Study
Algorithm Merits Demerits
ID3
• It builds the fastest andshort tree.
• From the training dataprediction rules are cre-ated .
• Reduces number of testsby pruning.
• It cant handle numeric at-tributes and missing val-ues.
• For small sample test-ing over-fitting or over-classification is possible.
• At a time single attributeis tested to make a deci-sion.
C4.5
• Supports continuous data.
• It avoids over fitting ofdata.
• Computational efficiencyis improved.
• Supports missing data val-ues at training.
• It requires that target at-tribute will have only dis-crete values.
J48
• Numeric and Nominalvalues are handled by it.
• Able to handle missingvalues.
CART
• Non parametric.
• No advance selection ofvariables.
• Can handle outliers.
• Unstable decision treemay be produced.
• One variable splitting.
Table 3.3: Comparison of Merits and Demerits of Decision Tree Algorithms
52
Chapter 3. Literature Survey On Decision Tree 3.3. Second Phase Study
increase in communication cost to bring the entire data to a single site,
and second problem is data privacy preservation.
• Merge approach: Generate the local data model at each site locally. All
these models are sent to a single site to merge into a global single model.
This mechanism is carried out in the works of [23] [24] [25]. As the num-
ber of sites are increased, this approach is not suitable with scalable kind
of problem.
• Sample approach: This approach uses samples. At each site a small set
of candidate data is sampled to form one global candidate data set. The
data mining can then later performed on global data set.
• Intermediate Message Passing Approach: In the above three approaches
a single site assists the data mining of distributed data while in this
approach P2P network is involved where different sites communicate
among themselves without a central/single server [26][27].
Over the period of time many of the decision tree algorithms have been devel-
oped by the researchers with increase in performance and the way of handling
different types of data. Out of them, few of algorithms have been discussed
here with below:
Yael Ben-Haim, Elad Tom-Tov [12] proposed the algorithm Streaming Par-
allel Decision Tree (SPDT) executing in a distributed environment and is es-
pecially designed for classifying large data sets and streaming data. This al-
gorithm empirically proved as accurate as a standard decision tree classifier,
while being scalable for processing of streaming data on multiple processors.
The essence of the algorithm is to quickly construct histograms at the proces-
sors, which compress the data to a fixed amount of memory. A master pro-
cessor uses this information to find near-optimal split points to terminal tree
nodes. The analysis shows that guarantees on the local accuracy of split points
imply guarantees on the overall tree accuracy. In the algorithm both training
and testing are executed in a distributed environment using only one pass on
the data.
Bagging [28] and boosting [29] are the meta classification algorithms are
built very first on either partition of training data or samples. These are the
53
Chapter 3. Literature Survey On Decision Tree 3.3. Second Phase Study
week classifiers which later combined using next level algorithm.
There are lots of algorithms which are more suited with distributed envi-
ronment like Stolfo et el. [30] learns weak classifier on each partition of the
sample data set, then later bring them to a single site. This is somewhat less
expensive than sending the entire data set to remote site. The meta-classifier
works on these weak classifiers centrally to form single and global classifier.
Bar-Or et al. [31] has suggested about to execute ID3 with hierarchical net-
work, to take only the statistics of contributing attributes at every node of the
tree and at each level. This guarantees that the selected attribute is having
highest gain.
In distributed environment the data are fragmented either horizontally,
vertically or hybrid way. So there should be algorithm for decision tree in-
duction for such scenarios. Caragea et al. [32] introduced such an algorithm
which works for distributed data. In this, the author majorly focus on splitting
criteria to be evaluated in a distributed fashion. This reduces the communica-
tion cost by reducing the more overhead. Moreover to this, the tree induced in
both scenario distributed and centralized are same. This systems is also avail-
able as the component of INDUS system.
A different approach was taken by Giannella et al. [33] and Olsen [34] for
inducing decision tree in vertically partitioned data. They used Gini informa-
tion gain as the impurity measure and showed that Gini between two attributes
can be formulated as a dot product between two binary vectors. To reduce the
communication cost, the authors evaluated the dot product after projecting
the vectors in a random smaller subspace. Instead of sending either the raw
data or the large binary vectors, the distributed sites communicate only these
projected low-dimensional vectors. The paper shows that using only 20% of
the communication cost necessary to centralize the data, they can build trees
which are at least 80% accurate compared to the trees produced by centraliza-
tion.
54
Chapter 3. Literature Survey On Decision Tree 3.4. Third Phase Study
Innovation Motivation/Problem Training Data
Bowyer, Chawla, Hall[18]Training of model with large
data setPima IndiansDiabetes, Iris
Long, Bursteinas [20]Distributed data mining on
distant machinesRepository of UCI
Zabala, Langner,Andrzejak [16]
Training distributed data without of RAM size
Repository of UCI
Soares, Moreira, Strecht[17]
University level knowledgegeneration for course model
University ofPorto: Academic
data from
Table 3.4: Merge Models with combination of rules: Examples
3.4 Third Phase Study
A more common approach is the combination of rules derived from decision
trees. The rule merging is performed by combing rules of two tree modelsinto the rules of single decision tree. This way we can reduce the number of
the rules and like growing the final merged tree model. Williams [13] has
presented the basic fundamental in his doctoral thesis, later many of the re-
searchers contributed a lot to process the intermediate process.
Provost and Hennessy [14, 15] present an approach to learning and com-
bining rules on disjoint subsets of a full training data. A rule based learning
algorithm is used to generate rules on each subset of the training data. The
merged model is constructed from satisfactory rules, i.e., rules that are generic
enough to be evaluated in the other models. All rules that are considered sat-
isfactory on the full data set are retained as they constitute a superset of the
rules generated when learning is done on the full training set. This approach
has not been replicated by other researchers. In the table 3.4 the research ex-
amples along with the problems and the data sets have been specified.
Hall, Chawla and Bowyer [18, 19] research present as rationale that is not
possible do train decision trees in very large data sets because it could over-
whelm the computer systems memory by making the learning process very
slow. Although a tangible problem in 1998, nowadays, this argument still
makes sense as the notion of very large data sets has turned into the big data
paradigm. The approach involves breaking down a large data set into n dis-
55
Chapter 3. Literature Survey On Decision Tree 3.4. Third Phase Study
joint partitions, then, in parallel, train a decision tree on each. Each model, in
this perspective, is considered an independent learner. Globally, models can
be viewed as agents learning a little about a domain with the knowledge of
each agent to be combined into one knowledge base. Simple experiments to
test the feasibility of this approach were done on two datasets: Iris and Pima
Indians Diabetes. In both cases, the data sets were split across two processors
and then the resulting models merged.
Bursteinas and Long [20] research aims to develop a technique for mining
data which is distributed on remote machines, and connections with limited
bandwidth arguing that there is a lack of algorithms and systems which could
perform data mining under such conditions. The merging procedure is divided
into two scenarios: one for disjoined partitions and one for overlapped parti-
tions. To evaluate the quality of the method, several experiments have been
performed. The results showed the equivalence of combined classifiers with
the classifier induced on a monolithic data set. The main advantage of the
proposed method is its ability to induce globally applicable classifiers from
distributed data without costly data transportation. It can also be applied to
parallelize mining of large-scale monolithic data sets. Experiments are per-
formed merging two models in data sets taken from the UCI Machine Learning
Repository [21].
Andrzejak, Langner and Zabala [16] propose a method for learning in par-
allel or from distributed data. They focus on the large data sets generated in
the mobile environments of distributed scenarios where the data set size ex-
ceeds the RAM size. They also have evaluated the interpretable models from
various models generated at numerous sites. They identified the impact and
the importance of the individual variables. In the distributed environment, if
the individual model of one site is interpretable then the overall/global model
may not be. In order to overcome this problem the authors proposed one new
approach to merge decision trees into a global interpretable single tree. This
proposed approach also overcomes the problem of connection bandwidth and
RAM size. It also gives good accuracy which has been evaluated in experiments
on UCI repository data sets [21].
Soares, Moreira, Strecht [17] research on educational data mining starts
from the premise that predicting the failure of students in university courses
56
Chapter 3. Literature Survey On Decision Tree 3.5. Challenges with DT merging
can provide useful information for course and programme managers as well as
to explain the drop out phenomenon. The rationale is that while it is important
to have models at course level, their number makes it hard to extract knowl-
edge that can be useful at the university level. Therefore, to support decision
making at this level, it is important to generalize the knowledge contained
in those models. An approach is presented to group and merge interpretable
models in order to replace them with more general ones without compromis-
ing the quality of predictive performance. The case study is data from the
University of Porto, Portugal, which is used for evaluation. The aggregation
method consists mainly of intersecting the decision rules of pairs of models
of a group recursively, i.e., by adding models along the merging process to
previously merged ones. The results obtained are promising, although they
suggest alternative approaches to the problem. Decision trees were trained us-
ing C5.0 algorithm and F1 was used as evaluation function of the individual
and merged models.
3.5 Challenges with DT merging
Decision tree learning on massive datasets is a common data mining task in
distributed environment, yet many state of the art as discussed above tree
learning algorithms require training data to reside in memory on a single ma-
chine, while more scalable implementations of tree learning have been pro-
posed, they typically require specialized parallel computing architectures. More-
over, all the approaches are static in nature, not domain-free, not scalable and
the accuracy is not preserved.
Merging the decision trees is the real challenge for the researchers, and
different researchers have proposed different merging policies, but still pre-
serving the accuracy is the big issue because of change in the global decision
rules due to several reasons such as 1) What if two same rules with a single
feature different have same class? 2) What if one rule partially overlaps the
other rule? 3) What if one rule fully overlaps the other rule? 4) What if one
rule have same feature (continuous value) with different constraints?
Our literature review and experiments on merging the decision trees, shown
57
Chapter 3. Literature Survey On Decision Tree 3.6. Summary
training time, communication overhead and the accuracy are the major chal-
lenges. To reduce the training time the proposed algorithm processes only new
dataset with already trained model which makes it scalable and dynamic, to
reduce the communication overhead the local models have been converted into
XML files, to preserve the accuracy the proposed algorithm incorporates some
rule merging policies. The proposed model has been discussed in the chapter
4 in detail.
3.6 Summary
In this chapter of literature survey very first the study/survey of different de-
cision tree based classification techniques like CART, J48, ID3, C5.0, SPRIT,
Random Forest and SLIQ have been conducted in very precise manner with
respect to type of data supported by them, their speed, the way for pruning
and whether they support the missing values or not. In the second phase dif-
ferent decision tree algorithms for distributed environment have been studied
and discussed their pros and cons. Moreover to this in the third phase of liter-
ature study, the different approaches have been discussed proposed by numer-
ous researchers for merging the decision trees in distributed environment to
for the global view of decision tree. In the last section the challenges observed
in merging different decision trees to form global decision tree have been dis-
cussed. These are the motivations for our research to generate the global de-
cision tree in distributed environment without losing the prediction quality of
the model.
3.7 References
1. Elder J. F. and King M. A., Evaluation of Fourteen Desktop Data Mining
Tools, in Proceedings of the IEEE International Conference on Systems,
Man and Cybernetics, 1998.
2. Juhua Chen, Wei Peng and Haiping Zhou, An Implementation of ID3:
Decision Tree Learning Algorithm Project of Comp 9417: Machine Learn-
ing University of New South Wales, School of Computer Science & Engi-
neering, Sydney, NSW 2032.
58
Chapter 3. Literature Survey On Decision Tree 3.7. References
3. C4.5 algorithm,Wikipedia, The Free Encyclopedia. Wikimedia Founda-
tion, 28-Jan-2015.
4. Breiman L., Random forests, Mach. Learn., vol. 45, no. 1, pp. 532, 2001.
5. Random tree, Wikipedia, The Free Encyclopedia. Wikimedia Founda-
tion, 13-Jul-2014.
6. Charles J. Stone, Jerome H. Friedman, Leo Breiman and Richard A. Ol-
shen, Classification and Regression Trees. Wadsworth International Group,
Belmont, California. 1984.
7. Gordan.V.Kass. An exploratory Technique for investigation large quanti-
ties of categorical data Applied Statics, vol 29, No .2, pp. 119-127.1980
8. Wu Shangzhuo, Wang Jian YanHongcan and Zhu Xiaoliang, Research and
application of the improved algorithm C4.5 on decision tree. 2009.
9. Manish Mehta, Rakesh agrawal and Jorma Rissanen, SLIQ: A scalable
parallel classifier for data mining IBM Almaden Research Center,CA 95120.
10. Suban Ravichandran, Vijay Bhanu Srinivasan and Chandrasekaran Ra-
masamy, Comparative Study on Decision Tree Techniques for Mobile Call
Detail Record, Journal of Communication and Computer 9, pp. 1331-
1335, 2012.
11. N. Peter, Enhancing random forest implementation in WEKA, in: Ma-
chine Learning Conference, [2005]
12. Yael Ben-Haim, Elad Tom-Tov, A Streaming Parallel Decision Tree Algo-
rithm, Journal of Machine Learning Research, 849-872. 2010.
13. G. J. Williams, Inducing and Combining Multiple Decision Trees. PhD
thesis, Australian National University, 1990.
14. F. J. Provost and D. N. Hennessy, Distributed machine learning: scaling
up with coarse-grained parallelism, in Proceedings of the 2nd Interna-
tional Conference on Intelligent Systems for Molecular Biology, vol. 2,
pp. 3407, Jan. 1994.
15. F. Provost and D. Hennessy, Scaling up: Distributed machine learning
with cooperation, in Proceedings of the 13th National Conference on Ar-
tificial Intelligence, pp. 7479, 1996.
59
Chapter 3. Literature Survey On Decision Tree 3.7. References
16. A. Andrzejak, F. Langner, and S. Zabala, Interpretable models from dis-
tributed data viamerging of decision trees, 2013 IEEE Symposium on
Computational Intelligence and Data Mining (CIDM), Apr. 2013.
17. P. Strecht, J. Mendes-Moreira, and C. Soares, Merging Decision Trees:
a case study in predicting student performance, in Proceedings of 10th
International Conference on Advanced Data Mining and Applications,
pp. 535548, 2014.
18. L. Hall, N. Chawla, and K. Bowyer, Combining decision trees learned in
parallel, Working Notes of the KDD-97 Workshop on Distributed Data
Mining, pp. 1015, 1998.
19. L. Hall, N. Chawla, and K. Bowyer, Decision tree learning on very large
data sets, IEEE International Conference on Systems, Man, and Cyber-
netics, vol. 3, pp. 2579 2584, 1998.
20. B. Bursteinas and J. Long, Merging distributed classifiers, in 5th World
Multi conference on Systemic, Cybernetics and Informatics, 2001.
21. S. Datta, C. Giannella, and H. Kargupta. K-Means Clustering over Peer-
to-Peer Networks. 8th Int. Workshop on High Performance and Dis-
tributed Mining (HPDM), 2005.
22. Baik, S. Bala, J. A Decision Tree Algorithm For Distributed Data Min-
ing.2004.
23. http://www.cs.waikato.ac.nz/ml/weka/
24. Khaled M. Hammouda and Mohamed S. Kamel. Hierarchically Distributed
Peer-to-PeerDocument Clustering and Cluster Summarization. IEEE Trans-
actions on Knowledge and DataEngineering, Vol. 21(5), pp.681-698. 2009.
25. N.F. Samatova, G. Ostrouchov, A. Geist, and A.V. Melechko. RACHET:
An Efficient Cover-Based Merging of Clustering Hierarchies from Dis-
tributed Datasets. Distributed and ParallelDatabases,vol. 11(2), pp. 157-
180. 2002.
26. S. Merugu and J. Ghosh, Privacy-Preserving Distributed Clustering Us-
ing Generative Models, 3rd IEEE Intl Conf. Data Mining (ICDM 03), pp.
211-218. 2003.
60
Chapter 3. Literature Survey On Decision Tree 3.7. References
27. J. da Silva, C. Giannella, R. Bhargava, H. Kargupta, and M. Klusch, Dis-
tributed Data Mining and Agents, Eng. Applications of Artificial Intelli-
gence, vol.18(7) , pp. 791-80. 2005.
28. L. Breiman, Bagging Predictors, Machine Learning, vol. 2, pp. 123140,
1996.
29. J. Friedman, T. Hastie, and R. Tibshirani, Additive Logistic Regression:
A Statistical View of Boosting, Dept. of Statistics, Stanford University,
Tech. Rep., 1998.
30. S. J. Stolfo, A. L. Prodromidis, S. Tselepis, W. Lee, D. W. Fan, and P. K.
Chan, JAM: Java Agents for Meta-Learning over Distributed Databases,
in Proceedings of SIGKDD97, pp. 7481. 1997.
31. A. Bar-Or, D. Keren, A. Schuster, and R. Wolff, Hierarchical Decision
Tree Induction in Distributed Genomic Databases, IEEE Transactions on
Knowledge and Data Engineering Special Issue on Mining Biological
Data, vol. 17, no. 8, pp. 1138 1151, August 2005.
32. D. Caragea, A. Silvescu, and V. Honavar, A Framework for Learning from
Distributed Data Using Sufficient Statistics and Its Application to Learn-
ing Decision Trees, International Journal of Hybrid Intelligent Systems,
vol. 1, no. 1-2, pp. 8089, 2004.
33. C. Giannella, K. Liu, T. Olsen, and H. Kargupta, Communication Efficient
Construction of Deicision Trees Over Heterogeneously Distributed Data,
in Proceedings of ICDM04, Brighton, UK, pp. 6774. 2004.
34. T. Olsen, Distributed Decision Tree Learning From Multiple Heteroge-
neous Data Sources, Masters thesis, University of Maryland, Baltimore
County, Baltimore. Maryland, October 2006.
61
Chapter 4
Proposed Approach
4.1 Introduction
For distributed environment, many of the researchers have contributed for de-
veloping classification and prediction methods. This methods support ma-
chine learning, statistics and pattern recognition in distributed environment.
The researchers also have proposed different approaches for merging the local
decision trees. From the deep literature review the facts have been identified
that, most algorithms are memory resident, typically assuming a small data
size, not domain-free, static in nature, less efficient in terms of processing and
communication overhead. Due to the large volume of data with privacy con-
cern there should be some efficient technique which supports scalable and dy-
namic classification and prediction which is able to handle very big volume of
data sets which spread across different sites and generates the global decision
tree without losing the prediction quality.
4.2 Problem Statement
The research problem of designing the efficient technique for merging the de-
cision trees which supports scalable and dynamic classification in distributed
environment with large volume of data and generates the global decision tree
without losing the prediction quality.
62
Chapter 4. Proposed Approach 4.3. Objective and Scope of Research
4.3 Objective and Scope of Research
OBJECTIVEA series of challenges have recently emerged in the data mining field for dis-
tributed environment, triggered by the rapid shift in status from academic to
applied science and the resulting needs of real-life applications. The proposed
work is concerned with dynamic and scalable approach for merging the deci-
sion trees models for large volume of data in distributed environment. In the
following, the main objectives of the thesis are enlisted.
1. To reduce the model (i.e. decision tree) training time and communication
time in distributed environment for large volume of data.
2. To introduce the efficient scalable and dynamic approach for newly gen-
erated dataset and already trained model.
3. To prepare the rule merging policies to generate the global model.
4. To generate the globally interpretable model by preserving the prediction
quality.
SCOPEIn this research the following things have been considered/included as the
scope.
1. To work with homogeneous and horizontally fragmented data set.
2. To collect and pre-process real data set: The student admission data set
from Official Web Portal of Parul group of institutes.
3. The work has been carried out on educational data and mainly has been
focused on student admission prediction.
4. The parser has been proposed to convert the decision tree into the deci-
sion rule.
5. The outcome of this research at the end is optimized global decision tree
without losing the prediction quality.
6. The simulation of work has been carried out on 2, 5 and 10 sites in the
network.
63
Chapter 4. Proposed Approach 4.4. Original Contribution by thesis
4.4 Original Contribution by thesis
In this research work the qualitative and exploratory approaches have been
used and followed the research methodology steps. Very first, during the liter-
ature review we referred various research papers, patents and other articles on
dynamic and scalable data mining for distributed environment, classification
techniques and algorithms, merging the decision trees and educational data
mining. In addition to this, we installed Weka tool [1] which is an open source
by the University of Waikato of New Zealand and studied various supervised
data mining algorithms for classification. During this initial phase of literature
review, we found researchers had done work on classification algorithms but
very few of them worked on decision tree in distributed environment. Major
reasons for this gap are most of the algorithms are not memory resident, static
in nature, scalable and domain free and also very less work have been done for
dynamic and scalable distributed environment.
During our literature review, we found research works based on merging
the decision trees generated at different geographical locations to form global
decision tree without losing the predictive quality. Therefore, our second phase
of literature review was mainly focused on global model generation from dif-
ferent local decision trees. By studying and comparing various approaches, we
found merging the decision rules are challenging task. We also found none of
the researcher has worked on scalable and dynamic approach in distributed
environment. As a result of both phases of literature review, we proposed a
model (with frame work, system architectures and algorithms) with the objec-
tives 1) to reduce network overhead, 2) Scalable and dynamic distributed envi-
ronment support which let not need to process whole dataset every time and 3)
The global model should not loss the predictive quality. The proposed model
with the frame work, system architecture at local site, decision tree merging ar-
chitecture at coordinator site are shown in figures 4.1,4.2 and 4.3 respectively.
The detail of each is available in the further sub sections.
To fulfill the objectives, the proposed model has been implemented in two
phases. In the first phase, 1) the decision trees have been generated at each lo-
cal site, 2) the decision tables have been formed from each local site, 3) conver-
sion of each local decision table into XML file to transmit it over the internet.
In the second phase, 1) the XML files have been converted into the decision
64
Chapter 4. Proposed Approach 4.5. Proposed Architecture
tables, 2) all decision tables have been merged and 3) the resultant decision
table have been converted into XML file to send to all local sites for prediction.
The proposed model has been implemented on the educational data set.
We have used the real data set of student admission process in different dis-
ciplines. The data sets have been collected from Parul university web portal
(PUWP). In the experimental the data set is processed on 2, 5 and 10 different
sites to generate the local decision tree models which later merged into a single
decision tree as a whole without losing the predication quality.
4.5 Proposed Architecture
As shown in figure 4.1, the data set D as a whole considered partitioned across
different data set sites Si where i=1,2,3,. . . .d, each site Si now process the
locally available dataset Di to generate the decision tree using J48 algorithm in
weka©tool.
J48 is an extension of ID3 algorithm which was previously designed and
implemented by Ross Quinlan, this algorithm generates the decision trees. It
is known as statistical classifier as because it generates the decision trees for
classification. It uses information entropy to build the decision trees form the
training data same as ID3. The training data contains the set of already clas-
sified samples S = S1, S2, ..... Sn. Here each sample Xi = X1, X2, ..... Xm is
nothing but the vector andX1, X2, ..... Xm represents the features or attributes
of the sample. C = C1, C2, ......etc, where C1, C2 .... etc represent the classes
in which the samples belong. The training data is generally augmented with
this vector C. The most contributing attribute which effectively splits the set
of samples is selected at each node of the tree of J48. Using the normalized
information gain (i.e. entropy difference) the attribute is selected to split the
data. Highest normalized information gain attribute is selected to make the
splitting decision. This process will continue and will form the decision tree.
As the decision trees generated at each site occupies larger memory, hence
it is converted into decision tables followed by XML files to transmit over the
network such that very less network overhead takes place. Each XML file then
65
Chapter 4. Proposed Approach 4.6. Proposed Algorithm
Figure 4.1: Proposed Framework
later available at coordinator site, where actual decision tree merging process
takes place.
The following figure 4.2 and figure 4.3 shows the flow chart of the complete
flow of our experimental algorithm at local site and coordinator site respec-
tively.
4.6 Proposed Algorithm
Input
• Dataset Di , Dtsi which are the set of training tuples and their associated
class labels (Dtsi is new data set instance at time stamp ts for site Si where
i=1,2,3,....N, flag=0 (indicates data set not processed before)
Output
• Global Decision Tree (GT)
4.6.1 Algorithm steps at local site
Step-1: If flag==0 then perform step-2 to step-5 otherwise perform step-6.
Step-2: Apply J48 algorithm on data set Di of each site Si to generate the local
decision tree DTi .
Step-3: Convert the Decision Tree DTi into the Decision Table Dtablei at each
site Si . flag=1
66
Chapter 4. Proposed Approach 4.6. Proposed Algorithm
Step-4: Sort decision rules in Dtablei in descending order as per class label
majority.
Step-5 : Perform step 12 to 13.
Step-6 : For each Dtsi perform step 7 to 13.
Step-7 : Classify Dtsi tuples into Dtablei and update Dtablei accordingly.
Di= Dtsi U DiStep-8 : if tuple tij does not follow the decision table rule then create new rule
for it.
Step-9 : if tuple tij follows most of (except one) the decision table rules then
increment the count of that rule. (i.e. correctly classified instances).
Step-10: If tuple tij conflicts any of the rule then dont consider such tij.
Step-11: If tuple tij partially or fully overlapped by other decision table rule
then update the decision table accordingly. (i.e. rule modification).
Step-12: Create the XML file Xi of Dtablei for each site Si .
Step-13: Send Xi file of each Si to coordinator site for further process.
Algorithm Complexity CalculationThe implementation of J48 requires a scan through the entire training set for
each node of the tree that exhibits a split on an attribute. Depending on the
data source, the number of nodes in the tree can be O(n), where n is the num-
ber of training instances, making the time complexity for this part O(n2).
The time complexity of the algorithm of generating a set of rules from a de-
cision tree is O(km2), where k is number of nodes of each branch and m is the
number of branches in the tree. Let the number of rules in the decision table is
r, then sorting of the rules in descending order takes time O(rLogr) using the
quick sort. The classification of q newly added instances take total time O(q).
The total time required to generate the XML file from the decision table with r
rules with c conditions is O(cr). At coordinator site the overall time complexity
of merging and intersection of the two decision tables with m and n instances
is is O(mLogm + nLogn).
67
Chapter 4. Proposed Approach 4.6. Proposed Algorithm
Figure 4.2: Local site Processing
68
Chapter 4. Proposed Approach 4.6. Proposed Algorithm
For each site the Dataset is Di and Dtsi which are the set of training tu-
ples and their associated class labels. Here Dtsi is new data set instance at
time stamp ts for site Si where i=1,2,3,.N. The algorithm has the dynamic and
scalable in nature; hence in this research the incremental approach has been
used with new data set rather than processing entire data set again. This will
reduce the computation cost. The new data set Dtsi generated at time stamp
ts for each site i is processed tuple wise and based on this the decision table
is directly updated without generating the decision tree. All local decision ta-
bles are merged at one of the coordinator site which later converted into the
decision tree. This decision tree is called global decision tree and it is globally
interpretable.
As given in the algorithm and shown in the flowchart, at each site Si the
data set Di is processed through the J48 algorithm which generates the deci-
sion treeDTi . Each decision tree at each site is converted into the decision rules
using the parser and stored into the decision table Dtablei . In order to reduce
the network overhead and transmission cost each decision table is converted
into the XML file Xi at each site i. Once all the XML files have been received by
the coordinator site, they are converted into the corresponding decision tables.
The flag variable is used to check whether the data set is new or not and
process it alone using the incremental approach with already trained model.
If flag=0 then steps 2 to 5, step 12 and 13 will be executed otherwise steps 7
to 13 will be processed. Once the new data set Dtsi is processed then it will be
appended to the data set Di .
4.6.2 Algorithm steps at coordinator site
Step-1: Convert the Xi into Dtablei for respective site Si .
Step-2: Merge the Dtablei into Single Table T.
Step-3: Convert T into the Global Decision Tree GT.
Step-3.1: Perform Intersection Phase: Finding the common rules i.e. regions.
Step-3.2: Perform Filter Phase: Remove the disjoint regions from the inter-
sected merged model.
Step-3.3: Perform Reduction Phase: Join the regions of same class but having
one attribute differ.
Step-4: Send GT to Si for local prediction.
69
Chapter 4. Proposed Approach 4.6. Proposed Algorithm
Figure 4.3: Coordinate site Processing
As shown in the algorithm and the flow chart at coordinator site, all XML
files are converted into the decision tables which all are merged forming global
decision table T. Different phased are performed on T. Very first the intersec-
tion phase is performed which finds the common rules. The second phase is
filter phase in which the disjoint regions from intersected merged models will
be removed. Due to this some of the rules will be ignored which may cause
reduction in the model accuracy. At the end the reduction phase is performed
to join the regions of same class or one attribute differ, this avoid the ambiguity
of deciding the class label.
70
Chapter 4. Proposed Approach 4.7. System architecture at local site
Figure 4.4: proposed system architecture for dynamic and scalable decisiontree generation
4.7 System architecture at local site
As shown in the figure 4.4, the detailed proposed architecture for dynamic
and scalable decision tree generation process has been discussed. The very
first step is model Mi creation from the data set Di available at each site. Later
the parser converts the decision trees into the decision rule set Ri for each site
Si . In the third phase the decision rule set Ri is converted into decision table
DT ablei which later converted into XML file.
Each local site Si sends its locally generated XML files Xi to coordinator
site for further decision tree merging process. In one of the intermediate step,
the newly added data set is appended with the previous decision table Dtableiof site Si directly generating the decision tree of new data set. This way the
approach becomes scalable, i.e. the algorithm supports new data sets as well.
4.8 System architecture at coordinator site
The process of merging k decision trees F1, F2, F3,. . . Fk into a single one starts
with creating for each tree Fi its Decision Table set Dtable(Fi). Decision Tables
Dtable(F1), Dtable(F2),. . . is reduced into a final Decision Table Dtable Final
by the merging operation on Decision tables set. Finally, Dtable Final is turned
71
Chapter 4. Proposed Approach 4.8. System architecture at coordinator site
Figure 4.5: Decision table merging process to generate the global decision tree
into a decision tree The merging operation is the core of the approach: it
merges several decision table sets Dtable(F1), Dtable(F2),. . . etc into a sin-
gle one. It consists of several phases like intersection, filtering and reduction
as shown in figure 4.5 below.
As shown in figure 4.5 above, the decision tables Dtable(Fi) of all the sites Si
with data set Di where i=1,2,3,. . .d are merged. At very first the intersection
phase is carried out where the common regions i.e. rules are found. In the
second phase, the less useful disjoint regions are removed from the list. This
process is known as filtering. In the third reduction phase, the disjoint regions
which can be combined i.e. merged with minor changes are merged to reduce
the number of disjoint regions.
Intersection Phase: It is a task to combine the regions of two decision mod-
els using a specific method to extract the common components of both, pre-
sented in decision table. The set of values (Numerical only) of each region on
each model are compared to discover common sets of values across each vari-
able. The class to assign to the merged region is straightforward if the pair of
regions have the same class, otherwise class conflict problem arises. Andrzejak,
Langer and Zabala[2] propose three strategies to address this problem. a) As-
sign the class with the greatest confidence, b) Assign the class with the greater
probability c) Retrain the model with examples for conflicting class regions, If
no conflict arises that class is assigned. Otherwise remove that region from the
72
Chapter 4. Proposed Approach 4.8. System architecture at coordinator site
merged model.
Filter Phase: It is the task to remove the disjoint regions from the inter-
sected model. This is somewhat pruning operation. In this the regions with the
highest relative volume and number of training examples are retained. Strecht,
Moreira and Soares [3] address the issue by removing the disjoint regions, and
highlighted the case where the models are not merge able if all regions are dis-
joint.
Reduction Phase: This is applicable when a set of regions have the same
class and all variables have equal values except for one. To obtain the simpler
merged model, this is the task to find out which can be joined into one. For
Nominal Variables: Union of values of variables from all regions. For Numeric
Variables: If intervals are contiguous.
Decision rule merging policiesRule-1: The continuous value should not be differing by any more than thresh-
old (this is adjustable). The threshold is decided well in advance and during
the rule merging process the value of continuous variable should not be differ-
ent than the threshold already decided.
Rule-2:Rule-2A: If the attribute tests > then the smaller of the two rule values is used.
For example if ACPCRank>2399 and ACPCRank>1050 is present in the same
rule, then the smaller ACPCRank>1050 is remain preserved. For example the
rule AdmType=STATE AND Institute = PIET1 AND ACPCRank>35015 AND
ACPCRank≤46309 AND ACPCRank>17215. Is modified to AdmType=STATE
AND Institute = PIET1 AND ACPCRank≤46309 AND ACPCRank>17215.
Rule-2B: If the attribute tests ≤ then the larger of the two rule values is used.
For example if ACPCRank≤102 and ACPCRank≤1345 is present in the same
rule, then the larger ACPCRank ≤ 1345 is remain preserved.
Rule-3: Partial OverlapTwo rules in which conditions are partially overlap; adjust the boundaries of
the rule. The two rules as below overlaps to each other are modified as a new
rule.
73
Chapter 4. Proposed Approach 4.9. Summary
EE Class: Category = OPEN AND AdmType=MANAGEMENT AND Insti-
tute=PIET1 AND ACPCRank¿17215.
And
Civil Class: ACPCRank≤100904 AND Category = OPEN AND AdmType=MANAGEMENT
AND Institute=PIET1 and ACPCRank>17215 Partially overlaps, hence modi-
fied to Category = OPEN AND AdmType=MANAGEMENT AND Institute=PIET1
AND ACPCRank¿100904 for EE Class.
Rule-4: One rule completely overlaps other rule, modify overlapped rule ac-
cording to overlapping rule.
Rule-5: Conflict in LabelTwo same rules have different labels then select the rule as below:
1. Use the label with highest confidence.
2. Avg. the probability distribution and use the label with highest proba-
bility.
4.9 Summary
In this chapter the problem statement along with the objective and scope have
been introduced in section 4.2 and 4.3 respectively. The original thesis contri-
bution has been discussed in section 4.4 while the proposed architecture has
been discussed in section 4.5. In section 4.6 the proposed algorithm at both
local site and coordinator site has been discussed. The system architecture at
local and coordinator site have been discussed in the subsequent sections 4.7
and 4.8 respectively.
4.10 References
1. http://www.cs.waikato.ac.nz/ml/weka/
2. Langner,Zabala and Andrzejak ,Interpretable models from distributed
data viamerging of decision trees,IEEE Symposium on Computational
Intelligence and Data Mining (CIDM), Apr. 2013.
74
Chapter 4. Proposed Approach 4.10. References
3. Soares, Mendes-Moreira and Strecht, Merging Decision Trees: a case study
in predicting student performance, in Proceedings of 10th International
Conference on Advanced Data Mining and Applications, pp. 535548,
2014.
4. G. J. Williams, Inducing and Combining Multiple Decision Trees. PhD
thesis, Australian National University, 1990.
5. Chawla, Hall,and Bowyer, Combining decision trees learned in parallel,
Working Notes of the KDD-97 Workshop on Distributed Data Mining,
pp. 1015, 1998.
6. L. Hall, N. Chawla, and K. Bowyer, Decision tree learning on very large
data sets, IEEE International Conference on Systems, Man, and Cyber-
netics, vol. 3, pp. 2579 2584, 1998.
7. Weston, Kuhn, Quinlan and Coulter, C5.0 Decision Trees and Rule-Based
Models. R package version 0.1.0-16, 2014.
8. Chan, P. & Stolfo, S. Learning arbiter and combiner trees from parti-
tioned data for scaling machine learning. Proc. Intl. Conf. Knowledge
Discovery and Data Mining. 1995.
9. Hennessy, Provost, Distributed machine learning: scaling up with coarse-
grained parallelism, in Proceedings of the 2nd International Conference
on Intelligent Systems for Molecular Biology, vol. 2, pp. 3407, Jan. 1994.
10. F. Provost and D. Hennessy, Scaling up: Distributed machine learning
with cooperation, in Proceedings of the 13th National Conference on Ar-
tificial Intelligence, pp. 7479, 1996.
11. Long and Bursteinas, Merging distributed classifiers, in 5th World Multi
conference on Systemic, Cybernetics and Informatics, 2001.
12. Students’ Admission Prediction using GRBST with Distributed Data Min-
ing, Communications on Applied Electronics (CAE) ISSN : 2394-4714,
Foundation of Computer Science FCS, New York, USA, Volume 2 No.1,
June 2015
13. A Proposed DDM Algorithm and Framework For EDM of Gujarat Tech-
nological University, Organized by Saffrony Institute of Technology In-
ternational Conference on Advances in Engineering, 22nd-23rd January
2015
75
Chapter 4. Proposed Approach 4.10. References
14. A Dynamic and Scalable Evolutionary Data Mining for Distributed En-
vironments, NCEVT-2013, PIET, Limda
15. Faculty Performance Evaluation Based on Prediction in Distributed Data
Mining, 2015 IEEE ICETECH- Coimbatore
16. Prediction and analysis of student performance using distributed data
mining, International Conference on Information, Knowledge & Research
In Engineering, Management and Sciences(IC-IKR-EMS),7th Dec-2014
KIT, Gujarat. IJETAETS-ISSN 0974-3588.
76
Chapter 5
Working of the Proposed
Model
5.1 Introduction
In chapter 4 the proposed architecture has been introduced. The architecture
contains N different sites [3], out of it any one site will be considered as the co-
ordinator site where the rule merging policies have been applied and the global
decision tree is generated from the newly formed global rules. In this chapter
the proposed algorithm at both local site and coordinator site has been dis-
cussed in detail along with the examples. At the end, the system architecture
at local and coordinator site have been discussed in more detail with merging
policies.
5.2 Local site algorithm computation
The data set D as a whole considered partitioned across different data set sites
Si where i=1,2,3,. . .d, each site Si now process the locally available dataset Dito generate the decision tree using J48 algorithm in weka©tool. Here the stu-
dent data set for admission has been considered and they have been processed
on two different sites S1 and S2. For simplicity of calculation the site S1 has
179 and site S2 has 142 instances to process. In the table 5.1 and 5.2 the de-
tailed statastics have been shown which is useful for finding different attribute
77
Chapter 5. Working of the Proposed Model 5.2. Local site algorithm computation
College Record AdmType Record Category Record Branch RecordPIT1 80 State 92 Open 60 it 13
PIET1 62 Management 33 SEBC 49 cse 31PIET2 0 TFWS 17 SC 19 mech 27PIT2 0 ST 14 ec 13
ee 14auto 03
ic 01civil 38
142 142 142 chem 02142
Table 5.1: Overall distributions of Site S2 instances with respect to attributes
PIT PIET State TFWS Management Open SEBC SC ST6 7 7 3 3 2 5 4 2
14 17 19 2 10 15 11 2 317 10 18 2 7 13 8 2 47 6 9 3 1 5 4 3 17 7 11 1 2 7 4 2 12 1 2 1 0 0 2 1 00 1 1 0 0 1 0 0 0
25 13 27 3 8 3 14 5 32 0 1 0 1 1 1 0 0
80 62 95 15 32 60 49 19 14
Table 5.2: Detail distribution of Site S2 instances with respect to attribute val-ues
splitting measures.
The information gain is based on the decrease in entropy after a dataset is
split on an attribute. Information gain IG (A) is the measure of the difference in
entropy from before to after the set S is split on an attribute A. In other words,
how much uncertainty in S was reduced after splitting set S on attribute A.
In the below computations the expected information needed to classify tu-
ples for site S2 and for attribute wise have been derived which is helpful for
selecting the most suitable attribute with splits the instances to build the deci-
sion tree. In the below computations log refers log with base 2.
78
Chapter 5. Working of the Proposed Model 5.2. Local site algorithm computation
• Expected information needed to classify tuples in site S2 dataset: The
expected information of entire dataset have been computed from all the
instances of data set at site S2 is equal to 3.9 as below.
Info(D2)= -13/142log(13/142) 31/142log(31/142) 27/142log(27/142) -
13/142log(13/142) 14/142log(14/142) 3/142log(3/142) 1/142 log(1/142)
38/142 log(38/142) 2/142log(2/142) = 3.9
• Expected information for each attribute: In this computation the ex-
pected information for each attribute of the data set have been computed
to find the information gain of each attribute.
1. Infoinstitute(D2)= 80/142* (-6/80log6/80 14/80log14/80 17/80log17/80
7/80log7/80 7/80log7/80 -2/80log2/80 25/80log25/80 2/80log2/80)
+ 62/142 *(-7/62log7/62 17/62log17/62 10/62log10/62 6/62log6/62
7/62log7/62 1/62log1/62 -13/62log13/62) = 2.66 Gaininstitute(D2)=Info(D2)-
Ifnoinstitute(D2) = 3.9-2.66 =1.24
Similarly the information gain for AdmType, Category have been
derived to as below:
2. InfoAdmType(D2)= 2.594, GainAdmTyple(D2)=1.306
3. InfoCategory(D2)=2.45 , GainCategory(D2)=1.45
After finding the information gain of each attribute, the split information for
each attribute has been computed as below:
1. Splitinfoinstitute(D2)= -80/142log80/142 62/142log62/142 = 0.994
2. SplitinfoAdmType(D2)= -95/142log95/142 15/142log15/142 32/142log32/142=
1.223
3. SplitinfoCategory(D2)= -60/142log60/142 49/142log49/142 -19/142log19/142
-14/142log14/142= 1.78
The Gain Ratio for each attribute then after calculated as below:
1. GainRatio(Institute)= Gaininstitute(D2)/Splitinfoinstitute(D2)= 1.24/0.994
= 1.2475 Similarly,
79
Chapter 5. Working of the Proposed Model 5.2. Local site algorithm computation
Figure 5.1: Data set at site S1
2. GainRatio(AdmType)= 1.068 and
3. GainRatio(Category)= 0.815
Likewise the above example of site S2, for the ease of processing site S2 has
179 instances which have been processed with the J48 algorithm. The figure
5.1 below clearly shows the data set at site S1. It contains total 179 instances
and 7 attributes. As shown in the figure 5.1 below the institute attribute has
80, 16, 81 and 2 instances for PIET1, PIET2, PIT1 and PIT2 institutes respec-
tively. Likewise this other attributes have distinct values in the data set.
5.2.1 Building the decision tree
The above data set at site S1 has been processed by the supervised classification
technique using the J48 algorithm which generates the decision tree as shown
in the figure 5.2 below. Here the 10 fold cross-validation with 66% splitting
percentage.
On processing the data set at site S1 the result is acquired as shown in fig-
ure 5.3 In this figure the detailed accuracy by class (here it is branch wise) is
80
Chapter 5. Working of the Proposed Model 5.2. Local site algorithm computation
Figure 5.2: Decision Tree generated at local site S1
shown. For each class the true positive (PT) Rate, False Positive (FP) Rate, Pre-
cision, Recall, F-Measure and ROC Area can be viewed.
The confusion matrix as shown in figure 5.3 can be clearly understood.
There are 141 instances in total are correctly classified and 38 instances are
incorrectly classified. The class CIVIL has maximum 12 instances in total and
the class IT has minimum only 1 instance is incorrectly classified.
5.2.2 Rule Generation
It is difficult and very much complex to merge the different local decision trees
to form the global one. For the efficient merging process the decision tree rules
have been converted into the simple decision rules. Using the J48 parser the
decision rules have been derived from the decision tree. Some of the decision
rules formed as below for different classes. At the left site of each rule the
multiple predicates have been ANDing to form the minterm predicate and the
right site is the class label. From the observations of the decision rules many
of the rules are complex, overlapping and somewhat differing with single at-
tribute.
81
Chapter 5. Working of the Proposed Model 5.2. Local site algorithm computation
Figure 5.3: Detailed accuracy and the confusion matrix
1. AdmType=TFWS AND Institute=PIET1 AND ACPCRank>17215→ CSE
2. ACPCRank≤48576 AND AdmType = ST AND Category = OPEN AND
Institute = PIT1 AND ACPCRank > 17215→ CSE
3. Category = ST AND ACPCRank > 10996 AND ACPCRank > 440 AND
ACPCRank ≤ 17215→ IT
4. HSC ≤ 57.54 AND SSC ≤ 74.33 AND AdmType = MANAGEMENT AND
Institute = PIT1 AND Category = OPEN AND ACPCRank > 17215→ IT
5. SSC > 75.24 AND Institute = PIT1 AND AdmType = MANAGEMENT
AND ACPCRank ≤ 10996 ACPCRank ≤ 17215 AND ACPCRank > 440
→ EC
6. SSC > 74.33 AND AdmType = MANAGEMENT AND Institute = PIT1
AND Category = OPEN AND ACPCRank > 17215→ EC
7. ACPCRank ≤ 17215 AND ACPCRank > 440 AND ACPCRank ≤ 10996
AND AdmType =State AND Category = SC→MECH
8. ACPCRank ≤ 17215 AND ACPCRank > 440 AND ACPCRank≤10996
AND AdmType =MANAGEMENT AND Institute =PIET1→MECH
82
Chapter 5. Working of the Proposed Model 5.2. Local site algorithm computation
9. ACPCRank > 100904 AND Category = OPEN AND AdmType =MAN-
AGEMENT AND Institute =PIET1 AND ACPCRANK > 17215→ EE
10. Category = SEBC AND AdmType =MANAGEMENT AND Institute =PIET1
AND ACPCRANK > 17215→ EE
11. ACPCRank > 11083 AND Category = OPEN AND ACPCRank > 10996
AND ACPCRank ≤ 17215 AND ACPCRank > 440→ CIVIL
12. ACPCRank > 17215 AND Institute =PIET1 AND AdmType =STATE AND
ACPCRank ≤ 35015→ CIVIL
5.2.3 Decision Table
In distributed environment the decision rules derived from the local decision
trees need to be transferred to the coordinator site for further process such as
merging process. Scanning all the decision rules of different local sites is some-
what complex and time consuming process. As a whole reducing the network
overhead and complexity the decision rules are converted into the decision ta-
ble form which can be later converted into the XML file. The decision table can
be looked like as shown in figure 5.4.
5.2.4 XML File Generation
As discussed above, the decision table is converted into the XML file as shown
in figure 5.5 to reduce the network traffic and easy processing. In XML file
the very first line is the XML version. The root tag is the data set. The other
lines include the record number which further extended to the attribute and
its value. The XML file is simple text file which requires few Kilo Bytes to store
the decision tree rules which actually requires hundreds of Mega Byte in size
for the large volume of dataset to be processed.
As shown in figure 5.6, to support the large volume of dataset [3] generated
at some time instance are only need to be processed along with the existing de-
cision table model. This totally saves the processing time for model generation.
Every time at each site on such event the XML file will be generated which later
transferred to the coordinator site. This approach supports the dynamic and
83
Chapter 5. Working of the Proposed Model 5.2. Local site algorithm computation
Figure 5.4: Decision table generated at local site S1
Figure 5.5: The XML file at local site S1
84
Chapter 5. Working of the Proposed Model5.3. Coordinator site algorithm computation
Figure 5.6: Dynamic and Scalable decision tree generation
scalable [4] evolutionary data mining by generating the decision tables.
5.3 Coordinator site algorithm computation
As discussed in chapter 4, the coordinator site is chosen at random. This site is
only responsible for merging the decision rules. In order to merge the decision
rule, the site very first collects all XML files from all the local sites followed
by converting each into the decision tables. All these decision tables are then
combined very first.
The merging process of k decision trees F1, F2, F3,... Fk into a single one
starts with creating for each tree Fi its Decision Table set Dtable(Fi). De-
cision Tables Dtable(F1), Dtable(F2),. . . is reduced into a final Decision
Table Dtable Final by the merging operation on Decision tables set. Finally,
Dtable Final is turned into a decision tree The merging operation is the main
part of the approach: it merges several decision table sets Dtable(F1), Dtable(F2),
85
Chapter 5. Working of the Proposed Model5.3. Coordinator site algorithm computation
Figure 5.7: The Decision table merging process to generate the global decisiontree
. . . etc into a single one. It consists of several phases like intersection, filtering
and reduction as shown in figure 5.7.
As shown in figure 5.7 above, the decision tables Dtable(F1) of all the sites
Si with data setDi where i=1,2,3,. . . d are merged. At very first the intersection
phase is carried out where the common regions i.e. rules are found. In the
second phase, the less useful disjoint regions are removed from the list. This
process is known as filtering. In the third reduction phase, the disjoint regions
which can be combined i.e. merged with minor changes are merged to reduce
the number of disjoint regions.
Intersection Phase: It is a task to combine the regions of two decision mod-
els using a specific method to extract the common components of both, pre-
sented in decision table. The set of values (Numerical only) of each region on
each model are compared to discover common sets of values across each vari-
able. The class to assign to the merged region is straightforward if the pair of
regions have the same class, otherwise class conflict problem arises. Andrze-
jak, Langer and Zabala [10] propose three strategies to address this problem.
a) Assign the class with the greatest confidence, b) Assign the class with the
greater probability c) Retrain the model with examples for conflicting class re-
gions, If no conflict arises that class is assigned. Otherwise remove that region
from the merged model.
86
Chapter 5. Working of the Proposed Model 5.4. Summary
Filter Phase: It is the task to remove the disjoint regions from the inter-
sected model. This is somewhat pruning operation. In this the regions with
the highest relative volume and number of training examples are retained.
Strecht, Moreira and Soares [11] address the issue by removing the disjoint
regions, and highlighted the case where the models are not merge able if all
regions are disjoint.
Reduction Phase: This is applicable when a set of regions have the same
class and all variables have equal values except for one. To obtain the simpler
merged model, this is the task to find out which can be joined into one. For
Nominal Variables: Union of values of variables from all regions. For Numeric
Variables: If intervals are contiguous.
5.4 Summary
In this chapter the detailed working of the proposed model has been discussed
with suitable example. Very first the sample training dataset at site S1 and S2
has been introduced. Later in this chapter the working of each local site and
the coordinator site has been described very clearly. At the end of this chapter
the decision rule merging policies with example have been introduced along
with the different phases of merging.
5.5 References
1. B. Bursteinas and J. Long, Merging distributed classifiers, in 5th World
Multi conference on Systemic, Cybernetics and Informatics, 2001.
2. Students’ Admission Prediction using GRBST with Distributed Data Min-
ing, Communications on Applied Electronics (CAE) ISSN : 2394-4714,
Foundation of Computer Science FCS, New York, USA, Volume 2 No.1,
June 2015
3. A Proposed DDM Algorithm and Framework For EDM of Gujarat Tech-
nological University, Organized by Saffrony Institute of Technology In-
87
Chapter 5. Working of the Proposed Model 5.5. References
ternational Conference on Advances in Engineering, 22nd-23rd January
2015
4. A Dynamic and Scalable Evolutionary Data Mining for Distributed En-
vironments, NCEVT-2013, PIET, Limda
5. Faculty Performance Evaluation Based on Prediction in Distributed Data
Mining, 2015 IEEE ICETECH- Coimbatore
6. Prediction and analysis of student performance using distributed data
mining, International Conference on Information, Knowledge & Research
In Engineering, Management and Sciences(IC-IKR-EMS),7th Dec-2014
KIT, Gujarat. IJETAETS-ISSN 0974-3588.
7. M. Kuhn, S. Weston, N. Coulter, and R. Quinlan, C5.0 Decision Trees and
Rule-Based Models. R package version 0.1.0-16, 2014.
8. Chan, P. & Stolfo, S. (1995). Learning arbiter and combiner trees from
partitioned data for scaling machine learning. Proc. Intl. Conf. Knowl-
edge Discovery and Data Mining.
9. F. J. Provost and D. N. Hennessy, Distributed machine learning: scaling
up with coarse-grained parallelism, in Proceedings of the 2nd Interna-
tional Conference on Intelligent Systems for Molecular Biology, vol. 2,
pp. 3407, Jan. 1994.
10. A. Andrzejak, F. Langner, and S. Zabala, Interpretable models from dis-
tributed data viamerging of decision trees, 2013 IEEE Symposium on
Computational Intelligence and Data Mining (CIDM), Apr. 2013.
11. P. Strecht, J. Mendes-Moreira, and C. Soares, Merging Decision Trees:
a case study in predicting student performance, in Proceedings of 10th
International Conference on Advanced Data Mining and Applications,
pp. 535548, 2014.
88
Chapter 6
Data Collection,
Preprocessing and
Implementation
6.1 Introduction
Data collection is the loosely controlled method of gathering the data. Such
data are mostly out of range, impossible data combinations, missing values,
noisy and many more. The data which have not been properly screened will
cause misleading results. To acquire quality result which will helpful for infor-
mation generation and decision making the raw data need to be preprocessed.
The data mining technique which involves transforming raw data into an
appropriate and understandable form for further processing is called data
preprocessing. In real world the data are most often incomplete, uncertain,
missing, and inconsistent and contains many errors. The phrase Garbage In,
Garbage Out is particularly applicable to machine learning and data mining.
To produce quality data for further processing to make decisions, data prepro-
cessing is required.
In this chapter different data collections used for implementation, their
preprocessing and implementation have been discussed in detail.
89
Chapter 6. Data Collection, Preprocessing and Implementation6.2. Data Collection
6.2 Data Collection
Data collection is the process of gathering and measuring information on vari-
ables of interest, in an established systematic fashion that enables one to an-
swer stated research questions, test hypotheses, and evaluate outcomes. There
are numerous data collection methods available, but in this research work, real
data sets of student admission at Parul University (gathered from PU Web por-
tal) and ZOO data set have been used.
6.2.1 Zoo Data set
This data set has been downloaded from UCI repository. A simple database
containing 17 Boolean-valued attributes and one numeric class (type) attribute
which is unique for each instance. There are total 101 instances with no miss-
ing value. The attribute information is as below:
6.2.2 Student Admission Data Set
This research has been carried out on real data set of Parul University for the
students admission prediction in different fields/branches of different colleges.
These data have been collected from the Parul University Web Portal. There
are more than 1,00,000 records in total used for training purpose. There are
more than 10 attributes in the data set but in the research the attribute section
method has been applied to keep only the relevant attributes. The preprocess-
ing technique also has been applied for making them quality data for further
data mining process.
Here the student data set for admission as shown in table 6.2 has been con-
sidered and they have been processed on two different sites S1 and S2. For
simplicity of calculation the site S1 has 179 and site S2 has 142 instances to
process.
The data set is as below.
90
Chapter 6. Data Collection, Preprocessing and Implementation6.2. Data Collection
Sr. No. Attribute Name Data Type Value(Range) Remarks1 Animal Name Boolean Unique for each instance2 Hair Boolean3 Feathers Boolean4 Eggs Boolean5 Milk Boolean6 Airbone Boolean7 Aquatic Boolean8 Predator Boolean9 Toothed Boolean
10 Backbone Boolean11 Breathes Boolean12 Venomous Boolean13 Fins Boolean14 Legs Numeric {0,2,4,5,6,8}15 Tail Boolean16 Domestic Boolean17 Catsize Boolean18 Type Numeric [1,7]
Table 6.1: ZOO DataSet
Sr. No. Attribute Name Data Type Value(Range) Remarks1 Institute Nominal {PIET1,PIET2, PIT1, PIT2}2 Admtype Nominal {State, Management, TFWS}3 Category Nominal {SC,ST,SEBC, OPEN}4 ACPCRank Nominal5 SSC Nominal [0,100] Percentage6 HSC Nominal [0,100] Percentage7 Degree Nominal8 City Nominal9 Name Nominal
Table 6.2: Student admission data set collected from Parul University Web Por-tal
91
Chapter 6. Data Collection, Preprocessing and Implementation6.3. Data Pre-Processing
Sr. No. Attribute Name Data Type Value(Range) Remarks1 Attendance Numeric2 Midsem result Boolean {YES, NO}3 Pre Bklg Boolean {YES, NO} Previous Backlog4 Assignment Nominal5 Pre result Boolean {YES, NO}6 Branch Nominal [0,100]7 Pass Boolean {YES, NO}
Table 6.3: Student performance data set collected from Departments of PITCollege
6.2.3 Student Performance Data Set
In this research the student performance data set has been collected from dif-
ferent departments of Parul Institute of Technology College of Parul Univer-
sity. This data set contains many of the attributes but using the attribute se-
lection method only 7 attributes as shown in table 6.3 have been identified for
further processing in data mining. This data set contains more than 50,000
instances.
6.3 Data Pre-Processing
For data mining process the data need to be pre-processed first to make them
quality data to acquire the quality analysis and information to make quality
decision. So before the data base user should be cleared with some of the most
relevant questions such as 1) What data is available for the task? 2) Is this data
relevant? 3) Is additional relevant data available? 4) How much historical data
is available? 5) Who is the data expert?
For data mining process the quantity of data also plays the most important
role same as the relevance of the data. The quantity of the data is somewhat 1)
Number of instances (records, objects): Rule of thumb: 5,000 or more desired,
if less, results are less reliable; use special methods (boosting,. . . ), 2) Number
of attributes (fields): Rule of thumb: for each attribute, 10 or more instances, If
more fields, use feature reduction and selection and 3) Number of targets: Rule
of thumb: > 100 for each class, if very unbalanced, use stratified sampling.
92
Chapter 6. Data Collection, Preprocessing and Implementation6.3. Data Pre-Processing
Figure 6.1: Forms of Data Preprocessing
The preprocessing is required in advance before data mining task because
1) Real world data are generally incomplete (lacking attribute values, lacking
certain attributes of interest, or containing only aggregate data), Noisy ( con-
taining errors or outliers) and Inconsistent (containing discrepancies in codes
or names). The data preprocessing task are explain below and shown in figure
6.1
• Data cleaning: fill in missing values, smooth noisy data, identify or re-
move outliers, and resolve inconsistencies.
• Data integration: using multiple databases, data cubes, or files.
• Data transformation: normalization and aggregation.
• Data reduction: reducing the volume but producing the same or similar
analytical results.
93
Chapter 6. Data Collection, Preprocessing and Implementation6.3. Data Pre-Processing
• Data discretization: part of data reduction, replacing numerical attributes
with nominal ones.
Data cleaning: This is the first preprocessing operation. It consists various
ways to clean data.
1. Fill in missing values (attribute or class value):
• Ignore the tuple: usually done when class label is missing.
• Use the attribute mean (or majority nominal value) to fill in the
missing value.
• Use the attribute mean (or majority nominal value) for all samples
belonging to the same class.
• Predict the missing value by using a learning algorithm: consider
the attribute with the missing value as a dependent (class) variable
and run a learning algorithm (usually Bayes or decision tree) to pre-
dict the missing value.
2. Identify outliers and smooth out noisy data:
• Binning
• Sort the attribute values and partition them into bins (see ”Unsu-
pervised discretization” below);
• Then smooth by bin means, bin median, or bin boundaries.
• Clustering: group values in clusters and then detect and remove
outliers (automatic or manual)
• Regression: smooth by fitting the data into regression functions.
3. Correct inconsistent data: use domain knowledge or expert decision.
Data transformation: Data transformation is the process of converting data
or information from one format to another, usually from the format of a source
system into the required format of a new destination system. Some of the data
transformation techniques have been discussed as below:
1. Normalization:
94
Chapter 6. Data Collection, Preprocessing and Implementation6.3. Data Pre-Processing
• Scaling attribute values to fall within a specified range. Example: to
transform V in [min, max] to V’ in [0,1], apply V’=(V-Min)/(Max-
Min)
• Scaling by using mean and standard deviation (useful when min
and max are unknown or when there are outliers): V’=(V-Mean)/StdDev.
2. Aggregation: moving up in the concept hierarchy on numeric attributes.
3. Generalization: moving up in the concept hierarchy on nominal attributes.
4. Attribute construction: replacing or adding new attributes inferred by
existing attributes.
Data reduction: Data reduction is the transformation of numerical or al-
phabetical digital information derived empirically or experimentally into a
corrected, ordered, and simplified form.
1. Reducing the number of attributes
• Data cube aggregation: applying roll-up, slice or dice operations.
• Removing irrelevant attributes: attribute selection (filtering and wrap-
per methods), searching the attribute space (see Lecture 5: Attribute-
oriented analysis).
• Principle component analysis (numeric attributes only): searching
for a lower dimensional space that can best represent the data..
2. Reducing the number of attribute values
• Binning (histograms): reducing the number of attributes by group-
ing them into intervals (bins).
• Clustering: grouping values in clusters.
• Aggregation or generalization
3. Reducing the number of tuples
• Sampling
95
Chapter 6. Data Collection, Preprocessing and Implementation 6.4. Test Data Set
6.4 Test Data Set
The dataset has been equally partitioned into the subsets equal to the num-
ber of sites. The experiments have been performed on 10k, 20k, 50k and 100k
records (Here k means thousand) at 2, 5 and 10 sites. The local training models
have been generated and merged using the proposed approach. The accuracy
of these global models has been checked on test datasets. The accuracy is more
than 98% to classify the test dataset. The results of basic comparison clearly
show that accuracy, training time, communication overhead and other parame-
ters have been optimized. The data set of student admission for the year 2013-
14, 2014-15 have been used to train the model, this model has been used with
the data set of student admission for the year 2015-16 which gives more than
98.03% accuracy for the prediction. These experimental results have been also
verified using the 10-fold cross validation.
6.5 Implementation
The research work has been carried out on different number of sites with the
following hardware and software configurations:
Software
1. Database: Microsoft Visual 2008 R2
2. Tool: Visual Studio-10 for .Net
3. Language: C#
4. Apache Hadoop Framework
Hardware
1. Processor: AMD E1-2500, 1.4 GHz
2. RAM: 4 GB
3. System: 64-bit OS / Ubuntu Linux OS (For Apache Hadoop Framework)
4. Hard disk: 400 GB
The screen shots captured as below:
96
Chapter 6. Data Collection, Preprocessing and Implementation6.5. Implementation
Figure 6.2: Site Selection
Figure 6.3: Run J48 algorithm to each site
97
Chapter 6. Data Collection, Preprocessing and Implementation6.5. Implementation
Figure 6.4: Load/Save the training model
Figure 6.5: Decision Tree and Decision Table at each site
98
Chapter 6. Data Collection, Preprocessing and Implementation6.5. Implementation
Figure 6.6: Combined Decision Tree and Decision Table
Figure 6.7: Branch wise decision rules
99
Chapter 6. Data Collection, Preprocessing and Implementation 6.6. Summary
6.6 Summary
In this chapter, the importance of data collection, preprocessing has been dis-
cussed. The different possibilities in which the data set may not be of enough
quality to process. Such data sets need to be preprocessed. In this research
work two different data sets have been used. The local training models have
been generated and merged using the proposed approach. The accuracy of
these global models has been checked on test datasets. The accuracy is more
than 98% to classify the test dataset.
100
Chapter 7
Implementation with Apache
Hadoop
7.1 Introduction to Apache Hadoop
Apache Hadoop [3] is an open source software framework [4]. There are two
main components 1) Map-Reduce: It is the distributed processing framework
and 2) Hadoop Distributed File System (HDFS) [2]: It is a distributed file sys-
tem. One of the most important reasons to use this in the research is to process
a very large set of data and to analyze it which is not possible on a single ma-
chine. In apache hadoop the storage is provided by HDFS and the analysis task
is done by Map-Reduce. The apache hadoop architecture is shown as below in
7.1.
7.2 Hadoop Map-Reduce
In the Map-Reduce the process is broken into two phases called Map and Re-
duce. In each phase there are key-value pairs as input and output. The input
to Map phase is the raw data. Generally, text input format is chosen where
each line in the data set is the text value, while the key is considered as the
offset from the beginning of the line to the beginning of the file. Using Map-
Reduce, the output is processed by the map function before it is sent to the
reduce function. The reduce function sorts and groups the key-value pairs [1].
101
Chapter 7. Implementation with Apache Hadoop 7.2. Hadoop Map-Reduce
Figure 7.1: Apache Hadoop Architecture
The Hadoop run time is only responsible to divide the job into small tasks
such as Map and Reduce. The Mapper class is required to implement Map
function and also Reducer class to implement the Reduce function.
7.2.1 Map
The input data is divided into input splits and one map task is created for each
input split. Then the list of < key1,value1 > entries are processed one after
other in the split which later produces the intermediate list of < key2,value2 >.
There are several classes need to be defined in map phase: InputFormat (de-
fines the source of input files), InputSplits (splits the input and defines sin-
gle unit of work for one map task), RecordReader (converts binary data <
key,value > from InputSplits) depends o the split size, Mapper, Combiner and
Partitioner. The size of the split plays important role for load-balancing. The
split size should not be too much small; otherwise it creates much overhead
to create the map task than its execution time. In most of the case one HDFS
block is equal to the split size, this is because if the split contains more than
one HDFS blocks, then it may cause the locality optimization problem because
all HDFS blocks do not present on the same node. From the InputSplits the
RecordReader creates < key,value > pairs recursively until entire split has been
102
Chapter 7. Implementation with Apache Hadoop 7.3. HDFS
processed.
There is one optional phase called Combiner which reduces the size of data
shuffled between map and reduce tasks. On the node, the combiner runs the
reduce function on the output of map. There is no guarantee that how many
numbers of times the combiner function is called by the Hadoop. The number
of partitioner and reducer is identical. The partitioner usually splits the inter-
mediate key space per each reducer.
7.2.2 Reduce
The set of intermediate values are reduced by the reducer to share a common
key to smaller set of values. In map task the number of tasks is not controllable
while it is true in case of reducer. The shuffle takes much time and bandwidth
if the number of reduce tasks is very high. On the other hand the less number
of longer reduce tasks affects the overall execution time due to the poor degree
of parallelism. There are three phases in reducer: first is shuffling is the pro-
cess to move the relevant sub parts of map output, second is sorting which is
the process of sorting the intermediate keys on a single node and third is re-
duce which is the merging process of the values which have the same key and
the output.
The many of the data mining applications need to pass the arguments to
the Map-Reduce tasks. The read-only large files are distributed by Distributed
Cache and small arguments are passed using setter and getter methods of Jon-
Conf class. There are two versions of Hadoop which are different from each
other with the architecture and job execution flow with Map-Reduce. Fig-
ure 7.2 shows the architecture and the job execution flow with Hadoop Map-
Reduce version 1.x (MRv1)
7.3 HDFS
In Hadoop the Hadoop Distributed File System (HDFS) is the main storage
component which provides a large scale storage (Terabytes or Petabytes) dis-
tributed architecture and can be easily extended by scaling out. The file in
103
Chapter 7. Implementation with Apache Hadoop 7.3. HDFS
Figure 7.2: Architecture and job execution flow in Hadoop Map Reduce version1.x (MRv1)
HDFS is divided into blocks (mostly 128 MB size of each block) of customized
size or predefined. This much of block size makes the seek time lesser com-
pared to reading the data from the disk. The original architecture is available
from Yahoo! Inc. [6] and it is based on the design of GFS [7]. By replicating the
blocks on various machines, hadoop supports fault tolerance and handling of
node failure and hence the throughput is also very high.
The main advantage of hadoop is it has the capability to handle unstruc-
tured data collected from different data sources or of different formats. More-
over to heterogeneous type of data which are collected at unrelated systems
can deposited/stored in the hadoop cluster without the predetermining that
how the data will be analyzed. The HDFS is not suitable to applications which
require very immediate seek accesses but more suitable for Write Once and
Read Many (WORM) kind of applications. HDFS provides very high data lo-
cality with Map-Reduce applications by storing the blocks in such a way that
there are very less movements of blocks among the machines of the same rack
or machines of different racks.
104
Chapter 7. Implementation with Apache Hadoop 7.4. Decision Tree Map-Reduce
7.4 Decision Tree Map-Reduce
Figure 7.3 shows the overview of the map-reduce model. The input is divided
into the chunk of blocks using the splitting method. The size of the block may
depend on the application. But most of the time the block size is 128 MB to
improve the overall performance. There is one mapper to work on each input
split and generates the intermediate < key,value > pairs. The reducer merging
all the intermediate results to form final output.
The decision tree is generated using the map-reduce by initially using new
data structures as follows:
1. Attr table: It is the attribute table. And it includes the basic information
of each attribute attr, It also includes the row identifier of an instance
row id, the values of attribute values(attr) and the class label for each
instance c.
2. Cnt table: It is the count table and it stores the count of instances for
class labels if it is split by attribute attr. There are two attributes in this
table count cnt and the class label c.
3. Hash table: It is the hash table for indexing or linking purpose. It stores
the link information between the row id and the node id (for tree nodes),
moreover to this the link between branch node subnode id and the par-
ent node node id.
The decision tree generation process is made of four phases:
Data preparation, Data selection, Map update and Tree growing.
7.4.1 Data Preparation
1. In this phase the traditional data is converted into MapReduce support-
able format.
2. MAP ATTR procedure is used to transform the instance record into at-
tribute table where attribute aj (j=1,2,. . ., M) is used as the key and
row id, class label c as values.
3. The number of instances with specific class labels (if split by attribute aj )
is computed by the REDUCE ATTR procedure. Thus the count table is
formed.
105
Chapter 7. Implementation with Apache Hadoop 7.4. Decision Tree Map-Reduce
Figure 7.3: Overview of Map-Reduce Model
Below are the algorithm steps:
procedure MAP ATTR(row id,(a1,a2,. . .,aM ,c))
emit(aj ,(row id,c))
end procedure
procedure REDUCE ATTR(aj ,(row id,c))
emit(aj ,(c,cnt))
end procedure
7.4.2 Selection
1. Select the best attribute abest.
2. REDUCE POPULATION aggregates the size of all records for attribute
aj . For this it takes instances for each attribute-value pair.
3. MAP COMPUTATION computes the information and split information
of aj .
4. REDUCE COMPUTATION computes the information gain ratio and se-
lects the maximum value of GainRatio(aj ) as a splitting attribute abest.
procedure REDUCE POPULATION((aj ,(c,cnt))
emit (aj ,all)
106
Chapter 7. Implementation with Apache Hadoop 7.4. Decision Tree Map-Reduce
end procedure
procedure MAP COMPUTATION((aj ,(c,cnt,all)))
compute Entropy(aj )
compute Info(aj )= (cnt/all) Entropy(aj )
compute SplitInfo(aj )= -(cnt/all) log (cnt/all)
emit(aj ,(Info(aj ),SplitInfo(aj ))
end procedure
procedure REDUCE COMPUTATION(aj ,(Info(aj ),SplitInfo(aj )))
emit(aj ,GainRatio(aj ))
end procedure
7.4.3 Update
MAP UPDATE COUNT reads a record from attribute table with key value
equals to abest, and emits the count of class labels. MAP HASH assigns node id
based on a hash value of abest to make sure that records with same values are
split into the same partition.
procedure MAP UPDATE COUNT((abest,(row id,c)))
emit(abest,(c,cnt))
end procedure
procedure MAP HASH(abest,row id))
compute node id=hash(abest)
emit(row id,node id)
end procedure
7.4.4 Tree Growing
In previous step the tree nodes are generated. Later this is extended to the de-
cision tree by creating the links between nodes. In this a node id is created for
generated nodes and then compared with existing old values. If available, then
it is grouped with the existing nodes, otherwise new sub nodes will be created.
107
Chapter 7. Implementation with Apache Hadoop 7.5. Apache Hadoop Cluster
procedure MAP((abest,row id))
compute node id=hash(abest)
if node idis same with the old value then
emit(row id,node id)
end ifadd a new subnode
emit(row id,node id,subnode id)
procedure
Out of all the above four steps, sequence of MapReduce operations and data
preparation are onetime tasks and the remaining are iterative. The terminal
condition is all node id become leaf nodes, and then a decision tree is built.
7.5 Apache Hadoop Cluster
The above methods have been implemented on hadoop clusters of 2, 5 and 10
nodes. In hadoop cluster there is one Name Node (i.e. Primary Node works as
master) and others are the Data Nodes (i.e. works as slave). For this, very first
in the experimental set up one node is installed as primary name node which
contains the information of all other data nodes. Thus name node is nothing
but handling the meta directory of the cluster including all the services. Figure
7.4 shows the installation of apache hadoop with node name as coed82.
The name node is the core part of HDFS file system. It is not responsible
of storing the files. It has the tree like directory of all files in the file system,
and it keeps the track of where the files are stored across the cluster. The client
application talks with it to locate a file or doing any operation like add, move,
copy or delete the contents of the file.
HDFS contains more than one data node and the data are physically stored
on data nodes. The same data may be replicated across the several data nodes
to support the fault tolerance and availability. Very first the data node need
to be connected with the name node to register itself in the file system. The
client application approaches to the name node on first time, then name node
108
Chapter 7. Implementation with Apache Hadoop 7.5. Apache Hadoop Cluster
Figure 7.4: Apache Hadoop Installation
Figure 7.5: Hadoop MapReduce Administration
109
Chapter 7. Implementation with Apache Hadoop 7.5. Apache Hadoop Cluster
connects with the data node and finally the client application directly fetches
the data from the data node.
Figure 7.5 show the map reduce administration with the cluster summary
for two data nodes. It also includes the map and reduce task capacity. The heap
size is also available as 73.5 MB. The scheduling information is also given. The
details of running and retired job is also available.
As shown in figure 7.6 and figure 7.7 the name node and cluster informa-
tion is available. It contains the following information.
1. Live Nodes It shows the number of data nodes which are available for
processing.
2. Configured capacity: The memory space utilized for configuration in
terms of GB.
3. Dead Nodes:The data nodes which are not currently working or partici-
pating due to failure.
4. DFS used/unused:The space used for Distributed File System.
5. No of under replicated blocks: The total number of under replicated
blocks.
6. Storage Directory:It gives the path, type and state of the directory.
Figure 7.8 and figure 7.9 gives the detailed information of the contents of
the directory and the directory log.
110
Chapter 7. Implementation with Apache Hadoop 7.5. Apache Hadoop Cluster
Figure 7.6: Two Node cluster
Figure 7.7: Cluster Summary
111
Chapter 7. Implementation with Apache Hadoop 7.5. Apache Hadoop Cluster
Figure 7.8: Contents of Directory
Figure 7.9: Directory Log
112
Chapter 7. Implementation with Apache Hadoop 7.6. Conclusion
7.6 Conclusion
The map-reduce process of apache hadoop takes the back up copy in the mem-
ory each time after the completion of each task, that means it deals with the
memory read and write operations on large data sets which may reduce the
throughput and also the overall performance. The operations are not ”in-
Memory” nature. This is the main reason why hadoop is not used for real time
analytics. But in the proposed research work hadoop is more suitable which
supports dynamic and scalable nature and also fault tolerance in distributed
environment.
7.7 References
1. Tom white, Hadoop, The Definitive Guide,3rd.Edition, OReilly Media,
Inc.,Sebastopol, CA 95472,2012.
2. Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,Rafael Coss, Roman B.
Melnyk Hadoop For Dummies, John Wiley & Sons, Inc., Hoboken, New
Jersey,2014
3. Apache Hadoop http://hadoop.apache.org/releases.html
4. http://en.wikipedia.org/wiki/Apache Hadoop
5. HDFS : http://hortonworks.com/hadoop/hdfs
6. The Hadoop Distributed File System, Yahoo! Inc., in:Proceedings of MSST2010,
by IEEE, unknown, 2010
7. Sanjay Ghemawat, Howard Gobioff & ShunTak Leung: The Google File
System, Google Inc., in SOSP03, October 1922, 2003, Bolton Landing NY
USA, 2003
113
Chapter 8
Results, Conclusions and
Future Enhancements
8.1 Introduction
In the first section of this chapter the experimental results have been presented
in detail. The comparative analysis for different number of sites such as 2, 5,
10 and for different number of records for different data sets have been exper-
imentally proven that the proposed dynamic and scalable approach is much
faster and accurate. At the end it has been mentioned that this research can
be further extended for streaming data for the applications such as intra-day
stock market, weather forecasting and data generated by real time applica-
tions.
8.2 Results
The proposed algorithm has been applied on different data sets such as student
admission, student performance and Zoo data set. The confusion matrixes for
Site1, Site2 and at Coordinator Site for are experimentally derived for student
data set as below.
The table 8.4 and figure 8.1 and figure 8.2 below show the comparative
performance of site1, site2 and coordinator site. The accuracy and the time
114
Chapter 8. Results, Conclusions and Future Enhancements 8.2. Results
YES NOYES TP=0.788 FN=0NO FP=0.043 TN=0.168
YES NO TOTAL RECOGNITION (%)YES 141 0 141 78.77NO 8 30 38 21.052
TOTAL 149 30 179 95.53
Table 8.1: Confusion Matrix for Site1
YES NOYES TP=0.648 FN=0NO FP=0.091 TN=0.26
YES NO TOTAL RECOGNITION (%)YES 92 0 92 64.788NO 13 37 50 35.211
TOTAL 105 37 142 90.845
Table 8.2: Confusion Matrix for Site2
YES NOYES TP=0.701 FN=0NO FP=0.066 TN=0.28
YES NO TOTAL RECOGNITION (%)YES 225 0 225 70.09NO 06 90 96 67.61
TOTAL 231 90 321 98.13
Table 8.3: Confusion Matrix for combined data set at coordinator site
115
Chapter 8. Results, Conclusions and Future Enhancements 8.2. Results
Site Accuracy Error Rate Training Time (Sec)Site1 90.845 9.115 0.11Site2 95.53 4.47 0.03
Coordinator 98.13 1.87 0.05
Site Sensitivity Specificity Precision RecallSite1 100 74 94.63 0.789Site2 100 78.94 87.62 0.648
Coordinator 100 93.75 97.40 0.701
Table 8.4: Comparative performance at site1, site2 and coordinator site
Figure 8.1: Performance measures
required to build the decision tree is much better. The error rate in combined
approach is lesser than others.
In the table 8.5 and figure 8.3 the comparison of proposed approach for
three different data set has been shown as below. From the statistics the algo-
rithm gives an excellent performance for student admission data set. While
for Zoo data set performance is little poor as it is having more attributes (in-
stances) than other two data sets. This shows that the number of the attributes
(instances) also plays important role for the processing.
116
Chapter 8. Results, Conclusions and Future Enhancements 8.2. Results
Figure 8.2: Recall and Training Time (Sec)
DataSet Number OfInstances
Accuracy(%)
ErrorRate (%)
TrainingTime (Sec)
Student Admission 321 98.13 1.87 0.05Student
Performance100 93 7 0.02
ZOO 101 96.05 3.95 0.04
DataSet Sensitivity Specificity Precision RecallStudent Admission 100 93.75 97.40 98.13
Student Performance 100 92.63 93.20 93ZOO 100 93.34 96.61 96.05
Table 8.5: Performance statistics for three different data sets
117
Chapter 8. Results, Conclusions and Future Enhancements 8.2. Results
Figure 8.3: Performance comparison
The experimental results for student admission data set presented below
show the processing time, communication Overhead time and total time (Mil-
lisecond) for centralized approach, intermediate message passing approach
and the proposed approach for different number of sites and different number
of instances. The detailed statistics and the graphical comparison are shown
one by one as below.
Records CentralizedApproach(ms)
IntermediateMessage
Passing(ms)
Proposed Method(ms)
10000 17.7 11.021 6.420000 40.8 25.83 14.340000 102.1 59.79 29.3260000 130.6 97.59 3680000 205.7 128.98 58.6
100000 229.3 160.33 70.73
Table 8.6: Statistics for total time on 2 sites
118
Chapter 8. Results, Conclusions and Future Enhancements 8.2. Results
Figure 8.4: Comparison for total time on 2 sites
Records CentralizedApproach(ms)
IntermediateMessage
Passing(ms)
Proposed Method(ms)
10000 14.4 10.78 3.3320000 34.79 29.77 7.6140000 73.36 59.83 16.9160000 92.06 80.72 21.6580000 150.97 116.03 32.4
100000 164.24 149.48 38.05
Table 8.7: Statistics for total time on 5 sites
Records CentralizedApproach(ms)
IntermediateMessage
Passing(ms)
Proposed Method(ms)
10000 8.33 5.78 3.0220000 17.01 15.08 6.2840000 35.54 3.49 14.90360000 46.46 45.09 18.0780000 77.61 63.71 26.48
100000 98.331 73.29 31.31
Table 8.8: Statistics for total time on 10 sites
119
Chapter 8. Results, Conclusions and Future Enhancements 8.2. Results
Figure 8.5: Comparison for total time on 5 sites
Figure 8.6: Comparison for total time on 10 sites
120
Chapter 8. Results, Conclusions and Future Enhancements 8.2. Results
Figure 8.7: Communications Overhead in Centralized Approach
Figure 8.8: Communications Overhead in Intermediate Message Passing Ap-proach
121
Chapter 8. Results, Conclusions and Future Enhancements 8.2. Results
Figure 8.9: Communications Overhead in Proposed Approach
Records Proposed Method(ms)
Proposed Methodwith ApacheHadoop (ms)
10000 6.4 8.3220000 14.3 17.2140000 29.32 25.1560000 36 32.5680000 58.6 51.48
100000 70.73 62.23
Table 8.9: Total time for 2 sites
Records Proposed Method(ms)
Proposed Methodwith ApacheHadoop (ms)
10000 3.33 4.4720000 7.61 8.5540000 16.91 13.2760000 21.65 19.5280000 32.4 24.73
100000 38.05 31.04
Table 8.10: Total time for 5 sites
122
Chapter 8. Results, Conclusions and Future Enhancements 8.3. Conclusions
Records Proposed Method(ms)
Proposed Methodwith ApacheHadoop (ms)
10000 3.02 3.8920000 6.28 7.5340000 14.903 10.2860000 18.07 14.3180000 26.48 19.34
100000 31.31 23.01
Table 8.11: Total time for 10 sites
The Communication Overhead (Milliseconds) in centralized, intermediate
message passing and the proposed approach have been discussed here. The
experimental results show the communication overhead in proposed approach
is much lesser than other two approaches. The communication overhead also
remains relatively identical as the number of sites increasing in centralized ap-
proach, but this increases the processing time and much more overhead on a
single machine. Moreover to this, the experimental results also shows the im-
plementation with Apache Hadoop Map-Reduce is much faster than the first
approach of implementation.
8.3 Conclusions
The outcome of the proposed model shows that the objectives of the research
work have been acquired. The proposed model can handle large volume of
data. The decision trees are merged with minimal network overhead and global
model preserves the quality of prediction. The results shown in the above ta-
bles with accuracy, error rate, specificity, sensitivity, precision, recall and train-
ing time are better than the existing system. The total time for local model gen-
eration and communication time in proposed approach is 3.14 and 2.34 times
faster than the centralized and intermediate message passing approaches. The
data set of student admission for the year 2013-14, 2014-15 have been used
to train the model, this model has been used with the data set of student ad-
mission for the year 2015-16 which gives more than 98.03% accuracy for the
prediction. These experimental results have been also verified using the 10-
fold cross validation.
123
Chapter 8. Results, Conclusions and Future Enhancements 8.3. Conclusions
Decision tree learning on massive datasets is a common data mining task
in distributed environment, yet many state of the art tree learning algorithms
require training data to reside in memory on a single machine, while more
scalable implementations of tree learning have been proposed, they typically
require specialized parallel computing architectures. Moreover, all the ap-
proaches are static in nature, not domain-free, not scalable and the accuracy is
not preserved.
Our literature review and experiments on merging the decision trees, shown
training time, communication overhead and the accuracy are the major chal-
lenges. To reduce the training time the proposed algorithm processes only new
dataset with already trained model which makes it scalable and dynamic, to
reduce the communication overhead the local models have been converted into
XML files, to preserve the accuracy the proposed algorithm incorporates some
rule merging policies.
The proposed approach has been divided into two major phases. During
first phase, the objectives were a) minimize the training time and b) reduce the
communication overhead by the sub-phases 1) Generating the local decision
tree model, 2) Applying the parsing for converting decision tree into decision
table, 3) Converting this decision table into XML file to reduce the communi-
cation overhead and 4) Applying the scalable approach with new data set. The
total time for local model generation and communication time in proposed ap-
proach is 3.14 and 2.34 times faster than the centralized and intermediate
message passing approaches.
During the second phase of proposed approach, we have introduced several
rule merging policies to preserve the quality by performing model intersection,
filtering and reduction phases. We have compared the resultant merged and
trained model with the actual one generated by J48 algorithm. This model pre-
serves the accuracy.
The proposed approach has been implemented with to ways: C# and Map-
Reduce with Apache Hadoop. The experimental results also shows the later is
much faster on increase the volume of data and the processing sites i.e. data
nodes. Moreover to this apache hadoop also supports fault tolerance and scal-
124
Chapter 8. Results, Conclusions and Future Enhancements 8.4. Future Work
ability (Horizontal scaling) to process very large volume of datasets.
We proposed a scalable and dynamic distributed approach for learning tree
models over large datasets which defines tree learning as a series of distributed
computations. We show how this approach supports dynamic and scalable
construction of decision trees models, as well as ensembles of such models.
The proposed approach is much efficient than all others.
8.4 Future Work
This research has been carried out on non streaming data set where the data are
stored in data warehouse at different sites. This research work can be further
extended on the streaming data set where the data are continuously coming
and need to process on real time basis. The intra-day stock market data gen-
erated every moment, weather data, scientific data and data generated by real
time at different geographical sites which need to be processed on real time ba-
sis are streaming data. In future, this research can be extended for streaming
data in distributed environment.
As hadoop is not performing ”in-memory” processing and hence it is not
suitable for real time analytic of large data sets. It writes in the memory af-
ter the operation and hence the throughput may decrease, so in future other
mechanism such as SPARK may be used for real time analytic process which
performs ”in-memory” operations and also supports streaming data which im-
proves the throughput.
8.5 Summary
In this chapter the experimental results have been presented in detail. The
comparative analysis for different number of sites such as 2, 5, and 10 for dif-
ferent number of records and for different data sets have been experimentally
proven that the proposed dynamic and scalable approach is much faster and
accurate. The proposed approach is 3.14 and 2.34 times faster than the cen-
tralized and intermediate message passing approaches. At the end it has been
125
Chapter 8. Results, Conclusions and Future Enhancements 8.5. Summary
mentioned that this research can be further extended for streaming data for
the applications such as intra-day stock market, weather forecasting and data
generated by real time applications. The SPARK can be used to perform ”in-
memory” operations which gives real time analytic.
126
List of Paper Publications
1. Students’ Admission Prediction using GRBST with Distributed Data Min-ing, Communications on Applied Electronics (CAE) ISSN : 2394-4714,Foundation of Computer Science FCS, New York, USA, Volume 2 No.1,June 2015
2. A Proposed DDM Algorithm and Framework For EDM of Gujarat Tech-nological University, Organized by Saffrony Institute of Technology In-ternational Conference on Advances in Engineering, 22nd-23rd January2015
3. An Approach on Early Prediction of Students Performance in UniversityExamination of Engineering Students Using Data Mining, InternationalJournal of Scientific Research and Management Studies (IJSRMS) ISSN:2349-3371 Volume 1 Issue 5, pg: 156-161
4. Faculty Performance Evaluation Based on Prediction in Distributed DataMining, 2015 IEEE ICETECH- Coimbatore
5. Prediction and analysis of student performance using distributed datamining, International Conference on Information, Knowledge & ResearchIn Engineering, Management and Sciences(IC-IKR-EMS),7th Dec-2014KIT, Gujarat. IJETAETS-ISSN 0974-3588
6. Prediction and analysis of Faculty performance using distributed datamining, International Conference on Information, Knowledge & ResearchIn Engineering, Management and Sciences(IC-IKR-EMS),7th Dec-2014KIT, Gujarat. IJETAETS-ISSN 0974-3588
7. A decision support application for student admission process based on pre-diction in distributed data mining, International Conference on Informa-tion, Knowledge & Research In Engineering, Management and Sciences(IC-IKR-EMS), 7th Dec-2014 KIT, Gujarat. IJETAETS-ISSN 0974-3588
8. A Dynamic and Scalable Evolutionary Data Mining for Distributed Envi-ronments, NCEVT-2013, PIET, Limda
9. An Approach of E-Governance with Distributed Data Mining For Stu-dent Performance Prediction, Springer international conference, ICICT-October-2015, Udaipur.
10. Dynamic and Scalable Data Mining with an Incremental Decision TreesMerging Approach for Distributed Environment, A Doctoral Conference2016 (DocCon 2016) at Udaipur, March-2016. (Under Publication)
127
List of Book/Book Chapter Publications
1. Book: ”Data Mining Techniques for Distributed Database Environment”,Dineshkumar B. Vaghela and Dr. Priyanka Sharma, ISBN 978-3-659-94945-6, Lambert Academic Publishing - 2016.
2. Book Chapter: ”Web Usage Mining Techniques and Applications acrossIndustries”, Dr. A.V. Senthil, ISBN: 978-1-522-50613-3, IGI global inter-national publication - 2016.
128
Patents/Copyright (if any)
Title A Dynamic And Scalable Evolutionary Data Mining& KDD For Distributed Environment
Filed At Copyright Office, New DelhiDairy Num-ber
3460/2016-CO/SW
Applicants &Inventors
Dineshkumar Bhagwandas Vaghela and Dr.Priyanka Sharma
ApplicationStatus
Waiting
Objection Re-ceived
Not Yet
Web Link http://copyright.gov.in/frmStatusGenUser.aspx
129