a dynamic and scalable for distributed environment · dwdm data warehousing and data mining rm...

A DYNAMIC AND SCALABLEEVOLUTIONARY DATA MINING & KDD

FOR DISTRIBUTED ENVIRONMENT

A Thesis submitted to Gujarat Technological Universityfor the Award of

Doctor of Philosophy

in

Computer/IT Engineering

by

Dineshkumar Bhagwandas VaghelaEnrollment No. 119997107013

Under supervision of

Dr. Priyanka SharmaHead, I.T. Department, Raksha Shakti University,

Meghaninagar

GUJARAT TECHNOLOGICALUNIVERSITY-AHMEDABAD

January-2017

I

© Dineshkumar Bhagwandas Vaghela

II

DeclarationI declare that the thesis entitled ”A Dynamic And Scalable Evolutionary

Data Mining & KDD In Distributed Environment” submitted by me for

the degree of Doctor of Philosophy is the record of research work carried out by

me during the period from July 2011 to March 2016 under the supervision of

Dr. Priyanka Sharma (Prof. and Head of IT, Raksha Shakti Univer-

sity) and this has not formed the basis for the award of any degree, diploma,

associate ship, fellowship, titles in this or any other University or other institu-

tion of higher learning.

I further declare that the material obtained from other sources has been duly

acknowledged in the thesis. I shall be solely responsible for any plagiarism or

other irregularities, if noticed in the thesis.

Signature of Research Scholar:

Name of Research Scholar: Dineshkumar Bhagwandas Vaghela

Place: Ahmedabad

Date:

III

CertificateI certify that the work incorporated in the thesis ”A Dynamic And Scalable

Evolutionary Data Mining & KDD In Distributed Environment” sub-

mitted by Mr. Dineshkumar Bhagwandas Vaghela was carried out by the

candidate under my supervision/guidance. To the best of my knowledge: (i) the

candidate has not submitted the same research work to any other institution

for any degree/diploma, Associate ship, Fellowship or other similar titles (ii)

the thesis submitted is a record of original research work done by the Research

Scholar during the period of study under my supervision, and (iii) the thesis

represents independent research work on the part of the Research Scholar.

Signature of Supervisor:

Name of Supervisor: Dr. Priyanka Sharma

Place: Ahmedabad

Date:

IV

Originality Report CertificateIt is certified that PhD Thesis titled ”A Dynamic And Scalable Evolution-

ary Data Mining & KDD In Distributed Environment” submitted by

Mr. Dineshkumar Bhagwandas Vaghela has been examined by me. I undertake

the following:

1. Thesis has significant new work / knowledge as compared already pub-

lished or are under consideration to be published elsewhere. No sentence,

equation, diagram, table, paragraph or section has been copied verbatim

from previous work unless it is placed under quotation marks and duly

referenced.

2. The work presented is original and own work of the author (i.e. there is

no plagiarism). No ideas, processes, results or words of others have been

presented as Author own work.

3. There is no fabrication of data or results which have been compiled /

analyzed.

4. There is no falsification by manipulating research materials, equipment or

processes, or changing or omitting data or results such that the research

is not accurately represented in the research record.

5. The thesis has been checked using https://turnitin.com (copy of orig-

inality report attached) and found within limits as per GTU Plagiarism

Policy and instructions issued from time to time (i.e. permitted similarity

index <= 25%).

Signature of Research Scholar: Date:


Place: Ahmedabad

Signature of Supervisor: Date:

Name of Supervisor: Dr.Priyanka Sharma

Place: Ahmedabad

VII

PhD THESIS Non-Exclusive License to GUJARAT

TECHNOLOGICAL UNIVERSITY

In consideration of being a PhD Research Scholar at GTU and in the interests of

the facilitation of research at GTU and elsewhere, I, Dineshkumar Bhagwandas

Vaghela having Enrollment No. 119997107013 hereby grant a non- exclusive,

royalty free and perpetual license to GTU on the following terms:

1. GTU is permitted to archive, reproduce and distribute my thesis, in whole

or in part, and/or my abstract, in whole or in part ( referred to collectively

as the Work) anywhere in the world, for non-commercial purposes, in all

forms of media;

2. GTU is permitted to authorize, sub-lease, sub-contract or procure any of

the acts mentioned in paragraph (1);

3. GTU is authorized to submit the Work at any National / International

Library, under the authority of their Thesis Non-Exclusive License;

4. The Universal Copyright Notice ©shall appear on all copies made under

the authority of this license;

5. I undertake to submit my thesis, through my University, to any Library

and Archives. Any abstract submitted with the thesis will be considered

to form part of the thesis.

6. I represent that my thesis is my original work, does not infringe any rights

of others, including privacy rights, and that I have the right to make the

grant conferred by this non-exclusive license.

7. If third party copyrighted material was included in my thesis for which,

under the terms of the Copyright Act, written permission from the copy-

right owners is required, I have obtained such permission from the copy-

right owners to do the acts mentioned in paragraph (1) above for the full

term of copyright protection.

8. I retain copyright ownership and moral rights in my thesis, and may deal

with the copyright in my thesis, in any way consistent with rights granted

by me to my University in this non-exclusive license.

VIII

9. I further promise to inform any person to whom I may hereafter assign

or license my copyright in my thesis of the rights granted by me to my

University in this non- exclusive license.

10. I am aware of and agree to accept the conditions and regulations of PhD

including all policy matters related to authorship and plagiarism.

Signature of Research Scholar: Date:


Place: Ahmedabad

Signature of Supervisor: Date:

Name of Supervisor: Dr.Priyanka Sharma

Place: Ahmedabad

IX

Thesis Approval FormThe viva-voce of the PhD Thesis submitted by Mr. Dineshkumar Bhagwandas

Vaghela (Enrollment No. 119997107013) entitled ”A Dynamic And Scalable

Evolutionary Data Mining & KDD In Distributed Environment” was

conducted on Date: , at Gujarat Technological University.

(Please tick any one of the following option)

The performance of the candidate was satisfactory. We recommend that

he be awarded the PhD degree.

Any further modifications in research work recommended by the panel

after 3 months from the date of first viva-voce upon request of the Supervi-

sor or request of Independent Research Scholar after which viva-voce can be

re-conducted by the same panel again.

The performance of the candidate was unsatisfactory. We recommend

that he should not be awarded the PhD degree.

Name & Sign. of Supervisor with Seal External Examiner-1 Name & Sign.

External Examiner-2 Name & Sign. External Examiner-3 Name & Sign.

X

AbbreviationsML Machine Learning

PUWP Parul University Web Portal

DT Decision Tree

SVM Support Vector Machine

DM Data Mining

DDM Distributed Data Mining

KDD Knowledge Discovery and Data Mining

ARM Association Rule Mining

DB DataBase

XML eXtensible Markup Language

OLAP Online Analytical Processing

DWDM Data Warehousing and Data Mining

RM Research Methodology

CHAID CHi-squared Automatic Interaction Detector

AID Automatic Interaction Detector

THAID Theta Automatic Interaction Detector

CART Classification and regression tree

HODA Hierarchical Optimal Discriminate Analysis

ID3 Iterative Dichotomiser 3

UCI University of California, Irvine

TP True Positive

TN True Negative

FP False Positive

FN False Negative

IBL Instance Based Learning

k-NN k- Nearest Neighbor

CD Concept Description

SPDT Streaming Parallel Decision Tree

SLIQ Supervised Learning In Quest

SPRINT Scalable Parallelizable Induction of decision tree

MDL Minimum Description Length

XI

AbstractThere are lots of fields such as biology, education, environmental research,

sensor network, stock market, weather forecasting and many more produce very

large volume of data. The data are produced with the high degree of velocity

and variety due to the immense and dynamic growth of these fields. Due to vast

use of internet in distributed environment has generated an urgent need for

new techniques and tools that can intelligently automatically transform the pro-

cessed data into useful information and knowledge. This is why data mining

has become a research with increasing importance to analyze such large volume

of data efficiently and effectively. The continuous collection of more and more

data at this velocity and scale, it will become the paramount of formalizing the

process of big data analysis at this stage.

The large volume of data are geographically spread across the globe may

tend to generate the very large number of models. The heterogeneous nature

of data, resultant models and techniques raise problems on how to generalize

knowledge in order to have global view of the phenomena across the entire or-

ganization.

Lots of data mining techniques have been introduced for different analyti-

cal processes such as clustering, frequent pattern, classification, rare item set

finding and many more. Out of all these two techniques namely classification

and prediction, are two forms of data analysis that can be used to extract

models describing important data classes or to predict future data trends. The

data analysis in these ways can help us to provide better understanding of the

data at large. Usually classification predicts categorical (discrete, unordered)

labels while prediction models predict continuous valued functions. Many clas-

sification and prediction methods have been proposed by researcher in machine

learning pattern recognition and statistics. As per the state of the art and liter-

ature survey most of the algorithms are memory resident, typically assuming

a small data size, static in nature, not scalable and not domain-free. Recent

data mining research has built on such work, developing scalable classification

and prediction techniques capable of handling large disk-resident data. The

application of the classical knowledge discovery process in distributed environ-

ments requires the collection of distributed data in a data warehouse for central

processing. However, this is usually either ineffective or infeasible for several

XII

reasons such as (1) Storage cost (2) Communication cost (3) Computational cost

(4) Private and sensitive data. From the literature review, the most promising

issues for prediction in distributed environment are RAM size due to very large

volume of data set, scalability of size of data set and dynamic nature of learning.

In decision tree learning the training data set is used for learning i.e. gener-

ating the decision tree model. In distributed environment the data at different

sites are not in corelation with other sites’ data and hence local site decision

tree model is not sufficient to produce global view for prediction. There are

two approaches so far, first is all data set need to be collected at one location

and then do data mining operation (i.e. central site processing) and second is

intermediate message passing among the sites involved for training the model.

In later approach the participating sites have to communicate among each other

through passing their intermediate trained models to generate the global model.

The main limitation of this approach is overhead of message passing. These two

approaches stated above are not effective and efficient, and hence this is the mo-

tivation for this research. The objectives of the research are: 1) To minimize the

training time and communication overhead and 2) To preserve the prediction

quality. In this research the effective and efficient approach has been proposed

which works on global decision tree generation in distributed environment to

extract the global knowledge.

This research has been carried out on real data set of Parul University for the

students admission prediction in different fields/branches of different colleges.

In the first phase, these data have been collected from the Parul University

Web Portal. There are more than 1,00,000 records in total used for training

purpose. As data collected from the University Portal they need to be pre-

processed such as removing the noise, outliers and missing values. The data are

stored in .csv file format.

In the second phase of this research, at the local site J48 algorithm (com-

plexity O (mn2)) generates the decision tree. In third phase the decision tree

at each site is converted into decision rules using the proposed parser. These

decision rules have been later converted into the decision tables. In the fourth

phase to reduce the transmission cost, the decision tables have been converted

into XML file and then sent to the coordinator site. In the fifth phase the

global tree model is generated at coordinator site by consolidating the decision

XIII

tables formed from XML files.

In sixth phase the dataset has been equally partitioned into the subsets

equal to the number of sites. The experiments have been performed on 10k, 20k,

50k and 100k records (Here k means thousand) at 2, 5 and 10 sites. The local

training models have been generated and merged using the proposed approach.

The accuracy of these global models has been checked on test datasets. The

accuracy is more than 98.03% to classify the test dataset. The results of basic

comparison clearly shows that accuracy, training time, communication overhead

and other parameters have been optimized. The data set of student admission

for the year 2013-14, 2014-15 have been used to train the model, this model has

been used with the data set of student admission for the year 2015-16 which

gives more than 98.03% accuracy for the prediction. These experimental results

have been also verified using the 10-fold cross validation.

XIV

Dedicated

To

My Parents (Surajben and Bhagwanbhai),

My Wife (Leela)

And

My Children (Late. Mittal, Devendra and Mamta)

XV

AcknowledgementAt first, my deepest gratitude to my supervisor, Dr. Priyanka Sharma,

Professor and Head of IT Department, Raksha Shakti University, for her con-

sistent support, supervision, guidance and inspiration during my doctoral pro-

gramme. Her invaluable suggestions and constructive criticisms from time to

time enabled me to complete my work successfully.

The completion of this work would not have been possible without, the Doc-

torate Progress Committee (DPC) members: Dr. Sanjaykumar Vij, Dean

(CSE, MCA, and MBA) ITM Universe and Dr. Bankim Patel, Director of

SRIMCA, Uka Tarsadia University. I am really thankful for their rigorous ex-

aminations and precious suggestions during my research.

My gratitude goes out to the assistance and support of Dr. Akshai Ag-

garwal, Ex-Vice Chancellor, Dr. Rajul Gajjar, Dean, PhD Programme, Dr.

N. M. Bhatt, Dean, PhD Programme, Mr. J. C. Lilani, I/C Registrar, Ms.

Mona Chaurasiya, Research Coordinator, Mr. Dhaval Gohil, Data Entry Op-

erator and other staff members of PhD Section, GTU.

Most importantly, none of this would have been possible without the love

and patience of my wife and family members. My wife, to whom this dis-

sertation is dedicated to, has been a constant source of love, concern, support

and strength all these years. My family members has aided and encouraged me

throughout this endeavor. I would like to express my heart-felt gratitude to

both of them. Finally, I have to give a special mention for the direct or indirect

support given by my colleagues.

I would like to address special thanks to the reviewers of my thesis, for ac-

cepting to read and review this thesis and giving approval of it. I would like to

appreciate all the researchers whose works I have used, initially in understand-

ing my field of research and later for updates. I would like to thank the many

people who have taught me starting with my school teachers, my undergraduate

teachers, and my post graduate teachers.

Dineshkumar B. Vaghela

List of Figures

2.1 Induction: Model Construction . . . . . . . . . . . . . . . . . . . 182.2 Deduction of test data using the model/classifier . . . . . . . . . 192.3 Posterior Probability of Naive Bayes . . . . . . . . . . . . . . . . . 202.4 Classification based on linear SVM . . . . . . . . . . . . . . . . . 232.5 Classification based on Hard SVM . . . . . . . . . . . . . . . . . . 232.6 Nonlinear Classification . . . . . . . . . . . . . . . . . . . . . . . 252.7 Distance functions equations . . . . . . . . . . . . . . . . . . . . . 272.8 Hamming Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 272.9 Decision tree based classification for car subscription . . . . . . . 32

3.1 Decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . 664.2 Local site Processing . . . . . . . . . . . . . . . . . . . . . . . . . 684.3 Coordinate site Processing . . . . . . . . . . . . . . . . . . . . . . 704.4 proposed system architecture for dynamic and scalable decision

tree generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.5 Decision table merging process to generate the global decision

tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.1 Data set at site S1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.2 Decision Tree generated at local site S1 . . . . . . . . . . . . . . . 815.3 Detailed accuracy and the confusion matrix . . . . . . . . . . . . 825.4 Decision table generated at local site S1 . . . . . . . . . . . . . . . 845.5 The XML file at local site S1 . . . . . . . . . . . . . . . . . . . . . 845.6 Dynamic and Scalable decision tree generation . . . . . . . . . . 855.7 The Decision table merging process to generate the global deci-

sion tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.1 Forms of Data Preprocessing . . . . . . . . . . . . . . . . . . . . . 936.2 Site Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.3 Run J48 algorithm to each site . . . . . . . . . . . . . . . . . . . . 976.4 Load/Save the training model . . . . . . . . . . . . . . . . . . . . 986.5 Decision Tree and Decision Table at each site . . . . . . . . . . . 98

1

List of Figures List of Figures

6.6 Combined Decision Tree and Decision Table . . . . . . . . . . . . 996.7 Branch wise decision rules . . . . . . . . . . . . . . . . . . . . . . 99

7.1 Apache Hadoop Architecture . . . . . . . . . . . . . . . . . . . . 1027.2 Architecture and job execution flow in Hadoop Map Reduce ver-

sion 1.x (MRv1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.3 Overview of Map-Reduce Model . . . . . . . . . . . . . . . . . . . 1067.4 Apache Hadoop Installation . . . . . . . . . . . . . . . . . . . . . 1097.5 Hadoop MapReduce Administration . . . . . . . . . . . . . . . . 1097.6 Two Node cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117.7 Cluster Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117.8 Contents of Directory . . . . . . . . . . . . . . . . . . . . . . . . . 1127.9 Directory Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

8.1 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . 1168.2 Recall and Training Time (Sec) . . . . . . . . . . . . . . . . . . . . 1178.3 Performance comparison . . . . . . . . . . . . . . . . . . . . . . . 1188.4 Comparison for total time on 2 sites . . . . . . . . . . . . . . . . . 1198.5 Comparison for total time on 5 sites . . . . . . . . . . . . . . . . . 1208.6 Comparison for total time on 10 sites . . . . . . . . . . . . . . . . 1208.7 Communications Overhead in Centralized Approach . . . . . . . 1218.8 Communications Overhead in Intermediate Message Passing Ap-

proach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1218.9 Communications Overhead in Proposed Approach . . . . . . . . 122

2

List of Tables

2.1 Advantages of different classification algorithms . . . . . . . . . 402.2 Feature comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 412.3 Comparison of Classification Algorithms . . . . . . . . . . . . . . 42

3.1 Performance based comparisons of different Decision tree algo-rithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2 Comparisons between different Decision Tree Algorithms . . . . 513.3 Comparison of Merits and Demerits of Decision Tree Algorithms 523.4 Merge Models with combination of rules: Examples . . . . . . . 55

5.1 Overall distributions of Site S2 instances with respect to attributes 785.2 Detail distribution of Site S2 instances with respect to attribute

values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.1 ZOO DataSet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.2 Student admission data set collected from Parul University Web

Portal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.3 Student performance data set collected from Departments of

PIT College . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

8.1 Confusion Matrix for Site1 . . . . . . . . . . . . . . . . . . . . . . 1158.2 Confusion Matrix for Site2 . . . . . . . . . . . . . . . . . . . . . . 1158.3 Confusion Matrix for combined data set at coordinator site . . . 1158.4 Comparative performance at site1, site2 and coordinator site . . 1168.5 Performance statistics for three different data sets . . . . . . . . . 1178.6 Statistics for total time on 2 sites . . . . . . . . . . . . . . . . . . . 1188.7 Statistics for total time on 5 sites . . . . . . . . . . . . . . . . . . . 1198.8 Statistics for total time on 10 sites . . . . . . . . . . . . . . . . . . 1198.9 Total time for 2 sites . . . . . . . . . . . . . . . . . . . . . . . . . . 1228.10 Total time for 5 sites . . . . . . . . . . . . . . . . . . . . . . . . . . 1228.11 Total time for 10 sites . . . . . . . . . . . . . . . . . . . . . . . . . 123

3

Contents

1 Introduction 71.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Background History . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4 Contribution of the research . . . . . . . . . . . . . . . . . . . . . 121.5 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . 131.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Classification Techniques 172.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Classification techniques . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 Naı̈ve Bayesian . . . . . . . . . . . . . . . . . . . . . . . . 192.2.2 Support Vector Machines (SVM) . . . . . . . . . . . . . . . 212.2.3 K-Nearest Neighbor Classifier . . . . . . . . . . . . . . . . 252.2.4 Instance Based Learning (IBL) . . . . . . . . . . . . . . . . 282.2.5 Rule Based Classification . . . . . . . . . . . . . . . . . . . 302.2.6 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 302.2.7 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3 Attribute Splitting Measures . . . . . . . . . . . . . . . . . . . . . 342.4 Decision Tree Classification . . . . . . . . . . . . . . . . . . . . . 36

2.4.1 Tree Building Phase . . . . . . . . . . . . . . . . . . . . . . 372.4.2 Tree Pruning Phase . . . . . . . . . . . . . . . . . . . . . . 39

2.5 Comparison of classification techniques . . . . . . . . . . . . . . 402.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3 Literature Survey On Decision Tree 463.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.2 First Phase Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3 Second Phase Study . . . . . . . . . . . . . . . . . . . . . . . . . . 503.4 Third Phase Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.5 Challenges with DT merging . . . . . . . . . . . . . . . . . . . . . 573.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4

Contents Contents

4 Proposed Approach 624.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 624.3 Objective and Scope of Research . . . . . . . . . . . . . . . . . . . 634.4 Original Contribution by thesis . . . . . . . . . . . . . . . . . . . 644.5 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . 654.6 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.6.1 Algorithm steps at local site . . . . . . . . . . . . . . . . . 664.6.2 Algorithm steps at coordinator site . . . . . . . . . . . . . 69

4.7 System architecture at local site . . . . . . . . . . . . . . . . . . . 714.8 System architecture at coordinator site . . . . . . . . . . . . . . . 714.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5 Working of the Proposed Model 775.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.2 Local site algorithm computation . . . . . . . . . . . . . . . . . . 77

5.2.1 Building the decision tree . . . . . . . . . . . . . . . . . . 805.2.2 Rule Generation . . . . . . . . . . . . . . . . . . . . . . . . 815.2.3 Decision Table . . . . . . . . . . . . . . . . . . . . . . . . . 835.2.4 XML File Generation . . . . . . . . . . . . . . . . . . . . . 83

5.3 Coordinator site algorithm computation . . . . . . . . . . . . . . 855.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6 Data Collection, Preprocessing and Implementation 896.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2.1 Zoo Data set . . . . . . . . . . . . . . . . . . . . . . . . . . 906.2.2 Student Admission Data Set . . . . . . . . . . . . . . . . . 906.2.3 Student Performance Data Set . . . . . . . . . . . . . . . . 92

6.3 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 926.4 Test Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7 Implementation with Apache Hadoop 1017.1 Introduction to Apache Hadoop . . . . . . . . . . . . . . . . . . . 1017.2 Hadoop Map-Reduce . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.2.1 Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027.2.2 Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.3 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037.4 Decision Tree Map-Reduce . . . . . . . . . . . . . . . . . . . . . . 105

7.4.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . 1057.4.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1067.4.3 Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5

Contents Contents

7.4.4 Tree Growing . . . . . . . . . . . . . . . . . . . . . . . . . 1077.5 Apache Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . 1087.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1137.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

8 Results, Conclusions and Future Enhancements 1148.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1238.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1258.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6

Chapter 1

Introduction

1.1 Introduction

The different fields like education, sensor network, Internet of Things (IoT),

biology, stock market, weather forecasting and many more generates the data

at a rapid speed, of different variety and of large volume. The quick-tempered

growth in the volume of data due to enormous use of internet in distributedenvironment [1] and due to this it is a high demanding requirement for new

approaches and techniques which may easily and automatically convert the al-

ready processed data in the valuable decision making form of information or

knowledge. This can only be achieved by properly processing and analyzing

the data with some techniques. These all techniques are the part of data min-

ing, this is the main reason why data mining takes significance important for

data analytic. The continuous collection of more and more data at this veloc-

ity and scale, it will become the paramount of formalizing the process of big

data analysis at this stage. The large volume of data are geographically spread

across the globe may tend to generate the very large number of models. The

heterogeneous nature of data, resultant models and techniques raise issues for

generalizing the knowledge for global view of the phenomena across the entire

organization.

Lots of data mining techniques have been introduced for different analyti-

cally processes such as clustering, frequent pattern, classification, rare item set

finding and many more. Out of all these two techniques namely classificationand prediction, are used to extract/fetch models to describe class labels of data

which are important for classification or this models can also be used to predict

7

Chapter 1. Introduction 1.1. Introduction

the future data trends. The data analysis in these ways can help us to provide

better understanding of the large data sets. Usually classification predicts cate-

gorical (discrete, un-ordered) labels or classes while prediction models predict

continuous valued functions. In the machine learning, pattern recognition and

statistics, the researchers have proposed many methods for both classification

and prediction. As per the state of the art and literature survey most of the

algorithms are in-memory, generally work on the data of small size, not dy-namic in nature, not scalable and not domain-free. From the state of the art it

has been concluded that at present lots of the research have been carried out

to support the scalable classification and performing the prediction which can

be able to handle large in-memory disk data. The application of the classical

knowledge discovery process in distributed environments requires the collec-

tion of distributed data in a data warehouse for central processing. However,

this is usually either ineffective or unfeasible for several reasons such as (1)

Storage cost (2) Communication cost (3) Computational cost (4) Private and

sensitive data. From the literature review, the most promising issues for pre-

diction in distributed environment are RAM size due to very large volume of

data set, scalability of size of data set and dynamic nature of learning.

Classification mainly used to analyze a given training data set and takes

each instance of it and assigns this instance to a particular class such that clas-

sification error will be least. It is used to extract models that accurately define

important data classes within the given data set. Classification is a two step

process. During first step the model is created by applying classification al-

gorithm on training data set then in second step the extracted model is tested

against a predefined test data set to measure the model trained performance

and accuracy. So classification is the process to assign class label from data set

whose class label is unknown.

The training data set is used for learning purpose to generate the decision

tree models. In distributed environment, the local decision tree model gener-

ated at each site is not sufficient to provide the global view for prediction as

because the local training data set are not in co-relation with other data set at

different geographically spread sites. In order to generate the global decision

tree either these all data set need to be collected at one location and then do

data mining operation. The other approach is intermediate message passing

among the sites involved for training the model. These participating sites have

8

Chapter 1. Introduction 1.2. Background History

to communicate among each other through passing their intermediate trained

models to generate the global model. This leads multiple messages passing

which causes the communication overhead. So both the approaches are not

effective and efficient, and hence this is the motivation for this research. The

objectives of the research are: 1) To minimize the training time and commu-

nication overhead and 2) To preserve the prediction quality. In this research,

the proposed approach is effective and efficient which works on global decision

tree generation in distributed environment to extract the global knowledge.

In this chapter the section 1.2 is about the background history of data min-

ing (Decision Tree based classification) in distributed environment, section 1.3

focuses the lacking which motivates for this research to be carried out, section

1.4 is briefly describes the contribution of the research, section 1.5 is about the

research methodology used in this research, in section 1.6 the thesis organiza-

tion has been introduced.

1.2 Background History

The analytical tools are used in data mining to discover unknown and useful

patterns, the relationships of these patterns in large volume of data and also

predicts the future patterns/classes by training the model with available data

set. For this the data mining tools applied either machine learning techniques,

any of relevant statistical model or any algorithm based on mathematics. More-

over to the above functionality of data mining, it also consists collection of

data from various sources, pre-processing and also managing them for proper

processing. Data Mining also involves Clustering, classification, regression,

frequent pattern generation and many more analysis and processing facilities.

Thus wider range of data can be processed by classification technique than ei-

ther regression or correlation. This is the main reason why the popularity of

classification is increasing day to day.

Data mining is the most important, significant and accurate machine learn-

ing [2] application. It allows very large volume of day to day data to be pro-

cessed effectively and to generate useful analysis which may further extend to

help the prediction for decision making. There are high chances of mistakes

9

Chapter 1. Introduction 1.2. Background History

during analysis of large volume of data specially for finding the correlation

among the different features of the data sets. Due to the above mentioned mis-

take some time its difficult to find solutions and take decision. These problems

can be easily resolved by machine learning which improves the efficiency of

the systems.

Classification technique is capable of processing/analyzing wider range of

data for decision making. There are numerous techniques available such as

Neural Network, Naive Bayesian, Support Vector Machine (SVM), K-Nearest

Neighbor Classifier (kNN), Instance Based Learning (IBL), Rule Based Classi-

fication and Decision Tree. Among all the techniques, decision tree is more

effective and easy to use because of the following two reasons:

1. Decision trees are powerful and popular tools for classification and pre-

diction [2].

2. Decision trees represent rules, which can be understood by humans and

used in knowledge system such as database [2].

The classification with decision tree [2][3], learner uses the training data

set for learning purpose and generates the decision tree model. In distributed

environment of data i.e. if the data are geographically spread across the dif-

ferent sites, it needs special attention for decision tree generation at each site

and taking them together to generate the global view i.e. global model. The

local decision tree model generated at each site is not sufficient to provide the

global view for prediction as because the local training data set are not in co-

relation with other data set at different geographically spread sites. In order to

generate the global decision tree either these all data set need to be collected at

one location and then do data mining operation. The other approach is inter-

mediate message passing [1] among the sites involved for training the model.

These participating sites have to communicate among each other through pass-

ing their intermediate trained models to generate the global model. This leads

multiple messages passing which causes the communication overhead. So both

the approaches are not effective and efficient, and hence this is the motivation

for this research. The objectives of the research are: 1) To minimize the training

time and communication overhead and 2) To preserve the prediction quality.

10

Chapter 1. Introduction 1.3. Motivation

In this research the effective and efficient approach has been proposed which

works on global decision tree generation in distributed environment to extract

the global knowledge.

1.3 Motivation

In distributed environment, a series of challenges have been emerged in the

field of data mining, triggered in different real life applications. The current

thesis is concerned with dynamic and scalable classification and prediction

tasks for distributed environment. Therefore, the general context is that of

classification (a potentially large volume of) data which distributed across the

different geographical sites. In the following, the motivation behind the main

objectives of the thesis is presented.

The first issue tackled is that it is neither feasible nor desirable for gath-ering all of the data in a centralized location as because it may need high

internet bandwidth and storage space requirements. For such kind of applica-

tion domain, it should be advisable and feasible is that to develop the systems

for acquiring the knowledge and performing the effective analysis at local sites

where the data and other computing resources are present, then transmit the

results/models to the needed sites. But this also cause the data privacy and

security to share the data of autonomous organizations. In such kind of situa-

tions, the knowledge acquisition techniques to be developed which may learn

from the statistical summaries and these can be supplied whenever required.

The second issue in machine learning and data mining is the development

of dynamic, adaptive and inductive learning techniques that scale up to large

and possibly physically distributed data sets. Many organizations which are

seeking further/added value from their data are already dealing with over-

whelming amounts of information. The number and size of their databases and

data warehouses grows at rapid rates, faster than the corresponding improve-ments in machine resources [2] and inductive learning techniques. Most

of the current generation of learning algorithms are computationally complex

and require all data to be resident in main memory which is clearly untenable

for many realistic problems and databases.

11

Chapter 1. Introduction 1.4. Contribution of the research

The third issue is to reduce the communication overhead and processing

time to merge the decision trees without losing the predictive quality of the

model. The decision trees generated at each site need to send to the coordi-

nator site, the size of this model cause the communication overhead. At the

coordinator site there is not any efficient merging algorithm which takes care

of generating the global model without losing the prediction quality.

As a whole, researchers have given significant contribution by proposing

algorithms for classification (i.e. here decision tree) and prediction, and they

also have proposed different approaches for merging the local decision trees.

From the sate of the art it has been observed that many of the algorithms are

limiting their performance due to they are not good enough with small mem-

ory size of RAM (i.e. memory resident), mainly working for a small datasize, not domain-free [4], static in nature [4], less efficient in terms of pro-cessing and communication overhead. None of the research has focused on

scalable and dynamic classification and prediction process of data mining in

distributed environment. At present enough research work is going on such

work. The researchers are at present working for developing scalable and dy-

namic techniques for classification and prediction which may be able to handle

large dataset in distributed environment.

The main objectives of the thesis are enlisted as below:

1. To reduce the model (i.e. decision tree) training time and communication

time in distributed environment for large volume of data.

2. To introduce the efficient scalable and dynamic approach for newly gen-

erated dataset and already trained model.

3. To prepare the rule merging policies to generate the global model.

4. To generate the globally interpretable model by preserving the prediction

quality.

1.4 Contribution of the research

This thesis provides major contributions in the field of dynamic and scalable

data mining in distributed environment with decision tree based classifica-

12

Chapter 1. Introduction 1.5. Organization of thesis

tion as discussed in the objectives above. The outcome of the proposed model

shows that the objectives of the research work have been acquired. The pro-

posed model can handle large volume of data. The decision trees are merged

with minimal network overhead compared to other approaches and global

model preserves the quality of prediction. The results collected after the exper-

iments with different approaches on distributed data sets in terms of accuracy,

error rate, specificity, sensitivity, precision, recall and training time are better

than the existing systems/approaches. The total time for local model gener-

ation and communication time in proposed approach is 3.14 and 2.34 times

faster than the centralized and intermediate message passing approaches re-

spectively. The data set of student admission for the year 2013-14, 2014-15

collected from Parul University Web Portal (PUWP) have been used to train

the model, this model has been used with the data set of student admission for

the year 2015-16 which gives more than 98.03% accuracy for the prediction.

These experimental results have been also verified using the 10-fold cross vali-

dation. The other data sets have also been used to check the performance of the

proposed model. The experimental results prove that the proposed approach

is better than the existing one.

1.5 Organization of thesis

In chapter 2, different classification techniques such as Nave Bayesian, Deci-

sion Tree, Support Vector Machine, linear and non linear classification have

been discussed. In this chapter further the decision tree has been discussed in

detail about its generation ( tree building phase and pruning phase). Moreover

to this, the reasons of why decision tree is better than other classification tech-

niques also have been discussed.

In chapter 3, the literature survey has been described in the area of classifi-

cation techniques of data mining. The survey has been segregated in different

classification techniques, decision tree based learning algorithms, decision tree

learning in distributed environment, merging the decision trees and the chal-

lenges of it. The survey has been done in three phases.

In chapter 4, the overview of proposed dynamic and scalable approach has

13

Chapter 1. Introduction 1.6. References

been described with the problem statement, objective and scope, original con-

tribution by the thesis and the proposed system architecture at local and coor-

dinator site.

In this chapter, the proposed algorithms at co-ordinate site and each local site

with the architecture have been represented in detail. The system architecture

at local and co-ordinate site also have been explained in detail. In this chapter

the decision rule merging policies have been introduced.

In chapter 5, the working of proposed model has been discussed in detail

with the proposed algorithm and the flow chart. The algorithm computation

at each local site has been explained in detail with example. IN this chapter

the decision tree generation, decision table generation from the decision rules,

XML file generation and the algorithm computation at co-ordinate site have

been discussed in detail.

In chapter 6, the training and testing data set have been discussed with the

data collection and pre-processing. In this chapter the features with their type

and the number of instances of each training data set have been shown. The

data pre-processing steps also discussed in detail.

In chapter 7, the proposed approach has been implemented in an Apache

Hadoop framework. This chapter describes the hadoop map-reduce, Hadoop

Distributed File System (HDFS) and Decision Tree map-reduce with four steps

like data preparation, selection, update and tree growing.

Chapter 8 describes the experimental results with parameterized compar-

isons. This chapter also concludes the research with objectives achieved with

justification, conclusions of the work, and scope of future enhancements pos-

sible in this research.

1.6 References

1. J. da Silva, C. Giannella, R. Bhargava, H. Kargupta, and M. Klusch, Dis-

tributed Data Mining and Agents, Eng. Applications of Artificial Intelli-

gence, 2005,vol.18(7) , pp. 791-80.

14


2. J.R. Quinlan, Induction of decision trees. Machine Learning, Vol.1-1, pp.

81-106, 1986

3. L. Hall, N. Chawla, and K. Bowyer, Decision tree learning on very large

data sets, IEEE International Conference on Systems, Man, and Cyber-

netics, vol. 3, pp. 2579 2584, 1998.

4. P. Strecht, J. Mendes-Moreira, and C. Soares, Merging Decision Trees:

a case study in predicting student performance, in Proceedings of 10th

International Conference on Advanced Data Mining and Applications,

pp. 535548, 2014.

5. Dinesh Vaghela, Priyanka Sharma, A decision support application for

student admission process based on prediction in distributed data min-

ing, International Conference on Information, Knowledge & Research In

Engineering, Management and Sciences(IC-IKR-EMS), Gujarat. IJETAETS-

ISSN 0974-3588, 7th Dec-2014.

6. Dinesh Vaghela, Priyanka Sharma, A Proposed DDM Algorithm and Frame-

work For EDM of Gujarat Technological University, Organized by Saf-

frony Institute of Technology International Conference on Advances in

Engineering, 22nd-23rd January 2015.

7. Baik, S. Bala, J., A Decision Tree Algorithm For Distributed Data Mining,

2004.

8. Dinesh Vaghela, Priyanka Sharma, Prediction and analysis of student

performance using distributed data mining, International Conference on

Information, Knowledge & Research In Engineering, Management and

Sciences(IC-IKR-EMS), Gujarat. IJETAETS-ISSN 0974-3588, 7th Dec-2014

9. Yael Ben-Haim, Elad Tom-Tov, A Streaming Parallel Decision Tree Algo-

rithm, Journal of Machine Learning Research, 849-872, 2010

10. Raj Kumar, Rajesh Verma, Classification Algorithms for Data Mining:A

Survey, International Journal of Innovations in Engineering and Technol-

ogy, ISSN: 2319 1058.

11. Bendi Venkata Ramana, M.Surendra Prasad Babu, N. B. Venkateswarlu,

A Critical Study of Selected Classification Algorithms for Liver Disease

Diagnosis, International Journal of Database Management Systems ( IJDMS

), Vol.3, No.2, May 2011

15


12. Thair Nu Phyu, Survey of Classification Techniques in Data Mining, Vol

I IMECS, March 18 - 20, 2009, Hong Kong

13. Rahul Gupta, Anuja Priyam, Anju Rathee, Abhijeet, and Saurabh Srivas-

tava, Comparative Analysis of Decision Tree Classification Algorithms,

International Journal of Current Engineering and Technology, ISSN 2277

4106.

16

Chapter 2

Classification Techniques

2.1 Introduction

Classification is a data mining function that assigns items/instances from the

data set in a collection to target categories or classes. The goal of classification

is to accurately predict the target class for each case in the data. There are lots

of applications of classification with data mining. For example, a classification

model could be used to identify loan applicants as low, medium, or high credit

risks. The Data Classification process includes two steps:

1. Building the Classifier or Model: This step is the learning step or the

learning phase. In this step the classification algorithms build the classi-

fier[7]. The classifier is built from the training set made up of database

instances/tuples and their associated class labels. Each instance/tuple

that constitutes the training set is referred to as a category or class. These

tuples can also be referred to as sample, object or data points.

2. Using Classifier for Classification - The training model/classifier gener-

ated using the training data set will classify the test data set objects/tuples.

The major issue is preparing the data for Classification and Prediction.

Preparing the data involves the following activities

• Data Cleaning Data cleaning involves removing the noise and treat-

ment of missing values. The noise is removed by applying smooth-

ing techniques and the problem of missing values is solved by re-

17

Chapter 2. Classification Techniques 2.2. Classification techniques

Figure 2.1: Induction: Model Construction

placing a missing value with most commonly occurring value for

that attribute.

• Relevance Analysis Database may also have the irrelevant attributes.

Correlation [11] analysis is used to know whether any two given at-

tributes are related.

• Data Transformation and reduction The data can be transformed

by any of the following methods.

– Normalization the data is transformed using normalization.

Normalization involves scaling all values for given attribute in

order to make them fall within a small specified range. Normal-

ization is used when in the learning step, the neural networks

or the methods involving measurements are used.

– Generalization the data can also be transformed by general-

izing it to the higher concept. For this purpose we can use the

concept hierarchies. In section 2.2 of this chapter, the different

classification techniques have been discussed.

2.2 Classification techniques

Classification technique can be classified into five categories, which are based

on different mathematical concepts. These categories are statistical-based [17],

distance-based, decision tree-based, neural network-based, and rule-based. Each

18


Figure 2.2: Deduction of test data using the model/classifier

category consists of several algorithms, but the most popular from each cate-

gory that are used extensively are C4.5, Naı̈ve Bayes, K-Nearest Neighbors, and

Back propagation Neural Network [18, 19, 20]. In this section, different clas-

sification techniques like Naı̈ve Bayesian, Support Vector Machine, Decision

Tree . . . etc has been discussed in detail.

2.2.1 Naı̈ve Bayesian

Naı̈ve Bayesian classifiers are based on theorem of Bayesian and they are sim-

ple probabilistic classifiers. These classifiers use the weak (naı̈ve) dependence

assumptions among the attributes/features of the data sets. Naı̈ve Bayes clas-

sifiers require the set of parameters linear in nature with variables for learning

task. They are highly scalable i.e. can be further applied on increasing data

set size. They use closed-form expression to train the model for likelihood as

much as possible[1][2][8], this algorithm takes linear (O(n)) time, not the ex-

pensive loop/iterative approximation which are used by many other types of

classifiers.

To construct the classifiers the Naı̈ve Bayes is a simple technique in which

19


Figure 2.3: Posterior Probability of Naive Bayes

the models are prepared as vectors of attribute values to assign class labels to

test objects/instances and the class labels are used from some finite set of la-

bels. Naı̈ve Bayes is a set of techniques/algorithms based on common principle

for training the classifiers. All naı̈ve Bayes classifiers assume the weak depen-

dence among the feature values for the class variables. Consider one example

to understand this principle, a bird may be considered to be a dove if it is grey

in color, small in size, and about 100 gm in weight. Each of these features

are to be considered independently to contribute that the bird is a dove by

the Naı̈ve Bayes classifiers, here the any possible correlations among the color,

size and weight features are considered without that they in correlation. Using

the Naı̈ve Bayes approach it is easy to build model for very large data sets. In

general Naı̈ve Bayes is known for its simplicity and highly sophisticated clas-

sification.

As shown in figure 2.3 posterior probability P(c — x) from P(c), P(x) and

P(x — c) can be calculated by the Naı̈ve Bayes.

Where,

• Posterior probability of class c: P(c—x) where c is target and x is at-

tributes.

• Prior probability of class c: P(c).

• Likelihood: P(x—c).

• Predictor’s prior probability: P(x).

There are numerous advantages of the Nav̈e Bayes for which it is widely

used are as below:

20


• It provides fast and easy prediction of test data samples. Multiple class

prediction is performed very well by it.

• With minimal training data and strong assumption of independence among

attributes, compared to other classifier models like logistic regression a

Naı̈ve Bayes classifier performs better.

• It is performing more effectively for categorical input variables rather

than to numerical variable(s). Normal distribution is assumed for nu-

merical variable.

It has the limitations as below:

• Zero Frequency problem: The model will not be able to make a predic-

tion if categorical variable has a category (in test data set), which was

not observed in training data set. To resolve this problem the smoothing

technique such as Laplace is used.

• In Naı̈ve Bayes probability outputs are not to be taken seriously and

hence it is also known as a bad estimator.

• In Naı̈ve Bayes is also not good because of the assumption of independent

predictors. In ideal situation it is not possible to have the completely

independent set of predictors.

2.2.2 Support Vector Machines (SVM)

A Support Vector Machine (SVM) is a discriminative classifier formally de-

fined by a separating hyper plane. In other words, given labeled training data

(supervised learning), the algorithm outputs an optimal hyper plane which

categorizes new examples.

SVMs[1][2] are one of the supervised learning models in machine learn-

ing same used for analysis by classification and regression. For given finite

set of training samples, the SVM marks them for categories and thus it builds

the training model which later assigns new test samples into the relevant cat-

egory. It is binary linear and non probabilistic classifier. In SVM model the

training/test samples are represented as dot/points in the space and they are

mapped in such a way that the clear gap among the categories appears which

separates the samples. New test samples/examples are later mapped on the

21


same space to predict the belonging category based on the gap.

Using the kernel trick SVMs are also be able to perform non-linear classifi-

cation in addition to linear classification. SVMs automatically map the inputs

with high dimensional feature/attribute spaces. In general SVM supports to

construct hyper plane or set of hyper planes in any dimensional space and this

can be used for any tasks such as regression, prediction or classification. SVM

hyper plane is the only responsible for good separation of training data with

the largest distance to its nearby data. This approach is generally known as

functional margin. The state of the rule is if margin is larger then the general-

ization error of classifier will be lower.

Linear SVMIf the training data set of n points is given and the form of it is (−→x1 , y1) .... (−→xn ,

yn) and here yi is 1 or 1, this indicates the class to which the point −→xi is present.

Each −→xi is a real vector of p-dimension. Our interest is to find the ”maximum-

margin hyper plane” which separates the group of samples/points from the

group of points −→xi here yi=1. This is required to be defined to maximize the

distance between the hyper plane and the nearest point −→xi .

It is understandable that H1 usually does not separate the classes. While

H2 separates them with a small margin, on the other hand H3 separates them

with the maximum margin. The hyper plane is defined as the set of points−→x satisfying −→x * −→w - b = 0. The support vector consists the sample on the

margin. Here −→w is the normal vector of hyper plane but it is not necessarily

normalized. The offset of the hyper plane from the origin with normal vector−→w is determined by the parameter −→w .

Hard-margin Two parallel hyper planes which separates two classes of data

can be selected if the training data are linearly separable. So by this we can

have the distance between them is as possible as large. The ”margin” is noth-

ing but the region bounded by these two hyper planes. The maximum margin

hyper plane lies between these planes. These hyper planes can be described by

the equations −→w * −→x - b = 1 and −→w * −→x - b = -1.

2−→w is the distance between two hyper planes, by minimizing −→w we can maxi-

mize the distance between the planes. By adding the constraint: for each either−→w * −→x - b ≥ 1 or −→w * −→x - b ≤ 1if yi = -1, here the data points can be prevented

22


Figure 2.4: Classification based on linear SVM

Figure 2.5: Classification based on Hard SVM

23


from falling into the margin. According to the constraints/conditions each

data point must lie on the actual side of the margin. The rewritten equation is

as follows:

yi(−→w ∗ −→x − b) ≥ 1, f orall1 ≤ i ≤ n (2.1)

To get the optimization problem we can put all these together: ”Minimize−→w subject to −→w * −→x - b ≥ 1 for i = 1, 2, 3,. n”. Completely the max-margin

hyper plane is determined by overrightarrow(xi) which lies nearest to it. Here

the −→x are called support vectors.

Soft Margin is the loss function with the equation max ( 0, 1 - yi(−→w * −→x- b)). It is introduced to enhance Support Vector Machine where the data are

non linearly separable. If the constraint in (2.1) is satisfied then this function

value is zero, that means, −→x lies on the correct side/way of the margin. If the

data are available at the wrong side of the margin then the function’s value is

calculated as proportional to the distance from the margin. The function with

minimization is as below:

1n

n∑i=1

max(0,1− yi(−→w ∗ −→x − b)) +λ−→w (2.2)

Here the parameter λ finds the trade off between increment the values of

margin-size and making sure that the −→x is fall/separated with the actual side

of it. Therefor, the soft margin Support Vector Machine may behave equally

same as hard margin Support Vector Machine for enough small values of λ in

the case when linearly classifiable test data are available.

Nonlinear Classification Vapnik in 1963 proposed original maximum mar-

gin hyper plane algorithm to construct a linear classifier. Moreover to this

Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik in 1992 sug-

gested the way to create nonlinear classifier using kernel trick approach which

was initially proposed by Aizerman et al. [5] to maximum margin hyper planes

[6]. In this the kernel function which is nonlinear replaces each dot prod-

uct. Thus the maximum margin hyper plane is allowed to fit in a transformed

feature space. In this case the transformed space may be high dimensional

and transformation may be nonlinear, here in the transformed feature space

24


Figure 2.6: Nonlinear Classification

the classifier is a hyper plane but in original input space may be non linear.

Therefor the generalization error of support vector machine may be increased

if working in a high dimensional feature space but the algorithm may work

well if enough samples/instances are provided. Various real world problems

can be solved by SVMs.

• As the applications of Support Vector Machines (SVMs) reduce the efforts

for training the instances with the labels in both transductive and stan-

dard inductive settings, due to this SVMs are good for text and hypertext

categorization.

• SVMs are much efficient in search accuracy than classical query refine-

ment and hence they are good for classification of images.

• In medical science classification of proteins more than 90% be only pos-

sible with SVMs.

• SVMs can also recognize Hand-written characters with good accuracy.

2.2.3 K-Nearest Neighbor Classifier

K nearest neighbors is a simple algorithm that stores all available cases and

classifies new cases based on a similarity measure (e.g., distance functions).

KNN has been used in statistical[17] estimation and pattern recognition al-

ready in the beginning of 1970s as a non-parametric technique.

25


The k-Nearest Neighbors algorithm can work for regression and classifi-

cation [10] in pattern recognition. In classification or regression the input is

supplied as the k-closest training samples for the given feature/attribute space,

while the output depends on either k-NN works for regression or classification:

• The voting of neighbors play an important role for classification of an

object. Here k is the number of nearest neighbor. For example if k=1,

means the object is assigned to a single closest neighbor.

• With respect to the property value (i.e. an average value of k nearest

neighbors) of the object, k-NN performs the regression.

Among all the machine learning algorithms k-NN is the easy and simplest

one. In k-NN the function value is approximated and calculated locally with

different computation for the classification. Thus it is lazy learner of can also

be said instance-based learner.

In k-NN, the nearer neighbors play important role for contribution than far

objects for computing the weight in both cases like classification and regres-

sion. Let d is the distance to the neighbor with the weight 1/d. This value can

be plays a role to classify the object. In k-NN, no training step is required and

hence it is sensitive to the local distribution of the data.

Training example has a class label and they are represented in the vector

form of the feature space in multidimensional. In the training phase only the

feature vectors and the class labels of the training objects are stored. in k-NN

algorithm, k is the constant given by the user, so in the classification part test

point is assigned the label that is most nearest in the training of k samples.

Euclidean distance is majorly used for continuous variables, on the other

hand Hamming distance is used for text classification kind of discrete vari-

ables. Pearson and Spearman [12] used microarray has been used for finding

correlation coefficients for gene expression. The performance of k-NN can be

also improved by learning the distance matrix and analysis of neighborhood

components. In figure 2.7 different distance equations have been given as fol-

lows:

26


Figure 2.7: Distance functions equations

Figure 2.8: Hamming Distance

It should also be noted that all three distance measures are only valid for

continuous variables. In the instance of categorical variables the Hamming

distance as shown in figure 2.8 must be used. It also brings up the issue of

standardization of the numerical variables between 0 and 1 when there is a

mixture of numerical and categorical variables in the dataset.

It has been observed that if the classes of the objects distribution is skewed

then k-NN has to suffer from ”majority voting” drawback. That means, new

sample predictions are dominated by more frequent class samples because of

their large value (weight) [13]. But this classification problem can also be over-

come by weighting the classification with the consideration of the distances

among the test point and its k nearest neighbors. In regression problem, the

class value of each k nearest points is multiplied with the inverse of the dis-

tance from the specific point to the test point. Abstraction in data represen-

27


tation is also the other way to overcome skew problem. K-NN can also be

applied to Self-Organizing Map (SOM) without consideration of the density of

node which is represented as center of the given cluster.

The data plays important role for selecting the value of k, in most cases

larger k value may reduce the noise effects in classification [14]. In this case

the class boundaries are less distinct. The heuristic approaches can also be ap-

plied to select the good k value. When the class label is predicted to its closest

training sample, then it is called nearest neighbor algorithm.

The noise, irrelevant features or non-consistent feature scales are only re-

sponsible to degrade the overall accuracy of the k-NN algorithm. In order to

improve the classification accuracy, many of the researchers contributed for

scaling or selection of features. The evolutionary algorithms which optimizes

the feature scaling is the well known approach [15]. The mutual information

among training data and training classes is also playing good role for feature

scaling. Selection of k as an odd number may avoid tied votes in can of binary

classification. The well known bootstrap method also generates the practically

optimal value of k [16].

2.2.4 Instance Based Learning (IBL)

As discussed earlier that k-NN does not have training part, it looks for k near-

est neighbors for new sample for the selection of the class to which it belongs.

k-NN is also incremental classification in which the algorithm uses indexing

concept to finding the neighbors efficiently. The comparison is an exhaustive

if it finds the nearest neighbor for the new instance with storing all instances

in memory which may also lead to lots of memory usage.

The solution of the above problem is the Instance Based Learning (IBL)

which is an enhancement of the k-NN classification algorithm. The k-NN algo-

rithm requires large space while IBL does not need to maintain model abstrac-

tions. Aha et al. (1991) focuses on the way of reducing the storage require-

ments with minimal loss in learning rate and accuracy. The k-NN does not

more suitable with noisy data, while the IBL supports noisy data and hence it

may be applied on lots of real-life datasets.

28


IBL are good for the following reasons:

• Objects are classified with the help of instances as they are supervised

learning algorithm.

• Very less expensive to update the object instances.

• It is much faster for learning/training the model.

• In order to obtain the concept description, this algorithm can also be

extended further.

IBL are bad for the following reasons:

• Since all training instances are saved, they are more expensive computa-

tionally.

• Does not much support the noisy attribute values.

• Irrelevant attributes are also not much supported by them.

• Its working only depends on the similarity function.

• They are not supporting working with nominal or missing valued at-

tributes.

• No way to know how the data are structured.

IBL methodology and framework The methodology and the frame work

of IBL have been discussed as below:

• The primary output of IBL algorithm is the Concept Description (CD)

which maps the instances to categories.

• The concept description includes the collection of stored instances with

past performance information at the time of classification. After the pro-

cessing of each instance the stored instances in a CD can be changed.

• The IBL framework has three parts: similarity function, classification

function and concept description updated.

– The similarity function computes the similarity between training

instances and the similarity is the numeric value.

29


– The classification function classifies the new instance based on the

value calculated from similarity function.

– The records of classification performance are maintained by concept

description updater.

– The new test instances, classification results, current value of CD

and the similarity values are the inputs while the modified CD is

the output for IBL.

2.2.5 Rule Based Classification

The rule based classification is the systematic selection of a small number of

features used for the decision making. It Increases the comprehensibility of

the knowledge patterns. The useful if-then rules have been extracted from the

dataset on statistical significance.

IF-THEN rules can be extracted using the Sequential Covering Algorithm

(SCA) such as AQ, CN2 and RIPPER from the training set of data. In this algo-

rithm it is not needed to generate the Decision Tree (DT). Many of the tuples

of the given class are covered by each rule. In this category of the classification

the rules are learned all at a time. Every time learning of rules is performed

followed by removing the tuple covered by that rule and thus this process con-

tinues for all the set of tuples. The path from root to leaf in a decision tree

represents a rule. The rules are also required to be pruned for the following

reasons:

• The quality assessment is depended on the original collection of training

samples. The rule pruning is mainly required because the rules perform

well for training data rather than subsequent data.

• The rule R is only pruned if the new version R’ has good quality.

2.2.6 Neural Networks

A neural network is same as biological human brain system which includes

the collection of neurons and it is also considered as the border line between

approximation algorithm and artificial intelligence. It learns through train-

ing resemble structured biological neuron networks and hence it is known as

a nonlinear predictive model. The neuron networks work for the applications

30


which include pattern detection, making prediction and learn from the past

such as biological systems. The artificial neuron networks are nothing but the

computer programs which enables the computer to learn like human being but

it can not mimic the human brain completely, but having some lacking or lim-

itations. They are highly accurate predictive models which can be applied for

large range of problems.

The Strengths of Neural Networks:

• High tolerance to noisy data

• Well-suited for continuous-valued inputs and outputs

• Successful on a wide array of real-world data

• Techniques exist for extraction of rules from neural networks

The Weaknesses of Neural Networks:

• Long training time

• Many non-trivial parameters, e.g., network topology

• Poor interpretability

2.2.7 Decision Tree

Due to the computational efficiency to handle the large volume of data, De-

cision Tree (DT) induction is the most well known Machine Learning (ML)

framework. It identifies the most contributing features/attributes for the given

problem and also provides interpretable results.

The Decision Tree is a Tree-shaped structure that represents sets of deci-

sions. These decisions generate rules for the classification of a dataset. Each

unique leaf node is dedicated to a record which is starting from the root and

continuously moves toward a child node with respect to the splitting criterion.

The splitting criteria evaluates a branching condition on the current node with

respect to the input records. There are two stages for decision tree construc-

tion: the first stage is to build a tree and second is to prune it. In most of

the algorithms the tree grows in top down way with greedy approach. It starts

with the root node, followed by at each intermediate node the database records

31


Figure 2.9: Decision tree based classification for car subscription

are evaluated with some splitting criterion. This procedure is applied recur-

sively and like wise the database is partitioned/splitted. In second stage, the

tree pruning is applied to reduce its size with some sophisticated way which

reduces the prediction error.

Why Decision tree??

Decision trees are having several advantages among all decision support tools:

• Decision Trees are easy to interpret and understand.

• Using it important insights can easily be generated.

• It also supports to add new scenarios if introduced.

• For different scenarios it can determine best, average and worst values.

• For the given results it can work as white box model to explain the con-

dition by Boolean logic.

• Other decision techniques can also be combined.

Decision tees having advantage over other data mining methods:

• Unlike to other techniques, Decision tree requires little data pre-processing

such as normaliztaion, removing blank values.

32


• It can be able to process/work on both either categorical or numerical

data.

• It also improves the reliability of the model by validating it using stan-

dard statistical tests.

• Even if the assumptions are violated by some validated true model gen-

erated from data, it works well. So it is robust.

• Very large volume of data can be effectively and efficiently analyzed using

the available resources.

Disadvantages of Decision tree

Disadvantages and limitations of decision trees are as follows:

• The information gain are biased in the favor of attributes with more lev-

els for categorical variables.[4]

• For uncertain values or linked outcomes the calculations are more com-

plex.

• Determining how deeply to grow the decision tree.

• Handling continuous attributes.

• Choosing an appropriate attribute selection measure.

• Handling training data with missing attribute values.

• Handing attributes with differing costs.

• Improve computational efficiency

Limitations:

• The problem of learning an optimal decision tree is known to be NP-

complete under several aspects of optimality and even for simple con-

cepts. Consequently, practical decision-tree learning algorithms are based

on heuristics such as the greedy algorithm where locally-optimal deci-

sions are made at each node. Such algorithms cannot guarantee to return

the globally-optimal decision tree.

• Decision-tree learners can create over-complex trees that do not general-

ize well from the training data. This is known as over fitting, Mechanisms

such as pruning are necessary to avoid this problem.

33

Chapter 2. Classification Techniques 2.3. Attribute Splitting Measures

2.3 Attribute Splitting Measures

The central choice of the basic algorithm (ID3) is selecting which attribute to

test at each node in the tree. The attribute which is most useful for classifying

the instances will be selected. The good quantitative measure of the worth of

the attribute is defined by the statistical properties like Entropy, Information

Gain, Split Info, Gain Ratio and Gini Index ...which measures how well a given

attribute separates the training instances according to their target classifica-

tion.

EntropyEntropy H(S) is a measure of the amount of uncertainty in the (data) set S (i.e.

entropy characterizes the (data) set S).

H(S) = −∑x

p(x)log2p(x) (2.3)

Where,

• S - The current (data) set for which entropy is being calculated (changes

every iteration of the ID3 algorithm)

• X - Set of classes in S

• p(x) - The proportion of the number of elements in class x to the number

of elements in set S

When H(S)=0, the set S is perfectly classified (i.e. all elements in S are of

the same class). In ID3, entropy is calculated for each remaining attribute. The

attribute with the smallest entropy is used to split the set S on this iteration.

The higher the entropy, the higher the potential to improve the classification

here. Entropy can be calculated using the frequency table.

Information GainThe information gain is based on the decrease in entropy after a dataset is split

on an attribute. Information gain IG (A) is the measure of the difference in

entropy from before to after the set S is split on an attribute A. In other words,

how much uncertainty in S was reduced after splitting set S on attribute A.

34

Chapter 2. Classification Techniques 2.3. Attribute Splitting Measures

IG(A,S) =H(S)−∑t

p(t)H(t) (2.4)

Where,

• H(S) - Entropy of set S

• T - The subsets created from splitting set S by attribute A such that

(S =∪t ∈ T )

• p(t) - The proportion of the number of elements in t to the number of

elements in set S

• H(t) - Entropy of subset t

In ID3, information gain can be calculated (instead of entropy) for each re-

maining attribute. The attribute with the largest information gain is used to

split the set S on this iteration.

Split InformationThe split information value represents the potential information generated by

splitting the training data set D into v partitions, corresponding to v outcomes

on attribute A.

High splitInfo: partitions have more or less the same size (uniform).

Low split Info: few partitions hold most of the tuples (peaks).

SplitInf oA(D) = −v∑j=1

|Dj |Dlog|Dj |D

(2.5)

Gain RatioC4.5 a successor of ID3 uses an extension to information gain C4.5, a successor

of ID3 uses an extension to information gain known as gain ratio. It overcomes

the bias of Information gain and applies a kind of normalization to informa-

tion gain using a split information value. The Gain ratio is defined as:

GainRatio(A) =Gain(A)

SplitInf o(A)(2.6)

The attribute with the maximum gain ratio will be selected as the splitting

35

Chapter 2. Classification Techniques 2.4. Decision Tree Classification

attribute.

Gini Index The Gini Index (used in CART) measures the impurity of a data

partition D.

Gini(D) = 1−m∑i=1

p2i (2.7)

m: the number of classes, pi: the probability that a tuple in D belongs to class

Ci. The Gini Index considers a binary split for each attribute A say D1 and D2.

The Gini index of D given that partitioning is:

GiniA(D) =D1DGini(D1) +

D2DGini(D2) (2.8)

The reduction in impurity is given by

∆Gini(A) = Gini(D)−GiniA(D) (2.9)

The attribute that maximizes the reduction in impurity is chosen as the split-

ting attribute.

2.4 Decision Tree Classification

Decision tree learning[9] uses a decision tree as a predictive model which maps

observations about an item to conclusions about the item’s target value. It is

one of the predictive modelling approaches used in statistics, data mining and

machine learning. Tree models where the target variable can take a finite set of

values are called classification trees. In these tree structures, leaves represent

class labels and branches represent conjunctions of features that lead to those

class labels. Decision trees where the target variable can take continuous val-

ues (typically real numbers) are called regression trees.

In decision analysis, a decision tree can be used to visually and explicitly

represent decisions and decision making. In data mining, a decision tree de-

scribes data but not decisions; rather the resulting classification tree can be an

input for decision making.

36


There are two main types of decision trees used in data mining as follows:

• Classification tree: In this type of the tree the predicted outcome is the

class in which it belongs.

• Regression tree: In this type of the tree, the a real number can be con-

sidered the predicted outcome.

The above analysis procedures can be referred by the term ClassificationAnd Regression Tree (CART) which was first introduced by Breiman et al

[3]. The trees used for both regression and classification have some similarities

with some considerable differences like where to split[3].

There are some well known decision trees:

• Bagging decision trees: The training data are repeatedly re-sampled and

replaced to build multiple decision trees. It also uses voting concept for

trees for prediction [4].

• A Random Forest: It uses different classifiers to improve the classifica-

tion rate by creating multiple decision trees.

• Boosted Trees: It is used either for regression or classification type prob-

lems [5][6].

• Rotation forest:It uses the concept of Principal Component Analysis (PCA)

on random subset of input features and every decision tree is trained.

2.4.1 Tree Building Phase

Tree (Decision Tree) learning is one of the most widely used and practical

method for inductive inference. It is the method of approximating the value

of target function, in which the learned function is represented by the tree.

The learned trees can also be represented as the set of if-the rules to improve

human readability. These learning methods are among the most popular of

inductive inference algorithms and have been successfully applied to a broad

range of tasks.

Decision tree learning is a method commonly used in data mining [20]. The

goal is to create a model that predicts the value of a target variable based on

several input variables. Each interior node corresponds to one of the input

37


variables; there are edges to children for each of the possible values of that in-

put variable. With respect to the path from the root to the leaf node, every leaf

uniquely represents the value of the target variable.

The decision tree is one of the simplest form for classification types of prob-

lems. All attributes/features have finite discrete domains and unique target

feature for classification, and this element is called the class. In decision tree,

every internal (i.e. non-leaf) node is labeled with input feature while arcs be-

tween the nodes are possible values of the features. Every leaf node is labeled

with a class.

Based on the attribute value the source set is divided into subsets in order

to learn a tree. Using the recursive partitioning on each derived subset this

process is repeated and this process stops when a subset at a node all target

variable having same value or there will be no more contribution of splitting

in prediction. This is an example of greedy algorithm and it is known as Top

Down Induction of Decision Tree (TDIDT) [21].

Most of the algorithms that have been developed for learning the trees are

variations on a core algorithm that employs a top-down, greedy search through

the space of possible trees. This approach is exemplified by ID3 (Quinlan 1986)

and its successors C4.5 (Quinlan 1993) C5.0 and many more.

The basic, ID3 algorithm learns the decision trees by constructing them

top-down, beginning with the question Which attribute should be tested at

the root of the decision tree? The answer is each attribute is evaluated using a

statistical test (by finding the information gain) to determine how well it alone

classifies the training examples. The best attribute is selected and used as the

test at the root node of the tree. A descended of the root node is created for

each possible value of this attribute, and the training examples are sorted to

the appropriate descendant node. The entire process is then repeated using

the training examples associated with each descendant node to select the best

attribute to test at that point in the tree. This forms a greedy search for an ac-

ceptable decision tree, in which the algorithm never backtracks to reconsider

the earlier choices.

38


2.4.2 Tree Pruning Phase

The smaller the complexity of a concept, the less danger that it over fits the

data , A polynomial of degree n can always fit n+1 points . Thus, learning al-

gorithms try to keep the learned concepts. For very large data sets, over fitting

is a challenge to generate the decision tree or other predictive models. Over

fitting happens when the learning algorithm continues to develop hypotheses

that reduce training set error at the cost of an increased test set error. In build-

ing decision trees there are several approaches to reduce the over-fitting.

• Pre-pruning: Before perfectly classifying the training set, the tree grow-

ing stops earlier.

• Post-pruning: In this approach the tree development is completed to

classify the entire training data set and then does the post pruning.

Many of the times, it is difficult to decide when to stop growing the trees

and hence practically the post-pruning approach is more successful and also it

covers the entire training data set for classification. There are several methods

as follows to define a criteria for finding the correct final tree.

1. Different data set from the training set is used as validation set.

2. Very first build the tree with available training data set and then make

sure whether expanding or pruning a node may bring some improvement

or not.

• Error estimation

• Significance testing (e.g., Chi-square test)

• Minimum Description Length principle: It stops the growth of the

tree when encoding size is reached to minimized.

Pre Pruning

Pre-Pruning stops growing a branch when information becomes unreliable.

Based on statistical significance test stops growing the tree when there is no

statistically significant association between any attribute and the class at a par-

ticular node. The pre pruning test is as follows:

39

Chapter 2. Classification Techniques 2.5. Comparison of classification techniques

DecisionTree

Naive Bayes K- Nearest Neigh-bor

SVM Neural Networks

Easily Ob-served anddevelopgeneratedrules

Fast, highly scal-able model build-ing (parallelized)and scoring

Robust to noisytraining data andeffective if thetraining data islarge

More accu-rate thanDecisionTree classifi-cation

High tolerance ofnoisy data andability to clas-sify patterns foruntrained data

Table 2.1: Advantages of different classification algorithms

• Most popular test: chi-squared test

• ID3 used chi-squared test in addition to information gain

Only statistically significant attributes were allowed to be selected by in-

formation gain procedure.

Post Pruning

Post-Pruning is unreliable which allows to grow a decision tree that correctly

classifies all training data simplify it later by replacing some nodes with leafs.

The post pruning is preferred usually in practice. Prepruning can stop early.

2.5 Comparison of classification techniques

In the table 2.1,the overall comparison [18][19] among different classification

techniques such as decision tree, Naive Bayesian, K-Nearest Neighbor, Support

Vector Machine (SVM) and Neural Networks have been discussed. Out of these

techniques Decision tree is easy to understand and to develop the rules while

SVM is more faster than it.

In the table 2.2, the feature wise comparison [18][19] has been shown. The

different features such as learning type, speed, accuracy, scalability and trans-

parency for classification techniques have been summarized more precisely.

In table 2.3, the merits and demerits of the classification techniques have

been discussed in much detail. The decision tree works well with redundant

attributes while irrelevant attributes affect in the construction of the the deci-

sion tree. Naive bayesian assumes the independence of the features hence it

40


Features DecisionTree

NaiveBayes

K-Nearest

NeighborSVM Neural

Networks

LearningType

EagerLearner

EagerLearner

LazyLearner

EagerLearner

EagerLearner

Speed Fast Very Fast SlowFast with

activelearning

Slow

AccuracyGood in

manydomains

Good inmany

domains

HighRobust

SignificantlyHigh

Good inmany

domains

ScalabilityEfficient forsmall data

set

Efficientfor largedata set

——- ——- Slow

Interpretability Good ——– ——– ——- BadTransparency Rules No Rules Rules No Rules No RulesMissing Value

Interpret.MissingValue

MissingValue

MissingValue

SparseData

——-

Table 2.2: Feature comparisons

offers less accuracy in classification. Neural Networks are having high toler-

ance towards the noisy data but take much time for training the model.

41


Algorithm Merits Demertis

DecisionTree

• Handles: continuous, discrete data andnumeric data.

• It provides fast result in classifying un-known records.

• It supports redundant attribute.

• Very good results acquired for smallsize tree. Results are not affected withoutliers.

• Normalization is not required.

• It cant predict the valueof a continuous class at-tribute.

• It provides error proneresults when too manyclasses are used.

• Construction of decisiontree is affected by irrele-vant attributes.

• Decision tree affected byeven small change in data.

NaiveBayesian

• It provides high accuracy and speed onlarge database.

• Minimum error rate compared to otherclassifiers.

• It is easy to understand.

• Supports streaming data, real and dis-crete valued data also.

• It assumes independenceof features. So it providesless accuracy.

NeuralNetworks

• Highly affected by noisy data.

• Good for continuous values.

• Non trained patterns can also be classi-fied.

• Complex to interpret.

• Takes much time to trainthe model.

Table 2.3: Comparison of Classification Algorithms

42

Chapter 2. Classification Techniques 2.6. Summary

2.6 Summary

In this chapter different classification techniques have been discussed followed

by what is the importance of decision tree based classification in decision mak-

ing. Moreover to this, two phases of decision tree such as tree building and

tree pruning have been discussed. The comparison of decision tree attribute

splitting measures have been discussed which play the crucial role in building

the decision tree. In the last section the detailed comparison with respect to

various parameters of different classification techniques have been discussed.

2.7 References

1. Cortes, C., Vapnik, V. Support-vector networks. Machine Learning 20 (3):

273. doi:10.1007/BF00994018.1995.

2. William H., Teukolsky, Saul A., Vetterling, William T., Flannery, B. P.Section

16.5. Support Vector Machines. Numerical Recipes: The Art of Scientific

Computing (3rd ed.). New York: Cambridge University Press. ISBN 978-

0-521-88068-8. 2007.

3. Boser, Guyon and Vapnik. A training algorithm for optimal margin clas-

sifiers. Proceedings of the fifth annual workshop on Computational learn-

ing theory COLT ’92. p. 144. doi:10.1145/130385.130401. ISBN 089791497X.

1992.

4. Aizerman, Braverman,Rozonoer and Lev I. Theoretical foundations of

the potential function method in pattern recognition learning. Automa-

tion and Remote Control 25: 821837. 1964.

5. Boser, B. E., Guyon, I. M., Vapnik, V. N. A training algorithm for optimal

margin classifiers. Proceedings of the fifth annual workshop on Compu-

tational learning theory COLT ’92. p. 144.1992.

6. Platt, John, Cristianini and Shawe-Taylor. Large margin DAGs for mul-

ticlass classification. In Solla, Sara A., Leen, Todd K., and Mller, Klaus-

Robert, eds.Advances in Neural Information Processing Systems

43

Chapter 2. Classification Techniques 2.7. References

7. Dietterich, Thomas G., and Bakiri, Ghulum, Bakiri. Solving Multiclass

Learning Problems via Error-Correcting Output Codes, Journal of Artifi-

cial Intelligence Research, Vol. 2 2: 263286.1995.

8. Lee, Y., Lin, Y., and Wahba, G. Multicategory Support Vector Machines.

Computing Science and Statistics 33. 2001.

9. https://en.wikipedia.org/wiki/Decisiontreelearning

10. Altman. An introduction to kernel and nearest-neighbor nonparametric

regression, The American Statistician 46(3): 175-185. 1992.

11. Jaskowiak, P. A., Campello, R. J. G. B. Comparing Correlation Coefficients

as Dissimilarity Measures for Cancer Classification in Gene Expression

Data.http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.208.993.

Brazilian Symposium on Bioinformatics, pp. 18.2011.

12. D. Coomans, D.L. Massart. Alternative k-nearest neighbor rules in su-

pervised pattern recognition : Part 1. k-Nearest neighbor classification

by using alternative voting rules. Analytica Chimica Acta 136: 1527.

1982.

13. Everitt, B. S., Landau, S., Leese, M. and Stahl, D. Miscellaneous Cluster-

ing Methods, in Cluster Analysis, 5th Edition, John Wiley & Sons, Ltd,

Chichester,UK. 2011.

14. Nigsch F, Bender A, van Buuren B, Tissen J, Nigsch E, Mitchell JB. Melt-

ing point prediction employing k-nearest neighbor algorithms and ge-

netic parameter optimization. Journal of Chemical Information and Mod-

eling 46, 24122422. 2006.

15. Hall P, Park BU, Samworth RJ. Choice of neighbor order in nearest-neighbor

classification”. Annals of Statistics 36, 21352152. 2008.

16. D. G. Terrell; D. W. Scott. Variable kernel density estimation. Annals of

Statistics 20, 12361265

17. Mills, Peter. Efficient statistical classification of satellite measurements.

International Journal of Remote Sensing. 1992.

18. G. Dimitoglou, J. A. Adams and C. M. Jim, Comparison of the C4.5 and

a Nave Bayes Classifier for the Prediction of Lung Cancer Survivability,

Journal of Comput-ing, Vol. 4, No. 2, pp. 1-9. 2012.

44

Chapter 2. Classification Techniques 2.7. References

19. J. Huang, J. Lu and C. X. Ling, Comparing Nave Bayes, Decision Trees,

and SVM with AUC and Accuracy, Proceedings of Third IEEE Interna-

tional Conference on Data Mining, 19-22 November 2003, pp. 553-556.

doi:10.1109/ICDM.2003.1250975.

20. Rokach, Lior, Maimon, O. Data mining with decision trees: theory and

applications. World Scientific Pub Co Inc. ISBN 978-9812771711. 2008.

21. Quinlan, J. R. Induction of Decision Trees. Machine Learning 1: 81-106,

Kluwer Academic Publishers. 1986.

45

Chapter 3

Literature Survey On

Decision Tree

3.1 Introduction

There are lists of data mining techniques such as frequent pattern mining,

classification, regression, clustering, association rule mining and many more,

but out of these classification is frequently used at the most. In classification

[1], the model is trained which describes and differentiate different data classes

to predict the classes whose labels are not known. The classification can be

performed with different algorithms such as neural networks, decision trees,

regression, etc. Due to the significance important of decision tree for large

data sets, in this research decision tree approach has been used. Generally the

classification is the sequence of the operations as follows:

1. Prepare the training data set using pre-processing on the raw data.

2. Class attribute and classes are identified.

3. Identify useful attributes for classification (Relevance analysis).

4. Learn a model using training examples in Training set.

5. Use the model to classify the unknown data samples.

46

Chapter 3. Literature Survey On Decision Tree 3.2. First Phase Study

Figure 3.1: Decision tree

3.2 First Phase Study

As the decision tree has been discussed in chapter 2, the Decision trees [2] rep-

resent a sequence of rules to form the class. It is a flowchart like tree structure.

The decision tree consists of three fundamentals, root node, internal node

and leaf node. Top most fundamental is the root node. Leaf node is the ter-

minal fundamental of the structure and the nodes in between is called the

internal node. Each internal node denotes test on an attribute, each branch

represents an outcome of the test, and each leaf node holds a class label. Var-

ious decision tree algorithms are used in classification like ID3, J48, CART,

C5.0, SLIQ, SPRINT, random forest, random tree, etc. In this work following

tree algorithms are taken for comparison.

ID3 (Iterative Dichotomiser 3) decision tree algorithm is developed by

Quinlan [6]. The basic idea of ID3 algorithm is to construct the decision tree

by employing a top-down, greedy search through the given sets to test each

attribute at every tree node. In the decision tree method, information gain ap-

proach is generally used to determine suitable property for each node of a gen-

erated decision tree. Thus, we can select the attribute with the highest infor-

mation gain (entropy reduction in the level of maximum) as the test attribute

of current node. In this way, the information needed to classify the training

sample subset obtained from later on partitioning will be the smallest. That is

to say, the use of this property to partition the sample set contained in current

node will make the mixture degree of different types for all generated sam-

ple subsets reduce to a minimum. Therefore, the use of such an information

47


theory approach will effectively reduce the required dividing number of object

classification. ID3 uses only categorical attributes to build a tree model. This

algorithm does not produce more accurate outcomes if the noise is present in

the data set. In order to get more accurate results the effective pre-processing

is carried out before model is built using ID3.

J48 J48 is an advance version of ID3, it decides target value of a new test

data with respect to different attribute values of training data [3]. The internal

nodes of a decision tree are denoted by different attributes while the branches

tell the possible values of these attributes. The internal nodes tell the depen-

dent variable values.

CART, its full form is Classification And Regression Tree (CART) and it was

initially proposed by Breiman et al. [7] which was binary tree also known as

Hierarchical Optimal Discriminate Analysis (HODA). It is non parametric de-

cision tree. It produces regression or classification tree depends on dependent

variable’s type either numeric or categorical respectively. Here the meaning of

binary means the node in a decision tree has two out word branches i.e. groups.

Gini index is used in CART as feature selection measure. The attribute with

largest gini index is used to split the records. CART handles both categorical

and numerical values along with missing attribute values also. It uses cost-

complexity pruning and also generate regression trees.

It builds both classifications and regression trees. It can be implemented

serially from Hunt’s algorithm. It also uses regression analysis using regressin

trees (S.Anupama et al,2011). Over a given period of time and the set of pre-

dictor variables regression analysis feature forecasts a dependent variable. It

gives high classification and prediction accuracy.

C5.0 is an extension of C4.5 which was initially derived from ID3. It is

applied on big data sets. It is much faster and memory efficient than C4.5. It

splits the samples based on the maximum information gain. The sample subset

that is get from the former split will be split afterward. This is continuous pro-

cess until the sample subset can not be further split. The attributes/features

which have less contribution will be rejected. One major advantage of C5.0 is

it handles multi value attributes and missing attributes from the data set [8].

48


SLIQ was introduced by Mehta et al(1996) and it stands for supervised

learning in ques. It can be implemented in serial and parallel system for fast

scalable decision tree. It is not based on Hunt’s algorithm. It uses breadth

first search greedy strategy to partition training data set recursively during the

tree building phase. SLIQ handles both numeric and categorical attributes but

with memory resident class list data structure which is the disadvantage of it.

It uses Minimum Description Length (MDL) principle in tree pruning.

SPRINT is the induction of decision tree algorithm and it stands for Scal-

able Parallelizable Induction and was introduce by Shafer et all (1996). It is

fast and scalable classifier. It partitions the training data recursively using

the breadth-first greed approach until no further split possible. It is also im-

plemented both serially and parallel.It uses attribute list and histogram data

structures which are not memory resident making sprint suitable for large data

sets, thus it removes all the data memory restrictions on data. It handles both

continuous and categorical attributes (Sunita et ,2011).

RANDOM FOREST[4] is an ensemble learning method of classificaiton,

regression and other tasks which construct decision tree at training time and

predicts the class at output time. It overcomes the over fitting problem of de-

cision trees. It averages multiple deep decision trees, train them on different

parts with the goal of reducing the variance. This comes at the expense of a

small increase and some loss of interpretability, but generally greatly boosts

the presentation of the final model.

The Random Forest algorithm was developed by Leo Breiman. It is a meta-

learner made of many individual trees to operate quickly over large data sets

and more importantly to be diverse by using random samples to build each

tree in the forest.

Construction of a tree:

1. In big data 2/3 of data is used to train the model with bootstrap replace-

ment.

2. Select the attribute with the most information gain from the random

number of attributes.

49

Chapter 3. Literature Survey On Decision Tree 3.3. Second Phase Study

3. Continue to construct the tree until no more nodes can be created.

4. Compute the error and measure the correctness of the tree.

) At each node of the tree diversity is acquired with random selection of at-

tributes, then select the attribute with highest level of learning. The perfor-

mance of random forests algorithm is linked to the level of correlation between

the two trees in the forest. The overall performance of the entire forest of trees

reduces if the correlation increases. The way to vary the level of correlation

between trees is by adjusting the number of random attributes to be selected

when creating a split in each tree. Increasing this variable (m) will both in-

crease the correlation of each tree and the strength of each tree. At some point

the tree correlation and tree strength will complement each other providing

the highest performance. In addition, increasing the number of trees will pro-

vide a more intelligent learner just as having a large diverse group will make

intelligent decisions [10] [11].

RANDOM TREE [5] is a collection of tree predictors. Tree predictors are

also known as forests. It handles both classification and regression problems.

In classification, the input feature vector is taken by the random tree classifier

to classify it with every tree in the forest, the class label which receives the

majority of votes is considered as the output. The average of the responses over

all the trees in the forest is considered the actual response in the regression. All

the trees are trained with different training sets but on the same parameters.

3.3 Second Phase Study

The huge datasets have been generated exponentially day to day by the tremen-

dous use of application softwares developed for numerous services for exam-

ple stock market, banking, supermarket, education, mobile devices and many

more. For the analysis and visualization these data need to be processed with

Distributed Data Mining (DDM) approach. With any of the below four ap-

proaches distributed data mining can be implemented [22].

• Central approach: Bring the all site datasets to a single site, followed by

applying the data mining on the entire combined dataset. This causes

two problems, first huge amount of communication overhead and hence

50


Algorithms Measure Procedure Pruning

ID3 Entropy, info-gainTop-down decisiontree construction

Pre-pruning

C4.5 Entropy, split info,gain ratio

Top-down decisiontree construction

Pre-pruning

C5.0 Entropy, split info,gain ratio

Top-down decisiontree construction

Post-pruning

CART Gini diversityindex

Constructs binarydecision tree

Post pruning basedon cost-complexity

measure

SLIQ Gini IndexBreadth First

based Decisiontree construction

Post-pruningbased on MDL

principle

SPRIT Gini IndexBreadth First

based Decisiontree construction

Post-pruningbased on MDL

principle

Table 3.1: Performance based comparisons of different Decision tree algo-rithms

Algorithms ID3 C4.5 C5.0 CART

Types ofData

CategoricalContinuous,Categorical

Continuous, dates,times, Categorical,

timestamps

Nominal,Continu-

ousProcessing

SpeedSlow Better than ID3 Fastest Average

TreePruning

No Pruning Early pruning Late PruningEarly

pruningBoosting Do not allow Do not allow Allow AllowMissingValues

No Support No Support Supports Supports

SplittingMeasure

Entropy, Info Gain Gain Ratio, Split Gain Ratio, SplitGini

diversityindex

Table 3.2: Comparisons between different Decision Tree Algorithms

51


Algorithm Merits Demerits

ID3

• It builds the fastest andshort tree.

• From the training dataprediction rules are cre-ated .

• Reduces number of testsby pruning.

• It cant handle numeric at-tributes and missing val-ues.

• For small sample test-ing over-fitting or over-classification is possible.

• At a time single attributeis tested to make a deci-sion.

C4.5

• Supports continuous data.

• It avoids over fitting ofdata.

• Computational efficiencyis improved.

• Supports missing data val-ues at training.

• It requires that target at-tribute will have only dis-crete values.

J48

• Numeric and Nominalvalues are handled by it.

• Able to handle missingvalues.

CART

• Non parametric.

• No advance selection ofvariables.

• Can handle outliers.

• Unstable decision treemay be produced.

• One variable splitting.

Table 3.3: Comparison of Merits and Demerits of Decision Tree Algorithms

52


increase in communication cost to bring the entire data to a single site,

and second problem is data privacy preservation.

• Merge approach: Generate the local data model at each site locally. All

these models are sent to a single site to merge into a global single model.

This mechanism is carried out in the works of [23] [24] [25]. As the num-

ber of sites are increased, this approach is not suitable with scalable kind

of problem.

• Sample approach: This approach uses samples. At each site a small set

of candidate data is sampled to form one global candidate data set. The

data mining can then later performed on global data set.

• Intermediate Message Passing Approach: In the above three approaches

a single site assists the data mining of distributed data while in this

approach P2P network is involved where different sites communicate

among themselves without a central/single server [26][27].

Over the period of time many of the decision tree algorithms have been devel-

oped by the researchers with increase in performance and the way of handling

different types of data. Out of them, few of algorithms have been discussed

here with below:

Yael Ben-Haim, Elad Tom-Tov [12] proposed the algorithm Streaming Par-

allel Decision Tree (SPDT) executing in a distributed environment and is es-

pecially designed for classifying large data sets and streaming data. This al-

gorithm empirically proved as accurate as a standard decision tree classifier,

while being scalable for processing of streaming data on multiple processors.

The essence of the algorithm is to quickly construct histograms at the proces-

sors, which compress the data to a fixed amount of memory. A master pro-

cessor uses this information to find near-optimal split points to terminal tree

nodes. The analysis shows that guarantees on the local accuracy of split points

imply guarantees on the overall tree accuracy. In the algorithm both training

and testing are executed in a distributed environment using only one pass on

the data.

Bagging [28] and boosting [29] are the meta classification algorithms are

built very first on either partition of training data or samples. These are the

53


week classifiers which later combined using next level algorithm.

There are lots of algorithms which are more suited with distributed envi-

ronment like Stolfo et el. [30] learns weak classifier on each partition of the

sample data set, then later bring them to a single site. This is somewhat less

expensive than sending the entire data set to remote site. The meta-classifier

works on these weak classifiers centrally to form single and global classifier.

Bar-Or et al. [31] has suggested about to execute ID3 with hierarchical net-

work, to take only the statistics of contributing attributes at every node of the

tree and at each level. This guarantees that the selected attribute is having

highest gain.

In distributed environment the data are fragmented either horizontally,

vertically or hybrid way. So there should be algorithm for decision tree in-

duction for such scenarios. Caragea et al. [32] introduced such an algorithm

which works for distributed data. In this, the author majorly focus on splitting

criteria to be evaluated in a distributed fashion. This reduces the communica-

tion cost by reducing the more overhead. Moreover to this, the tree induced in

both scenario distributed and centralized are same. This systems is also avail-

able as the component of INDUS system.

A different approach was taken by Giannella et al. [33] and Olsen [34] for

inducing decision tree in vertically partitioned data. They used Gini informa-

tion gain as the impurity measure and showed that Gini between two attributes

can be formulated as a dot product between two binary vectors. To reduce the

communication cost, the authors evaluated the dot product after projecting

the vectors in a random smaller subspace. Instead of sending either the raw

data or the large binary vectors, the distributed sites communicate only these

projected low-dimensional vectors. The paper shows that using only 20% of

the communication cost necessary to centralize the data, they can build trees

which are at least 80% accurate compared to the trees produced by centraliza-

tion.

54

Chapter 3. Literature Survey On Decision Tree 3.4. Third Phase Study

Innovation Motivation/Problem Training Data

Bowyer, Chawla, Hall[18]Training of model with large

data setPima IndiansDiabetes, Iris

Long, Bursteinas [20]Distributed data mining on

distant machinesRepository of UCI

Zabala, Langner,Andrzejak [16]

Training distributed data without of RAM size

Repository of UCI

Soares, Moreira, Strecht[17]

University level knowledgegeneration for course model

University ofPorto: Academic

data from

Table 3.4: Merge Models with combination of rules: Examples

3.4 Third Phase Study

A more common approach is the combination of rules derived from decision

trees. The rule merging is performed by combing rules of two tree modelsinto the rules of single decision tree. This way we can reduce the number of

the rules and like growing the final merged tree model. Williams [13] has

presented the basic fundamental in his doctoral thesis, later many of the re-

searchers contributed a lot to process the intermediate process.

Provost and Hennessy [14, 15] present an approach to learning and com-

bining rules on disjoint subsets of a full training data. A rule based learning

algorithm is used to generate rules on each subset of the training data. The

merged model is constructed from satisfactory rules, i.e., rules that are generic

enough to be evaluated in the other models. All rules that are considered sat-

isfactory on the full data set are retained as they constitute a superset of the

rules generated when learning is done on the full training set. This approach

has not been replicated by other researchers. In the table 3.4 the research ex-

amples along with the problems and the data sets have been specified.

Hall, Chawla and Bowyer [18, 19] research present as rationale that is not

possible do train decision trees in very large data sets because it could over-

whelm the computer systems memory by making the learning process very

slow. Although a tangible problem in 1998, nowadays, this argument still

makes sense as the notion of very large data sets has turned into the big data

paradigm. The approach involves breaking down a large data set into n dis-

55

Chapter 3. Literature Survey On Decision Tree 3.4. Third Phase Study

joint partitions, then, in parallel, train a decision tree on each. Each model, in

this perspective, is considered an independent learner. Globally, models can

be viewed as agents learning a little about a domain with the knowledge of

each agent to be combined into one knowledge base. Simple experiments to

test the feasibility of this approach were done on two datasets: Iris and Pima

Indians Diabetes. In both cases, the data sets were split across two processors

and then the resulting models merged.

Bursteinas and Long [20] research aims to develop a technique for mining

data which is distributed on remote machines, and connections with limited

bandwidth arguing that there is a lack of algorithms and systems which could

perform data mining under such conditions. The merging procedure is divided

into two scenarios: one for disjoined partitions and one for overlapped parti-

tions. To evaluate the quality of the method, several experiments have been

performed. The results showed the equivalence of combined classifiers with

the classifier induced on a monolithic data set. The main advantage of the

proposed method is its ability to induce globally applicable classifiers from

distributed data without costly data transportation. It can also be applied to

parallelize mining of large-scale monolithic data sets. Experiments are per-

formed merging two models in data sets taken from the UCI Machine Learning

Repository [21].

Andrzejak, Langner and Zabala [16] propose a method for learning in par-

allel or from distributed data. They focus on the large data sets generated in

the mobile environments of distributed scenarios where the data set size ex-

ceeds the RAM size. They also have evaluated the interpretable models from

various models generated at numerous sites. They identified the impact and

the importance of the individual variables. In the distributed environment, if

the individual model of one site is interpretable then the overall/global model

may not be. In order to overcome this problem the authors proposed one new

approach to merge decision trees into a global interpretable single tree. This

proposed approach also overcomes the problem of connection bandwidth and

RAM size. It also gives good accuracy which has been evaluated in experiments

on UCI repository data sets [21].

Soares, Moreira, Strecht [17] research on educational data mining starts

from the premise that predicting the failure of students in university courses

56

Chapter 3. Literature Survey On Decision Tree 3.5. Challenges with DT merging

can provide useful information for course and programme managers as well as

to explain the drop out phenomenon. The rationale is that while it is important

to have models at course level, their number makes it hard to extract knowl-

edge that can be useful at the university level. Therefore, to support decision

making at this level, it is important to generalize the knowledge contained

in those models. An approach is presented to group and merge interpretable

models in order to replace them with more general ones without compromis-

ing the quality of predictive performance. The case study is data from the

University of Porto, Portugal, which is used for evaluation. The aggregation

method consists mainly of intersecting the decision rules of pairs of models

of a group recursively, i.e., by adding models along the merging process to

previously merged ones. The results obtained are promising, although they

suggest alternative approaches to the problem. Decision trees were trained us-

ing C5.0 algorithm and F1 was used as evaluation function of the individual

and merged models.

3.5 Challenges with DT merging

Decision tree learning on massive datasets is a common data mining task in

distributed environment, yet many state of the art as discussed above tree

learning algorithms require training data to reside in memory on a single ma-

chine, while more scalable implementations of tree learning have been pro-

posed, they typically require specialized parallel computing architectures. More-

over, all the approaches are static in nature, not domain-free, not scalable and

the accuracy is not preserved.

Merging the decision trees is the real challenge for the researchers, and

different researchers have proposed different merging policies, but still pre-

serving the accuracy is the big issue because of change in the global decision

rules due to several reasons such as 1) What if two same rules with a single

feature different have same class? 2) What if one rule partially overlaps the

other rule? 3) What if one rule fully overlaps the other rule? 4) What if one

rule have same feature (continuous value) with different constraints?

Our literature review and experiments on merging the decision trees, shown

57

Chapter 3. Literature Survey On Decision Tree 3.6. Summary

training time, communication overhead and the accuracy are the major chal-

lenges. To reduce the training time the proposed algorithm processes only new

dataset with already trained model which makes it scalable and dynamic, to

reduce the communication overhead the local models have been converted into

XML files, to preserve the accuracy the proposed algorithm incorporates some

rule merging policies. The proposed model has been discussed in the chapter

4 in detail.

3.6 Summary

In this chapter of literature survey very first the study/survey of different de-

cision tree based classification techniques like CART, J48, ID3, C5.0, SPRIT,

Random Forest and SLIQ have been conducted in very precise manner with

respect to type of data supported by them, their speed, the way for pruning

and whether they support the missing values or not. In the second phase dif-

ferent decision tree algorithms for distributed environment have been studied

and discussed their pros and cons. Moreover to this in the third phase of liter-

ature study, the different approaches have been discussed proposed by numer-

ous researchers for merging the decision trees in distributed environment to

for the global view of decision tree. In the last section the challenges observed

in merging different decision trees to form global decision tree have been dis-

cussed. These are the motivations for our research to generate the global de-

cision tree in distributed environment without losing the prediction quality of

the model.

3.7 References

1. Elder J. F. and King M. A., Evaluation of Fourteen Desktop Data Mining

Tools, in Proceedings of the IEEE International Conference on Systems,

Man and Cybernetics, 1998.

2. Juhua Chen, Wei Peng and Haiping Zhou, An Implementation of ID3:

Decision Tree Learning Algorithm Project of Comp 9417: Machine Learn-

ing University of New South Wales, School of Computer Science & Engi-

neering, Sydney, NSW 2032.

58

Chapter 3. Literature Survey On Decision Tree 3.7. References

3. C4.5 algorithm,Wikipedia, The Free Encyclopedia. Wikimedia Founda-

tion, 28-Jan-2015.

4. Breiman L., Random forests, Mach. Learn., vol. 45, no. 1, pp. 532, 2001.

5. Random tree, Wikipedia, The Free Encyclopedia. Wikimedia Founda-

tion, 13-Jul-2014.

6. Charles J. Stone, Jerome H. Friedman, Leo Breiman and Richard A. Ol-

shen, Classification and Regression Trees. Wadsworth International Group,

Belmont, California. 1984.

7. Gordan.V.Kass. An exploratory Technique for investigation large quanti-

ties of categorical data Applied Statics, vol 29, No .2, pp. 119-127.1980

8. Wu Shangzhuo, Wang Jian YanHongcan and Zhu Xiaoliang, Research and

application of the improved algorithm C4.5 on decision tree. 2009.

9. Manish Mehta, Rakesh agrawal and Jorma Rissanen, SLIQ: A scalable

parallel classifier for data mining IBM Almaden Research Center,CA 95120.

10. Suban Ravichandran, Vijay Bhanu Srinivasan and Chandrasekaran Ra-

masamy, Comparative Study on Decision Tree Techniques for Mobile Call

Detail Record, Journal of Communication and Computer 9, pp. 1331-

1335, 2012.

11. N. Peter, Enhancing random forest implementation in WEKA, in: Ma-

chine Learning Conference, [2005]

12. Yael Ben-Haim, Elad Tom-Tov, A Streaming Parallel Decision Tree Algo-

rithm, Journal of Machine Learning Research, 849-872. 2010.

13. G. J. Williams, Inducing and Combining Multiple Decision Trees. PhD

thesis, Australian National University, 1990.

14. F. J. Provost and D. N. Hennessy, Distributed machine learning: scaling

up with coarse-grained parallelism, in Proceedings of the 2nd Interna-

tional Conference on Intelligent Systems for Molecular Biology, vol. 2,

pp. 3407, Jan. 1994.

15. F. Provost and D. Hennessy, Scaling up: Distributed machine learning

with cooperation, in Proceedings of the 13th National Conference on Ar-

tificial Intelligence, pp. 7479, 1996.

59


16. A. Andrzejak, F. Langner, and S. Zabala, Interpretable models from dis-

tributed data viamerging of decision trees, 2013 IEEE Symposium on

Computational Intelligence and Data Mining (CIDM), Apr. 2013.




pp. 535548, 2014.

18. L. Hall, N. Chawla, and K. Bowyer, Combining decision trees learned in

parallel, Working Notes of the KDD-97 Workshop on Distributed Data

Mining, pp. 1015, 1998.



netics, vol. 3, pp. 2579 2584, 1998.

20. B. Bursteinas and J. Long, Merging distributed classifiers, in 5th World

Multi conference on Systemic, Cybernetics and Informatics, 2001.

21. S. Datta, C. Giannella, and H. Kargupta. K-Means Clustering over Peer-

to-Peer Networks. 8th Int. Workshop on High Performance and Dis-

tributed Mining (HPDM), 2005.

22. Baik, S. Bala, J. A Decision Tree Algorithm For Distributed Data Min-

ing.2004.

23. http://www.cs.waikato.ac.nz/ml/weka/

24. Khaled M. Hammouda and Mohamed S. Kamel. Hierarchically Distributed

Peer-to-PeerDocument Clustering and Cluster Summarization. IEEE Trans-

actions on Knowledge and DataEngineering, Vol. 21(5), pp.681-698. 2009.

25. N.F. Samatova, G. Ostrouchov, A. Geist, and A.V. Melechko. RACHET:

An Efficient Cover-Based Merging of Clustering Hierarchies from Dis-

tributed Datasets. Distributed and ParallelDatabases,vol. 11(2), pp. 157-

180. 2002.

26. S. Merugu and J. Ghosh, Privacy-Preserving Distributed Clustering Us-

ing Generative Models, 3rd IEEE Intl Conf. Data Mining (ICDM 03), pp.

211-218. 2003.

60


27. J. da Silva, C. Giannella, R. Bhargava, H. Kargupta, and M. Klusch, Dis-

tributed Data Mining and Agents, Eng. Applications of Artificial Intelli-

gence, vol.18(7) , pp. 791-80. 2005.

28. L. Breiman, Bagging Predictors, Machine Learning, vol. 2, pp. 123140,

1996.

29. J. Friedman, T. Hastie, and R. Tibshirani, Additive Logistic Regression:

A Statistical View of Boosting, Dept. of Statistics, Stanford University,

Tech. Rep., 1998.

30. S. J. Stolfo, A. L. Prodromidis, S. Tselepis, W. Lee, D. W. Fan, and P. K.

Chan, JAM: Java Agents for Meta-Learning over Distributed Databases,

in Proceedings of SIGKDD97, pp. 7481. 1997.

31. A. Bar-Or, D. Keren, A. Schuster, and R. Wolff, Hierarchical Decision

Tree Induction in Distributed Genomic Databases, IEEE Transactions on

Knowledge and Data Engineering Special Issue on Mining Biological

Data, vol. 17, no. 8, pp. 1138 1151, August 2005.

32. D. Caragea, A. Silvescu, and V. Honavar, A Framework for Learning from

Distributed Data Using Sufficient Statistics and Its Application to Learn-

ing Decision Trees, International Journal of Hybrid Intelligent Systems,

vol. 1, no. 1-2, pp. 8089, 2004.

33. C. Giannella, K. Liu, T. Olsen, and H. Kargupta, Communication Efficient

Construction of Deicision Trees Over Heterogeneously Distributed Data,

in Proceedings of ICDM04, Brighton, UK, pp. 6774. 2004.

34. T. Olsen, Distributed Decision Tree Learning From Multiple Heteroge-

neous Data Sources, Masters thesis, University of Maryland, Baltimore

County, Baltimore. Maryland, October 2006.

61

Chapter 4

Proposed Approach

4.1 Introduction

For distributed environment, many of the researchers have contributed for de-

veloping classification and prediction methods. This methods support ma-

chine learning, statistics and pattern recognition in distributed environment.

The researchers also have proposed different approaches for merging the local

decision trees. From the deep literature review the facts have been identified

that, most algorithms are memory resident, typically assuming a small data

size, not domain-free, static in nature, less efficient in terms of processing and

communication overhead. Due to the large volume of data with privacy con-

cern there should be some efficient technique which supports scalable and dy-

namic classification and prediction which is able to handle very big volume of

data sets which spread across different sites and generates the global decision

tree without losing the prediction quality.

4.2 Problem Statement

The research problem of designing the efficient technique for merging the de-

cision trees which supports scalable and dynamic classification in distributed

environment with large volume of data and generates the global decision tree

without losing the prediction quality.

62

Chapter 4. Proposed Approach 4.3. Objective and Scope of Research

4.3 Objective and Scope of Research

OBJECTIVEA series of challenges have recently emerged in the data mining field for dis-

tributed environment, triggered by the rapid shift in status from academic to

applied science and the resulting needs of real-life applications. The proposed

work is concerned with dynamic and scalable approach for merging the deci-

sion trees models for large volume of data in distributed environment. In the

following, the main objectives of the thesis are enlisted.

1. To reduce the model (i.e. decision tree) training time and communication

time in distributed environment for large volume of data.

2. To introduce the efficient scalable and dynamic approach for newly gen-

erated dataset and already trained model.

3. To prepare the rule merging policies to generate the global model.

4. To generate the globally interpretable model by preserving the prediction

quality.

SCOPEIn this research the following things have been considered/included as the

scope.

1. To work with homogeneous and horizontally fragmented data set.

2. To collect and pre-process real data set: The student admission data set

from Official Web Portal of Parul group of institutes.

3. The work has been carried out on educational data and mainly has been

focused on student admission prediction.

4. The parser has been proposed to convert the decision tree into the deci-

sion rule.

5. The outcome of this research at the end is optimized global decision tree

without losing the prediction quality.

6. The simulation of work has been carried out on 2, 5 and 10 sites in the

network.

63

Chapter 4. Proposed Approach 4.4. Original Contribution by thesis

4.4 Original Contribution by thesis

In this research work the qualitative and exploratory approaches have been

used and followed the research methodology steps. Very first, during the liter-

ature review we referred various research papers, patents and other articles on

dynamic and scalable data mining for distributed environment, classification

techniques and algorithms, merging the decision trees and educational data

mining. In addition to this, we installed Weka tool [1] which is an open source

by the University of Waikato of New Zealand and studied various supervised

data mining algorithms for classification. During this initial phase of literature

review, we found researchers had done work on classification algorithms but

very few of them worked on decision tree in distributed environment. Major

reasons for this gap are most of the algorithms are not memory resident, static

in nature, scalable and domain free and also very less work have been done for

dynamic and scalable distributed environment.

During our literature review, we found research works based on merging

the decision trees generated at different geographical locations to form global

decision tree without losing the predictive quality. Therefore, our second phase

of literature review was mainly focused on global model generation from dif-

ferent local decision trees. By studying and comparing various approaches, we

found merging the decision rules are challenging task. We also found none of

the researcher has worked on scalable and dynamic approach in distributed

environment. As a result of both phases of literature review, we proposed a

model (with frame work, system architectures and algorithms) with the objec-

tives 1) to reduce network overhead, 2) Scalable and dynamic distributed envi-

ronment support which let not need to process whole dataset every time and 3)

The global model should not loss the predictive quality. The proposed model

with the frame work, system architecture at local site, decision tree merging ar-

chitecture at coordinator site are shown in figures 4.1,4.2 and 4.3 respectively.

The detail of each is available in the further sub sections.

To fulfill the objectives, the proposed model has been implemented in two

phases. In the first phase, 1) the decision trees have been generated at each lo-

cal site, 2) the decision tables have been formed from each local site, 3) conver-

sion of each local decision table into XML file to transmit it over the internet.

In the second phase, 1) the XML files have been converted into the decision

64

Chapter 4. Proposed Approach 4.5. Proposed Architecture

tables, 2) all decision tables have been merged and 3) the resultant decision

table have been converted into XML file to send to all local sites for prediction.

The proposed model has been implemented on the educational data set.

We have used the real data set of student admission process in different dis-

ciplines. The data sets have been collected from Parul university web portal

(PUWP). In the experimental the data set is processed on 2, 5 and 10 different

sites to generate the local decision tree models which later merged into a single

decision tree as a whole without losing the predication quality.

4.5 Proposed Architecture

As shown in figure 4.1, the data set D as a whole considered partitioned across

different data set sites Si where i=1,2,3,. . . .d, each site Si now process the

locally available dataset Di to generate the decision tree using J48 algorithm in

weka©tool.

J48 is an extension of ID3 algorithm which was previously designed and

implemented by Ross Quinlan, this algorithm generates the decision trees. It

is known as statistical classifier as because it generates the decision trees for

classification. It uses information entropy to build the decision trees form the

training data same as ID3. The training data contains the set of already clas-

sified samples S = S1, S2, ..... Sn. Here each sample Xi = X1, X2, ..... Xm is

nothing but the vector andX1, X2, ..... Xm represents the features or attributes

of the sample. C = C1, C2, ......etc, where C1, C2 .... etc represent the classes

in which the samples belong. The training data is generally augmented with

this vector C. The most contributing attribute which effectively splits the set

of samples is selected at each node of the tree of J48. Using the normalized

information gain (i.e. entropy difference) the attribute is selected to split the

data. Highest normalized information gain attribute is selected to make the

splitting decision. This process will continue and will form the decision tree.

As the decision trees generated at each site occupies larger memory, hence

it is converted into decision tables followed by XML files to transmit over the

network such that very less network overhead takes place. Each XML file then

65

Chapter 4. Proposed Approach 4.6. Proposed Algorithm

Figure 4.1: Proposed Framework

later available at coordinator site, where actual decision tree merging process

takes place.

The following figure 4.2 and figure 4.3 shows the flow chart of the complete

flow of our experimental algorithm at local site and coordinator site respec-

tively.

4.6 Proposed Algorithm

Input

• Dataset Di , Dtsi which are the set of training tuples and their associated

class labels (Dtsi is new data set instance at time stamp ts for site Si where

i=1,2,3,....N, flag=0 (indicates data set not processed before)

Output

• Global Decision Tree (GT)

4.6.1 Algorithm steps at local site

Step-1: If flag==0 then perform step-2 to step-5 otherwise perform step-6.

Step-2: Apply J48 algorithm on data set Di of each site Si to generate the local

decision tree DTi .

Step-3: Convert the Decision Tree DTi into the Decision Table Dtablei at each

site Si . flag=1

66


Step-4: Sort decision rules in Dtablei in descending order as per class label

majority.

Step-5 : Perform step 12 to 13.

Step-6 : For each Dtsi perform step 7 to 13.

Step-7 : Classify Dtsi tuples into Dtablei and update Dtablei accordingly.

Di= Dtsi U DiStep-8 : if tuple tij does not follow the decision table rule then create new rule

for it.

Step-9 : if tuple tij follows most of (except one) the decision table rules then

increment the count of that rule. (i.e. correctly classified instances).

Step-10: If tuple tij conflicts any of the rule then dont consider such tij.

Step-11: If tuple tij partially or fully overlapped by other decision table rule

then update the decision table accordingly. (i.e. rule modification).

Step-12: Create the XML file Xi of Dtablei for each site Si .

Step-13: Send Xi file of each Si to coordinator site for further process.

Algorithm Complexity CalculationThe implementation of J48 requires a scan through the entire training set for

each node of the tree that exhibits a split on an attribute. Depending on the

data source, the number of nodes in the tree can be O(n), where n is the num-

ber of training instances, making the time complexity for this part O(n2).

The time complexity of the algorithm of generating a set of rules from a de-

cision tree is O(km2), where k is number of nodes of each branch and m is the

number of branches in the tree. Let the number of rules in the decision table is

r, then sorting of the rules in descending order takes time O(rLogr) using the

quick sort. The classification of q newly added instances take total time O(q).

The total time required to generate the XML file from the decision table with r

rules with c conditions is O(cr). At coordinator site the overall time complexity

of merging and intersection of the two decision tables with m and n instances

is is O(mLogm + nLogn).

67


Figure 4.2: Local site Processing

68


For each site the Dataset is Di and Dtsi which are the set of training tu-

ples and their associated class labels. Here Dtsi is new data set instance at

time stamp ts for site Si where i=1,2,3,.N. The algorithm has the dynamic and

scalable in nature; hence in this research the incremental approach has been

used with new data set rather than processing entire data set again. This will

reduce the computation cost. The new data set Dtsi generated at time stamp

ts for each site i is processed tuple wise and based on this the decision table

is directly updated without generating the decision tree. All local decision ta-

bles are merged at one of the coordinator site which later converted into the

decision tree. This decision tree is called global decision tree and it is globally

interpretable.

As given in the algorithm and shown in the flowchart, at each site Si the

data set Di is processed through the J48 algorithm which generates the deci-

sion treeDTi . Each decision tree at each site is converted into the decision rules

using the parser and stored into the decision table Dtablei . In order to reduce

the network overhead and transmission cost each decision table is converted

into the XML file Xi at each site i. Once all the XML files have been received by

the coordinator site, they are converted into the corresponding decision tables.

The flag variable is used to check whether the data set is new or not and

process it alone using the incremental approach with already trained model.

If flag=0 then steps 2 to 5, step 12 and 13 will be executed otherwise steps 7

to 13 will be processed. Once the new data set Dtsi is processed then it will be

appended to the data set Di .

4.6.2 Algorithm steps at coordinator site

Step-1: Convert the Xi into Dtablei for respective site Si .

Step-2: Merge the Dtablei into Single Table T.

Step-3: Convert T into the Global Decision Tree GT.

Step-3.1: Perform Intersection Phase: Finding the common rules i.e. regions.

Step-3.2: Perform Filter Phase: Remove the disjoint regions from the inter-

sected merged model.

Step-3.3: Perform Reduction Phase: Join the regions of same class but having

one attribute differ.

Step-4: Send GT to Si for local prediction.

69


Figure 4.3: Coordinate site Processing

As shown in the algorithm and the flow chart at coordinator site, all XML

files are converted into the decision tables which all are merged forming global

decision table T. Different phased are performed on T. Very first the intersec-

tion phase is performed which finds the common rules. The second phase is

filter phase in which the disjoint regions from intersected merged models will

be removed. Due to this some of the rules will be ignored which may cause

reduction in the model accuracy. At the end the reduction phase is performed

to join the regions of same class or one attribute differ, this avoid the ambiguity

of deciding the class label.

70

Chapter 4. Proposed Approach 4.7. System architecture at local site

Figure 4.4: proposed system architecture for dynamic and scalable decisiontree generation

4.7 System architecture at local site

As shown in the figure 4.4, the detailed proposed architecture for dynamic

and scalable decision tree generation process has been discussed. The very

first step is model Mi creation from the data set Di available at each site. Later

the parser converts the decision trees into the decision rule set Ri for each site

Si . In the third phase the decision rule set Ri is converted into decision table

DT ablei which later converted into XML file.

Each local site Si sends its locally generated XML files Xi to coordinator

site for further decision tree merging process. In one of the intermediate step,

the newly added data set is appended with the previous decision table Dtableiof site Si directly generating the decision tree of new data set. This way the

approach becomes scalable, i.e. the algorithm supports new data sets as well.

4.8 System architecture at coordinator site

The process of merging k decision trees F1, F2, F3,. . . Fk into a single one starts

with creating for each tree Fi its Decision Table set Dtable(Fi). Decision Tables

Dtable(F1), Dtable(F2),. . . is reduced into a final Decision Table Dtable Final

by the merging operation on Decision tables set. Finally, Dtable Final is turned

71

Chapter 4. Proposed Approach 4.8. System architecture at coordinator site

Figure 4.5: Decision table merging process to generate the global decision tree

into a decision tree The merging operation is the core of the approach: it

merges several decision table sets Dtable(F1), Dtable(F2),. . . etc into a sin-

gle one. It consists of several phases like intersection, filtering and reduction

as shown in figure 4.5 below.

As shown in figure 4.5 above, the decision tables Dtable(Fi) of all the sites Si

with data set Di where i=1,2,3,. . .d are merged. At very first the intersection

phase is carried out where the common regions i.e. rules are found. In the

second phase, the less useful disjoint regions are removed from the list. This

process is known as filtering. In the third reduction phase, the disjoint regions

which can be combined i.e. merged with minor changes are merged to reduce

the number of disjoint regions.

Intersection Phase: It is a task to combine the regions of two decision mod-

els using a specific method to extract the common components of both, pre-

sented in decision table. The set of values (Numerical only) of each region on

each model are compared to discover common sets of values across each vari-

able. The class to assign to the merged region is straightforward if the pair of

regions have the same class, otherwise class conflict problem arises. Andrzejak,

Langer and Zabala[2] propose three strategies to address this problem. a) As-

sign the class with the greatest confidence, b) Assign the class with the greater

probability c) Retrain the model with examples for conflicting class regions, If

no conflict arises that class is assigned. Otherwise remove that region from the

72

Chapter 4. Proposed Approach 4.8. System architecture at coordinator site

merged model.

Filter Phase: It is the task to remove the disjoint regions from the inter-

sected model. This is somewhat pruning operation. In this the regions with the

highest relative volume and number of training examples are retained. Strecht,

Moreira and Soares [3] address the issue by removing the disjoint regions, and

highlighted the case where the models are not merge able if all regions are dis-

joint.

Reduction Phase: This is applicable when a set of regions have the same

class and all variables have equal values except for one. To obtain the simpler

merged model, this is the task to find out which can be joined into one. For

Nominal Variables: Union of values of variables from all regions. For Numeric

Variables: If intervals are contiguous.

Decision rule merging policiesRule-1: The continuous value should not be differing by any more than thresh-

old (this is adjustable). The threshold is decided well in advance and during

the rule merging process the value of continuous variable should not be differ-

ent than the threshold already decided.

Rule-2:Rule-2A: If the attribute tests > then the smaller of the two rule values is used.

For example if ACPCRank>2399 and ACPCRank>1050 is present in the same

rule, then the smaller ACPCRank>1050 is remain preserved. For example the

rule AdmType=STATE AND Institute = PIET1 AND ACPCRank>35015 AND

ACPCRank≤46309 AND ACPCRank>17215. Is modified to AdmType=STATE

AND Institute = PIET1 AND ACPCRank≤46309 AND ACPCRank>17215.

Rule-2B: If the attribute tests ≤ then the larger of the two rule values is used.

For example if ACPCRank≤102 and ACPCRank≤1345 is present in the same

rule, then the larger ACPCRank ≤ 1345 is remain preserved.

Rule-3: Partial OverlapTwo rules in which conditions are partially overlap; adjust the boundaries of

the rule. The two rules as below overlaps to each other are modified as a new

rule.

73

Chapter 4. Proposed Approach 4.9. Summary

EE Class: Category = OPEN AND AdmType=MANAGEMENT AND Insti-

tute=PIET1 AND ACPCRank¿17215.

And

Civil Class: ACPCRank≤100904 AND Category = OPEN AND AdmType=MANAGEMENT

AND Institute=PIET1 and ACPCRank>17215 Partially overlaps, hence modi-

fied to Category = OPEN AND AdmType=MANAGEMENT AND Institute=PIET1

AND ACPCRank¿100904 for EE Class.

Rule-4: One rule completely overlaps other rule, modify overlapped rule ac-

cording to overlapping rule.

Rule-5: Conflict in LabelTwo same rules have different labels then select the rule as below:

1. Use the label with highest confidence.

2. Avg. the probability distribution and use the label with highest proba-

bility.

4.9 Summary

In this chapter the problem statement along with the objective and scope have

been introduced in section 4.2 and 4.3 respectively. The original thesis contri-

bution has been discussed in section 4.4 while the proposed architecture has

been discussed in section 4.5. In section 4.6 the proposed algorithm at both

local site and coordinator site has been discussed. The system architecture at

local and coordinator site have been discussed in the subsequent sections 4.7

and 4.8 respectively.

4.10 References

1. http://www.cs.waikato.ac.nz/ml/weka/

2. Langner,Zabala and Andrzejak ,Interpretable models from distributed

data viamerging of decision trees,IEEE Symposium on Computational

Intelligence and Data Mining (CIDM), Apr. 2013.

74

Chapter 4. Proposed Approach 4.10. References

3. Soares, Mendes-Moreira and Strecht, Merging Decision Trees: a case study

in predicting student performance, in Proceedings of 10th International

Conference on Advanced Data Mining and Applications, pp. 535548,

2014.

4. G. J. Williams, Inducing and Combining Multiple Decision Trees. PhD

thesis, Australian National University, 1990.

5. Chawla, Hall,and Bowyer, Combining decision trees learned in parallel,

Working Notes of the KDD-97 Workshop on Distributed Data Mining,

pp. 1015, 1998.



netics, vol. 3, pp. 2579 2584, 1998.

7. Weston, Kuhn, Quinlan and Coulter, C5.0 Decision Trees and Rule-Based

Models. R package version 0.1.0-16, 2014.

8. Chan, P. & Stolfo, S. Learning arbiter and combiner trees from parti-

tioned data for scaling machine learning. Proc. Intl. Conf. Knowledge

Discovery and Data Mining. 1995.

9. Hennessy, Provost, Distributed machine learning: scaling up with coarse-

grained parallelism, in Proceedings of the 2nd International Conference

on Intelligent Systems for Molecular Biology, vol. 2, pp. 3407, Jan. 1994.

10. F. Provost and D. Hennessy, Scaling up: Distributed machine learning

with cooperation, in Proceedings of the 13th National Conference on Ar-

tificial Intelligence, pp. 7479, 1996.

11. Long and Bursteinas, Merging distributed classifiers, in 5th World Multi

conference on Systemic, Cybernetics and Informatics, 2001.

12. Students’ Admission Prediction using GRBST with Distributed Data Min-

ing, Communications on Applied Electronics (CAE) ISSN : 2394-4714,

Foundation of Computer Science FCS, New York, USA, Volume 2 No.1,

June 2015

13. A Proposed DDM Algorithm and Framework For EDM of Gujarat Tech-

nological University, Organized by Saffrony Institute of Technology In-

ternational Conference on Advances in Engineering, 22nd-23rd January

2015

75

Chapter 4. Proposed Approach 4.10. References

14. A Dynamic and Scalable Evolutionary Data Mining for Distributed En-

vironments, NCEVT-2013, PIET, Limda

15. Faculty Performance Evaluation Based on Prediction in Distributed Data

Mining, 2015 IEEE ICETECH- Coimbatore

16. Prediction and analysis of student performance using distributed data

mining, International Conference on Information, Knowledge & Research

In Engineering, Management and Sciences(IC-IKR-EMS),7th Dec-2014

KIT, Gujarat. IJETAETS-ISSN 0974-3588.

76

Chapter 5

Working of the Proposed

Model

5.1 Introduction

In chapter 4 the proposed architecture has been introduced. The architecture

contains N different sites [3], out of it any one site will be considered as the co-

ordinator site where the rule merging policies have been applied and the global

decision tree is generated from the newly formed global rules. In this chapter

the proposed algorithm at both local site and coordinator site has been dis-

cussed in detail along with the examples. At the end, the system architecture

at local and coordinator site have been discussed in more detail with merging

policies.

5.2 Local site algorithm computation

The data set D as a whole considered partitioned across different data set sites

Si where i=1,2,3,. . .d, each site Si now process the locally available dataset Dito generate the decision tree using J48 algorithm in weka©tool. Here the stu-

dent data set for admission has been considered and they have been processed

on two different sites S1 and S2. For simplicity of calculation the site S1 has

179 and site S2 has 142 instances to process. In the table 5.1 and 5.2 the de-

tailed statastics have been shown which is useful for finding different attribute

77

Chapter 5. Working of the Proposed Model 5.2. Local site algorithm computation

College Record AdmType Record Category Record Branch RecordPIT1 80 State 92 Open 60 it 13

PIET1 62 Management 33 SEBC 49 cse 31PIET2 0 TFWS 17 SC 19 mech 27PIT2 0 ST 14 ec 13

ee 14auto 03

ic 01civil 38

142 142 142 chem 02142

Table 5.1: Overall distributions of Site S2 instances with respect to attributes

PIT PIET State TFWS Management Open SEBC SC ST6 7 7 3 3 2 5 4 2

14 17 19 2 10 15 11 2 317 10 18 2 7 13 8 2 47 6 9 3 1 5 4 3 17 7 11 1 2 7 4 2 12 1 2 1 0 0 2 1 00 1 1 0 0 1 0 0 0

25 13 27 3 8 3 14 5 32 0 1 0 1 1 1 0 0

80 62 95 15 32 60 49 19 14

Table 5.2: Detail distribution of Site S2 instances with respect to attribute val-ues

splitting measures.

The information gain is based on the decrease in entropy after a dataset is

split on an attribute. Information gain IG (A) is the measure of the difference in

entropy from before to after the set S is split on an attribute A. In other words,

how much uncertainty in S was reduced after splitting set S on attribute A.

In the below computations the expected information needed to classify tu-

ples for site S2 and for attribute wise have been derived which is helpful for

selecting the most suitable attribute with splits the instances to build the deci-

sion tree. In the below computations log refers log with base 2.

78


• Expected information needed to classify tuples in site S2 dataset: The

expected information of entire dataset have been computed from all the

instances of data set at site S2 is equal to 3.9 as below.

Info(D2)= -13/142log(13/142) 31/142log(31/142) 27/142log(27/142) -

13/142log(13/142) 14/142log(14/142) 3/142log(3/142) 1/142 log(1/142)

38/142 log(38/142) 2/142log(2/142) = 3.9

• Expected information for each attribute: In this computation the ex-

pected information for each attribute of the data set have been computed

to find the information gain of each attribute.

1. Infoinstitute(D2)= 80/142* (-6/80log6/80 14/80log14/80 17/80log17/80

7/80log7/80 7/80log7/80 -2/80log2/80 25/80log25/80 2/80log2/80)

+ 62/142 *(-7/62log7/62 17/62log17/62 10/62log10/62 6/62log6/62

7/62log7/62 1/62log1/62 -13/62log13/62) = 2.66 Gaininstitute(D2)=Info(D2)-

Ifnoinstitute(D2) = 3.9-2.66 =1.24

Similarly the information gain for AdmType, Category have been

derived to as below:

2. InfoAdmType(D2)= 2.594, GainAdmTyple(D2)=1.306

3. InfoCategory(D2)=2.45 , GainCategory(D2)=1.45

After finding the information gain of each attribute, the split information for

each attribute has been computed as below:

1. Splitinfoinstitute(D2)= -80/142log80/142 62/142log62/142 = 0.994

2. SplitinfoAdmType(D2)= -95/142log95/142 15/142log15/142 32/142log32/142=

1.223

3. SplitinfoCategory(D2)= -60/142log60/142 49/142log49/142 -19/142log19/142

-14/142log14/142= 1.78

The Gain Ratio for each attribute then after calculated as below:

1. GainRatio(Institute)= Gaininstitute(D2)/Splitinfoinstitute(D2)= 1.24/0.994

= 1.2475 Similarly,

79


Figure 5.1: Data set at site S1

2. GainRatio(AdmType)= 1.068 and

3. GainRatio(Category)= 0.815

Likewise the above example of site S2, for the ease of processing site S2 has

179 instances which have been processed with the J48 algorithm. The figure

5.1 below clearly shows the data set at site S1. It contains total 179 instances

and 7 attributes. As shown in the figure 5.1 below the institute attribute has

80, 16, 81 and 2 instances for PIET1, PIET2, PIT1 and PIT2 institutes respec-

tively. Likewise this other attributes have distinct values in the data set.

5.2.1 Building the decision tree

The above data set at site S1 has been processed by the supervised classification

technique using the J48 algorithm which generates the decision tree as shown

in the figure 5.2 below. Here the 10 fold cross-validation with 66% splitting

percentage.

On processing the data set at site S1 the result is acquired as shown in fig-

ure 5.3 In this figure the detailed accuracy by class (here it is branch wise) is

80


Figure 5.2: Decision Tree generated at local site S1

shown. For each class the true positive (PT) Rate, False Positive (FP) Rate, Pre-

cision, Recall, F-Measure and ROC Area can be viewed.

The confusion matrix as shown in figure 5.3 can be clearly understood.

There are 141 instances in total are correctly classified and 38 instances are

incorrectly classified. The class CIVIL has maximum 12 instances in total and

the class IT has minimum only 1 instance is incorrectly classified.

5.2.2 Rule Generation

It is difficult and very much complex to merge the different local decision trees

to form the global one. For the efficient merging process the decision tree rules

have been converted into the simple decision rules. Using the J48 parser the

decision rules have been derived from the decision tree. Some of the decision

rules formed as below for different classes. At the left site of each rule the

multiple predicates have been ANDing to form the minterm predicate and the

right site is the class label. From the observations of the decision rules many

of the rules are complex, overlapping and somewhat differing with single at-

tribute.

81


Figure 5.3: Detailed accuracy and the confusion matrix

1. AdmType=TFWS AND Institute=PIET1 AND ACPCRank>17215→ CSE

2. ACPCRank≤48576 AND AdmType = ST AND Category = OPEN AND

Institute = PIT1 AND ACPCRank > 17215→ CSE

3. Category = ST AND ACPCRank > 10996 AND ACPCRank > 440 AND

ACPCRank ≤ 17215→ IT

4. HSC ≤ 57.54 AND SSC ≤ 74.33 AND AdmType = MANAGEMENT AND

Institute = PIT1 AND Category = OPEN AND ACPCRank > 17215→ IT

5. SSC > 75.24 AND Institute = PIT1 AND AdmType = MANAGEMENT

AND ACPCRank ≤ 10996 ACPCRank ≤ 17215 AND ACPCRank > 440

→ EC

6. SSC > 74.33 AND AdmType = MANAGEMENT AND Institute = PIT1

AND Category = OPEN AND ACPCRank > 17215→ EC

7. ACPCRank ≤ 17215 AND ACPCRank > 440 AND ACPCRank ≤ 10996

AND AdmType =State AND Category = SC→MECH

8. ACPCRank ≤ 17215 AND ACPCRank > 440 AND ACPCRank≤10996

AND AdmType =MANAGEMENT AND Institute =PIET1→MECH

82


9. ACPCRank > 100904 AND Category = OPEN AND AdmType =MAN-

AGEMENT AND Institute =PIET1 AND ACPCRANK > 17215→ EE

10. Category = SEBC AND AdmType =MANAGEMENT AND Institute =PIET1

AND ACPCRANK > 17215→ EE

11. ACPCRank > 11083 AND Category = OPEN AND ACPCRank > 10996

AND ACPCRank ≤ 17215 AND ACPCRank > 440→ CIVIL

12. ACPCRank > 17215 AND Institute =PIET1 AND AdmType =STATE AND

ACPCRank ≤ 35015→ CIVIL

5.2.3 Decision Table

In distributed environment the decision rules derived from the local decision

trees need to be transferred to the coordinator site for further process such as

merging process. Scanning all the decision rules of different local sites is some-

what complex and time consuming process. As a whole reducing the network

overhead and complexity the decision rules are converted into the decision ta-

ble form which can be later converted into the XML file. The decision table can

be looked like as shown in figure 5.4.

5.2.4 XML File Generation

As discussed above, the decision table is converted into the XML file as shown

in figure 5.5 to reduce the network traffic and easy processing. In XML file

the very first line is the XML version. The root tag is the data set. The other

lines include the record number which further extended to the attribute and

its value. The XML file is simple text file which requires few Kilo Bytes to store

the decision tree rules which actually requires hundreds of Mega Byte in size

for the large volume of dataset to be processed.

As shown in figure 5.6, to support the large volume of dataset [3] generated

at some time instance are only need to be processed along with the existing de-

cision table model. This totally saves the processing time for model generation.

Every time at each site on such event the XML file will be generated which later

transferred to the coordinator site. This approach supports the dynamic and

83


Figure 5.4: Decision table generated at local site S1

Figure 5.5: The XML file at local site S1

84

Chapter 5. Working of the Proposed Model5.3. Coordinator site algorithm computation

Figure 5.6: Dynamic and Scalable decision tree generation

scalable [4] evolutionary data mining by generating the decision tables.

5.3 Coordinator site algorithm computation

As discussed in chapter 4, the coordinator site is chosen at random. This site is

only responsible for merging the decision rules. In order to merge the decision

rule, the site very first collects all XML files from all the local sites followed

by converting each into the decision tables. All these decision tables are then

combined very first.

The merging process of k decision trees F1, F2, F3,... Fk into a single one

starts with creating for each tree Fi its Decision Table set Dtable(Fi). De-

cision Tables Dtable(F1), Dtable(F2),. . . is reduced into a final Decision

Table Dtable Final by the merging operation on Decision tables set. Finally,

Dtable Final is turned into a decision tree The merging operation is the main

part of the approach: it merges several decision table sets Dtable(F1), Dtable(F2),

85

Chapter 5. Working of the Proposed Model5.3. Coordinator site algorithm computation

Figure 5.7: The Decision table merging process to generate the global decisiontree

. . . etc into a single one. It consists of several phases like intersection, filtering

and reduction as shown in figure 5.7.

As shown in figure 5.7 above, the decision tables Dtable(F1) of all the sites

Si with data setDi where i=1,2,3,. . . d are merged. At very first the intersection

phase is carried out where the common regions i.e. rules are found. In the

second phase, the less useful disjoint regions are removed from the list. This

process is known as filtering. In the third reduction phase, the disjoint regions

which can be combined i.e. merged with minor changes are merged to reduce

the number of disjoint regions.

Intersection Phase: It is a task to combine the regions of two decision mod-

els using a specific method to extract the common components of both, pre-

sented in decision table. The set of values (Numerical only) of each region on

each model are compared to discover common sets of values across each vari-

able. The class to assign to the merged region is straightforward if the pair of

regions have the same class, otherwise class conflict problem arises. Andrze-

jak, Langer and Zabala [10] propose three strategies to address this problem.

a) Assign the class with the greatest confidence, b) Assign the class with the

greater probability c) Retrain the model with examples for conflicting class re-

gions, If no conflict arises that class is assigned. Otherwise remove that region

from the merged model.

86

Chapter 5. Working of the Proposed Model 5.4. Summary

Filter Phase: It is the task to remove the disjoint regions from the inter-

sected model. This is somewhat pruning operation. In this the regions with

the highest relative volume and number of training examples are retained.

Strecht, Moreira and Soares [11] address the issue by removing the disjoint

regions, and highlighted the case where the models are not merge able if all

regions are disjoint.

Reduction Phase: This is applicable when a set of regions have the same

class and all variables have equal values except for one. To obtain the simpler

merged model, this is the task to find out which can be joined into one. For

Nominal Variables: Union of values of variables from all regions. For Numeric

Variables: If intervals are contiguous.

5.4 Summary

In this chapter the detailed working of the proposed model has been discussed

with suitable example. Very first the sample training dataset at site S1 and S2

has been introduced. Later in this chapter the working of each local site and

the coordinator site has been described very clearly. At the end of this chapter

the decision rule merging policies with example have been introduced along

with the different phases of merging.

5.5 References

1. B. Bursteinas and J. Long, Merging distributed classifiers, in 5th World

Multi conference on Systemic, Cybernetics and Informatics, 2001.

2. Students’ Admission Prediction using GRBST with Distributed Data Min-

ing, Communications on Applied Electronics (CAE) ISSN : 2394-4714,

Foundation of Computer Science FCS, New York, USA, Volume 2 No.1,

June 2015

3. A Proposed DDM Algorithm and Framework For EDM of Gujarat Tech-

nological University, Organized by Saffrony Institute of Technology In-

87

Chapter 5. Working of the Proposed Model 5.5. References

ternational Conference on Advances in Engineering, 22nd-23rd January

2015

4. A Dynamic and Scalable Evolutionary Data Mining for Distributed En-

vironments, NCEVT-2013, PIET, Limda

5. Faculty Performance Evaluation Based on Prediction in Distributed Data

Mining, 2015 IEEE ICETECH- Coimbatore

6. Prediction and analysis of student performance using distributed data

mining, International Conference on Information, Knowledge & Research

In Engineering, Management and Sciences(IC-IKR-EMS),7th Dec-2014

KIT, Gujarat. IJETAETS-ISSN 0974-3588.

7. M. Kuhn, S. Weston, N. Coulter, and R. Quinlan, C5.0 Decision Trees and

Rule-Based Models. R package version 0.1.0-16, 2014.

8. Chan, P. & Stolfo, S. (1995). Learning arbiter and combiner trees from

partitioned data for scaling machine learning. Proc. Intl. Conf. Knowl-

edge Discovery and Data Mining.

9. F. J. Provost and D. N. Hennessy, Distributed machine learning: scaling

up with coarse-grained parallelism, in Proceedings of the 2nd Interna-

tional Conference on Intelligent Systems for Molecular Biology, vol. 2,

pp. 3407, Jan. 1994.

10. A. Andrzejak, F. Langner, and S. Zabala, Interpretable models from dis-

tributed data viamerging of decision trees, 2013 IEEE Symposium on

Computational Intelligence and Data Mining (CIDM), Apr. 2013.




pp. 535548, 2014.

88

Chapter 6

Data Collection,

Preprocessing and

Implementation

6.1 Introduction

Data collection is the loosely controlled method of gathering the data. Such

data are mostly out of range, impossible data combinations, missing values,

noisy and many more. The data which have not been properly screened will

cause misleading results. To acquire quality result which will helpful for infor-

mation generation and decision making the raw data need to be preprocessed.

The data mining technique which involves transforming raw data into an

appropriate and understandable form for further processing is called data

preprocessing. In real world the data are most often incomplete, uncertain,

missing, and inconsistent and contains many errors. The phrase Garbage In,

Garbage Out is particularly applicable to machine learning and data mining.

To produce quality data for further processing to make decisions, data prepro-

cessing is required.

In this chapter different data collections used for implementation, their

preprocessing and implementation have been discussed in detail.

89

Chapter 6. Data Collection, Preprocessing and Implementation6.2. Data Collection

6.2 Data Collection

Data collection is the process of gathering and measuring information on vari-

ables of interest, in an established systematic fashion that enables one to an-

swer stated research questions, test hypotheses, and evaluate outcomes. There

are numerous data collection methods available, but in this research work, real

data sets of student admission at Parul University (gathered from PU Web por-

tal) and ZOO data set have been used.

6.2.1 Zoo Data set

This data set has been downloaded from UCI repository. A simple database

containing 17 Boolean-valued attributes and one numeric class (type) attribute

which is unique for each instance. There are total 101 instances with no miss-

ing value. The attribute information is as below:

6.2.2 Student Admission Data Set

This research has been carried out on real data set of Parul University for the

students admission prediction in different fields/branches of different colleges.

These data have been collected from the Parul University Web Portal. There

are more than 1,00,000 records in total used for training purpose. There are

more than 10 attributes in the data set but in the research the attribute section

method has been applied to keep only the relevant attributes. The preprocess-

ing technique also has been applied for making them quality data for further

data mining process.

Here the student data set for admission as shown in table 6.2 has been con-

sidered and they have been processed on two different sites S1 and S2. For

simplicity of calculation the site S1 has 179 and site S2 has 142 instances to

process.

The data set is as below.

90

Chapter 6. Data Collection, Preprocessing and Implementation6.2. Data Collection

Sr. No. Attribute Name Data Type Value(Range) Remarks1 Animal Name Boolean Unique for each instance2 Hair Boolean3 Feathers Boolean4 Eggs Boolean5 Milk Boolean6 Airbone Boolean7 Aquatic Boolean8 Predator Boolean9 Toothed Boolean

10 Backbone Boolean11 Breathes Boolean12 Venomous Boolean13 Fins Boolean14 Legs Numeric {0,2,4,5,6,8}15 Tail Boolean16 Domestic Boolean17 Catsize Boolean18 Type Numeric [1,7]

Table 6.1: ZOO DataSet

Sr. No. Attribute Name Data Type Value(Range) Remarks1 Institute Nominal {PIET1,PIET2, PIT1, PIT2}2 Admtype Nominal {State, Management, TFWS}3 Category Nominal {SC,ST,SEBC, OPEN}4 ACPCRank Nominal5 SSC Nominal [0,100] Percentage6 HSC Nominal [0,100] Percentage7 Degree Nominal8 City Nominal9 Name Nominal

Table 6.2: Student admission data set collected from Parul University Web Por-tal

91

Chapter 6. Data Collection, Preprocessing and Implementation6.3. Data Pre-Processing

Sr. No. Attribute Name Data Type Value(Range) Remarks1 Attendance Numeric2 Midsem result Boolean {YES, NO}3 Pre Bklg Boolean {YES, NO} Previous Backlog4 Assignment Nominal5 Pre result Boolean {YES, NO}6 Branch Nominal [0,100]7 Pass Boolean {YES, NO}

Table 6.3: Student performance data set collected from Departments of PITCollege

6.2.3 Student Performance Data Set

In this research the student performance data set has been collected from dif-

ferent departments of Parul Institute of Technology College of Parul Univer-

sity. This data set contains many of the attributes but using the attribute se-

lection method only 7 attributes as shown in table 6.3 have been identified for

further processing in data mining. This data set contains more than 50,000

instances.

6.3 Data Pre-Processing

For data mining process the data need to be pre-processed first to make them

quality data to acquire the quality analysis and information to make quality

decision. So before the data base user should be cleared with some of the most

relevant questions such as 1) What data is available for the task? 2) Is this data

relevant? 3) Is additional relevant data available? 4) How much historical data

is available? 5) Who is the data expert?

For data mining process the quantity of data also plays the most important

role same as the relevance of the data. The quantity of the data is somewhat 1)

Number of instances (records, objects): Rule of thumb: 5,000 or more desired,

if less, results are less reliable; use special methods (boosting,. . . ), 2) Number

of attributes (fields): Rule of thumb: for each attribute, 10 or more instances, If

more fields, use feature reduction and selection and 3) Number of targets: Rule

of thumb: > 100 for each class, if very unbalanced, use stratified sampling.

92


Figure 6.1: Forms of Data Preprocessing

The preprocessing is required in advance before data mining task because

1) Real world data are generally incomplete (lacking attribute values, lacking

certain attributes of interest, or containing only aggregate data), Noisy ( con-

taining errors or outliers) and Inconsistent (containing discrepancies in codes

or names). The data preprocessing task are explain below and shown in figure

6.1

• Data cleaning: fill in missing values, smooth noisy data, identify or re-

move outliers, and resolve inconsistencies.

• Data integration: using multiple databases, data cubes, or files.

• Data transformation: normalization and aggregation.

• Data reduction: reducing the volume but producing the same or similar

analytical results.

93


• Data discretization: part of data reduction, replacing numerical attributes

with nominal ones.

Data cleaning: This is the first preprocessing operation. It consists various

ways to clean data.

1. Fill in missing values (attribute or class value):

• Ignore the tuple: usually done when class label is missing.

• Use the attribute mean (or majority nominal value) to fill in the

missing value.

• Use the attribute mean (or majority nominal value) for all samples

belonging to the same class.

• Predict the missing value by using a learning algorithm: consider

the attribute with the missing value as a dependent (class) variable

and run a learning algorithm (usually Bayes or decision tree) to pre-

dict the missing value.

2. Identify outliers and smooth out noisy data:

• Binning

• Sort the attribute values and partition them into bins (see ”Unsu-

pervised discretization” below);

• Then smooth by bin means, bin median, or bin boundaries.

• Clustering: group values in clusters and then detect and remove

outliers (automatic or manual)

• Regression: smooth by fitting the data into regression functions.

3. Correct inconsistent data: use domain knowledge or expert decision.

Data transformation: Data transformation is the process of converting data

or information from one format to another, usually from the format of a source

system into the required format of a new destination system. Some of the data

transformation techniques have been discussed as below:

1. Normalization:

94


• Scaling attribute values to fall within a specified range. Example: to

transform V in [min, max] to V’ in [0,1], apply V’=(V-Min)/(Max-

Min)

• Scaling by using mean and standard deviation (useful when min

and max are unknown or when there are outliers): V’=(V-Mean)/StdDev.

2. Aggregation: moving up in the concept hierarchy on numeric attributes.

3. Generalization: moving up in the concept hierarchy on nominal attributes.

4. Attribute construction: replacing or adding new attributes inferred by

existing attributes.

Data reduction: Data reduction is the transformation of numerical or al-

phabetical digital information derived empirically or experimentally into a

corrected, ordered, and simplified form.

1. Reducing the number of attributes

• Data cube aggregation: applying roll-up, slice or dice operations.

• Removing irrelevant attributes: attribute selection (filtering and wrap-

per methods), searching the attribute space (see Lecture 5: Attribute-

oriented analysis).

• Principle component analysis (numeric attributes only): searching

for a lower dimensional space that can best represent the data..

2. Reducing the number of attribute values

• Binning (histograms): reducing the number of attributes by group-

ing them into intervals (bins).

• Clustering: grouping values in clusters.

• Aggregation or generalization

3. Reducing the number of tuples

• Sampling

95

Chapter 6. Data Collection, Preprocessing and Implementation 6.4. Test Data Set

6.4 Test Data Set

The dataset has been equally partitioned into the subsets equal to the num-

ber of sites. The experiments have been performed on 10k, 20k, 50k and 100k

records (Here k means thousand) at 2, 5 and 10 sites. The local training models

have been generated and merged using the proposed approach. The accuracy

of these global models has been checked on test datasets. The accuracy is more

than 98% to classify the test dataset. The results of basic comparison clearly

show that accuracy, training time, communication overhead and other parame-

ters have been optimized. The data set of student admission for the year 2013-

14, 2014-15 have been used to train the model, this model has been used with

the data set of student admission for the year 2015-16 which gives more than

98.03% accuracy for the prediction. These experimental results have been also

verified using the 10-fold cross validation.

6.5 Implementation

The research work has been carried out on different number of sites with the

following hardware and software configurations:

Software

1. Database: Microsoft Visual 2008 R2

2. Tool: Visual Studio-10 for .Net

3. Language: C#

4. Apache Hadoop Framework

Hardware

1. Processor: AMD E1-2500, 1.4 GHz

2. RAM: 4 GB

3. System: 64-bit OS / Ubuntu Linux OS (For Apache Hadoop Framework)

4. Hard disk: 400 GB

The screen shots captured as below:

96

Chapter 6. Data Collection, Preprocessing and Implementation6.5. Implementation

Figure 6.2: Site Selection

Figure 6.3: Run J48 algorithm to each site

97


Figure 6.4: Load/Save the training model

Figure 6.5: Decision Tree and Decision Table at each site

98


Figure 6.6: Combined Decision Tree and Decision Table

Figure 6.7: Branch wise decision rules

99

Chapter 6. Data Collection, Preprocessing and Implementation 6.6. Summary

6.6 Summary

In this chapter, the importance of data collection, preprocessing has been dis-

cussed. The different possibilities in which the data set may not be of enough

quality to process. Such data sets need to be preprocessed. In this research

work two different data sets have been used. The local training models have

been generated and merged using the proposed approach. The accuracy of

these global models has been checked on test datasets. The accuracy is more

than 98% to classify the test dataset.

100

Chapter 7

Implementation with Apache

Hadoop

7.1 Introduction to Apache Hadoop

Apache Hadoop [3] is an open source software framework [4]. There are two

main components 1) Map-Reduce: It is the distributed processing framework

and 2) Hadoop Distributed File System (HDFS) [2]: It is a distributed file sys-

tem. One of the most important reasons to use this in the research is to process

a very large set of data and to analyze it which is not possible on a single ma-

chine. In apache hadoop the storage is provided by HDFS and the analysis task

is done by Map-Reduce. The apache hadoop architecture is shown as below in

7.1.

7.2 Hadoop Map-Reduce

In the Map-Reduce the process is broken into two phases called Map and Re-

duce. In each phase there are key-value pairs as input and output. The input

to Map phase is the raw data. Generally, text input format is chosen where

each line in the data set is the text value, while the key is considered as the

offset from the beginning of the line to the beginning of the file. Using Map-

Reduce, the output is processed by the map function before it is sent to the

reduce function. The reduce function sorts and groups the key-value pairs [1].

101

Chapter 7. Implementation with Apache Hadoop 7.2. Hadoop Map-Reduce

Figure 7.1: Apache Hadoop Architecture

The Hadoop run time is only responsible to divide the job into small tasks

such as Map and Reduce. The Mapper class is required to implement Map

function and also Reducer class to implement the Reduce function.

7.2.1 Map

The input data is divided into input splits and one map task is created for each

input split. Then the list of < key1,value1 > entries are processed one after

other in the split which later produces the intermediate list of < key2,value2 >.

There are several classes need to be defined in map phase: InputFormat (de-

fines the source of input files), InputSplits (splits the input and defines sin-

gle unit of work for one map task), RecordReader (converts binary data <

key,value > from InputSplits) depends o the split size, Mapper, Combiner and

Partitioner. The size of the split plays important role for load-balancing. The

split size should not be too much small; otherwise it creates much overhead

to create the map task than its execution time. In most of the case one HDFS

block is equal to the split size, this is because if the split contains more than

one HDFS blocks, then it may cause the locality optimization problem because

all HDFS blocks do not present on the same node. From the InputSplits the

RecordReader creates < key,value > pairs recursively until entire split has been

102

Chapter 7. Implementation with Apache Hadoop 7.3. HDFS

processed.

There is one optional phase called Combiner which reduces the size of data

shuffled between map and reduce tasks. On the node, the combiner runs the

reduce function on the output of map. There is no guarantee that how many

numbers of times the combiner function is called by the Hadoop. The number

of partitioner and reducer is identical. The partitioner usually splits the inter-

mediate key space per each reducer.

7.2.2 Reduce

The set of intermediate values are reduced by the reducer to share a common

key to smaller set of values. In map task the number of tasks is not controllable

while it is true in case of reducer. The shuffle takes much time and bandwidth

if the number of reduce tasks is very high. On the other hand the less number

of longer reduce tasks affects the overall execution time due to the poor degree

of parallelism. There are three phases in reducer: first is shuffling is the pro-

cess to move the relevant sub parts of map output, second is sorting which is

the process of sorting the intermediate keys on a single node and third is re-

duce which is the merging process of the values which have the same key and

the output.

The many of the data mining applications need to pass the arguments to

the Map-Reduce tasks. The read-only large files are distributed by Distributed

Cache and small arguments are passed using setter and getter methods of Jon-

Conf class. There are two versions of Hadoop which are different from each

other with the architecture and job execution flow with Map-Reduce. Fig-

ure 7.2 shows the architecture and the job execution flow with Hadoop Map-

Reduce version 1.x (MRv1)

7.3 HDFS

In Hadoop the Hadoop Distributed File System (HDFS) is the main storage

component which provides a large scale storage (Terabytes or Petabytes) dis-

tributed architecture and can be easily extended by scaling out. The file in

103

Chapter 7. Implementation with Apache Hadoop 7.3. HDFS

Figure 7.2: Architecture and job execution flow in Hadoop Map Reduce version1.x (MRv1)

HDFS is divided into blocks (mostly 128 MB size of each block) of customized

size or predefined. This much of block size makes the seek time lesser com-

pared to reading the data from the disk. The original architecture is available

from Yahoo! Inc. [6] and it is based on the design of GFS [7]. By replicating the

blocks on various machines, hadoop supports fault tolerance and handling of

node failure and hence the throughput is also very high.

The main advantage of hadoop is it has the capability to handle unstruc-

tured data collected from different data sources or of different formats. More-

over to heterogeneous type of data which are collected at unrelated systems

can deposited/stored in the hadoop cluster without the predetermining that

how the data will be analyzed. The HDFS is not suitable to applications which

require very immediate seek accesses but more suitable for Write Once and

Read Many (WORM) kind of applications. HDFS provides very high data lo-

cality with Map-Reduce applications by storing the blocks in such a way that

there are very less movements of blocks among the machines of the same rack

or machines of different racks.

104

Chapter 7. Implementation with Apache Hadoop 7.4. Decision Tree Map-Reduce

7.4 Decision Tree Map-Reduce

Figure 7.3 shows the overview of the map-reduce model. The input is divided

into the chunk of blocks using the splitting method. The size of the block may

depend on the application. But most of the time the block size is 128 MB to

improve the overall performance. There is one mapper to work on each input

split and generates the intermediate < key,value > pairs. The reducer merging

all the intermediate results to form final output.

The decision tree is generated using the map-reduce by initially using new

data structures as follows:

1. Attr table: It is the attribute table. And it includes the basic information

of each attribute attr, It also includes the row identifier of an instance

row id, the values of attribute values(attr) and the class label for each

instance c.

2. Cnt table: It is the count table and it stores the count of instances for

class labels if it is split by attribute attr. There are two attributes in this

table count cnt and the class label c.

3. Hash table: It is the hash table for indexing or linking purpose. It stores

the link information between the row id and the node id (for tree nodes),

moreover to this the link between branch node subnode id and the par-

ent node node id.

The decision tree generation process is made of four phases:

Data preparation, Data selection, Map update and Tree growing.

7.4.1 Data Preparation

1. In this phase the traditional data is converted into MapReduce support-

able format.

2. MAP ATTR procedure is used to transform the instance record into at-

tribute table where attribute aj (j=1,2,. . ., M) is used as the key and

row id, class label c as values.

3. The number of instances with specific class labels (if split by attribute aj )

is computed by the REDUCE ATTR procedure. Thus the count table is

formed.

105


Figure 7.3: Overview of Map-Reduce Model

Below are the algorithm steps:

procedure MAP ATTR(row id,(a1,a2,. . .,aM ,c))

emit(aj ,(row id,c))

end procedure

procedure REDUCE ATTR(aj ,(row id,c))

emit(aj ,(c,cnt))

end procedure

7.4.2 Selection

1. Select the best attribute abest.

2. REDUCE POPULATION aggregates the size of all records for attribute

aj . For this it takes instances for each attribute-value pair.

3. MAP COMPUTATION computes the information and split information

of aj .

4. REDUCE COMPUTATION computes the information gain ratio and se-

lects the maximum value of GainRatio(aj ) as a splitting attribute abest.

procedure REDUCE POPULATION((aj ,(c,cnt))

emit (aj ,all)

106


end procedure

procedure MAP COMPUTATION((aj ,(c,cnt,all)))

compute Entropy(aj )

compute Info(aj )= (cnt/all) Entropy(aj )

compute SplitInfo(aj )= -(cnt/all) log (cnt/all)

emit(aj ,(Info(aj ),SplitInfo(aj ))

end procedure

procedure REDUCE COMPUTATION(aj ,(Info(aj ),SplitInfo(aj )))

emit(aj ,GainRatio(aj ))

end procedure

7.4.3 Update

MAP UPDATE COUNT reads a record from attribute table with key value

equals to abest, and emits the count of class labels. MAP HASH assigns node id

based on a hash value of abest to make sure that records with same values are

split into the same partition.

procedure MAP UPDATE COUNT((abest,(row id,c)))

emit(abest,(c,cnt))

end procedure

procedure MAP HASH(abest,row id))

compute node id=hash(abest)

emit(row id,node id)

end procedure

7.4.4 Tree Growing

In previous step the tree nodes are generated. Later this is extended to the de-

cision tree by creating the links between nodes. In this a node id is created for

generated nodes and then compared with existing old values. If available, then

it is grouped with the existing nodes, otherwise new sub nodes will be created.

107

Chapter 7. Implementation with Apache Hadoop 7.5. Apache Hadoop Cluster

procedure MAP((abest,row id))

compute node id=hash(abest)

if node idis same with the old value then

emit(row id,node id)

end ifadd a new subnode

emit(row id,node id,subnode id)

procedure

Out of all the above four steps, sequence of MapReduce operations and data

preparation are onetime tasks and the remaining are iterative. The terminal

condition is all node id become leaf nodes, and then a decision tree is built.

7.5 Apache Hadoop Cluster

The above methods have been implemented on hadoop clusters of 2, 5 and 10

nodes. In hadoop cluster there is one Name Node (i.e. Primary Node works as

master) and others are the Data Nodes (i.e. works as slave). For this, very first

in the experimental set up one node is installed as primary name node which

contains the information of all other data nodes. Thus name node is nothing

but handling the meta directory of the cluster including all the services. Figure

7.4 shows the installation of apache hadoop with node name as coed82.

The name node is the core part of HDFS file system. It is not responsible

of storing the files. It has the tree like directory of all files in the file system,

and it keeps the track of where the files are stored across the cluster. The client

application talks with it to locate a file or doing any operation like add, move,

copy or delete the contents of the file.

HDFS contains more than one data node and the data are physically stored

on data nodes. The same data may be replicated across the several data nodes

to support the fault tolerance and availability. Very first the data node need

to be connected with the name node to register itself in the file system. The

client application approaches to the name node on first time, then name node

108


Figure 7.4: Apache Hadoop Installation

Figure 7.5: Hadoop MapReduce Administration

109


connects with the data node and finally the client application directly fetches

the data from the data node.

Figure 7.5 show the map reduce administration with the cluster summary

for two data nodes. It also includes the map and reduce task capacity. The heap

size is also available as 73.5 MB. The scheduling information is also given. The

details of running and retired job is also available.

As shown in figure 7.6 and figure 7.7 the name node and cluster informa-

tion is available. It contains the following information.

1. Live Nodes It shows the number of data nodes which are available for

processing.

2. Configured capacity: The memory space utilized for configuration in

terms of GB.

3. Dead Nodes:The data nodes which are not currently working or partici-

pating due to failure.

4. DFS used/unused:The space used for Distributed File System.

5. No of under replicated blocks: The total number of under replicated

blocks.

6. Storage Directory:It gives the path, type and state of the directory.

Figure 7.8 and figure 7.9 gives the detailed information of the contents of

the directory and the directory log.

110


Figure 7.6: Two Node cluster

Figure 7.7: Cluster Summary

111


Figure 7.8: Contents of Directory

Figure 7.9: Directory Log

112

Chapter 7. Implementation with Apache Hadoop 7.6. Conclusion

7.6 Conclusion

The map-reduce process of apache hadoop takes the back up copy in the mem-

ory each time after the completion of each task, that means it deals with the

memory read and write operations on large data sets which may reduce the

throughput and also the overall performance. The operations are not ”in-

Memory” nature. This is the main reason why hadoop is not used for real time

analytics. But in the proposed research work hadoop is more suitable which

supports dynamic and scalable nature and also fault tolerance in distributed

environment.

7.7 References

1. Tom white, Hadoop, The Definitive Guide,3rd.Edition, OReilly Media,

Inc.,Sebastopol, CA 95472,2012.

2. Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,Rafael Coss, Roman B.

Melnyk Hadoop For Dummies, John Wiley & Sons, Inc., Hoboken, New

Jersey,2014

3. Apache Hadoop http://hadoop.apache.org/releases.html

4. http://en.wikipedia.org/wiki/Apache Hadoop

5. HDFS : http://hortonworks.com/hadoop/hdfs

6. The Hadoop Distributed File System, Yahoo! Inc., in:Proceedings of MSST2010,

by IEEE, unknown, 2010

7. Sanjay Ghemawat, Howard Gobioff & ShunTak Leung: The Google File

System, Google Inc., in SOSP03, October 1922, 2003, Bolton Landing NY

USA, 2003

113

Chapter 8

Results, Conclusions and

Future Enhancements

8.1 Introduction

In the first section of this chapter the experimental results have been presented

in detail. The comparative analysis for different number of sites such as 2, 5,

10 and for different number of records for different data sets have been exper-

imentally proven that the proposed dynamic and scalable approach is much

faster and accurate. At the end it has been mentioned that this research can

be further extended for streaming data for the applications such as intra-day

stock market, weather forecasting and data generated by real time applica-

tions.

8.2 Results

The proposed algorithm has been applied on different data sets such as student

admission, student performance and Zoo data set. The confusion matrixes for

Site1, Site2 and at Coordinator Site for are experimentally derived for student

data set as below.

The table 8.4 and figure 8.1 and figure 8.2 below show the comparative

performance of site1, site2 and coordinator site. The accuracy and the time

114

Chapter 8. Results, Conclusions and Future Enhancements 8.2. Results

YES NOYES TP=0.788 FN=0NO FP=0.043 TN=0.168

YES NO TOTAL RECOGNITION (%)YES 141 0 141 78.77NO 8 30 38 21.052

TOTAL 149 30 179 95.53

Table 8.1: Confusion Matrix for Site1



TOTAL 105 37 142 90.845

Table 8.2: Confusion Matrix for Site2



TOTAL 231 90 321 98.13

Table 8.3: Confusion Matrix for combined data set at coordinator site

115


Site Accuracy Error Rate Training Time (Sec)Site1 90.845 9.115 0.11Site2 95.53 4.47 0.03

Coordinator 98.13 1.87 0.05

Site Sensitivity Specificity Precision RecallSite1 100 74 94.63 0.789Site2 100 78.94 87.62 0.648

Coordinator 100 93.75 97.40 0.701

Table 8.4: Comparative performance at site1, site2 and coordinator site

Figure 8.1: Performance measures

required to build the decision tree is much better. The error rate in combined

approach is lesser than others.

In the table 8.5 and figure 8.3 the comparison of proposed approach for

three different data set has been shown as below. From the statistics the algo-

rithm gives an excellent performance for student admission data set. While

for Zoo data set performance is little poor as it is having more attributes (in-

stances) than other two data sets. This shows that the number of the attributes

(instances) also plays important role for the processing.

116


Figure 8.2: Recall and Training Time (Sec)

DataSet Number OfInstances

Accuracy(%)

ErrorRate (%)

TrainingTime (Sec)

Student Admission 321 98.13 1.87 0.05Student

Performance100 93 7 0.02

ZOO 101 96.05 3.95 0.04

DataSet Sensitivity Specificity Precision RecallStudent Admission 100 93.75 97.40 98.13

Student Performance 100 92.63 93.20 93ZOO 100 93.34 96.61 96.05

Table 8.5: Performance statistics for three different data sets

117


Figure 8.3: Performance comparison

The experimental results for student admission data set presented below

show the processing time, communication Overhead time and total time (Mil-

lisecond) for centralized approach, intermediate message passing approach

and the proposed approach for different number of sites and different number

of instances. The detailed statistics and the graphical comparison are shown

one by one as below.

Records CentralizedApproach(ms)

IntermediateMessage

Passing(ms)

Proposed Method(ms)

10000 17.7 11.021 6.420000 40.8 25.83 14.340000 102.1 59.79 29.3260000 130.6 97.59 3680000 205.7 128.98 58.6

100000 229.3 160.33 70.73

Table 8.6: Statistics for total time on 2 sites

118


Figure 8.4: Comparison for total time on 2 sites


IntermediateMessage

Passing(ms)

Proposed Method(ms)

10000 14.4 10.78 3.3320000 34.79 29.77 7.6140000 73.36 59.83 16.9160000 92.06 80.72 21.6580000 150.97 116.03 32.4

100000 164.24 149.48 38.05



IntermediateMessage

Passing(ms)

Proposed Method(ms)

10000 8.33 5.78 3.0220000 17.01 15.08 6.2840000 35.54 3.49 14.90360000 46.46 45.09 18.0780000 77.61 63.71 26.48

100000 98.331 73.29 31.31


119




120


Figure 8.7: Communications Overhead in Centralized Approach

Figure 8.8: Communications Overhead in Intermediate Message Passing Ap-proach

121


Figure 8.9: Communications Overhead in Proposed Approach

Records Proposed Method(ms)

Proposed Methodwith ApacheHadoop (ms)

10000 6.4 8.3220000 14.3 17.2140000 29.32 25.1560000 36 32.5680000 58.6 51.48

100000 70.73 62.23

Table 8.9: Total time for 2 sites



10000 3.33 4.4720000 7.61 8.5540000 16.91 13.2760000 21.65 19.5280000 32.4 24.73

100000 38.05 31.04


122

Chapter 8. Results, Conclusions and Future Enhancements 8.3. Conclusions



10000 3.02 3.8920000 6.28 7.5340000 14.903 10.2860000 18.07 14.3180000 26.48 19.34

100000 31.31 23.01


The Communication Overhead (Milliseconds) in centralized, intermediate

message passing and the proposed approach have been discussed here. The

experimental results show the communication overhead in proposed approach

is much lesser than other two approaches. The communication overhead also

remains relatively identical as the number of sites increasing in centralized ap-

proach, but this increases the processing time and much more overhead on a

single machine. Moreover to this, the experimental results also shows the im-

plementation with Apache Hadoop Map-Reduce is much faster than the first

approach of implementation.

8.3 Conclusions

The outcome of the proposed model shows that the objectives of the research

work have been acquired. The proposed model can handle large volume of

data. The decision trees are merged with minimal network overhead and global

model preserves the quality of prediction. The results shown in the above ta-

bles with accuracy, error rate, specificity, sensitivity, precision, recall and train-

ing time are better than the existing system. The total time for local model gen-

eration and communication time in proposed approach is 3.14 and 2.34 times

faster than the centralized and intermediate message passing approaches. The

data set of student admission for the year 2013-14, 2014-15 have been used

to train the model, this model has been used with the data set of student ad-

mission for the year 2015-16 which gives more than 98.03% accuracy for the

prediction. These experimental results have been also verified using the 10-

fold cross validation.

123

Chapter 8. Results, Conclusions and Future Enhancements 8.3. Conclusions

Decision tree learning on massive datasets is a common data mining task

in distributed environment, yet many state of the art tree learning algorithms

require training data to reside in memory on a single machine, while more

scalable implementations of tree learning have been proposed, they typically

require specialized parallel computing architectures. Moreover, all the ap-

proaches are static in nature, not domain-free, not scalable and the accuracy is

not preserved.

Our literature review and experiments on merging the decision trees, shown

training time, communication overhead and the accuracy are the major chal-

lenges. To reduce the training time the proposed algorithm processes only new

dataset with already trained model which makes it scalable and dynamic, to

reduce the communication overhead the local models have been converted into

XML files, to preserve the accuracy the proposed algorithm incorporates some

rule merging policies.

The proposed approach has been divided into two major phases. During

first phase, the objectives were a) minimize the training time and b) reduce the

communication overhead by the sub-phases 1) Generating the local decision

tree model, 2) Applying the parsing for converting decision tree into decision

table, 3) Converting this decision table into XML file to reduce the communi-

cation overhead and 4) Applying the scalable approach with new data set. The

total time for local model generation and communication time in proposed ap-

proach is 3.14 and 2.34 times faster than the centralized and intermediate

message passing approaches.

During the second phase of proposed approach, we have introduced several

rule merging policies to preserve the quality by performing model intersection,

filtering and reduction phases. We have compared the resultant merged and

trained model with the actual one generated by J48 algorithm. This model pre-

serves the accuracy.

The proposed approach has been implemented with to ways: C# and Map-

Reduce with Apache Hadoop. The experimental results also shows the later is

much faster on increase the volume of data and the processing sites i.e. data

nodes. Moreover to this apache hadoop also supports fault tolerance and scal-

124

Chapter 8. Results, Conclusions and Future Enhancements 8.4. Future Work

ability (Horizontal scaling) to process very large volume of datasets.

We proposed a scalable and dynamic distributed approach for learning tree

models over large datasets which defines tree learning as a series of distributed

computations. We show how this approach supports dynamic and scalable

construction of decision trees models, as well as ensembles of such models.

The proposed approach is much efficient than all others.

8.4 Future Work

This research has been carried out on non streaming data set where the data are

stored in data warehouse at different sites. This research work can be further

extended on the streaming data set where the data are continuously coming

and need to process on real time basis. The intra-day stock market data gen-

erated every moment, weather data, scientific data and data generated by real

time at different geographical sites which need to be processed on real time ba-

sis are streaming data. In future, this research can be extended for streaming

data in distributed environment.

As hadoop is not performing ”in-memory” processing and hence it is not

suitable for real time analytic of large data sets. It writes in the memory af-

ter the operation and hence the throughput may decrease, so in future other

mechanism such as SPARK may be used for real time analytic process which

performs ”in-memory” operations and also supports streaming data which im-

proves the throughput.

8.5 Summary

In this chapter the experimental results have been presented in detail. The

comparative analysis for different number of sites such as 2, 5, and 10 for dif-

ferent number of records and for different data sets have been experimentally

proven that the proposed dynamic and scalable approach is much faster and

accurate. The proposed approach is 3.14 and 2.34 times faster than the cen-

tralized and intermediate message passing approaches. At the end it has been

125

Chapter 8. Results, Conclusions and Future Enhancements 8.5. Summary

mentioned that this research can be further extended for streaming data for

the applications such as intra-day stock market, weather forecasting and data

generated by real time applications. The SPARK can be used to perform ”in-

memory” operations which gives real time analytic.

126

List of Paper Publications

1. Students’ Admission Prediction using GRBST with Distributed Data Min-ing, Communications on Applied Electronics (CAE) ISSN : 2394-4714,Foundation of Computer Science FCS, New York, USA, Volume 2 No.1,June 2015

2. A Proposed DDM Algorithm and Framework For EDM of Gujarat Tech-nological University, Organized by Saffrony Institute of Technology In-ternational Conference on Advances in Engineering, 22nd-23rd January2015

3. An Approach on Early Prediction of Students Performance in UniversityExamination of Engineering Students Using Data Mining, InternationalJournal of Scientific Research and Management Studies (IJSRMS) ISSN:2349-3371 Volume 1 Issue 5, pg: 156-161

4. Faculty Performance Evaluation Based on Prediction in Distributed DataMining, 2015 IEEE ICETECH- Coimbatore

5. Prediction and analysis of student performance using distributed datamining, International Conference on Information, Knowledge & ResearchIn Engineering, Management and Sciences(IC-IKR-EMS),7th Dec-2014KIT, Gujarat. IJETAETS-ISSN 0974-3588

6. Prediction and analysis of Faculty performance using distributed datamining, International Conference on Information, Knowledge & ResearchIn Engineering, Management and Sciences(IC-IKR-EMS),7th Dec-2014KIT, Gujarat. IJETAETS-ISSN 0974-3588

7. A decision support application for student admission process based on pre-diction in distributed data mining, International Conference on Informa-tion, Knowledge & Research In Engineering, Management and Sciences(IC-IKR-EMS), 7th Dec-2014 KIT, Gujarat. IJETAETS-ISSN 0974-3588

8. A Dynamic and Scalable Evolutionary Data Mining for Distributed Envi-ronments, NCEVT-2013, PIET, Limda

9. An Approach of E-Governance with Distributed Data Mining For Stu-dent Performance Prediction, Springer international conference, ICICT-October-2015, Udaipur.

10. Dynamic and Scalable Data Mining with an Incremental Decision TreesMerging Approach for Distributed Environment, A Doctoral Conference2016 (DocCon 2016) at Udaipur, March-2016. (Under Publication)

127

List of Book/Book Chapter Publications

1. Book: ”Data Mining Techniques for Distributed Database Environment”,Dineshkumar B. Vaghela and Dr. Priyanka Sharma, ISBN 978-3-659-94945-6, Lambert Academic Publishing - 2016.

2. Book Chapter: ”Web Usage Mining Techniques and Applications acrossIndustries”, Dr. A.V. Senthil, ISBN: 978-1-522-50613-3, IGI global inter-national publication - 2016.

128

Patents/Copyright (if any)

Title A Dynamic And Scalable Evolutionary Data Mining& KDD For Distributed Environment

Filed At Copyright Office, New DelhiDairy Num-ber

3460/2016-CO/SW

Applicants &Inventors

Dineshkumar Bhagwandas Vaghela and Dr.Priyanka Sharma

ApplicationStatus

Waiting

Objection Re-ceived

Not Yet

Web Link http://copyright.gov.in/frmStatusGenUser.aspx

129

a dynamic and scalable for distributed environment · dwdm data warehousing and data mining rm...

Documents