analysis of classification algorithm on air quality ...€¦ · prediction methods like bagging,...

International Journal of All Research Education and Scientific Methods (IJARESM), ISSN: 2455-6211

Volume 8, Issue 6, June-2020, Impact Factor: 4.597, Available online at: www.ijaresm.com

IJARESM Publication, India >>>> www.ijaresm.com Page 23

Analysis of Classification Algorithm on Air Quality

Dataset using Weka Tool

Krishnendu KS1, Remya Anand

2

1,2MCA Department/Mahatma Gandhi University/India

-----------------------------------------------------*****************-----------------------------------------------------

ABSTRACT

Today a biggest threat to the whole ecosystem especially living beings is pollution. As, pollution is transfer of pure

air into impure ones by the combination of effluents, poisonous gases, particulate (PM2.5, PM10), etc. This has

resulted within the adverse effect on environment, respiratory disorders, heart diseases, genital system , depletion of

ozonosphere , heating , increase in mortality rates etc., Hence the necessity of the hour is to feature smartness within

the detection and analysis of pollution in our surroundings, in order that precautionary measure would be taken by

a private to avoid risks of the adverse effect of pollution on the health of a person's especially . Main aim of this

paper is analyzing the proportionality of pollutant within the air supported the town, location time. this is often

estimated using the WEKA tool to research the pollution data sets collected from the pollution control panel.

Keywords: Air Quality, WEKA, JRip, J48, LMT, Datasets, Classification.

I. INTRODUCTION

To oversee and control our condition from unfriendly impacts of pollution there's a requirement of compelling forecast and

examination of the equivalent. There are various reasons that contribute towards the expansion in pollution like vehicular

discharges, development and destruction destinations, processing plants mining exercises to offer some examples.

II. LITRATURE SURVEY

Mohamed Shakir et.all [1]- Main aim of this paper is analyzing the proportionality of pollutant within the air supported the-

time-of the day and day-of-the-week. Also estimate the effect of various environmental parameters like temperature, wind

speed, and humidity on various pollutants like NO, NO2, CO, PM10, and SO2.

Jeenaelsa George et.all [2]- The pollutants considered during this paper are aerosol and asbestos sheets. The source of

asbestos are building roofs which are mainly in populated area which of aerosol is combustion of coal.each object has

different temperatures using the TIR bands of Landsat 8 data, the urban objects are classified using the land surface

temperature map. The presence of asbestos sheets is detected by change in intensity of images with reference to Band 7 and

Band 9. Aerosol is comprised of components that cause pollution.

Swativitkaret.all [3]- during this paper the framework is proposed using data processing to review the prevailing pattern of

pollution data and to predict future pattern of it. The aim of this paper is to review the performance and accuracy of various

prediction methods like Bagging, rectilinear regression, Rep Tree, Random Forest, Additive Regression and non rectilinear

regression algorithm SMOreg on pollution data of Airoli, Navi Mumbai area.

Ahmed boubrimaet.all[4]- during this paper, we specialize in an alternate or complementary approach, with a network of

low cost and autonomic wireless sensors, aiming at a finer spatiotemporal granularity of sensing. Generic deployment

models within the literature aren't adapted to the stochastic nature of pollution sensing. Our main contribution is to style

integer applied mathematics models that compute sensor deployments capturing both the coverage of pollution under time-

varying weather and therefore the connectivity of the infrastructure.



IJARESM Publication, India >>>> www.ijaresm.com 4

III. METHODOLOGY

Dataset: A data set could also be a set of any kind of data. Most commonly a knowledge set corresponds to the info of

database table, or one statistical data matrix, where every column of the table represents a selected variable and every row

corresponds to a given member of data set in question. There are various dataset like banking dataset, biological datasets,

during which clustering are often performed. Here during this dissertation dataset of pollution is taken for analysis. There are 8 attributes and that are State, City, Location, Type,Monitoringdays, RSPM annual average, Percentage exceeded,

Class.

Classification: Classification is additionally referred to as classification trees or decision trees. the choice tree it creates

may be a tree where each node within the tree represents a spot where a choice must be made supported the input file. it's

also referred as node split point. One can move from root node to subsequent node and therefore the next until leaf node is

found; which tells the anticipated output.Classification may be a data processing algorithm which finds outthe output of a

replacement data instance. during this research paper the experimental study is conducted on variousclassification

algorithms and best algorithm is identified for pollution analysis. allow us to examine the Test options. you'll notice four

testing options as listed below –

Training set Supplied test set

Cross-validation

Percentage split

Unless you've got your own training set or a client supplied test set, you'd use cross-validation or percentage split options.

Under cross-validation, you'll set the amount of folds during which entire data would be split and used during each iteration

of coaching. within the percentage split, you'll split the info between training and testing using the set split percentage.

Confusion matrix:A confusion matrix may be a technique for summarizing the performance of a classification algorithm.

Classification accuracy alone are often misleading if you've got an unequal number of observations in each class or if

you've got quite two classes in your dataset. Calculating a confusion matrix can offer you a far better idea of what your classification model is getting right and what sorts of errors it's making.

Equation:Confusion matrix is employed to stop from misleading data. Every solution has confusion matrix of its own. The

matrix may be a 2*2 matrix during which each number has its own function.

The prediction of different algorithms that are being used is being represented in tabular format. The prediction measures

are based on Precision, Recall, Error – rate, F – measure, Accuracy.

True Positive (TP) False Negative (FN)

False Positive (FP) True Negative (TN)

TP: Both observed and predicted cases are true.

FP: Predicted case is yes but observed case is no.

FN: Both predicted and observed cases are no.

TN: Predicted case is no but observed case is yes.

The following measures in the table are being calculated using confusion matrix. Every measures has its own

formulas for calculations.

Precision: When predicted yes, how often it is correct.

TP/(TP+FP)

Recall: When it is actually yes, how often does it predict yes TP/(TP+FN)

Error-rate is calculated using

(FP+FN)/(TP+FP+TN+FN) F-measure: Weighted average of precision and recall




IV. RESEARCH FINDINGS

A. WEKA TOOL

Weka could also be a set of machine learning algorithms for processing tasks. The algorithms can either be applied on to a

dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression,

clustering, association rules, and visualization. it is also well-suited for developing new machine learning schemes.

Weka could also be a really fashionable free processing tool that has advanced text mining features. Weka contains a group

of visualization tools and algorithms for data analysis and predictive modelling, in conjunction with graphical user

interfaces for straightforward access to those functions. the primary non-Java version of Weka was a Tcl/Tk front-end to

modelling algorithms implemented in other programming languages, plus data pre-processing utilities in C, and a Make

file-based system for running machine learning experiments. This original version was primarily designedas a tool for

analyzing data from agricultural domains, but the newer fully Java-based version, that development started in 1997, is now

utilized in many different application areas, especially for educational purposes and research.

Figure1: The WEKA Tool GUI

B. ALGORITHM USED:

a) JRip Algorithm

JRip (RIPPER) is one among the essential and hottest algorithms. Classes are examined in growing size and an initial set of

rules for the category is generated using incremental reduced error JRip (RIPPER) proceeds by this class implements a

propositional rule learner, Repeated Incremental Pruning to supply Error Reduction (RIPPER), which was proposed by

William W. Cohen as an optimized version of IREP. it's based in association rules with reduced error pruning (REP), a

really common and effective technique found in decision tree algorithms. In REP for rules algorithms, the training data is

split into a growing set and a pruning set. First, an initial rule set is made that over ts the growing set, using some heuristic method. This overlarge rule set is then repeatedly simplified by applying one among a group of pruning operators typical

pruning operators would be to delete any single condition or any single rule. At each stage of simplification, the pruning

operator chosen is that the one that yields the best reduction of error on the pruning set. Simplification ends when applying

any pruning operator would increase error on the pruning set.




Figure2: JRip value with cross validation 4.







b) J48 Algorithm

J48 is an open source Java implementation of the C4.5 algorithm within the Weka data processing tool.The C4.5

algorithm for building decision trees is implemented in Weka as a classifier called J48. Classifiers, like filters, are

organized during a hierarchy: J48 has the complete name Wekaclassifiers trees.J48 The classifier is shown within

the text box next to the Choose button: It reads J48 –C 0.25 –M 2. This text gives the default parameter settings for

this classifier. C4.5 has several parameters, by the default visualization (when you invoke the classifier) only shows

–c ie. Confidence value (default 25%): lower values incur heavier pruning and -M ie. Minimum number of

instances within the two hottest branches

Figure 6: J48 value with cross validation 4.




c) LMT Algorithm

LMT may be a classification model with an associated supervised training algorithm that mixes logistic regression

and decision tree learning.Logistic model trees are supported the sooner ideas of a model trees, a decision tree that

has rectilinear regression models at its leaves to supply a piecewise rectilinear regression model (where ordinary

decision trees with constants at their leaves would produce a piecewise constant model)

Figure 10: LMT value with cross validation 4.







Figure 14: Visualizing Dataset Class.




C. RESULT:

The prediction of different algorithms that is being used are being represented in tabular format. The accuracy of each

algorithm is based on the different Cross-validation Fold values being used in the process.

Algorithms 4 5 8 10 Result

JRip 95.3353 95.3353 95.3353 95.9184 95.481075

J48 80.758 94.1691 88.0466 95.6268 89.650125

LMT 85.1312 86.5889 85.7143 87.172 86.1516

CONCLUSION

During this paper, we've proposed the comparative analysis of 4 differing kinds of classification algorithms with the help of

WEKA processing tool within the experimental study we've used the air quality dataset where individual algorithm results

were compared and best algorithm is chosen on the thought of accuracy and time required for model evaluation. From the

results it's been observed that average of JRip test is 95.481%, for LMT it's 86.151% and J48 is 89.65%, we've compared

the performance of 4 classifiers within the prediction of air quality analysis. By analyzing the algorithms we came to the

conclusion that by using air quality analysis dataset in Weka, The algorithm which provides the higher accuracy is found as

JRip algorithm from the research.

ACKNOWLEDGMENT

We thank our relations, friends and our mentorsfrom MCA Department of Sree Narayana Guru Institute Of Science And Technology Affiliated To Mahatma Gandhi University who were the constant source of inspiration. Their guidance and

expertise greatly assisted the research.

REFERENCES

[1] Mohamed Shakir,N. Rakesh”Investigation on Air Pollutant Data Sets using Data Mining Tool”Proceedings of the

Second International conference on I-SMAC (I-SMAC 2018) IEEE

[2] Varun Noorani Subramanian “Data analysis for predicting air pollutant concentration in Smart city Uppsala”Mars 2016

[3] Swati Vitkar”Comparative Analysis of Various Data Mining Prediction Algorithms, Demonstrated using Air Pollution

Data of Navi Mumbai”Research Journal of Chemical and Environmental Sciences,February 2017.

[4] Jitendra Agrawal “Analysis of Clustering Algorithm of Weka Tool on Air Pollution Dataset”Article in International

Journal of Computer Applications · June 2017.

analysis of classification algorithm on air quality ...€¦ · prediction methods like bagging,...

Documents