bussiness analytics group project
Post on 06-Jul-2018
217 Views
Preview:
TRANSCRIPT
-
8/17/2019 Bussiness Analytics Group Project
1/19
INTRODUCTION:
The health care industry is one of the world’s largest and fastest growing industries having huge
amount of healthcare data. This health care data includes relevant information about patient data,
their treatment data and resource management data. The information is rich and massive. Hiddenrelationships and trends in healthcare data can be discovered from the application of data mining
techniques. Data mining techniques are more effective that has used in healthcare research. In
this project we aimed to do the analysis of several data mining classification techniques using
!"# machine learning tools over the healthcare datasets. In this study, we use different data
mining classification techniques that have been tested on diagnostic datasets for diabetes.
SIGNIFICANCE OF DATA MINING IN HEALTHCARE:
$enerally all the healthcare organi%ations across the world stored the healthcare data in
electronic format. Healthcare data mainly contains all the information regarding patients as well
as the parties involved in healthcare industries. The storage of such type of data is increased at a
very rapidly rate. Due to continuous increasing the si%e of electronic healthcare data a type of
comple&ity is e&ist in it. In other words, we can say that healthcare data becomes very comple&.
'y using the traditional methods it becomes very difficult in order to e&tract the meaningful
information from it. 'ut due to advancement in field of statistics, mathematics and very other
disciplines it is now possible to e&tract the meaningful patterns from it. Data mining is beneficialin such a situation where large collections of healthcare data are available.
Data (ining mainly e&tracts the meaningful patterns which were previously not )nown. These
patterns can be then integrated into the )nowledge and with the help of this )nowledge essential
decisions can becomes possible. # number of benefits are provided by the data mining. *ome of
them are as follows+ it plays a very important role in the detection of fraud and abuse, provides
better medical treatments at reasonable price, detection of diseases at early stages, intelligent
healthcare decision support systems etc. Data mining techniques are very useful in healthcaredomain. They provide better medical services to the patients and helps to the healthcare
organi%ations in various medical management decisions. *ome of the services provided by the
data mining techniques in healthcare are+ number of days of stay in a hospital, ran)ing of
hospitals, better effective treatments, fraud insurance claims by patients as well as by providers,
readmission of patients, identifies better treatments methods for a particular group of patients,
-
8/17/2019 Bussiness Analytics Group Project
2/19
construction of effective drug recommendation systems, etc. Due to all these reasons researchers
are greatly influenced by the capabilities of data mining. In the healthcare field researchers
widely used the data mining techniques. There are various techniques of data mining. *ome of
them are classification, clustering, regression, etc. !ach and every medical information related to
patient as well as to healthcare organi%ations is useful. ith the help of such a powerful tool
)nown as data mining plays a very important role in healthcare industry. ecently researchers
uses data mining tools in distributed medical environment in order to provide better medical
services to a large proportion of population at a very low cost, better customer relationship
management, better management of healthcare resources, etc. It provides meaningful information
in the field of healthcare which may be then useful for management to ta)e decisions such as
estimation of medical staff, decision regarding health insurance policy, selection of treatments,
disease prediction etc. Dealing with the issues and challenges of data mining in healthcare. In
order to predict the various diseases effective analysis of data mining is used. -roposed a data
mining methodology in order to improve the result and proposed new data mining methodology
and proposed framewor) in order to improve the healthcare system.
DATAMINING CLASSIFICATION TECHNIQUE:
The healthcare industry is information rich yet )nowledge poor. Therefore for healthcare
research, data driven statistical research has become a complement. #s with the use of computers
powered with automated tools the large volumes of healthcare data are being collected and made
available to the medical research groups. #s a result, "nowledge Discovery in Databases "DD/,
which includes data mining techniques, has become a more popular research tool for healthcare
researchers to identify and to e&ploit patterns and relationships among large number of variables,
and also made them able to predict the outcome of a disease using the historical cases stored
within datasets. In this project, we carried out various participating data mining classification
techniques on healthcare data. 0lassification is one of the most popularly used methods of Data
(ining in Healthcare sector. It divides data samples into target classes. The classification
technique predicts the target class for each data points. ith the help of classification approach a
ris) factor can be associated to patients by analy%ing their patterns of diseases. It is a supervised
learning approach having )nown class categories. 'inary and multilevel are the two methods of
classification. In binary classification, only two possible classes such as, 1high2 or 1low2 ris)
-
8/17/2019 Bussiness Analytics Group Project
3/19
patient may be considered while the multiclass approach has more than two targets for e&le,
1high2, 1medium2 and 1low2 ris) patient. Data set is partitioned as training and testing dataset. It
consists of predicting a certain outcome based on a given input. Training set is the algorithm
which consists of a set of attributes in order to predict the outcome. In order to predict the
outcome it attempts to discover the relationship between attributes. $oal or prediction is its
outcome. There is another algorithm )nown as prediction set. It consists of same set of attributes
as that of training set. 'ut in prediction set, prediction attribute is yet to be )nown. In order to
process the prediction it mainly analyses the input. The term which defines how 1good2 the
algorithm is its accuracy.
DATABASE AND TOOLS USED IN PROJECT:
e have practiced -I(# Indian Diabetes dataset ta)en from m 30I (achine 4earningepository in !"#. !"# (achine learning tools are used to handle classification problems.
This study will help the researchers to determine the better results from the available data within
the datasets.
e)a is a collection of machine learning algorithms for data mining tas)s. The algorithms can
either be applied directly to a dataset or called from your own 5ava code. e)a contains tools for
data pre6processing, classification, regression, clustering, association rules, and visuali%ation. It
is also well6suited for developing new machine learning schemes. 7ound only on the islands of 8ew 9ealand, the e)a is a flightless bird with an inquisitive nature. e)a is open source
software issued under the $83 $eneral -ublic 4icense.
WEKA:
!"# is a data mining system developed by the 3niversity of ai)ato in 8ew 9ealand that
implements data mining algorithms using the 5#:# language. !"# is a state of6the art facility
for developing machine learning (4/ techniques and their application to real6world data mining
problems. It is a collection of machine learning algorithms for data mining tas)s. The algorithms
are applied directly to a dataset. !"# implements algorithms for data preprocessing,
-
8/17/2019 Bussiness Analytics Group Project
4/19
classification, regression, clustering and association rules; it also includes visuali%ation tools.
The new machine learning schemes or algorithm can also be developed with this pac)age.
!"# is open source software issued under $eneral -ublic 4icense. The data file normally used
by e)a is in #77 file format, which consists of special tags to indicate different things in the
data file foremost+ attribute names, attribute types, and attribute values and the data/. The main
interface in e)a is the !&plorer. It has a set of panels, each of which can be used to perform a
certain tas).
-
8/17/2019 Bussiness Analytics Group Project
5/19
The 0luster panel gives access to the clustering techniques in e)a, e.g., the simple )6
means algorithm. There is also an implementation of the e&pectation ma&imi%ation
algorithm for learning a mi&ture of normal distributions. The *elect attributes panel provides algorithms for identifying the most predictive
attributes in a dataset.
The :isuali%e panel shows a scatter plot matri&, where individual scatter plots can be
selected and enlarged, and analy%ed further using various selection operators.
CLASSIFICATION ALGORITHM USED IN THE PROJECT:
1) NAÏVE BAYES
7or probabilistic learning method 'ayesian classification is used. ith the help of
classification algorithm we can easily obtain it. 'ayes theorem of statistics plays a very
important role in it. hile in medical domain attributes such as patient symptoms and
their health state are correlated with each other but 8a>ve 'ayes 0lassifier assumes that
all attributes are independent with each other. This is the major disadvantage with 8a>ve
'ayes 0lassifier. If attributes are independent with each other than 8a>ve 'ayesian
classifier has shown great performance in terms of accuracy. In healthcare field they play
very important roles. Hence, researchers across the world used them there are various
advantages of ''8. ve 'ayes is a simple probabilistic classifier. It is based on the assumption of
mutual independency of attributes. The algorithm wor)s on the assumption, that variables
provided to the classifier are independent. The probabilities applied in the 8a>ve 'ayes
algorithm are calculated using 'ayes ule ?@@A the probability of hypothesis H can be
calculated on the basis of the hypothesis H and evidence about the hypothesis !
according to the following formula
-
8/17/2019 Bussiness Analytics Group Project
6/19
The 8a>ve 'ayes method wor)s effectively in various real6world situations.
2) ZERO R CLASSIFIER:9ero is the simplest classification method which relies on the target and ignores all
predictors. 9ero classifier simply predicts the majority category class/. #lthough there
is no predictability power in 9ero, it is useful for determining a baseline performance as
a benchmar) for other classification methods.In !"# 9ero6 is a simple classifier. 9ero6 is a trivial classifier, but it gives a lower
bound on the performance of a given dataset which should be significantly improved by
more comple& classifiers. #s such it is a reasonable test on how well the class can be
predicted without considering the other attributes. It can be used as a 4ower 'ound on
-erformance. #ny learning algorithm in !"# is derived from the abstract !"#
classifiers. $iven below is the flow chart of the 9ero algorithm.
-
8/17/2019 Bussiness Analytics Group Project
7/19
3) ONE R CLASSIFIER:
-
8/17/2019 Bussiness Analytics Group Project
8/19
4) J48 DECISION TREE(C 4.5 ! S"SS):# decision tree partitions the input space of a data set into mutually e&clusive regions,
each of which is assigned a label, a value or an action to characteri%e its data points. The
decision tree mechanism is transparent and we can follow a tree structure easily to see
how the decision is made. # decision tree is a tree structure consisting of internal and
e&ternal nodes connected by branches. #n internal node is a decision ma)ing unit that
evaluates a decision function to determine which child node to visit ne&t. The e&ternal
node, on the other hand, has no child nodes and is associated with a label or value that
characteri%es the given data that leads to its being visited. However, many decision tree
construction algorithms involve a two 6 step process. 7irst, a very large decision tree is
grown. Then, to reduce large si%e and overfitting the data, in the second step, the given
tree is pruned. The pruned decision tree that is used for classification purposes is called
the classification tree. To build a decision tree, we need to calculate entropy and
information gainE(S) # $ %&'2 %
Information $ain The information gain is depending on the decrease in entropy after a
dataset is split on a selected attribute. 0onstructing a decision tree is mean to find an
attribute which possess the highest information gain value
*+! (T, -) # E!/%0 (T) & E!/%0 (T, -)A'/: $enerate decision tree. $enerate a decision tree from the training tuples of
data partition D.
I!%: Data partition, D, which is a set of training tuples and their associated class labels;
attribute list, the set of candidate attributes; #ttribute selection method, a procedure to
determine the splitting criterion that 1best2 partitions the data tuples into individual
classes. This criterion consists of a splitting attribute and, possibly, either a split point or
splitting subset.
-
8/17/2019 Bussiness Analytics Group Project
9/19
F/: D6! T/ A'/
LO*ISTIC RE*RESSION
4ogistic egression is a probabilistic, statistical classifier used to predict the outcome of a
categorical dependent variable based on one or more predictor variables. The algorithm measures
the relationship between a dependent variable and one or more independent variables.
ADABOOSTIN* (I! WEKA)
'oosting is an ensemble method that starts out with a base classifier that is prepared on the
training data. # second classifier is then created behind it to focus on the instances in the training
data that the first classifier got wrong. The process continues to add classifiers until a limit is
reached in the number of models or accuracy.
'oosting is provided in e)a in the #da'oost(@ adaptive boosting/ algorithm.
@. 0lic) 1 Add new…2 in the 1 Algorithms2 section.
C. 0lic) the 1Choose2 button.
. 0lic) 1 AdaBoostM12 under the 1meta2 selection.
http://en.wikipedia.org/wiki/Boosting_(machine_learning)http://en.wikipedia.org/wiki/AdaBoosthttp://en.wikipedia.org/wiki/AdaBoosthttp://en.wikipedia.org/wiki/Boosting_(machine_learning)
-
8/17/2019 Bussiness Analytics Group Project
10/19
E. 0lic) the 1Choose2 button for the 1classifier 2 and select 1 J482 under the 1tree2 section
and clic) the 1choose2 button.
F. 0lic) the 1OK 2 button on the 1 AdaBoostM12 configuration
"ROBLE7: "REDICT TE ONSET OF DIABETES IN "I7A
Data mining and machine learning is helping medical professionals ma)e diagnosis easier by
bridging the gap between huge data sets and human )nowledge. e can begin to apply machine
learning techniques for classification in a dataset that describes a population that is under a high
ris) of the onset of diabetes.
Diabetes (ellitus affects GC million people in the world, and the number of people with type6C
diabetes is increasing in every country. 3ntreated, diabetes can cause many complications.
Diabetes Test
-hoto by :ictor , some rights reserved.
The population for this study was the -ima Indian population near -hoeni&, #ri%ona. The
population has been under continuous study since @F by the 8ational Institute of Diabetes and
Digestive and "idney Diseases because of its high incidence rate of diabetes.
7or the purposes of this dataset, diabetes was diagnosed according to orld Health
-
8/17/2019 Bussiness Analytics Group Project
11/19
e can start analy%ing data and e&perimenting with algorithms that will help us study the onset
of diabetes in -ima Indians.
e too) the data from 30I repository which has female patients aged more than C@ years of
-I(# Indian heritage.
1. Title: Pima Indians Diabetes Database%% 2. Sources:% (a) Original owners: National Institute o Diabetes and Digesti!e and% "idne# Diseases% (b) Donor o database: $incent Sigillito (!gsa&lcen.a&l.'u.edu)% esearc *enter+ ,I -rou& eader% /&&lied P#sics aborator#% Te 0ons o&ins 3ni!ersit#% 0ons o&ins oad% aurel+ ,D 24545
% (641) 7869261% (c) Date recei!ed: 7 ,a# 1774%% 6. Past 3sage:% 1. Smit+;0.;
-
8/17/2019 Bussiness Analytics Group Project
12/19
%% 5. or =ac /ttribute: (all numeric9!alued)% 1. Number o times ®nant% 2. Plasma glucose concentration a 2 ours in an oral glucose tolerancetest% 6. Diastolic blood &ressure (mm g)% F. Trice&s sin old ticness (mm)% 8. 29our serum insulin (mu 3ml)% . Jod# mass indeE (weigt in g(eigt in m)K2)% 5. Diabetes &edigree unction% @. /ge (#ears)% 7. *lass !ariable (4 or 1)%% @. ,issing /ttribute $alues: None%% 7. *lass Distribution: (class !alue 1 is inter&reted as Ltested &ositi!e or% diabetesL)%% *lass $alue Number o instances% 4 844
% 1 2@%% 14. Jrie statistical anal#sis:%% /ttribute number: ,ean: Standard De!iation:% 1. 6.@ 6.F% 2. 124.7 62.4% 6. 7.1 17.F% F. 24.8 1.4% 8. 57.@ 118.2% . 62.4 5.7% 5. 4.8 4.6% @. 66.2 11.@%%%%%%% elabeled !alues in attribute MclassM% rom: 4 To: testednegati!e% rom: 1 To: tested&ositi!e%
-
8/17/2019 Bussiness Analytics Group Project
13/19
A particularly interesting attribute used in the study was the Diabetes Pedigree Function,
pedi. It provided some data on diabetes mellitus history in relatives and the genetic
relationship of those relatives to the patient. This measure of genetic influence gave us an
idea of the hereditary risk one might have with the onset of diabetes mellitus. Based on
observations in the proceeding section, it is unclear how well this function predicts the onset
of diabetes.
Initially we did preprocessing and observed the some things.
-
8/17/2019 Bussiness Analytics Group Project
14/19
7rom the above figure of histograms we understood that some of the attributes are normally
distributed plasma, s)in, mass, blood pressure/ and some e&ponentially distributed pregnancy,
insulin, pedigree, age/. #s we )now age normally follows normal distribution, it seems there is
some problem with the dataset that is why it is s)ewed distribution.
7rom *cattered chart we observed that
• Interesting variable -!DI$!! don’t have any relationship with diabetes
• Interestingly lager values of plasma -$0/ with lager values of age, pedigree, '(I,
insulin, blood pressure and pregnancy found positive testing.
-
8/17/2019 Bussiness Analytics Group Project
15/19
*0#TT! 0H#T
-
8/17/2019 Bussiness Analytics Group Project
16/19
-
8/17/2019 Bussiness Analytics Group Project
17/19
-
8/17/2019 Bussiness Analytics Group Project
18/19
Evaluation
After performing across-validation on the dataset, I will focus on analyzing the algorithms
through the lens of three metrics:accuracy, ROC area, and F1 measure.
Based on testing, accuracy will determine the percentage of instances that were correctly
classified by the algorithm. This is an important start of our analysis since it will give us a
baseline of how each algorithm performs.
TheROC curve is created by plotting the fraction of true positives vs. the fraction of false
positives. An optimal classifier will have an ROC area value approaching 1.0, with 0.5 being
comparable to random guessing. I believe it will be very interesting to see how our
algorithms predict on this scale.
Finally, the F1 measure will be an important statistical analysis of classification since it will
measure test accuracy. F1 measure usesprecision (the number of true positives divided bythe number of true positives and false positives) andrecall (the true positives divided by the
number of true positives and the number of false negatives) to output a value between 0
and 1, where higher values imply better performance.
http://machinelearningmastery.com/how-to-choose-the-right-test-options-when-evaluating-machine-learning-algorithms/http://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/http://en.wikipedia.org/wiki/Roc_curvehttp://machinelearningmastery.com/how-to-choose-the-right-test-options-when-evaluating-machine-learning-algorithms/http://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/http://en.wikipedia.org/wiki/Roc_curve
-
8/17/2019 Bussiness Analytics Group Project
19/19
I strongly believe that all algorithms will perform rather similarly because we are dealing with
a small dataset for classification. However, the 4 algorithms should all perform better than
the class baseline prediction that gave an accuracy of about 65.2%.
I(-
top related