bussiness analytics group project

8/17/2019 Bussiness Analytics Group Project

1/19

INTRODUCTION:

The health care industry is one of the world’s largest and fastest growing industries having huge

amount of healthcare data. This health care data includes relevant information about patient data,

their treatment data and resource management data. The information is rich and massive. Hiddenrelationships and trends in healthcare data can be discovered from the application of data mining

techniques. Data mining techniques are more effective that has used in healthcare research. In

this project we aimed to do the analysis of several data mining classification techniques using

!"# machine learning tools over the healthcare datasets. In this study, we use different data

mining classification techniques that have been tested on diagnostic datasets for diabetes.

SIGNIFICANCE OF DATA MINING IN HEALTHCARE:

$enerally all the healthcare organi%ations across the world stored the healthcare data in

electronic format. Healthcare data mainly contains all the information regarding patients as well

as the parties involved in healthcare industries. The storage of such type of data is increased at a

very rapidly rate. Due to continuous increasing the si%e of electronic healthcare data a type of

comple&ity is e&ist in it. In other words, we can say that healthcare data becomes very comple&.

'y using the traditional methods it becomes very difficult in order to e&tract the meaningful

information from it. 'ut due to advancement in field of statistics, mathematics and very other

disciplines it is now possible to e&tract the meaningful patterns from it. Data mining is beneficialin such a situation where large collections of healthcare data are available.

Data (ining mainly e&tracts the meaningful patterns which were previously not )nown. These

patterns can be then integrated into the )nowledge and with the help of this )nowledge essential

decisions can becomes possible. # number of benefits are provided by the data mining. *ome of

them are as follows+ it plays a very important role in the detection of fraud and abuse, provides

better medical treatments at reasonable price, detection of diseases at early stages, intelligent

healthcare decision support systems etc. Data mining techniques are very useful in healthcaredomain. They provide better medical services to the patients and helps to the healthcare

organi%ations in various medical management decisions. *ome of the services provided by the

data mining techniques in healthcare are+ number of days of stay in a hospital, ran)ing of

hospitals, better effective treatments, fraud insurance claims by patients as well as by providers,

readmission of patients, identifies better treatments methods for a particular group of patients,


2/19

construction of effective drug recommendation systems, etc. Due to all these reasons researchers

are greatly influenced by the capabilities of data mining. In the healthcare field researchers

widely used the data mining techniques. There are various techniques of data mining. *ome of

them are classification, clustering, regression, etc. !ach and every medical information related to

patient as well as to healthcare organi%ations is useful. ith the help of such a powerful tool

)nown as data mining plays a very important role in healthcare industry. ecently researchers

uses data mining tools in distributed medical environment in order to provide better medical

services to a large proportion of population at a very low cost, better customer relationship

management, better management of healthcare resources, etc. It provides meaningful information

in the field of healthcare which may be then useful for management to ta)e decisions such as

estimation of medical staff, decision regarding health insurance policy, selection of treatments,

disease prediction etc. Dealing with the issues and challenges of data mining in healthcare. In

order to predict the various diseases effective analysis of data mining is used. -roposed a data

mining methodology in order to improve the result and proposed new data mining methodology

and proposed framewor) in order to improve the healthcare system.

DATAMINING CLASSIFICATION TECHNIQUE:

The healthcare industry is information rich yet )nowledge poor. Therefore for healthcare

research, data driven statistical research has become a complement. #s with the use of computers

powered with automated tools the large volumes of healthcare data are being collected and made

available to the medical research groups. #s a result, "nowledge Discovery in Databases "DD/,

which includes data mining techniques, has become a more popular research tool for healthcare

researchers to identify and to e&ploit patterns and relationships among large number of variables,

and also made them able to predict the outcome of a disease using the historical cases stored

within datasets. In this project, we carried out various participating data mining classification

techniques on healthcare data. 0lassification is one of the most popularly used methods of Data

(ining in Healthcare sector. It divides data samples into target classes. The classification

technique predicts the target class for each data points. ith the help of classification approach a

ris) factor can be associated to patients by analy%ing their patterns of diseases. It is a supervised

learning approach having )nown class categories. 'inary and multilevel are the two methods of

classification. In binary classification, only two possible classes such as, 1high2 or 1low2 ris)


3/19

patient may be considered while the multiclass approach has more than two targets for e&ample,

1high2, 1medium2 and 1low2 ris) patient. Data set is partitioned as training and testing dataset. It

consists of predicting a certain outcome based on a given input. Training set is the algorithm

which consists of a set of attributes in order to predict the outcome. In order to predict the

outcome it attempts to discover the relationship between attributes. $oal or prediction is its

outcome. There is another algorithm )nown as prediction set. It consists of same set of attributes

as that of training set. 'ut in prediction set, prediction attribute is yet to be )nown. In order to

process the prediction it mainly analyses the input. The term which defines how 1good2 the

algorithm is its accuracy.

DATABASE AND TOOLS USED IN PROJECT:

e have practiced -I(# Indian Diabetes dataset ta)en from m 30I (achine 4earningepository in !"#. !"# (achine learning tools are used to handle classification problems.

This study will help the researchers to determine the better results from the available data within

the datasets.

e)a is a collection of machine learning algorithms for data mining tas)s. The algorithms can

either be applied directly to a dataset or called from your own 5ava code. e)a contains tools for

data pre6processing, classification, regression, clustering, association rules, and visuali%ation. It

is also well6suited for developing new machine learning schemes. 7ound only on the islands of 8ew 9ealand, the e)a is a flightless bird with an inquisitive nature. e)a is open source

software issued under the $83 $eneral -ublic 4icense.

WEKA:

!"# is a data mining system developed by the 3niversity of ai)ato in 8ew 9ealand that

implements data mining algorithms using the 5#:# language. !"# is a state of6the art facility

for developing machine learning (4/ techniques and their application to real6world data mining

problems. It is a collection of machine learning algorithms for data mining tas)s. The algorithms

are applied directly to a dataset. !"# implements algorithms for data preprocessing,


4/19

classification, regression, clustering and association rules; it also includes visuali%ation tools.

The new machine learning schemes or algorithm can also be developed with this pac)age.

!"# is open source software issued under $eneral -ublic 4icense. The data file normally used

by e)a is in #77 file format, which consists of special tags to indicate different things in the

data file foremost+ attribute names, attribute types, and attribute values and the data/. The main

interface in e)a is the !&plorer. It has a set of panels, each of which can be used to perform a

certain tas).


5/19

The 0luster panel gives access to the clustering techniques in e)a, e.g., the simple )6

means algorithm. There is also an implementation of the e&pectation ma&imi%ation

algorithm for learning a mi&ture of normal distributions. The *elect attributes panel provides algorithms for identifying the most predictive

attributes in a dataset.

The :isuali%e panel shows a scatter plot matri&, where individual scatter plots can be

selected and enlarged, and analy%ed further using various selection operators.

CLASSIFICATION ALGORITHM USED IN THE PROJECT:

1) NAÏVE BAYES

7or probabilistic learning method 'ayesian classification is used. ith the help of

classification algorithm we can easily obtain it. 'ayes theorem of statistics plays a very

important role in it. hile in medical domain attributes such as patient symptoms and

their health state are correlated with each other but 8a>ve 'ayes 0lassifier assumes that

all attributes are independent with each other. This is the major disadvantage with 8a>ve

'ayes 0lassifier. If attributes are independent with each other than 8a>ve 'ayesian

classifier has shown great performance in terms of accuracy. In healthcare field they play

very important roles. Hence, researchers across the world used them there are various

advantages of ''8. ve 'ayes is a simple probabilistic classifier. It is based on the assumption of

mutual independency of attributes. The algorithm wor)s on the assumption, that variables

provided to the classifier are independent. The probabilities applied in the 8a>ve 'ayes

algorithm are calculated using 'ayes ule ?@@A the probability of hypothesis H can be

calculated on the basis of the hypothesis H and evidence about the hypothesis !

according to the following formula


6/19

The 8a>ve 'ayes method wor)s effectively in various real6world situations.

2) ZERO R CLASSIFIER:9ero is the simplest classification method which relies on the target and ignores all

predictors. 9ero classifier simply predicts the majority category class/. #lthough there

is no predictability power in 9ero, it is useful for determining a baseline performance as

a benchmar) for other classification methods.In !"# 9ero6 is a simple classifier. 9ero6 is a trivial classifier, but it gives a lower

bound on the performance of a given dataset which should be significantly improved by

more comple& classifiers. #s such it is a reasonable test on how well the class can be

predicted without considering the other attributes. It can be used as a 4ower 'ound on

-erformance. #ny learning algorithm in !"# is derived from the abstract !"#

classifiers. $iven below is the flow chart of the 9ero algorithm.


7/19

3) ONE R CLASSIFIER:


8/19

4) J48 DECISION TREE(C 4.5 ! S"SS):# decision tree partitions the input space of a data set into mutually e&clusive regions,

each of which is assigned a label, a value or an action to characteri%e its data points. The

decision tree mechanism is transparent and we can follow a tree structure easily to see

how the decision is made. # decision tree is a tree structure consisting of internal and

e&ternal nodes connected by branches. #n internal node is a decision ma)ing unit that

evaluates a decision function to determine which child node to visit ne&t. The e&ternal

node, on the other hand, has no child nodes and is associated with a label or value that

characteri%es the given data that leads to its being visited. However, many decision tree

construction algorithms involve a two 6 step process. 7irst, a very large decision tree is

grown. Then, to reduce large si%e and overfitting the data, in the second step, the given

tree is pruned. The pruned decision tree that is used for classification purposes is called

the classification tree. To build a decision tree, we need to calculate entropy and

information gainE(S) # $ %&'2 %

Information $ain The information gain is depending on the decrease in entropy after a

dataset is split on a selected attribute. 0onstructing a decision tree is mean to find an

attribute which possess the highest information gain value

*+! (T, -) # E!/%0 (T) & E!/%0 (T, -)A'/: $enerate decision tree. $enerate a decision tree from the training tuples of

data partition D.

I!%: Data partition, D, which is a set of training tuples and their associated class labels;

attribute list, the set of candidate attributes; #ttribute selection method, a procedure to

determine the splitting criterion that 1best2 partitions the data tuples into individual

classes. This criterion consists of a splitting attribute and, possibly, either a split point or

splitting subset.


9/19

F/: D6! T/ A'/

LO*ISTIC RE*RESSION

4ogistic egression is a probabilistic, statistical classifier used to predict the outcome of a

categorical dependent variable based on one or more predictor variables. The algorithm measures

the relationship between a dependent variable and one or more independent variables.

ADABOOSTIN* (I! WEKA)

'oosting is an ensemble method that starts out with a base classifier that is prepared on the

training data. # second classifier is then created behind it to focus on the instances in the training

data that the first classifier got wrong. The process continues to add classifiers until a limit is

reached in the number of models or accuracy.

'oosting is provided in e)a in the #da'oost(@ adaptive boosting/ algorithm.

@. 0lic) 1 Add new…2 in the 1 Algorithms2 section.

C. 0lic) the 1Choose2 button.

. 0lic) 1 AdaBoostM12 under the 1meta2 selection.

http://en.wikipedia.org/wiki/Boosting_(machine_learning)http://en.wikipedia.org/wiki/AdaBoosthttp://en.wikipedia.org/wiki/AdaBoosthttp://en.wikipedia.org/wiki/Boosting_(machine_learning)


10/19

E. 0lic) the 1Choose2 button for the 1classifier 2 and select 1 J482 under the 1tree2 section

and clic) the 1choose2 button.

F. 0lic) the 1OK 2 button on the 1 AdaBoostM12 configuration

"ROBLE7: "REDICT TE ONSET OF DIABETES IN "I7A

Data mining and machine learning is helping medical professionals ma)e diagnosis easier by

bridging the gap between huge data sets and human )nowledge. e can begin to apply machine

learning techniques for classification in a dataset that describes a population that is under a high

ris) of the onset of diabetes.

Diabetes (ellitus affects GC million people in the world, and the number of people with type6C

diabetes is increasing in every country. 3ntreated, diabetes can cause many complications.

Diabetes Test

-hoto by :ictor , some rights reserved.

The population for this study was the -ima Indian population near -hoeni&, #ri%ona. The

population has been under continuous study since @F by the 8ational Institute of Diabetes and

Digestive and "idney Diseases because of its high incidence rate of diabetes.

7or the purposes of this dataset, diabetes was diagnosed according to orld Health


11/19

e can start analy%ing data and e&perimenting with algorithms that will help us study the onset

of diabetes in -ima Indians.

e too) the data from 30I repository which has female patients aged more than C@ years of

-I(# Indian heritage.

1. Title: Pima Indians Diabetes Database%% 2. Sources:% (a) Original owners: National Institute o Diabetes and Digesti!e and% "idne# Diseases% (b) Donor o database: $incent Sigillito (!gsa&lcen.a&l.'u.edu)% esearc *enter+ ,I -rou& eader% /&&lied P#sics aborator#% Te 0ons o&ins 3ni!ersit#% 0ons o&ins oad% aurel+ ,D 24545

% (641) 7869261% (c) Date recei!ed: 7 ,a# 1774%% 6. Past 3sage:% 1. Smit+;0.;


12/19

%% 5. or =ac /ttribute: (all numeric9!alued)% 1. Number o times &regnant% 2. Plasma glucose concentration a 2 ours in an oral glucose tolerancetest% 6. Diastolic blood &ressure (mm g)% F. Trice&s sin old ticness (mm)% 8. 29our serum insulin (mu 3ml)% . Jod# mass indeE (weigt in g(eigt in m)K2)% 5. Diabetes &edigree unction% @. /ge (#ears)% 7. *lass !ariable (4 or 1)%% @. ,issing /ttribute $alues: None%% 7. *lass Distribution: (class !alue 1 is inter&reted as Ltested &ositi!e or% diabetesL)%% *lass $alue Number o instances% 4 844

% 1 2@%% 14. Jrie statistical anal#sis:%% /ttribute number: ,ean: Standard De!iation:% 1. 6.@ 6.F% 2. 124.7 62.4% 6. 7.1 17.F% F. 24.8 1.4% 8. 57.@ 118.2% . 62.4 5.7% 5. 4.8 4.6% @. 66.2 11.@%%%%%%% elabeled !alues in attribute MclassM% rom: 4 To: testednegati!e% rom: 1 To: tested&ositi!e%


13/19

A particularly interesting attribute used in the study was the Diabetes Pedigree Function,

pedi. It provided some data on diabetes mellitus history in relatives and the genetic

relationship of those relatives to the patient. This measure of genetic influence gave us an

idea of the hereditary risk one might have with the onset of diabetes mellitus. Based on

observations in the proceeding section, it is unclear how well this function predicts the onset

of diabetes.

Initially we did preprocessing and observed the some things.


14/19

7rom the above figure of histograms we understood that some of the attributes are normally

distributed plasma, s)in, mass, blood pressure/ and some e&ponentially distributed pregnancy,

insulin, pedigree, age/. #s we )now age normally follows normal distribution, it seems there is

some problem with the dataset that is why it is s)ewed distribution.

7rom *cattered chart we observed that

• Interesting variable -!DI$!! don’t have any relationship with diabetes

• Interestingly lager values of plasma -$0/ with lager values of age, pedigree, '(I,

insulin, blood pressure and pregnancy found positive testing.


15/19

*0#TT! 0H#T


16/19


17/19


18/19

Evaluation

After performing across-validation on the dataset, I will focus on analyzing the algorithms

through the lens of three metrics:accuracy, ROC area, and F1 measure.

Based on testing, accuracy will determine the percentage of instances that were correctly

classified by the algorithm. This is an important start of our analysis since it will give us a

baseline of how each algorithm performs.

TheROC curve is created by plotting the fraction of true positives vs. the fraction of false

positives. An optimal classifier will have an ROC area value approaching 1.0, with 0.5 being

comparable to random guessing. I believe it will be very interesting to see how our

algorithms predict on this scale.

Finally, the F1 measure will be an important statistical analysis of classification since it will

measure test accuracy. F1 measure usesprecision (the number of true positives divided bythe number of true positives and false positives) andrecall (the true positives divided by the

number of true positives and the number of false negatives) to output a value between 0

and 1, where higher values imply better performance.

http://machinelearningmastery.com/how-to-choose-the-right-test-options-when-evaluating-machine-learning-algorithms/http://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/http://en.wikipedia.org/wiki/Roc_curvehttp://machinelearningmastery.com/how-to-choose-the-right-test-options-when-evaluating-machine-learning-algorithms/http://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/http://en.wikipedia.org/wiki/Roc_curve


19/19

I strongly believe that all algorithms will perform rather similarly because we are dealing with

a small dataset for classification. However, the 4 algorithms should all perform better than

the class baseline prediction that gave an accuracy of about 65.2%.

I(-

bussiness analytics group project

Documents