bussiness analytics group project

Upload: kasi

Post on 06-Jul-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/17/2019 Bussiness Analytics Group Project

    1/19

     INTRODUCTION:

    The health care industry is one of the world’s largest and fastest growing industries having huge

    amount of healthcare data. This health care data includes relevant information about patient data,

    their treatment data and resource management data. The information is rich and massive. Hiddenrelationships and trends in healthcare data can be discovered from the application of data mining

    techniques. Data mining techniques are more effective that has used in healthcare research. In

    this project we aimed to do the analysis of several data mining classification techniques using

    !"# machine learning tools over the healthcare datasets. In this study, we use different data

    mining classification techniques that have been tested on diagnostic datasets for diabetes.

     SIGNIFICANCE OF DATA MINING IN HEALTHCARE:

      $enerally all the healthcare organi%ations across the world stored the healthcare data in

    electronic format. Healthcare data mainly contains all the information regarding patients as well

    as the parties involved in healthcare industries. The storage of such type of data is increased at a

    very rapidly rate. Due to continuous increasing the si%e of electronic healthcare data a type of 

    comple&ity is e&ist in it. In other words, we can say that healthcare data becomes very comple&.

    'y using the traditional methods it becomes very difficult in order to e&tract the meaningful

    information from it. 'ut due to advancement in field of statistics, mathematics and very other 

    disciplines it is now possible to e&tract the meaningful patterns from it. Data mining is beneficialin such a situation where large collections of healthcare data are available.

    Data (ining mainly e&tracts the meaningful patterns which were previously not )nown. These

     patterns can be then integrated into the )nowledge and with the help of this )nowledge essential

    decisions can becomes possible. # number of benefits are provided by the data mining. *ome of 

    them are as follows+ it plays a very important role in the detection of fraud and abuse, provides

     better medical treatments at reasonable price, detection of diseases at early stages, intelligent

    healthcare decision support systems etc. Data mining techniques are very useful in healthcaredomain. They provide better medical services to the patients and helps to the healthcare

    organi%ations in various medical management decisions. *ome of the services provided by the

    data mining techniques in healthcare are+ number of days of stay in a hospital, ran)ing of 

    hospitals, better effective treatments, fraud insurance claims by patients as well as by providers,

    readmission of patients, identifies better treatments methods for a particular group of patients,

  • 8/17/2019 Bussiness Analytics Group Project

    2/19

    construction of effective drug recommendation systems, etc. Due to all these reasons researchers

    are greatly influenced by the capabilities of data mining. In the healthcare field researchers

    widely used the data mining techniques. There are various techniques of data mining. *ome of 

    them are classification, clustering, regression, etc. !ach and every medical information related to

     patient as well as to healthcare organi%ations is useful. ith the help of such a powerful tool

    )nown as data mining plays a very important role in healthcare industry. ecently researchers

    uses data mining tools in distributed medical environment in order to provide better medical

    services to a large proportion of population at a very low cost, better customer relationship

    management, better management of healthcare resources, etc. It provides meaningful information

    in the field of healthcare which may be then useful for management to ta)e decisions such as

    estimation of medical staff, decision regarding health insurance policy, selection of treatments,

    disease prediction etc. Dealing with the issues and challenges of data mining in healthcare. In

    order to predict the various diseases effective analysis of data mining is used. -roposed a data

    mining methodology in order to improve the result and proposed new data mining methodology

    and proposed framewor) in order to improve the healthcare system.

     DATAMINING CLASSIFICATION TECHNIQUE:

    The healthcare industry is information rich yet )nowledge poor. Therefore for healthcare

    research, data driven statistical research has become a complement. #s with the use of computers

     powered with automated tools the large volumes of healthcare data are being collected and made

    available to the medical research groups. #s a result, "nowledge Discovery in Databases "DD/,

    which includes data mining techniques, has become a more popular research tool for healthcare

    researchers to identify and to e&ploit patterns and relationships among large number of variables,

    and also made them able to predict the outcome of a disease using the historical cases stored

    within datasets. In this project, we carried out various participating data mining classification

    techniques on healthcare data. 0lassification is one of the most popularly used methods of Data

    (ining in Healthcare sector. It divides data samples into target classes. The classification

    technique predicts the target class for each data points. ith the help of classification approach a

    ris) factor can be associated to patients by analy%ing their patterns of diseases. It is a supervised

    learning approach having )nown class categories. 'inary and multilevel are the two methods of 

    classification. In binary classification, only two possible classes such as, 1high2 or 1low2 ris) 

  • 8/17/2019 Bussiness Analytics Group Project

    3/19

     patient may be considered while the multiclass approach has more than two targets for e&ample,

    1high2, 1medium2 and 1low2 ris) patient. Data set is partitioned as training and testing dataset. It

    consists of predicting a certain outcome based on a given input. Training set is the algorithm

    which consists of a set of attributes in order to predict the outcome. In order to predict the

    outcome it attempts to discover the relationship between attributes. $oal or prediction is its

    outcome. There is another algorithm )nown as prediction set. It consists of same set of attributes

    as that of training set. 'ut in prediction set, prediction attribute is yet to be )nown. In order to

     process the prediction it mainly analyses the input. The term which defines how 1good2 the

    algorithm is its accuracy.

     DATABASE AND TOOLS USED IN PROJECT:

    e have practiced -I(# Indian Diabetes dataset ta)en from m 30I (achine 4earningepository in !"#. !"# (achine learning tools are used to handle classification problems.

    This study will help the researchers to determine the better results from the available data within

    the datasets.

    e)a is a collection of machine learning algorithms for data mining tas)s. The algorithms can

    either be applied directly to a dataset or called from your own 5ava code. e)a contains tools for 

    data pre6processing, classification, regression, clustering, association rules, and visuali%ation. It

    is also well6suited for developing new machine learning schemes. 7ound only on the islands of  8ew 9ealand, the e)a is a flightless bird with an inquisitive nature. e)a is open source

    software issued under the $83 $eneral -ublic 4icense.

    WEKA:

    !"# is a data mining system developed by the 3niversity of ai)ato in 8ew 9ealand that

    implements data mining algorithms using the 5#:# language. !"# is a state of6the art facility

    for developing machine learning (4/ techniques and their application to real6world data mining

     problems. It is a collection of machine learning algorithms for data mining tas)s. The algorithms

    are applied directly to a dataset. !"# implements algorithms for data preprocessing,

  • 8/17/2019 Bussiness Analytics Group Project

    4/19

    classification, regression, clustering and association rules; it also includes visuali%ation tools.

    The new machine learning schemes or algorithm can also be developed with this pac)age.

    !"# is open source software issued under $eneral -ublic 4icense. The data file normally used

     by e)a is in #77 file format, which consists of special tags to indicate different things in the

    data file foremost+ attribute names, attribute types, and attribute values and the data/. The main

    interface in e)a is the !&plorer. It has a set of panels, each of which can be used to perform a

    certain tas).

  • 8/17/2019 Bussiness Analytics Group Project

    5/19

    The 0luster panel gives access to the clustering techniques in e)a, e.g., the simple )6

    means algorithm. There is also an implementation of the e&pectation ma&imi%ation

    algorithm for learning a mi&ture of normal distributions. The *elect attributes panel provides algorithms for identifying the most predictive

    attributes in a dataset.

    The :isuali%e panel shows a scatter plot matri&, where individual scatter plots can be

    selected and enlarged, and analy%ed further using various selection operators.

    CLASSIFICATION ALGORITHM USED IN THE PROJECT:

    1) NAÏVE BAYES

    7or probabilistic learning method 'ayesian classification is used. ith the help of 

    classification algorithm we can easily obtain it. 'ayes theorem of statistics plays a very

    important role in it. hile in medical domain attributes such as patient symptoms and

    their health state are correlated with each other but 8a>ve 'ayes 0lassifier assumes that

    all attributes are independent with each other. This is the major disadvantage with 8a>ve

    'ayes 0lassifier. If attributes are independent with each other than 8a>ve 'ayesian

    classifier has shown great performance in terms of accuracy. In healthcare field they play

    very important roles. Hence, researchers across the world used them there are various

    advantages of ''8. ve 'ayes is a simple probabilistic classifier. It is based on the assumption of 

    mutual independency of attributes. The algorithm wor)s on the assumption, that variables

     provided to the classifier are independent. The probabilities applied in the 8a>ve 'ayes

    algorithm are calculated using 'ayes ule ?@@A the probability of hypothesis H can be

    calculated on the basis of the hypothesis H and evidence about the hypothesis !

    according to the following formula

  • 8/17/2019 Bussiness Analytics Group Project

    6/19

     The 8a>ve 'ayes method wor)s effectively in various real6world situations.

    2) ZERO R CLASSIFIER:9ero is the simplest classification method which relies on the target and ignores all

     predictors. 9ero classifier simply predicts the majority category class/. #lthough there

    is no predictability power in 9ero, it is useful for determining a baseline performance as

    a benchmar) for other classification methods.In !"# 9ero6 is a simple classifier. 9ero6 is a trivial classifier, but it gives a lower 

     bound on the performance of a given dataset which should be significantly improved by

    more comple& classifiers. #s such it is a reasonable test on how well the class can be

     predicted without considering the other attributes. It can be used as a 4ower 'ound on

    -erformance. #ny learning algorithm in !"# is derived from the abstract !"#

    classifiers. $iven below is the flow chart of the 9ero algorithm.

  • 8/17/2019 Bussiness Analytics Group Project

    7/19

    3) ONE R CLASSIFIER:

  • 8/17/2019 Bussiness Analytics Group Project

    8/19

    4) J48 DECISION TREE(C 4.5 ! S"SS):# decision tree partitions the input space of a data set into mutually e&clusive regions,

    each of which is assigned a label, a value or an action to characteri%e its data points. The

    decision tree mechanism is transparent and we can follow a tree structure easily to see

    how the decision is made. # decision tree is a tree structure consisting of internal and

    e&ternal nodes connected by branches. #n internal node is a decision ma)ing unit that

    evaluates a decision function to determine which child node to visit ne&t. The e&ternal

    node, on the other hand, has no child nodes and is associated with a label or value that

    characteri%es the given data that leads to its being visited. However, many decision tree

    construction algorithms involve a two 6 step process. 7irst, a very large decision tree is

    grown. Then, to reduce large si%e and overfitting the data, in the second step, the given

    tree is pruned. The pruned decision tree that is used for classification purposes is called

    the classification tree. To build a decision tree, we need to calculate entropy and

    information gainE(S) # $ %&'2 %

     Information $ain The information gain is depending on the decrease in entropy after a

    dataset is split on a selected attribute. 0onstructing a decision tree is mean to find an

    attribute which possess the highest information gain value

    *+! (T, -) # E!/%0 (T) & E!/%0 (T, -)A'/: $enerate decision tree. $enerate a decision tree from the training tuples of 

    data partition D.

    I!%: Data partition, D, which is a set of training tuples and their associated class labels;

    attribute list, the set of candidate attributes; #ttribute selection method, a procedure to

    determine the splitting criterion that 1best2 partitions the data tuples into individual

    classes. This criterion consists of a splitting attribute and, possibly, either a split point or 

    splitting subset.

  • 8/17/2019 Bussiness Analytics Group Project

    9/19

    F/: D6! T/ A'/

    LO*ISTIC RE*RESSION

    4ogistic egression is a probabilistic, statistical classifier used to predict the outcome of a

    categorical dependent variable based on one or more predictor variables. The algorithm measures

    the relationship between a dependent variable and one or more independent variables.

    ADABOOSTIN* (I! WEKA)

    'oosting is an ensemble method that starts out with a base classifier that is prepared on the

    training data. # second classifier is then created behind it to focus on the instances in the training

    data that the first classifier got wrong. The process continues to add classifiers until a limit is

    reached in the number of models or accuracy.

    'oosting is provided in e)a in the #da'oost(@ adaptive boosting/ algorithm.

    @. 0lic) 1 Add new…2 in the 1 Algorithms2 section.

    C. 0lic) the 1Choose2 button.

    . 0lic) 1 AdaBoostM12 under the 1meta2 selection.

    http://en.wikipedia.org/wiki/Boosting_(machine_learning)http://en.wikipedia.org/wiki/AdaBoosthttp://en.wikipedia.org/wiki/AdaBoosthttp://en.wikipedia.org/wiki/Boosting_(machine_learning)

  • 8/17/2019 Bussiness Analytics Group Project

    10/19

    E. 0lic) the 1Choose2 button for the 1classifier 2 and select 1 J482 under the 1tree2 section

    and clic) the 1choose2 button.

    F. 0lic) the 1OK 2 button on the 1 AdaBoostM12 configuration

    "ROBLE7: "REDICT TE ONSET OF DIABETES IN "I7A

    Data mining and machine learning is helping medical professionals ma)e diagnosis easier by

     bridging the gap between huge data sets and human )nowledge. e can begin to apply machine

    learning techniques for classification in a dataset that describes a population that is under a high

    ris) of the onset of diabetes.

    Diabetes (ellitus affects GC million people in the world, and the number of people with type6C

    diabetes is increasing in every country. 3ntreated, diabetes can cause many complications.

    Diabetes Test

    -hoto by :ictor , some rights reserved.

    The population for this study was the -ima Indian population near -hoeni&, #ri%ona. The

     population has been under continuous study since @F by the 8ational Institute of Diabetes and

    Digestive and "idney Diseases because of its high incidence rate of diabetes.

    7or the purposes of this dataset, diabetes was diagnosed according to orld Health

  • 8/17/2019 Bussiness Analytics Group Project

    11/19

    e can start analy%ing data and e&perimenting with algorithms that will help us study the onset

    of diabetes in -ima Indians.

    e too) the data from 30I repository which has female patients aged more than C@ years of 

    -I(# Indian heritage.

    1. Title: Pima Indians Diabetes Database%% 2. Sources:% (a) Original owners: National Institute o Diabetes and Digesti!e and% "idne# Diseases% (b) Donor o database: $incent Sigillito (!gsa&lcen.a&l.'u.edu)% esearc *enter+ ,I -rou& eader% /&&lied P#sics aborator#% Te 0ons o&ins 3ni!ersit#% 0ons o&ins oad% aurel+ ,D 24545

    % (641) 7869261% (c) Date recei!ed: 7 ,a# 1774%% 6. Past 3sage:% 1. Smit+;0.;

  • 8/17/2019 Bussiness Analytics Group Project

    12/19

    %% 5. or =ac /ttribute: (all numeric9!alued)% 1. Number o times &regnant% 2. Plasma glucose concentration a 2 ours in an oral glucose tolerancetest% 6. Diastolic blood &ressure (mm g)% F. Trice&s sin old ticness (mm)% 8. 29our serum insulin (mu 3ml)% . Jod# mass indeE (weigt in g(eigt in m)K2)% 5. Diabetes &edigree unction% @. /ge (#ears)% 7. *lass !ariable (4 or 1)%% @. ,issing /ttribute $alues: None%% 7. *lass Distribution: (class !alue 1 is inter&reted as Ltested &ositi!e or% diabetesL)%% *lass $alue Number o instances% 4 844

    % 1 2@%% 14. Jrie statistical anal#sis:%% /ttribute number: ,ean: Standard De!iation:% 1. 6.@ 6.F% 2. 124.7 62.4% 6. 7.1 17.F% F. 24.8 1.4% 8. 57.@ 118.2% . 62.4 5.7% 5. 4.8 4.6% @. 66.2 11.@%%%%%%% elabeled !alues in attribute MclassM% rom: 4 To: testednegati!e% rom: 1 To: tested&ositi!e%

  • 8/17/2019 Bussiness Analytics Group Project

    13/19

    A particularly interesting attribute used in the study was the Diabetes Pedigree Function,

    pedi. It provided some data on diabetes mellitus history in relatives and the genetic

    relationship of those relatives to the patient. This measure of genetic influence gave us an

    idea of the hereditary risk one might have with the onset of diabetes mellitus. Based on

    observations in the proceeding section, it is unclear how well this function predicts the onset

    of diabetes.

    Initially we did preprocessing and observed the some things.

  • 8/17/2019 Bussiness Analytics Group Project

    14/19

    7rom the above figure of histograms we understood that some of the attributes are normally

    distributed plasma, s)in, mass, blood pressure/ and some e&ponentially distributed pregnancy,

    insulin, pedigree, age/. #s we )now age normally follows normal distribution, it seems there is

    some problem with the dataset that is why it is s)ewed distribution.

    7rom *cattered chart we observed that

    • Interesting variable -!DI$!! don’t have any relationship with diabetes

    • Interestingly lager values of plasma -$0/ with lager values of age, pedigree, '(I,

    insulin, blood pressure and pregnancy found positive testing.

  • 8/17/2019 Bussiness Analytics Group Project

    15/19

    *0#TT! 0H#T

  • 8/17/2019 Bussiness Analytics Group Project

    16/19

  • 8/17/2019 Bussiness Analytics Group Project

    17/19

  • 8/17/2019 Bussiness Analytics Group Project

    18/19

    Evaluation

    After performing across-validation on the dataset, I will focus on analyzing the algorithms

    through the lens of three metrics:accuracy, ROC area, and F1 measure.

    Based on testing, accuracy will determine the percentage of instances that were correctly

    classified by the algorithm. This is an important start of our analysis since it will give us a

    baseline of how each algorithm performs.

    TheROC curve is created by plotting the fraction of true positives vs. the fraction of false

    positives. An optimal classifier will have an ROC area value approaching 1.0, with 0.5 being

    comparable to random guessing. I believe it will be very interesting to see how our

    algorithms predict on this scale.

    Finally, the F1 measure will be an important statistical analysis of classification since it will

    measure test accuracy. F1 measure usesprecision (the number of true positives divided bythe number of true positives and false positives) andrecall (the true positives divided by the

    number of true positives and the number of false negatives) to output a value between 0

    and 1, where higher values imply better performance.

    http://machinelearningmastery.com/how-to-choose-the-right-test-options-when-evaluating-machine-learning-algorithms/http://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/http://en.wikipedia.org/wiki/Roc_curvehttp://machinelearningmastery.com/how-to-choose-the-right-test-options-when-evaluating-machine-learning-algorithms/http://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/http://en.wikipedia.org/wiki/Roc_curve

  • 8/17/2019 Bussiness Analytics Group Project

    19/19

    I strongly believe that all algorithms will perform rather similarly because we are dealing with

    a small dataset for classification. However, the 4 algorithms should all perform better than

    the class baseline prediction that gave an accuracy of about 65.2%.

    I(-