Ego autem et domus
mea serviemus Domino
92APPLIED PREDICTIVE MODELING
Techniques in R
Over ninety of the most important models used bysuccessful Data Scientists With step by stepinstructions on how to build them FAST
Dr ND Lewis
Copyright copy 2015 by ND Lewis
All rights reserved No part of this publication may be reproduced distributed or trans-mitted in any form or by any means including photocopying recording or other electronicor mechanical methods without the prior written permission of the author except in thecase of brief quotations embodied in critical reviews and certain other noncommercial usespermitted by copyright law For permission requests contact the authorat wwwAusCovcom
Disclaimer Although the author and publisher have made every effort to ensure that theinformation in this book was correct at press time the author and publisher do not assumeand hereby disclaim any liability to any party for any loss damage or disruption causedby errors or omissions whether such errors or omissions result from negligence accidentor any other cause
Ordering Information Quantity sales Special discounts are available on quantity pur-chases by corporations associations and others For details email infoNigelDLewiscom
Image photography by Deanna Lewis
ISBN-13 978-1517516796ISBN-10 151751679X
Dedicated to Angela wife friend and mother extraordinaire
Acknowledgments
A special thank you to
My wife Angela for her patience and constant encouragement
My daughter Deanna for taking hundreds of photographs for this book andmy website
And the readers of my earlier books who contacted me with questions andsuggestions
iii
About This BookThis jam-packed book takes you under the hood with step by step instruc-tions using the popular and free R predictive analytics package It providesnumerous examples illustrations and exclusive use of real data to help youleverage the power of predictive analytics A book for every data analyststudent and applied researcher Here is what it can do for you
bull BOOST PRODUCTIVITY Bestselling author and data scientist DrND Lewis will show you how to build predictive analytic models in lesstime than you ever imagined possible Even if yoursquore a busy professionalor a student with little time By spending as little as 10 minutes aday working through the dozens of real world examples illustrationspractitioner tips and notes yoursquoll be able to make giant leaps forwardin your knowledge strengthen your business performance broaden yourskill-set and improve your understanding
bull SIMPLIFY ANALYSIS You will discover over 90 easy to follow appliedpredictive analytic techniques that can instantly expand your modelingcapability Plus yoursquoll discover simple routines that serve as a checklist you repeat next time you need a specific model Even better yoursquolldiscover practitioner tips work with real data and receive suggestionsthat will speed up your progress So even if yoursquore completely stressedout by data yoursquoll still find in this book tips suggestions and helpfuladvice that will ease your journey through the data science maze
bull SAVE TIME Imagine having at your fingertips easy access to the verybest of predictive analytics In this book yoursquoll learn fast effectiveways to build powerful models using R It contains over 90 of the mostsuccessful models used for learning from data With step by step in-structions on how to build them easily and quickly
bull LEARN FASTER 92 Applied Predictive Modeling Techniquesin R offers a practical results orientated approach that will boost yourproductivity expand your knowledge and create new and exciting op-portunities for you to get the very best from your data The book worksbecause you eliminate the anxiety of trying to master every single math-ematical detail Instead your goal at each step is to simply focus ona single routine using real data that only takes about 5 to 15 minutesto complete Within this routine is a series of actions by which thepredictive analytic model is constructed All you have to do is followthe steps They are your checklist for use and reuse
bull IMPROVE RESULTS Want to improve your predictive analytic re-sults but donrsquot have enough time Right now there are a dozen waysto instantly improve your predictive models performance Odds arethese techniques will only take a few minutes apiece to complete Theproblem You might feel like therersquos not enough time to learn how todo them all The solution is in your hands It uses R which is freeopen-source and extremely powerful software
In this rich fascinatingmdashsurprisingly accessiblemdashguide data scientist DrND Lewis reveals how predictive analytics works and how to deploy itspower using the free and widely available R predictive analytics packageThe book serves practitioners and experts alike by covering real life casestudies and the latest state-of-the-art techniques Everything you need to getstarted is contained within this book Here is some of what is included
bull Support Vector Machines
bull Relevance Vector Machines
bull Neural networks
bull Random forests
bull Random ferns
bull Classical Boosting
bull Model based boosting
bull Decision trees
bull Cluster Analysis
For people interested in statistics machine learning data analysis data min-ing and future hands-on practitioners seeking a career in the field it sets astrong foundation delivers the prerequisite knowledge and whets your ap-petite for more Buy the book today Your next big breakthrough usingpredictive analytics is only a page away
vi
OTHER BOOKS YOU WILL ALSO ENJOY
Over 100 Statistical Tests at Your Fingertips100 Statistical Tests in R is designedto give you rapid access to one hun-dred of the most popular statisticaltests
It shows you step by step howto carry out these tests in the freeand popular R statistical package
The book was created for the ap-plied researcher whose primary fo-cus is on their subject matter ratherthan mathematical lemmas or statis-tical theory
Step by step examples of eachtest are clearly described and canbe typed directly into R as printedon the page
To accelerate your research ideas over three hundred applications of sta-tistical tests across engineering science and the social sciences are discussed
100 Statistical Tests in R - ORDER YOUR COPY TODAY
vii
They laughed as they gave me the data toanalyzeBut then they saw my charts
Wish you had fresh ways to presentdata explore relationships visualizeyour data and break free from mun-dane charts and diagrams
Visualizing complex relation-ships with ease using R beginshere
In this book you will find inno-vative ideas to unlock the relation-ships in your own data and createkiller visuals to help you transformyour next presentation from good togreat
Visualizing ComplexData Using R - ORDERYOUR COPY TODAY
viii
PrefaceIn writing this text my intention was to collect together in a single placepractical predictive modeling techniques ideas and strategies that have beenproven to work but which are rarely taught in business schools data sciencecourses or contained in any other single text
On numerous occasions researchers in a wide variety of subject areashave asked ldquohow can I quickly understand and build a particular predictivemodelrdquo The answer used to involve reading complex mathematical textsand then programming complicated formulas in languages such as C C++and Java With the rise of R predictive analytics is now easier than ever 92Applied Predictive Modeling Techniques in R is designed to give yourapid access to over ninety of the most popular predictive analytic techniquesIt shows you step by step how to build each model in the free and popularR statistical package
The material you are about to read is based on my personal experiencearticles Irsquove written hundreds of scholarly articles Irsquove read over the yearsexperimentation some successful some failed conversations Irsquove had with datascientists in various fields and feedback Irsquove received from numerous presen-tations to people just like you
This book came out of the desire to put predictive analytic tools in thehands of the practitioner The material is therefore designed to be used bythe applied data scientist whose primary focus is on delivering results ratherthan mathematical lemmas or statistical theory Examples of each techniqueare clearly described and can be typed directly into R as printed on the page
This book in your hands is an enlarged revised and updated collection ofmy previous works on the subject Irsquove condensed into this volume the bestpractical ideas available
Data science is all about extracting meaningful structure from data Itis always a good idea for the data scientist to study how other users andresearchers have used a technique in actual practice This is primarily be-cause practice often differs substantially from the classroom or theoreticaltext books To this end and to accelerate your progress actual real worldapplications of the techniques are given at the start of each section
These illustrative applications cover a vast range of disciplines incorpo-rating numerous diverse topics such as intelligent shoes forecasting the stockmarket signature authentication oil sand pump prognostics detecting de-ception in speech electric fish localization tropical forest carbon mappingvehicle logo recognition understanding rat talk and many more I have alsoprovided detailed references to these application for further study at the endof each section
In keeping with the zeitgeist of R copies of the vast majority of appliedarticles referenced in this text are available for are free
New users to R can use this book easily and without any prior knowledgeThis is best achieved by typing in the examples as they are given and readingthe comments which follow Copies of R and free tutorial guides for beginnerscan be downloaded at httpswwwr-projectorg
I have found over and over that a data scientist who has exposure to abroad range of modeling tools and applications will run circles around thenarrowly focused genius who has only been exposed to the tools of theirparticular discipline
Greek philosopher Epicurus once said ldquoI write this not for the many butfor you each of us is enough of an audience for the otherrdquo Although theideas in this book reach out to thousands of individuals Irsquove tried to keepEpicurusrsquos principle in mindndashto have each page you read give meaning to justone person - YOU
I invite you to put what you read in these pages into action To help youdo that Irsquove created ldquo12 Resources to Supercharge Your Productivityin Rrdquo it is yours for free Simply go to http www auscov com toolshtml and download it now Itrsquos my gift to you It shares with you 12 of thevery best resources you can use to boost your productivity in R
Irsquove spoken to thousands of people over the past few years Irsquod love tohear your experiences using the ideas in this book Contact me with yourstories questions and suggestions at InfoNigelDLewiscom
Now itrsquos your turn
PS Donrsquot forget to sign-up for your free copy of 12 Resources to Super-charge Your Productivity in R at http www auscov com tools html
x
How to Get the Most from thisBook
There are at least three ways to use this book First you can dip into it as anefficient reference tool Flip to the technique you need and quickly see howto calculate it in R For best results type in the example given in the textexamine the results and then adjust the example to your own data Secondbrowse through the real world examples illustrations practitioner tips andnotes to stimulate your own research ideas Third by working through thenumerous examples you will strengthen you knowledge and understandingof both applied predictive modeling and R
Each section begins with a brief description of the underlying modelingmethodology followed by a diverse array of real world applications This isfollowed by a step by step guide using real data for each predictive analytictechnique
13 PRACTITIONER TIP
If you are using Windows you can easily upgrade to the latestversion of R using the installr package Enter the followinggt installpackages(installr)gt installr updateR ()
If a package mentioned in the text is not installed on your machine you candownload it by typing installpackages(ldquopackage_namerdquo) For exampleto download the ada package you would type in the R consolegt installpackages(ada)
Once a package is installed you must call it You do this by typing in theR consolegt require(ada)
1
92 Applied Predictive Modeling Techniques in R
The ada package is now ready for use You only need to type this onceat the start of your R session
13 PRACTITIONER TIP
You should only download packages from CRAN using en-crypted HTTPS connections This provides much higher as-surance that the code you are downloading is from a legitimateCRAN mirror rather than from another server posing as oneWhilst downloading a package from a HTTPS connection youmay run into a error message something likeunable to access index for repositoryhttpscranrstudiocomThis is particularly common on Windows The internet2 dllhas to be activated on versions before R-322 If you are usingan older version of R before downloading a new package enterthe followinggt setInternet2(TRUE)
Functions in R often have multiple parameters In the examples in thistext I focus primarily on the key parameters required for rapid model develop-ment For information on additional parameters available in a function typein the R console function_name For example to find out about additionalparameters in the ada function you would typeada
Details of the function and additional parameters will appear in yourdefault web browser After fitting your model of interest you are stronglyencouraged to experiment with additional parameters I have also includedthe setseed method in the R code samples throughout this text to assistyou in reproducing the results exactly as they appear on the page R is avail-able for all the major operating systems Due to the popularity of windowsexamples in this book use the windows version of R
2
13 PRACTITIONER TIP
Canrsquot remember what you typed two hours ago Donrsquot worryneither can I Provided you are logged into the same R sessionyou simply need to typegt history(Inf)
It will return your entire history of entered commands for yourcurrent session
You donrsquot have to wait until you have read the entire book to incorporatethe ideas into your own analysis You can experience their marvelous potencyfor yourself almost immediately You can go straight to the technique ofinterest and immediately test create and exploit it in your own research andanalysis
13 PRACTITIONER TIP
On 32-bit Windows machines R can only use up to 3Gb ofRAM regardless of how much you have installed Use thefollowing to check memory availabilitygt memorylimit ()
To remove all objects from memoryrm(list=ls())
Applying the ideas in this book will transform your data science practiceIf you utilize even one tip or idea from each chapter you will be far betterprepared not just to survive but to excel when faced by the challenges andopportunities of the ever expanding deluge of exploitable data
Now letrsquos get started
3
Part I
Decision Trees
4
The Basic Idea
We begin with decision trees because they are one of the most popular tech-niques in data mining1 They can be applied to both regression and classifi-cation problems Part of the reason for their popularity lies in the ability topresent results a simple easy to understand tree format True to its name thedecision tree selects an outcome by descending a tree of possible decisions
NOTE
Decision trees are in general a non-parametric inductivelearning technique able to produce classifiers for a given prob-lem which can assess new unseen situations andor reveal themechanisms driving a problem
An Illustrative ExampleIt will be helpful to build intuition by first looking at a simple example ofwhat this technique does with data Imagine you are required to build anautomatic rules based system to buy cars The goal is to make decisions asnew vehicles are presented to you Let us say you have access to data onthe attributes (also known as features) - Road tested miles driven Price ofvehicle Likability of the current owner (measured on a continuous scale from0 to 100 100 = ldquolove themrdquo) Odometer miles Age of the vehicle in years
A total of 100 measurements are obtained on each variable and also onthe decision (yes or no to purchase the vehicle) You run this data through adecision tree algorithm and it produces the tree shown in Figure 1 Severalthings are worth pointing out about this decision tree First the number ofobservations falling in ldquoyesrdquo and ldquonordquo is reported For example in ldquoRoadtested miles lt100rdquo we see there are 71 observations Second the tree isimmediately able to model new data using the rules developed Third it didnot use all of the variables to develop a decision rule (Likability of the current
6
owner and Price were excluded)Let us suppose that this tree classified 80 of the observations correctly
It is still worth investigating whether a more parsimonious and more accuratetree can be obtained One way to achieve this is to transform the variablesLet us suppose we create the additional variable Odometer
Age and then rebuild
the tree The result is shown in Figure 2It turns out that this decision tree which chooses only the transformed
data ignores all the other attributes Let us assume this tree has a predic-tion accuracy of 90 Two important points become evident First a moreparsimonious and accurate tree was possible and second to obtain this treeit was necessary to include the variable transformations in the second run ofthe decision tree algorithm
The example illustrates that the decision tree must have the variablessupplied in the appropriate form to obtain the most parsimonious tree Inpractice domain experts will often advise on the appropriate transformationof attributes
Figure 1 Car buying decision tree
7
92 Applied Predictive Modeling Techniques in R
Figure 2 Decision tree obtained by transforming variables
13 PRACTITIONER TIP
I once developed what I thought was a great statistical modelfor an area I knew little about Guess what happened It wasa total flop Why Because I did not include domain expertsin my design and analysis phase If you are building a deci-sion trees for knowledge discovery it is important to includedomain experts alongside you and throughout the entire anal-ysis process Their input will be required to assess the finaldecision tree and opine on ldquoreasonabilityrdquoAnother advantage of using domain experts is that the com-plexity of the final decision tree (in terms of the number ofnodes or the number of rules that can be extracted from atree) may be reduced Inclusion of domain experts almostalways helps the data scientist create a more efficient set ofrules
Decision Trees in PracticeThe basic idea behind a decision tree is to construct a tree whose leaves arelabeled with a particular value for the class attribute and whose inner nodes
8
represent descriptive attributes At each internal node in the tree a singleattribute value is compared with a single threshold value In other wordseach node corresponds to a ldquomeasurementrdquo of a particular attributemdashthatis a question often of the ldquoyesrdquo or ldquonordquo variety which can be asked aboutthat attributersquos value (eg ldquois age less than 48 yearsrdquo) One of the twochild nodes is then selected based on the result of the comparison and henceanother measurementmdashor to a leaf
When a leaf node is reached the single class associated with that node isthe final prediction In other words the terminal nodes carry the informationrequired to classify the data
A real world example of decision trees is shown in Figure 3 they weredeveloped by Koch et al2 for understanding biological cellular signaling net-works Notice that four trees of increasing complexity are developed by theresearchers
Figure 3 Four decision trees developed by Koch et al Note that decisiontree (A) (B) (C) and (D) have a misclassification error of 3085 24682117 and 1813 respectively However tree (A) is the easiest to interpretSource Koch et al
The Six Advantages of Decision Trees1 Decision trees can be easy-to-understand with intuitively clear rules
understandable to domain experts
2 Decision trees offer the ability to track and evaluate every step in thedecision-making process This is because each path through a tree con-sists of a combination of attributes which work together to distinguish
9
92 Applied Predictive Modeling Techniques in R
between classes This simplicity gives useful insights into the innerworkings of the method
3 Decision trees can handle both nominal and numeric input attributesand are capable of handling data sets that contain misclassified values
4 Decision trees can easily be programmed for use in real time systemsA great illustration of this is the research of Hailemariam et al3 whouse a decision tree to determine real time building occupancy
5 They are relatively inexpensive computationally and work well on bothlarge and small data sets Figure 4 illustrates an example of a verylarge decision tree used in Bioinformatics4 the smaller tree representsthe tuned version for greater readability
6 Decision trees are considered to be a non-parametric method Thismeans that decision trees have no assumptions about the space distri-bution and on the classifier structure
10
Figure 4 An example of a large Bioinformatics decision tree and the visuallytuned version from the research of Stiglic et al
NOTE
One weakness of decision trees is the risk of over fitting Thisoccurs when statistically insignificant patterns end up influ-encing classification results An over fitted tree will performpoorly on new data Bohanec and Bratko5 studied the roleof pruning a decision tree for better decision making Theyfound that pruning can reduce the risk of over fitting becauseit results in smaller decision trees that exploit fewer attributes
How Decision Trees WorkExact implementation details differ somewhat depending on the algorithmused but the general principles are very similar across methods and contain
11
92 Applied Predictive Modeling Techniques in R
the following steps
1 Provide a set (S) of examples with known classified states This iscalled the learning set
2 Select a set of test attributes attributes a1 a2 aN These can beviewed as the parameters of the system and are selected because theycontain essential information about the problem of concern In thecar example on page 6 the attributes were - a1 = Road tested milesdriven a2 = Price of vehicle a3 = Likability of the current owner a4= Odometer miles a5 = Age of the vehicle in years
3 Starting at the top node of the tree (often called the root node) with theentire set of examples S split S using a test on one or more attributesThe goal is to split S into subsets of increasing classification purity
4 Check the results of the split If every partition is pure in the sensethat all examples in the partition belong to the same class then stopLabel each leaf node with the name of the class
5 Recursively split any partitions that are not ldquopurerdquo
6 The procedure is stopped when all the newly created nodes are rsquotermi-nalrsquo ones containing ldquopure enoughrdquo learning subsets
Decision tree algorithms vary primarily in how they choose to ldquosplitrdquo thedata when to stop splitting and how they prune the trees they produce
NOTE
Many of the decision tree algorithms you will encounter arebased on a greedy top-down recursive partitioning strategyfor tree growth They use different variants of impurity mea-sures such as - information gain6 gain ratio7 gini-index8 anddistance-based measures9
12
Practical Applications
Intelligent Shoes for Stroke VictimsZhang et al10 develop a wearable shoe (SmartShoe) to monitor physical ac-tivity in stroke patients The data set consisted of 12 patients who hadexperienced a stroke
Supervised by a physical therapist SmartShoe collected data from thepatients using eight posture and activity groups sitting standing walkingascending stairs descending stairs cycling on a stationary bike being pushedin a wheelchair and propelling a wheelchair
Patients performed each activity from between 1- 3 minutes Data wascollected from the SmartShoe every 2 seconds from which feature vectors werecomputed
Half the feature features vectors were selected at random for the trainingset The remainder were used for validation The C50 algorithm was usedto build the decision tree
The researchers constructed both subject specific and group decision treesThe group models were developed using Leave-One-Out cross-validation
The performance results for five of the patients are shown in Table 1 Asmight be expected the individual models fit better than the group modelsFor example for patient 5 the accuracy of the patient specific tree was 984however using the group tree the accuracy declined to 755 This differencein performance might be in part due to over fitting of the individual specifictree The group models were trained using data from multiple subjects andtherefore can be expected to have lower overall performance scores
Patient 1 2 3 4 5 AverageIndividual 965 974 998 972 984 979Group 875 911 647 822 755 802
Table 1 Zhang et alrsquos decision tree performance metrics
13
92 Applied Predictive Modeling Techniques in R
13 PRACTITIONER TIP
Decision trees are often validated by calculating sensitivityand specificity Sensitivity is the ability of the classifier toidentify positive results while specificity is the ability to dis-tinguish negative results
Sensitivity = NTP
NTP +NTNtimes 100 (1)
Specificity = NTN
NTP +NTNtimes 100 (2)
NTP is the number of true positives and NTN is the numberof true negatives
Micro Ribonucleic AcidMicroRNAs (miRNAs) are non-protein coding Ribonucleic acids (RNAs)that attenuate protein production in P bodies11 Williams et al12 developa MicroRNA decision tree For the training set the researchers used knownmiRNAs associated from various plant species for positive controls and non-miRNA sequences as negative controls
The typical size of their training set consisted of 5294 cases using 29attributes The model was validated by calculating sensitivity and specificitybased on leave-one-out cross-validation
After training the researchers focus on attribute usage informationTable 2 shows the top ten attribute usage for a typical training run Theresearchers report that other training runs show similar usage The valuesrepresent the percentage of sequences that required that attribute for classi-fication Several attributes such as DuplexEnergy minMatchPercent and Ccontent are required for all sequences to be classified Note that G and Care directly related to the stability of the duplex Sensitivity and specificitywas as high as 8408 and 9853 respectively
An interesting question is if all miRNAs in each taxonomic category stud-ied by the researchers are systematically excluded from training while includ-ing all others how well does the predictor do when tested on the excludedcategory Table 2 provides the answer The ability to correctly identifyknown miRNAs ranged from 78 for the Salicaceae to 100 for seven ofthe groups shown in Table 2 The researchers conclude by stating ldquoWe have
14
Usage Attribute100 G100 C100 T100 DuplexEnergy100 minMatchPercent100 DeltaGnorm100 G + T100 G + C98 duplexEnergyNorm86 NormEnergyRatio
Table 2 Top ten attribute usage for one training runs of the classifierreported by Williams et al
shown that a highly accurate universal plant miRNA predictor can be producedby machine learning using C50rdquo
Taxonomic Correctly of full setGroup classified excludedEmbryophyta 94 916Lycopodiophyta 100 265Brassicaceae 100 2022Caricaceae 100 005Euphorbiaceae 100 034Fabaceae 100 2700Salicaceae 78 352Solanaceae 93 068Vitaceae 94 429Rutaceae 100 043Panicoideae 95 800Poaceae 100 1948Pooideae 80 318
Table 3 Results from exclusion of each of the 13 taxonomic groups byWilliams et al
15
92 Applied Predictive Modeling Techniques in R
NOTE
A decision tree produced by an algorithm is usually not opti-mal in the sense of statistical performance measures such asthe log-likelihood squared errors and so on It turns out thatfinding the ldquooptimal treerdquo if one exists is computationallyintractable (or NP-hard technically speaking)
Acute Liver FailureNakayama et al13 use decision trees for the prognosis of acute liver failure(ALF) patients The data set consisted of a of 1022 ALF patients seenbetween 1998 and 2007 (698 patients seen between 1998 and 2003 and 324patients seen between 2004 and 2007)
Measurements on 73 medical attributes at the onset of hepatic en-cephalopathy14 and 5 days later were collected from 371 of the 698 patientsseen between 1998 and 2003
Two decision trees were built The first was used to predict (using 5attributes) the outcome of the patient at the onset of hepatic encephalopathyThe second decision tree was used to predict (using 7 attributes) the outcomeat 5 days after the onset of grade II or more severe hepatic encephalopathyThe decision trees were validated using data from 160 of the 324 patients seenbetween 2004 and 2007 Decision tree performance is shown in Table 4
Decision Tree 1 Decision Tree IIOutcome at Outcome atthe onset 5 days
Accuracy 790 836(patients 1998-2003)
Accuracy 776 826(patients 2004-2007)
Table 4 Nakayama et alrsquos decision tree performance metrics
16
NOTE
The performance of a decision tree is often measured in termsof three characteristics
bull Accuracy - The percentage of cases correctly classified
bull Sensitivity - The percentage of cases correctly classifiedas belonging to class A among all observations knownto belong to class A
bull Specificity - The percentage of cases correctly classifiedas belonging to class B among all observations knownto belong to class B
Traffic Accidentsde Ontildea et al15 investigate the use of decision tress for analyzing road acci-dents on rural highways in the province of Granada Spain Regression-typegeneralized linear models Logit models and Probit models have been thetechniques most commonly used to conduct such analyses16
Three decision tree models are developed (CART ID3 and C45) usingdata collected from 2003 to 2009 Nineteen independent variables reportedin Table 5 are used to build the decision trees
Accident type Age Atmospheric factorsSafety barriers Cause Day of weekLane width Lighting Month
Number of injuries Number of Occupants Paved shoulderPavement width Pavement markings GenderShoulder type Sight distance TimeVehicle type
Table 5 Variables used from police accident reports by de Ontildea et al
The accuracy results of their analysis are shown in Table 6 Overall thedecision trees showed modest improvement over chance
17
92 Applied Predictive Modeling Techniques in R
CART C45 ID35587 5416 5272
Table 6 Accuracy results (percentage) reported by de Ontildea et al
13 PRACTITIONER TIP
Even though de Ontildea et al tested three different decision treealgorithms (CART ID3 and C45) their results led to a verymodest improvement over chance This will also happen toyou on very many occasions Rather than clinging on to atechnique (because it is the latest technique or the one youhappen to be most familiar with) the professional data scien-tist seeks out and tests alternative methods In this text youhave over ninety of the best applied modeling techniques atyour fingertips If decision trees donrsquot ldquocut itrdquo try somethingelse
18
Electrical Power LossesA non-technical loss (NTL) is defined by electrical power companies as anyconsumed electricity which is not billed This could be because of measure-ment equipment failure or fraud Traditionally power utilities have moni-tored NTL by making in situ inspections of equipment especially for thosecustomers that have very high or close to zero levels of energy consumptionMonedero et al17 develop a CART decision tree to enhance the detection rateof NTL
A sample on 38575 customer accounts was collected over two years inCatalonia Spain For each customer various indicators were measured18
The best decision tree had a depth of 5 with the terminal node identifyingcustomers with the highest likelihood of NTL A total of 176 customers wereidentified by the model This number was greater than the expected total of85 and too many for the utility company to inspect in situ The researcherstherefore merge their results with a Bayesian network model and therebyreduce the estimated number to 64
This example illustrates the important point that the predictive model isjust one aspect that goes into real world decisions Oftentimes a model willbe developed but deemed impracticable by the end user
NOTE
Cross-validation refers to a technique used to allow for thetraining and testing of inductive models Williams et al used aleave-one-out cross-validation Leave-one-out cross-validationinvolves taking out one observation from your sample andtraining the model with the rest The predictor just trained isapplied to the excluded observation One of two possibilitieswill occur the predictor is correct on the previously unseencontrol or not The removed observation is then returnedand the next observation is removed and again training andtesting are done This process is repeated for all observationsWhen this is completed the results are used to calculate theaccuracy of the model often measured in terms of sensitivityand specificity
19
92 Applied Predictive Modeling Techniques in R
Tracking Tetrahymena Pyriformis CellsTracking and matching individual biological cells in real time is a challengingtask Since decision trees provide an excellent tool for real time and rapiddecision making they have potential in this area Wang et al19 consider theissue of real time tracking and classification of Tetrahymena Pyriformis CellsThe issue is whether a cell in the current video frame is the same cell in theprevious video frame
A 23-dimensional feature vector is developed and whether two regions indifferent frames represent the same cell is manually determined The trainingset consisted of 1000 frames from 2 videos The videos were captured at 8frames per second with each frame an 8-bit gray level image The researchersdevelop two decision trees the first (T1) trained using the feature vector andmanual classification and the second trained (T2) using a truncated set offeatures The error rates for each tree by tree depth are reported in Table 7Notice that in this case T1 substantially outperforms T2 indicating the im-portance of using the full feature set
Tree Depth T1 T25 148 14028 137 124910 155 1361
Table 7 Error rates by tree depth for T1 amp T2 reported by Wang et al
13 PRACTITIONER TIP
Cross-validation is often used to prevent over-fitting a modelto the data In n-fold cross-validation we first divide thetraining set into n subsets of equal size Sequentially onesubset is tested using the classifier trained on the remaining(n-1) subsets Thus each instance of the whole training setis predicted once The advantage of this method over a ran-dom selection of training samples is that all observations areused for either training (n times) or evaluation (once) Cross-validation accuracy is measured as the percentage of data thatare correctly classified
20
Classification Trees
21
Technique 1
Classification Tree
A classification tree can be built using the package tree with the tree func-tion
tree(z ~ data split)
Key parameters include split which controls whether deviance or giniare used as the splitting criteria z the data-frame of classes data the dataset of attributes with which you wish to build the tree
13 PRACTITIONER TIP
To obtain information on which version or R you are runningloaded packages and other information usegt sessionInfo ()
Step 1 Load Required PackagesWe build our classification tree using the Vehicle data frame contained inthe mlbench packagegt library(tree)gt library(mlbench)gt data(Vehicle)
22
TECHNIQUE 1 CLASSIFICATION TREE
NOTE
The Vehicle data set20 was collected to classify a Corgievehicle silhouette as one of four types (double decker busCheverolet van Saab 9000 and Opel Manta 400) The dataframe contains 846 observations on 18 numerical features ex-tracted from the silhouettes and one nominal variable definingthe class of the objects (see Table 8)
You can access the features directly using Table 8 as a reference Forexample a summary of the Comp and Circ features is obtained by typinggt summary(Vehicle [1])
CompMin 73001st Qu 8700Median 9300Mean 93683rd Qu 10000Max 11900
gt summary(Vehicle [2])Circ
Min 33001st Qu 4000Median 4400Mean 44863rd Qu 4900Max 5900
Step 2 Prepare Data amp Tweak ParametersWe use 500 of the 846 observations to create a randomly selected trainingsample We use the observations associated with train to build the classifi-cation treegt setseed (107)gt N=nrow(Vehicle)gt train lt- sample (1N 500 FALSE)
23
92 Applied Predictive Modeling Techniques in R
Data-frame R name DescriptionIndex1 Comp Compactness2 Circ Circularity3 DCirc Distance Circularity4 RadRa Radius ratio5 PrAxisRa praxis aspect ratio6 MaxLRa maxlength aspect ratio7 ScatRa scatter ratio8 Elong elongatedness9 PrAxisRect praxis rectangularity10 MaxLRect maxlength rectangularity11 ScVarMaxis scaled variance along major axis12 ScVarmaxis scaled variance along minor axis13 RaGyr scaled radius of gyration14 SkewMaxis skewness about major axis15 Skewmaxis skewness about minor axis16 Kurtmaxis kurtosis about minor axis17 KurtMaxis kurtosis about major axis18 HollRa hollows ratio19 Class type bus opel saab van
Table 8 Attributes and Class labels for vehicle silhouettes data set
24
TECHNIQUE 1 CLASSIFICATION TREE
Step 3 Estimate the Decision TreeNow we are ready to build the decision tree using the training sample Thisis achieved by entering
gt fitlt- tree(Class ~ data = Vehicle[train ] split=deviance)
We use deviance as the splitting criteria a common alternative is to usesplit=gini You may be surprised by just how quickly R builds the tree
13 PRACTITIONER TIP
It is important to remember that the response variable is afactor This actually trips up quite a few users who have nu-meric categories and forget to convert their response variableusing factor() To see the levels of the response variabletypegt attributes(Vehicle$Class)$levels[1] bus opel saab van
$class[1] factor
For Vehicle each level is associated with a different vehicletype (bus opel saab van )
To see details of the fitted tree type
gt fit
node) split n deviance yval (yprob) denotes terminal node
1) root 500 1386000 saab ( 0248000 02540000258000 0240000 )2) Elong lt 415 229 489100 opel ( 0222707
0410480 0366812 0000000 )
25
92 Applied Predictive Modeling Techniques in R
14) SkewMaxis lt 645 7 9561 van (0000000 0000000 0428571 0571429 )
15) SkewMaxis gt 645 64 10300 van (0015625 0000000 0000000 0984375 )
At each branch of the tree (after root) we see in order
1 The branch number (eg in this case 1214 and 15)
2 the split (eg Elong lt 415)
3 the number of samples going along that split (eg 229)
4 the deviance associated with that split (eg 4891)
5 the predicted class (eg opel)
6 the associated probabilities (eg ( 0222707 0410480 0366812 0000000)
7 and for a terminal node (or leaf) the symbol
13 PRACTITIONER TIP
If the minimum deviance occurs with a tree with 1 node thenyour model is at best no better than random It is even pos-sible that it may be worse
A summary of the tree can also be obtainedgt summary(fit)
Classification treetree(formula = Class ~ data = Vehicle[train ]
split = deviance)Variables actually used in tree construction[1] Elong MaxLRa Comp
PrAxisRa ScVarmaxis[6] MaxLRect DCirc Skewmaxis
Circ KurtMaxis
26
TECHNIQUE 1 CLASSIFICATION TREE
[11] SkewMaxisNumber of terminal nodes 15Residual mean deviance 09381 = 455 485Misclassification error rate 0232 = 116 500
Notice that summary(fit) shows
1 The type of tree in this case a Classification tree
2 the formula used to fit the tree
3 the variables used to fit the tree
4 the number of terminal nodes in this case 15
5 the residual mean deviance - 09381
6 the misclassification error rate 0232 or 232
We plot the tree see Figure 11gt plot(fit) text(fit)
27
92 Applied Predictive Modeling Techniques in R
Figure 11 Fitted Decision Tree
13 PRACTITIONER TIP
The height of the vertical lines in Figure 11 is proportional tothe reduction in deviance The longer the line the larger thereduction This allows you to identify the important sectionsimmediately If you wish to plot the model using uniformlengths use plot(fittype=uniform)
28
TECHNIQUE 1 CLASSIFICATION TREE
Step 4 Assess ModelUnfortunately classification trees have a tendency to over-fit the data Oneapproach to reduce this risk is to use cross-validation For each hold outsample we fit the model and note at what level the tree gives the best results(using deviance or the misclassification rate) Then we hold out a differentsample and repeat This can be carried out using the cvtree()functionWe use a leave-one-out cross-validation using the misclassification rate anddeviance (FUN=prunemisclass followed by FUN=prunetree)
NOTE
Textbooks and academics used to spend a inordinate amountof time on the subject of when to stop splitting a tree andalso pruning techniques This is indeed an important aspectto consider when building a single tree because if the tree istoo large it will tend to over fit the data If the tree is toosmall it might misclassification important characteristics ofthe relationship between the covariates and the outcome Inactual practice I do not spend a great deal of time on decidingwhen to stop splitting a tree or even pruning This is partlybecuase
1 A single tree is generally only of interest to gain in-sight about the data if it can be easily interpreted Thedefault settings in R decision tree functions often aresufficient to create such trees
2 For use in ldquopurerdquo prediction activites random forests(see page272) have largely replaced individual decisiontrees becuase they often produce more acurate predic-tive models
The results are plotted out side by side in Figure 12 The jagged linesshows where the minimum deviance misclassification occurred with thecross-validated tree Since the cross validated misclassification and devianceboth reach their minimum close to the number of branches in the originalfitted tree there is little to be gained from pruning this treegt fitMcv lt- cvtree(fit K=346FUN=prunemisclass)
gt fitPcv lt- cvtree(fit K=346FUN=prunetree)
29
92 Applied Predictive Modeling Techniques in R
gt par(mfrow = c(1 2))gt plot(fitMcv)gt plot(fitPcv)
Figure 12 Cross Validation results on Vehicle using misclassification andDeviance
Step 5 Make PredictionsWe use the validation data set and the fitted decision tree to predict vehicleclasses then we display the confusion matrix and calculate the error rate ofthe fitted tree Overall the model has an error rate of 32
30
TECHNIQUE 1 CLASSIFICATION TREE
gt predlt-predict(fit newdata=Vehicle[-train ])
gt predclass lt- colnames(pred)[maxcol(pred tiesmethod = c(random))]
gt table(Vehicle$Class[-train]predclass dnn=c( Observed ClassPredicted Class ))
Predicted ClassObserved Class bus opel saab van
bus 86 1 3 4opel 1 55 20 9saab 4 55 23 6van 2 2 5 70
gt error_rate = (1-sum(predclass== Vehicle$Class[-train])346)
gt round(error_rate 3)[1] 0324
31
Technique 2
C50 Classification Tree
The C50 algorithm21 is based on the concepts of entropy the measure ofdisorder in a sample and the information gain of each attribute Informationgain is a measure of the effectiveness of an attribute in reducing the amountof entropy in the sample
It begins by calculating the entropy of a data sample The next stepis to calculate the information gain for each attribute This is the expectedreduction in entropy by partitioning the data set on the given attribute Fromthe set of information gain values the best attributes for partitioning the dataset are chosen and the decision tree is built
A C50 Classification Tree can be built using the package C50 with theC50 function
C50(z ~ data)
Key parameters include z the data-frame of classes data the data set ofattributes with which you wish to build the tree
Step 1 Load Required PackagesWe build our classification tree using the Vehicle data frame contained inthe mlbench packagegt library(C50)gt library(mlbench)gt data(Vehicle)
32
TECHNIQUE 2 C50 CLASSIFICATION TREE
Step 2 Prepare Data amp Tweak ParametersWe use 500 of the 846 observations to create a randomly selected trainingsample We will use the observations associated with train to build theclassification treegt setseed (107)gt N=nrow(Vehicle)gt train lt- sample (1N 500 FALSE)
Step 3 Estimate and Assess the Decision TreeNow we are ready to build the decision tree using the training sample Thisis achieved by entering
gt fitlt- C50( Class ~ data = Vehicle[train ] )
Next we assess variable importance using the C5imp function
gt C5imp(fit)
OverallMaxLRa 1000Elong 1000Comp 502Circ 462Skewmaxis 452ScatRa 400MaxLRect 292RaGyr 234DCirc 232SkewMaxis 202PrAxisRect 170Kurtmaxis 122PrAxisRa 90RadRa 30HollRa 30ScVarmaxis 00KurtMaxis 00
We observe that MaxLRa and Elong are the two most influential at-tributes The attributes ScVarmaxis and KurtMaxis are the least influ-ential variables with an influence score of zero
33
92 Applied Predictive Modeling Techniques in R
13 PRACTITIONER TIP
To assess the importance of attributes by split usegt C5imp(fit metric = splits)
Step 4 Make PredictionsWe use the validation data set and the fitted decision tree to predict vehicleclasses then we display the confusion matrix and calculate the error rate ofthe fitted tree Overall the model has an error rate of 275gt predlt-predict(fit newdata=Vehicle[-train ]type =
class)
gt table(Vehicle$Class[-train]pred dnn=c( ObservedClassPredicted Class ))
Predicted ClassObserved Class bus opel saab van
bus 84 1 5 4opel 0 48 31 6saab 3 38 47 0van 2 1 4 72
gt error_rate = (1-sum(pred== Vehicle$Class[-train ])346)
gt round(error_rate 3)[1] 0275
34
TECHNIQUE 2 C50 CLASSIFICATION TREE
13 PRACTITIONER TIP
To view the estimated probabilities of each class usegt predlt-predict(fit newdata=Vehicle[-train
]type = prob)
gt head(round(pred 3))bus opel saab van
5 0018 0004 0004 09747 0050 0051 0852 004810 0011 0228 0750 001012 0031 0032 0782 015515 0916 0028 0029 002719 0014 0070 0903 0013
35
Technique 3
Conditional InferenceClassification Tree
A conditional inference classification tree is a non-parametric regression treeembedding tree-structured regression models It is essentially a decision treebut with extra information about the distribution of classes in the terminalnodes22 It can be built using the package party with the ctree function
ctree(z ~data )
Key parameters include z the data-frame of classes data the data set ofattributes with which you wish to build the tree
Step 1 Load Required PackagesWe build our classification tree using the Vehicle data frame contained inthe mlbench packagegt library (party)gt library(mlbench)gt data(Vehicle)
Step 2 is outlined on page 23
Step 3 Estimate and Assess the Decision TreeNow we are ready to build the decision tree using the training sample Thisis achieved by entering
gt fitlt-ctree(Class ~ data = Vehicle[train ]controls =ctree_control(maxdepth = 2))
36
TECHNIQUE 3 CONDITIONAL INFERENCE
Notice we use controls with the maxdepth parameter to limit the depthof the tree to at most 2 Note if maxdepth = 0 no tree is fitted
Next we plot the tree
gt plot(fit)
The resultant tree is shown in Figure 31 At each internal node a p-valueis reported for the split In this case they are all highly significant (less than1) The primary split takes place at Elong le 41 Elong gt 41 The fourroot nodes are labeled Node 3 Node 4 Node 6 and Node 7 with 62 167 88and 183 observations respectively Each of these leaf nodes also has a barchart illustrating the proportion of the four vehicle types that fall into eachclass at that node
37
92 Applied Predictive Modeling Techniques in R
Figure 31 Conditional Inference Classification Tree for Vehicle withmaxdepth = 2
Step 4 Make PredictionsWe use the validation data set and the fitted decision tree to predict vehicleclasses we show the confusion matrix and calculate the error rate of fitThe misclassification rate is approximately 53 for this treegt predlt-predict(fit newdata=Vehicle[-train ]type =
response)
38
TECHNIQUE 3 CONDITIONAL INFERENCE
gt table(Vehicle$Class[-train]pred dnn=c( ObservedClassPredicted Class ))
Predicted ClassObserved Class bus opel saab van
bus 36 0 11 47opel 3 50 24 8saab 6 58 19 5van 0 0 21 58
gt error_rate = (1-sum(pred== Vehicle$Class[-train ])346)
gt round(error_rate 3)[1] 0529
13 PRACTITIONER TIP
Set the parameter type = node to see which nodes the ob-servations end up in and type = prob to view the proba-bilities For example to see the distribution for the validationsample typegt predlt-predict(fit newdata=Vehicle[-train
]type = node)gt table(pred)pred
3 4 6 745 108 75 118
We see that 45 observations ended up in node 3 and 118 innode 7
To assess the predictive power of fit we compare it against two otherconditional inference classification trees - fit3 which limits the maximumtree depth to 3 and fitu which estimates an unrestricted treegt fit3lt-ctree(Class ~ data = Vehicle[train ]controls =ctree_control(maxdepth = 3))
gt fitult-ctree(Class ~ data = Vehicle[train ])
We use the validation data set and the fitted decision tree to predictvehicle classes
39
92 Applied Predictive Modeling Techniques in R
gt pred3 lt-predict(fit3 newdata=Vehicle[-train ]type = response)
gt predu lt-predict(fitu newdata=Vehicle[-train ]type = response)
Next we calculate the error rate of each fitted treegt error_rate3 = (1-sum(pred3 ==Vehicle$Class[-train ])346)
gt error_rateu = (1-sum(predu ==Vehicle$Class[-train ])346)
gt tree_1lt-round(error_rate 3)gt tree_2lt-round(error_rate3 3)gt tree_3lt-round(error_rateu 3)
Finally we calculate the misclassification error rate for each fitted treegt errlt-cbind(tree_1tree_2tree_3)100gt rownames(err) lt- error ()
gt errtree_1 tree_2 tree_3
error () 529 416 37
The unrestricted tree has the lowest misclassification error rate of 37
40
Technique 4
Evolutionary Classification Tree
NOTE
The recursive partitioning methods discussed in previous sec-tions build the decision tree using a forward stepwise searchwhere splits are chosen to maximize homogeneity at the nextstep only Although this approach is known to be an efficientheuristic the results are only locally optimal Evolutionary al-gorithm based trees search over the parameter space of treesusing a global optimization method
A evolutionary classification tree model can be estimated using the pack-age evtree with the evtree function
evtree(z ~ data )
Key parameters include the response variable Z which contains the classesand data the data set of attributes with which you wish to build the tree
Step 1 Load Required PackagesWe build our classification tree using the Vehicle data frame contained inthe mlbench packagegt library (evtree)gt library(mlbench)gt data(Vehicle)
41
92 Applied Predictive Modeling Techniques in R
Step 2 Prepare Data amp Tweak ParametersWe use 500 of the 846 observations to create a randomly selected trainingsample We will use the observations associated with train to build theclassification treegt setseed (107)gt N=nrow(Vehicle)gt train lt- sample (1N 500 FALSE)
Step 3 Estimate and Assess the Decision TreeNow we are ready to build the decision tree using the training sample Weuse evtree to build the tree and plot to display the tree We restrict thetree depth using maxdepth=2
gt fitlt- evtree(Class ~ data = Vehicle[train ]control =evtreecontrol (maxdepth =2) )
gt plot(fit)
42
TECHNIQUE 4 EVOLUTIONARY CLASSIFICATION TREE
Figure 41 Fitted Evolutionary Classification Tree using Vehicle
The tree shown in Figure 41 visualizes the decision rules It also showsfor the terminal nodes the number of observations and their distributionamongst the classes Let us take node 3 as an illustration This node isreached by the rule maxLRa lt8 and ScVarmaxislt290 The node con-tains 64 observations Similar details can be obtained by typing fit
gt fit
Model formulaClass ~ Comp + Circ + DCirc + RadRa + PrAxisRa +
MaxLRa +
43
92 Applied Predictive Modeling Techniques in R
ScatRa + Elong + PrAxisRect + MaxLRect + ScVarMaxis +
ScVarmaxis + RaGyr + SkewMaxis + Skewmaxis+ Kurtmaxis +
KurtMaxis + HollRa
Fitted party[1] root| [2] MaxLRa lt 8| | [3] ScVarmaxis lt 290 van (n = 64 err =
453)| | [4] ScVarmaxis gt= 290 bus (n = 146 err =
274)| [5] MaxLRa gt= 8| | [6] ScVarmaxis lt 389 van (n = 123 err =
309)| | [7] ScVarmaxis gt= 389 opel (n = 167 err
= 473)
Number of inner nodes 3Number of terminal nodes 4
Step 4 Make PredictionsWe use the validation data set and the fitted decision tree to predict vehicleclasses then we display the confusion matrix and calculate the error rate ofthe fitted tree Overall the model has an error rate of 39gt predlt-predict(fit newdata=Vehicle[-train ]type =
response)
gt table(Vehicle$Class[-train]pred dnn=c( ObservedClassPredicted Class ))
Predicted ClassObserved Class bus opel saab van
bus 85 0 0 9opel 14 48 0 23saab 18 57 0 13van 0 1 0 78
44
TECHNIQUE 4 EVOLUTIONARY CLASSIFICATION TREE
gt error_rate = (1-sum(pred== Vehicle$Class[-train ])346)
gt round(error_rate 3)[1] 039
13 PRACTITIONER TIP
When using predict you can specify any of thefollowing via type = ldquoresponserdquo ldquoprobrdquo ldquoquantilerdquoldquodensityrdquo or ldquonoderdquo For example to view the estimatedprobabilities of each class usegt predlt-predict(fit newdata=Vehicle[-train
]type = prob)
To see the distribution across nodes you would entergt predlt-predict(fit newdata=Vehicle[-train
]type = node)
Finally we estimate a unrestricted model use the validation data set topredict vehicle classes display the confusion matrix and calculate the errorrate of the unrestricted fitted tree In this case the misclassification error rateis lower at 347gt fitlt- evtree(Class ~ data = Vehicle[train ] )
gt predlt-predict(fit newdata=Vehicle[-train ]type =response)
gt table(Vehicle$Class[-train]pred dnn=c( ObservedClassPredicted Class ))
Predicted ClassObserved Class bus opel saab van
bus 72 3 15 4opel 4 30 41 10saab 6 23 55 4van 3 1 6 69
gt error_rate = (1-sum(pred== Vehicle$Class[-train ])346)
45
92 Applied Predictive Modeling Techniques in R
gt round(error_rate 3)[1] 0347
46
Technique 5
Oblique Classification Tree
NOTE
A common question is what is different about oblique treesHere is the answer in a nutshell For a set of attributesX1Xk the standard classification tree produces binarypartitioned trees by considering axis-parallel splits over con-tinuous attributes ( ie grown using tests of the form Xi ltC versus Xi ge C) This is the most widely used approachto tree-growth Oblique trees are grown using oblique splits(ie grown using tests of the form sumk
i=1 αiXi lt C versussumki=1 αiXi ge C ) So for axis-parallel splits a single attribute
is used for oblique splits a weighted combination of attributesis used
An oblique classification tree model can be estimated using the packageobliquetree with the obliquetree function
obliquetree(z ~ data obliquesplits = onlycontrol = treecontrol () splitimpurity )
Key parameters include the response variable Z which contains the classesand data the data set of attributes with which you wish to build the tree con-trol with takes arguments from using treecontrol from the tree packageand splitimpurity which controls the splitting criterion used and takesvalues deviance or gini
47
92 Applied Predictive Modeling Techniques in R
Step 1 Load Required PackagesWe build our classification tree using the Vehicle data frame contained inthe mlbench packagegt library (obliquetree)gt library(mlbench)gt data(Vehicle)
Step 2 Prepare Data amp Tweak ParametersWe use 500 of the 846 observations to create a randomly selected training sam-ple Three attributes are used to build the tree (MaxLRa ScVarmaxisand Elong)
gt setseed (107)gt N=nrow(Vehicle)gt train lt- sample (1N 500 FALSE)gt flt-Class ~MaxLRa +ScVarmaxis +Elong
Step 3 Estimate and Assess the Decision TreeBefore estimating the tree we tweak the following
1 Only allow oblique splits (obliquesplits = only)
2 Use treecontrol to indicate the number of observations (nobs=500)and set the minimum number of observations in a node to 60(mincut=60)
The tree is estimated as follows
gtfitlt-obliquetree(fdata=Vehicle[train ] obliquesplits = only control = treecontrol(nobs =500mincut =60) splitimpurity=deviance)
48
TECHNIQUE 5 OBLIQUE CLASSIFICATION TREE
13 PRACTITIONER TIP
The type of split is indicated by the obliquesplits argu-ment For our analysis we use obliquesplits = only togrow trees that only use oblique splits Use obliquesplits= on to grow trees that use both oblique and axis paral-lel splits and obliquesplits = off to grow traditionalclassification trees (only use axis-parallel splits)
Details of the tree can be visualized using a combination of plot andtext see Figure 51gt plot(fit)text(fit)
49
92 Applied Predictive Modeling Techniques in R
Figure 51 Fitted Oblique tree using a subset of Vehicle attributes
The tree visualizes the decision rules The first split occurs where64minus 142MaxLRa+ 003ScV armaxisminus 012Elong lt 0
If the above holds it leads directly to the leaf node indicating class type =vanFull details of the tree can be obtained by typing fit whilst summary givesan overview of the fitted tree
gt summary(fit)
Classification tree
50
TECHNIQUE 5 OBLIQUE CLASSIFICATION TREE
obliquetree(formula = f data = Vehicle[train ]control = treecontrol(nobs = 500mincut = 60) splitimpurity = deviance
obliquesplits = only)Variables actually used in tree construction[1] Number of terminal nodes 5Residual mean deviance 1766 = 8742 495Misclassification error rate 0 = 0 500
Step 4 Make PredictionsWe use the validation data set and the fitted decision tree to predict vehicleclasses then we display the confusion matrix and calculate the error rate ofthe fitted tree Overall the model has an error rate of 399gt predlt-predict(fit newdata=Vehicle[-train ]type =
class)gtgt table(Vehicle$Class[-train]pred dnn=c( Observed
ClassPredicted Class ))Predicted Class
Observed Class bus opel saab vanbus 77 0 13 4opel 17 37 20 11saab 23 34 27 4van 8 0 4 67
gt error_rate = (1-sum(pred== Vehicle$Class[-train ])346)
gtgt round(error_rate 3)[1] 0399
51
Technique 6
Logistic Model Based RecursivePartitioning
A model based recursive partitioning logistic regression tree can be estimatedusing the package party with the mob function
ctree(z ~x+y+z|a+b+cdata model = glinearModel family = binomial())
Key parameters include the binary response variable Z the conditioningcovariates xy and z and the tree partitioning covariates a b and c
Step 1 Load Required PackagesWe build the decision tree using the data frame PimaIndiansDiabetes2 con-tained in the mlbench packagegt library (party)gt data(PimaIndiansDiabetes2package=mlbench)
NOTE
The PimaIndiansDiabetes2 data set was collected by theNational Institute of Diabetes and Digestive and Kidney Dis-eases23 It contains 768 observations on 9 variables measuredon females at least 21 years old of Pima Indian heritageTable 9 contains a description of each of the variables
52
TECHNIQUE 6 LOGISTIC MODEL BASED RECURSIVE
Name Descriptionpregnant Number of times pregnantglucose Plasma glucose concentration (glucose tolerance test)pressure Diastolic blood pressure (mm Hg)triceps Triceps skin fold thickness (mm)insulin 2-Hour serum insulin (mu Uml)mass Body mass indexpedigree Diabetes pedigree functionage Age (years)diabetes test for diabetes - Class variable (neg pos)
Table 9 Response and independent variables in PimaIndiansDiabetes2data frame
Step 2 Prepare Data amp Tweak ParametersFor our analysis we use 600 of the 768 observations to train the model Theresponse variable is diabetes and we use mass and pedigree as logistic re-gression conditioning variables with the remaining six variables (glucosepregnant pressure triceps insulin and age) being used as the par-titioning variables The model is stored in fgt setseed (898)gt n=nrow(PimaIndiansDiabetes2)gt train lt- sample (1n 600 FALSE)gt flt-diabetes ~ mass+ pedigree| glucose +pregnant +
pressure + triceps + insulin + age
53
92 Applied Predictive Modeling Techniques in R
NOTE
The PimaIndiansDiabetes2 data-set has a large number ofmisclassified values (recorded as NA) particularly for the at-tributes of insulin and triceps In traditional statisticalanalysis these values would have to be removed or their val-ues interpolated However ignoring misclassified values ortreating them as another category is often inefficient A moreefficient of the available information used by many decisiontree algorithms is to ignore the misclassified data point in theevaluation of a split but distribute them to child nodes usinga given rule For example the rule might
1 Distribute misclassified values to the node which has thelargest number of instances
2 Distribute to all child nodes with diminished weightsproportional with the number of instances from eachchild node
3 Randomly distribute to one single child node
4 Create surrogates attributes which closely resemble thetest attributes and use them to send misclassified valuesto child nodes
Step 3 Estimate amp Interpret Decision TreeWe estimate the model using the function mob and then use the plot functionto visualize the tree as shown in Figure 61 Since the response variablediabetes is binary and mass and pedigree are numeric a spinogram is usedfor visualization The plots in the leaves give spinograms for diabetes versusmass (upper panel) and pedigree (lower panel)
gt fit lt- mob(fdata = PimaIndiansDiabetes2[train ]model = glinearModel family = binomial())
gt plot(fit)
54
TECHNIQUE 6 LOGISTIC MODEL BASED RECURSIVE
13 PRACTITIONER TIP
As an alternative to using spinograms you can also plot thecumulative density function using the argument tp_args =list(cdplot = TRUE) in the plot function In the examplein this section you would typegt plot(fit tp_args = list(cdplot = TRUE))
You can also specify the smoothing bandwidth using the bwargument For examplegt plot(fmPID tp_args = list(cdplot = TRUE
bw = 15))
55
92 Applied Predictive Modeling Techniques in R
Figure 61 Logistic-regression-based tree for the Pima Indians diabetes data
The fitted lines are the mean predicted probabilities in each group Thedecision tree distinguishes four different groups of women
bull Node 3 Women with low glucose and 26 years or younger have onaverage a low risk of diabetes however this increases with mass butdecreases slightly with pedigree
bull Node 4 Women with low glucose and older than 26 years have onaverage a moderate risk of diabetes which increases with mass andpedigree
bull Node 5 Women with glucose in the range 127 to 165 have on average
56
TECHNIQUE 6 LOGISTIC MODEL BASED RECURSIVE
a moderate to high risk of diabetes which increases with mass andpedigree
bull Node 7 Women with glucose greater than 165 have on average a highrisk of diabetes which increases with mass and decreases with pedigree
The same interpretation can also be drawn from the coefficient estimatesobtained using the coef functiongt round(coef(fit) 3)
(Intercept) mass pedigree3 -7263 0143 -06804 -4887 0076 24956 -5711 0149 13657 -3216 0225 -2193
When comparing models it can be useful to have the value of the log like-lihood function or the Akaike information criterion These can be obtainedbygt logLik(fit)rsquolog Likrsquo -1154345 (df=15)
gt AIC(fit)[1] 2608691
Step 5 Make PredictionsWe use the function predict to fit calculated the fitted values using thevalidation sample and show the confusion table Then we calculate the mis-classification error which returns a value of 262gt predlt-predict(fit newdata=PimaIndiansDiabetes2[-
train ])
gt thresh lt- 05gt predFac lt- cut(pred breaks=c(-Inf thresh Inf)
labels=c(neg pos))
gt tblt-table(PimaIndiansDiabetes2$diabetes[-train]predFac dnn=c(actual predicted))
gt tb
57
92 Applied Predictive Modeling Techniques in R
predictedactual neg pos
neg 92 17pos 26 29
gt error lt- 1-(sum(diag(tb))sum(tb))gt round(error 3)100[1] 262
13 PRACTITIONER TIP
Re-run the analysis in this section omitting the attributes ofinsulin and triceps and using the naomit method to re-move the any remaining misclassified values Here is somesample code to get you startedtemplt-(PimaIndiansDiabetes2)temp$insulin lt- NULLtemp$triceps lt- NULLtemplt-naomit(temp)
What do you notice about the resultant decision tree
58
Technique 7
Probit Model Based RecursivePartitioning
A model based recursive partitioning probit regression tree can be estimatedusing the package party with the mob function
ctree(z ~x+y+z|a+b+cdata model = glinearModel family = binomial(link = probit) )
Key parameters include the binary response variable Z the conditioningcovariates xy and z and the tree partitioning covariates a b and c
Step 1 and step 2 are discussed beginning on page 52
Step 3 Estimate amp Interpret Decision TreeWe estimate the model using the function mob and display the coefficients atthe leaf nodes using coef
gt fit lt- mob(fdata = PimaIndiansDiabetes2[train ]model = glinearModel family = binomial(link = probit))
gt round(coef(fit) 3)(Intercept) mass pedigree
3 -4070 0078 -02224 -2932 0046 14746 -3416 0089 08147 -1174 0100 -1003
59
92 Applied Predictive Modeling Techniques in R
The estimated decision tree is similar to that shown in Figure 61 and theinterpretation of the coefficients is given on page 56
Step 5 Make PredictionsFor comparison with the logistic regression discussed on page 52 we reportthe value of the log likelihood function or the Akaike information criterionSince these values are very close to those obtained by the logistic regressionwe should expect similar predictive performancegt logLik(fit)rsquolog Likrsquo -1154277 (df=15)
gt AIC(fit)[1] 2608555
We use the function predict with the validation sample and show theconfusion table Then we calculate the misclassification error which returnsa value of 262 This is exactly the same error rate observed by the logisticregression modelgt predlt-predict(fit newdata=PimaIndiansDiabetes2[-
train ])
gt thresh lt- 05gt predFac lt- cut(pred breaks=c(-Inf thresh Inf)
labels=c(neg pos))
gt tblt-table(PimaIndiansDiabetes2$diabetes[-train]predFac dnn=c(actual predicted))
gt tbpredicted
actual neg posneg 92 17pos 26 29
gt error lt- 1-(sum(diag(tb))sum(tb))gt round(error 3)100[1] 262
60
Regression Trees forContinuous Response Variables
61
Technique 8
Regression Tree
A regression tree can be built using the package tree with the tree function
tree(z ~ data split)
Key parameters include split which controls whether deviance or giniare used as the splitting criteria z the data-frame containing the continuousresponse variable data the data set of attributes with which you wish tobuild the tree
Step 1 Load Required PackagesWe build our regression tree using the bodyfat data frame contained in theTHdata packagegt library(tree)gt data(bodyfatpackage=THdata)
NOTE
The bodyfat data set was collected by Garcia et al24to de-velop improved predictive regression equations for body fatcontent derived from common anthropometric measurementsThe original study collected data from 117 healthy Germansubjects 46 men and 71 women The bodyfat data framecontains the data collected on 10 variables for the 71 womensee Table 10
62
TECHNIQUE 8 REGRESSION TREE
Name DescriptionDEXfat body fat measured by DXA(response variable)age age in yearswaistcirc waist circumferencehipcirc hip circumferenceelbowbreadth breadth of the elbowkneebreadth breadth of the kneeanthro3a sum of logarithm of three anthropometric measurementsanthro3b sum of logarithm of three anthropometric measurementsanthro3c sum of logarithm of three anthropometric measurementsanthro4 sum of logarithm of three anthropometric measurements
Table 10 Response and independent variables in bodyfat data frame
Step 2 Prepare Data amp Tweak ParametersFollowing the approach taken by Garcia et al we use 45 of the 71 observationsto build the regression tree The remainder will be used for prediction The45 training observations were selected at random without replacementgt setseed (465)gt train lt- sample (171 45 FALSE)
Step 3 Estimate the Decision TreeNow we are ready to fit the decision tree using the training sample We takethe log of DEXfat as the response variable
gt fitlt- tree(log(DEXfat) ~ data = bodyfat[train] split=deviance)
To see details of the fitted tree enter
gt fit
node) split n deviance yval denotes terminal node
1) root 45 672400 33642) anthro4 lt 533 19 133200 3004
63
92 Applied Predictive Modeling Techniques in R
4) anthro4 lt 4545 5 025490 2667
14) waistcirc lt 10425 10 007201 3707 15) waistcirc gt 10425 5 008854 3874
The terminal value at each node is the estimated percent cover oflog(DEXfat) For example the root node indicates the overall mean oflog(DEXfat) is 3364 percent This is approximately the same value we wouldget fromgt mean(log(bodyfat$DEXfat))[1] 3359635
Following the splits the first terminal node is 4 where the estimatedabundance of log(DEXfat) that has anthro4 lt 533 and anthro4 lt 4545 is2667 percent
A summary of the tree can also be obtainedgt summary(fit)
Regression treetree(formula = log(DEXfat) ~ data = bodyfat[train
] split = deviance)Variables actually used in tree construction[1] anthro4 hipcirc waistcircNumber of terminal nodes 7Residual mean deviance 001687 = 06411 38Distribution of residuals
Min 1st Qu Median Mean 3rd QuMax
-0268300 -0080280 0009712 0000000 00734000332900
Notice that summary(fit) shows
1 The type of tree in this case a Regression tree
2 the formula used to fit the tree
3 the variables used to fit the tree
64
TECHNIQUE 8 REGRESSION TREE
4 the number of terminal nodes in this case 7
5 the residual mean deviance of 001687
6 the distribution of the residuals in this case they have a mean of 0
We plot the tree see Figure 81gt plot(fit) text(fit)
Figure 81 Fitted Regression Tree for bodyfat
65
92 Applied Predictive Modeling Techniques in R
Step 4 Assess ModelWe use a leave-one-out cross-validation with the results plotted in Figure 82Since the jagged line reaches a minimum close to the number of branches inthe original fitted tree there is little to be gained from pruning this treegt fitcv lt- cvtree(fit K=45)gt plot(fitcv)
Figure 82 Regression tree cross validation results using bodyfat
66
TECHNIQUE 8 REGRESSION TREE
Step 5 Make PredictionsWe use the test observations and the fitted decision tree to predictlog(DEXfat) The scatter plot between predicted and observed values isshown in Figure 83 The squared correlation coefficient between predictedand observed values is 0795gt predlt-predict(fit newdata=bodyfat[-train ])
gt plot(bodyfat$DEXfat[-train]pred xlab=DEXfatylab=Predicted Values main=Training SampleModel Fit)
gt round(cor(pred bodyfat$DEXfat[-train])^23)[1] 0795
67
92 Applied Predictive Modeling Techniques in R
Figure 83 Scatterplot of predicted versus observed observations for the re-gression tree of bodyfat
68
Technique 9
Conditional InferenceRegression Tree
A conditional inference regression tree is a non-parametric regression treeembedding tree-structured regression models It is similar to the regressiontree of page 62 but with extra information about the distribution of subjectsin the leaf nodes It can be estimated using the package party with thectree function
ctree(z ~data )
Key parameters include the continuous response variable Z the data-frameof classes data the data set of attributes with which you wish to build thetree
Step 1 Load Required PackagesWe build the decision tree using the bodyfat (see page 62) data frame con-tained in the THdata packagegt library (party)gt data(bodyfatpackage=THdata)
Step 2 is outlined on page 63
Step 3 Estimate and Assess the Decision TreeWe estimate the model using the training data followed by a plot of the fittedtree shown in Figure 91
69
92 Applied Predictive Modeling Techniques in R
gt fitlt- ctree(log(DEXfat) ~ data = bodyfat[train])
gt plot(fit)
Figure 91 Fitted Conditional Inference Regression Tree using bodyfat
Further details of the fitted tree can be obtained using the print functiongt print(fit)
Conditional inference tree with 4 terminalnodes
70
TECHNIQUE 9 CONDITIONAL INFERENCE REGRESSION TREE
Response log(DEXfat)Inputs age waistcirc hipcirc elbowbreadth
kneebreadth anthro3a anthro3b anthro3c anthro4Number of observations 45
1) anthro3c lt= 385 criterion = 1 statistic =352152) anthro3c lt= 339 criterion = 0998 statistic
= 140613) weights = 9
2) anthro3c gt 3394) weights = 12
1) anthro3c gt 3855) hipcirc lt= 1085 criterion = 0999 statistic
= 158626) weights = 10
5) hipcirc gt 10857) weights = 14
At each branch of the tree (after root) we see in order the branch numberthe split rule (eganthro3c lt= 385) the criterion reflects the reported p-value and is derived from statistic Terminal nodes (or leaf) are indicatedwith and weights are the number of subjects observations at that node
Step 5 Make PredictionsWe use the validation observations and the fitted decision tree to predictlog(DEXfat) The scatter plot between predicted and observed values isshown in Figure 92 The squared correlation coefficient between predictedand observed values is 068gt predlt-predict(fit newdata=bodyfat[-train ])
gt plot(bodyfat$DEXfat[-train]pred xlab=DEXfatylab=Predicted Values main=Training SampleModel Fit)
gt round(cor(pred bodyfat$DEXfat[-train])^23)[1]
log(DEXfat) 068
71
92 Applied Predictive Modeling Techniques in R
Figure 92 Conditional Inference Regression Tree scatter plot for bodyfat
72
Technique 10
Linear Model Based RecursivePartitioning
A linear model based recursive partitioning regression tree can be estimatedusing the package party with the mob function
ctree(z ~x+y+z|a+b+cdata model = linearModel )
Key parameters include the continuous response variable Z the linearregression covariates xy and z and the covariates a b and c with whichyou wish to partition the tree
Step 1 Load Required PackagesWe build the decision tree using the bodyfat (see page 62) data frame con-tained in the THdata packagegt library (party)gt data(bodyfatpackage=THdata)
Step 2 Prepare Data amp Tweak ParametersFor our analysis we will use the entire bodyfat sample We begin by takingthe log of the response variable (DEXfat) and the two conditioning variables(waistcirchipcirc) the remaining covariates form the partitioning setgt bodyfat$DEXfat lt-log(bodyfat$DEXfat)
gt bodyfat$waistcirc lt-log(bodyfat$waistcirc)
73
92 Applied Predictive Modeling Techniques in R
gt bodyfat$hipcirc lt-log(bodyfat$hipcirc)
gt flt- DEXfat~waistcirc + hipcirc|age+ elbowbreadth +kneebreadth+ anthro3a + anthro3b + anthro3c +anthro4
Step 3 Estimate amp Evaluate Decision TreeWe estimate the model using the function mob Since looking at the printedoutput can be rather tedious a visualization is shown in Figure 101 By de-fault this produces partial scatter plots of the response variable against eachof the regressors (waistcirchipcirc) in the terminal nodes Each scatterplot also shows the fitted values From this visualization it can be seen thatin the nodes 3 4 and 5 bodyfat increases with waist and hip circumferenceThe increase of value appears steepest in node 3 and flattens out somewhatin node 5
gt fit lt- mob(f data = bodyfat model =linearModel control = mob_control(objfun = logLik))
gt plot(fit)
13 PRACTITIONER TIP
Model based recursive partitioning searches for the locallyoptimal split in the response variable by minimizing theobjective function of the model Typically this will besomething like deviance or the negative log likelihood func-tion It can be specified using the mob_control control func-tion For example to use deviance you would set control =mob_control(objfun = deviance) In our analysis we usethe log likelihood with control = mob_control(objfun =logLik)
74
TECHNIQUE 10 LINEAR MODEL BASED RECURSIVE
Figure 101 Linear model based recursive partitioning tree using bodyfat
Further details of the fitted tree can be obtained by typing the fittedmodelrsquos namegt fit1) anthro3b lt= 464 criterion = 0998 statistic =
245492) anthro3b lt= 429 criterion = 0966 statistic
= 169623) weights = 31
Terminal node modelLinear model with coefficients(Intercept) waistcirc hipcirc
75
92 Applied Predictive Modeling Techniques in R
-14309 1196 2660
2) anthro3b gt 4294) weights = 20
Terminal node modelLinear model with coefficients(Intercept) waistcirc hipcirc
-53867 09887 09466
1) anthro3b gt 4645) weights = 20
Terminal node modelLinear model with coefficients(Intercept) waistcirc hipcirc
-35815 06027 09546
The output informs us that the tree consists of five nodes At each branchof the tree (after root) we see in order the branch number the split rule(eganthro3b lt= 464) Note criterion reflects the reported p-value25
and is derived from statistic Terminal nodes are indicated with andweights are the number of subjects observations at that node The outputalso presents the estimated regression coefficients at the terminal nodes Wecan also use coef function to obtain a summary of the estimated coefficientsand their associated nodegt round(coef(fit) 3)
(Intercept) waistcirc hipcirc3 -14309 1196 26604 -5387 0989 09475 -3582 0603 0955
The summary function also provides detailed statistical information on thefitted coefficients by node For example summary(fit) produces the following(we only show details of node 3 )gt summary(fit)
$lsquo3lsquo
CallNULL
Weighted Residuals
76
TECHNIQUE 10 LINEAR MODEL BASED RECURSIVE
Min 1Q Median 3Q Max-03272 00000 00000 00000 04376
CoefficientsEstimate Std Error t value Pr(gt|t|)
(Intercept) -143093 24745 -5783 329e-06 waistcirc 11958 04033 2965 0006119 hipcirc 26597 06969 3817 0000685 ---Signif codes 0 0001 001 005 01 1
Residual standard error 01694 on 28 degrees offreedom
Multiple R-squared 06941 Adjusted R-squared06723
F-statistic 3177 on 2 and 28 DF p-value 6278e-08
When comparing models it can be useful to have the value of the log like-lihood function or the Akaike information criterion These can be obtainedbygt logLik(fit)rsquolog Likrsquo 5442474 (df=14)
gt AIC(fit)[1] -8084949
13 PRACTITIONER TIP
The test statistics and p-values computed in each node canbe extracted using the function sctest() For example to seethe statistics for node 2 you would type sctest(fit node= 2)
Step 5 Make PredictionsWe use the function predict and then display the scatter plot between pre-dicted and observed values in Figure 102 The squared correlation coefficientbetween predicted and observed values is 089gt predlt-predict(fit newdata=bodyfat)
77
92 Applied Predictive Modeling Techniques in R
gt plot(bodyfat$DEXfat pred xlab=DEXfat ylab=PredictedValues main=FullSampleModelFit)
gt round(cor(pred bodyfat$DEXfat)^23)[1] 089
Figure 102 Linear model based recursive partitioning tree predicted andobserved values using bodyfat
78
Technique 11
Evolutionary Regression Tree
A evolutionary regression tree model can be estimated using the packageevtree with the evtree function
evtree(z ~ data )
Key parameters include the continuous response variable Z and the co-variates contained in data
Step 1 Load Required PackagesWe build the decision tree using the bodyfat (see page 62) data frame con-tained in the THdata packagegt library (evtree)gt data(bodyfatpackage=THdata)
Step 2 Prepare Data amp Tweak ParametersFor our analysis we will use 45 observations for the training sample Wetake the log of the response variable (DEXfat) and two of the covariates(waistcirchipcirc) The remaining covariates are used in their originalformgt setseed (465)gt train lt- sample (171 45 FALSE)
gt bodyfat$DEXfat lt-log(bodyfat$DEXfat)gt bodyfat$waistcirc lt-log(bodyfat$waistcirc)gt bodyfat$hipcirc lt-log(bodyfat$hipcirc)
79
92 Applied Predictive Modeling Techniques in R
gt flt- DEXfat~waistcirc + hipcirc+age+ elbowbreadth +kneebreadth+ anthro3a + anthro3b + anthro3c +anthro4
Step 3 Estimate amp Evaluate Decision TreeWe estimate the model using the function evtree A visualization is obtainedusing plot and shown in Figure 111 This produces box and whisper plotsof the response variable in each leaf From this visualization it can be seenthat body fat increases as we move from node 3 to node 5
gt fit lt- evtree(f data = bodyfat[train ])gt plot(fit)
13 PRACTITIONER TIP
Notice that evtreecontrol is used to control important as-pects of a tree You can change the number of evolutionaryiterations using niterations this is useful if your tree doesnot converge by the default number of iterations You can alsospecify the number of trees in the population using ntreesand the tree depth with maxdepth For example to limit themaximum tree depth to three and the number of iterations toten thousand you would enter something along the lines offit lt- evtree(f data = bodyfat[train ]control =evtreecontrol (maxdepth =2)niterations =10000)
80
TECHNIQUE 11 EVOLUTIONARY REGRESSION TREE
Figure 111 Fitted Evolutionary Regression Tree for bodyfat
Further details of the fitted tree can be obtained by typing the fittedmodelrsquos namegt fit
Model formulaDEXfat ~ waistcirc + hipcirc + age + elbowbreadth +
kneebreadth +anthro3a + anthro3b + anthro3c + anthro4
Fitted party[1] root
81
92 Applied Predictive Modeling Techniques in R
| [2] hipcirc lt 109| | [3] anthro3c lt 377 20271 (n = 18 err =
3859)| | [4] anthro3c gt= 377 31496 (n = 12 err =
1868)| [5] hipcirc gt= 109 43432 (n = 15 err = 5548)
Number of inner nodes 2Number of terminal nodes 3
13 PRACTITIONER TIP
Decision tree models can outperform other techniques inthe situation when relationships are irregular (eg non-monotonic) but they are also known to be inefficient whenrelationships can be well approximated by simpler models
Step 5 Make PredictionsWe use the function predict and then display the scatter plot between pre-dicted and observed values in Figure 112 The squared correlation coefficientbetween predicted and observed values is 073gt predlt-predict(fit newdata=bodyfat[-train ])
gt plot(bodyfat$DEXfat[-train]pred xlab=DEXfatylab=PredictedValues main=ModelFit)
gt round(cor(pred bodyfat$DEXfat[-train])^23)[1] 0727
82
TECHNIQUE 11 EVOLUTIONARY REGRESSION TREE
Figure 112 Scatter plot of fitted and observed values for the EvolutionaryRegression Tree using bodyfat
83
Decision Trees for Count ampOrdinal Response Data
13 PRACTITIONER TIP
An oft believed maxim is the more data the better whilst thismay sometimes be true it is a good idea to try to rationallyreduce the number of attributes you include in your decisiontree to the minimum set of highest value attributes Oatesand Jensen26 studied the influence of database size on decisiontree complexity They found tree size and complexity stronglydepends on the size of the training setIt is always worth thinking about and removing uninformativeattributes prior to decision tree construction For practicalideas and additional tips on how to do this see the excellentpapers of John27 Brodley and Friedl28 and Cano and Herrera29
84
Technique 12
Poisson Decision Tree
A decision tree for a count response variable (yi) following a Poisson distri-bution with a mean that depends on the covariates x1xk can be builtusing the package rpart with the rpart function
rpart(z ~data method = poisson)
Key parameters include method = ldquopoissonrdquo which is used to indicatethe type of tree to be built z the data-frame containing the Poisson dis-tributed response variable data the data set of attributes with which youwish to build the tree
Step 1 Load Required PackagesWe build a Poisson decision tree using the DebTrivedi data frame containedin the MixAll packagegt library (rpart)gt library(MixAll)gt data(DebTrivedi)
85
92 Applied Predictive Modeling Techniques in R
NOTE
Deb and Trivedi30 model counts of medical care utilization bythe elderly in the United States using data from the NationalMedical Expenditure Survey They analyze data on 4406 in-dividuals aged 66 and over who are covered by Medicarea public insurance program The objective is to model thedemand for medical care using as the response variable thenumber of physiciannon-physician office and hospital outpa-tient visits The data is contained in the DebTrivedi dataframe available in the MixAll package
Step 2 Prepare Data amp Tweak ParametersThe number of physician office visits (ofp) is the response variable The co-variates are - hosp (number of hospital stays) health (self-perceived healthstatus) numchron (number of chronic conditions) as well as the socioeco-nomic variables gender school (number of years of education) and privins(private insurance indicator)gt flt-ofp~hosp+health+numchron+gender+school+privins
Step 3 Estimate the Decision Tree amp Assess fitNow we are ready to fit the decision tree
gt fitlt-rpart(fdata=DebTrivedi method = poisson)
To see a plot of the tree use the plot and text methods
gt plot(fit) text(fit usen=TRUE cex =8))
Figure 121 shows a visualization of the fitted tree Each of the five ter-minal nodes reports the event rate the total number of events and number ofobservations for that node For example for the rule chain numchron lt15-gt hosplt05 -gt numchron lt05 the event rate is 3121 with 923 events atthat node
86
TECHNIQUE 12 POISSON DECISION TREE
13 PRACTITIONER TIP
To see the number of events and observations at every nodein a decision tree plot add all = TRUE to the text functionplot(fit) text(fit usen=TRUE all=TRUE
cex =6)
Figure 121 Poisson Decision Tree using the DebTrivedi data frame
To help validate the decision tree we use the printcp function The lsquocprsquo
87
92 Applied Predictive Modeling Techniques in R
part of the function stands for the ldquocomplexity parameterrdquo of the tree Thefunction indicates the optimal tree size based on the cp valuegt printcp(fit digits =3)
Rates regression treerpart(formula = f data = DebTrivedi method =
poisson)
Variables actually used in tree construction[1] hosp numchron
Root node error 269434406 = 612
n= 4406
CP nsplit rel error xerror xstd1 00667 0 1000 1000 003322 00221 1 0933 0942 003253 00220 2 0911 0918 003264 00122 3 0889 0896 003155 00100 4 0877 0887 00315
The printcp function returns the formula used to fit the tree the variablesused to build the tree (in this case hosp numchron) the root node error(612) the number of events at the root node (4406) and the relative errorthe cross-validation error(xerror) xstd and CP at each node split Eachrow represents a different height of the tree In general more levels in thetree often imply a lower classification error However you run the risk of overfitting
Figure 122 plots the rel_error and cp parametergt plotcp(fit)
88
TECHNIQUE 12 POISSON DECISION TREE
Figure 122 Complexity parameter for the Poisson decision tree using theDebTrivedi data frame
89
92 Applied Predictive Modeling Techniques in R
13 PRACTITIONER TIP
A simple rule of thumb is to choose the lowest level where therel_error + xstd lt xerror Another rule of thumb is toprune the tree so that it has the minimum xerror You cando this automatically using the followingpfitlt- prune(fit cp= fit$cptable[which
min(fit$cptable[xerror])CP])
pfit contains the pruned tree
Since the tree is relatively parsimonious we retain the original tree Asummary of the tree can also be obtained by typinggt summary(fit)
The first part of the output displays similar data to that obtained by usingthe printcp function This is followed by variable importance detailsVariable importancenumchron hosp health
55 37 8
We see that numchron is the most important variable followed by hospand then health
The second part of the summary function gives details of the tree withthe last few lines giving details of the terminal nodes For example for node5 we observeNode number 5 317 observations
events =2329 estimated rate =7346145 meandeviance =5810627
90
Technique 13
Poisson Model Based RecursivePartitioning
A poisson model based recursive partitioning regression tree can be estimatedusing the package party with the mob function
ctree(z ~x+y+z|a+b+cdata model = linearModel family=poisson(link = log) )
Key parameters include the poisson distributed response variable of countsZ the regression covariates xy and z and the covariates a b and c withwhich you wish to partition the tree
Step 1 Load Required PackagesWe build a Poisson decision tree using the DebTrivedi data frame containedin the MixAll package Details of this data frame are given on page 86gt library (party)gt library(MixAll)gt data(DebTrivedi)
Step 2 Prepare Data amp Tweak ParametersFor our analysis we use all the observations in DebTrivedi to estimate amodel with the response variable ofp (number of physician office visits) andnumchron (number of chronic conditions) as poisson regression condition-ing variables The remaining variables (hosp health gender schoolprivins) are used as the partitioning set The model is stored in f
91
92 Applied Predictive Modeling Techniques in R
gt flt-ofp~numchron|hosp+health+gender+school+privins
Step 3 Estimate the Decision Tree amp Assess fitNow we are ready to fit and plot the decision tree see Figure 131
gt fit lt- mob(f data=DebTrivedi model =linearModel family=poisson(link = log))
gt plot(fit)
Figure 131 Poisson Model Based Recursive Partitioning Tree forDebTrivedi
92
TECHNIQUE 13 POISSON MODEL BASED RECURSIVE
Coefficient estimates at the leaf nodes are given bygt round(coef(fit) 3)
(Intercept) numchron3 7042 02316 2555 08927 2672 14009 4119 094210 2788 104312 3769 121214 6982 112315 14579 0032
When comparing models it can be useful to have the value of the log like-lihood function or the Akaike information criterion These can be obtainedbygt logLik(fit)rsquolog Likrsquo -1410987 (df=31)
gt AIC(fit)[1] 2828175
The results of the parameter stability tests for any given node can beretrieved using sctest For example to retrieve the statistics for node 2entergt round(sctest(fit node = 2) 2)
hosp health gender school privinsstatistic 1394 3625 793 2422 1687pvalue 011 000 009 000 000
93
Technique 14
Conditional Inference OrdinalResponse Tree
The Conditional Inference Ordinal Response Tree is used when the responsevariable is measured on an ordinal scale In marketing for instance we oftensee consumer satisfaction measured on an ordinal scale - ldquovery satisfiedrdquo ldquosat-isfiedrdquo ldquodissatisfiedrdquo and ldquovery dissatisfiedrdquo In medical research constructssuch as self-perceived health are often measured on an ordinal scale - ldquoveryunhealthyrdquo ldquounhealthyrdquo ldquohealthyrdquo ldquovery healthyrdquo A conditional inferenceordinal response tree can be built using the package party with the ctreefunction
ctree(z ~data )
Key parameters include the response variable Z an ordered factor thedata-frame of classes data the data set of attributes with which you wish tobuild the tree
Step 1 Load Required PackagesWe build our tree using the wine data frame contained in the ordinal pack-agegt library (party)gt library(ordinal)gt data(wine)
94
TECHNIQUE 14 CONDITIONAL INFERENCE ORDINAL
NOTE
The wine data frame was analyzed by Randall31 in an ex-periment on factors determining the bitterness of wine Thebitterness (rating) was measured as 1 = ldquoleast bitterrdquo and 5= ldquomost bitterrdquo Temperature and contact between juice andskins can be controlled during wine production Two treat-ment factors were collected - temperature (temp) and contact(contact) with each having two levels Nine judges assessedwine from two bottles from each of the four treatment condi-tions resulting in a total of 72 observations in all
Step 2 Estimate and Assess the Decision TreeWe estimate the model using all of the data with rating as the responsevariable This is followed by a plot shown in Figure 141 of the fitted tree
gt fitlt-ctree(rating ~ temp + contact data= wine)gt plot(fit)
95
92 Applied Predictive Modeling Techniques in R
Figure 141 Fitted Conditional Inference Ordinal Response Tree using wine
The decision tree has three leafs - Node 3 with 18 observations Node 4with 18 observations and Node 5 with 36 observations The covariate temp ishighly significant (plt 0001) and contact is significant at the 5 level Noticethe terminal nodes contain the distribution of bitterness scores
We compare the fitted values with the observed values and print out theconfusion matrix and error rate The overall misclassification rate is 569for the fitted treegt tblt-table(wine$rating pred dnn=c(actual
predicted))gt tb
96
TECHNIQUE 14 CONDITIONAL INFERENCE ORDINAL
predictedactual 1 2 3 4 5
1 0 5 0 0 02 0 16 5 1 03 0 13 8 5 04 0 2 3 7 05 0 0 2 5 0
gt error lt- 1-(sum(diag(tb))sum(tb))gt round (error 3)[1] 0569
97
Decision Trees for SurvivalAnalysis
Studies involving time to event data are numerous and arise in all areas ofresearch For example survival times or other time to failure related mea-surements such as relapse time are major concerns of modeling medicaldata The Cox proportional hazard regression model and its extensions havebeen the traditional tool of the data scientist for modeling survival variableswith censoring These parametric (and semindashparametric) models remain use-ful staples as they allow simple interpretations of the covariate effects andcan readily be used for statistical inference However such models force aspecific link between the covariates and the response Even though interac-tions between covariates can be incorporated they must be specified by theanalyst
Survival decision trees allow the data scientist to carry out their analysiswithout imposing a specific link function or know aprori the nature of variableinteraction Survival trees offer great flexibility because they can automat-ically detect certain types of interactions without the need to specify thembeforehand prognostic groupings are a natural output from survival treesthis is because the basic idea of a decision tree is to partition the covariatespace recursively to form groups (nodes in the tree) of subjects which aresimilar according to the outcome of interest
98
Technique 15
Exponential Algorithm
A decision tree for survival data where the time values are assumed to fitan exponential model32 can be built using the package rpart with the rpartfunction
rpart(z ~data method = exp)
Key parameters include z the survival response variable (note we setz=Surv(time status)where Surv is a survival object constructed using thesurvival package) data the data set of explanatory variable and method= exp is used to indicate a survival decision tree
Step 1 Load Required PackagesWe build a survival decision tree using the rhDNase data frame containedin the simexaft package The required packages and data are loaded asfollowsgt library (rpart)gt library(simexaft)gt library(survival)gt data(rhDNase)
99
92 Applied Predictive Modeling Techniques in R
NOTE
Respiratory disease in patients with cystic fibrosis is char-acterized by airway obstruction caused by the accumulationof thick purulent secretions The viscoelasticity of these se-cretions can be reduced in-vitro by recombinant human de-oxyribonuclease I (rhDNase) a bioengineered copy of the hu-man enzymeThe rhDNase data set contained in the simexaftpackage contains a subset of the original data collected byFuchs et al33 who performed a randomized double-blindplacebo-controlled study on 968 adults and children with cys-tic fibrosis to determine the effects of once-daily and twice-daily administration of rhDNase The patients were treatedfor 24 weeks as outpatients The rhDNase data frame containsdata on the occurrence and resolution of all exacerbations for641 patients
Step 2 Prepare Data amp Tweak ParametersThe forced expiratory volume (FEV) was considered a risk factor and wasmeasured twice at randomization (rhDNase$fev and rhDNase$fev2) Wetake the average of the two measurements as an explanatory variable Theresponse is defined as the logarithm of the time from randomization to thefirst pulmonary exacerbation measured in the object survreg(Surv(time2status))gt rhDNase$fevave lt- (rhDNase$fev + rhDNase$fev2)2gt zlt-Surv(rhDNase$time2 rhDNase$status)
Step 3 Estimate the Decision Tree amp Assess fitNow we are ready to fit the decision tree
gt fitlt-rpart(z ~ trt + fevave data= rhDNase method= exp)
To see a plot of the tree simply type its name
gt plot(fit) text(fit usen=TRUE cex=8 all=TRUE)
100
TECHNIQUE 15 EXPONENTIAL ALGORITHM
Figure 151 shows a plot of the fitted tree Notice that the terminal nodesreport the estimated rate as well as the number of events and observationsavailable For example for the rule fevave lt8003 the estimated event rateis 035 with 27 events out of a total of 177
13 PRACTITIONER TIP
To see the rules that lead to a specific node use the functionpathrpart(fitted_Model node = x) For example to seethe rules associated with node 7 entergt pathrpart(fit node = 7)
node number 7rootfevave lt 8003fevave lt 4642
101
92 Applied Predictive Modeling Techniques in R
Figure 151 SurvivalDecision Tree using the rhDNase data frame
To help assess the decision tree we use the plottcp function shown inFigure 152 Since the tree is relatively parsimonious there is little need toprunegt plotcp(fit)
102
TECHNIQUE 15 EXPONENTIAL ALGORITHM
Figure 152 Complexity parameter and error for the Survival Decision Treeusing the rhDNase data frame
Survival times can vary greatly between subjects Decision tree analysis isa useful tool to homogenize the data by separating it into different subgroupsbased on treatments and other relevant characteristics In other words asingle tree will group subjects according to their survival behavior based ontheir covariates For a final summary of the model it can be helpful to plotthe probability of survival based on the final nodes in which the individualpatients landed as shown in Figure 153
We see that the node 2 appears to have the most favorable survival char-acteristicsgt km lt- survfit(z ~ fit$where data= rhDNase)
103
92 Applied Predictive Modeling Techniques in R
gt plot(km lty = 13 marktime = FALSE xlab = Time ylab = Status)
gt legend(150 02 paste(rsquonodersquo c(245)) lty =13)
Figure 153 Survival plot by terminal node for rhDNase
104
Technique 16
Conditional Inference SurvivalTree
A conditional inference survival tree is a non-parametric regression tree em-bedding tree-structured regression models This is essentially a decision treebut with extra information about survival in the terminal nodes34 It can bebuilt using the package party with the ctree function
ctree(z ~data )
Key parameters include z the survival response variable (note we setz=Surv(time status)where Surv is a survival object constructed using thesurvival package) and data the sample of explanatory variables
Step 1 Load Required PackagesWe build a conditional inference survival decision tree using the rhDNase dataframe contained in the simexaft package The required packages and dataare loaded as followsgt library (party)gt library(simexaft)gt library(survival)gt data(rhDNase)
Details of step 2 are given on page 100
Step 3 Estimate the Decision Tree amp Assess fitNext we fit and plot the decision tree Figure 161 shows the resultant tree
105
92 Applied Predictive Modeling Techniques in R
gt fitlt-ctree(z ~ trt + fevave data= rhDNase)gt plot(fit)
Figure 161 Conditional Inference Survival Tree for rhDNase
Notice that the internal nodes report the p-value for split whilst the leafnodes give the number of subjects and a plot of the estimated survival curveWe can obtain a summary of the tree by typinggt print(fit)
Conditional inference tree with 4 terminalnodes
106
TECHNIQUE 16 CONDITIONAL INFERENCE SURVIVAL TREE
Response zInputs trt fevaveNumber of observations 641
1) fevave lt= 7995 criterion = 1 statistic =581522) trt lt= 0 criterion = 0996 statistic = 934
3) weights = 2322) trt gt 0
4) fevave lt= 453 criterion = 0957 statistic= 5254
5) weights = 1054) fevave gt 453
6) weights = 1271) fevave gt 7995
7) weights = 177
Node 7 is a terminal or leaf node (the symbol ldquordquo signifies this ) with thedecision rule fevave gt 7995 At this node there are 177 observations
We grab the fitted responses using the function treeresponse and storethem in stree Notice that every stree component is a survival object ofclass survfitgt stree lt- treeresponse(fit)
gt class(stree [[2]])[1] survfit
gt class(stree [[7]])[1] survfit
For this particular tree we have four terminal nodes so there are only fourunique survival objects You can use the where method to see which nodesthe observations are ingt subjects lt- where(fit)
gt table(subjects)subjects
3 5 6 7232 105 127 177
107
92 Applied Predictive Modeling Techniques in R
So we have 232 subjects in node 3 and 127 subjects in node 6 which agreewith the numbers reported in Figure 161
We end our initial analysis by plotting in Figure 162 the survival curvefor node 3 with a 95 confidence interval and also mark on the plot the timeof individual subject eventsgt plot(stree [[3]] confint = TRUE marktime = TRUE
ylab=Cumulative Survival ()xlab=Days Elapsed)
Figure 162 Survival curve for node 3
108
NOTES
Notes1See for example the top ten list of Wu Xindong et al Top 10 algorithms in data
mining Knowledge and Information Systems 141 (2008) 1-372Koch Y Wolf T Sorger PK Eils R Brors B (2013) Decision-Tree Based Model Analysis
for Efficient Identification of Parameter Relations Leading to Different Signaling StatesPLoS ONE 8(12) e82593 doi101371journalpone0082593
3Ebenezer Hailemariam Rhys Goldstein Ramtin Attar amp Azam Khan (2011) Real-Time Occupancy Detection using Decision Trees with Multiple Sensor Types SimAUD 2011Conference Proceedings Symposium on Simulation for Architecture and Urban Design
4Taken from Stiglic G Kocbek S Pernek I Kokol P (2012) Comprehensive Decision TreeModels in Bioinformatics PLoS ONE 7(3) e33812 doi101371journalpone0033812
5Bohanec M Bratko I (1994) Trading accuracy for simplicity in decision trees MachineLearning 15 223ndash250
6See for example J R Quinlan ldquoInduction of decision treesrdquo Mach Learn vol 1 no1 pp 81ndash106 1986
7See J R Quinlan C45 programs for machine learning San Francisco CA USAMorgan Kaufmann Publishers Inc 1993
8See Breiman Leo et al Classification and regression trees CRC press 19849See R L De Macuteantaras ldquoA distance-based attribute selection measure for decision
tree inductionrdquo Mach Learn vol 6 no 1 pp 81ndash92 199110Zhang Ting et al Using decision trees to measure activities in people with stroke
Engineering in Medicine and Biology Society (EMBC) 2013 35th Annual InternationalConference of the IEEE IEEE 2013
11See for example J Liu M A Valencia-Sanchez G J Hannon and R ParkerldquoMicroRNA-dependent localization of targeted mRNAs to mammalian P-bodiesrdquo NatureCell Biology vol 7 no 7 pp 719ndash723 2005
12Williams Philip H Rod Eyles and George Weiller Plant MicroRNA predictionby supervised machine learning using C50 decision trees Journal of nucleic acids 2012(2012)
13Nakayama Nobuaki et al Algorithm to determine the outcome of patients with acuteliver failure a data-mining analysis using decision trees Journal of gastroenterology 476(2012) 664-677
14Hepatic encephalopathy also known as portosystemic encephalopathy is the loss ofbrain function (evident in confusion altered level of consciousness and coma) as a resultof liver failure
15de Ontildea Juan Griselda Loacutepez and Joaquiacuten Abellaacuten Extracting decision rules frompolice accident reports through decision trees Accident Analysis amp Prevention 50 (2013)1151-1160
16See for example Kashani A Mohaymany A 2011 Analysis of the traffic injuryseverity on two-lane two-way rural roads based on classification tree models Safety Science49 1314-1320
17Monedero Intildeigo et al Detection of frauds and other non-technical losses in a powerutility using Pearson coefficient Bayesian networks and decision trees International Jour-nal of Electrical Power amp Energy Systems 341 (2012) 90-98
18Maximum and minimum value monthly consumption Number of meter readings Num-ber of hours of maximum power consumption and three variables to measure abnormalconsumption
109
92 Applied Predictive Modeling Techniques in R
19Wang Quan et al Tracking tetrahymena pyriformis cells using decision trees Pat-tern Recognition (ICPR) 2012 21st International Conference on IEEE 2012
20This data set comes from the Turing Institute Glasgow Scotland21For further details see httpwwwrulequestcomsee5-comparisonhtml22For further details see
1 Hothorn Torsten Kurt Hornik and Achim Zeileis ctree Conditional InferenceTrees
2 Hothorn Torsten Kurt Hornik and Achim Zeileis Unbiased recursive partition-ing A conditional inference framework Journal of Computational and Graphicalstatistics 153 (2006) 651-674
3 Hothorn Torsten et al Party A laboratory for recursive partytioning (2010)23httpwwwniddknihgov24Garcia Ada L et al Improved prediction of body fat by measuring skinfold thickness
circumferences and bone breadths Obesity Research 133 (2005) 626-63425The reported p-value at a node is equal to 1-criterion26Oates T Jensen D (1997) The effects of training set size on decision tree complexity
In Proceedings of the Fourteenth International Conference on Machine Learning pp254ndash262
27John GH (1995) Robust decision trees removing outliers from databases In Pro-ceedings of the First Conference on Knowledge Discovery and Data Mining pp 174ndash179
28 Brodley CE Friedl MA (1999) Identifying mislabeled training data J Artif Intell Res11 131ndash167 19
29Cano JR Herrera F Lozano M (2007) Evolutionary stratified training set selection forextracting classification rules with trade off precision-interpretability Data amp KnowledgeEngineering 60(1) 90ndash108
30Deb Partha and Pravin K Trivedi Demand for medical care by the elderly a finitemixture approach Journal of applied Econometrics 123 (1997) 313-336
31See Randall J (1989) The analysis of sensory data by generalised linear modelBiometrical journal 7 pp 781ndash793
32See Atkinson Elizabeth J and Terry M Therneau An introduction to recursivepartitioning using the RPART routines Rochester Mayo Foundation (2000)
33See Henry J Fuchs Drucy S Borowitz David H Christiansen Edward M MorrisMartha L Nash Bonnie W Ramsey Beryl J Rosenstein Arnold L Smith and MaryEllen Wohl for the Pulmozyme Study Group N Engl J Med 1994 331637-642September8 1994
34For further details see
1 Hothorn Torsten Kurt Hornik and Achim Zeileis ctree Conditional InferenceTrees
2 Hothorn Torsten Kurt Hornik and Achim Zeileis Unbiased recursive partition-ing A conditional inference framework Journal of Computational and Graphicalstatistics 153 (2006) 651-674
3 Hothorn Torsten et al Party A laboratory for recursive partytioning (2010)
110
Part II
Support Vector Machines
111
The Basic Idea
The support vector machine (SVM) is a supervised machine learning algo-rithm35 that can be used for both regression and classification Letrsquos take aquick look at how SVM performs classification The core idea of SVMs isthat they construct hyperplanes in a multidimensional space that separatesobjects which belong to different classes A decision plane is then used todefine the boundaries between different classes Figure 163 visualizes thisidea
The decision plane separates a set of observations into their respectiveclasses using a straight line In this example the observations belong eitherto class ldquosolid circlerdquo or class ldquoedged circlerdquo The separating line defines aboundary on the right side of which all objects are ldquosolid circlerdquo and to theleft of which all objects are ldquoedged circlerdquo
Figure 163 Schematic illustration of a decision plane determined by a linearclassifier
113
92 Applied Predictive Modeling Techniques in R
In practice as illustrated in Figure 164 the majority of classificationproblems require nonlinear boundaries in order to determine the optimal sep-aration 36 Figure 165 illustrates how SVMs solve this problem The left sideof the figure represents the original sample (known as the input space) whichis mapped using a set of mathematical functions known as kernels to thefeature space The process of rearranging the objects for optimal separationis known as transformation Notice that in mapping from the input space tothe feature space the mapped objects are linearly separable Thus althoughSVM uses linear learning methods due to its nonlinear kernel function it isin effect a nonlinear classifier
NOTE
Since intuition is better built from examples that are easy toimagine lines and points are drawn in the Cartesian plane in-stead of hyperplanes and vectors in a high dimensional spaceRemember that the same concepts apply where the examplesto be classified lie in a space whose dimension is higher thantwo
Figure 164 Nonlinear boundary required for correct classification
114
Figure 165 Mapping from input to feature space in an SVM
Overview of SVM ImplementationThe SVM finds the decision hyperplane leaving the largest possible fractionof points of the same class on the same side while maximizing the distanceof either class from the hyperplane This minimizes the risk of misclassifyingnot only the examples in the training data set but also the yet-to-be seenexamples of the test set
To construct an optimal hyperplane SVM employs an iterative trainingalgorithm which is used to minimize an error function
Letrsquos take a look at how this is achieved Given a set of feature vectorsxi(i = 1 2 N) a target yi isin minus1+1 with corresponding binary labels isassociated with each feature vector xi The decision function for classificationof unseen examples is given as
y = f(xα) = sign
(Nssumi=1
αiyiK(si x) + b
)(161)
Wheresi are the Ns support vectors and K(si x) is the kernel functionThe parameters are determined by maximizing the margin hyperplane (seeFigure 166)
Nsumi=1
αi minus12
Nsumi=1
Nsumj=1
αiαjyiyjK(xi middot xj) (162)
115
92 Applied Predictive Modeling Techniques in R
subject to the constraints
Nsumi=1
αiyi = 0 and 0 le αi le C (163)
To build a SVM classifier the user needs to tune the cost parameter Cand choose a kernel function and optimize its parameters
13 PRACTITIONER TIP
I once worked for an Economist who was trained (essentially)in one statistical technique (linear regression and its variants)Whenever there was an empirical issue this individual alwaystried to frame it in terms of his understanding of linear mod-els and economics Needless to say this archaic approach tomodeling lead to all sorts of difficulties The data scientistis pragmatic in their modeling approach linear non-linearBayesian boosting they are guided by statistical theory ma-chine learning insights and unshackled from the vagueness ofeconomic theory
Note on the Slack ParameterThe variable C known as the slack parameter serves as the cost parameterthat controls the trade-off between the margin and classification error If noslack is allowed (often known as a hard margin) and the data are linearlyseparable the support vectors are the points which lie along the supportinghyperplanes as shown in Figure 166 In this case all of the support vectorslie exactly on the margin
116
Figure 166 Support vectors for linearly separable data and a hard margin
In many situations this will not yield useful results and a soft margin willbe required In this circumstance some proportion of data points are allowedto remain inside the margin The slack parameter C is used to control thisproportion37 A soft margin results in a wider margin and greater error on thetraining data set however it improves the generalization of data and reducesthe likelihood of over fitting
NOTE
The total number of support vectors depends on the amountof allowed slack and the distribution of the data If a largeamount of slack is permitted there will be a larger number ofsupport vectors than the case where very little slack is per-mitted Fewer support vectors means faster classification oftest points This is because the computational complexity ofthe SVM is linear in the number of support vectors
117
Practical Applications
NOTE
The kappa coefficient38 is a measure of agreement betweencategorical variables It is similar to the correlation coeffi-cient in that higher values indicate greater agreement It iscalculated as
k = Po minus Pe
1minus Pe
(164)
Po is the observed proportion correctly classified Pe is theproportion correctly classified by chance
Classification of Dengue Fever PatientsThe Dengue virus is a mosquito-borne pathogen that infects millions of peopleevery year Gomes et al39 use the support vector machine algorithm to classify28 dengue patients from the Recife metropolitan area Brazil (13 with denguefever (DF) and 15 with dengue haemorrhagic fever (DHF)) based on mRNAexpression data of 11 genes involved in the innate immune response pathway(MYD88 MDA5 TLR3 TLR7 TLR9 IRF3 IRF7 IFN-alpha IFN-betaIFN-gamma and RIGI)
A radial basis function is used and the model built using leave-one-outcross-validation repeated fifteen times under different conditions to analyzethe individual and collective contributions of each gene expression data toDFDHF classification
A different gene was removed during the first twelve cross-validationsDuring the last three cross-validations multiple genes were removedFigure 167 shows the overall accuracy for the support vector machine fordiffering values of its parameter C
118
Figure 167 SVM optimization Optimization of the parame-ters C and c of the SVM kernel RBF Source Gomes et aldoi101371journalpone0011267g003
119
92 Applied Predictive Modeling Techniques in R
13 PRACTITIONER TIP
To transform the gene expression data to a suitable for-mat for support vector machine training and testing Gomeset al designate each gene as either ldquo10rdquo (for observation ofup-regulation) or ldquo01rdquo (for observation of downregulation)Therefore the collective gene expressions observed in eachpatient was represented by a 24-dimension vector (12 genestimes2 gene states up- or down-regulated) Each of the 24-dimension vectors was labeled as either ldquo1rdquo for DF patients orldquo-1rdquo for DHF patients Notice this is a different classificationstructure than that used in traditional statistical modeling ofbinary variables Typically binary observations are measuredby Statisticians using 0 and 1 Be sure you have the correctclassification structure when moving between traditional sta-tistical models and those developed out of machine learning
Forecasting Stock Market DirectionHuang et al40 use a support vector machine to predict the direction of weeklychanges in the NIKKEI 225 Japanese stock market index The index is com-posed of 225 stocks of the largest Japanese publicly traded companies
Two independent variables are selected as inputs to the model weeklychanges in the SampP500 index and weekly changes in the US dollar - JapaneseYen exchange rate
Data was collected from January 1990 to December 2002 yielding a totalsample size of 676 observations The researchers use 640 observations to traintheir support vector machine and perform an out of sample evaluation onthe remaining 36 observations
As a benchmark the researchers compare the performance of their modelto four other models a random walk linear discriminant analysis quadraticdiscriminant analysis and a neural network
The random walk correctly predicts the direction of the stock market 50of the time linear discriminant analysis 55 quadratic discriminant analysisand the neural network 69 and the support vector machine 73
The researchers also observe that an information weighted combinationof the models correctly predicts the direction of the stock market 75 of thetime
120
Bankruptcy PredictionMin et al41 develop a support vector machine to predict bankruptcy andcompare its performance to a neural network logistic regression and multiplediscriminant analysis
Data on 1888 firms is collected from Korearsquos largest credit guarantee or-ganization The data set contains 944 bankruptcy and 944 surviving firmsThe attribute set consisted of 38 popular financial ratios and was reduced byprincipal component analysis to two ldquofundamentalrdquo factors
The training set consists of 80 of the observations with the remaining20 of observations used for the hold out test sample A radial basisfunction is used for the kernel Its parameters are optimized using a gridsearch procedure and 5- fold cross-validation
In rank order (for the hold out data) the support vector machine had aprediction accuracy of 83 the neural network 825 multiple discriminantanalysis 791 and the logistic regression 783
Early Onset Breast CancerBreast cancer is often classified according to the number of estrogen receptorspresent on the tumor Tumors with a large numbers of receptors are termedestrogen receptor positive (ER+) and estrogen receptor negative (ER-) forfew or no receptors ER status is important because ER+ cancers growunder the influence of estrogen and may respond well to hormone suppressiontreatments This is not the case for ER- cancers as they do not respond tohormone suppression treatments
Upstill-Goddard et al42 investigate whether patients who develop ER+and ER- tumors show distinct constitutional genetic profiles using geneticsingle nucleotide polymorphisms data At the core of their analysis was a sup-port vector machines with linear normalized quadratic polynomial quadraticpolynomial cubic and radial basis kernels The researchers opt for a 10- foldcross-validation
All five kernel models had an accuracy rate in excess of 93 see Table 11
121
92 Applied Predictive Modeling Techniques in R
Kernal Type Correctly ClassifiedLinear 9328 plusmn307Normalized quadratic polynomial 9369 plusmn269Quadratic polynomial 9389plusmn306Cubic polynomial 9464plusmn294Radial basis function 9595plusmn261
Table 11 Upstill-Goddard et alrsquos kernels and classification results
Flood SusceptibilityTehrany et al43 evaluate support vector machines with different kernel func-tions for spatial prediction of flood occurrence in the Kuala Terengganu basinMalaysia Model attributes were constructed using ten geographic factorsaltitude slope curvature stream power index (SPI) topographic wetnessindex (TWI) distance from the river geology land usecover (LULC) soiland surface runoff
Four kernels linear (LN) polynomial (PL) radial basis function (RBF)and sigmoid (SIG) were used to assess factor importance This was achievedby eliminating the factor and then measuring the Cohenrsquos kappa index ofthe model The overall rank of each factor44 is shown in Table 12 Overallslope was the most important factor followed by distance from river and thenaltitude
Factor Average RankAltitude 0265 3Slope 0288 1Curvature 0225 6SPI 0215 8TWI 0235 4Distance 0268 2Geology 0223 7LULC 0215 8Soil 0228 5Runoff 0140 10
Table 12 Variable importance calculated from data reported in Tehrany etal
122
13 PRACTITIONER TIP
Notice how both Upstill-Goddard et al and Tehrany et al usemultiple kernels in developing their models This is always agood strategy because it is not always obvious which kernel isoptimal at the onset of a research project
Signature AuthenticationRadhikaet al45 consider the problem of automatic signature authenticationusing a variety of algorithms including a support vector machine (SVM) Theother algorithms considered included a Bayes classifier (BC) fast Fouriertransform (FT) linear discriminant analysis (LD) and principal componentanalysis (PCA)
Their experiment used a signature database containing 75 subjects with15 genuine samples and 15 forged samples for each subject Features wereextracted from images drawn from the database and used as inputs to trainand test the various methods
The researchers report a false rejection rate of 8 for SVM and 13 forFT 10 for BC 11 for PCA and 12 for LD
Prediction of Vitamin D StatusThe accepted bio marker of vitamin D status is serum 25-hydroxyvitaminD (25(OH)D) concentration Unfortunately in large epidemiological studiesdirect measurement is often not feasible However useful proxies for 25(OH)Dare available by using questionnaire data
Guo et al46 develop a support vector machine to predict serum 25(OH)Dconcentration in large epidemiological studies using questionnaire data Atotal of 494 participants were recruited onto the study and asked to completea questionnaire which included sun exposure and sun protection behaviorsphysical activity smoking history diet and the use of supplements Skintypes were defined by spectrophotometric measurements of skin reflectanceto calculate melanin density for exposed skin sites (dorsum of hand shoulder)and non-exposed skin sites (upper inner arm buttock)
A multiple regression model (MLR) estimated using 12 explanatory vari-ables47 was used to benchmark the support vector machine The researchersselected a radial basis function for the kernel with identical explanatory fac-
123
92 Applied Predictive Modeling Techniques in R
tors used in the MLR The data were randomly assigned to a training sample(n= 294) and validation sample (n= 174)
The researchers report a correlation of 074 between predicted scores andmeasured 25(OH)D concentration for the support vector machine They alsonote that it performed better than MLR in correctly identifying individualswith vitamin D deficiency Overall they conclude ldquoRBF SVR [radial basisfunction support vector machine] method has considerable promise for theprediction of vitamin D status for use in chronic disease epidemiology andpotentially other situationsrdquo
13 PRACTITIONER TIP
The performance of the SVM is very closely tied to the choiceof the kernel function There exist many popular kernel func-tions that have been widely used for classification eg linearGaussian radial basis function polynomial and so on DataScientists can spend a considerable amount of time tweakingthe parameters of a specified kernel function via trial-and-error Here are four general approaches that can speed up theprocess
1 Cross-validation48
2 Multiple kernel learning49 attempts to construct a gen-eralized kernel function so as to solve all classificationproblems through combing different types of standardkernel functions
3 Evolution amp Particle swarm optimization Thadani etal50 use gene expression programming algorithms toevolve the kernel function of SVM An analogous ap-proach has been proposed using particle swarm opti-mization51
4 Automatic kernel selection using the C50 algorithmwhich attempts to select the optimal kernel functionbased on the statistical data characteristics and distri-bution information
124
Support Vector Classification
125
Technique 17
Binary Response Classificationwith C-SVM
A C-SVM for binary response classification can be estimated using the pack-age svmpath with the svmpath function
svmpath(x y kernelfunction)
Key parameters include the response variable y coded as (-1+1) thecovariates x and the specified kernel using kernelfunction
13 PRACTITIONER TIP
Although there are an ever growing number of kernels fourworkhorses of applied research are
bull LinearK(xi xj) = xTi xj
bull PolynomialK(xi xj) =(γxT
i xj + r)d γ gt 0
bull Radial basis functionK(xi xj) =exp (minusγxi minus xj2) γ gt 0
bull Sigmoidtanh(γxT
i xj + r)
Here γ r and d are kernel parameter
126
TECHNIQUE 17 BINARY RESPONSE CLASSIFICATION WITH
Step 1 Load Required PackagesWe build the C-SVM for binary response classification using the data framePimaIndiansDiabetes2 contained in the mlbench package For additionaldetails on this data set see page 52gt data(PimaIndiansDiabetes2package =mlbench)gt require(svmpath)
Step 2 Prepare Data amp Tweak ParametersThe PimaIndiansDiabetes2 has a large number of misclassified values(recorded as NA) particularly for the attributes of insulin and tricepsWe remove these two attributes from the sample and use the naomit methodto remove any remaining misclassified values The cleaned data is stored intempgt templt-(PimaIndiansDiabetes2)gt temp$insulin lt- NULLgt temp$triceps lt- NULLgt templt-naomit(temp)
The response diabetes is a factor containing the labels ldquoposrdquo and ldquonegrdquoHowever the svmpath method requires the response to be numeric takingthe values -1 or +1 The following converts diabetes into a format usableby svmpathgt ylt-(temp$diabetes)gt levels(y) lt- c(-11)gt ylt-asnumeric(ascharacter(y))gt y lt-asmatrix(y)
Support vector machine kernels generally depend on the inner product ofattribute vectors Therefore very large values might cause numerical prob-lems We scale the attributes to lie in the range [01] using the scale methodthe results are stored in the matrix xgt xlt-tempgt x$diabetes lt- NULL variablegt xlt-scale(x)
We use nrow to measure the remaining observations (as a check it shouldequal 724) We then set the training sample to select at random withoutreplacement 600 observations The remaining 124 observations form the testsample
127
92 Applied Predictive Modeling Techniques in R
gt setseed (103)gt n=nrow(x)gt train lt- sample (1n 600 FALSE)
Step 3 Estimate amp Evaluate ModelThe svmpath function can use two popular kernels the polynomial and ra-dial basis function We will assess both beginning with the polynomial ker-nel It is selected by setting kernelfunction = polykernel We also settrace=FALSE to prevent svmpath from printing results to the screen at eachiteration
gt fitlt-svmpath(x[train ] y[train ] kernelfunction= polykernel trace=FALSE)
A nice feature of svmpath is that it computes the entire regularization pathfor the SVM cost parameter along with the associated classification errorWe use the with method to identify the minimum errorgt with(fit Error[Error == min(Error)])[1] 140 140 140 140 140 140 140 140 140 140 140 140
140 140
Two things are worth noting here first the number of misclassified ob-servations is 140 out of the 600 observations which is approximately 23Second each observation of 140 is associated with a unique regularizationvalue Since the regularization value is the penalty parameter of the errorterm (and svmpath reports the inverse of this parameter) we will se-lect for our model the minimum value and store it in lambda This is achievedvia the following steps
1 Store the minimum error values in error
2 Grab the row numbers of these minimum errors using the whichmethod
3 Obtain the regularization parameter values associated with the mini-mum errors and store in temp_lamdba
4 Identify which value in temp_lamdba has the minimum value store itin lambda and print it to the screen
128
TECHNIQUE 17 BINARY RESPONSE CLASSIFICATION WITH
gt error lt-with(fit Error[Error == min(Error)])gt min_err_row lt-which(fit$Error == min(fit$Error))gt temp_lamdba lt-fit$lambda[min_err_row]
gt loclt-which(fit$lambda[min_err_row] == min(fit$lambda[min_err_row]))
gt lambda lt-temp_lamdba[loc]gt lambda[1] 7383352
The method svmpath actually reports the inverse of the kernel regular-ization parameter (often and somewhat confusingly called gamma in the lit-erature) We obtain a value of 7383352 which corresponds to a gamma of
17383352 = 00135
Next we follow the same procedure for radial basis function kernel Inthis case the estimated regularization parameter is stored in lambdaRgt fitRlt-svmpath(x[train ] y[train ] kernel
function = radialkernel trace=FALSE)gt error lt-with(fitR Error[Error == min(Error)])
gt min_err_row lt-which(fitR$Error == min(fitR$Error))gt temp_lamdba lt-fitR$lambda[min_err_row]
gt loclt-which(fitR$lambda[min_err_row] == min(fitR$lambda[min_err_row]))
gt lambdaR lt-temp_lamdba[loc]
gt lambdaR[1] 009738556gt error [1]600[1] 0015
Two things are noteworthy about this result First the regularizationparameter is estimated as 1
009738556 = 10268 Second the error is estimatedat only 15 This is very likely an indication that the model has been overfit
129
92 Applied Predictive Modeling Techniques in R
13 PRACTITIONER TIP
It often helps intuition to visualize data Letrsquos use a few linesof code to estimate a simple model and visualize it We willfit a sub-set of the model we are already working on using thefirst twelve patient observations on the attributes of glucoseand age storing the result in xx We standardize xx by usingthe scale method Then we grab the first twelve observationsof the response variable diabetesgt xxlt-cbind(temp$glucose [112] temp$age
[112])gt scale(xx)gt yylt-y[112]
Next we use the method svmpath to estimate the model anduse plot to show the resultant data points We use step=1 to show the results at the first value of the regularizationparameter and step =8 to show the results at the last stepThe dotted line in Figure 171 represents the margin notice itgets narrower from step 1 to 8 as the cost parameter increasesThe support vectors are represented by the open dots Noticethat the number of support vectors increase as we move fromstep 1 to step 8 Also notice at step 8 only one point ismisclassified (point 7)gt example lt- svmpath(xxyytrace=TRUE plot
=FALSE)gt par(mfrow = c(1 2))gt plot(example xlab=glucose ylab=age
step=1)gt plot(example xlab=glucose ylab=age
step=8)
130
TECHNIQUE 17 BINARY RESPONSE CLASSIFICATION WITH
Figure 171 Plot of support vectors using svmpath with the response vari-able diabetes and attributes glucose amp age
131
92 Applied Predictive Modeling Techniques in R
Step 4 Make Predictions
NOTE
Automated choice of kernel regularization parameters is chal-lenging This is because it is extreemly easy to over fit a SVMmodel on the validation sample if you only consider the mis-classification rate The consequence is that you end up witha model that is not generalizable to the test data andor amodel that performs considerably worse than the disguardedmodels with higher test sample error rates52
Although we suspect the radial basis kernel results in an over fit we willcompare itrsquos predictions to that of the optimal polynomial kernel First weuse the test data and the radial basis kernel via the predict method Theconfusion matrix is printed using the table method The error rate is thencalculated It is 35 and considerably higher than the 15 indicated duringvalidationgt predlt-predict(fitR newx=x[-train ] lambda=lambdaR
type=class)
gt table( pred y[-train ] dnn =c(Predicted Observed))
ObservedPredicted -1 1
-1 65 271 16 16
gt error_rate = (1- sum( pred == y[-train ] ) 124)gt round( error_rate 2)[1] 035
132
TECHNIQUE 17 BINARY RESPONSE CLASSIFICATION WITH
13 PRACTITIONER TIP
Degradation in performance due to over-fitting can be sur-prisingly large The key is to remember the primary goal ofthe validation sample is to provide a reliable indication of theexpected error on the test sample and future as yet unseensamples Thoughout this book we have used the setseed()method to help ensure replicability of the results Howevergiven the stochastic nature of the validation-test sample splitwe should always expect variation in performance for differ-ent realisations This suggests that evaluation should alwaysinvolve multiple partitions of the data to form training val-idation and test sets as the sampling of data for a singlepartition might arbitrarily favour one classifier over another
Letrsquos now look at the polynomial kernel modelgt predlt-predict(fit newx=x[-train ] lambda=lambda
type=class)gtgt table( pred y[-train ] dnn =c(Predicted
Observed))Observed
Predicted -1 1-1 76 131 5 30
gt error_rate = (1- sum( pred == y[-train ] ) 124)
gt round( error_rate 2)[1] 015
It seems the error rate for this choice of kernel is around 15 This is lessthan half the error of the radial basis kernel SVM
133
Technique 18
Multicategory Classificationwith C-SVM
A C-SVM for multicategory response classification can be estimated usingthe package e1071 with the svm function
svm(y ~ data kernel cost )
Key parameters include kernel - the kernel function cost - the costparameter multicategory response variable y and the covariates data
Step 1 Load Required PackagesWe build the C-SVM for multicategory response classification using the dataframe Vehicle contained in the mlbench package For additional details onthis sample see page 23gt require(e1071)gt library (mlbench)gt data(Vehicle)
Step 2 Prepare Data amp Tweak ParametersWe use 500 out of the 846 observations in the training sample with the re-maining saved for testing The variable train selects the random samplewithout replacement from the 500 observationsgt setseed (107)gt N=nrow(Vehicle)gt train lt- sample (1N 500 FALSE)
134
TECHNIQUE 18 MULTICATEGORY CLASSIFICATION WITH
Step 3 Estimate amp Evaluate ModelWe begin by estimating the support vector machine using the default settingsThis will use a radial basis function as the kernel with a cost parameter valueset equal to 1
gt fitlt- svm(Class ~ data = Vehicle[train ])
The summary function provides details of the estimated modelgt summary(fit)
Callsvm(formula = Class ~ data = Vehicle[train ])
ParametersSVM -Type C-classification
SVM -Kernel radialcost 1
gamma 005555556
Number of Support Vectors 376
( 122 64 118 72 )
Number of Classes 4
Levelsbus opel saab van
The function provides details of the type of support vector machine (C-classification) kernel cost parameter and gamma model parameter Noticethe model estimates 376 support vectors with 122 in the first class (bus)and 72 in the fourth class (van)A nice feature of the e1071 package is that contains tunesvm a supportvector machine tuning function We use the method to identify the bestmodel for gamma ranging between 025 and 4 and cost between 4 and 16We store the results in objgt obj lt- tunesvm(Class~ data = Vehicle[train ]
gamma = 2^( -22) cost = 2^(24))
Once again we can use the summary function to see the resultgt summary(obj)
135
92 Applied Predictive Modeling Techniques in R
Parameter tuning of svm
- sampling method 10-fold cross validation
- best parametersgamma cost025 16
- best performance 0254
The method automatically performs a 10-fold cross validation It appearsthe best model has a gamma of 025 and a cost parameter equal to 16 with254 misclassification rate
13 PRACTITIONER TIP
Greater control over model tuning can be achieved using thetunecontrol method inside of tunesvm For example toperform a 20-fold cross validation you would usetunecontrol = tunecontrol(sampling =
crosscross =20))
Your call using tunesvm would look something like thisobj lt- tunesvm(Class~ data = Vehicle[
train ] gamma = 2^( -22) cost =2^(24) tunecontrol = tunecontrol(sampling = crosscross =20))
Visualization of the output of tunesvm using plot is often useful for finetuninggt plot(obj)
Figure 181 illustrates the resultant plot To interpret the image notethat the darker the shading the better the fit of the model Two thingsare noteworthy about this image First a larger cost parameter seems toindicate a better fit Second a gamma of 05 or less also indicates a betterfitting model
Using this information we re-tune the model with gamma ranging from 001to 05 and cost ranging from 16 to 256 Now the best performance occurs
136
TECHNIQUE 18 MULTICATEGORY CLASSIFICATION WITH
with gamma set to 003 with cost equal to 32gt obj lt- tunesvm(Class~ data = Vehicle[train ]
gamma = seq(001 05 by=001) cost = 2^(48))gt summary(obj)
Parameter tuning of svm
- sampling method 10-fold cross validation
- best parametersgamma cost003 32
- best performance 017
137
92 Applied Predictive Modeling Techniques in R
Figure 181 Tuning Multicategory Classification with C-SVM using Vehicledata set
We store the results for the optimal model in the objects bestC andbestGamma and refit the modelgt bestC lt-obj$bestparameters [[2]]gt bestGamma lt-obj$bestparameters [[1]]gt fitlt- svm(Class ~ data = Vehicle[train ]cost=
bestC gamma=bestGamma cross =10)
Details of the fit can be viewed using the print methodgt print(fit)
138
TECHNIQUE 18 MULTICATEGORY CLASSIFICATION WITH
Callsvm(formula = Class ~ data = Vehicle[train ]
cost = bestC gamma = bestGamma cross = 10)
ParametersSVM -Type C-classification
SVM -Kernel radialcost 32
gamma 003
Number of Support Vectors 262
The fitted model now has 262 support vectors down from 376 for theoriginal model
The summary method provides additional details Reported are the sup-port vectors by classification and the results from the 10-fold cross validationOverall the model has a total accuracy of 81gt summary(fit)
Callsvm(formula = Class ~ data = Vehicle[train ]
cost = bestC gamma = bestGamma cross = 10)
ParametersSVM -Type C-classification
SVM -Kernel radialcost 32
gamma 003
Number of Support Vectors 262
( 100 32 94 36 )
Number of Classes 4
Levels
139
92 Applied Predictive Modeling Techniques in R
bus opel saab van
10-fold cross -validation on training data
Total Accuracy 81Single Accuracies84 86 88 76 82 78 76 84 82 74
It can be fun to visualize a two dimensional projection of the fitted dataTo do this we use the plot function with variables Elong and MaxLRa forthe x and y axis whilst holding all the other variables in Vehicle at theirmedian value Figure 182 shows the resultant plotgt plot(fit Vehicle[train ]Elong ~ MaxLRa
svSymbol = v slice = list(Comp = median(Vehicle$Comp)Circ =median(Vehicle$Circ)DCirc =median(Vehicle$DCirc)
RadRa = median(Vehicle$RadRa)PrAxisRa =median (Vehicle$PrAxisRa)
ScatRa=median(Vehicle$ScatRa) PrAxisRect =median(Vehicle$PrAxisRect)
MaxLRect =median(Vehicle$MaxLRect ) ScVarMaxis =median(Vehicle$ScVarMaxis) ScVarmaxis =median(Vehicle$ScVarmaxis) RaGyr=median(Vehicle$RaGyr) SkewMaxis =median(Vehicle$SkewMaxis) Skewmaxis =median(Vehicle$Skewmaxis ) Kurtmaxis =median(Vehicle$Kurtmaxis)KurtMaxis =median(Vehicle$KurtMaxis ) HollRa =median(Vehicle$HollRa )))
140
TECHNIQUE 18 MULTICATEGORY CLASSIFICATION WITH
Figure 182 Multicategory Classification with C-SVM two dimensional pro-jection of Vehicle
Step 4 Make PredictionsThe predict method with the test data and the fitted model fit are used asfollowsgt pred lt- predict(fit Vehicle[-train ])
Classification results alongside the actual observed values can be obtainedusing the table methodgt table(Vehicle$Class[-train]pred dnn=c( Observed
ClassPredicted Class ))
141
92 Applied Predictive Modeling Techniques in R
Predicted ClassObserved Class bus opel saab van
bus 87 2 0 5opel 0 55 28 2saab 0 20 68 0van 1 1 1 76
The error misclassification rate can then be calculated Overall themodel achieves an error rate of 17 on the test samplegt error_rate = (1-sum(pred== Vehicle$Class[-train ])
346)gt round(error_rate 3)[1] 0173
142
Technique 19
Multicategory Classificationwith nu-SVM
A nu-SVM for multicategory response classification can be estimated usingthe package e1071 with the svm function
svm(y ~ data kernel type=nu -classification)
Key parameters include kernel - the kernel function type set tonu-classification multicategory response variable y and the covariatesdata
Step 3 Estimate amp Evaluate ModelStep 1 and 2 are outlined beginning on page 134
We estimate the support vector machine using the default settings Thiswill use a radial basis function as the kernel with a nu parameter value setequal to 05
gt fitlt- svm(Class ~ data = Vehicle[train ]type=nu -classification)
The summary function provides details of the estimated modelgt summary(fit)
Callsvm(formula = Class ~ data = Vehicle[train ]
type = nu-classification)
143
92 Applied Predictive Modeling Techniques in R
ParametersSVM -Type nu -classification
SVM -Kernel radialgamma 005555556
nu 05
Number of Support Vectors 403
( 110 88 107 98 )
Number of Classes 4
Levelsbus opel saab van
The function provides details of the type of support vector machine (nu-classification) kernel nu parameter and gamma parameter Notice the modelestimates 403 support vectors with 110 in the first class (bus) and 98 in thefourth class (van) The total number of support vectors is slighter higherthan estimated in the C-SVM discussed on page 134We use the tunesvm method to identify the best model for gamma and nuranging between 005 and 045 We store the results in objgt obj lt- tunesvm(Class~ data = Vehicle[train ]type=nu-classificationgamma = seq(005 045 by=01)nu=seq(005 045 by=01))
We can use the summary function to see the resultgt summary(obj)
Parameter tuning of svm
- sampling method 10-fold cross validation
- best parametersgamma nu005 005
- best performance 0178
144
TECHNIQUE 19 MULTICATEGORY CLASSIFICATION WITH
The method automatically performs a 10-fold cross validation We seethat the best model has a gamma and nu equal to 005 with a 178 misclassi-fication rate Visualization of the output of tunesvm can be achieved usingthe plot functiongt plot(obj)
Figure 191 illustrates the resultant plot To interpret the image note thatthe darker the shading the better the fit of the model It seems that a smallernu and gamma lead to a better fit of the training data
Using this information we re-tune the model with gamma and nu rangingfrom 001 to 005 The best performance occurs with both parameters set to002 and an overall misclassification error rate of 164gt obj lt- tunesvm(Class~ data = Vehicle[train ]type=nu-classificationgamma = seq(001 005 by=001) nu=seq(001 005 by=001))
gt summary(obj)
Parameter tuning of svm
- sampling method 10-fold cross validation
- best parametersgamma nu002 002
- best performance 0164
145
92 Applied Predictive Modeling Techniques in R
Figure 191 Tuning Multicategory Classification with NU-SVM usingVehicle data set
We store the results for the optimal model in the objects bestC andbestGamma and then refit the modelgt bestNU lt-obj$bestparameters [[2]]gt bestGamma lt-obj$bestparameters [[1]]gtfitlt- svm(Class ~ data = Vehicle[train ]type=nu-classificationnu=bestNU gamma=bestGamma cross =10)
Details of the fit can be viewed using the print method
146
TECHNIQUE 19 MULTICATEGORY CLASSIFICATION WITH
print(fit)
Callsvm(formula = Class ~ data = Vehicle[train ]
type = nu-classificationnu = bestNU gamma = bestGamma cross = 10)
ParametersSVM -Type nu -classification
SVM -Kernel radialgamma 002
nu 002
Number of Support Vectors 204
The fitted model now has 204 support vectors less than half required forthe model we initially build
The summary method provides additional details Reported are the sup-port vectors by classification and the results from the 10-fold validationOverall the model has a total accuracy of 756gt summary(fit)
Callsvm(formula = Class ~ data = Vehicle[train ]
type = nu-classificationnu = bestNU gamma = bestGamma cross = 10)
ParametersSVM -Type nu -classification
SVM -Kernel radialgamma 002
nu 002
Number of Support Vectors 204
( 71 29 72 32 )
Number of Classes 4
147
92 Applied Predictive Modeling Techniques in R
Levelsbus opel saab van
10-fold cross -validation on training data
Total Accuracy 756Single Accuracies82 80 84 80 72 64 74 76 76 68
Next we visualize a two dimensional projection of the fitted data To dowe use the plot function with variables Elong and MaxLRa for the x and yaxis whilst holding all the other variables in Vehicle at their median valueFigure 192 shows the resultant plotgt plot(fit Vehicle[train ]Elong ~ MaxLRa
svSymbol = v slice = list(Comp = median(Vehicle$Comp)Circ =median(Vehicle$Circ)DCirc =median(Vehicle$DCirc)
RadRa = median(Vehicle$RadRa)PrAxisRa =median (Vehicle$PrAxisRa)
ScatRa=median(Vehicle$ScatRa) PrAxisRect =median(Vehicle$PrAxisRect)
MaxLRect =median(Vehicle$MaxLRect ) ScVarMaxis =median(Vehicle$ScVarMaxis) ScVarmaxis =median(Vehicle$ScVarmaxis) RaGyr=median(Vehicle$RaGyr) SkewMaxis =median(Vehicle$SkewMaxis) Skewmaxis =median(Vehicle$Skewmaxis ) Kurtmaxis =median(Vehicle$Kurtmaxis)KurtMaxis =median(Vehicle$KurtMaxis ) HollRa =median(Vehicle$HollRa )))
148
TECHNIQUE 19 MULTICATEGORY CLASSIFICATION WITH
Figure 192 Multicategory Classification with NU-SVM two dimensional pro-jection of Vehicle
Step 4 Make PredictionsThe predict method with the test data and the fitted model fit are used asfollowsgt pred lt- predict(fit Vehicle[-train ])
Classification results alongside the actual observed values can be obtainedusing the table methodgt table(Vehicle$Class[-train]pred dnn=c( Observed ClassPredicted Class ))
149
92 Applied Predictive Modeling Techniques in R
Predicted ClassObserved Class bus opel saab van
bus 87 1 1 5opel 0 41 42 2saab 0 30 58 0van 1 1 2 75
The error misclassification rate can then be calculated Overall themodel achieves an error rate of 246 on the test samplegt error_rate = (1-sum(pred== Vehicle$Class[-train ])
346)gt round(error_rate 3)[1] 0246
150
Technique 20
Bound-constraint C-SVMclassification
A bound-constraint C-SVM for multicategory response classification can beestimated using the package kernlab with the ksvm function
ksvm(y ~ data kernel type=C-bsvckpar )
Key parameters include kernel - the kernel function type set toC-bsvckpar which contains parameter values multicategory responsevariable y and the covariates data
Step 1 Load Required PackagesWe build the bound-constraint C-SVM for multicategory response classifica-tion using the data frame Vehicle contained in the mlbench package Foradditional details on this sample see page 23 The scatterplot3d packagedwill be used to create a three dimensional scatter plotgt library(kernlab)gt library (mlbench)gt data(Vehicle)gt library(scatterplot3d)
Step 2 Prepare Data amp Tweak ParametersWe use 500 out of the 846 observations in the training sample with the re-maining saved for testing The variable train selects the random samplewithout replacement from the 500 observations
151
92 Applied Predictive Modeling Techniques in R
gt setseed (107)gt N=nrow(Vehicle)gt train lt- sample (1N 500 FALSE)
Step 3 Estimate amp Evaluate ModelWe estimate the bound-constraint support vector machine using a radial basiskernel (type = rbfdot) and parameter sigma equal to 005 We also set thecross validation parameter cross =10
gt fitlt- ksvm(Class ~ data = Vehicle[train ]type=C-bsvckernel=rbfdot kpar=list(sigma =005) cross =10)
The print function provides details of the estimated modelgt print(fit)Support Vector Machine object of class ksvm
SV type C-bsvc (classification)parameter cost C = 1
Gaussian Radial Basis kernel functionHyperparameter sigma = 005
Number of Support Vectors 375
Objective Function Value -491905 -533552-480924 -1932145 -489507 -597368
Training error 0164Cross validation error 0248
The function provides details of the type of support vector machine (C-bsvc) model cost parameter and kernel parameter (C=1 sigma =005) No-tice the model estimates 375 support vectors with a training error of 164and cross validation error of 248 In this case we observe a relatively largedifference between the training error and cross validation error In practicethe cross validation error is often a better indicator of the expected perfor-mance on the test sample
Since the output of ksvm (in our example values stored in fit) is anS4 object both errors can be accessed directly using ldquordquo Although theldquopreferredrdquo approach is to use an accessor function such as cross(fit) and
152
TECHNIQUE 20 BOUND-CONSTRAINT C-SVM CLASSIFICATION
error(fit) If you donrsquot know the accessor functions names you can alwaysuse attributes(fit) and access the required parameter using (for S4objects ) or $ (for S3 objects) gt fiterror[1] 0164
gt fitcross[1] 0248
13 PRACTITIONER TIP
The package Kernlab supports a wide range of kernels Hereare nine popular choices
bull rbfdot - Radial Basis kernel function
bull polydot - Polynomial kernel function
bull vanilladot - Linear kernel function
bull tanhdot - Hyperbolic tangent kernel function
bull laplacedot - Laplacian kernel function
bull besseldot - Bessel kernel function
bull anovadot - ANOVA RBF kernel function
bull splinedot - Spline kernel
bull stringdot - String kernel
We need to tune the model to obtain the optimum parameters Letrsquos createa few lines of R code to do this for us First we set up the ranges for thecost and sigma parametersgt cost lt-2^(28)gt sigma lt-seq(0105 by=01)gt n_costlt-length(cost)gt n_sigma lt-length(sigma)
The total number of models to be estimated is stored in runs Check tosee if it contains the value 35 (it should)gt runslt-n_sigman_cost
153
92 Applied Predictive Modeling Techniques in R
gt countcost lt- 0gt countsigma lt-0gt runs[1] 35
The result in terms of cross validation error cost and sigma are storedin resultsgt results lt-1(3runsgt dim(results) lt- c(runs 3)gt colnames(results) lt- c(costsigmaerror)
The objects i j and count are loop variablesgt i=1gt j=1gt count=1
The main loop for tuning is as followsgt for (val in cost)
for (val in sigma) cat(iteration = count out of runs n)
fitlt- ksvm(Class ~ data = Vehicle[train ]type=C-bsvcC=cost[j]kernel=rbfdot kpar=list(sigma=sigma[i])cross =45)
results[count 1]= fitparam$Cresults[count 2]= sigma[i]results[count 3]= fitcrosscountsigma = countsigma +1count=count+1
i=i+1 end sigma loop
i=1j=j+1
end cost loop
154
TECHNIQUE 20 BOUND-CONSTRAINT C-SVM CLASSIFICATION
Notice we set cross = 45 to perform leave one out validationWhen you execute the above code as it is running you should see output
along the lines ofiteration = 1 out of 35iteration = 2 out of 35iteration = 3 out of 35iteration = 4 out of 35iteration = 5 out of 35
We turn results into a data frame using the asdataframe methodgt results lt-asdataframe(results)
Take a peek and you should see something like thisgt results
cost sigma error1 4 01 023585862 4 02 026414143 4 03 027441084 4 04 02973064
Now letrsquos find the best cross validation performance and its associatedrow numbergt with(results error[error == min(error)])[1] 02074074
gt which(results$error== min(results$error))[1] 11
The optimal values may need to be stored for later use We save them inbest_result using a few lines of codegt best_per_row lt-which(results$error== min(results$
error))
gt fit_costlt-results[best_per_row 1]gt fit_sigma lt-results[best_per_row 2]
gt fit_xerror lt-results[best_per_row 3]gt best_result lt-cbind(fit_cost fit_sigma fit_xerror)
gt colnames(best_result)lt- c(costsigmaerror)gt best_result
155
92 Applied Predictive Modeling Techniques in R
cost sigma error[1] 16 01 02074074
So we see the optimal results occur for cost =16 sigma = 01 with across validation error of 207
Figure 201 presents a three dimensional visualization of the tuning num-bers It was created using scatterplot3d as followsgt scatterplot3d(results$cost results$sigma results$error xlab=costylab=sigmazlab=Error)
156
TECHNIQUE 20 BOUND-CONSTRAINT C-SVM CLASSIFICATION
Figure 201 Bound-constraint C-SVM tuning 3d scatterplot using Vehicle
After all that effort we may as well estimate the optimal model using thetraining data and show the cross validation error It is around 245gt fitlt- ksvm(Class ~ data = Vehicle[train ]type=C-bsvccost=fit_cost kernel=rbfdot kpar=list(sigma=fit_sigma)cross =45)
gt fitcross[1] 02457912
157
92 Applied Predictive Modeling Techniques in R
Step 4 Make PredictionsThe predict method with the test data and the fitted model fit are used asfollowsgt pred lt- predict(fit Vehicle[-train ])
Classification results alongside the actual observed values can be obtainedusing the table methodgt table(Vehicle$Class[-train]pred dnn=c( Observed ClassPredicted Class ))
Predicted ClassObserved Class bus opel saab van
bus 88 1 2 3opel 1 31 50 3saab 1 24 61 2van 1 0 3 75
The error misclassification rate can then be calculated Overall themodel achieves an error rate of 263 on the test samplegt error_rate = (1-sum(pred== Vehicle$Class[-train ])
346)gt round(error_rate 3)[1] 0263
158
Technique 21
Weston - Watkins Multi-ClassSVM
A bound-constraint Weston - Watkins SVM for multicategory response classi-fication can be estimated using the package kernlab with the ksvm function
ksvm(y ~ data kernel type=kbb -svckpar )
Key parameters include kernel - the kernel function type set tokbb-svc kpar which contains parameter values multicategory responsevariable y and the covariates data
Step 3 Estimate amp Evaluate ModelStep 1 and step 2 are outlined beginning on page 151
We estimate the Weston - Watkins support vector machine using a radialbasis kernel (type = rbfdot) and parameter sigma equal to 005 We alsoset the cross validation parameter cross = 10
fitlt- ksvm(Class ~ data = Vehicle[train ]type=kbb -svckernel=rbfdot kpar=list(sigma =005) cross =10)
The print function provides details of the estimated modelgt print(fit)Support Vector Machine object of class ksvm
159
92 Applied Predictive Modeling Techniques in R
SV type kbb -svc (classification)parameter cost C = 1
Gaussian Radial Basis kernel functionHyperparameter sigma = 005
Number of Support Vectors 356
Objective Function Value 0Training error 0148Cross validation error 0278
The function provides details of the type of support vector machine (kbb-svc) model cost parameter kernel type and parameter (sigma =005) num-ber of support vectors (356) training error of 148 and cross validation errorof 278 Both types of error can be individually accessed as followsgt fiterror[1] 0148
gt fitcross[1] 0278
Letrsquos see if we can tune the model using our own grid search First we set upthe ranges for the cost and sigma parametersgt cost lt-2^(28)gt sigma lt-seq(0105 by=01)gt n_costlt-length(cost)gt n_sigma lt-length(sigma)
The total number of models to be estimated is stored in runsgt runslt-n_sigman_costgt countcost lt- 0gt countsigma lt-0
The results in terms of cross validation error cost and sigma are storedin resultsgt results lt-1(3runs)gt dim(results) lt- c(runs 3)gt colnames(results) lt- c(costsigmaerror)
The objects i j and count are loop variablesgt i=1
160
TECHNIQUE 21 WESTON - WATKINS MULTI-CLASS SVM
gt j=1gt count=1
The main loop for tuning is as followsgt for (val in cost)
for (val in sigma) cat(iteration = count out of runs n)
fitlt- ksvm(Class ~ data = Vehicle[train ]type=kbb -svcC=cost[j]kernel=rbfdot kpar=list(sigma=sigma[i])cross =45)
results[count 1]= fitparam$Cresults[count 2]= sigma[i]results[count 3]= fitcross
countsigma = countsigma +1count=count+1
i=i+1 end sigma loop
i=1j=j+1
end cost loop
When you execute the above code as it is running you should see outputalong the lines ofiteration = 1 out of 35iteration = 2 out of 35iteration = 3 out of 35iteration = 4 out of 35iteration = 5 out of 35
We turn results into a data frame using the asdataframe methodgt results lt-asdataframe(results)
Take a peek you should see something like this
161
92 Applied Predictive Modeling Techniques in R
gt resultscost sigma error
1 4 01 023569022 4 02 026397313 4 03 032289564 4 04 029730645 4 05 02885522
Now letrsquos find the best cross validation performance and its associatedrow numbergt with(results error[error == min(error)])[1] 02122896
gt which(results$error== min(results$error))[1] 21
So the optimal cross validation error is 21 and located in row 21 ofresults
The optimal values may need to be stored for later use We save them inbest_result using a few lines of codegt best_per_row lt-which(results$error== min(results$
error))
gt fit_costlt-results[best_per_row 1]gt fit_sigma lt-results[best_per_row 2]gt fit_xerror lt-results[best_per_row 3]
gt best_result lt-cbind(fit_cost fit_sigma fit_xerror)gt colnames(best_result)lt- c(costsigmaerror)
gt best_resultcost sigma error
[1] 64 01 02122896
Figure 211 presents a three dimensional visualization of the tuning num-bers It was created using scatterplot3d as followsgt scatterplot3d(results$cost results$sigma results$error xlab=costylab=sigmazlab=Error)
162
TECHNIQUE 21 WESTON - WATKINS MULTI-CLASS SVM
Figure 211 Weston - Watkins Multi-Class SVM tuning 3d scatterplot usingVehicle
After all that effort we may as well estimate the optimal model using thetraining data and show the cross validation error It is around 27gt fitlt- ksvm(Class ~ data = Vehicle[train ]type=kbb -svccost=fit_cost kernel=rbfdot kpar=list(sigma=fit_sigma)cross =45)
gt fitcross
163
92 Applied Predictive Modeling Techniques in R
[1] 02762626
Step 4 Make PredictionsThe predict method with the test data and the fitted model fit are used asfollowsgt pred lt- predict(fit Vehicle[-train ])
Classification results alongside the actual observed values can be obtainedusing the table methodgt table(Vehicle$Class[-train]pred dnn=c( Observed ClassPredicted Class ))
Predicted ClassObserved Class bus opel saab van
bus 88 1 0 5opel 1 37 43 4saab 1 33 53 1van 1 1 2 75
The error misclassification rate can be calculatedgt error_rate = (1-sum(pred== Vehicle$Class[-train ])
346)gt round(error_rate 3)[1] 0269
Overall the model achieves an error rate of 269 on the test sample
164
Technique 22
Crammer - Singer Multi-ClassSVM
A Crammer - Singer SVM for multicategory response classification can beestimated using the package kernlab with the ksvm function
ksvm(y ~ data kernel type=spoc -svckpar )
Key parameters include kernel - the kernel function type set tospoc-svc kpar which contains parameter values multicategory responsevariable y and the covariates data
Step 3 Estimate amp Evaluate ModelStep 1 and step 2 are outlined beginning on page 151
We estimate the Crammer- Singer Multi-Class support vector machineusing a radial basis kernel (type = rbfdot) and parameter sigma equal to005 We also set the cross validation parameter cross = 10
fitlt- ksvm(Class ~ data = Vehicle[train ]type=spoc -svckernel=rbfdot kpar=list(sigma =005) cross =10)
The print function provides details of the estimated modelgt print(fit)Support Vector Machine object of class ksvm
165
92 Applied Predictive Modeling Techniques in R
SV type spoc -svc (classification)parameter cost C = 1
Gaussian Radial Basis kernel functionHyperparameter sigma = 005
Number of Support Vectors 340
Objective Function Value 0Training error 0152Cross validation error 0244
The function provides details of the type of support vector machine (spoc-svc) model cost parameter and kernel parameter (sigma =005) number ofsupport vectors (340) training error of 152 and cross validation error of244 Both types of error can be individually accessed as followsgt fiterror[1] 0152
gt fitcross[1] 0244
We need to tune the model First we set up the ranges for the cost andsigma parametersgt cost lt-2^(28)gt sigma lt-seq(0105 by=01)gt n_costlt-length(cost)gt n_sigma lt-length(sigma)
The total number of models to be estimated is stored in runsgt runslt-n_sigman_costgt countcost lt- 0gt countsigma lt-0
The results in terms of cross validation error cost and sigma are storedin resultsgt results lt-1(3runs)gt dim(results) lt- c(runs 3)gt colnames(results) lt- c(costsigmaerror)
The objects i j and count are loop variablesgt i=1
166
TECHNIQUE 22 CRAMMER - SINGER MULTI-CLASS SVM
gt j=1gt count=1
The main loop for tuning is as followsgt for (val in cost)
for (val in sigma) cat(iteration = count out of runs n)
fitlt- ksvm(Class ~ data = Vehicle[train ]type=kbb -svcC=cost[j]kernel=rbfdot kpar=list(sigma=sigma[i])cross =45)
results[count 1]= fitparam$Cresults[count 2]= sigma[i]results[count 3]= fitcross
countsigma = countsigma +1count=count+1
i=i+1 end sigma loop
i=1j=j+1
end cost loop
When you execute the above code as it is running you should see outputalong the lines ofiteration = 1 out of 35iteration = 2 out of 35iteration = 3 out of 35iteration = 4 out of 35iteration = 5 out of 35
We turn results into a data frame using the asdataframe methodgt results lt-asdataframe(results)
Take a peek at the results and you should see something like this
167
92 Applied Predictive Modeling Techniques in R
gt resultscost sigma error
1 4 01 022996632 4 02 023787883 4 03 025824924 4 04 02730640
Now letrsquos find the best cross validation performance and its associatedrow number So the optimal cross validation error is 207 and located inrow 11 of resultsgt with(results error[error == min(error)]) [1]
02075758
gt which(results$error== min(results$error))[1] 11
We save in best_result using a few lines of codegt best_per_row lt-which(results$error== min(results$
error))
gt fit_costlt-results[best_per_row 1]gt fit_sigma lt-results[best_per_row 2]gt fit_xerror lt-results[best_per_row 3]
gt best_result lt-cbind(fit_cost fit_sigma fit_xerror)gt colnames(best_result)lt- c(costsigmaerror)
gt best_resultcost sigma error
[1] 16 01 02075758
Figure 221 presents a three dimensional visualization of the tuning num-bers It was created using scatterplot3d as followsgt scatterplot3d(results$cost results$sigma results$error xlab=costylab=sigmazlab=Error)
168
TECHNIQUE 22 CRAMMER - SINGER MULTI-CLASS SVM
Figure 221 Crammer- Singer Multi-Class SVM tuning 3d scatterplot usingVehicle
After all that effort we may as well estimate the optimal model using thetraining data and show the cross validation error It is around 25gt fitlt- ksvm(Class ~ data = Vehicle[train ]type=spoc -svccost=fit_cost kernel=rbfdot kpar=list(sigma=fit_sigma)cross =45)
gt fitcross
169
92 Applied Predictive Modeling Techniques in R
[1] 02478114
Step 4 Make PredictionsThe predict method with the test data and the fitted model fit are used asfollowsgt pred lt- predict(fit Vehicle[-train ])
Classification results alongside the actual observed values can be obtainedusing the table methodgt table(Vehicle$Class[-train]pred dnn=c( Observed ClassPredicted Class ))
Predicted ClassObserved Class bus opel saab van
bus 88 0 0 6opel 6 43 30 6saab 3 29 52 4van 1 0 1 77
The error misclassification rate can then be calculated Overall themodel achieves an error rate of 249 on the test samplegt error_rate = (1-sum(pred== Vehicle$Class[-train ])
346)gt round(error_rate 3)[1] 0249
170
Support Vector Regression
171
Technique 23
SVM eps-Regression
A eps-regression can be estimated using the package e1071 with the svmfunction
svm(y ~ data kernel cost type=eps -regressionepsilon gamma cost )
Key parameters include kernel - the kernel function the parameters costgamma and epsilon type set to eps-regression continuous responsevariable y and the covariates data
Step 1 Load Required PackagesWe build the support vector machine eps-regression using the data framebodyfat contained in the THdata package For additional details on thissample see page 62gt library(e1071)gt data(bodyfatpackage=THdata)
Step 2 Prepare Data amp Tweak ParametersWe use 45 out of the 71 observations in the training sample with the remainingsaved for testing The variable train selects the random sample withoutreplacement from the 71 observationsgt setseed (465)gt train lt- sample (171 45 FALSE)
172
TECHNIQUE 23 SVM EPS-REGRESSION
Step 3 Estimate amp Evaluate ModelWe begin by estimating the support vector machine using the default settingsThis will use a radial basis function as the kernel with a cost parameter valueset equal to 1
gt fitlt- svm(DEXfat ~ data = bodyfat[train ]type=eps -regression)
The summary function provides details of the estimated modelgt summary(fit)
Callsvm(formula = DEXfat ~ data = bodyfat[train ]
type = eps -regression)
ParametersSVM -Type eps -regression
SVM -Kernel radialcost 1
gamma 01111111epsilon 01
Number of Support Vectors 33
The function provides details of the type of support vector machine (eps-regression) cost parameter kernel type and associated gamma and epsilonparameters and the number of support vectors in this case 33A nice feature of the e1071 package is that contains tunesvm a supportvector machine tuning function We use the method to identify the bestmodel for gamma ranging between 025 and 4 and cost between 4 and 16 andepsilon between 001 to 009 The results are stored in objgt obj lt- tunesvm(DEXfat ~ data = bodyfat[train
]type=eps -regressiongamma = 2^( -22)cost = 2^(24) epsilon=seq(001 09 by=001))
Once again we can use the summary function to see the result
173
92 Applied Predictive Modeling Techniques in R
gt summary(obj)
Parameter tuning of svm
- sampling method 10-fold cross validation
- best parametersgamma cost epsilon025 8 001
- best performance 2658699
The method automatically performs a 10-fold cross validation We seethat the best model has a gamma of 025 cost parameter equal to 8 epsilonof 001 with a 266 misclassification rate These metrics can also be accessedcalling $bestperformance and $bestparametersgt obj$bestperformance[1] 2658699
gt obj$bestparametersgamma cost epsilon
6 025 8 001
We use these optimum parameters to refit the model using leave one outcross validation as followsgt bestEpi lt-obj$bestparameters [[3]]gt bestGamma lt-obj$bestparameters [[1]]gt bestCost lt-obj$bestparameters [[2]]
gt fitlt- svm(DEXfat ~ data = bodyfat[train ]type=eps -regressionepsilon=bestEpi gamma=bestGamma cost=bestCost cross =45)
The fitted model now has 44 support vectorsgt print(fit)
Callsvm(formula = DEXfat ~ data = bodyfat[train ]
type = eps -regression
174
TECHNIQUE 23 SVM EPS-REGRESSION
epsilon = bestEpi gamma = bestGamma cost =bestCost cross = 45)
ParametersSVM -Type eps -regression
SVM -Kernel radialcost 8
gamma 025epsilon 001
Number of Support Vectors 44
The summary method provides additional details It also reports themean square error for each cross validation (not shown here)gt summary(fit)
Callsvm(formula = DEXfat ~ data = bodyfat[train ]
type = eps -regressionepsilon = bestEpi gamma = bestGamma cost =
bestCost cross = 45)
ParametersSVM -Type eps -regression
SVM -Kernel radialcost 8
gamma 025epsilon 001
Number of Support Vectors 44
45-fold cross -validation on training data
Total Mean Squared Error 280095Squared Correlation Coefficient 07877644
175
92 Applied Predictive Modeling Techniques in R
Step 4 Make PredictionsThe predict method with the test data and the fitted model fit are used asfollowsgt pred lt- predict(fit bodyfat[-train ])
A plot of the predicted and observed values shown in Figure 231 isobtained using the plot functiongt plot(bodyfat$DEXfat[-train]pred xlab=DEXfatylab=Predicted Valuesmain=Training Sample Model Fit)
We calculate the squared correlation coefficient using the cor function Itreports a value of 0712
round(cor(pred bodyfat$DEXfat[-train])^23)[1] 0712
176
TECHNIQUE 23 SVM EPS-REGRESSION
Figure 231 Predicted and observed values using SVM eps regression forbodyfat
177
Technique 24
SVM nu-Regression
A nu-regression can be estimated using the package e1071 with the svmfunction
svm(y ~ data kernel cost type=nu-regressionnu gamma cost )
Key parameters include kernel - the kernel function the parameters costgamma and nu type set to nu-regression continuous response variabley and the covariates data
Step 3 Estimate amp Evaluate ModelStep 1 and 2 are outlined beginning on page 172
We begin by estimating the support vector machine using the default set-tings This will use a radial basis function as the kernel with a cost parametervalue set equal to 1
gt fitlt- svm(DEXfat ~ data = bodyfat[train ]type=nu -regression)
The summary function provides details of the estimated modelgt summary(fit)
Callsvm(formula = DEXfat ~ data = bodyfat[train ]
type = nu-regression)
178
TECHNIQUE 24 SVM NU-REGRESSION
ParametersSVM -Type nu -regression
SVM -Kernel radialcost 1
gamma 01111111nu 05
Number of Support Vectors 33
The function provides details of the type of support vector machine (nu-regression) cost parameter kernel type and associated gamma and nu pa-rameters and the number of support vectors in this case 33 which happensto be the same as estimated using the SVM eps-regression (see page 172)We use the tunesvm method to identify the best model for gamma rangingbetween 025 and 4 and cost between 4 and 16 and nu between 001 to 009The results are stored in objgt obj lt- tunesvm(DEXfat ~ data = bodyfat[train ]type=nu-regressiongamma = 2^( -22)cost = 2^(24) nu=seq(0109by=01))
We use the summary function to see the resultgt summary(obj)
Parameter tuning of svm
- sampling method 10-fold cross validation
- best parametersgamma cost nu025 8 09
- best performance 2642275
The method automatically performs a 10-fold cross validation We seethat the best model has a gamma of 025 cost parameter equal to 8 nu equalto 09 with a 264 misclassification rate These metrics can also be accessedcalling $bestperformance and $bestparametersgt obj$bestperformance[1] 2642275
179
92 Applied Predictive Modeling Techniques in R
gt obj$bestparametersgamma cost nu
126 025 8 09
We use these optimum parameters to refit the model using leave one outcross validation as followsgt bestNU lt-obj$bestparameters [[3]]gt bestGamma lt-obj$bestparameters [[1]]gt bestCost lt-obj$bestparameters [[2]]
gt fitlt- svm(DEXfat ~ data = bodyfat[train ]type=nu-regressionnu=bestNU gamma=bestGamma cost=bestCost cross =45)
The fitted model now has 45 support vectorsgt print(fit)
Callsvm(formula = DEXfat ~ data = bodyfat[train ]
type = nu-regressionnu = bestNU gamma = bestGamma cost = bestCost
cross = 45)
ParametersSVM -Type nu -regression
SVM -Kernel radialcost 8
gamma 025nu 09
Number of Support Vectors 45
The summary method provides additional details It also reports themean square error for each cross validation (not shown here)gt summary(fit)
180
TECHNIQUE 24 SVM NU-REGRESSION
Callsvm(formula = DEXfat ~ data = bodyfat[train ]
type = nu-regressionnu = bestNU gamma = bestGamma cost = bestCost
cross = 45)
ParametersSVM -Type nu -regression
SVM -Kernel radialcost 8
gamma 025nu 09
Number of Support Vectors 45
45-fold cross -validation on training data
Total Mean Squared Error 2798642Squared Correlation Coefficient 07871421
Step 4 Make PredictionsThe predict method with the test data and the fitted model fit are used asfollowsgt pred lt- predict(fit bodyfat[-train ])
A plot of the predicted and observed values shown in Figure 241 isobtained using the plot functiongt plot(bodyfat$DEXfat[-train]pred xlab=DEXfatylab=Predicted Valuesmain=Training Sample Model Fit)
We calculate the squared correlation coefficient using the cor function Itreports a value of 0712
round(cor(pred bodyfat$DEXfat[-train])^23)[1] 0712
181
92 Applied Predictive Modeling Techniques in R
Figure 241 Predicted and observed values using SVM nu regression forbodyfat
182
Technique 25
Bound-constraint SVMeps-Regression
A bound-constraint SVM eps-regression can be estimated using the packagekernlab with the ksvm function
ksvm(y ~ data kernel type=eps -bsvrkpar )
Key parameters include kernel - the kernel function type set totype=eps-bsvr kpar which contains parameter values multicategory re-sponse variable y and the covariates data
Step 1 Load Required PackagesWe build the model using the data frame bodyfat contained in the THdatapackage For additional details on this sample see page 62library(kernlab)data(bodyfatpackage=THdata)
Step 2 Prepare Data amp Tweak ParametersWe use 45 out of the 71 observations in the training sample with the remainingsaved for testing The variable train selects the random sample withoutreplacement from the 71 observationssetseed (465)train lt- sample (171 45 FALSE)
183
92 Applied Predictive Modeling Techniques in R
Step 3 Estimate amp Evaluate ModelWe estimate the bound-constraint support vector machine using a radial basiskernel (type = rbfdot) and parameter sigma equal to 005 We also set thecross validation parameter cross =10
fitlt- ksvm(DEXfat ~ data = bodyfat[train ]type=eps -bsvrkernel=rbfdot kpar=list(sigma =005) cross =10)
The print function provides details of the estimated modelgt print(fit)Support Vector Machine object of class ksvm
SV type eps -bsvrGaussian Radial Basis kernel functionHyperparameter sigma = 005
Number of Support Vectors 35
Objective Function Value -77105Training error 0096254Cross validation error 1877613
The function provides details of the type of support vector machine (eps-bsvr) kernel parameter (sigma =005) Notice the model estimates 35 sup-port vectors with a training error of 96 and cross validation error of 187Since the output of ksvm (in this case values are stored in fit) is an S4 ob-ject both errors can be accessed directly using ldquordquo gt fitparam$C[1] 1
gt fitparam$epsilon[1] 01
gt fiterror[1] 009625418
gt fitcross[1] 1877613
184
TECHNIQUE 25 BOUND-CONSTRAINT SVM EPS-REGRESSION
We need to tune the model to obtain the optimum parameters Letrsquoscreate a small bit of code to do this for us First we set up the ranges for thecost and sigma and epsilon parametersgt epsilon lt- seq(001 01 by=001)gt sigma lt-seq(011by=001)
gt n_epsilon lt-length(epsilon)gt n_sigma lt-length(sigma)
The total number of models to be estimated is stored in runs Since itcontains 910 models it may take a little while to rungt runslt-n_sigman_epsilongt runs[1] 910
The results in terms of cross validation error costepsilon and sigmaare stored in resultsgt results lt-1(4runs)gt dim(results) lt- c(runs 4)gt colnames(results) lt- c(costepsilonsigma
error)
The loop variables are as followsgt countepsilon lt-0gt countsigma lt-0
gt i=1gt j=1gt k=1gt count=1
The main loop for tuning is as followsgt for (val in epsilon)
for (val in sigma) fitlt- ksvm(DEXfat ~ data = bodyfat[train ]type=eps -bsvrepsilon=epsilon[k]kernel=rbfdot kpar=list(sigma=sigma[i])cross =45)
185
92 Applied Predictive Modeling Techniques in R
results[count 1]= fitparam$Cresults[count 2]= sigma[i]results[count 3]= fitparam$epsilonresults[count 4]= fitcrosscountsigma = countsigma +1count=count+1i=i+1
end sigma loopi=1
k=k+1cat(iteration () = round((countruns)100 0)
n)
end epsilon loopi=1k=1j=j+1
Notice we set cross = 45 to perform leave one out validationWhen you execute the above code as it is running you should see output
along the lines ofiteration () = 10iteration () = 20iteration () = 30
We turn results into a data frame using the asdataframe methodgt results lt-asdataframe(results)
Take a peek at the results and you should see something like thisgt results
cost epsilon sigma error1 1 010 001 22151082 1 011 001 22766893 1 012 001 23241924 1 013 001 23782745 1 014 001 2435676
Now letrsquos find the best cross validation performance (218) and its as-sociated row number (274)gt with(results error[error == min(error)])
186
TECHNIQUE 25 BOUND-CONSTRAINT SVM EPS-REGRESSION
[1] 2186181
gt which(results$error== min(results$error))[1] 274
The optimal values may need to be stored for later use We save them inbest_result using a few lines of codegt best_per_row lt-which(results$error== min(results$
error))
gt fit_costlt-results[best_per_row 1]gt fit_sigma lt-results[best_per_row 2]gt fit_epsilon lt-results[best_per_row 3]gt fit_xerror lt-results[best_per_row 4]
gt best_result lt-cbind(fit_cost fit_epsilon fit_sigma fit_xerror)
gt colnames(best_result)lt- c(costepsilonsigmaerror)
gt best_resultcost epsilon sigma error
[1] 1 004 01 2186181
After all that effort we may as well estimate the optimal model using thetraining data and show the cross validation error It is around 218gt fitlt- ksvm(DEXfat ~ data = bodyfat[train ]type=eps -bsvrcost=fit_cost epsilon=fit_epsilon kernel=rbfdot kpar=list(sigma=fit_sigma)cross =45)
gt fitcross[1] 2186834
Step 4 Make PredictionsThe predict method with the test data and the fitted model fit are used asfollows
187
92 Applied Predictive Modeling Techniques in R
pred lt- predict(fit2 bodyfat[-train ])
We fit a linear regression using pred as the response variable and theobserved values as the covariate The regression line alongside the predictedand observed values shown in Figure 251 are visualized using the plotmethod combined with the abline method to show the linear regressionlinegt linReg lt-lm(pred ~ bodyfat$DEXfat [-train ])
gt plot(bodyfat$DEXfat[-train]pred xlab=DEXfatylab=Predicted Valuesmain=Training Sample Model Fit)
gt abline(linReg col=darkred)
The correlation between the test sample predicted and observed values is083gt round(cor(pred bodyfat$DEXfat[-train])^23)
[1][1] 0834
188
TECHNIQUE 25 BOUND-CONSTRAINT SVM EPS-REGRESSION
Figure 251 Bound-constraint SVM eps-Regression observed and fitted valuesusing bodyfat
189
Support Vector NoveltyDetection
190
Technique 26
One-Classification SVM
A One-Classification SVM can be estimated using the package e1071 withthe svm function
svm(xkernel type=one -classification )
Key parameters x the attributes you use to train the modeltype=rdquoone-classificationrdquo and kernel - the kernel function
Step 1 Load Required PackagesWe build the one-classification SVM using the data frame Vehicle containedin the mlbench package For additional details on this sample see page 23gt library(e1071)gt library(caret)gt library (mlbench)gt data(Vehicle)gt library(scatterplot3d)
Step 2 Prepare Data amp Tweak ParametersTo begin we choose the bus category as our one classification class and createthe variable isbus to store TRUEFALSE values of vehicle typegt setseed (103)gt Vehicle$isbus[Vehicle$Class==bus] lt- TRUEgt Vehicle$isbus[Vehicle$Class=bus] lt- FALSE
As a reminder of the distribution of Vehicle we use the table method
191
92 Applied Predictive Modeling Techniques in R
gt table(Vehicle$Class)
bus opel saab van218 212 217 199
As a check note that you should observed 218 observations correspondingto bus in isbus= TRUEgt table(Vehicle$isbus)
FALSE TRUE628 218
Our next step is to create variables isbusTrue and isbusFalse to holdpositive (bus) and negative (non-bus) observations respectivelygt isbusTrue lt-subset(Vehicle Vehicle$isbus==TRUE)gt isbusFalse lt-subset(Vehicle Vehicle$isbus==FALSE)
Next create a subset of 218 positive values and sample 200 of these atrandom The remaining 18 observations will form the test samplegt train lt- sample (1218 100 FALSE)
gt trainAttributes lt-isbusTrue[train 118]gt trainLabels lt-isbusTrue[train 19]
Now as a quick check type trainLabels and check that they are all bus
Step 3 Estimate amp Evaluate ModelWe need to tune the model We begin by setting up the tuning parametersThe variable runs contains the total number of models we will estimate duringthe tuning processgamma lt- seq(001 005 by=001)nu lt- seq(001 005 by=001)
n_gamlt-length(gamma)n_nult-length(nu)
runslt-n_gamn_nu
countgamma lt- 0countnult-0
192
TECHNIQUE 26 ONE-CLASSIFICATION SVM
Next set up the variable to store the results and the loop count variables- ij and countgt results lt-1(3runs)gt dim(results) lt- c(runs 3)gt colnames(results) lt- c(gammanuPerformance
)
gt i=1gt k=1gt count=1
The main tuning loop is created as follows (notice we estimate the modelusing a 10-fold cross validation (cross = 10) and store the results in fit)
gt for (val in nu)
for (val in gamma) print(gamma[i])print(nu[k])
fitlt-svm(trainAttributes y=NULL type=rsquoone -classification rsquonu=nu[k]gamma=gamma[i]cross =10)
results[count 2]= fit$gammaresults[count 1]= fit$nuresults[count 3]= fit$totaccuracycountgamma = countgamma+1count=count+1i=i+1
end gamma loopi=1k=k+1
end nu loop
We turn results into a data frame using the asdataframe methodgt results lt-asdataframe(results)
193
92 Applied Predictive Modeling Techniques in R
Take a peek and you should see something like thisgt results
gamma nu Performance1 001 001 942 001 002 883 001 003 874 001 004 81
Now letrsquos find the best cross validation performancegt with(results Performance[Performance == max(
Performance)])[1] 94 94
Notice the best performance contains two values both at 94 Since thisis a very high value it is possibly the result of over fitting - a consequence ofover tuning a model The optimal values may need to be stored for later useWe save them in best_result using a few lines of codegt best_per_row lt-which(results$Performance == max(
results$Performance))
gt fit_gamma lt-results[best_per_row 1]gt fit_nult-results[best_per_row 2]gt fit_perlt-results[best_per_row 3]
gt best_result lt-cbind(fit_gamma fit_nu fit_per)gt colnames(best_result)lt- c(gammanu
Performance)
gt best_resultgamma nu Performance
[1] 001 001 94[2] 002 001 94
The relationship between the parameters and performance is visualizedusing the scatterplot3d function and shown in Figure 261gt scatterplot3d(results$nu results$gamma results$
Performance xlab=nuylab=gammazlab=Performance)
194
TECHNIQUE 26 ONE-CLASSIFICATION SVM
Figure 261 Relationship between tuning parameters
Letrsquos fit the optimum model using a leave one out cross validationgt fitlt-svm(trainAttributes y=NULL type=one -classificationnu=fit_nu gamma=fit_gammacross =100)
Step 4 Make PredictionsTo illustrate the performance of the model on the test sample we will use theconfusionMatrixmethod from the caret package The first step is to gather
195
92 Applied Predictive Modeling Techniques in R
together the required information for the training and test sample using thepredict method to make the forecastsgt trainpredictors lt-isbusTrue[train 118]gt trainLabels lt-isbusTrue[train 20]
gt testPositive lt-isbusTrue[-train ]gt testPosNeg lt-rbind(testPositive isbusFalse)
gt testpredictors lt-testPosNeg [ 118]gt testLabels lt-testPosNeg [20]
gt svmpredtrain lt-predict(fit trainpredictors)gt svmpredtest lt-predict(fit testpredictors)
Next we create two tables confTrain containing the training predictionsand confTest for the test sample predictionsgt confTrain lt-table(Predicted=svmpredtrain Reference
=trainLabels)gt confTest lt-table(Predicted=svmpredtest Reference=
testLabels)
Now we call the confusionMatrix method for the test sampleconfusionMatrix(confTest positive=TRUE)
Confusion Matrix and Statistics
ReferencePredicted FALSE TRUE
FALSE 175 5TRUE 453 113
Accuracy 0386195 CI (0351 04221)
No Information Rate 08418P-Value [Acc gt NIR] 1
Kappa 0093Mcnemar rsquos Test P-Value lt2e-16
Sensitivity 09576Specificity 02787
Pos Pred Value 01996
196
TECHNIQUE 26 ONE-CLASSIFICATION SVM
Neg Pred Value 09722Prevalence 01582
Detection Rate 01515Detection Prevalence 07587
Balanced Accuracy 06181
rsquoPositive rsquo Class TRUE
The method produces a range of test statistics including an accuracy rateof only 38 and kappa of 0093 Maybe we over-tuned the model a little Inthis example we can see clearly the consequence of over fitting in training ispoor generalization in the test sample
For illustrative purposes we re-estimate the model this time with nu =005 and gamma 001gt fitlt-svm(trainAttributes y=NULL type=one -classificationnu=005 gamma =005cross =100)
A little bit of house keeping as discussed previouslygt trainpredictors lt-isbusTrue[train 118]gt trainLabels lt-isbusTrue[train 20]
gt testPositive lt-isbusTrue[-train ]gt testPosNeg lt-rbind(testPositive isbusFalse)
gt testpredictors lt-testPosNeg [ 118]gt testLabels lt-testPosNeg [20]
gt svmpredtrain lt-predict(fit trainpredictors)gt svmpredtest lt-predict(fit testpredictors)
gt confTrain lt-table(Predicted=svmpredtrain Reference=trainLabels)
gt confTest lt-table(Predicted=svmpredtest Reference=testLabels)
Now we are ready to pass the necessary information to theconfusionMatrix methodgt confusionMatrix(confTest positive=TRUE)
197
92 Applied Predictive Modeling Techniques in R
Confusion Matrix and Statistics
ReferencePredicted FALSE TRUE
FALSE 568 29TRUE 60 89
Accuracy 0880795 CI (08552 09031)
No Information Rate 08418P-Value [Acc gt NIR] 0001575
Kappa 05952Mcnemar rsquos Test P-Value 0001473
Sensitivity 07542Specificity 09045
Pos Pred Value 05973Neg Pred Value 09514
Prevalence 01582Detection Rate 01193
Detection Prevalence 01997Balanced Accuracy 08293
rsquoPositive rsquo Class TRUE
Now the accuracy rate is 088 with a kappa of around 06 An importanttake away is that support vector machines as with many predictive analyticmethods can be very sensitive to over fitting
198
NOTES
Notes35See the paper by Cortes C Vapnik V (1995) Support-Vector Networks Machine
Learning 20 273ndash29736Note that classification tasks based on drawing separating lines to distinguish between
objects of different class memberships are known as hyperplane classifiers The SVM iswell suited for such tasks
37The nu parameter in nu-SVM38For further details see Hoehler FK 2000 Bias and prevalence effects on kappa
viewed in terms of sensitivity and specificity J Clin Epidemiol 53 499ndash50339Gomes A L et al Classification of dengue fever patients based on gene expression
data using support vector machines PloS one 56 (2010) e1126740Huang Wei Yoshiteru Nakamori and Shou-Yang Wang Forecasting stock market
movement direction with support vector machine Computers amp Operations Research3210 (2005) 2513-2522
41Min Jae H and Young-Chan Lee Bankruptcy prediction using support vector ma-chine with optimal choice of kernel function parameters Expert systems with applications284 (2005) 603-614
42Upstill-Goddard Rosanna et al Support vector machine classifier for estrogen re-ceptor positive and negative early-onset breast cancer (2013) e68606
43Tehrany Mahyat Shafapour et al Flood susceptibility assessment using GIS-basedsupport vector machine model with different kernel types Catena 125 (2015) 91-101
44Calculated by the author using the average of all four support vector machines45Radhika K R M K Venkatesha and G N Sekhar Off-line signature authenti-
cation based on moment invariants using support vector machine Journal of ComputerScience 63 (2010) 305
46Guo Shuyu Robyn M Lucas and Anne-Louise Ponsonby A novel approach forprediction of vitamin d status using support vector regression PloS one 811 (2013)
47Latitude ambient ultraviolet radiation levels ambient temperature hours in the sun6 weeks before the blood draw (log transformed to improve the linear fit) frequency ofwearing shorts in the last summer physical activity (three levels mild moderate vigor-ous) sex hip circumference height left back shoulder melanin density buttock melanindensity and inner upper arm melanin density
48For further details see either of the following1 Cawley GC Leave-one-out cross-validation based model selection criteria for
weighted LS-SVMs In International Joint Conference on Neural Networks IEEE2006 p 1661ndash1668
2 Vapnik V Chapelle O Bounds on error expectation for support vector machinesNeural computation 200012(9)2013ndash2036
3 Muller KR Mika S Ratsch G Tsuda K Scholkopf B An introduction to kernel-based learning algorithms IEEE Transactions on Neural Networks
49See1 Bach FR Lanckriet GR Jordan MI Multiple kernel learning conic duality and
the SMO algorithm In Proceedings of the twenty-first international conference onMachine learning ACM 2004 p 6ndash13
2 Zien A Ong CS Multiclass multiple kernel learning In Proceedings of the 24thinternational conference on Machine learning ACM 2007 p 1191ndash1198
199
92 Applied Predictive Modeling Techniques in R
50Thadani K Jayaraman V Sundararajan V Evolutionary selection of kernels in supportvector machines In International Conference on Advanced Computing and Communica-tions IEEE 2006 p 19ndash24
51See for example
1 Lin Shih-Wei et al Particle swarm optimization for parameter determination andfeature selection of support vector machines Expert systems with applications 354(2008) 1817-1824
2 Melgani Farid and Yakoub Bazi Classification of electrocardiogram signals withsupport vector machines and particle swarm optimization Information Technologyin Biomedicine IEEE Transactions on 125 (2008) 667-677
52For further discussion on the issues surrounding over fitting see G C Cawley and NL C Talbot Preventing over-fitting in model selection via Bayesian regularization of thehyper-parameters Journal of Machine Learning Research volume 8 pages 841-861 April2007
200
Part III
Relevance Vector Machine
201
The Basic Idea
The relevant vector machine (RVM) shares its functional form with the sup-port vector machine (SVM) discussed in Part II RVMs exploit a probabilisticBayesian learning framework
We begin with a data set of N training pairs xiyi where xi is the inputfeature vector and yi is the target output RVM make predictions using
yi = wTK + ε (261)
where w = [w1 wN ] is the vector of weights K =[k(xi x1) k(xi xN)] T is the vector of kernel functions and ε is the errorwhich for algorithmic simplicity is assumed to be zero-mean independentlyidentically distributed Gaussian with variance σ2 Therefore the predictionyi consists of the target output polluted by Gaussian noise
The Gaussian likelihood of the data is given by
p(y|wσ2
)= (2π)
minusN2 σminusNexp
minus 1
2σ2 yminusΦw2 (262)
where y = [y1 yN ] w = [ε w1 wN ] and Φ is an N times (N + 1) matrixwith Φij = k(xixjminus1) and Φi1 = 1
A standard approach to estimate the parameters in equation 262 is to usea zero-mean Gaussian to introduce an individual hyperparmeter αi on each ofthe weights wi so that
w sim N(
0 1α
)(263)
where α = [α1 αN ] The prior and posterior probability distributionsare then easily derived53
RVM uses much fewer kernel functions than the SVM It also yields prob-abilistic predictions automatic estimation of parameters and the ability touse arbitrary kernel functions The majority of parameters are automaticallyset to zero during the learning process giving a procedure that is extremely
203
92 Applied Predictive Modeling Techniques in R
effective at discerning basis functions which are relevant for making goodpredictions
NOTE
RVM requires the inversion of N times N matrices Since it takesO (N3) operations to invert an N times N matrix it can quicklybecome computationally expensive and therefore slow as thesample size increases
204
Practical Applications
13 PRACTITIONER TIP
The relevance vector machine is often assessed using the rootmean squared error (RMSE) or the Nash-Sutcliffe efficiency(NS) The larger the value of NS the smaller the value ofRMSE
RMSE =radicsumN
i=1(xi minus xi)2
N(264)
NS = 1minussumN
i=1(xi minus xi)2sumNi=1(xi minus xi)2 (265)
xi is the predicted value x is the average of the sample andN is the number of observations
Oil Sand Pump PrognosticsIn wet mineral processing operations slurry pumps deliver a mixture of bi-tumen sand and small pieces of rock from one site to another Due to theharsh environment in which they operate these pumps can fail suddenly Theconsequent downtime can lead to a large economic cost for the mining com-pany due to the interruption of the mineral processing operations Hu andTse54 combine a RVM with an exponential model to predict the remaininguseful life (RUL) of slurry pumps
Data were collected from the inlet and outlet of slurry pumps operatingin an oil sand mine using four accelerometers placed at key locations on thepump In total the pump was subjected to 904 measurement hours
Two data-sets were collected from different positions (Site 1 and Site 2)in the same pump and an alternative model of the pump degradation was de-
205
92 Applied Predictive Modeling Techniques in R
veloped using the sum of two exponential functions The overall performanceresults are shown in Table 13
Site 1 Site 1 Site 2 Site 2RVM Exponential Model RVM Exponential Model7051 2531 2885 705
Table 13 Hu and Tsersquos weighted average accuracy of prediction
Precision AgriculturePrecision agriculture involves using aircraft or spacecraft to gather high-resolution information on crops to better manage the growing and harvestingcycle Chlorophyll concentration measured in mass per unit leaf area (μgcmminus2) is an important biophysical measurement retrievable from air or spacereflectance data Elarab et al55 use a RVM to estimate spatially distributedchlorophyll concentrations
Data was gathered from three aircraft flights during the growing season(early growth mid growth and early flowering) The data set consisted of thedependent variable (Chlorophyll concentration) and 8 attribute or explana-tory variables All attributes were used to train and test the model
The researchers used six different kernels (Gauss Laplace spline Cauchythin plate spline (tps) and bubble) alongside a 5-fold cross validation Avariety of kernel widths ranging from 10-5 to 105 were also used The rootmean square error and Nash-Sutcliffe efficiency were used to assess model fitThe best fitting trained model used a tps kernel and had a root mean squareerror of 531 μg cmminus2 and Nash-Sutcliffe efficiency of 076
The test data consisted of a sample gathered from an unseen (by theRVM) flight The model using this data had a root mean square error of852 μg cmminus2 and Nash-Sutcliffe efficiency of 071 The researchers concludeldquoThis result showed that the [RVM] model successfully performed when givenunseen datardquo
Deception Detecting in SpeechZhou56 develop a technique to detect deception in speech by extracting speechdynamics (prosodic and non-linear features) and applying a RVM
206
The data set consisted of recorded interviews of 16 male and 16 femaleparticipants A total of 640 deceptive samples and 640 non-deceptive sampleswere used in the analysis
Speech dynamics such as pitch frequency short-term vocal energy andmei-frequency cepstrum coefficients57 were used as input attribute features
Classification accuracy of the RVM was assessed relative to a supportvector machine and a neural network for various training sample sizes Forexample with a training sample of 400 the RVM correctly classifies 7037of male voices whilst the support vector machine and neural network correctlyclassify 6814 and 4213 respectively
The researchers observe that a combination of prosodic and non-linearfeatures modeled using a RVM is effective for detecting deceptive speech
Diesel Engine PerformanceThe mathematical form of diesel engines is highly nonlinear Because of thisthey are often modeled using an artificial neural network (ANN) Wong etal58 perform an experiment to assess the ability of an ANN and a RVM topredict diesel engine performance
Three inputs are used in the models (engine speed load and cooling watertemperature) and the output variables are brake specific fuel consumptionand exhaust emissions such as nitrogen oxide and carbon dioxide
A water-cooled 4-cylinder direct-injection diesel engine was used to gen-erate data for the experiment Data was recorded at five engine speeds (12001400 1600 1800 and 2000 rpm) with engine torques (28 70 140 210 and252 Nm) Each test was carried out three times and the average values wereused in the analysis In all 22 sets of data were collected with 18 used as thetraining data and 4 to assess model performance
The ANN had a single layer with twenty hidden neurons and a Hyperbolictangent sigmoid transfer function
For the training data the researchers report an average root mean squareerror (RMSE) of 327 for the RVM and 4159 for the ANN The average RMSEfor the test set was 1773 and 3856 for the RVM and ANN respectively Theresearchers also note that the average R2 for the test set for the RVM was0929 and 0707 for the ANN
Gas Layer ClassificationZhao et al59 consider the issue of optimal parameter selection for a RVM andidentification of gas at various drill depths To optimize the parameters of a
207
92 Applied Predictive Modeling Techniques in R
RVM they use the particle swarm algorithm60The sample consist of 201 wells drilled at various depths with 63 gas pro-
ducing wells and 138 non-gas producing wells A total of 12 logging attributesare used as features into the RVM model A prediction accuracy of 9353is obtained during training using all the attributes the prediction accuracywas somewhat lower at 9175 for the test set
Credit Risk AssessmentTong and Li61 assess the use of RVM and a support vector machine (SVM)in the assessment of company credit risk The data consist of financial char-acteristics of 464 Chinese firms listed on Chinarsquos securities markets A totalof 116 of the 464 firms had experienced a serious credit event and were codedas ldquo0rdquo by the researchers The remaining firms were coded ldquo1rdquo
The attribute vector consisted of 25 financial ratios drawn from seven basiccategories (cash flow return on equity earning capacity operating capacitygrowth capacity short term solvency and long term solvency) Since many ofthese financial ratios have a high correlation with each other the researchersused principal components (PCA) and isomaps to reduce the dimensional-ity The models they then compared were a PCA-RVM Isomap-SVM andIsomap-RVM A summary of the results is reported in Table 14
Model AccuracyIsomap-RVM 9028Isomap-SVM 8611PCA-RVM 8959
Table 14 Summary of Tong and Lirsquos overall prediction accuracy
208
13 PRACTITIONER TIP
You may have noticed that several researchers compared theirRVM to a support vector machine or other model A datascientist I worked with faced an issue where one model (logisticregression) performed marginally better than an alternativemodel (decision tree) However the decision was made to gowith the decision tree because due to a quirk in the softwareit was better labeledZhou et al demonstrated the superiority of a RVM over a sup-port vector machine for their data-set The evidence was lesscompelling in the case of Tong and Li where Table 14 appearsto indicate a fairly close race In the case where alternativemodels perform in a similar range the rationale for which tochoose may be based on considerations not related to whichmodel performed best in absolute terms on the test set Thiswas certainly the case in the choice between logistic regressionand decision trees faced by my data scientist co-worker
209
Technique 27
RVM Regression
A RVM regression can be estimated using the package kernlab with the rvmfunction
rvm(y ~ data kernel kpar )
Key parameters include kernel - the kernel function kpar which con-tains parameter values multicategory response variable y and the covariatesdata
Step 1 Load Required PackagesWe build the RVM regression using the data frame bodyfat contained in theTHdata package For additional details on this sample see page 62library(kernlab)data(bodyfatpackage=THdata)
Step 2 Prepare Data amp Tweak ParametersWe use 45 out of the 71 observations in the training sample with the remainingsaved for testing The variable train selects the random sample withoutreplacement from the 71 observationssetseed (465)train lt- sample (171 45 FALSE)
210
TECHNIQUE 27 RVM REGRESSION
Step 3 Estimate amp Evaluate ModelWe estimate the RVM using a radial basis kernel (type = rbfdot) and pa-rameter sigma equal to 005 We also set the cross validation parametercross =10
gt fitlt- rvm(DEXfat ~ data = bodyfat[train ]kernel=rbfdot kpar=list(sigma =05)cross =10)
The print function provides details of the estimated model The func-tion provides details including the kernel type (radial basis) kernel parameter(sigma =05) number of relevance vectors (43) and the cross validation er-rorgt print(fit)Relevance Vector Machine object of class rvmProblem type regression
Gaussian Radial Basis kernel functionHyperparameter sigma = 05
Number of Relevance Vectors 43Variance 1211944Training error 24802080247Cross validation error 1079221
Since the output of rvm is an S4 object it can be accessed directly usingldquordquo gt fiterror[1] 2480208
gt fitcross[1] 1079221
OK letrsquos fit two other models and choose the one with the lowest crossvalidation error The first model fit1 is estimated using the default settingof rvm The second model fit2 uses a Laplacian kernel Ten-fold crossvalidation is used for both modelsgt fit1lt- rvm(DEXfat ~ data = bodyfat[train ]
211
92 Applied Predictive Modeling Techniques in R
cross =10)
gt fit2lt- rvm(DEXfat ~ data = bodyfat[train ]kernel=laplacedot kpar=list(sigma =0001) cross =10)
Now we are ready to assess the fit of each of the three models - fit fit1and fit2gt fitcross[1] 1082592
gt fit1cross[1] 6550046
gt fit2cross[1] 219755
The model fit2 using a Laplacian kernel has by far the best overall crossvalidation error
Step 4 Make PredictionsThe predict method with the test data and the fitted model fit2 is used asfollowspred lt- predict(fit2 bodyfat[-train ])
We fit a linear regression using the pred as the response variable and theobserved values as the covariate The regression line alongside the predictedand observed values shown in Figure 271 are visualized using the plotmethod combined with abline method to show the linear regression linegt linReg lt-lm(pred ~ bodyfat$DEXfat [-train ])
gt plot(bodyfat$DEXfat[-train]pred xlab=DEXfatylab=PredictedValuesmain=TrainingSampleModelFit)
gt abline(linReg col=darkred)
212
TECHNIQUE 27 RVM REGRESSION
The correlation between the test sample predicted and observed values is0813gt round(cor(pred bodyfat$DEXfat[-train])^23)
[1][1] 0813
Figure 271 RVM regression of observed and fitted values using bodyfat
213
92 Applied Predictive Modeling Techniques in R
Notes53For further details see
1 M E Tipping ldquoBayesian inference an introduction to principles and practice inmachine learningrdquo in Advanced Lectures on Machine Learning O Bousquet Uvon Luxburg and G Raumltsch Eds pp 41ndash62 Springer Berlin Germany 2004
2 M E Tipping ldquoSparseBays An Efficient Matlab Implementation ofthe Sparse Bayesian Modelling Algorithm (Version 20)rdquo March 2009httpwwwrelevancevectorcom
54Hu Jinfei and Peter W Tse A relevance vector machine-based approach with ap-plication to oil sand pump prognostics Sensors 139 (2013) 12663-12686
55Elarab Manal et al Estimating chlorophyll with thermal and broadband multi-spectral high resolution imagery from an unmanned aerial system using relevance vectormachines for precision agriculture International Journal of Applied Earth Observationand Geoinformation (2015)
56Zhou Yan et al Deception detecting from speech signal using relevance vectormachine and non-linear dynamics features Neurocomputing 151 (2015) 1042-1052
57Mei-frequency cepstrum is a representation of the short-term power spectrum of asound commonly used as features in speech recognition systems See for example LoganBeth Mel Frequency Cepstral Coefficients for Music Modeling ISMIR 2000 and HasanMd Rashidul Mustafa Jamil and Md Golam Rabbani Md Saifur Rahman Speakeridentification using Mel frequency cepstral coefficients variations 1 (2004) 4
58Wong Ka In Pak Kin Wong and Chun Shun Cheung Modelling and prediction ofdiesel engine performance using relevance vector machine International journal of greenenergy 123 (2015) 265-271
59Zhao Qianqian et al Relevance Vector Machine and Its Application in Gas LayerClassification Journal of Computational Information Systems 920 (2013) 8343-8350
60See Haiyan Lu Pichet Sriyanyong Yong Hua Song Tharam Dillon Experimentalstudy of a new hybrid PSO with mutation for economic dispatch with non-smooth costfunction [J] International Journal of Electrical Power and Energy Systems 32 (9) 2012921ndash935
61Tong Guangrong and Siwei Li Construction and Application Research of Isomap-RVM Credit Assessment Model Mathematical Problems in Engineering (2015)
214
Part IV
Neural Networks
215
The Basic Idea
A artificial neural network (ANN) is constructed from a number of intercon-nected nodes known as neurons These are arranged into an input layer ahidden layer and an output layer The input nodes correspond to the numberof features you wish to feed into the ANN and the number of output nodescorrespond to the number of items you wish to predict Figure 272 presentsan overview of an artificial neural network topology (ANN) It has 2 inputnodes 1 hidden layer with 3 nodes and 1 output node
Figure 272 A basic neural network
The NeuronAt the heart of a neural network is the neuron Figure 273 outlines theworkings of an individual neuron A weight is associated with each arc into
217
92 Applied Predictive Modeling Techniques in R
a neuron and the neuron then sums all inputs according to
Sj =Nsum
i=1wijaj + bj (271)
The parameter bj represents the bias associated with the neuron It allowsthe network to shift the activation function ldquoupwardsrdquo or ldquodownwardsrdquo Thistype of flexibility is important for successful machine learning
Figure 273 An artificial neuron
Activation and LearningA neural network is generally initialized with random weights Once the net-work has been initialized it is then trained Training consists of two elements- activation and learning
bull Step 1 An activation function f (Sj) is applied and the output passedto the next neuron(s) in the network The sigmoid function is a popularactivation function
bull f (Sj) = 11 + exp (minusSj)
(272)
bull Step 2 A learning ldquolawrdquo that describes how the adjustments of theweights are to be made during training The most popular learning lawis backpropagation
218
The Backpropagation Algorithm
It consists of the following steps
1 The network is presented with input attributes and the target outcome
2 The output of the network is compared to the actual known targetoutcome
3 The weights and biases of each neuron are adjusted by a factor basedon the derivative of the activation function the differences between thenetwork output and the actual target outcome and the neuron outputsThrough this process the network ldquolearnsrdquo
Two parameters are often used to speed up learning and prevent the systemfrom being trapped in a local minimum They are known as the learning rateand the momentum
13 PRACTITIONER TIP
ANNs are initialized by setting random values to the weightsand biases One rule of thumb is to set the random values tolie in the range (-2 k to 2 k) where k is the number of inputs
How Many Nodes and Hidden LayersOne of the very first questions asked about neural networks is how manynodes and layers should be included in the model There are no fixed rulesas to how many nodes to include in the hidden layer However as the numberof hidden nodes increase so does the time taken for the network to learn fromthe input data
219
Practical Applications
Sheet Sediment TransportTayfur62model sediment transport using artificial neural networks (ANNs)The sample consisted of the original experimental hydrological data of Kilinicamp Richardson63
A three-layer feed-forward artificial neural network with two neurons inthe input layer eight neurons in the hidden layer and one neuron in the out-put layer was built The sigmoid function was used as an activation functionin the training of the network and the learning of the ANN was accomplishedby the back-propagation algorithm A random value of 02 and mdash10 wereassigned for the network weights and biases before starting the training pro-cess
The ANNs performance was assessed relative to popular physical hydro-logical models (flow velocity shear stress stream power and unit streampower) against a combination of slope types (mild steep and very steep) andrain intensity (low high very high)
The ANN outperformed the popular hydrological models for very highintensity rainfall on both steep and very steep slope see Table 15
Mild slope Steep slope Very steep slopeLow Intensity physical physical ANNHigh Intensity physical physical physicalVery high inensity physical ANN ANN
Table 15 Which model is best Model transition framework derived fromTayfurrsquos analysis
220
13 PRACTITIONER TIP
Tayfurrsquos results highlight an important issue facing the datascientist Not only is it often a challenge to find the bestmodel among competing candidates it is even more difficultto identify a single model that works in all situations Agreat solution offered by Tayfur is to have a model transitionmatrix that is determine which model(s) perform well underspecific conditions and then use the appropriate model for agiven condition
Stock Market VolatilityMantri et al64 investigate the performance of a multilayer perceptron relativeto standard econometric models of stock market volatility (GARCH Ex-ponential GARCH Integrated GARCH and the Glosten Jagannathan andRunkle GARCH model)
Since the multilayer perceptron does not make assumptions about thedistribution of stock market innovations it is of interest to Financial Analystsand Statisticians The sample consisted of daily data collected on two Indianstock indices (BSE SENSEX and the NSE NIFTY) over the period January1995 to December 2008
The researchers found no statistical difference between the volatility ofthe stock indices estimated by the multilayer perceptron and standard econo-metric models of stock market volatility
Trauma SurvivalThe performance of trauma departments in the United Kingdom is widelyaudited by applying predictive models that assess the probability of survivaland examining the rate of unexpected survivals and deaths The standardapproach is the TRISS methodology which consists of two logistic regressionsone applied if the patient has a penetrating injury and the other applied forblunt injuries65
Hunter Henry and Ferguson66 assess the performance of the TRISSmethodology against alternative logistic regression models and a multilayerperceptron
221
92 Applied Predictive Modeling Techniques in R
The sample consisted of 15055 cases gathered from Scottish Trauma de-partments over the period 1992-1996 The data was divided into two subsetsthe training set containing 7224 cases from 1992-1994 and the test setcontaining 7831 cases gathered from 1995-1996 The researcherrsquos logistic re-gression models and the neural network were optimized using the trainingset
The neural network was optimized ten times with the best resulting modelselected The researchers conclude that neural networks can yield betterresults than logistic regression
13 PRACTITIONER TIP
The sigmoid function is a popular choice as an activation func-tion It is good practice to standardize (ie convert to therange (01)) all external input values before passing them intoa neural network This is because without standardizationlarge input values require very small weighting factors Thiscan cause two basic problems
1 Inaccuracies introduced by very small floating point cal-culations on your computer
2 Changes made by the back-propagation algorithm willbe extremely small causing training to be slow (the gra-dient of the sigmoid function at extreme values wouldbe approximately zero)
Brown Trout RedsLek et al67 compare the ability of multiple regression and neural networks topredict the density of brown trout redds in southwest France Twenty nineobservation stations distributed on six rivers and divided into 205 morpho-dynamic units collected information on 10 ecological metrics
The models were fitted using all the ecological variables and also witha sub-set of four variables Testing consisted of random selection for thetraining set (75 of observations and the test set 25 of observations) Theprocess was repeated a total of five times
The average correlation between the observed and estimated values overthe five samples is reported in Table 16 The researchers conclude that bothmultiple regression and neural networks can be used to predict the density of
222
brown trout redds however the neural network model had better predictionaccuracy
Neural Network Multiple RegressionTrain Test Train Test0900 0886 0684 0609
Table 16 Let et alrsquos reported correlation coefficients between estimate andobserved values in training and test samples
Electric Fish LocalizationWeakly electric fish emit an electric discharge used to navigate the surround-ing water and to communicate with other members of their shoal Trackingof individual fish is often carried out using infrared cameras However thisapproach becomes unreliable when there is a visual obstruction68
Kiar et al69 develop an non-invasive means of tracking weakly electric fishin real-time using a cascade forward neural network The data set contained299 data points which were interpolated to be 29900 data points The accu-racy of the neural network was 943 within 1cm of actual fish location witha mean square error of 002 mm and a image frame rate of 213 Hz
Chlorophyll DynamicsWu et al70 developed two modeling approaches - artificial neural networks(ANN) and multiple linear regression (MLR) to simulate the daily Chloro-phyll a dynamics in a northern German lowland river Chlorophyll absorbssunlight to synthesize carbohydrates from CO2 and water It is often used asa proxy for the amount of phytoplankton present in water bodies
Daily Chlorophyll a samples were taken over an 18 month period Intotal 426 daily samples were obtained Each 10th daily sample was assignedto the validation set resulting in 42 daily observations The calibration setcontained 384 daily observations
For ANN modelling a three layer back propagation neural network wasused The input layer consisted of 12 neurons corresponding to the inde-pendent variables shown in Table 17 The same independent variables werealso used in the multiple regression model The dependent variable in bothmodels was the daily concentration of Chlorophyll a
223
92 Applied Predictive Modeling Techniques in R
Air temperature Ammonium nitrogenAverage daily discharge ChlorideChlorophyll a concentration Daily precipitationDissolved inorganic nitrogen Nitrate nitrogenNitrite nitrogen Orthophosphate phosphorusSulfate Total phosphorus
Table 17 Wu et alrsquos input variables
The results of the ANN and MLR illustrated a good agreement betweenthe observed and predicted daily concentration of Chlorophyll a see Table 18
Model R-Square NS RMSEMLR 053 053 275NN 063 062 194
Table 18 Wu et alrsquos performance metrics (NS = Nash Sutcliffe efficiencyRMSE = root mean square error)
13 PRACTITIONER TIP
Whilst there are no fixed rules about how to standardize in-puts here are four popular choices for your original input vari-able xi
zi = xi minus xmin
xmax minus xmin
(273)
zi = xi minus xσx
(274)
zi = xiradicSSi
(275)
zi = xi
xmax + 1 (276)
SSi is the sum of square of xi and x and σx are the mean andstandard deviation of xi
224
Examples of Neural NetworkClassification
225
Technique 28
Resilient Backpropagation withBacktracking
A neural network with resilient backpropagation amp backtracking can be esti-mated using the package neuralnet with the neuralnet function
neuralnet(y ~ data hidden algorithm = rprop+)
Key parameters include the response variable y covariates data hiddenthe number of hidden neurons and algorithm = rprop+ to specify re-silient backpropagation with backtracking
Step 1 Load Required PackagesThe neural network is built using the data frame PimaIndiansDiabetes2contained in the mlbench package For additional details on this data set seepage 52gt library(neuralnet)gt data(PimaIndiansDiabetes2package=mlbench)
Step 2 Prepare Data amp Tweak ParametersThe PimaIndiansDiabetes2 has a large number of misclassified values(recorded as NA) particularly for the attributes of insulin and tricepsWe remove these two attributes from the sample and use the naomit methodto remove any remaining misclassified values The cleaned data is stored intemp
226
TECHNIQUE 28 RESILIENT BACKPROPAGATION WITH
gt templt-(PimaIndiansDiabetes2)gt temp$insulin lt- NULLgt temp$triceps lt- NULLgt templt-naomit(temp)
Next we need to convert the response variable and attributes into a matrixand then use the scale method to standardize the matrix tempgt ylt-(temp$diabetes)gt levels(y) lt- c(01)
gt ylt-asnumeric(ascharacter(y))gt y lt-asdataframe(y)
gt names(y)lt-c(diabetes)
gt temp$diabetes lt-NULLgt templt-cbind(temp y)gt templt-scale(temp)
Now we can select the training sample We choose to use 600 out of the724 observations The variable f is used to store the form of the model wherediabetes is the dependent or response variable Be sure to check n is equalto 724 the number of observations in the full samplegt setseed (103)gt n=nrow(temp)
gt train lt- sample (1n 600 FALSE)
gt flt- diabetes ~ pregnant+ glucose + pressure +mass + pedigree + age
Step 3 Estimate amp Evaluate ModelThe model is fitted using neuralnet with four hidden neurons
gt fitlt- neuralnet(f data = temp[train ]hidden=4algorithm = rprop+)
The print method gives a nice overview of the modelgt print(fit)
227
92 Applied Predictive Modeling Techniques in R
Call neuralnet(formula = f data = temp[train ]hidden = 4 algorithm = rprop+)
1 repetition was calculated
Error Reached Threshold Steps1 181242328 0009057962229 8448
A nice feature of the neuralnet package is the ability to visualize a fittednetwork using the plot method see Figure 281gt plot(fit intercept = FALSE showweights = FALSE)
13 PRACTITIONER TIP
It often helps intuition to visualize data To see the fittedintercepts set intercept = TRUE and to see the estimatedneuron weights set showweights = TRUE For example tryenteringplot(fit intercept = TRUE showweights = TRUE)
228
TECHNIQUE 28 RESILIENT BACKPROPAGATION WITH
age
pedigree
mass
pressure
glucose
pregnant
diabetes
Error 181242328 Steps 8448
Figure 281 Resilient Backpropagation and Backtracking neural network us-ing PimaIndiansDiabetes2
Step 4 Make PredictionsWe transfer the data into a variable called z and use this with the predictmethod and the test samplegt zlt-tempgt zlt-z[-7]gt pred lt- compute(fit z[-train ] )
The actual predictions should look something like thisgt sign(pred$netresult)
229
92 Applied Predictive Modeling Techniques in R
[1]4 -15 -17 -112 117 120 1
Letrsquos create a confusion matrix so we can see how well the neural networkperformed on the test samplegt table( sign(pred$netresult)sign(temp[-train 7])dnn =c(Predicted Observed))
ObservedPredicted -1 1
-1 61 61 20 37
We also need to calculate the error rategt error_rate = (1- sum( sign(pred$netresult) ==
sign(temp[-train 7]) ) 124)gt round( error_rate 2)[1] 021
The misclassification error rate is around 21
230
Technique 29
Resilient Backpropagation
A neural network with resilient backpropagation without backtracking canbe estimated using the package neuralnet with the neuralnet function
neuralnet(y ~ data hidden algorithm = rprop -)
Key parameters include the response variable y covariates data hiddenthe number of hidden neurons and algorithm = rprop- to specify re-silient backpropagation
Step 3 Estimate amp Evaluate ModelSteps 1 and 2 are outlined beginning on page 226
The model is fitted using neuralnet with four hidden neurons
gt fitlt- neuralnet(f data = temp[train ]hidden=4algorithm = rprop+)
The print method gives a nice overview of the model The error is closeto that observed for the neural network model estimated with backtrackingoutlined on page226gt print(fit)
Call neuralnet(formula = f data = temp[train ]hidden = 4 algorithm = rprop -)
1 repetition was calculated
231
92 Applied Predictive Modeling Techniques in R
Error Reached Threshold Steps1 1840243459 0009715675095 6814
Step 4 Make PredictionsWe transfer the data into a variable called z and use this with the predictmethod and the test samplegt zlt-tempgt zlt-z[-7]gt pred lt- compute(fit z[-train ] )
The actual predictions should look something like thisgt sign(pred$netresult)
[1]4 -15 17 -112 117 1
Letrsquos create a confusion matrix so we can see how well the neural networkperformed on the test samplegt table( sign(pred$netresult)sign(temp[-train 7])dnn =c(Predicted Observed))
ObservedPredicted -1 1
-1 62 41 19 39
We also need to calculate the error rategt error_rate = (1- sum( sign(pred$netresult) ==
sign(temp[-train 7]) ) 124)gt round( error_rate 2)[1] 019
The misclassification error rate is around 19
232
Technique 30
Smallest Learning Rate
A neural network using the smallest learning rate can be estimated using thepackage neuralnet with the neuralnet function
neuralnet(y ~ data hidden algorithm = slr )
Key parameters include the response variable y covariates data hiddenthe number of hidden neurons and algorithm = slr to use of a glob-ally convergent algorithm that uses resilient backpropagation without weightbacktracking and additionally the smallest learning rate
Step 3 Estimate amp Evaluate ModelSteps 1 and 2 are outlined beginning on page 226
The model is fitted using neuralnet with four hidden neurons
gt fitlt- neuralnet(f data = temp[train ]hidden=4algorithm = slr)
The print method gives a nice overview of the model The error is closeto that observed for the neural network model estimated with backtrackingoutlined on page 226 however the algorithm (due to the small learning rate)takes very many more steps to convergegt print(fit)
Call neuralnet(formula = f data = temp[train ]hidden = 4 algorithm = slr)
233
92 Applied Predictive Modeling Techniques in R
1 repetition was calculated
Error Reached Threshold Steps1 1797865898 0009813138137 96960
Step 4 Make PredictionsWe transfer the data into a variable called z and use this with the predictmethod and the test samplegt zlt-tempgt zlt-z[-7]gt pred lt- compute(fit z[-train ] )
The actual predictions should look something like thisgt sign(pred$netresult)
[1]4 -15 -17 -112 117 120 -1
Letrsquos create a confusion matrix so we can see how well the neural networkperformed on the test samplegt table( sign(pred$netresult)sign(temp[-train 7]) dnn =c(Predicted Observed))
ObservedPredicted -1 1
-1 58 101 23 33
We also need to calculate the error rategt error_rate = (1- sum( sign(pred$netresult) ==
sign(temp[-train 7]) ) 124)gt round( error_rate 2)[1] 027
The misclassification error rate is around 27
234
Technique 31
Probabilistic Neural Network
A probabilistic neural network can be estimated using the package pnn withthe learn function
learn(y data )
Key parameters include the response variable y and the covariatescontained in data
Step 1 Load Required PackagesThe neural network is built using the data frame PimaIndiansDiabetes2contained in the mlbench package For additional details on this data set seepage 52gt library(pnn)gt data(PimaIndiansDiabetes2package=mlbench)
Step 2 Prepare Data amp Tweak ParametersThe PimaIndiansDiabetes2 has a large number of misclassified values(recorded as NA) particularly for the attributes of insulin and tricepsWe remove these two attributes from the sample and use the naomit methodto remove any remaining misclassified values The cleaned data is stored intempgt templt-(PimaIndiansDiabetes2)gt temp$insulin lt- NULLgt temp$triceps lt- NULLgt templt-naomit(temp)
235
92 Applied Predictive Modeling Techniques in R
Next we need to convert the response variable and attributes into a matrixand then use the scale method to standardize the matrix tempgt templt-(PimaIndiansDiabetes2)gt temp$insulin lt- NULLgt temp$triceps lt- NULLgt templt-naomit(temp)
gt ylt-(temp$diabetes)
gt temp$diabetes lt-NULLgt templt-scale(temp)gt templt-cbind(asfactor(y)temp)
Now we can select the training sample We choose to use 600 out of the724 observationsgt setseed (103)gt n=nrow(temp)
gt n_train lt- 600gt n_testlt-n-n_train
gt train lt- sample (1n n_train FALSE)
Step 3 Estimate amp Evaluate ModelThe model is fitted using the learn method with the fitted model stored infit_basic
gt fit_basic lt- learn(dataframe(y[train]temp[train ]))
You can use the attributes method to identify the slots characteristicsof fit_basicgt attributes(fit_basic)$names[1] model set category
column categories[5] k n
236
TECHNIQUE 31 PROBABILISTIC NEURAL NETWORK
13 PRACTITIONER TIP
Remember you can access the contents of a fitted probabilisticneural network by using the $ notation For example to seewhat is in the ldquomodelrdquo slot you would typegt fit_basic$model[1] Probabilistic neural network
The summary method provides details on the modelgt summary(fit_basic)
Length Class Modemodel 1 -none - characterset 8 dataframe listcategorycolumn 1 -none - numericcategories 2 -none - characterk 1 -none - numericn 1 -none - numeric
Next we use the smooth method to set the smoothing parameter sigmaWe use a value of 05gt fit lt- smooth(fit_basic sigma =05)
13 PRACTITIONER TIP
Much of the time you will not have a pre-specified value inmind for the smoothing parameter sigma However you canlet the smooth function find the best value using its inbuiltgenetic algorithm To do that you would type someting alongthe lines ofgtsmooth(fit_basic)
The performance statistics of the fitted model are assessed using the perfmethod For example enter the following to see various aspects of the fittedmodelgt perf(fit)gt fit$observed
237
92 Applied Predictive Modeling Techniques in R
gt fit$guessedgt fit$successgt fit$failsgt ft$success_rategt fit$bic
Step 4 Make PredictionsLetrsquos take a look at the testing sample To see the first row of covariates inthe testing set entergt round(temp[-train ][1 ]2)
pregnant glucose pressure100 -085 -107 -052
mass pedigree age-063 -093 -105
You can see the first observation of the response variable in the test samplein a similar waygt y[-train ][1][1] negLevels neg pos
Now letrsquos predict the first response value in the test set using the covari-atesgt guess(fit asmatrix(temp[-train ][1 ]))$category[1] neg
Take a look at the associated probabilities In this case there is a 99probability associated with the neg classgt guess(fit asmatrix(temp[-train ][1 ]))$
probabilitiesneg pos
0996915706 0003084294
Here is how to see both prediction and associated probabilitiesgt guess(fit asmatrix(temp[-train ][1 ]))$category[1] neg
$probabilities
238
TECHNIQUE 31 PROBABILISTIC NEURAL NETWORK
neg pos0996915706 0003084294
OK now we are ready to predict all the response values in the test sampleWe can do this with a few lines of R codegt predlt-1n_testgt for (i in 1n_test)pred[i]lt-guess(fit asmatrix(temp[-train ][i]))$
category
Letrsquos create a confusion matrix so we can see how well the neural networkperformed on the test samplegt table( pred y[-train] dnn =c(Predicted Observed))
ObservedPredicted neg pos
neg 79 2pos 2 41
We also need to calculate the error rategt error_rate = (1- sum( pred == y[-train]) n_test
)gt round( error_rate 3)[1] 0032
The misclassification error rate is around 3
239
Technique 32
Multilayer Feedforward NeuralNetwork
A multilayer feedforward neural network can be estimated using the AMOREpackage with the train function
train(net PT errorcriterium )
Key parameters include net the neural network you wish to train P train-ing set attributes T the training set response variable output values andthe error criteria (Least Mean Squares or Least Mean Logarithm Squared orTAO Error) contained in errorcriterium
Step 1 Load Required PackagesThe neural network is built using the data frame PimaIndiansDiabetes2contained in the mlbench package For additional details on this data set seepage 52gt library(AMORE)gt data(PimaIndiansDiabetes2package=mlbench)
Step 2 Prepare Data amp Tweak ParametersThe PimaIndiansDiabetes2 has a large number of misclassified values(recorded as NA) particularly for the attributes of insulin and tricepsWe remove these two attributes from the sample and use the naomit methodto remove any remaining misclassified values The cleaned data is stored intemp
240
TECHNIQUE 32 MULTILAYER FEEDFORWARD NEURAL
gt templt-(PimaIndiansDiabetes2)gt temp$insulin lt- NULLgt temp$triceps lt- NULLgt templt-naomit(temp)
Next we need to convert the response variable and attributes into a matrixand then use the scale method to standardize the matrix tempgt ylt-(temp$diabetes)gt levels(y) lt- c(01)
gt ylt-asnumeric(ascharacter(y))
gt names(y)lt-c(diabetes)
gt temp$diabetes lt-NULLgt templt-cbind(temp y)gt templt-scale(temp)
Now we can select the training sample We choose to use 600 out of the724 observationsgt setseed (103)gt n=nrow(temp)gt train lt- sample (1n 600 FALSE)
Now we need to create the neural network we wish to traingt net lt- newff(nneurons=c(1321)learningrateglobal =001momentumglobal =05errorcriterium=LMLS Stao=NAhiddenlayer=sigmoidoutputlayer=purelinmethod=ADAPTgdwm)
Irsquoll explain the code above line by line In the first line wersquore creating anobject called net that will contain the structure of our new neural networkThe first argument to newff is nneurons which allows us to specify thenumber of inputs number of nodes in each hidden layer and number ofoutputs So in the example above we have 1 input 2 hidden layers the firstcontaining 3 nodes and the second containing 2 nodes and 1 output
The learningrateglobal argument constrains how much the algo-rithm is allowed to change the weights from iteration to iteration as the net-work is trained In this case learningrateglobal = 001 which means
241
92 Applied Predictive Modeling Techniques in R
that the algorithm canrsquot increase or decrease any one weight in the networkby any more than 001 from trial to trial
The errorcriterium argument specifies the error mechanism used ateach iteration There are three options here ldquoLMSrdquo (for least mean squares)ldquoLMLSrdquo (for least-mean-logarithm-squared) and ldquoTAOrdquo (for the Tao errormethod) In general I tend to choose ldquoLMLSrdquo as my starting point How-ever I will often train my networks using all three methods
The hiddenlayer and outputlayer arguments are used to choose thetype of activation function you will use to interpret the summations of theinputs and weights for any of the layers in your network I have set out-putlayer=purelin which results in linear output Other options includetansig sigmoid hardlim and custom
Finally method specifies the solution strategy for converging on theweights within the network For building prototype models I tend to useeither ADAPTgd (adaptive gradient descend) or ADAPTgdwm (adaptive gra-dient descend with momentum)
Step 3 Estimate amp Evaluate ModelThe model is fitted using the train method In this example I have setreport=TRUE to provide output during the algorithm run and nshows=5 toshow a total of 5 training epochs
gt fit lt- train(net P=temp[train ]T=temp[train 7]errorcriterium=LMLSreport=TRUE showstep=100nshows =5)
indexshow 1 LMLS 0337115972269707indexshow 2 LMLS 0335651051328758indexshow 3 LMLS 0335113569553075indexshow 4 LMLS 0334753676125557indexshow 5 LMLS 0334462044665089
Step 4 Make PredictionsThe sign function is used to convert predictions into negative and positiveHere I use the fitted network held in fit to predict using the test sample
242