fitting classification and statgraphics and r › webinars › decision trees webinar.pdf ·...
TRANSCRIPT
Fitting Classification and
Regression Trees Using
Statgraphics and R
Presented by
Dr. Neil W. Polhemus
Classification and Regression Trees
• Machine learning methods used to construct
predictive models from data.
• Recursively partitions the data space using simple
binary decisions.
• Commonly portrayed as a tree with a split at each
decision node.
Example – Fisher’s Iris Data
Species
Petal.length<2.45
setosa (p=1.0)Petal.width<1.75
Petal.length<4.95
Sepal.length<5.15
versicolor (p=0.8) versicolor (p=1.0)
virginica (p=0.666667)
Petal.length<4.95
virginica (p=0.833333) virginica (p=1.0)
Basic Model Structure
• Y: variable to be predicted
– If categorical, we construct a classification tree.
– If continuous, we construct a regression tree.
• X1, X2, … Xp: predictor variables
– May be either categorical or continuous.
Partitioning Algorithm
• Start at the root node.
• Amongst all variables Xj, find the split that minimizes
the resulting average within-node deviance.
– For a continuous variable, the split is of the form
Xj < c.
– For a discrete variable, the split divides the possible
values into 2 distinct groups.
• If one or more stopping criteria are met after the split,
stop. Otherwise consider splitting the child nodes.
RMS Titanic
Sample Data File
Source: Frank Harrell and Thomas Cason, University of Virginiahttp://biostat.mc.vanderbilt.edu/twiki/pub/Main/DataSets/titanic.html
n=1,309 observations (passengers only)
Data Input
Analysis Options
Partitioning Options
Node Impurity
• Measure the impurity in a tree using the residual
mean deviance (RMD).
n = number of observations in training set
k = number of leaves
pi,j = proportion of data of same type as i at its assigned leaf j
𝑌𝑖 = predicted value of observation i
– Classification trees
– Regression trees
Analysis Window
Decision Tree
survived
sex=female
pclass=3
fare<23.0875
Yes No
Yes
age<9.5
sibsp=3,4,5
No Yes
pclass=2,3
No No
Decision Tree Options
Tree Structure
* Note: based on complete cases only.
Node Probabilities
* Note: based on complete cases only.
Classification Table
* Note: based on both complete and partial cases.
Training and Validation Sets
• May separate the data into 2 sets:
– Training set used to build the tree.
– Test set used to estimate the tree
misclassification percentages.
Compare Results
Pruning Options
• Reduces the complexity of the tree by removing
branches.
Pruning by Cross-validation
• Runs a 10-fold cross-validation experiment. Builds
10 trees, leaving out 10% of the data each time,
and averages the results.
• Uses all of the observations for both training and
validation.
• Can be used to determine the optimal size for the
tree by increasing the number of leaves until you
see warnings such as:
Pruning Example
• First reduce within-node deviance to fit a complex
tree.
Pruning Example
surv iv ed
sex=fem
pclass=3
fare<23.0875
embarked=Q,S
age<27.5
sibsp=1,2,3
parch=0
No
Ye
s
fare<10.825
fare<8.35625
embarked=Q
Ye
s
fare<7.7625
Ye
s
fare<7.8896
No
Ye
sN
oY
es
age<29.5
No
fare<15.7
age<42.5
age<31.5
No
No
Ye
sY
es
fare<13.9354
Ye
s
fare<15.4937
No
Ye
s
fare<31.3312N
o
fare<37.0313N
o
No
fare<32.0896
parch=2,3
age<17.5
Ye
ssibsp=0,2,3
fare<26.125Y
es
Ye
sage<21.5
Ye
s
fare<27.5833
fare<22.0
age<25.0
Ye
s
fare<13.25
age<37.0
Ye
sY
es
Ye
sY
es
Ye
s
Ye
s
fare<149.036
Ye
s
sibsp=1
Ye
s
Ye
s
age<9.5
sibsp=0,1,2
age<3.5
No
No
fare<15.5729
fare<13.125
Ye
s
No
Ye
s
pclass=1
age<32.25
sibsp=0,1,3
No
age<31.5
embarked=Q,S
fare<7.8375
age<20.75
No
fare<7.7854
age<23.5
fare<7.62705
No
No
age<27.5
No
No
Ye
s
fare<7.9104
No
fare<11.0
fare<7.9875
No
age<19.5
fare<8.10415
Ye
s
fare<9.22085
No
No
fare<8.5479
No
age<21.5
No
No
age<25.5
No
parch=1
No
No
fare<14.1584
fare<7.5625
age<25.75
age<21.5
No
No
No
Ye
s
pclass=3
No
No
No
fare<7.9104
No
fare<18.9625
age<36.25
fare<12.9375
No
No
age<39.5
No
fare<13.25
age<46.0
No
age<59.0
No
No
No
No
age<54.5
fare<26.1438
age<43.0
No
No
fare<32.5104
age<42.5
age<36.5
age<33.5
Ye
sY
es
No
Ye
s
fare<51.9312
age<33.5
No
No
fare<58.875
fare<54.2708
No
Ye
s
fare<80.7542
fare<77.0084
age<32.5
Ye
sN
oN
o
fare<135.067
fare<109.892
No
Ye
s
fare<237.523
No
Ye
s
fare<86.35
fare<29.85
No
No
No
Pruning Example
• Now reduce the number of leaves to 10 and select
cross-validation.
surv iv ed
sex=female
pclass=3
fare<23.0875
Ye
s
No
fare<32.0896
Ye
s
Ye
s
age<9.5
sibsp=3,4,5
No
Ye
s
pclass=2,3
age<32.25
No
No
age<54.5
No
No
10 Leaves is Too Complex
Finding Optimal Number of Leaves
• Step 1: Copy script from Statgraphics to Word and
modify last line.
Finding Optimal Number of Leaves
• Step 2: Copy modified script to R and run it. Look
for size with minimum deviance.
Prune Tree
Final Tree
survived
sex=female
pclass=3
No
Yes
age<9.5
sibsp=3,4,5
No
Yes
No
Predict Additional Cases
• To make predictions for additional cases, add
them to the bottom of the original data, leaving the
cell for Y blank.
Predictions and Residuals
Example 2: World Bank Demographics
Decision Tree
Life Expectancy
Fertility.Rate<3.3
GDP.per.Capita<11068.0
GDP.per.Capita<4282.75
Pop..Density<40.37
66.8
83
71.6
66
74.4
99
79.5
18
Fertility.Rate<4.585
Female.Percentage<49.97
66.6
52
Age.Dependency.Ratio<72.925
62.2
28
54.0
03
Pop..Density<31.005
50.5
15
55.5
07
References
• StatFolios and data files are at:
www.statgraphics.com/webinars
• R Package “tree” (2015) https://cran.r-
project.org/web/packages/tree/tree.pdf
• Classic text: Brieman, L., Friedman, J., Stone, C.J.
and Olshen, R.A. (1998) Classification and
Regression Trees. Wadsworth.