week 15 lecture 30 cart -...
TRANSCRIPT
Week 15 Lecture 30
CARTClass Prep
bugs_waterchem.csv
library (rpart)
library (mvpart)
HabUse.csv
data(iris)
HWCART: readings on webpage
The mvpart package !!??
Reading from Guthery
Class project
What’s in a name?
Structural modeling
CART
Classification and Regression Trees
Recursive partitioning
Decision Trees
Constrained cluster analysis
Introduction to CART
Structural modeling technique
Data mining (exploration / description)
Partitions response variable/s based on
best predictor (surrogates)
Very flexible, few assumptions
Great for complex data and relationships
Focuses on maximizing prediction ability
rather than minimizing error
Introduction to CART
Recursively partitions the data into more
and more homogeneous subsets based on
certain levels of predictor variables
Divisive, constrained cluster analysis
“Supervised” divisive cluster analysis
Outcomes
Decision tree
Data description, patterns, relationships
Characteristics of the response clusters
Predictive model
Introduction to CARTUsed in medical, industry, and business fieldsBreiman et al. 1984. Classification and regression trees. Chapman and Hall, New York.
Fairly new to ecology
Important Introductory
Papers in Ecology
De'ath G, Fabricius KE. 2000. Classification and regression trees: A powerful yet simple technique for ecological data analysis. Ecology 81: 3178-3192.
De'ath G. 2002. Multivariate regression trees: a new technique for modeling species-environment relationships. Ecology 83: 1105-1117.
Data Structure for CARTA response variable – Continuous
– Categorical
Explanatory variables– Continuous and / or categorical
Matrix of response variables -- > MRT
Panacea
or Pandora’s box (c.f. James and McCulloch 1990) ?
How it Works
Purifies response by a rank of explanatory
variable/s
– i.e., values < or > X
How it works
Split the response variable into the two most homogenous groups based on the best level of the best explanatory variable– Choose the level of the explanatory variable that maximizes
homogeneity of the two groups with respect to the values of the response variable
Do again on each separated, exclusive group
Do again on those groups
Tree grows longer until you have 1 observation per group or you stop growing it– Overgrow the tree and then prune it back ?
– Nodes = splitting levels
– Terminal node = tree leaf
Allied with divisive hierarchical cluster analysis, but is constrained explicitly on your explanatory variables
MN < 0.02
MN < 0.02
SO4 < 46
AL <
0.02
64.3% MCR
0% MCR
0% MCR5.3% MCR
10.3% MCR
55.6% MCR33.9% MCR
55.9
% M
CR18.7% MCR
20% MCR
26.3% MCR
16% MCR
39% MCR
30% MCR
61.8% MCR
17% MCR33% MCR 10% MCR
11% MCR
40% MCR
Over-grown Tree
= 35/375
Rel. Error (or training error) and 1-Rel. Error = % Var Exp
Xerror = CV error
Xstd or std error
CV MCR
64.3% MCR
55.6% MCR
18.7% MCR
0% MCR
55.9% MCR
20% MCR
26.3% MCR
33.9% MCR
16% MCR11.1% MCR
40% MCR
0% MCR
39% MCR
61.8% MCR
5.3% MCR
10.3% MCR 30% MCR
Pruned Tree
Model Description
Categorical response– Terminal leaves characterized by a distribution on the
categorical variable
– Proportions of observations in each group
– MCR
Continuous responses– Terminal leaves characterized by a mean of the response
variable and summary stats
– Report % of SS
All terminal leaves are characterized by group size, some measure of variation, the response value, and values of explanatory variables
How to do it in
library (mvpart)
Modeling WVSCI Categories from
Landscape Data—Classification Tree
WV SCI
• Excellent
• Good
• Moderate
• Poor
Modeling EPT Scores from
Landscape Data—Regression Tree
It would be nice
to visualize this
variation
RPART Uses
Exploratory
Modeling
– Description
– Prediction—what is the value of the response variable given new observations on explanatory variables?
IF…THEN statements
Map generation
Distinguishing groups in terms of species composition
– Change points
Landscape Models to Predict
Water Quality Type
Application of
WQ Models
Prediction of WQ Type
in Un-sampled reaches
% by Rshed Area
Type Cheat Tygart
Sev A 5.3 % 1.4 %
Mod A 2.2 % 4.2 %
Hard 2.8 % 14.2 %
Soft 27.1 % 6.4 %
Trans 23.4 % 23.8 %
Ref 39.2 % 50.0 %
Model Validation
How well does it really work to predict
new data?
How to Pick the “BEST” Tree Size by Pruning
Test set Validation: External Model Testing– Model building subset (¾) and model testing subset (¼) (if you have
enough data)???
– Drop external data through different tree sizes to see which tree size does the best predicting
– Choose tree size with smallest predicted error
V-fold Cross Validation– Divide data into 10 equal groups (V = 10)
– Build tree with V2-9 and predict V1
– Build tree with V1, 3-10 and predict V2
– Etc.
– Calculate estimated error over ALL subsets for EACH tree size (WOW !!)
– Do 50 times (at least –Yikes !! )—because under multiple CVs the best tree size varies
– Select modal tree size that has lowest error rate
– Or select smallest tree size that is within 1SE of the minimum (freedom)
– On average, this tree size should give the best prediction success for new data
Picking “Best” Tree Size
Cross-Validation Relative
Error—decreases then
increases to a plateau
Relative Error—Decreases with
tree size
Easily Done in
Package rpart
Package mvpart
A word about publishing graphics
Multivariate Regression Trees
Extension of univariate regression trees
Multiple continuous response variables
Multiple continuous and/or categorical
predictor variables
Species – Environmental Relationships
Indicator Species
Disadvantages: impossible to visualize for
large assemblages
Conclusions
Linear models (OLS)
GLM and GAM (non-normal errors and
non-linear relationships)
CART, ANN, BRT
rpart Summary Advantages
Non-parametric
Missing data ok
Surrogate splitters
Simple even for complex relationships
Scale invariant
Outliers
Flexible
rpart Weaknesses
Over-fitting
– Finds best splits, good but not for prediction
One single tree model
Can be unstable
– Very sensitive to the input data !
– Can get very different trees
– Difficult with smooth responses
– GLM and GAM out-perform it
Poor predictive models
Disadvantages of CART
CART does not use combinations of
variables
Deceptive – if variable not included it could
be as it was “masked” by another as a
surrogate
Tree is optimal at each split – it may not be
globally optimal (sample / inference
challenge)
Solutions
Stochastic Boosting
– randomForest
– gbm (Boosted Regression Trees)
– gbmplus (aggregated boosted trees)
– ada
Currently no multi-classification response possible
But there’s a work around (tedious as hell)
The one versus all approach
Design matrix with dummies
Boosting
Improves prediction (quite a lot)
– Fit tree to many random samples (1000’s) of
the data (bagging size)
– Random subset of predictors used for each
tree
– Successive trees fit to residuals of earlier
trees
– Focus on hard to predict cases
– Average predictions over all cases
– Machine learning