week 15 lecture 30 cart -...

Week 15 Lecture 30

CARTClass Prep

bugs_waterchem.csv

library (rpart)

library (mvpart)

HabUse.csv

data(iris)

HWCART: readings on webpage

The mvpart package !!??

Reading from Guthery

Class project

What’s in a name?

Structural modeling

CART

Classification and Regression Trees

Recursive partitioning

Decision Trees

Constrained cluster analysis

Introduction to CART

Structural modeling technique

Data mining (exploration / description)

Partitions response variable/s based on

best predictor (surrogates)

Very flexible, few assumptions

Great for complex data and relationships

Focuses on maximizing prediction ability

rather than minimizing error

Introduction to CART

Recursively partitions the data into more

and more homogeneous subsets based on

certain levels of predictor variables

Divisive, constrained cluster analysis

“Supervised” divisive cluster analysis

Outcomes

Decision tree

Data description, patterns, relationships

Characteristics of the response clusters

Predictive model

Introduction to CARTUsed in medical, industry, and business fieldsBreiman et al. 1984. Classification and regression trees. Chapman and Hall, New York.

Fairly new to ecology

Important Introductory

Papers in Ecology

De'ath G, Fabricius KE. 2000. Classification and regression trees: A powerful yet simple technique for ecological data analysis. Ecology 81: 3178-3192.

De'ath G. 2002. Multivariate regression trees: a new technique for modeling species-environment relationships. Ecology 83: 1105-1117.

Data Structure for CARTA response variable – Continuous

– Categorical

Explanatory variables– Continuous and / or categorical

Matrix of response variables -- > MRT

Panacea

or Pandora’s box (c.f. James and McCulloch 1990) ?

How it Works

Purifies response by a rank of explanatory

variable/s

– i.e., values < or > X

How it works

Split the response variable into the two most homogenous groups based on the best level of the best explanatory variable– Choose the level of the explanatory variable that maximizes

homogeneity of the two groups with respect to the values of the response variable

Do again on each separated, exclusive group

Do again on those groups

Tree grows longer until you have 1 observation per group or you stop growing it– Overgrow the tree and then prune it back ?

– Nodes = splitting levels

– Terminal node = tree leaf

Allied with divisive hierarchical cluster analysis, but is constrained explicitly on your explanatory variables

MN < 0.02

MN < 0.02

SO4 < 46

AL <

0.02

64.3% MCR

0% MCR

0% MCR5.3% MCR

10.3% MCR

55.6% MCR33.9% MCR

55.9

% M

CR18.7% MCR

20% MCR

26.3% MCR

16% MCR

39% MCR

30% MCR

61.8% MCR

17% MCR33% MCR 10% MCR

11% MCR

40% MCR

Over-grown Tree

= 35/375

Rel. Error (or training error) and 1-Rel. Error = % Var Exp

Xerror = CV error

Xstd or std error

CV MCR

64.3% MCR

55.6% MCR

18.7% MCR

0% MCR

55.9% MCR

20% MCR

26.3% MCR

33.9% MCR

16% MCR11.1% MCR

40% MCR

0% MCR

39% MCR

61.8% MCR

5.3% MCR

10.3% MCR 30% MCR

Pruned Tree

Model Description

Categorical response– Terminal leaves characterized by a distribution on the

categorical variable

– Proportions of observations in each group

– MCR

Continuous responses– Terminal leaves characterized by a mean of the response

variable and summary stats

– Report % of SS

All terminal leaves are characterized by group size, some measure of variation, the response value, and values of explanatory variables

How to do it in

library (mvpart)

Modeling WVSCI Categories from

Landscape Data—Classification Tree

WV SCI

• Excellent

• Good

• Moderate

• Poor

Modeling EPT Scores from

Landscape Data—Regression Tree

It would be nice

to visualize this

variation

RPART Uses

Exploratory

Modeling

– Description

– Prediction—what is the value of the response variable given new observations on explanatory variables?

IF…THEN statements

Map generation

Distinguishing groups in terms of species composition

– Change points

Landscape Models to Predict

Water Quality Type

Application of

WQ Models

Prediction of WQ Type

in Un-sampled reaches

% by Rshed Area

Type Cheat Tygart

Sev A 5.3 % 1.4 %

Mod A 2.2 % 4.2 %

Hard 2.8 % 14.2 %

Soft 27.1 % 6.4 %

Trans 23.4 % 23.8 %

Ref 39.2 % 50.0 %

Model Validation

How well does it really work to predict

new data?

How to Pick the “BEST” Tree Size by Pruning

Test set Validation: External Model Testing– Model building subset (¾) and model testing subset (¼) (if you have

enough data)???

– Drop external data through different tree sizes to see which tree size does the best predicting

– Choose tree size with smallest predicted error

V-fold Cross Validation– Divide data into 10 equal groups (V = 10)

– Build tree with V2-9 and predict V1

– Build tree with V1, 3-10 and predict V2

– Etc.

– Calculate estimated error over ALL subsets for EACH tree size (WOW !!)

– Do 50 times (at least –Yikes !! )—because under multiple CVs the best tree size varies

– Select modal tree size that has lowest error rate

– Or select smallest tree size that is within 1SE of the minimum (freedom)

– On average, this tree size should give the best prediction success for new data

Picking “Best” Tree Size

Cross-Validation Relative

Error—decreases then

increases to a plateau

Relative Error—Decreases with

tree size

Easily Done in

Package rpart

Package mvpart

A word about publishing graphics

Multivariate Regression Trees

Extension of univariate regression trees

Multiple continuous response variables

Multiple continuous and/or categorical

predictor variables

Species – Environmental Relationships

Indicator Species

Disadvantages: impossible to visualize for

large assemblages

Conclusions

Linear models (OLS)

GLM and GAM (non-normal errors and

non-linear relationships)

CART, ANN, BRT

rpart Summary Advantages

Non-parametric

Missing data ok

Surrogate splitters

Simple even for complex relationships

Scale invariant

Outliers

Flexible

rpart Weaknesses

Over-fitting

– Finds best splits, good but not for prediction

One single tree model

Can be unstable

– Very sensitive to the input data !

– Can get very different trees

– Difficult with smooth responses

– GLM and GAM out-perform it

Poor predictive models

Disadvantages of CART

CART does not use combinations of

variables

Deceptive – if variable not included it could

be as it was “masked” by another as a

surrogate

Tree is optimal at each split – it may not be

globally optimal (sample / inference

challenge)

Solutions

Stochastic Boosting

– randomForest

– gbm (Boosted Regression Trees)

– gbmplus (aggregated boosted trees)

– ada

Currently no multi-classification response possible

But there’s a work around (tedious as hell)

The one versus all approach

Design matrix with dummies

Boosting

Improves prediction (quite a lot)

– Fit tree to many random samples (1000’s) of

the data (bagging size)

– Random subset of predictors used for each

tree

– Successive trees fit to residuals of earlier

trees

– Focus on hard to predict cases

– Average predictions over all cases

– Machine learning

week 15 lecture 30 cart -...

Documents