week 15 lecture 30 cart -...

36
Week 15 Lecture 30 CART Class Prep bugs_waterchem.csv library (rpart) library (mvpart) HabUse.csv data(iris) HW CART: readings on webpage The mvpart package !!?? Reading from Guthery Class project

Upload: others

Post on 01-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Week 15 Lecture 30

CARTClass Prep

bugs_waterchem.csv

library (rpart)

library (mvpart)

HabUse.csv

data(iris)

HWCART: readings on webpage

The mvpart package !!??

Reading from Guthery

Class project

Page 2: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

What’s in a name?

Structural modeling

CART

Classification and Regression Trees

Recursive partitioning

Decision Trees

Constrained cluster analysis

Page 3: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Introduction to CART

Structural modeling technique

Data mining (exploration / description)

Partitions response variable/s based on

best predictor (surrogates)

Very flexible, few assumptions

Great for complex data and relationships

Focuses on maximizing prediction ability

rather than minimizing error

Page 4: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Introduction to CART

Recursively partitions the data into more

and more homogeneous subsets based on

certain levels of predictor variables

Divisive, constrained cluster analysis

“Supervised” divisive cluster analysis

Page 5: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Outcomes

Decision tree

Data description, patterns, relationships

Characteristics of the response clusters

Predictive model

Page 6: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response
Page 7: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Introduction to CARTUsed in medical, industry, and business fieldsBreiman et al. 1984. Classification and regression trees. Chapman and Hall, New York.

Fairly new to ecology

Page 8: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Important Introductory

Papers in Ecology

De'ath G, Fabricius KE. 2000. Classification and regression trees: A powerful yet simple technique for ecological data analysis. Ecology 81: 3178-3192.

De'ath G. 2002. Multivariate regression trees: a new technique for modeling species-environment relationships. Ecology 83: 1105-1117.

Page 9: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Data Structure for CARTA response variable – Continuous

– Categorical

Explanatory variables– Continuous and / or categorical

Matrix of response variables -- > MRT

Panacea

or Pandora’s box (c.f. James and McCulloch 1990) ?

Page 10: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

How it Works

Purifies response by a rank of explanatory

variable/s

– i.e., values < or > X

Page 11: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

How it works

Split the response variable into the two most homogenous groups based on the best level of the best explanatory variable– Choose the level of the explanatory variable that maximizes

homogeneity of the two groups with respect to the values of the response variable

Do again on each separated, exclusive group

Do again on those groups

Tree grows longer until you have 1 observation per group or you stop growing it– Overgrow the tree and then prune it back ?

– Nodes = splitting levels

– Terminal node = tree leaf

Allied with divisive hierarchical cluster analysis, but is constrained explicitly on your explanatory variables

Page 12: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

MN < 0.02

Page 13: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

MN < 0.02

SO4 < 46

AL <

0.02

Page 14: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response
Page 15: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

64.3% MCR

0% MCR

0% MCR5.3% MCR

10.3% MCR

55.6% MCR33.9% MCR

55.9

% M

CR18.7% MCR

20% MCR

26.3% MCR

16% MCR

39% MCR

30% MCR

61.8% MCR

17% MCR33% MCR 10% MCR

11% MCR

40% MCR

Over-grown Tree

= 35/375

Rel. Error (or training error) and 1-Rel. Error = % Var Exp

Xerror = CV error

Xstd or std error

CV MCR

Page 16: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

64.3% MCR

55.6% MCR

18.7% MCR

0% MCR

55.9% MCR

20% MCR

26.3% MCR

33.9% MCR

16% MCR11.1% MCR

40% MCR

0% MCR

39% MCR

61.8% MCR

5.3% MCR

10.3% MCR 30% MCR

Pruned Tree

Page 17: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Model Description

Categorical response– Terminal leaves characterized by a distribution on the

categorical variable

– Proportions of observations in each group

– MCR

Continuous responses– Terminal leaves characterized by a mean of the response

variable and summary stats

– Report % of SS

All terminal leaves are characterized by group size, some measure of variation, the response value, and values of explanatory variables

Page 18: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

How to do it in

library (mvpart)

Page 19: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Modeling WVSCI Categories from

Landscape Data—Classification Tree

WV SCI

• Excellent

• Good

• Moderate

• Poor

Page 20: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Modeling EPT Scores from

Landscape Data—Regression Tree

It would be nice

to visualize this

variation

Page 21: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response
Page 22: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

RPART Uses

Exploratory

Modeling

– Description

– Prediction—what is the value of the response variable given new observations on explanatory variables?

IF…THEN statements

Map generation

Distinguishing groups in terms of species composition

– Change points

Page 23: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Landscape Models to Predict

Water Quality Type

Page 24: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Application of

WQ Models

Prediction of WQ Type

in Un-sampled reaches

% by Rshed Area

Type Cheat Tygart

Sev A 5.3 % 1.4 %

Mod A 2.2 % 4.2 %

Hard 2.8 % 14.2 %

Soft 27.1 % 6.4 %

Trans 23.4 % 23.8 %

Ref 39.2 % 50.0 %

Page 25: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Model Validation

How well does it really work to predict

new data?

Page 26: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

How to Pick the “BEST” Tree Size by Pruning

Test set Validation: External Model Testing– Model building subset (¾) and model testing subset (¼) (if you have

enough data)???

– Drop external data through different tree sizes to see which tree size does the best predicting

– Choose tree size with smallest predicted error

V-fold Cross Validation– Divide data into 10 equal groups (V = 10)

– Build tree with V2-9 and predict V1

– Build tree with V1, 3-10 and predict V2

– Etc.

– Calculate estimated error over ALL subsets for EACH tree size (WOW !!)

– Do 50 times (at least –Yikes !! )—because under multiple CVs the best tree size varies

– Select modal tree size that has lowest error rate

– Or select smallest tree size that is within 1SE of the minimum (freedom)

– On average, this tree size should give the best prediction success for new data

Page 27: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Picking “Best” Tree Size

Cross-Validation Relative

Error—decreases then

increases to a plateau

Relative Error—Decreases with

tree size

Page 28: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Easily Done in

Package rpart

Package mvpart

A word about publishing graphics

Page 29: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Multivariate Regression Trees

Extension of univariate regression trees

Multiple continuous response variables

Multiple continuous and/or categorical

predictor variables

Species – Environmental Relationships

Indicator Species

Disadvantages: impossible to visualize for

large assemblages

Page 30: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response
Page 31: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Conclusions

Linear models (OLS)

GLM and GAM (non-normal errors and

non-linear relationships)

CART, ANN, BRT

Page 32: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

rpart Summary Advantages

Non-parametric

Missing data ok

Surrogate splitters

Simple even for complex relationships

Scale invariant

Outliers

Flexible

Page 33: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

rpart Weaknesses

Over-fitting

– Finds best splits, good but not for prediction

One single tree model

Can be unstable

– Very sensitive to the input data !

– Can get very different trees

– Difficult with smooth responses

– GLM and GAM out-perform it

Poor predictive models

Page 34: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Disadvantages of CART

CART does not use combinations of

variables

Deceptive – if variable not included it could

be as it was “masked” by another as a

surrogate

Tree is optimal at each split – it may not be

globally optimal (sample / inference

challenge)

Page 35: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Solutions

Stochastic Boosting

– randomForest

– gbm (Boosted Regression Trees)

– gbmplus (aggregated boosted trees)

– ada

Currently no multi-classification response possible

But there’s a work around (tedious as hell)

The one versus all approach

Design matrix with dummies

Page 36: Week 15 Lecture 30 CART - jcsites.juniata.edujcsites.juniata.edu/faculty/merovich/QuantEcol_files/CART_class_202… · Data mining (exploration / description) Partitions response

Boosting

Improves prediction (quite a lot)

– Fit tree to many random samples (1000’s) of

the data (bagging size)

– Random subset of predictors used for each

tree

– Successive trees fit to residuals of earlier

trees

– Focus on hard to predict cases

– Average predictions over all cases

– Machine learning