classification trees and mars · recap regression trees cart (classification and regression...

Post on 09-Aug-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Classification Trees and MARSSTA450S/4000S: Topics in statistics. Statistical Aspects

of Data Mining

Ana-Maria Staicu

Classification Trees and MARS – p. 1/19

Recap Regression Trees

CART (classification and regression trees) is a method developed byBreiman, Friedman, Olshen and Stone to classify data on the basis ofsome of the variables.Known also as Recursive Partitioning.

Basic idea: construct a tree that will separate the data in the "best"way by finding binary splits on variables; find the best splitting variableand the best splitting point at each stage. The routine is recursive.Usually the process stops when some minimum node size, say 5

nodes is reached.

Once the tree has been grown, a cost complexity criterion is used toprune it.The tuning parameter α which governs the tradeoff between tree sizeand its goodness of fit to the data.

Classification Trees and MARS – p. 2/19

Classification Trees

For trees, R uses either package tree or rpart.

Classification Trees and MARS – p. 3/19

Classification Trees

For trees, R uses either package tree or rpart.

The target variable Y takes values 1, 2, . . . , K

One basic difference between classification and regression trees isthe action that takes place at the splits:

Classification Trees and MARS – p. 3/19

Classification Trees

For trees, R uses either package tree or rpart.

The target variable Y takes values 1, 2, . . . , K

One basic difference between classification and regression trees isthe action that takes place at the splits:

Regression tree: we try to minimize the sum of squares of errorbetween the true values and the "predicted" values; (the "predicted"value is the mean of all responses on either side of the split).

Classification Trees and MARS – p. 3/19

Classification Trees

For trees, R uses either package tree or rpart.

The target variable Y takes values 1, 2, . . . , K

One basic difference between classification and regression trees isthe action that takes place at the splits:

Regression tree: we try to minimize the sum of squares of errorbetween the true values and the "predicted" values; (the "predicted"value is the mean of all responses on either side of the split).

Classification tree: we try to minimize a measure of impurity (lossfunction):

Classification Trees and MARS – p. 3/19

Node impurity measure:

Misclassification error: 1Nm

i∈Rm

I(yi 6= k(m)),k(m) = arg maxk p̂mk(m).Note p̂mk = 1

Nm

i∈Rm

I(yi = k) is the proportion of class k

observations in node m.

Gini Index:∑

k 6=k′ p̂mkp̂mk′ =∑K

1 p̂mk(1 − p̂mk).

Cross-Entropy or Deviance: −∑K

1 p̂mk log p̂mk.

Classification Trees and MARS – p. 4/19

Node impurity measure:

Misclassification error: 1Nm

i∈Rm

I(yi 6= k(m)),k(m) = arg maxk p̂mk(m).Note p̂mk = 1

Nm

i∈Rm

I(yi = k) is the proportion of class k

observations in node m.

Gini Index:∑

k 6=k′ p̂mkp̂mk′ =∑K

1 p̂mk(1 − p̂mk).

Cross-Entropy or Deviance: −∑K

1 p̂mk log p̂mk.

When growing the treeChoose either Gini index or Cross-entropy. One reason is theirdifferentiability. Gini index is the default in R.

Classification Trees and MARS – p. 4/19

Node impurity measure:

Misclassification error: 1Nm

i∈Rm

I(yi 6= k(m)),k(m) = arg maxk p̂mk(m).Note p̂mk = 1

Nm

i∈Rm

I(yi = k) is the proportion of class k

observations in node m.

Gini Index:∑

k 6=k′ p̂mkp̂mk′ =∑K

1 p̂mk(1 − p̂mk).

Cross-Entropy or Deviance: −∑K

1 p̂mk log p̂mk.

When growing the treeChoose either Gini index or Cross-entropy. One reason is theirdifferentiability. Gini index is the default in R.

When pruning the treeChoose any of the three. Misclassification error is typically used.

Classification Trees and MARS – p. 4/19

§9.2.4 Other issues

Handling Unordered Inputs

If an input Xj has q ordered possible values, there are q − 1

possible partitions into 2 groups

If an input Xj is categorical, having q unordered possible values,there are 2q−1 − 1 possible partitions into 2 groups

Solution (for a 0 − 1 or quantitative outcome): order the predictorclasses according to the proportion falling in outcome class 1. Splitthe predictor Xj as if the values were ordered. This split results inthe optimal split in terms of squared error or Gini index. SeeBreiman’s et al Classification and Regression Trees .

Classification Trees and MARS – p. 5/19

§9.2.4 The Loss Matrix

Misclassifying observations may vary with classes.

Classification Trees and MARS – p. 6/19

§9.2.4 The Loss Matrix

Misclassifying observations may vary with classes.

Define a K × K loss matrix, L:Lkk′ : the loss for misclassifying a k class observation as one of classk

. Evidently Lkk = 0.

Classification Trees and MARS – p. 6/19

§9.2.4 The Loss Matrix

Misclassifying observations may vary with classes.

Define a K × K loss matrix, L:Lkk′ : the loss for misclassifying a k class observation as one of classk

. Evidently Lkk = 0.

How to incorporate the losses into the modeling process?Case K = 2: Weight observation in class 1 by L12, and observation inclass 2 by L21

Case K > 2: If Lkk′ is a function only of k, not of k′, weightobservation in class k by Lkk′ . In a terminal node m class k(m) will beassigned;k(m) = arg mink

l Llkp̂ml.To incorporate the loss into the process, modify Gini index to∑

k 6=k′ Lkk′ p̂mkp̂mk′ .

Classification Trees and MARS – p. 6/19

§9.2.4 Missing Predictor Values

In general two approaches:

1) discard the observations with missing values

2) impute the missing values, eg by mean of the predictor over the

non-missing observations

Classification Trees and MARS – p. 7/19

§9.2.4 Missing Predictor Values

In general two approaches:

1) discard the observations with missing values

2) impute the missing values, eg by mean of the predictor over the

non-missing observations

Tree based methods:

1) make a new category "NA" for the missing values of a categoricalpredictor.

2) use surrogate variables.

Classification Trees and MARS – p. 7/19

§9.2.4 Missing Predictor Values

In general two approaches:

1) discard the observations with missing values

2) impute the missing values, eg by mean of the predictor over the

non-missing observations

Tree based methods:

1) make a new category "NA" for the missing values of a categoricalpredictor.

2) use surrogate variables.

At any split, alternative splitting variables and corresponding splittingpoints are determined when building the model. A first surrogate splitwould best mimic the split of the training data achieved by the primarysplit, and so on. Use surrogate splits in order in the absence of theprimary splitting predictor.

Classification Trees and MARS – p. 7/19

Binary splitsMulti-way splits would split the data too quickly; insufficient data at thenext level down. Multi-way splits can be expressed as a series ofbinary splits.

Classification Trees and MARS – p. 8/19

Binary splitsMulti-way splits would split the data too quickly; insufficient data at thenext level down. Multi-way splits can be expressed as a series ofbinary splits.

Linear Combination splitsChoose a split of the form

ajXj ≤ c instead of the form Xj ≤ c.Consequences:

1) it improves the predictive power of the tree

2) it reduces its interpretability.

Alternatives:

HME(hierarchial mixture model)

Classification Trees and MARS – p. 8/19

Advantages

Trees are easy to interpret

Trees can handle multicolinearity

Tree-method is a non parametric method (assumptions free).

Classification Trees and MARS – p. 9/19

Advantages

Trees are easy to interpret

Trees can handle multicolinearity

Tree-method is a non parametric method (assumptions free).

Disadvantages

High variance caused by the hierarchical nature of the process

Even a more stable split criterion does not remove the instabilityAn error in the top split is propagated down to all of the splits below it;

Lack of smoothness of the predictor surface (MARS alleviate)

Difficulty in modeling additive structure (MARS capture).

Classification Trees and MARS – p. 9/19

Some code for trees

library(MASS)

library(rpart)

cpus.rp <- rpart(log10(perf) ˜ ., cpus[ ,2:8], cp=1e-3)

summary(cpus.rp)

printcp(cpus.rp)

# Regression tree:

# rpart(formula = log10(perf) ˜ ., data = cpus[, 2:8], cp=0.0 01)

# Variables actually used in tree construction:

# [1] cach chmax chmin mmax syct

# Root node error: 43.116/209 = 0.20629

# CP nsplit rel error xerror xstd

# 1 0.5492697 0 1.00000 1.02128 0.098176

# 2 0.0893390 1 0.45073 0.47818 0.048282

cpus.rp.pr <- prune(cpus.rp, cp=0.006)

post(cpus.rp.pr,title="Plot of rpart object cpus.rp.pr" ,

filename="C:\\AM\\CpusTree.eps",horizontal=F, points ize=8)

Classification Trees and MARS – p. 10/19

|

cach< 27

mmax< 6100

mmax< 1750 syct>=360

chmin< 5.5

cach< 0.5

mmax< 2.8e+04

cach< 96.5

mmax< 1.124e+04

cach< 56

cach>=27

mmax>=6100

mmax>=1750 syct< 360

chmin>=5.5

cach>=0.5

mmax>=2.8e+04

cach>=96.5

mmax>=1.124e+04

cach>=56

1.753n=209

1.525n=143

1.375n=78

1.089n=12

1.427n=66

1.704n=65

1.28n=7

1.756n=58

1.699n=46

1.531n=11

1.751n=35

1.974n=12

2.249n=66

2.062n=41

2.008n=34

1.827n=14

2.135n=20

2.324n=7

2.555n=25

2.268n=7

2.667n=18

Plot of rpart object cpus.rp.pr

Classification Trees and MARS – p. 11/19

Some code for trees

library(tree)

fgl.tr <- tree(type˜.,fgl)

summary(fgl.tr)

# Classification tree: tree(formula = type ˜ ., data = fgl)

# Number of terminal nodes: 20

# Residual mean deviance: 0.6853 = 133 / 194

# Misclassification error rate: 0.1542 = 33 / 214

fgl.tr1 <- snip.tree(fgl.tr, nodes = 9)

# The nodes could be sniped off interactively, by clicking wi th

# the mouse on the terminal node. fgl.tr1 <- snip.tree(fgl.t r)

fgl.cv <- cv.tree(fgl.tr,, FUN=prune.tree, K=10)

# The algorithm below randomly divides the training set.

for(i in 2:5)

{fgl.cv$dev <- fgl.cv$dev + cv.tree(fgl.tr,, prune.tree) $dev}

fgl.cv$dev <- fgl.cv$dev/5

plot(fgl.cv) title("Cross-validation plot for pruning")

Classification Trees and MARS – p. 12/19

X−v

al R

elat

ive

Err

or

0.2

0.4

0.6

0.8

1.0

1.2

Inf 0.054 0.016 0.0048 0.0012

1 3 5 7 10 12 14 16

size of tree

Pruning: choosing parameter cp

devi

ance

9510

010

511

011

512

012

513

0

5 10 15 20

170.0 23.0 16.0 10.0 8.1Cross−validation plot for pruning

Classification Trees and MARS – p. 13/19

§9.4 MARS

For the regression tree process, the data were partitioned in a waythat produced the "best" split with reference to the deviances from themean on either side of the split.

There are commercial versions with an interface to R on JeromeFriedman’s home page.Friedman, J. (1991): Multivariate Adaptative Regression Splines. Annalsof Statistics, 19:1, 1-141.Free version comes with the package mda.

Classification Trees and MARS – p. 14/19

§9.4 MARS

For the regression tree process, the data were partitioned in a waythat produced the "best" split with reference to the deviances from themean on either side of the split.

For MARS a similar process is used to find the best split withreference to the deviances from a spline function on either side of thesplit.

There are commercial versions with an interface to R on JeromeFriedman’s home page.Friedman, J. (1991): Multivariate Adaptative Regression Splines. Annalsof Statistics, 19:1, 1-141.Free version comes with the package mda.

Classification Trees and MARS – p. 14/19

The spline functions used by MARS are:

(X − t)+ =

x − t x > t

0 o/wand (t − X)+ =

t − x x < t

0 o/w

Each function is a piecewise linear. By multiplying these splinestogether it is possible to produce quadratic or cubic curves.The pair of functions (X − t)+, (t − X)+ is called reflected pair while t

is called a knot.

Recall: Regression tree uses as basis functions: I(Xj > c) and I(Xj ≤ c).

Classification Trees and MARS – p. 15/19

The spline functions used by MARS are:

(X − t)+ =

x − t x > t

0 o/wand (t − X)+ =

t − x x < t

0 o/w

Each function is a piecewise linear. By multiplying these splinestogether it is possible to produce quadratic or cubic curves.The pair of functions (X − t)+, (t − X)+ is called reflected pair while t

is called a knot.

MARS uses the collection of basis functionsC =

{

(Xj − t)+, (t − Xj)+}

, with t ∈ {x1j , . . . , xNj} j = 1, 2, . . . , p

Recall: Regression tree uses as basis functions: I(Xj > c) and I(Xj ≤ c).

Classification Trees and MARS – p. 15/19

The model is of the form:

f(X) = β0 +

M∑

m=1

βmhm(X) (1)

where the hm(X) ∈ C or a product of functions in C.M = {h0(X), . . . , hM (X)} is the set of all functions included in the model.

How to build the model if model functions were known?If functions hm(X) were known, determine the coefficients β0, . . . , βM

by minimizing the residual sum of squares. The model buildingstrategy is similar to stepwise linear regression: use functions of theform of hm(X) instead of the original inputs.

Classification Trees and MARS – p. 16/19

MARS Model functions

Step 1

Start with h0(X) = 1; f̂ (1) = β̂(1)0 . M(1) = {h0(X)}

The algorithm stops when the model set contains some preset number of terms.Classification Trees and MARS – p. 17/19

MARS Model functions

Step 1

Start with h0(X) = 1; f̂ (1) = β̂(1)0 . M(1) = {h0(X)}

Step 2

Add to the model a function of the form b1(Xj − t)+ + b2(t − Xj)+, witht ∈ {x1j , . . . , xNj} that produces the largest decrease in training error.Say this is achieved by j = J , and t = xkJ .Model: f̂ (2) = β̂

(2)0 + β̂

(2)1 (XJ − xkJ)+ + β̂

(2)2 (xkJ − XJ)+.

M(2) = {h0(X), h1(X), h2(X)}, h1(X) = (XJ − xkJ)+etc.

The algorithm stops when the model set contains some preset number of terms.Classification Trees and MARS – p. 17/19

MARS Model functions

Step 1

Start with h0(X) = 1; f̂ (1) = β̂(1)0 . M(1) = {h0(X)}

Step 2

Add to the model a function of the form b1(Xj − t)+ + b2(t − Xj)+, witht ∈ {x1j , . . . , xNj} that produces the largest decrease in training error.Say this is achieved by j = J , and t = xkJ .Model: f̂ (2) = β̂

(2)0 + β̂

(2)1 (XJ − xkJ)+ + β̂

(2)2 (xkJ − XJ)+.

M(2) = {h0(X), h1(X), h2(X)}, h1(X) = (XJ − xkJ)+etc.

Step m + 1

Add to the model a function of the formb2m−1hl(X)(Xj − t)+ + b2mhl(X)(t − Xj)+ with hl(X) ∈ M(m)

that produces the largest decrease in training error.Say this is achieved by j = J ′, t = xk′J′ and l = L.M(m+1) = M(m) ∪ {h2m−1(X), h2m(X)}

h2m−1 = hL(X)(XJ′ − xk′J′)+ and h2m = hL(X)(xk′J′ − XJ′)+

The algorithm stops when the model set contains some preset number of terms.Classification Trees and MARS – p. 17/19

At the end of this process, we have a large model of the form (9.19),which most probably overfits the data. So a backward deletionprocedure is applied.

Classification Trees and MARS – p. 18/19

At the end of this process, we have a large model of the form (9.19),which most probably overfits the data. So a backward deletionprocedure is applied.

The term, whose removal causes the smallest increase in residual, isdeleted from the model, at each stage.

Classification Trees and MARS – p. 18/19

At the end of this process, we have a large model of the form (9.19),which most probably overfits the data. So a backward deletionprocedure is applied.

The term, whose removal causes the smallest increase in residual, isdeleted from the model, at each stage.

The tuning parameter λ governs the tradeoff between the size of themodel and the goodness of fit to the data. Optimal value of λ isestimated by generalized cross-validation criterion:

GCV (λ) =

∑N

i=1(yi − f̂λ(xi))2

(1 − M(λ)/N)2

Classification Trees and MARS – p. 18/19

At the end of this process, we have a large model of the form (9.19),which most probably overfits the data. So a backward deletionprocedure is applied.

The term, whose removal causes the smallest increase in residual, isdeleted from the model, at each stage.

The tuning parameter λ governs the tradeoff between the size of themodel and the goodness of fit to the data. Optimal value of λ isestimated by generalized cross-validation criterion:

GCV (λ) =

∑N

i=1(yi − f̂λ(xi))2

(1 − M(λ)/N)2

M(λ) is the effective number of parameters used in the model;namely # terms in the model plus # parameters used to select theoptimal positions of the knots (3 parameters per knot).

Classification Trees and MARS – p. 18/19

Advantages:

Using piecewise linear basis functions, the regression surface isbuilt up parsimoniously.

MARS is not computationally intensive:For the piecewise linear functions, the reflected pair with rightmostknot is fitted. The knot is moved successively one position at atime to the left.

Classification Trees and MARS – p. 19/19

Advantages:

Using piecewise linear basis functions, the regression surface isbuilt up parsimoniously.

MARS is not computationally intensive:For the piecewise linear functions, the reflected pair with rightmostknot is fitted. The knot is moved successively one position at atime to the left.

Limitations:

Hierarchical (forward) modeling strategy.The philosophy used is that a higher-order interaction will likelyexists only if some of its "footprints" exist as well.

Restriction in the formation of model terms: each input can appearat most once in a product.

Classification Trees and MARS – p. 19/19

top related