decision tree in r - meetupfiles.meetup.com/1676436/decisiontrees.pdftree grows based on optimizing...

Introduction Method R Implementation Data Preparation Conclusion

Decision Tree in R

December 7, 2011

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R


Content I

1 IntroductionExample 1: Titanic DataExample 2: New York City Air Quality DataExample 3: An Artificial Data

2 MethodGeneral CommentRegression TreeClassification TreeTree Stop and PruningMissing Data

3 R Implementationrpartparty package


Decision Tree in R


Content IIRWeka packageevtree packagemvpart packagepartykit package

4 Data PreparationOverviewExample

5 Conclusion


Decision Tree in R


Agenda

Introduction through examples

Algorithm - a conceptual view

R implementation

Data preparation

Conclusion


Decision Tree in R


Example 1: Titanic Data

Titanic Passenger Survival Data

Data (n = 1046 complete records)

survived yes, no

sex female, male

pclass passenger class in on the ship

age continuous

Question

What survived and who perished?


Decision Tree in R


Example 2: New York City Air Quality Data

Survival of Titanic Passengers (n = 1046)

sexp < 0.001

1

female male

pclassp < 0.001

2

3 {1, 2}

Node 3 (n = 152)

yes

No

0

0.2

0.4

0.6

0.8

1Node 4 (n = 236)

yes

No

0

0.2

0.4

0.6

0.8

1

pclassp < 0.001

5

{2, 3} 1

agep < 0.001

6

≤ 9 > 9

Node 7 (n = 40)

yes

No

0

0.2

0.4

0.6

0.8

1Node 8 (n = 467)

yes

No

0

0.2

0.4

0.6

0.8

1

agep = 0.008

9

≤ 54 > 54

Node 10 (n = 123)

yes

No

0

0.2

0.4

0.6

0.8

1Node 11 (n = 28)

yes

No

0

0.2

0.4

0.6

0.8

1


Decision Tree in R



Air quality data

Data (n = 111 days May - Sep 1976)

Ozone Mean ozone in parts per billion

Solar.R Solar radiation

Wind Average wind speed

Temp Maximum daily temperature

Month Month (1-12)

Day Day (1-31)

Question

What explains the variation of Ozone level in New York City?


Decision Tree in R



> data(airquality) # load data

> air <- airquality

> air <- na.omit(air) # exclude missing data

> head(air) # display a few records

Ozone Solar.R Wind Temp Month Day

1 41 190 7.4 67 5 1

2 36 118 8.0 72 5 2

3 12 149 12.6 74 5 3

4 18 313 11.5 62 5 4

7 23 299 8.6 65 5 7

8 19 99 13.8 59 5 8Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R



NYC Air Quality, May−Sept 1976 (n=111)

Tempp < 0.001

1

≤ 82 > 82

Windp < 0.001

2

≤ 9.2 > 9.2

Node 3 (n = 24)

●

●

0

50

100

150

Tempp = 0.003

4

≤ 75 > 75

Node 5 (n = 32)

●

0

50

100

150

Node 6 (n = 21)

●

0

50

100

150

Node 7 (n = 34)

0

50

100

150


Decision Tree in R


Example 3: An Artificial Data

> ## Create an artificial data (Hastie etc.)

> z <- matrix(0, 40, 40) # Create a 40x40 matrix

> z[1:16, 1:12] <- 2

> z[17:24, ] <- 5

> z[25:40, 1:28] <- 8

> z[25:40, 29:40] <- 10

> ## Data set: 40x40 = 1600 rows

> ds <- data.frame(y =as.vector(z),expand.grid(1:40,1:40))

> colnames(ds) <- c("y", "x1", "x2")


Decision Tree in R



x1

x2

y

Artificial Data (Breaks: x1 = 16.5, 24.5; x2 = 12.5, 28.5)Example in Hastie et al.


Decision Tree in R



Artificial Data Completely Recovered by Tree Model(n = 40 x 40 = 1600)

|x1< 16.5

x2>=12.5 x1< 24.5

x2< 28.50

n=4482

n=1925

n=320

8n=448

10n=192


Decision Tree in R


General Comment

The Basic Idea of Decision Tree

A dependent variable variable (y): continuous or categorical

Multiple (can be many) independent variables (x ’s):continuous or categorical

Tree looks for split on node that can lead to the mostdifferention on y

Tree stops when further split becomes ineffective

A decision tree can:

Serve as a model (e.g. create rules)Make predictionSegment the data


Decision Tree in R


General Comment

History

Social scientists: Morgan and Sonquist (1963); Morgan andMessenger (1973)

Statistics: Breiman et al. (1984) - CART

Machine learning: Quinlan (1979 and after)

Ripley (1996)


Decision Tree in R


General Comment

Type of Algorithms

By dependent variable type

Classification tree Dependent variable is discreteEx: purchase (yes/no), types of disease treatment, . . .

Regression tree Dependent variable is continuousEx: spending, likelihood to buy, . . .

Popular Implementations

CHAID CHi-squared Automatic Interaction Detector

CART Classification And Regression Tree

C4.5 and C5.0 and some newer ones


Decision Tree in R


General Comment

Branch Split

CHAID allows multiple branch split - a wider tree

CART uses binary split

All major tree implementation in R enforces binary split

Multiple branch split can be achieved by several binary splits

Binary split avoids potential issues with multiple branch split:

The need for normalizing size when comparing splitsQuick fragmentation of the sample size

Note

Tree grows based on optimizing only the split from the currentnode rather then optimizing the entire tree


Decision Tree in R


Regression Tree

Regression Tree

The dependent variable is continuous

Fit a simple constant model to minimize the sum of thesquare from the constant

Like fitting an ANOVA modelEquivalent to fitting a Guassian GLMA greedy algorithm makes the search of the best splitcomputationally easy


Decision Tree in R


Regression Tree


Decision Tree in R


Regression Tree

Air Quality Data − Ozone (Numeric) Split on Temperature Alone

Tempp < 0.001

1

≤ 82 > 82

Tempp < 0.001

2

≤ 77 > 77

Node 3 (n = 50)

0

20

40

60

80

Node 4 (n = 27)

0

20

40

60

80

Tempp = 0.017

5

≤ 87 > 87

Node 6 (n = 17)

0

20

40

60

80

Node 7 (n = 17)

0

20

40

60

80


Decision Tree in R


Regression Tree

60 70 80 90

050

100

150

Matching Tree Split to Raw Data (n = 111)

Temp

Ozo

ne

●●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●●●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●●●

loess fit on raw databreaks identified by tree


Decision Tree in R


Classification Tree

Classfication Tree

The dependent variable is catgegorical

Common node impurity measures used:

Misclassification errorGini indexCross-entropy or deviance


Decision Tree in R


Classification Tree

> air0 <- air <- na.omit(airquality)

> vars<-c("Ozone","Solar.R","Wind","Temp")

> for (v in vars) {

+ air[,paste(v,".Cat",sep="")] <- cut(air[, v],

+ breaks=c(-Inf, median(air[, v]), Inf),

+ label = c("Low", "High"))

+ }

> air$Month.Cat <- as.factor(air$Month)

> air <- subset(air, select = -c(Ozone, Month, Day))

> head(air)

Solar.R Wind Temp Ozone.Cat Solar.R.Cat Wind.Cat Temp.Cat Month.Cat

1 190 7.4 67 High Low Low Low 5

2 118 8.0 72 High Low Low Low 5

3 149 12.6 74 Low Low High Low 5

4 313 11.5 62 Low High High Low 5

7 299 8.6 65 Low High Low Low 5

8 99 13.8 59 Low Low High Low 5


Decision Tree in R


Classification Tree

Air Quality − ALL Categorical Variables

Temp.Catp < 0.001

1

High Low

Solar.R.Catp = 0.074

2

Low High

Node 3 (n = 25)

Hig

hLo

w

0

0.2

0.4

0.6

0.8

1Node 4 (n = 29)

Hig

hLo

w

0

0.2

0.4

0.6

0.8

1Node 5 (n = 57)

Hig

hLo

w

0

0.2

0.4

0.6

0.8

1


Decision Tree in R


Classification Tree

Split on Solar.R

Solar.R

p = 0.03

1

High Low

Node 2 (n = 55)

Hig

hL

ow

0

0.2

0.4

0.6

0.8

1Node 3 (n = 56)

Hig

hL

ow

0

0.2

0.4

0.6

0.8

1

Split on Temperature

Temp

p < 0.001

1

High Low

Node 2 (n = 54)

Hig

hL

ow

0

0.2

0.4

0.6

0.8

1Node 3 (n = 57)

Hig

hL

ow

0

0.2

0.4

0.6

0.8

1

Split on Wind

Wind

p < 0.001

1

Low High

Node 2 (n = 58)

Hig

hL

ow

0

0.2

0.4

0.6

0.8

1Node 3 (n = 53)

Hig

hL

ow

0

0.2

0.4

0.6

0.8

1

Split on Month

Month

p < 0.001

1

{7, 8} {5, 6, 9}

Node 2 (n = 49)

Hig

hL

ow

0

0.2

0.4

0.6

0.8

1Node 3 (n = 62)

Hig

hL

ow

0

0.2

0.4

0.6

0.8

1


Decision Tree in R


Tree Stop and Pruning

Tree Stop and Pruning

General strategy:

Grow tree first and then prune

Implement cost-complexity pruning

Use tuning parameter a

Estiamte a throught 10-fold validation


Decision Tree in R


Missing Data

Missing Predictor Values

Several strategies rather than casewise deletion:

Missing value coded as a separate category

Constrcut surrogate variables - use highly correlatedvariables without missing value

Split case with missing value when passing down a branch

Missing value imputation


Decision Tree in R


Everything in R is an object. - John Chambers


Decision Tree in R


Key Packages

rpart Classic but update work horse on decision tree inR

party Conditional Inference Tree

RWeka R/Weka interface

tree Another regression and classification tree

evtree Global optimization

mvpart Multivariate dependent variable tree

partykit A general tree infrastructure


Decision Tree in R


rpart

> #library(rpart)

> op <- par(mfrow = c(1,2)) # print two plots on one screen

> ## run rpart model

> rp1 <- rpart(survived ~ sex + age + pclass, data = Titanic)

> ## simple plot. branch proportional to error in the fit.

> plot(rp1, main = "Simple Display") # simple on the left

> text(rp1) # add text label

> ## Fancier plot. equal branch spacing.

> plot(rp1, branch = 0.5, uniform = TRUE, main = "Pretty Display")

> text(rp1, pretty = 0, fancy = TRUE, use.n=TRUE, all = TRUE)

> par(op) # reset par


Decision Tree in R


rpart

Simple Display

|sex=b

age>=9.5

pclass=c

pclass=c

age>=1.5No

No yes No yesyes

Pretty Display

|

sex=male

age>=9.5

pclass=3

pclass=3

age>=1.5

sex=female

age< 9.5

pclass=1,2

pclass=1,2

age< 1.5

No 619/427

No 523/135

No 505/110

yes18/25

No 18/11

yes0/14

yes96/292

No 80/72

No 79/66

yes1/6

yes16/220


Decision Tree in R


rpart

> print(rp1) # print rpart object

n= 1046

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 1046 427 No (0.59177820 0.40822180)

2) sex=male 658 135 No (0.79483283 0.20516717)

4) age>=9.5 615 110 No (0.82113821 0.17886179) *

5) age< 9.5 43 18 yes (0.41860465 0.58139535)

10) pclass=3 29 11 No (0.62068966 0.37931034) *

11) pclass=1,2 14 0 yes (0.00000000 1.00000000) *

3) sex=female 388 96 yes (0.24742268 0.75257732)

6) pclass=3 152 72 No (0.52631579 0.47368421)

12) age>=1.5 145 66 No (0.54482759 0.45517241) *

13) age< 1.5 7 1 yes (0.14285714 0.85714286) *

7) pclass=1,2 236 16 yes (0.06779661 0.93220339) *


Decision Tree in R


rpart

> path.rpart(rp1, node=c(4, 7)) ##

node number: 4

root

sex=male

age>=9.5

node number: 7

root

sex=female

pclass=1,2


Decision Tree in R


rpart

> head(predict(rp1)) ## predicted pprobability of survival

No yes

1 0.06779661 0.9322034

2 0.00000000 1.0000000

3 0.06779661 0.9322034

4 0.82113821 0.1788618

5 0.06779661 0.9322034

6 0.82113821 0.1788618

> ## actual vs. predicted probability in one data frame

> tmp <- cbind(actual=as.numeric(Titanic$survived)-1, pred=predict(rp1)[, 2])

> cor(tmp) ## correlation

actual pred

actual 1.000000 0.640939

pred 0.640939 1.000000

> aggregate(tmp, by=list(Titanic$survived), mean) # compare by survival

Group.1 actual pred

1 No 0 0.2405231

2 yes 1 0.6513260

> aggregate(tmp, by=list(Titanic$sex), mean) # compare by gender


Decision Tree in R


rpart

Group.1 actual pred

1 female 0.7525773 0.7525773

2 male 0.2051672 0.2051672

> aggregate(tmp, by=list(Titanic$pclass), mean) #compare by pclass

Group.1 actual pred

1 1 0.6373239 0.5403331

2 2 0.4406130 0.5107649

3 3 0.2614770 0.2799117


Decision Tree in R


rpart

> # use all predictors with control change

> rp1 <- rpart(survived ~ . , control = rpart.control(minsplit=30,

+ minbucket=15, cp=0.012), data = Titanic)

> plot(rp1, branch=0.3, uniform=TRUE, margin = 0.1,

+ main = "minsplit=30, minbucket=15, cp=0.012")

> text(rp1, pretty=0, fancy=TRUE, use.n=TRUE, all = TRUE)


Decision Tree in R


rpart

minsplit=30, minbucket=15, cp=0.012

|

sex=male

age>=9.5 pclass=3

age>=27.5

sex=female

age< 9.5 pclass=1,2

age< 27.5

No 619/427

No 523/135

No 505/110

yes18/25

yes96/292

No 80/72

No 30/16

yes50/56

yes16/220


Decision Tree in R


rpart

> ## Pruning tree by a high CP

> rp1 <- prune(rp1, cp=0.018)

> plot(rp1, branch=0.3, uniform=TRUE, margin = 0.2,

+ main = "After Pruning by cp = 0.018")

> text(rp1, pretty=0, fancy=TRUE, use.n=TRUE, all = TRUE)


Decision Tree in R


rpart

After Pruning by cp = 0.018

|

sex=male

pclass=3

sex=female

pclass=1,2

No 619/427

No 523/135

yes96/292

No 80/72

yes16/220


Decision Tree in R


rpart

Comments on rpart

Tree model is an object so everything is acccessible

Finer tree control can be made on function parameters

rpart function: method, model, parms, controls

rpart.control function controls:

Minimum node size for continuous splitMinimum number of records in a nodeComplex Parameter (CP)Depth of the tree. . .

More CP related functions for tree size determination


Decision Tree in R


party package

party package Overview

Use conditional inference

A framework for general tree model

Powerful and flexible tree graphics

Many types of depedent variables:

nominalordinalnumericcensoredmultivariate

Covariates in arbitrary measurement scales


Decision Tree in R


party package

Key tree modeling functions in party

ctree Conditional Inference Tree

mob Model-based Recursive Partitioning

cforest Random forest


Decision Tree in R


party package

Edgar Anderson’s Iris Data (in R)

Data

Species Factor of 3 classes: setosa, versicolor, virginica

Sepal.Length continuous

Sepal.Width continuous

Petal.Length continuous

Petal.Width continuous

Question

Use tree model to predict species when knowing the 4measurements


Decision Tree in R


party package

> str(iris) ## show object structure

'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

> levels(iris$Species) <- c("setos", "versi", "virgi")

> ### classification

> iris.ct <- ctree(Species ~ .,data = iris)

> iris.ct

Conditional inference tree with 4 terminal nodes

Response: Species

Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

Number of observations: 150

1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264

2)* weights = 50

1) Petal.Length > 1.9

3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894


Decision Tree in R


party package

4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865

5)* weights = 46

4) Petal.Length > 4.8

6)* weights = 8

3) Petal.Width > 1.7

7)* weights = 46

> table(predict(iris.ct), iris$Species)

setos versi virgi

setos 50 0 0

versi 0 49 5

virgi 0 1 45

> plot(iris.ct, main = "Predict iris Species by ctree")


Decision Tree in R


party package

Predict iris Species by ctree

Petal.Length

1

<= 1.9 > 1.9

Node 2 (n = 50)

setos versi virgi

0

0.2

0.4

0.6

0.8

1

Petal.Width

3

<= 1.7 > 1.7

Petal.Length

4

<= 4.8 > 4.8

Node 5 (n = 46)

setos versi virgi

0

0.2

0.4

0.6

0.8

1Node 6 (n = 8)

setos versi virgi

0

0.2

0.4

0.6

0.8

1Node 7 (n = 46)

setos versi virgi

0

0.2

0.4

0.6

0.8

1

> ### survival analysis


Decision Tree in R


party package

> data("GBSG2", package = "ipred")

> ct1 <- ctree(Surv(time, cens) ~ .,data = GBSG2)

> plot(ct1)


Decision Tree in R


party package

pnodesp < 0.001

1

≤ 3 > 3

horThp = 0.035

2

no yes

Node 3 (n = 248)

0 500 1500 2500

0

0.2

0.4

0.6

0.8

1Node 4 (n = 128)

0 500 1500 2500

0

0.2

0.4

0.6

0.8

1

progrecp < 0.001

5

≤ 20 > 20

Node 6 (n = 144)

0 500 1500 2500

0

0.2

0.4

0.6

0.8

1Node 7 (n = 166)

0 500 1500 2500

0

0.2

0.4

0.6

0.8

1


Decision Tree in R


party package

> ex1<-ctree(Ozone ~ ., data=air0, controls=ctree_control(

+ maxdepth=3, mincriterion=0.95, minbucket=20))

> plot(ex1, inner_panel = node_inner(ex1, fill = "pink2"),

+ terminal_panel = node_hist(ex1, ymax = 0.07,

+ xscale = c(0, 200), fill = "cyan"),

+ main="NYC Air Quality - Different Tree Display")


Decision Tree in R


party package

NYC Air Quality − Different Tree Display

Tempp < 0.001

1

≤ 82 > 82

Windp < 0.001

2

≤ 9.2 > 9.2

Node 3 (n = 24)

0 0.020.040.06

0

50

100

150

200

Tempp = 0.003

4

≤ 75 > 75

Node 5 (n = 32)

0 0.020.040.06

0

50

100

150

200Node 6 (n = 21)

0 0.020.040.06

0

50

100

150

200Node 7 (n = 34)

0 0.020.040.06

0

50

100

150

200


Decision Tree in R


party package

mob - Model-based Recursive Partitioning

Typical tree algorithms partition data on the dependentvariable difference

mob partitions data on model difference

It relies on test of parameter instability

The outcome is still a tree which nodes display differentmodel pattern


Decision Tree in R


party package

> ## recursive partitioning of a logistic regression model

> ## load data

> data("PimaIndiansDiabetes", package = "mlbench")

> ## partition logistic regression diabetes ~ glucose

> ## wth respect to all remaining variables

> fmPID <- mob(diabetes ~ glucose | pregnant + pressure + triceps +

+ insulin + mass + pedigree + age,

+ data = PimaIndiansDiabetes, model = glinearModel,

+ family = binomial())

> ## fitted model

> coef(fmPID)

(Intercept) glucose

2 -9.951510 0.05870786

4 -6.705586 0.04683748

5 -2.770954 0.02353582

> plot(fmPID, main = "Pima Indians Diabetic Data (n = 768)")


Decision Tree in R


party package

Pima Indians Diabetic Data (n = 768)

massp < 0.001

1

≤ 26.3 > 26.3

Node 2 (n = 167)

0 99 117 140.5

pos

neg

0

0.2

0.4

0.6

0.8

1

● ●●

●

agep < 0.001

3

≤ 30 > 30

Node 4 (n = 304)

0 99 117 140.5 199

pos

neg

0

0.2

0.4

0.6

0.8

1

●

●

●

●

Node 5 (n = 297)

0 99 117 140.5 199

pos

neg

0

0.2

0.4

0.6

0.8

1

●

●

●

●


Decision Tree in R


RWeka package

RWeka package Overview

Weka (http://www.cs.waikato.ac.nz/ml/weka/)offers a collection of machine learning algorithms for datamining

Weka is written in in Java

Tree learners offered by Weka: C4.5, Naive Bayes trees,M5, logistic model tree

RWeka package creates R interface to Weka


Decision Tree in R

http://www.cs.waikato.ac.nz/ml/weka/


RWeka package


Decision Tree in R


RWeka package

> library(RWeka)

> w1 <- J48(survived ~ ., data=Titanic,

+ control = Weka_control(R = TRUE, B= TRUE))

> plot(w1, main="Tree by Weka J48 Model - Titanc Data")

> w1 ## print J48 model

J48 pruned tree

------------------

sex = female: yes (257.0/64.0)

sex != female

| age <= 9.0

| | pclass = 3: No (21.0/10.0)

| | pclass != 3: yes (9.0)

| age > 9.0: No (411.0/73.0)

Number of Leaves : 4

Size of the tree : 7

> summary(w1) ## Summary of J48 model

=== Summary ===


Decision Tree in R


RWeka package

Correctly Classified Instances 829 79.2543 %

Incorrectly Classified Instances 217 20.7457 %

Kappa statistic 0.5667

Mean absolute error 0.3244

Root mean squared error 0.4028

Relative absolute error 67.1335 %

Root relative squared error 81.9435 %

Coverage of cases (0.95 level) 100 %

Mean rel. region size (0.95 level) 99.3308 %

Total Number of Instances 1046

=== Confusion Matrix ===

a b <-- classified as

523 96 | a = No

121 306 | b = yes


Decision Tree in R


RWeka package

[c]

Tree by Weka J48 Model − Titanic Data

sex

female male

yes(257.0/64.0)

age

≤ 9 > 9

pclass

3 {1, 2}

No(21.0/10.0)

yes(9.0)

No(411.0/73.0)


Decision Tree in R


evtree package

evtree package Overview

Global optimization

Evolutionary algorithm

Classification and regression tree

Use partykit for tree structure

Computation demanding


Decision Tree in R


evtree package

> library(evtree)

> iris.ev <-evtree(Species ~ ., data=iris) ## evtree

> iris.ev

Model formula:

Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width

Fitted party:

[1] root

| [2] Petal.Width < 1: setos (n = 50, err = 0.0%)

| [3] Petal.Width >= 1

| | [4] Petal.Length < 5

| | | [5] Petal.Width < 1.7: versi (n = 47, err = 0.0%)

| | | [6] Petal.Width >= 1.7: virgi (n = 7, err = 14.3%)

| | [7] Petal.Length >= 5: virgi (n = 46, err = 4.3%)

Number of inner nodes: 3

Number of terminal nodes: 4

> plot(iris.ev, main = "Iris data using evtree")


Decision Tree in R


evtree package

Iris data using evtree

Petal.Width

1

< 1 >= 1

Node 2 (n = 50)

setos

00.20.40.60.8

1

Petal.Length

3

< 5 >= 5

Petal.Width

4

< 1.7 >= 1.7

Node 5 (n = 47)

setos

00.20.40.60.8

1Node 6 (n = 7)

setos

00.20.40.60.8

1Node 7 (n = 46)

setos

00.20.40.60.8

1


Decision Tree in R


mvpart package

> ##~ Use mvtree function to fit a multiple response model

> ##~ Automobile Data from 'Consumer Reports' 1990 (n = 49 cars)

>

> library(mvpart)

> ## Data set up

> data(car.test.frame) ## Conumser report car data in mvpart package

> car <- na.omit(car.test.frame) # use a short name

> head(car) # display a few records

Price Country Reliability Mileage Type Weight Disp. HP

Eagle Summit 4 8895 USA 4 33 Small 2560 97 113

Ford Escort 4 7402 USA 2 33 Small 2345 114 90

Ford Festiva 4 6319 Korea 4 37 Small 1845 81 63

Honda Civic 4 6635 Japan/USA 5 32 Small 2260 91 92

Mazda Protege 4 6599 Japan 5 32 Small 2440 113 103

Mercury Tracer 4 8672 Mexico 4 26 Small 2285 97 82

> car <- cbind(as.data.frame(scale(car[, c(1,3:4,6:7)])),

+ car[, c(2,5)]) # recale 5 dependent variables

> # fit and display a tree using "mvpart"

> car.mv <- mvpart(data.matrix(car[, 1:5]) ~ Country + Type,

+ data = car, uniform = TRUE, prn = TRUE, all.leaves = TRUE)

rpart(formula = form, data = data)


Decision Tree in R


mvpart package

Variables actually used in tree construction:

[1] Country Type

Root node error: 240/49 = 4.898

n= 49

CP nsplit rel error xerror xstd

1 0.356769 0 1.00000 1.04325 0.119931

2 0.142423 1 0.64323 0.73462 0.099250

3 0.083626 2 0.50081 0.59171 0.072715

4 0.068475 3 0.41718 0.57669 0.073553

5 0.024982 4 0.34871 0.49260 0.065112

> # PCA biplot of 5 group means (leaves)

> rpart.pca(car.mv, wgt.ave = FALSE)


Decision Tree in R


mvpart package

Type=Smll

Country=USA

Type=Cmpc,Medm,Sprt

Type=Cmpc

Type=Cmpc,Sprt

Type=Cmpc,Larg,Medm,Sprt,Van

Country=Japn,J/US,Swdn

Type=Larg,Van

Type=Medm,Sprt

Type=Medm,Van

251 : n=49

19.5 : n=12 142 : n=37

61.7 : n=21

36.1 : n=16

3.62 : n=5 26.4 : n=11

8.35 : n=5

44.6 : n=16

16.8 : n=11 6.63 : n=5

PriceReliabilityMileageWeightDisp.

Error : 0.349 CV Error : 0.521 SE : 0.0694Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R


mvpart package

Dim 1 77.99 % : [ 0.892 ]

D

im 2

19

.88

% :

[ 0.8

]

●

●

●

●●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

● ●

●

●

●

Price

Reliability

Mileage

Weight

Disp.


Decision Tree in R


partykit package

partykit - a toolkit for tree infrastructure in R

Represent tree model (objective)

Summarize result

Visualize tree structure

Read/coerce tree models from other sources (rpart,RWeka, PMML)

Offer standard methods for tree manipulation (print, plot,predict . . . )


Decision Tree in R


Overview

Data import

foreign R package that reads data stored by Minitab, S,SAS, SPSS, Stata, Systat, dBase,

ASCII file read.table statement

manual R official manaul R Data Import/Export(http://cran.wustl.edu/doc/manuals/R-data.pdf)

database See Relational databases section above

Excel See Reading Excel spreadsheets section above


Decision Tree in R

http://cran.wustl.edu/doc/manuals/R-data.pdf

http://cran.wustl.edu/doc/manuals/R-data.pdf


Overview

Data inspection

Common R statements

summary Summary statistics

table Frequency or crosstab

hist Histogram

str Display object structure

head Display a few record

dsni:j, m:n ] Display rows i-j and columns m-n

describe A function in package Hmisc

. . .


Decision Tree in R


Overview

Data manipulation

Statements for recoding

ifelse Conditional statement

as.factor Coerce to a (nominal) factor data type

as.ordered Coerce to an ordinal factor data type

cut Cut numerical data into categorical variables(factor)

as.data.frame Coerce into a data frame

apply By rows or columns: appl, lapply, sapply, by,aggregate, . . .

. . .


Decision Tree in R


Overview

Data manipulation

Statements data subsetting

dsn[1:n, ] Row indexing

dsn[, 10:6 ] Column indexing

subset Select record, rows and columns

. . .


Decision Tree in R


Example

> ## ----- Example: Titanic Passenger Data

> Titanic <- read.excel("C:\\Project\\MyTree\\titanic3.xlsx", "titanic3")

> str(Titanic)


$ pclass : num 1 1 1 1 1 1 1 1 1 1 ...

$ survived : num 1 1 0 0 0 1 1 0 1 0 ...

$ name : chr "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...

$ sex : chr "female" "male" "female" "male" ...

$ age : num 29 0.917 2 30 25 ...

$ sibsp : num 0 1 1 1 1 0 1 0 2 0 ...

$ parch : num 0 2 2 2 2 0 0 0 0 0 ...

$ ticket : chr "24160" "113781" "113781" "113781" ...

$ fare : num 211 152 152 152 152 ...

$ cabin : chr "B5" "C22 C26" "C22 C26" "C22 C26" ...

$ embarked : chr "S" "S" "S" "S" ...

$ boat : chr "2" "11" NA NA ...

$ body : chr NA NA NA "135" ...

$ home#dest: chr "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...

> ds <- subset(Titanic, select = c(survived, sex, age, pclass)) ## select variables

> sort(colSums(is.na(ds))) ## check missing

survived sex pclass age

1 1 1 264


Decision Tree in R


> ds <- na.omit(ds) # use only complete records

> str(ds)


$ survived: num 1 1 0 0 0 1 1 0 1 0 ...

$ sex : chr "female" "male" "female" "male" ...

$ age : num 29 0.917 2 30 25 ...

$ pclass : num 1 1 1 1 1 1 1 1 1 1 ...

- attr(*, "na.action")=Class 'omit' Named int [1:264] 16 38 41 47 60 70 71 75 81 107 ...

.. ..- attr(*, "names")= chr [1:264] "16" "38" "41" "47" ...

> # change to factor

> ds$survived <- factor(ds$survived, labels = c("No", "yes"))

> ds$sex <- as.factor(ds$sex)

> ds$pclass <- as.factor(ds$pclass)

> # run ctree

> # cf <- ctree(survived ~ ., data = ds, controls =

> # ctree_control(maxdepth = 3, mincriterion = 0.95, minbucket = 20))

>

> #plot(cf, main = "Titanic Passengers (n = 1046)")

>

> Titanic <- ds

> save(Titanic, file = "C:\\Project\\MyTree\\Titanic.RData") ## save R data

> #load("C:\\Project\\MyTree\\Titanic.RData") ## load data next time


Decision Tree in R


Advantages

Little statistics

Easy data requirement (even allows missing data!)

Capture nonlinear relationship

Accomodate interactions

Runs very fast

Good intrepretability and visualization

A convenient method for data segmentation


Decision Tree in R


Limitations

Less stable (or reproducible)

Only use limited variables

Lack of parametric information (such as variableimportance)

Typically requires relative large sample size

Not a good way to identify variable importance


Decision Tree in R


Thank you for attending!


Decision Tree in R

decision tree in r - meetupfiles.meetup.com/1676436/decisiontrees.pdftree grows based on optimizing...

Documents