decision tree in r - meetupfiles.meetup.com/1676436/decisiontrees.pdftree grows based on optimizing...

75
Introduction Method R Implementation Data Preparation Conclusion Decision Tree in R December 7, 2011 Ming Shan R/Predictive Analytics Meetup - December 7, 2011 Decision Tree in R

Upload: others

Post on 11-Jun-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Decision Tree in R

December 7, 2011

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 2: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Content I

1 IntroductionExample 1: Titanic DataExample 2: New York City Air Quality DataExample 3: An Artificial Data

2 MethodGeneral CommentRegression TreeClassification TreeTree Stop and PruningMissing Data

3 R Implementationrpartparty package

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 3: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Content IIRWeka packageevtree packagemvpart packagepartykit package

4 Data PreparationOverviewExample

5 Conclusion

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 4: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Agenda

Introduction through examples

Algorithm - a conceptual view

R implementation

Data preparation

Conclusion

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 5: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Example 1: Titanic Data

Titanic Passenger Survival Data

Data (n = 1046 complete records)

survived yes, no

sex female, male

pclass passenger class in on the ship

age continuous

Question

What survived and who perished?

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 6: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Example 2: New York City Air Quality Data

Survival of Titanic Passengers (n = 1046)

sexp < 0.001

1

female male

pclassp < 0.001

2

3 {1, 2}

Node 3 (n = 152)

yes

No

0

0.2

0.4

0.6

0.8

1Node 4 (n = 236)

yes

No

0

0.2

0.4

0.6

0.8

1

pclassp < 0.001

5

{2, 3} 1

agep < 0.001

6

≤ 9 > 9

Node 7 (n = 40)

yes

No

0

0.2

0.4

0.6

0.8

1Node 8 (n = 467)

yes

No

0

0.2

0.4

0.6

0.8

1

agep = 0.008

9

≤ 54 > 54

Node 10 (n = 123)

yes

No

0

0.2

0.4

0.6

0.8

1Node 11 (n = 28)

yes

No

0

0.2

0.4

0.6

0.8

1

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 7: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Example 2: New York City Air Quality Data

Air quality data

Data (n = 111 days May - Sep 1976)

Ozone Mean ozone in parts per billion

Solar.R Solar radiation

Wind Average wind speed

Temp Maximum daily temperature

Month Month (1-12)

Day Day (1-31)

Question

What explains the variation of Ozone level in New York City?

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 8: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Example 2: New York City Air Quality Data

> data(airquality) # load data

> air <- airquality

> air <- na.omit(air) # exclude missing data

> head(air) # display a few records

Ozone Solar.R Wind Temp Month Day

1 41 190 7.4 67 5 1

2 36 118 8.0 72 5 2

3 12 149 12.6 74 5 3

4 18 313 11.5 62 5 4

7 23 299 8.6 65 5 7

8 19 99 13.8 59 5 8Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 9: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Example 2: New York City Air Quality Data

NYC Air Quality, May−Sept 1976 (n=111)

Tempp < 0.001

1

≤ 82 > 82

Windp < 0.001

2

≤ 9.2 > 9.2

Node 3 (n = 24)

0

50

100

150

Tempp = 0.003

4

≤ 75 > 75

Node 5 (n = 32)

0

50

100

150

Node 6 (n = 21)

0

50

100

150

Node 7 (n = 34)

0

50

100

150

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 10: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Example 3: An Artificial Data

> ## Create an artificial data (Hastie etc.)

> z <- matrix(0, 40, 40) # Create a 40x40 matrix

> z[1:16, 1:12] <- 2

> z[17:24, ] <- 5

> z[25:40, 1:28] <- 8

> z[25:40, 29:40] <- 10

> ## Data set: 40x40 = 1600 rows

> ds <- data.frame(y =as.vector(z),expand.grid(1:40,1:40))

> colnames(ds) <- c("y", "x1", "x2")

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 11: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Example 3: An Artificial Data

x1

x2

y

Artificial Data (Breaks: x1 = 16.5, 24.5; x2 = 12.5, 28.5)Example in Hastie et al.

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 12: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Example 3: An Artificial Data

Artificial Data Completely Recovered by Tree Model(n = 40 x 40 = 1600)

|x1< 16.5

x2>=12.5 x1< 24.5

x2< 28.50

n=4482

n=1925

n=320

8n=448

10n=192

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 13: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

General Comment

The Basic Idea of Decision Tree

A dependent variable variable (y): continuous or categorical

Multiple (can be many) independent variables (x ’s):continuous or categorical

Tree looks for split on node that can lead to the mostdifferention on y

Tree stops when further split becomes ineffective

A decision tree can:

Serve as a model (e.g. create rules)Make predictionSegment the data

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 14: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

General Comment

History

Social scientists: Morgan and Sonquist (1963); Morgan andMessenger (1973)

Statistics: Breiman et al. (1984) - CART

Machine learning: Quinlan (1979 and after)

Ripley (1996)

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 15: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

General Comment

Type of Algorithms

By dependent variable type

Classification tree Dependent variable is discreteEx: purchase (yes/no), types of disease treatment, . . .

Regression tree Dependent variable is continuousEx: spending, likelihood to buy, . . .

Popular Implementations

CHAID CHi-squared Automatic Interaction Detector

CART Classification And Regression Tree

C4.5 and C5.0 and some newer ones

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 16: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

General Comment

Branch Split

CHAID allows multiple branch split - a wider tree

CART uses binary split

All major tree implementation in R enforces binary split

Multiple branch split can be achieved by several binary splits

Binary split avoids potential issues with multiple branch split:

The need for normalizing size when comparing splitsQuick fragmentation of the sample size

Note

Tree grows based on optimizing only the split from the currentnode rather then optimizing the entire tree

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 17: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Regression Tree

Regression Tree

The dependent variable is continuous

Fit a simple constant model to minimize the sum of thesquare from the constant

Like fitting an ANOVA modelEquivalent to fitting a Guassian GLMA greedy algorithm makes the search of the best splitcomputationally easy

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 18: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Regression Tree

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 19: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Regression Tree

Air Quality Data − Ozone (Numeric) Split on Temperature Alone

Tempp < 0.001

1

≤ 82 > 82

Tempp < 0.001

2

≤ 77 > 77

Node 3 (n = 50)

0

20

40

60

80

Node 4 (n = 27)

0

20

40

60

80

Tempp = 0.017

5

≤ 87 > 87

Node 6 (n = 17)

0

20

40

60

80

Node 7 (n = 17)

0

20

40

60

80

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 20: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Regression Tree

60 70 80 90

050

100

150

Matching Tree Split to Raw Data (n = 111)

Temp

Ozo

ne

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

loess fit on raw databreaks identified by tree

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 21: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Classification Tree

Classfication Tree

The dependent variable is catgegorical

Common node impurity measures used:

Misclassification errorGini indexCross-entropy or deviance

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 22: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Classification Tree

> air0 <- air <- na.omit(airquality)

> vars<-c("Ozone","Solar.R","Wind","Temp")

> for (v in vars) {

+ air[,paste(v,".Cat",sep="")] <- cut(air[, v],

+ breaks=c(-Inf, median(air[, v]), Inf),

+ label = c("Low", "High"))

+ }

> air$Month.Cat <- as.factor(air$Month)

> air <- subset(air, select = -c(Ozone, Month, Day))

> head(air)

Solar.R Wind Temp Ozone.Cat Solar.R.Cat Wind.Cat Temp.Cat Month.Cat

1 190 7.4 67 High Low Low Low 5

2 118 8.0 72 High Low Low Low 5

3 149 12.6 74 Low Low High Low 5

4 313 11.5 62 Low High High Low 5

7 299 8.6 65 Low High Low Low 5

8 99 13.8 59 Low Low High Low 5

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 23: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Classification Tree

Air Quality − ALL Categorical Variables

Temp.Catp < 0.001

1

High Low

Solar.R.Catp = 0.074

2

Low High

Node 3 (n = 25)

Hig

hLo

w

0

0.2

0.4

0.6

0.8

1Node 4 (n = 29)

Hig

hLo

w

0

0.2

0.4

0.6

0.8

1Node 5 (n = 57)

Hig

hLo

w

0

0.2

0.4

0.6

0.8

1

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 24: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Classification Tree

Split on Solar.R

Solar.R

p = 0.03

1

High Low

Node 2 (n = 55)

Hig

hL

ow

0

0.2

0.4

0.6

0.8

1Node 3 (n = 56)

Hig

hL

ow

0

0.2

0.4

0.6

0.8

1

Split on Temperature

Temp

p < 0.001

1

High Low

Node 2 (n = 54)

Hig

hL

ow

0

0.2

0.4

0.6

0.8

1Node 3 (n = 57)

Hig

hL

ow

0

0.2

0.4

0.6

0.8

1

Split on Wind

Wind

p < 0.001

1

Low High

Node 2 (n = 58)

Hig

hL

ow

0

0.2

0.4

0.6

0.8

1Node 3 (n = 53)

Hig

hL

ow

0

0.2

0.4

0.6

0.8

1

Split on Month

Month

p < 0.001

1

{7, 8} {5, 6, 9}

Node 2 (n = 49)

Hig

hL

ow

0

0.2

0.4

0.6

0.8

1Node 3 (n = 62)

Hig

hL

ow

0

0.2

0.4

0.6

0.8

1

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 25: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Tree Stop and Pruning

Tree Stop and Pruning

General strategy:

Grow tree first and then prune

Implement cost-complexity pruning

Use tuning parameter a

Estiamte a throught 10-fold validation

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 26: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Missing Data

Missing Predictor Values

Several strategies rather than casewise deletion:

Missing value coded as a separate category

Constrcut surrogate variables - use highly correlatedvariables without missing value

Split case with missing value when passing down a branch

Missing value imputation

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 27: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Everything in R is an object. - John Chambers

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 28: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Key Packages

rpart Classic but update work horse on decision tree inR

party Conditional Inference Tree

RWeka R/Weka interface

tree Another regression and classification tree

evtree Global optimization

mvpart Multivariate dependent variable tree

partykit A general tree infrastructure

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 29: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

rpart

> #library(rpart)

> op <- par(mfrow = c(1,2)) # print two plots on one screen

> ## run rpart model

> rp1 <- rpart(survived ~ sex + age + pclass, data = Titanic)

> ## simple plot. branch proportional to error in the fit.

> plot(rp1, main = "Simple Display") # simple on the left

> text(rp1) # add text label

> ## Fancier plot. equal branch spacing.

> plot(rp1, branch = 0.5, uniform = TRUE, main = "Pretty Display")

> text(rp1, pretty = 0, fancy = TRUE, use.n=TRUE, all = TRUE)

> par(op) # reset par

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 30: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

rpart

Simple Display

|sex=b

age>=9.5

pclass=c

pclass=c

age>=1.5No

No yes No yesyes

Pretty Display

|

sex=male

age>=9.5

pclass=3

pclass=3

age>=1.5

sex=female

age< 9.5

pclass=1,2

pclass=1,2

age< 1.5

No 619/427

No 523/135

No 505/110

yes18/25

No 18/11

yes0/14

yes96/292

No 80/72

No 79/66

yes1/6

yes16/220

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 31: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

rpart

> print(rp1) # print rpart object

n= 1046

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 1046 427 No (0.59177820 0.40822180)

2) sex=male 658 135 No (0.79483283 0.20516717)

4) age>=9.5 615 110 No (0.82113821 0.17886179) *

5) age< 9.5 43 18 yes (0.41860465 0.58139535)

10) pclass=3 29 11 No (0.62068966 0.37931034) *

11) pclass=1,2 14 0 yes (0.00000000 1.00000000) *

3) sex=female 388 96 yes (0.24742268 0.75257732)

6) pclass=3 152 72 No (0.52631579 0.47368421)

12) age>=1.5 145 66 No (0.54482759 0.45517241) *

13) age< 1.5 7 1 yes (0.14285714 0.85714286) *

7) pclass=1,2 236 16 yes (0.06779661 0.93220339) *

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 32: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

rpart

> path.rpart(rp1, node=c(4, 7)) ##

node number: 4

root

sex=male

age>=9.5

node number: 7

root

sex=female

pclass=1,2

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 33: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

rpart

> head(predict(rp1)) ## predicted pprobability of survival

No yes

1 0.06779661 0.9322034

2 0.00000000 1.0000000

3 0.06779661 0.9322034

4 0.82113821 0.1788618

5 0.06779661 0.9322034

6 0.82113821 0.1788618

> ## actual vs. predicted probability in one data frame

> tmp <- cbind(actual=as.numeric(Titanic$survived)-1, pred=predict(rp1)[, 2])

> cor(tmp) ## correlation

actual pred

actual 1.000000 0.640939

pred 0.640939 1.000000

> aggregate(tmp, by=list(Titanic$survived), mean) # compare by survival

Group.1 actual pred

1 No 0 0.2405231

2 yes 1 0.6513260

> aggregate(tmp, by=list(Titanic$sex), mean) # compare by gender

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 34: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

rpart

Group.1 actual pred

1 female 0.7525773 0.7525773

2 male 0.2051672 0.2051672

> aggregate(tmp, by=list(Titanic$pclass), mean) #compare by pclass

Group.1 actual pred

1 1 0.6373239 0.5403331

2 2 0.4406130 0.5107649

3 3 0.2614770 0.2799117

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 35: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

rpart

> # use all predictors with control change

> rp1 <- rpart(survived ~ . , control = rpart.control(minsplit=30,

+ minbucket=15, cp=0.012), data = Titanic)

> plot(rp1, branch=0.3, uniform=TRUE, margin = 0.1,

+ main = "minsplit=30, minbucket=15, cp=0.012")

> text(rp1, pretty=0, fancy=TRUE, use.n=TRUE, all = TRUE)

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 36: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

rpart

minsplit=30, minbucket=15, cp=0.012

|

sex=male

age>=9.5 pclass=3

age>=27.5

sex=female

age< 9.5 pclass=1,2

age< 27.5

No 619/427

No 523/135

No 505/110

yes18/25

yes96/292

No 80/72

No 30/16

yes50/56

yes16/220

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 37: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

rpart

> ## Pruning tree by a high CP

> rp1 <- prune(rp1, cp=0.018)

> plot(rp1, branch=0.3, uniform=TRUE, margin = 0.2,

+ main = "After Pruning by cp = 0.018")

> text(rp1, pretty=0, fancy=TRUE, use.n=TRUE, all = TRUE)

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 38: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

rpart

After Pruning by cp = 0.018

|

sex=male

pclass=3

sex=female

pclass=1,2

No 619/427

No 523/135

yes96/292

No 80/72

yes16/220

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 39: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

rpart

Comments on rpart

Tree model is an object so everything is acccessible

Finer tree control can be made on function parameters

rpart function: method, model, parms, controls

rpart.control function controls:

Minimum node size for continuous splitMinimum number of records in a nodeComplex Parameter (CP)Depth of the tree. . .

More CP related functions for tree size determination

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 40: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

party package

party package Overview

Use conditional inference

A framework for general tree model

Powerful and flexible tree graphics

Many types of depedent variables:

nominalordinalnumericcensoredmultivariate

Covariates in arbitrary measurement scales

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 41: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

party package

Key tree modeling functions in party

ctree Conditional Inference Tree

mob Model-based Recursive Partitioning

cforest Random forest

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 42: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

party package

Edgar Anderson’s Iris Data (in R)

Data

Species Factor of 3 classes: setosa, versicolor, virginica

Sepal.Length continuous

Sepal.Width continuous

Petal.Length continuous

Petal.Width continuous

Question

Use tree model to predict species when knowing the 4measurements

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 43: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

party package

> str(iris) ## show object structure

'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

> levels(iris$Species) <- c("setos", "versi", "virgi")

> ### classification

> iris.ct <- ctree(Species ~ .,data = iris)

> iris.ct

Conditional inference tree with 4 terminal nodes

Response: Species

Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

Number of observations: 150

1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264

2)* weights = 50

1) Petal.Length > 1.9

3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 44: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

party package

4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865

5)* weights = 46

4) Petal.Length > 4.8

6)* weights = 8

3) Petal.Width > 1.7

7)* weights = 46

> table(predict(iris.ct), iris$Species)

setos versi virgi

setos 50 0 0

versi 0 49 5

virgi 0 1 45

> plot(iris.ct, main = "Predict iris Species by ctree")

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 45: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

party package

Predict iris Species by ctree

Petal.Length

1

<= 1.9 > 1.9

Node 2 (n = 50)

setos versi virgi

0

0.2

0.4

0.6

0.8

1

Petal.Width

3

<= 1.7 > 1.7

Petal.Length

4

<= 4.8 > 4.8

Node 5 (n = 46)

setos versi virgi

0

0.2

0.4

0.6

0.8

1Node 6 (n = 8)

setos versi virgi

0

0.2

0.4

0.6

0.8

1Node 7 (n = 46)

setos versi virgi

0

0.2

0.4

0.6

0.8

1

> ### survival analysis

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 46: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

party package

> data("GBSG2", package = "ipred")

> ct1 <- ctree(Surv(time, cens) ~ .,data = GBSG2)

> plot(ct1)

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 47: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

party package

pnodesp < 0.001

1

≤ 3 > 3

horThp = 0.035

2

no yes

Node 3 (n = 248)

0 500 1500 2500

0

0.2

0.4

0.6

0.8

1Node 4 (n = 128)

0 500 1500 2500

0

0.2

0.4

0.6

0.8

1

progrecp < 0.001

5

≤ 20 > 20

Node 6 (n = 144)

0 500 1500 2500

0

0.2

0.4

0.6

0.8

1Node 7 (n = 166)

0 500 1500 2500

0

0.2

0.4

0.6

0.8

1

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 48: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

party package

> ex1<-ctree(Ozone ~ ., data=air0, controls=ctree_control(

+ maxdepth=3, mincriterion=0.95, minbucket=20))

> plot(ex1, inner_panel = node_inner(ex1, fill = "pink2"),

+ terminal_panel = node_hist(ex1, ymax = 0.07,

+ xscale = c(0, 200), fill = "cyan"),

+ main="NYC Air Quality - Different Tree Display")

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 49: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

party package

NYC Air Quality − Different Tree Display

Tempp < 0.001

1

≤ 82 > 82

Windp < 0.001

2

≤ 9.2 > 9.2

Node 3 (n = 24)

0 0.020.040.06

0

50

100

150

200

Tempp = 0.003

4

≤ 75 > 75

Node 5 (n = 32)

0 0.020.040.06

0

50

100

150

200Node 6 (n = 21)

0 0.020.040.06

0

50

100

150

200Node 7 (n = 34)

0 0.020.040.06

0

50

100

150

200

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 50: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

party package

mob - Model-based Recursive Partitioning

Typical tree algorithms partition data on the dependentvariable difference

mob partitions data on model difference

It relies on test of parameter instability

The outcome is still a tree which nodes display differentmodel pattern

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 51: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

party package

> ## recursive partitioning of a logistic regression model

> ## load data

> data("PimaIndiansDiabetes", package = "mlbench")

> ## partition logistic regression diabetes ~ glucose

> ## wth respect to all remaining variables

> fmPID <- mob(diabetes ~ glucose | pregnant + pressure + triceps +

+ insulin + mass + pedigree + age,

+ data = PimaIndiansDiabetes, model = glinearModel,

+ family = binomial())

> ## fitted model

> coef(fmPID)

(Intercept) glucose

2 -9.951510 0.05870786

4 -6.705586 0.04683748

5 -2.770954 0.02353582

> plot(fmPID, main = "Pima Indians Diabetic Data (n = 768)")

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 52: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

party package

Pima Indians Diabetic Data (n = 768)

massp < 0.001

1

≤ 26.3 > 26.3

Node 2 (n = 167)

0 99 117 140.5

pos

neg

0

0.2

0.4

0.6

0.8

1

● ●●

agep < 0.001

3

≤ 30 > 30

Node 4 (n = 304)

0 99 117 140.5 199

pos

neg

0

0.2

0.4

0.6

0.8

1

Node 5 (n = 297)

0 99 117 140.5 199

pos

neg

0

0.2

0.4

0.6

0.8

1

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 53: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

RWeka package

RWeka package Overview

Weka (http://www.cs.waikato.ac.nz/ml/weka/)offers a collection of machine learning algorithms for datamining

Weka is written in in Java

Tree learners offered by Weka: C4.5, Naive Bayes trees,M5, logistic model tree

RWeka package creates R interface to Weka

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 54: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

RWeka package

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 55: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

RWeka package

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 56: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

RWeka package

> library(RWeka)

> w1 <- J48(survived ~ ., data=Titanic,

+ control = Weka_control(R = TRUE, B= TRUE))

> plot(w1, main="Tree by Weka J48 Model - Titanc Data")

> w1 ## print J48 model

J48 pruned tree

------------------

sex = female: yes (257.0/64.0)

sex != female

| age <= 9.0

| | pclass = 3: No (21.0/10.0)

| | pclass != 3: yes (9.0)

| age > 9.0: No (411.0/73.0)

Number of Leaves : 4

Size of the tree : 7

> summary(w1) ## Summary of J48 model

=== Summary ===

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 57: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

RWeka package

Correctly Classified Instances 829 79.2543 %

Incorrectly Classified Instances 217 20.7457 %

Kappa statistic 0.5667

Mean absolute error 0.3244

Root mean squared error 0.4028

Relative absolute error 67.1335 %

Root relative squared error 81.9435 %

Coverage of cases (0.95 level) 100 %

Mean rel. region size (0.95 level) 99.3308 %

Total Number of Instances 1046

=== Confusion Matrix ===

a b <-- classified as

523 96 | a = No

121 306 | b = yes

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 58: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

RWeka package

[c]

Tree by Weka J48 Model − Titanic Data

sex

female male

yes(257.0/64.0)

age

≤ 9 > 9

pclass

3 {1, 2}

No(21.0/10.0)

yes(9.0)

No(411.0/73.0)

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 59: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

evtree package

evtree package Overview

Global optimization

Evolutionary algorithm

Classification and regression tree

Use partykit for tree structure

Computation demanding

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 60: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

evtree package

> library(evtree)

> iris.ev <-evtree(Species ~ ., data=iris) ## evtree

> iris.ev

Model formula:

Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width

Fitted party:

[1] root

| [2] Petal.Width < 1: setos (n = 50, err = 0.0%)

| [3] Petal.Width >= 1

| | [4] Petal.Length < 5

| | | [5] Petal.Width < 1.7: versi (n = 47, err = 0.0%)

| | | [6] Petal.Width >= 1.7: virgi (n = 7, err = 14.3%)

| | [7] Petal.Length >= 5: virgi (n = 46, err = 4.3%)

Number of inner nodes: 3

Number of terminal nodes: 4

> plot(iris.ev, main = "Iris data using evtree")

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 61: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

evtree package

Iris data using evtree

Petal.Width

1

< 1 >= 1

Node 2 (n = 50)

setos

00.20.40.60.8

1

Petal.Length

3

< 5 >= 5

Petal.Width

4

< 1.7 >= 1.7

Node 5 (n = 47)

setos

00.20.40.60.8

1Node 6 (n = 7)

setos

00.20.40.60.8

1Node 7 (n = 46)

setos

00.20.40.60.8

1

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 62: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

mvpart package

> ##~ Use mvtree function to fit a multiple response model

> ##~ Automobile Data from 'Consumer Reports' 1990 (n = 49 cars)

>

> library(mvpart)

> ## Data set up

> data(car.test.frame) ## Conumser report car data in mvpart package

> car <- na.omit(car.test.frame) # use a short name

> head(car) # display a few records

Price Country Reliability Mileage Type Weight Disp. HP

Eagle Summit 4 8895 USA 4 33 Small 2560 97 113

Ford Escort 4 7402 USA 2 33 Small 2345 114 90

Ford Festiva 4 6319 Korea 4 37 Small 1845 81 63

Honda Civic 4 6635 Japan/USA 5 32 Small 2260 91 92

Mazda Protege 4 6599 Japan 5 32 Small 2440 113 103

Mercury Tracer 4 8672 Mexico 4 26 Small 2285 97 82

> car <- cbind(as.data.frame(scale(car[, c(1,3:4,6:7)])),

+ car[, c(2,5)]) # recale 5 dependent variables

> # fit and display a tree using "mvpart"

> car.mv <- mvpart(data.matrix(car[, 1:5]) ~ Country + Type,

+ data = car, uniform = TRUE, prn = TRUE, all.leaves = TRUE)

rpart(formula = form, data = data)

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 63: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

mvpart package

Variables actually used in tree construction:

[1] Country Type

Root node error: 240/49 = 4.898

n= 49

CP nsplit rel error xerror xstd

1 0.356769 0 1.00000 1.04325 0.119931

2 0.142423 1 0.64323 0.73462 0.099250

3 0.083626 2 0.50081 0.59171 0.072715

4 0.068475 3 0.41718 0.57669 0.073553

5 0.024982 4 0.34871 0.49260 0.065112

> # PCA biplot of 5 group means (leaves)

> rpart.pca(car.mv, wgt.ave = FALSE)

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 64: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

mvpart package

Type=Smll

Country=USA

Type=Cmpc,Medm,Sprt

Type=Cmpc

Type=Cmpc,Sprt

Type=Cmpc,Larg,Medm,Sprt,Van

Country=Japn,J/US,Swdn

Type=Larg,Van

Type=Medm,Sprt

Type=Medm,Van

251 : n=49

19.5 : n=12 142 : n=37

61.7 : n=21

36.1 : n=16

3.62 : n=5 26.4 : n=11

8.35 : n=5

44.6 : n=16

16.8 : n=11 6.63 : n=5

PriceReliabilityMileageWeightDisp.

Error : 0.349 CV Error : 0.521 SE : 0.0694Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 65: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

mvpart package

Dim 1 77.99 % : [ 0.892 ]

D

im 2

19

.88

% :

[ 0.8

]

●●

●●

● ●

●●

●●●

● ●

Price

Reliability

Mileage

Weight

Disp.

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 66: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

partykit package

partykit - a toolkit for tree infrastructure in R

Represent tree model (objective)

Summarize result

Visualize tree structure

Read/coerce tree models from other sources (rpart,RWeka, PMML)

Offer standard methods for tree manipulation (print, plot,predict . . . )

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 67: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Overview

Data import

foreign R package that reads data stored by Minitab, S,SAS, SPSS, Stata, Systat, dBase,

ASCII file read.table statement

manual R official manaul R Data Import/Export(http://cran.wustl.edu/doc/manuals/R-data.pdf)

database See Relational databases section above

Excel See Reading Excel spreadsheets section above

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 68: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Overview

Data inspection

Common R statements

summary Summary statistics

table Frequency or crosstab

hist Histogram

str Display object structure

head Display a few record

dsni:j, m:n ] Display rows i-j and columns m-n

describe A function in package Hmisc

. . .

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 69: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Overview

Data manipulation

Statements for recoding

ifelse Conditional statement

as.factor Coerce to a (nominal) factor data type

as.ordered Coerce to an ordinal factor data type

cut Cut numerical data into categorical variables(factor)

as.data.frame Coerce into a data frame

apply By rows or columns: appl, lapply, sapply, by,aggregate, . . .

. . .

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 70: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Overview

Data manipulation

Statements data subsetting

dsn[1:n, ] Row indexing

dsn[, 10:6 ] Column indexing

subset Select record, rows and columns

. . .

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 71: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Example

> ## ----- Example: Titanic Passenger Data

> Titanic <- read.excel("C:\\Project\\MyTree\\titanic3.xlsx", "titanic3")

> str(Titanic)

'data.frame': 1310 obs. of 14 variables:

$ pclass : num 1 1 1 1 1 1 1 1 1 1 ...

$ survived : num 1 1 0 0 0 1 1 0 1 0 ...

$ name : chr "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...

$ sex : chr "female" "male" "female" "male" ...

$ age : num 29 0.917 2 30 25 ...

$ sibsp : num 0 1 1 1 1 0 1 0 2 0 ...

$ parch : num 0 2 2 2 2 0 0 0 0 0 ...

$ ticket : chr "24160" "113781" "113781" "113781" ...

$ fare : num 211 152 152 152 152 ...

$ cabin : chr "B5" "C22 C26" "C22 C26" "C22 C26" ...

$ embarked : chr "S" "S" "S" "S" ...

$ boat : chr "2" "11" NA NA ...

$ body : chr NA NA NA "135" ...

$ home#dest: chr "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...

> ds <- subset(Titanic, select = c(survived, sex, age, pclass)) ## select variables

> sort(colSums(is.na(ds))) ## check missing

survived sex pclass age

1 1 1 264

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 72: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

> ds <- na.omit(ds) # use only complete records

> str(ds)

'data.frame': 1046 obs. of 4 variables:

$ survived: num 1 1 0 0 0 1 1 0 1 0 ...

$ sex : chr "female" "male" "female" "male" ...

$ age : num 29 0.917 2 30 25 ...

$ pclass : num 1 1 1 1 1 1 1 1 1 1 ...

- attr(*, "na.action")=Class 'omit' Named int [1:264] 16 38 41 47 60 70 71 75 81 107 ...

.. ..- attr(*, "names")= chr [1:264] "16" "38" "41" "47" ...

> # change to factor

> ds$survived <- factor(ds$survived, labels = c("No", "yes"))

> ds$sex <- as.factor(ds$sex)

> ds$pclass <- as.factor(ds$pclass)

> # run ctree

> # cf <- ctree(survived ~ ., data = ds, controls =

> # ctree_control(maxdepth = 3, mincriterion = 0.95, minbucket = 20))

>

> #plot(cf, main = "Titanic Passengers (n = 1046)")

>

> Titanic <- ds

> save(Titanic, file = "C:\\Project\\MyTree\\Titanic.RData") ## save R data

> #load("C:\\Project\\MyTree\\Titanic.RData") ## load data next time

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 73: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Advantages

Little statistics

Easy data requirement (even allows missing data!)

Capture nonlinear relationship

Accomodate interactions

Runs very fast

Good intrepretability and visualization

A convenient method for data segmentation

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 74: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Limitations

Less stable (or reproducible)

Only use limited variables

Lack of parametric information (such as variableimportance)

Typically requires relative large sample size

Not a good way to identify variable importance

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R

Page 75: Decision Tree in R - Meetupfiles.meetup.com/1676436/DecisionTrees.pdfTree grows based on optimizing only the split from thecurrent node rather then optimizing the entire tree Ming

Introduction Method R Implementation Data Preparation Conclusion

Thank you for attending!

Ming Shan R/Predictive Analytics Meetup - December 7, 2011

Decision Tree in R