classification and regression trees (cart) theory and applications

Classification and Regression Trees(CART)

Theory and Applications

A Master Thesis Presented

by

Roman Timofeev

(188778)

to

Prof. Dr. Wolfgang Hardle

CASE - Center of Applied Statistics and Economics

Humboldt University, Berlin

in partial fulfillment of the requirements

for the degree of

Master of Art

Berlin, December 20, 2004

Declaration of Authorship

I hereby confirm that I have authored this master thesis independently and withoutuse of others than the indicated resources. All passages, which are literally or in gen-eral matter taken out of publications or other resources, are marked as such.

Roman Timofeev

Berlin, January 27, 2005

Abstract

This master thesis is devoted to Classification and Regression Trees (CART). CART isclassification method which uses historical data to construct decision trees. Dependingon available information about the dataset, classification tree or regression tree can beconstructed. Constructed tree can be then used for classification of new observations.

The first part of the thesis describes fundamental principles of tree construction, dif-ferent splitting algorithms and pruning procedures. Second part of the paper answersthe questions why should we use or should not use the CART method. Advantagesand weaknesses of the method are discussed and tested in detail.

In the last part, CART is applied to real data, using the statistical software XploRe.Here different statistical macros (quantlets), graphical and plotting tools are presented.

Keywords: CART, Classification method, Classification tree, Regression tree, Statis-tical Software

Contents

1 Introduction 7

2 Construction of Maximum Tree 92.1 Classification tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Gini splitting rule . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.2 Twoing splitting rule . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Regression tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Choice of the Right Size Tree 153.1 Optimization by minimum number of points . . . . . . . . . . . . . . . . 153.2 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Classification of New Data 18

5 Advantages and Disadvantages of CART 195.1 CART as a classification method . . . . . . . . . . . . . . . . . . . . . . 195.2 CART in financial sector . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.3 Disadvantages of CART . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Examples 266.1 Simulated example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.2 Boston housing example . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.3 Bankruptcy data example . . . . . . . . . . . . . . . . . . . . . . . . . . 35

List of Figures

1.1 Classification tree of San Diego Medical Center patients. . . . . . . . . . 7

2.1 Splitting algorithm of CART . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Maximum classification tree for bancruptcy dataset, constructed using

Gini splitting rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Maximum classification tree for bancruptcy dataset, constructed using

Twoing splitting rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Classification tree for bancruptcy dataset, parameter Nmin is equal to15. Tree impurity is equal to 0.20238. Number of terminal nodes 9 . . . 16

3.2 Classification tree for bancruptcy dataset, parameter Nmin is equal to30. Tree impurity is equal to 0.21429. Number of terminal nodes 6 . . . 16

5.1 Classification tree for bancruptcy dataset after including a third randomvariable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.2 Classification tree for bancruptcy dataset after monotone transforma-tion of the first variable log(x1 + 100) . . . . . . . . . . . . . . . . . . . 21

5.3 Classification tree for bancruptcy dataset after two outliers . . . . . . . 225.4 Classification tree, countrcuted on 90% data of bancruptcy dataset . . . 245.5 Dataset of three classes with linear structure . . . . . . . . . . . . . . . 255.6 Dataset of three classes with non-linear structure . . . . . . . . . . . . . 25

6.1 Simulated uniform data with 3 classes . . . . . . . . . . . . . . . . . . . 276.2 Classification tree for generated data of 500 observations . . . . . . . . . 286.3 Simulated uniform data with three classes and overlapping . . . . . . . . 296.4 Maximum classification tree for simulated two dimensional data with

overlapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.5 Maximum tree for simulated two dimensional data with overlapping and

NumberOfPoints option . . . . . . . . . . . . . . . . . . . . . . . . . . 306.6 PDF tree generated via cartdrawpdfclass command . . . . . . . . . . . 316.7 Regression tree for 100 observations of Boston housing dataset, mini-

mum number of observations Nmin is set to 10. . . . . . . . . . . . . . . 33

List of Figures

6.8 Regression tree for 100 observations of Boston housing dataset, Nminparameter is set to 30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.9 Classification tree of bankruptcy dataset, minimum number of observa-tions Nmin is set to 30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.10 Classification tree of bankruptcy dataset, minimum number of observa-tions Nmin is set to 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.11 Distribution of classification ratio by Nmin parameter . . . . . . . . . . 386.12 Classification tree of bankruptcy dataset, minimum number of observa-

tions Nmin is set to 40. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5

Notation

X [N M ] matrix of variables in learning sampleY [M 1] vector of classes/response values in learning sampleN number of observations in learning sampleM number of variables in learning samplet index of nodetp parent nodetl child left nodetr child right nodePl probability of left nodePr probability of right nodei(t) impurity functioni(tp) impurity value for parent nodei(tc) impurity value for child nodesi(tl) impurity value for left child nodei(tr) impurity value for right child nodei(t) change of impurityK number of classesk index of classp(k|t) conditional probability of class k provided we are in node tT decision treeR(T ) misclassification error of tree TT number of terminal nodes in tree T(T ) complexity measure which depends on number of terminal nodes TxRj best splitting value of variable xj .Nmin minimum number of observations parameter, which is used for pruning

1 Introduction

Classification and Regression Trees is a classification method which uses historical datato construct so-called decision trees. Decision trees are then used to classify new data.In order to use CART we need to know number of classes a priori.

CART methodology was developed in 80s by Breiman, Freidman, Olshen, Stone intheir paper Classification and Regression Trees (1984). For building decision trees,CART uses so-called learning sample - a set of historical data with pre-assigned classesfor all observations. For example, learning sample for credit scoring system wouldbe fundamental information about previous borrows (variables) matched with actualpayoff results (classes).

Decision trees are represented by a set of questions which splits the learning sampleinto smaller and smaller parts. CART asks only yes/no questions. A possible questioncould be: Is age greater than 50? or Is sex male?. CART algorithm will searchfor all possible variables and all possible values in order to find the best split - thequestion that splits the data into two parts with maximum homogeneity. The processis then repeated for each of the resulting data fragments. Here is an example of simpleclassification tree, used by San Diego Medical Center for classification of their patientsto different levels of risk:

Is the systolicblood pressure > 91?

Is age > 62.5?

Is sinus tachycardiapresent?

ClassHigh Risk

ClassLow Risk

ClassHigh Risk

ClassLow Risk

1

Figure 1.1: Classification tree of San Diego Medical Center patients.

1 Introduction

In practice there can be much more complicated decision trees which can includedozens of levels and hundreds of variables. As it can be seen from figure 1.1, CARTcan easily handle both numerical and categorical variables. Among other advantagesof CART method is its robustness to outliers. Usually the splitting algorithm willisolate outliers in individual node or nodes. An important practical property of CARTis that the structure of its classification or regression trees is invariant with respectto monotone transformations of independent variables. One can replace any variablewith its logarithm or square root value, the structure of the tree will not change.

CART methodology consists of tree parts:

1. Construction of maximum tree

2. Choice of the right tree size

3. Classification of new data using constructed tree

8

2 Construction of Maximum Tree

This part is most time consuming. Building the maximum tree implies splitting thelearning sample up to last observations, i.e. when terminal nodes contain observationsonly of one class. Splitting algorithms are different for classification and regressiontrees. Let us first consider the construction of classification trees.

2.1 Classification tree

Classification trees are used when for each observation of learning sample we know theclass in advance. Classes in learning sample may be provided by user or calculatedin accordance with some exogenous rule. For example, for stocks trading project, theclass can be computed as a subject to real change of asset price.

Let tp be a parent node and tl,tr - respectively left and tight child nodes of parentnode tp. Consider the learning sample with variable matrix X with M number ofvariables xj and N observations. Let class vector Y consist of N observations withtotal amount of K classes.

Classification tree is built in accordance with splitting rule - the rule that performs thesplitting of learning sample into smaller parts. We already know that each time datahave to be divided into two parts with maximum homogeneity:

Parentt

Leftt Rightt

LeftP RightPRj jx x

Figure 2.1: Splitting algorithm of CART


where tp, tl, tr - parent, left and right nodes; xj - variable j; xRj - best splitting valueof variable xj .

Maximum homogeneity of child nodes is defined by so-called impurity function i(t).Since the impurity of parent node tp is constant for any of the possible splits xj xRj , j = 1, . . . ,M , the maximum homogeneity of left and right child nodes will beequivalent to the maximization of change of impurity function i(t):

i(t) = i(tp) E[i(tc)]

where tc - left and right child nodes of the parent node tp. Assuming that the Pl, Pr- probabilities of right and left nodes, we get:

i(t) = i(tp) Pli(tl) Pri(tr)

Therefore, at each node CART solves the following maximization problem:

argmaxxjxRj , j=1,...,M

[i(tp) Pli(tl) Pri(tr)] (2.1)

Equation 2.1 implies that CART will search through all possible values of all variablesin matrix X for the best split question xj < xRj which will maximize the change ofimpurity measure i(t).

The next important question is how to define the impurity function i(t). In theorythere are several impurity functions, but only two of them are widely used in practice:Gini splitting rule and Twoing splitting rule.

2.1.1 Gini splitting rule

Gini splitting rule (or Gini index) is most broadly used rule. It uses the followingimpurity function i(t):

i(t) =k 6=l

p(k|t)p(l|t) (2.2)

where k, l 1, . . . ,K - index of the class; p(k|t) - conditional probability of class kprovided we are in node t.

Applying the Gini impurity function 2.2 to maximization problem 2.1 we will get thefollowing change of impurity measure i(t):

i(t) = Kk=1

p2(k|tp) + PlKk=1

p2(k|tl) + PrKk=1

p2(k|tr)

10


Therefore, Gini algorithm will solve the following problem:


[

Kk=1

p2(k|tp) + PlKk=1

p2(k|tl) + PrKk=1

p2(k|tr)]

(2.3)

Gini algorithm will search in learning sample for the largest class and isolate it fromthe rest of the data. Ginni works well for noisy data.

2.1.2 Twoing splitting rule

Unlike Gini rule, Twoing will search for two classes that will make up together morethen 50% of the data. Twoing splitting rule will maximize the following change-of-impurity measure:

i(t) =PlPr4

[Kk=1

p(k|tl) p(k|tr)]2

which implies the following maximization problem:


PlPr4

[Kk=1

p(k|tl) p(k|tr)]2 (2.4)

Although Twoing splitting rule allows us to build more balanced trees, this algorithmworks slower than Gini rule. For example, if the total number of classes is equal to K,than we will have 2K1 possible splits.

Besides mentioned Gini and Twoing splitting rules, there are several other methods.Among most used are Enthropy rule, 2 rule, maximum deviation rule. But it hasbeen proved 1 that final tree is insensitive to the choice of splitting rule. It is pruningprocedure which is much more important.

We can compare two trees, build on the same dataset but using different splittingrules. With the help of cartdrawpdfclass command in XploRe, one can constructclassification tree in PDF, specifying which splitting rule should be used (setting thethird parameter to 0 for Gini rule or to 1 - for Twoing):

cartdrawpdfclass(x,y, 0, 1) - Gini splitting rule cartdrawpdfclass(x,y, 1, 1) - Twoing splitting rule

1Breiman, Leo; Friedman, Jerome; Olshen, Richard; Stone, Charles (1984). Classification andRegression Trees, Chapman & Hall, page 95

11


X1

classification and regression trees (cart) theory and applications

Documents