cart classification and regression trees experienced user guide

CART Modeling Strategies Slide 1

CART Modeling Strategies For Experienced Data Analysts

CART Modeling Strategies For Experienced Data Analysts

• CART takes a significant step towards automated data analysis– One of CART’s predecessors was called

AAutomatic IInteraction DDetector (AIDAID)

• Nevertheless, high quality CART results require careful planning & expert guidance

• No realistic prospect that CART analyses or any other sophisticated modeling can be automated in the near term


All Data analysis, regardless of methods employed, have

certain prerequisites

All Data analysis, regardless of methods employed, have

certain prerequisites

• Complete understanding of the data available– Correct variable definitions– Sample sources and relationship to study

population– Review of conventional summary statistics,

percentiles– Standard reports that would be generated in the

process of data integrity checks– Calculations verified: check that totals can be

generated from components– Consistency checks: related fields do not conflict


Careful data preparationCareful data preparation

• CART is far better suited to dirty data analysis than conventional statistical modeling or NN tools– capable of dealing with missing values, outliers

• Nevertheless, considerable benefits to proper data preparation– the better the data the better a model can perform

• Includes – correct identification of missing value codes (998

valid or .)– uniform data handling when records come from

different entities (branches, regions, behavioral groups)

– if responder data is processed separately from and differently than non-responder data, completely erroneous results will be produced


Some core preparatory stepsSome core preparatory steps

• Identify illegal variables to be excluded from all models– ID variables– post event variables– variables unlikely to be available in future, or

against which CART model is intended to compete (eg Bankruptcy scores)

– variables disallowed by regulators (banking, insurance)

– variables derived in part from dependent variables, or generated from target variable behavior

– variables too closely connected to target for any reason


Exploratory Data Analysis with CART:

Pre-modeling

Exploratory Data Analysis with CART:

Pre-modeling• Run a single split tree and report all competitors

– ranks ability of all variables to separate target variable into homogeneous groups

– command settings� LIMIT DEPTH=1� ERROR EXPLORE� BOPTIONS COMPETITORS=large number

• Run limited depth trees for target using one predictor at a time (again exploratory--non-tested trees)– LIMIT DEPTH=2 (up to 4 nodes) or LIMIT DEPTH=3

(up to 8 nodes) (actual number depends on redundant node pruning)

– provides optimal binning of variables– binned versions could be used in parametric models


The CART Non-linear Correlation Matrix

The CART Non-linear Correlation Matrix

• Run CART models using every pair of legal variables– should be unlimited depth– could be tested or exploratory– will detect non-linear dependencies

• Results will be asymmetric– results can be used to fill out a correlation matrix

• Alternate Procedure– run simple regressions using all pairs of variables– use CART to predict residuals– correlation determined by both linear and CART

components


Example Pearson and CART correlation Matrices

Example Pearson and CART correlation Matrices

• From Kerry


CART Affiliation MatricesCART Affiliation Matrices

• Select a group of interesting variables• Let each variable in turn be the target variable,

all others in group are predictors• Grow standard trees (not depth limited) with test

procedure to prune• Each column in matrix is a target variable• Rows are filled with importance scores (scaled to

0,1)• Provides a picture of variable interdependencies• Can highlight surprise relationships between

predictors– can help in detecting data errors – when affiliations stringer or weaker than expected


Detection of multivariate outliers

Detection of multivariate outliers

• Grow CART tree for every variable as predicted by a trimmed down variable list

• Predict each variable in turn from all other variables

• Restrict trees to moderate to large terminal nodes– use ATOM or MINCHILD controls

• For regression: measure deviation of each data point from predicted

• For classification: check if class value of data point is rare in predicted terminal node

• Use results to investigate unusual observations

• Procedure still needs more research


Once data QC is complete serious CART modeling can

begin

Once data QC is complete serious CART modeling can

begin• Need to understand nature of problem:

– what would be the appropriate statistical models to use for problem at hand

– e.g. is problem a simple binary outcome (respond or not to a direct mail piece)

– alternatively, does it have an inherent time dimension (how long will customer remain customer -- telecommunications churn)

� latter problem involves censored data– is study of a fundamentally time series or panel data

type– then need to allow for lagged variables, etc.


CART cannot protect you from using an improper analysis

strategy

CART cannot protect you from using an improper analysis

strategy• CART will help you execute your analysis strategy

more quickly and often more accurately• If the modeling strategy you have selected will

produce biased results CART may just exacerbate the problem

• A definitive modeling approach is not required, but a defensible approach is


Example: Targeting model for a catalog to maximize profit

Example: Targeting model for a catalog to maximize profit

• Sensible to model in stages– 1) yes/no response model: use classification tree– 2) Dollar volume of order for those who do respond

� modeled conditional on response=yes� modeled just on subset of responders� regression tree plausible� or classification tree on binned order amounts

– Final model could be an expected profit model� prob(respond)*Expected(Revenue| Respond)� model could be all CART, all logit, or a mixture� such models discussed later


Modeling strategy will also dictate test strategy

Modeling strategy will also dictate test strategy

• Suppose we are tracking purchase behavior over time

• Data organized as one record per purchase opportunity

• The unit of observation will be a complete case history– ideally will want to assign some complete case

histories to training data– other entire case histories to test data– important not to allow random assignment between

train and test on a record by record basis– might want to hold back some records from longer

case histories as an additional source of test data


Initial CART analyses are strictly exploratory

Initial CART analyses are strictly exploratory

• Intended to reveal summary and descriptive information about the data

• Omnibus Model: dependent variable(s) fit to virtually all legal variables– Certain obvious exclusions necessary: ID

numbers, clones and transforms of the dependent variable as discussed above

– Omnibus Model reveals something about the predictability of the dependent variable

– recall that largest tree has error no more than twice Bayes rate


Determine Splitting Rule to Use

Determine Splitting Rule to Use

• Gini, Twoing, power modified Twoing for classification– possibly ordered twoing

• Least squares (LS) or Least Absolute Deviation (LAD) for regression

• Best splitting rule can be selected very early in project and typically does not have to be revisited


Assess agreement among different test methods

Assess agreement among different test methods

• If data set is small cross validation is required• In this case rerun trees several times with

different starting random number seeds– use to assess stability of size and error rate of best

trees• With large data sets reassign cases between

learn and test several times– initial check is on error rates and sizes of best trees


Run all as batch of startup CART trees

Run all as batch of startup CART trees

• Using three or four splitting rules, and three or four test sets will get some initial feel for predictability of target variable

• Useful to develop some text processing scripts to extract components of the classic CART reports most interesting – tree sequence– misclassification results (which classes are wrong)– prediction success table– importance rankings

� latter can be aggregated as follows:� add up all importance scores for each variable across

all trees� rescale so that highest score is 100

• LOPTION NOPRINT gives summary tables only – no tree detail; very helpful when trees tend to be

large– Navigator file still produced so tree can be seen in

GUI


Derived variables almost certainly need to be created

Derived variables almost certainly need to be created

• Almost impossible to develop high performance models without analyst creation of derived variables

• Many derived variables are “obvious” to domain specialists– to predict purchase amounts look at customer

lifetime totals– possibly aggregate previous purchases into

category subtotals– calculate trend; have orders been increasing or

decreasing over time?• Consider standard statistical summaries of

groups of variables:– mean, standard deviation, min, max, trend


Use linear combination splits to search for new derived

variables

Use linear combination splits to search for new derived

variables• Linear combinations found by CART can suggest

new derived variables• Recommend that the delete option be set high

and that the required sample size also be substantial

• LINEAR N=1000 DELETE=.4– permits linear combination splits only in nodes with

more than 1,000 cases– the higher the DELETE parameter the fewer terms in

the combination• E.g.


Results of first models are used to generate the first cut

back list of predictors

Results of first models are used to generate the first cut

back list of predictors• List is determined through a combination of

judgment and perusal of initial CART runs• Purpose is error avoidance, exclusion of

nuisance, pernicious and not believable variables• Variables that seem odd in the context, and thus

probably should not have predictive value also excluded– Important not to exclude any variables that prior

knowledge, conventional wisdom would include– Purpose of this stage is not radical pruning but

elimination of valueless variables


Can be useful to explore trees for selected predictor variables

or other variables of interest

Can be useful to explore trees for selected predictor variables

or other variables of interest• Can think of the CART tree as an extended

non-parametric version of correlation analysis

• Results simply reveal what variables are in some way associated in the data

• Could construct a table of variables in the columns against variables that predict in the rows


Same procedure could be used to impute values for missing data points

Same procedure could be used to impute values for missing data points

• Actual procedure is complex and will be discussed in another context

• Our proposed missing value imputation procedure is iterative

• Also might start selecting complexity values that restrain growth of trees to reasonable sizes– A large data set might allow trees with many

hundreds of terminal nodes– Yet optimal models might fall into the 20-100

terminal node size


Next set of models should explore the impact of

alternative splitting and testing rules

Next set of models should explore the impact of

alternative splitting and testing rules

• Useful to look at GINI, TWOING, and TWOING POWER=1

• Useful to compare external test data with cross-validation in smaller data sets

• These runs may suggest which splitting rules are most promising for further work

• In most problems the default GINI is the best rule to use– Definitively better than ENTROPY, often slightly

better than TWOING


Impact of alternative splitting and testing rules; continued

Impact of alternative splitting and testing rules; continued

• In some problems, usually problems with poor predictability, TWOING, POWER=1 works well– e.g. Relative error in best GINI tree is .8 or

higher– In these cases, the more balanced splitting

strategy seems to yield better trees


Also want to compare results from different test proceduresAlso want to compare results from different test procedures

• Compare runs with different subsets of test data randomly chosen from larger data sets

• e.g., Create two uniform random variables – %LET TEST20A=urn <0.20

– %LET TEST20B=urn >0.20– Use TEST20A to pick out test sample in one run

and use TEST20B in another run


We hope results will be very similar across test sets

We hope results will be very similar across test sets

• Approximate size of optimal tree• Approximate relative error• Importance ranking of variables — which

variables appear near top of list• Reasonable overlap of primary splitters in

trees


Instability of results across test data sets is a warning sign

Instability of results across test data sets is a warning sign

• May need to carefully review interdependencies of predictor variables

• Results may be due to a set of closely competing predictors with different information content

• If so, will want to consider whether one or more of these competitors should be dropped

• In this case, a judgment is made concerning variables to exclude from the model

• Results may be unstable due to inherent variance of the tree predictor

• In this case, will ultimately want to consider aggregation of experts discussed below


Experiments with Linear Combination Splits

Experiments with Linear Combination Splits

• Linear combinations are occasionally instructive• Not useful when many variables are involved• We recommend restriction to 2-variable linear

combinations• Helpful if there are strictly positive variables

transformed to logs– 2-variable linear combination might reveal a form

like

c1*log (X1) - c2*log(X2) ,

which is a ratio of the predictors


Reading CART resultsReading CART results

• Useful to prepare a series of summary reports after CART runs are done

• One report should just include the TREE SEQUENCE– Reveals the size of the optimal tree, relative error

rate– Can be used to reject certain runs – too large, too

small, too inaccurate• Another report extracts just the split variables:

– Contains a listing of the node split variables– Provides an brief outline of how the tree evolved


Reports are used to select trees that appear to be

promising

Reports are used to select trees that appear to be

promising• It is possible that no promising trees are

found in the early rounds of analysis • Attractive trees need to be printed to

facilitate absorption of the implicit model


Currently we use allCLEAR to print

Currently we use allCLEAR to print

• Future CART will include its own pretty print but will still support allCLEAR

• We request the “splits” level of detail in the output– Includes split variable, split value, class assignment– Table of class distribution in the node might be too

voluminous


Trees need to be read for the story they tell and

assessed for plausibility

Trees need to be read for the story they tell and

assessed for plausibility• Particularly at the higher levels of the tree

(lower levels might disappear with pruning)• Does the predictive model agree with

intuition and prior expectations?


When troubling patterns emerge, need to look at the

competitors of a node

When troubling patterns emerge, need to look at the

competitors of a node• Reveals what other variable would be used to

split the node if the main splitter were not available

• If the competitor is more acceptable than the primary in a node can consider dropping the primary

• Method will only work if analyst is willing to exclude the variable from anywhere in the tree

• On the basis of these reports and prints can determine candidate second round models


Now can move on to tools for model refinement

Now can move on to tools for model refinement

• Selection of right-sized trees based on judgment

• Altering costs of misclassification• Creation of new variables


Judgmental Pruning of Trees: A necessary step in model development

Judgmental Pruning of Trees: A necessary step in model development

• When the CART monograph was published in 1984 the authors suggested that the best tree was the “one-se-rule tree”

• This is the smallest tree within one standard error of the minimum cost tree

• The reasoning was: all trees within a one standard error band are statistically indistinguishable, and small trees are inherently more comprehensible and preferable


Judgmental Pruning of Trees: continued

Judgmental Pruning of Trees: continued

• The current view of the CART originators is that one should accept the literal minimum cost tree produced by CART

• This view is based on a further dozen years of experience which has revealed that the “one-se-rule” may be too conservative

• Nonetheless, compelling reasons exist to prefer smaller trees in data-mining investigations


In data-mining exercises trees can easily grow to unmanageable depths

In data-mining exercises trees can easily grow to unmanageable depths

• With the prodigious volumes of warehoused data, greedy analysis tools can develop complex models without restraint

• Paradoxically, the large quantities of data can serve to mislead

• The problem is similar to that noted by statisticians who first analyzed large national probability sample databases: in regression, t-test, and chi-square tests, almost every estimated coefficient is “significantlysignificantly” different from zero, and every null is rejected

• In the tree-growing context, elaborate trees of great depth appear to perform extremely well even on independent hold-out samples

• One therefore obtains seemingly informative complexity


A way to “discount” findings based on very

large data sets is needed

A way to “discount” findings based on very

large data sets is needed• The solution in the conventional modeling context

has been to adjust the significance level required before placing too much faith in a finding

• For example, a t-statistic of 2.2 for a regression coefficient based on 30 degrees of freedom should be considered more compelling than the same t-statistic based on 100,000 degrees of freedom

• In the CART context it would be useful to have optimal tree size selection criteria that adapted to the volume of data available


Three tools for adjusting an analysis to data richness

are available in CART

Three tools for adjusting an analysis to data richness

are available in CART• The ATOM or minimum node size available

for splitting: as the data set size increases, ATOM size can also be increased (perhaps with the log of sample size)– The thinking is: as data sets increase in size,

require the amount of data needed to support a split to increase also


Three tools for adjusting an analysis; continued

Three tools for adjusting an analysis; continued

• The minimum child size can also be adjusted. MINCHILD prevents CART from splitting off nodes too small to support separate analysis– For example, we might not want to attempt inferring the

probability of prepay in any node containing less than 100 observations

– MINCHILD and ATOM are closely related but are different concepts. MINCHILD guarantees that no terminal node will ever be smaller than its predetermined value. ATOM determines the minimum size of a node that is eligible to be split. ATOM must always be at least 2*MINCHILD so that if the smallest node eligible for splitting is split into two equal parts, each part will be at least as large as MINCHILD.

• Trees other than the “optimal” tree can be PICKED from the tree sequence


The third tool is selection of a tree from the CART sequenceThe third tool is selection of a tree from the CART sequence

• Analyst intervention in tree selection is both desirable and unavoidable

• Allows the incorporation of prior knowledge and domain expertise

• This type of selection is really just pruning: the analyst decides to prune back further than the CART algorithms recommend

• Topic is mentioned briefly in the CART monograph where the authors discuss their decision to eliminate one or two nodes near the bottom of a medical diagnosis tree: – MD’s running the study did not believe that these lower

level splits captured the underlying biology

• This is similar to a statistician deciding to exclude a borderline significant interaction in a regression model:


In the data-mining context, tree selection can be guided by

the relative error plot

In the data-mining context, tree selection can be guided by

the relative error plot• Each CART run produces a plot of relative error

against number of nodes and the relative error is printed on the TREE SEQUENCE report

• In data mining these plots have a characteristic shape: steep declines in the relative error as tree initially evolves followed by lengthy flat portions in which further error reduction is extremely small with each additional node

• Further, the test data support the hypothesis that many of these error reductions are “statistically statistically significantsignificant.” In the CART context the claim is that the more complex larger trees will predict well on fresh data and thus contain valuable information.


An analyst could defensibly decide to trade off a large

block of nodes for a small “increase”

in prediction error

An analyst could defensibly decide to trade off a large

block of nodes for a small “increase”

in prediction error• In one of our CART models the “optimaloptimal” tree had 100 terminal nodes and a relative error of 0.333968 +/- 0.00578

• Yet the sub-tree with 63 terminal nodes only has a relative error of 0.34339, a one-point apparent loss in accuracy.

• And 29 terminal nodes yield a relative error of .38564


Final tree selection based on the relative error plot alone

Final tree selection based on the relative error plot alone

• In many applications it will be difficult to make a final tree selection based on the relative error plot alone

• The plot reveals many opportunities for selection, but rarely serves to single out a best tree

• In some problems it is possible to find the tree that exhausts all substantial improvements and that separates a steeply sloping section from a flat plateau


The next step of tree assessment

The next step of tree assessment

• Carefully review of a relatively large tree chosen by CART

• Examination of a large tree node-by-node will be very instructive

• We are assuming that the early splits of the tree have already been examined and found to be convincing and acceptable


Review of a relatively large tree chosen by CART

Review of a relatively large tree chosen by CART

• Purpose of this stage of review is to consider the lower branches:– Do any of the splits appear fortuitous or not

particularly believable?– Are the same variables being used repeatedly to

minutely subdivide a predictor?– Is it worth pursuing additional refinement of the sub-

sample reached at a particular juncture in the tree?– Is there any concern for whatever reason that the

splits are not reasonable representations of reality?


Additional ConsiderationsAdditional Considerations

• The tree that results when questionable or low value sections of the CART optimal tree are dropped should be considered– Unfortunately, there appears to be no substitute for

the careful and detailed examination of the CART tree node-by-node

– However, the only contribution of judgment here is to eliminate nodes that are thought to be the result of over-fitting


Goodness-Of-Fit Measures for Classification Trees

in Classic CART

Goodness-Of-Fit Measures for Classification Trees

in Classic CART• CART classification trees automatically generate

diagnostic reports– Relative Error Rate for all trees in pruned sequence– Misclassification Rate By Class for Learn and Test

data– Misclassification Table: Actual vs. Predicted Class

• CART class probability trees display only the relative error sequence

• Although these reports are helpful in sorting out the most promising trees early on in CART analyses, they contain far less information than needed for proper model assessment


Characteristics of the CART GINI Measure

Characteristics of the CART GINI Measure

• Measure is zero whenever a node is pure• Most CART trees are grown and pruned using the

Gini measure of within node diversity• Gini is largest when distribution of classes in a

node is uniform

• CART trees usually grown with priors EQUAL– Essential to encourage promising tree evolution

when class distribution is skewed– Practical impact is to make make CART strive for

roughly equal accuracy in all classes– Priors DATA and priors MIX rarely work well

• CART Gini measure will then be priors adjusted

i t pii

( )= −∑1 2


One new measure of tree performance — “Rho-squaredRho-squared”

One new measure of tree performance — “Rho-squaredRho-squared”

• Although the growing process is improved with equal priors, the practical evaluation of the tree requires using data priors – Actual node distributions, not priors adjusted

• We therefore compute unadjusted Gini for entire tree and compare this with the Gini of the root

• Provides a measure of the improvement due to splitting


“Rho-squaredRho-squared”; continued“Rho-squaredRho-squared”; continued

• Formal definition of Rho-squaredRho-squared = 1 - Gini(tree)/Gini(root)– If Gini(tree)=Gini(root) we have no improvement

and rho-squared=0– If Gini(tree)=0, meaning all terminal nodes are

perfectly pure, then rho-squared=1– Thus, rho-squared measures how the gap from

Gini(root) to a Gini of 0 is closed by the model

• Can be used to compare competing tree models


Second new measure compares learn vs. test class

distribution in terminal nodes

Second new measure compares learn vs. test class

distribution in terminal nodes

• Every classification tree generates a distribution of the dependent variable in each terminal node

• This learn data distribution can be compared with the distribution observed in other data:– The test data used to calibrate relative error rates

and select the optimal tree– A test data set independent of both learn and test

data used in the tree modeling– Data from other sources that are not necessarily

expected to be similar to the tree under study• Might also want to compare the test data with

external data


Performance comparisons can be summarized in a chi-square statistic

Performance comparisons can be summarized in a chi-square statistic

– If there are K classes then each terminal node contributes a chi-square statistic with K-1 df

– With T terminal nodes the overall statistic for the tree has T*(K-1) degrees of freedom

– Can decompose the statistic by node or by class– Useful when the statistic is large to determine

source of large deviations� Are we fitting badly in a specific subtree?� Are the deviations concentrated in one class?


Class Probability TreesClass Probability Trees

• Technically, project Oracle uses class probability trees for forecasts and simulation

• Class probability trees use the same GINI method for growing

• Uses GINI for pruning trees as well• Nevertheless, we used classification trees

throughout and interpreted the results as class probability trees

• Several reasons for this approach– Classification trees produce misclassification

reports– Can be guided by variable cost of misclassification– Class probability trees sometimes much smaller

than classification trees


Class Probability Trees; continued

Class Probability Trees; continued

• Main problem with class probability trees– Pruning based on equal priors– Want pruning based on data priors, not yet possible

in CART• Hence, use of classification tree to allow

judgmental pruning• Nonetheless, looking at class probability tree

sizes can be used to bound right sized tree• Would be desirable to modify CAR to allow

different priors in growing and pruning

cart classification and regression trees experienced user guide

Technology

automated data analysis

experienced data analysts

dirty data analysis

cart models

nonresponder data

proper data preparation

cart analyses

uniform data handling