predictive modeling using decision trees
TRANSCRIPT
1
Predictive Modeling Using Decision Trees
Introduction
Decision Trees Powerful/popular for classification & prediction Represent rules
– Rules can be expressed in English– IF Age <=43 & Sex = Male
& Credit Card Insurance = NoTHEN Life Insurance Promotion = No
Useful to explore data to gain insight into relationships of a large number of candidate input variables to a target (output) variable
Decision Tree – What is it?
A structure that can be used to divide up a large
collection of records into successively smaller
sets of records by applying a sequence of simple
decision rules
A decision tree model consists of a set of rules for
dividing a large heterogeneous population into
smaller, more homogeneous groups with respect
to a particular target variable
4
Banking marketing scenario (HMEQ):Target :
• default on a home-equity line of credit (BAD)Inputs :
• number of delinquent trade lines (DELINQ)• number of credit inquiries (NINQ)• debt to income ratio (DEBTINC)• possibly many other inputs
Decision Trees: HMEQ Example
5
Introduction to Decision Tree modeling
6
Decision Trees:
Interpretation of the fitted decision tree•The internal nodes contain rules•Start at the root node (top) and follow the rules until a terminal node (leaf) is reached. •The leaves contain the estimate of the expected value of the target – in this case the posterior probability of BAD. The probability can then be used to allocate cases to classes.
Decision Tree Template
Drawn top-to-bottom or left-to-
right
Top (or left-most) node = Root
Node
Descendent node(s) = Child
Node(s)
Bottom (or right-most) node(s)
= Leaf Node(s)
Unique path from root to each
leaf = Rule
Root
Child Child Leaf
LeafChild
Leaf
8
Divide and Conquer
n = 5,000
10% BAD
n = 3,350 n = 1,650Debt-to-Income
Ratio < 45
yes no
21% BAD5% BAD
The tree is fitted to the data by recursive partitioning. Partitioning refers to segmenting the data into subgroups that are as homogeneous as possible with respect to the target. In this case, the binary split (Debt-to-Income Ratio < 45) was chosen. The 5,000 cases were split into two groups, one with a 5% BAD rate and the other with a 21% BAD rate.
The method is recursive because each subgroup results from splitting a subgroup from a previous split. Thus, the 3,350 cases in the left child node and the 1,650 cases in the right child node are split again in similar fashion.
9
The Cultivation of Trees Split Search
– Which splits are to be considered? Splitting Criterion
– Which split is best? Stopping Rule
– When should the splitting stop? Pruning Rule
– Should some branches be lopped off?
10
How is the best split determined? In some situations, the worth of a split is obvious. If the expected target is the same in the child nodes as in the parent node, no improvement was made, and the split is worthless!
In contrast, if a split results in pure child nodes, the split is undisputedly best. For classification trees, the three most widely used splitting criteria are based on the Pearson chi-squared test, the Gini index, and entropy. All three measure the difference in class distributions across the child nodes. The three methods usually give similar results.
Splitting Criteria
11
Splitting Criteria
Left Right
Perfect Split
Debt-to-Income Ratio < 45
A CompetingThree-Way Split
4500 0
0 500
3196 1304
154 346
Not Bad
Bad
2521 1188
115 162
791
223
Left Center Right
4500
500
4500
500
4500
500
Not Bad
Bad
Not Bad
Bad
Decision Tree TypesBinary trees – only two choices in each split. Can be non-uniform (uneven) in depth
N-way trees or ternary trees – three or more choices in at least one of its splits (3-way, 4-way, etc.)
Split CriteriaThe best split is defined as one that does the best job of separating the data into groups where a single class predominates in each groupMeasure used to evaluate a potential split is purity The best split is one that increases purity of the sub-
sets by the greatest amount A good split also creates nodes of similar size or at
least does not create very small nodes
Tests for Choosing Best Split
Purity (Diversity) Measures:
Gini (population diversity)
Entropy (information gain)
Gini (Population Diversity)The Gini measure of a node is the sum of the squares of the proportions of the classes.
Root Node: 0.5^2 + 0.5^2 = 0.5 (even balance)
Leaf Nodes: 0.1^2 + 0.9^2 = 0.82 (close to pure)
Gini Score =0.5*.82+0.5*.82=.82 (close to pure)
Decision Tree Advantages
1. Easy to understand
2. Map nicely to a set of business rules
3. Applied to real problems
4. Make no prior assumptions about the data
5. Able to process both numerical and categorical data
17
Benefits of Trees
Interpretability
– tree-structured presentation
Mixed Measurement Scales Regression trees
Handling of Outliers
Handling of Missing Values
18
The Right-Sized Tree
Stunting
Pruning
19
Building and Interpreting Decision Trees Explore the types of decision tree models available in
Enterprise Miner. Build a decision tree model. Examine the model results and interpret these results. Choose a decision threshold theoretically and
empirically.
20
The Scenario Determine who should be approved for a home equity
loan. The target variable is a binary variable that indicates
whether an applicant eventually defaulted on the loan. The input variables are variables such as the amount
of the loan, amount due on the existing mortgage, the value of the property, and the number of recent credit inquiries.
21
The HMEQ data set contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates if an applicant eventually defaulted or was seriously delinquent. This adverse outcome occurred in 1,189 cases (20%). For each applicant, 12 input variables were recorded.
Presume that every two dollars loaned eventually returns three dollars if the loan is paid off in full.
Accuracy Measures (Classification)
Misclassification error
Error = classifying a record as belonging to one class when it belongs to another class.
Error rate = percent of misclassified records out of the total records in the validation data
Confusion Matrix
201 1’s correctly classified as “1”
85 1’s incorrectly classified as “0”
25 0’s incorrectly classified as “1”
2689 0’s correctly classified as “0”
Actual Class 1 0
1 201 85
0 25 2689
Predicted Class
Classification Confusion Matrix
Error Rate
Overall error rate = (25+85)/3000 = 3.67% Accuracy = 1 – err = (201+2689) = 96.33% If multiple classes, error rate is: (sum of misclassified records)/(total records)
Actual Class 1 0
1 201 85
0 25 2689
Predicted Class
Classification Confusion Matrix
Cutoff for classificationMost DM algorithms classify via a 2-step process:
For each record,
1. Compute probability of belonging to class “1”
2. Compare to cutoff value, and classify accordingly
Default cutoff value is 0.50
If >= 0.50, classify as “1”
If < 0.50, classify as “0” Can use different cutoff values Typically, error rate is lowest for cutoff = 0.50
Cutoff Table
Actual Class Prob. of "1" Actual Class Prob. of "1"1 0.996 1 0.5061 0.988 0 0.4711 0.984 0 0.3371 0.980 1 0.2181 0.948 0 0.1991 0.889 0 0.1491 0.848 0 0.0480 0.762 0 0.0381 0.707 0 0.0251 0.681 0 0.0221 0.656 0 0.0160 0.622 0 0.004
If cutoff is 0.50: eleven records are classified as “1” If cutoff is 0.80: seven records are classified as “1”
Confusion Matrix for Different Cutoffs
0.25
Actual Class owner non-owner
owner 11 1
non-owner 4 8
0.75
Actual Class owner non-owner
owner 7 5
non-owner 1 11
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
196723
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
I nt_ NODE_
161616161616 9161616161616 91616 9161616 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
I nt_ LEAF_
4 4 4 4 4 4 2 4 4 4 4 4 4 2 4 4 2 4 4 41111111111111111111111111111111111111111111111
I ntP_ DEFAULT1
1. 00001. 00001. 00001. 00001. 00001. 00001. 00001. 00001. 00001. 00001. 00001. 00001. 00001. 00001. 00001. 00001. 00001. 00001. 00001. 00000. 82820. 82820. 82820. 82820. 82820. 82820. 82820. 82820. 82820. 82820. 82820. 82820. 82820. 82820. 82820. 82820. 82820. 82820. 82820. 82820. 82820. 82820. 8282
I ntP_ DEFAULT0
0. 00000. 00000. 00000. 00000. 00000. 00000. 00000. 00000. 00000. 00000. 00000. 00000. 00000. 00000. 00000. 00000. 00000. 00000. 00000. 00000. 17180. 17180. 17180. 17180. 17180. 17180. 17180. 17180. 17180. 17180. 17180. 17180. 17180. 17180. 17180. 17180. 17180. 17180. 17180. 17180. 17180. 17180. 1718
NomI _ DEFAULT1111111111111111111111111111111111111111111
I ntU_ DEFAULT
1111111111111111111111111111111111111111111
NomF_ DEFAULT1111111111111111111111010111111011111010111
I ntR_ DEFAULT1
0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 1718 0. 1718- 0. 8282 0. 1718- 0. 8282 0. 1718 0. 1718 0. 1718 0. 1718 0. 1718 0. 1718- 0. 8282 0. 1718 0. 1718 0. 1718 0. 1718 0. 1718- 0. 8282 0. 1718- 0. 8282 0. 1718 0. 1718 0. 1718
I ntR_ DEFAULT0
0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000 0. 0000- 0. 1718- 0. 1718 0. 8282- 0. 1718 0. 8282- 0. 1718- 0. 1718- 0. 1718- 0. 1718- 0. 1718- 0. 1718 0. 8282- 0. 1718- 0. 1718- 0. 1718- 0. 1718- 0. 1718 0. 8282- 0. 1718 0. 8282- 0. 1718- 0. 1718- 0. 1718
Nom_ WARN_
I ntDef aul t
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1
I ntLoan
24900 13400 65500 16800 24700 15500 6300 20600 20100 24000 30200 6500 11800 21700 5700 10300 11800 12000 15200 9700 40000 40000 12000 24800 41100 40000 20000 10000 10000 12000 17600 12000 10000 45000 15000 8500 8500 10000 47000 10000 6000 15000 6000
I ntMor t gage
62191 131524 205156 27623 79347 82054 12476 52946 16755 88783 80951 183860 74512 24984 74172 70147 67678 76345 105328 32660 4742 53543 88000 37200 94600 120000 . . 69727 42000 76043 87000 76700 47321 29000 48961 18240 34767 164411 32000 9660 45000 24600
I ntVal ue
83694 148356 290239 88231 108238 104627 32559 83558 29412 116967 116160 208910 93328 92297 79846 122124 108092 89036 113931 54536 . . 118750 67000 151000 159000 115750 65088 90312 60000 95605 101200 97800 115000 105000 73550 40200 51000 235500 59000 35900 68250 30500
NomReasonHomeI mpDebt ConDebt ConDebt ConDebt ConDebt ConHomeI mpDebt ConHomeI mpDebt ConDebt ConDebt ConHomeI mpDebt ConDebt ConHomeI mpHomeI mpHomeI mpHomeI mpDebt ConDebt ConDebt ConHomeI mpDebt ConDebt ConDebt ConHomeI mpHomeI mpHomeI mpDebt ConDebt ConDebt ConDebt ConDebt ConHomeI mpHomeI mpHomeI mpHomeI mpDebt ConDebt ConDebt ConDebt ConHomeI mp
44
Consequences of a Decision
Decision 1 Decision 0
Actual 1 True Positive False Negative
Actual 0 False Positive True Negative
45
Consequences of a Decision: Profit matrix (SAS EM)
Decision 1 Decision 0
Actual 1 True Positive
(Profit = $2)
False Negative
Actual 0 False Positive
(Loss = $1)
True Negative
46
Bayes Rule: Optimal threshold
positive false ofcost
negative false ofcost 1
1
Using the cost structure defined for the home equity example, the optimal threshold is 1/(1+(2/1)) = 1/3. That is,•reject all applications whose predicted probability of default exceeds 0.3333