inferring decision trees using the minimum description length principle j. r. quinlan and r. l....
TRANSCRIPT
Inferring Decision Trees Using the Minimum Description Length
Principle
J. R. Quinlan and R. L. Rivest
Information and Computation 80, 227-248, 1989
Introduction
Minimum Description Length Principle The best theory is the one which minimizes the sum of
1. the length of the theory, and
2. the length of the data when encoded using the theory as a predictor for the data.
Goal Application of MDLP to the construction of decision trees fro
m data
Example (1)Attribute
No. Outlook Temperature Humidity Windy Class
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
False
True
False
False
False
True
True
False
False
False
True
True
False
True
N
N
P
P
P
N
P
N
P
P
P
P
P
N
Example (2)
Best Tree
It has the smallest possible error rate when classifying previously unseen objects.
An imperfect, smaller DT often achieves greater accuracy in the classification of new objects rather than one which perfectly classifies all the known objects. May be overly sensitive to statistical irregularities and
idiosyncrasies of the given data set. Generally not possible for a DT inference to explicitly minimize
the error rate on new examples. A number of different approximate measures MDLP
MDLP (1)
DT which minimizes this measure is proposed as a “best” DT to infer from the given data.
Motivation Problem Communication problem based on the given data Transmitting the fewest total bits
Our communication problem You and I have the copy of the data set. Your copy : Last column is missing Send you an exact description of the missing column using as fe
w bits as possible Simplest technique : 14 bits
MDLP (2)
The more predictable that the class of an object is from the attributes, the fewer bits I may need to send.
In general,1. Partition the set of objects into a number of subsets based on the
attributes of the objects.
2. Send you a description of this partition.
3. Send you a description of the most frequent class to be associated with each subset.
4. For each category of objects, send you a description of the exceptions.
There are few exceptions in a category.
MDLP (3)
Decision Tree Natural and efficient way of partitioning Associating a default class with each category
Best DT The combined length of the description of the DT, plus the
description of the exceptions, must be as small as possible.
Bayesian Interpretation of the MDLP
MDLP can be naturally viewed as a Bayesian MAP estimator. T : decision tree t : the length of T D : the data to be transmitted d : the length of D r : a fixed parameter, r > 1.
Control how quickly the probability decrease as the length t of the string increases.
Bayesian Interpretation (2)
The prob. of each binary string of length t
Empty string : (1 – 1 / r) Strings 0 and 1 : (1 – 1 / r) (1 / 2r)
rT and rD : two fixed parameters rT > 1 and rD > 1
The priori prob. of the theory represented by the DT T
t
rr
2
111
t
TT r
rTP
2
11)(
Bayesian Interpretation (3)
The prob. of the observed data, given the theory,
The posteriori prob. of the theory
d
DD r
rTDP
2
11)|(
)(
)()|()|(
DP
TPTDPDTP
Bayesian Interpretation (4)
),,(
)()1lg()1lg(
))lg(1())lg(1())|(lg(
Drrgdctc
Dfrr
rdrtDTP
DTDT
DT
DT
The tree which minimizes tcT + dcD will have maximum posteriori probability.
rT = rD = 2 cT = cD = 2 t + d If rT is large, the large trees T will be penalized heavily.
A more compact tree will have maximum posteriori prob. If rD is large, exceptions will be penalized heavily.
A large tree, which explains the given data most accurately, is likely to result.
Assume rT = rD, so that cT = cD.
Coding Strings of 0’s and 1’s
Notations n : the length of string k : the number of symbol 1 (n – k) : the number of symbol 0 k b
b : a known priori upper bound on k b = n or b = (n + 1) / 2
The procedure 1. First I transmit the value of k. lg(b +1) bits 2. There are only strings possible.
k
n
k
nlg
Coding Strings (2)
Total cost
Standard measure of the complexity of a binary string of length n containing exactly k 1’s, where k < b.
Table I N,N,P,P,P,N,P,N,P,P,P,P,P,N L(14,9,14) = lg(15) + lg(2002) = 14.874 bits
Approximation using Stirling’s formula
Do not depend on the position of 1s.
bitslg)1lg(),,(
k
nbbknL
)/1()lg(2
)2lg(
2
)lg(
2
)lg(
2
)lg(4),,( nOb
knkn
n
knHbknL
Coding Strings (3)
Quinlan’s Heuristics The information content in a string of length n containing k P’s
nH(k / n) Under approximation to L(n, k, b) May result in large decision tree
Generalization
k = k1 + k2 + + kt
t
t kkk
n
k
knkkknL
,,,1
1lg),,,;(
2121
Coding Sets of Strings
Example Table Attribute “humidity”
High humidity objects– N, N, P, P, N, P, N
Normal humidity objects– P, N, P, P, P, P, P
Code the exceptions L(7, 3, 3) + L(7, 1, 3) = 11.937 bits < 14 bits Some relationship between the attribute and the class
Need to include the cost of describing this decision tree
Coding Decision Trees (1)
Coding scheme Smaller DTs are represented by shorter codewords than larger DTs. Recursive, top-down, depth-first procedure A leaf is encoded as “0” followed by an encoding of the default clas
s for that leaf. To code a tree which is not a leaf, begin with “1”, followed by the c
ode for the attribute at the root of the tree, followed by the encoding of the subtrees of the tree, in order.
Coding Decision Trees (2)
1 Outlook 1 Humidity 0 N 0 P 0 P 1 Windy 0 N 0 P “Outlook” require 2 bits
Selecting the first attribute out of four
“Humidity” require lg(3) bits Only three attributes remains
Example Tree : 18.170 bits
Coding Decision Trees (3)
For a uniform b-ary tree with n decision nodes and (b – 1)n + 1 leaves
bn + 1 bits The number of b-ary trees with n internal node and (b – 1)n + 1 lea
ves
The proposed coding scheme is not efficient for high arity trees.
n
bn
nb 1)1(
1
)1(2
2lg
2
)lg(
2
))1lg((
2
)lg(1o
nnbbn
bbnH
Coding Decision Trees (4)
Cost of Describing the structure of the tree Suppose the tree has k decision nodes and n – k leaves. L(n, k, (n + 1) / 2) k < n – k all tests will have arity at least two.
Total tree description cost Add the cost of specifying the attribute names for each node and
the cost of specifying the default class for each leaf
Coding Exceptions
Five subsets of the set of objects Sunny outlook & high humidity : N, N, N Sunny outlook & normal humidity : P, P Overcast outlook : P, P, P, P Rainy outlook & windy : N, N Rainy outlook & not windy : P, P, P
The exceptions can be encoded with a cost L(3, 0, 1) + L(2, 0, 1) + L(4, 0, 2) + L(2, 0, 1) + L(3, 0, 1) = 5.585
Total cost for the communication problem 18.170 + 5.585
Coding Real-Valued Attributes
Two approaches to find a good “cut point” Using values of the known objects
Suppose that for the desired attribute the n given objects have m < n distinct values.
Sort m real values and specify ith number by specifying i Specify i using lg(m) bits
Using compactly described rational numbers
Computing Good Decision Trees
Cost of replacing a leaf with a decision node Suppose A attributes and replacing a leaf at depth d with a decisi
on node. There are d’ d attributes tested on the path from the root to thi
s leaf. There A – d’ attributes eligible. To indicate which one is selected require lg(A – d’) bits. If the attribute selected has v values, the cost for describing the ad
ditional tree structure is 2v – 1 bits. Split one subset into v subsets.
Measure the extent to which the exceptions can be coded more efficiently.
If the savings so obtained is greater than the cost of extending the tree, the extension should be selected.
Two Phase Process
First Phase Begin with a single leaf and continue to extend the tree
Iterate until the tree is perfect or cannot be grown any further1. Let x be a leaf whose corresponding categories are of varying
classes, such that it is possible to replace x with a decision node.
2. For each possible attribute A, compute the total communication cost if this change is made.
3. Replace x with the decision node least total communication cost.
Second Phrase The tree is repeatedly pruned back by replacing decision nodes
by leaves, whenever this improves the total communication cost, until no further improvement in communication cost is possible.
Attribute with a large number of values
An attribute with a large number of values is penalized. It splits the set of objects into many subsets. e.g. The “Object number” in the example. The cost of specifying the attribute will not be justified in terms
of the extra compression achieved in transmitting the class information.
Empirical Results
Compare with C4 MDLP provides a unified framework for both growing
and pruning the decision tree.
Data setMDLP C4
Size Error rate Size Error Rate
Hypo
Discordant
LEDCredit
Endgame
Prob-Disj
11
15
83
14
15
17
0.6%
1.9%
26.9%
17.4%
17.9%
20.5%
11.0
13.6
56.0
32.5
62.6
42.6
0.55%
1.25%
28.1%
16.1%
13.6%
14.9%
Extensions (1)
In presence of noisy data The significance of the existing dependencies of the class on the
attributes will be masked. As the noise level increases, the tree grown should decrease in s
ize. Handle “training sets” which are especially representativ
e of the concept being learned Associate a “frequency count” larger than one with each input o
bject If it is required that the DT will classify each object correctly, th
ese counts can be set to a large value. The saving that can be realized by using a perfect DT increases,
as the counts increases.
Extensions (2)
Experiment Replicate the data set some number c times If the communication is separated into t “tree bits” and d “data bit
s”, the total cost should be t + cd. c : priori understanding of the representativeness or completeness
of the given data set.
T
D
c
cc
Extensions (3)
Data setc = 1 c = 2 c = 8
Size Error rate Size Error rate Size Error rate
Hypo
Discordant
LED
Credit
Endgame
Prob-Disj
11
15
83
14
15
17
0.6%
1.9%
26.9%
17.4%
17.9%
20.5%
15
23
93
11
35
45
0.6%
1.8%
27.0%
13.5%
11.5%
13.5%
19
37
95
76
71
69
0.5%
1.1%
27.0%
17.0%
12.0%
13.0%