inferring decision trees using the minimum description length principle j. r. quinlan and r. l....

Inferring Decision Trees Using the Minimum Description Length

Principle

J. R. Quinlan and R. L. Rivest

Information and Computation 80, 227-248, 1989

Introduction

Minimum Description Length Principle The best theory is the one which minimizes the sum of

1. the length of the theory, and

2. the length of the data when encoded using the theory as a predictor for the data.

Goal Application of MDLP to the construction of decision trees fro

m data

Example (1)Attribute

No. Outlook Temperature Humidity Windy Class

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Sunny

Sunny

Overcast

Rain

Rain

Rain

Overcast

Sunny

Sunny

Rain

Sunny

Overcast

Overcast

Rain

Hot

Hot

Hot

Mild

Cool

Cool

Cool

Mild

Cool

Mild

Mild

Mild

Hot

Mild

High

High

High

High

Normal

Normal

Normal

High

Normal

Normal

Normal

High

Normal

High

False

True

False

False

False

True

True

False

False

False

True

True

False

True

N

N

P

P

P

N

P

N

P

P

P

P

P

N

Example (2)

Best Tree

It has the smallest possible error rate when classifying previously unseen objects.

An imperfect, smaller DT often achieves greater accuracy in the classification of new objects rather than one which perfectly classifies all the known objects. May be overly sensitive to statistical irregularities and

idiosyncrasies of the given data set. Generally not possible for a DT inference to explicitly minimize

the error rate on new examples. A number of different approximate measures MDLP

MDLP (1)

DT which minimizes this measure is proposed as a “best” DT to infer from the given data.

Motivation Problem Communication problem based on the given data Transmitting the fewest total bits

Our communication problem You and I have the copy of the data set. Your copy : Last column is missing Send you an exact description of the missing column using as fe

w bits as possible Simplest technique : 14 bits

MDLP (2)

The more predictable that the class of an object is from the attributes, the fewer bits I may need to send.

In general,1. Partition the set of objects into a number of subsets based on the

attributes of the objects.

2. Send you a description of this partition.

3. Send you a description of the most frequent class to be associated with each subset.

4. For each category of objects, send you a description of the exceptions.

There are few exceptions in a category.

MDLP (3)

Decision Tree Natural and efficient way of partitioning Associating a default class with each category

Best DT The combined length of the description of the DT, plus the

description of the exceptions, must be as small as possible.

Bayesian Interpretation of the MDLP

MDLP can be naturally viewed as a Bayesian MAP estimator. T : decision tree t : the length of T D : the data to be transmitted d : the length of D r : a fixed parameter, r > 1.

Control how quickly the probability decrease as the length t of the string increases.

Bayesian Interpretation (2)

The prob. of each binary string of length t

Empty string : (1 – 1 / r) Strings 0 and 1 : (1 – 1 / r) (1 / 2r)

rT and rD : two fixed parameters rT > 1 and rD > 1

The priori prob. of the theory represented by the DT T

t

rr

2

111

t

TT r

rTP

2

11)(


The prob. of the observed data, given the theory,

The posteriori prob. of the theory

d

DD r

rTDP

2

11)|(

)(

)()|()|(

DP

TPTDPDTP


),,(

)()1lg()1lg(

))lg(1())lg(1())|(lg(

Drrgdctc

Dfrr

rdrtDTP

DTDT

DT

DT

The tree which minimizes tcT + dcD will have maximum posteriori probability.

rT = rD = 2 cT = cD = 2 t + d If rT is large, the large trees T will be penalized heavily.

A more compact tree will have maximum posteriori prob. If rD is large, exceptions will be penalized heavily.

A large tree, which explains the given data most accurately, is likely to result.

Assume rT = rD, so that cT = cD.

Coding Strings of 0’s and 1’s

Notations n : the length of string k : the number of symbol 1 (n – k) : the number of symbol 0 k b

b : a known priori upper bound on k b = n or b = (n + 1) / 2

The procedure 1. First I transmit the value of k. lg(b +1) bits 2. There are only strings possible.

k

n

k

nlg

Coding Strings (2)

Total cost

Standard measure of the complexity of a binary string of length n containing exactly k 1’s, where k < b.

Table I N,N,P,P,P,N,P,N,P,P,P,P,P,N L(14,9,14) = lg(15) + lg(2002) = 14.874 bits

Approximation using Stirling’s formula

Do not depend on the position of 1s.

bitslg)1lg(),,(

k

nbbknL

)/1()lg(2

)2lg(

2

)lg(

2

)lg(

2

)lg(4),,( nOb

knkn

n

knHbknL

Coding Strings (3)

Quinlan’s Heuristics The information content in a string of length n containing k P’s

nH(k / n) Under approximation to L(n, k, b) May result in large decision tree

Generalization

k = k1 + k2 + + kt

t

t kkk

n

k

knkkknL

,,,1

1lg),,,;(

2121

Coding Sets of Strings

Example Table Attribute “humidity”

High humidity objects– N, N, P, P, N, P, N

Normal humidity objects– P, N, P, P, P, P, P

Code the exceptions L(7, 3, 3) + L(7, 1, 3) = 11.937 bits < 14 bits Some relationship between the attribute and the class

Need to include the cost of describing this decision tree

Coding Decision Trees (1)

Coding scheme Smaller DTs are represented by shorter codewords than larger DTs. Recursive, top-down, depth-first procedure A leaf is encoded as “0” followed by an encoding of the default clas

s for that leaf. To code a tree which is not a leaf, begin with “1”, followed by the c

ode for the attribute at the root of the tree, followed by the encoding of the subtrees of the tree, in order.


1 Outlook 1 Humidity 0 N 0 P 0 P 1 Windy 0 N 0 P “Outlook” require 2 bits

Selecting the first attribute out of four

“Humidity” require lg(3) bits Only three attributes remains

Example Tree : 18.170 bits


For a uniform b-ary tree with n decision nodes and (b – 1)n + 1 leaves

bn + 1 bits The number of b-ary trees with n internal node and (b – 1)n + 1 lea

ves

The proposed coding scheme is not efficient for high arity trees.

n

bn

nb 1)1(

1

)1(2

2lg

2

)lg(

2

))1lg((

2

)lg(1o

nnbbn

bbnH


Cost of Describing the structure of the tree Suppose the tree has k decision nodes and n – k leaves. L(n, k, (n + 1) / 2) k < n – k all tests will have arity at least two.

Total tree description cost Add the cost of specifying the attribute names for each node and

the cost of specifying the default class for each leaf

Coding Exceptions

Five subsets of the set of objects Sunny outlook & high humidity : N, N, N Sunny outlook & normal humidity : P, P Overcast outlook : P, P, P, P Rainy outlook & windy : N, N Rainy outlook & not windy : P, P, P

The exceptions can be encoded with a cost L(3, 0, 1) + L(2, 0, 1) + L(4, 0, 2) + L(2, 0, 1) + L(3, 0, 1) = 5.585

Total cost for the communication problem 18.170 + 5.585

Coding Real-Valued Attributes

Two approaches to find a good “cut point” Using values of the known objects

Suppose that for the desired attribute the n given objects have m < n distinct values.

Sort m real values and specify ith number by specifying i Specify i using lg(m) bits

Using compactly described rational numbers

Computing Good Decision Trees

Cost of replacing a leaf with a decision node Suppose A attributes and replacing a leaf at depth d with a decisi

on node. There are d’ d attributes tested on the path from the root to thi

s leaf. There A – d’ attributes eligible. To indicate which one is selected require lg(A – d’) bits. If the attribute selected has v values, the cost for describing the ad

ditional tree structure is 2v – 1 bits. Split one subset into v subsets.

Measure the extent to which the exceptions can be coded more efficiently.

If the savings so obtained is greater than the cost of extending the tree, the extension should be selected.

Two Phase Process

First Phase Begin with a single leaf and continue to extend the tree

Iterate until the tree is perfect or cannot be grown any further1. Let x be a leaf whose corresponding categories are of varying

classes, such that it is possible to replace x with a decision node.

2. For each possible attribute A, compute the total communication cost if this change is made.

3. Replace x with the decision node least total communication cost.

Second Phrase The tree is repeatedly pruned back by replacing decision nodes

by leaves, whenever this improves the total communication cost, until no further improvement in communication cost is possible.

Attribute with a large number of values

An attribute with a large number of values is penalized. It splits the set of objects into many subsets. e.g. The “Object number” in the example. The cost of specifying the attribute will not be justified in terms

of the extra compression achieved in transmitting the class information.

Empirical Results

Compare with C4 MDLP provides a unified framework for both growing

and pruning the decision tree.

Data setMDLP C4

Size Error rate Size Error Rate

Hypo

Discordant

LEDCredit

Endgame

Prob-Disj

11

15

83

14

15

17

0.6%

1.9%

26.9%

17.4%

17.9%

20.5%

11.0

13.6

56.0

32.5

62.6

42.6

0.55%

1.25%

28.1%

16.1%

13.6%

14.9%

Extensions (1)

In presence of noisy data The significance of the existing dependencies of the class on the

attributes will be masked. As the noise level increases, the tree grown should decrease in s

ize. Handle “training sets” which are especially representativ

e of the concept being learned Associate a “frequency count” larger than one with each input o

bject If it is required that the DT will classify each object correctly, th

ese counts can be set to a large value. The saving that can be realized by using a perfect DT increases,

as the counts increases.

Extensions (2)

Experiment Replicate the data set some number c times If the communication is separated into t “tree bits” and d “data bit

s”, the total cost should be t + cd. c : priori understanding of the representativeness or completeness

of the given data set.

T

D

c

cc

Extensions (3)

Data setc = 1 c = 2 c = 8

Size Error rate Size Error rate Size Error rate

Hypo

Discordant

LED

Credit

Endgame

Prob-Disj

11

15

83

14

15

17

0.6%

1.9%

26.9%

17.4%

17.9%

20.5%

15

23

93

11

35

45

0.6%

1.8%

27.0%

13.5%

11.5%

13.5%

19

37

95

76

71

69

0.5%

1.1%

27.0%

17.0%

12.0%

13.0%

inferring decision trees using the minimum description length principle j. r. quinlan and r. l....

Documents

length of dr

length of td

given data set

exact description

set of objects

observed data

category of objects

dt tbayesian interpretation