decision trees

Decision Trees

2

Outline What is a decision tree ? How to construct a decision tree ?

What are the major steps in decision tree induction ?

How to select the attribute to split the node ?

What are the other issues ?

3

Classification by Decision Tree Induction Decision tree

A flow-chart-like tree structure Internal node denotes a test on an

attribute Branch represents an outcome of test Leaf nodes represent class labels or class

distribution Age?

Student? Credit?

fairexcellent

>40

31…40

30

NO YES

no yes

NO YES

YES

4

Training Dataset

Age Income

Student

Credit Buys_computer

P1 <=30

high no fair no

P2 <=30

high no excellent

no

P3 31…40

high no fair yes

P4 >40 medium

no fair yes

P5 >40 low yes fair yes

P6 >40 low yes excellent

no

P7 31…40

low yes excellent

yes

P8 <=30

medium

no fair no

P9 <=30

low yes fair yes

P10 >40 medium

yes fair yes

P11 <=30

medium

yes excellent

yes

P12 31…40

medium

no excellent

yes

P13 31…40

high yes fair yes

P14 >40 medium

no excellent

no

5

Output: A Decision Tree for “buy_computer”

Age?

Student? Credit?

fairexcellent

>4031…40<=30

NO YES

no yes

NO YES

YES

6





7

Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-

valued, they are discretized in advance) Tree is constructed in a top-down recursive

divide-and-conquer manner At start, all training examples are at the

root Test attributes are selected on basis of a

heuristic or statistical measure (e.g., information gain)

Examples are partitioned recursively based on selected attributes

8

Training Dataset

Age Income

Student


P1 <=30

high no fair no

P2 <=30

high no excellent

no

P3 31…40

high no fair yes

P4 >40 medium

no fair yes



no

P7 31…40

low yes excellent

yes

P8 <=30

medium

no fair no

P9 <=30

low yes fair yes

P10 >40 medium

yes fair yes

P11 <=30

medium

yes excellent

yes

P12 31…40

medium

no excellent

yes

P13 31…40

high yes fair yes

P14 >40 medium

no excellent

no

9

Construction of A Decision Tree for “buy_computer”

? [P1,…P14]

Yes: 9, No:5

10








11


Age?

>4031…40<=30

[P1,…P14]Yes: 9,

No:5

12








13

Training Dataset

Age Income

Student


P1 <=30

high no fair no

P2 <=30

high no excellent

no

P3 31…40

high no fair yes

P4 >40 medium

no fair yes



no

P7 31…40

low yes excellent

yes

P8 <=30

medium

no fair no

P9 <=30

low yes fair yes

P10 >40 medium

yes fair yes

P11 <=30

medium

yes excellent

yes

P12 31…40

medium

no excellent

yes

P13 31…40

high yes fair yes

P14 >40 medium

no excellent

no

14


Age?

>4031…40<=30

[P1,…P14]Yes: 9,

No:5

[P1,P2,P8,P9,P11]

Yes: 2, No:3

[P3,P7,P12,P13]Yes: 4, No:0

[P4,P5,P6,P10,P14]

Yes: 3, No:2YES ? ?

15


Age?

>4030…40<=30

[P1,…P14]Yes: 9,

No:5

[P1,P2,P8,P9,P11]

Yes: 2, No:3

[P3,P7,P12,P13]Yes: 4, No:0

[P4,P5,P6,P10,P14]

Yes: 3, No:2Student?

no yes

YES ?

[P1,P2,P8]

Yes: 0, No:3

[P9,P11]Yes: 2,

No:0NO YES

16


Age?

>4030…40<=30

[P1,…P14]Yes: 9,

No:5

[P1,P2,P8,P9,P11]

Yes: 2, No:3

[P3,P7,P12,P13]Yes: 4, No:0

[P4,P5,P6,P10,P14]

Yes: 3, No:2Student?

no yes

YES

[P1,P2,P8]

Yes: 0, No:3

[P9,P11]Yes: 2,

No:0

Credit?

fairexcellent

NO YES NO YES

[P6,P14]Yes: 0,

No:2

[P4,P5,P10]

Yes: 3, No:0

17








18





19

Which Attribute is the Best?

The attribute most useful for classifying examples

Information gain An information-theoretic approach Measure how well an attribute separates

the training examples Use the attribute with the highest

information gain to split Minimize the expected number of tests

needed to classify a new tuple

How useful?

How well separated?

How pure splitting result?

Information gain

20

Choosing an attribute Idea: a good attribute splits the examples into subsets

that are (ideally) "all positive" or "all negative"

Patrons? is a better choice

21

Information theory

If there are n equally probable possible messages, then the probability p of each is 1/n

Information conveyed by a message is -log(p) = log(n)

E.g., if there are 16 messages, then log(16) = 4 and we need 4 bits to identify/send each message

In general, if we are given a probability distribution

P = (p1, p2, .., pn) Then the information conveyed by the distribution (aka entropy of P) is:

I(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn))

22

Information theory II Information conveyed by distribution (a.k.a.

entropy of P):

I(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn))

Examples: If P is (0.5, 0.5) then I(P) is 1 If P is (0.67, 0.33) then I(P) is 0.92 If P is (1, 0) then I(P) is 0

The more uniform the probability distribution, the greater its information: More information is conveyed by a message telling you which event actually occurred

Entropy is the average number of bits/message needed to represent a stream of messages

23

Information for classification

If a set S of records is partitioned into disjoint exhaustive classes (C1,C2,..,Ck) on the basis of the value of the class attribute, then the information needed to identify the class of an element of S is Info(S) = I(P)

where P is the probability distribution of partition (C1,C2,..,Ck):

P = (|C1|/|S|, |C2|/|S|, ..., |Ck|/|S|)

C1

C2

C3

C1

C2C3

High informationLow information

24

Information, Entropy, and Information Gain

S contains si tuples of class Ci for i = {1, ..., m} Information measures “the amount of info” required to classify any arbitrary tuple

where is the probability that an arbitrary tuple belongs to Ci

Example: S contains 100 tuples, 25 belong to class C1 and 75 belong to class C2

i

m

iim21 pp,...,s,ss 2

1

log)I(

811.0100

75log

100

75

100

25log

100

25)75,25I(

22

|| S

sp ii

25


Information reflects the “purity” of the data set

Low information value indicates high purity

High information value indicates high diversity

Example: S contains 100 tuples 0 belongs to class C1 and 100 belong to class

C2

50 belong to class C1 and 50 belong to class C2

0100

100log

100

100

100

0log

100

0)100,0I(

22

1100

50log

100

50

100

50log

100

50)50,50I(

22

26

Information for classification II If we partition S w.r.t attribute X into sets

{T1,T2, ..,Tn} then the information needed to identify the class of an element of S becomes the weighted average of the information needed to identify the class of an element of Ti, i.e. the weighted average of Info(Ti):

Info(X,T) = |Ti|/|S| * Info(Ti)

C1

C2

C3C1

C2

C3

High information Low information

27

Information gain Consider the quantity Gain(X,S) defined as Gain(X,S) = Info(S) - Info(X,S) This represents the difference between

information needed to identify an element of S and information needed to identify an element of S after the

value of attribute X has been obtained

That is, this is the gain in information due to attribute X We can use this to rank attributes and to build

decision trees where at each node is located the attribute with greatest gain among the attributes not yet considered in the path from the root

The intent of this ordering is: To create small decision trees so that records can be

identified after only a few questions To match a hoped-for minimality of the process

represented by the records being considered (Occam’s Razor)

28


S contains si tuples of class Ci for i = {1, ..., m}

Attribute A has values {a1,a2,...,av} Let sij be the number of tuples which

belong to class Ci, and have a value of aj in attribute A

Entropy of attribute A is

Information gained by branching on attribute A

),...,(||

...E(A) 1

1

1

mjj

mjj

ssIS

ssv

j

)E(),...,,I()Gain( 21 AsssA m

29


Let Tj be the set of tuples having value aj in attribute A s1j+…,+smj = |Tj| I(s1j,…,smj) = I(Tj)

Entropy of attribute A is ),...,(

||

...E(A) 1

1

1

mjj

mjj

ssIS

ssv

j

Proportion of |Tj| over |S|

Information of Tj

30


A=a2

A=a3

40 60

10

10

10 20

20 30

I(10,10)=1

I(10,20)=0.918

I(20,30)=0.971

971.0100

3020918.0

100

20101

100

1010)(

AE

961.0

01.0)()60,40()( AEIAGain

I(40,60)=0.971

S contains 100 tuples, 40 belong to class C1 (red) and 60 belong to class C2 (blue)

30 tuples

50 tuples

A=a120 tuples

31

Computing information gain

French

Italian

Thai

Burger

Empty Some Full

Y

Y

Y

Y

Y

YN

N

N

N

N

N

•I(S) = - (.5 log .5 + .5 log .5) = .5 + .5 = 1

•I (Pat, S) = 1/6 (0) + 1/3 (0) + 1/2 (- (2/3 log 2/3 +

1/3 log 1/3)) = 1/2 (2/3*.6 + 1/3*1.6) = .47

•I (Type, S) = 1/6 (1) + 1/6 (1) + 1/3 (1) + 1/3 (1) = 1

Gain (Pat, S) = 1 - .47 = .53Gain (Type, T) = 1 – 1 = 0

32

Regarding the Definition of Entropy…

On Text book Page 134 (Equ. 3.6)

On Text book Page 287 (Equ. 7.2)

m

iii ppSEntropy

12log)(

),...,(||

...(A)E 1

1

1

mjj

mjj

ssIS

ssntropy

v

j

PolymophismWhen entropy is defined on tuples, use Equ. 3.6When entropy is defined on attribute, use Equ. 7.2

33

How well does it work?

Many case studies have shown that decision trees are at least as accurate as human experts. A study for diagnosing breast cancer had humans correctly classifying the examples 65% of the time; the decision tree classified 72% correct

British Petroleum designed a decision tree for gas-oil separation for offshore oil platforms that replaced an earlier rule-based expert system

Cessna designed an airplane flight controller using 90,000 examples and 20 attributes per example

34





35

Extracting Classification Rules from Trees Represent knowledge in the form of IF-

THEN rules One rule is created for each path from

root to a leaf Each attribute-value pair along a path

forms a conjunction Leaf node holds class prediction Rules are easier for humans to

understand

36

Examples of Classification Rules

Age?

Student? Credit?

fairexcellent

>40

31…40

30

NO YES

no yes

NO YES

YES

Classification rules:1. IF age = “<=30” AND student = “no” THEN buys_computer = “no”2. IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”3. IF age = “31…40” THEN buys_computer = “yes”4. IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”5. IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

37

Avoid Over-fitting in Classification Generated tree may over-fit training data

Too many branches, some may reflect anomalies due to noise or outliers

Result is in poor accuracy for unseen samples

Two approaches to avoiding over-fitting Pre-pruning: Halt tree construction early—

do not split a node if this would result in goodness measure falling below a threshold

Difficult to choose an appropriate threshold Post-pruning: Remove branches from a

“fully grown” tree—get a sequence of progressively pruned trees

Use a set of data different from training data to decide which is “best pruned tree”

38

Enhancements to basic decision tree induction

Dynamic discretization for continuous-valued attributes

Dynamically define new discrete-valued attributes that partition continuous attribute value into a discrete set of intervals

Handle missing attribute values Assign most common value of attribute Assign probability to each of possible values

Attribute construction Create new attributes based on existing ones

that are sparsely represented Reduce fragmentation (no. of samples at branch

becomes too small to be statistically significant), repetition (attribute is repeatedly tested along a branch), and replication (duplicate subtrees)

39

Classification in Large Databases Classification—a classical problem extensively

studied by statisticians and machine learning researchers

Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed

Why decision tree induction in data mining? relatively faster learning speed (than other

classification methods) convertible to simple and easy to understand

classification rules can use SQL queries for accessing databases comparable classification accuracy with other

methods

40

Scalable Decision Tree Induction Methods SLIQ (EDBT’96 — Mehta et al.)

Build an index for each attribute and only class list and the current attribute list reside in memory

SPRINT (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure

PUBLIC (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning:

stop growing the tree earlier RainForest (VLDB’98 — Gehrke,

Ramakrishnan & Ganti) separates the scalability aspects from the

criteria that determine the quality of the tree builds an AVC-list (attribute, value, class

label)

41

Summary What is a decision tree ?

A flow-chart-like tree: internal nodes, branches, and leaf nodes

How to construct a decision tree ? What are the major steps in decision tree

induction ? Test attribute selection Sample partition

How to select the attribute to split the node ? Select the attribute with the highest

information gain Calculate the information of the node Calculate the entropy of the attribute Calculate the difference between the

information and entropy What are the other issues ?

decision trees

Documents

decision treesoutlinewhat

tree structureinternal

selected attributesoutlinewhat

greedy algorithmattributes

statistical measure

training examples

roottest attributes

information gainexamples