basic learning methods: 1r, decision

Basic Learning Methods: 1R, Decision Trees

1

Supervised Learning● Aim: Construct a model that is able to predict the class

label of a data instance.◆ Classification learning

● Training / Learning◆ Automatically construct the model using training

data● Testing / Operational Usage

◆ Make use of the learned model to predict an unseen data instance

◆ Measure the performance of the model

2

Simplicity first● Simple algorithms sometimes work well!● There are many kinds of simple structure, e.g.

◆ One attribute does all the work◆ All attributes contribute equally & independently◆ A weighted linear combination might do◆ Instance-based: use a few prototypes◆ Use simple logical rules

● Sometimes, success of method depends on the domain

3

Inferring rudimentary rules

● 1R: learns a 1-level decision tree◆ i.e., rules that all test one particular attribute

● Basic version◆ one branch for each value◆ each branch assigns most frequent class

● Error rate: proportion of instances that don’t belong to the majority class of their corresponding branch

● Choose attribute with lowest error rate (assumes nominal attributes)

4

Input instances with attributes

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHighHotSunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook● Play attribute has a

special role – class attribute

● Learn a model to predict the outcome of the class attribute (i.e., Play)

5

Rule Template for 1RTemplate of the knowledge (simple rule)If <attribute> is:

<value1>, then <class> is <outcome1><value2> , then <class> is <outcome2>

::

6

Pseudo‐code for 1R

For each attribute,For each value of the attribute, make a rule as follows:

count how often each class appearsfind the most frequent classmake the rule assign that class to this attribute-value

Calculate the error rate of the rulesChoose the rules with the smallest error rate

Note: “missing” is treated as a separate attribute value

If <attribute> is:<value1>, then <class> is <outcome1><value2> , then <class> is <outcome2>

::

Pseudo-code

Template of the knowledge (simple rule)

7

Processing the Attributes

3/6True No

5/142/8False YesWindy

1/7Normal Yes

4/143/7High NoHumidity

5/14

4/14

Total errors

1/4Cool Yes

2/6Mild Yes

2/4Hot NoTemp

2/5Rainy Yes

0/4Overcast Yes

2/5Sunny NoOutlook

ErrorsRulesAttribute

NoTrueHighMildRainy












NoTrueHighHotSunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook

8

Output Solution● There are two solutions as shown below. The final

solution can be arbitrarily selected from one of them.● First solution –

If Outlook is:● Sunny, then play is no● Overcast, then play is yes● Rainy, then play is yes

● Second solution –If Humidity is:

● High, then play is no● Normal, then play is yes

9

Dealing with numeric attributes

64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

……………

YesFalse8075Rainy

YesFalse8683Overcast

NoTrue9080Sunny

NoFalse8585Sunny

PlayWindyHumidityTemperatureOutlook

● Discretize numeric attributes● Divide each attribute’s range into intervals● Sort instances according to attribute’s values● Place breakpoints where class changes (majority class)● This minimizes the total error● Example: temperature from weather data

10

The problem of overfitting

64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No

● This procedure is very sensitive to noise● One instance with an incorrect class label will probably

produce a separate interval● Also: time stamp attribute will have zero errors● Simple solution:

◆ enforce minimum number of instances in majority class per interval

● Example (with min = 3):

11

With overfitting avoidance

0/1> 95.5 Yes

3/6True No*

5/142/8False YesWindy

2/6> 82.5 and 95.5 No

3/141/7 82.5 YesHumidity

5/14

4/14

Total errors

2/4> 77.5 No*

3/10 77.5 YesTemperature

2/5Rainy Yes

0/4Overcast Yes

2/5Sunny NoOutlook

ErrorsRulesAttribute

● The final solution -If Humidity is:

● <= 82.5, then play is yes● > 82.5 and <= 95.5, play is no● > 95.5, play is yes 12

Discussion of 1R

Robert Holte, “Very Simple Classification Rules Perform Well on Most Commonly Used Datasets”, Machine Learning, 11:63-91, 1993.

● 1R was described in a paper by Holte (1993)● Contains an experimental evaluation on 16 datasets

(using cross-validation so that results were representative of performance on future data)

● Minimum number of instances was set to 6 after some experimentation

● 1R’s simple rules performed not much worse than much more complex decision trees

● Simplicity first pays off!

13

Decision Trees● Found in various applications such as product

recommendation

14

One example ‐ Netflix

Decision Trees● Decision tree

◆ A flow-chart-like tree structure◆ Internal node denotes a test on an attribute◆ Branch represents an outcome of the test◆ Leaf nodes represent class labels or class

distribution● Use of decision tree: Classifying an unknown

sample◆ Test the attribute values of the sample against

the decision tree

15

Decision Trees

16

age?

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

31..40

Learning Decision Trees From Data● Strategy: top down ● Recursive divide-and-conquer fashion

◆ First: select attribute for root nodeCreate branch for each possible attribute value

◆ Then: split instances into subsetsOne for each branch extending from the node

◆ Finally: repeat recursively for each branch, using only instances that reach the branch

● Stop if all instances have the same class

17

Attribute Selection by Information Gain Computation

attribute1 attribute2 class label

high high yeshigh high yeshigh high yeshigh low yeshigh low yeshigh low yeshigh low nolow low nolow low nolow high nolow high nolow high no

18




Consider the attribute1:

attribute1 yes nohigh 6 1low 0 5



19








attribute1 is better than attribute2 for classification purpose !

20

Which attribute to select?

21

Which attribute to select?

22

Criterion for attribute selection

● Which is the best attribute?● Want to get the smallest tree● Heuristic: choose the attribute that produces the

“purest” nodes● Popular impurity criterion: information gain● Information gain increases with the average purity of

the subsets● Strategy: choose attribute that gives greatest

information gain

23

Computing Information● Measure information in bits● Given a probability distribution, the info required to

predict an event is the distribution’s entropy● Entropy gives the information required in bits

(can involve fractions of bits!)● Formula for computing the entropy:

entropy , , ⋯ log log ⋯ log

24

Example: attribute OutlookOutlook = Sunny :

Note: thisis normallyundefined.

Outlook = Overcast :

Outlook = Rainy :

Expected information for attribute:

info 2,3 entropy 25 ,35

25 log

25

35 log

35 0.971bits

info 4,0 entropy 1,0 1 log 1 0 log 0 0bit

info 3,2 entropy 35 ,25

35 log

35

25 log

25 0.971bits

info 2,3 , 4,0 , 3,2 514 0.971 4

14 0 514 0.971 0.693bits

25

Computing Information GainInformation gain: information before splitting – information after splitting

Information gain for attributes from weather data:

gain(Outlook ) = 0.247 bitsgain(Temperature ) = 0.029 bitsgain(Humidity ) = 0.152 bitsgain(Windy ) = 0.048 bits

gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])= 0.940 – 0.693= 0.247 bits

26

Continuing to split

gain(Temperature ) = 0.571 bitsgain(Humidity ) = 0.971 bitsgain(Windy ) = 0.020 bits

27

Final Decision Tree

Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can’t be split any further

28

Wish list for a purity measure● Properties we require from a purity measure:

◆ When node is pure, measure should be zero◆ When impurity is maximal (i.e. all classes equally

likely), measure should be maximal◆ Measure should obey multistage property (i.e.

decisions can be made in several stages):

● Entropy is the only function that satisfies all three properties!

measure 2,3,4 measure 2,7 79 measure 3,4

29

Properties of the entropy

The multistage property:

Simplification of computation:

Note: instead of maximizing info gain we could just minimize information

2 log2 3 log3 4 log4 9 log9 9

entropy , , entropy , entropy ,

info 2,3,4 29 log 2 9

39 log 3 9

49 log 4 9

30

Highly‐branching attributes● Problematic: attributes with a large number of values

(extreme case: ID code)● Subsets are more likely to be pure if there is a large

number of values● Information gain is biased towards choosing attributes

with a large number of values● This may result in overfitting (selection of an attribute

that is non-optimal for prediction)● Another problem: fragmentation

31

Weather data with ID code

N

M

L

K

J

I

H

G

F

E

D

C

B

A

ID code

NoTrueHighMildRainy












NoTrueHighHotSunny

NoFalseHighHotSunny

PlayWindyHumidityTemp.Outlook

32

Tree stump for ID code attribute

Entropy of split:

Implies that information gain is maximal for ID code(namely 0.940 bits)

info info 0,1 info 0,1 ⋯ info 0,1 0bit

33

Gain ratio● Gain ratio: a modification of the information gain that

reduces its bias● Gain ratio takes number and size of branches into

account when choosing an attribute◆ It corrects the information gain by taking the

intrinsic information of a split into account● Intrinsic information: entropy of distribution of

instances into branches (i.e. how much info do we need to tell which branch an instance belongs to)

34

Computing the gain ratioExample: intrinsic information for ID code

Value of attribute decreases as intrinsic information gets larger

Definition of gain ratio:

Example:

info 1,1,⋯ , 1 14 114 log 1

14 3.807bits

gain_ratio attributegain attribute

intrinsic_info attribute

gain_ratio IDcode0.940bits3.807bits 0.246

35

Gain ratios for weather data

0.019Gain ratio: 0.029/1.5570.157Gain ratio: 0.247/1.577

1.557Split info: info([4,6,4])1.577 Split info: info([5,4,5])

0.029Gain: 0.940-0.9110.247Gain: 0.940-0.693

0.911Info:0.693Info:

TemperatureOutlook

0.049Gain ratio: 0.048/0.9850.152Gain ratio: 0.152/1

0.985Split info: info([8,6])1.000 Split info: info([7,7])

0.048Gain: 0.940-0.8920.152Gain: 0.940-0.788

0.892Info:0.788Info:

WindyHumidity

36

More on the gain ratio● “Outlook” still comes out top● However: “ID code” has greater gain ratio

◆ Standard fix: ad hoc test to prevent splitting on that type of attribute

● Problem with gain ratio: it may overcompensate◆ May choose an attribute just because its intrinsic

information is very low◆ Standard fix: only consider attributes with greater

than average information gain

37

Discussion● Top-down induction of decision trees: ID3, algorithm

developed by Ross Quinlan◆ Gain ratio just one modification of this basic

algorithm◆ C4.5: deals with numeric attributes, missing

values, noisy data● There are many other attribute selection criteria!

(But little difference in accuracy of result)

38

basic learning methods: 1r, decision

Documents