decision tree - sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. ·...

Decision Tree CE-717 : Machine Learning Sharif University of Technology

M. Soleymani

Fall 2012

Some slides have been adapted from: Prof. Tom Mitchell

Decision tree

Approximating functions of usually discrete domain

The learned function is represented by a decision tree

Application: Database mining

2

PlayTennis example: Training samples

3

A PlayTennis example

Feature 1: Humidity

Feature 2: Wind

Feature 3: Outlook

Feature 4: Temprature

Output: Play Tennis

Example:

𝒙=[low, weak, rain, hot] 𝑦=𝑁𝑜

4

PlayTennis example: decision tree

5

Decision tree structure

Internal node: a test on some attributes

Branch: one of the possible results of the test

Leaf: predicting y

6

Decision tree learning:

Function approximation problem

Problem Setting:

Set of possible instances 𝑋

Unknown target function 𝑓: 𝑋 → 𝑌 (𝑌 is discrete valued)

Set of function hypotheses 𝐻 = * 𝑕 | 𝑕 ∶ 𝑋 → 𝑌 +

𝑕 is a DT where tree sorts each 𝒙 to a leaf which assigns 𝑦

Input:

Training examples *(𝒙 𝑖 , 𝑦 𝑖 )+ of unknown target function 𝑓

Output:

Hypothesis 𝑕 ∈ 𝐻 that best approximates target function 𝑓

7

Decision trees hypothesis space

Suppose attributes are Boolean

Disjunction of conjunctions

Which trees to show the following functions?

𝑦 = 𝑥1 𝑎𝑛𝑑 𝑥3

𝑦 = 𝑥1 𝑜𝑟 𝑥3

𝑦 = (𝑥1 𝑎𝑛𝑑 𝑥3) 𝑜𝑟(𝑥2 𝑎𝑛𝑑 ¬𝑥4)

8

Decision tree as a rule base

Decision tree = a set of rules

Disjunctions of conjunctions of constraints on the attribute

values of instances

Each path from root to a leaf = conjunction of attribute tests

All of the leafs with 𝑦 = 𝑖 are considered to find rule for 𝑦 = 𝑖

9

How to construct basic decision tree?

Learning an optimal decision tree is NP-Complete

Instead, we use a greedy search based on a heuristic

Which attribute at the root?

The attribute 𝑨 that is the best classifier

Measure: how well splits the set into homogeneous subsets (having same value of target)

We prefer decisions leading to a simple, compact tree with few nodes

How to form descendant?

Descendant is created for each possible value of 𝐴

Training examples are sorted to descendant nodes

10

Basic decision tree learning algorithm (ID3)

ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠)

1) Create a root node 𝑟

2) If all examples have the same label or 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 = *+ return 𝑟 with proper label

3) 𝐴 ← the best classifier attribute on 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠

4) For each value of 𝐴 add a new branch below 𝑟 and sort

examples accordingly

5) Recursively grows the tree under each branch

Top down, Greedy, No backtrack

11

ID3

12

•ID3 (Examples, Target_Attribute, Attributes)

•Create a root node for the tree

•If all examples are positive, return the single-node tree Root, with label = +

•If all examples are negative, return the single-node tree Root, with label = -

•If number of predicting attributes is empty then

• return Root, with label = most common value of the target attribute in the examples

•else

•A = The Attribute that best classifies examples.

•Testing attribute for Root = A.

•for each possible value, 𝑣𝑖, of A

•Add a new tree branch below Root, corresponding to the test A = .

•Let Examples(𝑣𝑖) be the subset of examples that have the value for A

•if Examples(𝑣𝑖) is empty then

• below this new branch add a leaf node with label = most common target value in the examples

•else below this new branch add subtree ID3 (Examples(𝒗𝒊), Target_Attribute, Attributes – {A})

•return Root

Which attribute is the best?

13

Which attribute is the best?

14

A variety of heuristics for picking a good test

Information gain: originated with ID3 (Quinlan,1979).

Gini impurity

…

Entropy

𝐻 𝑋 = − 𝑃 𝑥𝑖 𝑙𝑜𝑔𝑃(𝑥𝑖)𝑥𝑖∈𝑋

Amount of uncertainty associated with a specific

probability distribution

15

Entropy

Information theory:

𝐻 𝑋 : expected number of bits needed to encode a randomly

drawn value of 𝑋 (under most efficient code)

Most efficient code assigns −log 𝑃(𝑋 = 𝑖) bits to encode 𝑋 = 𝑖

⇒ expected number of bits to code one random 𝑋 is 𝐻(𝑋)

16

Entropy for a Boolean variable

𝐻(𝑋)

𝑃(𝑋 = 1)

𝐻 𝑋 = −1 log2 1 − 0 log2 0 = 0 𝐻 𝑋 = −0.5 log21

2− 0.5 log2

1

2= 1

17

Entropy as a measure

of impurity

Conditional entropy

𝐻 𝑌 𝑋 = − 𝑃 𝑋 = 𝑖, 𝑌 = 𝑗 𝑙𝑜𝑔𝑃 𝑌 = 𝑗|𝑋 = 𝑖𝑗𝑖

18

𝐻 𝑌 𝑋 = 𝑃 𝑋 = 𝑖 −𝑃 𝑌 = 𝑗|𝑋 = 𝑖 𝑙𝑜𝑔𝑃 𝑌 = 𝑗|𝑋 = 𝑖𝑗𝑖

probability of following i-th branch for 𝑋

Entropy of 𝑌 for samples sorted to i-th branch

Mutual Information

𝐼 𝑋, 𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = 𝐻 𝑌 − 𝐻 𝑌 𝑋

The expected reduction in entropy caused by partitioning

examples according to this attribute

𝐻 𝑌 : Entropy of Y before you split

𝐻 𝑌 𝑋 : Entropy after split

19

Information Gain

𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐼𝑆 𝐴, 𝑌 = 𝐻𝑆 𝑌 − 𝐻𝑆 𝑌|𝐴

𝑌: target variable

𝑆: samples

𝐴: variable used to sort samples

𝐺𝑎𝑖𝑛 𝑆, 𝐴 ≡ 𝐻𝑆 𝑌 − 𝑆𝑣𝑆𝐻𝑆𝑣 𝑌

𝑣∈Values(𝐴)

20

Information Gain: Example

21

How to find the best attribute classifier?

Information gain as our criteria for a good split

attribute that maximizes information gain at each node

Choose 𝑗-th attribute: 𝑗 = argmax

𝑖𝐼𝐺 𝑋𝑖

= argmax𝑖𝐻 𝑌 − 𝐻 𝑌|𝑋𝑖

22

Example

23

Decision tree algorithm

24

The algorithm

either reaches homogenous nodes

or runs out of attributes

Guaranteed to find a tree consistent with any conflict-free

training set

Conflict free training set: identical feature vectors always assigned the

same class

But not necessarily find the simplest tree.

How partition instance space?

25

Decision tree

Partition the instance space into axis-parallel regions, labeled with

class value

Source: Duda & Hurt ’s Book

26

Bias in decision tree induction

Information-gain gives a bias for trees with minimal depth.

Implements a search (preference) bias instead of a

language (restriction) bias.

Which Tree Should We Output?

ID3: heuristic search through space of DTs

Stops at smallest acceptable tree

Occam’s razor: prefer the

simplest hypothesis that

fits the data

27

Ockham (1285-1349) Principle of Parsimony:

“One should not increase, beyond what is necessary,

the number of entities required to explain anything.”

Why prefer short hypotheses?

(Occam’s Razor)

Fewer short hypotheses than long ones

a short hypothesis that fits the data is less likely to be a

statistical coincidence

highly probable that a sufficiently complex hypothesis will fit

the data

28

Over-fitting problem

ID3 usually perfectly classifies training data

It tries to memorize every training data

Poor decisions when very little data and may not reflect

reliable trends

A node that “should” be pure but had a single exception?

Noise in the training data: the tree is erroneously fitting.

For many (non relevant) attributes, the algorithm will

continue to split nodes

leads to over-fitting!

29

Over-fitting problem: an example

Consider adding a (noisy) training example:

How does the tree change?

PlayTennis Wind Humidity Temp Outlook

No Strong Normal Hot Sunny

30

General definition of overfitting

Hypothesis 𝑕 ∈ 𝐻

Training error: 𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑎𝑖𝑛(𝑕)

Expected error: 𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑢𝑒(𝑕)

𝑕 overfits training data if there is a 𝑕′ ∈ 𝐻 such that

𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑎𝑖𝑛 𝑕 < 𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑎𝑖𝑛(𝑕′)

𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑢𝑒 𝑕 > 𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑢𝑒(𝑕′)

31

Over-fitting in decision tree learning

32

A question?

33

How can it be made smaller and simpler?

When should a node be declared a leaf?

If a leaf node is impure, how should the category label be assigned?

Pruning?

Avoiding overfitting

Stop growing when the data split is not statistically

significant.

Grow full tree and then prune it.

More successful than stop growing in practice.

How to select “best” tree:

Measure performance over separate validation set

MDL: minimize

𝑠𝑖𝑧𝑒 𝑡𝑟𝑒𝑒 + 𝑠𝑖𝑧𝑒(𝑚𝑖𝑠𝑠𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠(𝑡𝑟𝑒𝑒))

34

Reduced-error pruning

Split data into train and validation set

Build tree using training set

Do until further pruning is harmful:

Evaluate impact on validation set when pruning sub-tree

rooted at each node

Temporarily remove sub-tree rooted at node

Replace it with a leaf labeled with the current majority class at that node

Measure and record error on validation set

Greedily remove the one that most improves validation set

accuracy (if any).

35

Produces smallest version of the most accurate sub-tree.

Rule post-pruning

Convert tree to equivalent set of rules

Prune each rule independently of others

Sort final rules into desired sequence for use

Perhaps most frequently used method (e.g., C4.5)

36

Continuous attributes

Tests on continuous variables as boolean ?

Either use threshold to turn into binary or discretize

Its possible to compute information gain for all possible

thresholds (there are a finite number of training samples)

Harder if we wish to assign more than two values (can be

done recursively)

37

38

More issues on decision trees

Better splitting criteria

Information gain prefers features with many values.

Missing feature values

Features with costs

Misclassification costs

Incremental learning (ID4, ID5)

Regression trees

Ranking classifiers

Rich Caruana & Alexandru Niculescu-Mizil, An Empirical Comparison of Supervised Learning

Algorithms, ICML 2006

Top 8 are all based on various extensions of

decision trees

39

Decision tree advantages

40

Simple to understand and interpret

Requires little data preparation

Can handle both numerical and categorical data

Performs well with large data in a short time

Robust

Performs well even if its assumptions are somewhat violated

decision tree - sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. ·...

Documents