decision tree - sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. ·...

40
Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adapted from: Prof. Tom Mitchell

Upload: others

Post on 21-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Decision Tree CE-717 : Machine Learning Sharif University of Technology

M. Soleymani

Fall 2012

Some slides have been adapted from: Prof. Tom Mitchell

Page 2: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Decision tree

Approximating functions of usually discrete domain

The learned function is represented by a decision tree

Application: Database mining

2

Page 3: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

PlayTennis example: Training samples

3

Page 4: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

A PlayTennis example

Feature 1: Humidity

Feature 2: Wind

Feature 3: Outlook

Feature 4: Temprature

Output: Play Tennis

Example:

𝒙=[low, weak, rain, hot] 𝑦=𝑁𝑜

4

Page 5: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

PlayTennis example: decision tree

5

Page 6: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Decision tree structure

Internal node: a test on some attributes

Branch: one of the possible results of the test

Leaf: predicting y

6

Page 7: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Decision tree learning:

Function approximation problem

Problem Setting:

Set of possible instances 𝑋

Unknown target function 𝑓: 𝑋 → 𝑌 (𝑌 is discrete valued)

Set of function hypotheses 𝐻 = * 𝑕 | 𝑕 ∶ 𝑋 → 𝑌 +

𝑕 is a DT where tree sorts each 𝒙 to a leaf which assigns 𝑦

Input:

Training examples *(𝒙 𝑖 , 𝑦 𝑖 )+ of unknown target function 𝑓

Output:

Hypothesis 𝑕 ∈ 𝐻 that best approximates target function 𝑓

7

Page 8: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Decision trees hypothesis space

Suppose attributes are Boolean

Disjunction of conjunctions

Which trees to show the following functions?

𝑦 = 𝑥1 𝑎𝑛𝑑 𝑥3

𝑦 = 𝑥1 𝑜𝑟 𝑥3

𝑦 = (𝑥1 𝑎𝑛𝑑 𝑥3) 𝑜𝑟(𝑥2 𝑎𝑛𝑑 ¬𝑥4)

8

Page 9: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Decision tree as a rule base

Decision tree = a set of rules

Disjunctions of conjunctions of constraints on the attribute

values of instances

Each path from root to a leaf = conjunction of attribute tests

All of the leafs with 𝑦 = 𝑖 are considered to find rule for 𝑦 = 𝑖

9

Page 10: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

How to construct basic decision tree?

Learning an optimal decision tree is NP-Complete

Instead, we use a greedy search based on a heuristic

Which attribute at the root?

The attribute 𝑨 that is the best classifier

Measure: how well splits the set into homogeneous subsets (having same value of target)

We prefer decisions leading to a simple, compact tree with few nodes

How to form descendant?

Descendant is created for each possible value of 𝐴

Training examples are sorted to descendant nodes

10

Page 11: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Basic decision tree learning algorithm (ID3)

ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠)

1) Create a root node 𝑟

2) If all examples have the same label or 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 = *+ return 𝑟 with proper label

3) 𝐴 ← the best classifier attribute on 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠

4) For each value of 𝐴 add a new branch below 𝑟 and sort

examples accordingly

5) Recursively grows the tree under each branch

Top down, Greedy, No backtrack

11

Page 12: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

ID3

12

•ID3 (Examples, Target_Attribute, Attributes)

•Create a root node for the tree

•If all examples are positive, return the single-node tree Root, with label = +

•If all examples are negative, return the single-node tree Root, with label = -

•If number of predicting attributes is empty then

• return Root, with label = most common value of the target attribute in the examples

•else

•A = The Attribute that best classifies examples.

•Testing attribute for Root = A.

•for each possible value, 𝑣𝑖, of A

•Add a new tree branch below Root, corresponding to the test A = .

•Let Examples(𝑣𝑖) be the subset of examples that have the value for A

•if Examples(𝑣𝑖) is empty then

• below this new branch add a leaf node with label = most common target value in the examples

•else below this new branch add subtree ID3 (Examples(𝒗𝒊), Target_Attribute, Attributes – {A})

•return Root

Page 13: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Which attribute is the best?

13

Page 14: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Which attribute is the best?

14

A variety of heuristics for picking a good test

Information gain: originated with ID3 (Quinlan,1979).

Gini impurity

Page 15: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Entropy

𝐻 𝑋 = − 𝑃 𝑥𝑖 𝑙𝑜𝑔𝑃(𝑥𝑖)𝑥𝑖∈𝑋

Amount of uncertainty associated with a specific

probability distribution

15

Page 16: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Entropy

Information theory:

𝐻 𝑋 : expected number of bits needed to encode a randomly

drawn value of 𝑋 (under most efficient code)

Most efficient code assigns −log 𝑃(𝑋 = 𝑖) bits to encode 𝑋 = 𝑖

⇒ expected number of bits to code one random 𝑋 is 𝐻(𝑋)

16

Page 17: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Entropy for a Boolean variable

𝐻(𝑋)

𝑃(𝑋 = 1)

𝐻 𝑋 = −1 log2 1 − 0 log2 0 = 0 𝐻 𝑋 = −0.5 log21

2− 0.5 log2

1

2= 1

17

Entropy as a measure

of impurity

Page 18: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Conditional entropy

𝐻 𝑌 𝑋 = − 𝑃 𝑋 = 𝑖, 𝑌 = 𝑗 𝑙𝑜𝑔𝑃 𝑌 = 𝑗|𝑋 = 𝑖𝑗𝑖

18

𝐻 𝑌 𝑋 = 𝑃 𝑋 = 𝑖 −𝑃 𝑌 = 𝑗|𝑋 = 𝑖 𝑙𝑜𝑔𝑃 𝑌 = 𝑗|𝑋 = 𝑖𝑗𝑖

probability of following i-th branch for 𝑋

Entropy of 𝑌 for samples sorted to i-th branch

Page 19: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Mutual Information

𝐼 𝑋, 𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = 𝐻 𝑌 − 𝐻 𝑌 𝑋

The expected reduction in entropy caused by partitioning

examples according to this attribute

𝐻 𝑌 : Entropy of Y before you split

𝐻 𝑌 𝑋 : Entropy after split

19

Page 20: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Information Gain

𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐼𝑆 𝐴, 𝑌 = 𝐻𝑆 𝑌 − 𝐻𝑆 𝑌|𝐴

𝑌: target variable

𝑆: samples

𝐴: variable used to sort samples

𝐺𝑎𝑖𝑛 𝑆, 𝐴 ≡ 𝐻𝑆 𝑌 − 𝑆𝑣𝑆𝐻𝑆𝑣 𝑌

𝑣∈Values(𝐴)

20

Page 21: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Information Gain: Example

21

Page 22: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

How to find the best attribute classifier?

Information gain as our criteria for a good split

attribute that maximizes information gain at each node

Choose 𝑗-th attribute: 𝑗 = argmax

𝑖𝐼𝐺 𝑋𝑖

= argmax𝑖𝐻 𝑌 − 𝐻 𝑌|𝑋𝑖

22

Page 23: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Example

23

Page 24: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Decision tree algorithm

24

The algorithm

either reaches homogenous nodes

or runs out of attributes

Guaranteed to find a tree consistent with any conflict-free

training set

Conflict free training set: identical feature vectors always assigned the

same class

But not necessarily find the simplest tree.

Page 25: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

How partition instance space?

25

Decision tree

Partition the instance space into axis-parallel regions, labeled with

class value

Source: Duda & Hurt ’s Book

Page 26: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

26

Bias in decision tree induction

Information-gain gives a bias for trees with minimal depth.

Implements a search (preference) bias instead of a

language (restriction) bias.

Page 27: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Which Tree Should We Output?

ID3: heuristic search through space of DTs

Stops at smallest acceptable tree

Occam’s razor: prefer the

simplest hypothesis that

fits the data

27

Ockham (1285-1349) Principle of Parsimony:

“One should not increase, beyond what is necessary,

the number of entities required to explain anything.”

Page 28: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Why prefer short hypotheses?

(Occam’s Razor)

Fewer short hypotheses than long ones

a short hypothesis that fits the data is less likely to be a

statistical coincidence

highly probable that a sufficiently complex hypothesis will fit

the data

28

Page 29: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Over-fitting problem

ID3 usually perfectly classifies training data

It tries to memorize every training data

Poor decisions when very little data and may not reflect

reliable trends

A node that “should” be pure but had a single exception?

Noise in the training data: the tree is erroneously fitting.

For many (non relevant) attributes, the algorithm will

continue to split nodes

leads to over-fitting!

29

Page 30: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Over-fitting problem: an example

Consider adding a (noisy) training example:

How does the tree change?

PlayTennis Wind Humidity Temp Outlook

No Strong Normal Hot Sunny

30

Page 31: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

General definition of overfitting

Hypothesis 𝑕 ∈ 𝐻

Training error: 𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑎𝑖𝑛(𝑕)

Expected error: 𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑢𝑒(𝑕)

𝑕 overfits training data if there is a 𝑕′ ∈ 𝐻 such that

𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑎𝑖𝑛 𝑕 < 𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑎𝑖𝑛(𝑕′)

𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑢𝑒 𝑕 > 𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑢𝑒(𝑕′)

31

Page 32: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Over-fitting in decision tree learning

32

Page 33: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

A question?

33

How can it be made smaller and simpler?

When should a node be declared a leaf?

If a leaf node is impure, how should the category label be assigned?

Pruning?

Page 34: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Avoiding overfitting

Stop growing when the data split is not statistically

significant.

Grow full tree and then prune it.

More successful than stop growing in practice.

How to select “best” tree:

Measure performance over separate validation set

MDL: minimize

𝑠𝑖𝑧𝑒 𝑡𝑟𝑒𝑒 + 𝑠𝑖𝑧𝑒(𝑚𝑖𝑠𝑠𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠(𝑡𝑟𝑒𝑒))

34

Page 35: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Reduced-error pruning

Split data into train and validation set

Build tree using training set

Do until further pruning is harmful:

Evaluate impact on validation set when pruning sub-tree

rooted at each node

Temporarily remove sub-tree rooted at node

Replace it with a leaf labeled with the current majority class at that node

Measure and record error on validation set

Greedily remove the one that most improves validation set

accuracy (if any).

35

Produces smallest version of the most accurate sub-tree.

Page 36: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Rule post-pruning

Convert tree to equivalent set of rules

Prune each rule independently of others

Sort final rules into desired sequence for use

Perhaps most frequently used method (e.g., C4.5)

36

Page 37: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Continuous attributes

Tests on continuous variables as boolean ?

Either use threshold to turn into binary or discretize

Its possible to compute information gain for all possible

thresholds (there are a finite number of training samples)

Harder if we wish to assign more than two values (can be

done recursively)

37

Page 38: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

38

More issues on decision trees

Better splitting criteria

Information gain prefers features with many values.

Missing feature values

Features with costs

Misclassification costs

Incremental learning (ID4, ID5)

Regression trees

Page 39: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Ranking classifiers

Rich Caruana & Alexandru Niculescu-Mizil, An Empirical Comparison of Supervised Learning

Algorithms, ICML 2006

Top 8 are all based on various extensions of

decision trees

39

Page 40: Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. · Learning an optimal decision tree is NP-Complete Instead, we use a greedy search based

Decision tree advantages

40

Simple to understand and interpret

Requires little data preparation

Can handle both numerical and categorical data

Performs well with large data in a short time

Robust

Performs well even if its assumptions are somewhat violated