introduction to artificial intelligence€¦ · introduction to artificial intelligence comp307...
TRANSCRIPT
![Page 1: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/1.jpg)
Introduction to Artificial Intelligence
COMP307 Machine Learning 3 – Decision Tree
Learning Method
1
![Page 2: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/2.jpg)
Outline• Decision tree learning vs learned decision trees
• How to build a decision tree using a set of instances
• How to measure a node in decision tree: (im)purity measures
• Design issues of DT learning
2
![Page 3: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/3.jpg)
Decision Tree• A tree-like model for making decisions
4
![Page 4: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/4.jpg)
Decision Trees vs DT Learning• A Decision Tree (DT) is a classifier
– Symbolic representation, not probabilistic– Essentially a rule– “Easy” to interpret
• DT learning is a learning process– To find a DT: output/solution is a DT– One of the oldest classification learning methods in AI– Also developed independently in Statistics/Operations Research
5
![Page 5: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/5.jpg)
Example (Training) Dataset• Approve/Reject a loan application?
6
Applicant Job Deposit Family Class
1 true low single Approve
2 true low couple Approve
3 true low single Approve
4 true high single Approve
5 false high couple Approve
6 false low couple Reject
7 true low children Reject
8 false low single Reject
9 false high children Reject
![Page 6: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/6.jpg)
Example DT• An example DT
7
COMP307 ML3 (DT): 5
Decision Trees
Job Deposit Reject
Reject
RejectApprove
Approve
JobApprove
Family
couplesingle
true falsehigh low
true false
children
COMP307 ML3 (DT): 6
Building Decision Trees
• You can always build a decision tree trivially
– Choose some order on the attributes
– Build tree with one attribute for each level
– Label each leaf with appropriate class
A
C
D D D D D D D D
X X X X ZZZ Y Y Y
C C C
BB
• Problems
– Each leaf represents a possible instance
– All we are doing is remembering every instance — no generalisation, no
prediction, no learning
Z
A
B
D
YZ
Y X
C
• Solution
– Find a small decision tree
– capture the common features of instances
– probably generalise to predict classes for unseen instances
COMP307 ML3 (DT): 7
Building A Good Decision Tree
• Input: Instances described by attribute-value pairs
• Output: a “good” decision tree classifier
– Critical issue: choosing which attribute to use next
• DT algorithm:
Examine set of instances in the root node
If set is "pure" enough, or no more attributes
then stop
Else
Construct subsets of instances in the subnodes
Compute average "purity" of subnodes
Choose the best attribute
Recurse on each subnode
COMP307 ML3 (DT): 8
Decision Tree Building
DepositJob
true false high low
children
couplesingle
Which is best?
Choose the best attribute
Reject
Family
![Page 7: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/7.jpg)
Building a DT• A simple way:
– Start from a root, select a feature at the root to be branched– Grow the tree by depth-first or breadth-first (or any other order)– For each node, add one child node for each possible feature value– A child node can be a feature node or a class (leaf node)
• Node: feature / class• Edge: value of the parent node
• Cannot support numeric features (infinite possible values)
8
![Page 8: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/8.jpg)
Building a DT• Input: Instances/Samples• Output: a “good” decision tree classifier
• A decision tree progressively splits the training set into smaller subsets
• Pure node: all the samples at that node have the same class label
• No need to further split a pure node• Recursive tree-growing process: Given data at a node,
decide the node as a leaf node or find another feature test to split the node
9
![Page 9: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/9.jpg)
Building a DT• Start:
– Build a tree with a single root node– The entire training set is at the root node
• Repeat until no split is needed:– Select a node from the tree to examine– Check the purity of the set at the examined node– If the set is pure enough: node -> leaf node of the major class– Else: design a feature test to expand the node, split the set
10
![Page 10: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/10.jpg)
Example (Training) Dataset• Approve/Reject a loan application?
11
Applicant Job Deposit Family Class
1 true low single Approve
2 true low couple Approve
3 true low single Approve
4 true high single Approve
5 false high couple Approve
6 false low couple Reject
7 true low children Reject
8 false low single Reject
9 false high children Reject
![Page 11: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/11.jpg)
Design Issues• Should the feature (answers to questions) be binary or
multivalued? In other words, how many splits should be made at a node?
• Which feature or feature combination should be tested at a node?
• When should a node be declared a leaf node?• If the tree becomes “too large”, can it be pruned to make it
smaller and simpler?• If a leaf node is impure, how should category label be
assigned to it?
12
![Page 12: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/12.jpg)
Number of Splits• Binary: every question has a True/False answer
13
COMP307 ML3 (DT): 5
Decision Trees
Job Deposit Reject
Reject
RejectApprove
Approve
JobApprove
Family
couplesingle
true falsehigh low
true false
children
COMP307 ML3 (DT): 6
Building Decision Trees
• You can always build a decision tree trivially
– Choose some order on the attributes
– Build tree with one attribute for each level
– Label each leaf with appropriate class
A
C
D D D D D D D D
X X X X ZZZ Y Y Y
C C C
BB
• Problems
– Each leaf represents a possible instance
– All we are doing is remembering every instance — no generalisation, no
prediction, no learning
Z
A
B
D
YZ
Y X
C
• Solution
– Find a small decision tree
– capture the common features of instances
– probably generalise to predict classes for unseen instances
COMP307 ML3 (DT): 7
Building A Good Decision Tree
• Input: Instances described by attribute-value pairs
• Output: a “good” decision tree classifier
– Critical issue: choosing which attribute to use next
• DT algorithm:
Examine set of instances in the root node
If set is "pure" enough, or no more attributes
then stop
Else
Construct subsets of instances in the subnodes
Compute average "purity" of subnodes
Choose the best attribute
Recurse on each subnode
COMP307 ML3 (DT): 8
Decision Tree Building
DepositJob
true false high low
children
couplesingle
Which is best?
Choose the best attribute
Reject
Family
![Page 13: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/13.jpg)
Feature Test Design• Which feature/attribute should be used in the feature test?• Greedy design: the question should make the child nodes as
pure as possible• Node (im)purity: can be defined in different ways
– Probability based– Information theory based
14
ML Tools 03:
Discretisation: Entropy Based Discretisation
• Entropy measures the impurity or uncertainty in a group of examples
• S is the training set, C1, …, CN classes • E(S) measure the Entropy of S, Pc is the proportion of
class Cc in S
13
High entropy
Very impure Least impure (Pure)
Null entropy
Less impure
Low entropy
![Page 14: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/14.jpg)
Node (Im)purity Measure• Assume there are two classes A and B• At a node: m instances class A, n instances class B• Impurity: 𝑖𝑚𝑝 = 𝑃 𝐴 𝑃 𝐵 = !
!"#× #!"#
= !#!"# !
– If pure (m = 0 or n = 0): imp = 0– If m = n, imp is maximum– Smooth
15
The smaller the better
COMP307 ML3 (DT): 9
Measuring Purity
• Need a measure of how “pure” a node is
– all one class −→ pure −→ can predict the class
– mixture of classes −→ impure −→ have to ask more questions
• Several functions
– probability based
– information theory based
– ... ...
• Choose the attribute whose children have the best purity
COMP307 ML3 (DT): 10
(Im)Purity Measure: P(A)P(B)
A
7 A’s, 3 B’s
All class BAll class
• Impurity:
P (A)P (B) =m
m + n×
n
m + n=
mn
(m + n)2
m: number of A’s, n: number of B’s
• Goodness of attribute:
– average impurity of subnodes
COMP307 ML3 (DT): 11
Weighting the Impurities
• How do we take the average?
average = 16%
Att 1 Att 2
true false falsetrue
impurity = 4/5 * 1/5impurity = 1/5 * 4/5= 16% = 16%
impurity = 1/1 * 0/1= 0%
impurity = 4/9 * 5/9= 24.6%
average = 12.3% ?
average = 20% ?
• Need to weight the nodes by probability of going to node:
COMP307 ML3 (DT): 12
Weighting the Impurities (Continued)
falsetrue
P(node) = 7/10 P(node) = 3/10
impurity(node) = 4/7 * 3/7 impurity(node) = 2/3 * 1/3 = 12/49 = 2/9
• Goodness of attribute = weighted average impurity of subnodes
=!
i[P (nodei)× impurity(nodei)]
= (7/10× 12/49) + 3/10× 2/9+ = 84/490 + 6/90 = 0.238
![Page 15: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/15.jpg)
Feature Test Design• Goodness of a feature test: average impurity of child
nodes
17
FT1?
Impurity = 1/5 * 4/5 = 4/25
Impurity = 4/5 * 1/5 = 4/25
Average Impurity = 4/25 = 16%
![Page 16: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/16.jpg)
Feature Test Design• Goodness of a feature test: average impurity of child
nodes
18
FT2?
Impurity = 1/1 * 0/1 = 0
Impurity = 4/9 * 5/9 = 20/81
Average Impurity = 10/81 = 12.4%
FT2 is better than FT1?
![Page 17: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/17.jpg)
Feature Test Design• Weighting the Impurities by the probability of nodes
– Weighted average impurity = ∑"#$% 𝑃 𝑛𝑜𝑑𝑒" ×𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦(𝑛𝑜𝑑𝑒")
19
FT1?
Impurity = 1/5 * 4/5 = 4/25
Impurity = 4/5 * 1/5 = 4/25
Weighted average Impurity = 4/25 = 16%
P(node1) = 0.5 P(node2) = 0.5
![Page 18: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/18.jpg)
Feature Test Design• Weighting the Impurities by the probability of nodes
– Weighted average impurity = ∑"#$% 𝑃 𝑛𝑜𝑑𝑒" ×𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦(𝑛𝑜𝑑𝑒")
20
FT2?
Impurity = 1/1 * 0/1 = 0
Impurity = 4/9 * 5/9 = 20/81
Weighted average Impurity = 0.9 * 20/81 = 22.2%
P(node1) = 0.1 P(node2) = 0.9
FT1 is better than FT2 after weighting
![Page 19: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/19.jpg)
Numeric Features• Approve/Reject a loan application?
22
Applicant Job Deposit Family Class
1 true $10K single Approve
2 true $7K couple Approve
3 true $4K single Approve
4 true $16K single Approve
5 false $18K couple Approve
6 false $6K couple Reject
7 true $8K children Reject
8 false $3K single Reject
9 false $30K children Reject
What question to ask about Deposit?
![Page 20: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/20.jpg)
Numeric Features• Can split on a simple comparison
– Which split point?– Consider class boundaries
23
Deposit < $10K
True False
Applicant Job Deposit Family Class
8 false $3K single Reject
3 true $4K single Approve
6 false $6K couple Reject
2 true $7K couple Approve
7 true $8K children Reject
1 true $10K single Approve
4 true $16K single Approve
5 false $18K couple Approve
9 false $30K children Reject
<$4K
<$6K
<$7K
<$8K
<$10K
<$30K
![Page 21: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/21.jpg)
Numeric Features• Consider class boundaries, choose the best split with
minimal weighted average impurity– (Deposit < 4K): 1/9*imp(0:1) + 8/9*imp(5:3) = 0.208– (Deposit < 6K): 2/9*imp(1:1) + 7/9*imp(4:3) = 0.246– (Deposit < 7K): 3/9*imp(1:2) + 6/9*imp(4:2) = 0.222– (Deposit < 8K): 4/9*imp(2:2) + 5/9*imp(3:2) = 0.244– (Deposit < 10K): 5/9*imp(2:3) + 4/9*imp(3:1) = 0.217– (Deposit < 30K): 8/9*imp(5:3) + 1/9*imp(0:1) = 0.208
24
![Page 22: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/22.jpg)
When to Stop Splitting? • If stop split too early, then the node is not pure enough• If stop too late, then the tree becomes too large and
complex, and can overfitting
• Stop splitting a node when– A node is pure (or reach some impurity threshold)– The maximal tree depth/size is reached– The best candidate split reduces the impurity by by less than the
preset threshold (e.g. 5%)– …
• Principle: tradeoff between tree complexity and accuracy/error
25
![Page 23: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/23.jpg)
Pruning• Shrink the tree, make it smaller/simpler, to reduce overfitting• Pruning is the inverse of splitting • Any pair whose elimination yields a satisfactory (small) increase
in impurity is eliminated, and the common parent node becomes leaf node
26
![Page 24: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1](https://reader030.vdocuments.us/reader030/viewer/2022040116/5ed0e8a86c3b491bdd592626/html5/thumbnails/24.jpg)
Summary• Decision Tree vs DT learning• How to build a decision from a set of training instances• Design issues
– Node split– Question/Test design– Numeric features to binary features– Stopping criteria– Pruning
27