decision trees
DESCRIPTION
Decision Trees. Outline. What is a decision tree ? How to construct a decision tree ? What are the major steps in decision tree induction ? How to select the attribute to split the node ? What are the other issues ?. Age?. 30. > 40. 31…40. Credit?. Student?. YES. YES. YES. NO. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/1.jpg)
Decision Trees
![Page 2: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/2.jpg)
2
Outline What is a decision tree ? How to construct a decision tree ?
What are the major steps in decision tree induction ?
How to select the attribute to split the node ?
What are the other issues ?
![Page 3: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/3.jpg)
3
Classification by Decision Tree Induction Decision tree
A flow-chart-like tree structure Internal node denotes a test on an
attribute Branch represents an outcome of test Leaf nodes represent class labels or class
distribution Age?
Student? Credit?
fairexcellent
>40
31…40
30
NO YES
no yes
NO YES
YES
![Page 4: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/4.jpg)
4
Training Dataset
Age Income
Student
Credit Buys_computer
P1 <=30
high no fair no
P2 <=30
high no excellent
no
P3 31…40
high no fair yes
P4 >40 medium
no fair yes
P5 >40 low yes fair yes
P6 >40 low yes excellent
no
P7 31…40
low yes excellent
yes
P8 <=30
medium
no fair no
P9 <=30
low yes fair yes
P10 >40 medium
yes fair yes
P11 <=30
medium
yes excellent
yes
P12 31…40
medium
no excellent
yes
P13 31…40
high yes fair yes
P14 >40 medium
no excellent
no
![Page 5: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/5.jpg)
5
Output: A Decision Tree for “buy_computer”
Age?
Student? Credit?
fairexcellent
>4031…40<=30
NO YES
no yes
NO YES
YES
![Page 6: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/6.jpg)
6
Outline What is a decision tree ? How to construct a decision tree ?
What are the major steps in decision tree induction ?
How to select the attribute to split the node ?
What are the other issues ?
![Page 7: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/7.jpg)
7
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-
valued, they are discretized in advance) Tree is constructed in a top-down recursive
divide-and-conquer manner At start, all training examples are at the
root Test attributes are selected on basis of a
heuristic or statistical measure (e.g., information gain)
Examples are partitioned recursively based on selected attributes
![Page 8: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/8.jpg)
8
Training Dataset
Age Income
Student
Credit Buys_computer
P1 <=30
high no fair no
P2 <=30
high no excellent
no
P3 31…40
high no fair yes
P4 >40 medium
no fair yes
P5 >40 low yes fair yes
P6 >40 low yes excellent
no
P7 31…40
low yes excellent
yes
P8 <=30
medium
no fair no
P9 <=30
low yes fair yes
P10 >40 medium
yes fair yes
P11 <=30
medium
yes excellent
yes
P12 31…40
medium
no excellent
yes
P13 31…40
high yes fair yes
P14 >40 medium
no excellent
no
![Page 9: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/9.jpg)
9
Construction of A Decision Tree for “buy_computer”
? [P1,…P14]
Yes: 9, No:5
![Page 10: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/10.jpg)
10
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-
valued, they are discretized in advance) Tree is constructed in a top-down recursive
divide-and-conquer manner At start, all training examples are at the
root Test attributes are selected on basis of a
heuristic or statistical measure (e.g., information gain)
Examples are partitioned recursively based on selected attributes
![Page 11: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/11.jpg)
11
Construction of A Decision Tree for “buy_computer”
Age?
>4031…40<=30
[P1,…P14]Yes: 9,
No:5
![Page 12: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/12.jpg)
12
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-
valued, they are discretized in advance) Tree is constructed in a top-down recursive
divide-and-conquer manner At start, all training examples are at the
root Test attributes are selected on basis of a
heuristic or statistical measure (e.g., information gain)
Examples are partitioned recursively based on selected attributes
![Page 13: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/13.jpg)
13
Training Dataset
Age Income
Student
Credit Buys_computer
P1 <=30
high no fair no
P2 <=30
high no excellent
no
P3 31…40
high no fair yes
P4 >40 medium
no fair yes
P5 >40 low yes fair yes
P6 >40 low yes excellent
no
P7 31…40
low yes excellent
yes
P8 <=30
medium
no fair no
P9 <=30
low yes fair yes
P10 >40 medium
yes fair yes
P11 <=30
medium
yes excellent
yes
P12 31…40
medium
no excellent
yes
P13 31…40
high yes fair yes
P14 >40 medium
no excellent
no
![Page 14: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/14.jpg)
14
Construction of A Decision Tree for “buy_computer”
Age?
>4031…40<=30
[P1,…P14]Yes: 9,
No:5
[P1,P2,P8,P9,P11]
Yes: 2, No:3
[P3,P7,P12,P13]Yes: 4, No:0
[P4,P5,P6,P10,P14]
Yes: 3, No:2YES ? ?
![Page 15: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/15.jpg)
15
Construction of A Decision Tree for “buy_computer”
Age?
>4030…40<=30
[P1,…P14]Yes: 9,
No:5
[P1,P2,P8,P9,P11]
Yes: 2, No:3
[P3,P7,P12,P13]Yes: 4, No:0
[P4,P5,P6,P10,P14]
Yes: 3, No:2Student?
no yes
YES ?
[P1,P2,P8]
Yes: 0, No:3
[P9,P11]Yes: 2,
No:0NO YES
![Page 16: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/16.jpg)
16
Construction of A Decision Tree for “buy_computer”
Age?
>4030…40<=30
[P1,…P14]Yes: 9,
No:5
[P1,P2,P8,P9,P11]
Yes: 2, No:3
[P3,P7,P12,P13]Yes: 4, No:0
[P4,P5,P6,P10,P14]
Yes: 3, No:2Student?
no yes
YES
[P1,P2,P8]
Yes: 0, No:3
[P9,P11]Yes: 2,
No:0
Credit?
fairexcellent
NO YES NO YES
[P6,P14]Yes: 0,
No:2
[P4,P5,P10]
Yes: 3, No:0
![Page 17: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/17.jpg)
17
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-
valued, they are discretized in advance) Tree is constructed in a top-down recursive
divide-and-conquer manner At start, all training examples are at the
root Test attributes are selected on basis of a
heuristic or statistical measure (e.g., information gain)
Examples are partitioned recursively based on selected attributes
![Page 18: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/18.jpg)
18
Outline What is a decision tree ? How to construct a decision tree ?
What are the major steps in decision tree induction ?
How to select the attribute to split the node ?
What are the other issues ?
![Page 19: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/19.jpg)
19
Which Attribute is the Best?
The attribute most useful for classifying examples
Information gain An information-theoretic approach Measure how well an attribute separates
the training examples Use the attribute with the highest
information gain to split Minimize the expected number of tests
needed to classify a new tuple
How useful?
How well separated?
How pure splitting result?
Information gain
![Page 20: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/20.jpg)
20
Choosing an attribute Idea: a good attribute splits the examples into subsets
that are (ideally) "all positive" or "all negative"
Patrons? is a better choice
![Page 21: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/21.jpg)
21
Information theory
If there are n equally probable possible messages, then the probability p of each is 1/n
Information conveyed by a message is -log(p) = log(n)
E.g., if there are 16 messages, then log(16) = 4 and we need 4 bits to identify/send each message
In general, if we are given a probability distribution
P = (p1, p2, .., pn) Then the information conveyed by the distribution (aka entropy of P) is:
I(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn))
![Page 22: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/22.jpg)
22
Information theory II Information conveyed by distribution (a.k.a.
entropy of P):
I(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn))
Examples: If P is (0.5, 0.5) then I(P) is 1 If P is (0.67, 0.33) then I(P) is 0.92 If P is (1, 0) then I(P) is 0
The more uniform the probability distribution, the greater its information: More information is conveyed by a message telling you which event actually occurred
Entropy is the average number of bits/message needed to represent a stream of messages
![Page 23: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/23.jpg)
23
Information for classification
If a set S of records is partitioned into disjoint exhaustive classes (C1,C2,..,Ck) on the basis of the value of the class attribute, then the information needed to identify the class of an element of S is Info(S) = I(P)
where P is the probability distribution of partition (C1,C2,..,Ck):
P = (|C1|/|S|, |C2|/|S|, ..., |Ck|/|S|)
C1
C2
C3
C1
C2C3
High informationLow information
![Page 24: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/24.jpg)
24
Information, Entropy, and Information Gain
S contains si tuples of class Ci for i = {1, ..., m} Information measures “the amount of info” required to classify any arbitrary tuple
where is the probability that an arbitrary tuple belongs to Ci
Example: S contains 100 tuples, 25 belong to class C1 and 75 belong to class C2
i
m
iim21 pp,...,s,ss 2
1
log)I(
811.0100
75log
100
75
100
25log
100
25)75,25I(
22
|| S
sp ii
![Page 25: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/25.jpg)
25
Information, Entropy, and Information Gain
Information reflects the “purity” of the data set
Low information value indicates high purity
High information value indicates high diversity
Example: S contains 100 tuples 0 belongs to class C1 and 100 belong to class
C2
50 belong to class C1 and 50 belong to class C2
0100
100log
100
100
100
0log
100
0)100,0I(
22
1100
50log
100
50
100
50log
100
50)50,50I(
22
![Page 26: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/26.jpg)
26
Information for classification II If we partition S w.r.t attribute X into sets
{T1,T2, ..,Tn} then the information needed to identify the class of an element of S becomes the weighted average of the information needed to identify the class of an element of Ti, i.e. the weighted average of Info(Ti):
Info(X,T) = |Ti|/|S| * Info(Ti)
C1
C2
C3C1
C2
C3
High information Low information
![Page 27: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/27.jpg)
27
Information gain Consider the quantity Gain(X,S) defined as Gain(X,S) = Info(S) - Info(X,S) This represents the difference between
information needed to identify an element of S and information needed to identify an element of S after the
value of attribute X has been obtained
That is, this is the gain in information due to attribute X We can use this to rank attributes and to build
decision trees where at each node is located the attribute with greatest gain among the attributes not yet considered in the path from the root
The intent of this ordering is: To create small decision trees so that records can be
identified after only a few questions To match a hoped-for minimality of the process
represented by the records being considered (Occam’s Razor)
![Page 28: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/28.jpg)
28
Information, Entropy, and Information Gain
S contains si tuples of class Ci for i = {1, ..., m}
Attribute A has values {a1,a2,...,av} Let sij be the number of tuples which
belong to class Ci, and have a value of aj in attribute A
Entropy of attribute A is
Information gained by branching on attribute A
),...,(||
...E(A) 1
1
1
mjj
mjj
ssIS
ssv
j
)E(),...,,I()Gain( 21 AsssA m
![Page 29: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/29.jpg)
29
Information, Entropy, and Information Gain
Let Tj be the set of tuples having value aj in attribute A s1j+…,+smj = |Tj| I(s1j,…,smj) = I(Tj)
Entropy of attribute A is ),...,(
||
...E(A) 1
1
1
mjj
mjj
ssIS
ssv
j
Proportion of |Tj| over |S|
Information of Tj
![Page 30: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/30.jpg)
30
Information, Entropy, and Information Gain
A=a2
A=a3
40 60
10
10
10 20
20 30
I(10,10)=1
I(10,20)=0.918
I(20,30)=0.971
971.0100
3020918.0
100
20101
100
1010)(
AE
961.0
01.0)()60,40()( AEIAGain
I(40,60)=0.971
S contains 100 tuples, 40 belong to class C1 (red) and 60 belong to class C2 (blue)
30 tuples
50 tuples
A=a120 tuples
![Page 31: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/31.jpg)
31
Computing information gain
French
Italian
Thai
Burger
Empty Some Full
Y
Y
Y
Y
Y
YN
N
N
N
N
N
•I(S) = - (.5 log .5 + .5 log .5) = .5 + .5 = 1
•I (Pat, S) = 1/6 (0) + 1/3 (0) + 1/2 (- (2/3 log 2/3 +
1/3 log 1/3)) = 1/2 (2/3*.6 + 1/3*1.6) = .47
•I (Type, S) = 1/6 (1) + 1/6 (1) + 1/3 (1) + 1/3 (1) = 1
Gain (Pat, S) = 1 - .47 = .53Gain (Type, T) = 1 – 1 = 0
![Page 32: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/32.jpg)
32
Regarding the Definition of Entropy…
On Text book Page 134 (Equ. 3.6)
On Text book Page 287 (Equ. 7.2)
m
iii ppSEntropy
12log)(
),...,(||
...(A)E 1
1
1
mjj
mjj
ssIS
ssntropy
v
j
PolymophismWhen entropy is defined on tuples, use Equ. 3.6When entropy is defined on attribute, use Equ. 7.2
![Page 33: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/33.jpg)
33
How well does it work?
Many case studies have shown that decision trees are at least as accurate as human experts. A study for diagnosing breast cancer had humans correctly classifying the examples 65% of the time; the decision tree classified 72% correct
British Petroleum designed a decision tree for gas-oil separation for offshore oil platforms that replaced an earlier rule-based expert system
Cessna designed an airplane flight controller using 90,000 examples and 20 attributes per example
![Page 34: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/34.jpg)
34
Outline What is a decision tree ? How to construct a decision tree ?
What are the major steps in decision tree induction ?
How to select the attribute to split the node ?
What are the other issues ?
![Page 35: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/35.jpg)
35
Extracting Classification Rules from Trees Represent knowledge in the form of IF-
THEN rules One rule is created for each path from
root to a leaf Each attribute-value pair along a path
forms a conjunction Leaf node holds class prediction Rules are easier for humans to
understand
![Page 36: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/36.jpg)
36
Examples of Classification Rules
Age?
Student? Credit?
fairexcellent
>40
31…40
30
NO YES
no yes
NO YES
YES
Classification rules:1. IF age = “<=30” AND student = “no” THEN buys_computer = “no”2. IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”3. IF age = “31…40” THEN buys_computer = “yes”4. IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”5. IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
![Page 37: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/37.jpg)
37
Avoid Over-fitting in Classification Generated tree may over-fit training data
Too many branches, some may reflect anomalies due to noise or outliers
Result is in poor accuracy for unseen samples
Two approaches to avoiding over-fitting Pre-pruning: Halt tree construction early—
do not split a node if this would result in goodness measure falling below a threshold
Difficult to choose an appropriate threshold Post-pruning: Remove branches from a
“fully grown” tree—get a sequence of progressively pruned trees
Use a set of data different from training data to decide which is “best pruned tree”
![Page 38: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/38.jpg)
38
Enhancements to basic decision tree induction
Dynamic discretization for continuous-valued attributes
Dynamically define new discrete-valued attributes that partition continuous attribute value into a discrete set of intervals
Handle missing attribute values Assign most common value of attribute Assign probability to each of possible values
Attribute construction Create new attributes based on existing ones
that are sparsely represented Reduce fragmentation (no. of samples at branch
becomes too small to be statistically significant), repetition (attribute is repeatedly tested along a branch), and replication (duplicate subtrees)
![Page 39: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/39.jpg)
39
Classification in Large Databases Classification—a classical problem extensively
studied by statisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed
Why decision tree induction in data mining? relatively faster learning speed (than other
classification methods) convertible to simple and easy to understand
classification rules can use SQL queries for accessing databases comparable classification accuracy with other
methods
![Page 40: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/40.jpg)
40
Scalable Decision Tree Induction Methods SLIQ (EDBT’96 — Mehta et al.)
Build an index for each attribute and only class list and the current attribute list reside in memory
SPRINT (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning:
stop growing the tree earlier RainForest (VLDB’98 — Gehrke,
Ramakrishnan & Ganti) separates the scalability aspects from the
criteria that determine the quality of the tree builds an AVC-list (attribute, value, class
label)
![Page 41: Decision Trees](https://reader036.vdocuments.us/reader036/viewer/2022062309/568150c9550346895dbeec57/html5/thumbnails/41.jpg)
41
Summary What is a decision tree ?
A flow-chart-like tree: internal nodes, branches, and leaf nodes
How to construct a decision tree ? What are the major steps in decision tree
induction ? Test attribute selection Sample partition
How to select the attribute to split the node ? Select the attribute with the highest
information gain Calculate the information of the node Calculate the entropy of the attribute Calculate the difference between the
information and entropy What are the other issues ?