cs b551: decision trees
DESCRIPTION
CS B551: Decision Trees. Agenda. Decision trees Complexity Learning curves Combatting overfitting Boosting. Recap. Still in supervised setting with logical attributes - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/1.jpg)
CS B551: DECISION TREES
![Page 2: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/2.jpg)
AGENDA
Decision trees Complexity Learning curves Combatting overfitting
Boosting
![Page 3: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/3.jpg)
RECAP
Still in supervised setting with logical attributes
Find a representation of CONCEPT in the form:
CONCEPT(x) S(A,B, …)
where S(A,B,…) is a sentence built with the observable attributes, e.g.:
CONCEPT(x) A(x) (B(x) v C(x))
![Page 4: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/4.jpg)
PREDICATE AS A DECISION TREE
The predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree:
A?
B?
C?True
True
True
True
FalseTrue
False
FalseFalse
False
Example:A mushroom is poisonous iffit is yellow and small, or yellow, big and spotted• x is a mushroom• CONCEPT = POISONOUS• A = YELLOW• B = BIG• C = SPOTTED
![Page 5: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/5.jpg)
PREDICATE AS A DECISION TREE
The predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree:
A?
B?
C?True
True
True
True
FalseTrue
False
FalseFalse
False
Example:A mushroom is poisonous iffit is yellow and small, or yellow, big and spotted• x is a mushroom• CONCEPT = POISONOUS• A = YELLOW• B = BIG• C = SPOTTED• D = FUNNEL-CAP• E = BULKY
![Page 6: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/6.jpg)
TRAINING SET
Ex. # A B C D E CONCEPT
1 False False True False True False
2 False True False False False False
3 False True True True True False
4 False False True False False False
5 False False False True True False
6 True False True False False True
7 True False False True False True
8 True False True False True True
9 True True True False True True
10 True True True True True True
11 True True False False False False
12 True True False False True False
13 True False True True True True
![Page 7: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/7.jpg)
TrueTrueTrueTrueFalseTrue13
FalseTrueFalseFalseTrueTrue12
FalseFalseFalseFalseTrueTrue11
TrueTrueTrueTrueTrueTrue10
TrueTrueFalseTrueTrueTrue9
TrueTrueFalseTrueFalseTrue8
TrueFalseTrueFalseFalseTrue7
TrueFalseFalseTrueFalseTrue6
FalseTrueTrueFalseFalseFalse5
FalseFalseFalseTrueFalseFalse4
FalseTrueTrueTrueTrueFalse3
FalseFalseFalseFalseTrueFalse2
FalseTrueFalseTrueFalseFalse1
CONCEPTEDCBAEx. #
POSSIBLE DECISION TREE
D
CE
B
E
AA
A
T
F
F
FF
F
T
T
T
TT
![Page 8: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/8.jpg)
POSSIBLE DECISION TREE
D
CE
B
E
AA
A
T
F
F
FF
F
T
T
T
TT
CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA))))))
A?
B?
C?True
True
True
True
FalseTrue
False
FalseFalse
False
CONCEPT A (B v C)
![Page 9: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/9.jpg)
POSSIBLE DECISION TREE
D
CE
B
E
AA
A
T
F
F
FF
F
T
T
T
TT
A?
B?
C?True
True
True
True
FalseTrue
False
FalseFalse
False
CONCEPT A (B v C)
KIS bias Build smallest decision tree
Computationally intractable problem greedy algorithm
CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA))))))
![Page 10: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/10.jpg)
TOP-DOWNINDUCTION OF A DT
DTL(D, Predicates)1. If all examples in D are positive then return True2. If all examples in D are negative then return False3. If Predicates is empty then return majority rule4. A error-minimizing predicate in Predicates5. Return the tree whose:
- root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A)
A
C
True
True
TrueB
True
TrueFalse
False
FalseFalse
False
![Page 11: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/11.jpg)
COMMENTS
Widely used algorithm Greedy Robust to noise (incorrect examples) Not incremental
![Page 12: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/12.jpg)
LEARNABLE CONCEPTS
Some simple concepts cannot be represented compactly in DTsParity(x) = X1 xor X2 xor … xor Xn
Majority(x) = 1 if most of Xi’s are 1, 0 otherwise
Exponential size in # of attributesNeed exponential # of examples to
learn exactlyThe ease of learning is dependent on
shrewdly (or luckily) chosen attributes that correlate with CONCEPT
![Page 13: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/13.jpg)
MISCELLANEOUS ISSUES
Assessing performance: Training set and test set Learning curve
size of training set
% c
orr
ect
on
tes
t se
t 100
Typical learning curve
![Page 14: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/14.jpg)
MISCELLANEOUS ISSUES
Assessing performance: Training set and test set Learning curve
size of training set
% c
orr
ect
on
tes
t se
t 100
Typical learning curve
Some concepts are unrealizable within a machine’s capacity
![Page 15: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/15.jpg)
MISCELLANEOUS ISSUES
Assessing performance: Training set and test set Learning curve
OverfittingRisk of using irrelevant
observable predicates togenerate an hypothesis
that agrees with all examples
in the training set
size of training set
% c
orr
ect
on
tes
t se
t
100
Typical learning curve
![Page 16: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/16.jpg)
MISCELLANEOUS ISSUES
Assessing performance: Training set and test set Learning curve
Overfitting Tree pruning
Risk of using irrelevantobservable predicates togenerate an hypothesis
that agrees with all examples
in the training set
Terminate recursion when# errors / information gain
is small
![Page 17: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/17.jpg)
MISCELLANEOUS ISSUES
Assessing performance: Training set and test set Learning curve
Overfitting Tree pruning
Terminate recursion when# errors / information gain
is small
Risk of using irrelevantobservable predicates togenerate an hypothesis
that agrees with all examples
in the training setThe resulting decision tree + majority rule may not classify correctly all examples in the training set
![Page 18: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/18.jpg)
MISCELLANEOUS ISSUES
Assessing performance: Training set and test set Learning curve
Overfitting Tree pruning
Incorrect examples Missing data Multi-valued and continuous attributes
![Page 19: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/19.jpg)
USING INFORMATION THEORY Rather than minimizing the probability of
error, minimize the expected number of questions needed to decide if an object x satisfies CONCEPT
Use the information-theoretic quantity known as information gain
Split on variable with highest information gain
![Page 20: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/20.jpg)
ENTROPY / INFORMATION GAIN Entropy: encodes the quantity of uncertainty in a
random variable H(X) = -xVal(X) P(x) log P(x)
Properties H(X) = 0 if X is known, i.e. P(x)=1 for some value x H(X) > 0 if X is not known with certainty H(X) is maximal if P(X) is uniform distribution
Information gain: measures the reduction in uncertainty in X given knowledge of Y I(X,Y) = Ey[H(X) – H(X|Y)] =
y P(y) x [P(x|y) log P(x|y) – x P(x)log P(x)] Properties
Always nonnegative = 0 if X and Y are independent
If Y is a choice, maximizing IG = > minimizing Ey[H(X|Y)]
![Page 21: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/21.jpg)
MAXIMIZING IG / MINIMIZING CONDITIONAL ENTROPY IN DECISION TREES
Ey[H(X|Y)] = y P(y) x P(x|y) log P(x|y)
Let n be # of examples Let n+,n- be # of examples on T/F branches of
Y Let p+,p- be accuracy on true/false branches
of Y P(Y) = (p+n++p-n-)/n P(correct|Y) = p+, P(correct|-Y) = p-
Ey[H(X|Y)] n+ [p+log p+ + (1-p+)log (1-p+)] + n- [p-log p- + (1-p-) log (1-p-)]
![Page 22: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/22.jpg)
STATISTICAL METHODS FOR ADDRESSING OVERFITTING / NOISE There may be few training examples that
match the path leading to a deep node in the decision tree More susceptible to choosing irrelevant/incorrect
attributes when sample is small Idea:
Make a statistical estimate of predictive power (which increases with larger samples)
Prune branches with low predictive power Chi-squared pruning
![Page 23: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/23.jpg)
TOP-DOWN DT PRUNING Consider an inner node X that by itself (majority rule) predicts
p examples correctly and n examples incorrectly At k leaf nodes, number of of correct/incorrect examples are
p1/n1,…,pk/nk
Chi-squared test: Null hypothesis: example labels randomly chosen with distribution
p/(p+n) (X is irrelevant) Alternate hypothesis: examples not randomly chosen (X is relevant)
Let Z = Si (pi – pi’)2/pi’ + (ni – ni’)2/ni’ Where pi’ = pi(pi+ni)/(p+n), ni’ = ni(pi+ni)/(p+n) are the expected
number of true/false examples at leaf node i if the null hypothesis holds
Z is a statistic that is approximately drawn from the chi-squared distribution with k degrees of freedom
Look up p-Value of Z from a table, prune if p-Value > a for some a (usually ~.05)
![Page 24: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/24.jpg)
CONTINUOUS ATTRIBUTES
Continuous attributes can be converted into logical ones via thresholds X => X<a
When considering splitting on X, pick the threshold a to minimize # of errors
7 7 6 5 6 5 4 5 4 3 4 5 4 5 6 7
![Page 25: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/25.jpg)
APPLICATIONS OF DECISION TREE
Medical diagnostic / Drug design Evaluation of geological systems for
assessing gas and oil basins Early detection of problems (e.g., jamming)
during oil drilling operations Automatic generation of rules in expert
systems
![Page 26: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/26.jpg)
HUMAN-READABILITY
DTs also have the advantage of being easily understood by humans
Legal requirement in many areas Loans & mortgages Health insurance Welfare
![Page 27: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/27.jpg)
ENSEMBLE LEARNING (BOOSTING)
![Page 28: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/28.jpg)
IDEA
It may be difficult to search for a single hypothesis that explains the data
Construct multiple hypotheses (ensemble), and combine their predictions
“Can a set of weak learners construct a single strong learner?” – Michael Kearns, 1988
![Page 29: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/29.jpg)
MOTIVATION
5 classifiers with 60% accuracy On a new example, run them all, and pick the
prediction using majority voting
If errors are independent, new classifier has 94% accuracy! (In reality errors will not be independent, but we
hope they will be mostly uncorrelated)
![Page 30: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/30.jpg)
BOOSTING
Weighted training set
Ex. # Weight A B C D E CONCEPT
1 w1 False False True False True False
2 w2 False True False False False False
3 w3 False True True True True False
4 w4 False False True False False False
5 w5 False False False True True False
6 w6 True False True False False True
7 w7 True False False True False True
8 w8 True False True False True True
9 w9 True True True False True True
10 w10 True True True True True True
11 w11 True True False False False False
12 w12 True True False False True False
13 w13 True False True True True True
![Page 31: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/31.jpg)
BOOSTING
Start with uniform weights wi=1/N
Use learner 1 to generate hypothesis h1
Adjust weights to give higher importance to misclassified examples
Use learner 2 to generate hypothesis h2
… Weight hypotheses according to
performance, and return weighted majority
![Page 32: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/32.jpg)
MUSHROOM EXAMPLE
“Decision stumps” - single attribute DT
Ex. # Weight A B C D E CONCEPT
1 1/13 False False True False True False
2 1/13 False True False False False False
3 1/13 False True True True True False
4 1/13 False False True False False False
5 1/13 False False False True True False
6 1/13 True False True False False True
7 1/13 True False False True False True
8 1/13 True False True False True True
9 1/13 True True True False True True
10 1/13 True True True True True True
11 1/13 True True False False False False
12 1/13 True True False False True False
13 1/13 True False True True True True
![Page 33: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/33.jpg)
MUSHROOM EXAMPLE
Pick C first, learns CONCEPT = C
Ex. # Weight A B C D E CONCEPT
1 1/13 False False True False True False
2 1/13 False True False False False False
3 1/13 False True True True True False
4 1/13 False False True False False False
5 1/13 False False False True True False
6 1/13 True False True False False True
7 1/13 True False False True False True
8 1/13 True False True False True True
9 1/13 True True True False True True
10 1/13 True True True True True True
11 1/13 True True False False False False
12 1/13 True True False False True False
13 1/13 True False True True True True
![Page 34: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/34.jpg)
MUSHROOM EXAMPLE
Pick C first, learns CONCEPT = C
Ex. # Weight A B C D E CONCEPT
1 1/13 False False True False True False
2 1/13 False True False False False False
3 1/13 False True True True True False
4 1/13 False False True False False False
5 1/13 False False False True True False
6 1/13 True False True False False True
7 1/13 True False False True False True
8 1/13 True False True False True True
9 1/13 True True True False True True
10 1/13 True True True True True True
11 1/13 True True False False False False
12 1/13 True True False False True False
13 1/13 True False True True True True
![Page 35: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/35.jpg)
MUSHROOM EXAMPLE
Update weights
Ex. # Weight A B C D E CONCEPT
1 .125 False False True False True False
2 .056 False True False False False False
3 .125 False True True True True False
4 .125 False False True False False False
5 .056 False False False True True False
6 .056 True False True False False True
7 .125 True False False True False True
8 .056 True False True False True True
9 .056 True True True False True True
10 .056 True True True True True True
11 .056 True True False False False False
12 .056 True True False False True False
13 .056 True False True True True True
![Page 36: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/36.jpg)
MUSHROOM EXAMPLE
Next try A, learn CONCEPT=A
Ex. # Weight A B C D E CONCEPT
1 .125 False False True False True False
2 .056 False True False False False False
3 .125 False True True True True False
4 .125 False False True False False False
5 .056 False False False True True False
6 .056 True False True False False True
7 .125 True False False True False True
8 .056 True False True False True True
9 .056 True True True False True True
10 .056 True True True True True True
11 .056 True True False False False False
12 .056 True True False False True False
13 .056 True False True True True True
![Page 37: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/37.jpg)
MUSHROOM EXAMPLE
Next try A, learn CONCEPT=A
Ex. # Weight A B C D E CONCEPT
1 .125 False False True False True False
2 .056 False True False False False False
3 .125 False True True True True False
4 .125 False False True False False False
5 .056 False False False True True False
6 .056 True False True False False True
7 .125 True False False True False True
8 .056 True False True False True True
9 .056 True True True False True True
10 .056 True True True True True True
11 .056 True True False False False False
12 .056 True True False False True False
13 .056 True False True True True True
![Page 38: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/38.jpg)
MUSHROOM EXAMPLE
Update weights
Ex. # Weight A B C D E CONCEPT
1 0.07 False False True False True False
2 0.03 False True False False False False
3 0.07 False True True True True False
4 0.07 False False True False False False
5 0.03 False False False True True False
6 0.03 True False True False False True
7 0.07 True False False True False True
8 0.03 True False True False True True
9 0.03 True True True False True True
10 0.03 True True True True True True
11 0.25 True True False False False False
12 0.25 True True False False True False
13 0.03 True False True True True True
![Page 39: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/39.jpg)
MUSHROOM EXAMPLE
Next try E, learn CONCEPT=E
Ex. # Weight A B C D E CONCEPT
1 0.07 False False True False True False
2 0.03 False True False False False False
3 0.07 False True True True True False
4 0.07 False False True False False False
5 0.03 False False False True True False
6 0.03 True False True False False True
7 0.07 True False False True False True
8 0.03 True False True False True True
9 0.03 True True True False True True
10 0.03 True True True True True True
11 0.25 True True False False False False
12 0.25 True True False False True False
13 0.03 True False True True True True
![Page 40: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/40.jpg)
MUSHROOM EXAMPLE
Next try E, learn CONCEPT=E
Ex. # Weight A B C D E CONCEPT
1 0.07 False False True False True False
2 0.03 False True False False False False
3 0.07 False True True True True False
4 0.07 False False True False False False
5 0.03 False False False True True False
6 0.03 True False True False False True
7 0.07 True False False True False True
8 0.03 True False True False True True
9 0.03 True True True False True True
10 0.03 True True True True True True
11 0.25 True True False False False False
12 0.25 True True False False True False
13 0.03 True False True True True True
![Page 41: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/41.jpg)
MUSHROOM EXAMPLE
Update Weights…
Ex. # Weight A B C D E CONCEPT
1 0.07 False False True False True False
2 0.03 False True False False False False
3 0.07 False True True True True False
4 0.07 False False True False False False
5 0.03 False False False True True False
6 0.03 True False True False False True
7 0.07 True False False True False True
8 0.03 True False True False True True
9 0.03 True True True False True True
10 0.03 True True True True True True
11 0.25 True True False False False False
12 0.25 True True False False True False
13 0.03 True False True True True True
![Page 42: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/42.jpg)
MUSHROOM EXAMPLE
Final classifier, order C,A,E,D,B Weights on hypotheses determined by overall
error Weighted majority weights
A=2.1, B=0.9, C=0.8, D=1.4, E=0.09 100% accuracy on training set
![Page 43: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/43.jpg)
BOOSTING STRATEGIES
Prior weighting strategy was the popular AdaBoost algorithm
see R&N pp. 667 Many other strategies Typically as the number of hypotheses
increases, accuracy increases as well Does this conflict with Occam’s razor?
![Page 44: CS B551: Decision Trees](https://reader035.vdocuments.us/reader035/viewer/2022062314/56812bcb550346895d902381/html5/thumbnails/44.jpg)
ANNOUNCEMENTS
Next class: Neural networks & function learning R&N 18.6-7
HW3 graded, solutions online HW4 due today HW5 out today