cse 446: decision trees winter 2012 slides adapted from carlos guestrin and andrew moore by luke...

56
CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Upload: robert-stevenson

Post on 17-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

CSE 446: Decision TreesWinter 2012

Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Page 2: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

5

Overview of Learning

What is Being Learned?

Type of Supervision (eg, Experience, Feedback)

LabeledExamples

Reward Nothing

Discrete Function

Classification Clustering

Continuous Function

Regression

Policy Apprenticeship Learning

ReinforcementLearning

Page 3: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

A learning problem: predict fuel efficiency

From the UCI repository (thanks to Ross Quinlan)

• 40 Records

• Discrete data (for now)

• Predict MPG

mpg cylinders displacement horsepower weight acceleration modelyear maker

good 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europe

Page 4: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Hypotheses: decision trees f : X Y

• Each internal node tests an attribute xi

• Each branch assigns an attribute value xi=v

• Each leaf assigns a class y

• To classify input x?

traverse the tree from root to leaf, output the labeled y

Cylinders

3 4 5 6 8

good bad badMaker Horsepower

low med highamerica asia europe

bad badgoodgood goodbad

Page 5: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Hypothesis space• How many possible

hypotheses?

• What functions can be represented?

• How many will be consistent with a given dataset?

• How will we choose the best one?

mpg cylinders displacement horsepower weight acceleration modelyear maker

good 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europe

Cylinders

3 4 5 6 8

good bad badMaker Horsepower

low med highamerica asia europe

bad badgoodgood goodbad

Page 6: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Are all decision trees equal?• Many trees can represent the same concept• But, not all trees will have the same size!

e.g., = (A ∧ B) (A ∧ C)

A

B C

t

t

f

f

+ _

t f

+ _

• Which tree do we prefer?

B

C C

t f

f

+t f

+ _

At f

A

_ +

_

t t f

Page 7: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Learning decision trees is hard!!!

• Finding the simplest (smallest) decision tree is an NP-complete problem [Hyafil & Rivest ’76]

• Resort to a greedy heuristic:– Start from empty decision tree– Split on next best attribute (feature)– Recurse

Page 8: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Improving Our Tree

predict mpg=bad

Page 9: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Recursive Step

Records in which cylinders

= 4

Records in which cylinders

= 5

Records in which cylinders

= 6

Records in which cylinders

= 8

Build tree fromThese records..

Build tree fromThese records..

Build tree fromThese records..

Build tree fromThese records..

Page 10: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

A full tree

Page 11: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Two Questions

1. Which attribute gives the best split?2. When to stop recursion?

Greedy Algorithm:– Start from empty decision tree– Split on the best attribute (feature)– Recurse

Page 12: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

30

Which attribute gives the best split?

A1: The one with the highest information gain

Defined in terms of entropy

A2: Actually many alternatives, eg, accuracy

Seeks to reduce the misclassification rate

Page 13: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

EntropyEntropy H(Y) of a random variable Y

More uncertainty, more entropy!

Information Theory interpretation:

H(Y) is the expected number of bits

needed to encode a randomly drawn

value of Y (under most efficient code)

Page 14: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Entropy Example

X1 X2 Y

T T T

T F T

T T T

T F T

F T T

F F F

P(Y=t) = 5/6

P(Y=f) = 1/6

H(Y) = - 5/6 log2 5/6 - 1/6 log2 1/6

= 0.65

Page 15: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Conditional EntropyConditional Entropy H( Y |X) of a random variable Y conditioned on

a random variable X

X1

Y=t : 4Y=f : 0

t f

Y=t : 1Y=f : 1

P(X1=t) = 4/6

P(X1=f) = 2/6

X1 X2 Y

T T T

T F T

T T T

T F T

F T T

F F F

Example:

H(Y|X1) = - 4/6 (1 log2 1 + 0 log2 0)

- 2/6 (1/2 log2 1/2 + 1/2 log2 1/2)

= 2/6

= 0.33

Page 16: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Information Gain

Advantage of attribute – decrease in entropy (uncertainty) after splitting

X1 X2 Y

T T T

T F T

T T T

T F T

F T T

F F F

In our running example:

IG(X1) = H(Y) – H(Y|X1) = 0.65 – 0.33

IG(X1) > 0 we prefer the split!

Page 17: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

37

Alternate Splitting Criteria

• Misclassification ImpurityMinimum probability that a training pattern will be misclassified M(Y) = 1-max P(Y=yi)

• Misclassification Gain IGM(X) = [1-max P(Y=yi)] - [1-(max max P(Y=yi| x=xj))]

ji

i

i

Page 18: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Learning Decision Trees

• Start from empty decision tree• Split on next best attribute (feature)

– Use information gain (or…?) to select attribute:

• Recurse

Page 19: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Suppose we want to predict MPG

Now, Look at all the information gains…

predict mpg=bad

Page 20: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Tree After One Iteration

Page 21: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

©Carlos Guestrin 2005-2009 41

When to Terminate?

Page 22: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Base Case One

Don’t split a node if all matching

records have the same

output value

Page 23: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Base Case Two

Don’t split a node if none

of the attributes can

create multiple

[non-empty] children

Page 24: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Base Case Two: No attributes can

distinguish

Page 25: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Base Cases: An idea• Base Case One: If all records in current data subset

have the same output then don’t recurse• Base Case Two: If all records have exactly the same

set of input attributes then don’t recurse

Proposed Base Case 3:If all attributes have zero

information gain then don’t recurse

Is this a good idea?

Page 26: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

The problem with Base Case 3a b y

0 0 00 1 11 0 11 1 0

y = a XOR b

The information gains: The resulting decision tree:

Page 27: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

But Without Base Case 3:

y = a XOR bThe resulting decision tree:

So: Base Case 3? Include or Omit?

a b y0 0 00 1 11 0 11 1 0

Page 28: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Building Decision TreesBuildTree(DataSet,Output) If all output values are the same in DataSet,

Then return a leaf node that says “predict this unique output” If all input values are the same,

Then return a leaf node that says “predict the majority output” Else find attribute X with highest Info Gain Suppose X has nX distinct values (i.e. X has arity nX).

Create and return a non-leaf node with nX children.

The ith child is built by calling BuildTree(DSi,Output)

Where DSi consists of all those records in DataSet

for which X = ith distinct value of X.

Page 29: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

General View of a Classifier

0.0 1.0 2.0 3.0 4.0 5.0 6.0

0.0

1.0

2.0

3.0

Hypothesis:Decision Boundary for labeling

function

++

+ +

++

+

+

- -

-

- -

--

-

-

- +

++

-

-

-

+

+ Label: -Label: +

?

?

?

?

Page 30: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

© Daniel S. Weld 50

Page 31: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Ok, so how does it perform?

51

Page 32: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

MPG Test set error

Page 33: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

MPG test set error

The test set error is much worse than the training set error…

…why?

Page 34: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Decision trees will overfit• Our decision trees have no learning bias

– Training set error is always zero!• (If there is no label noise)

– Lots of variance– Will definitely overfit!!!– Must introduce some bias towards simpler trees

• Why might one pick simpler trees?

Page 35: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Occam’s Razor• Why Favor Short Hypotheses?• Arguments for:

– Fewer short hypotheses than long ones→A short hyp. less likely to fit data by coincidence→Longer hyp. that fit data may might be coincidence

• Arguments against:– Argument above on really uses the fact that

hypothesis space is small!!!– What is so special about small sets based on the

complexity of each hypothesis?

Page 36: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

How to Build Small TreesSeveral reasonable approaches:• Stop growing tree before overfit

– Bound depth or # leaves– Base Case 3– Doesn’t work well in practice

• Grow full tree; then prune– Optimize on a held-out (development set)

• If growing the tree hurts performance, then cut back• Con: Requires a larger amount of data…

– Use statistical significance testing • Test if the improvement for any split is likely due to noise• If so, then prune the split!

– Convert to logical rules• Then simplify rules

Page 37: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

57

Reduced Error PruningSplit data into training & validation sets (10-33%)

Train on training set (overfitting)Do until further pruning is harmful:

1) Evaluate effect on validation set of pruning each possible node (and tree below it)

2) Greedily remove the node that most improves accuracy of validation set

Page 38: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

© Daniel S. Weld 58

Effect of Reduced-Error Pruning

Cut tree back to here

Page 39: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

59

Alternatively

• Chi-squared pruning– Grow tree fully– Consider leaves in turn

• Is parent split worth it?

• Compared to Base-Case 3?

Page 40: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Consider this split

Page 41: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

A chi-square test

• Suppose that mpg was completely uncorrelated with maker.

• What is the chance we’d have seen data of at least this apparent level

of association anyway?By using a particular kind of chi-square test, the answer is 13.5%

Such hypothesis tests are relatively easy to compute, but involved

Page 42: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Using Chi-squared to avoid overfitting

• Build the full decision tree as before• But when you can grow it no more, start to

prune:– Beginning at the bottom of the tree, delete splits

in which pchance > MaxPchance– Continue working you way up until there are no

more prunable nodes

MaxPchance is a magic parameter you must specify to the decision tree, indicating your willingness to risk fitting noise

Page 43: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Regularization• Note for Future: MaxPchance is a regularization parameter that helps us bias

towards simpler models

Smaller Trees Larger Trees

MaxPchanceIncreasingDecreasing

Exp

ect

ed

Te

st s

et

Err

or

We’ll learn to choose the value of magic parameters like this one later!

Page 44: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Missing Data

65

mpg cylinders displacement horsepower weight acceleration modelyear maker

good 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europe

What to do??

Page 45: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

© Daniel S. Weld 66

Missing Data 1

• Assign attribute …

most common value, … given classification

Day Temp Humid Wind Tennis?d1 h h weak nd2 h h s nd8 m h weak nd9 c weak yesd11 m n s yes

• Don’t use this instance for learning?

most common value for attribute, or

Page 46: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

©Carlos Guestrin 2005-2009 69

Attributes with Many Values?

Page 47: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

© Daniel S. Weld 70

Attributes with many values

• So many values that it– Divides examples into tiny sets– Each set likely uniform high info gain– But poor predictor…

• Need to penalize these attributes

Page 48: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Real-Valued inputsWhat should we do if some of the inputs are real-valued?

mpg cylinders displacementhorsepower weight acceleration modelyear maker

good 4 97 75 2265 18.2 77 asiabad 6 199 90 2648 15 70 americabad 4 121 110 2600 12.8 77 europebad 8 350 175 4100 13 73 americabad 6 198 95 3102 16.5 74 americabad 4 108 94 2379 16.5 73 asiabad 4 113 95 2228 14 71 asiabad 8 302 139 3570 12.8 78 america: : : : : : : :: : : : : : : :: : : : : : : :good 4 120 79 2625 18.6 82 americabad 8 455 225 4425 10 70 americagood 4 107 86 2464 15.5 76 europebad 5 131 103 2830 15.9 78 europe

Infinite number of possible split values!!!

Finite dataset, only finite number of relevant splits!

Page 49: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

“One branch for each numeric value” idea:

Hopeless: with such high branching factor will shatter the dataset and overfit

Page 50: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Threshold splits

• Binary tree, split on attribute X at value t– One branch: X < t– Other branch: X ≥ t

Page 51: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

The set of possible thresholds

• Binary tree, split on attribute X– One branch: X < t– Other branch: X ≥ t

• Search through possible values of t– Seems hard!!!

• But only finite number of t’s are important– Sort data according to X into {x1,…,xm}

– Consider split points of the form xi + (xi+1 – xi)/2

Page 52: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Picking the best threshold• Suppose X is real valued

• Define IG(Y|X:t) as H(Y) - H(Y|X:t)

• Define H(Y|X:t) =

H(Y|X < t) P(X < t) + H(Y|X >= t) P(X >= t)• IG(Y|X:t) is the information gain for predicting Y if all you know is whether X is

greater than or less than t

• Then define IG*(Y|X) = maxt IG(Y|X:t)

• For each real-valued attribute, use IG*(Y|X) for assessing its suitability as a split

• Note, may split on an attribute multiple times, with different thresholds

Page 53: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Example with MPG

Page 54: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Example tree for our continuous

dataset

Page 55: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

What you need to know about decision trees

• Decision trees are one of the most popular ML tools– Easy to understand, implement, and use– Computationally cheap (to solve heuristically)

• Information gain to select attributes (ID3, C4.5,…)• Presented for classification, can be used for regression

and density estimation too• Decision trees will overfit!!!

– Must use tricks to find “simple trees”, e.g.,• Fixed depth/Early stopping• Pruning• Hypothesis testing

Page 56: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Acknowledgements

• Some of the material in the decision trees presentation is courtesy of Andrew Moore, from his excellent collection of ML tutorials:– http://www.cs.cmu.edu/~awm/tutorials

• Improved by – Carlos Guestrin & – Luke Zettlemoyer