Download - Machine Learning in Practice Lecture 18

Machine Learning in PracticeLecture 18

Carolyn Penstein Rosé

Language Technologies Institute/ Human-Computer Interaction

Institute

Plan for the Day

AnnouncementsQuestions?Quiz Feedback

Rule Based Learning Revisit the Tic Tac Toe Problem Start thinking about Optimization and

Tuning

Quiz Feedback Only one person

got everything rightWere the

readings confusing this time?

Association Rule Mining vs. Rule Learning

Rule Based Learning

Rules Versus Trees Tree based learning is divide and conquer

Decision based on what will have the biggest overall effect on “purity” at leaf nodes

Rule based learning is “separate-and-conquer”Considers only one class at a time (usually

starting with smallest)What separates this class from the default class

Trees vs. Rules

J48

Locally Optimal Solutions

Optimal SolutionLocally Optimal

Solution

Covering Algorithms Rule based algorithms are called covering

algorithms Whereas tree based algorithms take all

classes into account at the same time, covering algorithms only consider one class at a time

Rule based algorithms look for a set of conditions that achieve high accuracy on one class at a time

Accuracy versus Information Gain

[A A A A B B B B B]

[A A A B] [A B B B B] [A A A A B B][B B B]

[A A A A B B B B B]

Accuracy: 78% Accuracy: 78%

Information: .76 Information: .61

* Note that lower resulting information means higher information gain.

Accuracy vs Information Gain

Rules Don’t Need to be Applied in Order

Rules that predict the same class can be re-ordered without affecting performance

If rules are treated as un-ordered, rules associated with different classes might match at the same time In that case you need to have a tie breakerMaybe rule accuracyMaybe based on prior probabilities of each class

Rule Learning

@attribute ice-cream {chocolate, vanilla, coffee, rocky-road, strawberry}@attribute cake {chocolate, vanilla}@attribute yummy {yum,good,ok}

@data

chocolate,chocolate,yumvanilla,chocolate,goodcoffee,chocolate,yumcoffee,vanilla,okrocky-road,chocolate,yumstrawberry,vanilla,ok

@relation is-yummy

Note that the rules below for each class consider different subsets of attributes

Note that two conditions were necessary to most accurately predict yum – rule learning algorithms add conditions to rules until accuracy is high enough The more complex a rule becomes, the more likely it is to over-fit

If chocolate cake and not vanilla ice cream then yumIf vanilla ice cream then goodIf vanilla cake then ok

Rule Induction by Pruning Rules from Trees

Rules can be read off of trees They will be overly complex But they can be pruned in a “greedy”

fashion using the same principles discussed hereYou might get duplicate rules then, so remove

those In practice this is very inefficient

Rules versus Trees Decision tree learning is a

divide and conquer approach Top-down, looking to attributes

that achieve useful splits in data Trees can be converted into

sets of rules

If you then TutorIf not(you) and Imperitive then TutorIf not(you) and not(Imperitive) and good then TutorIf not(you) and not(Imperitive) and not(good) and WordCount > 2 and not(all-I) then TutorIf not(you) and not(Imperitive) and not(good) and WordCount > 2 and all-I and not(So) then StudentIf not(you) and not(Imperitive) and not(good) and WordCount > 2 and all-I and So then TutorIf not(you) and not(Imperitive) and not(good) and WordCount <= 2 and not(on) then StudentIf not(you) and not(Imperitive) and not(good) and WordCount <= 2 and On then Tutor

Ordered Rules More Compact

If rules are applied in order, then you can use if-then-else structure

But then you’re back to a tree representation

If you then TutorIf not(you) and Imperitive then Tutor else if good then Tutor else if WordCount > 2 then if not(all-I) then Tutor else if …..

Advantages of Classification Rules

Decision trees can’t easily represent disjunctions Sometimes subtrees have to

be repeated – this introduces a greater chance of error

So rules are a more powerful representation, but more power can lead to more over-fitting!!!

If a and b then xIf c and d then x

a

b c

d c

d

x

x

x

Advantages of Classification Rules

Classification rules express disjunctions more concisely

Decision lists are meant to be applied in order (so context is assumed) Easy to encode “else”

conditions

If a and b then xIf c and d then x

a

b c

d c

d

x

x

x

Rules Versus Trees Because both algorithms make one selection at a

time, they will prefer different choices since the criteria are different

Rule learning is more prone to over-fitting Rule representations have more power (e.g.,

disjunctions) Rule learning algorithms tend to make decisions based

on more local information Even when Information Gain is used for choosing

between options, the set of options considered is different

Pruning Rules

Just as trees are grown and then pruned, rules are also grown and then prunedRather than one growth stage followed by one

pruning stage, you alternate growth and pruning With rules only reduced error pruning is used

Trees can be pruned using reduced error pruning or by estimating error on training data using confidence intervals

Rules only have one pruning operationTrees have two pruning operations

Rule Learning Manipulations

Pruning Paradigms: How would this rule perform over the whole set by itself versus how would this rule perform after other rules have fired? Do you start with a default? If so, what is that default?

Pruning rule: remove the condition that improves the performance of the rule the most over a validation set (or remove conditions in reverse order)

Tic Tac Toe

Tic Tac Toe

X

X

O X

O

O

X

X

O

Tic Tac Toe: Remember this?

Decision Trees: .67 Kappa

SMO: .96 Kappa

Naïve Bayes: .28 Kappa

X

X

O X

O

O

X

X

O

Decision Trees

Decision TreesHow do you think the rule modelwould be different?

Rules from JRIP

.95 Kappa!* When will it fail?

Optimization

Why Trees and Rules are Sometimes Counter Intuitive

All machine learning algorithms are designed to avoid doing an exhaustive search of the vector space

In order to reduce search time, they make simplifying assumptions that sometimes lead to counter-intuitive results

We have talked about some variations on basic tree and rule learning These affect which options are visible at each point in

the search

Locally Optimal Solutions

Optimal SolutionLocally Optimal

Solution

Why Trees and Rules are Sometimes Counter Intuitive The simplifying assumptions bias the search

to favor certain regions of the hypothesis spaceDifferent algorithms have different biases, so

they look at a different subset of solutions When this bias leads the algorithm to an

optimal or near optimal solution it is a useful biasDepends largely on quirky characteristics of your

data set

Why Trees and Rules are Sometimes Counter Intuitive

Simplifying assumptions increase efficiency but may decrease the quality of the derived solutionsTunnel visionSpurious regularities in the data lead to

unpredictable results Tuning the parameters of an algorithm

changes its bias (i.e., binary spilts vs not)You have to guard against overfitting!

Optimizing Parameter Settings

1

2

4

5

3

Train

Validation

Test

Iterate over settings

Compare performanceover validation set;Pick optimal setting

Test on Test Set

Use a modified form of cross-validation:

Or you can have a hold-outValidation set you use for allfolds

Still N folds, but eachfold has less training data than with standard cross validation

Optimizing Parameter Settings

1

2

4

5

3

Train

Validation

Test

•This approach assumes thatyou want to estimate the generalization you will get from yourlearning and tuning approach together.

•If you just want to know what the bestperformance you can get on *this* setby tuning, you can just use standardcross-validation

Take Home Message Tree Based and Rule Based Learners are

similarRules are readableGreedy algorithmsLocally optimal solution

Tree Based and Rule Based Learners are different Information gain versus AccuracyRepresentational power wrt disjunctions

Download - Machine Learning in Practice Lecture 18

Top Related