Download - Machine Learning in Practice Lecture 18
![Page 1: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/1.jpg)
Machine Learning in PracticeLecture 18
Carolyn Penstein Rosé
Language Technologies Institute/ Human-Computer Interaction
Institute
![Page 2: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/2.jpg)
Plan for the Day
AnnouncementsQuestions?Quiz Feedback
Rule Based Learning Revisit the Tic Tac Toe Problem Start thinking about Optimization and
Tuning
![Page 3: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/3.jpg)
Quiz Feedback Only one person
got everything rightWere the
readings confusing this time?
Association Rule Mining vs. Rule Learning
![Page 4: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/4.jpg)
Rule Based Learning
![Page 5: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/5.jpg)
Rules Versus Trees Tree based learning is divide and conquer
Decision based on what will have the biggest overall effect on “purity” at leaf nodes
Rule based learning is “separate-and-conquer”Considers only one class at a time (usually
starting with smallest)What separates this class from the default class
![Page 6: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/6.jpg)
Trees vs. Rules
J48
![Page 7: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/7.jpg)
Trees vs. Rules
J48
![Page 8: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/8.jpg)
Locally Optimal Solutions
Optimal SolutionLocally Optimal
Solution
![Page 9: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/9.jpg)
Covering Algorithms Rule based algorithms are called covering
algorithms Whereas tree based algorithms take all
classes into account at the same time, covering algorithms only consider one class at a time
Rule based algorithms look for a set of conditions that achieve high accuracy on one class at a time
![Page 10: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/10.jpg)
Accuracy versus Information Gain
[A A A A B B B B B]
[A A A B] [A B B B B] [A A A A B B][B B B]
[A A A A B B B B B]
Accuracy: 78% Accuracy: 78%
Information: .76 Information: .61
* Note that lower resulting information means higher information gain.
![Page 11: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/11.jpg)
Accuracy vs Information Gain
![Page 12: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/12.jpg)
Rules Don’t Need to be Applied in Order
Rules that predict the same class can be re-ordered without affecting performance
If rules are treated as un-ordered, rules associated with different classes might match at the same time In that case you need to have a tie breakerMaybe rule accuracyMaybe based on prior probabilities of each class
![Page 13: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/13.jpg)
Rule Learning
@attribute ice-cream {chocolate, vanilla, coffee, rocky-road, strawberry}@attribute cake {chocolate, vanilla}@attribute yummy {yum,good,ok}
@data
chocolate,chocolate,yumvanilla,chocolate,goodcoffee,chocolate,yumcoffee,vanilla,okrocky-road,chocolate,yumstrawberry,vanilla,ok
@relation is-yummy
Note that the rules below for each class consider different subsets of attributes
Note that two conditions were necessary to most accurately predict yum – rule learning algorithms add conditions to rules until accuracy is high enough The more complex a rule becomes, the more likely it is to over-fit
If chocolate cake and not vanilla ice cream then yumIf vanilla ice cream then goodIf vanilla cake then ok
![Page 14: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/14.jpg)
Rule Induction by Pruning Rules from Trees
Rules can be read off of trees They will be overly complex But they can be pruned in a “greedy”
fashion using the same principles discussed hereYou might get duplicate rules then, so remove
those In practice this is very inefficient
![Page 15: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/15.jpg)
Rules versus Trees Decision tree learning is a
divide and conquer approach Top-down, looking to attributes
that achieve useful splits in data Trees can be converted into
sets of rules
If you then TutorIf not(you) and Imperitive then TutorIf not(you) and not(Imperitive) and good then TutorIf not(you) and not(Imperitive) and not(good) and WordCount > 2 and not(all-I) then TutorIf not(you) and not(Imperitive) and not(good) and WordCount > 2 and all-I and not(So) then StudentIf not(you) and not(Imperitive) and not(good) and WordCount > 2 and all-I and So then TutorIf not(you) and not(Imperitive) and not(good) and WordCount <= 2 and not(on) then StudentIf not(you) and not(Imperitive) and not(good) and WordCount <= 2 and On then Tutor
![Page 16: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/16.jpg)
Ordered Rules More Compact
If rules are applied in order, then you can use if-then-else structure
But then you’re back to a tree representation
If you then TutorIf not(you) and Imperitive then Tutor else if good then Tutor else if WordCount > 2 then if not(all-I) then Tutor else if …..
![Page 17: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/17.jpg)
Advantages of Classification Rules
Decision trees can’t easily represent disjunctions Sometimes subtrees have to
be repeated – this introduces a greater chance of error
So rules are a more powerful representation, but more power can lead to more over-fitting!!!
If a and b then xIf c and d then x
a
b c
d c
d
x
x
x
![Page 18: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/18.jpg)
Advantages of Classification Rules
Classification rules express disjunctions more concisely
Decision lists are meant to be applied in order (so context is assumed) Easy to encode “else”
conditions
If a and b then xIf c and d then x
a
b c
d c
d
x
x
x
![Page 19: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/19.jpg)
Rules Versus Trees Because both algorithms make one selection at a
time, they will prefer different choices since the criteria are different
Rule learning is more prone to over-fitting Rule representations have more power (e.g.,
disjunctions) Rule learning algorithms tend to make decisions based
on more local information Even when Information Gain is used for choosing
between options, the set of options considered is different
![Page 20: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/20.jpg)
Pruning Rules
Just as trees are grown and then pruned, rules are also grown and then prunedRather than one growth stage followed by one
pruning stage, you alternate growth and pruning With rules only reduced error pruning is used
Trees can be pruned using reduced error pruning or by estimating error on training data using confidence intervals
Rules only have one pruning operationTrees have two pruning operations
![Page 21: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/21.jpg)
Rule Learning Manipulations
Pruning Paradigms: How would this rule perform over the whole set by itself versus how would this rule perform after other rules have fired? Do you start with a default? If so, what is that default?
Pruning rule: remove the condition that improves the performance of the rule the most over a validation set (or remove conditions in reverse order)
![Page 22: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/22.jpg)
Tic Tac Toe
![Page 23: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/23.jpg)
Tic Tac Toe
X
X
O X
O
O
X
X
O
![Page 24: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/24.jpg)
Tic Tac Toe: Remember this?
Decision Trees: .67 Kappa
SMO: .96 Kappa
Naïve Bayes: .28 Kappa
X
X
O X
O
O
X
X
O
![Page 25: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/25.jpg)
Decision Trees
![Page 26: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/26.jpg)
Decision TreesHow do you think the rule modelwould be different?
![Page 27: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/27.jpg)
Rules from JRIP
.95 Kappa!* When will it fail?
![Page 28: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/28.jpg)
Optimization
![Page 29: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/29.jpg)
Why Trees and Rules are Sometimes Counter Intuitive
All machine learning algorithms are designed to avoid doing an exhaustive search of the vector space
In order to reduce search time, they make simplifying assumptions that sometimes lead to counter-intuitive results
We have talked about some variations on basic tree and rule learning These affect which options are visible at each point in
the search
![Page 30: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/30.jpg)
Locally Optimal Solutions
Optimal SolutionLocally Optimal
Solution
![Page 31: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/31.jpg)
Why Trees and Rules are Sometimes Counter Intuitive The simplifying assumptions bias the search
to favor certain regions of the hypothesis spaceDifferent algorithms have different biases, so
they look at a different subset of solutions When this bias leads the algorithm to an
optimal or near optimal solution it is a useful biasDepends largely on quirky characteristics of your
data set
![Page 32: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/32.jpg)
Why Trees and Rules are Sometimes Counter Intuitive
Simplifying assumptions increase efficiency but may decrease the quality of the derived solutionsTunnel visionSpurious regularities in the data lead to
unpredictable results Tuning the parameters of an algorithm
changes its bias (i.e., binary spilts vs not)You have to guard against overfitting!
![Page 33: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/33.jpg)
Optimizing Parameter Settings
1
2
4
5
3
Train
Validation
Test
Iterate over settings
Compare performanceover validation set;Pick optimal setting
Test on Test Set
Use a modified form of cross-validation:
Or you can have a hold-outValidation set you use for allfolds
Still N folds, but eachfold has less training data than with standard cross validation
![Page 34: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/34.jpg)
Optimizing Parameter Settings
1
2
4
5
3
Train
Validation
Test
•This approach assumes thatyou want to estimate the generalization you will get from yourlearning and tuning approach together.
•If you just want to know what the bestperformance you can get on *this* setby tuning, you can just use standardcross-validation
![Page 35: Machine Learning in Practice Lecture 18](https://reader033.vdocuments.us/reader033/viewer/2022061605/568150c4550346895dbee34c/html5/thumbnails/35.jpg)
Take Home Message Tree Based and Rule Based Learners are
similarRules are readableGreedy algorithmsLocally optimal solution
Tree Based and Rule Based Learners are different Information gain versus AccuracyRepresentational power wrt disjunctions