machine learning in practice lecture 7 carolyn penstein rosé language technologies institute/...
TRANSCRIPT
Machine Learning in PracticeLecture 7
Carolyn Penstein Rosé
Language Technologies Institute/ Human-Computer Interaction
Institute
Plan for the Day Announcements
No new homework this weekNo quiz this weekProject Proposal Due by the end of the week
Naïve Bayes Review Linear Model Review Tic Tac Toe across models
X
X
O X
O
O
X
X
O
Project proposals If you are using one of the prefabricated
projects on blackboard, let me know which one
Otherwise, tell me what data you are usingNumber of instancesWhat you’re predictingWhat features you are working with
Short description of what your ideas are for improving performance
If convenient, let me know what the baseline performance is
Note: you can use your own data for the assignments from now on….
Example of ideas: How could you expand on what’s here?
Example of ideas: How could you expand on what’s here?
Add features thatdescribe the source
Example of ideas: How could you expand on what’s here?
Add features that describethings that were going on
during the time when the pollwas taken
Example of ideas: How could you expand on what’s here?
Add features thatdescribe personal
characteristics of thecandidates
Getting the Baseline Performance
Percent correct
Percent correct,controlling for correct by chance
Performance onindividual categories
Confusion matrix
* Right click in Result list and select Save Result Buffer to save performance stats.
Clarification about Cohen’s Kappa
A
B
A B
5 2
1 8
7
9
106 16 OverallTotal
Total agreements = 13Percent agreement = 13/16 = .81Agreement by chance = i(Rowi*Coli)/OverallTotal = 7*6/16 + 9*10/16 = 2.63 + 5.63 = 8.3Kappa = (TotalAgreement – Agreement by Chance)/ (Overall Total – Agreement by Chance) = (13 – 8.3)/(16 – 8.3) = 4.7 / 7.7 = .61
Assume 2 coders were assigning instancesto category A or category B, and you want tomeasure their agreement.
Coder 1’s Codes
Coder 2’s Codes
Naïve Bayes Review
Naïve Bayes Simulation
You can modify the Class counts and Counts for each attribute valuewithin each class.
You can also turn smoothing on or off.
Finally, you can manipulate the attribute values for the instance youwant to classify with your model.
Naïve Bayes Simulation
You can modify the Class counts and Counts for each attribute valuewithin each class.
You can also turn smoothing on or off.
Finally, you can manipulate the attribute values for the instance youwant to classify with your model.
Naïve Bayes Simulation
You can modify the Class counts and Counts for each attribute valuewithin each class.
You can also turn smoothing on or off.
Finally, you can manipulate the attribute values for the instance youwant to classify with your model.
Linear Model Review
Remember this: What do concepts look like?
What are we learning? We’re learning to draw a line through a
multidimensional space Really a “hyperplane”
Each function we learn is like single split in a decision tree But it can take many features into account at one time
rather than just one
F(x) = C0 + C1X1 + C2X2 + C3X3
X1-Xn are our attributes
C0-Cn are coefficients
We’re learning the coefficients, which are weights
What do linear models do? Notice that what you
want to predict is a numberYou use the number
to order instances You want to learn a
function that can get the same ordering
Linear models literally add evidence
Result = 2*A - B - 3*C
Actual values between 2 and -4,rather than between 1 and 5,
but order is the same.
Order affects correlation, actual valueaffects absolute error.
What do linear models do? If what you want to
predict is a category, you can assign values to rangesSort instances based
on predicted valueCut based on
threshold i.e., Val1 where f(x) <
0, Val2 otherwise
Result = 2*A - B - 3*C
Actual values between 2 and -4,rather than between 1 and 5,
but order is the same.
What do linear models do? F(x) = C0 + C1X1 + C2X2 + C3X3
X1-Xn are our attributesC0-Cn are coefficientsWe’re learning the coefficients, which are
weights Think of linear models as imposing a
ranking on instancesFeatures associated with one class get
negative weightsFeatures associated with the other class get
positive weights
More on Linear Regression Linear regressions try to minimize the sum
of the squares of the differences between predicted values and actual values for all training instances Sum over all instances [ Square(predicted value of instance –
actual value of instance) ] Note that this is different from back propagation for neural nets
that minimize the error at the output nodes considering only one training instance at a time
What is learned is a set of weights (not probabilities!)
Limitations of Linear Regressions Can only handle numeric attributes What do you do with your nominal
attributes?You could turn them into numeric attributesFor example: red = 1, blue = 2, orange = 3But is red really less than blue? Is red closer to blue than it is to orange? If you treat your attributes in an unnatural way,
your algorithms may make unwanted inferences about relationships between instances
Another option is to turn nominal attributes into sets of binary attributes
Note: Some people said linear models don’t handle nominal attributes on the homework and I disagreed – the reason I disagreed is because you CAN have nominal attributes, you just have to represent them in a way the model can deal with.
Performing well with skewed class distributions
Naïve Bayes has trouble with skewed class distributions because of the contribution of prior probabilitiesRemember our math problem case
Linear models can compensate for thisThey don’t have any notion of prior probability
per se If they can find a good split on the data, they
will find it wherever it isProblem if there is not a good split
Skewed but clean separation
Skewed but clean separation
Skewed but no clean separation
Skewed but no clean separation
Taking a Step Back Linear models have rules composed of numbers
So they “look” more like Naïve Bayes than like Decision Trees
But the numbers are obtained through a focus on achieving accuracy So the learning process is more like Decision Trees
Given these two properties, what can you say about assumptions about the form of the solution and assumptions about the world that are made?
Tic Tac Toe
Tic Tac Toe What algorithm do you
think would work best?
How would you represent the feature space?
What cases do you think would be hard?
X
X
O X
O
O
X
X
O
Tic Tac Toe
X
X
O X
O
O
X
X
O
Tic Tac Toe Decision Trees: .67
Kappa
SMO: .96 Kappa
Naïve Bayes: .28 Kappa
What do you think is different about what these algorithms is learning?
X
X
O X
O
O
X
X
O
Decision Trees
Naïve Bayes
Each conditional probability is based on each square in isolation
Can you guess which square is most informative?
X
X
O X
O
O
X
X
O
Linear Function Counts every X as
evidence of winning If there are more X’s, then
it’s a win for X Usually right, except in the
case of a tie
X
X
O X
O
O
X
X
O
Take Home Message Naïve Bayes is affected by prior probabilities in
two places Note that prior probabilities have an indirect effect on
all conditional probabilities Linear functions are not directly affected by prior
probabilities So sometimes they can perform better on skewed data
sets Even with the same data representation, different
algorithms learn something different Naïve Bayes learned that the center square is
important Decision trees memorized important trees Linear function counted Xs