machine learning in practice lecture 7 carolyn penstein rosé language technologies institute/...

35
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Upload: todd-long

Post on 15-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Machine Learning in PracticeLecture 7

Carolyn Penstein Rosé

Language Technologies Institute/ Human-Computer Interaction

Institute

Page 2: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Plan for the Day Announcements

No new homework this weekNo quiz this weekProject Proposal Due by the end of the week

Naïve Bayes Review Linear Model Review Tic Tac Toe across models

X

X

O X

O

O

X

X

O

Page 3: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Project proposals If you are using one of the prefabricated

projects on blackboard, let me know which one

Otherwise, tell me what data you are usingNumber of instancesWhat you’re predictingWhat features you are working with

Short description of what your ideas are for improving performance

If convenient, let me know what the baseline performance is

Note: you can use your own data for the assignments from now on….

Page 4: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Example of ideas: How could you expand on what’s here?

Page 5: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Example of ideas: How could you expand on what’s here?

Add features thatdescribe the source

Page 6: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Example of ideas: How could you expand on what’s here?

Add features that describethings that were going on

during the time when the pollwas taken

Page 7: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Example of ideas: How could you expand on what’s here?

Add features thatdescribe personal

characteristics of thecandidates

Page 8: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Getting the Baseline Performance

Percent correct

Percent correct,controlling for correct by chance

Performance onindividual categories

Confusion matrix

* Right click in Result list and select Save Result Buffer to save performance stats.

Page 9: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Clarification about Cohen’s Kappa

A

B

A B

5 2

1 8

7

9

106 16 OverallTotal

Total agreements = 13Percent agreement = 13/16 = .81Agreement by chance = i(Rowi*Coli)/OverallTotal = 7*6/16 + 9*10/16 = 2.63 + 5.63 = 8.3Kappa = (TotalAgreement – Agreement by Chance)/ (Overall Total – Agreement by Chance) = (13 – 8.3)/(16 – 8.3) = 4.7 / 7.7 = .61

Assume 2 coders were assigning instancesto category A or category B, and you want tomeasure their agreement.

Coder 1’s Codes

Coder 2’s Codes

Page 10: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Naïve Bayes Review

Page 11: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Naïve Bayes Simulation

You can modify the Class counts and Counts for each attribute valuewithin each class.

You can also turn smoothing on or off.

Finally, you can manipulate the attribute values for the instance youwant to classify with your model.

Page 12: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Naïve Bayes Simulation

You can modify the Class counts and Counts for each attribute valuewithin each class.

You can also turn smoothing on or off.

Finally, you can manipulate the attribute values for the instance youwant to classify with your model.

Page 13: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Naïve Bayes Simulation

You can modify the Class counts and Counts for each attribute valuewithin each class.

You can also turn smoothing on or off.

Finally, you can manipulate the attribute values for the instance youwant to classify with your model.

Page 14: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Linear Model Review

Page 15: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Remember this: What do concepts look like?

Page 16: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

What are we learning? We’re learning to draw a line through a

multidimensional space Really a “hyperplane”

Each function we learn is like single split in a decision tree But it can take many features into account at one time

rather than just one

F(x) = C0 + C1X1 + C2X2 + C3X3

X1-Xn are our attributes

C0-Cn are coefficients

We’re learning the coefficients, which are weights

Page 17: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

What do linear models do? Notice that what you

want to predict is a numberYou use the number

to order instances You want to learn a

function that can get the same ordering

Linear models literally add evidence

Result = 2*A - B - 3*C

Actual values between 2 and -4,rather than between 1 and 5,

but order is the same.

Order affects correlation, actual valueaffects absolute error.

Page 18: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

What do linear models do? If what you want to

predict is a category, you can assign values to rangesSort instances based

on predicted valueCut based on

threshold i.e., Val1 where f(x) <

0, Val2 otherwise

Result = 2*A - B - 3*C

Actual values between 2 and -4,rather than between 1 and 5,

but order is the same.

Page 19: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

What do linear models do? F(x) = C0 + C1X1 + C2X2 + C3X3

X1-Xn are our attributesC0-Cn are coefficientsWe’re learning the coefficients, which are

weights Think of linear models as imposing a

ranking on instancesFeatures associated with one class get

negative weightsFeatures associated with the other class get

positive weights

Page 20: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

More on Linear Regression Linear regressions try to minimize the sum

of the squares of the differences between predicted values and actual values for all training instances Sum over all instances [ Square(predicted value of instance –

actual value of instance) ] Note that this is different from back propagation for neural nets

that minimize the error at the output nodes considering only one training instance at a time

What is learned is a set of weights (not probabilities!)

Page 21: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Limitations of Linear Regressions Can only handle numeric attributes What do you do with your nominal

attributes?You could turn them into numeric attributesFor example: red = 1, blue = 2, orange = 3But is red really less than blue? Is red closer to blue than it is to orange? If you treat your attributes in an unnatural way,

your algorithms may make unwanted inferences about relationships between instances

Another option is to turn nominal attributes into sets of binary attributes

Note: Some people said linear models don’t handle nominal attributes on the homework and I disagreed – the reason I disagreed is because you CAN have nominal attributes, you just have to represent them in a way the model can deal with.

Page 22: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Performing well with skewed class distributions

Naïve Bayes has trouble with skewed class distributions because of the contribution of prior probabilitiesRemember our math problem case

Linear models can compensate for thisThey don’t have any notion of prior probability

per se If they can find a good split on the data, they

will find it wherever it isProblem if there is not a good split

Page 23: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Skewed but clean separation

Page 24: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Skewed but clean separation

Page 25: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Skewed but no clean separation

Page 26: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Skewed but no clean separation

Page 27: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Taking a Step Back Linear models have rules composed of numbers

So they “look” more like Naïve Bayes than like Decision Trees

But the numbers are obtained through a focus on achieving accuracy So the learning process is more like Decision Trees

Given these two properties, what can you say about assumptions about the form of the solution and assumptions about the world that are made?

Page 28: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Tic Tac Toe

Page 29: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Tic Tac Toe What algorithm do you

think would work best?

How would you represent the feature space?

What cases do you think would be hard?

X

X

O X

O

O

X

X

O

Page 30: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Tic Tac Toe

X

X

O X

O

O

X

X

O

Page 31: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Tic Tac Toe Decision Trees: .67

Kappa

SMO: .96 Kappa

Naïve Bayes: .28 Kappa

What do you think is different about what these algorithms is learning?

X

X

O X

O

O

X

X

O

Page 32: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Decision Trees

Page 33: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Naïve Bayes

Each conditional probability is based on each square in isolation

Can you guess which square is most informative?

X

X

O X

O

O

X

X

O

Page 34: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Linear Function Counts every X as

evidence of winning If there are more X’s, then

it’s a win for X Usually right, except in the

case of a tie

X

X

O X

O

O

X

X

O

Page 35: Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Take Home Message Naïve Bayes is affected by prior probabilities in

two places Note that prior probabilities have an indirect effect on

all conditional probabilities Linear functions are not directly affected by prior

probabilities So sometimes they can perform better on skewed data

sets Even with the same data representation, different

algorithms learn something different Naïve Bayes learned that the center square is

important Decision trees memorized important trees Linear function counted Xs