cs 4100: artificial intelligence...cs 4100: artificial intelligence perceptronsand logistic...

31
CS 4100: Artificial Intelligence Perceptrons and Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Upload: others

Post on 08-Mar-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

CS 4100: Artificial IntelligencePerceptrons and Logistic Regression

Jan-Willem van de Meent, Northeastern University

[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Page 2: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Linear Classifiers

Page 3: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Feature Vectors

Hello,

Do you want free printrcartriges? Why pay more when you can get them ABSOLUTELY FREE! Just

# free : 2YOUR_NAME : 0MISSPELLED : 2FROM_FRIEND : 0...

SPAMor+

PIXEL-7,12 : 1PIXEL-7,13 : 0...NUM_LOOPS : 1...

“2”

Page 4: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Some (Simplified) Biology

• Very loose inspiration: human neurons

Page 5: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Linear Classifiers

• Inputs are feature values• Each feature has a weight• Sum is the activation

• If the activation is:• Positive, output +1• Negative, output -1

Sf1f2f3

w1

w2

w3>0?

Page 6: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Weights• Binary case: compare features to a weight vector• Learning: figure out the weight vector from examples

# free : 2YOUR_NAME : 0MISSPELLED : 2FROM_FRIEND : 0...

# free : 4YOUR_NAME :-1MISSPELLED : 1FROM_FRIEND :-3...

# free : 0YOUR_NAME : 1MISSPELLED : 1FROM_FRIEND : 1...

Dot product positive means the positive class

Page 7: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Decision Rules

Page 8: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Binary Decision Rule

• In the space of feature vectors• Examples are points• Any weight vector is a hyperplane• One side corresponds to Y=+1• Other corresponds to Y=-1

BIAS : -3free : 4money : 2... 0 1

0

1

2

freem

oney

+1 = SPAM

-1 = HAM

Page 9: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Weight Updates

Page 10: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Learning: Binary Perceptron• Start with weights = 0• For each training instance:

• Classify with current weights

• If correct: (i.e., y=y*), no change!

• If wrong: adjust the weight vector

Page 11: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Learning: Binary Perceptron• Start with weights = 0• For each training instance:

• Classify with current weights

• If correct: (i.e., y=y*), no change!• If wrong: adjust the weight vector

by adding or subtracting the feature vector. Subtract if y* is -1.

Page 12: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Examples: Perceptron

• Separable Case

Page 13: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Multiclass Decision Rule

• If we have multiple classes:• A weight vector for each class:

• Score (activation) of a class y:

• Prediction with highest score wins

Binary = multiclass where the negative class has weight zero

Page 14: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Learning: Multiclass Perceptron

• Start with all weights = 0• Pick training examples one by one

• Predict with current weights

• If correct: no change!• If wrong: lower score of wrong answer,

raise score of right answer

Page 15: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Example: Multiclass Perceptron

BIAS : 1win : 0game : 0 vote : 0 the : 0 ...

BIAS : 0win : 0game : 0vote : 0the : 0...

BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

Question: What will the weights w be for each class after 3 updates?

y1 = “politics”, x1 = “win the vote”y2 = “politics”, x2 = “win the election”y3 = “sports”, x3 = “win the game”

Page 16: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Example: Multiclass Perceptron

BIAS : 1win : 0game : 0 vote : 0 the : 0 ...

BIAS : 0win : 0game : 0vote : 0the : 0...

BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

Question: What will the weights w be for each class after 3 updates?

wpolitics f(x1) = 0wsports f(x1) = 1

wtech f(x1) = 0

+ 1 + 1+ 0+ 1+ 1

Prediction:“sports”(wrong)

- 1 - 1 - 0- 1- 1

1 1011

f(x1) =

y1 = “politics”, x1 = “win the vote”y2 = “politics”, x2 = “win the election”y3 = “sports”, x3 = “win the game”

Page 17: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Example: Multiclass Perceptron

BIAS : 0win : -1game : 0 vote : -1 the : -1 ...

BIAS : 1win : 1game : 0vote : 1the : 1...

BIAS : 0 win : 0 game : 0 vote : 0the : 0 ...

y1 = “politics”, x1 = “win the vote”y2 = “politics”, x2 = “win the election”y3 = “sports”, x3 = “win the game”

Question: What will the weights w be for each class after 3 updates?

wpolitics f(x1) = 3wsports f(x1) = -2

wtech f(x1) = -3

Prediction:“politics”(correct)

1 1001

f(x2) =

Page 18: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Example: Multiclass Perceptron

BIAS : 0win : -1game : 0 vote : -1 the : -1 ...

BIAS : 1win : 1game : 0vote : 1the : 1...

BIAS : 0win : 0game : 0 vote : 0the : 0...

y1 = “politics”, x1 = “win the vote”y2 = “politics”, x2 = “win the election”y3 = “sports”, x3 = “win the game”

Question: What will the weights w be for each class after 3 updates?

wpolitics f(x1) = 3wsports f(x1) = -2

wtech f(x1) = -3

Prediction:“politics”(wrong)

1 1101

f(x3) =

- 1 - 1- 1- 0- 1

+ 1 + 1 + 1+ 0+ 1

Page 19: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Example: Multiclass Perceptron

BIAS : 1win : 0game : 1 vote : -1the : 0 ...

BIAS : 0win : 0game : -1vote : 1the : 0...

BIAS : 0win : 0game : 0vote : 0the : 0...

y1 = “politics”, x1 = “win the vote”y2 = “politics”, x2 = “win the election”y3 = “sports”, x3 = “win the game”

Question: What will the weights w be for each class after 3 updates?

Page 20: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

δ

Properties of Perceptrons

• Separability: true if there exists weights w that get the training set perfectly correct

• Convergence: if the training data are separable, a perceptron will eventually converge (binary case)

• Mistake Bound: the maximum number of mistakes (updates)(binary case) is related to the number of features kand the margin δ or degree of separability

Separable

Non-Separable

Page 21: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Problems with the Perceptron

• Noise: if the data isn’t separable, weights might thrash• Averaging weight vectors over time

can help (averaged perceptron)

• Mediocre generalization: finds a “barely” separating solution

• Overtraining: test / held-out accuracy usually rises, then falls• Overtraining is a kind of overfitting

Page 22: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Improving the Perceptron

Page 23: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Non-Separable Case: Deterministic DecisionEven the best linear boundary makes at least one mistake

Page 24: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Non-Separable Case: Probabilistic Decision

0.5 | 0.50.3 | 0.7

0.1 | 0.9

0.7 | 0.30.9 | 0.1

Page 25: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

How to get probabilistic decisions?

• Perceptron scoring:• If very positive à want probability going to 1• If very negative à want probability going to 0

• Sigmoid function

z = w · f(x)z = w · f(x)

z = w · f(x)

�(z) =1

1 + e�z

Page 26: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Best w?

• Maximum likelihood estimation:

with:

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

P (y(i) = +1|x(i);w) =1

1 + e�w·f(x(i))

P (y(i) = �1|x(i);w) = 1� 1

1 + e�w·f(x(i))

This is called Logistic Regression

Page 27: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Separable Case: Deterministic Decision – Many Options

Page 28: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Separable Case: Probabilistic Decision – Clear Preference

0.5 | 0.50.3 | 0.7

0.7 | 0.3

0.5 | 0.50.3 | 0.7

0.7 | 0.3

Page 29: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Multiclass Logistic Regression• Recall Perceptron:

• A weight vector for each class:

• Score (activation) of a class y:

• Prediction with highest score wins

• How to turn scores into probabilities?

z1, z2, z3 ! ez1

ez1 + ez2 + ez3,

ez2

ez1 + ez2 + ez3,

ez3

ez1 + ez2 + ez3

original activations softmax activations

Page 30: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Best w?

• Maximum likelihood estimation:

with:

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

P (y(i)|x(i);w) =ewy(i) ·f(x(i))

Py e

wy·f(x(i))

This is called Multi-Class Logistic Regression

Page 31: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Perceptronsand Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by

Next Lecture

• Optimization

• i.e., how do we solve:

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)