machine learning (cse 446): perceptronperceptron convergence due to rosenblatt (1958). if dis...
TRANSCRIPT
Machine Learning (CSE 446):Perceptron
Noah Smithc© 2017
University of [email protected]
October 11, 2017
1 / 37
No office hours for Noah today.
John’s office hours will be 12–2 today.
Patrick’s office hours will run 12:30–2:30 on Thursday.
2 / 37
Happy Medium?
Decision trees (that aren’t too deep): use relatively few features to classify.
K-nearest neighbors: all features weighted equally.
Today: use all features, but weight them.
3 / 37
Happy Medium?
Decision trees (that aren’t too deep): use relatively few features to classify.
K-nearest neighbors: all features weighted equally.
Today: use all features, but weight them.
For today’s lecture, assume that y ∈ {−1,+1} instead of {0, 1}, and that x ∈ Rd.
4 / 37
Happy Medium?
Decision trees (that aren’t too deep): use relatively few features to classify.
K-nearest neighbors: all features weighted equally.
Today: use all features, but weight them.
For today’s lecture, assume that y ∈ {−1,+1} instead of {0, 1}, and that x ∈ Rd.
Remember that x[j] = φj(x). (Features have already been applied to the data.)
5 / 37
Inspiration from NeuronsImage from Wikimedia Commons.
Input signals come in through dendrites, output signal passes out through the axon.
6 / 37
Neuron-Inspired Classifier
x[1]
…
w[1] ×
x[2] w[2] ×
x[3] w[3] ×
x[d] w[d] ×
∑
b
!
fire, or not?ŷ
bias parameter
weight parameters
“activation” output
input
7 / 37
Neuron-Inspired Classifier
f(x) = sign (w · x+ b)
remembering that: w · x =
d∑j=1
w[j] · x[j]
8 / 37
Neuron-Inspired Classifier
f(x) = sign (w · x+ b)
remembering that: w · x =
d∑j=1
w[j] · x[j]
Learning requires us to set the weights w and the bias b.
9 / 37
Perceptron Learning AlgorithmData: D = 〈(xn, yn)〉Nn=1, number of epochs EResult: weights w and bias binitialize: w = 0 and b = 0;for e ∈ {1, . . . , E} do
for n ∈ {1, . . . , N}, in random order do# predicty = sign (w · xn + b);if y 6= yn then
# updatew← w + yn · xn;b← b+ yn;
end
end
endreturn w, b
Algorithm 1: PerceptronTrain
10 / 37
Linear Decision Boundary
w·x + b < 0
w·x + b > 0
w·x + b = 0
11 / 37
Linear Decision Boundary
w·x + b = 0
activation = w·x + b
12 / 37
Linear Decision Boundary
w·x + b = 0
13 / 37
Interpretation of Weight Values
What does it mean when . . .
I w12 = 100?
I w12 = −1?
I w12 = 0?
14 / 37
Interpretation of Weight Values
What does it mean when . . .
I w12 = 100?
I w12 = −1?
I w12 = 0?
15 / 37
Interpretation of Weight Values
What does it mean when . . .
I w12 = 100?
I w12 = −1?
I w12 = 0?
16 / 37
Interpretation of Weight Values
What does it mean when . . .
I w12 = 100?
I w12 = −1?
I w12 = 0?
17 / 37
Interpretation of Weight Values
What does it mean when . . .
I w12 = 100?
I w12 = −1?
I w12 = 0?
What if we multiply w by 2?
18 / 37
Linear Separability
A dataset D = 〈(xn, yn)〉Nn=1 is linearly separable if there exists some linear classifier(defined by w, b) such that, for all n, yn = sign (w · xn + b).
margin(D,w, b) =
{minn yn · (w · xn + b) if w and b separate D−∞ otherwise
margin(D) = supw,b
margin(D,w, b)
19 / 37
Perceptron ConvergenceDue to Rosenblatt (1958).
If D is linearly separable with margin γ > 0 and for all n ∈ {1, . . . , N}, ‖xn‖2 ≤ 1,then the perceptron algorithm will converge in at most 1
γ2updates.
20 / 37
Perceptron ConvergenceDue to Rosenblatt (1958).
If D is linearly separable with margin γ > 0 and for all n ∈ {1, . . . , N}, ‖xn‖2 ≤ 1,then the perceptron algorithm will converge in at most 1
γ2updates.
I Proof can be found in Daume (2017), pp. 50–51.
I The theorem does not guarantee that the perceptron’s classifier will achievemargin γ.
21 / 37
What if D isn’t linearly separable?
22 / 37
What if D isn’t linearly separable?
No guarantees.
23 / 37
Voted Perceptron
Let w(e,n) and b(e,n) be the parameters after updating based on the nth example onepoch e.
y = sign
(E∑e=1
N∑n=1
sign(w(e,n) · x+ b(e,n))
)
24 / 37
Voted Perceptron
Let w(e,n) and b(e,n) be the parameters after updating based on the nth example onepoch e.
y = sign
(E∑e=1
N∑n=1
sign(w(e,n) · x+ b(e,n))
)
Do we need to store E ·N parameter settings?
25 / 37
Voted Perceptron
Let w(e,n) and b(e,n) be the parameters after updating based on the nth example onepoch e.
y = sign
(E∑e=1
N∑n=1
sign(w(e,n) · x+ b(e,n))
)
Can do a little better by storing “survival” time for each vector (number of times anupdate didn’t need to be made because y = y).
26 / 37
Voted Perceptron
27 / 37
Voted Perceptron
28 / 37
Voted Perceptron
29 / 37
Voted Perceptron
30 / 37
Voted Perceptron
31 / 37
Voted Perceptron
32 / 37
Averaged Perceptron
y = sign
(E∑e=1
N∑n=1
w(e,n) · x+ b(e,n)
)
33 / 37
Averaged Perceptron
y = sign
(E∑e=1
N∑n=1
w(e,n) · x+ b(e,n)
)
Do we need to store E ·N parameter settings?
34 / 37
Averaged Perceptron
y = sign
(E∑e=1
N∑n=1
w(e,n) · x+ b(e,n)
)
Do we need to store E ·N parameter settings?
What kind of decision boundary does the averaged perceptron have?
35 / 37
Averaged Perceptron
y = sign
(E∑e=1
N∑n=1
w(e,n) · x+ b(e,n)
)
= sign
((E∑e=1
N∑n=1
w(e,n)
)· x+
(E∑e=1
N∑n=1
b(e,n)
))
Do we need to store E ·N parameter settings?
What kind of decision boundary does the averaged perceptron have?
36 / 37
References I
Hal Daume. A Course in Machine Learning (v0.9). Self-published athttp://ciml.info/, 2017.
Frank Rosenblatt. The perceptron: A probabilistic model for information storage andorganization in the brain. Psychological Review, 65:386–408, 1958.
37 / 37