machine learning (cse 446): perceptronperceptron convergence due to rosenblatt (1958). if dis...

Machine Learning (CSE 446):Perceptron

Noah Smithc© 2017

University of [email protected]

October 11, 2017

1 / 37

No office hours for Noah today.

John’s office hours will be 12–2 today.

Patrick’s office hours will run 12:30–2:30 on Thursday.

2 / 37

Happy Medium?

Decision trees (that aren’t too deep): use relatively few features to classify.

K-nearest neighbors: all features weighted equally.

Today: use all features, but weight them.

3 / 37

Happy Medium?




For today’s lecture, assume that y ∈ {−1,+1} instead of {0, 1}, and that x ∈ Rd.

4 / 37

Happy Medium?




For today’s lecture, assume that y ∈ {−1,+1} instead of {0, 1}, and that x ∈ Rd.

Remember that x[j] = φj(x). (Features have already been applied to the data.)

5 / 37

Inspiration from NeuronsImage from Wikimedia Commons.

Input signals come in through dendrites, output signal passes out through the axon.

6 / 37

Neuron-Inspired Classifier

x[1]

…

w[1] ×

x[2] w[2] ×

x[3] w[3] ×

x[d] w[d] ×

∑

b

!

fire, or not?ŷ

bias parameter

weight parameters

“activation” output

input

7 / 37


f(x) = sign (w · x+ b)

remembering that: w · x =

d∑j=1

w[j] · x[j]

8 / 37


f(x) = sign (w · x+ b)

remembering that: w · x =

d∑j=1

w[j] · x[j]

Learning requires us to set the weights w and the bias b.

9 / 37

Perceptron Learning AlgorithmData: D = 〈(xn, yn)〉Nn=1, number of epochs EResult: weights w and bias binitialize: w = 0 and b = 0;for e ∈ {1, . . . , E} do

for n ∈ {1, . . . , N}, in random order do# predicty = sign (w · xn + b);if y 6= yn then

# updatew← w + yn · xn;b← b+ yn;

end

end

endreturn w, b

Algorithm 1: PerceptronTrain

10 / 37

Linear Decision Boundary

w·x + b < 0

w·x + b > 0

w·x + b = 0

11 / 37


w·x + b = 0

activation = w·x + b

12 / 37


w·x + b = 0

13 / 37

Interpretation of Weight Values

What does it mean when . . .

I w12 = 100?

I w12 = −1?

I w12 = 0?

14 / 37



I w12 = 100?

I w12 = −1?

I w12 = 0?

15 / 37



I w12 = 100?

I w12 = −1?

I w12 = 0?

16 / 37



I w12 = 100?

I w12 = −1?

I w12 = 0?

17 / 37



I w12 = 100?

I w12 = −1?

I w12 = 0?

What if we multiply w by 2?

18 / 37

Linear Separability

A dataset D = 〈(xn, yn)〉Nn=1 is linearly separable if there exists some linear classifier(defined by w, b) such that, for all n, yn = sign (w · xn + b).

margin(D,w, b) =

{minn yn · (w · xn + b) if w and b separate D−∞ otherwise

margin(D) = supw,b

margin(D,w, b)

19 / 37

Perceptron ConvergenceDue to Rosenblatt (1958).

If D is linearly separable with margin γ > 0 and for all n ∈ {1, . . . , N}, ‖xn‖2 ≤ 1,then the perceptron algorithm will converge in at most 1

γ2updates.

20 / 37

Perceptron ConvergenceDue to Rosenblatt (1958).

If D is linearly separable with margin γ > 0 and for all n ∈ {1, . . . , N}, ‖xn‖2 ≤ 1,then the perceptron algorithm will converge in at most 1

γ2updates.

I Proof can be found in Daume (2017), pp. 50–51.

I The theorem does not guarantee that the perceptron’s classifier will achievemargin γ.

21 / 37

What if D isn’t linearly separable?

22 / 37

What if D isn’t linearly separable?

No guarantees.

23 / 37

Voted Perceptron

Let w(e,n) and b(e,n) be the parameters after updating based on the nth example onepoch e.

y = sign

(E∑e=1

N∑n=1

sign(w(e,n) · x+ b(e,n))

)

24 / 37

Voted Perceptron


y = sign

(E∑e=1

N∑n=1


)

Do we need to store E ·N parameter settings?

25 / 37

Voted Perceptron


y = sign

(E∑e=1

N∑n=1


)

Can do a little better by storing “survival” time for each vector (number of times anupdate didn’t need to be made because y = y).

26 / 37

Voted Perceptron

27 / 37

Voted Perceptron

28 / 37

Voted Perceptron

29 / 37

Voted Perceptron

30 / 37

Voted Perceptron

31 / 37

Voted Perceptron

32 / 37

Averaged Perceptron

y = sign

(E∑e=1

N∑n=1

w(e,n) · x+ b(e,n)

)

33 / 37

Averaged Perceptron

y = sign

(E∑e=1

N∑n=1

w(e,n) · x+ b(e,n)

)


34 / 37

Averaged Perceptron

y = sign

(E∑e=1

N∑n=1

w(e,n) · x+ b(e,n)

)


What kind of decision boundary does the averaged perceptron have?

35 / 37

Averaged Perceptron

y = sign

(E∑e=1

N∑n=1

w(e,n) · x+ b(e,n)

)

= sign

((E∑e=1

N∑n=1

w(e,n)

)· x+

(E∑e=1

N∑n=1

b(e,n)

))


What kind of decision boundary does the averaged perceptron have?

36 / 37

References I

Hal Daume. A Course in Machine Learning (v0.9). Self-published athttp://ciml.info/, 2017.

Frank Rosenblatt. The perceptron: A probabilistic model for information storage andorganization in the brain. Psychological Review, 65:386–408, 1958.

37 / 37

http://ciml.info/

machine learning (cse 446): perceptronperceptron convergence due to rosenblatt (1958). if dis...

Documents