![Page 1: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/1.jpg)
ORIE 4741: Learning with Big Messy Data
The Perceptron
Professor Udell
Operations Research and Information EngineeringCornell
September 12, 2019
1 / 24
![Page 2: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/2.jpg)
Announcements
I sections next week (Monday and Wednesday): Julia tutorial
I resources from section will posted on website and athttps://github.com/ORIE4741/section
I homework 1 has been released, due in two weeks
I start finding project teams. . .
2 / 24
![Page 3: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/3.jpg)
A simple classifier: the perceptron
classification problem: e.g., credit card approval
I X = Rd , Y = {−1,+1}I data D = {(x1, y1), . . . , (xn, yn)}, xi ∈ X , yi ∈ Y for each
i = 1, . . . , nI for picture: X = R2, Y = {red, blue}
3 / 24
![Page 4: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/4.jpg)
Linear classification
I X = Rd , Y = {−1,+1}I data D = {(x1, y1), . . . , (xn, yn)}, xi ∈ X , yi ∈ Y for each
i = 1, . . . , n
make decision using a linear function
I approve credit if
d∑j=1
wjxj = w>x ≥ b;
deny otherwise.
I parametrized by weights w ∈ Rd
I decision boundary is the hyperplane {x : w>x = b}
4 / 24
![Page 5: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/5.jpg)
Feature transformation
simplify notation: remove the offset b using a featuretransformation
example: approve credit if
w>x ≥ b
eg, X = R, w = 1, b = 2 (picture)
Q: Can we represent this decision rule by another with no offset?A: Projective transformation (picture)
I let x̃ = (1, x), w̃ = (−b,w)
I then w̃>x̃ = w>x − b
now rename x̃ and w̃ as x and w
5 / 24
![Page 6: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/6.jpg)
Feature transformation
simplify notation: remove the offset b using a featuretransformation
example: approve credit if
w>x ≥ b
eg, X = R, w = 1, b = 2 (picture)
Q: Can we represent this decision rule by another with no offset?A: Projective transformation (picture)
I let x̃ = (1, x), w̃ = (−b,w)
I then w̃>x̃ = w>x − b
now rename x̃ and w̃ as x and w
5 / 24
![Page 7: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/7.jpg)
Feature transformation
simplify notation: remove the offset b using a featuretransformation
example: approve credit if
w>x ≥ b
eg, X = R, w = 1, b = 2 (picture)
Q: Can we represent this decision rule by another with no offset?
A: Projective transformation (picture)
I let x̃ = (1, x), w̃ = (−b,w)
I then w̃>x̃ = w>x − b
now rename x̃ and w̃ as x and w
5 / 24
![Page 8: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/8.jpg)
Feature transformation
simplify notation: remove the offset b using a featuretransformation
example: approve credit if
w>x ≥ b
eg, X = R, w = 1, b = 2 (picture)
Q: Can we represent this decision rule by another with no offset?A: Projective transformation (picture)
I let x̃ = (1, x), w̃ = (−b,w)
I then w̃>x̃ = w>x − b
now rename x̃ and w̃ as x and w
5 / 24
![Page 9: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/9.jpg)
Feature transformation
simplify notation: remove the offset b using a featuretransformation
example: approve credit if
w>x ≥ b
eg, X = R, w = 1, b = 2 (picture)
Q: Can we represent this decision rule by another with no offset?A: Projective transformation (picture)
I let x̃ = (1, x), w̃ = (−b,w)
I then w̃>x̃ = w>x − b
now rename x̃ and w̃ as x and w
5 / 24
![Page 10: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/10.jpg)
Feature transformation
simplify notation: remove the offset b using a featuretransformation
example: approve credit if
w>x ≥ b
eg, X = R, w = 1, b = 2 (picture)
Q: Can we represent this decision rule by another with no offset?A: Projective transformation (picture)
I let x̃ = (1, x), w̃ = (−b,w)
I then w̃>x̃ = w>x − b
now rename x̃ and w̃ as x and w
5 / 24
![Page 11: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/11.jpg)
How good is your classifier?
I X = Rd , Y = {−1,+1}I data D = {(x1, y1), . . . , (xn, yn)}, xi ∈ X , yi ∈ Y for each
i = 1, . . . , n
I approve credit if w>x ≥ 0; deny otherwise.
if ‖w‖ = 1, inner product w>x measures distance of x toclassification boundary
I define θ to be angle between x and w
I geometry: distance from x to boundary is ‖x‖ cos(θ)
I definition of inner product:
w>x = ‖w‖‖x‖ cos(θ) = ‖x‖ cos(θ)
since ‖w‖ = 1
6 / 24
![Page 12: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/12.jpg)
How good is your classifier?
I X = Rd , Y = {−1,+1}I data D = {(x1, y1), . . . , (xn, yn)}, xi ∈ X , yi ∈ Y for each
i = 1, . . . , n
I approve credit if w>x ≥ 0; deny otherwise.
if ‖w‖ = 1, inner product w>x measures distance of x toclassification boundary
I define θ to be angle between x and w
I geometry: distance from x to boundary is ‖x‖ cos(θ)
I definition of inner product:
w>x = ‖w‖‖x‖ cos(θ) = ‖x‖ cos(θ)
since ‖w‖ = 1
6 / 24
![Page 13: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/13.jpg)
Linear classification
I X = Rd , Y = {−1,+1}I data D = {(x1, y1), . . . , (xn, yn)}, xi ∈ X , yi ∈ Y for each
i = 1, . . . , n
make decision using a linear function h : X → Y
h(x) = sign(w>x)
Definition
The sign function is defined as
sign(z) =
1 z > 00 z = 0−1 z < 0
7 / 24
![Page 14: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/14.jpg)
Linear classification
I X = Rd , Y = {−1,+1}I data D = {(x1, y1), . . . , (xn, yn)}
make decision using a linear function h : X → Y
h(x) = sign(w>x)
Definition
The hypothesis set H is the set of candidate functions mightwe choose to map X to Y.
Here, H = {h : X → Y | h(x) = sign(w>x)}
8 / 24
![Page 15: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/15.jpg)
The perceptron learning rule
how to learn h(x) = sign(w>x) so that h(xi ) ≈ yi?
Frank Rosenblatt’s Mark I Per-ceptron machine was the firstimplementation of the percep-tron algorithm. The machinewas connected to a camerathat used 20×20 cadmium sul-fide photocells to produce a400-pixel image. The mainvisible feature is a patchboardthat allowed experimentationwith different combinations ofinput features. To the rightof that are arrays of po-tentiometers that implementedthe adaptive weights.
9 / 24
![Page 16: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/16.jpg)
The perceptron learning rule
how to learn h(x) = sign(w>x − b) so that h(xi ) ≈ yi?
perceptron algorithm [Rosenblatt, 1962]:
I initialize w = 0
I while there is a misclassified example (x , y)
I w ← w + yx
10 / 24
![Page 17: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/17.jpg)
Perceptron: iteration 1
10 / 24
![Page 18: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/18.jpg)
Perceptron: iteration 3
10 / 24
![Page 19: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/19.jpg)
Perceptron: iteration 5
10 / 24
![Page 20: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/20.jpg)
Perceptron: iteration 7
10 / 24
![Page 21: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/21.jpg)
Perceptron: iteration 9
10 / 24
![Page 22: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/22.jpg)
Perceptron: iteration 11
10 / 24
![Page 23: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/23.jpg)
Perceptron: iteration 13
10 / 24
![Page 24: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/24.jpg)
Margin of classifier
correct classification means{w>x > 0, y = 1
w>x < 0, y = −1
=⇒ yw>x > 0
Definition
The margin of classifier w on example (x , y) is
yw>x
I positive margin means (x , y) is correctly classified by w
I negative margin means (x , y) is not correctly classified by w
I bigger margin means (x , y) is more correctly classified
11 / 24
![Page 25: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/25.jpg)
Margin of classifier
correct classification means{w>x > 0, y = 1
w>x < 0, y = −1=⇒ yw>x > 0
Definition
The margin of classifier w on example (x , y) is
yw>x
I positive margin means (x , y) is correctly classified by w
I negative margin means (x , y) is not correctly classified by w
I bigger margin means (x , y) is more correctly classified
11 / 24
![Page 26: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/26.jpg)
Margin of classifier
correct classification means{w>x > 0, y = 1
w>x < 0, y = −1=⇒ yw>x > 0
Definition
The margin of classifier w on example (x , y) is
yw>x
I positive margin means (x , y) is correctly classified by w
I negative margin means (x , y) is not correctly classified by w
I bigger margin means (x , y) is more correctly classified
11 / 24
![Page 27: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/27.jpg)
Margin of classifier
correct classification means{w>x > 0, y = 1
w>x < 0, y = −1=⇒ yw>x > 0
Definition
The margin of classifier w on example (x , y) is
yw>x
I positive margin means (x , y) is correctly classified by w
I negative margin means (x , y) is not correctly classified by w
I bigger margin means (x , y) is more correctly classified
11 / 24
![Page 28: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/28.jpg)
The perceptron learning rule
notation: use superscripts w (t) for iterates
perceptron algorithm [Rosenblatt, 1962]:
I initialize w (0) = 0
I for t = 1, . . .
I if there is a misclassified example (x (t), y (t))I w (t+1) = w (t) + y (t)x (t)
I else quit
12 / 24
![Page 29: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/29.jpg)
The perceptron learning rule
perceptron algorithm: for misclassified (x (t), y (t)),
w (t+1) = w (t) + y (t)x (t)
Q: why is this a good idea?A: classification is “better” for w (t+1) than for w (t):we will show: margin on (x (t), y (t)) is bigger for w (t+1). recall
Definition
The margin of classifier w on example (x , y) is
yw>x
I positive margin means (x , y) is correctly classified by w
I negative margin means (x , y) is not correctly classified by w
13 / 24
![Page 30: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/30.jpg)
The perceptron learning rule
perceptron algorithm: for misclassified (x (t), y (t)),
w (t+1) = w (t) + y (t)x (t)
Q: why is this a good idea?
A: classification is “better” for w (t+1) than for w (t):we will show: margin on (x (t), y (t)) is bigger for w (t+1). recall
Definition
The margin of classifier w on example (x , y) is
yw>x
I positive margin means (x , y) is correctly classified by w
I negative margin means (x , y) is not correctly classified by w
13 / 24
![Page 31: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/31.jpg)
The perceptron learning rule
perceptron algorithm: for misclassified (x (t), y (t)),
w (t+1) = w (t) + y (t)x (t)
Q: why is this a good idea?A: classification is “better” for w (t+1) than for w (t):we will show: margin on (x (t), y (t)) is bigger for w (t+1). recall
Definition
The margin of classifier w on example (x , y) is
yw>x
I positive margin means (x , y) is correctly classified by w
I negative margin means (x , y) is not correctly classified by w
13 / 24
![Page 32: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/32.jpg)
The perceptron learning rule
perceptron algorithm: for misclassified (x (t), y (t)),
w (t+1) = w (t) + y (t)x (t)
why is this a good idea?
I example (x (t), y (t)) is misclassified at time tI ⇐⇒ sign(w (t)>x (t)) 6= y (t)
I ⇐⇒ y (t)w (t)>x (t) < 0.I compute
y (t)w (t+1)>x (t) = y (t)(w (t) + y (t)x (t))>x (t)
= y (t)w (t)>x (t) + (y (t))2‖x (t)‖2
≥ y (t)w (t)>x (t)
I so w (t+1) classifies (x (t), y (t)) better than w (t) did(but possibly still not correctly)
14 / 24
![Page 33: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/33.jpg)
The perceptron learning rule
perceptron algorithm: for misclassified (x (t), y (t)),
w (t+1) = w (t) + y (t)x (t)
why is this a good idea?
I example (x (t), y (t)) is misclassified at time t
I ⇐⇒ sign(w (t)>x (t)) 6= y (t)
I ⇐⇒ y (t)w (t)>x (t) < 0.I compute
y (t)w (t+1)>x (t) = y (t)(w (t) + y (t)x (t))>x (t)
= y (t)w (t)>x (t) + (y (t))2‖x (t)‖2
≥ y (t)w (t)>x (t)
I so w (t+1) classifies (x (t), y (t)) better than w (t) did(but possibly still not correctly)
14 / 24
![Page 34: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/34.jpg)
The perceptron learning rule
perceptron algorithm: for misclassified (x (t), y (t)),
w (t+1) = w (t) + y (t)x (t)
why is this a good idea?
I example (x (t), y (t)) is misclassified at time tI ⇐⇒ sign(w (t)>x (t)) 6= y (t)
I ⇐⇒ y (t)w (t)>x (t) < 0.I compute
y (t)w (t+1)>x (t) = y (t)(w (t) + y (t)x (t))>x (t)
= y (t)w (t)>x (t) + (y (t))2‖x (t)‖2
≥ y (t)w (t)>x (t)
I so w (t+1) classifies (x (t), y (t)) better than w (t) did(but possibly still not correctly)
14 / 24
![Page 35: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/35.jpg)
The perceptron learning rule
perceptron algorithm: for misclassified (x (t), y (t)),
w (t+1) = w (t) + y (t)x (t)
why is this a good idea?
I example (x (t), y (t)) is misclassified at time tI ⇐⇒ sign(w (t)>x (t)) 6= y (t)
I ⇐⇒ y (t)w (t)>x (t) < 0.
I compute
y (t)w (t+1)>x (t) = y (t)(w (t) + y (t)x (t))>x (t)
= y (t)w (t)>x (t) + (y (t))2‖x (t)‖2
≥ y (t)w (t)>x (t)
I so w (t+1) classifies (x (t), y (t)) better than w (t) did(but possibly still not correctly)
14 / 24
![Page 36: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/36.jpg)
The perceptron learning rule
perceptron algorithm: for misclassified (x (t), y (t)),
w (t+1) = w (t) + y (t)x (t)
why is this a good idea?
I example (x (t), y (t)) is misclassified at time tI ⇐⇒ sign(w (t)>x (t)) 6= y (t)
I ⇐⇒ y (t)w (t)>x (t) < 0.I compute
y (t)w (t+1)>x (t) = y (t)(w (t) + y (t)x (t))>x (t)
= y (t)w (t)>x (t) + (y (t))2‖x (t)‖2
≥ y (t)w (t)>x (t)
I so w (t+1) classifies (x (t), y (t)) better than w (t) did(but possibly still not correctly)
14 / 24
![Page 37: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/37.jpg)
The perceptron learning rule
perceptron algorithm: for misclassified (x (t), y (t)),
w (t+1) = w (t) + y (t)x (t)
why is this a good idea?
I example (x (t), y (t)) is misclassified at time tI ⇐⇒ sign(w (t)>x (t)) 6= y (t)
I ⇐⇒ y (t)w (t)>x (t) < 0.I compute
y (t)w (t+1)>x (t) = y (t)(w (t) + y (t)x (t))>x (t)
= y (t)w (t)>x (t) + (y (t))2‖x (t)‖2
≥ y (t)w (t)>x (t)
I so w (t+1) classifies (x (t), y (t)) better than w (t) did
(but possibly still not correctly)
14 / 24
![Page 38: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/38.jpg)
The perceptron learning rule
perceptron algorithm: for misclassified (x (t), y (t)),
w (t+1) = w (t) + y (t)x (t)
why is this a good idea?
I example (x (t), y (t)) is misclassified at time tI ⇐⇒ sign(w (t)>x (t)) 6= y (t)
I ⇐⇒ y (t)w (t)>x (t) < 0.I compute
y (t)w (t+1)>x (t) = y (t)(w (t) + y (t)x (t))>x (t)
= y (t)w (t)>x (t) + (y (t))2‖x (t)‖2
≥ y (t)w (t)>x (t)
I so w (t+1) classifies (x (t), y (t)) better than w (t) did(but possibly still not correctly)
14 / 24
![Page 39: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/39.jpg)
Linearly separable data
Definition
the data D = {(x1, y1), . . . , (xn, yn)} is linearly separable if
yi = sign((w \)>xi ) i = 1, . . . , n
for some vector w \.
that is, there is some hyperplane that (strictly) separates thedata into positive and negative examples
I w \ has positive margin yiw>xi > 0 for every example
I so the minimum margin ρ = mini=1,...,n yix>i w \ > 0
15 / 24
![Page 40: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/40.jpg)
Linearly separable data
Definition
the data D = {(x1, y1), . . . , (xn, yn)} is linearly separable if
yi = sign((w \)>xi ) i = 1, . . . , n
for some vector w \.
that is, there is some hyperplane that (strictly) separates thedata into positive and negative examples
I w \ has positive margin yiw>xi > 0 for every example
I so the minimum margin ρ = mini=1,...,n yix>i w \ > 0
15 / 24
![Page 41: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/41.jpg)
The perceptron learning rule works
how do we know that the perceptron algorithm will work?
we’ll prove that
Theorem
If the data is linearly separable, then the perceptron algorithmeventually makes no mistakes.
downside: it could take a long time. . .
16 / 24
![Page 42: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/42.jpg)
The perceptron learning rule works
how do we know that the perceptron algorithm will work?
we’ll prove that
Theorem
If the data is linearly separable, then the perceptron algorithmeventually makes no mistakes.
downside: it could take a long time. . .
16 / 24
![Page 43: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/43.jpg)
The perceptron learning rule works
how do we know that the perceptron algorithm will work?
we’ll prove that
Theorem
If the data is linearly separable, then the perceptron algorithmeventually makes no mistakes.
downside: it could take a long time. . .
16 / 24
![Page 44: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/44.jpg)
Proof of convergence (I)
Let w \ be a vector that strictly separates the data into positiveand negative examples. So the minimum margin is positive:
ρ = mini=1,...,n
yix>i w \ > 0.
Suppose for simplicity that we start with w (0) = 0.
I Notice w (t) becomes aligned with w \:
(w \)>w (t+1) = (w \)>(w (t) + y (t)x (t))
= (w \)>w (t) + y (t)(w \)>x (t)
≥ (w \)>w (t) + ρ.
I So by induction, as long as there’s a misclassified exampleat time t,
(w \)>w (t) ≥ ρt.17 / 24
![Page 45: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/45.jpg)
Proof of convergence (II)
I Define R = maxi=1,...,n ‖xi‖.I Notice ‖w (t)‖ doesn’t grow too fast:
‖w (t+1)‖2 = ‖w (t) + y (t)x (t)‖2
= ‖w (t)‖2 + ‖x (t)‖2 + 2y (t)w (t)>x (t)
≤ ‖w (t)‖2 + ‖x (t)‖2
≤ ‖w (t)‖2 + R2
because (x (t), y (t)) was misclassified by w (t).
I So by induction,‖w (t)‖2 ≤ tR2.
18 / 24
![Page 46: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/46.jpg)
Proof of convergence (III)
I So as long as there’s a misclassified example at time t,
(w \)>w (t) ≥ ρt and ‖w (t)‖2 ≤ tR2.
I Put it together: if there’s a misclassified example at time t,
ρt ≤ (w \)>w (t) ≤ ‖w \‖‖w (t)‖ ≤ ‖w \‖√tR,
so
t ≤(‖w \‖Rρ
)2
.
This bounds the maximum running time of the algorithm!
19 / 24
![Page 47: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/47.jpg)
Proof of convergence (III)
I So as long as there’s a misclassified example at time t,
(w \)>w (t) ≥ ρt and ‖w (t)‖2 ≤ tR2.
I Put it together: if there’s a misclassified example at time t,
ρt ≤ (w \)>w (t) ≤ ‖w \‖‖w (t)‖ ≤ ‖w \‖√tR,
so
t ≤(‖w \‖Rρ
)2
.
This bounds the maximum running time of the algorithm!
19 / 24
![Page 48: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/48.jpg)
Understanding the bound
I is the bound tight? why or why not?
I what does the bound tell us about non-separable data?
20 / 24
![Page 49: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/49.jpg)
Perceptron with outlier: iteration 1
20 / 24
![Page 50: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/50.jpg)
Perceptron with outlier: iteration 2
20 / 24
![Page 51: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/51.jpg)
Perceptron with outlier: iteration 3
20 / 24
![Page 52: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/52.jpg)
Perceptron with outlier: iteration 4
20 / 24
![Page 53: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/53.jpg)
Perceptron with outlier: iteration 5
20 / 24
![Page 54: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/54.jpg)
Perceptron with outlier: iteration 47
20 / 24
![Page 55: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/55.jpg)
Perceptron with outlier: iteration 48
20 / 24
![Page 56: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/56.jpg)
Perceptron with outlier: iteration 49
20 / 24
![Page 57: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/57.jpg)
Perceptron with outlier: iteration 50
20 / 24
![Page 58: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/58.jpg)
How to measure error?
Q: How to measure the quality of an (imperfect) linear classifier?
I Number of misclassifications:
n∑i=1
yi 6= sign(w>xi )
I Size of misclassifications (attempt 1):
n∑i=1
max(−yiw>xi , 0)
I Size of misclassifications (attempt 2):
n∑i=1
max(1− yiw>xi , 0)
21 / 24
![Page 59: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/59.jpg)
How to measure error?
Q: How to measure the quality of an (imperfect) linear classifier?
I Number of misclassifications:
n∑i=1
yi 6= sign(w>xi )
I Size of misclassifications (attempt 1):
n∑i=1
max(−yiw>xi , 0)
I Size of misclassifications (attempt 2):
n∑i=1
max(1− yiw>xi , 0)
21 / 24
![Page 60: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/60.jpg)
How to measure error?
Q: How to measure the quality of an (imperfect) linear classifier?
I Number of misclassifications:
n∑i=1
yi 6= sign(w>xi )
I Size of misclassifications (attempt 1):
n∑i=1
max(−yiw>xi , 0)
I Size of misclassifications (attempt 2):
n∑i=1
max(1− yiw>xi , 0)
21 / 24
![Page 61: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/61.jpg)
How to measure error?
Q: How to measure the quality of an (imperfect) linear classifier?
I Number of misclassifications:
n∑i=1
yi 6= sign(w>xi )
I Size of misclassifications (attempt 1):
n∑i=1
max(−yiw>xi , 0)
I Size of misclassifications (attempt 2):
n∑i=1
max(1− yiw>xi , 0)
21 / 24
![Page 62: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/62.jpg)
Recap: Perceptron
I a simple learning algorithm to learn a linear classifier
I themes we’ll see again: linear functions, iterative updates,margin
I how we plotted the data: axes = X , color = YI vector w ∈ Rd defines linear decision boundary
I simplify algorithm with feature transformation
I proof of convergence: induction, Cauchy-Schwartz, linearalgebra
22 / 24
![Page 63: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/63.jpg)
Schema for supervised learning
I unknown target function f : X → YI training examples D = {(x1, y1), . . . , (xn, yn)}I hypothesis set HI learning algorithm AI final hypothesis g : X → Y
23 / 24
![Page 64: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/64.jpg)
Generalization
how well will our classifier do on new data?
I if we know nothing about the new data, no guarantees
I but if the new data looks statistically like the old. . .
24 / 24
![Page 65: ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information](https://reader033.vdocuments.us/reader033/viewer/2022042102/5e7e8eb6ae1392026f781be5/html5/thumbnails/65.jpg)
Generalization
how well will our classifier do on new data?
I if we know nothing about the new data, no guarantees
I but if the new data looks statistically like the old. . .
24 / 24