midterm review - university of...
TRANSCRIPT
Midterm review
CS 446
1. Lecture review
(Lec1.) Basic setting: supervised learning
Training data: labeled examples
(x1, y1), (x2, y2), . . . , (xn, yn)
where
I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and
I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.
Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs. (Note: 0 training error is easy;test/population error is what matters.)
learned predictorpast labeled examples learning algorithm
predicted label
new (unlabeled) example
1 / 61
(Lec1.) Basic setting: supervised learning
Training data: labeled examples
(x1, y1), (x2, y2), . . . , (xn, yn)
where
I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and
I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.
Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs. (Note: 0 training error is easy;test/population error is what matters.)
learned predictorpast labeled examples learning algorithm
predicted label
new (unlabeled) example
1 / 61
(Lec1.) Basic setting: supervised learning
Training data: labeled examples
(x1, y1), (x2, y2), . . . , (xn, yn)
where
I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and
I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.
Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs. (Note: 0 training error is easy;test/population error is what matters.)
learned predictorpast labeled examples learning algorithm
predicted label
new (unlabeled) example
1 / 61
(Lec1.) Basic setting: supervised learning
Training data: labeled examples
(x1, y1), (x2, y2), . . . , (xn, yn)
where
I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and
I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.
Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs. (Note: 0 training error is easy;test/population error is what matters.)
learned predictorpast labeled examples learning algorithm
predicted label
new (unlabeled) example
1 / 61
(Lec2.) k-nearest neighbors classifier
Given: labeled examples D := {(xi, yi)}ni=1
Predictor: fD,k : X → Y:
On input x,
1. Find the k points xi1 ,xi2 , . . . ,xik among {xi}ni=1 “closest” to x(the k nearest neighbors).
2. Return the plurality of yi1 , yi2 , . . . , yik .
(Break ties in both steps arbitrarily.)
2 / 61
(Lec2.) Choosing k
The hold-out set approach
1. Pick a subset V ⊂ S (hold-out set, a.k.a. validation set).2. For each k ∈ {1, 3, 5, . . . }:
I Construct k-NN classifier fS\V,k using S \ V .
I Compute error rate of fS\V,k on V(“hold-out error rate”).
3. Pick the k that gives the smallest hold-out error rate.
(There are many other approaches.)
3 / 61
(Lec2.) Choosing k
The hold-out set approach
1. Pick a subset V ⊂ S (hold-out set, a.k.a. validation set).2. For each k ∈ {1, 3, 5, . . . }:
I Construct k-NN classifier fS\V,k using S \ V .
I Compute error rate of fS\V,k on V(“hold-out error rate”).
3. Pick the k that gives the smallest hold-out error rate.
(There are many other approaches.)
3 / 61
(Lec2.) Decision trees
Directly optimize tree structure for good classification.
A decision tree is a function f : X → Y, represented by a binary tree in which:
I Each tree node is associated with a splitting rule g : X → {0, 1}.I Each leaf node is associated with a label y ∈ Y.
When X = Rd, typically only considersplitting rules of the form
g(x) = 1{xi > t}
for some i ∈ [d] and t ∈ R.Called axis-aligned or coordinate splits.
(Notation: [d] := {1, 2, . . . , d}.)
x1 > 1.7
x2 > 2.8y = 1
y = 2 y = 3
4 / 61
(Lec2.) Decision trees
Directly optimize tree structure for good classification.
A decision tree is a function f : X → Y, represented by a binary tree in which:
I Each tree node is associated with a splitting rule g : X → {0, 1}.I Each leaf node is associated with a label y ∈ Y.
When X = Rd, typically only considersplitting rules of the form
g(x) = 1{xi > t}
for some i ∈ [d] and t ∈ R.Called axis-aligned or coordinate splits.
(Notation: [d] := {1, 2, . . . , d}.)
x1 > 1.7
x2 > 2.8y = 1
y = 2 y = 3
4 / 61
(Lec2.) Decision tree example
sepal length/width1.5 2 2.5 3
peta
l le
ngth
/wid
th
2
2.5
3
3.5
4
4.5
5
5.5
6 Classifying irises by sepal and petalmeasurements
I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width
I x2 = ratio of petal length to width
5 / 61
(Lec2.) Decision tree example
sepal length/width1.5 2 2.5 3
peta
l le
ngth
/wid
th
2
2.5
3
3.5
4
4.5
5
5.5
6 Classifying irises by sepal and petalmeasurements
I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width
I x2 = ratio of petal length to width
y = 2
5 / 61
(Lec2.) Decision tree example
sepal length/width1.5 2 2.5 3
peta
l le
ngth
/wid
th
2
2.5
3
3.5
4
4.5
5
5.5
6 Classifying irises by sepal and petalmeasurements
I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width
I x2 = ratio of petal length to width
x1 > 1.7
5 / 61
(Lec2.) Decision tree example
sepal length/width1.5 2 2.5 3
peta
l le
ngth
/wid
th
2
2.5
3
3.5
4
4.5
5
5.5
6 Classifying irises by sepal and petalmeasurements
I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width
I x2 = ratio of petal length to width
x1 > 1.7
y = 1 y = 3
5 / 61
(Lec2.) Decision tree example
sepal length/width1.5 2 2.5 3
peta
l le
ngth
/wid
th
2
2.5
3
3.5
4
4.5
5
5.5
6 Classifying irises by sepal and petalmeasurements
I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width
I x2 = ratio of petal length to width
x1 > 1.7
x2 > 2.8y = 1
5 / 61
(Lec2.) Decision tree example
sepal length/width1.5 2 2.5 3
peta
l le
ngth
/wid
th
2
2.5
3
3.5
4
4.5
5
5.5
6 Classifying irises by sepal and petalmeasurements
I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width
I x2 = ratio of petal length to width
x1 > 1.7
x2 > 2.8y = 1
y = 2 y = 3
5 / 61
(Lec2.) Nearest neighbors and decision trees
Today we covered two standard machine learning methods.
1.0 0.5 0.0 0.5 1.0
1.0
0.5
0.0
0.5
1.0
x1
x2
Nearest neighbors.Training/fitting: memorize data.Testing/predicting: find k closestmemorized points, return plurity la-bel.Overfitting? Vary k.
Decision trees.Training/fitting: greedily partitionspace, reducing “uncertainty”.Testing/predicting: traverse tree,output leaf label.Overfitting? Limit or prune tree.
Note: both methods can ouput real numbers (regression, not classification);return median/mean of { neighbors, points reaching leaf }. 6 / 61
(Lec3-4.) ERM setup for least squares.
I Predictors/model: f(x) = wTx;a linear predictor/regressor.(For linear classification: x 7→ sgn(wTx).)
I Loss/penalty: the least squares loss
`ls(y, y) = `ls(y, y) = (y − y)2.
(Some conventions scale this by 1/2.)
I Goal: minimize least squares emprical risk
Rls(f) =1
n
n∑i=1
`ls(yi, f(xi)) =1
n
n∑i=1
(yi − f(xi))2.
I Specifically, we choose w ∈ Rd according to
arg minw∈Rd
Rls
(x 7→ wTx
)= arg min
w∈Rd
1
n
n∑i=1
(yi −wTxi)2.
I More generally, this is the ERM approach:pick a model and minimize empirical risk over the model parameters.
7 / 61
(Lec3-4.) ERM in general
I Pick a family of models/predictors F .(For today, we use linear predictors.)
I Pick a loss function `.(For today, we chose squared loss.)
I Minimize the empirical risk over the model parameters.
We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.
Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.
8 / 61
(Lec3-4.) ERM in general
I Pick a family of models/predictors F .(For today, we use linear predictors.)
I Pick a loss function `.(For today, we chose squared loss.)
I Minimize the empirical risk over the model parameters.
We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.
Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.
8 / 61
(Lec3-4.) Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.
Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R is a minimizer of R.
9 / 61
(Lec3-4.) Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R is a minimizer of R.
9 / 61
(Lec3-4.) Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R is a minimizer of R.
9 / 61
(Lec3-4.) Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R is a minimizer of R.
9 / 61
(Lec3-4.) Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R is a minimizer of R.
9 / 61
(Lec3-4.) Full (factorization) SVD (new slide)
Given M ∈ Rn×d, let M = USV T denote the singular value decomposition(SVD), where
I U ∈ Rn×n is orthonormal, thus UTU = UUT = I,
I V ∈ Rd×d is orthonormal, thus V TV = V V T = I,
I S ∈ Rn×d has singular values s1 ≥ s2 ≥ · · · ≥ sminn,d along the diagonaland zeros elsewhere, where the number of positive singular values equalsthe rank of M .
Some facts:
I SVD is not unique when the singular values are not distinct; e.g., we canwrite I = UIV T where U is any orthonormal matrix.
I Pseudoinverse S+ ∈ Rd×n of S is obtained by starting with ST andtaking the reciprocal of each positive entry.
I Pseudoinverse of M is V S+UT.
I If M−1 exists, then M−1 = M+.
10 / 61
(Lec3-4.) Thin (decomposition) SVD (new slide)
Given M ∈ Rn×d, (s,u,v) are a singular value with corresponding left andright singular vectors if Mv = su and MTu = sv.The thin SVD of M is M =
∑ri=1 siuiv
Ti , where r is the rank of M , and
I left singular vectors (u1, . . . ,ur) are orthonormal (but we might haver < min{n, d}!) and span the column space of M ,
I right singular vectors (v1, . . . ,vr) are orthonormal (but we might haver < min{n, d}!) and span the row space of M ,
I sigular values s1 ≥ · · · ≥ sr > 0.
Some facts:
I Pseudoinverse M+ =∑ri=1
1siviu
Ti .
I (ui)ri=1 span t
11 / 61
(Lec3-4.) SVD and least squares
Recall: we’d like to find w such that
ATAw = ATb.
If w = A+b, then
ATAw =
r∑i=1
siviuTi
r∑i=1
siuivTi
r∑i=1
1
siviu
Ti
b=
r∑i=1
siviuTi
r∑i=1
uiuTi
b = ATb.
Henceforth, define wols = A+b as the OLS solution.(OLS = “ordinary least squares”.)
Note: in general, AA+ =∑ri=1 uiu
Ti 6= I.
12 / 61
(Lec3-4.) Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then
‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2
= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.
Since
(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,
then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.
Morever, writing A =∑ri=1 siuiv
Ti ,
‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>
r∑i=1
s2ivivTi
(w′−w),
so w′ optimal iff w′ −w is in the right nullspace of A.
(We’ll revisit all this with convexity later.)
13 / 61
(Lec3-4.) Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then
‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2
= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.
Since
(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,
then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.
Morever, writing A =∑ri=1 siuiv
Ti ,
‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>
r∑i=1
s2ivivTi
(w′−w),
so w′ optimal iff w′ −w is in the right nullspace of A.
(We’ll revisit all this with convexity later.)
13 / 61
(Lec3-4.) Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then
‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2
= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.
Since
(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,
then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.
Morever, writing A =∑ri=1 siuiv
Ti ,
‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>
r∑i=1
s2ivivTi
(w′−w),
so w′ optimal iff w′ −w is in the right nullspace of A.
(We’ll revisit all this with convexity later.)
13 / 61
(Lec3-4.) Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
14 / 61
(Lec3-4.) Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
14 / 61
(Lec3-4.) Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
14 / 61
(Lec3-4.) Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
14 / 61
(Lec3-4.) Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
14 / 61
(Lec3-4.) Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
14 / 61
(Lec5-6.) Geometry of linear classifiers
x1
x2
H
w
A hyperplane in Rd is a linear subspaceof dimension d−1.
I A hyperplane in R2 is a line.
I A hyperplane in R3 is a plane.
I As a linear subspace, a hyperplanealways contains the origin.
A hyperplane H can be specified by a(non-zero) normal vector w ∈ Rd.
The hyperplane with normal vector wis the set of points orthogonal to w:
H ={x ∈ Rd : xTw = 0
}.
Given w and its corresponding H:H splits the sets labeled positive {x : wTx > 0} and those labeled negative{x : wTw < 0}.
15 / 61
(Lec5-6.) Classification with a hyperplane
H
w
span{w}
Projection of x onto span{w} (a line) hascoordinate
‖x‖2 · cos(θ)
where
cos θ =xTw
‖w‖2‖x‖2.
(Distance to hyperplane is ‖x‖2 · | cos(θ)|.)
Decision boundary is hyperplane (oriented by w):
xTw > 0 ⇐⇒ ‖x‖2 · cos(θ) > 0 ⇐⇒ x on same side of H as w
What should we do if we want hyperplane decision boundary that doesn’t(necessarily) go through origin?
16 / 61
(Lec5-6.) Classification with a hyperplane
H
w
span{w}
x
‖x‖2 · cos θθ
Projection of x onto span{w} (a line) hascoordinate
‖x‖2 · cos(θ)
where
cos θ =xTw
‖w‖2‖x‖2.
(Distance to hyperplane is ‖x‖2 · | cos(θ)|.)
Decision boundary is hyperplane (oriented by w):
xTw > 0 ⇐⇒ ‖x‖2 · cos(θ) > 0 ⇐⇒ x on same side of H as w
What should we do if we want hyperplane decision boundary that doesn’t(necessarily) go through origin?
16 / 61
(Lec5-6.) Classification with a hyperplane
H
w
span{w}
x
‖x‖2 · cos θθ
Projection of x onto span{w} (a line) hascoordinate
‖x‖2 · cos(θ)
where
cos θ =xTw
‖w‖2‖x‖2.
(Distance to hyperplane is ‖x‖2 · | cos(θ)|.)
Decision boundary is hyperplane (oriented by w):
xTw > 0 ⇐⇒ ‖x‖2 · cos(θ) > 0 ⇐⇒ x on same side of H as w
What should we do if we want hyperplane decision boundary that doesn’t(necessarily) go through origin?
16 / 61
(Lec5-6.) Classification with a hyperplane
H
w
span{w}
x
‖x‖2 · cos θθ
Projection of x onto span{w} (a line) hascoordinate
‖x‖2 · cos(θ)
where
cos θ =xTw
‖w‖2‖x‖2.
(Distance to hyperplane is ‖x‖2 · | cos(θ)|.)
Decision boundary is hyperplane (oriented by w):
xTw > 0 ⇐⇒ ‖x‖2 · cos(θ) > 0 ⇐⇒ x on same side of H as w
What should we do if we want hyperplane decision boundary that doesn’t(necessarily) go through origin?
16 / 61
(Lec5-6.) Linear separability
Is it always possible to find w with sign(wTxi) = yi?Is it always possible to find a hyperplane separating the data?(Appending 1 means it need not go through the origin.)
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
Linearly separable. Not linearly separable.
17 / 61
(Lec5-6.) Cauchy-Schwarz (new slide)
Cauchy-Schwarz inequality. |aTb| ≤ ‖a‖ · ‖b‖.
Proof. If ‖a‖ = ‖b‖,
0 ≤ ‖a− b‖2 = ‖a‖2 − 2aTb+ ‖b‖2 = 2‖a‖ · ‖b‖ − 2aTb,
which rearranges to give aTb ≤ ‖a‖ · ‖b‖.
For the case ‖a‖ < ‖b‖, apply the preceding to ‖b‖‖a‖
(aT(‖a‖b‖b‖
)).
For the absolute value, apply the preceding to (a,−b). �
18 / 61
(Lec5-6.) Cauchy-Schwarz (new slide)
Cauchy-Schwarz inequality. |aTb| ≤ ‖a‖ · ‖b‖.
Proof. If ‖a‖ = ‖b‖,
0 ≤ ‖a− b‖2 = ‖a‖2 − 2aTb+ ‖b‖2 = 2‖a‖ · ‖b‖ − 2aTb,
which rearranges to give aTb ≤ ‖a‖ · ‖b‖.
For the case ‖a‖ < ‖b‖, apply the preceding to ‖b‖‖a‖
(aT(‖a‖b‖b‖
)).
For the absolute value, apply the preceding to (a,−b). �
18 / 61
(Lec5-6.) Logistic loss
Let’s state our classification goal with a generic margin loss `:
R(w) =1
n
n∑i=1
`(yiwTxi);
the key properties we want:
I ` is continuous;
I `(z) ≥ c1[z ≤ 0] = c`zo(z) for some c > 0 and any z ∈ R,
which implies R`(w) ≥ cRzo(w).
I `′(0) < 0 (pushes stuff from wrong to right).
Examples.
I Squared loss, written in margin form: `ls(z) := (1− z)2;note `ls(yy) = (1− yy)2 = y2(1− yy)2 = (y − y)2.
I Logistic loss: `log(z) = ln(1 + exp(−z)).
19 / 61
(Lec5-6.) Logistic loss
Let’s state our classification goal with a generic margin loss `:
R(w) =1
n
n∑i=1
`(yiwTxi);
the key properties we want:
I ` is continuous;
I `(z) ≥ c1[z ≤ 0] = c`zo(z) for some c > 0 and any z ∈ R,
which implies R`(w) ≥ cRzo(w).
I `′(0) < 0 (pushes stuff from wrong to right).
Examples.
I Squared loss, written in margin form: `ls(z) := (1− z)2;note `ls(yy) = (1− yy)2 = y2(1− yy)2 = (y − y)2.
I Logistic loss: `log(z) = ln(1 + exp(−z)).
19 / 61
(Lec5-6.) Squared and logistic losses on linearly separabledata I
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
-12.000
-8.000
-4.000
0.000
4.000
8.000
12.000
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
-1.200
-0.800
-0.400
0.000
0.400
0.800
1.200
Logistic loss. Squared loss.
20 / 61
(Lec5-6.) Squared and logistic losses on linearly separabledata II
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
-12.000
-8.000
-4.000
0.000
4.000
8.000
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
-1.200
-0.800
-0.400
0.000
0.400
0.800
1.200
Logistic loss. Squared loss.
21 / 61
(Lec5-6.) Logistic risk and separation
If there exists a perfect linear separator,empirical logistic risk minimization should find it.
Theorem.
If there exists w with yiwTxi > 0 for all i,
then every w with Rlog(w) < ln(2)/2n + infv Rlog(v),also satisfies yiw
Txi > 0.
Proof. Omitted.
22 / 61
(Lec5-6.) Logistic risk and separation
If there exists a perfect linear separator,empirical logistic risk minimization should find it.
Theorem. If there exists w with yiwTxi > 0 for all i,
then every w with Rlog(w) < ln(2)/2n + infv Rlog(v),also satisfies yiw
Txi > 0.
Proof. Omitted.
22 / 61
(Lec5-6.) Logistic risk and separation
If there exists a perfect linear separator,empirical logistic risk minimization should find it.
Theorem. If there exists w with yiwTxi > 0 for all i,
then every w with Rlog(w) < ln(2)/2n + infv Rlog(v),also satisfies yiw
Txi > 0.
Proof. Omitted.
22 / 61
(Lec5-6.) Least squares and logistic ERM
Least squares:
I Take gradient of ‖Aw − b‖2, set to 0;obtain normal equations ATAw = ATb.
I One choice is minimum norm solution A+b.
Logistic loss:
I Take gradient of Rlog(w) = 1n
∑ni=1 ln(1 + exp(yiw
Txi)) and set to 0 ???
Remark. Is A+b a “closed form expression”?
23 / 61
(Lec5-6.) Least squares and logistic ERM
Least squares:
I Take gradient of ‖Aw − b‖2, set to 0;obtain normal equations ATAw = ATb.
I One choice is minimum norm solution A+b.
Logistic loss:
I Take gradient of Rlog(w) = 1n
∑ni=1 ln(1 + exp(yiw
Txi)) and set to 0 ???
Remark. Is A+b a “closed form expression”?
23 / 61
(Lec5-6.) Least squares and logistic ERM
Least squares:
I Take gradient of ‖Aw − b‖2, set to 0;obtain normal equations ATAw = ATb.
I One choice is minimum norm solution A+b.
Logistic loss:
I Take gradient of Rlog(w) = 1n
∑ni=1 ln(1 + exp(yiw
Txi)) and set to 0 ???
Remark. Is A+b a “closed form expression”?
23 / 61
(Lec5-6.) Gradient descent
Given a function F : Rd → R, gradient descent is the iteration
wi+1 := wi − ηi∇wF (wi),
where w0 is given, and ηi is a learning rate / step size.
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.010.0
7.5
5.0
2.5
0.0
2.5
5.0
7.5
10.0
2.000
4.00
0
6.000
8.000
10.000
12.000
14.000
Does this work for least squares?Later we’ll show it works for least squares and logistic regression due toconvexity.
24 / 61
(Lec5-6.) Gradient descent
Given a function F : Rd → R, gradient descent is the iteration
wi+1 := wi − ηi∇wF (wi),
where w0 is given, and ηi is a learning rate / step size.
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.010.0
7.5
5.0
2.5
0.0
2.5
5.0
7.5
10.0
2.000
4.00
0
6.000
8.000
10.000
12.000
14.000
Does this work for least squares?
Later we’ll show it works for least squares and logistic regression due toconvexity.
24 / 61
(Lec5-6.) Gradient descent
Given a function F : Rd → R, gradient descent is the iteration
wi+1 := wi − ηi∇wF (wi),
where w0 is given, and ηi is a learning rate / step size.
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.010.0
7.5
5.0
2.5
0.0
2.5
5.0
7.5
10.0
2.000
4.00
0
6.000
8.000
10.000
12.000
14.000
Does this work for least squares?Later we’ll show it works for least squares and logistic regression due toconvexity.
24 / 61
(Lec5-6.) Multiclass?
All our methods so far handle multiclass:
I k-nn and decision tree: plurality label.
I Least squares: arg minW∈Rd×k
‖AW −B‖2F
with B ∈ Rn×k;
W ∈ Rd×k is k separate linear regressors in Rd.
How about linear classifiers?
I At prediction time, x 7→ arg maxy f(x)y.
I As in binary case: interpretation f(x)y = Pr[Y = y|X = x].
What is a good loss function?
25 / 61
(Lec5-6.) Cross-entropy
Given two probability vectors p, q ∈∆k = {p ∈ Rk≥0 :∑i pi = 1},
H(p, q) = −k∑i=1
pi ln qi (cross-entropy).
I If p = q, then H(p, q) = H(p) (entropy); indeed
H(p, q) = −k∑i=1
pi ln
(piqipi
)= H(p)︸ ︷︷ ︸
entropy
+ KL(p, q)︸ ︷︷ ︸KL divergence
.
Since KL ≥ 0 and moreover 0 iff p = q,this is the cost/entropy of p plus a penalty for differing.
I Choose encoding yi = ey for y ∈ {1, . . . , k},and y ∝ exp(f(x)) with f : Rd → Rk;
`ce(y, f(x)) = H(y, y) = −k∑i=1
yi ln
exp(f(x)i)∑kj=1 exp(f(x)j)
= − ln
exp(f(x)y)∑kj=1 exp(f(x)j)
= −f(x)y + lnk∑j=1
exp(f(x)j).
(In pytorch, use torch.nn.CrossEntropyLoss()(f(x), y).) 26 / 61
(Lec5-6.) Cross-entropy, classification, and margins
The zero-one loss for classification is
`zo(yi, f(x)) = 1
[yi 6= arg max
jf(x)j
].
In the multiclass case, can define margin as
f(x)y −maxj 6=y
f(x)j ,
interpreted as “the distance by which f is correct”. (Can be negative!)
Since ln∑j zj ≈ maxj zj , cross-entropy satisfies
`ce(yi, f(x)) = −f(x)y + ln∑j
exp(f(x)j
)≈ −f(x)y + max
jf(x)j ,
thus minimizing cross-entropy maximizes margins.
27 / 61
(Lec7-8.) The ERM perspective
These lectures will follow an ERM perspective on deep networks:
I Pick a model/predictor class (network architecture).(We will spend most of our time on this!)
I Pick a loss/risk.(We will almost always use cross-entropy!)
I Pick an optimizer.(We will mostly treat this as a black box!)
The goal is low test error, whereas above only gives low training error;we will briefly discuss this as well.
28 / 61
(Lec7-8.) Basic deep networks
A self-contained expression is
x 7→ σL
(WLσL−1
(· · ·(W 2σ1(W 1x+ b1) + b2
)· · ·)
+ bL
),
with equivalent “functional form”
x 7→ (fL ◦ · · · ◦ f1)(x) where fi(z) = σi (W iz + bi) .
Some further details (many more to come!):
I (W i)Li=1 with W i ∈ Rdi×di−1 are the weights, and (bi)
Li=1 are the biases.
I (σi)Li=1 with σi : Rdi → Rdi are called nonlinearties, or activations, or
transfer functions, or link functions.
I This is only the basic setup; many things can and will change, please askmany questions!
29 / 61
(Lec7-8.) Choices of activation
Basic form:
x 7→ σL(WLσL−1
(· · ·W 2σ1(W 1x+ b1) + b2 · · ·
)+ bL
).
Choices of activation (univariate, coordinate-wise):
I Indicator/step/heavyside/threshold z 7→ 1[z ≥ 0].This was the original choice (1940s!).
I Sigmoid σs(z) := 11+exp(−z) .
This was popular roughly 1970s - 2005?
I Hyperbolic tangent z 7→ tanh(z).Similar to sigmoid, used during same interval.
I Rectified Linear Unit (ReLU) σr(z) = max{0, z}.It (and slight variants, e.g., Leaky ReLU, ELU, . . . ) are the dominantchoice now; popularized in “Imagenet/AlexNet” paper(Krizhevsky-Sutskever-Hinton, 2012).
I Identity z 7→ z; we’ll often use this as the last layer when we usecross-entropy loss.
I NON-coordinate-wise choices: we will discuss “softmax” and “pooling” abit later.
30 / 61
(Lec7-8.) “Architectures” and “models”
Basic form:
x 7→ σL(WLσL−1
(· · ·W 2σ1(W 1x+ b1) + b2 · · ·
)+ bL
).
((W i, bi))Li=1, the weights and biases, are the parameters.
Let’s roll them into W := ((W i, bi))Li=1,
and consider the network as a two-parameter function FW(x) = F (x;W).
I The model or class of functions is {FW : all possible W}. F (botharguments unset) is also called an architecture.
I When we fit/train/optimize, typically we leave the architecture fixed andvary W to minimize risk.
(More on this in a moment.)
31 / 61
(Lec7-8.) ERM recipe for basic networks
Standard ERM recipe:
I First we pick a class of functions/predictors;for deep networks, that means a F (·, ·).
I Then we pick a loss function and write down an empirical riskminimization problem; in these lectures we will pick cross-entropy:
arg minW
1
n
n∑i=1
`ce
(yi, F (xi,W)
)= arg min
W 1∈Rd×d1 ,b1∈Rd1
...WL∈R
dL−1×dL ,bL∈RdL
1
n
n∑i=1
`ce
(yi, F (xi; ((W i, bi))
Li=1)
)
= arg minW 1∈Rd×d1 ,b1∈Rd1
...WL∈R
dL−1×dL ,bL∈RdL
1
n
n∑i=1
`ce
(yi, σL(· · ·σ1(W 1xi + b1) · · · )
).
I Then we pick an optimizer. In this class, we only use gradient descentvariants. It is a miracle that this works.
32 / 61
(Lec7-8.) Sometimes, linear just isn’t enough
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
-3.00
0
-1.50
00.0
00
1.500
3.000
4.500
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
-32.
000
-24.
000
-16.
000
-8.0
00
-8.000
0.00
0
0.0008.00
0
8.000
16.000
Linear predictor:x 7→ wT [ x1 ].
Some blue points misclassified.
ReLU network:x 7→ W 2σr(W 1x + b1) + b2.
0 misclassifications!
33 / 61
(Lec7-8.) Classical example: XOR
Classical “XOR problem” (Minsky-Papert-’69).(Check wikipedia for “AI Winter”.)
Theorem. On this data, any linear classifier (with affine expansion)makes at least one mistake.Picture proof. Recall: linear classifiers correspond to separating hyperplanes.
I If it splits the blue points, it’s incorrect on one of them.
I If it doesn’t split the blue points,then one halfspace contains the common midpoint,and therefore wrong on at least one red point.
34 / 61
(Lec7-8.) Classical example: XOR
Classical “XOR problem” (Minsky-Papert-’69).(Check wikipedia for “AI Winter”.)
Theorem. On this data, any linear classifier (with affine expansion)makes at least one mistake.Picture proof. Recall: linear classifiers correspond to separating hyperplanes.
I If it splits the blue points, it’s incorrect on one of them.
I If it doesn’t split the blue points,then one halfspace contains the common midpoint,and therefore wrong on at least one red point.
34 / 61
(Lec7-8.) Classical example: XOR
Classical “XOR problem” (Minsky-Papert-’69).(Check wikipedia for “AI Winter”.)
Theorem. On this data, any linear classifier (with affine expansion)makes at least one mistake.Picture proof. Recall: linear classifiers correspond to separating hyperplanes.
I If it splits the blue points, it’s incorrect on one of them.
I If it doesn’t split the blue points,then one halfspace contains the common midpoint,and therefore wrong on at least one red point.
34 / 61
(Lec7-8.) One layer was not enough. How about two?
Theorem (Cybenko ’89, Hornik-Stinchcombe-White ’89, Funahashi ’89, Leshnoet al ’92, . . . ). Given any continuous function f : Rd → R and any ε > 0, thereexist parameters (W 1, b1,W 2) so that
supx∈[0,1]d
∣∣f(x)−W 2σ (W 1x+ b1)∣∣ ≤ ε,
as long as σ is “reasonable” (e.g., ReLU or sigmoid or threshold).
Remarks.
I Together with XOR example, justifies using nonlinearities.
I Does not justify (very) deep networks.
I Only says these networks exist, not that we can optimize for them!
35 / 61
(Lec7-8.) One layer was not enough. How about two?
Theorem (Cybenko ’89, Hornik-Stinchcombe-White ’89, Funahashi ’89, Leshnoet al ’92, . . . ). Given any continuous function f : Rd → R and any ε > 0, thereexist parameters (W 1, b1,W 2) so that
supx∈[0,1]d
∣∣f(x)−W 2σ (W 1x+ b1)∣∣ ≤ ε,
as long as σ is “reasonable” (e.g., ReLU or sigmoid or threshold).
Remarks.
I Together with XOR example, justifies using nonlinearities.
I Does not justify (very) deep networks.
I Only says these networks exist, not that we can optimize for them!
35 / 61
(Lec7-8.) General graph-based view
Classical graph-based perspective.
I Network is a directed acyclic graph;sources are inputs, sinks are outputs, intermediate nodes computez 7→ σ(wTz + b) (with their own (σ,w, b)).
I Nodes at distance 1 from inputs are the first layer, distance 2 is secondlayer, and so on.
“Modern” graph-based perspective.
I Edges in the graph can be multivariate, meaning vectors or generaltensors, and not just scalars.
I Edges will often “skip” layers;“layer” is therefore ambiguous.
I Diagram conventions differ;e.g., tensorflow graphs include nodes for parameters.
36 / 61
(Lec7-8.) 2-D convolution in deep networks (pictures)
(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)
37 / 61
(Lec7-8.) 2-D convolution in deep networks (pictures)
(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)
37 / 61
(Lec7-8.) 2-D convolution in deep networks (pictures)
(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)
37 / 61
(Lec7-8.) 2-D convolution in deep networks (pictures)
(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)
37 / 61
(Lec7-8.) Softmax
Replace vector input z with z′ ∝ ez, meaning
z 7→
(ez1∑j ezj, . . . ,
ezk∑j ezj,
).
I Converts input into a probability vector;useful for interpreting output network output as Pr[Y = y|X = x].
I We have baked it into our cross-entropy definition;last lectures networks with cross-entropy training had implicit softmax.
I If some coordinate j of z dominates others, then softmax is close to ej .
38 / 61
(Lec7-8.) Max pooling
2
2
3
0
3
0
0
1
0
3
0
0
2
1
2
0
2
2
3
1
1
2
3
1
0
3.0
3.0
3.0
2.0
3.0
3.0
3.0
3.0
3.0
(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)
I Often used together with convolution layers;shrinks/downsamples the input.
I Another variant is average pooling.
I Implementation: torch.nn.MaxPool2d .
39 / 61
(Lec7-8.) Max pooling
2
2
3
0
3
0
0
1
0
3
0
0
2
1
2
0
2
2
3
1
1
2
3
1
0
3.0
3.0
3.0
2.0
3.0
3.0
3.0
3.0
3.0
(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)
I Often used together with convolution layers;shrinks/downsamples the input.
I Another variant is average pooling.
I Implementation: torch.nn.MaxPool2d .
39 / 61
(Lec7-8.) Max pooling
2
2
3
0
3
0
0
1
0
3
0
0
2
1
2
0
2
2
3
1
1
2
3
1
0
3.0
3.0
3.0
2.0
3.0
3.0
3.0
3.0
3.0
(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)
I Often used together with convolution layers;shrinks/downsamples the input.
I Another variant is average pooling.
I Implementation: torch.nn.MaxPool2d .
39 / 61
(Lec7-8.) Max pooling
2
2
3
0
3
0
0
1
0
3
0
0
2
1
2
0
2
2
3
1
1
2
3
1
0
3.0
3.0
3.0
2.0
3.0
3.0
3.0
3.0
3.0
(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)
I Often used together with convolution layers;shrinks/downsamples the input.
I Another variant is average pooling.
I Implementation: torch.nn.MaxPool2d .
39 / 61
(Lec9-10.) Multivariate network single-example gradients
Define Gj(W) = σj(W j · · ·σ1(W 1x+ b) · · · ).The multivariate chain rule tells us
∇WF (Wx) = JTxT,
and J ∈ Rl×k is the Jacobian matrix of F : Rk → Rl at Wx,the matrix of all coordinate-wise derivatives.
dGLdWL
= JTLGL−1(W)T
dGLdbL
= JTL,
...
dGLdW j
= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T,
dGLdbj
= (JLWLJL−1WL−1 · · ·Jj)T ,
with Jj as the Jacobian of σj at W jGj−1(W) + bj .For example, with σj that is coordinate-wise σ : R→ R,
Jj is diag
(σ′([W jGj−1(W) + bj
]1
), . . . , σ′
([W jGj−1(W) + bj
]dj
)).
40 / 61
(Lec9-10.) Multivariate network single-example gradients
Define Gj(W) = σj(W j · · ·σ1(W 1x+ b) · · · ).The multivariate chain rule tells us
∇WF (Wx) = JTxT,
and J ∈ Rl×k is the Jacobian matrix of F : Rk → Rl at Wx,the matrix of all coordinate-wise derivatives.
dGLdWL
= JTLGL−1(W)T
dGLdbL
= JTL,
...
dGLdW j
= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T,
dGLdbj
= (JLWLJL−1WL−1 · · ·Jj)T ,
with Jj as the Jacobian of σj at W jGj−1(W) + bj .For example, with σj that is coordinate-wise σ : R→ R,
Jj is diag
(σ′([W jGj−1(W) + bj
]1
), . . . , σ′
([W jGj−1(W) + bj
]dj
)).
40 / 61
(Lec9-10.) Multivariate network single-example gradients
Define Gj(W) = σj(W j · · ·σ1(W 1x+ b) · · · ).The multivariate chain rule tells us
∇WF (Wx) = JTxT,
and J ∈ Rl×k is the Jacobian matrix of F : Rk → Rl at Wx,the matrix of all coordinate-wise derivatives.
dGLdWL
= JTLGL−1(W)T
dGLdbL
= JTL,
...
dGLdW j
= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T,
dGLdbj
= (JLWLJL−1WL−1 · · ·Jj)T ,
with Jj as the Jacobian of σj at W jGj−1(W) + bj .For example, with σj that is coordinate-wise σ : R→ R,
Jj is diag
(σ′([W jGj−1(W) + bj
]1
), . . . , σ′
([W jGj−1(W) + bj
]dj
)).
40 / 61
(Lec9-10.) Initialization
RecalldGLdW j
= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T.
I What if we set W = 0? What if σ = σr is a ReLU?
I What if we set two rows of W j (two nodes) identically?
I Resolving this issue is called symmetry breaking.
I Standard linear/dense layer initializations:
N (0,2
din) “He et al.”,
N (0,2
din + dout) Glorot/Xavier,
U(− 1√din
,1√din
) torch default.
(Convolution layers adjusted to have similar distributions.)Random initialization is emerging as a key story in deep networks!
41 / 61
(Lec9-10.) Initialization
RecalldGLdW j
= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T.
I What if we set W = 0? What if σ = σr is a ReLU?
I What if we set two rows of W j (two nodes) identically?
I Resolving this issue is called symmetry breaking.
I Standard linear/dense layer initializations:
N (0,2
din) “He et al.”,
N (0,2
din + dout) Glorot/Xavier,
U(− 1√din
,1√din
) torch default.
(Convolution layers adjusted to have similar distributions.)Random initialization is emerging as a key story in deep networks!
41 / 61
(Lec9-10.) Initialization
RecalldGLdW j
= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T.
I What if we set W = 0? What if σ = σr is a ReLU?
I What if we set two rows of W j (two nodes) identically?
I Resolving this issue is called symmetry breaking.
I Standard linear/dense layer initializations:
N (0,2
din) “He et al.”,
N (0,2
din + dout) Glorot/Xavier,
U(− 1√din
,1√din
) torch default.
(Convolution layers adjusted to have similar distributions.)Random initialization is emerging as a key story in deep networks!
41 / 61
(Lec9-10.) Initialization
RecalldGLdW j
= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T.
I What if we set W = 0? What if σ = σr is a ReLU?
I What if we set two rows of W j (two nodes) identically?
I Resolving this issue is called symmetry breaking.
I Standard linear/dense layer initializations:
N (0,2
din) “He et al.”,
N (0,2
din + dout) Glorot/Xavier,
U(− 1√din
,1√din
) torch default.
(Convolution layers adjusted to have similar distributions.)Random initialization is emerging as a key story in deep networks!
41 / 61
(Lec9-10; SGD slide.) Minibatches
We used the linearity of gradients to write
∇wR(w) =1
n
n∑i=1
∇w`(F (xi;w),yi).
What happens if we replace ((xi,yi))ni=1 with minibatch ((x′i,y
′i))
bi=1?
I Random minibatch =⇒ two gradients equal in expectation.
I Most torch layers take minibatch input:
I torch.nn.Linear has input shape (b, d), output (b, d′).I torch.nn.Conv2d has input shape (b, c, h, w), output (b, c′, h′, w′).
I This is used heavily outside deep learning as well.It is an easy way to use parallel floating point operations (as in GPU andCPU).
I Setting batch size is black magic and depends on many things(prediction problem, gpu characteristics, . . . ).
42 / 61
(Lec9-10.) Convex sets
A set S is convex if, for every pair of points {x,x′} in S, the line segmentbetween x and x′ is also contained in S. ({x,x′} ∈ S =⇒ [x,x′] ∈ S.)
convex not convex convex convex
Examples:
I All of Rd.
I Empty set.
I Half-spaces: {x ∈ Rd : aTx ≤ b}.I Intersections of convex sets.
I Polyhedra:{x ∈ Rd : Ax ≤ b
}=⋂mi=1
{x ∈ Rd : aT
ix ≤ bi}
.
I Convex hulls:conv(S) := {
∑ki=1 αixi : k ∈ N, xi ∈ S, αi ≥ 0,
∑ki=1 αi = 1}.
(Infinite convex hulls: intersection of all convex supersets.)
43 / 61
(Lec9-10.) Convex sets
A set S is convex if, for every pair of points {x,x′} in S, the line segmentbetween x and x′ is also contained in S. ({x,x′} ∈ S =⇒ [x,x′] ∈ S.)
convex not convex convex convex
Examples:
I All of Rd.
I Empty set.
I Half-spaces: {x ∈ Rd : aTx ≤ b}.I Intersections of convex sets.
I Polyhedra:{x ∈ Rd : Ax ≤ b
}=⋂mi=1
{x ∈ Rd : aT
ix ≤ bi}
.
I Convex hulls:conv(S) := {
∑ki=1 αixi : k ∈ N, xi ∈ S, αi ≥ 0,
∑ki=1 αi = 1}.
(Infinite convex hulls: intersection of all convex supersets.)
43 / 61
(Lec9-10.) Convex functions from convex sets
The epigraph of a function f is the area above the curve:
epi(f) :={
(x, y) ∈ Rd+1 : y ≥ f(x)}.
A function is convex if its epigraph is convex.
f is not convex f is convex
44 / 61
(Lec9-10.) Convex functions (standard definition)
A function f : Rd → R is convex if for any x,x′ ∈ Rd and α ∈ [0, 1],
f ((1− α)x+ αx′) ≤ (1− α) · f(x) + α · f(x′).
f is not convex f is convexx x′ x x′
Examples:
I f(x) = cx for any c > 0 (on R)
I f(x) = |x|c for any c ≥ 1 (on R)
I f(x) = bTx for any b ∈ Rd.
I f(x) =‖x‖ for any norm ‖·‖.I f(x) = xTAx for symmetric positive semidefinite A.
I f(x) = ln(∑d
i=1 exp(xi))
, which approximates maxi xi.
45 / 61
(Lec9-10.) Convex functions (standard definition)
A function f : Rd → R is convex if for any x,x′ ∈ Rd and α ∈ [0, 1],
f ((1− α)x+ αx′) ≤ (1− α) · f(x) + α · f(x′).
f is not convex f is convexx x′ x x′
Examples:
I f(x) = cx for any c > 0 (on R)
I f(x) = |x|c for any c ≥ 1 (on R)
I f(x) = bTx for any b ∈ Rd.
I f(x) =‖x‖ for any norm ‖·‖.I f(x) = xTAx for symmetric positive semidefinite A.
I f(x) = ln(∑d
i=1 exp(xi))
, which approximates maxi xi.
45 / 61
(Lec9-10.) Convexity of differentiable functions
Differentiable functions
If f : Rd → R is differentiable, then f isconvex if and only if
f(x) ≥ f(x0) +∇f(x0)T(x− x0)
for all x,x0 ∈ Rd.Note: this implies increasing slopes:(∇f(x)−∇f(y)
)T(x− y) ≥ 0.
f(x)
x0
a(x)
a(x) = f(x0) + f ′(x0)(x− x0)
Twice-differentiable functions
If f : Rd → R is twice-differentiable, then f is convex if and only if
∇2f(x) � 0
for all x ∈ Rd (i.e., the Hessian, or matrix of second-derivatives, is positivesemi-definite for all x).
46 / 61
(Lec9-10.) Convexity of differentiable functions
Differentiable functions
If f : Rd → R is differentiable, then f isconvex if and only if
f(x) ≥ f(x0) +∇f(x0)T(x− x0)
for all x,x0 ∈ Rd.Note: this implies increasing slopes:(∇f(x)−∇f(y)
)T(x− y) ≥ 0.
f(x)
x0
a(x)
a(x) = f(x0) + f ′(x0)(x− x0)
Twice-differentiable functions
If f : Rd → R is twice-differentiable, then f is convex if and only if
∇2f(x) � 0
for all x ∈ Rd (i.e., the Hessian, or matrix of second-derivatives, is positivesemi-definite for all x).
46 / 61
(Lec9-10.) Convex optimization problems
Standard form of a convex optimization problem:
minx∈Rd
f0(x)
s.t. fi(x) ≤ 0 for all i = 1, . . . , n
where f0, f1, . . . , fn : Rd → R are convex functions.
Fact: the feasible set
A :={x ∈ Rd : fi(x) ≤ 0 for all i = 1, 2, . . . , n
}is a convex set.
(SVMs next week will give us an example.)
47 / 61
(Lec9-10.) Convex optimization problems
Standard form of a convex optimization problem:
minx∈Rd
f0(x)
s.t. fi(x) ≤ 0 for all i = 1, . . . , n
where f0, f1, . . . , fn : Rd → R are convex functions.
Fact: the feasible set
A :={x ∈ Rd : fi(x) ≤ 0 for all i = 1, 2, . . . , n
}is a convex set.
(SVMs next week will give us an example.)
47 / 61
(Lec9-10.) Subgradients: Jensen’s inequality.
If f : Rd → R is convex, then Ef(X) ≥ f (EX).
Proof. Set y := EX, and pick any s ∈ ∂f (EX). Then
Ef(X) ≥ E(f(y) + sT(X − y)
)= f(y) + sTE (X − y) = f(y).
Note. This inequality comes up often!
48 / 61
(Lec11-12.) Maximum margin linear classifier
The solution w to the following mathematical optimization problem:
minw∈Rd
1
2‖w‖22
s.t. yxTw ≥ 1 for all (x, y) ∈ S
gives the linear classifier with the largest minimum margin on S—i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.
This is a convex optimization problem; can be solved in polynomial time.
If there is a solution (i.e., S is linearly separable), then the solution is unique.
We can solve this in a variety of ways (e.g., projected gradient descent);we will work with the dual.
Note: Can also explicitly include affine expansion, so decision boundary neednot pass through origin. We’ll do our derivations without it.
49 / 61
(Lec11-12.) Maximum margin linear classifier
The solution w to the following mathematical optimization problem:
minw∈Rd
1
2‖w‖22
s.t. yxTw ≥ 1 for all (x, y) ∈ S
gives the linear classifier with the largest minimum margin on S—i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.
This is a convex optimization problem; can be solved in polynomial time.
If there is a solution (i.e., S is linearly separable), then the solution is unique.
We can solve this in a variety of ways (e.g., projected gradient descent);we will work with the dual.
Note: Can also explicitly include affine expansion, so decision boundary neednot pass through origin. We’ll do our derivations without it.
49 / 61
(Lec11-12.) Maximum margin linear classifier
The solution w to the following mathematical optimization problem:
minw∈Rd
1
2‖w‖22
s.t. yxTw ≥ 1 for all (x, y) ∈ S
gives the linear classifier with the largest minimum margin on S—i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.
This is a convex optimization problem; can be solved in polynomial time.
If there is a solution (i.e., S is linearly separable), then the solution is unique.
We can solve this in a variety of ways (e.g., projected gradient descent);we will work with the dual.
Note: Can also explicitly include affine expansion, so decision boundary neednot pass through origin. We’ll do our derivations without it.
49 / 61
(Lec11-12.) Maximum margin linear classifier
The solution w to the following mathematical optimization problem:
minw∈Rd
1
2‖w‖22
s.t. yxTw ≥ 1 for all (x, y) ∈ S
gives the linear classifier with the largest minimum margin on S—i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.
This is a convex optimization problem; can be solved in polynomial time.
If there is a solution (i.e., S is linearly separable), then the solution is unique.
We can solve this in a variety of ways (e.g., projected gradient descent);we will work with the dual.
Note: Can also explicitly include affine expansion, so decision boundary neednot pass through origin. We’ll do our derivations without it.
49 / 61
(Lec11-12.) Maximum margin linear classifier
The solution w to the following mathematical optimization problem:
minw∈Rd
1
2‖w‖22
s.t. yxTw ≥ 1 for all (x, y) ∈ S
gives the linear classifier with the largest minimum margin on S—i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.
This is a convex optimization problem; can be solved in polynomial time.
If there is a solution (i.e., S is linearly separable), then the solution is unique.
We can solve this in a variety of ways (e.g., projected gradient descent);we will work with the dual.
Note: Can also explicitly include affine expansion, so decision boundary neednot pass through origin. We’ll do our derivations without it.
49 / 61
(Lec11-12.) SVM Duality summary
Lagrangian
L(w,α) =1
2‖w‖22 +
n∑i=1
αi(1− yixTiw).
Primal maximum margin problem was
P (w) = maxα≥0
L(w,α) = maxα≥0
1
2‖w‖22 +
n∑i=1
αi(1− yixTiw)
.Dual problem
D(α) = minw∈Rd
L(w,α) = L
n∑i=1
αiyixi,α
=n∑i=1
αi −1
2
∥∥∥∥∥∥n∑i=1
αiyixi
∥∥∥∥∥∥2
2
=n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjxTixj .
Given dual optimum α,
I Corresponding primal optimum w =∑ni=1 αiyixi;
I Strong duality P (w) = D(α);
I αi > 0 implies yixTi w = 1,
and these yixi are support vectors.50 / 61
(Lec11-12.) Nonseparable case: bottom line
Unconstrained primal:
minw∈Rd
1
2‖w‖2 + C
n∑i=1
[1− yixT
iw]+.
Dual:
maxα∈Rn
0≤αi≤C
n∑i=1
αi −1
2
∥∥∥∥∥∥n∑i=1
αiyixi
∥∥∥∥∥∥2
Dual solution α gives primal solution w =∑ni=1 αiyixi.
Remarks.
I Can take C →∞ to recover the separable case.
I Dual is a constrained convex quadratic (can be solved with projectedgradient descent).
I Some presentations include bias in primal (xTiw + b);
this introduces a constraint∑ni=1 yiαi = 0 in dual.
I Some presentations replace 12
and C with λ2
and 1n
, respectively.
51 / 61
(Lec11-12.) Nonseparable case: bottom line
Unconstrained primal:
minw∈Rd
1
2‖w‖2 + C
n∑i=1
[1− yixT
iw]+.
Dual:
maxα∈Rn
0≤αi≤C
n∑i=1
αi −1
2
∥∥∥∥∥∥n∑i=1
αiyixi
∥∥∥∥∥∥2
Dual solution α gives primal solution w =∑ni=1 αiyixi.
Remarks.
I Can take C →∞ to recover the separable case.
I Dual is a constrained convex quadratic (can be solved with projectedgradient descent).
I Some presentations include bias in primal (xTiw + b);
this introduces a constraint∑ni=1 yiαi = 0 in dual.
I Some presentations replace 12
and C with λ2
and 1n
, respectively.
51 / 61
(Lec11-12.) Looking at the dual again
SVM dual problem only depends on xi through inner products xTixj .
maxα1,α2,...,αn≥0
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjxTixj .
If we use feature expansion (e.g., quadratic expansion) x 7→ φ(x), this becomes
maxα1,α2,...,αn≥0
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjφ(xi)Tφ(xj).
Solution w =∑ni=1 αiyiφ(xi) is used in the following way:
x 7→ φ(x)Tw =
n∑i=1
αiyiφ(x)Tφ(xi).
Key insight:
I Training and prediction only use φ(x)Tφ(x′), never an isolated φ(x);
I Sometimes computing φ(x)Tφ(x′) is much easier than computing φ(x).
52 / 61
(Lec11-12.) Looking at the dual again
SVM dual problem only depends on xi through inner products xTixj .
maxα1,α2,...,αn≥0
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjxTixj .
If we use feature expansion (e.g., quadratic expansion) x 7→ φ(x), this becomes
maxα1,α2,...,αn≥0
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjφ(xi)Tφ(xj).
Solution w =∑ni=1 αiyiφ(xi) is used in the following way:
x 7→ φ(x)Tw =
n∑i=1
αiyiφ(x)Tφ(xi).
Key insight:
I Training and prediction only use φ(x)Tφ(x′), never an isolated φ(x);
I Sometimes computing φ(x)Tφ(x′) is much easier than computing φ(x).
52 / 61
(Lec11-12.) Looking at the dual again
SVM dual problem only depends on xi through inner products xTixj .
maxα1,α2,...,αn≥0
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjxTixj .
If we use feature expansion (e.g., quadratic expansion) x 7→ φ(x), this becomes
maxα1,α2,...,αn≥0
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjφ(xi)Tφ(xj).
Solution w =∑ni=1 αiyiφ(xi) is used in the following way:
x 7→ φ(x)Tw =n∑i=1
αiyiφ(x)Tφ(xi).
Key insight:
I Training and prediction only use φ(x)Tφ(x′), never an isolated φ(x);
I Sometimes computing φ(x)Tφ(x′) is much easier than computing φ(x).
52 / 61
(Lec11-12.) Looking at the dual again
SVM dual problem only depends on xi through inner products xTixj .
maxα1,α2,...,αn≥0
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjxTixj .
If we use feature expansion (e.g., quadratic expansion) x 7→ φ(x), this becomes
maxα1,α2,...,αn≥0
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjφ(xi)Tφ(xj).
Solution w =∑ni=1 αiyiφ(xi) is used in the following way:
x 7→ φ(x)Tw =n∑i=1
αiyiφ(x)Tφ(xi).
Key insight:
I Training and prediction only use φ(x)Tφ(x′), never an isolated φ(x);
I Sometimes computing φ(x)Tφ(x′) is much easier than computing φ(x).
52 / 61
(Lec11-12.) Quadratic expansion
I φ : Rd → R1+2d+(d2), where
φ(x) =(
1,√
2x1, . . . ,√
2xd, x21, . . . , x
2d,
√2x1x2, . . . ,
√2x1xd, . . . ,
√2xd−1xd
)(Don’t mind the
√2’s. . . )
I Computing φ(x)Tφ(x′) in O(d) time:
φ(x)Tφ(x′) = (1 + xTx′)2.
I Much better than d2 time.
I What if we change exponent “2”?
I What if we replace additive “1” with 0?
53 / 61
(Lec11-12.) Quadratic expansion
I φ : Rd → R1+2d+(d2), where
φ(x) =(
1,√
2x1, . . . ,√
2xd, x21, . . . , x
2d,
√2x1x2, . . . ,
√2x1xd, . . . ,
√2xd−1xd
)(Don’t mind the
√2’s. . . )
I Computing φ(x)Tφ(x′) in O(d) time:
φ(x)Tφ(x′) = (1 + xTx′)2.
I Much better than d2 time.
I What if we change exponent “2”?
I What if we replace additive “1” with 0?
53 / 61
(Lec11-12.) Quadratic expansion
I φ : Rd → R1+2d+(d2), where
φ(x) =(
1,√
2x1, . . . ,√
2xd, x21, . . . , x
2d,
√2x1x2, . . . ,
√2x1xd, . . . ,
√2xd−1xd
)(Don’t mind the
√2’s. . . )
I Computing φ(x)Tφ(x′) in O(d) time:
φ(x)Tφ(x′) = (1 + xTx′)2.
I Much better than d2 time.
I What if we change exponent “2”?
I What if we replace additive “1” with 0?
53 / 61
(Lec11-12.) Quadratic expansion
I φ : Rd → R1+2d+(d2), where
φ(x) =(
1,√
2x1, . . . ,√
2xd, x21, . . . , x
2d,
√2x1x2, . . . ,
√2x1xd, . . . ,
√2xd−1xd
)(Don’t mind the
√2’s. . . )
I Computing φ(x)Tφ(x′) in O(d) time:
φ(x)Tφ(x′) = (1 + xTx′)2.
I Much better than d2 time.
I What if we change exponent “2”?
I What if we replace additive “1” with 0?
53 / 61
(Lec11-12.) Quadratic expansion
I φ : Rd → R1+2d+(d2), where
φ(x) =(
1,√
2x1, . . . ,√
2xd, x21, . . . , x
2d,
√2x1x2, . . . ,
√2x1xd, . . . ,
√2xd−1xd
)(Don’t mind the
√2’s. . . )
I Computing φ(x)Tφ(x′) in O(d) time:
φ(x)Tφ(x′) = (1 + xTx′)2.
I Much better than d2 time.
I What if we change exponent “2”?
I What if we replace additive “1” with 0?
53 / 61
(Lec11-12; RBF kernel.) Infinite dimensional featureexpansion
For any σ > 0, there is an infinite feature expansion φ : Rd → R∞ such that
φ(x)Tφ(x′) = exp
−∥∥x− x′∥∥222σ2
,
which can be computed in O(d) time.
(This is called the Gaussian kernel with bandwidth σ.)
54 / 61
(Lec11-12.) Kernels
A (positive definite) kernel function K : X × X → R is a symmetric functionsatisfying:
For any x1,x2, . . . ,xn ∈ X , the n×n matrix whose (i, j)-th entry isK(xi,xj) is positive semidefinite.
(This matrix is called the Gram matrix.)
For any kernel K, there is a feature mapping φ : X → H such that
φ(x)Tφ(x′) = K(x,x′).
H is a Hilbert space—i.e., a special kind of inner product space—called theReproducing Kernel Hilbert Space corresponding to K.
55 / 61
(Lec11-12.) Kernels
A (positive definite) kernel function K : X × X → R is a symmetric functionsatisfying:
For any x1,x2, . . . ,xn ∈ X , the n×n matrix whose (i, j)-th entry isK(xi,xj) is positive semidefinite.
(This matrix is called the Gram matrix.)
For any kernel K, there is a feature mapping φ : X → H such that
φ(x)Tφ(x′) = K(x,x′).
H is a Hilbert space—i.e., a special kind of inner product space—called theReproducing Kernel Hilbert Space corresponding to K.
55 / 61
(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)
Solve
maxα1,α2,...,αn≥0
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjK(xi,xj).
Solution w =∑ni=1 αiyiφ(xi) is used in the following way:
x 7→ φ(x)Tw =
n∑i=1
αiyiK(x,xi).
I To represent classifier, need to keep support vector examples (xi, yi) andcorresponding αi’s.
I To compute prediction on x, iterate through support vector examples andcompute K(x,xi) for each support vector xi . . .
Very similar to nearest neighbor classifier:predictor is represented using (a subset of) the training data.
56 / 61
(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)
Solve
maxα1,α2,...,αn≥0
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjK(xi,xj).
Solution w =∑ni=1 αiyiφ(xi) is used in the following way:
x 7→ φ(x)Tw =n∑i=1
αiyiK(x,xi).
I To represent classifier, need to keep support vector examples (xi, yi) andcorresponding αi’s.
I To compute prediction on x, iterate through support vector examples andcompute K(x,xi) for each support vector xi . . .
Very similar to nearest neighbor classifier:predictor is represented using (a subset of) the training data.
56 / 61
(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)
Solve
maxα1,α2,...,αn≥0
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjK(xi,xj).
Solution w =∑ni=1 αiyiφ(xi) is used in the following way:
x 7→ φ(x)Tw =n∑i=1
αiyiK(x,xi).
I To represent classifier, need to keep support vector examples (xi, yi) andcorresponding αi’s.
I To compute prediction on x, iterate through support vector examples andcompute K(x,xi) for each support vector xi . . .
Very similar to nearest neighbor classifier:predictor is represented using (a subset of) the training data.
56 / 61
(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)
Solve
maxα1,α2,...,αn≥0
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjK(xi,xj).
Solution w =∑ni=1 αiyiφ(xi) is used in the following way:
x 7→ φ(x)Tw =n∑i=1
αiyiK(x,xi).
I To represent classifier, need to keep support vector examples (xi, yi) andcorresponding αi’s.
I To compute prediction on x, iterate through support vector examples andcompute K(x,xi) for each support vector xi . . .
Very similar to nearest neighbor classifier:predictor is represented using (a subset of) the training data.
56 / 61
(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)
Solve
maxα1,α2,...,αn≥0
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjK(xi,xj).
Solution w =∑ni=1 αiyiφ(xi) is used in the following way:
x 7→ φ(x)Tw =n∑i=1
αiyiK(x,xi).
I To represent classifier, need to keep support vector examples (xi, yi) andcorresponding αi’s.
I To compute prediction on x, iterate through support vector examples andcompute K(x,xi) for each support vector xi . . .
Very similar to nearest neighbor classifier:predictor is represented using (a subset of) the training data.
56 / 61
(Lec11-12.) Nonlinear support vector machines (again)
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
-32.
000
-24.
000
-16.
000
-8.0
00
-8.000
0.00
0
0.0008.00
0
8.000
16.000
ReLU network.
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
-12.
500
-10.
000
-7.5
00-5
.000
-2.5
00
-2.500
0.00
0
0.000
2.500
Quadratic SVM.
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
-3.0
00
-2.0
00
-1.0
00
-1.0
00
0.00
0
0.000
1.00
0
2.000
3.000
RBF SVM (σ = 1).
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
-1.500-1.000
-1.000
-1.000
-0.5
00
-0.500
0.00
0
0.000
0.50
0
0.500
1.000
1.50
0
RBF SVM (σ = 0.1). 57 / 61
(Lec13.) The Perceptron Algorithm
Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter
w ← w + 1[ywTx ≤ 0]yx.
Remarks.
I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.
I Therefore: if we update, we do so by rotating towards yx.
I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).
Not obvious that Perceptron will eventually terminate! (We’ll return to this.)
58 / 61
(Lec13.) The Perceptron Algorithm
Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter
w ← w + 1[ywTx ≤ 0]yx.
Remarks.
I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.
I Therefore: if we update, we do so by rotating towards yx.
I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).
Not obvious that Perceptron will eventually terminate! (We’ll return to this.)
58 / 61
(Lec13.) The Perceptron Algorithm
Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter
w ← w + 1[ywTx ≤ 0]yx.
Remarks.
I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.
I Therefore: if we update, we do so by rotating towards yx.
I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).
Not obvious that Perceptron will eventually terminate! (We’ll return to this.)
58 / 61
(Lec13.) The Perceptron Algorithm
Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter
w ← w + 1[ywTx ≤ 0]yx.
Remarks.
I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.
I Therefore: if we update, we do so by rotating towards yx.
I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).
Not obvious that Perceptron will eventually terminate! (We’ll return to this.)
58 / 61
(Lec13.) The Perceptron Algorithm
Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter
w ← w + 1[ywTx ≤ 0]yx.
Remarks.
I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.
I Therefore: if we update, we do so by rotating towards yx.
I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).
Scenario 1
wtxt yt = 1xTtwt ≤ 0
Current vector wt comparble to xt in length.
Not obvious that Perceptron will eventually terminate! (We’ll return to this.)
58 / 61
(Lec13.) The Perceptron Algorithm
Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter
w ← w + 1[ywTx ≤ 0]yx.
Remarks.
I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.
I Therefore: if we update, we do so by rotating towards yx.
I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).
Scenario 1
wtxt yt = 1xTtwt ≤ 0
wt+1
Updated vector wt+1 now correctly classifies (xt, yt).
Not obvious that Perceptron will eventually terminate! (We’ll return to this.)
58 / 61
(Lec13.) The Perceptron Algorithm
Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter
w ← w + 1[ywTx ≤ 0]yx.
Remarks.
I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.
I Therefore: if we update, we do so by rotating towards yx.
I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).
Scenario 2
wt
xtyt = 1xTtwt ≤ 0
Current vector wt much longer than xt.
Not obvious that Perceptron will eventually terminate! (We’ll return to this.)
58 / 61
(Lec13.) The Perceptron Algorithm
Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter
w ← w + 1[ywTx ≤ 0]yx.
Remarks.
I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.
I Therefore: if we update, we do so by rotating towards yx.
I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).
Scenario 2
wt
xtyt = 1xTtwt ≤ 0
wt+1
Updated vector wt+1 does not correctly classify (xt, yt).
Not obvious that Perceptron will eventually terminate! (We’ll return to this.)
58 / 61
(Lec13.) The Perceptron Algorithm
Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter
w ← w + 1[ywTx ≤ 0]yx.
Remarks.
I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.
I Therefore: if we update, we do so by rotating towards yx.
I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).
Scenario 2
wt
xtyt = 1xTtwt ≤ 0
wt+1
Updated vector wt+1 does not correctly classify (xt, yt).
Not obvious that Perceptron will eventually terminate! (We’ll return to this.) 58 / 61
2. Homework review
Nice homework problems
I HW1-P3: SVD practice.
I HW2-P1: SVD practice, definition ‖A‖2 and ‖A‖F.
I HW3-P1: convexity practice.
I HW3-P2: deep network and probability practice.
I HW3-P4, HW3-P5: kernel practice.
59 / 61
3. Stuff added 3/12/2019
Belaboring Cauchy-Schwarz
Theorem (Cauchy-Schwarz). |aTb| ≤ ‖a‖ · ‖b‖.
Proof (another one. . . ). If either a or b are zero, then aTb = 0 = ‖a‖ · ‖b‖.If both a and b are nonzero, note for any r > 0 that
0 ≤∥∥∥∥ra− br
∥∥∥∥2 = r2‖a‖2 − 2aTb+‖b‖2
r2,
which can be rearranged into
aTb ≤ r2‖a‖2
2+‖b‖2
2r2.
Choosing r =√‖b‖/‖a‖,
aTb ≤ ‖a‖2
2
(‖b‖‖a‖
)+‖b‖2
2
(‖a‖‖b‖
)= ‖a‖ · ‖b‖.
Lastly, applying the bound to (a,−b) gives
−aTb = aT (−b) ≤ ‖a‖ · ‖ − b‖ = ‖a‖ · ‖b‖,
so together it follows that∣∣∣aTb
∣∣∣ ≤ ‖a‖ · ‖b‖. �
60 / 61
Belaboring Cauchy-Schwarz
Theorem (Cauchy-Schwarz). |aTb| ≤ ‖a‖ · ‖b‖.
Proof (another one. . . ). If either a or b are zero, then aTb = 0 = ‖a‖ · ‖b‖.If both a and b are nonzero, note for any r > 0 that
0 ≤∥∥∥∥ra− br
∥∥∥∥2 = r2‖a‖2 − 2aTb+‖b‖2
r2,
which can be rearranged into
aTb ≤ r2‖a‖2
2+‖b‖2
2r2.
Choosing r =√‖b‖/‖a‖,
aTb ≤ ‖a‖2
2
(‖b‖‖a‖
)+‖b‖2
2
(‖a‖‖b‖
)= ‖a‖ · ‖b‖.
Lastly, applying the bound to (a,−b) gives
−aTb = aT (−b) ≤ ‖a‖ · ‖ − b‖ = ‖a‖ · ‖b‖,
so together it follows that∣∣∣aTb
∣∣∣ ≤ ‖a‖ · ‖b‖. �
60 / 61
SVD again
1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.
2. Thin decomposition SVD: M =∑ri=1 siuiv
Ti .
3. Full factorization SVD: M = USV T.
4. “Operational” view of SVD: for M ∈ Rn×d,
↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓
·s1 0
. . . 00 sr
0 0
· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓
>
.
First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).
Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.
61 / 61
SVD again
1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.
2. Thin decomposition SVD: M =∑ri=1 siuiv
Ti .
3. Full factorization SVD: M = USV T.
4. “Operational” view of SVD: for M ∈ Rn×d,
↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓
·s1 0
. . . 00 sr
0 0
· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓
>
.
First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).
Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.
61 / 61
SVD again
1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.
2. Thin decomposition SVD: M =∑ri=1 siuiv
Ti .
3. Full factorization SVD: M = USV T.
4. “Operational” view of SVD: for M ∈ Rn×d,
↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓
·s1 0
. . . 00 sr
0 0
· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓
>
.
First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).
Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.
61 / 61
SVD again
1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.
2. Thin decomposition SVD: M =∑ri=1 siuiv
Ti .
3. Full factorization SVD: M = USV T.
4. “Operational” view of SVD: for M ∈ Rn×d,
↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓
·s1 0
. . . 00 sr
0 0
· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓
>
.
First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).
Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.
61 / 61
SVD again
1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.
2. Thin decomposition SVD: M =∑ri=1 siuiv
Ti .
3. Full factorization SVD: M = USV T.
4. “Operational” view of SVD: for M ∈ Rn×d,
↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓
·s1 0
. . . 00 sr
0 0
· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓
>
.
First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).
Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.
61 / 61
SVD again
1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.
2. Thin decomposition SVD: M =∑ri=1 siuiv
Ti .
3. Full factorization SVD: M = USV T.
4. “Operational” view of SVD: for M ∈ Rn×d,
↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓
·s1 0
. . . 00 sr
0 0
· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓
>
.
First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).
Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.
61 / 61
SVD again
1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.
2. Thin decomposition SVD: M =∑ri=1 siuiv
Ti .
3. Full factorization SVD: M = USV T.
4. “Operational” view of SVD: for M ∈ Rn×d,
↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓
·s1 0
. . . 00 sr
0 0
· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓
>
.
First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).
Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.
61 / 61