midterm review - university of...

Midterm review

CS 446

1. Lecture review

(Lec1.) Basic setting: supervised learning

Training data: labeled examples

(x1, y1), (x2, y2), . . . , (xn, yn)

I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs. (Note: 0 training error is easy;test/population error is what matters.)

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

1 / 61

(x1, y1), (x2, y2), . . . , (xn, yn)

predicted label

1 / 61

(x1, y1), (x2, y2), . . . , (xn, yn)

predicted label

1 / 61

(x1, y1), (x2, y2), . . . , (xn, yn)

predicted label

1 / 61

(Lec2.) k-nearest neighbors classifier

Given: labeled examples D := {(xi, yi)}ni=1

Predictor: fD,k : X → Y:

On input x,

1. Find the k points xi1 ,xi2 , . . . ,xik among {xi}ni=1 “closest” to x(the k nearest neighbors).

2. Return the plurality of yi1 , yi2 , . . . , yik .

(Break ties in both steps arbitrarily.)

2 / 61

(Lec2.) Choosing k

The hold-out set approach

1. Pick a subset V ⊂ S (hold-out set, a.k.a. validation set).2. For each k ∈ {1, 3, 5, . . . }:

I Construct k-NN classifier fS\V,k using S \ V .

I Compute error rate of fS\V,k on V(“hold-out error rate”).

3. Pick the k that gives the smallest hold-out error rate.

(There are many other approaches.)

3 / 61

(Lec2.) Choosing k

The hold-out set approach

1. Pick a subset V ⊂ S (hold-out set, a.k.a. validation set).2. For each k ∈ {1, 3, 5, . . . }:

I Construct k-NN classifier fS\V,k using S \ V .

I Compute error rate of fS\V,k on V(“hold-out error rate”).

3. Pick the k that gives the smallest hold-out error rate.

(There are many other approaches.)

3 / 61

(Lec2.) Decision trees

Directly optimize tree structure for good classification.

A decision tree is a function f : X → Y, represented by a binary tree in which:

I Each tree node is associated with a splitting rule g : X → {0, 1}.I Each leaf node is associated with a label y ∈ Y.

When X = Rd, typically only considersplitting rules of the form

g(x) = 1{xi > t}

for some i ∈ [d] and t ∈ R.Called axis-aligned or coordinate splits.

(Notation: [d] := {1, 2, . . . , d}.)

x1 > 1.7

x2 > 2.8y = 1

y = 2 y = 3

4 / 61

(Lec2.) Decision trees

Directly optimize tree structure for good classification.

A decision tree is a function f : X → Y, represented by a binary tree in which:

I Each tree node is associated with a splitting rule g : X → {0, 1}.I Each leaf node is associated with a label y ∈ Y.

When X = Rd, typically only considersplitting rules of the form

g(x) = 1{xi > t}

for some i ∈ [d] and t ∈ R.Called axis-aligned or coordinate splits.

(Notation: [d] := {1, 2, . . . , d}.)

x1 > 1.7

x2 > 2.8y = 1

y = 2 y = 3

4 / 61

(Lec2.) Decision tree example

sepal length/width1.5 2 2.5 3

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

5 / 61

x1 > 1.7

5 / 61

x1 > 1.7

y = 1 y = 3

5 / 61

x1 > 1.7

x2 > 2.8y = 1

5 / 61

x1 > 1.7

x2 > 2.8y = 1

y = 2 y = 3

5 / 61

(Lec2.) Nearest neighbors and decision trees

Today we covered two standard machine learning methods.

1.0 0.5 0.0 0.5 1.0

Nearest neighbors.Training/fitting: memorize data.Testing/predicting: find k closestmemorized points, return plurity la-bel.Overfitting? Vary k.

Decision trees.Training/fitting: greedily partitionspace, reducing “uncertainty”.Testing/predicting: traverse tree,output leaf label.Overfitting? Limit or prune tree.

Note: both methods can ouput real numbers (regression, not classification);return median/mean of { neighbors, points reaching leaf }. 6 / 61

(Lec3-4.) ERM setup for least squares.

I Predictors/model: f(x) = wTx;a linear predictor/regressor.(For linear classification: x 7→ sgn(wTx).)

I Loss/penalty: the least squares loss

`ls(y, y) = `ls(y, y) = (y − y)2.

(Some conventions scale this by 1/2.)

I Goal: minimize least squares emprical risk

Rls(f) =1

n∑i=1

`ls(yi, f(xi)) =1

n∑i=1

(yi − f(xi))2.

I Specifically, we choose w ∈ Rd according to

arg minw∈Rd

(x 7→ wTx

)= arg min

w∈Rd

n∑i=1

(yi −wTxi)2.

I More generally, this is the ERM approach:pick a model and minimize empirical risk over the model parameters.

7 / 61

(Lec3-4.) ERM in general

I Pick a family of models/predictors F .(For today, we use linear predictors.)

I Pick a loss function `.(For today, we chose squared loss.)

I Minimize the empirical risk over the model parameters.

We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.

Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.

8 / 61

(Lec3-4.) ERM in general

I Pick a family of models/predictors F .(For today, we use linear predictors.)

I Pick a loss function `.(For today, we chose squared loss.)

I Minimize the empirical risk over the model parameters.

We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.

Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.

8 / 61

(Lec3-4.) Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

Can write empirical risk as

R(w) =1

n∑i=1

(yi − xT

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

9 / 61

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n∑i=1

(yi − xT

= ‖Aw − b‖22.

9 / 61

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

R(w) =1

n∑i=1

(yi − xT

= ‖Aw − b‖22.

9 / 61

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

R(w) =1

n∑i=1

(yi − xT

= ‖Aw − b‖22.

9 / 61

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

R(w) =1

n∑i=1

(yi − xT

= ‖Aw − b‖22.

9 / 61

(Lec3-4.) Full (factorization) SVD (new slide)

Given M ∈ Rn×d, let M = USV T denote the singular value decomposition(SVD), where

I U ∈ Rn×n is orthonormal, thus UTU = UUT = I,

I V ∈ Rd×d is orthonormal, thus V TV = V V T = I,

I S ∈ Rn×d has singular values s1 ≥ s2 ≥ · · · ≥ sminn,d along the diagonaland zeros elsewhere, where the number of positive singular values equalsthe rank of M .

Some facts:

I SVD is not unique when the singular values are not distinct; e.g., we canwrite I = UIV T where U is any orthonormal matrix.

I Pseudoinverse S+ ∈ Rd×n of S is obtained by starting with ST andtaking the reciprocal of each positive entry.

I Pseudoinverse of M is V S+UT.

I If M−1 exists, then M−1 = M+.

10 / 61

(Lec3-4.) Thin (decomposition) SVD (new slide)

Given M ∈ Rn×d, (s,u,v) are a singular value with corresponding left andright singular vectors if Mv = su and MTu = sv.The thin SVD of M is M =

∑ri=1 siuiv

Ti , where r is the rank of M , and

I left singular vectors (u1, . . . ,ur) are orthonormal (but we might haver < min{n, d}!) and span the column space of M ,

I right singular vectors (v1, . . . ,vr) are orthonormal (but we might haver < min{n, d}!) and span the row space of M ,

I sigular values s1 ≥ · · · ≥ sr > 0.

Some facts:

I Pseudoinverse M+ =∑ri=1

1siviu

I (ui)ri=1 span t

11 / 61

(Lec3-4.) SVD and least squares

Recall: we’d like to find w such that

ATAw = ATb.

If w = A+b, then

ATAw =

r∑i=1

siviuTi

r∑i=1

siuivTi

r∑i=1

siviuTi

r∑i=1

b = ATb.

Henceforth, define wols = A+b as the OLS solution.(OLS = “ordinary least squares”.)

Note: in general, AA+ =∑ri=1 uiu

Ti 6= I.

12 / 61

(Lec3-4.) Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

Morever, writing A =∑ri=1 siuiv

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),

so w′ optimal iff w′ −w is in the right nullspace of A.

(We’ll revisit all this with convexity later.)

13 / 61

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),

13 / 61

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),

13 / 61

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

14 / 61

R(w) + λ‖w‖22

over w ∈ Rd.

14 / 61

R(w) + λ‖w‖22

over w ∈ Rd.

14 / 61

R(w) + λ‖w‖22

over w ∈ Rd.

14 / 61

R(w) + λ‖w‖22

over w ∈ Rd.

14 / 61

R(w) + λ‖w‖22

over w ∈ Rd.

14 / 61

(Lec5-6.) Geometry of linear classifiers

A hyperplane in Rd is a linear subspaceof dimension d−1.

I A hyperplane in R2 is a line.

I A hyperplane in R3 is a plane.

I As a linear subspace, a hyperplanealways contains the origin.

A hyperplane H can be specified by a(non-zero) normal vector w ∈ Rd.

The hyperplane with normal vector wis the set of points orthogonal to w:

H ={x ∈ Rd : xTw = 0

Given w and its corresponding H:H splits the sets labeled positive {x : wTx > 0} and those labeled negative{x : wTw < 0}.

15 / 61

(Lec5-6.) Classification with a hyperplane

span{w}

Projection of x onto span{w} (a line) hascoordinate

‖x‖2 · cos(θ)

cos θ =xTw

‖w‖2‖x‖2.

(Distance to hyperplane is ‖x‖2 · | cos(θ)|.)

Decision boundary is hyperplane (oriented by w):

xTw > 0 ⇐⇒ ‖x‖2 · cos(θ) > 0 ⇐⇒ x on same side of H as w

What should we do if we want hyperplane decision boundary that doesn’t(necessarily) go through origin?

16 / 61

span{w}

‖x‖2 · cos θθ

‖x‖2 · cos(θ)

cos θ =xTw

‖w‖2‖x‖2.

16 / 61

span{w}

‖x‖2 · cos(θ)

cos θ =xTw

‖w‖2‖x‖2.

16 / 61

span{w}

‖x‖2 · cos(θ)

cos θ =xTw

‖w‖2‖x‖2.

16 / 61

(Lec5-6.) Linear separability

Is it always possible to find w with sign(wTxi) = yi?Is it always possible to find a hyperplane separating the data?(Appending 1 means it need not go through the origin.)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

Linearly separable. Not linearly separable.

17 / 61

(Lec5-6.) Cauchy-Schwarz (new slide)

Cauchy-Schwarz inequality. |aTb| ≤ ‖a‖ · ‖b‖.

Proof. If ‖a‖ = ‖b‖,

0 ≤ ‖a− b‖2 = ‖a‖2 − 2aTb+ ‖b‖2 = 2‖a‖ · ‖b‖ − 2aTb,

which rearranges to give aTb ≤ ‖a‖ · ‖b‖.

For the case ‖a‖ < ‖b‖, apply the preceding to ‖b‖‖a‖

(aT(‖a‖b‖b‖

For the absolute value, apply the preceding to (a,−b). �

18 / 61

(Lec5-6.) Cauchy-Schwarz (new slide)

Cauchy-Schwarz inequality. |aTb| ≤ ‖a‖ · ‖b‖.

Proof. If ‖a‖ = ‖b‖,

0 ≤ ‖a− b‖2 = ‖a‖2 − 2aTb+ ‖b‖2 = 2‖a‖ · ‖b‖ − 2aTb,

which rearranges to give aTb ≤ ‖a‖ · ‖b‖.

For the case ‖a‖ < ‖b‖, apply the preceding to ‖b‖‖a‖

(aT(‖a‖b‖b‖

For the absolute value, apply the preceding to (a,−b). �

18 / 61

(Lec5-6.) Logistic loss

Let’s state our classification goal with a generic margin loss `:

R(w) =1

n∑i=1

`(yiwTxi);

the key properties we want:

I ` is continuous;

I `(z) ≥ c1[z ≤ 0] = c`zo(z) for some c > 0 and any z ∈ R,

which implies R`(w) ≥ cRzo(w).

I `′(0) < 0 (pushes stuff from wrong to right).

Examples.

I Squared loss, written in margin form: `ls(z) := (1− z)2;note `ls(yy) = (1− yy)2 = y2(1− yy)2 = (y − y)2.

I Logistic loss: `log(z) = ln(1 + exp(−z)).

19 / 61

(Lec5-6.) Logistic loss

Let’s state our classification goal with a generic margin loss `:

R(w) =1

n∑i=1

`(yiwTxi);

the key properties we want:

I ` is continuous;

I `(z) ≥ c1[z ≤ 0] = c`zo(z) for some c > 0 and any z ∈ R,

which implies R`(w) ≥ cRzo(w).

I `′(0) < 0 (pushes stuff from wrong to right).

Examples.

I Squared loss, written in margin form: `ls(z) := (1− z)2;note `ls(yy) = (1− yy)2 = y2(1− yy)2 = (y − y)2.

I Logistic loss: `log(z) = ln(1 + exp(−z)).

19 / 61

(Lec5-6.) Squared and logistic losses on linearly separabledata I

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

-12.000

-8.000

-4.000

12.000

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

-1.200

-0.800

-0.400

Logistic loss. Squared loss.

20 / 61

(Lec5-6.) Squared and logistic losses on linearly separabledata II

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

-12.000

-8.000

-4.000

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

-1.200

-0.800

-0.400

Logistic loss. Squared loss.

21 / 61

(Lec5-6.) Logistic risk and separation

If there exists a perfect linear separator,empirical logistic risk minimization should find it.

Theorem.

If there exists w with yiwTxi > 0 for all i,

then every w with Rlog(w) < ln(2)/2n + infv Rlog(v),also satisfies yiw

Txi > 0.

Proof. Omitted.

22 / 61

Theorem. If there exists w with yiwTxi > 0 for all i,

Txi > 0.

Proof. Omitted.

22 / 61

Theorem. If there exists w with yiwTxi > 0 for all i,

Txi > 0.

Proof. Omitted.

22 / 61

(Lec5-6.) Least squares and logistic ERM

Least squares:

I Take gradient of ‖Aw − b‖2, set to 0;obtain normal equations ATAw = ATb.

I One choice is minimum norm solution A+b.

Logistic loss:

I Take gradient of Rlog(w) = 1n

∑ni=1 ln(1 + exp(yiw

Txi)) and set to 0 ???

Remark. Is A+b a “closed form expression”?

23 / 61

Least squares:

Logistic loss:

23 / 61

Least squares:

Logistic loss:

23 / 61

(Lec5-6.) Gradient descent

Given a function F : Rd → R, gradient descent is the iteration

wi+1 := wi − ηi∇wF (wi),

where w0 is given, and ηi is a learning rate / step size.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.010.0

10.000

12.000

14.000

Does this work for least squares?Later we’ll show it works for least squares and logistic regression due toconvexity.

24 / 61

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.010.0

10.000

12.000

14.000

Does this work for least squares?

Later we’ll show it works for least squares and logistic regression due toconvexity.

24 / 61

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.010.0

10.000

12.000

14.000

Does this work for least squares?Later we’ll show it works for least squares and logistic regression due toconvexity.

24 / 61

(Lec5-6.) Multiclass?

All our methods so far handle multiclass:

I k-nn and decision tree: plurality label.

I Least squares: arg minW∈Rd×k

‖AW −B‖2F

with B ∈ Rn×k;

W ∈ Rd×k is k separate linear regressors in Rd.

How about linear classifiers?

I At prediction time, x 7→ arg maxy f(x)y.

I As in binary case: interpretation f(x)y = Pr[Y = y|X = x].

What is a good loss function?

25 / 61

(Lec5-6.) Cross-entropy

Given two probability vectors p, q ∈∆k = {p ∈ Rk≥0 :∑i pi = 1},

H(p, q) = −k∑i=1

pi ln qi (cross-entropy).

I If p = q, then H(p, q) = H(p) (entropy); indeed

H(p, q) = −k∑i=1

(piqipi

)= H(p)︸︷︷︸

entropy

+ KL(p, q)︸︷︷︸KL divergence

Since KL ≥ 0 and moreover 0 iff p = q,this is the cost/entropy of p plus a penalty for differing.

I Choose encoding yi = ey for y ∈ {1, . . . , k},and y ∝ exp(f(x)) with f : Rd → Rk;

`ce(y, f(x)) = H(y, y) = −k∑i=1

exp(f(x)i)∑kj=1 exp(f(x)j)

= − ln

exp(f(x)y)∑kj=1 exp(f(x)j)

= −f(x)y + lnk∑j=1

exp(f(x)j).

(In pytorch, use torch.nn.CrossEntropyLoss()(f(x), y).) 26 / 61

(Lec5-6.) Cross-entropy, classification, and margins

The zero-one loss for classification is

`zo(yi, f(x)) = 1

[yi 6= arg max

jf(x)j

In the multiclass case, can define margin as

f(x)y −maxj 6=y

f(x)j ,

interpreted as “the distance by which f is correct”. (Can be negative!)

Since ln∑j zj ≈ maxj zj , cross-entropy satisfies

`ce(yi, f(x)) = −f(x)y + ln∑j

exp(f(x)j

)≈ −f(x)y + max

jf(x)j ,

thus minimizing cross-entropy maximizes margins.

27 / 61

(Lec7-8.) The ERM perspective

These lectures will follow an ERM perspective on deep networks:

I Pick a model/predictor class (network architecture).(We will spend most of our time on this!)

I Pick a loss/risk.(We will almost always use cross-entropy!)

I Pick an optimizer.(We will mostly treat this as a black box!)

The goal is low test error, whereas above only gives low training error;we will briefly discuss this as well.

28 / 61

(Lec7-8.) Basic deep networks

A self-contained expression is

x 7→ σL

(WLσL−1

(· · ·(W 2σ1(W 1x+ b1) + b2

)· · ·)

with equivalent “functional form”

x 7→ (fL ◦ · · · ◦ f1)(x) where fi(z) = σi (W iz + bi) .

Some further details (many more to come!):

I (W i)Li=1 with W i ∈ Rdi×di−1 are the weights, and (bi)

Li=1 are the biases.

I (σi)Li=1 with σi : Rdi → Rdi are called nonlinearties, or activations, or

transfer functions, or link functions.

I This is only the basic setup; many things can and will change, please askmany questions!

29 / 61

(Lec7-8.) Choices of activation

Basic form:

x 7→ σL(WLσL−1

(· · ·W 2σ1(W 1x+ b1) + b2 · · ·

Choices of activation (univariate, coordinate-wise):

I Indicator/step/heavyside/threshold z 7→ 1[z ≥ 0].This was the original choice (1940s!).

I Sigmoid σs(z) := 11+exp(−z) .

This was popular roughly 1970s - 2005?

I Hyperbolic tangent z 7→ tanh(z).Similar to sigmoid, used during same interval.

I Rectified Linear Unit (ReLU) σr(z) = max{0, z}.It (and slight variants, e.g., Leaky ReLU, ELU, . . . ) are the dominantchoice now; popularized in “Imagenet/AlexNet” paper(Krizhevsky-Sutskever-Hinton, 2012).

I Identity z 7→ z; we’ll often use this as the last layer when we usecross-entropy loss.

I NON-coordinate-wise choices: we will discuss “softmax” and “pooling” abit later.

30 / 61

(Lec7-8.) “Architectures” and “models”

Basic form:

x 7→ σL(WLσL−1

(· · ·W 2σ1(W 1x+ b1) + b2 · · ·

((W i, bi))Li=1, the weights and biases, are the parameters.

Let’s roll them into W := ((W i, bi))Li=1,

and consider the network as a two-parameter function FW(x) = F (x;W).

I The model or class of functions is {FW : all possible W}. F (botharguments unset) is also called an architecture.

I When we fit/train/optimize, typically we leave the architecture fixed andvary W to minimize risk.

(More on this in a moment.)

31 / 61

(Lec7-8.) ERM recipe for basic networks

Standard ERM recipe:

I First we pick a class of functions/predictors;for deep networks, that means a F (·, ·).

I Then we pick a loss function and write down an empirical riskminimization problem; in these lectures we will pick cross-entropy:

arg minW

n∑i=1

(yi, F (xi,W)

)= arg min

W 1∈Rd×d1 ,b1∈Rd1

...WL∈R

dL−1×dL ,bL∈RdL

n∑i=1

(yi, F (xi; ((W i, bi))

= arg minW 1∈Rd×d1 ,b1∈Rd1

...WL∈R

dL−1×dL ,bL∈RdL

n∑i=1

(yi, σL(· · ·σ1(W 1xi + b1) · · · )

I Then we pick an optimizer. In this class, we only use gradient descentvariants. It is a miracle that this works.

32 / 61

(Lec7-8.) Sometimes, linear just isn’t enough

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

-8.000

0.0008.00

16.000

Linear predictor:x 7→ wT [ x1 ].

Some blue points misclassified.

ReLU network:x 7→ W 2σr(W 1x + b1) + b2.

0 misclassifications!

33 / 61

(Lec7-8.) Classical example: XOR

Classical “XOR problem” (Minsky-Papert-’69).(Check wikipedia for “AI Winter”.)

Theorem. On this data, any linear classifier (with affine expansion)makes at least one mistake.Picture proof. Recall: linear classifiers correspond to separating hyperplanes.

I If it splits the blue points, it’s incorrect on one of them.

I If it doesn’t split the blue points,then one halfspace contains the common midpoint,and therefore wrong on at least one red point.

34 / 61

(Lec7-8.) One layer was not enough. How about two?

Theorem (Cybenko ’89, Hornik-Stinchcombe-White ’89, Funahashi ’89, Leshnoet al ’92, . . . ). Given any continuous function f : Rd → R and any ε > 0, thereexist parameters (W 1, b1,W 2) so that

supx∈[0,1]d

∣∣f(x)−W 2σ (W 1x+ b1)∣∣ ≤ ε,

as long as σ is “reasonable” (e.g., ReLU or sigmoid or threshold).

Remarks.

I Together with XOR example, justifies using nonlinearities.

I Does not justify (very) deep networks.

I Only says these networks exist, not that we can optimize for them!

35 / 61

(Lec7-8.) One layer was not enough. How about two?

Theorem (Cybenko ’89, Hornik-Stinchcombe-White ’89, Funahashi ’89, Leshnoet al ’92, . . . ). Given any continuous function f : Rd → R and any ε > 0, thereexist parameters (W 1, b1,W 2) so that

supx∈[0,1]d

∣∣f(x)−W 2σ (W 1x+ b1)∣∣ ≤ ε,

as long as σ is “reasonable” (e.g., ReLU or sigmoid or threshold).

Remarks.

I Together with XOR example, justifies using nonlinearities.

I Does not justify (very) deep networks.

I Only says these networks exist, not that we can optimize for them!

35 / 61

(Lec7-8.) General graph-based view

Classical graph-based perspective.

I Network is a directed acyclic graph;sources are inputs, sinks are outputs, intermediate nodes computez 7→ σ(wTz + b) (with their own (σ,w, b)).

I Nodes at distance 1 from inputs are the first layer, distance 2 is secondlayer, and so on.

“Modern” graph-based perspective.

I Edges in the graph can be multivariate, meaning vectors or generaltensors, and not just scalars.

I Edges will often “skip” layers;“layer” is therefore ambiguous.

I Diagram conventions differ;e.g., tensorflow graphs include nodes for parameters.

36 / 61

(Lec7-8.) 2-D convolution in deep networks (pictures)

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

37 / 61

(Lec7-8.) Softmax

Replace vector input z with z′ ∝ ez, meaning

z 7→

(ez1∑j ezj, . . . ,

ezk∑j ezj,

I Converts input into a probability vector;useful for interpreting output network output as Pr[Y = y|X = x].

I We have baked it into our cross-entropy definition;last lectures networks with cross-entropy training had implicit softmax.

I If some coordinate j of z dominates others, then softmax is close to ej .

38 / 61

(Lec7-8.) Max pooling

I Often used together with convolution layers;shrinks/downsamples the input.

I Another variant is average pooling.

I Implementation: torch.nn.MaxPool2d .

39 / 61

(Lec9-10.) Multivariate network single-example gradients

Define Gj(W) = σj(W j · · ·σ1(W 1x+ b) · · · ).The multivariate chain rule tells us

∇WF (Wx) = JTxT,

and J ∈ Rl×k is the Jacobian matrix of F : Rk → Rl at Wx,the matrix of all coordinate-wise derivatives.

dGLdWL

= JTLGL−1(W)T

dGLdbL

= JTL,

dGLdW j

= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T,

dGLdbj

= (JLWLJL−1WL−1 · · ·Jj)T ,

with Jj as the Jacobian of σj at W jGj−1(W) + bj .For example, with σj that is coordinate-wise σ : R→ R,

Jj is diag

(σ′([W jGj−1(W) + bj

), . . . , σ′

([W jGj−1(W) + bj

40 / 61

∇WF (Wx) = JTxT,

dGLdWL

= JTLGL−1(W)T

dGLdbL

= JTL,

dGLdW j

dGLdbj

Jj is diag

), . . . , σ′

([W jGj−1(W) + bj

40 / 61

∇WF (Wx) = JTxT,

dGLdWL

= JTLGL−1(W)T

dGLdbL

= JTL,

dGLdW j

dGLdbj

Jj is diag

), . . . , σ′

([W jGj−1(W) + bj

40 / 61

(Lec9-10.) Initialization

RecalldGLdW j

= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T.

I What if we set W = 0? What if σ = σr is a ReLU?

I What if we set two rows of W j (two nodes) identically?

I Resolving this issue is called symmetry breaking.

I Standard linear/dense layer initializations:

N (0,2

din) “He et al.”,

N (0,2

din + dout) Glorot/Xavier,

U(− 1√din

,1√din

) torch default.

(Convolution layers adjusted to have similar distributions.)Random initialization is emerging as a key story in deep networks!

41 / 61

RecalldGLdW j

N (0,2

U(− 1√din

,1√din

) torch default.

41 / 61

RecalldGLdW j

N (0,2

U(− 1√din

,1√din

) torch default.

41 / 61

RecalldGLdW j

N (0,2

U(− 1√din

,1√din

) torch default.

41 / 61

(Lec9-10; SGD slide.) Minibatches

We used the linearity of gradients to write

∇wR(w) =1

n∑i=1

∇w`(F (xi;w),yi).

What happens if we replace ((xi,yi))ni=1 with minibatch ((x′i,y

′i))

I Random minibatch =⇒ two gradients equal in expectation.

I Most torch layers take minibatch input:

I torch.nn.Linear has input shape (b, d), output (b, d′).I torch.nn.Conv2d has input shape (b, c, h, w), output (b, c′, h′, w′).

I This is used heavily outside deep learning as well.It is an easy way to use parallel floating point operations (as in GPU andCPU).

I Setting batch size is black magic and depends on many things(prediction problem, gpu characteristics, . . . ).

42 / 61

(Lec9-10.) Convex sets

A set S is convex if, for every pair of points {x,x′} in S, the line segmentbetween x and x′ is also contained in S. ({x,x′} ∈ S =⇒ [x,x′] ∈ S.)

convex not convex convex convex

Examples:

I All of Rd.

I Empty set.

I Half-spaces: {x ∈ Rd : aTx ≤ b}.I Intersections of convex sets.

I Polyhedra:{x ∈ Rd : Ax ≤ b

}=⋂mi=1

{x ∈ Rd : aT

ix ≤ bi}

I Convex hulls:conv(S) := {

∑ki=1 αixi : k ∈ N, xi ∈ S, αi ≥ 0,

∑ki=1 αi = 1}.

(Infinite convex hulls: intersection of all convex supersets.)

43 / 61

(Lec9-10.) Convex sets

A set S is convex if, for every pair of points {x,x′} in S, the line segmentbetween x and x′ is also contained in S. ({x,x′} ∈ S =⇒ [x,x′] ∈ S.)

convex not convex convex convex

Examples:

I All of Rd.

I Empty set.

I Half-spaces: {x ∈ Rd : aTx ≤ b}.I Intersections of convex sets.

I Polyhedra:{x ∈ Rd : Ax ≤ b

}=⋂mi=1

{x ∈ Rd : aT

ix ≤ bi}

I Convex hulls:conv(S) := {

∑ki=1 αixi : k ∈ N, xi ∈ S, αi ≥ 0,

∑ki=1 αi = 1}.

(Infinite convex hulls: intersection of all convex supersets.)

43 / 61

(Lec9-10.) Convex functions from convex sets

The epigraph of a function f is the area above the curve:

epi(f) :={

(x, y) ∈ Rd+1 : y ≥ f(x)}.

A function is convex if its epigraph is convex.

f is not convex f is convex

44 / 61

(Lec9-10.) Convex functions (standard definition)

A function f : Rd → R is convex if for any x,x′ ∈ Rd and α ∈ [0, 1],

f ((1− α)x+ αx′) ≤ (1− α) · f(x) + α · f(x′).

f is not convex f is convexx x′ x x′

Examples:

I f(x) = cx for any c > 0 (on R)

I f(x) = |x|c for any c ≥ 1 (on R)

I f(x) = bTx for any b ∈ Rd.

I f(x) =‖x‖ for any norm ‖·‖.I f(x) = xTAx for symmetric positive semidefinite A.

I f(x) = ln(∑d

i=1 exp(xi))

, which approximates maxi xi.

45 / 61

(Lec9-10.) Convex functions (standard definition)

A function f : Rd → R is convex if for any x,x′ ∈ Rd and α ∈ [0, 1],

f ((1− α)x+ αx′) ≤ (1− α) · f(x) + α · f(x′).

f is not convex f is convexx x′ x x′

Examples:

I f(x) = cx for any c > 0 (on R)

I f(x) = |x|c for any c ≥ 1 (on R)

I f(x) = bTx for any b ∈ Rd.

I f(x) =‖x‖ for any norm ‖·‖.I f(x) = xTAx for symmetric positive semidefinite A.

I f(x) = ln(∑d

i=1 exp(xi))

, which approximates maxi xi.

45 / 61

(Lec9-10.) Convexity of differentiable functions

Differentiable functions

If f : Rd → R is differentiable, then f isconvex if and only if

f(x) ≥ f(x0) +∇f(x0)T(x− x0)

for all x,x0 ∈ Rd.Note: this implies increasing slopes:(∇f(x)−∇f(y)

)T(x− y) ≥ 0.

a(x) = f(x0) + f ′(x0)(x− x0)

Twice-differentiable functions

If f : Rd → R is twice-differentiable, then f is convex if and only if

∇2f(x) � 0

for all x ∈ Rd (i.e., the Hessian, or matrix of second-derivatives, is positivesemi-definite for all x).

46 / 61

(Lec9-10.) Convexity of differentiable functions

Differentiable functions

If f : Rd → R is differentiable, then f isconvex if and only if

f(x) ≥ f(x0) +∇f(x0)T(x− x0)

for all x,x0 ∈ Rd.Note: this implies increasing slopes:(∇f(x)−∇f(y)

)T(x− y) ≥ 0.

a(x) = f(x0) + f ′(x0)(x− x0)

Twice-differentiable functions

If f : Rd → R is twice-differentiable, then f is convex if and only if

∇2f(x) � 0

for all x ∈ Rd (i.e., the Hessian, or matrix of second-derivatives, is positivesemi-definite for all x).

46 / 61

(Lec9-10.) Convex optimization problems

Standard form of a convex optimization problem:

minx∈Rd

s.t. fi(x) ≤ 0 for all i = 1, . . . , n

where f0, f1, . . . , fn : Rd → R are convex functions.

Fact: the feasible set

A :={x ∈ Rd : fi(x) ≤ 0 for all i = 1, 2, . . . , n

}is a convex set.

(SVMs next week will give us an example.)

47 / 61

(Lec9-10.) Convex optimization problems

Standard form of a convex optimization problem:

minx∈Rd

s.t. fi(x) ≤ 0 for all i = 1, . . . , n

where f0, f1, . . . , fn : Rd → R are convex functions.

Fact: the feasible set

A :={x ∈ Rd : fi(x) ≤ 0 for all i = 1, 2, . . . , n

}is a convex set.

(SVMs next week will give us an example.)

47 / 61

(Lec9-10.) Subgradients: Jensen’s inequality.

If f : Rd → R is convex, then Ef(X) ≥ f (EX).

Proof. Set y := EX, and pick any s ∈ ∂f (EX). Then

Ef(X) ≥ E(f(y) + sT(X − y)

)= f(y) + sTE (X − y) = f(y).

Note. This inequality comes up often!

48 / 61

(Lec11-12.) Maximum margin linear classifier

The solution w to the following mathematical optimization problem:

minw∈Rd

2‖w‖22

s.t. yxTw ≥ 1 for all (x, y) ∈ S

gives the linear classifier with the largest minimum margin on S—i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.

This is a convex optimization problem; can be solved in polynomial time.

If there is a solution (i.e., S is linearly separable), then the solution is unique.

We can solve this in a variety of ways (e.g., projected gradient descent);we will work with the dual.

Note: Can also explicitly include affine expansion, so decision boundary neednot pass through origin. We’ll do our derivations without it.

49 / 61

minw∈Rd

2‖w‖22

49 / 61

minw∈Rd

2‖w‖22

49 / 61

minw∈Rd

2‖w‖22

49 / 61

minw∈Rd

2‖w‖22

49 / 61

(Lec11-12.) SVM Duality summary

Lagrangian

L(w,α) =1

2‖w‖22 +

n∑i=1

αi(1− yixTiw).

Primal maximum margin problem was

P (w) = maxα≥0

L(w,α) = maxα≥0

2‖w‖22 +

n∑i=1

αi(1− yixTiw)

.Dual problem

D(α) = minw∈Rd

L(w,α) = L

n∑i=1

αiyixi,α

=n∑i=1

αi −1

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2

=n∑i=1

αi −1

n∑i,j=1

αiαjyiyjxTixj .

Given dual optimum α,

I Corresponding primal optimum w =∑ni=1 αiyixi;

I Strong duality P (w) = D(α);

I αi > 0 implies yixTi w = 1,

and these yixi are support vectors.50 / 61

(Lec11-12.) Nonseparable case: bottom line

Unconstrained primal:

minw∈Rd

2‖w‖2 + C

n∑i=1

[1− yixT

maxα∈Rn

0≤αi≤C

n∑i=1

αi −1

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2

Dual solution α gives primal solution w =∑ni=1 αiyixi.

Remarks.

I Can take C →∞ to recover the separable case.

I Dual is a constrained convex quadratic (can be solved with projectedgradient descent).

I Some presentations include bias in primal (xTiw + b);

this introduces a constraint∑ni=1 yiαi = 0 in dual.

I Some presentations replace 12

and C with λ2

and 1n

, respectively.

51 / 61

(Lec11-12.) Nonseparable case: bottom line

Unconstrained primal:

minw∈Rd

2‖w‖2 + C

n∑i=1

[1− yixT

maxα∈Rn

0≤αi≤C

n∑i=1

αi −1

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2

Dual solution α gives primal solution w =∑ni=1 αiyixi.

Remarks.

I Can take C →∞ to recover the separable case.

I Dual is a constrained convex quadratic (can be solved with projectedgradient descent).

I Some presentations include bias in primal (xTiw + b);

this introduces a constraint∑ni=1 yiαi = 0 in dual.

I Some presentations replace 12

and C with λ2

and 1n

, respectively.

51 / 61

(Lec11-12.) Looking at the dual again

SVM dual problem only depends on xi through inner products xTixj .

maxα1,α2,...,αn≥0

n∑i=1

αi −1

n∑i,j=1

αiαjyiyjxTixj .

If we use feature expansion (e.g., quadratic expansion) x 7→ φ(x), this becomes

n∑i=1

αi −1

n∑i,j=1

αiαjyiyjφ(xi)Tφ(xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =

n∑i=1

αiyiφ(x)Tφ(xi).

Key insight:

I Training and prediction only use φ(x)Tφ(x′), never an isolated φ(x);

I Sometimes computing φ(x)Tφ(x′) is much easier than computing φ(x).

52 / 61

n∑i=1

αi −1

n∑i,j=1

αiαjyiyjxTixj .

n∑i=1

αi −1

n∑i,j=1

x 7→ φ(x)Tw =

n∑i=1

αiyiφ(x)Tφ(xi).

Key insight:

52 / 61

n∑i=1

αi −1

n∑i,j=1

αiαjyiyjxTixj .

n∑i=1

αi −1

n∑i,j=1

x 7→ φ(x)Tw =n∑i=1

αiyiφ(x)Tφ(xi).

Key insight:

52 / 61

n∑i=1

αi −1

n∑i,j=1

αiαjyiyjxTixj .

n∑i=1

αi −1

n∑i,j=1

x 7→ φ(x)Tw =n∑i=1

αiyiφ(x)Tφ(xi).

Key insight:

52 / 61

(Lec11-12.) Quadratic expansion

I φ : Rd → R1+2d+(d2), where

φ(x) =(

2x1, . . . ,√

2xd, x21, . . . , x

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

)(Don’t mind the

√2’s. . . )

I Computing φ(x)Tφ(x′) in O(d) time:

φ(x)Tφ(x′) = (1 + xTx′)2.

I Much better than d2 time.

I What if we change exponent “2”?

I What if we replace additive “1” with 0?

53 / 61

φ(x) =(

2x1, . . . ,√

2xd, x21, . . . , x

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

)(Don’t mind the

√2’s. . . )

φ(x)Tφ(x′) = (1 + xTx′)2.

53 / 61

φ(x) =(

2x1, . . . ,√

2xd, x21, . . . , x

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

)(Don’t mind the

√2’s. . . )

φ(x)Tφ(x′) = (1 + xTx′)2.

53 / 61

φ(x) =(

2x1, . . . ,√

2xd, x21, . . . , x

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

)(Don’t mind the

√2’s. . . )

φ(x)Tφ(x′) = (1 + xTx′)2.

53 / 61

φ(x) =(

2x1, . . . ,√

2xd, x21, . . . , x

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

)(Don’t mind the

√2’s. . . )

φ(x)Tφ(x′) = (1 + xTx′)2.

53 / 61

(Lec11-12; RBF kernel.) Infinite dimensional featureexpansion

For any σ > 0, there is an infinite feature expansion φ : Rd → R∞ such that

φ(x)Tφ(x′) = exp

−∥∥x− x′∥∥222σ2

which can be computed in O(d) time.

(This is called the Gaussian kernel with bandwidth σ.)

54 / 61

(Lec11-12.) Kernels

A (positive definite) kernel function K : X × X → R is a symmetric functionsatisfying:

For any x1,x2, . . . ,xn ∈ X , the n×n matrix whose (i, j)-th entry isK(xi,xj) is positive semidefinite.

(This matrix is called the Gram matrix.)

For any kernel K, there is a feature mapping φ : X → H such that

φ(x)Tφ(x′) = K(x,x′).

H is a Hilbert space—i.e., a special kind of inner product space—called theReproducing Kernel Hilbert Space corresponding to K.

55 / 61

(Lec11-12.) Kernels

A (positive definite) kernel function K : X × X → R is a symmetric functionsatisfying:

For any x1,x2, . . . ,xn ∈ X , the n×n matrix whose (i, j)-th entry isK(xi,xj) is positive semidefinite.

(This matrix is called the Gram matrix.)

For any kernel K, there is a feature mapping φ : X → H such that

φ(x)Tφ(x′) = K(x,x′).

H is a Hilbert space—i.e., a special kind of inner product space—called theReproducing Kernel Hilbert Space corresponding to K.

55 / 61

(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)

n∑i=1

αi −1

n∑i,j=1

αiαjyiyjK(xi,xj).

x 7→ φ(x)Tw =

n∑i=1

αiyiK(x,xi).

I To represent classifier, need to keep support vector examples (xi, yi) andcorresponding αi’s.

I To compute prediction on x, iterate through support vector examples andcompute K(x,xi) for each support vector xi . . .

Very similar to nearest neighbor classifier:predictor is represented using (a subset of) the training data.

56 / 61

n∑i=1

αi −1

n∑i,j=1

αiαjyiyjK(xi,xj).

x 7→ φ(x)Tw =n∑i=1

αiyiK(x,xi).

56 / 61

n∑i=1

αi −1

n∑i,j=1

αiαjyiyjK(xi,xj).

x 7→ φ(x)Tw =n∑i=1

αiyiK(x,xi).

56 / 61

n∑i=1

αi −1

n∑i,j=1

αiαjyiyjK(xi,xj).

x 7→ φ(x)Tw =n∑i=1

αiyiK(x,xi).

56 / 61

n∑i=1

αi −1

n∑i,j=1

αiαjyiyjK(xi,xj).

x 7→ φ(x)Tw =n∑i=1

αiyiK(x,xi).

56 / 61

(Lec11-12.) Nonlinear support vector machines (again)

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

-8.000

0.0008.00

16.000

ReLU network.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

-2.500

Quadratic SVM.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

RBF SVM (σ = 1).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

-1.500-1.000

-1.000

-0.500

RBF SVM (σ = 0.1). 57 / 61

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

58 / 61

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

58 / 61

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

58 / 61

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

Scenario 1

wtxt yt = 1xTtwt ≤ 0

Current vector wt comparble to xt in length.

58 / 61

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

Scenario 1

wtxt yt = 1xTtwt ≤ 0

Updated vector wt+1 now correctly classifies (xt, yt).

58 / 61

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

Scenario 2

xtyt = 1xTtwt ≤ 0

Current vector wt much longer than xt.

58 / 61

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

Scenario 2

xtyt = 1xTtwt ≤ 0

Updated vector wt+1 does not correctly classify (xt, yt).

58 / 61

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

Scenario 2

xtyt = 1xTtwt ≤ 0

Updated vector wt+1 does not correctly classify (xt, yt).

Not obvious that Perceptron will eventually terminate! (We’ll return to this.) 58 / 61

2. Homework review

Nice homework problems

I HW1-P3: SVD practice.

I HW2-P1: SVD practice, definition ‖A‖2 and ‖A‖F.

I HW3-P1: convexity practice.

I HW3-P2: deep network and probability practice.

I HW3-P4, HW3-P5: kernel practice.

59 / 61

3. Stuff added 3/12/2019

Belaboring Cauchy-Schwarz

Theorem (Cauchy-Schwarz). |aTb| ≤ ‖a‖ · ‖b‖.

Proof (another one. . . ). If either a or b are zero, then aTb = 0 = ‖a‖ · ‖b‖.If both a and b are nonzero, note for any r > 0 that

0 ≤∥∥∥∥ra− br

∥∥∥∥2 = r2‖a‖2 − 2aTb+‖b‖2

which can be rearranged into

aTb ≤ r2‖a‖2

2+‖b‖2

Choosing r =√‖b‖/‖a‖,

aTb ≤ ‖a‖2

(‖b‖‖a‖

)+‖b‖2

(‖a‖‖b‖

)= ‖a‖ · ‖b‖.

Lastly, applying the bound to (a,−b) gives

−aTb = aT (−b) ≤ ‖a‖ · ‖ − b‖ = ‖a‖ · ‖b‖,

so together it follows that∣∣∣aTb

∣∣∣ ≤ ‖a‖ · ‖b‖. �

60 / 61

Belaboring Cauchy-Schwarz

Theorem (Cauchy-Schwarz). |aTb| ≤ ‖a‖ · ‖b‖.

Proof (another one. . . ). If either a or b are zero, then aTb = 0 = ‖a‖ · ‖b‖.If both a and b are nonzero, note for any r > 0 that

0 ≤∥∥∥∥ra− br

∥∥∥∥2 = r2‖a‖2 − 2aTb+‖b‖2

which can be rearranged into

aTb ≤ r2‖a‖2

2+‖b‖2

Choosing r =√‖b‖/‖a‖,

aTb ≤ ‖a‖2

(‖b‖‖a‖

)+‖b‖2

(‖a‖‖b‖

)= ‖a‖ · ‖b‖.

Lastly, applying the bound to (a,−b) gives

−aTb = aT (−b) ≤ ‖a‖ · ‖ − b‖ = ‖a‖ · ‖b‖,

so together it follows that∣∣∣aTb

∣∣∣ ≤ ‖a‖ · ‖b‖. �

60 / 61

SVD again

1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.

2. Thin decomposition SVD: M =∑ri=1 siuiv

3. Full factorization SVD: M = USV T.

4. “Operational” view of SVD: for M ∈ Rn×d,

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).

Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.

61 / 61

SVD again

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

61 / 61

SVD again

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

61 / 61

SVD again

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

61 / 61

SVD again

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

61 / 61

SVD again

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

61 / 61

SVD again

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

61 / 61

midterm review - university of...

Documents