midterm review - university of...

148
Midterm review CS 446

Upload: others

Post on 05-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

Midterm review

CS 446

Page 2: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

1. Lecture review

Page 3: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec1.) Basic setting: supervised learning

Training data: labeled examples

(x1, y1), (x2, y2), . . . , (xn, yn)

where

I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs. (Note: 0 training error is easy;test/population error is what matters.)

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

1 / 61

Page 4: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec1.) Basic setting: supervised learning

Training data: labeled examples

(x1, y1), (x2, y2), . . . , (xn, yn)

where

I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs. (Note: 0 training error is easy;test/population error is what matters.)

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

1 / 61

Page 5: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec1.) Basic setting: supervised learning

Training data: labeled examples

(x1, y1), (x2, y2), . . . , (xn, yn)

where

I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs. (Note: 0 training error is easy;test/population error is what matters.)

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

1 / 61

Page 6: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec1.) Basic setting: supervised learning

Training data: labeled examples

(x1, y1), (x2, y2), . . . , (xn, yn)

where

I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs. (Note: 0 training error is easy;test/population error is what matters.)

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

1 / 61

Page 7: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec2.) k-nearest neighbors classifier

Given: labeled examples D := {(xi, yi)}ni=1

Predictor: fD,k : X → Y:

On input x,

1. Find the k points xi1 ,xi2 , . . . ,xik among {xi}ni=1 “closest” to x(the k nearest neighbors).

2. Return the plurality of yi1 , yi2 , . . . , yik .

(Break ties in both steps arbitrarily.)

2 / 61

Page 8: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec2.) Choosing k

The hold-out set approach

1. Pick a subset V ⊂ S (hold-out set, a.k.a. validation set).2. For each k ∈ {1, 3, 5, . . . }:

I Construct k-NN classifier fS\V,k using S \ V .

I Compute error rate of fS\V,k on V(“hold-out error rate”).

3. Pick the k that gives the smallest hold-out error rate.

(There are many other approaches.)

3 / 61

Page 9: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec2.) Choosing k

The hold-out set approach

1. Pick a subset V ⊂ S (hold-out set, a.k.a. validation set).2. For each k ∈ {1, 3, 5, . . . }:

I Construct k-NN classifier fS\V,k using S \ V .

I Compute error rate of fS\V,k on V(“hold-out error rate”).

3. Pick the k that gives the smallest hold-out error rate.

(There are many other approaches.)

3 / 61

Page 10: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec2.) Decision trees

Directly optimize tree structure for good classification.

A decision tree is a function f : X → Y, represented by a binary tree in which:

I Each tree node is associated with a splitting rule g : X → {0, 1}.I Each leaf node is associated with a label y ∈ Y.

When X = Rd, typically only considersplitting rules of the form

g(x) = 1{xi > t}

for some i ∈ [d] and t ∈ R.Called axis-aligned or coordinate splits.

(Notation: [d] := {1, 2, . . . , d}.)

x1 > 1.7

x2 > 2.8y = 1

y = 2 y = 3

4 / 61

Page 11: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec2.) Decision trees

Directly optimize tree structure for good classification.

A decision tree is a function f : X → Y, represented by a binary tree in which:

I Each tree node is associated with a splitting rule g : X → {0, 1}.I Each leaf node is associated with a label y ∈ Y.

When X = Rd, typically only considersplitting rules of the form

g(x) = 1{xi > t}

for some i ∈ [d] and t ∈ R.Called axis-aligned or coordinate splits.

(Notation: [d] := {1, 2, . . . , d}.)

x1 > 1.7

x2 > 2.8y = 1

y = 2 y = 3

4 / 61

Page 12: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec2.) Decision tree example

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

5 / 61

Page 13: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec2.) Decision tree example

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

y = 2

5 / 61

Page 14: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec2.) Decision tree example

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

x1 > 1.7

5 / 61

Page 15: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec2.) Decision tree example

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

x1 > 1.7

y = 1 y = 3

5 / 61

Page 16: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec2.) Decision tree example

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

x1 > 1.7

x2 > 2.8y = 1

5 / 61

Page 17: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec2.) Decision tree example

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

x1 > 1.7

x2 > 2.8y = 1

y = 2 y = 3

5 / 61

Page 18: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec2.) Nearest neighbors and decision trees

Today we covered two standard machine learning methods.

1.0 0.5 0.0 0.5 1.0

1.0

0.5

0.0

0.5

1.0

x1

x2

Nearest neighbors.Training/fitting: memorize data.Testing/predicting: find k closestmemorized points, return plurity la-bel.Overfitting? Vary k.

Decision trees.Training/fitting: greedily partitionspace, reducing “uncertainty”.Testing/predicting: traverse tree,output leaf label.Overfitting? Limit or prune tree.

Note: both methods can ouput real numbers (regression, not classification);return median/mean of { neighbors, points reaching leaf }. 6 / 61

Page 19: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) ERM setup for least squares.

I Predictors/model: f(x) = wTx;a linear predictor/regressor.(For linear classification: x 7→ sgn(wTx).)

I Loss/penalty: the least squares loss

`ls(y, y) = `ls(y, y) = (y − y)2.

(Some conventions scale this by 1/2.)

I Goal: minimize least squares emprical risk

Rls(f) =1

n

n∑i=1

`ls(yi, f(xi)) =1

n

n∑i=1

(yi − f(xi))2.

I Specifically, we choose w ∈ Rd according to

arg minw∈Rd

Rls

(x 7→ wTx

)= arg min

w∈Rd

1

n

n∑i=1

(yi −wTxi)2.

I More generally, this is the ERM approach:pick a model and minimize empirical risk over the model parameters.

7 / 61

Page 20: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) ERM in general

I Pick a family of models/predictors F .(For today, we use linear predictors.)

I Pick a loss function `.(For today, we chose squared loss.)

I Minimize the empirical risk over the model parameters.

We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.

Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.

8 / 61

Page 21: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) ERM in general

I Pick a family of models/predictors F .(For today, we use linear predictors.)

I Pick a loss function `.(For today, we chose squared loss.)

I Minimize the empirical risk over the model parameters.

We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.

Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.

8 / 61

Page 22: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.

Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

9 / 61

Page 23: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

9 / 61

Page 24: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

9 / 61

Page 25: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

9 / 61

Page 26: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

9 / 61

Page 27: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) Full (factorization) SVD (new slide)

Given M ∈ Rn×d, let M = USV T denote the singular value decomposition(SVD), where

I U ∈ Rn×n is orthonormal, thus UTU = UUT = I,

I V ∈ Rd×d is orthonormal, thus V TV = V V T = I,

I S ∈ Rn×d has singular values s1 ≥ s2 ≥ · · · ≥ sminn,d along the diagonaland zeros elsewhere, where the number of positive singular values equalsthe rank of M .

Some facts:

I SVD is not unique when the singular values are not distinct; e.g., we canwrite I = UIV T where U is any orthonormal matrix.

I Pseudoinverse S+ ∈ Rd×n of S is obtained by starting with ST andtaking the reciprocal of each positive entry.

I Pseudoinverse of M is V S+UT.

I If M−1 exists, then M−1 = M+.

10 / 61

Page 28: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) Thin (decomposition) SVD (new slide)

Given M ∈ Rn×d, (s,u,v) are a singular value with corresponding left andright singular vectors if Mv = su and MTu = sv.The thin SVD of M is M =

∑ri=1 siuiv

Ti , where r is the rank of M , and

I left singular vectors (u1, . . . ,ur) are orthonormal (but we might haver < min{n, d}!) and span the column space of M ,

I right singular vectors (v1, . . . ,vr) are orthonormal (but we might haver < min{n, d}!) and span the row space of M ,

I sigular values s1 ≥ · · · ≥ sr > 0.

Some facts:

I Pseudoinverse M+ =∑ri=1

1siviu

Ti .

I (ui)ri=1 span t

11 / 61

Page 29: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) SVD and least squares

Recall: we’d like to find w such that

ATAw = ATb.

If w = A+b, then

ATAw =

r∑i=1

siviuTi

r∑i=1

siuivTi

r∑i=1

1

siviu

Ti

b=

r∑i=1

siviuTi

r∑i=1

uiuTi

b = ATb.

Henceforth, define wols = A+b as the OLS solution.(OLS = “ordinary least squares”.)

Note: in general, AA+ =∑ri=1 uiu

Ti 6= I.

12 / 61

Page 30: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

Since

(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

Morever, writing A =∑ri=1 siuiv

Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),

so w′ optimal iff w′ −w is in the right nullspace of A.

(We’ll revisit all this with convexity later.)

13 / 61

Page 31: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

Since

(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

Morever, writing A =∑ri=1 siuiv

Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),

so w′ optimal iff w′ −w is in the right nullspace of A.

(We’ll revisit all this with convexity later.)

13 / 61

Page 32: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

Since

(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

Morever, writing A =∑ri=1 siuiv

Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),

so w′ optimal iff w′ −w is in the right nullspace of A.

(We’ll revisit all this with convexity later.)

13 / 61

Page 33: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

14 / 61

Page 34: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

14 / 61

Page 35: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

14 / 61

Page 36: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

14 / 61

Page 37: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

14 / 61

Page 38: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

14 / 61

Page 39: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Geometry of linear classifiers

x1

x2

H

w

A hyperplane in Rd is a linear subspaceof dimension d−1.

I A hyperplane in R2 is a line.

I A hyperplane in R3 is a plane.

I As a linear subspace, a hyperplanealways contains the origin.

A hyperplane H can be specified by a(non-zero) normal vector w ∈ Rd.

The hyperplane with normal vector wis the set of points orthogonal to w:

H ={x ∈ Rd : xTw = 0

}.

Given w and its corresponding H:H splits the sets labeled positive {x : wTx > 0} and those labeled negative{x : wTw < 0}.

15 / 61

Page 40: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Classification with a hyperplane

H

w

span{w}

Projection of x onto span{w} (a line) hascoordinate

‖x‖2 · cos(θ)

where

cos θ =xTw

‖w‖2‖x‖2.

(Distance to hyperplane is ‖x‖2 · | cos(θ)|.)

Decision boundary is hyperplane (oriented by w):

xTw > 0 ⇐⇒ ‖x‖2 · cos(θ) > 0 ⇐⇒ x on same side of H as w

What should we do if we want hyperplane decision boundary that doesn’t(necessarily) go through origin?

16 / 61

Page 41: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Classification with a hyperplane

H

w

span{w}

x

‖x‖2 · cos θθ

Projection of x onto span{w} (a line) hascoordinate

‖x‖2 · cos(θ)

where

cos θ =xTw

‖w‖2‖x‖2.

(Distance to hyperplane is ‖x‖2 · | cos(θ)|.)

Decision boundary is hyperplane (oriented by w):

xTw > 0 ⇐⇒ ‖x‖2 · cos(θ) > 0 ⇐⇒ x on same side of H as w

What should we do if we want hyperplane decision boundary that doesn’t(necessarily) go through origin?

16 / 61

Page 42: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Classification with a hyperplane

H

w

span{w}

x

‖x‖2 · cos θθ

Projection of x onto span{w} (a line) hascoordinate

‖x‖2 · cos(θ)

where

cos θ =xTw

‖w‖2‖x‖2.

(Distance to hyperplane is ‖x‖2 · | cos(θ)|.)

Decision boundary is hyperplane (oriented by w):

xTw > 0 ⇐⇒ ‖x‖2 · cos(θ) > 0 ⇐⇒ x on same side of H as w

What should we do if we want hyperplane decision boundary that doesn’t(necessarily) go through origin?

16 / 61

Page 43: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Classification with a hyperplane

H

w

span{w}

x

‖x‖2 · cos θθ

Projection of x onto span{w} (a line) hascoordinate

‖x‖2 · cos(θ)

where

cos θ =xTw

‖w‖2‖x‖2.

(Distance to hyperplane is ‖x‖2 · | cos(θ)|.)

Decision boundary is hyperplane (oriented by w):

xTw > 0 ⇐⇒ ‖x‖2 · cos(θ) > 0 ⇐⇒ x on same side of H as w

What should we do if we want hyperplane decision boundary that doesn’t(necessarily) go through origin?

16 / 61

Page 44: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Linear separability

Is it always possible to find w with sign(wTxi) = yi?Is it always possible to find a hyperplane separating the data?(Appending 1 means it need not go through the origin.)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

Linearly separable. Not linearly separable.

17 / 61

Page 45: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Cauchy-Schwarz (new slide)

Cauchy-Schwarz inequality. |aTb| ≤ ‖a‖ · ‖b‖.

Proof. If ‖a‖ = ‖b‖,

0 ≤ ‖a− b‖2 = ‖a‖2 − 2aTb+ ‖b‖2 = 2‖a‖ · ‖b‖ − 2aTb,

which rearranges to give aTb ≤ ‖a‖ · ‖b‖.

For the case ‖a‖ < ‖b‖, apply the preceding to ‖b‖‖a‖

(aT(‖a‖b‖b‖

)).

For the absolute value, apply the preceding to (a,−b). �

18 / 61

Page 46: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Cauchy-Schwarz (new slide)

Cauchy-Schwarz inequality. |aTb| ≤ ‖a‖ · ‖b‖.

Proof. If ‖a‖ = ‖b‖,

0 ≤ ‖a− b‖2 = ‖a‖2 − 2aTb+ ‖b‖2 = 2‖a‖ · ‖b‖ − 2aTb,

which rearranges to give aTb ≤ ‖a‖ · ‖b‖.

For the case ‖a‖ < ‖b‖, apply the preceding to ‖b‖‖a‖

(aT(‖a‖b‖b‖

)).

For the absolute value, apply the preceding to (a,−b). �

18 / 61

Page 47: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Logistic loss

Let’s state our classification goal with a generic margin loss `:

R(w) =1

n

n∑i=1

`(yiwTxi);

the key properties we want:

I ` is continuous;

I `(z) ≥ c1[z ≤ 0] = c`zo(z) for some c > 0 and any z ∈ R,

which implies R`(w) ≥ cRzo(w).

I `′(0) < 0 (pushes stuff from wrong to right).

Examples.

I Squared loss, written in margin form: `ls(z) := (1− z)2;note `ls(yy) = (1− yy)2 = y2(1− yy)2 = (y − y)2.

I Logistic loss: `log(z) = ln(1 + exp(−z)).

19 / 61

Page 48: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Logistic loss

Let’s state our classification goal with a generic margin loss `:

R(w) =1

n

n∑i=1

`(yiwTxi);

the key properties we want:

I ` is continuous;

I `(z) ≥ c1[z ≤ 0] = c`zo(z) for some c > 0 and any z ∈ R,

which implies R`(w) ≥ cRzo(w).

I `′(0) < 0 (pushes stuff from wrong to right).

Examples.

I Squared loss, written in margin form: `ls(z) := (1− z)2;note `ls(yy) = (1− yy)2 = y2(1− yy)2 = (y − y)2.

I Logistic loss: `log(z) = ln(1 + exp(−z)).

19 / 61

Page 49: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Squared and logistic losses on linearly separabledata I

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

-12.000

-8.000

-4.000

0.000

4.000

8.000

12.000

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

-1.200

-0.800

-0.400

0.000

0.400

0.800

1.200

Logistic loss. Squared loss.

20 / 61

Page 50: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Squared and logistic losses on linearly separabledata II

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

-12.000

-8.000

-4.000

0.000

4.000

8.000

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

-1.200

-0.800

-0.400

0.000

0.400

0.800

1.200

Logistic loss. Squared loss.

21 / 61

Page 51: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Logistic risk and separation

If there exists a perfect linear separator,empirical logistic risk minimization should find it.

Theorem.

If there exists w with yiwTxi > 0 for all i,

then every w with Rlog(w) < ln(2)/2n + infv Rlog(v),also satisfies yiw

Txi > 0.

Proof. Omitted.

22 / 61

Page 52: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Logistic risk and separation

If there exists a perfect linear separator,empirical logistic risk minimization should find it.

Theorem. If there exists w with yiwTxi > 0 for all i,

then every w with Rlog(w) < ln(2)/2n + infv Rlog(v),also satisfies yiw

Txi > 0.

Proof. Omitted.

22 / 61

Page 53: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Logistic risk and separation

If there exists a perfect linear separator,empirical logistic risk minimization should find it.

Theorem. If there exists w with yiwTxi > 0 for all i,

then every w with Rlog(w) < ln(2)/2n + infv Rlog(v),also satisfies yiw

Txi > 0.

Proof. Omitted.

22 / 61

Page 54: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Least squares and logistic ERM

Least squares:

I Take gradient of ‖Aw − b‖2, set to 0;obtain normal equations ATAw = ATb.

I One choice is minimum norm solution A+b.

Logistic loss:

I Take gradient of Rlog(w) = 1n

∑ni=1 ln(1 + exp(yiw

Txi)) and set to 0 ???

Remark. Is A+b a “closed form expression”?

23 / 61

Page 55: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Least squares and logistic ERM

Least squares:

I Take gradient of ‖Aw − b‖2, set to 0;obtain normal equations ATAw = ATb.

I One choice is minimum norm solution A+b.

Logistic loss:

I Take gradient of Rlog(w) = 1n

∑ni=1 ln(1 + exp(yiw

Txi)) and set to 0 ???

Remark. Is A+b a “closed form expression”?

23 / 61

Page 56: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Least squares and logistic ERM

Least squares:

I Take gradient of ‖Aw − b‖2, set to 0;obtain normal equations ATAw = ATb.

I One choice is minimum norm solution A+b.

Logistic loss:

I Take gradient of Rlog(w) = 1n

∑ni=1 ln(1 + exp(yiw

Txi)) and set to 0 ???

Remark. Is A+b a “closed form expression”?

23 / 61

Page 57: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Gradient descent

Given a function F : Rd → R, gradient descent is the iteration

wi+1 := wi − ηi∇wF (wi),

where w0 is given, and ηi is a learning rate / step size.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.010.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

2.000

4.00

0

6.000

8.000

10.000

12.000

14.000

Does this work for least squares?Later we’ll show it works for least squares and logistic regression due toconvexity.

24 / 61

Page 58: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Gradient descent

Given a function F : Rd → R, gradient descent is the iteration

wi+1 := wi − ηi∇wF (wi),

where w0 is given, and ηi is a learning rate / step size.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.010.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

2.000

4.00

0

6.000

8.000

10.000

12.000

14.000

Does this work for least squares?

Later we’ll show it works for least squares and logistic regression due toconvexity.

24 / 61

Page 59: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Gradient descent

Given a function F : Rd → R, gradient descent is the iteration

wi+1 := wi − ηi∇wF (wi),

where w0 is given, and ηi is a learning rate / step size.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.010.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

2.000

4.00

0

6.000

8.000

10.000

12.000

14.000

Does this work for least squares?Later we’ll show it works for least squares and logistic regression due toconvexity.

24 / 61

Page 60: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Multiclass?

All our methods so far handle multiclass:

I k-nn and decision tree: plurality label.

I Least squares: arg minW∈Rd×k

‖AW −B‖2F

with B ∈ Rn×k;

W ∈ Rd×k is k separate linear regressors in Rd.

How about linear classifiers?

I At prediction time, x 7→ arg maxy f(x)y.

I As in binary case: interpretation f(x)y = Pr[Y = y|X = x].

What is a good loss function?

25 / 61

Page 61: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Cross-entropy

Given two probability vectors p, q ∈∆k = {p ∈ Rk≥0 :∑i pi = 1},

H(p, q) = −k∑i=1

pi ln qi (cross-entropy).

I If p = q, then H(p, q) = H(p) (entropy); indeed

H(p, q) = −k∑i=1

pi ln

(piqipi

)= H(p)︸ ︷︷ ︸

entropy

+ KL(p, q)︸ ︷︷ ︸KL divergence

.

Since KL ≥ 0 and moreover 0 iff p = q,this is the cost/entropy of p plus a penalty for differing.

I Choose encoding yi = ey for y ∈ {1, . . . , k},and y ∝ exp(f(x)) with f : Rd → Rk;

`ce(y, f(x)) = H(y, y) = −k∑i=1

yi ln

exp(f(x)i)∑kj=1 exp(f(x)j)

= − ln

exp(f(x)y)∑kj=1 exp(f(x)j)

= −f(x)y + lnk∑j=1

exp(f(x)j).

(In pytorch, use torch.nn.CrossEntropyLoss()(f(x), y).) 26 / 61

Page 62: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec5-6.) Cross-entropy, classification, and margins

The zero-one loss for classification is

`zo(yi, f(x)) = 1

[yi 6= arg max

jf(x)j

].

In the multiclass case, can define margin as

f(x)y −maxj 6=y

f(x)j ,

interpreted as “the distance by which f is correct”. (Can be negative!)

Since ln∑j zj ≈ maxj zj , cross-entropy satisfies

`ce(yi, f(x)) = −f(x)y + ln∑j

exp(f(x)j

)≈ −f(x)y + max

jf(x)j ,

thus minimizing cross-entropy maximizes margins.

27 / 61

Page 63: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) The ERM perspective

These lectures will follow an ERM perspective on deep networks:

I Pick a model/predictor class (network architecture).(We will spend most of our time on this!)

I Pick a loss/risk.(We will almost always use cross-entropy!)

I Pick an optimizer.(We will mostly treat this as a black box!)

The goal is low test error, whereas above only gives low training error;we will briefly discuss this as well.

28 / 61

Page 64: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) Basic deep networks

A self-contained expression is

x 7→ σL

(WLσL−1

(· · ·(W 2σ1(W 1x+ b1) + b2

)· · ·)

+ bL

),

with equivalent “functional form”

x 7→ (fL ◦ · · · ◦ f1)(x) where fi(z) = σi (W iz + bi) .

Some further details (many more to come!):

I (W i)Li=1 with W i ∈ Rdi×di−1 are the weights, and (bi)

Li=1 are the biases.

I (σi)Li=1 with σi : Rdi → Rdi are called nonlinearties, or activations, or

transfer functions, or link functions.

I This is only the basic setup; many things can and will change, please askmany questions!

29 / 61

Page 65: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) Choices of activation

Basic form:

x 7→ σL(WLσL−1

(· · ·W 2σ1(W 1x+ b1) + b2 · · ·

)+ bL

).

Choices of activation (univariate, coordinate-wise):

I Indicator/step/heavyside/threshold z 7→ 1[z ≥ 0].This was the original choice (1940s!).

I Sigmoid σs(z) := 11+exp(−z) .

This was popular roughly 1970s - 2005?

I Hyperbolic tangent z 7→ tanh(z).Similar to sigmoid, used during same interval.

I Rectified Linear Unit (ReLU) σr(z) = max{0, z}.It (and slight variants, e.g., Leaky ReLU, ELU, . . . ) are the dominantchoice now; popularized in “Imagenet/AlexNet” paper(Krizhevsky-Sutskever-Hinton, 2012).

I Identity z 7→ z; we’ll often use this as the last layer when we usecross-entropy loss.

I NON-coordinate-wise choices: we will discuss “softmax” and “pooling” abit later.

30 / 61

Page 66: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) “Architectures” and “models”

Basic form:

x 7→ σL(WLσL−1

(· · ·W 2σ1(W 1x+ b1) + b2 · · ·

)+ bL

).

((W i, bi))Li=1, the weights and biases, are the parameters.

Let’s roll them into W := ((W i, bi))Li=1,

and consider the network as a two-parameter function FW(x) = F (x;W).

I The model or class of functions is {FW : all possible W}. F (botharguments unset) is also called an architecture.

I When we fit/train/optimize, typically we leave the architecture fixed andvary W to minimize risk.

(More on this in a moment.)

31 / 61

Page 67: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) ERM recipe for basic networks

Standard ERM recipe:

I First we pick a class of functions/predictors;for deep networks, that means a F (·, ·).

I Then we pick a loss function and write down an empirical riskminimization problem; in these lectures we will pick cross-entropy:

arg minW

1

n

n∑i=1

`ce

(yi, F (xi,W)

)= arg min

W 1∈Rd×d1 ,b1∈Rd1

...WL∈R

dL−1×dL ,bL∈RdL

1

n

n∑i=1

`ce

(yi, F (xi; ((W i, bi))

Li=1)

)

= arg minW 1∈Rd×d1 ,b1∈Rd1

...WL∈R

dL−1×dL ,bL∈RdL

1

n

n∑i=1

`ce

(yi, σL(· · ·σ1(W 1xi + b1) · · · )

).

I Then we pick an optimizer. In this class, we only use gradient descentvariants. It is a miracle that this works.

32 / 61

Page 68: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) Sometimes, linear just isn’t enough

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-3.00

0

-1.50

00.0

00

1.500

3.000

4.500

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-32.

000

-24.

000

-16.

000

-8.0

00

-8.000

0.00

0

0.0008.00

0

8.000

16.000

Linear predictor:x 7→ wT [ x1 ].

Some blue points misclassified.

ReLU network:x 7→ W 2σr(W 1x + b1) + b2.

0 misclassifications!

33 / 61

Page 69: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) Classical example: XOR

Classical “XOR problem” (Minsky-Papert-’69).(Check wikipedia for “AI Winter”.)

Theorem. On this data, any linear classifier (with affine expansion)makes at least one mistake.Picture proof. Recall: linear classifiers correspond to separating hyperplanes.

I If it splits the blue points, it’s incorrect on one of them.

I If it doesn’t split the blue points,then one halfspace contains the common midpoint,and therefore wrong on at least one red point.

34 / 61

Page 70: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) Classical example: XOR

Classical “XOR problem” (Minsky-Papert-’69).(Check wikipedia for “AI Winter”.)

Theorem. On this data, any linear classifier (with affine expansion)makes at least one mistake.Picture proof. Recall: linear classifiers correspond to separating hyperplanes.

I If it splits the blue points, it’s incorrect on one of them.

I If it doesn’t split the blue points,then one halfspace contains the common midpoint,and therefore wrong on at least one red point.

34 / 61

Page 71: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) Classical example: XOR

Classical “XOR problem” (Minsky-Papert-’69).(Check wikipedia for “AI Winter”.)

Theorem. On this data, any linear classifier (with affine expansion)makes at least one mistake.Picture proof. Recall: linear classifiers correspond to separating hyperplanes.

I If it splits the blue points, it’s incorrect on one of them.

I If it doesn’t split the blue points,then one halfspace contains the common midpoint,and therefore wrong on at least one red point.

34 / 61

Page 72: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) One layer was not enough. How about two?

Theorem (Cybenko ’89, Hornik-Stinchcombe-White ’89, Funahashi ’89, Leshnoet al ’92, . . . ). Given any continuous function f : Rd → R and any ε > 0, thereexist parameters (W 1, b1,W 2) so that

supx∈[0,1]d

∣∣f(x)−W 2σ (W 1x+ b1)∣∣ ≤ ε,

as long as σ is “reasonable” (e.g., ReLU or sigmoid or threshold).

Remarks.

I Together with XOR example, justifies using nonlinearities.

I Does not justify (very) deep networks.

I Only says these networks exist, not that we can optimize for them!

35 / 61

Page 73: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) One layer was not enough. How about two?

Theorem (Cybenko ’89, Hornik-Stinchcombe-White ’89, Funahashi ’89, Leshnoet al ’92, . . . ). Given any continuous function f : Rd → R and any ε > 0, thereexist parameters (W 1, b1,W 2) so that

supx∈[0,1]d

∣∣f(x)−W 2σ (W 1x+ b1)∣∣ ≤ ε,

as long as σ is “reasonable” (e.g., ReLU or sigmoid or threshold).

Remarks.

I Together with XOR example, justifies using nonlinearities.

I Does not justify (very) deep networks.

I Only says these networks exist, not that we can optimize for them!

35 / 61

Page 74: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) General graph-based view

Classical graph-based perspective.

I Network is a directed acyclic graph;sources are inputs, sinks are outputs, intermediate nodes computez 7→ σ(wTz + b) (with their own (σ,w, b)).

I Nodes at distance 1 from inputs are the first layer, distance 2 is secondlayer, and so on.

“Modern” graph-based perspective.

I Edges in the graph can be multivariate, meaning vectors or generaltensors, and not just scalars.

I Edges will often “skip” layers;“layer” is therefore ambiguous.

I Diagram conventions differ;e.g., tensorflow graphs include nodes for parameters.

36 / 61

Page 75: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) 2-D convolution in deep networks (pictures)

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

37 / 61

Page 76: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) 2-D convolution in deep networks (pictures)

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

37 / 61

Page 77: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) 2-D convolution in deep networks (pictures)

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

37 / 61

Page 78: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) 2-D convolution in deep networks (pictures)

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

37 / 61

Page 79: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) Softmax

Replace vector input z with z′ ∝ ez, meaning

z 7→

(ez1∑j ezj, . . . ,

ezk∑j ezj,

).

I Converts input into a probability vector;useful for interpreting output network output as Pr[Y = y|X = x].

I We have baked it into our cross-entropy definition;last lectures networks with cross-entropy training had implicit softmax.

I If some coordinate j of z dominates others, then softmax is close to ej .

38 / 61

Page 80: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) Max pooling

2

2

3

0

3

0

0

1

0

3

0

0

2

1

2

0

2

2

3

1

1

2

3

1

0

3.0

3.0

3.0

2.0

3.0

3.0

3.0

3.0

3.0

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

I Often used together with convolution layers;shrinks/downsamples the input.

I Another variant is average pooling.

I Implementation: torch.nn.MaxPool2d .

39 / 61

Page 81: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) Max pooling

2

2

3

0

3

0

0

1

0

3

0

0

2

1

2

0

2

2

3

1

1

2

3

1

0

3.0

3.0

3.0

2.0

3.0

3.0

3.0

3.0

3.0

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

I Often used together with convolution layers;shrinks/downsamples the input.

I Another variant is average pooling.

I Implementation: torch.nn.MaxPool2d .

39 / 61

Page 82: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) Max pooling

2

2

3

0

3

0

0

1

0

3

0

0

2

1

2

0

2

2

3

1

1

2

3

1

0

3.0

3.0

3.0

2.0

3.0

3.0

3.0

3.0

3.0

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

I Often used together with convolution layers;shrinks/downsamples the input.

I Another variant is average pooling.

I Implementation: torch.nn.MaxPool2d .

39 / 61

Page 83: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec7-8.) Max pooling

2

2

3

0

3

0

0

1

0

3

0

0

2

1

2

0

2

2

3

1

1

2

3

1

0

3.0

3.0

3.0

2.0

3.0

3.0

3.0

3.0

3.0

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

I Often used together with convolution layers;shrinks/downsamples the input.

I Another variant is average pooling.

I Implementation: torch.nn.MaxPool2d .

39 / 61

Page 84: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10.) Multivariate network single-example gradients

Define Gj(W) = σj(W j · · ·σ1(W 1x+ b) · · · ).The multivariate chain rule tells us

∇WF (Wx) = JTxT,

and J ∈ Rl×k is the Jacobian matrix of F : Rk → Rl at Wx,the matrix of all coordinate-wise derivatives.

dGLdWL

= JTLGL−1(W)T

dGLdbL

= JTL,

...

dGLdW j

= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T,

dGLdbj

= (JLWLJL−1WL−1 · · ·Jj)T ,

with Jj as the Jacobian of σj at W jGj−1(W) + bj .For example, with σj that is coordinate-wise σ : R→ R,

Jj is diag

(σ′([W jGj−1(W) + bj

]1

), . . . , σ′

([W jGj−1(W) + bj

]dj

)).

40 / 61

Page 85: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10.) Multivariate network single-example gradients

Define Gj(W) = σj(W j · · ·σ1(W 1x+ b) · · · ).The multivariate chain rule tells us

∇WF (Wx) = JTxT,

and J ∈ Rl×k is the Jacobian matrix of F : Rk → Rl at Wx,the matrix of all coordinate-wise derivatives.

dGLdWL

= JTLGL−1(W)T

dGLdbL

= JTL,

...

dGLdW j

= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T,

dGLdbj

= (JLWLJL−1WL−1 · · ·Jj)T ,

with Jj as the Jacobian of σj at W jGj−1(W) + bj .For example, with σj that is coordinate-wise σ : R→ R,

Jj is diag

(σ′([W jGj−1(W) + bj

]1

), . . . , σ′

([W jGj−1(W) + bj

]dj

)).

40 / 61

Page 86: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10.) Multivariate network single-example gradients

Define Gj(W) = σj(W j · · ·σ1(W 1x+ b) · · · ).The multivariate chain rule tells us

∇WF (Wx) = JTxT,

and J ∈ Rl×k is the Jacobian matrix of F : Rk → Rl at Wx,the matrix of all coordinate-wise derivatives.

dGLdWL

= JTLGL−1(W)T

dGLdbL

= JTL,

...

dGLdW j

= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T,

dGLdbj

= (JLWLJL−1WL−1 · · ·Jj)T ,

with Jj as the Jacobian of σj at W jGj−1(W) + bj .For example, with σj that is coordinate-wise σ : R→ R,

Jj is diag

(σ′([W jGj−1(W) + bj

]1

), . . . , σ′

([W jGj−1(W) + bj

]dj

)).

40 / 61

Page 87: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10.) Initialization

RecalldGLdW j

= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T.

I What if we set W = 0? What if σ = σr is a ReLU?

I What if we set two rows of W j (two nodes) identically?

I Resolving this issue is called symmetry breaking.

I Standard linear/dense layer initializations:

N (0,2

din) “He et al.”,

N (0,2

din + dout) Glorot/Xavier,

U(− 1√din

,1√din

) torch default.

(Convolution layers adjusted to have similar distributions.)Random initialization is emerging as a key story in deep networks!

41 / 61

Page 88: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10.) Initialization

RecalldGLdW j

= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T.

I What if we set W = 0? What if σ = σr is a ReLU?

I What if we set two rows of W j (two nodes) identically?

I Resolving this issue is called symmetry breaking.

I Standard linear/dense layer initializations:

N (0,2

din) “He et al.”,

N (0,2

din + dout) Glorot/Xavier,

U(− 1√din

,1√din

) torch default.

(Convolution layers adjusted to have similar distributions.)Random initialization is emerging as a key story in deep networks!

41 / 61

Page 89: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10.) Initialization

RecalldGLdW j

= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T.

I What if we set W = 0? What if σ = σr is a ReLU?

I What if we set two rows of W j (two nodes) identically?

I Resolving this issue is called symmetry breaking.

I Standard linear/dense layer initializations:

N (0,2

din) “He et al.”,

N (0,2

din + dout) Glorot/Xavier,

U(− 1√din

,1√din

) torch default.

(Convolution layers adjusted to have similar distributions.)Random initialization is emerging as a key story in deep networks!

41 / 61

Page 90: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10.) Initialization

RecalldGLdW j

= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T.

I What if we set W = 0? What if σ = σr is a ReLU?

I What if we set two rows of W j (two nodes) identically?

I Resolving this issue is called symmetry breaking.

I Standard linear/dense layer initializations:

N (0,2

din) “He et al.”,

N (0,2

din + dout) Glorot/Xavier,

U(− 1√din

,1√din

) torch default.

(Convolution layers adjusted to have similar distributions.)Random initialization is emerging as a key story in deep networks!

41 / 61

Page 91: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10; SGD slide.) Minibatches

We used the linearity of gradients to write

∇wR(w) =1

n

n∑i=1

∇w`(F (xi;w),yi).

What happens if we replace ((xi,yi))ni=1 with minibatch ((x′i,y

′i))

bi=1?

I Random minibatch =⇒ two gradients equal in expectation.

I Most torch layers take minibatch input:

I torch.nn.Linear has input shape (b, d), output (b, d′).I torch.nn.Conv2d has input shape (b, c, h, w), output (b, c′, h′, w′).

I This is used heavily outside deep learning as well.It is an easy way to use parallel floating point operations (as in GPU andCPU).

I Setting batch size is black magic and depends on many things(prediction problem, gpu characteristics, . . . ).

42 / 61

Page 92: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10.) Convex sets

A set S is convex if, for every pair of points {x,x′} in S, the line segmentbetween x and x′ is also contained in S. ({x,x′} ∈ S =⇒ [x,x′] ∈ S.)

convex not convex convex convex

Examples:

I All of Rd.

I Empty set.

I Half-spaces: {x ∈ Rd : aTx ≤ b}.I Intersections of convex sets.

I Polyhedra:{x ∈ Rd : Ax ≤ b

}=⋂mi=1

{x ∈ Rd : aT

ix ≤ bi}

.

I Convex hulls:conv(S) := {

∑ki=1 αixi : k ∈ N, xi ∈ S, αi ≥ 0,

∑ki=1 αi = 1}.

(Infinite convex hulls: intersection of all convex supersets.)

43 / 61

Page 93: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10.) Convex sets

A set S is convex if, for every pair of points {x,x′} in S, the line segmentbetween x and x′ is also contained in S. ({x,x′} ∈ S =⇒ [x,x′] ∈ S.)

convex not convex convex convex

Examples:

I All of Rd.

I Empty set.

I Half-spaces: {x ∈ Rd : aTx ≤ b}.I Intersections of convex sets.

I Polyhedra:{x ∈ Rd : Ax ≤ b

}=⋂mi=1

{x ∈ Rd : aT

ix ≤ bi}

.

I Convex hulls:conv(S) := {

∑ki=1 αixi : k ∈ N, xi ∈ S, αi ≥ 0,

∑ki=1 αi = 1}.

(Infinite convex hulls: intersection of all convex supersets.)

43 / 61

Page 94: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10.) Convex functions from convex sets

The epigraph of a function f is the area above the curve:

epi(f) :={

(x, y) ∈ Rd+1 : y ≥ f(x)}.

A function is convex if its epigraph is convex.

f is not convex f is convex

44 / 61

Page 95: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10.) Convex functions (standard definition)

A function f : Rd → R is convex if for any x,x′ ∈ Rd and α ∈ [0, 1],

f ((1− α)x+ αx′) ≤ (1− α) · f(x) + α · f(x′).

f is not convex f is convexx x′ x x′

Examples:

I f(x) = cx for any c > 0 (on R)

I f(x) = |x|c for any c ≥ 1 (on R)

I f(x) = bTx for any b ∈ Rd.

I f(x) =‖x‖ for any norm ‖·‖.I f(x) = xTAx for symmetric positive semidefinite A.

I f(x) = ln(∑d

i=1 exp(xi))

, which approximates maxi xi.

45 / 61

Page 96: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10.) Convex functions (standard definition)

A function f : Rd → R is convex if for any x,x′ ∈ Rd and α ∈ [0, 1],

f ((1− α)x+ αx′) ≤ (1− α) · f(x) + α · f(x′).

f is not convex f is convexx x′ x x′

Examples:

I f(x) = cx for any c > 0 (on R)

I f(x) = |x|c for any c ≥ 1 (on R)

I f(x) = bTx for any b ∈ Rd.

I f(x) =‖x‖ for any norm ‖·‖.I f(x) = xTAx for symmetric positive semidefinite A.

I f(x) = ln(∑d

i=1 exp(xi))

, which approximates maxi xi.

45 / 61

Page 97: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10.) Convexity of differentiable functions

Differentiable functions

If f : Rd → R is differentiable, then f isconvex if and only if

f(x) ≥ f(x0) +∇f(x0)T(x− x0)

for all x,x0 ∈ Rd.Note: this implies increasing slopes:(∇f(x)−∇f(y)

)T(x− y) ≥ 0.

f(x)

x0

a(x)

a(x) = f(x0) + f ′(x0)(x− x0)

Twice-differentiable functions

If f : Rd → R is twice-differentiable, then f is convex if and only if

∇2f(x) � 0

for all x ∈ Rd (i.e., the Hessian, or matrix of second-derivatives, is positivesemi-definite for all x).

46 / 61

Page 98: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10.) Convexity of differentiable functions

Differentiable functions

If f : Rd → R is differentiable, then f isconvex if and only if

f(x) ≥ f(x0) +∇f(x0)T(x− x0)

for all x,x0 ∈ Rd.Note: this implies increasing slopes:(∇f(x)−∇f(y)

)T(x− y) ≥ 0.

f(x)

x0

a(x)

a(x) = f(x0) + f ′(x0)(x− x0)

Twice-differentiable functions

If f : Rd → R is twice-differentiable, then f is convex if and only if

∇2f(x) � 0

for all x ∈ Rd (i.e., the Hessian, or matrix of second-derivatives, is positivesemi-definite for all x).

46 / 61

Page 99: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10.) Convex optimization problems

Standard form of a convex optimization problem:

minx∈Rd

f0(x)

s.t. fi(x) ≤ 0 for all i = 1, . . . , n

where f0, f1, . . . , fn : Rd → R are convex functions.

Fact: the feasible set

A :={x ∈ Rd : fi(x) ≤ 0 for all i = 1, 2, . . . , n

}is a convex set.

(SVMs next week will give us an example.)

47 / 61

Page 100: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10.) Convex optimization problems

Standard form of a convex optimization problem:

minx∈Rd

f0(x)

s.t. fi(x) ≤ 0 for all i = 1, . . . , n

where f0, f1, . . . , fn : Rd → R are convex functions.

Fact: the feasible set

A :={x ∈ Rd : fi(x) ≤ 0 for all i = 1, 2, . . . , n

}is a convex set.

(SVMs next week will give us an example.)

47 / 61

Page 101: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec9-10.) Subgradients: Jensen’s inequality.

If f : Rd → R is convex, then Ef(X) ≥ f (EX).

Proof. Set y := EX, and pick any s ∈ ∂f (EX). Then

Ef(X) ≥ E(f(y) + sT(X − y)

)= f(y) + sTE (X − y) = f(y).

Note. This inequality comes up often!

48 / 61

Page 102: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Maximum margin linear classifier

The solution w to the following mathematical optimization problem:

minw∈Rd

1

2‖w‖22

s.t. yxTw ≥ 1 for all (x, y) ∈ S

gives the linear classifier with the largest minimum margin on S—i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.

This is a convex optimization problem; can be solved in polynomial time.

If there is a solution (i.e., S is linearly separable), then the solution is unique.

We can solve this in a variety of ways (e.g., projected gradient descent);we will work with the dual.

Note: Can also explicitly include affine expansion, so decision boundary neednot pass through origin. We’ll do our derivations without it.

49 / 61

Page 103: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Maximum margin linear classifier

The solution w to the following mathematical optimization problem:

minw∈Rd

1

2‖w‖22

s.t. yxTw ≥ 1 for all (x, y) ∈ S

gives the linear classifier with the largest minimum margin on S—i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.

This is a convex optimization problem; can be solved in polynomial time.

If there is a solution (i.e., S is linearly separable), then the solution is unique.

We can solve this in a variety of ways (e.g., projected gradient descent);we will work with the dual.

Note: Can also explicitly include affine expansion, so decision boundary neednot pass through origin. We’ll do our derivations without it.

49 / 61

Page 104: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Maximum margin linear classifier

The solution w to the following mathematical optimization problem:

minw∈Rd

1

2‖w‖22

s.t. yxTw ≥ 1 for all (x, y) ∈ S

gives the linear classifier with the largest minimum margin on S—i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.

This is a convex optimization problem; can be solved in polynomial time.

If there is a solution (i.e., S is linearly separable), then the solution is unique.

We can solve this in a variety of ways (e.g., projected gradient descent);we will work with the dual.

Note: Can also explicitly include affine expansion, so decision boundary neednot pass through origin. We’ll do our derivations without it.

49 / 61

Page 105: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Maximum margin linear classifier

The solution w to the following mathematical optimization problem:

minw∈Rd

1

2‖w‖22

s.t. yxTw ≥ 1 for all (x, y) ∈ S

gives the linear classifier with the largest minimum margin on S—i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.

This is a convex optimization problem; can be solved in polynomial time.

If there is a solution (i.e., S is linearly separable), then the solution is unique.

We can solve this in a variety of ways (e.g., projected gradient descent);we will work with the dual.

Note: Can also explicitly include affine expansion, so decision boundary neednot pass through origin. We’ll do our derivations without it.

49 / 61

Page 106: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Maximum margin linear classifier

The solution w to the following mathematical optimization problem:

minw∈Rd

1

2‖w‖22

s.t. yxTw ≥ 1 for all (x, y) ∈ S

gives the linear classifier with the largest minimum margin on S—i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.

This is a convex optimization problem; can be solved in polynomial time.

If there is a solution (i.e., S is linearly separable), then the solution is unique.

We can solve this in a variety of ways (e.g., projected gradient descent);we will work with the dual.

Note: Can also explicitly include affine expansion, so decision boundary neednot pass through origin. We’ll do our derivations without it.

49 / 61

Page 107: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) SVM Duality summary

Lagrangian

L(w,α) =1

2‖w‖22 +

n∑i=1

αi(1− yixTiw).

Primal maximum margin problem was

P (w) = maxα≥0

L(w,α) = maxα≥0

1

2‖w‖22 +

n∑i=1

αi(1− yixTiw)

.Dual problem

D(α) = minw∈Rd

L(w,α) = L

n∑i=1

αiyixi,α

=n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2

2

=n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj .

Given dual optimum α,

I Corresponding primal optimum w =∑ni=1 αiyixi;

I Strong duality P (w) = D(α);

I αi > 0 implies yixTi w = 1,

and these yixi are support vectors.50 / 61

Page 108: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Nonseparable case: bottom line

Unconstrained primal:

minw∈Rd

1

2‖w‖2 + C

n∑i=1

[1− yixT

iw]+.

Dual:

maxα∈Rn

0≤αi≤C

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2

Dual solution α gives primal solution w =∑ni=1 αiyixi.

Remarks.

I Can take C →∞ to recover the separable case.

I Dual is a constrained convex quadratic (can be solved with projectedgradient descent).

I Some presentations include bias in primal (xTiw + b);

this introduces a constraint∑ni=1 yiαi = 0 in dual.

I Some presentations replace 12

and C with λ2

and 1n

, respectively.

51 / 61

Page 109: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Nonseparable case: bottom line

Unconstrained primal:

minw∈Rd

1

2‖w‖2 + C

n∑i=1

[1− yixT

iw]+.

Dual:

maxα∈Rn

0≤αi≤C

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2

Dual solution α gives primal solution w =∑ni=1 αiyixi.

Remarks.

I Can take C →∞ to recover the separable case.

I Dual is a constrained convex quadratic (can be solved with projectedgradient descent).

I Some presentations include bias in primal (xTiw + b);

this introduces a constraint∑ni=1 yiαi = 0 in dual.

I Some presentations replace 12

and C with λ2

and 1n

, respectively.

51 / 61

Page 110: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Looking at the dual again

SVM dual problem only depends on xi through inner products xTixj .

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj .

If we use feature expansion (e.g., quadratic expansion) x 7→ φ(x), this becomes

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjφ(xi)Tφ(xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =

n∑i=1

αiyiφ(x)Tφ(xi).

Key insight:

I Training and prediction only use φ(x)Tφ(x′), never an isolated φ(x);

I Sometimes computing φ(x)Tφ(x′) is much easier than computing φ(x).

52 / 61

Page 111: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Looking at the dual again

SVM dual problem only depends on xi through inner products xTixj .

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj .

If we use feature expansion (e.g., quadratic expansion) x 7→ φ(x), this becomes

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjφ(xi)Tφ(xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =

n∑i=1

αiyiφ(x)Tφ(xi).

Key insight:

I Training and prediction only use φ(x)Tφ(x′), never an isolated φ(x);

I Sometimes computing φ(x)Tφ(x′) is much easier than computing φ(x).

52 / 61

Page 112: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Looking at the dual again

SVM dual problem only depends on xi through inner products xTixj .

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj .

If we use feature expansion (e.g., quadratic expansion) x 7→ φ(x), this becomes

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjφ(xi)Tφ(xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =n∑i=1

αiyiφ(x)Tφ(xi).

Key insight:

I Training and prediction only use φ(x)Tφ(x′), never an isolated φ(x);

I Sometimes computing φ(x)Tφ(x′) is much easier than computing φ(x).

52 / 61

Page 113: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Looking at the dual again

SVM dual problem only depends on xi through inner products xTixj .

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj .

If we use feature expansion (e.g., quadratic expansion) x 7→ φ(x), this becomes

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjφ(xi)Tφ(xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =n∑i=1

αiyiφ(x)Tφ(xi).

Key insight:

I Training and prediction only use φ(x)Tφ(x′), never an isolated φ(x);

I Sometimes computing φ(x)Tφ(x′) is much easier than computing φ(x).

52 / 61

Page 114: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Quadratic expansion

I φ : Rd → R1+2d+(d2), where

φ(x) =(

1,√

2x1, . . . ,√

2xd, x21, . . . , x

2d,

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

)(Don’t mind the

√2’s. . . )

I Computing φ(x)Tφ(x′) in O(d) time:

φ(x)Tφ(x′) = (1 + xTx′)2.

I Much better than d2 time.

I What if we change exponent “2”?

I What if we replace additive “1” with 0?

53 / 61

Page 115: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Quadratic expansion

I φ : Rd → R1+2d+(d2), where

φ(x) =(

1,√

2x1, . . . ,√

2xd, x21, . . . , x

2d,

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

)(Don’t mind the

√2’s. . . )

I Computing φ(x)Tφ(x′) in O(d) time:

φ(x)Tφ(x′) = (1 + xTx′)2.

I Much better than d2 time.

I What if we change exponent “2”?

I What if we replace additive “1” with 0?

53 / 61

Page 116: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Quadratic expansion

I φ : Rd → R1+2d+(d2), where

φ(x) =(

1,√

2x1, . . . ,√

2xd, x21, . . . , x

2d,

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

)(Don’t mind the

√2’s. . . )

I Computing φ(x)Tφ(x′) in O(d) time:

φ(x)Tφ(x′) = (1 + xTx′)2.

I Much better than d2 time.

I What if we change exponent “2”?

I What if we replace additive “1” with 0?

53 / 61

Page 117: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Quadratic expansion

I φ : Rd → R1+2d+(d2), where

φ(x) =(

1,√

2x1, . . . ,√

2xd, x21, . . . , x

2d,

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

)(Don’t mind the

√2’s. . . )

I Computing φ(x)Tφ(x′) in O(d) time:

φ(x)Tφ(x′) = (1 + xTx′)2.

I Much better than d2 time.

I What if we change exponent “2”?

I What if we replace additive “1” with 0?

53 / 61

Page 118: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Quadratic expansion

I φ : Rd → R1+2d+(d2), where

φ(x) =(

1,√

2x1, . . . ,√

2xd, x21, . . . , x

2d,

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

)(Don’t mind the

√2’s. . . )

I Computing φ(x)Tφ(x′) in O(d) time:

φ(x)Tφ(x′) = (1 + xTx′)2.

I Much better than d2 time.

I What if we change exponent “2”?

I What if we replace additive “1” with 0?

53 / 61

Page 119: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12; RBF kernel.) Infinite dimensional featureexpansion

For any σ > 0, there is an infinite feature expansion φ : Rd → R∞ such that

φ(x)Tφ(x′) = exp

−∥∥x− x′∥∥222σ2

,

which can be computed in O(d) time.

(This is called the Gaussian kernel with bandwidth σ.)

54 / 61

Page 120: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Kernels

A (positive definite) kernel function K : X × X → R is a symmetric functionsatisfying:

For any x1,x2, . . . ,xn ∈ X , the n×n matrix whose (i, j)-th entry isK(xi,xj) is positive semidefinite.

(This matrix is called the Gram matrix.)

For any kernel K, there is a feature mapping φ : X → H such that

φ(x)Tφ(x′) = K(x,x′).

H is a Hilbert space—i.e., a special kind of inner product space—called theReproducing Kernel Hilbert Space corresponding to K.

55 / 61

Page 121: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Kernels

A (positive definite) kernel function K : X × X → R is a symmetric functionsatisfying:

For any x1,x2, . . . ,xn ∈ X , the n×n matrix whose (i, j)-th entry isK(xi,xj) is positive semidefinite.

(This matrix is called the Gram matrix.)

For any kernel K, there is a feature mapping φ : X → H such that

φ(x)Tφ(x′) = K(x,x′).

H is a Hilbert space—i.e., a special kind of inner product space—called theReproducing Kernel Hilbert Space corresponding to K.

55 / 61

Page 122: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)

Solve

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjK(xi,xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =

n∑i=1

αiyiK(x,xi).

I To represent classifier, need to keep support vector examples (xi, yi) andcorresponding αi’s.

I To compute prediction on x, iterate through support vector examples andcompute K(x,xi) for each support vector xi . . .

Very similar to nearest neighbor classifier:predictor is represented using (a subset of) the training data.

56 / 61

Page 123: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)

Solve

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjK(xi,xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =n∑i=1

αiyiK(x,xi).

I To represent classifier, need to keep support vector examples (xi, yi) andcorresponding αi’s.

I To compute prediction on x, iterate through support vector examples andcompute K(x,xi) for each support vector xi . . .

Very similar to nearest neighbor classifier:predictor is represented using (a subset of) the training data.

56 / 61

Page 124: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)

Solve

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjK(xi,xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =n∑i=1

αiyiK(x,xi).

I To represent classifier, need to keep support vector examples (xi, yi) andcorresponding αi’s.

I To compute prediction on x, iterate through support vector examples andcompute K(x,xi) for each support vector xi . . .

Very similar to nearest neighbor classifier:predictor is represented using (a subset of) the training data.

56 / 61

Page 125: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)

Solve

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjK(xi,xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =n∑i=1

αiyiK(x,xi).

I To represent classifier, need to keep support vector examples (xi, yi) andcorresponding αi’s.

I To compute prediction on x, iterate through support vector examples andcompute K(x,xi) for each support vector xi . . .

Very similar to nearest neighbor classifier:predictor is represented using (a subset of) the training data.

56 / 61

Page 126: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)

Solve

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjK(xi,xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =n∑i=1

αiyiK(x,xi).

I To represent classifier, need to keep support vector examples (xi, yi) andcorresponding αi’s.

I To compute prediction on x, iterate through support vector examples andcompute K(x,xi) for each support vector xi . . .

Very similar to nearest neighbor classifier:predictor is represented using (a subset of) the training data.

56 / 61

Page 127: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec11-12.) Nonlinear support vector machines (again)

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-32.

000

-24.

000

-16.

000

-8.0

00

-8.000

0.00

0

0.0008.00

0

8.000

16.000

ReLU network.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-12.

500

-10.

000

-7.5

00-5

.000

-2.5

00

-2.500

0.00

0

0.000

2.500

Quadratic SVM.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-3.0

00

-2.0

00

-1.0

00

-1.0

00

0.00

0

0.000

1.00

0

2.000

3.000

RBF SVM (σ = 1).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-1.500-1.000

-1.000

-1.000

-0.5

00

-0.500

0.00

0

0.000

0.50

0

0.500

1.000

1.50

0

RBF SVM (σ = 0.1). 57 / 61

Page 128: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

Page 129: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

Page 130: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

Page 131: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

Page 132: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Scenario 1

wtxt yt = 1xTtwt ≤ 0

Current vector wt comparble to xt in length.

Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

Page 133: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Scenario 1

wtxt yt = 1xTtwt ≤ 0

wt+1

Updated vector wt+1 now correctly classifies (xt, yt).

Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

Page 134: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Scenario 2

wt

xtyt = 1xTtwt ≤ 0

Current vector wt much longer than xt.

Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

Page 135: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Scenario 2

wt

xtyt = 1xTtwt ≤ 0

wt+1

Updated vector wt+1 does not correctly classify (xt, yt).

Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

Page 136: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Scenario 2

wt

xtyt = 1xTtwt ≤ 0

wt+1

Updated vector wt+1 does not correctly classify (xt, yt).

Not obvious that Perceptron will eventually terminate! (We’ll return to this.) 58 / 61

Page 137: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

2. Homework review

Page 138: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

Nice homework problems

I HW1-P3: SVD practice.

I HW2-P1: SVD practice, definition ‖A‖2 and ‖A‖F.

I HW3-P1: convexity practice.

I HW3-P2: deep network and probability practice.

I HW3-P4, HW3-P5: kernel practice.

59 / 61

Page 139: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

3. Stuff added 3/12/2019

Page 140: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

Belaboring Cauchy-Schwarz

Theorem (Cauchy-Schwarz). |aTb| ≤ ‖a‖ · ‖b‖.

Proof (another one. . . ). If either a or b are zero, then aTb = 0 = ‖a‖ · ‖b‖.If both a and b are nonzero, note for any r > 0 that

0 ≤∥∥∥∥ra− br

∥∥∥∥2 = r2‖a‖2 − 2aTb+‖b‖2

r2,

which can be rearranged into

aTb ≤ r2‖a‖2

2+‖b‖2

2r2.

Choosing r =√‖b‖/‖a‖,

aTb ≤ ‖a‖2

2

(‖b‖‖a‖

)+‖b‖2

2

(‖a‖‖b‖

)= ‖a‖ · ‖b‖.

Lastly, applying the bound to (a,−b) gives

−aTb = aT (−b) ≤ ‖a‖ · ‖ − b‖ = ‖a‖ · ‖b‖,

so together it follows that∣∣∣aTb

∣∣∣ ≤ ‖a‖ · ‖b‖. �

60 / 61

Page 141: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

Belaboring Cauchy-Schwarz

Theorem (Cauchy-Schwarz). |aTb| ≤ ‖a‖ · ‖b‖.

Proof (another one. . . ). If either a or b are zero, then aTb = 0 = ‖a‖ · ‖b‖.If both a and b are nonzero, note for any r > 0 that

0 ≤∥∥∥∥ra− br

∥∥∥∥2 = r2‖a‖2 − 2aTb+‖b‖2

r2,

which can be rearranged into

aTb ≤ r2‖a‖2

2+‖b‖2

2r2.

Choosing r =√‖b‖/‖a‖,

aTb ≤ ‖a‖2

2

(‖b‖‖a‖

)+‖b‖2

2

(‖a‖‖b‖

)= ‖a‖ · ‖b‖.

Lastly, applying the bound to (a,−b) gives

−aTb = aT (−b) ≤ ‖a‖ · ‖ − b‖ = ‖a‖ · ‖b‖,

so together it follows that∣∣∣aTb

∣∣∣ ≤ ‖a‖ · ‖b‖. �

60 / 61

Page 142: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

SVD again

1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.

2. Thin decomposition SVD: M =∑ri=1 siuiv

Ti .

3. Full factorization SVD: M = USV T.

4. “Operational” view of SVD: for M ∈ Rn×d,

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

0 0

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

>

.

First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).

Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.

61 / 61

Page 143: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

SVD again

1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.

2. Thin decomposition SVD: M =∑ri=1 siuiv

Ti .

3. Full factorization SVD: M = USV T.

4. “Operational” view of SVD: for M ∈ Rn×d,

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

0 0

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

>

.

First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).

Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.

61 / 61

Page 144: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

SVD again

1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.

2. Thin decomposition SVD: M =∑ri=1 siuiv

Ti .

3. Full factorization SVD: M = USV T.

4. “Operational” view of SVD: for M ∈ Rn×d,

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

0 0

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

>

.

First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).

Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.

61 / 61

Page 145: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

SVD again

1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.

2. Thin decomposition SVD: M =∑ri=1 siuiv

Ti .

3. Full factorization SVD: M = USV T.

4. “Operational” view of SVD: for M ∈ Rn×d,

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

0 0

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

>

.

First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).

Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.

61 / 61

Page 146: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

SVD again

1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.

2. Thin decomposition SVD: M =∑ri=1 siuiv

Ti .

3. Full factorization SVD: M = USV T.

4. “Operational” view of SVD: for M ∈ Rn×d,

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

0 0

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

>

.

First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).

Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.

61 / 61

Page 147: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

SVD again

1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.

2. Thin decomposition SVD: M =∑ri=1 siuiv

Ti .

3. Full factorization SVD: M = USV T.

4. “Operational” view of SVD: for M ∈ Rn×d,

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

0 0

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

>

.

First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).

Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.

61 / 61

Page 148: Midterm review - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-midterm_review.pdf · 3/12/2019  · (Lec2.) Nearest neighbors and decision trees Today we covered

SVD again

1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.

2. Thin decomposition SVD: M =∑ri=1 siuiv

Ti .

3. Full factorization SVD: M = USV T.

4. “Operational” view of SVD: for M ∈ Rn×d,

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

0 0

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

>

.

First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).

Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.

61 / 61