support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · linearly separable data...

Post on 07-Mar-2021

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Support vector machines

CS 446 / ECE 449

2021-02-16 14:33:55 -0600 (d94860f)

Another algorithm for linear prediction. Why?!

1 / 37

“pytorch meta-algorithm”.

...

5. Tweak 1-4 until training error is small.

6. Tweak 1-5, possibly reducing model complexity, until testing erroris small.

Recall: step 5 is easy, step 6 is complicated.Nice way to reduce complexity: favorable inductive bias.

I Support vector machines (SVMs) popularized a very influentialinductive bias: maximum margin predictors.

This concept has consequence beyond linear predictors and beyondsupervised learning.

I Nonlinear SVMs, via kernels, are also highly influential.

E.g., we will revisit them in the deep network lectures.

2 / 37

“pytorch meta-algorithm”.

...

5. Tweak 1-4 until training error is small.

6. Tweak 1-5, possibly reducing model complexity, until testing erroris small.

Recall: step 5 is easy, step 6 is complicated.Nice way to reduce complexity: favorable inductive bias.

I Support vector machines (SVMs) popularized a very influentialinductive bias: maximum margin predictors.

This concept has consequence beyond linear predictors and beyondsupervised learning.

I Nonlinear SVMs, via kernels, are also highly influential.

E.g., we will revisit them in the deep network lectures.

2 / 37

Plan for SVM

I Hard-margin SVM.

I Soft-margin SVM.

I SVM duality.

I Nonlinear SVM: kernels

3 / 37

Linearly separable data

Linearly separable data means there exists u ∈ Rd which perfectly classifiesall training points:

miniyix

Tiu > 0.

We have many ways to solve for u:

I Logistic regression.

I Perceptron.

I Convex programming (convexfunction subject to convex setconstraint).

4 / 37

Linearly separable data

Linearly separable data means there exists u ∈ Rd which perfectly classifiesall training points:

miniyix

Tiu > 0.

We have many ways to solve for u:

I Logistic regression.

I Perceptron.

I Convex programming (convexfunction subject to convex setconstraint).

4 / 37

Maximum margin solution

Best linear classifier onpopulation

Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.

(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.

I We’ve seen inductive bias:least squares and logistic regression choose different predictors.

I This particular bias (margin maximization) is common in machine learning,has many nice properties.

Key insight: can express this as another convex program.

5 / 37

Maximum margin solution

Best linear classifier onpopulation

Arbitrary linear separator ontraining data S

Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.

(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.

I We’ve seen inductive bias:least squares and logistic regression choose different predictors.

I This particular bias (margin maximization) is common in machine learning,has many nice properties.

Key insight: can express this as another convex program.

5 / 37

Maximum margin solution

Best linear classifier onpopulation

Arbitrary linear separator ontraining data S

Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.

(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.

I We’ve seen inductive bias:least squares and logistic regression choose different predictors.

I This particular bias (margin maximization) is common in machine learning,has many nice properties.

Key insight: can express this as another convex program.

5 / 37

Maximum margin solution

Best linear classifier onpopulation

Arbitrary linear separator ontraining data S

Maximum margin solution ontraining data S

Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.

(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.

I We’ve seen inductive bias:least squares and logistic regression choose different predictors.

I This particular bias (margin maximization) is common in machine learning,has many nice properties.

Key insight: can express this as another convex program.

5 / 37

Maximum margin solution

Best linear classifier onpopulation

Arbitrary linear separator ontraining data S

Maximum margin solution ontraining data S

Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.

(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.

I We’ve seen inductive bias:least squares and logistic regression choose different predictors.

I This particular bias (margin maximization) is common in machine learning,has many nice properties.

Key insight: can express this as another convex program.

5 / 37

Maximum margin solution

Best linear classifier onpopulation

Arbitrary linear separator ontraining data S

Maximum margin solution ontraining data S

Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.

(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.

I We’ve seen inductive bias:least squares and logistic regression choose different predictors.

I This particular bias (margin maximization) is common in machine learning,has many nice properties.

Key insight: can express this as another convex program.

5 / 37

Maximum margin solution

Best linear classifier onpopulation

Arbitrary linear separator ontraining data S

Maximum margin solution ontraining data S

Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.

(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.

I We’ve seen inductive bias:least squares and logistic regression choose different predictors.

I This particular bias (margin maximization) is common in machine learning,has many nice properties.

Key insight: can express this as another convex program.5 / 37

Distance to decision boundary

Suppose w ∈ Rd satisfies min(x,y)∈S

yxTw > 0.

I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.

I Thus for each direction w/‖w‖, we can fix a scaling.

Let (x, y) be any example in S that achieves the minimum mini yixTiw.

H

w

yxyxTw‖w‖2

I Rescale w so that yxTw = 1.

(Now scaling is fixed.)

I Distance from yx to H isyxTw

‖w‖ =1

‖w‖ .

This is the (normalized minimum) margin.

I This gives optimization problem

max 1/‖w‖

subj. to min(x,y)∈S

yxTw = 1.

Refinements: (a) can make constraint∀i, yixT

iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.

6 / 37

Distance to decision boundary

Suppose w ∈ Rd satisfies min(x,y)∈S

yxTw > 0.

I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.

I Thus for each direction w/‖w‖, we can fix a scaling.

Let (x, y) be any example in S that achieves the minimum mini yixTiw.

H

w

yxyxTw‖w‖2

I Rescale w so that yxTw = 1.

(Now scaling is fixed.)

I Distance from yx to H isyxTw

‖w‖ =1

‖w‖ .

This is the (normalized minimum) margin.

I This gives optimization problem

max 1/‖w‖

subj. to min(x,y)∈S

yxTw = 1.

Refinements: (a) can make constraint∀i, yixT

iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.

6 / 37

Distance to decision boundary

Suppose w ∈ Rd satisfies min(x,y)∈S

yxTw > 0.

I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.

I Thus for each direction w/‖w‖, we can fix a scaling.

Let (x, y) be any example in S that achieves the minimum mini yixTiw.

H

w

yxyxTw‖w‖2

I Rescale w so that yxTw = 1.

(Now scaling is fixed.)

I Distance from yx to H isyxTw

‖w‖ =1

‖w‖ .

This is the (normalized minimum) margin.

I This gives optimization problem

max 1/‖w‖

subj. to min(x,y)∈S

yxTw = 1.

Refinements: (a) can make constraint∀i, yixT

iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.

6 / 37

Distance to decision boundary

Suppose w ∈ Rd satisfies min(x,y)∈S

yxTw > 0.

I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.

I Thus for each direction w/‖w‖, we can fix a scaling.

Let (x, y) be any example in S that achieves the minimum mini yixTiw.

H

w

yxyxTw‖w‖2

I Rescale w so that yxTw = 1.

(Now scaling is fixed.)

I Distance from yx to H isyxTw

‖w‖ =1

‖w‖ .

This is the (normalized minimum) margin.

I This gives optimization problem

max 1/‖w‖

subj. to min(x,y)∈S

yxTw = 1.

Refinements: (a) can make constraint∀i, yixT

iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.

6 / 37

Distance to decision boundary

Suppose w ∈ Rd satisfies min(x,y)∈S

yxTw > 0.

I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.

I Thus for each direction w/‖w‖, we can fix a scaling.

Let (x, y) be any example in S that achieves the minimum mini yixTiw.

H

w

yxyxTw‖w‖2

I Rescale w so that yxTw = 1.

(Now scaling is fixed.)

I Distance from yx to H isyxTw

‖w‖ =1

‖w‖ .

This is the (normalized minimum) margin.

I This gives optimization problem

max 1/‖w‖

subj. to min(x,y)∈S

yxTw = 1.

Refinements: (a) can make constraint∀i, yixT

iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.

6 / 37

Distance to decision boundary

Suppose w ∈ Rd satisfies min(x,y)∈S

yxTw > 0.

I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.

I Thus for each direction w/‖w‖, we can fix a scaling.

Let (x, y) be any example in S that achieves the minimum mini yixTiw.

H

w

yxyxTw‖w‖2

I Rescale w so that yxTw = 1.

(Now scaling is fixed.)

I Distance from yx to H isyxTw

‖w‖ =1

‖w‖ .

This is the (normalized minimum) margin.

I This gives optimization problem

max 1/‖w‖

subj. to min(x,y)∈S

yxTw = 1.

Refinements: (a) can make constraint∀i, yixT

iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.

6 / 37

Distance to decision boundary

Suppose w ∈ Rd satisfies min(x,y)∈S

yxTw > 0.

I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.

I Thus for each direction w/‖w‖, we can fix a scaling.

Let (x, y) be any example in S that achieves the minimum mini yixTiw.

H

w

yxyxTw‖w‖2

I Rescale w so that yxTw = 1.

(Now scaling is fixed.)

I Distance from yx to H isyxTw

‖w‖ =1

‖w‖ .

This is the (normalized minimum) margin.

I This gives optimization problem

max 1/‖w‖

subj. to min(x,y)∈S

yxTw = 1.

Refinements: (a) can make constraint∀i, yixT

iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.

6 / 37

Maximum margin linear classifier

The solution w to the following mathematical optimization problem:

minw∈Rd

1

2‖w‖22

s.t. yxTw ≥ 1 for all (x, y) ∈ S

gives the linear classifier with the largest minimum margin on S — i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.

I This is a convex optimization problem: minimization of a convex function,subject to a convex constraint.

I There are many solvers for this problem; it is an area of active research.

You will code up an easy one in homework 2!

I The feasible set is nonempty when S is linearly separable;

in this case, the solution is unique.

I Note: some presentations explicitly include and encode the appended “1”feature on x; we will not.

7 / 37

Maximum margin linear classifier

The solution w to the following mathematical optimization problem:

minw∈Rd

1

2‖w‖22

s.t. yxTw ≥ 1 for all (x, y) ∈ S

gives the linear classifier with the largest minimum margin on S — i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.

I This is a convex optimization problem: minimization of a convex function,subject to a convex constraint.

I There are many solvers for this problem; it is an area of active research.

You will code up an easy one in homework 2!

I The feasible set is nonempty when S is linearly separable;

in this case, the solution is unique.

I Note: some presentations explicitly include and encode the appended “1”feature on x; we will not.

7 / 37

Soft-margin SVM

The convex program made no sense if the data is not linearly separable.It is sometimes called the hard-margin SVM.Now we develop a soft-margin SVM for non-separable data.

8 / 37

Soft-margin SVMs (Cortes and Vapnik, 1995)

Start from the hard-margin program:

minw∈Rd

1

2‖w‖22

s.t. yixTiw ≥ 1 for all i = 1, 2, . . . , n.

Suppose it is infeasible (no linear separators).

Introduce slack variables ξ1, . . . , ξn ≥ 0, and a trade-off parameter C > 0:

minw∈Rd,ξ1,...,ξn∈R

1

2‖w‖22 + C

n∑i=1

ξi

s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,

ξi ≥ 0 for all i = 1, 2, . . . , n,

which is always feasible. This is called soft-margin SVM.

(Slack variables are auxiliary variables; not needed to form the linear classifier.)

9 / 37

Soft-margin SVMs (Cortes and Vapnik, 1995)

Start from the hard-margin program:

minw∈Rd

1

2‖w‖22

s.t. yixTiw ≥ 1 for all i = 1, 2, . . . , n.

Suppose it is infeasible (no linear separators).

Introduce slack variables ξ1, . . . , ξn ≥ 0, and a trade-off parameter C > 0:

minw∈Rd,ξ1,...,ξn∈R

1

2‖w‖22 + C

n∑i=1

ξi

s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,

ξi ≥ 0 for all i = 1, 2, . . . , n,

which is always feasible. This is called soft-margin SVM.

(Slack variables are auxiliary variables; not needed to form the linear classifier.)

9 / 37

Interpretation of slack varables

minw∈Rd,ξ1,...,ξn∈R

1

2‖w‖22 + C

n∑i=1

ξi

s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,

ξi ≥ 0 for all i = 1, 2, . . . , n.

H

For given w, ξi/‖w‖2 is distance that xi would have to move to satisfy

yixTiw ≥ 1.

10 / 37

Another interpretation of slack variables

Constraints with non-negative slack variables:

minw∈Rd,ξ1,...,ξn∈R

1

2‖w‖22 + C

n∑i=1

ξi

s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,

ξi ≥ 0 for all i = 1, 2, . . . , n.

Equivalent unconstrained form:Given w, the optimal ξi is max{0, 1− yixT

iw}, thus

minw∈Rd

1

2‖w‖22 + C

n∑i=1

[1− yixT

iw]

+.

Notation: [a]+ := max {0, a} (ReLU!).[1− yxTw

]+

is hinge loss of w on example (x, y).

11 / 37

Another interpretation of slack variables

Constraints with non-negative slack variables:

minw∈Rd,ξ1,...,ξn∈R

1

2‖w‖22 + C

n∑i=1

ξi

s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,

ξi ≥ 0 for all i = 1, 2, . . . , n.

Equivalent unconstrained form:Given w, the optimal ξi is max{0, 1− yixT

iw}, thus

minw∈Rd

1

2‖w‖22 + C

n∑i=1

[1− yixT

iw]

+.

Notation: [a]+ := max {0, a} (ReLU!).

[1− yxTw

]+

is hinge loss of w on example (x, y).

11 / 37

Another interpretation of slack variables

Constraints with non-negative slack variables:

minw∈Rd,ξ1,...,ξn∈R

1

2‖w‖22 + C

n∑i=1

ξi

s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,

ξi ≥ 0 for all i = 1, 2, . . . , n.

Equivalent unconstrained form:Given w, the optimal ξi is max{0, 1− yixT

iw}, thus

minw∈Rd

1

2‖w‖22 + C

n∑i=1

[1− yixT

iw]

+.

Notation: [a]+ := max {0, a} (ReLU!).[1− yxTw

]+

is hinge loss of w on example (x, y).

11 / 37

Unconstrained soft-margin SVM

This form is a regularized ERM:

minw∈Rd

1

2‖w‖22 + C

n∑i=1

`hinge(yixTiw) where `hinge(z) = max{0, 1− z}.

I We can equivalently substitute C = 1 and write λ2‖w‖2.

I C is a hyper-parameter, with no good search procedure.

12 / 37

SVM duality

A convex program is an optimization problem (minimization or maximization)where a convex objective is minimized over a convex constraint (feasible) set.

Every convex program has a corresponding dual program.There is a rich theory about this correspondence.For the SVM, the dual has many nice properties:

I Clarifies the role of support vectors.

I Leads to a nice nonlinear approach via kernels.

I Gives another choice for optimization algorithms.

(Used in homework 2!)

13 / 37

SVM duality

A convex program is an optimization problem (minimization or maximization)where a convex objective is minimized over a convex constraint (feasible) set.

Every convex program has a corresponding dual program.There is a rich theory about this correspondence.For the SVM, the dual has many nice properties:

I Clarifies the role of support vectors.

I Leads to a nice nonlinear approach via kernels.

I Gives another choice for optimization algorithms.

(Used in homework 2!)

13 / 37

SVM duality

A convex program is an optimization problem (minimization or maximization)where a convex objective is minimized over a convex constraint (feasible) set.

Every convex program has a corresponding dual program.There is a rich theory about this correspondence.For the SVM, the dual has many nice properties:

I Clarifies the role of support vectors.

I Leads to a nice nonlinear approach via kernels.

I Gives another choice for optimization algorithms.

(Used in homework 2!)

13 / 37

SVM hard-margin duality.Define the two optimization problems

min

{1

2‖w‖2 : w ∈ Rd,∀i � 1− yixiw ≤ 0

}(primal),

max

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj : α ∈ Rn,α ≥ 0

(dual).

If the primal is feasible, then primal optimal value = dual optimal value.Given a primal optimum w and a dual optimum α, then

w =n∑i=1

αiyixi.

I The dual variables α have dimension n, same as examples.

I We can write the primal optimum as a linear combination of examples.

I The dual objective is a concave quadratic.

I We will derive this duality using Lagrange multipliers.

14 / 37

SVM hard-margin duality.Define the two optimization problems

min

{1

2‖w‖2 : w ∈ Rd,∀i � 1− yixiw ≤ 0

}(primal),

max

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj : α ∈ Rn,α ≥ 0

(dual).

If the primal is feasible, then primal optimal value = dual optimal value.Given a primal optimum w and a dual optimum α, then

w =n∑i=1

αiyixi.

I The dual variables α have dimension n, same as examples.

I We can write the primal optimum as a linear combination of examples.

I The dual objective is a concave quadratic.

I We will derive this duality using Lagrange multipliers.

14 / 37

Lagrange multipliers

Move constraints to objective using Lagrange multipliers.

Original problem: minw∈Rd

1

2‖w‖22

s.t. 1− yixTiw ≤ 0 for all i = 1, . . . , n.

I For each constraint 1− yixTiw ≤ 0, associate a dual variable (Lagrange

multiplier) αi ≥ 0.

I Move constraints to objective by adding∑ni=1 αi(1− yix

Tiw) and

maximizing over α = (α1, . . . , αn) ∈ Rn s.t. α ≥ 0 (i.e., αi ≥ 0 for all i).

Lagrangian L(w,α):

L(w,α) :=1

2‖w‖22 +

n∑i=1

αi(1− yixTiw).

Maximizing over α ≥ 0 recovers primal problem: for any w ∈ Rd,

P (w) := supα≥0

L(w,α) =

{12‖w‖22 if mini yix

Tiw ≥ 1,

∞ otherwise.

What if we leave α fixed, and minimize w?

15 / 37

Lagrange multipliers

Move constraints to objective using Lagrange multipliers.

Original problem: minw∈Rd

1

2‖w‖22

s.t. 1− yixTiw ≤ 0 for all i = 1, . . . , n.

I For each constraint 1− yixTiw ≤ 0, associate a dual variable (Lagrange

multiplier) αi ≥ 0.

I Move constraints to objective by adding∑ni=1 αi(1− yix

Tiw) and

maximizing over α = (α1, . . . , αn) ∈ Rn s.t. α ≥ 0 (i.e., αi ≥ 0 for all i).

Lagrangian L(w,α):

L(w,α) :=1

2‖w‖22 +

n∑i=1

αi(1− yixTiw).

Maximizing over α ≥ 0 recovers primal problem: for any w ∈ Rd,

P (w) := supα≥0

L(w,α) =

{12‖w‖22 if mini yix

Tiw ≥ 1,

∞ otherwise.

What if we leave α fixed, and minimize w?

15 / 37

Lagrange multipliers

Move constraints to objective using Lagrange multipliers.

Original problem: minw∈Rd

1

2‖w‖22

s.t. 1− yixTiw ≤ 0 for all i = 1, . . . , n.

I For each constraint 1− yixTiw ≤ 0, associate a dual variable (Lagrange

multiplier) αi ≥ 0.

I Move constraints to objective by adding∑ni=1 αi(1− yix

Tiw) and

maximizing over α = (α1, . . . , αn) ∈ Rn s.t. α ≥ 0 (i.e., αi ≥ 0 for all i).

Lagrangian L(w,α):

L(w,α) :=1

2‖w‖22 +

n∑i=1

αi(1− yixTiw).

Maximizing over α ≥ 0 recovers primal problem: for any w ∈ Rd,

P (w) := supα≥0

L(w,α) =

{12‖w‖22 if mini yix

Tiw ≥ 1,

∞ otherwise.

What if we leave α fixed, and minimize w?15 / 37

Dual problem

Lagrangian

L(w,α) :=1

2‖w‖22 +

n∑i=1

αi(1− yixTiw).

Primal hard-margin SVM

P (w) = supα≥0

L(w,α) = supα≥0

1

2‖w‖22 +

n∑i=1

αi(1− yixTiw)

.

Dual problem D(α) = minw L(w,α): given α ≥ 0, then w 7→ L(w,α) is aconvex quadratic with minimum w =

∑ni=1 αiyixi, giving

D(α) = minw∈Rd

L(w,α) = L

n∑i=1

αiyixi,α

=

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2

2

=n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj .

16 / 37

Dual problem

Lagrangian

L(w,α) :=1

2‖w‖22 +

n∑i=1

αi(1− yixTiw).

Primal hard-margin SVM

P (w) = supα≥0

L(w,α) = supα≥0

1

2‖w‖22 +

n∑i=1

αi(1− yixTiw)

.Dual problem D(α) = minw L(w,α): given α ≥ 0, then w 7→ L(w,α) is aconvex quadratic with minimum w =

∑ni=1 αiyixi, giving

D(α) = minw∈Rd

L(w,α) = L

n∑i=1

αiyixi,α

=

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2

2

=

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj .

16 / 37

Summarizing,

L(w,α) =1

2‖w‖22 +

n∑i=1

αi(1− yixTiw) Lagrangian,

P (w) = maxα≥0

L(w,α) primal problem,

D(α) = minw

L(w,α) dual problem.

I For general Lagrangians, have weak duality

P (w) ≥ D(α),

since P (w) = maxα′≥0

L(w,α′) ≥ L(w,α) ≥ minw′

L(w′,α) = D(α).

I By convexity, have strong duality minw P (w) = maxα≥0 D(α),and an optimum α for D gives an optimum w for P via

w =

n∑i=1

αiyixi = arg minw

L(w, α).

17 / 37

Summarizing,

L(w,α) =1

2‖w‖22 +

n∑i=1

αi(1− yixTiw) Lagrangian,

P (w) = maxα≥0

L(w,α) primal problem,

D(α) = minw

L(w,α) dual problem.

I For general Lagrangians, have weak duality

P (w) ≥ D(α),

since P (w) = maxα′≥0

L(w,α′) ≥ L(w,α) ≥ minw′

L(w′,α) = D(α).

I By convexity, have strong duality minw P (w) = maxα≥0 D(α),and an optimum α for D gives an optimum w for P via

w =

n∑i=1

αiyixi = arg minw

L(w, α).

17 / 37

Summarizing,

L(w,α) =1

2‖w‖22 +

n∑i=1

αi(1− yixTiw) Lagrangian,

P (w) = maxα≥0

L(w,α) primal problem,

D(α) = minw

L(w,α) dual problem.

I For general Lagrangians, have weak duality

P (w) ≥ D(α),

since P (w) = maxα′≥0

L(w,α′) ≥ L(w,α) ≥ minw′

L(w′,α) = D(α).

I By convexity, have strong duality minw P (w) = maxα≥0 D(α),and an optimum α for D gives an optimum w for P via

w =

n∑i=1

αiyixi = arg minw

L(w, α).

17 / 37

Support vectors

Optimal solutions w and α = (α1, . . . , αn) satisfy

I w =n∑i=1

αiyixi =∑i:αi>0

αiyixi,

I αi > 0 ⇒ yixTi w = 1 for all i = 1, . . . , n (complementary slackness).

The yixi where αi > 0 are called support vectors.

H

I Support vector examples satisfy “margin”constraints with equality.

I Get same solution if non-support vectorsdeleted.

I Primal optimum is a linear combination ofsupport vectors.

18 / 37

Support vectors

Optimal solutions w and α = (α1, . . . , αn) satisfy

I w =n∑i=1

αiyixi =∑i:αi>0

αiyixi,

I αi > 0 ⇒ yixTi w = 1 for all i = 1, . . . , n (complementary slackness).

The yixi where αi > 0 are called support vectors.

H

I Support vector examples satisfy “margin”constraints with equality.

I Get same solution if non-support vectorsdeleted.

I Primal optimum is a linear combination ofsupport vectors.

18 / 37

Support vectors

Optimal solutions w and α = (α1, . . . , αn) satisfy

I w =n∑i=1

αiyixi =∑i:αi>0

αiyixi,

I αi > 0 ⇒ yixTi w = 1 for all i = 1, . . . , n (complementary slackness).

The yixi where αi > 0 are called support vectors.

H

I Support vector examples satisfy “margin”constraints with equality.

I Get same solution if non-support vectorsdeleted.

I Primal optimum is a linear combination ofsupport vectors.

18 / 37

Support vectors

Optimal solutions w and α = (α1, . . . , αn) satisfy

I w =n∑i=1

αiyixi =∑i:αi>0

αiyixi,

I αi > 0 ⇒ yixTi w = 1 for all i = 1, . . . , n (complementary slackness).

The yixi where αi > 0 are called support vectors.

H

I Support vector examples satisfy “margin”constraints with equality.

I Get same solution if non-support vectorsdeleted.

I Primal optimum is a linear combination ofsupport vectors.

18 / 37

Support vectors

Optimal solutions w and α = (α1, . . . , αn) satisfy

I w =n∑i=1

αiyixi =∑i:αi>0

αiyixi,

I αi > 0 ⇒ yixTi w = 1 for all i = 1, . . . , n (complementary slackness).

The yixi where αi > 0 are called support vectors.

H

I Support vector examples satisfy “margin”constraints with equality.

I Get same solution if non-support vectorsdeleted.

I Primal optimum is a linear combination ofsupport vectors.

18 / 37

Proof of complementary slackness

For the optimal (feasible) solutions w and α, we have

P (w) = D(α) = minw∈Rd

L(w, α) (by strong duality)

≤ L(w, α)

=1

2‖w‖22 +

n∑i=1

αi(1− yixTi w)

≤ 1

2‖w‖22 (constraints are satisfied)

= P (w).

Therefore, every term in sumn∑i=1

αi(1− yixTi w) must be zero:

αi(1− yixTi w) = 0 for all i = 1, . . . , n.

If αi > 0, then must have 1− yixTi w = 0. (Not iff!)

19 / 37

Proof of complementary slackness

For the optimal (feasible) solutions w and α, we have

P (w) = D(α) = minw∈Rd

L(w, α) (by strong duality)

≤ L(w, α)

=1

2‖w‖22 +

n∑i=1

αi(1− yixTi w)

≤ 1

2‖w‖22 (constraints are satisfied)

= P (w).

Therefore, every term in sumn∑i=1

αi(1− yixTi w) must be zero:

αi(1− yixTi w) = 0 for all i = 1, . . . , n.

If αi > 0, then must have 1− yixTi w = 0. (Not iff!)

19 / 37

Proof of complementary slackness

For the optimal (feasible) solutions w and α, we have

P (w) = D(α) = minw∈Rd

L(w, α) (by strong duality)

≤ L(w, α)

=1

2‖w‖22 +

n∑i=1

αi(1− yixTi w)

≤ 1

2‖w‖22 (constraints are satisfied)

= P (w).

Therefore, every term in sumn∑i=1

αi(1− yixTi w) must be zero:

αi(1− yixTi w) = 0 for all i = 1, . . . , n.

If αi > 0, then must have 1− yixTi w = 0. (Not iff!)

19 / 37

Proof of complementary slackness

For the optimal (feasible) solutions w and α, we have

P (w) = D(α) = minw∈Rd

L(w, α) (by strong duality)

≤ L(w, α)

=1

2‖w‖22 +

n∑i=1

αi(1− yixTi w)

≤ 1

2‖w‖22 (constraints are satisfied)

= P (w).

Therefore, every term in sumn∑i=1

αi(1− yixTi w) must be zero:

αi(1− yixTi w) = 0 for all i = 1, . . . , n.

If αi > 0, then must have 1− yixTi w = 0. (Not iff!)

19 / 37

Proof of complementary slackness

For the optimal (feasible) solutions w and α, we have

P (w) = D(α) = minw∈Rd

L(w, α) (by strong duality)

≤ L(w, α)

=1

2‖w‖22 +

n∑i=1

αi(1− yixTi w)

≤ 1

2‖w‖22 (constraints are satisfied)

= P (w).

Therefore, every term in sumn∑i=1

αi(1− yixTi w) must be zero:

αi(1− yixTi w) = 0 for all i = 1, . . . , n.

If αi > 0, then must have 1− yixTi w = 0. (Not iff!)

19 / 37

Proof of complementary slackness

For the optimal (feasible) solutions w and α, we have

P (w) = D(α) = minw∈Rd

L(w, α) (by strong duality)

≤ L(w, α)

=1

2‖w‖22 +

n∑i=1

αi(1− yixTi w)

≤ 1

2‖w‖22 (constraints are satisfied)

= P (w).

Therefore, every term in sumn∑i=1

αi(1− yixTi w) must be zero:

αi(1− yixTi w) = 0 for all i = 1, . . . , n.

If αi > 0, then must have 1− yixTi w = 0. (Not iff!)

19 / 37

SVM (hard-margin) duality summary

Lagrangian

L(w,α) =1

2‖w‖22 +

n∑i=1

αi(1− yixTiw).

Primal maximum margin problem was

P (w) = supα≥0

L(w,α) = supα≥0

1

2‖w‖22 +

n∑i=1

αi(1− yixTiw)

.Dual problem

D(α) = minw∈Rd

L(w,α) =

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2

2

.

Given dual optimum α,

I Corresponding primal optimum w =∑ni=1 αiyixi;

I Strong duality P (w) = D(α);

I αi > 0 implies yixTi w = 1,

and these yixi are support vectors.

20 / 37

SVM soft-margin dual

Similarly,

L(w, ξ,α) =1

2‖w‖22 + C

n∑i=1

ξi +

n∑i=1

αi(1− ξi − yixTiw) (Lagrangian),

P (w, ξ) = supα≥0

L(w, ξ,α) (Primal),

=

{12‖w‖22 + C

∑ni=1 ξi ∀i � 1− ξi − yix

Tiw ≤ 0,

∞ otherwise,

D(α) = minw∈Rd,ξ∈Rn

≥0

L(w, ξ,α) (Dual),

= maxα∈Rn

0≤αi≤C

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2 .

Remarks.

I Dual solution α still gives primal solution w =∑ni=1 αiyixi.

I Can take C →∞ to recover hard-margin case.

I Dual is still a constrained concave quadratic (used in many solvers).

I Some presentations include bias in primal (xTiw + b);

this introduces a constraint∑ni=1 yiαi = 0 in dual.

21 / 37

SVM soft-margin dual

Similarly,

L(w, ξ,α) =1

2‖w‖22 + C

n∑i=1

ξi +n∑i=1

αi(1− ξi − yixTiw) (Lagrangian),

P (w, ξ) = supα≥0

L(w, ξ,α) (Primal),

=

{12‖w‖22 + C

∑ni=1 ξi ∀i � 1− ξi − yix

Tiw ≤ 0,

∞ otherwise,

D(α) = minw∈Rd,ξ∈Rn

≥0

L(w, ξ,α) (Dual),

= maxα∈Rn

0≤αi≤C

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2 .

Remarks.

I Dual solution α still gives primal solution w =∑ni=1 αiyixi.

I Can take C →∞ to recover hard-margin case.

I Dual is still a constrained concave quadratic (used in many solvers).

I Some presentations include bias in primal (xTiw + b);

this introduces a constraint∑ni=1 yiαi = 0 in dual.

21 / 37

SVM soft-margin dual

Similarly,

L(w, ξ,α) =1

2‖w‖22 + C

n∑i=1

ξi +n∑i=1

αi(1− ξi − yixTiw) (Lagrangian),

P (w, ξ) = supα≥0

L(w, ξ,α) (Primal),

=

{12‖w‖22 + C

∑ni=1 ξi ∀i � 1− ξi − yix

Tiw ≤ 0,

∞ otherwise,

D(α) = minw∈Rd,ξ∈Rn

≥0

L(w, ξ,α) (Dual),

= maxα∈Rn

0≤αi≤C

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2 .

Remarks.

I Dual solution α still gives primal solution w =∑ni=1 αiyixi.

I Can take C →∞ to recover hard-margin case.

I Dual is still a constrained concave quadratic (used in many solvers).

I Some presentations include bias in primal (xTiw + b);

this introduces a constraint∑ni=1 yiαi = 0 in dual.

21 / 37

Nonlinear SVM: feature mapping annoying in the primal?

SVM hard-margin primal, with a feature mapping φ : Rd → Rp:

min

{1

2‖w‖2 : w ∈ Rp, ∀i � φ(xi)

Tw ≥ 1

}.

Now the search space has p dimensions; potentially p� d.

Can we do better?

22 / 37

Nonlinear SVM: feature mapping annoying in the primal?

SVM hard-margin primal, with a feature mapping φ : Rd → Rp:

min

{1

2‖w‖2 : w ∈ Rp, ∀i � φ(xi)

Tw ≥ 1

}.

Now the search space has p dimensions; potentially p� d.

Can we do better?

22 / 37

Feature mapping in the dual

SVM hard-margin dual, with a feature mapping φ : Rd → Rp:

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjφ(xi)Tφ(xj).

Given dual optimum α, since w =∑ni=1 αiyiφ(xi), we can predict on future x

with

x 7→ φ(x)Tw =

n∑i=1

αiyiφ(x)Tφ(xi).

I Dual form never needs φ(x) ∈ Rp, only φ(x)Tφ(xi) ∈ R.

I Kernel trick: replace every φ(x)Tφ(x′) with kernel evaluation k(x,x′).

Sometimes k(·, ·) is much cheaper than φ(x)Tφ(x′).

I This idea started with SVM, but appears in many other places.

I Downside: implementations usually store Gram matrix G ∈ Rn×n whereGij := k(xi,xj).

23 / 37

Feature mapping in the dual

SVM hard-margin dual, with a feature mapping φ : Rd → Rp:

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjφ(xi)Tφ(xj).

Given dual optimum α, since w =∑ni=1 αiyiφ(xi), we can predict on future x

with

x 7→ φ(x)Tw =

n∑i=1

αiyiφ(x)Tφ(xi).

I Dual form never needs φ(x) ∈ Rp, only φ(x)Tφ(xi) ∈ R.

I Kernel trick: replace every φ(x)Tφ(x′) with kernel evaluation k(x,x′).

Sometimes k(·, ·) is much cheaper than φ(x)Tφ(x′).

I This idea started with SVM, but appears in many other places.

I Downside: implementations usually store Gram matrix G ∈ Rn×n whereGij := k(xi,xj).

23 / 37

Feature mapping in the dual

SVM hard-margin dual, with a feature mapping φ : Rd → Rp:

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjφ(xi)Tφ(xj).

Given dual optimum α, since w =∑ni=1 αiyiφ(xi), we can predict on future x

with

x 7→ φ(x)Tw =

n∑i=1

αiyiφ(x)Tφ(xi).

I Dual form never needs φ(x) ∈ Rp, only φ(x)Tφ(xi) ∈ R.

I Kernel trick: replace every φ(x)Tφ(x′) with kernel evaluation k(x,x′).

Sometimes k(·, ·) is much cheaper than φ(x)Tφ(x′).

I This idea started with SVM, but appears in many other places.

I Downside: implementations usually store Gram matrix G ∈ Rn×n whereGij := k(xi,xj).

23 / 37

Kernel example: affine features

Affine features: φ : Rd → R1+d, where

φ(x) = (1, x1, . . . , xd).

Kernel form:φ(x)Tφ(x′) = 1 + xTx′.

24 / 37

Kernel example: quadratic features

Consider re-normalized quadratic features

φ : Rd → R1+2d+(d2), where

φ(x) =(

1,√

2x1, . . . ,√

2xd, x21, . . . , x

2d,

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

).

Just writing this down takes time O(d2).Meanwhile,

φ(x)Tφ(x′) = (1 + xTx′)2,

time O(d).

Tweaks:

I What if we change exponent “2”?

I What if we replace additive “1” with 0?

25 / 37

Kernel example: quadratic features

Consider re-normalized quadratic features

φ : Rd → R1+2d+(d2), where

φ(x) =(

1,√

2x1, . . . ,√

2xd, x21, . . . , x

2d,

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

).

Just writing this down takes time O(d2).Meanwhile,

φ(x)Tφ(x′) = (1 + xTx′)2,

time O(d).

Tweaks:

I What if we change exponent “2”?

I What if we replace additive “1” with 0?

25 / 37

RBF kernel

For any σ > 0, there is an infinite feature expansion φ : Rd → R∞ such that

φ(x)Tφ(x′) = exp

−∥∥x− x′∥∥2

2

2σ2

,

which can be computed in O(d) time.

This is called a Gaussian kernel or RBF kernel.It has some similarities to nearest neighbor methods (later lecture).

φ maps to an infinite-dimensional space, but there’s no reason to know that.26 / 37

Defining kernels without φ

A (positive definite) kernel function k : X ×X → R is a symmetric func-tion so that for any n and any data examples (xi)

ni=1, the corresponding

Gram matrix G ∈ Rn×n with Gij := k(xi,xj) is positive semi-definite.

I There is a ton of theory about this formalism;

e.g., keywords RKHS, representer theorem, Mercer’s theorem.

I Given any such k, there always exists a corresponding φ.

I This definition ensures the SVM dual is still concave.

27 / 37

Defining kernels without φ

A (positive definite) kernel function k : X ×X → R is a symmetric func-tion so that for any n and any data examples (xi)

ni=1, the corresponding

Gram matrix G ∈ Rn×n with Gij := k(xi,xj) is positive semi-definite.

I There is a ton of theory about this formalism;

e.g., keywords RKHS, representer theorem, Mercer’s theorem.

I Given any such k, there always exists a corresponding φ.

I This definition ensures the SVM dual is still concave.

27 / 37

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

Source data.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-12.

500

-10.

000

-7.5

00-5

.000

-2.5

00

-2.500

0.00

0

0.000

2.500

Quadratic SVM.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-3.0

00

-2.0

00

-1.0

00

-1.0

00

0.00

0

0.000

1.00

0

2.000

3.000

RBF SVM (σ = 1).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-1.500-1.000

-1.000

-1.000

-0.5

00

-0.500

0.00

0

0.000

0.50

0

0.500

1.000

1.50

0

RBF SVM (σ = 0.1).

28 / 37

Summary for SVM

I Hard-margin SVM.

I Soft-margin SVM.

I SVM duality.

I Nonlinear SVM: kernels

29 / 37

(Appendix.)

30 / 37

Soft-margin dual derivation

Let’s derive the final dual form:

L(w, ξ,α) =1

2‖w‖22 + C

n∑i=1

ξi +

n∑i=1

αi(1− ξi − yixTiw)

D(α) = minw∈Rd,ξ∈Rn

≥0

L(w, ξ,α) = maxα∈Rn

0≤αi≤C

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2 .

Given α and ξ, the minimizing w is still w =∑ni=1 αiyixi; plugging in,

D(α) = minξ∈Rn≥0

L

n∑i=1

αiyixi, ξ,α

=n∑i=1

αi−1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2

+

n∑i=1

ξi(C−αi).

The goal is to maximize D; if αi > C, then ξi ↑ ∞ gives D(α) = −∞.Otherwise, minimized at ξi = 0. Therefore the dual problem is

maxα∈Rn

0≤αi≤C

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2 .

31 / 37

Gaussian kernel feature expansion

First consider d = 1, meaning φ : R→ R∞.

What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?

Reverse engineer using Taylor expansion:

e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y

2/(2σ2) · exy/σ2

= e−x2/(2σ2) · e−y

2/(2σ2) ·∞∑k=0

1

k!

(xy

σ2

)kSo let

φ(x) := e−x2/(2σ2)

(1,x

σ,

1√2!

(x

σ

)2

,1√3!

(x

σ

)3

, . . .

).

How to handle d > 1?

e−‖x−y‖2/(2σ2) = e−‖x‖

2/(2σ2) · e−‖y‖2/(2σ2) · ex

Ty/σ2

= e−‖x‖2/(2σ2) · e−‖y‖

2/(2σ2) ·∞∑k=0

1

k!

(xTy

σ2

)k.

32 / 37

Gaussian kernel feature expansion

First consider d = 1, meaning φ : R→ R∞.

What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?

Reverse engineer using Taylor expansion:

e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y

2/(2σ2) · exy/σ2

= e−x2/(2σ2) · e−y

2/(2σ2) ·∞∑k=0

1

k!

(xy

σ2

)kSo let

φ(x) := e−x2/(2σ2)

(1,x

σ,

1√2!

(x

σ

)2

,1√3!

(x

σ

)3

, . . .

).

How to handle d > 1?

e−‖x−y‖2/(2σ2) = e−‖x‖

2/(2σ2) · e−‖y‖2/(2σ2) · ex

Ty/σ2

= e−‖x‖2/(2σ2) · e−‖y‖

2/(2σ2) ·∞∑k=0

1

k!

(xTy

σ2

)k.

32 / 37

Gaussian kernel feature expansion

First consider d = 1, meaning φ : R→ R∞.

What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?

Reverse engineer using Taylor expansion:

e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y

2/(2σ2) · exy/σ2

= e−x2/(2σ2) · e−y

2/(2σ2) ·∞∑k=0

1

k!

(xy

σ2

)k

So let

φ(x) := e−x2/(2σ2)

(1,x

σ,

1√2!

(x

σ

)2

,1√3!

(x

σ

)3

, . . .

).

How to handle d > 1?

e−‖x−y‖2/(2σ2) = e−‖x‖

2/(2σ2) · e−‖y‖2/(2σ2) · ex

Ty/σ2

= e−‖x‖2/(2σ2) · e−‖y‖

2/(2σ2) ·∞∑k=0

1

k!

(xTy

σ2

)k.

32 / 37

Gaussian kernel feature expansion

First consider d = 1, meaning φ : R→ R∞.

What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?

Reverse engineer using Taylor expansion:

e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y

2/(2σ2) · exy/σ2

= e−x2/(2σ2) · e−y

2/(2σ2) ·∞∑k=0

1

k!

(xy

σ2

)kSo let

φ(x) := e−x2/(2σ2)

(1,x

σ,

1√2!

(x

σ

)2

,1√3!

(x

σ

)3

, . . .

).

How to handle d > 1?

e−‖x−y‖2/(2σ2) = e−‖x‖

2/(2σ2) · e−‖y‖2/(2σ2) · ex

Ty/σ2

= e−‖x‖2/(2σ2) · e−‖y‖

2/(2σ2) ·∞∑k=0

1

k!

(xTy

σ2

)k.

32 / 37

Gaussian kernel feature expansion

First consider d = 1, meaning φ : R→ R∞.

What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?

Reverse engineer using Taylor expansion:

e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y

2/(2σ2) · exy/σ2

= e−x2/(2σ2) · e−y

2/(2σ2) ·∞∑k=0

1

k!

(xy

σ2

)kSo let

φ(x) := e−x2/(2σ2)

(1,x

σ,

1√2!

(x

σ

)2

,1√3!

(x

σ

)3

, . . .

).

How to handle d > 1?

e−‖x−y‖2/(2σ2) = e−‖x‖

2/(2σ2) · e−‖y‖2/(2σ2) · ex

Ty/σ2

= e−‖x‖2/(2σ2) · e−‖y‖

2/(2σ2) ·∞∑k=0

1

k!

(xTy

σ2

)k.

32 / 37

Gaussian kernel feature expansion

First consider d = 1, meaning φ : R→ R∞.

What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?

Reverse engineer using Taylor expansion:

e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y

2/(2σ2) · exy/σ2

= e−x2/(2σ2) · e−y

2/(2σ2) ·∞∑k=0

1

k!

(xy

σ2

)kSo let

φ(x) := e−x2/(2σ2)

(1,x

σ,

1√2!

(x

σ

)2

,1√3!

(x

σ

)3

, . . .

).

How to handle d > 1?

e−‖x−y‖2/(2σ2) = e−‖x‖

2/(2σ2) · e−‖y‖2/(2σ2) · ex

Ty/σ2

= e−‖x‖2/(2σ2) · e−‖y‖

2/(2σ2) ·∞∑k=0

1

k!

(xTy

σ2

)k.

32 / 37

Kernel example: products of all subsets of coordinates

Consider φ : Rd → R2d

, where

φ(x) =

∏i∈S

xi

S⊆{1,2,...,d}

Time O(2d) just to write down.Kernel evaluation takes time O(d):

φ(x)Tφ(x′) =

d∏i=1

(1 + xix′i).

33 / 37

Other kernels

Suppose k1 and k2 are positive definite kernel functions.

1. k(x, y) := k1(x, y) + k2(x, y) define a positive definite kernel?

2. k(x, y) := c · k1(x, y) (for c ≥ 0) define a positive definite kernel?

3. k(x, y) := k1(x, y) · k2(x, y) define a positive definite kernel?

Another approach: random features k(x,x′) = EwF (w,x′)TF (w,x′) for someF ; we will revisit this with deep networks and the neural tangent kernel (NTK).

34 / 37

Other kernels

Suppose k1 and k2 are positive definite kernel functions.

1. k(x, y) := k1(x, y) + k2(x, y) define a positive definite kernel?

2. k(x, y) := c · k1(x, y) (for c ≥ 0) define a positive definite kernel?

3. k(x, y) := k1(x, y) · k2(x, y) define a positive definite kernel?

Another approach: random features k(x,x′) = EwF (w,x′)TF (w,x′) for someF ; we will revisit this with deep networks and the neural tangent kernel (NTK).

34 / 37

Other kernels

Suppose k1 and k2 are positive definite kernel functions.

1. k(x, y) := k1(x, y) + k2(x, y) define a positive definite kernel?

2. k(x, y) := c · k1(x, y) (for c ≥ 0) define a positive definite kernel?

3. k(x, y) := k1(x, y) · k2(x, y) define a positive definite kernel?

Another approach: random features k(x,x′) = EwF (w,x′)TF (w,x′) for someF ; we will revisit this with deep networks and the neural tangent kernel (NTK).

34 / 37

Other kernels

Suppose k1 and k2 are positive definite kernel functions.

1. k(x, y) := k1(x, y) + k2(x, y) define a positive definite kernel?

2. k(x, y) := c · k1(x, y) (for c ≥ 0) define a positive definite kernel?

3. k(x, y) := k1(x, y) · k2(x, y) define a positive definite kernel?

Another approach: random features k(x,x′) = EwF (w,x′)TF (w,x′) for someF ; we will revisit this with deep networks and the neural tangent kernel (NTK).

34 / 37

Other kernels

Suppose k1 and k2 are positive definite kernel functions.

1. k(x, y) := k1(x, y) + k2(x, y) define a positive definite kernel?

2. k(x, y) := c · k1(x, y) (for c ≥ 0) define a positive definite kernel?

3. k(x, y) := k1(x, y) · k2(x, y) define a positive definite kernel?

Another approach: random features k(x,x′) = EwF (w,x′)TF (w,x′) for someF ; we will revisit this with deep networks and the neural tangent kernel (NTK).

34 / 37

Kernel ridge regression

Kernel ridge regression:

minw

1

2n‖Xw − y‖2 +

λ

2‖w‖2.

Solution:w = (XTX + λnI)−1XTy.

Linear algebra fact:XT(XXT + λnI)−1y.

Therefore predict with

x 7→ xTw = (Xx)T(XXT + λnI)−1y =n∑i=1

(xTix)T

[(XXT + λnI)−1y

]i.

Kernel approach:

I Compute α := (G+ λnI)−1, where G ∈ Rn is the Gram matrix:Gij = k(xi,xj).

I Predict with x 7→∑ni=1 αix

Tix.

35 / 37

Multiclass SVM

There are a few ways;one is to use one-against-all as in lecture.

Many researchers have proposed various multiclass SVM methods,but some of them can be shown to fail in trivial cases.

36 / 37

Supplemental reading

I Shalev-Shwartz/Ben-David: chapter 15.

I Murphy: chapter 14.

37 / 37

top related