indirect rule learning: support vector machinesdzeng/bios740/svm.pdf · on svm optimization i...

28
Indirect Rule Learning: Support Vector Machines Donglin Zeng, Department of Biostatistics, University of North Carolina

Upload: others

Post on 08-May-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Indirect Rule Learning: SupportVector Machines

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 2: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Indirect learning: loss optimization

I It doesn’t estimate the prediction rule f (x) directly, sincemost loss functions do not have explicit optimizers.

I Indirection learning aims to directly minimize anempirically approximate expected loss function.

– Most often, it minimizes (empirical risk minimization)n∑

i=1

L(Yi, f (Xi)).

I For example, least squares estimation:n∑

i=1

(Yi − f (Xi))2.

classification problem:n∑

i=1

I(Yi 6= f (Xi)).

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 3: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Potential challenges

I What is a good approximation for expected loss function?Empirical risk is most commonly used but there are otheralternatives.

I What is the choice of candidate f for optimization?I How to avoid overfitting?I Will computation be feasible?

– global minimizer– computation complexity

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 4: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Least squares estimation

I The empirical risk isn∑

i=1

(Yi − f (Xi))2.

I f (x) can be from– a class of linear functions;– a sieve space of basis functions (splines, wavelets, radialbasis);– or fully nonparametric (kernel estimation).

I Overfitting can be addressed using regularization:– variable selection for linear models;– penalized splines, shrinkage for sieve approximation– cross-validation for tuning parameter selection.

I Computation– convex optimization– co-ordinate decent optimization for large p

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 5: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Support Vector Machines

I Consider binary classification problem and use label{−1, 1} for two classes.

I We start from a simple classification rule which is a linearfunction of feature variables X.

I The idea of SVM is to

identify a hyperplace on feature space

that separates classes as much as possible.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 6: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

SVM illustration

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 7: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Mathematical formulation of SVM

I The goal is to find a hyperplane β0 + XTβ such that

Yi(β0 + XTi β) > 0

for all i = 1, ...,n.I Furthermore, we wish to maximize the margin given as M.I That is, we solve

max‖β‖=1

M subject to Yi(β0 + XTi β) ≥M.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 8: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Equivalent optimization

I It is equivalent to

min12‖β‖2 subject to Yi(β0 + XT

i β) ≥ 1.

I There are two difficulties in practice– classes may not be separable so no solution exists;– classes may be separable but separation is nonlinear.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 9: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Extension to imperfect separation data

I For imperfect separation, Yi(β0 + XTi β) may not be

positive, i.e., the prediction is wrong.I We should allow this misclassification but impose some

penalty for wrong prediction.I This can be done by introducing slack variables ξ1, ..., ξn

for each subject.I ξi ≥ 0 describes the distance off the correct classification

given by margins.I However, we should restrict the total penalty to be not too

large.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 10: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

SVM optimization

I The optimization is

max‖β‖=1

M, subject to Yi(β0 + XTi β) ≥M(1− ξi), i = 1, ...,n.

where ξi ≥ 0 andn∑

i=1

ξi ≤ a pre-specified constant.

I Equivalently,

min12‖β‖2 + C

n∑i=1

ξi subject to Yi(β0 + XTi β) ≥ (1− ξi),

ξi ≥ 0, i = 1, ...,n,

where C is a given constant (called cost parameter).I This is a convex minimization problem with linear

constraints.Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 11: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Solve SVM problem using duality

I The Lagrange (primal) function is

12‖β‖2+C

n∑i=1

ξi−n∑

i=1

αi

[Yi(β0 + XT

i β)− (1− ξi)]−

n∑i=1

µiξi,

where αi ≥ 0, µi ≥ 0 are the Lagrange multipliers.I Differentiate with respect to β0, β and ξi to obtain

β =n∑

i=1

αiYiXi,

0 =

n∑i=1

αiYi,

αi = C− µi, i = 1, ...,n.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 12: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Dual problem

I After plugging β into the primal function and using theequations, the dual objective function is

LD =

n∑i=1

αi −12

n∑i=1

n∑j=1

αiαjYiYjXTi Xj.

I The dual problem becomes max LD subject to

0 ≤ αi ≤ C, i = 1, ...,n,n∑

i=1

αiYi = 0.

I Furthermore, KKT conditions give

αi

[Yi(XT

i β + β0)− (1− ξi)]= 0,

µiξi = 0,

Yi(XTi β + β0)− (1− ξi) ≥ 0.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 13: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

KKT conditions

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 14: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

On SVM optimization

I Solving the dual problem is a simple convex quadraticprogramming problems (there are many solvers in allpackages).

I Since β =∑n

i=1 αiYiXi, the hyperplane is determined bythose observations with αi 6= 0, called support vectors.

I Among support vectors, some are on the margin edges(ξi = 0) and the remainders have αi = C.

I Any support vectors with ξi = 0 can be used to solve for β0(often taken to be the average if there are multiple).

I Sometimes β0 can be solved by directly minimize theprimal function.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 15: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Illustrative example

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 16: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Go beyond linear SVM

I The most commonly used nonlinear prediction rule is torestrict f in a RKHS,HK (called kernel trick).

I Recall that RKHS is given by a kernel function K(x, y)where K(x, y) has an eigen-expansion

K(x, y) =∞∑

k=1

γkφk(x)φk(y),

where φk/√γk is the normalized basis function for

{HK, < ·, · >HK} .I We can represent f (x) using these basis functions

f (x) = β0 +∞∑

k=1

βkφk(x)/√γk.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 17: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Dual problem with kernel trick

I Follow the same derivation as linear SVM (replace Xi bythe vector (φ1(Xi)/

√γ1, φ2(Xi)

√γ2, ...)

T then the dualobjective function becomes

n∑i=1

αi −12

n∑i=1

n∑j=1

αiαjYiYj(

∞∑k=1

φk(Xi)φk(Xj)/γk)

=

n∑i=1

αi −12

n∑i=1

n∑j=1

αiαjYiYjK(Xi,Xj).

I The prediction function becomes

f (x) = β0 +

∞∑k=1

n∑i=1

αiYiφk(Xi)φk(x)/γk =

n∑i=1

αiYiK(Xi, x).

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 18: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Advantage of kernel trick

I Our conclusions areI (a) restricting f toHK leads to a nonlinear prediction

function depending on the kernel function;I (b) solving the dual problem for the prediction function

only need to know the kernel function K(x, y) (notnecessarily the basis functions);

I (c) the optimization in the dual problem depends on thenumber of observations (n) but not the dimensionality ofXi’s.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 19: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Choice of the kernel functions

I polynomial kernel:

K(x, x′) = (1 + xTx′)d

I Radial basis or Gaussian kernel:

K(x, x′) = exp{−γ‖x− x′‖2}

I Neural network:

K(x, x′) = tanh(k1xTx′ + k2).

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 20: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Revisit SVM example

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 21: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Loss formulation for SVM

I Revisit linear SVM formulation: we minimize ‖β‖ subjectto separation constraints

Yi(β0 + XTi β) ≥ 1− ξi, ξi ≥ 0, i = 1, ..,n

and that∑n

i=1 ξi is controlled by a constant.I We need to understand what exact empirical loss this

optimization tries to minimize because by achieving this,– we can characterize how SVM will possibly minimizeclassification loss (Fisher consistency);– we can study the stochastic variability of the SVMclassification (convergence rate and risk bound).

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 22: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Loss formulation: continue

I Equivalently, we minimize (for a given constant C)

12‖β‖2 + C

n∑i=1

ξi

subject toξi ≥ [1− Yi(β0 + XT

i β)]+,

where (1− z)+ = max(1− z, 0).I Hence, SVM is equivalent minimizing the following loss

n∑i=1

[1− Yi(β0 + XTi β)]+ +

λ

2‖β‖2.

I For nonlinear SVM, the loss isn∑

i=1

[1− Yif (Xi)]+ +λ

2‖f‖2HK.

I We name L(y, f ) = [1− yf ]+ the hinge loss function.Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 23: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Plot of the hinge loss

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 24: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Fisher consistency of SVM

I Fisher consistency: suppose f ∗ minimizes E[(1− Yf (X))+].Then sign(f ∗(x)) is the Bayes rule for the classificationproblem.

I Proof: note that

E[(1− Yf (X))+

∣∣∣X = x] = (1− f (x))+P(Y = 1|X = x)

+(1 + f (x))+P(Y = −1|X = x),

as a function of f (x), is a 3-piece of linear function,decreasing in (−∞,−1], linear in (−1, 1] and increasing in[1,∞).

I The minimize is attained at f (x) = −1 ifP(Y = −1|X = x) < P(Y = 1|X = 1) and 1 otherwise..

I We conclude the Fisher consistency.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 25: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Extension of the hinge loss

I The hinge loss is one special case of the so-called largemargin loss with form φ(yf ) for some convex function of φ.

I Additional examples include– Binomial deviance: log(1 + e−yf )– Squared loss: (1− yf )2

– square hinge loss: (1− yf )2+

– AdaBoost: exp{−yf}.I A sufficient condition for the Fisher consistency is that φ(x)

is differentiable at 0 and φ′(0) < 0.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 26: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

SVM for regression

I Extension of SVM to continuous Y is based onmodification of the loss in SVM.

I Consider prediction f (X) for subject with feature X and Yis his/her true outcome.

I The inaccuracy of the prediction can be characterized asthe so-called ε-insensitive loss:

L(Y, f (X)) = max(|Y− f (X)| − ε, 0).

I The loss is zero if the prediction error is within ε.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 27: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

ε-insensitive loss

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 28: Indirect Rule Learning: Support Vector Machinesdzeng/BIOS740/SVM.pdf · On SVM optimization I Solving the dual problem is a simple convex quadratic programming problems (there are

Optimization problem in SVM for regression

I The objective function for linear prediction ismin ‖β‖2/2 + C{ξi + ξ′i} subject to

−ξ′i−ε ≤ Yi−(β0+XTi β) ≤ ε+ξi, ξi ≥ 0, ξ′i ≥ 0, i = 1, ...,n.

I The dual problem is

min εn∑

i=1

(αi+α′i)−

n∑i=1

Yi(αi−α′i)+12

n∑i=1

n∑j=1

(αi−α′i)(αj−α′j)XTi Xj

subject to

0 ≤ αi, α′i ≤ C,

n∑i=1

(αi − α′i) = 0, αiα′i = 0

I The prediction function is β0 + XTβ with

β =

n∑i=1

(αi − α′i)Xi.

Donglin Zeng, Department of Biostatistics, University of North Carolina