indirect rule learning: support vector machinesdzeng/bios740/svm.pdf · on svm optimization i...

Indirect Rule Learning: SupportVector Machines

Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect learning: loss optimization

I It doesn’t estimate the prediction rule f (x) directly, sincemost loss functions do not have explicit optimizers.

I Indirection learning aims to directly minimize anempirically approximate expected loss function.

– Most often, it minimizes (empirical risk minimization)n∑

i=1

L(Yi, f (Xi)).

I For example, least squares estimation:n∑

i=1

(Yi − f (Xi))2.

classification problem:n∑

i=1

I(Yi 6= f (Xi)).


Potential challenges

I What is a good approximation for expected loss function?Empirical risk is most commonly used but there are otheralternatives.

I What is the choice of candidate f for optimization?I How to avoid overfitting?I Will computation be feasible?

– global minimizer– computation complexity


Least squares estimation

I The empirical risk isn∑

i=1

(Yi − f (Xi))2.

I f (x) can be from– a class of linear functions;– a sieve space of basis functions (splines, wavelets, radialbasis);– or fully nonparametric (kernel estimation).

I Overfitting can be addressed using regularization:– variable selection for linear models;– penalized splines, shrinkage for sieve approximation– cross-validation for tuning parameter selection.

I Computation– convex optimization– co-ordinate decent optimization for large p


Support Vector Machines

I Consider binary classification problem and use label{−1, 1} for two classes.

I We start from a simple classification rule which is a linearfunction of feature variables X.

I The idea of SVM is to

identify a hyperplace on feature space

that separates classes as much as possible.


SVM illustration


Mathematical formulation of SVM

I The goal is to find a hyperplane β0 + XTβ such that

Yi(β0 + XTi β) > 0

for all i = 1, ...,n.I Furthermore, we wish to maximize the margin given as M.I That is, we solve

max‖β‖=1

M subject to Yi(β0 + XTi β) ≥M.


Equivalent optimization

I It is equivalent to

min12‖β‖2 subject to Yi(β0 + XT

i β) ≥ 1.

I There are two difficulties in practice– classes may not be separable so no solution exists;– classes may be separable but separation is nonlinear.


Extension to imperfect separation data

I For imperfect separation, Yi(β0 + XTi β) may not be

positive, i.e., the prediction is wrong.I We should allow this misclassification but impose some

penalty for wrong prediction.I This can be done by introducing slack variables ξ1, ..., ξn

for each subject.I ξi ≥ 0 describes the distance off the correct classification

given by margins.I However, we should restrict the total penalty to be not too

large.


SVM optimization

I The optimization is

max‖β‖=1

M, subject to Yi(β0 + XTi β) ≥M(1− ξi), i = 1, ...,n.

where ξi ≥ 0 andn∑

i=1

ξi ≤ a pre-specified constant.

I Equivalently,

min12‖β‖2 + C

n∑i=1

ξi subject to Yi(β0 + XTi β) ≥ (1− ξi),

ξi ≥ 0, i = 1, ...,n,

where C is a given constant (called cost parameter).I This is a convex minimization problem with linear

constraints.Donglin Zeng, Department of Biostatistics, University of North Carolina

Solve SVM problem using duality

I The Lagrange (primal) function is

12‖β‖2+C

n∑i=1

ξi−n∑

i=1

αi

[Yi(β0 + XT

i β)− (1− ξi)]−

n∑i=1

µiξi,

where αi ≥ 0, µi ≥ 0 are the Lagrange multipliers.I Differentiate with respect to β0, β and ξi to obtain

β =n∑

i=1

αiYiXi,

0 =

n∑i=1

αiYi,

αi = C− µi, i = 1, ...,n.


Dual problem

I After plugging β into the primal function and using theequations, the dual objective function is

LD =

n∑i=1

αi −12

n∑i=1

n∑j=1

αiαjYiYjXTi Xj.

I The dual problem becomes max LD subject to

0 ≤ αi ≤ C, i = 1, ...,n,n∑

i=1

αiYi = 0.

I Furthermore, KKT conditions give

αi

[Yi(XT

i β + β0)− (1− ξi)]= 0,

µiξi = 0,

Yi(XTi β + β0)− (1− ξi) ≥ 0.


KKT conditions


On SVM optimization

I Solving the dual problem is a simple convex quadraticprogramming problems (there are many solvers in allpackages).

I Since β =∑n

i=1 αiYiXi, the hyperplane is determined bythose observations with αi 6= 0, called support vectors.

I Among support vectors, some are on the margin edges(ξi = 0) and the remainders have αi = C.

I Any support vectors with ξi = 0 can be used to solve for β0(often taken to be the average if there are multiple).

I Sometimes β0 can be solved by directly minimize theprimal function.


Illustrative example


Go beyond linear SVM

I The most commonly used nonlinear prediction rule is torestrict f in a RKHS,HK (called kernel trick).

I Recall that RKHS is given by a kernel function K(x, y)where K(x, y) has an eigen-expansion

K(x, y) =∞∑

k=1

γkφk(x)φk(y),

where φk/√γk is the normalized basis function for

{HK, < ·, · >HK} .I We can represent f (x) using these basis functions

f (x) = β0 +∞∑

k=1

βkφk(x)/√γk.


Dual problem with kernel trick

I Follow the same derivation as linear SVM (replace Xi bythe vector (φ1(Xi)/

√γ1, φ2(Xi)

√γ2, ...)

T then the dualobjective function becomes

n∑i=1

αi −12

n∑i=1

n∑j=1

αiαjYiYj(

∞∑k=1

φk(Xi)φk(Xj)/γk)

=

n∑i=1

αi −12

n∑i=1

n∑j=1

αiαjYiYjK(Xi,Xj).

I The prediction function becomes

f (x) = β0 +

∞∑k=1

n∑i=1

αiYiφk(Xi)φk(x)/γk =

n∑i=1

αiYiK(Xi, x).


Advantage of kernel trick

I Our conclusions areI (a) restricting f toHK leads to a nonlinear prediction

function depending on the kernel function;I (b) solving the dual problem for the prediction function

only need to know the kernel function K(x, y) (notnecessarily the basis functions);

I (c) the optimization in the dual problem depends on thenumber of observations (n) but not the dimensionality ofXi’s.


Choice of the kernel functions

I polynomial kernel:

K(x, x′) = (1 + xTx′)d

I Radial basis or Gaussian kernel:

K(x, x′) = exp{−γ‖x− x′‖2}

I Neural network:

K(x, x′) = tanh(k1xTx′ + k2).


Revisit SVM example


Loss formulation for SVM

I Revisit linear SVM formulation: we minimize ‖β‖ subjectto separation constraints

Yi(β0 + XTi β) ≥ 1− ξi, ξi ≥ 0, i = 1, ..,n

and that∑n

i=1 ξi is controlled by a constant.I We need to understand what exact empirical loss this

optimization tries to minimize because by achieving this,– we can characterize how SVM will possibly minimizeclassification loss (Fisher consistency);– we can study the stochastic variability of the SVMclassification (convergence rate and risk bound).


Loss formulation: continue

I Equivalently, we minimize (for a given constant C)

12‖β‖2 + C

n∑i=1

ξi

subject toξi ≥ [1− Yi(β0 + XT

i β)]+,

where (1− z)+ = max(1− z, 0).I Hence, SVM is equivalent minimizing the following loss

n∑i=1

[1− Yi(β0 + XTi β)]+ +

λ

2‖β‖2.

I For nonlinear SVM, the loss isn∑

i=1

[1− Yif (Xi)]+ +λ

2‖f‖2HK.

I We name L(y, f ) = [1− yf ]+ the hinge loss function.Donglin Zeng, Department of Biostatistics, University of North Carolina

Plot of the hinge loss


Fisher consistency of SVM

I Fisher consistency: suppose f ∗ minimizes E[(1− Yf (X))+].Then sign(f ∗(x)) is the Bayes rule for the classificationproblem.

I Proof: note that

E[(1− Yf (X))+

∣∣∣X = x] = (1− f (x))+P(Y = 1|X = x)

+(1 + f (x))+P(Y = −1|X = x),

as a function of f (x), is a 3-piece of linear function,decreasing in (−∞,−1], linear in (−1, 1] and increasing in[1,∞).

I The minimize is attained at f (x) = −1 ifP(Y = −1|X = x) < P(Y = 1|X = 1) and 1 otherwise..

I We conclude the Fisher consistency.


Extension of the hinge loss

I The hinge loss is one special case of the so-called largemargin loss with form φ(yf ) for some convex function of φ.

I Additional examples include– Binomial deviance: log(1 + e−yf )– Squared loss: (1− yf )2

– square hinge loss: (1− yf )2+

– AdaBoost: exp{−yf}.I A sufficient condition for the Fisher consistency is that φ(x)

is differentiable at 0 and φ′(0) < 0.


SVM for regression

I Extension of SVM to continuous Y is based onmodification of the loss in SVM.

I Consider prediction f (X) for subject with feature X and Yis his/her true outcome.

I The inaccuracy of the prediction can be characterized asthe so-called ε-insensitive loss:

L(Y, f (X)) = max(|Y− f (X)| − ε, 0).

I The loss is zero if the prediction error is within ε.


ε-insensitive loss


Optimization problem in SVM for regression

I The objective function for linear prediction ismin ‖β‖2/2 + C{ξi + ξ′i} subject to

−ξ′i−ε ≤ Yi−(β0+XTi β) ≤ ε+ξi, ξi ≥ 0, ξ′i ≥ 0, i = 1, ...,n.

I The dual problem is

min εn∑

i=1

(αi+α′i)−

n∑i=1

Yi(αi−α′i)+12

n∑i=1

n∑j=1

(αi−α′i)(αj−α′j)XTi Xj

subject to

0 ≤ αi, α′i ≤ C,

n∑i=1

(αi − α′i) = 0, αiα′i = 0

I The prediction function is β0 + XTβ with

β =

n∑i=1

(αi − α′i)Xi.


indirect rule learning: support vector machinesdzeng/bios740/svm.pdf · on svm optimization i...

Documents