indirect rule learning: support vector machinesdzeng/bios740/svm.pdf · on svm optimization i...
TRANSCRIPT
Indirect Rule Learning: SupportVector Machines
Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect learning: loss optimization
I It doesn’t estimate the prediction rule f (x) directly, sincemost loss functions do not have explicit optimizers.
I Indirection learning aims to directly minimize anempirically approximate expected loss function.
– Most often, it minimizes (empirical risk minimization)n∑
i=1
L(Yi, f (Xi)).
I For example, least squares estimation:n∑
i=1
(Yi − f (Xi))2.
classification problem:n∑
i=1
I(Yi 6= f (Xi)).
Donglin Zeng, Department of Biostatistics, University of North Carolina
Potential challenges
I What is a good approximation for expected loss function?Empirical risk is most commonly used but there are otheralternatives.
I What is the choice of candidate f for optimization?I How to avoid overfitting?I Will computation be feasible?
– global minimizer– computation complexity
Donglin Zeng, Department of Biostatistics, University of North Carolina
Least squares estimation
I The empirical risk isn∑
i=1
(Yi − f (Xi))2.
I f (x) can be from– a class of linear functions;– a sieve space of basis functions (splines, wavelets, radialbasis);– or fully nonparametric (kernel estimation).
I Overfitting can be addressed using regularization:– variable selection for linear models;– penalized splines, shrinkage for sieve approximation– cross-validation for tuning parameter selection.
I Computation– convex optimization– co-ordinate decent optimization for large p
Donglin Zeng, Department of Biostatistics, University of North Carolina
Support Vector Machines
I Consider binary classification problem and use label{−1, 1} for two classes.
I We start from a simple classification rule which is a linearfunction of feature variables X.
I The idea of SVM is to
identify a hyperplace on feature space
that separates classes as much as possible.
Donglin Zeng, Department of Biostatistics, University of North Carolina
SVM illustration
Donglin Zeng, Department of Biostatistics, University of North Carolina
Mathematical formulation of SVM
I The goal is to find a hyperplane β0 + XTβ such that
Yi(β0 + XTi β) > 0
for all i = 1, ...,n.I Furthermore, we wish to maximize the margin given as M.I That is, we solve
max‖β‖=1
M subject to Yi(β0 + XTi β) ≥M.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Equivalent optimization
I It is equivalent to
min12‖β‖2 subject to Yi(β0 + XT
i β) ≥ 1.
I There are two difficulties in practice– classes may not be separable so no solution exists;– classes may be separable but separation is nonlinear.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Extension to imperfect separation data
I For imperfect separation, Yi(β0 + XTi β) may not be
positive, i.e., the prediction is wrong.I We should allow this misclassification but impose some
penalty for wrong prediction.I This can be done by introducing slack variables ξ1, ..., ξn
for each subject.I ξi ≥ 0 describes the distance off the correct classification
given by margins.I However, we should restrict the total penalty to be not too
large.
Donglin Zeng, Department of Biostatistics, University of North Carolina
SVM optimization
I The optimization is
max‖β‖=1
M, subject to Yi(β0 + XTi β) ≥M(1− ξi), i = 1, ...,n.
where ξi ≥ 0 andn∑
i=1
ξi ≤ a pre-specified constant.
I Equivalently,
min12‖β‖2 + C
n∑i=1
ξi subject to Yi(β0 + XTi β) ≥ (1− ξi),
ξi ≥ 0, i = 1, ...,n,
where C is a given constant (called cost parameter).I This is a convex minimization problem with linear
constraints.Donglin Zeng, Department of Biostatistics, University of North Carolina
Solve SVM problem using duality
I The Lagrange (primal) function is
12‖β‖2+C
n∑i=1
ξi−n∑
i=1
αi
[Yi(β0 + XT
i β)− (1− ξi)]−
n∑i=1
µiξi,
where αi ≥ 0, µi ≥ 0 are the Lagrange multipliers.I Differentiate with respect to β0, β and ξi to obtain
β =n∑
i=1
αiYiXi,
0 =
n∑i=1
αiYi,
αi = C− µi, i = 1, ...,n.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Dual problem
I After plugging β into the primal function and using theequations, the dual objective function is
LD =
n∑i=1
αi −12
n∑i=1
n∑j=1
αiαjYiYjXTi Xj.
I The dual problem becomes max LD subject to
0 ≤ αi ≤ C, i = 1, ...,n,n∑
i=1
αiYi = 0.
I Furthermore, KKT conditions give
αi
[Yi(XT
i β + β0)− (1− ξi)]= 0,
µiξi = 0,
Yi(XTi β + β0)− (1− ξi) ≥ 0.
Donglin Zeng, Department of Biostatistics, University of North Carolina
KKT conditions
Donglin Zeng, Department of Biostatistics, University of North Carolina
On SVM optimization
I Solving the dual problem is a simple convex quadraticprogramming problems (there are many solvers in allpackages).
I Since β =∑n
i=1 αiYiXi, the hyperplane is determined bythose observations with αi 6= 0, called support vectors.
I Among support vectors, some are on the margin edges(ξi = 0) and the remainders have αi = C.
I Any support vectors with ξi = 0 can be used to solve for β0(often taken to be the average if there are multiple).
I Sometimes β0 can be solved by directly minimize theprimal function.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Illustrative example
Donglin Zeng, Department of Biostatistics, University of North Carolina
Go beyond linear SVM
I The most commonly used nonlinear prediction rule is torestrict f in a RKHS,HK (called kernel trick).
I Recall that RKHS is given by a kernel function K(x, y)where K(x, y) has an eigen-expansion
K(x, y) =∞∑
k=1
γkφk(x)φk(y),
where φk/√γk is the normalized basis function for
{HK, < ·, · >HK} .I We can represent f (x) using these basis functions
f (x) = β0 +∞∑
k=1
βkφk(x)/√γk.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Dual problem with kernel trick
I Follow the same derivation as linear SVM (replace Xi bythe vector (φ1(Xi)/
√γ1, φ2(Xi)
√γ2, ...)
T then the dualobjective function becomes
n∑i=1
αi −12
n∑i=1
n∑j=1
αiαjYiYj(
∞∑k=1
φk(Xi)φk(Xj)/γk)
=
n∑i=1
αi −12
n∑i=1
n∑j=1
αiαjYiYjK(Xi,Xj).
I The prediction function becomes
f (x) = β0 +
∞∑k=1
n∑i=1
αiYiφk(Xi)φk(x)/γk =
n∑i=1
αiYiK(Xi, x).
Donglin Zeng, Department of Biostatistics, University of North Carolina
Advantage of kernel trick
I Our conclusions areI (a) restricting f toHK leads to a nonlinear prediction
function depending on the kernel function;I (b) solving the dual problem for the prediction function
only need to know the kernel function K(x, y) (notnecessarily the basis functions);
I (c) the optimization in the dual problem depends on thenumber of observations (n) but not the dimensionality ofXi’s.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Choice of the kernel functions
I polynomial kernel:
K(x, x′) = (1 + xTx′)d
I Radial basis or Gaussian kernel:
K(x, x′) = exp{−γ‖x− x′‖2}
I Neural network:
K(x, x′) = tanh(k1xTx′ + k2).
Donglin Zeng, Department of Biostatistics, University of North Carolina
Revisit SVM example
Donglin Zeng, Department of Biostatistics, University of North Carolina
Loss formulation for SVM
I Revisit linear SVM formulation: we minimize ‖β‖ subjectto separation constraints
Yi(β0 + XTi β) ≥ 1− ξi, ξi ≥ 0, i = 1, ..,n
and that∑n
i=1 ξi is controlled by a constant.I We need to understand what exact empirical loss this
optimization tries to minimize because by achieving this,– we can characterize how SVM will possibly minimizeclassification loss (Fisher consistency);– we can study the stochastic variability of the SVMclassification (convergence rate and risk bound).
Donglin Zeng, Department of Biostatistics, University of North Carolina
Loss formulation: continue
I Equivalently, we minimize (for a given constant C)
12‖β‖2 + C
n∑i=1
ξi
subject toξi ≥ [1− Yi(β0 + XT
i β)]+,
where (1− z)+ = max(1− z, 0).I Hence, SVM is equivalent minimizing the following loss
n∑i=1
[1− Yi(β0 + XTi β)]+ +
λ
2‖β‖2.
I For nonlinear SVM, the loss isn∑
i=1
[1− Yif (Xi)]+ +λ
2‖f‖2HK.
I We name L(y, f ) = [1− yf ]+ the hinge loss function.Donglin Zeng, Department of Biostatistics, University of North Carolina
Plot of the hinge loss
Donglin Zeng, Department of Biostatistics, University of North Carolina
Fisher consistency of SVM
I Fisher consistency: suppose f ∗ minimizes E[(1− Yf (X))+].Then sign(f ∗(x)) is the Bayes rule for the classificationproblem.
I Proof: note that
E[(1− Yf (X))+
∣∣∣X = x] = (1− f (x))+P(Y = 1|X = x)
+(1 + f (x))+P(Y = −1|X = x),
as a function of f (x), is a 3-piece of linear function,decreasing in (−∞,−1], linear in (−1, 1] and increasing in[1,∞).
I The minimize is attained at f (x) = −1 ifP(Y = −1|X = x) < P(Y = 1|X = 1) and 1 otherwise..
I We conclude the Fisher consistency.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Extension of the hinge loss
I The hinge loss is one special case of the so-called largemargin loss with form φ(yf ) for some convex function of φ.
I Additional examples include– Binomial deviance: log(1 + e−yf )– Squared loss: (1− yf )2
– square hinge loss: (1− yf )2+
– AdaBoost: exp{−yf}.I A sufficient condition for the Fisher consistency is that φ(x)
is differentiable at 0 and φ′(0) < 0.
Donglin Zeng, Department of Biostatistics, University of North Carolina
SVM for regression
I Extension of SVM to continuous Y is based onmodification of the loss in SVM.
I Consider prediction f (X) for subject with feature X and Yis his/her true outcome.
I The inaccuracy of the prediction can be characterized asthe so-called ε-insensitive loss:
L(Y, f (X)) = max(|Y− f (X)| − ε, 0).
I The loss is zero if the prediction error is within ε.
Donglin Zeng, Department of Biostatistics, University of North Carolina
ε-insensitive loss
Donglin Zeng, Department of Biostatistics, University of North Carolina
Optimization problem in SVM for regression
I The objective function for linear prediction ismin ‖β‖2/2 + C{ξi + ξ′i} subject to
−ξ′i−ε ≤ Yi−(β0+XTi β) ≤ ε+ξi, ξi ≥ 0, ξ′i ≥ 0, i = 1, ...,n.
I The dual problem is
min εn∑
i=1
(αi+α′i)−
n∑i=1
Yi(αi−α′i)+12
n∑i=1
n∑j=1
(αi−α′i)(αj−α′j)XTi Xj
subject to
0 ≤ αi, α′i ≤ C,
n∑i=1
(αi − α′i) = 0, αiα′i = 0
I The prediction function is β0 + XTβ with
β =
n∑i=1
(αi − α′i)Xi.
Donglin Zeng, Department of Biostatistics, University of North Carolina