chih-jen lin july 13, 2006b95028/studygroup/...discussed in chapter 3.2. if the eucledian distance...
TRANSCRIPT
A Guide to Support Vector Machines
Chih-Jen Lin
July 13, 2006
2
Contents
I Basic Topics 7
1 Introduction to Classification Problems 9
1.1 Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Training and Testing Error . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Nearest Neighbor Methods . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Overfitting and Underfitting . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Support Vector Classification 17
2.1 Linear Separating Hyperplane with Maximal Margin . . . . . . . . . 17
2.2 Mapping Data to Higher Dimensional Spaces . . . . . . . . . . . . . . 19
2.3 The Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Kernel and Decision Functions . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Multi-class SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.1 One-against-all Multi-class SVM . . . . . . . . . . . . . . . . . 26
2.5.2 One-against-one Multi-class SVM . . . . . . . . . . . . . . . . 26
2.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Training and Testing a Data set 29
3.1 Categorical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Data Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 RBF Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.2 Cross-validation and Grid-search . . . . . . . . . . . . . . . . 32
3.4 A General Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Using LIBSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Other Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.7 A Large Practical Example . . . . . . . . . . . . . . . . . . . . . . . . 40
3.8 A Failed Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.9 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3
4 CONTENTS
4 Support Vector Regression 454.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . 464.3 A Practical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
II Advanced Topics 51
5 The Dual Problem 535.1 SVM and Convex Optimization . . . . . . . . . . . . . . . . . . . . . 535.2 Lagrangian Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.3 Proof of (5.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6 Solving the Quadratic Problems 656.1 Solving Optimization Problems . . . . . . . . . . . . . . . . . . . . . 656.2 The Decomposition Method . . . . . . . . . . . . . . . . . . . . . . . 666.3 Working Set Selection and Stopping Criteria . . . . . . . . . . . . . 686.4 Analytical Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.5 The Calculation of b . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.6 Shrinking and Caching . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.6.1 Shrinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.6.2 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.7 Convergence of the Decomposition Method . . . . . . . . . . . . . . . 786.8 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . 796.9 Computational Complexity for Multi-class SVM . . . . . . . . . . . . 806.10 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Preface
This is the course notes that I have been using in the past few years. The main
purpose is to let students not in machine learning background learn how to effectively
use support vector machines as a tool.
The first part (basic topics) is suitable for people who would like to use SVM. The
second part (advanced topics) is for people who would like to know implementation
details of SVM software.
5
6 CONTENTS
Part I
Basic Topics
7
Chapter 1
Introduction to ClassificationProblems
1.1 Data Classification
The basic idea of data classification problem can be simply described as follows: Given
training data with known labels (classes), we would like to learn a model, so it can
be used for predicting data with unknown labels.
For example, suppose the height and weight of the following eight persons are
available and medical experts have identified that some of them are over-weighted or
under-weighted:
Table 1.1.1: Six training points
ID 1 2 3 4 5 6Weight (kg) 50 60 70 70 80 90Height (m) 1.6 1.7 1.9 1.5 1.7 1.6
Over-weighted No No No Yes Yes Yes
They are considered as “two classes” of data and the label is whether the person
is over- or under-weighted. The above data can also be illustrated by Figure 1.1.
Consulting with experts may be expensive, so we would like to construct a model
from available information. Then for any person, this model could easily predict
whether he/she is over- or under-weighted.
9
10 CHAPTER 1. INTRODUCTION TO CLASSIFICATION PROBLEMS
40 50 60 70 80 901.4
1.5
1.6
1.7
1.8
1.9
2
Weight
He
igh
t
Figure 1.1: Six training points
1.2 Training and Testing Error
A model can be a rule like
If weight ≥ 60, then over-weighted.
Clearly, this rule does not make sense, as some tall people may be thin even thouth
their weights are more than 60. A better model may be the following rule:
If weight/(height)2 ≥ 23, then over-weighted.
The area of classification is to idenfity a good model so future prediction is accu-
rate.
Here “weight” and “height” are called features or attributes. In statistics, they are
called variables. Each person is considered as a data instance (or a data observation).
Mathematically, we have x = [weight, height] as a data instance and y = 1 or −1
as the label of each instance. Here, there are six training instances x1, . . . ,x6 with
corresponding class labels y = [−1,−1,−1, 1, 1, 1]T (if −1 and 1 mean under-weighted
and over-weighted, respectively).
1.3 Nearest Neighbor Methods
Here, we introduce a simple classifier: nearest neighbor. For any new person, we check
that his/her (height, weight) is closest to which one in the training set. For example,
if a person has weight=70 and height=1.8, then the closest one in the training set is
1.4. LINEAR CLASSIFIERS 11
the third person, who is under-weighted. Thus, the new one is predicted as under-
weighted as well. Note that weight and height are now in two very different ranges,
so we may have to scale them before calculating the distance. This issue will be
discussed in Chapter 3.2.
If the Eucledian distance is considered, for any given x, the nearest neighbor
method essentially predict it to be in the
class of arg mini‖x− xi‖2.
We may wrorry that the closest instance is a wrongly recorded data, so sometimes
the k closest points are considered and the prediction is by a majority vote. This
method is called k-nearest neighbor. How to select k is an issue and will be discussed
in Chapter 1.5.
1.4 Linear Classifiers
A model after the training procedure can be, for example, a rule set, or the whole
training set like that by the nearest neighbor. Here we show that a straight line can
be a model as well.
40 50 60 70 80 901.4
1.5
1.6
1.7
1.8
1.9
2
Weight
He
igh
t
Figure 1.2: Example of a linear classifier
In Figure 1.4, a line
0.2× weight− 10× height + 3 = 0
separates all the training data. In general we represent such a line as
wTx + b = 0,
12 CHAPTER 1. INTRODUCTION TO CLASSIFICATION PROBLEMS
where x = [weight, height]T , w = [0.2,−10]T , and b = 3. Then for any new data x,
we check whether it is on the right- or left-hand side of the line. That is,
if wTx + b > 0 predict x as “over-weighted”,
< 0 predict x as “under-weighted”.
How to find such a straight line will be discussed in Chapter 2.
1.5 Overfitting and Underfitting
(a) Training data and an overfitting clas-sifier
(b) Applying an overfitting classifier ontesting data
(c) Training data and a better classifier (d) Applying a better classifier on testingdata
Figure 1.3: An overfitting classifier and a better classifier (● and ▲: training data;© and △: testing data).
Note that it may not be useful to achieve high training accuracy (i.e., classifiers
accurately predict training data whose class labels are indeed known). A clear illus-
tration is in Figure 1.3. It is a problem with two classes of data: triangles and circles.
1.6. CROSS VALIDATION 13
Filled circles and triangles are the training data while hollow circles and triangles
are the testing data. The testing accuracy the classifier in Figures 1.3(a) and 1.3(b)
is not good since it overfits the training data. On the other hand, the classifier in
Figures 1.3(c) and 1.3(d) does not overfit the training data and hence gives better
testing accuracy.
Some training data may be wrongly recorded, so sometimes we should allow train-
ing errors. That is, under the obtained model, we predict some training data to be
in their opposite class. Note that if there are no duplicated data instances, we can
always fit training data so that training accuracy is 100%. An example is in Figure
1.4.
Figure 1.4: We can always achieve 100% training accuracy
Perfect training accuracy is not good, so we should avoid overfitting training data.
On the other hand, a good model should also avoid underfitting, which means the
model does not extract enough information from the training data. An example is in
Figure 1.5. Clearly, the linear classifier does not use the information that most circles
are at the upper-right corner and most triangles are at the lower-left. Therefore, from
the discussion in this section, we conclude that a good classifier should
avoid overfitting and avoid underfitting.
1.6 Cross Validation
The above discussion also hints another important fact about classification problem:
Training accuracy is not important; only test accuracy counts.
14 CHAPTER 1. INTRODUCTION TO CLASSIFICATION PROBLEMS
Figure 1.5: A underfitting example
This statement is quite obvious as for training data we already know their class
labels. However, as the true class labels of test data are not known, how do we find
the performance on predicting them? A common way is to separate training data
to two parts of which one is considered unknown in training the classifier. Then
the prediction accuracy on this set can more precisely reflect the performance on
classifying unknown data. An improved version of this procedure is cross-validation.
In v-fold cross-validation, we first divide the training set into v subsets of equal
size. Sequentially one subset is tested using the classifier trained on the remaining
v − 1 subsets. Thus, each instance of the whole training set is predicted once so
the cross-validation accuracy is the percentage of data which are correctly classified.
Usually v ≥ 5 is used.
In Chapter 1.3, we mention the k-nearest neighbor method and the selection of
k can be via cross validation. For example, sequentially we try k = 1, 3, 5, 7, . . .
and calculate the corresponding cross validation accuract. The k with the highest
accuracy is the best and used for future prediction. Note that we consider only odd
k, so the majority vote in prediction could produce a single winner.
1.7 Exercises
1. Write a k-nearest neighbor code to train
http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/binary/ijcnn1.bz2
and test
http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/binary/ijcnn1.t.bz2
1.7. EXERCISES 15
You need to conduct cross-validation on the training data in order to select a
good k.
16 CHAPTER 1. INTRODUCTION TO CLASSIFICATION PROBLEMS
Chapter 2
Support Vector Classification
2.1 Linear Separating Hyperplane with Maximal
Margin
The original idea of SVM classification is to use a linear separating hyperplane to
create a classifier. Given training vectors xi, i = 1, . . . , l of length n, and a vector y
defined as follows
yi =
{
1 if xi in class 1,−1 if xi in class 2,
the support vector technique tries to find the separating hyperplane with the largest
margin between two classes, measured along a line perpendicular to the hyperplane.
For example, in Figure 2.1, two classes could be fully separated by a dotted line
wTx + b = 0. We would like to decide the line with the largest margin. In other
words, intuitively we think that the distance between two classes of training data
should be as large as possible. That means we find a line with parameters w and b
such that the distance between wTx + b = ±1 is maximized.
The distance between wTx + b = 1 and −1 can be calculated by the following
way. Consider a point x on wTx + b = −1:
@@
@@
@@
@
@@
@@
@@
@
x + tw
xtw
wTx + b = −1
wTx + b = 1
17
18 CHAPTER 2. SUPPORT VECTOR CLASSIFICATION
wTx + b =
+10−1
Figure 2.1: Separating hyperplane
As w is the “normal vector” of the line wTx + b = −1, w and the line are
perpendicular to each other. Starting from x and moving along the direction w, we
assume x + tw touches the line wTx + b = 1. Thus,
wT (x + tw) + b = 1 and wT x + b = −1.
Then, twTw = 2, so the distance (i.e., the length of tw) is ‖tw‖ = 2‖w‖/(wTw) =
2/‖w‖. Note that ‖w‖ =√
w21 + · · ·+ w2
n. As maximizing 2/‖w‖ is equivalent to
minimizing wTw/2, we have the following problem:
minw,b
1
2wTw
subject to yi(wTxi + b) ≥ 1, (2.1)
i = 1, . . . , l.
The constraint yi(wTxi + b) ≥ 1 means
(wTxi) + b ≥ 1 if yi = 1,(wTxi) + b ≤ −1 if yi = −1.
That is, data in the class 1 must be on the right-hand side of wTx+b = 0 while data in
the other class must be on the left-hand side. Note that the reason of maximizing the
distance between wTx + b = ±1 is based on Vapnik’s Structural Risk Minimization
(Vapnik, 1998).
The following example gives a simple illustration of maximal-margin separating
hyperplanes:
Example 2.1.1 Given two training data in R1 as in the following figure:
△0
©1
2.2. MAPPING DATA TO HIGHER DIMENSIONAL SPACES 19
What is the separating hyperplane ?
Now two data are x1 = 1,x2 = 0 with y = [+1,−1]T . Furthermore, w ∈ R1, so
(2.1) becomes
minw,b
1
2w2
subject to w · 1 + b ≥ 1, (2.2)
−1(w · 0 + b) ≥ 1. (2.3)
From (2.3), −b ≥ 1. Putting this into (2.2), w ≥ 2. In other words, for any (w, b)
which satisfies (2.2) and (2.3), w ≥ 2. As we are minimizing 12w2, the smallest
possibility is w = 2. Thus, (w, b) = (2,−1) is the optimal solution. The separating
hyperplane is 2x− 1 = 0, in the middle of the two training data:
△0
©1
•x = 1/2
2.2 Mapping Data to Higher Dimensional Spaces
Figure 2.2: An example which is not linear separable
However, practically problems may not be linearly separable where an example
is in Figure 2.2. That is, there is no (w, b) which satisfies constraints of (2.1). In
this situation, we say (2.1) is “infeasible.” In (Cortes and Vapnik, 1995) the authors
introduced slack variables ξi, i = 1, . . . , l in the constraints:
minw,b,ξ
1
2wTw + C
l∑
i=1
ξi
subject to yi(wTxi + b) ≥ 1− ξi, (2.4)
ξi ≥ 0, i = 1, . . . , l.
20 CHAPTER 2. SUPPORT VECTOR CLASSIFICATION
That is, constraints (2.4) allow that training data may not be on the correct side of
the separating hyperplane wTx + b = 0. This situation happens when ξi > 1 and an
example is in the following figure
wTxi + b = 1− ξi < −1
We have ξ ≥ 0 as if ξ < 0, yi(wTxi + b) ≥ 1 − ξi ≥ 1 and the training data is
already on the correct side. The new problem is always feasible since for any (w, b),
ξi ≡ max(0, 1− yi(wTx + b)), i = 1, . . . , l,
lead to that (w, b, ξ) is a feasible solution.
Using this setting, we may worry that for linearly separable data, some ξi > 1
and hence corresponding data are wrongly classified. For the case that most data
except some noisy ones are separable by a linear function, we would like wTx+ b = 0
correctly classifies the majority of points. Thus, in the objective function we add a
penalty term C∑l
i=1 ξi, where C > 0 is the penalty parameter. To have the objective
value as small as possible, most ξi should be zero, so the constraint goes back to its
original form. Theoretically we can prove that if data are linear separable and C is
larger than a certain number, problem (2.4) goes back to (2.1) and all ξi are zero
(Lin, 2001a).
Unfortunately, such a setting is not enough for practical use. If data are distributed
in a highly nonlinear way, employing only a linear function causes many training
instances to be on the wrong side of the hyperplane. So underfitting occurs and the
decision function does not perform well.
To fit the training data better, we may think of using a nonlinear curve like that
in Figure 2.2. The problem is that it is very difficult to model nonlinear curves. All
we are familiar with are eliptic, hyperbolic, or parabolic curves, which are far from
enough in practice. Instead of using more sophisticated curves, another approach is
to map data into a higher dimensional space. For example, in the example in Chap-
ter 1, each data instance has two features (attributes): height and weight. We may
consider two other attributes
2.3. THE DUAL PROBLEM 21
height-weight, weight/(height2).
Such features may provide more information for separating underweighted/overweighted
people. Each new data instance is now in a four-dimensional space, so if the two new
features are good, it should be easier to have a seperating hyperplane so that most ξi
are zero.
Thus SVM non-linearly transforms the original input space into a higher dimen-
sional feature space. More precisely, the training data x is mapped into a (possibly
infinite) vector in a higher dimensional space:
φ(x) = [φ1(x), φ2(x), . . .].
In this higher dimensional space, it is more possible that data can be linearly sepa-
rated. An example by mapping x from R3 to R10 is as follows:
φ(x) = (1,√
2x1,√
2x2,√
2x3, x21, x
22, x
23,√
2x1x2,√
2x1x3,√
2x2x3).
An extreme example is to map a data instance x ∈ R1 to an infinite dimensional
space:
φ(x) =
[
1,x
1!,x2
2!,x3
3!, . . .
]T
.
We then try to find a linear separating plane in a higher dimensional space so
(2.4) becomes
minw,b,ξ
1
2wTw + C
l∑
i=1
ξi
subject to yi(wTφ(xi) + b) ≥ 1− ξi, (2.5)
ξi ≥ 0, i = 1, . . . , l.
2.3 The Dual Problem
The remaining problem is how to effectively solve (2.5). Especially after data are
mapped into a higher dimensional space, the number of variables (w, b) becomes very
large or even infinite. We handle this difficulty by solving the dual problem of (2.5):
minα
1
2
l∑
i=1
l∑
j=1
αiαjyiyjφ(xi)T φ(xj)−
l∑
i=1
αi
subject to 0 ≤ αi ≤ C, i = 1, . . . , l, (2.6)l∑
i=1
yiαi = 0.
22 CHAPTER 2. SUPPORT VECTOR CLASSIFICATION
This new problem of course has some relation with the original problem (2.5), and
we hope that it can be solved more easily. Sometimes we write (2.6) in a matrix form
for convenience:
minα
1
2αT Qα− eT α
subject to 0 ≤ αi ≤ C, i = 1, . . . , l, (2.7)
yT α = 0.
In (2.7), e is the vector of all ones, C is the upper bound, Q is an l by l positive
semidefinite matrix, Qij ≡ yiyjK(xi,xj), and K(xi,xj) ≡ φ(xi)T φ(xj) is the kernel,
which will be addressed in Chapter 2.4.
If (2.7) is called the “dual” problem of (2.5), we refer (2.5) to be the “primal” prob-
lem. Suppose (w, b, ξ) and α are optimal solutions of the primal and dual problems,
respectively, the following two properties hold:
w =
l∑
i=1
αiyiφ(xi) (2.8)
1
2wT w + C
l∑
i=1
ξi = eT α− 1
2αT Qα. (2.9)
In other words, if the dual problem is solved with a solution α, the optimal primal
solution w is easily obtained from (2.8). Suppose an optimal b is also easily found,
the decision function is hence determined.
Thus, the crucial point is whether the dual is easier to be solved than the primal.
The number of variables in the dual, which is the size of the training set: l, is a fixed
number. In contrast, the number of variables in the primal problem varies depending
on how data are mapped to a higher dimensional space. Therefore, moving from the
primal to the dual means that we solve a finite-dimensional optimization problem
instead of a possibly infinite-dimensional problem.
We illustrate this primal-dual relationship using data in Example 2.1.1 without
mapping them to a higher dimensional space. As the problem is linearly separable, it
is fine to consider the formulation (2.1) without slack variables ξi, i = 1, . . . , l. Then
2.4. KERNEL AND DECISION FUNCTIONS 23
the dual is
minα
1
2
l∑
i=1
l∑
j=1
αiαjyiyjxTi xj −
l∑
i=1
αi
subject to 0 ≤ αi, i = 1, . . . , l,l∑
i=1
yiαi = 0.
Using data in Example 2.1.1, the objective function is
1
2α2
1 − (α1 + α2)
=1
2
[
α1 α2
]
[
1 00 0
] [
α1
α2
]
−[
1 1]
[
α1
α2
]
.
Constraints are
α1 − α2 = 0, 0 ≤ α1, 0 ≤ α2.
Substituting α2 = α1 into the objective function,
1
2α2
1 − 2α1
has the smallest value at α1 = 2. As [2, 2]T satisfies constraints 0 ≤ α1 and 0 ≤ α2,
it is the optimal solution. Using the primal-dual relation (2.8),
w = y1α1x1 + y2α2x2
= 1 · 2 · 1 + (−1) · 2 · 0= 2,
the same as what obtained by directly solving the primal problem.
The calculation of b is easy, but is left in Chapter 6 for implementation details. The
remaining issue of using the dual problem is about the inner product φ(xi)T φ(xj).
If φ(x) is an infinite-long vector, there is no way to fully write it down and then
calculate the inner produce. Thus, even though the dual possesses the advantage of
having a finite number of variables, we even could not write the problem down before
solving it. This is resolved by using special mapping functions φ so that φ(xi)T φ(xj)
is efficiently calculated. Details are in the next section.
2.4 Kernel and Decision Functions
Consider a special φ(x) mentioned earlier (assume x ∈ R3):
φ(x) = (1,√
2x1,√
2x2,√
2x3, x21, x
22, x
23,√
2x1x2,√
2x1x3,√
2x2x3).
24 CHAPTER 2. SUPPORT VECTOR CLASSIFICATION
In this case it is easy to see that φ(xi)T φ(xj) = (1 + xT
i xj)2, which is easier to be
calculated then doing a direct inner product. To be more precise, a direct calculation
of φ(xi)T φ(xj) takes 10 multiplications and 9 additions, but using (1 + xT
i xj)2, only
four multiplications and three additions are needed. Therefore, if a special φ(x) is
considered, even though it is a long vector, φ(xi)T φ(xj) may still be easily available.
We call such inner products the “kernel function.” Some popular kernels are, for
example,
1. e−γ||xi−xj ||2
(Gaussian kernel or Radial bassis function (RBF) kernel),
2. (xTi xj/γ + δ)d (polynomial kernel),
where γ, d, and δ are kernel parameters. The following calculation shows that the
Gaussian (RBF) kernel indeed is an inner product of two vectors in an infinite di-
mensional space. Assume x ∈ R1 and γ > 0.
e−γ||xi−xj ||2
= e−γ(xi−xj)2
= e−γx2
i +2γxixj−γx2
j
= e−γx2
i −γx2
j(
1 +2γxixj
1!+
(2γxixj)2
2!+
(2γxixj)3
3!+ · · ·
)
= e−γx2
i −γx2
j(
1 · 1 +
√
2γ
1!xi ·√
2γ
1!xj +
√
(2γ)2
2!x2
i ·√
(2γ)2
2!x2
j
+
√
(2γ)3
3!x3
i ·√
(2γ)3
3!x3
j + · · ·)
= φ(xi)T φ(xj),
where
φ(x) = e−γx2
[
1,
√
2γ
1!x,
√
(2γ)2
2!x2,
√
(2γ)3
3!x3, · · ·
]T
.
Note that γ > 0 is used for the existance of terms such as√
2γ
1!,√
(2γ)3
3!, etc.
After (2.7) is solved with a solution α, the vector for which αi > 0 are called
support vectors. Then, from (5.10), a decision function is written as
f(x) = sign(wT φ(x) + b) = sign
(
l∑
i=1
yiαiφ(xi)T φ(x) + b
)
. (2.10)
In other words, for a test vector x, if∑l
i=1 yiαiφ(xi)T φ(x) + b > 0, we classify it to
be in the class 1. Otherwise, we think it is in the second class. We can see that only
support vectors will affect results in the prediction stage. In general, the number of
2.5. MULTI-CLASS SVM 25
support vectors is not large. Therefore we can say SVM is used to find important
data (support vectors) from training data.
We use Figure 2.2 as an illustration. Two classes of training data are not linearly
separable. Using the RBF kernel, we obtain a hyperplane wTφ(x) + b = 0. In the
original space, it is indeed a nonlinear curve
l∑
i=1
yiαiφ(xi)T φ(x) + b = 0. (2.11)
In the figure, all points in red color are support vectors and they are selected from
both classes of training data. Clearly support vectors are close to the nonlinear curve
(2.11) are more important points.
−1.5 −1 −0.5 0 0.5 1−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Figure 2.3: Support vectors (marked as +) are important data from training data
2.5 Multi-class SVM
The discussion so far assumes that data are in only two classes. Many practical
applications involve with more classes. For example, hand-written digit recognition
considers data in 10 classes: digits 0 to 9. There are many ways to extend SVM for
such cases. Here, we discuss two simple methods.
26 CHAPTER 2. SUPPORT VECTOR CLASSIFICATION
2.5.1 One-against-all Multi-class SVM
This commonly mis-named method should be called “one-against-the-rest.” It con-
structs binary SVM models so that each one is trained with one class as positive and
the rest as negative. We illustrate this method by a simple situation of four classes.
The four two-class SVMs are
yi = 1 yi = −1 Decision functionsclass 1 classes 2,3,4 f 1(x) = (w1)Tx + b1
class 2 classes 1,3,4 f 2(x) = (w2)Tx + b2
class 3 classes 1,2,4 f 3(x) = (w3)Tx + b3
class 4 classes 1,2,3 f 4(x) = (w4)Tx + b4
For any test data x, if it is in the ith class, we would expect that
f i(x) ≥ 1 and f j(x) ≤ −1, if j 6= i.
This “expection” directly follows from our setting of training the four two-class prob-
lems and from the assumption that data are correctly separated. Therefore, f i(x)
has the largest values among f 1(x), . . . , f 4(x) and hence the decision rule is
Predicted class = arg maxi=1,...,4
f i(x).
2.5.2 One-against-one Multi-class SVM
This method also constructs several two-class SVMs but each one is by training data
from only two different classes. Thus, this method is sometimes called a “pairwise”
approach. For the same example of four classes, six two-class problems are con-
structed:
yi = 1 yi = −1 Decision functionsclass 1 class 2 f 12(x) = (w12)Tx + b12
class 1 class 3 f 13(x) = (w13)Tx + b13
class 1 class 4 f 14(x) = (w14)Tx + b14
class 2 class 3 f 23(x) = (w23)Tx + b23
class 2 class 4 f 24(x) = (w24)Tx + b24
class 3 class 4 f 34(x) = (w34)Tx + b34
For any test data x, we put it into the six functions. If the problem of classes
i and j indicates the data x should be in i, the class i gets one vote. For example,
assume
2.6. NOTES 27
Classes winner1 2 11 3 11 4 12 3 22 4 43 4 3
Then, we have
class 1 2 3 4# votes 3 1 1 1
Thus, x is predicted to be in the first class.
For a data set with k different classes, this method constructs k(k − 1)/2 two-
class SVMs. We may worry that sometimes more than one class obtains the highest
number of votes. Practically this situation does not happen so often and there are
some further strategies to handle it.
2.6 Notes
The formulas of SVM were developed in (Boser et al., 1992; Cortes and Vapnik, 1995),
where the mapping function and the dual problem are introduced. Other general SVM
references are, for example, (Cristianini and Shawe-Taylor, 2000; Scholkopf and Smola,
2002).
Many work have shown that data in a higher dimensional space has a larger op-
portunity to be separated (e.g., Cover (1965)). Such results explain why our mapping
here should be useful.
The one-against-one method were introduced in (Knerr et al., 1990; Friedman,
1996), and the first use on SVM was in (Kreßel, 1999). A comparison of one-against-
all, one-against-one, and other approaches for multi-class SVM is (Hsu and Lin, 2002).
2.7 Exercises
1. Given three training data in R2 as in the following figure:
-
6
d t
t
(0, 0)
(0, 1)
(1, 0)
28 CHAPTER 2. SUPPORT VECTOR CLASSIFICATION
What is the separating hyperplane ?
2. Given four training data in R2 as in the following figure:
-
6
d t
t d
(0, 0)
(0, 1)
(1, 0)
(1, 1)
What is the separating hyperplane ?
3. Solve problem 2 via its dual optimization formula.
4. Assume x ∈ R2 and γ > 0. Show that
e−γ||xi−xj ||2
is in a form of φ(xi)T φ(xj).
Chapter 3
Training and Testing a Data set
3.1 Categorical Features
SVM requires that each data instance is represented as a vector of real numbers.
Hence, if there are categorical attributes, we first have to convert them into numeric
data:
1. Use one integer number to represent an m-category attribute. For example, a
three-category attribute such as {red, green, blue} can be represented as 1,2,3.
2. Use m binary values to represent an m-category attribute. Only one of the m
numbers is one, and others are zero. Thus, {red, green, blue} can be represented
as (0,0,1), (0,1,0), and (1,0,0).
Our experience indicates that if the number of values in an attribute is not too many,
the second coding might be more stable than using a single number to represent a
categorical attribute.
3.2 Data Scaling
Assume there are three training instances as follows
height gender yx1 150 F -1x2 180 M 1x3 185 M 1
The attribute “height” is in centimeter. Clearly, the second “gender” is consistent
with target class y and hence should be more important. However, if F and M are
29
30 CHAPTER 3. TRAINING AND TESTING A DATA SET
transformed to be 0 and 1, respectively, training data and the separating hyperplane
are
x1
x2 x3
The separating hyperplane is nearly a vertical line, so the decision strongly de-
pends on the first attribute. This result is not good as the second attribute should
play a more important role.
If we linearly scale the first to the range [0, 1] by:
1st attribute− 150
185− 150,
then new points and the separating hyperplane are
x1
x2x3
This, transformed back to the original space, is
x1
x2 x3
Therefore, the second attribute plays a role in the decision function.
This example explains that when features are in different numerical ranges, those
in larger ranges may dominate the others. Thus, a proper scaling of features before
training SVM can be very important.
Another reason for doing data scaling is to avoid numerical difficulties during the
calculation. For example, if the polynomial kernel
(xTi xj + 1)8 (3.1)
is used, the first attribute which ranges from 150 to 185 will cause the value of (3.1)
to be larger than (102)8 = 1016. Computer overflow easily happens when dealing with
3.2. DATA SCALING 31
such numbers. Moreover, if the following RBF kernel is used:
e−‖xi−xj‖2
, (3.2)
we have values smaller than e−10000. It is so small (< 10−300) and the decision function
hasl∑
i=1
αiyiK(xi,x) + b ≈ b,
if x 6= xi, i = 1, . . . , l. Apparently, this decision function is not good.
A simple linear scaling is formally stated as the following. Assume Mi and mi are
respectively the largest and smallest values of the ith attribute and we would like to
scale the ith attribute to the range of [−1, +1]:
mi Mi
−1 1
If x is the original value of the ith attribute in one data instance, the new value
should be
x′ =x− Mi+mi
2Mi−mi
2
= 2x−mi
Mi −mi
− 1.
There are many other possible ways of data scaling, but will not be discussed here.
Of course we must use the same method to scale testing data before testing. For
example, suppose that we scaled the first attribute of training data from [-10, +10]
to [-1, +1]. If the first attribute of testing data is lying in the range [-11, +8], we
must scale the testing data to [-1.1, +0.8].
Data scaling is important for many other classification methods. Sarle (1997)
explains why we scale data while using Neural Networks, and most considerations
also apply to SVM.
Some ask about the difference between scaling each feature to [−1, +1] and [0, +1].
In Homework 1, we show that for the linear and RBF kernels, if different parameters
have been considered, they are fully equivalent. However, for polynomial kernels,
they are different. [0, 1] causes that all kernel elements are nonnegative. It is not
clear yet whether this is a good property or not.
32 CHAPTER 3. TRAINING AND TESTING A DATA SET
3.3 Model Selection
Though there are only few common kernels mentioned in Chapter 2, we must decide
which one to try first. Then we also need to choose the penalty parameter C and
kernel parameters.
3.3.1 RBF Kernel
We suggest that in general RBF is a reasonable first choice. The RBF kernel non-
linearly maps samples into a higher dimensional space, so it, unlike the linear kernel,
can handle the case when the relation between class labels and attributes is nonlinear.
Furthermore, the linear kernel is a special case of RBF as (Keerthi and Lin, 2003)
shows that the linear kernel with a penalty parameter C has the same performance
as the RBF kernel with some parameters (C, γ).
The second reason is the number of hyperparameters which influences the com-
plexity of model selection. The polynomial kernel has more hyperparameters than
the RBF kernel.
Finally, the RBF kernel has less numerical difficulties. One key point is 0 <
K(xi,xj) ≤ 1 in contrast to polynomial kernels of which kernel values may go to
infinity (xTi xj/γ + δ > 1) or zero (xT
i xj/γ + δ < 1) while the degree is large.
3.3.2 Cross-validation and Grid-search
There are two parameters while using the RBF kernel: C and γ. It is not known
beforehand which C and γ are the best for one problem; consequently some kind
of model selection (parameter search) must be done. The goal is to identify good
(C, γ) so that the classifier can accurately predict unknown data (i.e., testing data).
Note that Chapter 1.5 has explained that it may not be useful to achieve high training
accuracy (i.e., classifiers accurately predict training data whose class labels are indeed
known). Similar to selecting k of k-nearest neighbor in Chapter 1.6, cross validation
estimates the performance of the model.
We recommend a “grid-search” on C and γ using cross-validation. Basically pairs
of (C, γ) are tried and the one with the best cross-validation accuracy is picked. We
found that trying exponentially growing sequences of C and γ is a practical method to
identify good parameters (for example, C = 2−5, 2−3, . . . , 215, γ = 2−15, 2−13, . . . , 23).
The grid-search is straightforward but seems stupid. In fact, there are several
advanced methods which can save computational cost by, for example, approximating
3.3. MODEL SELECTION 33
the cross-validation rate. However, there are two motivations why we prefer the simple
grid-search approach.
One is that psychologically we may not feel safe to use methods which avoid doing
an exhaustive parameter search by approximations or heuristics. The other reason is
that the computational time to find good parameters by grid-search is not much more
than that by advanced methods since there are only two parameters. Furthermore,
the grid-search can be easily parallelized because each (C, γ) is independent. Many
of advanced methods are iterative processes, e.g. walking along a path, which might
be difficult for parallelization.
german.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scale 77.5 77
76.5 76
75.5 75
-5 0 5 10 15
lg(C)
-14
-12
-10
-8
-6
-4
-2
0
2
lg(gamma)
Figure 3.1: Loose grid search on C = 2−5, 2−3, . . . , 215 and γ = 2−15, 2−13, . . . , 23.
Since doing a complete grid-search may still be time-consuming, we recommend
using a coarse grid first. After identifying a “better” region on the grid, a finer grid
search on that region can be conducted. To illustrate this, we do an experiment on the
problem german from the Statlog collection (Michie et al., 1994). After scaling this
set, we first use a coarse grid (Figure 3.1) and find that the best (C, γ) is (23, 2−5)
with the cross-validation rate 77.5%. Next we conduct a finer grid search on the
neighborhood of (23, 2−5) (Figure 3.2) and obtain a better cross-validation rate 77.6%
at (23.25, 2−5.25). After the best (C, γ) is found, the whole training set is trained again
to generate the final classifier. Note that there is no need to conduct a very fine grid
34 CHAPTER 3. TRAINING AND TESTING A DATA SET
german.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scale 77.5 77 76.5 76 75.5 75
1 1.5 2 2.5 3 3.5 4 4.5 5
lg(C)
-7
-6.5
-6
-5.5
-5
-4.5
-4
-3.5
-3
lg(gamma)
Figure 3.2: Fine grid-search on C = 21, 21.25, . . . , 25 and γ = 2−7, 2−6.75, . . . , 2−3.
search. Figure 3.1 clearly shows that good parameters are in a quite wide region.
The above approach works well for problems with thousands or more data points.
For very large data sets, a feasible approach is to randomly choose a subset of the
data set, conduct grid-search on them, and then do a better-region-only grid-search
on the complete data set.
3.4 A General Procedure
To use SVM, we propose trying the following procedure first:
• Conduct simple scaling on the data.
• Consider the RBF kernel K(x,y) = e−γ‖x−y‖2
• Use cross-validation to find the best parameter C and γ.
• Use the best parameter C and γ to train the whole training set.
• Test.
3.5. USING LIBSVM 35
3.5 Using LIBSVM
We use LIBSVM, a library for support vector machines, to demonstrate the training
and testing procedure (Chang and Lin, 2001b). It is available at
http://www.csie.ntu.edu.tw/∼cjlin/libsvm
Instructions for installation on different platforms are in the README file of the
package.
The format of training and testing data file is:
<label> <index1>:<value1> <index2>:<value2> ...
<label> is the target value of the training data. For classification, it should be
an integer which identifies a class (multi-class classification is supported). <index>
is an integer starting from 1 and <value> is a real number. The indices must be in
an ascending order. An example is in the following:
1.0 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02
1.0 1:5.707397e+01 2:2.214040e+02 3:8.607959e-02 4:1.229114e+02
1.0 1:1.725900e+01 2:1.734360e+02 3:-1.298053e-01 4:1.250318e+02
1.0 1:2.177940e+01 2:1.249531e+02 3:1.538853e-01 4:1.527150e+02
0.0 1:2.391101e+01 2:3.890001e+01 3:4.704049e-01 4:1.257871e+02
0.0 1:2.230670e+01 2:2.262220e+01 3:2.117224e-01 4:1.012818e+02
0.0 1:1.640820e+01 2:3.920219e+01 3:-9.912787e-02 4:3.248707e+01
Clearly this data set contains four features. Next we consider a data set in
http://www.csie.ntu.edu.tw/∼cjlin/papers/guide/data/train.1
It includes 3,089 training instances.
A simple training shows
$./svm-train train.1
......*
optimization finished, #iter = 6131
nu = 0.606144
obj = -1061.528899, rho = -0.495258
nSV = 3053, nBSV = 724
Total nSV = 3053
36 CHAPTER 3. TRAINING AND TESTING A DATA SET
From the output, obj is the optimal objective value of the dual SVM problem (2.7).
The value rho is −b in the decision function (2.10). nSV and nBSV are number of
support vectors (i.e., 0 < αi ≤ C) and bounded support vectors (i.e., αi = C). Other
information such as #iter will be explained in Chapter 6.
The training procedure generates a model file train1.model which includes in-
formation such as support vectors and dual optimal solutions. The test file has the
same format as the training. If labels of the testing data file are available, we can
calculate the test accuracy. If they are unknown, you still have to fill this column
with any number. The testing procedure is as follows:
$./svm-predict test.1 train.1.model test.1.predict
Accuracy = 66.925% (2677/4000)
The test accuracy is not satisfactory. If we predict the training data using the same
model:
$./svm-predict train.1 train.1.model o
Accuracy = 99.7734% (3082/3089)
We find that training and testing accuracy are rather different. In fact, overfitting has
happened. From the discussion earlier in this Chapter, we understand that scaling
and parameter selection may be needed.
Thus, we use the program svm-scale provided in LIBSVM to conduct data scaling:
$./svm-scale -l -1 -u 1 train.1 > train.1.scale
This means that each attribute is linearly scaled to the range [−1, 1]. A common
mistake is then to scale the test data by the same way:
$./svm-scale -l -1 -u 1 test.1 > test.1.scale
Remember we should use the same scaling factor for training and testing sets. A
correct way should be
$./svm-scale -s range1 train.1 > train.1.scale
$./svm-scale -r range1 test.1 > test.1.scale
That is, we store the scaling factor used in training and apply them for the testing
set. By training and predicting scaled sets we obtain:
3.6. OTHER EXAMPLES 37
$./svm-train train.1.scale
$./svm-predict test.1.scale train.1.scale.model test.1.predict
→ Accuracy = 96.15%
The test accuracy is now much better. Now parameters used are default ones: C = 1
and the RBF kernel with γ = 1/n = 0.25, where n is the number of features. Note
that different parameters could really lead to different performance. For example, if
we use C = 20, γ = 400
$./svm-train -c 20 -g 400 train.1.scale
$./svm-predict train.1.scale train.1.scale.model o
Accuracy = 100% (3089/3089) (classification)
we obtain 100% training accuracy but very bad accuracy
$./svm-predict test.1.scale train.1.scale.model o
Accuracy = 82.7% (3308/4000) (classification)
Thus parameter selection is quite important.
In LIBSVM there is a simple tool grid.py for parameter selection:
$./grid.py train.1.scale
[local] -1 -7 85.1408 (best c=0.5, g=0.0078125, rate=85.1408)
[local] 5 -7 95.4354 (best c=32.0, g=0.0078125, rate=95.4354)
.
.
.
Best c=2.0, g=2.0
A contour like Figures 3.1 and 3.2 showing cross validation accuracy is generated.
3.6 Other Examples
Table 3.6.1 presents some real-world examples. These data sets are reported from
our users who could not obtain reasonable accuracy in the beginning. Using the
procedure illustrated in Section 3.4, we help them to achieve better performance.
∗Courtesy of Jan Conrad from Uppsala University, Sweden.†Courtesy of Cory Spencer from Simon Fraser University, Canada (Gardy et al., 2003).‡Courtesy of a user from Germany.§As there are no testing data, cross-validation instead of testing accuracy is presented here.
38 CHAPTER 3. TRAINING AND TESTING A DATA SET
Table 3.6.1: Problem characteristics and performance comparisons.
Applications #training #testing #features #classes Accuracy Accuracydata data by users by our
procedureAstroparticle∗ 3,089 4,000 4 2 75.2% 96.9%Bioinformatics† 391 0§ 20 3 36% 85.2%Vehicle‡ 1,243 41 21 2 4.88% 87.8%
These sets are at http://www.csie.ntu.edu.tw/∼cjlin/papers/guide/data/.
The first set has been discussed in Section 3.5. Here we present details of the other
two. We also demonstrate how to use a scripts easy.py which exactly does the general
procedure in Section 3.4.
• Bioinformatics
– Original sets with default parameters
$./svm-train -v 5 train.2
→ Cross Validation Accuracy = 56.5217%
– Scaled sets with default parameters
$./svm-scale -l -1 -u 1 train.2 > train.2.scale
$./svm-train -v 5 train.2.scale
→ Cross Validation Accuracy = 78.5166%
– Scaled sets with parameter selection
$python grid.py train.2.scale
· · ·2.0 0.5 85.1662
→ Cross Validation Accuracy = 85.1662%
(Best C=2.0, γ=0.5 with five fold cross-validation rate=85.1662%)
– Using an automatic script
$python easy.py train.2
Scaling training data...
Cross validation...
3.6. OTHER EXAMPLES 39
Best c=2.0, g=0.5
Training...
• Vehicle
– Original sets with default parameters
$./svm-train train.3
$./svm-predict test.3 train.3.model test.3.predict
→ Accuracy = 2.43902%
– Scaled sets with default parameters
$./svm-scale -l -1 -u 1 -s range3 train.3 > train.3.scale
$./svm-scale -r range3 test.3 > test.3.scale
$./svm-train train.3.scale
$./svm-predict test.3.scale train.3.scale.model test.3.predict
→ Accuracy = 12.1951%
– Scaled sets with parameter selection
$python grid.py train.3.scale
· · ·128.0 0.125 84.8753
(Best C=128.0, γ=0.125 with five-fold cross-validation rate=84.8753%)
$./svm-train -c 128 -g 0.125 train.3.scale
$./svm-predict test.3.scale train.3.scale.model test.3.predict
→ Accuracy = 87.8049%
– Using an automatic script
$python easy.py train.3 test.3
Scaling training data...
Cross validation...
Best c=128.0, g=0.125
Training...
Scaling testing data...
Testing...
Accuracy = 87.8049% (36/41) (classification)
40 CHAPTER 3. TRAINING AND TESTING A DATA SET
d2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3k 97 96.8 96.6 96.4 96.2 96 95.8 95.6 95.4 95.2 95
4 5 6 7 8 9 10
lg(C)
-3
-2
-1
0
1
2
lg(gamma)
Figure 3.3: Cross validation using 3000 points
3.7 A Large Practical Example
In this section we discuss how to deal with a large and practical data set. It comes
from the first problem First problem of IJCNN Challenge 2001, organized by Ford
Scientific Research Labs (Prokhorov, 2001). We summarize the approach of the win-
ning entry (Chang and Lin, 2001a). The training set consists of 50,000 instances like
the following
0.000000 -0.999991 0.169769 0.000000 1.0000000.000000 -0.659538 0.169769 0.000292 1.0000000.000000 -0.660738 0.169128 -0.020372 1.0000001.000000 -0.660307 0.169128 0.007305 1.0000000.000000 -0.660159 0.169525 0.002519 1.0000000.000000 -0.659091 0.169525 0.018198 1.0000000.000000 -0.660532 0.169525 -0.024526 1.0000000.000000 -0.659798 0.169525 0.012458 1.000000
3.7. A LARGE PRACTICAL EXAMPLE 41
d2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10k 97.6 97.4 97.2 97 96.8 96.6 96.4 96.2 96
2 3 4 5 6 7 8
lg(C)
-2
-1
0
1
2
3
lg(gamma)
Figure 3.4: Cross validation using 10,000 points
They are results at 50,000 time points and hence are a time series. There are 100,000
testing points. The kth instance contains
x1(k), x2(k), x3(k), x4(k), x5(k), y(k),
where y(k) = ±1 is the class label. For a time-series data, past and future information
may affect the current class label. Moreover, it is known that the fifth attribute x5(k)
is independent of y(k), but only if x5(k) = 1, then the test instance is considered
for evaluation. Therefore, among the 100,000 test instances, only around 90,000 are
evaluated. Another known information is that x4(k) is more important than other
features.
To begin, we analyze features in more detail. The first feature x1(k) has a peri-
odicity so that sequentially we have nine 0s, one 1, nine 0s, one 1, and so on. Other
attributes, x2(k), . . . , x4(k), are real numbers in the range of ±1.5. An interesting
42 CHAPTER 3. TRAINING AND TESTING A DATA SET
d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97
1 2 3 4 5 6 7
lg(C)
-2
-1
0
1
2
3
lg(gamma)
Figure 3.5: Cross validation using 50,000 points
observation is that for the 50,000 training data, 90% of y(k) are −1. Thus it might
be possible that if we just guess all test data to be −1, there is already 90% accuracy.
The difficulty is how to use learning techniques to achieve higher accuracy.
To use SVM for constructing a model, first we have to decide the attributes
(i.e. features) of each data. There are possible variables which may affect y(k). In
addition, for each attribute we need an encoding scheme. For example, to represent
the periodicity of x1(k), we can include x1(k−5), . . . , x1(k+4) as 10 binary attributes
of the kth data. On the other hand, we can use only one integer between 1 to 10 which
indicates the the position of 1 in x1(k − 5), . . . , x1(k + 4). Based on our experience
we choose the former way as it might be better for support vector machines.
We directly use x2(k) and x3(k) as they are. As x4(k) is more important, we
consider some past and future elements. After conducting some cross validation
tests, we decide to use x4(k−5), . . . , x4(k +4). Therefore, each training data consists
3.7. A LARGE PRACTICAL EXAMPLE 43
of 22 attributes.
For learning techniques like Neural Networks or Support Vector Machines, it is
recommended to scale each attribute of data into an appropriate range such as [−1, 1]
or [0, 1]. Since all raw data under our encoding scheme are already in a small region
[−1.5, 1.5], we do not conduct any scaling.
After preparing the training data, we do the model selection by 5-fold cross valida-
tion. We consider only the RBF kernel K(xi,xj) = e−γ‖xi−xj‖2
. Thus two parameters
are the kernel parameter γ and the penalty parameter C in (2.5).
First we work on a small subset of the training data: 3,000 randomly selected
points. The contour of cross validation accuracy is in Figure 3.3 where two axes are
log2 C and log2 γ. It can be seen that the best cross validation rates happen at around
C = 27 and γ = 20. We then work on a larger subset with 10,000 data points. Results
are in Figure 3.4. Parameters with the best cross validation rate are at around C = 24
to 25 and γ = 20 to 21, a different range than that in Figure 3.3. Finally we do the
model selection on all 50,000 training data where results are shown in Figure 3.5.
Again we note that the best parameters slightly move to another region.
Therefore, the experiment seems to show that the best parameters depend on the
size of the training data. Remember that the objective function of SVM is
1
2wTw + C
l∑
i=1
ξi.
Under the same w, a larger number of instances causes larger∑l
i=1 ξi. Thus, some
have proposed using
1
2wTw +
C
l
l∑
i=1
ξi.
This matches our experimental results as roughly
27 · 3000 ≈ 25 · 10000 ≈ 23 · 50000.
Though using a subset for parameter selection saves time, a procedure using all avail-
able training points may still be the most reliable.
Finally we select C = 24 and γ = 22 to train the 50,000 data and obtain a model
for testing. There are 1,293 test errors. This is the winning entry of IJCNN 2001
competition and the second place has more than 2,000 errors. For the 50,000 training
data, the number of support vectors is around 3,000. Note that the number of SVs
depends on different data sets.
44 CHAPTER 3. TRAINING AND TESTING A DATA SET
3.8 A Failed Example
3.9 Homework
1. (a) Show that for the linear kernel, a data scaled to [−1, 1] and trained by
SVM with C is equivalent to the same data scaled to [0, +1] and solved by
SVM with 4C.
(b) Show that for the RBF kernel, a data scaled to [−1, 1] and trained by SVM
with γ is equivalent to the same data scaled to [0, +1] and solved by SVM
with 4γ.
Chapter 4
Support Vector Regression
4.1 Linear Regression
Given training data (x1, y1), . . . , (xl, yl) in Figure 4.1, where xi are input vectors
and yi are the associated output value of xi, the traditional linear regression finds a
linear function wTx + b so that (w, b) is an optimal solution of
minw,b
l∑
i=1
(yi − (wTxi + b))2. (4.1)
In other words, wTx+ b approximates training data by minimizing the sum of square
errors.
x
y
wTx + b
Figure 4.1: Linear Regression
Note that now n, the number of features, is in general less than l. Otherwise, a
line passing all points so that (4.1) is zero is an optimal wTx + b. For such cases,
overfitting occurs.
45
46 CHAPTER 4. SUPPORT VECTOR REGRESSION
Similar to classification, if the data is nonlinearly distributed, a linear function
is not good enough. Therefore, we also map data to a higher dimensional space by
a function φ(x). Then l ≤ dimensionality of φ(x) so again overfitting happens. An
example is in Figure 4.2.
(a) Overfitting (b) Better approximation
Figure 4.2: Nonlinear regression
4.2 Support Vector Regression
To remedy the overfitting problem after using φ, we consider the following reformu-
lation of (4.1):
minw,b,ξ,ξ∗
l∑
i=1
(ξ2i + (ξ∗i )
2)
subject to −ξ∗i ≤ yi − (wTφ(xi) + b) ≤ +ξi, (4.2)
ξi, ξ∗i ≥ 0, i = 1, . . . , l.
It is easy to see that (4.1) (with x replaced by φ(x)) and (4.2) are equivalent: If
(w, b, ξ, ξ∗) is optimal for (4.2), as ξ2 + (ξ∗)2 is minimized, we have
ξi = max(yi − (wT φ(xi) + b), 0) and
ξ∗i = max(−yi + (wT φ(xi) + b), 0).
Thus,
ξ2i + (ξ∗i )
2 = (yi − (wTφ(xi) + b))2.
Moreover, ξiξ∗i = 0 at an optimal solution.
4.2. SUPPORT VECTOR REGRESSION 47
Instead of using seqaure errors, we can use linear ones:
minw,b,ξ,ξ∗
l∑
i=1
(ξi + ξ∗i )
subject to −ξ∗i ≤ yi − (wTφ(xi) + b) ≤ +ξi,
ξi, ξ∗i ≥ 0, i = 1, . . . , l.
ξ∗i
ξi
wTφ(x) + b =[
ǫ0−ǫ
]
Figure 4.3: Support Vector Regression
Support vector regression (SVR) then emploies two modifications to avoid over-
fitting:
1. A threshold ǫ is given so that if the ith data satisfies:
−ǫ ≤ yi − (wTφ(xi) + b) ≤ ǫ,
it is considered a correct approximation. Then ξi = ξ∗i = 0.
2. To smooth the function wT φ(x) + b, an additional term wTw is added to the
objective function.
Thus, support vector regression solves the following optimization problem:
minw,b,ξ,ξ∗
1
2wTw + C
l∑
i=1
(ξi + ξ∗i ) (4.3)
subject to (wTφ(xi) + b)− yi ≤ ǫ + ξi,
yi − (wT φ(xi) + b) ≤ ǫ + ξ∗i ,
ξi, ξ∗i ≥ 0, i = 1, . . . , l.
48 CHAPTER 4. SUPPORT VECTOR REGRESSION
Clearly, ξi is the upper training error (ξ∗i is the lower) subject to the ǫ-insensitive
tube |y− (wT φ(x)+ b)| ≤ ǫ. This can be clearly seen from Figure 4.3. If xi is not in
the tube, there is an error ξi or ξ∗i which we would like to minimize in the objective
function. SVR avoids underfitting and overfitting the training data by minimizing the
training error C∑l
i=1(ξi + ξ∗i ) as well as the regularization term 12wTw. The addition
of the term wTw can be explained by a similar way to that for classification problems.
In Figure 4.4, under the condition that training data are in the ǫ-insensitive tube, we
would like the approximate function to be as general as possible to represent the data
distribution.
2ǫ‖w‖
wTx + b = ǫ
wTx + b = −ǫ
(a)
2ǫ‖w‖
wTx + b = ǫ
wTx + b = −ǫ
(b)
Figure 4.4: More general approximate function by maximizing the distance betweenwTx + b = ±ǫ.
The parameters which control the regression quality are the cost of error C, the
width of the tube ǫ, and the mapping function φ. Similar to support vector classifi-
cation, as w may be a huge vector variable, we solve the dual problem:
minα,α∗
1
2(α−α∗)T Q(α−α∗) + ǫ
l∑
i=1
(αi + α∗i ) +
l∑
i=1
yi(αi − α∗i )
subject to
l∑
i=1
(αi − α∗i ) = 0, 0 ≤ αi, α
∗i ≤ C, i = 1, . . . , l, (4.4)
where Qij = K(xi,xj) ≡ φ(xi)T φ(xj). The derivation of the dual uses the same
procedure for support vector classification. The primal-dual relation shows that
w =l∑
i=1
(−αi + α∗i )φ(xi),
4.3. A PRACTICAL EXAMPLE 49
so the approximate function is:
l∑
i=1
(−αi + α∗i )K(xi,x) + b.
4.3 A Practical Example
4.4 Exercises
1. If you have read Chapter 6, derive the SVR dual using the same procedure.
50 CHAPTER 4. SUPPORT VECTOR REGRESSION
Part II
Advanced Topics
51
Chapter 5
The Dual Problem
5.1 SVM and Convex Optimization
A convex set is a region like the following
Clearly, any line segment connecting two points in the region is still in the set:
x1
x2
Its formal definition is as follows
Definition 5.1.2 (Convex region) A set A is convex if for any x1,x2 ∈ A,
λx1 + (1− λ)x2 ∈ A, ∀0 ≤ λ ≤ 1.
Note that λx1 + (1− λ)x2, 0 ≤ λ ≤ 1, includes all points on the segment connecting
x1 and x2. The following set is non-convex as part of a line segment is not in the set:
a non-convex set
53
54 CHAPTER 5. THE DUAL PROBLEM
A close-related concept is convex function:
convex function non-convex function
The main difference between the above two figures is that the convex one has a
unique “local minimum,” while the non-convex one has two. By a local minimum,
we mean a point which is the smallest in its surrounding region. Formally,
Definition 5.1.3 (Convex function) A function f(x) is convex if for any x1,x2,
f(λx1 + (1− λ)x2) ≤ λf(x1) + (1− λ)f(x2), ∀0 ≤ λ ≤ 1.
That is, in the following figure, any line segment connecting two points (x1, f(x1))
and (x2, f(x2)) are above f(x), where x is between x1 and x2:
x1 x2
(λx1 + (1 − λ)x2, f(λx1 + (1 − λ)x2))
(λx1 + (1− λ)x2, λf(x1) + (1 − λ)f(x2))
Example 5.1.4 The SVM objective function is convex
We would like to have
1
2(λw1 + (1− λ)w2)T (λw1 + (1− λ)w2) + C
l∑
i=1
(λξ1i + (1− λ)ξ2
i )
≤ λ(1
2w1T
w1 + Cl∑
i=1
ξ1i ) + (1− λ)(
1
2w2T
w2 + Cl∑
i=1
ξ2i ).
5.1. SVM AND CONVEX OPTIMIZATION 55
It is easy to cancel out terms with ξi. Then, the difference between the right-hand
and letf-hand sides is
λ(1
2w1T
w1) + (1− λ)(1
2w2T
w2)− 1
2(λw1 + (1− λ)w2)T (λw1 + (1− λ)w2)
=1
2λ(1− λ)(w1 −w2)T (w1 −w2) ≥ 0. (5.1)
The above discussion clearly shows that linear functions (i.e., lines in 2-dimensional
spaces) are convex functions. From this, we can further define “strictly convex”
functions:
convex strictly convex
Formally,
Definition 5.1.5 (Strictly convex) A function f(x) is strictly convex if for any
x1 6= x2,
f(λx1 + (1− λ)x2) < λf(x1) + (1− λ)f(x2), ∀0 < λ < 1.
Clearly, f(w) = 12wTw is a strictly convex function as in (5.1), if 0 < λ < 1, “≥”
becomes “>.”
Example 5.1.6 The function wTw is a strictly convex function of w.
In (5.1), if 0 < λ < 1, then ≥ is replaced by >. Thus, we get the strict convexity of
wTw.
For such strictly convex function, if it is quadratic, we can easily obtain its global
minimum:
Theorem 5.1.7 Consider a strictly convex quadratic function
f(x) =1
2xT Qx + pT x.
Then,
x is the unique global minimum ⇔ Qx + p = 0.
56 CHAPTER 5. THE DUAL PROBLEM
Proof.
First we claim that for a strictly convex quadratic function
dT Qd > 0, ∀d 6= 0.
For any two vectors x and d,
f(x + λd)
=1
2(x + λd)T Q(x + λd) + pT (x + λd)
= f(x) + λ(Qx + p)T d +1
2λ2dT Qd.
On the other hand, for any 0 < λ < 1,
f(x + λd) = f((1− λ)x + λ(x + d))
≤ (1− λ)f(x) + λf(x + d)
= (1− λ)f(x) + λf(x) + λ(Qx + p)Td +λ
2dT Qd.
Thus,λ2
2dT Qd ≤ λ
2dT Qd
and hence
dT Qd ≥ 0.
Now we are ready to prove the theorem:
“⇐” For any x 6= x. If d ≡ x− x, then
f(x) =1
2(x + d)T Q(x + d) + pT (x + d)
= f(x) + (Qx + p)Td +1
2dT Qd
> f(x).
Thus, x is the unique global minimum.
“⇒” If Qx + p 6= 0, we consider d = −t(Qx + p), where t > 0.
f(x + d) = f(x) + (Qx + p)Td +1
2dT Qd
= f(x)− t(Qx + p)T (Qx + p) +1
2t2(Qx + p)T Q(Qx + p).
If (Qx + p)T Q(Qx + p) > 0 and
t <2(Qx + p)T (Qx + p)
(Qx + p)T Q(Qx + p),
5.2. LAGRANGIAN DUALITY 57
then
f(x + d) < f(x)
and x is not a global optimum. This causes a contradiction. On the other hand, if
(Qx+p)T Q(Qx+p) = 0, for any t > 0, f(x) < f(x+d) also causes a contradiction.
2
Note that if Q is positive definite, Qx + p = 0 has a unique solution −Q−1p.
Thus, f(x) has a unique global minimum.
5.2 Lagrangian Duality
We begin by considering a simpler problem (2.1). Its Lagrangian dual is defined as
follows:
maxα≥0
(minw,b
L(w, b, α)), (5.2)
where
L(w, b, α) =1
2‖w‖2 −
l∑
i=1
αi
(
yi(wTφ(xi) + b)− 1
)
. (5.3)
The Lagrangian L has to be minimized with respect to the primal variables w and
b and maximized with respect to the dual variables αi. For a nonlinear problem like
(2.1), called the primal problem, there are several closely related problems of which
the Lagrangian dual is an important one. Under certain conditions, the primal and
dual problems have the same optimal objective values. Therefore, we can instead
solve the dual which may be an easier problem than the primal. In particular, when
data are mapped into a higher dimensional space, solving the dual may be the only
way to train SVM.
Assume (w, b) is an optimal solution of the primal with the optimal objective
value γ = 12‖w‖2.∗ Thus, no (w, b) satisfies
1
2‖w‖2 < γ and yi(w
Tφ(xi) + b) ≥ 1, i = 1, . . . , l. (5.4)
With (5.4), there is α ≥ 0 such that for all w, b
1
2‖w‖2 − γ −
l∑
i=1
αi
(
yi(wTφ(xi) + b)− 1
)
≥ 0. (5.5)
∗ We do not give a rigorous treatment on the existance of an optimal solution here. In fact, ifthe primal’s feasible set is non-empty, an optimal (w, b) exists. Details are in (Lin, 2001a).
58 CHAPTER 5. THE DUAL PROBLEM
This result is quite intiutive, but a detailed proof is complicated and is left in Chapter
5.3.
Therefore, (5.5) implies
maxα≥0
minw,b
L(w, b, α) ≥ minw,b
L(w, b, α) ≥ γ. (5.6)
On the other hand, for any α,
minw,b
L(w, b, α) ≤ L(w, b, α),
so
maxα≥0
minw,b
L(w, b, α) ≤ maxα≥0
L(w, b, α) =1
2‖w‖2 = γ. (5.7)
Note that to have the equality we use the property that (w, b) is feasible for the
primal problem (i.e., yi(wT φ(xi) + b)− 1 ≥ 0).
Therefore, with (5.6), the inequality in (5.7) becomes an equality. This property is
called the strong duality, where the primal and dual have the same optimal objective
value. Thus, α is an optimal solution of the dual problem. In addition, putting (w, b)
into (5.5), with αi ≥ 0 and yi(wTφ(xi) + b)− 1 ≥ 0, we obtain
αi[yi(wTφ(xi) + b)− 1] = 0, i = 1, . . . , l, (5.8)
which is usually called the complementarity condition.
To simplify the dual, when α is fixed,
minw,b
L(w, b, α) =
{
−∞ if∑l
i=1 αiyi 6= 0,
minw12wTw −
∑li=1 αi[yiw
T φ(xi)− 1] if∑l
i=1 αiyi = 0.
(5.9)
Note that if∑l
i=1 αiyi 6= 0, we can decrease −b∑l
i=1 αiyi in L(w, b, α) as much as we
want. For the case∑l
i=1 αiyi = 0, following Example 5.1.6,
1
2wTw −
l∑
i=1
αi[yiwTφ(xi)− 1]
is now a strictly convex function of w. Theorem 5.1.7 then implies that the unique
optimum happens when
∂
∂wi
L(w, b, α) = 0, i = 1, . . . , n.
Thus,
w =l∑
i=1
αiyiφ(xi). (5.10)
5.2. LAGRANGIAN DUALITY 59
Therefore, by substituting (5.10) into (5.2), the dual problem can be written as
maxα≥0
{
∑li=1 αi − 1
2
∑li
∑lj=1 αiαjyiyjφ(xi)
T φ(xj) if∑l
i=1 αiyi = 0,
−∞ if∑l
i=1 αiyi 6= 0.(5.11)
As −∞ is definitely not the maximal objective value of the dual, the dual optimal so-
lution does not happen when∑l
i=1 αiyi 6= 0. Therefore, the dual problem is simplified
to finding optimal αi of
maxα∈Rl
l∑
i=1
αi −1
2
l∑
i=1
l∑
j=1
αiαjyiyjφ(xi)T φ(xj)
subject to αi ≥ 0, i = 1, . . . , l, andl∑
i=1
αiyi = 0.
This is the dual SVM problem that we usually refer to. Note that (5.8),∑l
i=1 αiyi =
0, αi ≥ 0 ∀i, and (5.10), are called the Karush-Kuhn-Tucker (KKT) optimality
conditions of the primal problem. That is, (w, b) is an optimal solution of the primal
if and only if it is feasible and there is α satisfying the KKT conditions.
As practically we solve the dual, in the following we give a formal discussion about
the dual problem and its KKT conditions.
Theorem 5.2.8 α is an optimal solution of the dual if and only if α is feasible and
there is b so that
yi(wT φ(xi) + b)− 1 ≥ 0, i = 1, . . . , l, and (5.12)
αi[yi(wTφ(xi) + b)− 1] = 0, i = 1, . . . , l, (5.13)
where w is defined as in (5.10) using α.
Proof.
“⇒”
Since α is an optimal solution of the dual,
γ = maxα≥0
minw,b
L(w, b, α) = minw,b
L(w, b, α).
From the discussion in the primal-dual relation,
γ = minw,b
L(w, b, α) ≤ L(w∗, b∗, α) ≤ γ, (5.14)
where (w∗, b∗) is an optimal solution of the primal. Since yT α = 0, using Theorem
5.1.7, L(w, b, α) has a global minimum w =∑l
i=1 αiyiφ(xi). Since L(w∗, b∗, α) = γ
60 CHAPTER 5. THE DUAL PROBLEM
from (5.14), w∗ is essentially w. Then, using this b∗ as b, we have the conditions
(5.12) and (5.13).
“⇐”
(5.12) implies (w, b) is a feasible solution of the primal. This and (5.13) imply
L(w, b, α) ≥ γ. From the discussion around (5.10) and assumptiong on (w, b),
l∑
i=1
αi −1
2
l∑
i=1
l∑
j=1
αiαjyiyjφ(xi)Tφ(xj) = min
w,bL(w, b, α) = L(w, b, α) ≥ γ. (5.15)
Since γ is the optimal objective value of the dual from earlier discussion and α is
feasible to the dual, “≥” in (5.15) should be “=” and α is an optimal solution of the
dual. 2
Using a similar derivation, the dual of (2.5) is (2.7). Now the Langrangian dual is
maxα≥0,β≥0
(minw,b,ξ
L(w, b, ξ, α, β)), (5.16)
where
L(w, b, ξ, α, β) =1
2‖w‖2 + C
l∑
i=1
ξi −l∑
i=1
αi
(
yi(wTφ(xi) + b)− 1 + ξi
)
−l∑
i=1
βiξi.
(5.17)
From this formula, we understand that for each inequality constraint, there is a
corresponding nonnegative dual variable. If the constraint is an equality, then the
associated dual variable is non-restricted (i.e., it could be negative).
Details of deriving (2.7) are left in Exercise 1.
5.3 Proof of (5.5)
The derivation here follows from Lemma 6.2.3 of (Bazaraa et al., 1993). To simplify
the proof, we make an assumption that {(w, b) | yi(wT φ(xi) + b) > 1, i = 1, . . . , l} is
nonempty. (5.5) remains valid without this assumption.
As there is no (w, b) which satisfies (5.4), (0, 0) is not in the following set
A = {(p,q) | p >1
2‖w‖2−γ, qi ≥ −(yi(w
Tφ(xi)+b)−1), i = 1, . . . , l for some (w, b)}.
A is convex as for any (p1,q1), (p2,q2) ∈ A with associated (w1, b1) and (w2, b2),
λp1 + (1− λ)p2 >1
2λ‖w1‖2 +
1
2(1− λ)‖w2‖2 − γ
≥ 1
2‖λw1 + (1− λ)w2‖2 − γ
5.3. PROOF OF (5.5) 61
and
λ(q1)i + (1− λ)(q2)i
≥ −(
yi
(
(λw1 + (1− λ)w2)T φ(xi) + (λb1 + (1− λ)b2)
)
− 1
)
.
That is,
(λp1 + (1− λ)p2, λq1 + (1− λ)q2) ∈ A,
for all 0 ≤ λ ≤ 1. If the set A is like the following (assume q has only one component),
p
q
A
pu0 + qu = 0
there are (u0,u) 6= (0, 0) such that
u0p + uTq ≥ 0, ∀(p,q) ∈ Cl(A). (5.18)
We use a dotted boundary as points on the boundary may not be in the set A. Note
that for points on the boundary, u0p+uT q ≥ 0 still holds. So in (5.18), Cl(A) means
points in A or its boundary†.
Now we have (0, 0) /∈ A, so an extreme situation may be as the following:
p
q
A
pu0 + qu = 0
(5.18) still holds and clearly the key is that A is a convex set.
Let us look at (5.18) in more detail. As p ∈ A can be arbitrarily large, u0 ≥ 0.
Otherwise, a negative u0 with large p and some fixed q will violate (5.18). Similarly,
u ≥ 0. As (p,q) = [12wTw−γ, y1(w
Tφ(x1)+ b)−1, . . . , yl(wTφ(xl)+ b)−1] ∈ Cl(A),
u0(1
2wTw− γ)−
l∑
i=1
ui(yi(wTφ(xi) + b)− 1) ≥ 0, ∀(w, b). (5.19)
† (5.18) is indeed not trivial, but we omit a formal proof here.
62 CHAPTER 5. THE DUAL PROBLEM
If u0 = 0, the assumption that there are (w, b) such that yi(wT φ(xi) + b)− 1 > 0, i =
1, . . . , l implies ui = 0, i = 1, . . . , l. This contradicts the fact that (u0,u) 6= (0, 0).
Thus, u0 > 0 and (5.19) can be written as
(1
2wTw− γ)−
l∑
i=1
ui
u0
(yi(wTφ(xi) + b)− 1) ≥ 0, ∀(w, b).
which is exactly (5.5).
5.4 Notes
The duality theory holds only for primal problems satisfying certain convex condi-
tions and the so called “constraint qualification.” Here we have linear inequalities
as constraints, so the constraint qualification is satisfied. In Exercise 6, we show an
example for which the duality results do not hold.
5.5 Exercises
1. Show that the dual of (2.5) is (2.7).
2. Consider the following primal optimization problem
minw,b,ξ
1
2wTw + C
l∑
i=1
ξ2i
subject to yi(wTφ(xi) + b) ≥ 1− ξi, (5.20)
ξi ≥ 0, i = 1, . . . , l.
(a) Prove that the constraint ξi ≥ 0 is not needed.
(b) Derive its dual by the Lagrangian dual procedure.
(c) Show that by transforming (5.20) to
minw,b
1
2wT w
subject to yi(wT xi + b) ≥ 1, i = 1, . . . , l,
we can directly obtain the dual using results in (5.2).
5.5. EXERCISES 63
3. Consider the following primal optimization problem
minw,b,ξ
1
2wTw +
1
2b2 + C
l∑
i=1
ξi
subject to yi(wTφ(xi) + b) ≥ 1− ξi,
ξi ≥ 0, i = 1, . . . , l.
Derive its dual by the Lagrangian dual procedure.
4. Consider the following primal optimization problem
minw,b,ξ
1
2wTw + C+
∑
i:yi=1
ξi + C−
∑
i:yi=−1
ξi
subject to yi(wTφ(xi) + b) ≥ 1− ξi,
ξi ≥ 0, i = 1, . . . , l.
Derive its dual by the Lagrangian dual procedure.
5. In Exercise 2.2, we directly solve the primal/dual problem to obtain the sepa-
rating hyperplane. Verify that primal/dual solutions satisfy KKT conditions.
6. Derive the Lagrangian dual of the following two problems and show that the
objective values of primal and dual are not equal.
(a)
minx1,x2
x1
subject to x31 ≥ x2,
x2 ≥ 0.
(b)
minx1,x2
x1
subject to x1 ≥ x32,
x2 ≥ 0.
The first problem comes from (Fletcher, 1987) and its feasible region is not
convex. The second problem, however, has a convex region. If the con-
straints are not linear, even for convex programming, there are situations
64 CHAPTER 5. THE DUAL PROBLEM
where the dual does not exist. By convex programming we mean the ob-
jective function is convex and the feasible region is a convex set. To have
the primal-dual relationshop mentioned in this chapter, existing theorem
requires constraints, written as gi(x) ≤ 0, ∀i, satisfy that (1) gi are convex
functions, and (2) certain constraing qualification. For problems with only
linear inequalities as constraints (e.g. SVM), both conditions hold.
7. If the error term C∑l
i=1 ξi of SVM formulation (2.5) becomes C∑l
i=1 e(ξi),
where e is a function defined as
e(ξ) ≡{
(0.5/γ)ξ2 if ξ ≤ γ,
ξ − 0.5γ if ξ ≥ γ.
Here γ is a positive constant. Derive its dual.
This is a difficult exercise.
Chapter 6
Solving the Quadratic Problems
6.1 Solving Optimization Problems
We consider the following optimization problem:
minα
1
2αT Qα− eT α
subject to yT α = ∆, (6.1)
0 ≤ αt ≤ C, t = 1, . . . , l,
where yt = ±1, t = 1, . . . , l. If ∆ = 0, (6.1) is the SVM dual problem (2.7).
Optimization has been a well-developed area. For objective functions which are
differentiable, the solution process usually involves with its first or second derivative.
For example, in Theorem 5.1.7, if f(x) = 12xT Qx − pTx, and Q is positive definite,
the optimal solution is by solving the linear system Qx + p = 0, where Qx + p is the
first derivative of f(x). Many optimization applications involve with sparse Q (i.e.,
many Q’s components are zeros), so the linear sysyem can be easily solved even if the
number of variables is huge.
Unfortunately, such matrix operations are not doable here as Q is in general a
fully dense matrix (i.e., all Q’s components are non-zero). For example, if the RBF
kernel is used,
Qij = yiyje−γ‖xi−xj‖2
> 0.
Thus, if there are 30,000 training data, Q is a 30, 000 × 30, 000 dense matrix with
(30000)2 = 9× 108 elements. The matrix Q cannot be fit into the computer memory,
so operations on Q cannot be done. In this Chapter, we consider the decomposition
method to conquer this difficulty.
65
66 CHAPTER 6. SOLVING THE QUADRATIC PROBLEMS
6.2 The Decomposition Method
This method is an iterative process and in each iteration only few variables are up-
dated. We illustrate this method by a simple example:
-
6
d t
t d
(0,0) (1,0)
(0,1) (1,1)
Using the linear kernel and C = 1, the dual optimization problem is
minα
1
2
[
α1 α2 α3 α4
]
0 0 0 00 1 0 −10 0 1 −10 −1 −1 2
α1
α2
α3
α4
−[
1 1 1 1]
α1
α2
α3
α4
subject to[
1 −1 −1 1]
α = 0,
0 ≤ α1, . . . , α4 ≤ 1.
Now αi = 0, i = 1, . . . , 4 satisfies all constraints, so we use it as the initial solution.
Then the decomposition method proceeds as follows:
Iteration 1:
α3 = α4 = 0 are fixed and we minimize the function on α1 and α2:
minα1,α2
1
2
[
α1 α2 0 0]
0 0 0 00 1 0 −10 0 1 −10 −1 −1 2
α1
α2
00
−[
1 1 1 1]
α1
α2
00
=1
2
[
α1 α2
]
[
0 00 1
] [
α1
α2
]
−[
1 1]
[
α1
α2
]
subject to α1 − α2 = α3 − α4 = 0,
0 ≤ α1, α2 ≤ 1.
By substituting α1 = α2 into the objective function, we get
minα2
1
2α2
2 − 2α2
subject to 0 ≤ α2 ≤ 1.
6.2. THE DECOMPOSITION METHOD 67
Thus, α2 = α1 = 1.
Iteration 2:
α1 = α2 = 1 are fixed and we minimize the function on α3 and α4:
minα3,α4
1
2
[
1 1 α3 α4
]
0 0 0 00 1 0 −10 0 1 −10 −1 −1 2
11α3
α4
−[
1 1 1 1]
11α3
α4
=1
2
[
α3 α4
]
[
1 −1−1 2
] [
α3
α4
]
− α4 + 1−[
1 1]
[
α3
α4
]
− 2
subject to −α3 + α4 = −α1 + α2 = −1 + 1 = 0,
0 ≤ α3, α4 ≤ 1.
By substituting α3 = α4 into the objective function, we get
minα3
1
2α2
3 − 3α3 − 1
subject to 0 ≤ α3 ≤ 1.
Thus, α3 = α4 = 1.
Later we will show that α1 = · · · = α4 = 1 is already an optimal solution, so the
procedure stops.
By calling the indices of variables to be minimized as the working set of that
iteration, a general description of decomposition methods is as follows:
Algorithm 6.2.9 (Decomposition method)
1. Given a number q ≪ l as the size of the working set. Find α1 as the initial
solution. Set k = 1.
2. If αk is an optimal solution of (2.7), stop. Otherwise, find a working set B ⊂{1, . . . , l}, whose size is q. Define N ≡ {1, . . . , l}\B and αk
B and αkN to be
sub-vectors of αk corresponding to B and N , respectively.
3. Solve the following sub-problem with the variable αB:
minαB
1
2
[
αTB (αk
N)T]
[
QBB QBN
QNB QNN
] [
αB
αkN
]
−[
eTB (ek
N)T]
[
αB
αkN
]
=1
2αT
BQBBαB + (−eB + QBNαkN)T αB + constant
subject to 0 ≤ (αB)t ≤ C, t = 1, . . . , q, (6.2)
yTBαB = ∆− yT
NαkN ,
where[
QBB QBNQNB QNN
]
is a permutation of the matrix Q.
68 CHAPTER 6. SOLVING THE QUADRATIC PROBLEMS
4. Set αk+1B to be the optimal solution of (6.2) and αk+1
N ≡ αkN . Set k ← k + 1
and goto Step 2.
In each iteration, the indices {1, . . . , l} of the training set are separated to two sets B
and N . The vector αN is fixed so the objective value becomes 12αT
BQBBαB +(−eB +
QBNαN)T αB + 12αT
NQNNαN −eTNαN . Then a sub-problem with the variable αB, i.e.
(6.2), is solved. Note that B is updated in each iteration. To simplify the notation,
we simply use B instead of Bk.
Clearly in each iteration of Algorithm 6.2.9, only QBB and QBN of Q are needed.
If ql memory spaces can be allocated to store them, we calculate QBB and QBN when
needed. Thus, Q does not have to be fully stored and this fact solves the memory
difficulty of traditional optimization procedures. Of course the check if αk is optimal
or not in Step 2 may still cause memory problem. This issue will be discussed in the
next section.
If the working set B is restricted to only two elements, the method is called
“Sequential Minimal Optimization” (SMO).
6.3 Working Set Selection and Stopping Criteria
An immediate question about the simple example in Section 6.2 is how we select
{α1, α2} in the first iteration and then {α3, α4} in the second. We call this issue
the “working set selection.” Here, we discuss a method, which, in a sense, selects
variables violating the optimality condition.
Recall in Theorem 5.2.8, we stated that α is a dual optimal solution if and onlt if it
satisfies certain conditions. At that time we consider the simpler formulation without
the penalty term C∑l
t=1 ξt in the primal. Now for (6.1), Theorem 5.2.8 becomes
α is an optimal solution of (6.1) if and only if α is feasible and there are b, ξ ≥ 0
such that
yt(wTφ(xt) + b)− 1 + ξt ≥ 0, (6.3)
αt[yt(wT φ(xt) + b)− 1 + ξt] = 0, and (6.4)
(C − αt)ξt = 0, t = 1, . . . , l, (6.5)
where w =∑l
t=1 αtytφ(xt). Moreover, (w, b, ξ) is optimal for the primal.
(6.3)–(6.5) do not have to involve with w as we can always write
yt(wTφ(xt) + b) = (Qα)t + byt.
6.3. WORKING SET SELECTION AND STOPPING CRITERIA 69
To study the working set selection, we rewrite (6.3)–(6.5) to equivalent conditions:
There are b, ξ ≥ 0 such that
if αt = 0 then ξt = 0 and (Qα)t + byt − 1 ≥ 0, (6.6)
0 < αt < C = 0 = 0, (6.7)
αt = C ≥ 0 = −ξt ≤ 0. (6.8)
Thus, depending on (Qα)t + byt − 1, we know how to choose ξ and there is no
need to write it. By separating (6.7) into (6.6) and (6.8), we have
There is b such tat
if αt > 0 then (Qα)t + byt − 1 ≤ 0,
< C ≥ 0.
Using the property yt = ±1, i = 1, . . . , l, they are equivalent to
There is b such that
if αt > 0 and yt = 1 then (Qα)t − 1 ≤ −b,
> 0 = −1 ≤ b,
< C = 1 ≥ −b,
< C = −1 ≥ b,
Finally, we are able to remove b:
Theorem 6.3.10 α is an optimal solution of (6.1) if and only if α is feasible and
maxt∈Iup(α)
−yt∇f(α)t ≤ mint∈Ilow(α)
−yt∇f(α)t, (6.9)
where
f(α) ≡ 1
2αT Qα− eT α,∇f(α) ≡ Qα− e,
and
Iup(α) ≡ {t | αt < C, yt = 1 or αt > 0, yt = −1},Ilow(α) ≡ {t | αt < C, yt = −1 or αt > 0, yt = 1}.
An illustration of the above results is in the following figure:
70 CHAPTER 6. SOLVING THE QUADRATIC PROBLEMS
-�
t ∈ Iup(α) t ∈ Ilow(α)
−yt∇f(α)t
(a) α is optmal
- �
t ∈ Iup(α) t ∈ Ilow(α)
−yt∇f(α)t
(b) α is optimal
�
-t ∈ Iup(α)
t ∈ Ilow(α)
−yt∇f(α)t
(c) α is not optimal
Figure 6.1: Illustration of the dual optimality condition
If q, an even number, is the size of the working set B and αk is the current iterate,
we can select q/2 indices from elements in Iup(αk) and the other q/2 indices from
Ilow(αk) so that
−yi1∇f(αk)i1 ≥ −yi2∇f(αk)i2 ≥ · · · ≥ −yiq/2∇f(αk)iq/2
>
−yjq/2∇f(αk)jq/2
≥ · · · ≥ −yj1∇f(αk)j1.(6.10)
Therefore, essentially the q/2 most violated pairs are put into the working set and we
call (i1, j1) a “maximal violating pair.”
Taking the example used earlier, initially α1 = · · · = α4 = 0, so
Iup(α) = {α1, α4} and Ilow(α) = {α2, α3}.
Then,
1 = −y1∇f(α)1 = −y4∇f(α)4 = maxt∈Iup(α)
−yt∇f(α)t
> mint∈Ilow(α)
−yt∇f(α)t = −y2∇f(α)2 = −y3∇f(α)3 = −1.
Thus, if we would like to choose two variables for minimization, one whould be from
{1, 4} and the other should be from {2, 3}.After the first iteration, α = [1, 1, 0, 0]T , so
∇f(α) = Qα− e =
0 0 0 00 1 0 −10 0 1 −10 −1 −1 2
1100
−
1111
=
−10−1−2
.
6.3. WORKING SET SELECTION AND STOPPING CRITERIA 71
Then −yt∇f(α)t, t = 1, . . . , 4 are
[
1 0 −1 2]T
,
and
Iup(α) = {2, 4}, Ilow(α) = {1, 3}.
As
−y4∇f(α)4 = maxt∈Iup(α)
−yt∇f(α)t
> mint∈Ilow(α)
−yt∇f(α)t = −y3∇f(α)3,
we select {3, 4} as the working set. After the second iteration, α = [1, 1, 1, 1]T , so
∇f(α) = Qα− e =
0 0 0 00 1 0 −10 0 1 −10 −1 −1 2
1111
−
1111
=
−1−1−1−1
.
Then −yt∇f(α)t, t = 1, . . . , 4 are
[
1 −1 −1 1]T
,
and
Iup(α) = {2, 3}, Ilow(α) = {1, 4}.
As
1 = mint∈Ilow(α)
−yt∇f(α)t ≥ maxt∈Iup(α)
−yt∇f(α)t = −1,
from Theorem 6.3.10, α is an optimal solution. Note that during the iterative proce-
dure, we always keep the feasibility.
For the sub-problem (6.2), the optimality condition is
maxt∈Iup(α)∩B
−yt∇f(α)t ≤ mint∈Ilow(α)∩B
−yt∇f(α)t (6.11)
Under our working set selection, in Algorithm 6.2.9, αkB does not satisfy (6.11). In
other words, αkB is not optimal for the sub-problem either, so we are guaranteed to
find a better αB so that
f(αk+1) < f(αk).
As the decomposition method may take infinite iterations before (6.9) is satisfied,
practically we replace the stopping condition in Step 2 of Algorithm 6.2.9 with
maxt∈Iup(α)
−yt∇f(α)t ≤ mint∈Ilow(α)
−yt∇f(α)t + ǫ, (6.12)
72 CHAPTER 6. SOLVING THE QUADRATIC PROBLEMS
where ǫ, the stopping tolerance, is a small positive number.
To use (6.12), ∇f(α) must be maintained throughout all iterations. The memory
problem may occur as ∇f(α) = Qα − e involves the matrix Q. This issue is solved
by the following tricks:
1. α1 = 0 satisfies all constraints of (2.7). Thus, the initial ∇f(α1) = −e is easily
obtained.
2. We can easily updatie ∇f(α) using only QBB and QBN :
∇f(αk+1) = ∇f(αk) + Q(αk+1 −αk)
∇f(αk) + QT:,B(αk+1 −αk)B,
where
Q:,B =
[
QBB
QNB
]
.
6.4 Analytical Solutions
The remaining problem on implemeting the decomposition method is how to solve
the sub-problem (6.2). Here, we consider a simple situation that q = 2. That is, in
each iteration, only two elements are selected as the working set. Thus,
i ≡ arg maxt∈Iup(α)
−yt∇f(α)t, j ≡ arg mint∈Ilow(α)
−yt∇f(α)t.
Then, (6.2) is a simple problem with only two variables:
minαi,αj
1
2
[
αi αj
]
[
Qii Qij
Qji Qjj
] [
αi
αj
]
+ (Qi,NαN − 1)αi + (Qj,NαN − 1)αj(6.13)
subject to yiαi + yjαj = ∆− yTNαk
N , (6.14)
0 ≤ αi, αj ≤ C.
In this section, we discuss simple ways to solve (6.14). To begin, we consider the
case yi 6= yj. Insted of varibles αi and αj, we introduce dj so that
αi = αki + di, αj = αk
j + dj, (6.15)
where the second equality follows from (6.14). Thus, if without considering 0 ≤αi, αj ≤ C, (6.13) becomes
1
2
[
αki + di αk
j + dj
]
[
Qii Qij
Qji Qjj
] [
αki + di
αkj + dj
]
+ (Qi,NαkN − 1)(αk
i + di) +
(Qj,NαkN − 1)(αk
j + dj)
=1
2(Qii + Qjj + 2Qij)d
2j + (∇f(αk)i +∇f(αk)j)dj + constant. (6.16)
6.4. ANALYTICAL SOLUTIONS 73
The minimum of (6.15) happens at
−∇f(αk)i −∇f(αk)j
Qii + Qjj + 2Qij
. (6.17)
Similarly, if yi = yj, (6.15) becomes
αj = αkj + dj, αi = αk
i − di,
and (6.17) becomes∇f(α)i −∇f(α)j
Qii + Qjj − 2Qij
.
Thus, we can write
αk+1j = αk
j +
{
−∇f(αk)i−∇f(αk)j
Qii+Qjj+2Qijif yi 6= yj,
∇f(αk)i−∇f(αk)j
Qii+Qjj−2Qijif yi = yj.
(6.18)
Due to the constraints 0 ≤ αi, αj ≤ C, αk+1j or αk+1
i may be outside the allowed
region. In this case, the value of (6.18) is clipped into the feasible region. For example,
if yi 6= yj, the line yiαi + yjαj = ∆ − yTNαk
N can be like one of the two situations in
Figure 6.2.
-
6
��
��
��
��
��
��
αi
yiαi + yjαj = ∆− yTNαk
N
αj
Figure 6.2: Two situations when yi 6= yj
We can explicitly check the situation where αk+1i or αk+1
j is not in [0, C]:
δ ≡ αki − αk
j
If δ > 0
If αk+1j < 0
αk+1j ← 0
αk+1i ← δ
74 CHAPTER 6. SOLVING THE QUADRATIC PROBLEMS
If αk+1i > C
αk+1i ← C
αk+1j ← C − δ
else
If αk+1i < 0
αk+1i ← 0
αk+1j ← −δ
If αk+1j > C
αk+1j ← C
αk+1i ← C + δ
The situation for yi = yj is similar.
Another minor problem is that the denominator in (6.18) is sometime zero. When
this happens,
Qij = ±(Qii + Qij)/2
so
QiiQjj −Q2ij
= QiiQjj − (Qii + Qjj)2/4
= −(Qii −Qij)2/4 ≤ 0.
Therefore, we know if QBB is positive definite, the zero denominator in (6.18) never
happens. Hence this problem happens only if QBB is a 2 by 2 singular matrix. We
discuss some situations where QBB may be singular.
1. The function φ does not map data to independent vectors in a higher-dimensional
space so Q is only positive semidefinite. For example, using the linear or low-
degree polynomial kernels. Then it is possible that a singular QBB is picked.
2. Some kernels have a nice property that φ(xi), i = 1, . . . , l are linearly indepen-
dent if xi 6= xj. Thus Q as well as all possible QBB are positive definite. An
example is the RBF kernel. However, for many practical data we have encoun-
tered, some of xi, i = 1, . . . , l are the same. Therefore, several rows (columns)
of Q are exactly the same so QBB may be singular.
However, even if the denominator of (6.18) is zero, there are no numerical prob-
lems: From (6.12), we note that
∇f(α)i +∇f(α)j ≥ ǫ
6.5. THE CALCULATION OF b 75
during the iterative process. Since
∇f(α)i +∇f(α)j =
{
±(∇f(α)i +∇f(α)j) if yi 6= yj, and
±(∇f(α)i −∇f(α)j) if yi = yj,
the situation of 0/0 which is defined as NaN by IEEE standard does not appear.
Therefore, (6.18) returns ±∞ if the denominator is zero which can be detected as
special quantity of IEEE standard and clipped to regular floating point number.
6.5 The Calculation of b
After the solution α of the dual optimization problem is obtained, the variable b must
be calculated as it is used in the decision function.
(6.7) shows that for an optimal α, if αi satisfies 0 < αi < C, then
b = −yi∇f(α)i.
Practically to avoid numerical errors, we average them:
b =
∑
0<αi<C −yi∇f(α)i∑
0<αi<C 1.
On the other hand, if there is no such αi, as Theorem 6.3.10 implies that b can be
any number satisfying
−yi∇f(α)i ≤ b ≤ −yj∇f(α)j,
in this case, we can simply take the midpoint of the range:
b =−yi∇f(α)i +−yj∇f(α)j
2.
6.6 Shrinking and Caching
6.6.1 Shrinking
Since for many problems the number of free support vectors (i.e. 0 < αi < C) is small,
the shrinking technique reduces the size of the working problem without considering
some bounded variables (Joachims, 1998). Near the end of the iterative process, the
decomposition method identifies a possible set A where all final free αi may reside
in. Indeed we can have the following theorem which shows that at the final iterations
of the decomposition proposed in Section 6.3 only variables corresponding to a small
set are still allowed to move (Lin, 2002b, Theorem II.3):
76 CHAPTER 6. SOLVING THE QUADRATIC PROBLEMS
Theorem 6.6.11 If limk→∞ αk = α, then from Theorem 6.7.13, α is an optimal
solution. Furthermore, after k is large enough, only elements in
{i | −yi∇f(α)i = max( maxαt<C,yt=1
−∇f(α)t, maxαt>0,yt=−1
∇f(α)t)
= min( minαt<C,yt=−1
∇f(α)t, minαt>0,yt=1
−∇f(α)t)} (6.19)
can still be possibly modified.
Therefore, we tend to guess that if a variable αi is equal to C for several iterations,
then at the final solution, it is still at the upper bound. Hence instead of solving the
whole problem (2.7), the decomposition method works on a smaller problem:
minαA
1
2αT
AQAAαA − (eA −QANαkN)T αA
0 ≤ (αA)t ≤ C, t = 1, . . . , q, (6.20)
yTAαA = ∆− yT
NαkN ,
where N = {1, . . . , l}\A.
Of course this heuristic may fail if the optimal solution of (6.20) is not the cor-
responding part of that of (2.7). When that happens, the whole problem (2.7) is
reoptimized starting from a point α where αB is an optimal solution of (6.20) and
αN are bounded variables identified before the shrinking process. Note that while
solving the shrinked problem (6.20), we only know the gradient QAAαA+QANαN +pA
of (6.20). Hence when problem (2.7) is reoptimized we also have to reconstruct the
whole gradient ∇f(α), which is quite expensive.
Many implementations began the shrinking procedure near the end of the iterative
process, in LIBSVM however, we start the shrinking process from the beginning. The
procedure is as follows:
1. After every min(l, 1000) iterations, we try to shrink some variables. Note that
during the iterative process
min({∇f(αk)t | yt = −1, αt < C}, {−∇f(αk)t | yt = 1, αt > 0})= −gj < gi (6.21)
= max({−∇f(αk)t | yt = 1, αt < C}, {∇f(αk)t | yt = −1, αt > 0}),
as Theorem 6.3.10 is not satisfied yet.
We conjecture that for those
gt =
{
−∇f(α)t if yt = 1, αt < C,
∇f(α)t if yt = −1, αt > 0,(6.22)
6.6. SHRINKING AND CACHING 77
if
gt ≤ −gj, (6.23)
and αt resides at a bound, then the value of αt may not change any more. Hence
we inactivate this variable. Similarly, for those
gt ≡{
−∇f(α)t if yt = −1, αt < C,
∇f(α)t if yt = 1, αt > 0,(6.24)
if
− gt ≥ gi, (6.25)
and αt is at a bound, it is inactivated. Thus the set A of activated variables is
dynamically reduced in every min(l, 1000) iterations.
2. Of course the above shrinking strategy may be too aggressive. Since the decom-
position method has a very slow convergence and a large portion of iterations
are spent for achieving the final digit of the required accuracy, we would not
like those iterations are wasted because of a wrongly shrinked problem (6.20).
Hence when the decomposition method first achieves the tolerance
gi ≤ −gj + 10ǫ,
where ǫ is the specified stopping criteria, we reconstruct the whole gradient.
Then based on the correct information, we use criteria like (6.22) and (6.24) to
inactivate some variables and the decomposition method continues.
Therefore, in LIBSVM, the size of the set A of (6.20) is dynamically reduced.
To decrease the cost of reconstructing the gradient ∇f(α), during the iterations we
always keep
Gi = C∑
αj=C
Qij , i = 1, . . . , l.
Then for the gradient ∇f(α)i, i /∈ A, we have
∇f(α)i =l∑
j=1
Qijαj = Gi +∑
0<αj<C
Qijαj .
78 CHAPTER 6. SOLVING THE QUADRATIC PROBLEMS
6.6.2 Caching
Another technique for reducing the computational time is caching. Since Q is fully
dense and may not be stored in the computer memory, elements Qij are calculated
as needed. Then usually a special storage using the idea of a cache is used to store
recently used Qij (Joachims, 1998). Hence the computational cost of later iterations
can be reduced.
Theorem 6.6.11 also supports the use of the cache as in final iterations only some
columns of the matrix Q are still needed. Thus if the cache can contain these columns,
we can avoid most kernel evaluations in final iterations.
In LIBSVM, we implement a simple least-recent-use strategy for the cache. We
dynamically cache only recently used columns of QAA of (6.20).
6.7 Convergence of the Decomposition Method
The convergence of decomposition methods was first studied in (Chang et al., 2000)
but algorithms discussed there do not coincide with existing implementations. In this
section we will discuss only convergence results related to the specific decomposition
method in Section 6.3.
From (Keerthi and Gilbert, 2002) we have
Theorem 6.7.12 Given any ǫ > 0, after a finite number of iterations (6.12) will be
satisfied.
This theorem establishes the so-called “finite termination” property so we are sure
that after finite steps the algorithm will stop.
For asymptotic convergence, from (Lin, 2002a), we have
Theorem 6.7.13 If {αk} is the sequence generated by the decomposition method in
Section 6.3, the limit of any its convergent subsequence is an optimal solution of (6.1).
Note that Theorem 6.7.12 does not imply Theorem 6.7.13 as if we consider gi and
gj in (6.12) as functions of α, they are not continuous. Hence we cannot take limit
on both sides of (6.12) and claim that any convergent point has already satisfies the
KKT condition.
Theorem 6.7.13 was first proved as a special case of general results in (Lin, 2001c)
where some assumptions are needed. Now the proof in (Lin, 2002a) does not require
any assumption.
6.8. COMPUTATIONAL COMPLEXITY 79
For local convergence, as the algorithm used here is a special case of the one
discussed in (Lin, 2001b), we have the following theorem
Theorem 6.7.14 If Q is positive definite and the dual optimization problem is de-
generate (see Assumption 2 in (Lin, 2001b)), then there is c < 1 such that after k is
large enough,
f(αk+1)− f(α∗) ≤ c(f(αk)− f(α∗)),
where α∗ is the optimal solution of (6.1).
That is, LIBSVM is linearly convergent.
The discussion here is about global and local convergence. We investigate the
computational complexity in Section 6.8.
6.8 Computational Complexity
The discussion in Section 6.7 is about the asymptotic global convergence of the decom-
position method. In addition, the linear convergence (Theorem 6.7.14) is a property
of the local convergence rate. Here, we discuss the computational complexity.
The main operations are on finding QBNαkN + pB of (6.2) and the update of
∇f(αk) to ∇f(αk+1). Note that ∇f(α) is used in the working set selection as well
as the stopping condition. They can be considered together as
QBNαkN + pB = ∇f(αk)−QBBαk
B, (6.26)
and
∇f(αk+1) = ∇f(αk) + Q:,B(αk+1B −αk
B), (6.27)
where Q:,B is the sub-matrix of Q with column indices in B. That is, at the kth
iteration, as we already have∇f(αk), the right-hand-side of (6.26) is used to construct
the sub-problem. After the sub-problem is solved, (6.27) is employed to have the next
∇f(αk+1). As B has only two elements and solving the sub-problem is easy, the main
cost is Q:,B(αk+1B −αk
B) of (6.27). The operation itself takes O(2l) but if Q:,B is not
available in the cache and each kernel evaluation costs O(n), one column of Q:,B
already needs O(ln). Therefore, the complexity is:
1. #Iterations× O(l) if most columns of Q are cached during iterations.
2. #Iterations×O(nl) if most columns of Q are cached during iterations and each
kernel evaluation is O(n).
80 CHAPTER 6. SOLVING THE QUADRATIC PROBLEMS
Note that if shrinking is incorporated, l will gradually decrease during iterations.
Unfortunately, so far we do not know much about the complexity of the number of
iterations. An earlier work is in (Hush and Scovel, 2003). However, its result applies
only to decomposition methods discussed in (Chang et al., 2000) but not LIBSVM or
other existing software.
6.9 Computational Complexity for Multi-class SVM
In Chapter 2.5 we discussed two multi-class methods: one-against-all and one-against-
one. Asthe latter trains k(k − 1)/2 classifiers, where k is the number of classes, we
may think that “one-against-one” takes more training time. In fact, this statment
may not be right. Assume the time for training one two-class problem (i.e., solving
the dual problem (6.1)), is (·)d, where (·) is the number of variables. Then the training
time of the two multi-class approaches are compared in Table 6.9.1. Clearly, if d ≥ 2,
the one-against-one approach has a shorter training time.
Table 6.9.1: Training complexity of two multi-class approachesMethod one-against-all one-against-oneAverage size per
l 2l/ktwo-class SVM# two-class SVMs k k(k − 1)/2
training time kO(ld) k(k−1)2
O((2lk)d)
To discuss the testing time, we first look at the decision function of two-clas
SVMs (2.10). Kernel evaluations are the main part and some subsequent multipli-
cations/additions of these kernel values follow. For multi-class SVMs, a training
instance may be a support vector of several two-class SVMs. There is no need to do
the same kernel evaluation several times, so indeed we conduct such evaluations in
the beginning before calculating the decision values. Therefore, if the percentage of
training data to be support vectors in the k and k(k − 1)/2 two-class SVMs is about
the same and k is not too large, the two multi-class approaches have similar testing
time.
6.10 Notes
Some early work on decomposition methods are, for example, (Osuna et al., 1997;
Joachims, 1998; Platt, 1998). The idea of using only two elements for the working set
6.11. EXERCISES 81
are from the Sequential Minimal Optimization (SMO) by (Platt, 1998). The working
set discussed is indeed a special version discussed in (Joachims, 1998; Keerthi et al.,
2001).
6.11 Exercises
1. In this chapter, we use a more complicated version of Theorem 5.2.8. Prove it
by the same derivation of the theorem.
2. Consider the following SVM formula:
minw,b,ξ
1
2wTw + C+
∑
i:yi=1
ξi + C−
∑
i:yi=−1
ξi
subject to yi(wTφ(xi) + b) ≥ 1− ξi,
ξi ≥ 0, i = 1, . . . , l.
How will you change the derivation in Chapter 6.4?
3. Prove that the working set selection in Chapter 6.3 is equivalent to solving the
following problem:
mind
∇f(α)Td
subject to yTd = 0, −1 ≤ dt ≤ 1, (6.28)
dt ≥ 0, if αt = 0, dt ≤ 0, if αt = C,
|{dt | dt 6= 0}| = 2. (6.29)
Note that |{dt | dt 6= 0}| means the number of components of d which are not
zero. The constraint (6.29) implies that a descent direction involving only two
variables is obtained. Then components of α with non-zero dt are included in
the working set B which is used to construct the sub-problem (6.2). Note that
d is only used for identifying B but not as a search direction.
4. Consider x1,x2,x3 with ‖x1 − x2‖ = ‖x1 − x3‖ = ‖x2 − x3‖, y = [1, 1,−1]T ,
and C =∞. If the RBF kernel is used, the dual SVM problem is
minα1,α2,α3
1
2
[
α1 α2 α3
]
1 a −aa 1 −a−a −a 1
α1
α2
α3
− (α1 + α2 + α3)
subject to α1 + α2 − α3 = 0,
0 ≤ α1, α2, α3,
82 CHAPTER 6. SOLVING THE QUADRATIC PROBLEMS
where a = e−γ‖xi−xj‖2
. We assume C is large so is not needed here. Prove that
at the optimal solution,
α∗ =[
23(1−a)
23(1−a)
43(1−a)
]T
. (6.30)
Then prove that if the initial solution is zero and the decomposition method in
Chapter 6.2 is considered, after k is large enough,
(αk+1 −α∗)T Q(αk+1 −α∗) =1
4(αk −α∗)T Q(αk −α∗). (6.31)
This is a simple example for which the decomposition method is linearly con-
vergent. Hence, in theory, the linear convergence is already the best worst-case
analysis for the decomposition method in Chapter 6.2.
5. Write a simple SVM classifier using the algorithm in this chapter. You consider
dense input format only. Basically you use the simple working set selection
and solve two-variable optimization problems until given stopping tolerance is
satisfied. You can write the predictor in the same program so we immediately
know the test accuracy. Kernel elements are calculated by need.
Requirement:
• The code should be less than 300 lines in any high-level language.
• RBF kernel is enough
• Run some small data in
http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/binary
using some given parameters. On some (maybe most) parameters your
code should be faster then libsvm, but on others it is the other way around.
Explain why.
• Moreover, your code should be able to train the 50,000 ijcnn1 data set in
the same url address.
Bibliography
Bazaraa, M. S., H. D. Sherali, and C. M. Shetty (1993). Nonlinear programming :
theory and algorithms (Second ed.). Wiley.
Boser, B., I. Guyon, and V. Vapnik (1992). A training algorithm for optimal margin
classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning
Theory, pp. 144–152. ACM Press.
Chang, C.-C., C.-W. Hsu, and C.-J. Lin (2000). The analysis of decomposition meth-
ods for support vector machines. IEEE Transactions on Neural Networks 11 (4),
1003–1008.
Chang, C.-C. and C.-J. Lin (2001a). IJCNN 2001 challenge: Generalization ability
and text decoding. In Proceedings of IJCNN. IEEE.
Chang, C.-C. and C.-J. Lin (2001b). LIBSVM: a library for support vector machines.
Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
Cortes, C. and V. Vapnik (1995). Support-vector network. Machine Learning 20,
273–297.
Cover, T. (1965). Geometrical and statistical properties of systems of linear inequal-
ities with applications in pattern recognition. IEEE Transactions on Electronic
Computers EC-14, 326–334.
Cristianini, N. and J. Shawe-Taylor (2000). An Introduction to Support Vector Ma-
chines. Cambridge, UK: Cambridge University Press.
Fletcher, R. (1987). Practical Methods of Optimization. John Wiley and Sons.
Friedman, J. (1996). Another approach to polychotomous classification. Tech-
nical report, Department of Statistics, Stanford University. Available at
http://www-stat.stanford.edu/reports/friedman/poly.ps.Z.
83
84 BIBLIOGRAPHY
Gardy, J. L., C. Spencer, K. Wang, M. Ester, G. E. Tusnady, I. Simon, S. Hua,
K. deFays, C. Lambert, K. Nakai, and F. S. Brinkman (2003). PSORT-B: improving
protein subcellular localization prediction for gram-negative bacteria. Nucleic Acids
Research 31 (13), 3613–3617.
Hsu, C.-W. and C.-J. Lin (2002). A comparison of methods for multi-class support
vector machines. IEEE Transactions on Neural Networks 13 (2), 415–425.
Hush, D. and C. Scovel (2003). Polynomial-time decomposition algorithms for support
vector machines. Machine Learning 51, 51–71.
Joachims, T. (1998). Making large-scale SVM learning practical. In B. Scholkopf,
C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods - Support
Vector Learning, Cambridge, MA. MIT Press.
Keerthi, S. S. and E. G. Gilbert (2002). Convergence of a generalized SMO algorithm
for SVM classifier design. Machine Learning 46, 351–360.
Keerthi, S. S. and C.-J. Lin (2003). Asymptotic behaviors of support vector machines
with Gaussian kernel. Neural Computation 15 (7), 1667–1689.
Keerthi, S. S., S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy (2001). Im-
provements to Platt’s SMO algorithm for SVM classifier design. Neural Computa-
tion 13, 637–649.
Knerr, S., L. Personnaz, and G. Dreyfus (1990). Single-layer learning revisited: a
stepwise procedure for building and training a neural network. In J. Fogelman (Ed.),
Neurocomputing: Algorithms, Architectures and Applications. Springer-Verlag.
Kreßel, U. (1999). Pairwise classification and support vector machines. In
B. Scholkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods
— Support Vector Learning, Cambridge, MA, pp. 255–268. MIT Press.
Lin, C.-J. (2001a). Formulations of support vector machines: a note from an opti-
mization point of view. Neural Computation 13 (2), 307–317.
Lin, C.-J. (2001b). Linear convergence of a decomposition method for support vector
machines. Technical report, Department of Computer Science and Information
Engineering, National Taiwan University, Taipei, Taiwan.
BIBLIOGRAPHY 85
Lin, C.-J. (2001c). On the convergence of the decomposition method for support
vector machines. IEEE Transactions on Neural Networks 12 (6), 1288–1298.
Lin, C.-J. (2002a). Asymptotic convergence of an SMO algorithm without any as-
sumptions. IEEE Transactions on Neural Networks 13 (1), 248–250.
Lin, C.-J. (2002b). A formal analysis of stopping criteria of decomposition methods
for support vector machines. IEEE Transactions on Neural Networks 13 (5), 1045–
1052.
Michie, D., D. J. Spiegelhalter, and C. C. Taylor (1994). Machine Learning, Neural
and Statistical Classification. Englewood Cliffs, N.J.: Prentice Hall. Data available
at http://www.ncc.up.pt/liacc/ML/statlog/datasets.html.
Osuna, E., R. Freund, and F. Girosi (1997). Training support vector machines: An
application to face detection. In Proceedings of CVPR’97, New York, NY, pp.
130–136. IEEE.
Platt, J. C. (1998). Fast training of support vector machines using sequential minimal
optimization. In B. Scholkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances
in Kernel Methods - Support Vector Learning, Cambridge, MA. MIT Press.
Prokhorov, D. (2001). IJCNN 2001 neural network competi-
tion. Slide presentation in IJCNN’01, Ford Research Laboratory.
http://www.geocities.com/ijcnn/nnc_ijcnn01.pdf .
Sarle, W. S. (1997). Neural Network FAQ. Periodic posting to the Usenet newsgroup
comp.ai.neural-nets.
Scholkopf, B. and A. J. Smola (2002). Learning with kernels. MIT Press.
Vapnik, V. (1998). Statistical Learning Theory. New York, NY: Wiley.
Index
strong duality, 42
attributes, 6
complementarity condition, 42
data instance, 6
features, 6
kernel function, 18
Nearest Neighbor Methods, 6
86