chih-jen lin july 13, 2006b95028/studygroup/...discussed in chapter 3.2. if the eucledian distance...

A Guide to Support Vector Machines

Chih-Jen Lin

July 13, 2006

Contents

I Basic Topics 7

1 Introduction to Classification Problems 9

1.1 Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2 Training and Testing Error . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Nearest Neighbor Methods . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5 Overfitting and Underfitting . . . . . . . . . . . . . . . . . . . . . . . 12

1.6 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Support Vector Classification 17

2.1 Linear Separating Hyperplane with Maximal Margin . . . . . . . . . 17

2.2 Mapping Data to Higher Dimensional Spaces . . . . . . . . . . . . . . 19

2.3 The Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Kernel and Decision Functions . . . . . . . . . . . . . . . . . . . . . . 23

2.5 Multi-class SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5.1 One-against-all Multi-class SVM . . . . . . . . . . . . . . . . . 26

2.5.2 One-against-one Multi-class SVM . . . . . . . . . . . . . . . . 26

2.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Training and Testing a Data set 29

3.1 Categorical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Data Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.1 RBF Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.2 Cross-validation and Grid-search . . . . . . . . . . . . . . . . 32

3.4 A General Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 Using LIBSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6 Other Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.7 A Large Practical Example . . . . . . . . . . . . . . . . . . . . . . . . 40

3.8 A Failed Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.9 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3

4 CONTENTS

4 Support Vector Regression 454.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . 464.3 A Practical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

II Advanced Topics 51

5 The Dual Problem 535.1 SVM and Convex Optimization . . . . . . . . . . . . . . . . . . . . . 535.2 Lagrangian Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.3 Proof of (5.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6 Solving the Quadratic Problems 656.1 Solving Optimization Problems . . . . . . . . . . . . . . . . . . . . . 656.2 The Decomposition Method . . . . . . . . . . . . . . . . . . . . . . . 666.3 Working Set Selection and Stopping Criteria . . . . . . . . . . . . . 686.4 Analytical Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.5 The Calculation of b . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.6 Shrinking and Caching . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.6.1 Shrinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.6.2 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.7 Convergence of the Decomposition Method . . . . . . . . . . . . . . . 786.8 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . 796.9 Computational Complexity for Multi-class SVM . . . . . . . . . . . . 806.10 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Preface

This is the course notes that I have been using in the past few years. The main

purpose is to let students not in machine learning background learn how to effectively

use support vector machines as a tool.

The first part (basic topics) is suitable for people who would like to use SVM. The

second part (advanced topics) is for people who would like to know implementation

details of SVM software.

5

6 CONTENTS

Part I

Basic Topics

7

Chapter 1

Introduction to ClassificationProblems

1.1 Data Classification

The basic idea of data classification problem can be simply described as follows: Given

training data with known labels (classes), we would like to learn a model, so it can

be used for predicting data with unknown labels.

For example, suppose the height and weight of the following eight persons are

available and medical experts have identified that some of them are over-weighted or

under-weighted:

Table 1.1.1: Six training points

ID 1 2 3 4 5 6Weight (kg) 50 60 70 70 80 90Height (m) 1.6 1.7 1.9 1.5 1.7 1.6

Over-weighted No No No Yes Yes Yes

They are considered as “two classes” of data and the label is whether the person

is over- or under-weighted. The above data can also be illustrated by Figure 1.1.

Consulting with experts may be expensive, so we would like to construct a model

from available information. Then for any person, this model could easily predict

whether he/she is over- or under-weighted.

9

10 CHAPTER 1. INTRODUCTION TO CLASSIFICATION PROBLEMS

40 50 60 70 80 901.4

1.5

1.6

1.7

1.8

1.9

2

Weight

He

igh

t

Figure 1.1: Six training points

1.2 Training and Testing Error

A model can be a rule like

If weight ≥ 60, then over-weighted.

Clearly, this rule does not make sense, as some tall people may be thin even thouth

their weights are more than 60. A better model may be the following rule:

If weight/(height)2 ≥ 23, then over-weighted.

The area of classification is to idenfity a good model so future prediction is accu-

rate.

Here “weight” and “height” are called features or attributes. In statistics, they are

called variables. Each person is considered as a data instance (or a data observation).

Mathematically, we have x = [weight, height] as a data instance and y = 1 or −1

as the label of each instance. Here, there are six training instances x1, . . . ,x6 with

corresponding class labels y = [−1,−1,−1, 1, 1, 1]T (if −1 and 1 mean under-weighted

and over-weighted, respectively).

1.3 Nearest Neighbor Methods

Here, we introduce a simple classifier: nearest neighbor. For any new person, we check

that his/her (height, weight) is closest to which one in the training set. For example,

if a person has weight=70 and height=1.8, then the closest one in the training set is

1.4. LINEAR CLASSIFIERS 11

the third person, who is under-weighted. Thus, the new one is predicted as under-

weighted as well. Note that weight and height are now in two very different ranges,

so we may have to scale them before calculating the distance. This issue will be

discussed in Chapter 3.2.

If the Eucledian distance is considered, for any given x, the nearest neighbor

method essentially predict it to be in the

class of arg mini‖x− xi‖2.

We may wrorry that the closest instance is a wrongly recorded data, so sometimes

the k closest points are considered and the prediction is by a majority vote. This

method is called k-nearest neighbor. How to select k is an issue and will be discussed

in Chapter 1.5.

1.4 Linear Classifiers

A model after the training procedure can be, for example, a rule set, or the whole

training set like that by the nearest neighbor. Here we show that a straight line can

be a model as well.

40 50 60 70 80 901.4

1.5

1.6

1.7

1.8

1.9

2

Weight

He

igh

t

Figure 1.2: Example of a linear classifier

In Figure 1.4, a line

0.2× weight− 10× height + 3 = 0

separates all the training data. In general we represent such a line as

wTx + b = 0,


where x = [weight, height]T , w = [0.2,−10]T , and b = 3. Then for any new data x,

we check whether it is on the right- or left-hand side of the line. That is,

if wTx + b > 0 predict x as “over-weighted”,

< 0 predict x as “under-weighted”.

How to find such a straight line will be discussed in Chapter 2.

1.5 Overfitting and Underfitting

(a) Training data and an overfitting clas-sifier

(b) Applying an overfitting classifier ontesting data

(c) Training data and a better classifier (d) Applying a better classifier on testingdata

Figure 1.3: An overfitting classifier and a better classifier (● and ▲: training data;© and △: testing data).

Note that it may not be useful to achieve high training accuracy (i.e., classifiers

accurately predict training data whose class labels are indeed known). A clear illus-

tration is in Figure 1.3. It is a problem with two classes of data: triangles and circles.

1.6. CROSS VALIDATION 13

Filled circles and triangles are the training data while hollow circles and triangles

are the testing data. The testing accuracy the classifier in Figures 1.3(a) and 1.3(b)

is not good since it overfits the training data. On the other hand, the classifier in

Figures 1.3(c) and 1.3(d) does not overfit the training data and hence gives better

testing accuracy.

Some training data may be wrongly recorded, so sometimes we should allow train-

ing errors. That is, under the obtained model, we predict some training data to be

in their opposite class. Note that if there are no duplicated data instances, we can

always fit training data so that training accuracy is 100%. An example is in Figure

1.4.

Figure 1.4: We can always achieve 100% training accuracy

Perfect training accuracy is not good, so we should avoid overfitting training data.

On the other hand, a good model should also avoid underfitting, which means the

model does not extract enough information from the training data. An example is in

Figure 1.5. Clearly, the linear classifier does not use the information that most circles

are at the upper-right corner and most triangles are at the lower-left. Therefore, from

the discussion in this section, we conclude that a good classifier should

avoid overfitting and avoid underfitting.

1.6 Cross Validation

The above discussion also hints another important fact about classification problem:

Training accuracy is not important; only test accuracy counts.


Figure 1.5: A underfitting example

This statement is quite obvious as for training data we already know their class

labels. However, as the true class labels of test data are not known, how do we find

the performance on predicting them? A common way is to separate training data

to two parts of which one is considered unknown in training the classifier. Then

the prediction accuracy on this set can more precisely reflect the performance on

classifying unknown data. An improved version of this procedure is cross-validation.

In v-fold cross-validation, we first divide the training set into v subsets of equal

size. Sequentially one subset is tested using the classifier trained on the remaining

v − 1 subsets. Thus, each instance of the whole training set is predicted once so

the cross-validation accuracy is the percentage of data which are correctly classified.

Usually v ≥ 5 is used.

In Chapter 1.3, we mention the k-nearest neighbor method and the selection of

k can be via cross validation. For example, sequentially we try k = 1, 3, 5, 7, . . .

and calculate the corresponding cross validation accuract. The k with the highest

accuracy is the best and used for future prediction. Note that we consider only odd

k, so the majority vote in prediction could produce a single winner.

1.7 Exercises

1. Write a k-nearest neighbor code to train

http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/binary/ijcnn1.bz2

and test

http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/binary/ijcnn1.t.bz2

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/binary/ijcnn1.bz2

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/binary/ijcnn1.t.bz2

1.7. EXERCISES 15

You need to conduct cross-validation on the training data in order to select a

good k.

Chapter 2

Support Vector Classification

2.1 Linear Separating Hyperplane with Maximal

Margin

The original idea of SVM classification is to use a linear separating hyperplane to

create a classifier. Given training vectors xi, i = 1, . . . , l of length n, and a vector y

defined as follows

yi =

{

1 if xi in class 1,−1 if xi in class 2,

the support vector technique tries to find the separating hyperplane with the largest

margin between two classes, measured along a line perpendicular to the hyperplane.

For example, in Figure 2.1, two classes could be fully separated by a dotted line

wTx + b = 0. We would like to decide the line with the largest margin. In other

words, intuitively we think that the distance between two classes of training data

should be as large as possible. That means we find a line with parameters w and b

such that the distance between wTx + b = ±1 is maximized.

The distance between wTx + b = 1 and −1 can be calculated by the following

way. Consider a point x on wTx + b = −1:

@@

@@

@@

@

@@

@@

@@

@

x + tw

xtw

wTx + b = −1

wTx + b = 1

17

18 CHAPTER 2. SUPPORT VECTOR CLASSIFICATION

wTx + b =

+10−1

Figure 2.1: Separating hyperplane

As w is the “normal vector” of the line wTx + b = −1, w and the line are

perpendicular to each other. Starting from x and moving along the direction w, we

assume x + tw touches the line wTx + b = 1. Thus,

wT (x + tw) + b = 1 and wT x + b = −1.

Then, twTw = 2, so the distance (i.e., the length of tw) is ‖tw‖ = 2‖w‖/(wTw) =

2/‖w‖. Note that ‖w‖ =√

w21 + · · ·+ w2

n. As maximizing 2/‖w‖ is equivalent to

minimizing wTw/2, we have the following problem:

minw,b

1

2wTw

subject to yi(wTxi + b) ≥ 1, (2.1)

i = 1, . . . , l.

The constraint yi(wTxi + b) ≥ 1 means

(wTxi) + b ≥ 1 if yi = 1,(wTxi) + b ≤ −1 if yi = −1.

That is, data in the class 1 must be on the right-hand side of wTx+b = 0 while data in

the other class must be on the left-hand side. Note that the reason of maximizing the

distance between wTx + b = ±1 is based on Vapnik’s Structural Risk Minimization

(Vapnik, 1998).

The following example gives a simple illustration of maximal-margin separating

hyperplanes:

Example 2.1.1 Given two training data in R1 as in the following figure:

△0

©1

2.2. MAPPING DATA TO HIGHER DIMENSIONAL SPACES 19

What is the separating hyperplane ?

Now two data are x1 = 1,x2 = 0 with y = [+1,−1]T . Furthermore, w ∈ R1, so

(2.1) becomes

minw,b

1

2w2

subject to w · 1 + b ≥ 1, (2.2)

−1(w · 0 + b) ≥ 1. (2.3)

From (2.3), −b ≥ 1. Putting this into (2.2), w ≥ 2. In other words, for any (w, b)

which satisfies (2.2) and (2.3), w ≥ 2. As we are minimizing 12w2, the smallest

possibility is w = 2. Thus, (w, b) = (2,−1) is the optimal solution. The separating

hyperplane is 2x− 1 = 0, in the middle of the two training data:

△0

©1

•x = 1/2

2.2 Mapping Data to Higher Dimensional Spaces

Figure 2.2: An example which is not linear separable

However, practically problems may not be linearly separable where an example

is in Figure 2.2. That is, there is no (w, b) which satisfies constraints of (2.1). In

this situation, we say (2.1) is “infeasible.” In (Cortes and Vapnik, 1995) the authors

introduced slack variables ξi, i = 1, . . . , l in the constraints:

minw,b,ξ

1

2wTw + C

l∑

i=1

ξi

subject to yi(wTxi + b) ≥ 1− ξi, (2.4)

ξi ≥ 0, i = 1, . . . , l.


That is, constraints (2.4) allow that training data may not be on the correct side of

the separating hyperplane wTx + b = 0. This situation happens when ξi > 1 and an

example is in the following figure

wTxi + b = 1− ξi < −1

We have ξ ≥ 0 as if ξ < 0, yi(wTxi + b) ≥ 1 − ξi ≥ 1 and the training data is

already on the correct side. The new problem is always feasible since for any (w, b),

ξi ≡ max(0, 1− yi(wTx + b)), i = 1, . . . , l,

lead to that (w, b, ξ) is a feasible solution.

Using this setting, we may worry that for linearly separable data, some ξi > 1

and hence corresponding data are wrongly classified. For the case that most data

except some noisy ones are separable by a linear function, we would like wTx+ b = 0

correctly classifies the majority of points. Thus, in the objective function we add a

penalty term C∑l

i=1 ξi, where C > 0 is the penalty parameter. To have the objective

value as small as possible, most ξi should be zero, so the constraint goes back to its

original form. Theoretically we can prove that if data are linear separable and C is

larger than a certain number, problem (2.4) goes back to (2.1) and all ξi are zero

(Lin, 2001a).

Unfortunately, such a setting is not enough for practical use. If data are distributed

in a highly nonlinear way, employing only a linear function causes many training

instances to be on the wrong side of the hyperplane. So underfitting occurs and the

decision function does not perform well.

To fit the training data better, we may think of using a nonlinear curve like that

in Figure 2.2. The problem is that it is very difficult to model nonlinear curves. All

we are familiar with are eliptic, hyperbolic, or parabolic curves, which are far from

enough in practice. Instead of using more sophisticated curves, another approach is

to map data into a higher dimensional space. For example, in the example in Chap-

ter 1, each data instance has two features (attributes): height and weight. We may

consider two other attributes

2.3. THE DUAL PROBLEM 21

height-weight, weight/(height2).

Such features may provide more information for separating underweighted/overweighted

people. Each new data instance is now in a four-dimensional space, so if the two new

features are good, it should be easier to have a seperating hyperplane so that most ξi

are zero.

Thus SVM non-linearly transforms the original input space into a higher dimen-

sional feature space. More precisely, the training data x is mapped into a (possibly

infinite) vector in a higher dimensional space:

φ(x) = [φ1(x), φ2(x), . . .].

In this higher dimensional space, it is more possible that data can be linearly sepa-

rated. An example by mapping x from R3 to R10 is as follows:

φ(x) = (1,√

2x1,√

2x2,√

2x3, x21, x

22, x

23,√

2x1x2,√

2x1x3,√

2x2x3).

An extreme example is to map a data instance x ∈ R1 to an infinite dimensional

space:

φ(x) =

[

1,x

1!,x2

2!,x3

3!, . . .

]T

.

We then try to find a linear separating plane in a higher dimensional space so

(2.4) becomes

minw,b,ξ

1

2wTw + C

l∑

i=1

ξi

subject to yi(wTφ(xi) + b) ≥ 1− ξi, (2.5)

ξi ≥ 0, i = 1, . . . , l.

2.3 The Dual Problem

The remaining problem is how to effectively solve (2.5). Especially after data are

mapped into a higher dimensional space, the number of variables (w, b) becomes very

large or even infinite. We handle this difficulty by solving the dual problem of (2.5):

minα

1

2

l∑

i=1

l∑

j=1

αiαjyiyjφ(xi)T φ(xj)−

l∑

i=1

αi

subject to 0 ≤ αi ≤ C, i = 1, . . . , l, (2.6)l∑

i=1

yiαi = 0.


This new problem of course has some relation with the original problem (2.5), and

we hope that it can be solved more easily. Sometimes we write (2.6) in a matrix form

for convenience:

minα

1

2αT Qα− eT α

subject to 0 ≤ αi ≤ C, i = 1, . . . , l, (2.7)

yT α = 0.

In (2.7), e is the vector of all ones, C is the upper bound, Q is an l by l positive

semidefinite matrix, Qij ≡ yiyjK(xi,xj), and K(xi,xj) ≡ φ(xi)T φ(xj) is the kernel,

which will be addressed in Chapter 2.4.

If (2.7) is called the “dual” problem of (2.5), we refer (2.5) to be the “primal” prob-

lem. Suppose (w, b, ξ) and α are optimal solutions of the primal and dual problems,

respectively, the following two properties hold:

w =

l∑

i=1

αiyiφ(xi) (2.8)

1

2wT w + C

l∑

i=1

ξi = eT α− 1

2αT Qα. (2.9)

In other words, if the dual problem is solved with a solution α, the optimal primal

solution w is easily obtained from (2.8). Suppose an optimal b is also easily found,

the decision function is hence determined.

Thus, the crucial point is whether the dual is easier to be solved than the primal.

The number of variables in the dual, which is the size of the training set: l, is a fixed

number. In contrast, the number of variables in the primal problem varies depending

on how data are mapped to a higher dimensional space. Therefore, moving from the

primal to the dual means that we solve a finite-dimensional optimization problem

instead of a possibly infinite-dimensional problem.

We illustrate this primal-dual relationship using data in Example 2.1.1 without

mapping them to a higher dimensional space. As the problem is linearly separable, it

is fine to consider the formulation (2.1) without slack variables ξi, i = 1, . . . , l. Then

2.4. KERNEL AND DECISION FUNCTIONS 23

the dual is

minα

1

2

l∑

i=1

l∑

j=1

αiαjyiyjxTi xj −

l∑

i=1

αi

subject to 0 ≤ αi, i = 1, . . . , l,l∑

i=1

yiαi = 0.

Using data in Example 2.1.1, the objective function is

1

2α2

1 − (α1 + α2)

=1

2

[

α1 α2

]

[

1 00 0

] [

α1

α2

]

−[

1 1]

[

α1

α2

]

.

Constraints are

α1 − α2 = 0, 0 ≤ α1, 0 ≤ α2.

Substituting α2 = α1 into the objective function,

1

2α2

1 − 2α1

has the smallest value at α1 = 2. As [2, 2]T satisfies constraints 0 ≤ α1 and 0 ≤ α2,

it is the optimal solution. Using the primal-dual relation (2.8),

w = y1α1x1 + y2α2x2

= 1 · 2 · 1 + (−1) · 2 · 0= 2,

the same as what obtained by directly solving the primal problem.

The calculation of b is easy, but is left in Chapter 6 for implementation details. The

remaining issue of using the dual problem is about the inner product φ(xi)T φ(xj).

If φ(x) is an infinite-long vector, there is no way to fully write it down and then

calculate the inner produce. Thus, even though the dual possesses the advantage of

having a finite number of variables, we even could not write the problem down before

solving it. This is resolved by using special mapping functions φ so that φ(xi)T φ(xj)

is efficiently calculated. Details are in the next section.

2.4 Kernel and Decision Functions

Consider a special φ(x) mentioned earlier (assume x ∈ R3):

φ(x) = (1,√

2x1,√

2x2,√

2x3, x21, x

22, x

23,√

2x1x2,√

2x1x3,√

2x2x3).


In this case it is easy to see that φ(xi)T φ(xj) = (1 + xT

i xj)2, which is easier to be

calculated then doing a direct inner product. To be more precise, a direct calculation

of φ(xi)T φ(xj) takes 10 multiplications and 9 additions, but using (1 + xT

i xj)2, only

four multiplications and three additions are needed. Therefore, if a special φ(x) is

considered, even though it is a long vector, φ(xi)T φ(xj) may still be easily available.

We call such inner products the “kernel function.” Some popular kernels are, for

example,

1. e−γ||xi−xj ||2

(Gaussian kernel or Radial bassis function (RBF) kernel),

2. (xTi xj/γ + δ)d (polynomial kernel),

where γ, d, and δ are kernel parameters. The following calculation shows that the

Gaussian (RBF) kernel indeed is an inner product of two vectors in an infinite di-

mensional space. Assume x ∈ R1 and γ > 0.

e−γ||xi−xj ||2

= e−γ(xi−xj)2

= e−γx2

i +2γxixj−γx2

j

= e−γx2

i −γx2

j(

1 +2γxixj

1!+

(2γxixj)2

2!+

(2γxixj)3

3!+ · · ·

)

= e−γx2

i −γx2

j(

1 · 1 +

√

2γ

1!xi ·√

2γ

1!xj +

√

(2γ)2

2!x2

i ·√

(2γ)2

2!x2

j

+

√

(2γ)3

3!x3

i ·√

(2γ)3

3!x3

j + · · ·)

= φ(xi)T φ(xj),

where

φ(x) = e−γx2

[

1,

√

2γ

1!x,

√

(2γ)2

2!x2,

√

(2γ)3

3!x3, · · ·

]T

.

Note that γ > 0 is used for the existance of terms such as√

2γ

1!,√

(2γ)3

3!, etc.

After (2.7) is solved with a solution α, the vector for which αi > 0 are called

support vectors. Then, from (5.10), a decision function is written as

f(x) = sign(wT φ(x) + b) = sign

(

l∑

i=1

yiαiφ(xi)T φ(x) + b

)

. (2.10)

In other words, for a test vector x, if∑l

i=1 yiαiφ(xi)T φ(x) + b > 0, we classify it to

be in the class 1. Otherwise, we think it is in the second class. We can see that only

support vectors will affect results in the prediction stage. In general, the number of

2.5. MULTI-CLASS SVM 25

support vectors is not large. Therefore we can say SVM is used to find important

data (support vectors) from training data.

We use Figure 2.2 as an illustration. Two classes of training data are not linearly

separable. Using the RBF kernel, we obtain a hyperplane wTφ(x) + b = 0. In the

original space, it is indeed a nonlinear curve

l∑

i=1

yiαiφ(xi)T φ(x) + b = 0. (2.11)

In the figure, all points in red color are support vectors and they are selected from

both classes of training data. Clearly support vectors are close to the nonlinear curve

(2.11) are more important points.

−1.5 −1 −0.5 0 0.5 1−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Figure 2.3: Support vectors (marked as +) are important data from training data

2.5 Multi-class SVM

The discussion so far assumes that data are in only two classes. Many practical

applications involve with more classes. For example, hand-written digit recognition

considers data in 10 classes: digits 0 to 9. There are many ways to extend SVM for

such cases. Here, we discuss two simple methods.


2.5.1 One-against-all Multi-class SVM

This commonly mis-named method should be called “one-against-the-rest.” It con-

structs binary SVM models so that each one is trained with one class as positive and

the rest as negative. We illustrate this method by a simple situation of four classes.

The four two-class SVMs are

yi = 1 yi = −1 Decision functionsclass 1 classes 2,3,4 f 1(x) = (w1)Tx + b1

class 2 classes 1,3,4 f 2(x) = (w2)Tx + b2



For any test data x, if it is in the ith class, we would expect that

f i(x) ≥ 1 and f j(x) ≤ −1, if j 6= i.

This “expection” directly follows from our setting of training the four two-class prob-

lems and from the assumption that data are correctly separated. Therefore, f i(x)

has the largest values among f 1(x), . . . , f 4(x) and hence the decision rule is

Predicted class = arg maxi=1,...,4

f i(x).

2.5.2 One-against-one Multi-class SVM

This method also constructs several two-class SVMs but each one is by training data

from only two different classes. Thus, this method is sometimes called a “pairwise”

approach. For the same example of four classes, six two-class problems are con-

structed:

yi = 1 yi = −1 Decision functionsclass 1 class 2 f 12(x) = (w12)Tx + b12

class 1 class 3 f 13(x) = (w13)Tx + b13





For any test data x, we put it into the six functions. If the problem of classes

i and j indicates the data x should be in i, the class i gets one vote. For example,

assume

2.6. NOTES 27

Classes winner1 2 11 3 11 4 12 3 22 4 43 4 3

Then, we have

class 1 2 3 4# votes 3 1 1 1

Thus, x is predicted to be in the first class.

For a data set with k different classes, this method constructs k(k − 1)/2 two-

class SVMs. We may worry that sometimes more than one class obtains the highest

number of votes. Practically this situation does not happen so often and there are

some further strategies to handle it.

2.6 Notes

The formulas of SVM were developed in (Boser et al., 1992; Cortes and Vapnik, 1995),

where the mapping function and the dual problem are introduced. Other general SVM

references are, for example, (Cristianini and Shawe-Taylor, 2000; Scholkopf and Smola,

2002).

Many work have shown that data in a higher dimensional space has a larger op-

portunity to be separated (e.g., Cover (1965)). Such results explain why our mapping

here should be useful.

The one-against-one method were introduced in (Knerr et al., 1990; Friedman,

1996), and the first use on SVM was in (Kreßel, 1999). A comparison of one-against-

all, one-against-one, and other approaches for multi-class SVM is (Hsu and Lin, 2002).

2.7 Exercises

1. Given three training data in R2 as in the following figure:

-

6

d t

t

(0, 0)

(0, 1)

(1, 0)



2. Given four training data in R2 as in the following figure:

-

6

d t

t d

(0, 0)

(0, 1)

(1, 0)

(1, 1)


3. Solve problem 2 via its dual optimization formula.

4. Assume x ∈ R2 and γ > 0. Show that

e−γ||xi−xj ||2

is in a form of φ(xi)T φ(xj).

Chapter 3

Training and Testing a Data set

3.1 Categorical Features

SVM requires that each data instance is represented as a vector of real numbers.

Hence, if there are categorical attributes, we first have to convert them into numeric

data:

1. Use one integer number to represent an m-category attribute. For example, a

three-category attribute such as {red, green, blue} can be represented as 1,2,3.

2. Use m binary values to represent an m-category attribute. Only one of the m

numbers is one, and others are zero. Thus, {red, green, blue} can be represented

as (0,0,1), (0,1,0), and (1,0,0).

Our experience indicates that if the number of values in an attribute is not too many,

the second coding might be more stable than using a single number to represent a

categorical attribute.

3.2 Data Scaling

Assume there are three training instances as follows

height gender yx1 150 F -1x2 180 M 1x3 185 M 1

The attribute “height” is in centimeter. Clearly, the second “gender” is consistent

with target class y and hence should be more important. However, if F and M are

29

30 CHAPTER 3. TRAINING AND TESTING A DATA SET

transformed to be 0 and 1, respectively, training data and the separating hyperplane

are

x1

x2 x3

The separating hyperplane is nearly a vertical line, so the decision strongly de-

pends on the first attribute. This result is not good as the second attribute should

play a more important role.

If we linearly scale the first to the range [0, 1] by:

1st attribute− 150

185− 150,

then new points and the separating hyperplane are

x1

x2x3

This, transformed back to the original space, is

x1

x2 x3

Therefore, the second attribute plays a role in the decision function.

This example explains that when features are in different numerical ranges, those

in larger ranges may dominate the others. Thus, a proper scaling of features before

training SVM can be very important.

Another reason for doing data scaling is to avoid numerical difficulties during the

calculation. For example, if the polynomial kernel

(xTi xj + 1)8 (3.1)

is used, the first attribute which ranges from 150 to 185 will cause the value of (3.1)

to be larger than (102)8 = 1016. Computer overflow easily happens when dealing with

3.2. DATA SCALING 31

such numbers. Moreover, if the following RBF kernel is used:

e−‖xi−xj‖2

, (3.2)

we have values smaller than e−10000. It is so small (< 10−300) and the decision function

hasl∑

i=1

αiyiK(xi,x) + b ≈ b,

if x 6= xi, i = 1, . . . , l. Apparently, this decision function is not good.

A simple linear scaling is formally stated as the following. Assume Mi and mi are

respectively the largest and smallest values of the ith attribute and we would like to

scale the ith attribute to the range of [−1, +1]:

mi Mi

−1 1

If x is the original value of the ith attribute in one data instance, the new value

should be

x′ =x− Mi+mi

2Mi−mi

2

= 2x−mi

Mi −mi

− 1.

There are many other possible ways of data scaling, but will not be discussed here.

Of course we must use the same method to scale testing data before testing. For

example, suppose that we scaled the first attribute of training data from [-10, +10]

to [-1, +1]. If the first attribute of testing data is lying in the range [-11, +8], we

must scale the testing data to [-1.1, +0.8].

Data scaling is important for many other classification methods. Sarle (1997)

explains why we scale data while using Neural Networks, and most considerations

also apply to SVM.

Some ask about the difference between scaling each feature to [−1, +1] and [0, +1].

In Homework 1, we show that for the linear and RBF kernels, if different parameters

have been considered, they are fully equivalent. However, for polynomial kernels,

they are different. [0, 1] causes that all kernel elements are nonnegative. It is not

clear yet whether this is a good property or not.

ftp://ftp.sas.com/pub/neural/FAQ.html


3.3 Model Selection

Though there are only few common kernels mentioned in Chapter 2, we must decide

which one to try first. Then we also need to choose the penalty parameter C and

kernel parameters.

3.3.1 RBF Kernel

We suggest that in general RBF is a reasonable first choice. The RBF kernel non-

linearly maps samples into a higher dimensional space, so it, unlike the linear kernel,

can handle the case when the relation between class labels and attributes is nonlinear.

Furthermore, the linear kernel is a special case of RBF as (Keerthi and Lin, 2003)

shows that the linear kernel with a penalty parameter C has the same performance

as the RBF kernel with some parameters (C, γ).

The second reason is the number of hyperparameters which influences the com-

plexity of model selection. The polynomial kernel has more hyperparameters than

the RBF kernel.

Finally, the RBF kernel has less numerical difficulties. One key point is 0 <

K(xi,xj) ≤ 1 in contrast to polynomial kernels of which kernel values may go to

infinity (xTi xj/γ + δ > 1) or zero (xT

i xj/γ + δ < 1) while the degree is large.

3.3.2 Cross-validation and Grid-search

There are two parameters while using the RBF kernel: C and γ. It is not known

beforehand which C and γ are the best for one problem; consequently some kind

of model selection (parameter search) must be done. The goal is to identify good

(C, γ) so that the classifier can accurately predict unknown data (i.e., testing data).

Note that Chapter 1.5 has explained that it may not be useful to achieve high training

accuracy (i.e., classifiers accurately predict training data whose class labels are indeed

known). Similar to selecting k of k-nearest neighbor in Chapter 1.6, cross validation

estimates the performance of the model.

We recommend a “grid-search” on C and γ using cross-validation. Basically pairs

of (C, γ) are tried and the one with the best cross-validation accuracy is picked. We

found that trying exponentially growing sequences of C and γ is a practical method to

identify good parameters (for example, C = 2−5, 2−3, . . . , 215, γ = 2−15, 2−13, . . . , 23).

The grid-search is straightforward but seems stupid. In fact, there are several

advanced methods which can save computational cost by, for example, approximating

3.3. MODEL SELECTION 33

the cross-validation rate. However, there are two motivations why we prefer the simple

grid-search approach.

One is that psychologically we may not feel safe to use methods which avoid doing

an exhaustive parameter search by approximations or heuristics. The other reason is

that the computational time to find good parameters by grid-search is not much more

than that by advanced methods since there are only two parameters. Furthermore,

the grid-search can be easily parallelized because each (C, γ) is independent. Many

of advanced methods are iterative processes, e.g. walking along a path, which might

be difficult for parallelization.

german.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scale 77.5 77

76.5 76

75.5 75

-5 0 5 10 15

lg(C)

-14

-12

-10

-8

-6

-4

-2

0

2

lg(gamma)

Figure 3.1: Loose grid search on C = 2−5, 2−3, . . . , 215 and γ = 2−15, 2−13, . . . , 23.

Since doing a complete grid-search may still be time-consuming, we recommend

using a coarse grid first. After identifying a “better” region on the grid, a finer grid

search on that region can be conducted. To illustrate this, we do an experiment on the

problem german from the Statlog collection (Michie et al., 1994). After scaling this

set, we first use a coarse grid (Figure 3.1) and find that the best (C, γ) is (23, 2−5)

with the cross-validation rate 77.5%. Next we conduct a finer grid search on the

neighborhood of (23, 2−5) (Figure 3.2) and obtain a better cross-validation rate 77.6%

at (23.25, 2−5.25). After the best (C, γ) is found, the whole training set is trained again

to generate the final classifier. Note that there is no need to conduct a very fine grid

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/binary/german.numer_scale

http://www.ncc.up.pt/liacc/ML/statlog/datasets.html


german.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scalegerman.numer_scale 77.5 77 76.5 76 75.5 75

1 1.5 2 2.5 3 3.5 4 4.5 5

lg(C)

-7

-6.5

-6

-5.5

-5

-4.5

-4

-3.5

-3

lg(gamma)

Figure 3.2: Fine grid-search on C = 21, 21.25, . . . , 25 and γ = 2−7, 2−6.75, . . . , 2−3.

search. Figure 3.1 clearly shows that good parameters are in a quite wide region.

The above approach works well for problems with thousands or more data points.

For very large data sets, a feasible approach is to randomly choose a subset of the

data set, conduct grid-search on them, and then do a better-region-only grid-search

on the complete data set.

3.4 A General Procedure

To use SVM, we propose trying the following procedure first:

• Conduct simple scaling on the data.

• Consider the RBF kernel K(x,y) = e−γ‖x−y‖2

• Use cross-validation to find the best parameter C and γ.

• Use the best parameter C and γ to train the whole training set.

• Test.

3.5. USING LIBSVM 35

3.5 Using LIBSVM

We use LIBSVM, a library for support vector machines, to demonstrate the training

and testing procedure (Chang and Lin, 2001b). It is available at

http://www.csie.ntu.edu.tw/∼cjlin/libsvm

Instructions for installation on different platforms are in the README file of the

package.

The format of training and testing data file is:

<label> <index1>:<value1> <index2>:<value2> ...

<label> is the target value of the training data. For classification, it should be

an integer which identifies a class (multi-class classification is supported). <index>

is an integer starting from 1 and <value> is a real number. The indices must be in

an ascending order. An example is in the following:

1.0 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02

1.0 1:5.707397e+01 2:2.214040e+02 3:8.607959e-02 4:1.229114e+02

1.0 1:1.725900e+01 2:1.734360e+02 3:-1.298053e-01 4:1.250318e+02

1.0 1:2.177940e+01 2:1.249531e+02 3:1.538853e-01 4:1.527150e+02

0.0 1:2.391101e+01 2:3.890001e+01 3:4.704049e-01 4:1.257871e+02

0.0 1:2.230670e+01 2:2.262220e+01 3:2.117224e-01 4:1.012818e+02

0.0 1:1.640820e+01 2:3.920219e+01 3:-9.912787e-02 4:3.248707e+01

Clearly this data set contains four features. Next we consider a data set in

http://www.csie.ntu.edu.tw/∼cjlin/papers/guide/data/train.1

It includes 3,089 training instances.

A simple training shows

$./svm-train train.1

......*

optimization finished, #iter = 6131

nu = 0.606144

obj = -1061.528899, rho = -0.495258

nSV = 3053, nBSV = 724

Total nSV = 3053

http://www.csie.ntu.edu.tw/~cjlin/libsvm

http://www.csie.ntu.edu.tw/~cjlin/papers/guide/data/train.1


From the output, obj is the optimal objective value of the dual SVM problem (2.7).

The value rho is −b in the decision function (2.10). nSV and nBSV are number of

support vectors (i.e., 0 < αi ≤ C) and bounded support vectors (i.e., αi = C). Other

information such as #iter will be explained in Chapter 6.

The training procedure generates a model file train1.model which includes in-

formation such as support vectors and dual optimal solutions. The test file has the

same format as the training. If labels of the testing data file are available, we can

calculate the test accuracy. If they are unknown, you still have to fill this column

with any number. The testing procedure is as follows:

$./svm-predict test.1 train.1.model test.1.predict

Accuracy = 66.925% (2677/4000)

The test accuracy is not satisfactory. If we predict the training data using the same

model:

$./svm-predict train.1 train.1.model o

Accuracy = 99.7734% (3082/3089)

We find that training and testing accuracy are rather different. In fact, overfitting has

happened. From the discussion earlier in this Chapter, we understand that scaling

and parameter selection may be needed.

Thus, we use the program svm-scale provided in LIBSVM to conduct data scaling:

$./svm-scale -l -1 -u 1 train.1 > train.1.scale

This means that each attribute is linearly scaled to the range [−1, 1]. A common

mistake is then to scale the test data by the same way:

$./svm-scale -l -1 -u 1 test.1 > test.1.scale

Remember we should use the same scaling factor for training and testing sets. A

correct way should be

$./svm-scale -s range1 train.1 > train.1.scale

$./svm-scale -r range1 test.1 > test.1.scale

That is, we store the scaling factor used in training and apply them for the testing

set. By training and predicting scaled sets we obtain:

3.6. OTHER EXAMPLES 37

$./svm-train train.1.scale

$./svm-predict test.1.scale train.1.scale.model test.1.predict

→ Accuracy = 96.15%

The test accuracy is now much better. Now parameters used are default ones: C = 1

and the RBF kernel with γ = 1/n = 0.25, where n is the number of features. Note

that different parameters could really lead to different performance. For example, if

we use C = 20, γ = 400

$./svm-train -c 20 -g 400 train.1.scale

$./svm-predict train.1.scale train.1.scale.model o

Accuracy = 100% (3089/3089) (classification)

we obtain 100% training accuracy but very bad accuracy

$./svm-predict test.1.scale train.1.scale.model o

Accuracy = 82.7% (3308/4000) (classification)

Thus parameter selection is quite important.

In LIBSVM there is a simple tool grid.py for parameter selection:

$./grid.py train.1.scale

[local] -1 -7 85.1408 (best c=0.5, g=0.0078125, rate=85.1408)

[local] 5 -7 95.4354 (best c=32.0, g=0.0078125, rate=95.4354)

.

.

.

Best c=2.0, g=2.0

A contour like Figures 3.1 and 3.2 showing cross validation accuracy is generated.

3.6 Other Examples

Table 3.6.1 presents some real-world examples. These data sets are reported from

our users who could not obtain reasonable accuracy in the beginning. Using the

procedure illustrated in Section 3.4, we help them to achieve better performance.

∗Courtesy of Jan Conrad from Uppsala University, Sweden.†Courtesy of Cory Spencer from Simon Fraser University, Canada (Gardy et al., 2003).‡Courtesy of a user from Germany.§As there are no testing data, cross-validation instead of testing accuracy is presented here.


Table 3.6.1: Problem characteristics and performance comparisons.

Applications #training #testing #features #classes Accuracy Accuracydata data by users by our

procedureAstroparticle∗ 3,089 4,000 4 2 75.2% 96.9%Bioinformatics† 391 0§ 20 3 36% 85.2%Vehicle‡ 1,243 41 21 2 4.88% 87.8%

These sets are at http://www.csie.ntu.edu.tw/∼cjlin/papers/guide/data/.

The first set has been discussed in Section 3.5. Here we present details of the other

two. We also demonstrate how to use a scripts easy.py which exactly does the general

procedure in Section 3.4.

• Bioinformatics

– Original sets with default parameters

$./svm-train -v 5 train.2

→ Cross Validation Accuracy = 56.5217%

– Scaled sets with default parameters

$./svm-scale -l -1 -u 1 train.2 > train.2.scale

$./svm-train -v 5 train.2.scale


– Scaled sets with parameter selection

$python grid.py train.2.scale

· · ·2.0 0.5 85.1662


(Best C=2.0, γ=0.5 with five fold cross-validation rate=85.1662%)

– Using an automatic script

$python easy.py train.2

Scaling training data...

Cross validation...

http://www.csie.ntu.edu.tw/~cjlin/papers/guide/data/

3.6. OTHER EXAMPLES 39

Best c=2.0, g=0.5

Training...

• Vehicle

– Original sets with default parameters

$./svm-train train.3

$./svm-predict test.3 train.3.model test.3.predict

→ Accuracy = 2.43902%

– Scaled sets with default parameters

$./svm-scale -l -1 -u 1 -s range3 train.3 > train.3.scale

$./svm-scale -r range3 test.3 > test.3.scale

$./svm-train train.3.scale


→ Accuracy = 12.1951%

– Scaled sets with parameter selection

$python grid.py train.3.scale

· · ·128.0 0.125 84.8753

(Best C=128.0, γ=0.125 with five-fold cross-validation rate=84.8753%)

$./svm-train -c 128 -g 0.125 train.3.scale


→ Accuracy = 87.8049%

– Using an automatic script

$python easy.py train.3 test.3

Scaling training data...

Cross validation...

Best c=128.0, g=0.125

Training...

Scaling testing data...

Testing...

Accuracy = 87.8049% (36/41) (classification)


d2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3kd2-3k 97 96.8 96.6 96.4 96.2 96 95.8 95.6 95.4 95.2 95

4 5 6 7 8 9 10

lg(C)

-3

-2

-1

0

1

2

lg(gamma)

Figure 3.3: Cross validation using 3000 points

3.7 A Large Practical Example

In this section we discuss how to deal with a large and practical data set. It comes

from the first problem First problem of IJCNN Challenge 2001, organized by Ford

Scientific Research Labs (Prokhorov, 2001). We summarize the approach of the win-

ning entry (Chang and Lin, 2001a). The training set consists of 50,000 instances like

the following

0.000000 -0.999991 0.169769 0.000000 1.0000000.000000 -0.659538 0.169769 0.000292 1.0000000.000000 -0.660738 0.169128 -0.020372 1.0000001.000000 -0.660307 0.169128 0.007305 1.0000000.000000 -0.660159 0.169525 0.002519 1.0000000.000000 -0.659091 0.169525 0.018198 1.0000000.000000 -0.660532 0.169525 -0.024526 1.0000000.000000 -0.659798 0.169525 0.012458 1.000000

3.7. A LARGE PRACTICAL EXAMPLE 41

d2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10kd2-10k 97.6 97.4 97.2 97 96.8 96.6 96.4 96.2 96

2 3 4 5 6 7 8

lg(C)

-2

-1

0

1

2

3

lg(gamma)

Figure 3.4: Cross validation using 10,000 points

They are results at 50,000 time points and hence are a time series. There are 100,000

testing points. The kth instance contains

x1(k), x2(k), x3(k), x4(k), x5(k), y(k),

where y(k) = ±1 is the class label. For a time-series data, past and future information

may affect the current class label. Moreover, it is known that the fifth attribute x5(k)

is independent of y(k), but only if x5(k) = 1, then the test instance is considered

for evaluation. Therefore, among the 100,000 test instances, only around 90,000 are

evaluated. Another known information is that x4(k) is more important than other

features.

To begin, we analyze features in more detail. The first feature x1(k) has a peri-

odicity so that sequentially we have nine 0s, one 1, nine 0s, one 1, and so on. Other

attributes, x2(k), . . . , x4(k), are real numbers in the range of ±1.5. An interesting


d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97

1 2 3 4 5 6 7

lg(C)

-2

-1

0

1

2

3

lg(gamma)

Figure 3.5: Cross validation using 50,000 points

observation is that for the 50,000 training data, 90% of y(k) are −1. Thus it might

be possible that if we just guess all test data to be −1, there is already 90% accuracy.

The difficulty is how to use learning techniques to achieve higher accuracy.

To use SVM for constructing a model, first we have to decide the attributes

(i.e. features) of each data. There are possible variables which may affect y(k). In

addition, for each attribute we need an encoding scheme. For example, to represent

the periodicity of x1(k), we can include x1(k−5), . . . , x1(k+4) as 10 binary attributes

of the kth data. On the other hand, we can use only one integer between 1 to 10 which

indicates the the position of 1 in x1(k − 5), . . . , x1(k + 4). Based on our experience

we choose the former way as it might be better for support vector machines.

We directly use x2(k) and x3(k) as they are. As x4(k) is more important, we

consider some past and future elements. After conducting some cross validation

tests, we decide to use x4(k−5), . . . , x4(k +4). Therefore, each training data consists

3.7. A LARGE PRACTICAL EXAMPLE 43

of 22 attributes.

For learning techniques like Neural Networks or Support Vector Machines, it is

recommended to scale each attribute of data into an appropriate range such as [−1, 1]

or [0, 1]. Since all raw data under our encoding scheme are already in a small region

[−1.5, 1.5], we do not conduct any scaling.

After preparing the training data, we do the model selection by 5-fold cross valida-

tion. We consider only the RBF kernel K(xi,xj) = e−γ‖xi−xj‖2

. Thus two parameters

are the kernel parameter γ and the penalty parameter C in (2.5).

First we work on a small subset of the training data: 3,000 randomly selected

points. The contour of cross validation accuracy is in Figure 3.3 where two axes are

log2 C and log2 γ. It can be seen that the best cross validation rates happen at around

C = 27 and γ = 20. We then work on a larger subset with 10,000 data points. Results

are in Figure 3.4. Parameters with the best cross validation rate are at around C = 24

to 25 and γ = 20 to 21, a different range than that in Figure 3.3. Finally we do the

model selection on all 50,000 training data where results are shown in Figure 3.5.

Again we note that the best parameters slightly move to another region.

Therefore, the experiment seems to show that the best parameters depend on the

size of the training data. Remember that the objective function of SVM is

1

2wTw + C

l∑

i=1

ξi.

Under the same w, a larger number of instances causes larger∑l

i=1 ξi. Thus, some

have proposed using

1

2wTw +

C

l

l∑

i=1

ξi.

This matches our experimental results as roughly

27 · 3000 ≈ 25 · 10000 ≈ 23 · 50000.

Though using a subset for parameter selection saves time, a procedure using all avail-

able training points may still be the most reliable.

Finally we select C = 24 and γ = 22 to train the 50,000 data and obtain a model

for testing. There are 1,293 test errors. This is the winning entry of IJCNN 2001

competition and the second place has more than 2,000 errors. For the 50,000 training

data, the number of support vectors is around 3,000. Note that the number of SVs

depends on different data sets.


3.8 A Failed Example

3.9 Homework

1. (a) Show that for the linear kernel, a data scaled to [−1, 1] and trained by

SVM with C is equivalent to the same data scaled to [0, +1] and solved by

SVM with 4C.

(b) Show that for the RBF kernel, a data scaled to [−1, 1] and trained by SVM

with γ is equivalent to the same data scaled to [0, +1] and solved by SVM

with 4γ.

Chapter 4

Support Vector Regression

4.1 Linear Regression

Given training data (x1, y1), . . . , (xl, yl) in Figure 4.1, where xi are input vectors

and yi are the associated output value of xi, the traditional linear regression finds a

linear function wTx + b so that (w, b) is an optimal solution of

minw,b

l∑

i=1

(yi − (wTxi + b))2. (4.1)

In other words, wTx+ b approximates training data by minimizing the sum of square

errors.

x

y

wTx + b

Figure 4.1: Linear Regression

Note that now n, the number of features, is in general less than l. Otherwise, a

line passing all points so that (4.1) is zero is an optimal wTx + b. For such cases,

overfitting occurs.

45

46 CHAPTER 4. SUPPORT VECTOR REGRESSION

Similar to classification, if the data is nonlinearly distributed, a linear function

is not good enough. Therefore, we also map data to a higher dimensional space by

a function φ(x). Then l ≤ dimensionality of φ(x) so again overfitting happens. An

example is in Figure 4.2.

(a) Overfitting (b) Better approximation

Figure 4.2: Nonlinear regression

4.2 Support Vector Regression

To remedy the overfitting problem after using φ, we consider the following reformu-

lation of (4.1):

minw,b,ξ,ξ∗

l∑

i=1

(ξ2i + (ξ∗i )

2)

subject to −ξ∗i ≤ yi − (wTφ(xi) + b) ≤ +ξi, (4.2)

ξi, ξ∗i ≥ 0, i = 1, . . . , l.

It is easy to see that (4.1) (with x replaced by φ(x)) and (4.2) are equivalent: If

(w, b, ξ, ξ∗) is optimal for (4.2), as ξ2 + (ξ∗)2 is minimized, we have

ξi = max(yi − (wT φ(xi) + b), 0) and

ξ∗i = max(−yi + (wT φ(xi) + b), 0).

Thus,

ξ2i + (ξ∗i )

2 = (yi − (wTφ(xi) + b))2.

Moreover, ξiξ∗i = 0 at an optimal solution.

4.2. SUPPORT VECTOR REGRESSION 47

Instead of using seqaure errors, we can use linear ones:

minw,b,ξ,ξ∗

l∑

i=1

(ξi + ξ∗i )

subject to −ξ∗i ≤ yi − (wTφ(xi) + b) ≤ +ξi,

ξi, ξ∗i ≥ 0, i = 1, . . . , l.

ξ∗i

ξi

wTφ(x) + b =[

ǫ0−ǫ

]

Figure 4.3: Support Vector Regression

Support vector regression (SVR) then emploies two modifications to avoid over-

fitting:

1. A threshold ǫ is given so that if the ith data satisfies:

−ǫ ≤ yi − (wTφ(xi) + b) ≤ ǫ,

it is considered a correct approximation. Then ξi = ξ∗i = 0.

2. To smooth the function wT φ(x) + b, an additional term wTw is added to the

objective function.

Thus, support vector regression solves the following optimization problem:

minw,b,ξ,ξ∗

1

2wTw + C

l∑

i=1

(ξi + ξ∗i ) (4.3)

subject to (wTφ(xi) + b)− yi ≤ ǫ + ξi,

yi − (wT φ(xi) + b) ≤ ǫ + ξ∗i ,

ξi, ξ∗i ≥ 0, i = 1, . . . , l.


Clearly, ξi is the upper training error (ξ∗i is the lower) subject to the ǫ-insensitive

tube |y− (wT φ(x)+ b)| ≤ ǫ. This can be clearly seen from Figure 4.3. If xi is not in

the tube, there is an error ξi or ξ∗i which we would like to minimize in the objective

function. SVR avoids underfitting and overfitting the training data by minimizing the

training error C∑l

i=1(ξi + ξ∗i ) as well as the regularization term 12wTw. The addition

of the term wTw can be explained by a similar way to that for classification problems.

In Figure 4.4, under the condition that training data are in the ǫ-insensitive tube, we

would like the approximate function to be as general as possible to represent the data

distribution.

2ǫ‖w‖

wTx + b = ǫ

wTx + b = −ǫ

(a)

2ǫ‖w‖

wTx + b = ǫ

wTx + b = −ǫ

(b)

Figure 4.4: More general approximate function by maximizing the distance betweenwTx + b = ±ǫ.

The parameters which control the regression quality are the cost of error C, the

width of the tube ǫ, and the mapping function φ. Similar to support vector classifi-

cation, as w may be a huge vector variable, we solve the dual problem:

minα,α∗

1

2(α−α∗)T Q(α−α∗) + ǫ

l∑

i=1

(αi + α∗i ) +

l∑

i=1

yi(αi − α∗i )

subject to

l∑

i=1

(αi − α∗i ) = 0, 0 ≤ αi, α

∗i ≤ C, i = 1, . . . , l, (4.4)

where Qij = K(xi,xj) ≡ φ(xi)T φ(xj). The derivation of the dual uses the same

procedure for support vector classification. The primal-dual relation shows that

w =l∑

i=1

(−αi + α∗i )φ(xi),

4.3. A PRACTICAL EXAMPLE 49

so the approximate function is:

l∑

i=1

(−αi + α∗i )K(xi,x) + b.

4.3 A Practical Example

4.4 Exercises

1. If you have read Chapter 6, derive the SVR dual using the same procedure.

Part II

Advanced Topics

51

Chapter 5

The Dual Problem

5.1 SVM and Convex Optimization

A convex set is a region like the following

Clearly, any line segment connecting two points in the region is still in the set:

x1

x2

Its formal definition is as follows

Definition 5.1.2 (Convex region) A set A is convex if for any x1,x2 ∈ A,

λx1 + (1− λ)x2 ∈ A, ∀0 ≤ λ ≤ 1.

Note that λx1 + (1− λ)x2, 0 ≤ λ ≤ 1, includes all points on the segment connecting

x1 and x2. The following set is non-convex as part of a line segment is not in the set:

a non-convex set

53

54 CHAPTER 5. THE DUAL PROBLEM

A close-related concept is convex function:

convex function non-convex function

The main difference between the above two figures is that the convex one has a

unique “local minimum,” while the non-convex one has two. By a local minimum,

we mean a point which is the smallest in its surrounding region. Formally,

Definition 5.1.3 (Convex function) A function f(x) is convex if for any x1,x2,

f(λx1 + (1− λ)x2) ≤ λf(x1) + (1− λ)f(x2), ∀0 ≤ λ ≤ 1.

That is, in the following figure, any line segment connecting two points (x1, f(x1))

and (x2, f(x2)) are above f(x), where x is between x1 and x2:

x1 x2

(λx1 + (1 − λ)x2, f(λx1 + (1 − λ)x2))

(λx1 + (1− λ)x2, λf(x1) + (1 − λ)f(x2))

Example 5.1.4 The SVM objective function is convex

We would like to have

1

2(λw1 + (1− λ)w2)T (λw1 + (1− λ)w2) + C

l∑

i=1

(λξ1i + (1− λ)ξ2

i )

≤ λ(1

2w1T

w1 + Cl∑

i=1

ξ1i ) + (1− λ)(

1

2w2T

w2 + Cl∑

i=1

ξ2i ).

5.1. SVM AND CONVEX OPTIMIZATION 55

It is easy to cancel out terms with ξi. Then, the difference between the right-hand

and letf-hand sides is

λ(1

2w1T

w1) + (1− λ)(1

2w2T

w2)− 1

2(λw1 + (1− λ)w2)T (λw1 + (1− λ)w2)

=1

2λ(1− λ)(w1 −w2)T (w1 −w2) ≥ 0. (5.1)

The above discussion clearly shows that linear functions (i.e., lines in 2-dimensional

spaces) are convex functions. From this, we can further define “strictly convex”

functions:

convex strictly convex

Formally,

Definition 5.1.5 (Strictly convex) A function f(x) is strictly convex if for any

x1 6= x2,

f(λx1 + (1− λ)x2) < λf(x1) + (1− λ)f(x2), ∀0 < λ < 1.

Clearly, f(w) = 12wTw is a strictly convex function as in (5.1), if 0 < λ < 1, “≥”

becomes “>.”

Example 5.1.6 The function wTw is a strictly convex function of w.

In (5.1), if 0 < λ < 1, then ≥ is replaced by >. Thus, we get the strict convexity of

wTw.

For such strictly convex function, if it is quadratic, we can easily obtain its global

minimum:

Theorem 5.1.7 Consider a strictly convex quadratic function

f(x) =1

2xT Qx + pT x.

Then,

x is the unique global minimum ⇔ Qx + p = 0.


Proof.

First we claim that for a strictly convex quadratic function

dT Qd > 0, ∀d 6= 0.

For any two vectors x and d,

f(x + λd)

=1

2(x + λd)T Q(x + λd) + pT (x + λd)

= f(x) + λ(Qx + p)T d +1

2λ2dT Qd.

On the other hand, for any 0 < λ < 1,

f(x + λd) = f((1− λ)x + λ(x + d))

≤ (1− λ)f(x) + λf(x + d)

= (1− λ)f(x) + λf(x) + λ(Qx + p)Td +λ

2dT Qd.

Thus,λ2

2dT Qd ≤ λ

2dT Qd

and hence

dT Qd ≥ 0.

Now we are ready to prove the theorem:

“⇐” For any x 6= x. If d ≡ x− x, then

f(x) =1

2(x + d)T Q(x + d) + pT (x + d)

= f(x) + (Qx + p)Td +1

2dT Qd

> f(x).

Thus, x is the unique global minimum.

“⇒” If Qx + p 6= 0, we consider d = −t(Qx + p), where t > 0.

f(x + d) = f(x) + (Qx + p)Td +1

2dT Qd

= f(x)− t(Qx + p)T (Qx + p) +1

2t2(Qx + p)T Q(Qx + p).

If (Qx + p)T Q(Qx + p) > 0 and

t <2(Qx + p)T (Qx + p)

(Qx + p)T Q(Qx + p),

5.2. LAGRANGIAN DUALITY 57

then

f(x + d) < f(x)

and x is not a global optimum. This causes a contradiction. On the other hand, if

(Qx+p)T Q(Qx+p) = 0, for any t > 0, f(x) < f(x+d) also causes a contradiction.

2

Note that if Q is positive definite, Qx + p = 0 has a unique solution −Q−1p.

Thus, f(x) has a unique global minimum.

5.2 Lagrangian Duality

We begin by considering a simpler problem (2.1). Its Lagrangian dual is defined as

follows:

maxα≥0

(minw,b

L(w, b, α)), (5.2)

where

L(w, b, α) =1

2‖w‖2 −

l∑

i=1

αi

(

yi(wTφ(xi) + b)− 1

)

. (5.3)

The Lagrangian L has to be minimized with respect to the primal variables w and

b and maximized with respect to the dual variables αi. For a nonlinear problem like

(2.1), called the primal problem, there are several closely related problems of which

the Lagrangian dual is an important one. Under certain conditions, the primal and

dual problems have the same optimal objective values. Therefore, we can instead

solve the dual which may be an easier problem than the primal. In particular, when

data are mapped into a higher dimensional space, solving the dual may be the only

way to train SVM.

Assume (w, b) is an optimal solution of the primal with the optimal objective

value γ = 12‖w‖2.∗ Thus, no (w, b) satisfies

1

2‖w‖2 < γ and yi(w

Tφ(xi) + b) ≥ 1, i = 1, . . . , l. (5.4)

With (5.4), there is α ≥ 0 such that for all w, b

1

2‖w‖2 − γ −

l∑

i=1

αi

(

yi(wTφ(xi) + b)− 1

)

≥ 0. (5.5)

∗ We do not give a rigorous treatment on the existance of an optimal solution here. In fact, ifthe primal’s feasible set is non-empty, an optimal (w, b) exists. Details are in (Lin, 2001a).


This result is quite intiutive, but a detailed proof is complicated and is left in Chapter

5.3.

Therefore, (5.5) implies

maxα≥0

minw,b

L(w, b, α) ≥ minw,b

L(w, b, α) ≥ γ. (5.6)

On the other hand, for any α,

minw,b

L(w, b, α) ≤ L(w, b, α),

so

maxα≥0

minw,b

L(w, b, α) ≤ maxα≥0

L(w, b, α) =1

2‖w‖2 = γ. (5.7)

Note that to have the equality we use the property that (w, b) is feasible for the

primal problem (i.e., yi(wT φ(xi) + b)− 1 ≥ 0).

Therefore, with (5.6), the inequality in (5.7) becomes an equality. This property is

called the strong duality, where the primal and dual have the same optimal objective

value. Thus, α is an optimal solution of the dual problem. In addition, putting (w, b)

into (5.5), with αi ≥ 0 and yi(wTφ(xi) + b)− 1 ≥ 0, we obtain

αi[yi(wTφ(xi) + b)− 1] = 0, i = 1, . . . , l, (5.8)

which is usually called the complementarity condition.

To simplify the dual, when α is fixed,

minw,b

L(w, b, α) =

{

−∞ if∑l

i=1 αiyi 6= 0,

minw12wTw −

∑li=1 αi[yiw

T φ(xi)− 1] if∑l

i=1 αiyi = 0.

(5.9)

Note that if∑l

i=1 αiyi 6= 0, we can decrease −b∑l

i=1 αiyi in L(w, b, α) as much as we

want. For the case∑l

i=1 αiyi = 0, following Example 5.1.6,

1

2wTw −

l∑

i=1

αi[yiwTφ(xi)− 1]

is now a strictly convex function of w. Theorem 5.1.7 then implies that the unique

optimum happens when

∂

∂wi

L(w, b, α) = 0, i = 1, . . . , n.

Thus,

w =l∑

i=1

αiyiφ(xi). (5.10)

5.2. LAGRANGIAN DUALITY 59

Therefore, by substituting (5.10) into (5.2), the dual problem can be written as

maxα≥0

{

∑li=1 αi − 1

2

∑li

∑lj=1 αiαjyiyjφ(xi)

T φ(xj) if∑l

i=1 αiyi = 0,

−∞ if∑l

i=1 αiyi 6= 0.(5.11)

As −∞ is definitely not the maximal objective value of the dual, the dual optimal so-

lution does not happen when∑l

i=1 αiyi 6= 0. Therefore, the dual problem is simplified

to finding optimal αi of

maxα∈Rl

l∑

i=1

αi −1

2

l∑

i=1

l∑

j=1

αiαjyiyjφ(xi)T φ(xj)

subject to αi ≥ 0, i = 1, . . . , l, andl∑

i=1

αiyi = 0.

This is the dual SVM problem that we usually refer to. Note that (5.8),∑l

i=1 αiyi =

0, αi ≥ 0 ∀i, and (5.10), are called the Karush-Kuhn-Tucker (KKT) optimality

conditions of the primal problem. That is, (w, b) is an optimal solution of the primal

if and only if it is feasible and there is α satisfying the KKT conditions.

As practically we solve the dual, in the following we give a formal discussion about

the dual problem and its KKT conditions.

Theorem 5.2.8 α is an optimal solution of the dual if and only if α is feasible and

there is b so that

yi(wT φ(xi) + b)− 1 ≥ 0, i = 1, . . . , l, and (5.12)

αi[yi(wTφ(xi) + b)− 1] = 0, i = 1, . . . , l, (5.13)

where w is defined as in (5.10) using α.

Proof.

“⇒”

Since α is an optimal solution of the dual,

γ = maxα≥0

minw,b

L(w, b, α) = minw,b

L(w, b, α).

From the discussion in the primal-dual relation,

γ = minw,b

L(w, b, α) ≤ L(w∗, b∗, α) ≤ γ, (5.14)

where (w∗, b∗) is an optimal solution of the primal. Since yT α = 0, using Theorem

5.1.7, L(w, b, α) has a global minimum w =∑l

i=1 αiyiφ(xi). Since L(w∗, b∗, α) = γ


from (5.14), w∗ is essentially w. Then, using this b∗ as b, we have the conditions

(5.12) and (5.13).

“⇐”

(5.12) implies (w, b) is a feasible solution of the primal. This and (5.13) imply

L(w, b, α) ≥ γ. From the discussion around (5.10) and assumptiong on (w, b),

l∑

i=1

αi −1

2

l∑

i=1

l∑

j=1

αiαjyiyjφ(xi)Tφ(xj) = min

w,bL(w, b, α) = L(w, b, α) ≥ γ. (5.15)

Since γ is the optimal objective value of the dual from earlier discussion and α is

feasible to the dual, “≥” in (5.15) should be “=” and α is an optimal solution of the

dual. 2

Using a similar derivation, the dual of (2.5) is (2.7). Now the Langrangian dual is

maxα≥0,β≥0

(minw,b,ξ

L(w, b, ξ, α, β)), (5.16)

where

L(w, b, ξ, α, β) =1

2‖w‖2 + C

l∑

i=1

ξi −l∑

i=1

αi

(

yi(wTφ(xi) + b)− 1 + ξi

)

−l∑

i=1

βiξi.

(5.17)

From this formula, we understand that for each inequality constraint, there is a

corresponding nonnegative dual variable. If the constraint is an equality, then the

associated dual variable is non-restricted (i.e., it could be negative).

Details of deriving (2.7) are left in Exercise 1.

5.3 Proof of (5.5)

The derivation here follows from Lemma 6.2.3 of (Bazaraa et al., 1993). To simplify

the proof, we make an assumption that {(w, b) | yi(wT φ(xi) + b) > 1, i = 1, . . . , l} is

nonempty. (5.5) remains valid without this assumption.

As there is no (w, b) which satisfies (5.4), (0, 0) is not in the following set

A = {(p,q) | p >1

2‖w‖2−γ, qi ≥ −(yi(w

Tφ(xi)+b)−1), i = 1, . . . , l for some (w, b)}.

A is convex as for any (p1,q1), (p2,q2) ∈ A with associated (w1, b1) and (w2, b2),

λp1 + (1− λ)p2 >1

2λ‖w1‖2 +

1

2(1− λ)‖w2‖2 − γ

≥ 1

2‖λw1 + (1− λ)w2‖2 − γ

5.3. PROOF OF (5.5) 61

and

λ(q1)i + (1− λ)(q2)i

≥ −(

yi

(

(λw1 + (1− λ)w2)T φ(xi) + (λb1 + (1− λ)b2)

)

− 1

)

.

That is,

(λp1 + (1− λ)p2, λq1 + (1− λ)q2) ∈ A,

for all 0 ≤ λ ≤ 1. If the set A is like the following (assume q has only one component),

p

q

A

pu0 + qu = 0

there are (u0,u) 6= (0, 0) such that

u0p + uTq ≥ 0, ∀(p,q) ∈ Cl(A). (5.18)

We use a dotted boundary as points on the boundary may not be in the set A. Note

that for points on the boundary, u0p+uT q ≥ 0 still holds. So in (5.18), Cl(A) means

points in A or its boundary†.

Now we have (0, 0) /∈ A, so an extreme situation may be as the following:

p

q

A

pu0 + qu = 0

(5.18) still holds and clearly the key is that A is a convex set.

Let us look at (5.18) in more detail. As p ∈ A can be arbitrarily large, u0 ≥ 0.

Otherwise, a negative u0 with large p and some fixed q will violate (5.18). Similarly,

u ≥ 0. As (p,q) = [12wTw−γ, y1(w

Tφ(x1)+ b)−1, . . . , yl(wTφ(xl)+ b)−1] ∈ Cl(A),

u0(1

2wTw− γ)−

l∑

i=1

ui(yi(wTφ(xi) + b)− 1) ≥ 0, ∀(w, b). (5.19)

† (5.18) is indeed not trivial, but we omit a formal proof here.


If u0 = 0, the assumption that there are (w, b) such that yi(wT φ(xi) + b)− 1 > 0, i =

1, . . . , l implies ui = 0, i = 1, . . . , l. This contradicts the fact that (u0,u) 6= (0, 0).

Thus, u0 > 0 and (5.19) can be written as

(1

2wTw− γ)−

l∑

i=1

ui

u0

(yi(wTφ(xi) + b)− 1) ≥ 0, ∀(w, b).

which is exactly (5.5).

5.4 Notes

The duality theory holds only for primal problems satisfying certain convex condi-

tions and the so called “constraint qualification.” Here we have linear inequalities

as constraints, so the constraint qualification is satisfied. In Exercise 6, we show an

example for which the duality results do not hold.

5.5 Exercises

1. Show that the dual of (2.5) is (2.7).

2. Consider the following primal optimization problem

minw,b,ξ

1

2wTw + C

l∑

i=1

ξ2i

subject to yi(wTφ(xi) + b) ≥ 1− ξi, (5.20)

ξi ≥ 0, i = 1, . . . , l.

(a) Prove that the constraint ξi ≥ 0 is not needed.

(b) Derive its dual by the Lagrangian dual procedure.

(c) Show that by transforming (5.20) to

minw,b

1

2wT w

subject to yi(wT xi + b) ≥ 1, i = 1, . . . , l,

we can directly obtain the dual using results in (5.2).

5.5. EXERCISES 63


minw,b,ξ

1

2wTw +

1

2b2 + C

l∑

i=1

ξi

subject to yi(wTφ(xi) + b) ≥ 1− ξi,

ξi ≥ 0, i = 1, . . . , l.

Derive its dual by the Lagrangian dual procedure.


minw,b,ξ

1

2wTw + C+

∑

i:yi=1

ξi + C−

∑

i:yi=−1

ξi


ξi ≥ 0, i = 1, . . . , l.

Derive its dual by the Lagrangian dual procedure.

5. In Exercise 2.2, we directly solve the primal/dual problem to obtain the sepa-

rating hyperplane. Verify that primal/dual solutions satisfy KKT conditions.

6. Derive the Lagrangian dual of the following two problems and show that the

objective values of primal and dual are not equal.

(a)

minx1,x2

x1

subject to x31 ≥ x2,

x2 ≥ 0.

(b)

minx1,x2

x1

subject to x1 ≥ x32,

x2 ≥ 0.

The first problem comes from (Fletcher, 1987) and its feasible region is not

convex. The second problem, however, has a convex region. If the con-

straints are not linear, even for convex programming, there are situations


where the dual does not exist. By convex programming we mean the ob-

jective function is convex and the feasible region is a convex set. To have

the primal-dual relationshop mentioned in this chapter, existing theorem

requires constraints, written as gi(x) ≤ 0, ∀i, satisfy that (1) gi are convex

functions, and (2) certain constraing qualification. For problems with only

linear inequalities as constraints (e.g. SVM), both conditions hold.

7. If the error term C∑l

i=1 ξi of SVM formulation (2.5) becomes C∑l

i=1 e(ξi),

where e is a function defined as

e(ξ) ≡{

(0.5/γ)ξ2 if ξ ≤ γ,

ξ − 0.5γ if ξ ≥ γ.

Here γ is a positive constant. Derive its dual.

This is a difficult exercise.

Chapter 6

Solving the Quadratic Problems

6.1 Solving Optimization Problems

We consider the following optimization problem:

minα

1

2αT Qα− eT α

subject to yT α = ∆, (6.1)

0 ≤ αt ≤ C, t = 1, . . . , l,

where yt = ±1, t = 1, . . . , l. If ∆ = 0, (6.1) is the SVM dual problem (2.7).

Optimization has been a well-developed area. For objective functions which are

differentiable, the solution process usually involves with its first or second derivative.

For example, in Theorem 5.1.7, if f(x) = 12xT Qx − pTx, and Q is positive definite,

the optimal solution is by solving the linear system Qx + p = 0, where Qx + p is the

first derivative of f(x). Many optimization applications involve with sparse Q (i.e.,

many Q’s components are zeros), so the linear sysyem can be easily solved even if the

number of variables is huge.

Unfortunately, such matrix operations are not doable here as Q is in general a

fully dense matrix (i.e., all Q’s components are non-zero). For example, if the RBF

kernel is used,

Qij = yiyje−γ‖xi−xj‖2

> 0.

Thus, if there are 30,000 training data, Q is a 30, 000 × 30, 000 dense matrix with

(30000)2 = 9× 108 elements. The matrix Q cannot be fit into the computer memory,

so operations on Q cannot be done. In this Chapter, we consider the decomposition

method to conquer this difficulty.

65

66 CHAPTER 6. SOLVING THE QUADRATIC PROBLEMS

6.2 The Decomposition Method

This method is an iterative process and in each iteration only few variables are up-

dated. We illustrate this method by a simple example:

-

6

d t

t d

(0,0) (1,0)

(0,1) (1,1)

Using the linear kernel and C = 1, the dual optimization problem is

minα

1

2

[

α1 α2 α3 α4

]

0 0 0 00 1 0 −10 0 1 −10 −1 −1 2

α1

α2

α3

α4

−[

1 1 1 1]

α1

α2

α3

α4

subject to[

1 −1 −1 1]

α = 0,

0 ≤ α1, . . . , α4 ≤ 1.

Now αi = 0, i = 1, . . . , 4 satisfies all constraints, so we use it as the initial solution.

Then the decomposition method proceeds as follows:

Iteration 1:

α3 = α4 = 0 are fixed and we minimize the function on α1 and α2:

minα1,α2

1

2

[

α1 α2 0 0]

0 0 0 00 1 0 −10 0 1 −10 −1 −1 2

α1

α2

00

−[

1 1 1 1]

α1

α2

00

=1

2

[

α1 α2

]

[

0 00 1

] [

α1

α2

]

−[

1 1]

[

α1

α2

]

subject to α1 − α2 = α3 − α4 = 0,

0 ≤ α1, α2 ≤ 1.

By substituting α1 = α2 into the objective function, we get

minα2

1

2α2

2 − 2α2

subject to 0 ≤ α2 ≤ 1.

6.2. THE DECOMPOSITION METHOD 67

Thus, α2 = α1 = 1.

Iteration 2:

α1 = α2 = 1 are fixed and we minimize the function on α3 and α4:

minα3,α4

1

2

[

1 1 α3 α4

]

0 0 0 00 1 0 −10 0 1 −10 −1 −1 2

11α3

α4

−[

1 1 1 1]

11α3

α4

=1

2

[

α3 α4

]

[

1 −1−1 2

] [

α3

α4

]

− α4 + 1−[

1 1]

[

α3

α4

]

− 2

subject to −α3 + α4 = −α1 + α2 = −1 + 1 = 0,

0 ≤ α3, α4 ≤ 1.

By substituting α3 = α4 into the objective function, we get

minα3

1

2α2

3 − 3α3 − 1

subject to 0 ≤ α3 ≤ 1.

Thus, α3 = α4 = 1.

Later we will show that α1 = · · · = α4 = 1 is already an optimal solution, so the

procedure stops.

By calling the indices of variables to be minimized as the working set of that

iteration, a general description of decomposition methods is as follows:

Algorithm 6.2.9 (Decomposition method)

1. Given a number q ≪ l as the size of the working set. Find α1 as the initial

solution. Set k = 1.

2. If αk is an optimal solution of (2.7), stop. Otherwise, find a working set B ⊂{1, . . . , l}, whose size is q. Define N ≡ {1, . . . , l}\B and αk

B and αkN to be

sub-vectors of αk corresponding to B and N , respectively.

3. Solve the following sub-problem with the variable αB:

minαB

1

2

[

αTB (αk

N)T]

[

QBB QBN

QNB QNN

] [

αB

αkN

]

−[

eTB (ek

N)T]

[

αB

αkN

]

=1

2αT

BQBBαB + (−eB + QBNαkN)T αB + constant

subject to 0 ≤ (αB)t ≤ C, t = 1, . . . , q, (6.2)

yTBαB = ∆− yT

NαkN ,

where[

QBB QBNQNB QNN

]

is a permutation of the matrix Q.


4. Set αk+1B to be the optimal solution of (6.2) and αk+1

N ≡ αkN . Set k ← k + 1

and goto Step 2.

In each iteration, the indices {1, . . . , l} of the training set are separated to two sets B

and N . The vector αN is fixed so the objective value becomes 12αT

BQBBαB +(−eB +

QBNαN)T αB + 12αT

NQNNαN −eTNαN . Then a sub-problem with the variable αB, i.e.

(6.2), is solved. Note that B is updated in each iteration. To simplify the notation,

we simply use B instead of Bk.

Clearly in each iteration of Algorithm 6.2.9, only QBB and QBN of Q are needed.

If ql memory spaces can be allocated to store them, we calculate QBB and QBN when

needed. Thus, Q does not have to be fully stored and this fact solves the memory

difficulty of traditional optimization procedures. Of course the check if αk is optimal

or not in Step 2 may still cause memory problem. This issue will be discussed in the

next section.

If the working set B is restricted to only two elements, the method is called

“Sequential Minimal Optimization” (SMO).

6.3 Working Set Selection and Stopping Criteria

An immediate question about the simple example in Section 6.2 is how we select

{α1, α2} in the first iteration and then {α3, α4} in the second. We call this issue

the “working set selection.” Here, we discuss a method, which, in a sense, selects

variables violating the optimality condition.

Recall in Theorem 5.2.8, we stated that α is a dual optimal solution if and onlt if it

satisfies certain conditions. At that time we consider the simpler formulation without

the penalty term C∑l

t=1 ξt in the primal. Now for (6.1), Theorem 5.2.8 becomes

α is an optimal solution of (6.1) if and only if α is feasible and there are b, ξ ≥ 0

such that

yt(wTφ(xt) + b)− 1 + ξt ≥ 0, (6.3)

αt[yt(wT φ(xt) + b)− 1 + ξt] = 0, and (6.4)

(C − αt)ξt = 0, t = 1, . . . , l, (6.5)

where w =∑l

t=1 αtytφ(xt). Moreover, (w, b, ξ) is optimal for the primal.

(6.3)–(6.5) do not have to involve with w as we can always write

yt(wTφ(xt) + b) = (Qα)t + byt.

6.3. WORKING SET SELECTION AND STOPPING CRITERIA 69

To study the working set selection, we rewrite (6.3)–(6.5) to equivalent conditions:

There are b, ξ ≥ 0 such that

if αt = 0 then ξt = 0 and (Qα)t + byt − 1 ≥ 0, (6.6)

0 < αt < C = 0 = 0, (6.7)

αt = C ≥ 0 = −ξt ≤ 0. (6.8)

Thus, depending on (Qα)t + byt − 1, we know how to choose ξ and there is no

need to write it. By separating (6.7) into (6.6) and (6.8), we have

There is b such tat

if αt > 0 then (Qα)t + byt − 1 ≤ 0,

< C ≥ 0.

Using the property yt = ±1, i = 1, . . . , l, they are equivalent to

There is b such that

if αt > 0 and yt = 1 then (Qα)t − 1 ≤ −b,

> 0 = −1 ≤ b,

< C = 1 ≥ −b,

< C = −1 ≥ b,

Finally, we are able to remove b:

Theorem 6.3.10 α is an optimal solution of (6.1) if and only if α is feasible and

maxt∈Iup(α)

−yt∇f(α)t ≤ mint∈Ilow(α)

−yt∇f(α)t, (6.9)

where

f(α) ≡ 1

2αT Qα− eT α,∇f(α) ≡ Qα− e,

and

Iup(α) ≡ {t | αt < C, yt = 1 or αt > 0, yt = −1},Ilow(α) ≡ {t | αt < C, yt = −1 or αt > 0, yt = 1}.

An illustration of the above results is in the following figure:


-�

t ∈ Iup(α) t ∈ Ilow(α)

−yt∇f(α)t

(a) α is optmal

- �

t ∈ Iup(α) t ∈ Ilow(α)

−yt∇f(α)t

(b) α is optimal

�

-t ∈ Iup(α)

t ∈ Ilow(α)

−yt∇f(α)t

(c) α is not optimal

Figure 6.1: Illustration of the dual optimality condition

If q, an even number, is the size of the working set B and αk is the current iterate,

we can select q/2 indices from elements in Iup(αk) and the other q/2 indices from

Ilow(αk) so that

−yi1∇f(αk)i1 ≥ −yi2∇f(αk)i2 ≥ · · · ≥ −yiq/2∇f(αk)iq/2

>

−yjq/2∇f(αk)jq/2

≥ · · · ≥ −yj1∇f(αk)j1.(6.10)

Therefore, essentially the q/2 most violated pairs are put into the working set and we

call (i1, j1) a “maximal violating pair.”

Taking the example used earlier, initially α1 = · · · = α4 = 0, so

Iup(α) = {α1, α4} and Ilow(α) = {α2, α3}.

Then,

1 = −y1∇f(α)1 = −y4∇f(α)4 = maxt∈Iup(α)

−yt∇f(α)t

> mint∈Ilow(α)

−yt∇f(α)t = −y2∇f(α)2 = −y3∇f(α)3 = −1.

Thus, if we would like to choose two variables for minimization, one whould be from

{1, 4} and the other should be from {2, 3}.After the first iteration, α = [1, 1, 0, 0]T , so

∇f(α) = Qα− e =

0 0 0 00 1 0 −10 0 1 −10 −1 −1 2

1100

−

1111

=

−10−1−2

.

6.3. WORKING SET SELECTION AND STOPPING CRITERIA 71

Then −yt∇f(α)t, t = 1, . . . , 4 are

[

1 0 −1 2]T

,

and

Iup(α) = {2, 4}, Ilow(α) = {1, 3}.

As

−y4∇f(α)4 = maxt∈Iup(α)

−yt∇f(α)t

> mint∈Ilow(α)

−yt∇f(α)t = −y3∇f(α)3,

we select {3, 4} as the working set. After the second iteration, α = [1, 1, 1, 1]T , so

∇f(α) = Qα− e =

0 0 0 00 1 0 −10 0 1 −10 −1 −1 2

1111

−

1111

=

−1−1−1−1

.

Then −yt∇f(α)t, t = 1, . . . , 4 are

[

1 −1 −1 1]T

,

and

Iup(α) = {2, 3}, Ilow(α) = {1, 4}.

As

1 = mint∈Ilow(α)

−yt∇f(α)t ≥ maxt∈Iup(α)

−yt∇f(α)t = −1,

from Theorem 6.3.10, α is an optimal solution. Note that during the iterative proce-

dure, we always keep the feasibility.

For the sub-problem (6.2), the optimality condition is

maxt∈Iup(α)∩B

−yt∇f(α)t ≤ mint∈Ilow(α)∩B

−yt∇f(α)t (6.11)

Under our working set selection, in Algorithm 6.2.9, αkB does not satisfy (6.11). In

other words, αkB is not optimal for the sub-problem either, so we are guaranteed to

find a better αB so that

f(αk+1) < f(αk).

As the decomposition method may take infinite iterations before (6.9) is satisfied,

practically we replace the stopping condition in Step 2 of Algorithm 6.2.9 with

maxt∈Iup(α)

−yt∇f(α)t ≤ mint∈Ilow(α)

−yt∇f(α)t + ǫ, (6.12)


where ǫ, the stopping tolerance, is a small positive number.

To use (6.12), ∇f(α) must be maintained throughout all iterations. The memory

problem may occur as ∇f(α) = Qα − e involves the matrix Q. This issue is solved

by the following tricks:

1. α1 = 0 satisfies all constraints of (2.7). Thus, the initial ∇f(α1) = −e is easily

obtained.

2. We can easily updatie ∇f(α) using only QBB and QBN :

∇f(αk+1) = ∇f(αk) + Q(αk+1 −αk)

∇f(αk) + QT:,B(αk+1 −αk)B,

where

Q:,B =

[

QBB

QNB

]

.

6.4 Analytical Solutions

The remaining problem on implemeting the decomposition method is how to solve

the sub-problem (6.2). Here, we consider a simple situation that q = 2. That is, in

each iteration, only two elements are selected as the working set. Thus,

i ≡ arg maxt∈Iup(α)

−yt∇f(α)t, j ≡ arg mint∈Ilow(α)

−yt∇f(α)t.

Then, (6.2) is a simple problem with only two variables:

minαi,αj

1

2

[

αi αj

]

[

Qii Qij

Qji Qjj

] [

αi

αj

]

+ (Qi,NαN − 1)αi + (Qj,NαN − 1)αj(6.13)

subject to yiαi + yjαj = ∆− yTNαk

N , (6.14)

0 ≤ αi, αj ≤ C.

In this section, we discuss simple ways to solve (6.14). To begin, we consider the

case yi 6= yj. Insted of varibles αi and αj, we introduce dj so that

αi = αki + di, αj = αk

j + dj, (6.15)

where the second equality follows from (6.14). Thus, if without considering 0 ≤αi, αj ≤ C, (6.13) becomes

1

2

[

αki + di αk

j + dj

]

[

Qii Qij

Qji Qjj

] [

αki + di

αkj + dj

]

+ (Qi,NαkN − 1)(αk

i + di) +

(Qj,NαkN − 1)(αk

j + dj)

=1

2(Qii + Qjj + 2Qij)d

2j + (∇f(αk)i +∇f(αk)j)dj + constant. (6.16)

6.4. ANALYTICAL SOLUTIONS 73

The minimum of (6.15) happens at

−∇f(αk)i −∇f(αk)j

Qii + Qjj + 2Qij

. (6.17)

Similarly, if yi = yj, (6.15) becomes

αj = αkj + dj, αi = αk

i − di,

and (6.17) becomes∇f(α)i −∇f(α)j

Qii + Qjj − 2Qij

.

Thus, we can write

αk+1j = αk

j +

{

−∇f(αk)i−∇f(αk)j

Qii+Qjj+2Qijif yi 6= yj,

∇f(αk)i−∇f(αk)j

Qii+Qjj−2Qijif yi = yj.

(6.18)

Due to the constraints 0 ≤ αi, αj ≤ C, αk+1j or αk+1

i may be outside the allowed

region. In this case, the value of (6.18) is clipped into the feasible region. For example,

if yi 6= yj, the line yiαi + yjαj = ∆ − yTNαk

N can be like one of the two situations in

Figure 6.2.

-

6

��

��

��

��

��

��

αi

yiαi + yjαj = ∆− yTNαk

N

αj

Figure 6.2: Two situations when yi 6= yj

We can explicitly check the situation where αk+1i or αk+1

j is not in [0, C]:

δ ≡ αki − αk

j

If δ > 0

If αk+1j < 0

αk+1j ← 0

αk+1i ← δ


If αk+1i > C

αk+1i ← C

αk+1j ← C − δ

else

If αk+1i < 0

αk+1i ← 0

αk+1j ← −δ

If αk+1j > C

αk+1j ← C

αk+1i ← C + δ

The situation for yi = yj is similar.

Another minor problem is that the denominator in (6.18) is sometime zero. When

this happens,

Qij = ±(Qii + Qij)/2

so

QiiQjj −Q2ij

= QiiQjj − (Qii + Qjj)2/4

= −(Qii −Qij)2/4 ≤ 0.

Therefore, we know if QBB is positive definite, the zero denominator in (6.18) never

happens. Hence this problem happens only if QBB is a 2 by 2 singular matrix. We

discuss some situations where QBB may be singular.

1. The function φ does not map data to independent vectors in a higher-dimensional

space so Q is only positive semidefinite. For example, using the linear or low-

degree polynomial kernels. Then it is possible that a singular QBB is picked.

2. Some kernels have a nice property that φ(xi), i = 1, . . . , l are linearly indepen-

dent if xi 6= xj. Thus Q as well as all possible QBB are positive definite. An

example is the RBF kernel. However, for many practical data we have encoun-

tered, some of xi, i = 1, . . . , l are the same. Therefore, several rows (columns)

of Q are exactly the same so QBB may be singular.

However, even if the denominator of (6.18) is zero, there are no numerical prob-

lems: From (6.12), we note that

∇f(α)i +∇f(α)j ≥ ǫ

6.5. THE CALCULATION OF b 75

during the iterative process. Since

∇f(α)i +∇f(α)j =

{

±(∇f(α)i +∇f(α)j) if yi 6= yj, and

±(∇f(α)i −∇f(α)j) if yi = yj,

the situation of 0/0 which is defined as NaN by IEEE standard does not appear.

Therefore, (6.18) returns ±∞ if the denominator is zero which can be detected as

special quantity of IEEE standard and clipped to regular floating point number.

6.5 The Calculation of b

After the solution α of the dual optimization problem is obtained, the variable b must

be calculated as it is used in the decision function.

(6.7) shows that for an optimal α, if αi satisfies 0 < αi < C, then

b = −yi∇f(α)i.

Practically to avoid numerical errors, we average them:

b =

∑

0<αi<C −yi∇f(α)i∑

0<αi<C 1.

On the other hand, if there is no such αi, as Theorem 6.3.10 implies that b can be

any number satisfying

−yi∇f(α)i ≤ b ≤ −yj∇f(α)j,

in this case, we can simply take the midpoint of the range:

b =−yi∇f(α)i +−yj∇f(α)j

2.

6.6 Shrinking and Caching

6.6.1 Shrinking

Since for many problems the number of free support vectors (i.e. 0 < αi < C) is small,

the shrinking technique reduces the size of the working problem without considering

some bounded variables (Joachims, 1998). Near the end of the iterative process, the

decomposition method identifies a possible set A where all final free αi may reside

in. Indeed we can have the following theorem which shows that at the final iterations

of the decomposition proposed in Section 6.3 only variables corresponding to a small

set are still allowed to move (Lin, 2002b, Theorem II.3):


Theorem 6.6.11 If limk→∞ αk = α, then from Theorem 6.7.13, α is an optimal

solution. Furthermore, after k is large enough, only elements in

{i | −yi∇f(α)i = max( maxαt<C,yt=1

−∇f(α)t, maxαt>0,yt=−1

∇f(α)t)

= min( minαt<C,yt=−1

∇f(α)t, minαt>0,yt=1

−∇f(α)t)} (6.19)

can still be possibly modified.

Therefore, we tend to guess that if a variable αi is equal to C for several iterations,

then at the final solution, it is still at the upper bound. Hence instead of solving the

whole problem (2.7), the decomposition method works on a smaller problem:

minαA

1

2αT

AQAAαA − (eA −QANαkN)T αA

0 ≤ (αA)t ≤ C, t = 1, . . . , q, (6.20)

yTAαA = ∆− yT

NαkN ,

where N = {1, . . . , l}\A.

Of course this heuristic may fail if the optimal solution of (6.20) is not the cor-

responding part of that of (2.7). When that happens, the whole problem (2.7) is

reoptimized starting from a point α where αB is an optimal solution of (6.20) and

αN are bounded variables identified before the shrinking process. Note that while

solving the shrinked problem (6.20), we only know the gradient QAAαA+QANαN +pA

of (6.20). Hence when problem (2.7) is reoptimized we also have to reconstruct the

whole gradient ∇f(α), which is quite expensive.

Many implementations began the shrinking procedure near the end of the iterative

process, in LIBSVM however, we start the shrinking process from the beginning. The

procedure is as follows:

1. After every min(l, 1000) iterations, we try to shrink some variables. Note that

during the iterative process

min({∇f(αk)t | yt = −1, αt < C}, {−∇f(αk)t | yt = 1, αt > 0})= −gj < gi (6.21)

= max({−∇f(αk)t | yt = 1, αt < C}, {∇f(αk)t | yt = −1, αt > 0}),

as Theorem 6.3.10 is not satisfied yet.

We conjecture that for those

gt =

{

−∇f(α)t if yt = 1, αt < C,

∇f(α)t if yt = −1, αt > 0,(6.22)

6.6. SHRINKING AND CACHING 77

if

gt ≤ −gj, (6.23)

and αt resides at a bound, then the value of αt may not change any more. Hence

we inactivate this variable. Similarly, for those

gt ≡{

−∇f(α)t if yt = −1, αt < C,

∇f(α)t if yt = 1, αt > 0,(6.24)

if

− gt ≥ gi, (6.25)

and αt is at a bound, it is inactivated. Thus the set A of activated variables is

dynamically reduced in every min(l, 1000) iterations.

2. Of course the above shrinking strategy may be too aggressive. Since the decom-

position method has a very slow convergence and a large portion of iterations

are spent for achieving the final digit of the required accuracy, we would not

like those iterations are wasted because of a wrongly shrinked problem (6.20).

Hence when the decomposition method first achieves the tolerance

gi ≤ −gj + 10ǫ,

where ǫ is the specified stopping criteria, we reconstruct the whole gradient.

Then based on the correct information, we use criteria like (6.22) and (6.24) to

inactivate some variables and the decomposition method continues.

Therefore, in LIBSVM, the size of the set A of (6.20) is dynamically reduced.

To decrease the cost of reconstructing the gradient ∇f(α), during the iterations we

always keep

Gi = C∑

αj=C

Qij , i = 1, . . . , l.

Then for the gradient ∇f(α)i, i /∈ A, we have

∇f(α)i =l∑

j=1

Qijαj = Gi +∑

0<αj<C

Qijαj .


6.6.2 Caching

Another technique for reducing the computational time is caching. Since Q is fully

dense and may not be stored in the computer memory, elements Qij are calculated

as needed. Then usually a special storage using the idea of a cache is used to store

recently used Qij (Joachims, 1998). Hence the computational cost of later iterations

can be reduced.

Theorem 6.6.11 also supports the use of the cache as in final iterations only some

columns of the matrix Q are still needed. Thus if the cache can contain these columns,

we can avoid most kernel evaluations in final iterations.

In LIBSVM, we implement a simple least-recent-use strategy for the cache. We

dynamically cache only recently used columns of QAA of (6.20).

6.7 Convergence of the Decomposition Method

The convergence of decomposition methods was first studied in (Chang et al., 2000)

but algorithms discussed there do not coincide with existing implementations. In this

section we will discuss only convergence results related to the specific decomposition

method in Section 6.3.

From (Keerthi and Gilbert, 2002) we have

Theorem 6.7.12 Given any ǫ > 0, after a finite number of iterations (6.12) will be

satisfied.

This theorem establishes the so-called “finite termination” property so we are sure

that after finite steps the algorithm will stop.

For asymptotic convergence, from (Lin, 2002a), we have

Theorem 6.7.13 If {αk} is the sequence generated by the decomposition method in

Section 6.3, the limit of any its convergent subsequence is an optimal solution of (6.1).

Note that Theorem 6.7.12 does not imply Theorem 6.7.13 as if we consider gi and

gj in (6.12) as functions of α, they are not continuous. Hence we cannot take limit

on both sides of (6.12) and claim that any convergent point has already satisfies the

KKT condition.

Theorem 6.7.13 was first proved as a special case of general results in (Lin, 2001c)

where some assumptions are needed. Now the proof in (Lin, 2002a) does not require

any assumption.

6.8. COMPUTATIONAL COMPLEXITY 79

For local convergence, as the algorithm used here is a special case of the one

discussed in (Lin, 2001b), we have the following theorem

Theorem 6.7.14 If Q is positive definite and the dual optimization problem is de-

generate (see Assumption 2 in (Lin, 2001b)), then there is c < 1 such that after k is

large enough,

f(αk+1)− f(α∗) ≤ c(f(αk)− f(α∗)),

where α∗ is the optimal solution of (6.1).

That is, LIBSVM is linearly convergent.

The discussion here is about global and local convergence. We investigate the

computational complexity in Section 6.8.

6.8 Computational Complexity

The discussion in Section 6.7 is about the asymptotic global convergence of the decom-

position method. In addition, the linear convergence (Theorem 6.7.14) is a property

of the local convergence rate. Here, we discuss the computational complexity.

The main operations are on finding QBNαkN + pB of (6.2) and the update of

∇f(αk) to ∇f(αk+1). Note that ∇f(α) is used in the working set selection as well

as the stopping condition. They can be considered together as

QBNαkN + pB = ∇f(αk)−QBBαk

B, (6.26)

and

∇f(αk+1) = ∇f(αk) + Q:,B(αk+1B −αk

B), (6.27)

where Q:,B is the sub-matrix of Q with column indices in B. That is, at the kth

iteration, as we already have∇f(αk), the right-hand-side of (6.26) is used to construct

the sub-problem. After the sub-problem is solved, (6.27) is employed to have the next

∇f(αk+1). As B has only two elements and solving the sub-problem is easy, the main

cost is Q:,B(αk+1B −αk

B) of (6.27). The operation itself takes O(2l) but if Q:,B is not

available in the cache and each kernel evaluation costs O(n), one column of Q:,B

already needs O(ln). Therefore, the complexity is:

1. #Iterations× O(l) if most columns of Q are cached during iterations.

2. #Iterations×O(nl) if most columns of Q are cached during iterations and each

kernel evaluation is O(n).


Note that if shrinking is incorporated, l will gradually decrease during iterations.

Unfortunately, so far we do not know much about the complexity of the number of

iterations. An earlier work is in (Hush and Scovel, 2003). However, its result applies

only to decomposition methods discussed in (Chang et al., 2000) but not LIBSVM or

other existing software.

6.9 Computational Complexity for Multi-class SVM

In Chapter 2.5 we discussed two multi-class methods: one-against-all and one-against-

one. Asthe latter trains k(k − 1)/2 classifiers, where k is the number of classes, we

may think that “one-against-one” takes more training time. In fact, this statment

may not be right. Assume the time for training one two-class problem (i.e., solving

the dual problem (6.1)), is (·)d, where (·) is the number of variables. Then the training

time of the two multi-class approaches are compared in Table 6.9.1. Clearly, if d ≥ 2,

the one-against-one approach has a shorter training time.

Table 6.9.1: Training complexity of two multi-class approachesMethod one-against-all one-against-oneAverage size per

l 2l/ktwo-class SVM# two-class SVMs k k(k − 1)/2

training time kO(ld) k(k−1)2

O((2lk)d)

To discuss the testing time, we first look at the decision function of two-clas

SVMs (2.10). Kernel evaluations are the main part and some subsequent multipli-

cations/additions of these kernel values follow. For multi-class SVMs, a training

instance may be a support vector of several two-class SVMs. There is no need to do

the same kernel evaluation several times, so indeed we conduct such evaluations in

the beginning before calculating the decision values. Therefore, if the percentage of

training data to be support vectors in the k and k(k − 1)/2 two-class SVMs is about

the same and k is not too large, the two multi-class approaches have similar testing

time.

6.10 Notes

Some early work on decomposition methods are, for example, (Osuna et al., 1997;

Joachims, 1998; Platt, 1998). The idea of using only two elements for the working set

6.11. EXERCISES 81

are from the Sequential Minimal Optimization (SMO) by (Platt, 1998). The working

set discussed is indeed a special version discussed in (Joachims, 1998; Keerthi et al.,

2001).

6.11 Exercises

1. In this chapter, we use a more complicated version of Theorem 5.2.8. Prove it

by the same derivation of the theorem.

2. Consider the following SVM formula:

minw,b,ξ

1

2wTw + C+

∑

i:yi=1

ξi + C−

∑

i:yi=−1

ξi


ξi ≥ 0, i = 1, . . . , l.

How will you change the derivation in Chapter 6.4?

3. Prove that the working set selection in Chapter 6.3 is equivalent to solving the

following problem:

mind

∇f(α)Td

subject to yTd = 0, −1 ≤ dt ≤ 1, (6.28)

dt ≥ 0, if αt = 0, dt ≤ 0, if αt = C,

|{dt | dt 6= 0}| = 2. (6.29)

Note that |{dt | dt 6= 0}| means the number of components of d which are not

zero. The constraint (6.29) implies that a descent direction involving only two

variables is obtained. Then components of α with non-zero dt are included in

the working set B which is used to construct the sub-problem (6.2). Note that

d is only used for identifying B but not as a search direction.

4. Consider x1,x2,x3 with ‖x1 − x2‖ = ‖x1 − x3‖ = ‖x2 − x3‖, y = [1, 1,−1]T ,

and C =∞. If the RBF kernel is used, the dual SVM problem is

minα1,α2,α3

1

2

[

α1 α2 α3

]

1 a −aa 1 −a−a −a 1

α1

α2

α3

− (α1 + α2 + α3)

subject to α1 + α2 − α3 = 0,

0 ≤ α1, α2, α3,


where a = e−γ‖xi−xj‖2

. We assume C is large so is not needed here. Prove that

at the optimal solution,

α∗ =[

23(1−a)

23(1−a)

43(1−a)

]T

. (6.30)

Then prove that if the initial solution is zero and the decomposition method in

Chapter 6.2 is considered, after k is large enough,

(αk+1 −α∗)T Q(αk+1 −α∗) =1

4(αk −α∗)T Q(αk −α∗). (6.31)

This is a simple example for which the decomposition method is linearly con-

vergent. Hence, in theory, the linear convergence is already the best worst-case

analysis for the decomposition method in Chapter 6.2.

5. Write a simple SVM classifier using the algorithm in this chapter. You consider

dense input format only. Basically you use the simple working set selection

and solve two-variable optimization problems until given stopping tolerance is

satisfied. You can write the predictor in the same program so we immediately

know the test accuracy. Kernel elements are calculated by need.

Requirement:

• The code should be less than 300 lines in any high-level language.

• RBF kernel is enough

• Run some small data in

http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/binary

using some given parameters. On some (maybe most) parameters your

code should be faster then libsvm, but on others it is the other way around.

Explain why.

• Moreover, your code should be able to train the 50,000 ijcnn1 data set in

the same url address.

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/binary

Bibliography

Bazaraa, M. S., H. D. Sherali, and C. M. Shetty (1993). Nonlinear programming :

theory and algorithms (Second ed.). Wiley.

Boser, B., I. Guyon, and V. Vapnik (1992). A training algorithm for optimal margin

classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning

Theory, pp. 144–152. ACM Press.

Chang, C.-C., C.-W. Hsu, and C.-J. Lin (2000). The analysis of decomposition meth-

ods for support vector machines. IEEE Transactions on Neural Networks 11 (4),

1003–1008.

Chang, C.-C. and C.-J. Lin (2001a). IJCNN 2001 challenge: Generalization ability

and text decoding. In Proceedings of IJCNN. IEEE.

Chang, C.-C. and C.-J. Lin (2001b). LIBSVM: a library for support vector machines.

Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.

Cortes, C. and V. Vapnik (1995). Support-vector network. Machine Learning 20,

273–297.

Cover, T. (1965). Geometrical and statistical properties of systems of linear inequal-

ities with applications in pattern recognition. IEEE Transactions on Electronic

Computers EC-14, 326–334.

Cristianini, N. and J. Shawe-Taylor (2000). An Introduction to Support Vector Ma-

chines. Cambridge, UK: Cambridge University Press.

Fletcher, R. (1987). Practical Methods of Optimization. John Wiley and Sons.

Friedman, J. (1996). Another approach to polychotomous classification. Tech-

nical report, Department of Statistics, Stanford University. Available at

http://www-stat.stanford.edu/reports/friedman/poly.ps.Z.

83

http://www.csie.ntu.edu.tw/~cjlin/libsvm

84 BIBLIOGRAPHY

Gardy, J. L., C. Spencer, K. Wang, M. Ester, G. E. Tusnady, I. Simon, S. Hua,

K. deFays, C. Lambert, K. Nakai, and F. S. Brinkman (2003). PSORT-B: improving

protein subcellular localization prediction for gram-negative bacteria. Nucleic Acids

Research 31 (13), 3613–3617.

Hsu, C.-W. and C.-J. Lin (2002). A comparison of methods for multi-class support

vector machines. IEEE Transactions on Neural Networks 13 (2), 415–425.

Hush, D. and C. Scovel (2003). Polynomial-time decomposition algorithms for support

vector machines. Machine Learning 51, 51–71.

Joachims, T. (1998). Making large-scale SVM learning practical. In B. Scholkopf,

C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods - Support

Vector Learning, Cambridge, MA. MIT Press.

Keerthi, S. S. and E. G. Gilbert (2002). Convergence of a generalized SMO algorithm

for SVM classifier design. Machine Learning 46, 351–360.

Keerthi, S. S. and C.-J. Lin (2003). Asymptotic behaviors of support vector machines

with Gaussian kernel. Neural Computation 15 (7), 1667–1689.

Keerthi, S. S., S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy (2001). Im-

provements to Platt’s SMO algorithm for SVM classifier design. Neural Computa-

tion 13, 637–649.

Knerr, S., L. Personnaz, and G. Dreyfus (1990). Single-layer learning revisited: a

stepwise procedure for building and training a neural network. In J. Fogelman (Ed.),

Neurocomputing: Algorithms, Architectures and Applications. Springer-Verlag.

Kreßel, U. (1999). Pairwise classification and support vector machines. In

B. Scholkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods

— Support Vector Learning, Cambridge, MA, pp. 255–268. MIT Press.

Lin, C.-J. (2001a). Formulations of support vector machines: a note from an opti-

mization point of view. Neural Computation 13 (2), 307–317.

Lin, C.-J. (2001b). Linear convergence of a decomposition method for support vector

machines. Technical report, Department of Computer Science and Information

Engineering, National Taiwan University, Taipei, Taiwan.

BIBLIOGRAPHY 85

Lin, C.-J. (2001c). On the convergence of the decomposition method for support

vector machines. IEEE Transactions on Neural Networks 12 (6), 1288–1298.

Lin, C.-J. (2002a). Asymptotic convergence of an SMO algorithm without any as-

sumptions. IEEE Transactions on Neural Networks 13 (1), 248–250.

Lin, C.-J. (2002b). A formal analysis of stopping criteria of decomposition methods

for support vector machines. IEEE Transactions on Neural Networks 13 (5), 1045–

1052.

Michie, D., D. J. Spiegelhalter, and C. C. Taylor (1994). Machine Learning, Neural

and Statistical Classification. Englewood Cliffs, N.J.: Prentice Hall. Data available

at http://www.ncc.up.pt/liacc/ML/statlog/datasets.html.

Osuna, E., R. Freund, and F. Girosi (1997). Training support vector machines: An

application to face detection. In Proceedings of CVPR’97, New York, NY, pp.

130–136. IEEE.

Platt, J. C. (1998). Fast training of support vector machines using sequential minimal

optimization. In B. Scholkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances

in Kernel Methods - Support Vector Learning, Cambridge, MA. MIT Press.

Prokhorov, D. (2001). IJCNN 2001 neural network competi-

tion. Slide presentation in IJCNN’01, Ford Research Laboratory.

http://www.geocities.com/ijcnn/nnc_ijcnn01.pdf .

Sarle, W. S. (1997). Neural Network FAQ. Periodic posting to the Usenet newsgroup

comp.ai.neural-nets.

Scholkopf, B. and A. J. Smola (2002). Learning with kernels. MIT Press.

Vapnik, V. (1998). Statistical Learning Theory. New York, NY: Wiley.

http://www.ncc.up.pt/liacc/ML/statlog/datasets.html

Index

strong duality, 42

attributes, 6

complementarity condition, 42

data instance, 6

features, 6

kernel function, 18

Nearest Neighbor Methods, 6

86

chih-jen lin july 13, 2006b95028/studygroup/...discussed in chapter 3.2. if the eucledian distance...

Documents