support vector machines (svms) chapter 5 (duda et al.) cs479/679 pattern recognition dr. george...

Support Vector Machines (SVMs)

Chapter 5 (Duda et al.)

CS479/679 Pattern RecognitionDr. George Bebis

Learning through “empirical risk” minimization

• Estimate g(x) from a finite set of observations by minimizing an error function, for example, the training error (also called empirical risk):

1

2

1

1k

kk

ifz

if

x

xclass labels:

2

1

1[ ( )]

n

emp k kk

R z g xn

Learning through “empirical risk” minimization (cont’d)

• Conventional empirical risk minimization does not imply good generalization performance.

– There could be several different functions g(x) which all approximate the training data set well.

– Difficult to determine which function would have the best generalization performance.

Learning through “empirical risk” minimization (cont’d)

B1

B2

Which solution is better?

Solution 1 Solution 2

Statistical Learning:Capacity and VC dimension

• To guarantee good generalization performance, the capacity (i.e., complexity) of the learned functions must be controlled.

• Functions with high capacity are more complicated (i.e., have many degrees of freedom).

low capacity high capacity

Statistical Learning:Capacity and VC dimension (cont’d)

• How do we measure capacity?

– In statistical learning, the Vapnik-Chervonenkis (VC) dimension is a popular measure of capacity.

– The VC dimension can predict a probabilistic upper bound on the generalization error of a classifier.

Statistical Learning:Capacity and VC dimension (cont’d)

• A function that (1) minimizes the empirical risk and (2) has low VC dimension

will generalize well regardless of the dimensionality of the input space:

with probability (1-δ); (n: # of training examples)

(Vapnik, 1995, “Structural Risk Minimization Principle”)

nn(log(2 / ) 1) log( / 4)

true train

VC n VCerr err

n

structural risk minimization

VC dimension and margin of separation

• Vapnik has shown that maximizing the margin of separation (i.e., empty space between classes) is equivalent to minimizing the VC dimension.

• The optimal hyperplane is the one giving the largest margin of separation between the classes.

Margin of separation and support vectors

• How is the margin defined?– The margin is defined by the

distance of the nearest training samples from the hyperplane.

– We refer to these samples as support vectors.

– Intuitively speaking, these are the most difficult samples to classify.

Margin of separation andsupport vectors (cont’d)

B1

B2

B1

B2

b11

b12

b21b22

margin

different solutions corresponding margins

SVM Overview

• Primarily two-class classifiers but can be extended to multiple classes.

• It performs structural risk minimization to achieve good generalization performance.

• The optimization criterion is the margin of separation between classes.

• Training is equivalent to solving a quadratic programming problem with linear constraints.

Linear SVM: separable case

• Linear discriminant

• Class labels

• Consider the equivalent problem:

1

2

1

1k

kk

ifz

if

x

x

Decide ω1 if g(x) > 0 and ω2 if g(x) < 00( ) tg w x w x

0( ) 0 ( ) 0, 1,2,...,tk k k kz g or z w for k n x w x

Linear SVM: separable case (cont’d)

• The distance of a point xk from the separating hyperplane should satisfy the constraint:

• To constraint the length of w (uniqueness), we impose:

• Using the above constraint:

( ), 0

|| ||k kz g

b bw

x

0

1( ) 1 ( ) 1 where

|| ||t

k k k kz g or z w bw

x w x

1b w


maximizemargin:

2

|| ||w0( ) 1, 1,2,...,t

k kz w for k n w x

quadraticprogrammingproblem


• Using Langrange optimization, minimize:

• Easier to solve the “dual” problem (Kuhn-Tucker construction):

20 0

1

1( , , ) || || [ ( ) 1], 0

2

nt

k k k kk

L w z w

w w w x

1 ,

1

2

n nt

k k j k j j kk k j

z z

x x


• The solution is given by:

01

,n

tk k k k k

k

z w z

w x w x

0 01 1

( ) ( ) ( . )n n

tk k k k k k

k k

g z w z w

x x x x x

dot product0( ) tg w x w x


• It can be shown that if xk is not a support vector, then the corresponding λk=0.

Only the support vectorscontribute to the solution!

0 01 1

( ) ( ) ( . )n n

tk k k k k k

k k

g z w z w

x x x x x

dot product

Linear SVM: non-separable case

• Allow miss-classifications (i.e., soft margin classifier) by introducing positive error (slack) variables ψk :

0( ) 1 , 1,2,...,tk k kz w w k n x

Linear SVM: non-separable case (cont’d)

• The constant c controls the trade-off between margin and misclassification errors.

• Aims to prevent outliers from affecting the optimal hyperplane.

Linear SVM: non-separable case (cont’d)

1 ,

1

2

n nt

k k j k j j kk k j

z z

x x

• Easier to solve the “dual” problem (Kuhn-Tucker construction):

01

( ) ( . )n

k k kk

g z w

x x x

• Extending these concepts to the non-linear case involves mapping the data to a high-dimensional space h:

• Mapping the data to a sufficiently high dimensional space is likely to cast the data linearly separable in that space.

Nonlinear SVM

1

2

( )

( )( )

...

( )

k

kk k

h k

x

xx Φ x

x

Nonlinear SVM (cont’d)

Example:


0

1

( ) ( ( ). ( ))n

k k kk

g z w

x x x

01

( ) ( . )n

k k kk

g z w

x x xlinear SVM:

non-linear SVM:


01

( ) ( ( ). ( ))n

k k kk

g z w

x x xnon-linear SVM:

• The disadvantage of this approach is that the mapping

might be very computationally intensive to compute!

• Is there an efficient way to compute ?

( )k k x x

( ). ( )k x x

The kernel trick

• Compute dot products using a kernel function

01

( ) ( , )n

k k kk

g z K w

x x x

( , ) ( ). ( )k kK x x x x

01

( ) ( ( ). ( ))n

k k kk

g z w

x x x

The kernel trick (cont’d)

• Comments– Kernel functions which can be expressed as a dot

product in some space satisfy the Mercer’s condition (see Burges’ paper)

• The Mercer’s condition does not tell us how to construct Φ() or even what the high dimensional space is.

• Advantages of kernel trick– No need to know Φ() – Computations remain feasible even if the feature space

has high dimensionality.

Polynomial Kernel

K(x,y)=(x . y) d

Polynomial Kernel - Example

Common Kernel functions

Example

Example (cont’d)

h=6

Example (cont’d)

Example (cont’d)

(Problem 4)

Example (cont’d)

w0=0

Example (cont’d)

w =

Example (cont’d)

The discriminant

w =

Comments

• SVM is based on exact optimization, not on approximate methods (i.e., global optimization method, no local optima)

• Appears to avoid overfitting in high dimensional spaces and generalize well using a small training set.

• Performance depends on the choice of the kernel and its parameters.

• Its complexity depends on the number of support vectors, not on the dimensionality of the transformed space.

support vector machines (svms) chapter 5 (duda et al.) cs479/679 pattern recognition dr. george...

Documents

vc dimension contd

low vc dimension

margin of separation

george bebis slide

linear svm

statistical learning

separable case contd

largest margin of separation