kernels in support vector machines - biocomp.unibo.it

65
Kernels in Support Vector Machines Based on lectures of Martin Law, University of Michigan

Upload: others

Post on 04-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kernels in Support Vector Machines - biocomp.unibo.it

Kernels in Support Vector Machines

Based on lectures of Martin Law, University of Michigan

Page 2: Kernels in Support Vector Machines - biocomp.unibo.it

Non Linear separable problems

AND OR NOT(1)

XOR

The XOR problem cannot be solved with a perceptron.

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 3: Kernels in Support Vector Machines - biocomp.unibo.it

With NN:Multi-layer feed-forward neural networks

Neurons are organized into hierarchical layers

Each layer receive their inputs from the previous one and transmits the output to the next one

w1ij

w2ij

111

ji

i

ijj xwgz

2122

ji

i

ijj zwgz

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 4: Kernels in Support Vector Machines - biocomp.unibo.it

2

(12)

1

(11)

1

(21)

2

1w1

11

w122

w121

w112

w211

w221

XORw1

11 = 0.7 w121 = 0.7 1

1 = 0. 5 w1

12 = 0.3 w122 = 0.3 1

2 = 0. 5 w2

11 = 0.7 w221 = -0.7 1

2 = 0. 5

a11 = -0.5 z1

1 = 0

a12 = -0.5 z1

2 = 0

a21 = -0.5 z1

2 = 0

x1 = 0 x2 = 0

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 5: Kernels in Support Vector Machines - biocomp.unibo.it

2

(12)

1

(11)

1

(21)

2

1w1

11

w122

w121

w112

w211

w221

XORw1

11 = 0.7 w121 = 0.7 1

1 = 0. 5 w1

12 = 0.3 w122 = 0.3 1

2 = 0. 5 w2

11 = 0.7 w221 = -0.7 1

2 = 0. 5

a11 = 0.2 z1

1 = 1

a12 = -0.2 z1

2 = 0

a21 = 0.2 z1

2 = 1

x1 = 1 x2 = 0

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 6: Kernels in Support Vector Machines - biocomp.unibo.it

2

(12)

1

(11)

1

(21)

2

1w1

11

w122

w121

w112

w211

w221

XORw1

11 = 0.7 w121 = 0.7 1

1 = 0. 5 w1

12 = 0.3 w122 = 0.3 1

2 = 0. 5 w2

11 = 0.7 w221 = -0.7 1

2 = 0. 5

a11 = 0.2 z1

1 = 1

a12 = -0.2 z1

2 = 0

a21 = 0.2 z1

2 = 1

x1 = 0 x2 = 1

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 7: Kernels in Support Vector Machines - biocomp.unibo.it

2

(12)

1

(11)

1

(21)

2

1w1

11

w122

w121

w112

w211

w221

XORw1

11 = 0.7 w121 = 0.7 1

1 = 0. 5 w1

12 = 0.3 w122 = 0.3 1

2 = 0. 5 w2

11 = 0.7 w221 = -0.7 1

2 = 0. 5

a11 = 0.9 z1

1 = 1

a12 = 0.1 z1

2 = 1

a21 = -0.5 z1

2 = 0

x1 = 1 x2 = 1

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 8: Kernels in Support Vector Machines - biocomp.unibo.it

The hidden layer REMAPS the input in a new representation that is linearly separable

Input Desired Activation ofoutput hidden neurons

0 0 0 0 01 0 1 0 10 1 1 0 11 1 0 1 1

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 9: Kernels in Support Vector Machines - biocomp.unibo.it

Extension to Non-linear Decision Boundary

So far, we have only considered large-margin classifier with a linear decision boundary

How to generalize it to become nonlinear?

Key idea: transform xi to a higher dimensional space to “make life easier”

Input space: the space the point xi are located

Feature space: the space of f(xi) after transformation

Why to transform?

Linear operation in the feature space is equivalent to non-linear operation in input space

Classification can become easier with a proper transformation. In the XOR problem, for example, adding a new feature of x1x2 make the problem linearly separable

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 10: Kernels in Support Vector Machines - biocomp.unibo.it

XOR

X Y

0 0 0

0 1 1

1 0 1

1 1 0

Is not linearly separable

X Y XY

0 0 0 0

0 1 0 1

1 0 0 1

1 1 1 0

Is linearly separable

X

X

Y

Y

XY

Page 11: Kernels in Support Vector Machines - biocomp.unibo.it

Find a feature space

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 12: Kernels in Support Vector Machines - biocomp.unibo.it

Transforming the Data

Computation in the feature space can be costly because it is high dimensional

The feature space can be infinite-dimensional!

The kernel trick comes to rescue

f( )

f( )

f( )f( )f( )

f( )

f( )f( )

f(.)f( )

f( )

f( )

f( )f( )

f( )

f( )

f( )f( )

f( )

Feature spaceInput spaceNote: feature space is of higher dimension

than the input space in practice

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 13: Kernels in Support Vector Machines - biocomp.unibo.it

The Kernel Trick

Recall the SVM optimization problem

The data points only appear as scalar product

As long as we can calculate the inner product in the feature space, we do not need the mapping explicitly

Many common geometric operations (angles, distances) can be expressed by inner products

Define the kernel function K by

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 14: Kernels in Support Vector Machines - biocomp.unibo.it

An Example for f(.) and K(.,.)

Suppose f(.) is given as follows

An inner product in the feature space is

So, if we define the kernel function as follows, there is no need to carry out f(.) explicitly

This use of kernel function to avoid carrying out f(.) explicitly is known as the kernel trick

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 15: Kernels in Support Vector Machines - biocomp.unibo.it

Kernels

Given a mapping:

a kernel is represented as the inner product

A kernel must satisfy the Mercer’s condition:

φ(x)x

i

ii φφK (y)(x)yx ),(

0)()()()( that such )( 2yxyxyx,xxx ddggKdgg

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Analogous to positive-semidefinite matrices M for which

00 zMzz T

Page 16: Kernels in Support Vector Machines - biocomp.unibo.it

Modification Due to Kernel Function

Change all inner products to kernel functions

For training,

Original

With kernel function

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 17: Kernels in Support Vector Machines - biocomp.unibo.it

Modification Due to Kernel Function

For testing, the new data z is classified as class 1 if f >0, and as class 2 if f <0

Original

With kernel function

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 18: Kernels in Support Vector Machines - biocomp.unibo.it

More on Kernel Functions

Since the training of SVM only requires the value of K(xi, xj), there is no restriction of the form of xi and xj

xi can be a sequence or a tree, instead of a feature vector

K(xi, xj) is just a similarity measure comparing xi and xj

For a test object z, the discriminat function essentially is a weighted sum of the similarity between z and a pre-selected set of objects (the support vectors)

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 19: Kernels in Support Vector Machines - biocomp.unibo.it

Example

Suppose we have 5 1D data points

x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4, 5 as class 2 y1=1, y2=1, y3=-1, y4=-1, y5=1

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 20: Kernels in Support Vector Machines - biocomp.unibo.it

Example

1 2 4 5 6

class 2 class 1class 1

Suppose we have 5 1D data points

x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4, 5 as class 2 y1=1, y2=1, y3=-1, y4=-1, y5=1

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 21: Kernels in Support Vector Machines - biocomp.unibo.it

Example

We use the polynomial kernel of degree 2

K(x,y) = (xy+1)2

C is set to 100

We first find ai (i=1, …, 5) by

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 22: Kernels in Support Vector Machines - biocomp.unibo.it

Example

By using a QP solver, we get

a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833

Note that the constraints are indeed satisfied

The support vectors are {x2=2, x4=5, x5=6}

The discriminant function is

b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1,

All three give b=9

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 23: Kernels in Support Vector Machines - biocomp.unibo.it

Example

Value of discriminant function

1 2 4 5 6

class 2 class 1class 1

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

f(z) f(z)>0

f(z)<0

Page 24: Kernels in Support Vector Machines - biocomp.unibo.it

Kernel Functions

In practical use of SVM, the user specifies the kernel function; the transformation f(.) is not explicitly stated

Given a kernel function K(xi, xj), the transformation f(.) is given by its eigenfunctions (a concept in functional analysis)

Eigenfunctions can be difficult to construct explicitly

This is why people only specify the kernel function without worrying about the exact transformation

Another view: kernel function, being an scalar product, is really a similarity measure between the objects

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 25: Kernels in Support Vector Machines - biocomp.unibo.it

A kernel is associated to a transformation

Given a kernel, in principle it should be recovered the transformation in the feature space that originates it.

K(x,y) = (xy+1)2= x2y2+2xy+1

If x and y are numbers it corresponds the transformation

What if x and y are 2-dimensional vectors?

1

2

2

x

x

x

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 26: Kernels in Support Vector Machines - biocomp.unibo.it

A kernel is associated to a transformation

) ) )

) ) ) ) ) ) ) ) ) ) ) )2211222222112121

222112

2221

1,1,

jijijijijiji

jijijiji

xx+xx+xx+xxxx+xx+

xxxxxx+=xxK

) ) ) ) ) ) Tiiiiiii x,x,xxx,x=)(x 21222121 22,21,f

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 27: Kernels in Support Vector Machines - biocomp.unibo.it

XOR

Simple example (XOR problem)

) N

=i

N

j=

jijiji

N

=i

i xxKyyααα=αL1 11

),(2

1Input vector Y

[-1,-1] -1

[-1,+1] +1

[+1,-1] +1

[+1,+1] -1

)21, jiji xx+=)xK(x

0 1

1

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

(-1,-1) (-1,+1) (+1,-1) (+1,+1)

(-1,-1) 9 1 1 1

(-1,+1) 1 9 1 1

(+1,-1) 1 1 9 1

(+1,+1) 1 1 1 9

Page 28: Kernels in Support Vector Machines - biocomp.unibo.it

)α+ααα+αααα+α

+ααααααα(+α+α+αα(αL

2443

234232

22

413121214321

929229

22292

1)

09α1αL

09α1αL

09α1αL

09α1αL

43214

43213

43212

43211

=ααα

=ααα

=ααα

=ααα

4

1

8

14321

=L

=α=α=α=α

The four Input vectors are

All support vectors

N

=i

iii )(xyα=w1

W = [0, 0, –1/sqrt(2), 0, 0, 0] T

XOR

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 29: Kernels in Support Vector Machines - biocomp.unibo.it

N

=i

iii )(xyα=w1

W = [0, 0, –1/sqrt(2), 0, 0, 0]’

XOR

Input vector Y x1x2

[-1,-1] -1 +1

[-1,+1] +1 -1

[+1,-1] +1 -1

[+1,+1] -1 +1

Input vector Y

[-1,-1] -1

[-1,+1] +1

[+1,-1] +1

[+1,+1] -1

0 1

1

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

) ) ) ) ) ) Tiiiiiii x,x,xxx,x=)(x 21222121 22,21,f

Page 30: Kernels in Support Vector Machines - biocomp.unibo.it

Examples of Kernel Functions

Polynomial kernel up to degree d

Polynomial kernel up to degree d

Radial basis function kernel with width s

Sigmoid with parameter k and

It does not satisfy the Mercer condition on all k and

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 31: Kernels in Support Vector Machines - biocomp.unibo.it

Ploynomial kernel

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Bishop C, Pattern recognition and Machine Learning, Springer

Page 32: Kernels in Support Vector Machines - biocomp.unibo.it

Fonte: http://www.ivanciuc.org/

Page 33: Kernels in Support Vector Machines - biocomp.unibo.it

Examples of Kernel Functions

Radial basis function (or gaussian) kernel with width s

222 2expexp

2exp),(

sss

yyxyxxyxK

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 34: Kernels in Support Vector Machines - biocomp.unibo.it

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 35: Kernels in Support Vector Machines - biocomp.unibo.it

Examples of Kernel Functions

With 1-dim vectors:

It corresponds to the scalar product in the infinite dimensional feature space:

For vector in m-dim the feature space is more complicated

2

2

22

2

2expexp

2exp),(

sss

yxyxyxK

)

.....

!,....,

!3,

2,,1

2exp)(

3

3

2

2

2

2

n

n

T

n

xxxxxx

sssssf

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 36: Kernels in Support Vector Machines - biocomp.unibo.it

Without slack variables

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Bishop C, Pattern recognition and Machine Learning, Springer

Page 37: Kernels in Support Vector Machines - biocomp.unibo.it

With slack variables

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Bishop C, Pattern recognition and Machine Learning, Springer

Page 38: Kernels in Support Vector Machines - biocomp.unibo.it

Gaussian RBF kernel

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Bishop C, Pattern recognition and Machine Learning, Springer

Page 39: Kernels in Support Vector Machines - biocomp.unibo.it

Building new kernels

If k1(x,y) and k2(x,y) are two valid kernels then the following kernels are valid

Linear Combination

Exponential

Product

Polymomial transformation (Q: polymonial with non negative coefficients)

Function product (f: any function)

),(),(),( 2211 yxkcyxkcyxk

),(exp),( 1 yxkyxk

),(),(),( 21 yxkyxkyxk

),(),( 1 yxkQyxk

)(),()(),( 1 yfyxkxfyxk

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 40: Kernels in Support Vector Machines - biocomp.unibo.it

Choosing the Kernel Function

Probably the most tricky part of using SVM.

The kernel function is important because it creates the kernel matrix, which summarizes all the data

Many principles have been proposed (diffusion kernel, Fisher kernel, string kernel, …)

There is even research to estimate the kernel matrix from available information

In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try

Note that SVM with RBF kernel is closely related to RBF neural networks, with the centers of the radial basis functions automatically chosen for SVM

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 41: Kernels in Support Vector Machines - biocomp.unibo.it

Kernels can be defined also for structures other than vectors

Computational biology often deals with structures different from vectors:

Sequences (DNA, RNA, proteins)

Trees (Phylogenetic relationships)

Graphs (Interaction networks)

3-D structures (proteins)

Is it possible to build kernels for that structures?

Transform data onto a feature space made of n-dimensional real vectors and then compute the scalar product.

Write a kernel without writing explicitly the feature space (but.. What is a kernel?)

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 42: Kernels in Support Vector Machines - biocomp.unibo.it
Page 43: Kernels in Support Vector Machines - biocomp.unibo.it
Page 44: Kernels in Support Vector Machines - biocomp.unibo.it

Defining kernels without defining feature transformationWhat a kernel represent?

Distance in feature space

Page 45: Kernels in Support Vector Machines - biocomp.unibo.it

Defining kernels without defining feature transformationWhat a kernel represent?

Distance in feature space

Kernel is a SIMILARITY measure

Moreover it has to fullfill a «positivity» condition

Page 46: Kernels in Support Vector Machines - biocomp.unibo.it
Page 47: Kernels in Support Vector Machines - biocomp.unibo.it
Page 48: Kernels in Support Vector Machines - biocomp.unibo.it

Spectral kernel for sequences

Given a DNA sequence x we can count the number of bases (4-D feature space)

Or the number of dimers (16-D space)

Or l-mers (4l –D space)

The spectral kernel is

),,,()(1 TGCA nnnnx f

,..),,,,,,,()(2 CTCGCCCAATAGACAA nnnnnnnnx f

) )yxyxk lll ff ),(

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 49: Kernels in Support Vector Machines - biocomp.unibo.it
Page 50: Kernels in Support Vector Machines - biocomp.unibo.it
Page 51: Kernels in Support Vector Machines - biocomp.unibo.it
Page 52: Kernels in Support Vector Machines - biocomp.unibo.it
Page 53: Kernels in Support Vector Machines - biocomp.unibo.it

l is usually lower than 1

Page 54: Kernels in Support Vector Machines - biocomp.unibo.it
Page 55: Kernels in Support Vector Machines - biocomp.unibo.it
Page 56: Kernels in Support Vector Machines - biocomp.unibo.it
Page 57: Kernels in Support Vector Machines - biocomp.unibo.it
Page 58: Kernels in Support Vector Machines - biocomp.unibo.it
Page 59: Kernels in Support Vector Machines - biocomp.unibo.it

Kernel out of generative models

Given a generative model associating a probability p(x|θ) to a given input, we define :

Fisher Kernel

)()(),( ypxpyxK

),(),(),(

),(),(1

),(),(

)|(ln),(

1

1

ygFxgyxK

xgxgN

xgxgEF

xpxg

T

N

i

ii

T

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 60: Kernels in Support Vector Machines - biocomp.unibo.it

Other Aspects of SVM

How to use SVM for multi-class classification?

One can change the QP formulation to become multi-class

More often, multiple binary classifiers are combined

One can train multiple one-versus-all classifiers, or combine multiple pairwise classifiers “intelligently”

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 61: Kernels in Support Vector Machines - biocomp.unibo.it

Other Aspects of SVM

How to interpret the SVM discriminant function value as probability?

By performing logistic regression on the SVM output of a set of data (validation set) that is not used for training

Some SVM software (like libsvm) have these features built-in

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 62: Kernels in Support Vector Machines - biocomp.unibo.it

Software

A list of SVM implementation can be found at http://www.kernel-machines.org/software.html

Some implementation (such as LIBSVM) can handle multi-class classification

SVMLight is among one of the earliest implementation of SVM

Several Matlab toolboxes for SVM are also available

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 63: Kernels in Support Vector Machines - biocomp.unibo.it

Summary: Steps for Classification

Prepare the pattern matrix

Select the kernel function to use

Select the parameter of the kernel function and the value of C

You can use the values suggested by the SVM software, or you can set apart a validation set to determine the values of the parameter

Execute the training algorithm and obtain the ai

Unseen data can be classified using the ai and the support vectors

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 64: Kernels in Support Vector Machines - biocomp.unibo.it

Strengths and Weaknesses of SVM

Strengths

Training is relatively easy

No local optimal, unlike in neural networks

It scales relatively well to high dimensional data

Tradeoff between classifier complexity and error can be controlled explicitly

Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors

Weaknesses

Need to choose a “good” kernel function.

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

Page 65: Kernels in Support Vector Machines - biocomp.unibo.it

Other Types of Kernel Methods

A lesson learnt in SVM: a linear algorithm in the feature space is equivalent to a non-linear algorithm in the input space

Standard linear algorithms can be generalized to its non-linear version by going to the feature space

Kernel principal component analysis, kernel independent component analysis, kernel canonical correlation analysis, kernel k-means, 1-class SVM are some examples

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna