kernels in support vector machines - biocomp.unibo.it
TRANSCRIPT
Kernels in Support Vector Machines
Based on lectures of Martin Law, University of Michigan
Non Linear separable problems
AND OR NOT(1)
XOR
The XOR problem cannot be solved with a perceptron.
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
With NN:Multi-layer feed-forward neural networks
Neurons are organized into hierarchical layers
Each layer receive their inputs from the previous one and transmits the output to the next one
w1ij
w2ij
111
ji
i
ijj xwgz
2122
ji
i
ijj zwgz
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
2
(12)
1
(11)
1
(21)
2
1w1
11
w122
w121
w112
w211
w221
XORw1
11 = 0.7 w121 = 0.7 1
1 = 0. 5 w1
12 = 0.3 w122 = 0.3 1
2 = 0. 5 w2
11 = 0.7 w221 = -0.7 1
2 = 0. 5
a11 = -0.5 z1
1 = 0
a12 = -0.5 z1
2 = 0
a21 = -0.5 z1
2 = 0
x1 = 0 x2 = 0
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
2
(12)
1
(11)
1
(21)
2
1w1
11
w122
w121
w112
w211
w221
XORw1
11 = 0.7 w121 = 0.7 1
1 = 0. 5 w1
12 = 0.3 w122 = 0.3 1
2 = 0. 5 w2
11 = 0.7 w221 = -0.7 1
2 = 0. 5
a11 = 0.2 z1
1 = 1
a12 = -0.2 z1
2 = 0
a21 = 0.2 z1
2 = 1
x1 = 1 x2 = 0
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
2
(12)
1
(11)
1
(21)
2
1w1
11
w122
w121
w112
w211
w221
XORw1
11 = 0.7 w121 = 0.7 1
1 = 0. 5 w1
12 = 0.3 w122 = 0.3 1
2 = 0. 5 w2
11 = 0.7 w221 = -0.7 1
2 = 0. 5
a11 = 0.2 z1
1 = 1
a12 = -0.2 z1
2 = 0
a21 = 0.2 z1
2 = 1
x1 = 0 x2 = 1
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
2
(12)
1
(11)
1
(21)
2
1w1
11
w122
w121
w112
w211
w221
XORw1
11 = 0.7 w121 = 0.7 1
1 = 0. 5 w1
12 = 0.3 w122 = 0.3 1
2 = 0. 5 w2
11 = 0.7 w221 = -0.7 1
2 = 0. 5
a11 = 0.9 z1
1 = 1
a12 = 0.1 z1
2 = 1
a21 = -0.5 z1
2 = 0
x1 = 1 x2 = 1
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
The hidden layer REMAPS the input in a new representation that is linearly separable
Input Desired Activation ofoutput hidden neurons
0 0 0 0 01 0 1 0 10 1 1 0 11 1 0 1 1
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Extension to Non-linear Decision Boundary
So far, we have only considered large-margin classifier with a linear decision boundary
How to generalize it to become nonlinear?
Key idea: transform xi to a higher dimensional space to “make life easier”
Input space: the space the point xi are located
Feature space: the space of f(xi) after transformation
Why to transform?
Linear operation in the feature space is equivalent to non-linear operation in input space
Classification can become easier with a proper transformation. In the XOR problem, for example, adding a new feature of x1x2 make the problem linearly separable
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
XOR
X Y
0 0 0
0 1 1
1 0 1
1 1 0
Is not linearly separable
X Y XY
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 0
Is linearly separable
X
X
Y
Y
XY
Find a feature space
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Transforming the Data
Computation in the feature space can be costly because it is high dimensional
The feature space can be infinite-dimensional!
The kernel trick comes to rescue
f( )
f( )
f( )f( )f( )
f( )
f( )f( )
f(.)f( )
f( )
f( )
f( )f( )
f( )
f( )
f( )f( )
f( )
Feature spaceInput spaceNote: feature space is of higher dimension
than the input space in practice
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
The Kernel Trick
Recall the SVM optimization problem
The data points only appear as scalar product
As long as we can calculate the inner product in the feature space, we do not need the mapping explicitly
Many common geometric operations (angles, distances) can be expressed by inner products
Define the kernel function K by
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
An Example for f(.) and K(.,.)
Suppose f(.) is given as follows
An inner product in the feature space is
So, if we define the kernel function as follows, there is no need to carry out f(.) explicitly
This use of kernel function to avoid carrying out f(.) explicitly is known as the kernel trick
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Kernels
Given a mapping:
a kernel is represented as the inner product
A kernel must satisfy the Mercer’s condition:
φ(x)x
i
ii φφK (y)(x)yx ),(
0)()()()( that such )( 2yxyxyx,xxx ddggKdgg
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Analogous to positive-semidefinite matrices M for which
00 zMzz T
Modification Due to Kernel Function
Change all inner products to kernel functions
For training,
Original
With kernel function
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Modification Due to Kernel Function
For testing, the new data z is classified as class 1 if f >0, and as class 2 if f <0
Original
With kernel function
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
More on Kernel Functions
Since the training of SVM only requires the value of K(xi, xj), there is no restriction of the form of xi and xj
xi can be a sequence or a tree, instead of a feature vector
K(xi, xj) is just a similarity measure comparing xi and xj
For a test object z, the discriminat function essentially is a weighted sum of the similarity between z and a pre-selected set of objects (the support vectors)
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Example
Suppose we have 5 1D data points
x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4, 5 as class 2 y1=1, y2=1, y3=-1, y4=-1, y5=1
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Example
1 2 4 5 6
class 2 class 1class 1
Suppose we have 5 1D data points
x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4, 5 as class 2 y1=1, y2=1, y3=-1, y4=-1, y5=1
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Example
We use the polynomial kernel of degree 2
K(x,y) = (xy+1)2
C is set to 100
We first find ai (i=1, …, 5) by
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Example
By using a QP solver, we get
a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833
Note that the constraints are indeed satisfied
The support vectors are {x2=2, x4=5, x5=6}
The discriminant function is
b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1,
All three give b=9
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Example
Value of discriminant function
1 2 4 5 6
class 2 class 1class 1
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
f(z) f(z)>0
f(z)<0
Kernel Functions
In practical use of SVM, the user specifies the kernel function; the transformation f(.) is not explicitly stated
Given a kernel function K(xi, xj), the transformation f(.) is given by its eigenfunctions (a concept in functional analysis)
Eigenfunctions can be difficult to construct explicitly
This is why people only specify the kernel function without worrying about the exact transformation
Another view: kernel function, being an scalar product, is really a similarity measure between the objects
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
A kernel is associated to a transformation
Given a kernel, in principle it should be recovered the transformation in the feature space that originates it.
K(x,y) = (xy+1)2= x2y2+2xy+1
If x and y are numbers it corresponds the transformation
What if x and y are 2-dimensional vectors?
1
2
2
x
x
x
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
A kernel is associated to a transformation
) ) )
) ) ) ) ) ) ) ) ) ) ) )2211222222112121
222112
2221
1,1,
jijijijijiji
jijijiji
xx+xx+xx+xxxx+xx+
xxxxxx+=xxK
) ) ) ) ) ) Tiiiiiii x,x,xxx,x=)(x 21222121 22,21,f
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
XOR
Simple example (XOR problem)
) N
=i
N
j=
jijiji
N
=i
i xxKyyααα=αL1 11
),(2
1Input vector Y
[-1,-1] -1
[-1,+1] +1
[+1,-1] +1
[+1,+1] -1
)21, jiji xx+=)xK(x
0 1
1
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
(-1,-1) (-1,+1) (+1,-1) (+1,+1)
(-1,-1) 9 1 1 1
(-1,+1) 1 9 1 1
(+1,-1) 1 1 9 1
(+1,+1) 1 1 1 9
)α+ααα+αααα+α
+ααααααα(+α+α+αα(αL
2443
234232
22
413121214321
929229
22292
1)
09α1αL
09α1αL
09α1αL
09α1αL
43214
43213
43212
43211
=ααα
=ααα
=ααα
=ααα
4
1
8
14321
=L
=α=α=α=α
The four Input vectors are
All support vectors
N
=i
iii )(xyα=w1
W = [0, 0, –1/sqrt(2), 0, 0, 0] T
XOR
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
N
=i
iii )(xyα=w1
W = [0, 0, –1/sqrt(2), 0, 0, 0]’
XOR
Input vector Y x1x2
[-1,-1] -1 +1
[-1,+1] +1 -1
[+1,-1] +1 -1
[+1,+1] -1 +1
Input vector Y
[-1,-1] -1
[-1,+1] +1
[+1,-1] +1
[+1,+1] -1
0 1
1
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
) ) ) ) ) ) Tiiiiiii x,x,xxx,x=)(x 21222121 22,21,f
Examples of Kernel Functions
Polynomial kernel up to degree d
Polynomial kernel up to degree d
Radial basis function kernel with width s
Sigmoid with parameter k and
It does not satisfy the Mercer condition on all k and
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Ploynomial kernel
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Bishop C, Pattern recognition and Machine Learning, Springer
Fonte: http://www.ivanciuc.org/
Examples of Kernel Functions
Radial basis function (or gaussian) kernel with width s
222 2expexp
2exp),(
sss
yyxyxxyxK
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Examples of Kernel Functions
With 1-dim vectors:
It corresponds to the scalar product in the infinite dimensional feature space:
For vector in m-dim the feature space is more complicated
2
2
22
2
2expexp
2exp),(
sss
yxyxyxK
)
.....
!,....,
!3,
2,,1
2exp)(
3
3
2
2
2
2
n
n
T
n
xxxxxx
sssssf
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Without slack variables
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Bishop C, Pattern recognition and Machine Learning, Springer
With slack variables
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Bishop C, Pattern recognition and Machine Learning, Springer
Gaussian RBF kernel
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Bishop C, Pattern recognition and Machine Learning, Springer
Building new kernels
If k1(x,y) and k2(x,y) are two valid kernels then the following kernels are valid
Linear Combination
Exponential
Product
Polymomial transformation (Q: polymonial with non negative coefficients)
Function product (f: any function)
),(),(),( 2211 yxkcyxkcyxk
),(exp),( 1 yxkyxk
),(),(),( 21 yxkyxkyxk
),(),( 1 yxkQyxk
)(),()(),( 1 yfyxkxfyxk
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Choosing the Kernel Function
Probably the most tricky part of using SVM.
The kernel function is important because it creates the kernel matrix, which summarizes all the data
Many principles have been proposed (diffusion kernel, Fisher kernel, string kernel, …)
There is even research to estimate the kernel matrix from available information
In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try
Note that SVM with RBF kernel is closely related to RBF neural networks, with the centers of the radial basis functions automatically chosen for SVM
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Kernels can be defined also for structures other than vectors
Computational biology often deals with structures different from vectors:
Sequences (DNA, RNA, proteins)
Trees (Phylogenetic relationships)
Graphs (Interaction networks)
3-D structures (proteins)
Is it possible to build kernels for that structures?
Transform data onto a feature space made of n-dimensional real vectors and then compute the scalar product.
Write a kernel without writing explicitly the feature space (but.. What is a kernel?)
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Defining kernels without defining feature transformationWhat a kernel represent?
Distance in feature space
Defining kernels without defining feature transformationWhat a kernel represent?
Distance in feature space
Kernel is a SIMILARITY measure
Moreover it has to fullfill a «positivity» condition
Spectral kernel for sequences
Given a DNA sequence x we can count the number of bases (4-D feature space)
Or the number of dimers (16-D space)
Or l-mers (4l –D space)
The spectral kernel is
),,,()(1 TGCA nnnnx f
,..),,,,,,,()(2 CTCGCCCAATAGACAA nnnnnnnnx f
) )yxyxk lll ff ),(
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
l is usually lower than 1
Kernel out of generative models
Given a generative model associating a probability p(x|θ) to a given input, we define :
Fisher Kernel
)()(),( ypxpyxK
),(),(),(
),(),(1
),(),(
)|(ln),(
1
1
ygFxgyxK
xgxgN
xgxgEF
xpxg
T
N
i
ii
T
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Other Aspects of SVM
How to use SVM for multi-class classification?
One can change the QP formulation to become multi-class
More often, multiple binary classifiers are combined
One can train multiple one-versus-all classifiers, or combine multiple pairwise classifiers “intelligently”
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Other Aspects of SVM
How to interpret the SVM discriminant function value as probability?
By performing logistic regression on the SVM output of a set of data (validation set) that is not used for training
Some SVM software (like libsvm) have these features built-in
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Software
A list of SVM implementation can be found at http://www.kernel-machines.org/software.html
Some implementation (such as LIBSVM) can handle multi-class classification
SVMLight is among one of the earliest implementation of SVM
Several Matlab toolboxes for SVM are also available
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Summary: Steps for Classification
Prepare the pattern matrix
Select the kernel function to use
Select the parameter of the kernel function and the value of C
You can use the values suggested by the SVM software, or you can set apart a validation set to determine the values of the parameter
Execute the training algorithm and obtain the ai
Unseen data can be classified using the ai and the support vectors
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Strengths and Weaknesses of SVM
Strengths
Training is relatively easy
No local optimal, unlike in neural networks
It scales relatively well to high dimensional data
Tradeoff between classifier complexity and error can be controlled explicitly
Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors
Weaknesses
Need to choose a “good” kernel function.
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Other Types of Kernel Methods
A lesson learnt in SVM: a linear algorithm in the feature space is equivalent to a non-linear algorithm in the input space
Standard linear algorithms can be generalized to its non-linear version by going to the feature space
Kernel principal component analysis, kernel independent component analysis, kernel canonical correlation analysis, kernel k-means, 1-class SVM are some examples
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna