Kernel MachinesFrom linear to Kernel SVM
Gilles GassoBenoit Gauzere - Stephane Canu
INSA Rouen - ASI DepartmentLaboratory LITIS
November 12, 2019
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 1 / 35
Road map
1 Intuition and Motivation
2 KernelTheoretical framework
3 Kernel functionsKernel on vectorsKernels on generic dataReproducing kernel
4 Non-Linear Kernel MachineSVM for classificationSVM for regression
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 2 / 35
Intuition and Motivation
Linear SVM
X1
X2
Inputs are vectors of IRd
Linear SVM seeks linear classification function
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 3 / 35
Intuition and Motivation
Limitations
Data are not always vectors: (string, time series, graphs, images . . . )
The decision function can not always be linear (text categorization;email filtering; gene detection; protein image classification;handwriting recognition; prediction of loan defaulting)
Figure: How do you classify these data?
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 4 / 35
Intuition and Motivation
From linear to non-linear decision function
Data might be separable with anon-linear function
Non-linear embedding of x =
(x(1)
x(2)
)R2 → H
x 7→ Φ(x) =
x2(1)
x2(2)√
2x(1)x(2)
Train a linear SVM with samples(Φ(xi ), yi )
Resulting SVM model
f (x) = b +∑i∈SV
αiyiΦ(xi )>Φ(x)
‘Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 5 / 35
Intuition and Motivation
Non-linear decision function
Decision function
f (x) = b +∑i∈SV
αiyi Φ(xi )>Φ(x)︸ ︷︷ ︸
Kernel k(xi ,x)
Kernel function: the trick
No explicit knowledge of Φ(x)
We only need to define afunction k(·, ·) : X × X → R
Problem linearly non-separable inthe original space X
but linear separable in the space H induced bythe kernel k
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 6 / 35
Intuition and Motivation
Also. . .
How do we classify dataset composed of proteins?
Use the kernel trick: f (protein) = b +∑
i∈SV αiyik(proteini , protein)
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 7 / 35
Intuition and Motivation
Remark
Intuition
By simply modifying the dot product, the algorithm works in another space
Kernel Trick
1 Linear SVM relies on inner product between the SV and the sample topredict
2 → Replaces the inner product between the sample in the ambientspace by a kernel k(·, ·)
3 → Leads to a non-linear version of the SVM
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 8 / 35
Intuition and Motivation
Motivation of Kernel Machine
From
linear techniquesoperating on vector spaces
to
non linear prediction modelsoperating on various, structured, high-dimensional data
Using a:well known mathematical framework
leading to:efficient and powerful algorithm and tools
−→ Kernel method
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 9 / 35
What is a kernel ?
What can be k ?
Kernel Theoretical framework
Prerequisites
Some definitions and notations
X : non empty input space (IRN , graphs, objects, . . . )
x ∈ X ,
H: feature space endowed with a dot product 〈·, ·〉HΦ : X → H: embedding function from X to H
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 11 / 35
Kernel Theoretical framework
Kernel
Kernel
A kernel k is a function k : X × X → IR:
k(x, z) = 〈Φ(x),Φ(z)〉H
Which kernel for SVM ? −→ positive definite kernelsWhy ? −→ To ensure a well-defined SVM dual problem
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 12 / 35
Kernel Theoretical framework
Positive Definite Kernels (1)
Positive definite kernel
A kernel k(x, z) on X × X is said to be positive definite
if it is symmetric: k(x, z) = k(z, x)
and if for any finite positive integer n:
∀αii=1,n ∈ IR,∀xii=1,n ∈ Ω,n∑
i=1
n∑j=1
αiαjk(xi , xj) ≥ 0
it is strictly positive definite if for αi 6= 0n∑
i=1
n∑j=1
αiαjk(xi , xj) > 0
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 13 / 35
Kernel Theoretical framework
Positive Definite Kernels (2)
Gram Matrix
Given a kernel k : X × X → IR, and samples x1, . . . , xn, the GramMatrix K is a n × n matrix with entries Ki ,j := k(xi , xj)
Another way to characterize positive definite kernel
If for any set of n ∈ IN samples x1, . . . , xn the associated Gram MatrixK ∈ IRn×n is positive definite, then k is a positive definite kernel on X .
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 14 / 35
Positive definite Kernels
Kernel functions Kernel on vectors
Linear Kernel
k(x, z) = x>z
x, z ∈ IRd
symmetric: x>z = x>z
positive:
n∑i=1
n∑j=1
αiαjk(xi , xj) =n∑
i=1
n∑j=1
αiαjx>i xj
=
(n∑
i=1
αixi
)> n∑j=1
αjxj
=
∥∥∥∥∥n∑
i=1
αixi
∥∥∥∥∥2
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 16 / 35
Kernel functions Kernel on vectors
Finite kernel
let φj , j = 1, p be a finite dictionary of functions from X to IR (polynomials,wavelets...)
the feature map and linear kernel
feature map:Φ : X → IRp
x 7→ Φ =(φ1(x), ..., φp(x)
)Linear kernel in the feature space:
k(x, z) =(φ1(x), ..., φp(x)
)>(φ1(z), ..., φp(z)
)e.g. the quadratic kernel: x, z ∈ IRd , k(x, z) =
(x>z + b
)2
feature map:
Φ : IRd → IRp=1+d+ d(d+1)2
x 7→ Φ =(1,√
2x1, ...,√
2xj , ...,√
2xd , x21 , ..., x
2j , ..., x
2d , ...,
√2xixj , ...
)Here the xj represent the variables of x ∈ IRd
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 17 / 35
Kernel functions Kernel on vectors
Closed form kernel: the quadratic kernel
Quadratic kernel k(x, z) =(x>z + 1
)2= 1 + 2x>z +
(x>z
)2
x, z ∈ IRd ,
It computes the dot product of the dictionary
Φ : IRd → IRp=1+d+ d(d+1)2
x 7→ Φ =(1,√
2x1,√
2x2, ...,√
2xd , x21 , x
22 , ..., x
2d , ...,
√2xixj , ...
)
p = 1 + d + d(d+1)2 multiplications vs. d + 1
use kernel to save computation
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 18 / 35
Kernel functions Kernel on vectors
Gaussian kernel
k(x, z) = exp(−‖x−z‖
2
2σ2
)for σ = 1:
Φ(x) =
exp ‖x‖2
2j√j!
1/j
(j
n1, . . . , nk
)1/2
xn11 . . . xnkk
j=0,...,∞,
∑ki=1 ni=j
Feature space has an infinite dimension
Overlearning
σ controls the influence area of the kernel
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 19 / 35
Kernel functions Kernels on generic data
Kernels on structures
X may not be a vector space.
we can define kernels on any kind of data :
StringsTime seriesGraphsImages. . .
O
C N
N
CC
CC
C
C
D
N
N
D
CC
C
CC
CkGilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 20 / 35
Kernel functions Kernels on generic data
Positive definite kernels: some common examples
Type Name k(x, z)
radial Gaussian exp(−‖x−z‖
2
2σ2
)radial Laplacian exp(−‖x− z‖/σ)
non stat. χ2 exp(−r/σ), r =∑
k(xk−zk )2
xk+zk
projectif polynomial (x>z + σ)p
projectif cosinus x>z/‖x‖‖z‖projectif correlation exp
(x>z‖x‖‖z‖ − σ
)The kernel may involve hyper-parameter(s) to tune (polynom order p,bandwidth σ)
Their value has to be set by cross-validation
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 21 / 35
Gram matrices with different bandwidths
raw data Gram matrix for σ = 2
σ = .5 σ = 10
Kernel functions Reproducing kernel
Reproducing Kernel Map
A feature space defined by the kernel.
Feature Map
set IRX := X → IRkernel k
Map from input space to IRX .
Φ : X → IRX
x 7→ k(·, x)
Φ
x x'Φ(x) Φ(x')
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 23 / 35
Kernel functions Reproducing kernel
Reproducing Kernel Hilbert Space (RKHS)
Hilbert Space
H is Hilbert Space:
pre-Hilbert space of functionsendowed with a inner product 〈·, ·〉H inducing a norm ‖f‖ :=
√〈f, f〉
Hilbert space: generalization of euclidean space to finite or infinitedimension
Reproducing kernel Hilbert space (RKHS)
A Hilbert space H endowed with the inner product 〈·, ·〉H is said to bewith reproducing kernel if it exists a positive definite kernel k such that
∀x ∈ X , k(·, x) ∈ H∀f ∈ H, f (x) = 〈f (·), k(x, ·)〉H
Beware: f = f (·) is a function while f (x) is the real value of f at point x
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 24 / 35
Kernel functions Reproducing kernel
RKHS (2)
Positive kernel ⇔ RKHS
it defines the inner product and the regularity (smoothness) of H
there exists a ”mapping” function Φ : X → H such that ∀x, z ∈ X the innerproduct 〈Φ(x),Φ(z)〉H = k(x, z)
Examples:
the polynomial kernel of degree p induces a feature of dim = 1 + p + p(p+1)2
the space induced by the gaussian kernel is of infinite dimension.
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 25 / 35
Kernel Machine
How to exploit the kernel trick to implement non-linear methods ?
Non-Linear Kernel Machine SVM for classification
Non-Linear SVM: formulation
Let k(·, ·) be a positive definite kernel inducing the space H
There exits the mapping function Φ : X → H defined such that∀x, z ∈ X we get 〈Φ(x),Φ(z)〉H = k(x, z)
Non-Linear SVM: general case
Dataset D = (xi , yi ) ∈ Rd × −1, 1ni=1
Problem formulation
minw∈H,b,ξini=1
12‖w‖
2H + C
∑Ni=1 ξi
s.t. yi (〈w, φ(xi )〉H + b) ≥ 1−ξi ∀i = 1, · · · , nξi ≥ 0 ∀i = 1, . . . , n
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 27 / 35
Non-Linear Kernel Machine SVM for classification
Non-Linear SVM: the solutionDual problem
maxαi
N∑i=1
αi −1
2
N∑i,j=1
αiαjyiyj 〈Φ(xi ),Φ(xj)〉H︸ ︷︷ ︸k(xi ,xj )
s.t. 0 ≤ αi ≤ C , ∀ i = 1, · · · ,NN∑i=1
αiyi = 0
Matrix form
maxα∈RN
−1
2α>Gα + 1
>α
s.t. 0 ≤ α ≤ C 1>,α>y = 0
with G ∈ Rn×n a matrix suchthat Gij = yiyjk(xi , xj)
G is positive definite for a
positive definite kernel k
Classification function
f (x) =n∑
i=1
αiyik(xi , x) + b
Linear SVM = SVM with a linear kernel k(x, z) = x>z
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 28 / 35
Non-Linear Kernel Machine SVM for classification
Illustration: kernel mapping to decision function
Classification function:
f (x) =n∑
i=1
αiyik(xi , x) + b
ftp://ftp.cea.fr/pub/unati/people/educhesnay/pystatml/StatisticsMachineLearningPythonDraft.pdf
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 29 / 35
Non-Linear Kernel Machine SVM for classification
Sparsity of the SVM
f (x) =n∑
i=1
αik(x, xi )
D(x) = sign(f (x) + b
)
useless data important data suspicious datawell classified support
α = 0 0 < α < C α = C
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 30 / 35
Non-Linear Kernel Machine SVM for classification
Influence of kernel parameter
Gaussian kernel : exp(−‖x−z‖
2
2σ2
)with bandwidth σ
Kernel bandwidth (log. scale)-3 -2 -1 0 1 2
Err
or
0.2
0.4
0.6
< too small
0
0
0
0
0
0
0
0
0 0
nice <
0
0
< too large
0
0
−→ select σ using cross-validation
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 31 / 35
Non-Linear Kernel Machine SVM for classification
SVM with non symmetric costs
Useful for imbalanced classification problem
Set different weights C+ and C− according to the label yi = 1 or yi = −1
Primal problem minf∈H,b,ξ∈IRn
12‖f ‖
2H + C+
∑i|yi=1
ξi + C−∑
i|yi=−1
ξi
with yi(f (xi ) + b
)≥ 1− ξi , ξi ≥ 0, i = 1, n
Dual formulation maxα∈IRn
− 12α>Gα + α>1I
with α>y = 0and 0 ≤ αi ≤ C+ or C− i = 1, n
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 32 / 35
Support vector regression (SVR)
Regression with absolute value error
minf∈H
12‖f ‖
2H
s. t. |f (xi )− yi | ≤ t, i = 1, n
The support vector regression introduce slack variables
(SVR)
minf∈H
12‖f ‖
2H + C
∑|ξi |
such that |f (xi )− yi | ≤ t + ξi 0 ≤ ξi i = 1, n
0 1 2 3 4 5 6 7 8−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1Support Vector Machine Regression
x
y
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−1.5
−1
−0.5
0
0.5
1
1.5Support Vector Machine Regression
x
y
Non-Linear Kernel Machine SVM for regression
SVM Solvers
LibSVM (Multi-language) :http://www.csie.ntu.edu.tw/~cjlin/libsvm/
ScikitLearn (Python)http://scikit-learn.org/stable/modules/svm.html
PyML http://pyml.sourceforge.net/#
...
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 34 / 35
Non-Linear Kernel Machine SVM for regression
Conclusion
What we’ve seen
Kernels corresponds to scalar product in some Hilbert space:
value corresponds to high dimensional scalar product,on non linear embeddingwithout explicit representations of Φ
Can be defined on any kind of data provided we are able to define ameasure of similarity
Applications
Algorithms operating on these functions
Non linear prediction models for classification and regression
Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 35 / 35