kernel machines - from linear to kernel svm€¦ · road map 1 intuition and motivation 2 kernel...

Kernel MachinesFrom linear to Kernel SVM

Gilles GassoBenoit Gauzere - Stephane Canu

INSA Rouen - ASI DepartmentLaboratory LITIS

November 12, 2019

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 1 / 35

Road map

1 Intuition and Motivation

2 KernelTheoretical framework

3 Kernel functionsKernel on vectorsKernels on generic dataReproducing kernel

4 Non-Linear Kernel MachineSVM for classificationSVM for regression


Intuition and Motivation

Linear SVM

X1

X2

Inputs are vectors of IRd

Linear SVM seeks linear classification function



Limitations

Data are not always vectors: (string, time series, graphs, images . . . )

The decision function can not always be linear (text categorization;email filtering; gene detection; protein image classification;handwriting recognition; prediction of loan defaulting)

Figure: How do you classify these data?



From linear to non-linear decision function

Data might be separable with anon-linear function

Non-linear embedding of x =

(x(1)

x(2)

)R2 → H

x 7→ Φ(x) =

x2(1)

x2(2)√

2x(1)x(2)

Train a linear SVM with samples(Φ(xi ), yi )

Resulting SVM model

f (x) = b +∑i∈SV

αiyiΦ(xi )>Φ(x)

‘Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 5 / 35


Non-linear decision function

Decision function

f (x) = b +∑i∈SV

αiyi Φ(xi )>Φ(x)︸︷︷︸

Kernel k(xi ,x)

Kernel function: the trick

No explicit knowledge of Φ(x)

We only need to define afunction k(·, ·) : X × X → R

Problem linearly non-separable inthe original space X

but linear separable in the space H induced bythe kernel k



Also. . .

How do we classify dataset composed of proteins?

Use the kernel trick: f (protein) = b +∑

i∈SV αiyik(proteini , protein)



Remark

Intuition

By simply modifying the dot product, the algorithm works in another space

Kernel Trick

1 Linear SVM relies on inner product between the SV and the sample topredict

2 → Replaces the inner product between the sample in the ambientspace by a kernel k(·, ·)

3 → Leads to a non-linear version of the SVM



Motivation of Kernel Machine

From

linear techniquesoperating on vector spaces

to

non linear prediction modelsoperating on various, structured, high-dimensional data

Using a:well known mathematical framework

leading to:efficient and powerful algorithm and tools

−→ Kernel method


What is a kernel ?

What can be k ?

Kernel Theoretical framework

Prerequisites

Some definitions and notations

X : non empty input space (IRN , graphs, objects, . . . )

x ∈ X ,

H: feature space endowed with a dot product 〈·, ·〉HΦ : X → H: embedding function from X to H



Kernel

Kernel

A kernel k is a function k : X × X → IR:

k(x, z) = 〈Φ(x),Φ(z)〉H

Which kernel for SVM ? −→ positive definite kernelsWhy ? −→ To ensure a well-defined SVM dual problem



Positive Definite Kernels (1)

Positive definite kernel

A kernel k(x, z) on X × X is said to be positive definite

if it is symmetric: k(x, z) = k(z, x)

and if for any finite positive integer n:

∀αii=1,n ∈ IR,∀xii=1,n ∈ Ω,n∑

i=1

n∑j=1

αiαjk(xi , xj) ≥ 0

it is strictly positive definite if for αi 6= 0n∑

i=1

n∑j=1

αiαjk(xi , xj) > 0



Positive Definite Kernels (2)

Gram Matrix

Given a kernel k : X × X → IR, and samples x1, . . . , xn, the GramMatrix K is a n × n matrix with entries Ki ,j := k(xi , xj)

Another way to characterize positive definite kernel

If for any set of n ∈ IN samples x1, . . . , xn the associated Gram MatrixK ∈ IRn×n is positive definite, then k is a positive definite kernel on X .


Positive definite Kernels

Kernel functions Kernel on vectors

Linear Kernel

k(x, z) = x>z

x, z ∈ IRd

symmetric: x>z = x>z

positive:

n∑i=1

n∑j=1

αiαjk(xi , xj) =n∑

i=1

n∑j=1

αiαjx>i xj

=

(n∑

i=1

αixi

)> n∑j=1

αjxj

=

∥∥∥∥∥n∑

i=1

αixi

∥∥∥∥∥2



Finite kernel

let φj , j = 1, p be a finite dictionary of functions from X to IR (polynomials,wavelets...)

the feature map and linear kernel

feature map:Φ : X → IRp

x 7→ Φ =(φ1(x), ..., φp(x)

)Linear kernel in the feature space:

k(x, z) =(φ1(x), ..., φp(x)

)>(φ1(z), ..., φp(z)

)e.g. the quadratic kernel: x, z ∈ IRd , k(x, z) =

(x>z + b

)2

feature map:

Φ : IRd → IRp=1+d+ d(d+1)2

x 7→ Φ =(1,√

2x1, ...,√

2xj , ...,√

2xd , x21 , ..., x

2j , ..., x

2d , ...,

√2xixj , ...

)Here the xj represent the variables of x ∈ IRd



Closed form kernel: the quadratic kernel

Quadratic kernel k(x, z) =(x>z + 1

)2= 1 + 2x>z +

(x>z

)2

x, z ∈ IRd ,

It computes the dot product of the dictionary

Φ : IRd → IRp=1+d+ d(d+1)2

x 7→ Φ =(1,√

2x1,√

2x2, ...,√

2xd , x21 , x

22 , ..., x

2d , ...,

√2xixj , ...

)

p = 1 + d + d(d+1)2 multiplications vs. d + 1

use kernel to save computation



Gaussian kernel

k(x, z) = exp(−‖x−z‖

2

2σ2

)for σ = 1:

Φ(x) =

exp ‖x‖2

2j√j!

1/j

(j

n1, . . . , nk

)1/2

xn11 . . . xnkk

j=0,...,∞,

∑ki=1 ni=j

Feature space has an infinite dimension

Overlearning

σ controls the influence area of the kernel


Kernel functions Kernels on generic data

Kernels on structures

X may not be a vector space.

we can define kernels on any kind of data :

StringsTime seriesGraphsImages. . .

O

C N

N

CC

CC

C

C

D

N

N

D

CC

C

CC

CkGilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 20 / 35

Kernel functions Kernels on generic data

Positive definite kernels: some common examples

Type Name k(x, z)

radial Gaussian exp(−‖x−z‖

2

2σ2

)radial Laplacian exp(−‖x− z‖/σ)

non stat. χ2 exp(−r/σ), r =∑

k(xk−zk )2

xk+zk

projectif polynomial (x>z + σ)p

projectif cosinus x>z/‖x‖‖z‖projectif correlation exp

(x>z‖x‖‖z‖ − σ

)The kernel may involve hyper-parameter(s) to tune (polynom order p,bandwidth σ)

Their value has to be set by cross-validation


Gram matrices with different bandwidths

raw data Gram matrix for σ = 2

σ = .5 σ = 10

Kernel functions Reproducing kernel

Reproducing Kernel Map

A feature space defined by the kernel.

Feature Map

set IRX := X → IRkernel k

Map from input space to IRX .

Φ : X → IRX

x 7→ k(·, x)

Φ

x x'Φ(x) Φ(x')



Reproducing Kernel Hilbert Space (RKHS)

Hilbert Space

H is Hilbert Space:

pre-Hilbert space of functionsendowed with a inner product 〈·, ·〉H inducing a norm ‖f‖ :=

√〈f, f〉

Hilbert space: generalization of euclidean space to finite or infinitedimension

Reproducing kernel Hilbert space (RKHS)

A Hilbert space H endowed with the inner product 〈·, ·〉H is said to bewith reproducing kernel if it exists a positive definite kernel k such that

∀x ∈ X , k(·, x) ∈ H∀f ∈ H, f (x) = 〈f (·), k(x, ·)〉H

Beware: f = f (·) is a function while f (x) is the real value of f at point x



RKHS (2)

Positive kernel ⇔ RKHS

it defines the inner product and the regularity (smoothness) of H

there exists a ”mapping” function Φ : X → H such that ∀x, z ∈ X the innerproduct 〈Φ(x),Φ(z)〉H = k(x, z)

Examples:

the polynomial kernel of degree p induces a feature of dim = 1 + p + p(p+1)2

the space induced by the gaussian kernel is of infinite dimension.


Kernel Machine

How to exploit the kernel trick to implement non-linear methods ?

Non-Linear Kernel Machine SVM for classification

Non-Linear SVM: formulation

Let k(·, ·) be a positive definite kernel inducing the space H

There exits the mapping function Φ : X → H defined such that∀x, z ∈ X we get 〈Φ(x),Φ(z)〉H = k(x, z)

Non-Linear SVM: general case

Dataset D = (xi , yi ) ∈ Rd × −1, 1ni=1

Problem formulation

minw∈H,b,ξini=1

12‖w‖

2H + C

∑Ni=1 ξi

s.t. yi (〈w, φ(xi )〉H + b) ≥ 1−ξi ∀i = 1, · · · , nξi ≥ 0 ∀i = 1, . . . , n



Non-Linear SVM: the solutionDual problem

maxαi

N∑i=1

αi −1

2

N∑i,j=1

αiαjyiyj 〈Φ(xi ),Φ(xj)〉H︸︷︷︸k(xi ,xj )

s.t. 0 ≤ αi ≤ C , ∀ i = 1, · · · ,NN∑i=1

αiyi = 0

Matrix form

maxα∈RN

−1

2α>Gα + 1

>α

s.t. 0 ≤ α ≤ C 1>,α>y = 0

with G ∈ Rn×n a matrix suchthat Gij = yiyjk(xi , xj)

G is positive definite for a

positive definite kernel k

Classification function

f (x) =n∑

i=1

αiyik(xi , x) + b

Linear SVM = SVM with a linear kernel k(x, z) = x>z



Illustration: kernel mapping to decision function

Classification function:

f (x) =n∑

i=1

αiyik(xi , x) + b

ftp://ftp.cea.fr/pub/unati/people/educhesnay/pystatml/StatisticsMachineLearningPythonDraft.pdf


ftp://ftp.cea.fr/pub/unati/people/educhesnay/pystatml/StatisticsMachineLearningPythonDraft.pdf


Sparsity of the SVM

f (x) =n∑

i=1

αik(x, xi )

D(x) = sign(f (x) + b

)

useless data important data suspicious datawell classified support

α = 0 0 < α < C α = C



Influence of kernel parameter

Gaussian kernel : exp(−‖x−z‖

2

2σ2

)with bandwidth σ

Kernel bandwidth (log. scale)-3 -2 -1 0 1 2

Err

or

0.2

0.4

0.6

< too small

0

0

0

0

0

0

0

0

0 0

nice <

0

0

< too large

0

0

−→ select σ using cross-validation



SVM with non symmetric costs

Useful for imbalanced classification problem

Set different weights C+ and C− according to the label yi = 1 or yi = −1

Primal problem minf∈H,b,ξ∈IRn

12‖f ‖

2H + C+

∑i|yi=1

ξi + C−∑

i|yi=−1

ξi

with yi(f (xi ) + b

)≥ 1− ξi , ξi ≥ 0, i = 1, n

Dual formulation maxα∈IRn

− 12α>Gα + α>1I

with α>y = 0and 0 ≤ αi ≤ C+ or C− i = 1, n


Support vector regression (SVR)

Regression with absolute value error

minf∈H

12‖f ‖

2H

s. t. |f (xi )− yi | ≤ t, i = 1, n

The support vector regression introduce slack variables

(SVR)

minf∈H

12‖f ‖

2H + C

∑|ξi |

such that |f (xi )− yi | ≤ t + ξi 0 ≤ ξi i = 1, n

0 1 2 3 4 5 6 7 8−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1Support Vector Machine Regression

x

y

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−1.5

−1

−0.5

0

0.5

1

1.5Support Vector Machine Regression

x

y

Non-Linear Kernel Machine SVM for regression

SVM Solvers

LibSVM (Multi-language) :http://www.csie.ntu.edu.tw/~cjlin/libsvm/

ScikitLearn (Python)http://scikit-learn.org/stable/modules/svm.html

PyML http://pyml.sourceforge.net/#

...


http://www.csie.ntu.edu.tw/~cjlin/libsvm/

http://scikit-learn.org/stable/modules/svm.html

http://pyml.sourceforge.net/#

Non-Linear Kernel Machine SVM for regression

Conclusion

What we’ve seen

Kernels corresponds to scalar product in some Hilbert space:

value corresponds to high dimensional scalar product,on non linear embeddingwithout explicit representations of Φ

Can be defined on any kind of data provided we are able to define ameasure of similarity

Applications

Algorithms operating on these functions

Non linear prediction models for classification and regression


kernel machines - from linear to kernel svm€¦ · road map 1 intuition and motivation 2 kernel...

Documents