kernel machines - from linear to kernel svm€¦ · road map 1 intuition and motivation 2 kernel...

35
Kernel Machines From linear to Kernel SVM Gilles Gasso Benoit Ga¨ uz` ere - St´ ephane Canu INSA Rouen - ASI Department Laboratory LITIS November 12, 2019 Gilles Gasso Benoit Ga¨ uz` ere - St´ ephane Canu Kernel Machines 1 / 35

Upload: others

Post on 04-May-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Kernel MachinesFrom linear to Kernel SVM

Gilles GassoBenoit Gauzere - Stephane Canu

INSA Rouen - ASI DepartmentLaboratory LITIS

November 12, 2019

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 1 / 35

Page 2: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Road map

1 Intuition and Motivation

2 KernelTheoretical framework

3 Kernel functionsKernel on vectorsKernels on generic dataReproducing kernel

4 Non-Linear Kernel MachineSVM for classificationSVM for regression

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 2 / 35

Page 3: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Intuition and Motivation

Linear SVM

X1

X2

Inputs are vectors of IRd

Linear SVM seeks linear classification function

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 3 / 35

Page 4: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Intuition and Motivation

Limitations

Data are not always vectors: (string, time series, graphs, images . . . )

The decision function can not always be linear (text categorization;email filtering; gene detection; protein image classification;handwriting recognition; prediction of loan defaulting)

Figure: How do you classify these data?

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 4 / 35

Page 5: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Intuition and Motivation

From linear to non-linear decision function

Data might be separable with anon-linear function

Non-linear embedding of x =

(x(1)

x(2)

)R2 → H

x 7→ Φ(x) =

x2(1)

x2(2)√

2x(1)x(2)

Train a linear SVM with samples(Φ(xi ), yi )

Resulting SVM model

f (x) = b +∑i∈SV

αiyiΦ(xi )>Φ(x)

‘Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 5 / 35

Page 6: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Intuition and Motivation

Non-linear decision function

Decision function

f (x) = b +∑i∈SV

αiyi Φ(xi )>Φ(x)︸ ︷︷ ︸

Kernel k(xi ,x)

Kernel function: the trick

No explicit knowledge of Φ(x)

We only need to define afunction k(·, ·) : X × X → R

Problem linearly non-separable inthe original space X

but linear separable in the space H induced bythe kernel k

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 6 / 35

Page 7: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Intuition and Motivation

Also. . .

How do we classify dataset composed of proteins?

Use the kernel trick: f (protein) = b +∑

i∈SV αiyik(proteini , protein)

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 7 / 35

Page 8: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Intuition and Motivation

Remark

Intuition

By simply modifying the dot product, the algorithm works in another space

Kernel Trick

1 Linear SVM relies on inner product between the SV and the sample topredict

2 → Replaces the inner product between the sample in the ambientspace by a kernel k(·, ·)

3 → Leads to a non-linear version of the SVM

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 8 / 35

Page 9: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Intuition and Motivation

Motivation of Kernel Machine

From

linear techniquesoperating on vector spaces

to

non linear prediction modelsoperating on various, structured, high-dimensional data

Using a:well known mathematical framework

leading to:efficient and powerful algorithm and tools

−→ Kernel method

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 9 / 35

Page 10: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

What is a kernel ?

What can be k ?

Page 11: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Kernel Theoretical framework

Prerequisites

Some definitions and notations

X : non empty input space (IRN , graphs, objects, . . . )

x ∈ X ,

H: feature space endowed with a dot product 〈·, ·〉HΦ : X → H: embedding function from X to H

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 11 / 35

Page 12: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Kernel Theoretical framework

Kernel

Kernel

A kernel k is a function k : X × X → IR:

k(x, z) = 〈Φ(x),Φ(z)〉H

Which kernel for SVM ? −→ positive definite kernelsWhy ? −→ To ensure a well-defined SVM dual problem

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 12 / 35

Page 13: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Kernel Theoretical framework

Positive Definite Kernels (1)

Positive definite kernel

A kernel k(x, z) on X × X is said to be positive definite

if it is symmetric: k(x, z) = k(z, x)

and if for any finite positive integer n:

∀αii=1,n ∈ IR,∀xii=1,n ∈ Ω,n∑

i=1

n∑j=1

αiαjk(xi , xj) ≥ 0

it is strictly positive definite if for αi 6= 0n∑

i=1

n∑j=1

αiαjk(xi , xj) > 0

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 13 / 35

Page 14: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Kernel Theoretical framework

Positive Definite Kernels (2)

Gram Matrix

Given a kernel k : X × X → IR, and samples x1, . . . , xn, the GramMatrix K is a n × n matrix with entries Ki ,j := k(xi , xj)

Another way to characterize positive definite kernel

If for any set of n ∈ IN samples x1, . . . , xn the associated Gram MatrixK ∈ IRn×n is positive definite, then k is a positive definite kernel on X .

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 14 / 35

Page 15: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Positive definite Kernels

Page 16: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Kernel functions Kernel on vectors

Linear Kernel

k(x, z) = x>z

x, z ∈ IRd

symmetric: x>z = x>z

positive:

n∑i=1

n∑j=1

αiαjk(xi , xj) =n∑

i=1

n∑j=1

αiαjx>i xj

=

(n∑

i=1

αixi

)> n∑j=1

αjxj

=

∥∥∥∥∥n∑

i=1

αixi

∥∥∥∥∥2

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 16 / 35

Page 17: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Kernel functions Kernel on vectors

Finite kernel

let φj , j = 1, p be a finite dictionary of functions from X to IR (polynomials,wavelets...)

the feature map and linear kernel

feature map:Φ : X → IRp

x 7→ Φ =(φ1(x), ..., φp(x)

)Linear kernel in the feature space:

k(x, z) =(φ1(x), ..., φp(x)

)>(φ1(z), ..., φp(z)

)e.g. the quadratic kernel: x, z ∈ IRd , k(x, z) =

(x>z + b

)2

feature map:

Φ : IRd → IRp=1+d+ d(d+1)2

x 7→ Φ =(1,√

2x1, ...,√

2xj , ...,√

2xd , x21 , ..., x

2j , ..., x

2d , ...,

√2xixj , ...

)Here the xj represent the variables of x ∈ IRd

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 17 / 35

Page 18: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Kernel functions Kernel on vectors

Closed form kernel: the quadratic kernel

Quadratic kernel k(x, z) =(x>z + 1

)2= 1 + 2x>z +

(x>z

)2

x, z ∈ IRd ,

It computes the dot product of the dictionary

Φ : IRd → IRp=1+d+ d(d+1)2

x 7→ Φ =(1,√

2x1,√

2x2, ...,√

2xd , x21 , x

22 , ..., x

2d , ...,

√2xixj , ...

)

p = 1 + d + d(d+1)2 multiplications vs. d + 1

use kernel to save computation

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 18 / 35

Page 19: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Kernel functions Kernel on vectors

Gaussian kernel

k(x, z) = exp(−‖x−z‖

2

2σ2

)for σ = 1:

Φ(x) =

exp ‖x‖2

2j√j!

1/j

(j

n1, . . . , nk

)1/2

xn11 . . . xnkk

j=0,...,∞,

∑ki=1 ni=j

Feature space has an infinite dimension

Overlearning

σ controls the influence area of the kernel

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 19 / 35

Page 20: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Kernel functions Kernels on generic data

Kernels on structures

X may not be a vector space.

we can define kernels on any kind of data :

StringsTime seriesGraphsImages. . .

O

C N

N

CC

CC

C

C

D

N

N

D

CC

C

CC

CkGilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 20 / 35

Page 21: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Kernel functions Kernels on generic data

Positive definite kernels: some common examples

Type Name k(x, z)

radial Gaussian exp(−‖x−z‖

2

2σ2

)radial Laplacian exp(−‖x− z‖/σ)

non stat. χ2 exp(−r/σ), r =∑

k(xk−zk )2

xk+zk

projectif polynomial (x>z + σ)p

projectif cosinus x>z/‖x‖‖z‖projectif correlation exp

(x>z‖x‖‖z‖ − σ

)The kernel may involve hyper-parameter(s) to tune (polynom order p,bandwidth σ)

Their value has to be set by cross-validation

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 21 / 35

Page 22: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Gram matrices with different bandwidths

raw data Gram matrix for σ = 2

σ = .5 σ = 10

Page 23: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Kernel functions Reproducing kernel

Reproducing Kernel Map

A feature space defined by the kernel.

Feature Map

set IRX := X → IRkernel k

Map from input space to IRX .

Φ : X → IRX

x 7→ k(·, x)

Φ

x x'Φ(x) Φ(x')

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 23 / 35

Page 24: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Kernel functions Reproducing kernel

Reproducing Kernel Hilbert Space (RKHS)

Hilbert Space

H is Hilbert Space:

pre-Hilbert space of functionsendowed with a inner product 〈·, ·〉H inducing a norm ‖f‖ :=

√〈f, f〉

Hilbert space: generalization of euclidean space to finite or infinitedimension

Reproducing kernel Hilbert space (RKHS)

A Hilbert space H endowed with the inner product 〈·, ·〉H is said to bewith reproducing kernel if it exists a positive definite kernel k such that

∀x ∈ X , k(·, x) ∈ H∀f ∈ H, f (x) = 〈f (·), k(x, ·)〉H

Beware: f = f (·) is a function while f (x) is the real value of f at point x

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 24 / 35

Page 25: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Kernel functions Reproducing kernel

RKHS (2)

Positive kernel ⇔ RKHS

it defines the inner product and the regularity (smoothness) of H

there exists a ”mapping” function Φ : X → H such that ∀x, z ∈ X the innerproduct 〈Φ(x),Φ(z)〉H = k(x, z)

Examples:

the polynomial kernel of degree p induces a feature of dim = 1 + p + p(p+1)2

the space induced by the gaussian kernel is of infinite dimension.

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 25 / 35

Page 26: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Kernel Machine

How to exploit the kernel trick to implement non-linear methods ?

Page 27: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Non-Linear Kernel Machine SVM for classification

Non-Linear SVM: formulation

Let k(·, ·) be a positive definite kernel inducing the space H

There exits the mapping function Φ : X → H defined such that∀x, z ∈ X we get 〈Φ(x),Φ(z)〉H = k(x, z)

Non-Linear SVM: general case

Dataset D = (xi , yi ) ∈ Rd × −1, 1ni=1

Problem formulation

minw∈H,b,ξini=1

12‖w‖

2H + C

∑Ni=1 ξi

s.t. yi (〈w, φ(xi )〉H + b) ≥ 1−ξi ∀i = 1, · · · , nξi ≥ 0 ∀i = 1, . . . , n

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 27 / 35

Page 28: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Non-Linear Kernel Machine SVM for classification

Non-Linear SVM: the solutionDual problem

maxαi

N∑i=1

αi −1

2

N∑i,j=1

αiαjyiyj 〈Φ(xi ),Φ(xj)〉H︸ ︷︷ ︸k(xi ,xj )

s.t. 0 ≤ αi ≤ C , ∀ i = 1, · · · ,NN∑i=1

αiyi = 0

Matrix form

maxα∈RN

−1

2α>Gα + 1

s.t. 0 ≤ α ≤ C 1>,α>y = 0

with G ∈ Rn×n a matrix suchthat Gij = yiyjk(xi , xj)

G is positive definite for a

positive definite kernel k

Classification function

f (x) =n∑

i=1

αiyik(xi , x) + b

Linear SVM = SVM with a linear kernel k(x, z) = x>z

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 28 / 35

Page 29: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Non-Linear Kernel Machine SVM for classification

Illustration: kernel mapping to decision function

Classification function:

f (x) =n∑

i=1

αiyik(xi , x) + b

ftp://ftp.cea.fr/pub/unati/people/educhesnay/pystatml/StatisticsMachineLearningPythonDraft.pdf

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 29 / 35

Page 30: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Non-Linear Kernel Machine SVM for classification

Sparsity of the SVM

f (x) =n∑

i=1

αik(x, xi )

D(x) = sign(f (x) + b

)

useless data important data suspicious datawell classified support

α = 0 0 < α < C α = C

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 30 / 35

Page 31: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Non-Linear Kernel Machine SVM for classification

Influence of kernel parameter

Gaussian kernel : exp(−‖x−z‖

2

2σ2

)with bandwidth σ

Kernel bandwidth (log. scale)-3 -2 -1 0 1 2

Err

or

0.2

0.4

0.6

< too small

0

0

0

0

0

0

0

0

0 0

nice <

0

0

< too large

0

0

−→ select σ using cross-validation

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 31 / 35

Page 32: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Non-Linear Kernel Machine SVM for classification

SVM with non symmetric costs

Useful for imbalanced classification problem

Set different weights C+ and C− according to the label yi = 1 or yi = −1

Primal problem minf∈H,b,ξ∈IRn

12‖f ‖

2H + C+

∑i|yi=1

ξi + C−∑

i|yi=−1

ξi

with yi(f (xi ) + b

)≥ 1− ξi , ξi ≥ 0, i = 1, n

Dual formulation maxα∈IRn

− 12α>Gα + α>1I

with α>y = 0and 0 ≤ αi ≤ C+ or C− i = 1, n

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 32 / 35

Page 33: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Support vector regression (SVR)

Regression with absolute value error

minf∈H

12‖f ‖

2H

s. t. |f (xi )− yi | ≤ t, i = 1, n

The support vector regression introduce slack variables

(SVR)

minf∈H

12‖f ‖

2H + C

∑|ξi |

such that |f (xi )− yi | ≤ t + ξi 0 ≤ ξi i = 1, n

0 1 2 3 4 5 6 7 8−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1Support Vector Machine Regression

x

y

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−1.5

−1

−0.5

0

0.5

1

1.5Support Vector Machine Regression

x

y

Page 34: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Non-Linear Kernel Machine SVM for regression

SVM Solvers

LibSVM (Multi-language) :http://www.csie.ntu.edu.tw/~cjlin/libsvm/

ScikitLearn (Python)http://scikit-learn.org/stable/modules/svm.html

PyML http://pyml.sourceforge.net/#

...

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 34 / 35

Page 35: Kernel Machines - From linear to Kernel SVM€¦ · Road map 1 Intuition and Motivation 2 Kernel Theoretical framework 3 Kernel functions Kernel on vectors Kernels on generic data

Non-Linear Kernel Machine SVM for regression

Conclusion

What we’ve seen

Kernels corresponds to scalar product in some Hilbert space:

value corresponds to high dimensional scalar product,on non linear embeddingwithout explicit representations of Φ

Can be defined on any kind of data provided we are able to define ameasure of similarity

Applications

Algorithms operating on these functions

Non linear prediction models for classification and regression

Gilles Gasso Benoit Gauzere - Stephane Canu Kernel Machines 35 / 35