multiple kernel learning

Multiple Kernel LearningMarius KloftTechnische Universität BerlinKorea University

Marius Kloft (TU Berlin)

• Aim

▫ Learning the relation of of two random quantities and

from observations

• Kernel-based learning:

Machine Learning

1/12

• Example

▫ Object detection in images


Multiple Views / Kernels

2/12

How to combine the views?

Weightings.

(Lanckriet, 2004)

Shape

Space

Color


Computation of Weights?

3/12

• State of the art

▫ Sparse weights

Kernels / views are completely discarded

▫But why discard information?

(Bach, 2008)


From Vision to Reality?

• State of the art: sparse method

▫ empirically ineffective

4/12

(Gehler et al., Noble et al., Shawe-Taylor et al., NIPS 2008)

• My dissertation: new methodology

▫ established as a standard

Breakthrough at lear-ning bounds: O(M/n)

Effective in Applications

More efficient and effective in practice


Presentation of MethodologyNon-sparse Multiple Kernel Learning


• Generalized formulation

▫ arbitrary loss

▫ arbitrary norms

e.g. lp-norms:

1-norm leads to sparsity:

New Methodology

• Computation of weights?

▫ Model

Kernel

▫ Mathematical program

Convex problem.

5/12

(Kloft et al., ECML 2010, JMLR 2011)

Optimization over weights


Theoretical Analysis

• Theoretical foundations

▫ Active research topic NIPS workshop 2010

▫ We show:

Theorem (Kloft & Blanchard). The local Rademacher complexity of MKL is bounded by:

• Corollaries (Learning Bounds)

▫ Upper bound with rate

best known rate:

Generally

for , improvement of two orders of magnitude

6/12

(Kloft & Blanchard, NIPS 2011, JMLR 2012)

(Cortes et al., ICML 2010)


Proof (Sketch)

1. Relating the original class with the centered class

2. Bounding the complexity of the centered class

3. Khintchine-Kahane’s and Rosenthal’s inequalities

4. Bounding the complexity of the original class

5. Relating the bound to the truncation of the spectra of the kernels

7/12


• Implementation

▫ In C++ (“SHOGUN Toolbox”)

Matlab/Octave/Python/R support

▫ Runtime:

~ 1-2 orders of magnitude faster

Optimization

• Algorithms

1. Newton method

2. sequential, quadratically constrained programming

with level set projections

3. block-coordinate descent alg.

Alternate solve (P) w.r.t. w

solve (P) w.r.t. :

Until convergence

(proved)

8/12

(Kloft et al., JMLR 2011)

analytical

(Sketch)

11Marius Kloft (TU Berlin)

▫ Choice of p crucial

▫ Optimality of p depends on true sparsity SVM fails in the sparse scenarios

1-norm best in the sparsest scenario only

p-norm MKL proves robust in all scenarios

• Bounds▫ Can minimize bounds w.r.t. p

▫ Bounds well reflect empirical results

• Design▫ Two 50-dimensional Gaussians

mean μ1 on simplex,

▫Zero-mean features irrelevant 50 training examples One linear kernel per feature

▫ Six scenarios Vary % of irrelevant features

Variance so that Bayes error constant

• Results

sparsity0% 44% 64% 82% 92% 98%

0%

50%

test error

0% 44% 64% 82% 92% 98%

Toy Experiment

Sparsity

sparsity


Applications▫

Compu

ter vi

sion

Vi

sual

objec

t rec

ognit

ion

▫Bioi

nfor

matics

Sub

cellu

lar lo

caliz

ation

Pro

tein

fold p

redic

tion

DNA

Tran

scrip

tion s

tart s

ite de

tectio

n

Meta

bolic

netw

ork r

econ

struc

tion

non-sparsity

• Lesson learned:

▫ Optimality of a method depends on true underlying sparsity of problem

• Applications studied:


• Visual object recognition

▫ Aim: annotation of visual media (e.g., images)

▫ Motivation:

▫ content-based image retrieval

Application Domain: Computer Vision

9/12

aeroplane bicycle bird


• Multiple kernels

▫ based on

Color histograms

shapes

(gradients)

local features

(SIFT words)

spatial features

• Visual object recognition

▫ Aim: annotation of visual media (e.g., images)

▫ Motivation:

▫ content-based image retrieval


9/12


• Datasets

▫ 1. VOC 2009 challenge

7054 train / 6925 test images

20 object categories

▫aeroplane, bicycle,…

▫ 2. ImageCLEF2010 Challenge

8000 train / 10000 test images

▫ taken from Flickr

93 concept categories

▫partylife, architecture, skateboard, …


9/12

• 32 Kernels

▫ 4 types:

▫ varied over color channel combinations and spatial tilings (levels of a spatial pyramid)

▫ one-vs.-rest classifier for each visual concept

Bag of Words Global Histogram

Gradients BoW-SIFT Ho(O)G

Color Histograms BoW-C HoC


• Challenge results

▫ Employed our approach for ImageCLEF2011 Photo Annotation challenge

achieved the winning entries in 3 categories!


9/12

• Preliminary results:

▫ using BoW-S only gives worse results

→ BoW-S alone is not sufficient

SVM MKL

VOC2009 55.85 56.76

CLEF2010 36.45 37.02

(Binder et al., 2010,2011)


• Experiment

▫ Disagreement of single-kernel classifiers

▫ BoW-S kernels induce more or less the same predictions


9/12

• Why can MKL help?

▫ some images better captured by certain kernels

▫ Different images may have different kernels that capture them well


• Detection of▫ transcription start sites:

• by means of kernels based on:

▫ sequence alignments

▫ distribution of nukleotides downstream, upstream

▫ folding properties binding energies and angles

• Empirical analysis

▫ detection accuracy (AUC):

▫ higher accuracies than sparse MKL and ARTS

ARTS winner of international comparison of 19 models

(Abeel et al., 2009)

(Kloft et al., NIPS 2009, JMLR 2011)

Application Domain: Genetics


• Theoretical analysis

▫ impact of lp-Norm on bound

▫ confirms experimental results:

stronger theoretical guarantees for proposed approach (p>1)

empirical and theoretical results approximately equal for

• Empirical analysis

▫ detection accuracy (AUC):

▫ higher accuracies than sparse MKL and ARTS

ARTS winner of international comparison of 19 models

(Kloft et al., NIPS 2009, JMLR 2011)

(Abeel et al., 2009)

Application Domain: Genetics


• Protein Fold Prediction

▫ Prediction of fold class of protein

Fold class related to protein’s function

▫e.g., important for drug design

▫ Data set and kernels from Y. Ying

27 fold classes

Fixed train and test sets

12 biologically inspired kernels

▫e.g. hydrophobicity, polarity, van-der-Waals volume

• Results

▫ Accuracy:

▫ 1-norm MKL and SVM on par

▫ p-norm MKL performs best

6% higher accuracy than baselines

Application Domain: Pharmacology


▫Com

puter

visio

n

Visu

al ob

ject r

ecog

nition

▫Bioi

nfor

matics

Sub

cellu

lar lo

caliz

ation

Pro

tein

fold p

redic

tion

DNA

Tran

scrip

tion s

plice

site

detec

tion

Meta

bolic

netw

ork r

econ

struc

tion

non-sparsity

Further Applications


Conclusion: Non-sparse Multiple Kernel Learning

10/12

Training with > 100,000 data points and > 1 000 Kernels

Sharp learning bounds

Appli-cations

Visual Object Recognitionestablished standard: winner of ImageCLEF 2011 Challenge

Computational Biology

More accurate gene detector than winner of int. comparison

12/12

Thank you for your attention.I will be pleased to answer any additional questions.

References

▫ Abeel, Van de Peer, Saeys (2009). Toward a gold standard for promoter prediction evaluation. Bioinformatics.

▫ Bach (2008). Consistency of the Group Lasso and Multiple Kernel Learning. Journal of Machine Learning Research (JMLR).

▫ Kloft, Brefeld, Laskov, and Sonnenburg (2008). Non-sparse Multiple Kernel Learning. NIPS Workshop on Kernel Learning.

▫ Kloft, Brefeld, Sonnenburg, Laskov, Müller, and Zien (2009). Efficient and Accurate Lp-norm Multiple Kernel Learning. Advances in Neural Information Processing Systems (NIPS 2009).

▫ Kloft, Rückert, and Bartlett (2010). A Unifying View of Multiple Kernel Learning. ECML.

▫ Kloft, Blanchard (2011). The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning. Advances in Neural Information Processing Systems (NIPS 2011).

▫ Kloft, Brefeld, Sonnenburg, and Zien (2011). Lp-Norm Multiple Kernel Learning. Journal of Machine Learning Research (JMLR), 12(Mar):953-997.

▫ Kloft and Blanchard (2012). On the Convergence Rate of Lp-norm Multiple Kernel Learning. Journal of Machine Learning Research (JMLR), 13(Aug):2465-2502.

▫ Lanckriet, Cristianini, Bartlett, El Ghaoui, Jordan (2004). Learning the Kernel Matrix with Semidefinite Programming. Journal of Machine Learning Research (JMLR).

11/12

multiple kernel learning

Documents

practicemarius kloft

theorem kloft blanchard

p bounds

wsolve p

alternatesolve p

original class

sparse scenarios

norm best