multiple kernel learning - iit kanpurhome.iitk.ac.in/~anurendk/cs678/mkl_report.pdf · multiple...

21
Multiple Kernel Learning Vivek Gupta 2 and Anurendra Kumar 1 1 Department of Electrical Engineering 2 Department of Computer Science Abstract We present the current state of the art techniques for multiple kernel learning. Multiple kernel learning uses several kernels instead of a single kernel and thus captures complex features of the data. It is the promising idea for automatically selecting the kernel and accommodate for different notion of similarities on a particular dataset . We follow the work of [7] to formulate the QCQP optimization problem as semi-infinite linear program which can be solved efficiently in lesser time using wrapper and chunking algorithms. We then do experimentation using the toolbox provided by the authors in the paper and show that multiple kernel learning gives better results and improves accuracy by significant amount in comparison with a single kernel in all types of dataset independent of its distribution and separability. We also show that this approach provides deeper understanding of the kernels.Finally experimentation was done on combining kernels with feature combination in multi-modal setting where data comes from heterogenous sources. Under the guidance of Professor Harish Karnick Indian Institute of Technology, Kanpur

Upload: dinhnhan

Post on 14-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Multiple KernelLearning

Vivek Gupta2 and Anurendra Kumar1

1 Department of Electrical Engineering2 Department of Computer Science

Abstract

We present the current state of the art techniques for multiple kernellearning. Multiple kernel learning uses several kernels instead of a single

kernel and thus captures complex features of the data. It is the promisingidea for automatically selecting the kernel and accommodate for differentnotion of similarities on a particular dataset . We follow the work of [7] toformulate the QCQP optimization problem as semi-infinite linear programwhich can be solved efficiently in lesser time using wrapper and chunking

algorithms. We then do experimentation using the toolbox provided by theauthors in the paper and show that multiple kernel learning gives better

results and improves accuracy by significant amount in comparison with asingle kernel in all types of dataset independent of its distribution and

separability. We also show that this approach provides deeper understandingof the kernels.Finally experimentation was done on combining kernels with

feature combination in multi-modal setting where data comes fromheterogenous sources.

Under the guidance of Professor Harish Karnick

Indian Institute of Technology, Kanpur

Contents

1 Introduction 2

2 Motivation 2

3 Problem Formulation 33.1 Primal Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Interesting Observations . . . . . . . . . . . . . . . . . . . . . . . . . . 33.3 Conic primal problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.4 Dual Problem for M.K.L. . . . . . . . . . . . . . . . . . . . . . . . . . 43.5 Saddle Point Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.6 Semi-Infinite Linear Program . . . . . . . . . . . . . . . . . . . . . . . 4

4 Algorithms for solving Semi-Infinite Linear Program 4

5 Shogun 5

6 Experiments 56.1 Experiment No.-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56.2 Experiment No.-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76.3 Experiment No.-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96.4 Experiment No.-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116.5 Experiment No.-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166.6 Experiment No.-6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.7 Experiment No.-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

7 Observation and Conclusion 19

8 Future Work 19

1

1 Introduction

The support vector machines are discriminiative classifier which has gained hugepopularity. It employs a kernel function k(x1, x2) which embeds notion of similaritybetween x1 and x2. The classifier is trained by solving the following primal problem:

minw

12(||w)2||+ C

∑Ni=1 ξi

w.r.t.w ∈RD, ξ ∈Rn, b ∈Rs.t. ξi ≥ 0 and yi(〈 w, φ(xi)〉+ b) ≥ 1- ξi∀i = 1, 2.....N

Instead of solving primal problem we solve the dual problem because it enablesus to apply kernel trick. Kernel trick allows to solve the problem using kernel matrixonly without going to feature space. The dual problem is:

maxα−1

2∑ni=1

∑nj=1 αiαjyiyj〈 xi, xj〉 +

∑Ni=1 αi

w.r.t.α ∈RN ,

s.t.∑Ni=1 αiyi = 0

0 ≤ αi ≤ C ∀i = 1, 2.....N

The dual problem is powerful in the sense that we can choose any positive semidef-inite kernel to define inner product. By using the non-linear kernel we can embednon-linear notion of similarity.

2 Motivation

Though single kernel learning incorporates non-linear similarities, still it’s impossibleto find a single kernel which performs best for all types of datasets. The gaussiankernel is able to perform comparatively better than other kernels because it’s infinitedimensional. Still most of the times we have to choose a kernel in S.V.M. and tuneit’s parameters which is non-intuitive. The multiple kernel learning is an efficient wayfor automatic model solution.

Multiple kernel learning provides flexibility and shows that often learning prob-lems involve multiple heterogeneous data sources. Since multiple kernel can embeddifferent notions of similarity at the same time we can exploit this property to solvethe problem.for e.g. consider a video data with subtitle. It contains video features,audio feature and text features and each set of features require different notion ofsimilarity

Often resulting decision function is hard to interpret. We have shown in our exper-iments that multiple kernel provides helps in interpreting resulting decision functionand able to extract relevant knowledge about the training set. Thus it provides deeperunderstanding of the kernels.

2

3 Problem Formulation

Without less of generality we focus only on binary classification at the moment. Someauthors [3] had considered kernel mixture framework where each kernel and exampleare assigned weight, but these do not offer useful insights. We therefore follow linearconvex combination of kernels

k(xi, xj)=∑Kk=1βkkk(xi, xj) (1)

with βk ≥ 0 and∑K

1 βk =1, where each kernel kk requires only a subset of features.If we choose appropriate kernels kk and find sparse weighing βk ,decision functionand feature selection can easily be implemented, which is missing in current kernelbased algorithms .

3.1 Primal Problem

We are given N data points (xi, yi) (yi ∈ ±1) and K mappings φk(x) 7−→ RDkfrom the input into K feature spaces φ1(x1), φ2(x2), ...φK (xK ) where Dk is thedimensionality of K feature spaces. The primal problem is

minwk

12(

∑Kk=1 ||wk)2||+ C

∑Ni=1 ξi

w.r.t.wk ∈RDk , ξ ∈Rn, b ∈R

s.t. ξi ≥ 0 and yi(∑k=Kk=1 〈 wk, φK (xi)〉+ b) ≥ 1- ξi∀i = 1, 2.....N

3.2 Interesting Observations

Bach showed that the solution can be written as wk = βkw′k with βk ≥ 0 and∑K

k=1βk=1. Since β is constrained by l1norm it has sparse solution which wasdesired since for a particular dataset only few kernels should work. w is constrainedon (l2norm) and therefore not sparse which again was expected.

3.3 Conic primal problem

Objective function contains summation of norms. There are many ways to formulatethe dual of the above problem. One of the way is to convert into conic primal problemand then take it’s dual. This is done by using epigraph technique and observingthat constraint lies in a convex cone. The final dual problem obtained after thesetransformations are:

minu

1

2u2 + C

N∑i=1

ξi

w.r.t. u ∈ R, tk ∈ R, (wk, tk) ∈KDk, ξ ∈ Rn, b ∈ R

s.t. ξi ≥ 0, yi(

k=K∑k=1

〈 wk, φK (xi〉+ b) ≥ 1− ξi∀i = 1, 2.....N

3

where KDkdenotes second order convex cone and (wk, tk) ∈KDk

means ||wk|| ≤ tk

3.4 Dual Problem for M.K.L.

Using technique given in the book by Boyd and Vandenberge [4] for conic duality onthe above problem and doing a few mathematical operations( [7]) gives

min γ

w.r.t. γ ∈R , α ∈ RN

s.t. 0 ≤ α ≤ 1.C,

N∑i=1

αiyi = 0, Sk(α) =1

2

N∑i,j=1

αiαjyiyjkk(xi, xj)−i=N∑i=1

αi ≤ γ

3.5 Saddle Point Problem

We see that the above dual contains Sk(α) as quadratic constraints. Our aim is toconvert into linear program even if it contains infinitely many constraints. The aboveproblem is equivalent to following saddlepoint problem:

maxβk

minα

L = γ +

K∑k=1

βk(Sk(α)− γ))

s.t. 0 ≤ α ≤ 1.C, ξi ≥ 0 and∑Ni=1αiyi = 0

Setting the derivative to 0 and substituting the value of β ,we get the followingsimplified equation:

maxβk

minα

K∑k=1

βkSk(α)

s.t. 0 ≤ α ≤ 1.C, ξi ≥ 0 ,∑Ni=1αiyi = 0 and

∑Kk=1 βk = 1

3.6 Semi-Infinite Linear Program

Using the epigraph technique w.r.t. α we get following SILP

max θ

w.r.t.θ ∈ R, β ∈RK

s.t.0 ≤ β,∑Ni=1 βi = 1 and

∑Kk=1 βkSk(α) ≥ θ,

∀α ∈RN with 0 ≤ α ≤ 1.C and∑Ni=1 αiyi = 0

This is a linear program in θ and β with infinetly many constraints, one for each

α ∈RN satisfying constraints 0 ≤ α ≤ 1.C,∑Ni=1 αiyi = 0.

4 Algorithms for solving Semi-Infinite Linear Pro-gram

Using [6] and [5], it can be shown that SILP is feasible if primal is feasible andbounded. The two algorithms generally used for solving SILP are:

4

Wrapper Algorithm: It divides the SILP problem into inner and outer sub-problem. It then iterates the solution over two subproblems until it converges.

• Inner loop: Find α using standard SVM and replacing single kernel k(xi, xj)

with∑Kk=1 βkkk(xi, xj).

• Outer Loop The new problem has finite constraint. Solve it for β and θ

Advantages of Wrapper Algorithm: Easy,generic and reasonably fast for easyand medium sized problems.Disdvantage of Wrapper Algorithm: If β is far away from global optimum, it’scostly to determine α.

Chunking Algorithm: This is an extension of chunking algorithm used in stan-dard SVM implementaion to counter the problem when β is far away from globaloptimum in wrapper algorithm. It does intermediate recomputation of β avoids full

recomputation of β each time by storing the modified SVMlightoptimizer.(Refer1001[7]formoredetails)

5 Shogun

We do all of our experiments on Shogun machine learning toolbox provided by [7]using python interface. We have used scikit-learn and numpy libraries along withshogun to implement the results. It is a free open source Toolbox originally designedfor Large Scale Kernel Methods and bio-informatics. It has following features:

• Large number of kernels including string kernels.

• Modular and optimized for a very large number of examples and hundreds ofkernels to be combined.

• Allows easy combination of multiple data representations, algorithm classes,and general purpose tools.

• Originally written in C++ but unified interface available for C++, Python,Octave, R, Java, Lua, C,Matlab.

• Algorithms: HMM,LDA, LPM,Perceptron,SVR...and many more

6 Experiments

We experimented on with binary class problem, regression problem

6.1 Experiment No.-1

Dataset: Two concentric circles with binary class on opposite halves w.r.t. horizon-tal axis passing through center. (Fig 1)

Experiment: Starting with non-separable case with zero distance between thecircles(highest classification error) we gradually increase the distance between the cir-cles(low classification error) and performed binary classification using multiple kernels.As boundaries needed for separation will gradually tend towards larger circle whendistance is increased, high width Gaussian kernel should be preferred. We performedseveral experiments to demonstrate this using multikernel binary classification with

5

4 Gaussian kernels of varying widths (i.e. 2,5,7,10). We finally compare the meanweight acquired by various kernels during multikernel learning with respect to thedistance between circles i.e. Fig 2. We also compared the mean error by multikerneland individual kernel during binary classification on a similar randomly generatednoisy test-set i.e. Fig 3.

Figure 1: Dataset and Binary Classification Boundaries

Figure 2: Kernel width with varing distance between circle

6

Figure 3: Classification Error with varying distance between circles

Results:

• As distance increases, higher weight is given to larger width gaussian kernel inmulti-kernel setting.

• Error reduces as the distance increases between circles i.e. more separability indata

• Multi-Kernel learning gives the minimun error for all distances i.e varying sep-arability

• Multi-kernel learning gives more weight to efficient kernel which separates databetter.

• Multi-kernel learning perform better regardless of data-separability than singlekernels.

6.2 Experiment No.-2

Dataset: Random Samples generated from Gaussian Mixture models of four Gaus-sian with different mean and different covariance matrix (non-diagonal) (Fig 1). Thesample generated from ,1 ,4 Gaussian are assigned as positive (red) and those from2,3 are assigned as negitive (blue) as shown in Fig 4.

Experiment: Binary classification is performed using multi-kernel learning withtwo Gaussian of width 0.25 width 25 to compare separating boundary and error witheach single kernel classification used in multi-kernel setting. The lower width kernelwill overfit and higher width kernel will underfit. The validation is done on datapoints taken from rectangular region covering all Gaussian.

7

Figure 4: Dataset 4 Gaussian Samples Binary Classes

Figure 5: Decision Boundary in Multi Kernel Setting

8

Figure 6: Decision Boundaries with various Kernels MKL , Gaussian(0.25) , Gaus-sian(25)

Results:

Kernel AccuracyMKL 92.26

Gaussian(0.25) 87.40Gaussian(25) 89.43

• Multikernel learning gives more weight to efficient kernel which separates databetter

• Multikernel learning avoids overfitting (width 25) and underfitting (width 0.25)

• Multikernel learning performs better regardless of data-distribution than singlekernels

6.3 Experiment No.-3

Dataset: Random Samples generated from Gaussian Mixture models of two Gaus-sian with different mean and different covariance matrices (non-diagonal) (Fig 7).The sample generated from smaller Guassian are assigned as positive (red) and thosefrom bigger one are assigned as negative (blue) as shown in Fig 7. The differencebetween the mean of the two Gaussian is varied during classification

Experiment: Starting with non-separable case with zero distance between thecircles(highest classification error) we gradually increase the distance between the twoGaussian (low classification error) and perform binary classification using multiplekernels. As boundaries needed for separation will gradually tend towards larger circlewhen distance is increased, high width Gaussian kernel are preferred. We performedseveral experiments to demonstrate this using multikernel binary classification with2 Gaussian kernels of varying widths (i.e. 0.25 , 200). We finally compare the meanweight acquired by various kernels during multikernel learning wrt to the distancebetween circles i.e. Fig 8. We also compared the mean error by multikernel andindividual kernel during binary classification on a similar randomly generated noisytest-set i.e. Fig 9.

9

Figure 7: Gaussian classes approaching and then drifting away

Figure 8: Kernel weight vs distance between mean (seprability)

10

Figure 9: Classification Error vs distance between mean (separability)

Results:

• Higher weight is given to larger width kernel as distance increases in multi kernellearning

• Error reduces as the distances increases between circle i.e. more separability indata

• MultiKernel learning give the mini-mun error for all distances i.e varying sepa-rability

• Multikernel learning give more weight to efficient kernel which separate databetter.

• Multikernel learning perform better regardless of data-separability than singlekernels.

6.4 Experiment No.-4

Datasets Multiple dataset with binary classes with complex boundaries and varyingnoise were taken.

Experiment: To study more about how multi-kernel learning performs over dif-ferent data distribution we formed multiple data sets and multi-kernel learning per-formed using multiple kernels of different type like Guassian (width = 1) , Polynomial(degree = 4), Sigmoid and Linear. The first 4 data-set maintains minimal noise of10% wheres the next 4 are the same datasets with noise of around 20-40 %. Decisionboundaries and kernel width was analysed for all of the data sets.

11

Figure 10: Dataset1 : Closely Spaced Concentric Circles with binary classes

Figure 11: Dataset2 : Far Spaced Concentric Circles with binary classes

12

Figure 12: Dataset3 : Moon separated binary classes

Figure 13: Dataset4 : Closely space Gaussian blobs with Noise (5%)

13

Figure 14: Dataset5 : Linearly Separable binary classes

Figure 15: Dataset6 : Moon separated binary classes with high noise(40%)

14

Figure 16: Dataset7 : Concentric Circles with high noise(40%)

Figure 17: Dataset8 : Linear Separable binary classes with High noise

Results:

15

Dataset Gaussian Polynomial Sigmoid LinearDataset1 9.99e-01 2.09-08 1.02e-10 1.05e-10Dataset2 2.77e-03 9.97e-01 8.75e-07 9.10e-07Dataset3 8.77e-01 8.27e-05 1.95e-06 1.23e-01Dataset4 8.08e-01 9.56e-05 1.31e-07 1.91e-01Dataset5 9.85e-04 9.99e-01 1.40e-07 6.13e-07Dataset6 7.18e-01 1.79e-01 6.36e-08 1.02e-01Dataset7 9.99e-01 2.33e-06 1.79e-08 3.10e-08Dataset8 8.44e-01 1.61e-07 6.63e-09 1.55e-01

• Multi-kernel learning performs better regardless of data-distribution.

• With increase in noise, Gaussian and sigmoid kernel are preferred.

• The distribution of weights is sparse which avoids complexity or over fitting.

• Different points in data set use different boundaries i.e. kernel for separability

• Complex boundaries can be learned using multi-kernel learning

• Same Multichannel model learn good boundaries for all detests.

6.5 Experiment No.-5

Datasets A 4 dimensional data set with two dimension (1 & 2) shows concentric cir-cular relation with respect to binary classes on either side. The other two dimension(3 & 4) showing linear separability with respect to binary classes on opposite sides.A random noise w.r.t to wrong classification of around 20% - 25% was introducedrandomly and separately w.r.t to both type of dimensions. Thus a heterogeneoussource data-set with two dimension following a radial and other two following a lin-ear relation with different classification portion examples were created. The relationbetween the 2 dimension of different types is shown in Fig 18.

Experiment: Binary classification was performed with two kernels: Gaussian(width= 5) and linear and combined in various ways. Feature Based Multi- kernel classifica-tion using linear combination of Gaussian kernel using radial features i.e. dimension1 & 2 and similarly linear kernel applied on linear features i.e. dimensions 3 & 4.Another classifier Reverse feature based multi-kernel was trained in a similar way butthe kernel were interchanged. Third classification was done on multi-kernel learningon all of the features i.e. 4 dimensions combined. The fourth represent classificationusing 1 & 2 dimension using a single linear kernel. Similarly another classifier usedonly 3 & 4 dimes ions and single Gaussian kernel.

16

Figure 18: Four Dimension Data from Linear & Radius Sources

Figure 19: Error & Weight Comparison with Combination Method

Results:

• Feature based Multi-kernel learning performs better since different dimensionsseparate classes by different hyperplane.

• Reverse feature based kernel performed worst as it uses exactly the differentkernels required.

• Better performance of Feature based MultiKernel learning than linear and Gaus-sian shows that it also correctly classifies some misclassified example with noise.

• Feature based kernel works good for right feature type (with or without noise)and is used by correct kernel combined together.

6.6 Experiment No.-6

Dataset: Random Samples generated for a regression problem for fitting a sine curvewith particular frequency. Many such sets were created by varying sine curve fre-quency from 1 to 5.

17

Experiment: Starting with non-separable case with zero distance between the dat-apoints (highest error) we gradually increase the distance between the Gaussian (lowclassification error) and performed regression using multiple kernels. As boundariesneeded for separation will gradually tend toward larger circle when distance is in-creased, high width Gaussian kernel are preferred. We performed several experimentsto demonstrate this using multikernel binary classification with 5 Gaussian kernelsof varying widths. We finally compare the mean weight acquired by various kernelsduring multikernel learning wrt to the frequency of curve i.e. Fig 20. We also com-pared the mean error by multikernel and individual kernel during fitting on a similarrandomly generated noisy test-set i.e. Fig 21.

Figure 20: Kernel Widths with varing frequency of sine curve

Figure 21: Error with varing frequency of sine curve

Results:

• Multi-kernel works good for regression problem like fitting of complex sine curve.

• Higher weight is given to larger width kernel as distance increases in multi-kernellearning.

• Error reduces as the frequency of sine wave decreases i.e. more separability indatapoint.

• Multi-Kernel learning give the minimun error for all frequency i.e varying sep-arability

• Multi-kernel learning gives more weight to efficient kernel which separates databetter.

• Multi-kernel learning performs better regardless of closeness of points than singlekernels.

18

6.7 Experiment No.-7

Dataset 1: USPS Handwritten digit data [2] with 4650 Training and test exampleswith ten classes each corresponding with one digit & attribute i.e. (feature dim = 256).

Dataset 2: Ionosphere Dataset from UCI [1] repository with 280 Training and71 Test data(10 fold cross validated) with binary classes with attribute i.e. (featuredim = 34.)

Experiment: Multiclass binary multi-kernel classification with two mixture ofkernels i.e. Polynomial (degree 2) & Gaussian (width 15) was performed using indi-vidual kernels.

Results:

• Result on Ionosphere Dataset using single and multikernel. Multikernel per-forms better than both Polynomial and Guassian kernel

Kernel AccuracyMKL 94.9

Gaussian 93.1Polynomial 93.5

• Result on USPS handwritten digit Dataset using single and multikernel. Mul-tikernel perform better than both Polynomial and Guassian kernel

Kernel AccuracyMKL 94.1

Gaussian 92.1Polynomial 91

7 Observation and Conclusion

• Multiple Kernel Learning automatically learns the efficient weighted distributionof kernels.

• Multiple Kernel Learning gives less generalized error than any of the kernels,independent of data distribution and separability.

• Multiple Kernel Learning also learns efficiently for outliers and noisy data.

• Multiple kernel learning give a sparse distribution of weight acting as regularizer.

• Multi-kernel avoids both udnderfitting and overfitting by best distribution ofweights.

• Complex boundaries can be learnt and curves can be fit by using multi-kernellearning.

• Multiple Kernel Learning can learn data coming from different heterogeneoussources i.e. Multi-modal Settings.

8 Future Work

• Apply on real multi-modal datasets such as video with audio and subtitles.

• Experiment with non-convex and non-linear combination of kernels

19

References

[1] UCI Ionosphere dataset ttps://arcive.ics.uci.edu/ml/datasets/ionosphere.

[2] USPS Handwritten digit dataset ttp://www.gaussianprocess.org/gpml/data/},.\bibitem{bac Bach, F. R., Lanckriet, G. R., and Jordan, M. I. Mul-tiple kernel learning, conic duality, and the smo algorithm. In Proceedings of thetwenty-first international conference on Machine learning (2004), ACM, p. 6.

[3] Bi, J., Zhang, T., and Bennett, K. P. Column-generation boosting methodsfor mixture of kernels. In Proceedings of the tenth ACM SIGKDD internationalconference on Knowledge discovery and data mining (2004), ACM, pp. 521–526.

[4] Boyd, S., and Vandenberghe, L. Convex optimization. Cambridge universitypress, 2004.

[5] Hettich, R., and Kortanek, K. O. Semi-infinite programming: theory, meth-ods, and applications. SIAM review 35, 3 (1993), 380–429.

[6] Ratsch, G., Demiriz, A., and Bennett, K. P. Sparse regression ensemblesin infinite and finite hypothesis spaces. Machine Learning 48, 1-3 (2002), 189–218.

[7] Sonnenburg, S., Ratsch, G., Schafer, C., and Scholkopf, B. Large scalemultiple kernel learning. The Journal of Machine Learning Research 7 (2006),1531–1565.

20