multiple kernel learning based approach to representation and feature selection for image...

1

Multiple Kernel Learning based Approach to Representation and Feature Selection for

Image Classification using Support Vector Machines

Dr. C. Chandra Sekhar

Speech and Vision LaboratoryDept. of Computer Science and EngineeringIndian Institute of Technology MadrasChennai-600036

[email protected]

Presented atICAC-09, Tiruchirappalli on August 8, 2009

2

• Kernel methods for pattern analysis• Support vector machines for pattern

classification• Multiple kernel learning• Representation and feature selection• Image classification

Outline of the Talk

3

Residential Interiors

Mountains

Military Vehicles

Sacred Places

Sunsets & Sunrises

Image Classification

4

• Pattern: Any regularity, relation or structure in data or source of data

• Pattern analysis: Automatic detection and characterization of relations in data

• Statistical and machine learning methods for pattern analysis assume that the data is in vectorial form

• Relations among data are expressed as– Classification rules– Regression functions– Cluster structures

Pattern Analysis

5

• Stage I (1960’s):– Learning methods to detect linear relations among data– Perceptron models

• Stage II (Mid 1980’s):– Learning methods to detect nonlinear patterns– Gradient descent methods: Local minima problem– Multilayer feedforward neural networks

• Stage III (Mid 1990’s):– Kernel methods to detect nonlinear relations– Problem of local minima does not exist– Methods are applicable to non-vectorial data also

• Graphs, sets, texts, strings, biosequences, graphs, trees– Support vector machines

Evolution of Pattern Analysis Methods

6

Artificial Neural Networks

7

Neuron with Threshold LogicActivation Function

x1

x2

xM

w1

w2

wM

i=1WiXi-θ∑

M

Input Weights Activation value

Output signal

a s=f(a)

Threshold logic activation function

• McCulloch-Pitts Neuron

8

Perceptron

x1

x2

• Linearly separable classes: Regions of two classes are separable by a linear surface (line, plane or hyperplane)

9

Hard Problems

• Nonlinearly separable classes

a1

a2

10

Neuron with Continuous Activation Function

x1

x2

xM

w1

w2

wM

i=1WiXi-θ∑

M

Input Weights Activation value

Output signal

a s=f(a)

Sigmoidal activation function

11

Multilayer Feedforward Neural Network

• Architecture:– Input layer ---- Linear neurons– One or more hidden layers ---- Nonlinear neurons – Output layer ---- Linear or nonlinear neurons

Input Layer Output LayerHidden Layer

1

2

i

I

1

2

j

J

1

2

k

K

x1

x2

xi

.

.

.

xI

.

.

.

s1

s2

sk

.

.

.

sK

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

12

Gradient Descent Method

Error

Weight

Global minimum

Local minimum

Local minimum

13

Artificial Neural Networks: Summary• Perceptrons, with threshold logic function as

activation function, are suitable for linearly separable classes

• Multilayer feedforward neural networks, with sigmoidal function as activation function, are suitable for nonlinearly separable classes – Complexity of the model depends on

• Dimension of the input pattern vector• Number of classes• Shapes of the decision surfaces to be formed

– Architecture of the model is empirically determined– Large number of training examples are required when

the complexity of the model is high– Local minima problem– Suitable for vectorial type of data

14

Kernel Methods

15

Key Aspects of Kernel Methods• Kernel methods involve

– Nonlinear transformation of data to a higher dimensionalfeature space induced by a Mercer kernel

– Detection of optimal linear solutions in the kernel feature space

• Transformation to a higher dimensional space is expected to be helpful in conversion of nonlinear relations into linear relations (Cover’s theorem)– Nonlinearly separable patterns to linearly separable patterns– Nonlinear regression to linear regression– Nonlinear separation of clusters to linear separation of clusters

• Pattern analysis methods are implemented in such a way that the kernel feature space representation is not explicitly required. They involve computation of pair-wise inner-products only.

• The pair-wise inner-products are computed efficiently directly from the original representation of data using a kernel function (Kernel trick)

16

x

y

xyyx 2,, )X( 22X

Illustration of Transformation

17

Support Vector Machines for Pattern Classification

18

Optimal Separating Hyperplane

z1

z2

19

Maximum Margin Hyperplane

20

Maximum Margin Hyperplane

b

Margin =||||

1w

0)(T bxw

1)(T bxw

1)(T bxw

31

Kernel Functions

• Mercer kernels: Kernel functions that satisfy

Mercer’s theorem

• Kernel gram matrix: The matrix containing the

values of the kernel function on all pairs of data

points in the training data set

• Kernel gram matrix should be positive semi-definite,

i.e., the eigenvalues should be non-negative, for

convergence of the iterative method used for

solving the constrained optimization problem

• Kernels for vectorial data:

– Linear kernel

– Polynomial kernel

– Gaussian kernel

32

Kernel Functions• Kernels for non-vectorial data:

– Kernels on graphs

– Kernels on sets

– Kernels for texts

• Vector space kernels

• Semantic kernels

• Latent semantic kernels

– Kernels on strings

• String kernels

• Trie-based kernels

– Kernels on trees

– Kernels from generative models

• Fisher kernels

– Kernels on images

• Hausdorff kernels

• Histogram intersection kernels

• Kernels learnt from data

• Kernels optimized for data

36

Applications of Kernel Methods• Genomics and computational biology

• Knowledge discovery in clinical microarray data analysis

• Recognition of white blood cells of Leukaemia

• Classification of interleaved human brain tasks in fMRI

• Image classification and retrieval

• Hyperspectral image classification

• Perceptual representations for image coding

• Complex SVM approach to OFDM coherent demodulation

• Smart antenna array processing

• Speaker verification

• Kernel CCA for learning the semantics of text

G.Camps-Valls, J.L.Rojo-Alvarez and M.Martinez-Ramon, Kernel

Methods in Bioengineering, Signal and Image Processing, Idea

Group Publishing, 2007

37

Work on Kernel Methods at IIT MadrasFocus:

Design of suitable kernels and development of kernel

methods for speech, image and video processing tasks

Tasks:

• Speech recognition

• Speech segmentation

• Speaker segmentation

• Voice activity detection

• Handwritten character recognition

• Face detection

• Image classification

• Video shot boundary detection

• Time series data processing

38

Kernel Methods: Summary• Kernel methods involve

– Nonlinear transformation of data to a higher dimensionalfeature space induced by a Mercer kernel

– Construction of optimal linear solutions in the kernel feature space

• Complexity of model is not dependent on the dimension of data

• Models can be trained with small size datasets• Kernel methods can be used for non-vectorial type

of data also• Performance of kernel methods is dependent on

the choice of the kernel. Design of a suitable kernel is a research problem.

• Approaches to the design of kernels by learning the kernels from data or by constructing the optimized kernels for a given data are being explored

39

Multiple Kernel Learning based Approach to Representation and Feature Selection for

Image Classification using Support Vector Machines

Dileep A.D. and C.Chandra Sekhar, “Representation and feature selection using multiple kernel learning”, in Proceedings of International Joint Conference on Neural Networks, vol.1, pp.717-722, Atlanta, USA, June 2009.

40

Representation and Feature Selection

• Early fusion for combining multiple representations

• Feature Selection– Compute a class separability measuring criterion function– Rank features in the decreasing order of the criterion function – Features corresponding to the best values of criterion function

are then selected

• Multiple kernel learning (MKL) approach is to develop an integrated framework to perform both representation selection and feature selection

Representation 1 ... Representation m

Classifier

Representation 2

Representation 1 ... Representation mRepresentation 2...

41

Multiple Kernel Learning (MKL)

• A new kernel can be obtained as a linear combination of many base kernels– Each of the base kernels is a Mercer kernel

- Non-negative coefficient represents the weight of the lth base kernel

• Issues– Choice of the base kernel– Technique for determining the values of

weights

),(),( jill

lji kk xxxx

l

42

MKL framework for SVM• In SVM framework, the MKL task is considered as a way of

optimizing the kernel weights while training the SVM

• The optimization problem of SVM (Dual problem) is given as:

mlniC

y

ts

kyy

l

i

n

iii

n

i

n

i

n

j

m

ljilljijii

,...,1 ,0,...,1 0

0

..

),(21max

1

1 1 1 1

xx

• Both the Lagrange coefficients αi and the base kernel weights γl are determined by solving the optimization problem

43

Representation and Feature Selection using MKL

• A base kernel is computed for each representation of data

• MKL method is used to determine the weights for the base kernels

• Base kernels with non-zero weights indicate the relevant representations

• Similarly, a base kernel is computed for each feature in a representation

• MKL method is used to select the relevant features with non-zero weights of the corresponding base kernels

44

Studies on Image Categorization• Dataset:

– Data of 5 classes in Corel image database– Number of images per class: 100 (Training: 90, Test: 10, 10-fold evaluation)

Class 1[African specialty

animals]

Class 2[Alaskan wildlife]

Class 3[Arabian horses]

Class 4[North American

wildlife]

Class 5[Wild life

Galapagos] 44

45

Representations Information Representation Tiling Dimension

Colour

DCT coefficients of average colour in 20 x 20 grid Global 12

CIE L*a*b* colour coordinates of two dominant colour clusters

Global 6

Shape Magnitude of the 16 x 16 FFT of Sobel edge image

Global 128

Colour and Structure

Haar transform of quantized HSV colour histogram Global 256

Shape MPEG-7 edge histogram 4 x 4 80

ColourThree first central moments of L*a*b* CIE colour distribution 5 45

Average CIE L*a*b* colour 5 15

ShapeCo-occurrence matrix of four Sobel edge directions 5 80

Histogram of four Sobel edge directions 5 20

Texture Texture feature based on histogram of relative brightness of neighbouring pixels 5 40

46

Classification Performance for Different Representations

• Classification accuracy (in %) of SVM based classifier

Representation Linear Kernel

Gaussian Kernel

DCT coefficients of average colour in 20 x 20 grid 60.8 69.2


60.8 70.2

Magnitude of the 16 x 16 FFT of Sobel edge image 46.6 57.6

Haar transform of quantized HSV colour histogram 84.2 86.8

MPEG-7 edge histogram 53.8 53.4

Three first central moments of L*a*b* CIE colour distribution 83.8 85.6

Average CIE L*a*b* colour 68.2 78.6

Co-occurrence matrix of four Sobel edge directions 44.0 49.8

Histogram of four Sobel edge directions 36.8 49.2

Texture feature based on histogram of relative brightness of neighbouring pixels

46.0 51.6

47

Class-wise Performance • Gaussian kernel SVM based classifier

RepresentationClassification Accuracy (in %)

Class1 Class2 Class3 Class4 Class5

DCT coefficients of average colour in 20 x 20 grid 77 40 97 74 73

CIE L*a*b* colour coordinates of two dominant colour clusters 78 39 96 65 80

Magnitude of the 16 x 16 FFT of Sobel edge image 53 29 83 61 72

Haar transform of quantized HSV colour histogram 86 80 99 90 92

MPEG-7 edge histogram 38 31 82 44 56

Three first central moments of L*a*b* CIE colour distribution 86 79 100 86 80

Average CIE L*a*b* colour 75 52 96 81 84

Co-occurrence matrix of four Sobel edge directions 59 30 77 41 52

Histogram of four Sobel edge directions 49 22 71 24 55

Texture feature based on histogram of relative brightness of neighbouring pixels 40 33 89 43 52

48

Performance for Combination of All Representations

Combination of representations

Dimension

Classification Accuracy (in %)

LinearKernel

Gaussian Kernel

Supervector obtained using all the representations

682 90.40 91.20

• Supervector obtained using all representations gives a better performance than any one representation

• Dimension of the combined representation is high

Representation 1 ... Representation 10Representation 2

Representation 1 ... Representation 10Representation 2

49

Representations Selected using MKL

• Linear kernel as base kernelMagnitude of

the 16 x 16 FFT of Sobel edge

image

Dim: 128

0.1170

Haar transform of quantized HSV colour histogram

Dim: 256

0.5627

MPEG-7 edge histogram

Dim: 80

0.1225

DCT coefficients of average colour

in 20 x 20 rectangular

grid

CIE L*a*b* colour

coordinates of two dominant colour clusters

Three first central

moments of L*a*b* CIE

colour distribution

Dim: 45

0.1615

Co-occurrence matrix of four

Sobel edge directions

Dim: 80

0.0261

Texture feature

based on histogram of

relative brightness of neighbouring

pixelsDim: 40

0.0101

Average CIE L*a*b* colour

Histogram of four Sobel

edge directions

50

Representations Selected using MKL

• Gaussian kernel as base kernel

CIE L*a*b* colour

coordinates of two dominant colour clusters

Dim: 6

0.0064

Magnitude of the 16 x 16 FFT of Sobel edge

image

Dim: 128

0.0030


Dim: 256

0.6045

MPEG-7 edge histogram

Dim: 80

0.1535


Dim: 15

0.2326

DCT coefficients of average colour

in 20 x 20 rectangular

grid

Three first central

moments of L*a*b* CIE

colour distribution

Co-occurrence matrix of four

Sobel edge directions

Histogram of four Sobel

edge directions

Texture feature

based on histogram of

relative brightness of neighbouring

pixels

51

Performance for Different Methods of Combining Representations

• Classification accuracy (in %)

Combination of representation

Linear Kernel Gaussian Kernel

Dimension Accuracy Dimension Accuracy

Supervector of all the representations 682 90.4 682 91.2

MKL approach 682 92.0 682 92.4

Supervector of representations with non-zero weights determined using MKL method

629 91.8 485 93.0

• Supervector obtained using representations with non-zero weights gives the same performance as MKL approach, with a significantly reduced dimension of the combined representation

52

Feature Selection using MKL• Representations and features selected using MKL with Gaussian

kernel as base kernel• Classification accuracy (in %)

RepresentationDimension Classification accuracy

Actual features

Selected features

Actual features

Weighted features


6 6 70.2 70.6

Magnitude of the 16 x 16 FFT of Sobel edge image 128 62 57.6 60.4


256 63 86.8 88.0

MPEG-7 edge histogram 80 53 53.4 53.8


15 14 78.6 79.0

53

Performance of Different Methods for Feature Selection

Representation DimensionClassification

accuracy (in %)

Supervector of representationswith non-zero weight values determined using MKL method

485 93.0

Supervector of features with non-zero weight values determined using MKL method

198 93.0

54

Multiple Kernel Learning: Summary

• Multiple kernel learning (MKL) approach is useful for combining representations as well as selecting features, with weights representations/features determined automatically

• Selection of representations and features using MKL approach is studied for image categorization task

• Combining representations using MKL approach performs marginally better than the combination of all representations

• Dimensionality of the feature vector obtained using the MKL approach is significantly less

55

1. Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

2. B. Yegnanarayana, Artificial Neural Networks, Prentice-Hall of India, 1999.

3. Satish Kumar, Neural Networks - A Class Room Approach, Tata McGraw-Hill, 2004.

4. Simon Haykin, Neural Networks – A Comprehensive Foundation, 2nd Edition, Prentice Hall, 1999.

5. John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004

6. B.Scholkopf and A.J.Smola, Learning with Kernels – Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, 2002

7. R.Herbrich, Learning Kernel Classifiers – Theory and Algorithms, MIT Press, 2002

8. B.Scholkopf, K.Tsuda and J-P.Vert, Kernel Methods in Computational Biology, MIT Press, 2004

9. G.Camps-Valls, J.L.Rojo-Alvarez and M.Martinez-Ramon, Kernel Methods in Bioengineering, Signal and Image Processing, Idea Group Publishing, 2007

10. F.Camastra and A.Vinciarelli, Machine Learning for Audio, Image and Video Analysis – Theory and Applications, Springer, 2008

Web resource: www.kernelmachines.org

Text Books

56

Thank You

multiple kernel learning based approach to representation and feature selection for image...

Technology

kernel functions kernels

kernel function kernel

talk kernel methods

kernel methods14

graphs kernels

trees kernels

sets kernels

pattern analysis pattern