object recognition techniques (2)

8/11/2019 Object Recognition Techniques (2)

1/224

1

ANALYSIS OF HIERARCHIAL FOR

OBJECT RECOGNITION

TECHNIQUES

For the Degree of

Doctor of Philosophy

In

SUBJECT

Submitted to

SHRI VENKATESHWARA UNIVERSITY,

Gajraula, Amroha (UTTAR PRADESH)

Research Supervisor: Research

Scholar:

Dr. Name

2014


2/224

2

DECLARATION

I hereby declare that this submission is my own work and that, to the best of my

knowledge and belief, it contains no material previously published or written by

another person nor material which to a substantial extent has been accepted for the

award of any other degree or diploma of the university or other institute of higher

learning, except where due acknowledgment has been made in the text.

Signature of Research Scholar

Name :

Enrollment No.


3/224

3

CERTIFICATE

Certified that Name of student(enrollment no..) has carried out the research work

presented in this thesis entitled Title of Thesis. for the award of Doctor

ofPhilosophy from Shri Venkateshwara University, Gajraula under my/our (print only

that is applicable) supervision. The thesis embodies results of original work, and

studies as are carried out by the student himself/ herself (print only that is applicable)

and the contents of the thesis do not form the basis for the award of any other degree to

the candidate or to anybody else from this or any other University/Institution.

Signature Signature

(Name of Supervisor) (Name of Supervisor)

Designation) (Designation)

(Address) (Address)

Date:


4/224

4

SHRI VENKATESHWARA UNIVERSITY, GAJRAULA

CERTIFICATE OF THESIS SUBMISSION FOR

EVALUATION(To be submitted in duplicate)

1.Name :..........

2.Enrollment No. :

3.Thesis title:...........

.......

4.Degree for which the thesis is submitted:

5.Department of the University to which the thesis is submitted :

....................................................................................................................................................................................................

6.Faculty of the University to which the thesis is submitted :

....................................................................................................................................................................................................

7. Thesis Preparation Guide was referred to for preparing the thesis. Yes No

8. Specifications regarding thesis format have been closely followed. Yes No

9. The contents of the thesis have been organized based on the guidelines. Yes No

10. The thesis has been prepared without resorting to plagiarism. Yes No

11. All sources used have been cited appropriately. Yes No

12. The thesis has not been submitted elsewhere for a degree. Yes No

13. Submitted two copies of spiral bound thesis plus one CD. Yes No

14. Submitted five copies of synopsis approved by RDC. Yes No

15. Submitted two copies of spiral bound research summary. Yes No

Name...Enrollment No


5/224

5

SHRI VENKATESHWARA UNIVERSITY, GAJRAULA

CERTIFICATE OF FINAL THESIS SUBMISSION

(To be submitted in duplicate)1. Name : .........

2. Enrollment No. :

3. Thesis title:...

....

4. Degree for which the thesis is submitted: ........

5. Department of University to which the thesis is submitted :

...

6. Faculty of the University to which the thesis is submitted :

7. Thesis Preparation Guide was referred to for preparing the thesis. Yes No

8. Specifications regarding thesis format have been closely followed. Yes No

9. The contents of the thesis have been organized based on the guidelines. Yes No

10.The thesis has been prepared without resorting to plagiarism. Yes No

11.All sources used have been cited appropriately. Yes No

12.The thesis has not been submitted elsewhere for a degree. Yes No

13.All the corrections have been incorporated. Yes No

14.Submitted five hard bound copies of the thesis plus one CD. Yes No

15.Submitted five copies of research summary. Yes

No

(Signature(s) of the Supervisor(s) (Signature of the Candidate)


6/224

6

Name(s): Name..

Enrollment No

1) ABSTRACT

AnObject recognition systems constitute a deeply entrenched and omnipresent

component of modern intelligent systems. Research on object recognition algorithms

has led to advances in factory and office automation through the creation of optical

character recognition systems, assembly-line industrial inspection systems, as well as

chip defect identification systems. It has also led to significant advances in medical

imaging, defence and biometrics. In this paper we discuss the evolution of computer-

based object recognition systems over the last fifty years, and overview the successes

and failures of proposed solutions to the problem. We survey the breadth of

approaches adopted over the years in attempting to solve the problem, and highlight

the important role that active and attentive approaches must play in any solution that

bridges the semantic gap in the proposed object representations, while simultaneously

leading to efficient learning and inference algorithms. From the earliest systems

which dealt with the character recognition problem, to modern visually-guided agents

that can purposively search entire rooms for objects, we argue that a common thread

of all such systems is their fragility and their inability to generalize as well as the

human visual system can. At the same time, however, we demonstrate that the

performance of such systems in strictly controlled environments often vastly

outperforms the capabilities of the human visual system. We conclude our survey by


7/224

7

arguing that the next step in the evolution of object recognition algorithms will

require radical and bold steps forward in terms of the object representations, as well

as the learning and inference algorithms used.

LIST OF TABLES

Table: 1.

Comparison of kernel descriptors (KDES) and hierarchi-cal kernel descriptors(HKDES) on CIFAR10 provide extensive comparisons with current state-of-the-art

algorithms in terms of accuracy.

Table: 2.

Comparisons on the RGB-D Object Dataset. RGB de-notes features over RGB images

and depth denotes features over depth images.

Table: 3.

Comparisons to existing recognition approaches using a combination of depth features

and image features. Nonlinear SVMs use Gaussian kernel.

LIST OF FIGURES

Figure1:


8/224

8

Different components of an object recognition system are shown

Figure 2:

Hierarchical Kernel Descriptors

Figure 3:

Examples of correspondences established between frames of a database image (left) and a

query image (right).

Figure 4:

Examples of corresponding query (left columns) and database (right columns) images from

the ZuBuD dataset. The image pairs exhibit occlusion, varying illumi-nation and viewpoint

and orientation changes

Figure 5:

Examples of corresponding query (left columns) and database (right columns) images from

the ZuBuD dataset. The image pairs exhibit occlusion, varying illumi-nation and viewpoint

and orientation changes.

Figure 6:

Image retrieval on FOCUS dataset: query local-isation results. query images, database

images, and query localisations

Figure 7:

An example of matches established on a wide-baseline stereo pair.

Figure 8:

Overview of the spatiotemporal (4-D) approach to dynamic vision (adapted from [50, 268]).


9/224

9

Chart 1: Summary of the 1989-2009 papers in Table 5 on active object detection. By

definition search efficiency is not the primary concern in these systems, since by assumption

the object is always in the sensors field of view. However inference scalability constitutes a

significant component of such systems. We notice very little use of function and context in

these systems. Furthermore, training such systems is often non-trivial.

Figure 9:

A sequence of viewpoints from which the system developed by Wilkes and Tsotsos [266]

actively recognizes an origami object.

Figure 10:

The object verification and next viewpoint selection algorithm used in [280] (diagramadapted from [280]).

Figure 11:

Graphical model for next-view-planning as proposed in [284, 285].

Figure 12:

The aspects of an object and its congruence classes (adapted from Gremban and Ikeuchi

[287]).

Figure 13:

An aspect resolution tree used to determine if there is a single interval of values for that

satisfy certain constraints (adapted from Gremban and Ikeuchi [287]).

Figure 14:

The two types of view degeneracies proposed by Dickinson et al. [49].

Chart 2: Summary of the 1992-2012 papers on active object localization and recognition from

Table 6. As expected, search efficiency and the role of 3D information is significantly more

prominent in these papers (as compared to Chart 7)

Figure 15:


10/224

10

Reconstructionist vision vs. Selective Perception, after Rimey and Brown [302]

feature vector c. Laporte and Arbel [291] build upon this work and choose the best next

viewpoint by calculating the symmetric KL divergence (Jeffrey divergence) of the likelihood

of the observed data given the assumption that this data resulted from two views of two

distinct objects. By weighing each Jeffrey divergence by the product of the probabilities ofobserving the two competing objects and their two views, they can determine the next view

which provides the object identity hypothesis, thus again demonstrating the active vision

systems direct applicability in the standard recognition pipeline (see Fig.1).

Figure 16:

A PART-OF Bayes net for a table-top scenario, similar to what was proposed by Rimey and

Brown [302].

Figure 17:An IS-A Bayes tree for a table-top scenario that was used by Rimey and Brown [302].

Figure 18:

The direct-search model, which includes nodes that affect direct search efficiency (unboxed

nodes) and explicit model parameters (boxed nodes). Adapted from Wixson and Ballard

[303].

Figure 19:Junction types proposed by Malik [321] and used by Brunnstrom et al. [306] for recognizing

man-made objects.

Figure 20:

An ASIMO humanoid robot was used by Andreopoulos et al. [24] to actively search an

indoor environment.

Figure 21:

An example of ASIMO pointing at an object once the target object is successfully localized

in a 3D environment [24].

Figure 22:


11/224

11

The twenty object classes that the 2011 PASCAL dataset contains. Some of the earlier

versions of the PASCAL dataset only used subsets of these object classes. Adapted from

[324]

Chart 3: Summary of the PASCAL Challenge papers from Table 7 which correspond to

algorithms published between 2002-2011. Notice that the winning PASCAL challenge

algorithms typically make little use of function, context, 3D and make a moderate use of

texture.

Figure 23:

The HOG detector of Dalal and Triggs (from [335] with permission). (a): The average

gradient image over a set of registered training images. (b), (c): Each pixel demonstrates the

maximum and minimum (respectively) SVM weight of the corresponding block. (d): The test

image used in the rest of the subfigures. (e): The computed R-HOG descriptor of the image in

subfigure (e). (f),(g): The R-HOG descriptor weighed by the positive and negative SVM

weights respectively.

Figure 24:

Examples of the Harris-Laplace detector and the Laplacian detector, which were used

extensively in [142] as interest-point/region detectors (figure reproduced from [142] with

permission).

Figure 25:The distributions of various object classes corresponding to six feature classes.

Figure 26:

Example of the algorithm by Felzenszwalb et al. [366] localizing a person using the coarse

template representation and the higher resolution subpart templates of the person (from [366]

with permission).

Figure 27:

The HOG feature pyramid used in [366], showing the coarse root-level template and the

higher resolution templates of the persons subparts (from [366] with permission).

Figure 28:


12/224

12

The distribution of edges and appearance patches of certain car model training images used

by Chum and Zisserman [365], with the learned regions of interest overlaid (from [365], with

permission).

Figure 29:

The 35 most frequent 2AS constructed from 10 outdoor images (from [367] with permission).

It is easier to understand the left images contents (e.g., a busy road with mountains in the

background) if the cars in the image have been firstly localized. Conversely, in the right

image, occlusions make the object localization problem difficult. Thus, prior knowledge that

the image contains exclusively cars, can make the localization problem easier (from [361]

with permission).

Figure 30:

Demonstrating how top-down category-specific attentional biases can modulate the shape-

words during the bag-of-words histogram construction (from [358] with permission). low

level features (e.g., edges, color) and they are grouping them in more complex ways in order

to achieve more universal representations of object parts. In terms of object verification and

object hypothesizing (see Fig.1) the work by Felzenszwalb et al. [366] represents the most

successful approach tested in Pascal 2007, for using a coarse generative model of object parts

to improve recognition performance.

Figure 31:

(a)The 3-layer tree-like object representation in [348]. (b) A reference template without any

part displacement, showing the root-node bounding box (blue), the centers of the 9 parts in

the 2nd layer (yellow dots), and the 36 part at the last layer in color purple. (c) and (d) denote

object localizations (from [348] with permission).

Figure 32:

On using context to mitigate the negative effects of ambiguous localizations [350]. The

greater the ambiguities, the greater role contextual knowledge plays (from [350] with

permission).

Figure 33:


13/224

13

An example of a feature extraction stage of the type FC SG Rabs N PA . An inputimage (or a feature map) is passed through a non-linear filterbank, followed by rectification,

local contrast normalization and spatial pooling/sub- sampling.

Figure 34:

Test Error rate vs. number of training samples per class on NORB Dataset. Although pure

random features perform surprisingly well when training data is very scarce, for large

number of training data learning improves the performance significantly. Absolute value

rectification (Rabs) and local normalization (N ) is shown to improve the performance in all

cases.

Figure 35:

Left: random stage-1 filters, and corresponding optimal inputs that maximize the response

of each corresponding complex cell in a FC SG Rabs N PA architecture.

Figure 36:

Left: A dictionary with 128 elements, learned with patch based sparse coding model. Right:

A dictionary with 128 elements, learned with convolutional sparse coding model. The

dictionary learned with the convolutional model spans the orientation space much more

uniformly. In addition it can be seen that the diversity of filters obtained by convolutional

sparse model is much richer compared to patch based one.

Figure 37:

Top Left: Smooth shrinkage function. Parameters and b control the smoothness and

location of the kink of the function. As it converges more closely to soft thresholding

operator. Top Right: Total loss as a function of number of iterations. The vertical dotted

line marks the iteration number when diagonal hessian approximation was updated. It is

clear that for both encoder functions, hessian update improves the convergence significantly.

Bottom: 128 convolutional filters (k) learned in the encoder using smooth shrinkage

function.

Figure 38:

Second stage filters. Left: Encoder kernels that correspond to the dictionary elements.

Right: 128 dictionary elements, each row shows 16 dictionary elements, connecting to a


14/224

14

single second layer feature map. It can be seen that each group extracts similar type of

features from their corresponding inputs.

Figure 39:

Results on the INRIA dataset with per-image metric. Left: Comparing two best systems

with unsupervised initialization (U U ) vs random initialization (RR). Right: Effect of

bootstrapping on final performance for unsupervised ini- tialized system.

Figure 40:

Results on the INRIA dataset with per-image metric. These curves are computed from the

bounding boxes and confidences made available by (Dollar et al., 2009b). Comparing our

two best systems labeled (U +U + and R+R+)with all the other methods.

Figure 41:

Reconstruction Error vs 1 norm Sparsity Penalty for coordinate de- scent sparse coding and

variational free energy minimization.

Figure 42:

Angle between representations obtained for two consecutive frames using different parameter

values using sparse coding and variational free energy minimization.


15/224

15

TABLE OF CONTENTS

1. INTRODUCTION

a. System Component

b. Complexity of Object Recognition

c. Two Dimensional

d. Three Dimensional

e. Segmented

2. HIERARCHIAL KERNEL DESCRIPTOR

a. Kernel Descriptor

b. Kernel Descriptor Over Kernel Descriptor

c. Everyday Object Recognition Using RGB-D

d. Experiments

i. Cifar 10

ii. RGB-D Object Dataset

3. OBJECT RECOGNITION METHOD BASED ON TRANSFORMATION

a. Classes of object recognition methods

i. Appearance based methods

ii. Geometry Based Methods

4. RECOGNITION AS A CORRESPONDENCE LOCAL FEATURE

a. The approach of David lowe

b. The approach of Mikolajczyk & Schmidc. The approach of Tuytelaars Ferrari & Gool

d. The LAF approach of Matas

e. The approach of Zisserman


16/224

16

i. Indexing and Matching

ii. Verification

f. Scale Saliency by Kadir & Brady

g. Local PCA, approaches of Jugessur & Ohba

h. The approach of Selinger and Nelson

i. Applications

5. LITERATURE SURVEY

a. Active and Dynamic Vision

b. Active object detection literature survey

c. Active object localization

6. CASE STUDIES FROM RECOGNITION CHALLENGES AND THE EVOLVING

LANDSCAPE

a. Dataset and evaluation techniques

b. Sampling the current state of the art in the recognition literature

i. Pascal 2005

ii. Pascal 2006

iii. Pascal 2008

iv. Pascal 2009

v. Pascal 2010

vi. Pascal 2011

c. The evolving landscape

7. MULTI STAGE ARCHITECTURE FOR OBJECT RECOGNITION

a. Modules for hierarchical systems

b. Combining module into hierarchy

c. Training protocol

d. Experiment with Caltech101 Dataset

e. Using a single stage of feature extraction

f. Using two stage of feature extraction

g. NORB datatest


17/224

17

h. Random filter performance

i. Handwritten digit recognition

j. Convolutional space coding

k. Algorithms and Methods

i. Learning convolutional dictionaries

ii. Learning and efficient encoder

iii. Patch based vs convolutional sparse modelling

l. Multi stage architecture

m. Experiments

i. Object recognition using Caltech 101 dataset

ii. Pedestrian detection

iii. Architecture and Training

iv. Per image evaluation

n. Sparse coding by variational marginalization

o. Variational marginalization for spare coding

p. Stability experiments

8. CONCLUSION

9. References


18/224

18

2) Introduction

Object recognition is a fundamental and challenging problem and is a major focus of

research in computer vision, machine learning and robotics. The task is difficult partly

because images are in high-dimensional space and can change with viewpoint, while

the objects them-selves may be deformable, leading to large intra-class variation. The

core of building object recognition systems is to extract meaningful representations

(features) from high-dimensional observations such as images, videos, and 3D point

clouds. This paper aims to discover such representations using machine learning

methods. An object recognition system finds objects in the real world from an image

of the world, using object models which are known a priori. This task is surprisingly

difficult. Humans perform object recognition effortlessly and instantaneously.

Algorithmic description of this task for implementation on machines has been very

difficult. In this chapter we will discuss different steps in object recognition and

introduce some techniques that have been used for object recognition in many

applications. We will discuss the different types of recognition tasks that a vision

system may need to perform. We will analyze the complexity of these tasks and

present approaches useful in different phases of the recognition task.


19/224

19

Over the past few years, there has been increasing interest in feature learning for

object recognition using ma-chine learning methods. Deep belief nets (DBNs) are

appealing feature learning methods that can learn a hierarchy of features. DBNs are

trained one layer at a time using contrastive divergence, where the feature learned by

the current layer becomes the data for training the next layer. Deep belief nets have

shown impressive results on handwritten digit recognition, speech recognition and

visual object recognition. Convolutional neural networks (CNNs) are another example

that can learn multiple layers of nonlinear features. In CNNs, the parameters of the

entire network, including a final layer for recognition, are jointly optimized using the

back-propagation algorithm.

The object recognition problem can be defined as a labeling problem based on models

of known objects. Formally, given an image containing one or more objects of interest

(and background) and a set of labels corresponding to a set of models known to the

system, the system should assign correct labels to regions, or a set of regions, in the

image. The object recognition problem is closely tied to the segmentation problem:

without at least a partial recognition of objects, segmentation cannot be done, and

without segmentation, object recognition is not possible.

In this chapter, we discuss basic aspects of object recognition hiarchy and its analysis.

We present the architecture and main components of object recognition and discuss

their role in object recognition systems of varying complexity.


20/224

20

Figure1: Different components of an object recognition system are shown

a) System Component

An object recognition system must have the following components to perform the

task:

Model database (also called modelbase)

Feature detector

Hypothesizer

Hypothesis verifier

A block diagram showing interactions and information flow among different

components of the system is given in Figure. The model database contains all the

models known to the system. The information in the model database depends on the


21/224

21

approach used for the recognition. It can vary from a qualitative or functional

description to precise geometric surface information. In many cases, the models of

objects are abstract feature vectors, as discussed later in this section. A feature is some

attribute of the object that is considered important in describing and recognizing the

object in relation to other objects. Size, color, and shape are some commonly used

features.

The feature detector applies operators to images and identifies locations of features

that help in forming object hypotheses. The features used by a system depend on the

types of objects to be recognized and the organization of the model database. Using

the detected features in the image, the hypothesizer assigns likelihoods to objects

present in the scene. This step is used to reduce the search space for the recognizer

using certain features.

The model base is organized using some type of indexing scheme to facilitate

elimination of unlikely object candidates from possible consideration. The verifier

then uses object models to verify the hypotheses and refines the likelihood of objects.

The system then selects the object with the highest likelihood, based on all the

evidence, as the correct object.

All object recognition systems use models either explicitly or implicitly and employ

feature detectors based on these object models. The hypothesis formation and

verification components vary in their importance in different approaches to object

recognition. Some systems use only hypothesis formation and then select the object


22/224

22

with highest likelihood as the correct object. Pattern classification approaches are a

good example of this approach. Many artificial intelligence systems, on the other

hand, rely little on the hypothesis formation and do more work in the verification

phases. In fact, one of the classical approaches, template matching, bypasses the

hypothesis formation stage entirely.

An object recognition system must select appropriate tools and techniques for the

steps discussed above. Many factors must be considered in the selection of

appropriate methods for a particular application. The central issues that should be

considered in designing an object recognition system are:

Object or model representation: How should objects be represented in the

model database? What are the important attributes or features of objects that must be

captured in these models? For some objects, geometric descriptions may be available

and may also be efficient, while for another class one may have to rely on generic or

functional features.

The representation of an object should capture all relevant information without any

redundancies and should organize this information in a form that allows easy access

by different components of the object recognition system.

Feature extraction: Which features should be detected, and how can they be

detected reliably? Most features can be computed in twodimensional images but they

are related to three-dimensional characteristics of objects. Due to the nature of the


23/224

23

image formation process, some features are easy to compute reliably while others are

very difficult. Feature detection issues were discussed in many chapters in this book.

Feature-model matching: How can features in images be matched to models

in the database? In most object recognition tasks, there are many features and

numerous objects. An exhaustive matching approach will solve the recognition

problem but may be too slow to be useful. Effectiveness of features and efficiency of

a matching technique must be considered in developing a matching approach.

Hypotheses formation: How can a set of likely objects based on the feature

matching be selected, and how can probabilities be assigned to each possible object?

The hypothesis formation step is basically a heuristic to reduce the size of the search

space. This step uses knowledge of the application domain to assign some kind of

probability or confidence measure to different objects in the domain. This measure

reflects the likelihood of the presence of objects based on the detected features.

Object verification: How can object models be used to select the most likely

object from the set of probable objects in a given image? The presence of each likely

object can be verified by using their models. One must examine each plausible

hypothesis to verify the presence of the object or ignore it. If the models are

geometric, it is easy to precisely verify objects using camera location and other scene

parameters. In other cases, it may not be possible to verify a hypothesis.


24/224

24

Depending on the complexity of the problem, one or more modules in Figure. may

become trivial. For example, pattern recognition-based object recognition systems do

not use any feature-model matching or object verification; they directly assign

probabilities to objects and select the object with the highest probability.

b) Complexity of Object Recognition

As we studied in earlier chapters in this book, images of scenes depend on

illumination, camera parameters, and camera location. Since an object must be

recognized from images of a scene containing multiple entities, the complexity of

object recognition depends on several factors. A qualitative way to consider the

complexity of the object recognition task would consider the following factors:

Scene constancy: The scene complexity will depend on whether the images are

acquired in similar conditions (illumination, background, camera parameters, and

viewpoint ) as t:le models. As seen in earlier chapters, scene conditions affect images

of the same object dramatically. Under different scene conditions, the performance of

different feature detectors will be significantly different. The nature of the

background, other objects, and illumination must be considered to determine what

kind of features can be efficiently and reliably detected.

Image-models spaces: In some applications, images may be obtained such that

three-dimensional objects can be considered two-dimensional. The models in such


25/224


26/224

26

recognition problem into the following classes.

c) Two-dimensional

In many applications, images are acquired from a distance sufficient to consider the

projection to be orthographic. If the objects are always in one stable position in the

scene, then they can be considered two-dimensional. In these applications, one can

use a two-dimensional modelbase. There are two possible cases:

Objects will not be occluded, as in remote sensing and many industrial

applications .

Objects may be occluded by other objects of interest or be partially visible, as

in the bin of parts problem.

In some cases, though the objects may be far away, they may appear in different

positions resulting in multiple stable views. In such cases also, the problem may be

considered inherently as two-dimensional object recognition.

d) Three-dimensional

If the images of objects can be obtained from arbitrary viewpoints, then an object may

appear very different in its two views. For object recognition using three-dimensional


27/224

27

models, the perspective effect and viewpoint of the image have to be considered. The

fact that the models are three-dimensional and the images contain only two-

dimensional information affects object recognition approaches. Again, the two factors

to be considered are whether objects are separated from other objects or not.

For three-dimensional cases, one should consider the information used in the object

recognition task. Two different cases are:

Intensity: There is no surface information available explicitly in intensity

images. Using intensity values, features corresponding to the three-dimensional

structure of objects should be recognized .

2.5-dimensional images: In many applications, surface representations with

viewer-centered coordinates are available, or can be computed, from images. This

information can be used in object recognition.

Range images are also 2.5-dimensional. These images give the distance to different

points in an image from a particular view point.

e) Segmented

The images have been segmented to separate objects from the background. As

discussed in Chapter 3 on segmentation, object recognition and segmentation

problems are closely linked in most cases. In some applications, it is possible to


28/224

28

segment out an object easily. In cases when the objects have not been segmented, the

recognition problem is closely linked with the segmentation problem.

3) HIERARCHIAL KERNEL DESCRIPTORS

Kernel descriptors highlight the kernel view of orientation histograms, such as SIFT

and HOG, and show that they are a particular type of match kernels over patches. This

novel view suggests a unified framework for turning pixel attributes (gradient, color,

local binary pattern, etc.) into patch-level features:

(1) design match kernels using pixel attributes;

(2) learn compact basis vectors using kernel principal component analysis (KPCA);

(3) construct kernel descriptors by projecting the infinite-dimensional fea-ture vectors

to the learned basis vectors.

The key idea of this work is that we can apply the kernel descriptor framework not

only over sets of pixels (patches), but also sets of kernel descriptors. Hierarchical

kernel descriptors aggregate spatially nearby patch-level features to form higher level

features by using kernel descriptors recursively, as shown in Fig. 1. This procedure

can be repeated until we reach the final image-level features.


29/224

29

a) Kernel Descriptors

Patch-level features are critical for many computer vision tasks. Orientation

histograms like SIFT and HOG are popular patch-level features for object recognition.

Kernel descriptors include SIFT and HOG as special cases, and provide a principled

way to generate rich patch-level features from various pixel attributes.

The gradient match kernel, Kgrad, is based on the pixel

Figure2:. Hierarchical Kernel Descriptors.

In the first layer, pixel attributes are aggregated into patch-level features. In the

second layer, patch-level features are turned into aggregated patch-level features. In

the final layer, aggregated patch-level features are converted into image-level

features. Kernel descriptors are used in every layer.


30/224

30

where P and Q are patches from two different images, and denotes the 2D position of

a pixel in an image patch. Let z, mz be e the orientation and magnitude of the image

gradient at a pixe z.

The color kernel descriptor Kcol is based on the pixel intensity attribute

where czis the pixel color at position z (intensity for gray images and RGB values for

color images) and kc(cz, cz0 ) = exp(ckcz cz0 k2

) is a Gaussian kernel. The shape

kernel descriptor, Kshape, is based on the local binary pattern attribute

Gradient, color and shape kernel descriptors are strong in their own right and

complement one another. Their combi-nation turns out to be always (much) better

than the best in-dividual feature. Kernel descriptors are able to generate rich visual

feature sets by turning various pixel attributes into patch-level features, and are

superior to the current state-of-the-art recognition algorithms on many standard visual

object recognition datasets.

b) Kernel Descriptors over Kernel Descriptors

The match kernels used to aggregate patch-level features have similar structure to

those used to aggregate pixel at-tributes:


31/224

31

where A and A0denote image patches, and P and Q are sets of image patches.

The patch position Gaussian kernel kC (CA, CA0 ) = exp(C kCA CA0 k2) = C

(CA)>C (CA0 ) describes the spatial relationship between two patches, where C A is

the center position of patch A (normalized to [0, 1]). The patch Gaussian kernel kF

(FA, FA0 ) = exp(FkFA FA0 k2) = F(FA)

>F(FA0 ) measures the similarity of two

patch-level features, where FAare gradient, shape or color kernel descriptors in our

case. The linear kernel WAWA0 weights the contribution of each patch-level feature

where W small positive constant. WA is the average of gradient mag-nitudes for the

gradient kernel descriptor, the average of standard deviations for the shape kernel

descriptor and is always 1 for the color kernel descriptor.

Note that although efficient match kernels [1] used match kernels to aggregate patch-

level features, they dont consider spatial information in match kernels and so spatial

pyramid is required to integrate spatial information. In addi-tion, they also do not

weight the contribution of each patch, which can be suboptimal. The novel joint

match kernels ( 5) provide a way to integrate patch-level features, patch varia-tion,

and spatial information jointly.


32/224


33/224

33

which can be written as a single Gaussian kernel. This procedure is optimal in the

sense of minimizing the least square approximation error. However, it is intractable to

compute the eigenvectors of a 125, 000 125, 000 matrix on a modern personal

computer. Here we propose a fast algorithm for finding the eigenvec-tors of the

Kronecker product of kernel matrices. Since ker-nel matrices are symmetric positive

definite, we have

suggests that the top r eigenvectors of KF KC can be chosen from the Kronecker

product of the eigenvectors of KF and those of KC , which significantly reduces com-

putational cost. The second layer kernel descriptors have the form. Recursively

applying kernel descriptors in a similar man-ner, we can get kernel descriptors of

more layers, which represents features at different levels.

c) Everyday Object Recognition using RGB-D

We recorded with the camera mounted at three different heights relative to the

turntable, giving viewing angles of approxi-mately 30, 45 and 60 degrees with the

horizon. One revolu-tion of each object was recorded at each height. Each video

sequence is recorded at 20 Hz and contains around 250 frames, giving a total of

250,000 RGB + Depth frames. A combination of visual and depth cues (Mixture-of-

Gaussian fitting on RGB, RANSAC plane fitting on depth) produces a segmentation


34/224

34

for each frame separating the object of in-terest from the background. The objects are

organized into a hierarchy taken from WordNet hypernym/hyponym rela-tions and is

a subset of the categories in ImageNet. Each of the 300 objects in the dataset belong

to one of 51 cate-gories

Our hierarchical kernel descriptors, being a generic ap-proach based on kernels, has

no trouble generalizing from color images to depth images. Treating a depth image as

a grayscale image, i.e. using depth values as intensity, gra-dient and shape kernel

descriptors can be directly extracted and they capture edge and shape information in

the depth channel. However, color kernel descriptors extracted over the raw depth

image does not have any significant mean-ing. Instead, we make the observation that

the distance d of an object from the camera is inversely proportional to the square root

of its area s in RGB images. For a given object.

Since we have the segmen-tation of objects, we can represent s using the number of

pixels belonging to the object mask. Finally, we multiply depth values by s before

extracting color kernel descrip-tors over this normalized depth image. This yields a

feature that is sensitive to the physical size of the object.


35/224

35

In the experiments section, we will compare in de-tail the performance of our

hierarchical kernel descrip-tors on RGB-D object recognition to that in [15]. Our

approach consistently outperforms the state of the art in [15]. In particular, our

hierarchical kernel descriptors on the depth image perform much better than the com-

bination of depth features (including spin images) used in [15], increasing the depth-

only object category recog-nition from 53.1% (linear SVMs) and 64.7% (nonlinear

SVMs) to 75.7% (hierarchical kernel descriptors and lin-ear SVMs). Moreover, our

depth features served as the backbone in the object-aware situated interactive system

that was successfully demonstrated at the Consumer Elec-tronics Show 2011 despite

adverse lighting conditions.

d) Experiments

In this section, we evaluate hierarchical kernel descrip-tors on CIFAR10 and the

RGB-D Object Dataset. We also

Features KDES [1]

HKDES (this

work)

Color 53.9 63.4

Shape 68.2 69.4

Gradient 66.3 71.2


36/224

36

Combination 76.0 80.0

Table 1. Comparison of kernel descriptors (KDES) and hierarchi-cal kernel

descriptors (HKDES) on CIFAR10 provide extensive comparisons with current state-

of-the-art algorithms in terms of accuracy.

In all experiments we use the same parameter settings as the original kernel

descriptors for the first layer of hi-erarchical kernel descriptors. For SIFT as well as

gradi-ent and shape kernel descriptors, all images are transformed into grayscale ([0,

1]). Image intensity and RGB values are normalized to [0, 1]. Like HOG [5], we

compute gradients using the mask [1, 0, 1] for gradient kernel descriptors. We also

evaluate the performance of the combination of the three hierarchical kernel

descriptors by concatenating the image-level feature vectors. Our experiments suggest

that this combination always improves accuracy.

i) CIFAR10

CIFAR10 is a subset of the 80 million tiny images dataset [26, 14]. These images are

downsampled to 32 32 pixels. The training set contains 5,000 images per category,

while the test set contains 1,000 images per category.

Due to the tiny image size, we use two-layer hierarchical kernel descriptors to obtain

image-level features. We keep the first layer the same as kernel descriptors. Kernel


37/224

37

de-scriptors are extracted over 8 8 image patches over dense regular grids with a

spacing of 2 pixels. We split the whole training set into 10,000/40,000

training/validation set, and optimize the kernel parameters of the second layer kernel

descriptors on the validation set using grid search. Fi-nally, we train linear SVMs on

the full training set using the optimized kernel parameter setting. Our hierarchical

model can handle large numbers of basis vectors. We tried both 1000 and 5000 basis

vectors for the patch-level Gaus-sian kernel kF , and found that a larger number of

visual words is slightly better (0.5% to 1% improvement depend-ing on the type of

kernel descriptor). In the second layer, we use 1000 basis vector, enforce KPCA to

keep 97% of the energy for all kernel descriptors, and produce roughly 6000-

dimensional image-level features. Note that the sec-ond layer of hierarchical kernel

descriptors are image-level features, and should be compared to that of image-level

features formed by EMK, rather than that of kernel descriptors over image patches.

The dimensionality of EMK features [1] in is 14000, higher than that of hierarchical

kernel descriptors.

We compare kernel descriptors and hierarchical kernel

Method Accuracy

Logistic regression 36.0

Support Vector Machines 39.5

GIST 54.7


38/224

38

SIFT 65.6

fine-tuning GRBM 64.8

GRBM two layers 56.6

mcRBM 68.3

mcRBM-DBN 71.0

Tiled CNNs 73.1

improved LCC 74.5

KDES + EMK + linear SVMs 76.0

Convolutional RBM 78.9

K-means (Triangle, 4k features) 79.6

HKDES + linear SVMs (this

work) 80.0

descriptors in Table 1. As we see, hierarchical kernel de-scriptors consistently

outperform kernel descriptors. The shape hierarchical kernel descriptor is slightly

better than the shape kernel descriptor. The other two hierarchical ker-nel descriptors

are much better than their counterparts: gra-dient hierarchical kernel descriptor is

about 5 percent higher than gradient kernel descriptor and color hierarchical kernel

descriptor is 10 percent better than color kernel descriptor. Finally, the combination of

all three hierarchical kernel de-scriptors outperform the combination of all three

kernel de-scriptors by 4 percent. We were not able to run nonlinear SVMs with


39/224

39

Laplacian kernels on the scale of this dataset in reasonable time, given the high

dimensionality of image-level features. Instead, we make comparisons on a subset of

5,000 training images and our experiments suggest that non-linear SVMs have similar

performance with linear SVMs when hierarchical kernel descriptors are used.

We compare hierarchical kernel descriptors with the cur-rent state-of-the-art feature

learning algorithms in Table 2. Deep belief nets and sparse coding have been

extensively evaluated on this dataset [25, 31]. mcRBM can model pixel intensities and

pairwise dependencies between them jointly. Factorized third-order restricted

Boltzmann machine, fol-lowed by deep belief nets, has an accuracy of 71.0%. Tiled

CNNs has the best accuracy among deep networks. The improved LCC extends the

original local coordinate coding by including local tangent directions and is able to

integrate geometric information. As we have seen, sophisticated fea-ture extraction

can significantly boost accuracy and is much better than using raw pixel features.

SIFT features have an accuracy of 65.2% and works reasonably even on tiny images.

The combination of three hierarchical kernel de-scriptors has an accuracy of 80.0%,

higher than all other competing techniques; its accuracy is 14.4 percent higher than

SIFT, 9.0 percent higher than mcRBM combined with DBNs, and 5.5 percent higher

than the improved LCC. Hi-erarchical kernel descriptors slightly outperform the very

recent work: the convolutional RBM and the triangle K-means with 4000 centers.

ii) RGB-D Object Dataset


40/224


41/224

41

determining the category name of an object (e.g. coffee mug). One category usually

contains many different object instances.

To test the generalization ability of our approaches, for category recognition we train

models on a set of objects and at test time present to the system objects that were not

present in the training set [15]. At each trial, we randomly leave one object out from

each category for testing and train classifiers on the remaining 300 - 51 = 249 objects.

For in-stance recognition we also follow the experimental setting suggested by [15]:

train models on the video sequences of each object where the viewing angles are 30

and 60with the horizon and test them on the 45video sequence.

For category recognition, the average accuracy over 10 random train/test splits is

reported in the second column of Table. For instance recognition, the accuracy on the

test set is reported in the third column of Table. As we ex-pect, the combination of

hierarchical kernel descriptors is much better than any single descriptor. The

underlying rea- son is that each depth descriptor captures different informa-tion and

the weights learned by linear SVMs using super-vised information can automatically

balance the importance of each descriptor across objects.

Method Category Instance

Color HKDES (RGB) 60.12.1 58.4


42/224

42

Shape HKDES (RGB) 72.61.9 74.6

Gradient HKDES (RGB) 70.12.9 75.9

Combination of HKDES

(RGB) 76.12.2 79.3

Color HKDES (depth) 61.82.4 28.8

Shape HKDES (depth) 65.81.8 36.7

Gradient HKDES (depth) 70.82.7 39.3

Combination of HKDES

(depth) 75.72.6 46.8

Combination of all HKDES 84.12.2 82.4

Table2: Comparisons on the RGB-D Object Dataset. RGB de-notes features over

RGB images and depth denotes features over depth images.

Approaches Category Instance

Linear SVMs [15] 81.92.8 73.9

Nonlinear SVMs [15] 83.83.5 74.8

Random Forest [15] 79.64.0 73.1

Combination of all

HKDES 84.12.2 82.4

Table3: Comparisons to existing recognition approaches using a combination of depth

features and image features. Nonlinear SVMs use Gaussian kernel.


43/224

43

In Table 4, we compare hierarchical kernel descriptors with the rich feature set used

in, where SIFT, color and textons were extracted from RGB images, and 3-D bound-

ing boxes and spin images over depth images. Hier-archical kernel descriptors are

slightly better than this rich feature set for category recognition, and much better for

in-stance recognition.

It is worth noting that, using depth alone, we improve the category recognition

accuracy in from 53.1% (lin-ear SVMs) to 75.7% (hierarchical kernel descriptors and

linear SVMs). This shows the power of our hierarchical kernel descriptor formulation

when being applied to a non-conventional domain. The depth-alone results are

meaning-ful for many scenarios where color images are not used for privacy or

robustness reasons.

As a comparison, we also extracted SIFT features on both RGB and depth images and

trained linear SVMs over image-level features formed by spatial pyramid EMK. The

resulting classifier has an accuracy of 71.9% for category recognition, much lower

than the result of the combination of hierarchical kernel descriptors (84.2%). This is

not sur-prising since SIFT fails to capture shape and object size information.

Nevertheless, hierarchical kernel descriptors provide a unified way to generate rich

feature sets over both RGB and depth images, giving significantly better accuracy.


44/224

44

4) OBJECT RECOGNITION METHODS BASED ON TRANSFORMATION

Recognition of general three-dimensional objects from 2D images and videos is a

challenging task. The common for-mulation of the problem is essentially: given some

knowl-edge of how certain objects may appear, plus an image of a scene possibly

containing those objects, find which objects are present in the scene and where.

Recognition is accom-plished by matching features of an image and model of an

object. The two most important issues that a method must address are the definition of

a feature, and how the matching is found.

What is the goal in designing an object recognition sys-tem? Achieving generality, i.e.

the ability to recognise any object hand-crafted adaptation to a specific task,

robustness, the ability to recognise the objects in arbitrary conditions, and easy

learning, i.e. avoiding special or demanding proce-dures to obtain the database of

models. Obviously these requirements are generally impossible to achieve, as it is for

example impossible to recognise objects in images taken in complete darkness. The

challenge is then to develop a method with minimal constraints.

Object recognition methods can be classified according to a number of characteristics.

We focus on model acqui-sition (learning) and invariance to image formation condi-

tions. Historically, two main trends can be identified. In the so called geometry- or

model-based object recognition, the knowledge of an object appearance is provided

by the user as an explicit CAD-like model. Typically, such a model describes only the


45/224

45

3D shape, omitting other properties such as colour and texture. On the other end of

the spectrum are the appearance-based methods, where no explicit user-provided

model is required. The object representations are usually acquired through an

automatic learning phase (but not necessarily), and the model typically relies on

surface re-flectance (albedo) properties. Recently, methods which put local image

patches into correspondence emerged. Models are learned automatically, objects are

represented by ap-pearance of small local elements. Global arrangement of the

representation is constrained by weak or strong geometric models.

The rest of the paper is structured as follows. In Sec-tion 2, an overview of classes of

object recognition methods is given. Survey on methods which are based on match-

ing of local features is presented in Section 3, and Section 4 describes some of their

successful applications. Section 5 concludes the paper.

4) CLASSES OF OBJECT RECOGNITION METHODS

i) Appearance Based Methods

The central idea behind appearance-based methods is the following. Having seen all

possible appearances of an object, can recognition be achieved by just efficiently

remembering all of them? Could recognition be thus implemented as an efficient

visual (pictorial) memory? The answer obviously depends on what is meant by all

appearances. The ap-proach has been successfully demonstrated for scenes with

unoccluded objects on black background [34]. But remem-bering all possible object


46/224


47/224

47

The family of appearance-based object recognition meth-ods includes global

histogram matching methods. In [66, 67], Swain and Ballard proposed to represent an

object by a colour histogram. Objects are identified by matching his-tograms of image

regions to histograms of a model image. While the technique is robust to object

orientation, scaling, and occlusion, it is very sensitive to lighting conditions, and it is

not suitable for recognition of objects that cannot be identified by colour alone. The

approach has been later mod-ified by Healey and Slater [14] and Funt and Finlayson

[12] to exploit illumination invariants. Recently, the concept of histogram matching

was generalised by Schiele [52, 51, 50], where, instead of pixel colours, responses of

various filters are used to form the histograms (called then receptive field histograms).

To summarise, appearance based approaches are attrac-tive since they do not require

image features or geometric primitives to be detected and matched. But their

limitations, i.e. the necessity of dense sampling of training views and the low

robustness to occlusion and cluttered background, make them suitable mainly for

certain applications with limited or controlled variations in the image formation

conditions, e.g. for industrial inspection.

iii) Geometry-Based Methods

In geometry- (or shape-, or model-) based methods, the in-formation about the objects

is represented explicitly. The recognition can than be interpreted as deciding whether


48/224

48

(a part of) a given image can be a projection of the known (usually 3D) model [41] of

an object.

Generally, two representations are needed: one to repre-sent object model, and

another to represent the image con-tent. To facilitate finding a match between model

and image, the two representations should be closely related. In the ideal case there

will be a simple relation between primitives used to describe the model and those used

to describe the image. Would the object be, for example, described by a wireframe

model, the image might be best described in terms of linear intensity edges. Each edge

can be then matched directly to one of the model wires. However, the model and

image rep-resentations often have distinctly different meanings. The model may

describe the 3D shape of an object while the im-age edges correspond only to visible

manifestations of that shape mixed together with false edges (discontinuities in

surface albedo) and illumination effects (shadows).

To achieve pose and illumination invariance, it is prefer-able to employ model

primitives that are at least somewhat invariant with respect to changes in these

conditions. Con-siderable effort has been directed to identify primitives that are

invariant with respect to viewpoint change.

The main disadvantages of geometry-based methods are: the dependency on reliable

extraction of geometric primi-tives (lines, circles, etc.), the ambiguity in interpretation

of the detected primitives (presence of primitives that are not modelled), the restricted


49/224


50/224

50

correspondences. Since it is not required that all local features match, the approaches

are robust to occlusion and cluttered background.

To recognise objects from different views, it is necessary to handle all variations in

object appearance. The varia-tions might be complex in general, but at the scale of the

local features they can be modelled by simple, e.g. affine, transformations. Thus, by

allowing simple transformations at local scale, a significant viewpoint invariance is

achieved even for objects with complicated shapes. As a result, it is possible to obtain

models of objects from only a few views, taken e.g. 90 degrees apart.

The main advantages of the approaches based on match-ing local features are

summarised below.

[1]Learning, i.e. the construction of internal models of known objects, is done

automatically from images depict-ing the objects. No user intervention is required

except for providing the training images.

[2]The local representation is based on appearance. There is no need to extract geometric

primitives (e.g. lines), which are generally hard to detect reliably.

[3]Segmentation of objects from background is not required prior recognition, and yet

objects are recognised on an unknown background.

[4]Objects of interest are recognised even if partially oc-cluded by other unknown

objects in the scene.


51/224


52/224


53/224

53

to these small misalignments. Such a descriptor might be based e.g. on colour

moments (in-tegral statistics over whole region), or on local histograms.

It follows that the major factors that affect the discrim-inative potential, and thus the

ability to handle large object databases, of a method are the repeatability and the

locali-sation precision of the detector.

Indexing. During learning of object models, descriptors of local appearance are stored

into a database. In the recognition phase, descriptors are computed on the query

image, and the database is looked up for similar descriptors (potential matches). The

database should be organised (indexed) in a way that allows an efficient retrieval of

similar descriptors. The character of suitable indexing structure depends generally on

the properties of the descriptors (e.g. their di-mensionality) and on the distance

measure used to determine which are the similar ones (e.g. euclidean distance). Gen-

erally, for optimal performance of the index (fast retrieval times), such combination of

descriptor and distance measure should be sought, that minimises the ratio of

distances to correct and to false matches.

The choice of indexing scheme has major effect on the speed of the recognition

process, especially on how the speed scales to large object databases. Commonly,

though, the database searches are done simply by sequential scan, i.e. without using

any indexing structure.


54/224

54

Matching. When recognising objects in an unknown query image, local features are

computed in the same form as for the database images. None, one, or possibly more

tentative correspondences are then established for every feature de-tected in the query

image. Searching the database, euclidean or mahalanobis distance is typically

evaluated between the query feature and the features stored in the database. The

closest match, if close enough, is retrieved. These tentative correspondences are based

purely on the similarity of the descriptors. A database object which exhibit high (non-

random) number of established correspondences is considered as a candidate match.

Verification. The similarity of descriptors, on its own, is not a measure reliable

enough to guarantee that an established correspondence is correct. As a final step of

the recognition process, a verification of presence of the model in the query image is

performed. A global transformation connecting the images is estimated in a robust

way (e.g. by using RANSAC algorithm). Typically, the global transformation has the

form of epipolar geometry constraint for general (but rigid) 3D objects, or of

homography for planar objects. More complex transformations can be derived for

non-rigid or articulated (piecewise rigid) objects.

As mentioned before, if a detector cannot recover certain parameters of the image

transformations, descriptor must be made invariant to them. It is preferable, though, to

have a covariant detector rather than an invariant descriptor, as that allows for more

powerful global consistency verification. If, for example, the detector does not

provide the orienta-tions of the image elements, rotational invariants have to be


55/224

55

employed in the descriptor. In such a case, it is impossi-ble to verify that all of the

matched elements agree in their orientation.

Finally, tentative correspondences which are not consistent with the estimated global

transformation are rejected, and only remaining correspondences are used to estimate

the final score of the match.

In the following, main contributions to the field of object recognition based on local

correspondences are reviewed. The approaches follow the aforementioned structure,

but differ in individual steps; in the way how are the local features obtained

(detectors), and what are the features themselves (descriptors).


56/224

56

5) RECOGNITION AS A CORRESPONDENCE OF LOCAL FEATURES - A

SURVEY

a) The Approach of David Lowe

David Lowe has developed an object recognition system, with emphasis on efficiency,

achieving real-time recognition times. Anchor points of interest are detected with

invariance to scale, rotation and translation. Since local patches undergo more

complicated transforma-tions then similarities, a local-histogram based descriptor is

proposed, which is robust to imprecisions in alignment of the patches.

Detector. The detection of regions of interest proceeds as follows:

[12] Detection of scale-space peaks. Circular regions with maximal response of the

difference-of-gaussians (DoG) filter, are detected at all scales and image locations. Ef-

ficient implementation exploits the scale-space pyramid. The initial image is

repeatedly convolved with a Gaus-sian filter to produce a set of scale-space images.

Adja-cent scale-space images are then subtracted to produce a set of DoG images. In

these images, local minima and maxima (i.e. extrema of the DoG filter response) are

de-tected, both in spatial and scale domains. The result of the first phase is thus a set

of triplets x, y and , image locations and a characteristic scales.

[13] The location of the detected points is refined. The DoG responses are locally

fitted with 3D quadratic function and the location and characteristic scale of the

circular regions are determined with subpixel accuracy. The re-finement is necessary,


57/224


58/224

58

Verification. The Hough transform is used to identify clusters of tentative

correspondences with a consistent ge-ometric transformation. Since the actual

transformation is approximated by a similarity, the Hough accumulator is 4-

dimensionsional and is partitioned to rather broad bins. Only clusters with at least 3

entries in a bin, are considered further. Each such cluster is then subject to a

geometric ver-ification procedure in which an iterative least-squares fitting is used to

find the best affine projection relating the query and database images.

b) The Approach of Mikolajczyk & Schmid

The approach by Schmid et al. is described in [44, 28, 56, 54, 53, 55, 27, 10]. Based on an

affine generalisation of Harris corner detector, anchor points are detected and described by

Gaussian derivatives of image intensities in shape-adapted elliptical neighbourhoods.

Detector. In their work, Mikolajczyk and Schmid imple-ment affine-adapted Harris point

detector. Since the three-parametric affine Gaussian scale space is too complex to be

practically useful, they propose a solution which itera-tively search for affine shape

adaptation in neighbourhoods of points detected in uniform scale space. For initialisa-tion,

approximate locations and scales of interest points are extracted by standard multi-scale

Harris detector. These points are not affine invariant because of the uniform Gaus-sian kernel

used. Given the initial approximate solution, their algorithm iteratively modifies the shape,


59/224

59

the scale and the spatial location of neighbourhood of each point, and con-verges to affine-

invariant interest points. For more details see [28].

Descriptors and Matching. The descriptors are com-posed from Gaussian derivatives

computed over the shape-normalised regions. Invariance to rotation is obtained by steering

the derivatives in the direction of the gradient. Using derivatives up to 4th order, the

descriptors are 12-dimensional. The similarity of descriptors is in first approx-imation

measured by the Mahalanobis distance. Promis-ing close matches are then confirmed or

rejected by cross-correlation measure computed over normalised neighbour-hood windows.

Verification. Once the point-to-point correspondences are obtained, a robust estimation of the

geometric transforma-tion between the two images is computed using RANSAC algorithm.

The transformation used is either a homography or a fundamental matrix.

Recently, Dorko and Schmid [10] extended the approach towards object categorisation. Local

image patches are de-tected and described by the same approach as described above. Patches

from several examples of objects from a given category (e.g. cars) are collected together, and

a classifier is trained to distinguish them from patches of different cate-gories and from

background patches.

c) The Approach of Tuytelaars, Ferrari & van Gool


60/224

60

Luc van Gool and his collaborators developed an approach based on matching of local image

features [73, 75, 11, 72, 71, 74, 69]. They start with detection of elliptical or parallelo-gram

image regions. The regions are described by a vector of photometricaly invariant generalised

colour moments, and matching is typically verified by the epipolar geometry con-straint.

Detector. Two methods for extraction of affinely invariant regions are proposed, yielding

geometry- and intensity-based regions. The regions are affine covariant, they adapt their

shape to the underlying intensity profile, in order to keep on representing the same physical

part of an object. Apart from the geometric invariance, photometric invariance allows for

independent scaling and offsets for each of the three colour channels. The region extraction

always starts by detecting stable anchor points. The anchor points are either Harris points

[13], or local extrema of image intensity. Although the detection of Harris points is not really

affine invariant, as the support set over which is the response computed is circu-lar, the points

are still fairly stable under viewpoint changes, and could be precisely localised (even to

subpixel accuracy). Intensity extrema, on the other hand, are invariant to any continuous

geometric transformation and to any monotonic transformation of the intensity, but are not

localised as ac-curately. On colour images, the detection is performed three times, separately

on each of the colour bands.

Descriptors and Matching. In the case of geometry-based regions, each of the regions is

described by a vector of 18 generalised colour moments [29], invariant to photometric

transformations. For the intensity-based regions, 9 rotation-invariant generalised colour

moments are used. The simi-larity between the descriptors is given by the Mahalanobis


61/224

61

distance, correspondences between two images are formed from regions with the distance

mutually smallest. Once cor-responding regions have been found, the cross-correlation be-

tween them is computed as a final check before accepting the match. In the case of the

intensity-based regions, where the rotation is unknown, the crosscorrelation is maximised

over all rotations. Good matches are further fine-tuned by non-linear optimisation: the

crosscorrelation is maximised over small deviations of the transformation parameters.

Verification. The set of tentative correspondences is pruned by both geometric and

photometric constraints. The geometric constraint basically rejects correspondences con-

tradicting the epipolar geometry. Photometric constraint assumes that there is always a group

of corresponding re-gions that undergo the same transformation of intensities.

Correspondences that have singular photometric transforma-tion are rejected. Recently, a

growing flexible homography approach was presented, which allows for accurate model

alignment even for nonrigid objects. The size of the aligned area is then used as a measure of

the match quality.

d) The LAF Approach of Matas.

The approach of Matas et al. [25, 37, 26, 36] starts with detection of Maximally Stable

Extremal Regions. Affine co-variant local coordinate systems (called Local Affine Frames,

LAFs) are then established, and measurements taken rela-tive to them describe the regions.


62/224

62

Figure 3: Examples of correspondences established between frames of a database image (left)

and a query image (right).

Detector. The Maximally Stable Extremal Regions (MSERs) were introduced in [25].

The attractive properties of MSERs are: 1. invariance to affine transformations of im-age

coordinates, 2. invariance to monotonic transformation of intensity, 3. computational

complexity almost linear in the number of pixels and consequently near real-time run time,

and 4. since no smoothing is involved, both very fine and coarse image structures are

detected. Starting with contours of the detected region, local frames (coordinate systems) are

constructed in several affine covariant ways. Affine covari-ant properties of covariance

matrix, bi-tangent lines, and line parallelism are exploited. As demonstrated in Figure, local

affine frames facilitate normalisation of image patches into a canonical frame and enable

direct comparison of photomet-ricaly normalised intensity values, eliminating the need for

invariants.

Descriptor. Three different descriptors were used. The first is directly the intensities of the

local patches. The intensities are discretised into 15 15 3 rasters, yield-ing 675-

dimensional descriptors. The size is discriminative enough to distinguish between a large


63/224

63

amount of database objects, yet coarse enough to be tolerant to decent misalign-ments in the

frame localisation. Second type of descriptor employs the discrete cosine transformation,

which is applied to the discretised patches [38]. The number of low frequency DCT

coefficients that are kept in the database is used to adapt the preference of descriptor

discriminativity against the localisation tolerance. Finally, rotational invariants were used.

Verification. In the wide-baseline stereo problems, the cor-respondences are verified by

robustly selecting only these conforming to the epipolar geometry constraint. For object

recognition it is typically sufficient to approximate the global geometry transformation by a

homography with flexible tol-eration increasing towards the object boundaries.

e) The Approach of Zisserman

A. Zisserman and his collaborators developed strategies for matching of local features mainly

in the context of the wide-baseline stereo problem [43, 42, 48, 45, 46]. Recently they

presented an interesting work relating image retrieval prob-lem and text retrieval [63, 47, 49].

They introduced an im-age retrieval system, called VideoGoogle, which is capable of

processing and indexing full-length movies.

Detectors and Descriptors. Two types of detectors of lo-cal image elements are employed.

One is the shape-adapted elliptical regions by Mikolajczyk and Schmid, as described in


64/224


65/224


66/224

66

entation of the objects, and to occlusion. Ohba and Ikeuchi and Jugessur and Dudek propose

an appearance-based object recognition method robust to variations in the background and

occlusion of a substantial fraction of the image.

In order to apply the eigenspace analysis to recognition of partially occluded objects, they

propose to divide the ob-ject appearance into small windows, referred to as eigen windows,

and to apply eigenspace analysis to them. Like in other approaches exploiting local

appearance, even if some of the windows are occluded, the remaining are still effective and

can recover the object identity and pose.

In addition to robustness to occlusions, Jugessur and Dudek [16] also address the problem of

rotation invariance. The proposed solution is to compute the PCA not on the intensity

patches, but rather in frequency domain of the win-dows represented in polar coordinates.

g) The Approach of Selinger & Nelson

The object recognition system developed by Nelson and Selinger at the University of

Rochester exploits a four-level hierarchy of grouping processes [35, 59, 61, 58, 57, 60]. The

system architecture is similar to other local feature-based ap-proaches though a different

terminology is used. Inspired by the Gestalt laws and perceptual grouping principles, a four-


67/224

67

level grouping hierarchy is built, where higher levels contains groups of elements from lower

levels.

The hierarchy is constructed as follows. At the fourth highest level, a 3D object is represented

as a topologically structured set of flexible 2D views. The geometric relation between the

views is stored here. This level is used for geo-metric reasoning, but not for recognition.

Recognition takes place at the third level, the level of the component views. In these views

the visual appearance of an object, derived from a training image, is represented as a loosely

structured com-bination of a number of local context regions. Local context regions (local

features) are represented at the second level. The regions can be thought of as local image

patches that surround first level features. At the first level are features (detected image

elements) that are the result of grouping processes run on the image, typically representing

connected contour fragments, or locally homogeneous regions. Only


68/224

68

Figure 4: Examples of corresponding query (left columns) and database (right columns)

images from the ZuBuD dataset. The image pairs exhibit occlusion, varying illumi-nation and

viewpoint and orientation changes.

Efficient recognition is achieved by using a database implemented as an associative memory

of keyed context patches. An unknown keyed context patch recalls associ-ated hypotheses for

all known views of objects that could have produced such context patch. These hypotheses

are processed by a second associative memory, indexed by the view parameters, which

partitions the hypotheses into clus-ters that are mutually consistent within a loose geometric

framework (these clusters are the third level groups). The looseness is obtained by tolerating

a specified deviation in position, size, and orientation. The bounds are set to be consistent

with a given distance between training views (e.g. approximately 20 degrees). The output of

the recognition stage is a set of third level groupings that represent hypothe-ses of the identity

and pose of objects in the scene, ranked by the total evidence for each hypothesis.

h) APPLICATIONS

Approaches matching local features have been experimen-tally shown to obtain state-of-the-

art results. Here we present few examples of the adressed problems. Results are demonstrated

using the approach of Matas et al. [37, 36, 38], although comparable results have been shown

by others.


69/224

69

Figure 5: Image retrieval on FOCUS dataset: query local-isation results. query images,

database images, and query localisations

Object Recognition. In object recognition experiments, Columbia Object Image Library

(COIL-100) [1], or more of-ten its subset COIL-20, has been widely used, and for com-

parison purposes has become a de facto standard benchmark

Figure 6: An example of matches established on a wide-baseline stereo pair.

dataset. COIL-100 is a set of colour images of 100 different objects, where 72 images of each

object were taken at pose intervals of 5. The objects are unoccluded and on unclut-tered

black background. Such a configuration is benign for appearance-based methods. Table 1

compares recognition rates achieved by the LAF approach with the rates of sev-eral


70/224

70

appearance-based object recognition methods. Results are presented for five experimental

set-ups, differing in the number of training views per object. Decreasing the number of

training views increases demands on the methods general-isation ability, and on the

insensivity to image deformations. The LAF approach performs best in all experiments,

regard-less of the number of training views. For only four training views, the recognition rate

is almost 95%, demonstrating the remarkable robustness to local affine distortions.

Image retrieval. The retrieval performance of the LAF method was evaluated on the FOCUS

dataset, containing 360 colour high-resolution images of advertisements scanned from

magazines. The task was to retrieve adverts for a given product, given a query image of the

product logo. Examples of query logos, retrieved images, and visualised localisations of the

logos are depicted in Figure.

Another challenging retrieval problem involved recogni-tion of buildings in urban scenes.

Given an image of an unknown building, taken from an unknown viewpoint, the algorithm

was to identify the building. The experiments were conducted on a set of images of 201

different buildings. The dataset was provided by ETH Zurich and is publicly available [62].

The database contains five photographs of every of the 201 buildings, and a separate set of

115 query images is provided. Examples of corresponding query and database images are

shown in Figure. The LAF method achieved 100% recognition rate in rank 1.

Video retrieval. The problem of retrieval of video frames from full-length movies was

addressed in [63]. Local descrip-tors were computed on key frames and stored into database.


71/224

71

To reduce the otherwise enormous database size, descriptors were clustered according to their

similarity. Impresive real-time retrieval was achieved for a closed system, i.e. for the case of

query images originating from the movie itself.

Wide baseline stereo matching. For a significant variety of scenes the epipolar geometry can

be computed automati-cally from two (or possibly more) uncalibrated images, show-ing the

scene from significantly different viewpoints. The role of the matching in the wide-baseline

stereo problem is to provide corresponding points, i.e. the points which in the two images

represent identical element of the 3D scene. Correspondences found in a difficult stereo pair

are shown in Figure.


72/224


73/224

73

Vision is a difficult problem consisting of many building blocks that can be characterized in

isolation. Eye movements are one such building block.

2. Since visual sensitivity is the highest in the fovea, in general, eye movements are needed

for recognizing small stimuli.

3. During a fixation, a number of things happen concurrently: the visual information around

the fixation is analysed , and visual information away from the current fixation is analyzed to

help select the next saccade target.

The exact processes involved in this are still largely unknown. Findlay and Gilchrist [33] also

pose a number of questions, in order to demonstrate that numerous basic problems in vision

still remain open for research.

1. What visual information determines the target of the next eye movement?

2. What visual information determines when eyes move?

3. What information is combined across eye movements to form a stable representation of the

environment?

As discussed earlier [29], a brute force approach to object localization subject to a cost

constraint, is often intractable e as the search space size increases. Furthermore, the human

brain would have to be some hundreds of thousands times larger than it currently is, if visual

sensitivity across the visual space was the same as that in the fovea [29]. Thus, active and


74/224

74

attentive approaches to the problem are usually proposed as a means of addressing these

constraints.

We will show in this section that within the context of the general framework for object

recognition that was illustrated in Fig.1, previous work on active object recognition systems

has conclusively demonstrated that active vision systems are capable of leading to significant

improvements in both the learning and inference phases of object recognition. This includes

improvements in the robustness of all the components of the feature-extraction feature

grouping object-hypothesis object-verification object-recognition pipeline.

Some of the problems inherent in single view object recognition, include [266]:

1. The im

object recognition techniques (2)

Documents