object recognition techniques (2)
TRANSCRIPT
-
8/11/2019 Object Recognition Techniques (2)
1/224
1
ANALYSIS OF HIERARCHIAL FOR
OBJECT RECOGNITION
TECHNIQUES
For the Degree of
Doctor of Philosophy
In
SUBJECT
Submitted to
SHRI VENKATESHWARA UNIVERSITY,
Gajraula, Amroha (UTTAR PRADESH)
Research Supervisor: Research
Scholar:
Dr. Name
2014
-
8/11/2019 Object Recognition Techniques (2)
2/224
2
DECLARATION
I hereby declare that this submission is my own work and that, to the best of my
knowledge and belief, it contains no material previously published or written by
another person nor material which to a substantial extent has been accepted for the
award of any other degree or diploma of the university or other institute of higher
learning, except where due acknowledgment has been made in the text.
Signature of Research Scholar
Name :
Enrollment No.
-
8/11/2019 Object Recognition Techniques (2)
3/224
3
CERTIFICATE
Certified that Name of student(enrollment no..) has carried out the research work
presented in this thesis entitled Title of Thesis. for the award of Doctor
ofPhilosophy from Shri Venkateshwara University, Gajraula under my/our (print only
that is applicable) supervision. The thesis embodies results of original work, and
studies as are carried out by the student himself/ herself (print only that is applicable)
and the contents of the thesis do not form the basis for the award of any other degree to
the candidate or to anybody else from this or any other University/Institution.
Signature Signature
(Name of Supervisor) (Name of Supervisor)
Designation) (Designation)
(Address) (Address)
Date:
-
8/11/2019 Object Recognition Techniques (2)
4/224
4
SHRI VENKATESHWARA UNIVERSITY, GAJRAULA
CERTIFICATE OF THESIS SUBMISSION FOR
EVALUATION(To be submitted in duplicate)
1.Name :..........
2.Enrollment No. :
3.Thesis title:...........
.......
4.Degree for which the thesis is submitted:
5.Department of the University to which the thesis is submitted :
....................................................................................................................................................................................................
6.Faculty of the University to which the thesis is submitted :
....................................................................................................................................................................................................
7. Thesis Preparation Guide was referred to for preparing the thesis. Yes No
8. Specifications regarding thesis format have been closely followed. Yes No
9. The contents of the thesis have been organized based on the guidelines. Yes No
10. The thesis has been prepared without resorting to plagiarism. Yes No
11. All sources used have been cited appropriately. Yes No
12. The thesis has not been submitted elsewhere for a degree. Yes No
13. Submitted two copies of spiral bound thesis plus one CD. Yes No
14. Submitted five copies of synopsis approved by RDC. Yes No
15. Submitted two copies of spiral bound research summary. Yes No
Name...Enrollment No
-
8/11/2019 Object Recognition Techniques (2)
5/224
5
SHRI VENKATESHWARA UNIVERSITY, GAJRAULA
CERTIFICATE OF FINAL THESIS SUBMISSION
(To be submitted in duplicate)1. Name : .........
2. Enrollment No. :
3. Thesis title:...
....
4. Degree for which the thesis is submitted: ........
5. Department of University to which the thesis is submitted :
...
6. Faculty of the University to which the thesis is submitted :
7. Thesis Preparation Guide was referred to for preparing the thesis. Yes No
8. Specifications regarding thesis format have been closely followed. Yes No
9. The contents of the thesis have been organized based on the guidelines. Yes No
10.The thesis has been prepared without resorting to plagiarism. Yes No
11.All sources used have been cited appropriately. Yes No
12.The thesis has not been submitted elsewhere for a degree. Yes No
13.All the corrections have been incorporated. Yes No
14.Submitted five hard bound copies of the thesis plus one CD. Yes No
15.Submitted five copies of research summary. Yes
No
(Signature(s) of the Supervisor(s) (Signature of the Candidate)
-
8/11/2019 Object Recognition Techniques (2)
6/224
6
Name(s): Name..
Enrollment No
1) ABSTRACT
AnObject recognition systems constitute a deeply entrenched and omnipresent
component of modern intelligent systems. Research on object recognition algorithms
has led to advances in factory and office automation through the creation of optical
character recognition systems, assembly-line industrial inspection systems, as well as
chip defect identification systems. It has also led to significant advances in medical
imaging, defence and biometrics. In this paper we discuss the evolution of computer-
based object recognition systems over the last fifty years, and overview the successes
and failures of proposed solutions to the problem. We survey the breadth of
approaches adopted over the years in attempting to solve the problem, and highlight
the important role that active and attentive approaches must play in any solution that
bridges the semantic gap in the proposed object representations, while simultaneously
leading to efficient learning and inference algorithms. From the earliest systems
which dealt with the character recognition problem, to modern visually-guided agents
that can purposively search entire rooms for objects, we argue that a common thread
of all such systems is their fragility and their inability to generalize as well as the
human visual system can. At the same time, however, we demonstrate that the
performance of such systems in strictly controlled environments often vastly
outperforms the capabilities of the human visual system. We conclude our survey by
-
8/11/2019 Object Recognition Techniques (2)
7/224
7
arguing that the next step in the evolution of object recognition algorithms will
require radical and bold steps forward in terms of the object representations, as well
as the learning and inference algorithms used.
LIST OF TABLES
Table: 1.
Comparison of kernel descriptors (KDES) and hierarchi-cal kernel descriptors(HKDES) on CIFAR10 provide extensive comparisons with current state-of-the-art
algorithms in terms of accuracy.
Table: 2.
Comparisons on the RGB-D Object Dataset. RGB de-notes features over RGB images
and depth denotes features over depth images.
Table: 3.
Comparisons to existing recognition approaches using a combination of depth features
and image features. Nonlinear SVMs use Gaussian kernel.
LIST OF FIGURES
Figure1:
-
8/11/2019 Object Recognition Techniques (2)
8/224
8
Different components of an object recognition system are shown
Figure 2:
Hierarchical Kernel Descriptors
Figure 3:
Examples of correspondences established between frames of a database image (left) and a
query image (right).
Figure 4:
Examples of corresponding query (left columns) and database (right columns) images from
the ZuBuD dataset. The image pairs exhibit occlusion, varying illumi-nation and viewpoint
and orientation changes
Figure 5:
Examples of corresponding query (left columns) and database (right columns) images from
the ZuBuD dataset. The image pairs exhibit occlusion, varying illumi-nation and viewpoint
and orientation changes.
Figure 6:
Image retrieval on FOCUS dataset: query local-isation results. query images, database
images, and query localisations
Figure 7:
An example of matches established on a wide-baseline stereo pair.
Figure 8:
Overview of the spatiotemporal (4-D) approach to dynamic vision (adapted from [50, 268]).
-
8/11/2019 Object Recognition Techniques (2)
9/224
9
Chart 1: Summary of the 1989-2009 papers in Table 5 on active object detection. By
definition search efficiency is not the primary concern in these systems, since by assumption
the object is always in the sensors field of view. However inference scalability constitutes a
significant component of such systems. We notice very little use of function and context in
these systems. Furthermore, training such systems is often non-trivial.
Figure 9:
A sequence of viewpoints from which the system developed by Wilkes and Tsotsos [266]
actively recognizes an origami object.
Figure 10:
The object verification and next viewpoint selection algorithm used in [280] (diagramadapted from [280]).
Figure 11:
Graphical model for next-view-planning as proposed in [284, 285].
Figure 12:
The aspects of an object and its congruence classes (adapted from Gremban and Ikeuchi
[287]).
Figure 13:
An aspect resolution tree used to determine if there is a single interval of values for that
satisfy certain constraints (adapted from Gremban and Ikeuchi [287]).
Figure 14:
The two types of view degeneracies proposed by Dickinson et al. [49].
Chart 2: Summary of the 1992-2012 papers on active object localization and recognition from
Table 6. As expected, search efficiency and the role of 3D information is significantly more
prominent in these papers (as compared to Chart 7)
Figure 15:
-
8/11/2019 Object Recognition Techniques (2)
10/224
10
Reconstructionist vision vs. Selective Perception, after Rimey and Brown [302]
feature vector c. Laporte and Arbel [291] build upon this work and choose the best next
viewpoint by calculating the symmetric KL divergence (Jeffrey divergence) of the likelihood
of the observed data given the assumption that this data resulted from two views of two
distinct objects. By weighing each Jeffrey divergence by the product of the probabilities ofobserving the two competing objects and their two views, they can determine the next view
which provides the object identity hypothesis, thus again demonstrating the active vision
systems direct applicability in the standard recognition pipeline (see Fig.1).
Figure 16:
A PART-OF Bayes net for a table-top scenario, similar to what was proposed by Rimey and
Brown [302].
Figure 17:An IS-A Bayes tree for a table-top scenario that was used by Rimey and Brown [302].
Figure 18:
The direct-search model, which includes nodes that affect direct search efficiency (unboxed
nodes) and explicit model parameters (boxed nodes). Adapted from Wixson and Ballard
[303].
Figure 19:Junction types proposed by Malik [321] and used by Brunnstrom et al. [306] for recognizing
man-made objects.
Figure 20:
An ASIMO humanoid robot was used by Andreopoulos et al. [24] to actively search an
indoor environment.
Figure 21:
An example of ASIMO pointing at an object once the target object is successfully localized
in a 3D environment [24].
Figure 22:
-
8/11/2019 Object Recognition Techniques (2)
11/224
11
The twenty object classes that the 2011 PASCAL dataset contains. Some of the earlier
versions of the PASCAL dataset only used subsets of these object classes. Adapted from
[324]
Chart 3: Summary of the PASCAL Challenge papers from Table 7 which correspond to
algorithms published between 2002-2011. Notice that the winning PASCAL challenge
algorithms typically make little use of function, context, 3D and make a moderate use of
texture.
Figure 23:
The HOG detector of Dalal and Triggs (from [335] with permission). (a): The average
gradient image over a set of registered training images. (b), (c): Each pixel demonstrates the
maximum and minimum (respectively) SVM weight of the corresponding block. (d): The test
image used in the rest of the subfigures. (e): The computed R-HOG descriptor of the image in
subfigure (e). (f),(g): The R-HOG descriptor weighed by the positive and negative SVM
weights respectively.
Figure 24:
Examples of the Harris-Laplace detector and the Laplacian detector, which were used
extensively in [142] as interest-point/region detectors (figure reproduced from [142] with
permission).
Figure 25:The distributions of various object classes corresponding to six feature classes.
Figure 26:
Example of the algorithm by Felzenszwalb et al. [366] localizing a person using the coarse
template representation and the higher resolution subpart templates of the person (from [366]
with permission).
Figure 27:
The HOG feature pyramid used in [366], showing the coarse root-level template and the
higher resolution templates of the persons subparts (from [366] with permission).
Figure 28:
-
8/11/2019 Object Recognition Techniques (2)
12/224
12
The distribution of edges and appearance patches of certain car model training images used
by Chum and Zisserman [365], with the learned regions of interest overlaid (from [365], with
permission).
Figure 29:
The 35 most frequent 2AS constructed from 10 outdoor images (from [367] with permission).
It is easier to understand the left images contents (e.g., a busy road with mountains in the
background) if the cars in the image have been firstly localized. Conversely, in the right
image, occlusions make the object localization problem difficult. Thus, prior knowledge that
the image contains exclusively cars, can make the localization problem easier (from [361]
with permission).
Figure 30:
Demonstrating how top-down category-specific attentional biases can modulate the shape-
words during the bag-of-words histogram construction (from [358] with permission). low
level features (e.g., edges, color) and they are grouping them in more complex ways in order
to achieve more universal representations of object parts. In terms of object verification and
object hypothesizing (see Fig.1) the work by Felzenszwalb et al. [366] represents the most
successful approach tested in Pascal 2007, for using a coarse generative model of object parts
to improve recognition performance.
Figure 31:
(a)The 3-layer tree-like object representation in [348]. (b) A reference template without any
part displacement, showing the root-node bounding box (blue), the centers of the 9 parts in
the 2nd layer (yellow dots), and the 36 part at the last layer in color purple. (c) and (d) denote
object localizations (from [348] with permission).
Figure 32:
On using context to mitigate the negative effects of ambiguous localizations [350]. The
greater the ambiguities, the greater role contextual knowledge plays (from [350] with
permission).
Figure 33:
-
8/11/2019 Object Recognition Techniques (2)
13/224
13
An example of a feature extraction stage of the type FC SG Rabs N PA . An inputimage (or a feature map) is passed through a non-linear filterbank, followed by rectification,
local contrast normalization and spatial pooling/sub- sampling.
Figure 34:
Test Error rate vs. number of training samples per class on NORB Dataset. Although pure
random features perform surprisingly well when training data is very scarce, for large
number of training data learning improves the per- formance significantly. Absolute value
rectification (Rabs) and local normalization (N ) is shown to improve the performance in all
cases.
Figure 35:
Left: random stage-1 filters, and corresponding optimal inputs that maximize the response
of each corresponding complex cell in a FC SG Rabs N PA architecture.
Figure 36:
Left: A dictionary with 128 elements, learned with patch based sparse coding model. Right:
A dictionary with 128 elements, learned with convolutional sparse coding model. The
dictionary learned with the convolutional model spans the orientation space much more
uniformly. In addition it can be seen that the diversity of filters obtained by convolutional
sparse model is much richer compared to patch based one.
Figure 37:
Top Left: Smooth shrinkage function. Parameters and b control the smoothness and
location of the kink of the function. As it converges more closely to soft thresholding
operator. Top Right: Total loss as a function of number of iterations. The vertical dotted
line marks the iteration number when diagonal hessian approximation was updated. It is
clear that for both encoder functions, hessian update improves the convergence significantly.
Bottom: 128 convolutional filters (k) learned in the encoder using smooth shrinkage
function.
Figure 38:
Second stage filters. Left: Encoder kernels that correspond to the dictionary elements.
Right: 128 dictionary elements, each row shows 16 dictionary elements, connecting to a
-
8/11/2019 Object Recognition Techniques (2)
14/224
14
single second layer feature map. It can be seen that each group extracts similar type of
features from their corresponding inputs.
Figure 39:
Results on the INRIA dataset with per-image metric. Left: Comparing two best systems
with unsupervised initialization (U U ) vs random initialization (RR). Right: Effect of
bootstrapping on final performance for unsupervised ini- tialized system.
Figure 40:
Results on the INRIA dataset with per-image metric. These curves are computed from the
bounding boxes and confidences made available by (Dollar et al., 2009b). Comparing our
two best systems labeled (U +U + and R+R+)with all the other methods.
Figure 41:
Reconstruction Error vs 1 norm Sparsity Penalty for coordinate de- scent sparse coding and
variational free energy minimization.
Figure 42:
Angle between representations obtained for two consecutive frames using different parameter
values using sparse coding and variational free energy minimization.
-
8/11/2019 Object Recognition Techniques (2)
15/224
15
TABLE OF CONTENTS
1. INTRODUCTION
a. System Component
b. Complexity of Object Recognition
c. Two Dimensional
d. Three Dimensional
e. Segmented
2. HIERARCHIAL KERNEL DESCRIPTOR
a. Kernel Descriptor
b. Kernel Descriptor Over Kernel Descriptor
c. Everyday Object Recognition Using RGB-D
d. Experiments
i. Cifar 10
ii. RGB-D Object Dataset
3. OBJECT RECOGNITION METHOD BASED ON TRANSFORMATION
a. Classes of object recognition methods
i. Appearance based methods
ii. Geometry Based Methods
4. RECOGNITION AS A CORRESPONDENCE LOCAL FEATURE
a. The approach of David lowe
b. The approach of Mikolajczyk & Schmidc. The approach of Tuytelaars Ferrari & Gool
d. The LAF approach of Matas
e. The approach of Zisserman
-
8/11/2019 Object Recognition Techniques (2)
16/224
16
i. Indexing and Matching
ii. Verification
f. Scale Saliency by Kadir & Brady
g. Local PCA, approaches of Jugessur & Ohba
h. The approach of Selinger and Nelson
i. Applications
5. LITERATURE SURVEY
a. Active and Dynamic Vision
b. Active object detection literature survey
c. Active object localization
6. CASE STUDIES FROM RECOGNITION CHALLENGES AND THE EVOLVING
LANDSCAPE
a. Dataset and evaluation techniques
b. Sampling the current state of the art in the recognition literature
i. Pascal 2005
ii. Pascal 2006
iii. Pascal 2008
iv. Pascal 2009
v. Pascal 2010
vi. Pascal 2011
c. The evolving landscape
7. MULTI STAGE ARCHITECTURE FOR OBJECT RECOGNITION
a. Modules for hierarchical systems
b. Combining module into hierarchy
c. Training protocol
d. Experiment with Caltech101 Dataset
e. Using a single stage of feature extraction
f. Using two stage of feature extraction
g. NORB datatest
-
8/11/2019 Object Recognition Techniques (2)
17/224
17
h. Random filter performance
i. Handwritten digit recognition
j. Convolutional space coding
k. Algorithms and Methods
i. Learning convolutional dictionaries
ii. Learning and efficient encoder
iii. Patch based vs convolutional sparse modelling
l. Multi stage architecture
m. Experiments
i. Object recognition using Caltech 101 dataset
ii. Pedestrian detection
iii. Architecture and Training
iv. Per image evaluation
n. Sparse coding by variational marginalization
o. Variational marginalization for spare coding
p. Stability experiments
8. CONCLUSION
9. References
-
8/11/2019 Object Recognition Techniques (2)
18/224
18
2) Introduction
Object recognition is a fundamental and challenging problem and is a major focus of
research in computer vision, machine learning and robotics. The task is difficult partly
because images are in high-dimensional space and can change with viewpoint, while
the objects them-selves may be deformable, leading to large intra-class variation. The
core of building object recognition systems is to extract meaningful representations
(features) from high-dimensional observations such as images, videos, and 3D point
clouds. This paper aims to discover such representations using machine learning
methods. An object recognition system finds objects in the real world from an image
of the world, using object models which are known a priori. This task is surprisingly
difficult. Humans perform object recognition effortlessly and instantaneously.
Algorithmic description of this task for implementation on machines has been very
difficult. In this chapter we will discuss different steps in object recognition and
introduce some techniques that have been used for object recognition in many
applications. We will discuss the different types of recognition tasks that a vision
system may need to perform. We will analyze the complexity of these tasks and
present approaches useful in different phases of the recognition task.
-
8/11/2019 Object Recognition Techniques (2)
19/224
19
Over the past few years, there has been increasing interest in feature learning for
object recognition using ma-chine learning methods. Deep belief nets (DBNs) are
appealing feature learning methods that can learn a hierarchy of features. DBNs are
trained one layer at a time using contrastive divergence, where the feature learned by
the current layer becomes the data for training the next layer. Deep belief nets have
shown impressive results on handwritten digit recognition, speech recognition and
visual object recognition. Convolutional neural networks (CNNs) are another example
that can learn multiple layers of nonlinear features. In CNNs, the parameters of the
entire network, including a final layer for recognition, are jointly optimized using the
back-propagation algorithm.
The object recognition problem can be defined as a labeling problem based on models
of known objects. Formally, given an image containing one or more objects of interest
(and background) and a set of labels corresponding to a set of models known to the
system, the system should assign correct labels to regions, or a set of regions, in the
image. The object recognition problem is closely tied to the segmentation problem:
without at least a partial recognition of objects, segmentation cannot be done, and
without segmentation, object recognition is not possible.
In this chapter, we discuss basic aspects of object recognition hiarchy and its analysis.
We present the architecture and main components of object recognition and discuss
their role in object recognition systems of varying complexity.
-
8/11/2019 Object Recognition Techniques (2)
20/224
20
Figure1: Different components of an object recognition system are shown
a) System Component
An object recognition system must have the following components to perform the
task:
Model database (also called modelbase)
Feature detector
Hypothesizer
Hypothesis verifier
A block diagram showing interactions and information flow among different
components of the system is given in Figure. The model database contains all the
models known to the system. The information in the model database depends on the
-
8/11/2019 Object Recognition Techniques (2)
21/224
21
approach used for the recognition. It can vary from a qualitative or functional
description to precise geometric surface information. In many cases, the models of
objects are abstract feature vectors, as discussed later in this section. A feature is some
attribute of the object that is considered important in describing and recognizing the
object in relation to other objects. Size, color, and shape are some commonly used
features.
The feature detector applies operators to images and identifies locations of features
that help in forming object hypotheses. The features used by a system depend on the
types of objects to be recognized and the organization of the model database. Using
the detected features in the image, the hypothesizer assigns likelihoods to objects
present in the scene. This step is used to reduce the search space for the recognizer
using certain features.
The model base is organized using some type of indexing scheme to facilitate
elimination of unlikely object candidates from possible consideration. The verifier
then uses object models to verify the hypotheses and refines the likelihood of objects.
The system then selects the object with the highest likelihood, based on all the
evidence, as the correct object.
All object recognition systems use models either explicitly or implicitly and employ
feature detectors based on these object models. The hypothesis formation and
verification components vary in their importance in different approaches to object
recognition. Some systems use only hypothesis formation and then select the object
-
8/11/2019 Object Recognition Techniques (2)
22/224
22
with highest likelihood as the correct object. Pattern classification approaches are a
good example of this approach. Many artificial intelligence systems, on the other
hand, rely little on the hypothesis formation and do more work in the verification
phases. In fact, one of the classical approaches, template matching, bypasses the
hypothesis formation stage entirely.
An object recognition system must select appropriate tools and techniques for the
steps discussed above. Many factors must be considered in the selection of
appropriate methods for a particular application. The central issues that should be
considered in designing an object recognition system are:
Object or model representation: How should objects be represented in the
model database? What are the important attributes or features of objects that must be
captured in these models? For some objects, geometric descriptions may be available
and may also be efficient, while for another class one may have to rely on generic or
functional features.
The representation of an object should capture all relevant information without any
redundancies and should organize this information in a form that allows easy access
by different components of the object recognition system.
Feature extraction: Which features should be detected, and how can they be
detected reliably? Most features can be computed in twodimensional images but they
are related to three-dimensional characteristics of objects. Due to the nature of the
-
8/11/2019 Object Recognition Techniques (2)
23/224
23
image formation process, some features are easy to compute reliably while others are
very difficult. Feature detection issues were discussed in many chapters in this book.
Feature-model matching: How can features in images be matched to models
in the database? In most object recognition tasks, there are many features and
numerous objects. An exhaustive matching approach will solve the recognition
problem but may be too slow to be useful. Effectiveness of features and efficiency of
a matching technique must be considered in developing a matching approach.
Hypotheses formation: How can a set of likely objects based on the feature
matching be selected, and how can probabilities be assigned to each possible object?
The hypothesis formation step is basically a heuristic to reduce the size of the search
space. This step uses knowledge of the application domain to assign some kind of
probability or confidence measure to different objects in the domain. This measure
reflects the likelihood of the presence of objects based on the detected features.
Object verification: How can object models be used to select the most likely
object from the set of probable objects in a given image? The presence of each likely
object can be verified by using their models. One must examine each plausible
hypothesis to verify the presence of the object or ignore it. If the models are
geometric, it is easy to precisely verify objects using camera location and other scene
parameters. In other cases, it may not be possible to verify a hypothesis.
-
8/11/2019 Object Recognition Techniques (2)
24/224
24
Depending on the complexity of the problem, one or more modules in Figure. may
become trivial. For example, pattern recognition-based object recognition systems do
not use any feature-model matching or object verification; they directly assign
probabilities to objects and select the object with the highest probability.
b) Complexity of Object Recognition
As we studied in earlier chapters in this book, images of scenes depend on
illumination, camera parameters, and camera location. Since an object must be
recognized from images of a scene containing multiple entities, the complexity of
object recognition depends on several factors. A qualitative way to consider the
complexity of the object recognition task would consider the following factors:
Scene constancy: The scene complexity will depend on whether the images are
acquired in similar conditions (illumination, background, camera parameters, and
viewpoint ) as t:le models. As seen in earlier chapters, scene conditions affect images
of the same object dramatically. Under different scene conditions, the performance of
different feature detectors will be significantly different. The nature of the
background, other objects, and illumination must be considered to determine what
kind of features can be efficiently and reliably detected.
Image-models spaces: In some applications, images may be obtained such that
three-dimensional objects can be considered two-dimensional. The models in such
-
8/11/2019 Object Recognition Techniques (2)
25/224
-
8/11/2019 Object Recognition Techniques (2)
26/224
26
recognition problem into the following classes.
c) Two-dimensional
In many applications, images are acquired from a distance sufficient to consider the
projection to be orthographic. If the objects are always in one stable position in the
scene, then they can be considered two-dimensional. In these applications, one can
use a two-dimensional modelbase. There are two possible cases:
Objects will not be occluded, as in remote sensing and many industrial
applications .
Objects may be occluded by other objects of interest or be partially visible, as
in the bin of parts problem.
In some cases, though the objects may be far away, they may appear in different
positions resulting in multiple stable views. In such cases also, the problem may be
considered inherently as two-dimensional object recognition.
d) Three-dimensional
If the images of objects can be obtained from arbitrary viewpoints, then an object may
appear very different in its two views. For object recognition using three-dimensional
-
8/11/2019 Object Recognition Techniques (2)
27/224
27
models, the perspective effect and viewpoint of the image have to be considered. The
fact that the models are three-dimensional and the images contain only two-
dimensional information affects object recognition approaches. Again, the two factors
to be considered are whether objects are separated from other objects or not.
For three-dimensional cases, one should consider the information used in the object
recognition task. Two different cases are:
Intensity: There is no surface information available explicitly in intensity
images. Using intensity values, features corresponding to the three-dimensional
structure of objects should be recognized .
2.5-dimensional images: In many applications, surface representations with
viewer-centered coordinates are available, or can be computed, from images. This
information can be used in object recognition.
Range images are also 2.5-dimensional. These images give the distance to different
points in an image from a particular view point.
e) Segmented
The images have been segmented to separate objects from the background. As
discussed in Chapter 3 on segmentation, object recognition and segmentation
problems are closely linked in most cases. In some applications, it is possible to
-
8/11/2019 Object Recognition Techniques (2)
28/224
28
segment out an object easily. In cases when the objects have not been segmented, the
recognition problem is closely linked with the segmentation problem.
3) HIERARCHIAL KERNEL DESCRIPTORS
Kernel descriptors highlight the kernel view of orientation histograms, such as SIFT
and HOG, and show that they are a particular type of match kernels over patches. This
novel view suggests a unified framework for turning pixel attributes (gradient, color,
local binary pattern, etc.) into patch-level features:
(1) design match kernels using pixel attributes;
(2) learn compact basis vectors using kernel principal component analysis (KPCA);
(3) construct kernel descriptors by projecting the infinite-dimensional fea-ture vectors
to the learned basis vectors.
The key idea of this work is that we can apply the kernel descriptor framework not
only over sets of pixels (patches), but also sets of kernel descriptors. Hierarchical
kernel descriptors aggregate spatially nearby patch-level features to form higher level
features by using kernel descriptors recursively, as shown in Fig. 1. This procedure
can be repeated until we reach the final image-level features.
-
8/11/2019 Object Recognition Techniques (2)
29/224
29
a) Kernel Descriptors
Patch-level features are critical for many computer vision tasks. Orientation
histograms like SIFT and HOG are popular patch-level features for object recognition.
Kernel descriptors include SIFT and HOG as special cases, and provide a principled
way to generate rich patch-level features from various pixel attributes.
The gradient match kernel, Kgrad, is based on the pixel
Figure2:. Hierarchical Kernel Descriptors.
In the first layer, pixel attributes are aggregated into patch-level features. In the
second layer, patch-level features are turned into aggregated patch-level features. In
the final layer, aggregated patch-level features are converted into image-level
features. Kernel descriptors are used in every layer.
-
8/11/2019 Object Recognition Techniques (2)
30/224
30
where P and Q are patches from two different images, and denotes the 2D position of
a pixel in an image patch. Let z, mz be e the orientation and magnitude of the image
gradient at a pixe z.
The color kernel descriptor Kcol is based on the pixel intensity attribute
where czis the pixel color at position z (intensity for gray images and RGB values for
color images) and kc(cz, cz0 ) = exp(ckcz cz0 k2
) is a Gaussian kernel. The shape
kernel descriptor, Kshape, is based on the local binary pattern attribute
Gradient, color and shape kernel descriptors are strong in their own right and
complement one another. Their combi-nation turns out to be always (much) better
than the best in-dividual feature. Kernel descriptors are able to generate rich visual
feature sets by turning various pixel attributes into patch-level features, and are
superior to the current state-of-the-art recognition algorithms on many standard visual
object recognition datasets.
b) Kernel Descriptors over Kernel Descriptors
The match kernels used to aggregate patch-level features have similar structure to
those used to aggregate pixel at-tributes:
-
8/11/2019 Object Recognition Techniques (2)
31/224
31
where A and A0denote image patches, and P and Q are sets of image patches.
The patch position Gaussian kernel kC (CA, CA0 ) = exp(C kCA CA0 k2) = C
(CA)>C (CA0 ) describes the spatial relationship between two patches, where C A is
the center position of patch A (normalized to [0, 1]). The patch Gaussian kernel kF
(FA, FA0 ) = exp(FkFA FA0 k2) = F(FA)
>F(FA0 ) measures the similarity of two
patch-level features, where FAare gradient, shape or color kernel descriptors in our
case. The linear kernel WAWA0 weights the contribution of each patch-level fea- ture
where W small positive constant. WA is the average of gradient mag-nitudes for the
gradient kernel descriptor, the average of standard deviations for the shape kernel
descriptor and is always 1 for the color kernel descriptor.
Note that although efficient match kernels [1] used match kernels to aggregate patch-
level features, they dont con- sider spatial information in match kernels and so spatial
pyramid is required to integrate spatial information. In addi-tion, they also do not
weight the contribution of each patch, which can be suboptimal. The novel joint
match kernels ( 5) provide a way to integrate patch-level features, patch varia-tion,
and spatial information jointly.
-
8/11/2019 Object Recognition Techniques (2)
32/224
-
8/11/2019 Object Recognition Techniques (2)
33/224
33
which can be written as a single Gaussian kernel. This procedure is optimal in the
sense of minimizing the least square approximation error. However, it is intractable to
compute the eigenvectors of a 125, 000 125, 000 matrix on a modern personal
computer. Here we propose a fast algorithm for finding the eigenvec-tors of the
Kronecker product of kernel matrices. Since ker-nel matrices are symmetric positive
definite, we have
suggests that the top r eigenvectors of KF KC can be chosen from the Kronecker
product of the eigenvectors of KF and those of KC , which significantly reduces com-
putational cost. The second layer kernel descriptors have the form. Recursively
applying kernel descriptors in a similar man-ner, we can get kernel descriptors of
more layers, which represents features at different levels.
c) Everyday Object Recognition using RGB-D
We recorded with the camera mounted at three different heights relative to the
turntable, giving viewing angles of approxi-mately 30, 45 and 60 degrees with the
horizon. One revolu-tion of each object was recorded at each height. Each video
sequence is recorded at 20 Hz and contains around 250 frames, giving a total of
250,000 RGB + Depth frames. A combination of visual and depth cues (Mixture-of-
Gaussian fitting on RGB, RANSAC plane fitting on depth) produces a segmentation
-
8/11/2019 Object Recognition Techniques (2)
34/224
34
for each frame separating the object of in-terest from the background. The objects are
organized into a hierarchy taken from WordNet hypernym/hyponym rela-tions and is
a subset of the categories in ImageNet. Each of the 300 objects in the dataset belong
to one of 51 cate-gories
Our hierarchical kernel descriptors, being a generic ap-proach based on kernels, has
no trouble generalizing from color images to depth images. Treating a depth image as
a grayscale image, i.e. using depth values as intensity, gra-dient and shape kernel
descriptors can be directly extracted and they capture edge and shape information in
the depth channel. However, color kernel descriptors extracted over the raw depth
image does not have any significant mean-ing. Instead, we make the observation that
the distance d of an object from the camera is inversely proportional to the square root
of its area s in RGB images. For a given object.
Since we have the segmen-tation of objects, we can represent s using the number of
pixels belonging to the object mask. Finally, we multiply depth values by s before
extracting color kernel descrip-tors over this normalized depth image. This yields a
feature that is sensitive to the physical size of the object.
-
8/11/2019 Object Recognition Techniques (2)
35/224
35
In the experiments section, we will compare in de-tail the performance of our
hierarchical kernel descrip-tors on RGB-D object recognition to that in [15]. Our
approach consistently outperforms the state of the art in [15]. In particular, our
hierarchical kernel descriptors on the depth image perform much better than the com-
bination of depth features (including spin images) used in [15], increasing the depth-
only object category recog-nition from 53.1% (linear SVMs) and 64.7% (nonlinear
SVMs) to 75.7% (hierarchical kernel descriptors and lin-ear SVMs). Moreover, our
depth features served as the backbone in the object-aware situated interactive system
that was successfully demonstrated at the Consumer Elec-tronics Show 2011 despite
adverse lighting conditions.
d) Experiments
In this section, we evaluate hierarchical kernel descrip-tors on CIFAR10 and the
RGB-D Object Dataset. We also
Features KDES [1]
HKDES (this
work)
Color 53.9 63.4
Shape 68.2 69.4
Gradient 66.3 71.2
-
8/11/2019 Object Recognition Techniques (2)
36/224
36
Combination 76.0 80.0
Table 1. Comparison of kernel descriptors (KDES) and hierarchi-cal kernel
descriptors (HKDES) on CIFAR10 provide extensive comparisons with current state-
of-the-art algorithms in terms of accuracy.
In all experiments we use the same parameter settings as the original kernel
descriptors for the first layer of hi-erarchical kernel descriptors. For SIFT as well as
gradi-ent and shape kernel descriptors, all images are transformed into grayscale ([0,
1]). Image intensity and RGB values are normalized to [0, 1]. Like HOG [5], we
compute gradients using the mask [1, 0, 1] for gradient kernel descriptors. We also
evaluate the performance of the combination of the three hierarchical kernel
descriptors by concatenating the image-level feature vectors. Our experiments suggest
that this combination always improves accuracy.
i) CIFAR10
CIFAR10 is a subset of the 80 million tiny images dataset [26, 14]. These images are
downsampled to 32 32 pixels. The training set contains 5,000 images per category,
while the test set contains 1,000 images per category.
Due to the tiny image size, we use two-layer hierarchical kernel descriptors to obtain
image-level features. We keep the first layer the same as kernel descriptors. Kernel
-
8/11/2019 Object Recognition Techniques (2)
37/224
37
de-scriptors are extracted over 8 8 image patches over dense regular grids with a
spacing of 2 pixels. We split the whole training set into 10,000/40,000
training/validation set, and optimize the kernel parameters of the second layer kernel
descriptors on the validation set using grid search. Fi-nally, we train linear SVMs on
the full training set using the optimized kernel parameter setting. Our hierarchical
model can handle large numbers of basis vectors. We tried both 1000 and 5000 basis
vectors for the patch-level Gaus-sian kernel kF , and found that a larger number of
visual words is slightly better (0.5% to 1% improvement depend-ing on the type of
kernel descriptor). In the second layer, we use 1000 basis vector, enforce KPCA to
keep 97% of the energy for all kernel descriptors, and produce roughly 6000-
dimensional image-level features. Note that the sec-ond layer of hierarchical kernel
descriptors are image-level features, and should be compared to that of image-level
features formed by EMK, rather than that of kernel descriptors over image patches.
The dimensionality of EMK features [1] in is 14000, higher than that of hierarchical
kernel descriptors.
We compare kernel descriptors and hierarchical kernel
Method Accuracy
Logistic regression 36.0
Support Vector Machines 39.5
GIST 54.7
-
8/11/2019 Object Recognition Techniques (2)
38/224
38
SIFT 65.6
fine-tuning GRBM 64.8
GRBM two layers 56.6
mcRBM 68.3
mcRBM-DBN 71.0
Tiled CNNs 73.1
improved LCC 74.5
KDES + EMK + linear SVMs 76.0
Convolutional RBM 78.9
K-means (Triangle, 4k features) 79.6
HKDES + linear SVMs (this
work) 80.0
descriptors in Table 1. As we see, hierarchical kernel de-scriptors consistently
outperform kernel descriptors. The shape hierarchical kernel descriptor is slightly
better than the shape kernel descriptor. The other two hierarchical ker-nel descriptors
are much better than their counterparts: gra-dient hierarchical kernel descriptor is
about 5 percent higher than gradient kernel descriptor and color hierarchical kernel
descriptor is 10 percent better than color kernel descriptor. Finally, the combination of
all three hierarchical kernel de-scriptors outperform the combination of all three
kernel de-scriptors by 4 percent. We were not able to run nonlinear SVMs with
-
8/11/2019 Object Recognition Techniques (2)
39/224
39
Laplacian kernels on the scale of this dataset in reasonable time, given the high
dimensionality of image-level features. Instead, we make comparisons on a subset of
5,000 training images and our experiments suggest that non-linear SVMs have similar
performance with linear SVMs when hierarchical kernel descriptors are used.
We compare hierarchical kernel descriptors with the cur-rent state-of-the-art feature
learning algorithms in Table 2. Deep belief nets and sparse coding have been
extensively evaluated on this dataset [25, 31]. mcRBM can model pixel intensities and
pairwise dependencies between them jointly. Factorized third-order restricted
Boltzmann machine, fol-lowed by deep belief nets, has an accuracy of 71.0%. Tiled
CNNs has the best accuracy among deep networks. The improved LCC extends the
original local coordinate coding by including local tangent directions and is able to
integrate geometric information. As we have seen, sophisticated fea-ture extraction
can significantly boost accuracy and is much better than using raw pixel features.
SIFT features have an accuracy of 65.2% and works reasonably even on tiny images.
The combination of three hierarchical kernel de-scriptors has an accuracy of 80.0%,
higher than all other competing techniques; its accuracy is 14.4 percent higher than
SIFT, 9.0 percent higher than mcRBM combined with DBNs, and 5.5 percent higher
than the improved LCC. Hi-erarchical kernel descriptors slightly outperform the very
recent work: the convolutional RBM and the triangle K-means with 4000 centers.
ii) RGB-D Object Dataset
-
8/11/2019 Object Recognition Techniques (2)
40/224
-
8/11/2019 Object Recognition Techniques (2)
41/224
41
determining the category name of an object (e.g. coffee mug). One category usually
contains many different object instances.
To test the generalization ability of our approaches, for category recognition we train
models on a set of objects and at test time present to the system objects that were not
present in the training set [15]. At each trial, we randomly leave one object out from
each category for testing and train classifiers on the remaining 300 - 51 = 249 objects.
For in-stance recognition we also follow the experimental setting suggested by [15]:
train models on the video sequences of each object where the viewing angles are 30
and 60with the horizon and test them on the 45video sequence.
For category recognition, the average accuracy over 10 random train/test splits is
reported in the second column of Table. For instance recognition, the accuracy on the
test set is reported in the third column of Table. As we ex-pect, the combination of
hierarchical kernel descriptors is much better than any single descriptor. The
underlying rea- son is that each depth descriptor captures different informa-tion and
the weights learned by linear SVMs using super-vised information can automatically
balance the importance of each descriptor across objects.
Method Category Instance
Color HKDES (RGB) 60.12.1 58.4
-
8/11/2019 Object Recognition Techniques (2)
42/224
42
Shape HKDES (RGB) 72.61.9 74.6
Gradient HKDES (RGB) 70.12.9 75.9
Combination of HKDES
(RGB) 76.12.2 79.3
Color HKDES (depth) 61.82.4 28.8
Shape HKDES (depth) 65.81.8 36.7
Gradient HKDES (depth) 70.82.7 39.3
Combination of HKDES
(depth) 75.72.6 46.8
Combination of all HKDES 84.12.2 82.4
Table2: Comparisons on the RGB-D Object Dataset. RGB de-notes features over
RGB images and depth denotes features over depth images.
Approaches Category Instance
Linear SVMs [15] 81.92.8 73.9
Nonlinear SVMs [15] 83.83.5 74.8
Random Forest [15] 79.64.0 73.1
Combination of all
HKDES 84.12.2 82.4
Table3: Comparisons to existing recognition approaches using a combination of depth
features and image features. Nonlinear SVMs use Gaussian kernel.
-
8/11/2019 Object Recognition Techniques (2)
43/224
43
In Table 4, we compare hierarchical kernel descriptors with the rich feature set used
in, where SIFT, color and textons were extracted from RGB images, and 3-D bound-
ing boxes and spin images over depth images. Hier-archical kernel descriptors are
slightly better than this rich feature set for category recognition, and much better for
in-stance recognition.
It is worth noting that, using depth alone, we improve the category recognition
accuracy in from 53.1% (lin-ear SVMs) to 75.7% (hierarchical kernel descriptors and
linear SVMs). This shows the power of our hierarchical kernel descriptor formulation
when being applied to a non-conventional domain. The depth-alone results are
meaning-ful for many scenarios where color images are not used for privacy or
robustness reasons.
As a comparison, we also extracted SIFT features on both RGB and depth images and
trained linear SVMs over image-level features formed by spatial pyramid EMK. The
resulting classifier has an accuracy of 71.9% for category recognition, much lower
than the result of the combination of hierarchical kernel descriptors (84.2%). This is
not sur-prising since SIFT fails to capture shape and object size information.
Nevertheless, hierarchical kernel descriptors provide a unified way to generate rich
feature sets over both RGB and depth images, giving significantly better accuracy.
-
8/11/2019 Object Recognition Techniques (2)
44/224
44
4) OBJECT RECOGNITION METHODS BASED ON TRANSFORMATION
Recognition of general three-dimensional objects from 2D images and videos is a
challenging task. The common for-mulation of the problem is essentially: given some
knowl-edge of how certain objects may appear, plus an image of a scene possibly
containing those objects, find which objects are present in the scene and where.
Recognition is accom-plished by matching features of an image and model of an
object. The two most important issues that a method must address are the definition of
a feature, and how the matching is found.
What is the goal in designing an object recognition sys-tem? Achieving generality, i.e.
the ability to recognise any object hand-crafted adaptation to a specific task,
robustness, the ability to recognise the objects in arbitrary conditions, and easy
learning, i.e. avoiding special or demanding proce-dures to obtain the database of
models. Obviously these requirements are generally impossible to achieve, as it is for
example impossible to recognise objects in images taken in complete darkness. The
challenge is then to develop a method with minimal constraints.
Object recognition methods can be classified according to a number of characteristics.
We focus on model acqui-sition (learning) and invariance to image formation condi-
tions. Historically, two main trends can be identified. In the so called geometry- or
model-based object recognition, the knowledge of an object appearance is provided
by the user as an explicit CAD-like model. Typically, such a model describes only the
-
8/11/2019 Object Recognition Techniques (2)
45/224
45
3D shape, omitting other properties such as colour and texture. On the other end of
the spectrum are the appearance-based methods, where no explicit user-provided
model is required. The object representations are usually acquired through an
automatic learning phase (but not necessarily), and the model typically relies on
surface re-flectance (albedo) properties. Recently, methods which put local image
patches into correspondence emerged. Models are learned automatically, objects are
represented by ap-pearance of small local elements. Global arrangement of the
representation is constrained by weak or strong geometric models.
The rest of the paper is structured as follows. In Sec-tion 2, an overview of classes of
object recognition methods is given. Survey on methods which are based on match-
ing of local features is presented in Section 3, and Section 4 describes some of their
successful applications. Section 5 concludes the paper.
4) CLASSES OF OBJECT RECOGNITION METHODS
i) Appearance Based Methods
The central idea behind appearance-based methods is the following. Having seen all
possible appearances of an object, can recognition be achieved by just efficiently
remembering all of them? Could recognition be thus implemented as an efficient
visual (pictorial) memory? The answer obviously depends on what is meant by all
appearances. The ap-proach has been successfully demonstrated for scenes with
unoccluded objects on black background [34]. But remem-bering all possible object
-
8/11/2019 Object Recognition Techniques (2)
46/224
-
8/11/2019 Object Recognition Techniques (2)
47/224
47
The family of appearance-based object recognition meth-ods includes global
histogram matching methods. In [66, 67], Swain and Ballard proposed to represent an
object by a colour histogram. Objects are identified by matching his-tograms of image
regions to histograms of a model image. While the technique is robust to object
orientation, scaling, and occlusion, it is very sensitive to lighting conditions, and it is
not suitable for recognition of objects that cannot be identified by colour alone. The
approach has been later mod-ified by Healey and Slater [14] and Funt and Finlayson
[12] to exploit illumination invariants. Recently, the concept of histogram matching
was generalised by Schiele [52, 51, 50], where, instead of pixel colours, responses of
various filters are used to form the histograms (called then receptive field histograms).
To summarise, appearance based approaches are attrac-tive since they do not require
image features or geometric primitives to be detected and matched. But their
limitations, i.e. the necessity of dense sampling of training views and the low
robustness to occlusion and cluttered background, make them suitable mainly for
certain applications with limited or controlled variations in the image formation
conditions, e.g. for industrial inspection.
iii) Geometry-Based Methods
In geometry- (or shape-, or model-) based methods, the in-formation about the objects
is represented explicitly. The recognition can than be interpreted as deciding whether
-
8/11/2019 Object Recognition Techniques (2)
48/224
48
(a part of) a given image can be a projection of the known (usually 3D) model [41] of
an object.
Generally, two representations are needed: one to repre-sent object model, and
another to represent the image con-tent. To facilitate finding a match between model
and image, the two representations should be closely related. In the ideal case there
will be a simple relation between primitives used to describe the model and those used
to describe the image. Would the object be, for example, described by a wireframe
model, the image might be best described in terms of linear intensity edges. Each edge
can be then matched directly to one of the model wires. However, the model and
image rep-resentations often have distinctly different meanings. The model may
describe the 3D shape of an object while the im-age edges correspond only to visible
manifestations of that shape mixed together with false edges (discontinuities in
surface albedo) and illumination effects (shadows).
To achieve pose and illumination invariance, it is prefer-able to employ model
primitives that are at least somewhat invariant with respect to changes in these
conditions. Con-siderable effort has been directed to identify primitives that are
invariant with respect to viewpoint change.
The main disadvantages of geometry-based methods are: the dependency on reliable
extraction of geometric primi-tives (lines, circles, etc.), the ambiguity in interpretation
of the detected primitives (presence of primitives that are not modelled), the restricted
-
8/11/2019 Object Recognition Techniques (2)
49/224
-
8/11/2019 Object Recognition Techniques (2)
50/224
50
correspondences. Since it is not required that all local features match, the approaches
are robust to occlusion and cluttered background.
To recognise objects from different views, it is necessary to handle all variations in
object appearance. The varia-tions might be complex in general, but at the scale of the
local features they can be modelled by simple, e.g. affine, transformations. Thus, by
allowing simple transformations at local scale, a significant viewpoint invariance is
achieved even for objects with complicated shapes. As a result, it is possible to obtain
models of objects from only a few views, taken e.g. 90 degrees apart.
The main advantages of the approaches based on match-ing local features are
summarised below.
[1]Learning, i.e. the construction of internal models of known objects, is done
automatically from images depict-ing the objects. No user intervention is required
except for providing the training images.
[2]The local representation is based on appearance. There is no need to extract geometric
primitives (e.g. lines), which are generally hard to detect reliably.
[3]Segmentation of objects from background is not required prior recognition, and yet
objects are recognised on an unknown background.
[4]Objects of interest are recognised even if partially oc-cluded by other unknown
objects in the scene.
-
8/11/2019 Object Recognition Techniques (2)
51/224
-
8/11/2019 Object Recognition Techniques (2)
52/224
-
8/11/2019 Object Recognition Techniques (2)
53/224
53
to these small misalignments. Such a descriptor might be based e.g. on colour
moments (in-tegral statistics over whole region), or on local histograms.
It follows that the major factors that affect the discrim-inative potential, and thus the
ability to handle large object databases, of a method are the repeatability and the
locali-sation precision of the detector.
Indexing. During learning of object models, descriptors of local appearance are stored
into a database. In the recognition phase, descriptors are computed on the query
image, and the database is looked up for similar descriptors (potential matches). The
database should be organised (indexed) in a way that allows an efficient retrieval of
similar descriptors. The character of suitable indexing structure depends generally on
the properties of the descriptors (e.g. their di-mensionality) and on the distance
measure used to determine which are the similar ones (e.g. euclidean distance). Gen-
erally, for optimal performance of the index (fast retrieval times), such combination of
descriptor and distance measure should be sought, that minimises the ratio of
distances to correct and to false matches.
The choice of indexing scheme has major effect on the speed of the recognition
process, especially on how the speed scales to large object databases. Commonly,
though, the database searches are done simply by sequential scan, i.e. without using
any indexing structure.
-
8/11/2019 Object Recognition Techniques (2)
54/224
54
Matching. When recognising objects in an unknown query image, local features are
computed in the same form as for the database images. None, one, or possibly more
tentative correspondences are then established for every feature de-tected in the query
image. Searching the database, euclidean or mahalanobis distance is typically
evaluated between the query feature and the features stored in the database. The
closest match, if close enough, is retrieved. These tentative correspondences are based
purely on the similarity of the descriptors. A database object which exhibit high (non-
random) number of established correspondences is considered as a candidate match.
Verification. The similarity of descriptors, on its own, is not a measure reliable
enough to guarantee that an established correspondence is correct. As a final step of
the recognition process, a verification of presence of the model in the query image is
performed. A global transformation connecting the images is estimated in a robust
way (e.g. by using RANSAC algorithm). Typically, the global transformation has the
form of epipolar geometry constraint for general (but rigid) 3D objects, or of
homography for planar objects. More complex transformations can be derived for
non-rigid or articulated (piecewise rigid) objects.
As mentioned before, if a detector cannot recover certain parameters of the image
transformations, descriptor must be made invariant to them. It is preferable, though, to
have a covariant detector rather than an invariant descriptor, as that allows for more
powerful global consistency verification. If, for example, the detector does not
provide the orienta-tions of the image elements, rotational invariants have to be
-
8/11/2019 Object Recognition Techniques (2)
55/224
55
employed in the descriptor. In such a case, it is impossi-ble to verify that all of the
matched elements agree in their orientation.
Finally, tentative correspondences which are not consistent with the estimated global
transformation are rejected, and only remaining correspondences are used to estimate
the final score of the match.
In the following, main contributions to the field of object recognition based on local
correspondences are reviewed. The approaches follow the aforementioned structure,
but differ in individual steps; in the way how are the local features obtained
(detectors), and what are the features themselves (descriptors).
-
8/11/2019 Object Recognition Techniques (2)
56/224
56
5) RECOGNITION AS A CORRESPONDENCE OF LOCAL FEATURES - A
SURVEY
a) The Approach of David Lowe
David Lowe has developed an object recognition system, with emphasis on efficiency,
achieving real-time recognition times. Anchor points of interest are detected with
invariance to scale, rotation and translation. Since local patches undergo more
complicated transforma-tions then similarities, a local-histogram based descriptor is
proposed, which is robust to imprecisions in alignment of the patches.
Detector. The detection of regions of interest proceeds as follows:
[12] Detection of scale-space peaks. Circular regions with maximal response of the
difference-of-gaussians (DoG) filter, are detected at all scales and image locations. Ef-
ficient implementation exploits the scale-space pyramid. The initial image is
repeatedly convolved with a Gaus-sian filter to produce a set of scale-space images.
Adja-cent scale-space images are then subtracted to produce a set of DoG images. In
these images, local minima and maxima (i.e. extrema of the DoG filter response) are
de-tected, both in spatial and scale domains. The result of the first phase is thus a set
of triplets x, y and , image locations and a characteristic scales.
[13] The location of the detected points is refined. The DoG responses are locally
fitted with 3D quadratic function and the location and characteristic scale of the
circular regions are determined with subpixel accuracy. The re-finement is necessary,
-
8/11/2019 Object Recognition Techniques (2)
57/224
-
8/11/2019 Object Recognition Techniques (2)
58/224
58
Verification. The Hough transform is used to identify clusters of tentative
correspondences with a consistent ge-ometric transformation. Since the actual
transformation is approximated by a similarity, the Hough accumulator is 4-
dimensionsional and is partitioned to rather broad bins. Only clusters with at least 3
entries in a bin, are considered further. Each such cluster is then subject to a
geometric ver-ification procedure in which an iterative least-squares fitting is used to
find the best affine projection relating the query and database images.
b) The Approach of Mikolajczyk & Schmid
The approach by Schmid et al. is described in [44, 28, 56, 54, 53, 55, 27, 10]. Based on an
affine generalisation of Harris corner detector, anchor points are detected and described by
Gaussian derivatives of image intensities in shape-adapted elliptical neighbourhoods.
Detector. In their work, Mikolajczyk and Schmid imple-ment affine-adapted Harris point
detector. Since the three-parametric affine Gaussian scale space is too complex to be
practically useful, they propose a solution which itera-tively search for affine shape
adaptation in neighbourhoods of points detected in uniform scale space. For initialisa-tion,
approximate locations and scales of interest points are extracted by standard multi-scale
Harris detector. These points are not affine invariant because of the uniform Gaus-sian kernel
used. Given the initial approximate solution, their algorithm iteratively modifies the shape,
-
8/11/2019 Object Recognition Techniques (2)
59/224
59
the scale and the spatial location of neighbourhood of each point, and con-verges to affine-
invariant interest points. For more details see [28].
Descriptors and Matching. The descriptors are com-posed from Gaussian derivatives
computed over the shape-normalised regions. Invariance to rotation is obtained by steering
the derivatives in the direction of the gradient. Using derivatives up to 4th order, the
descriptors are 12-dimensional. The similarity of descriptors is in first approx-imation
measured by the Mahalanobis distance. Promis-ing close matches are then confirmed or
rejected by cross-correlation measure computed over normalised neighbour-hood windows.
Verification. Once the point-to-point correspondences are obtained, a robust estimation of the
geometric transforma-tion between the two images is computed using RANSAC algorithm.
The transformation used is either a homography or a fundamental matrix.
Recently, Dorko and Schmid [10] extended the approach towards object categorisation. Local
image patches are de-tected and described by the same approach as described above. Patches
from several examples of objects from a given category (e.g. cars) are collected together, and
a classifier is trained to distinguish them from patches of different cate-gories and from
background patches.
c) The Approach of Tuytelaars, Ferrari & van Gool
-
8/11/2019 Object Recognition Techniques (2)
60/224
60
Luc van Gool and his collaborators developed an approach based on matching of local image
features [73, 75, 11, 72, 71, 74, 69]. They start with detection of elliptical or parallelo-gram
image regions. The regions are described by a vector of photometricaly invariant generalised
colour moments, and matching is typically verified by the epipolar geometry con-straint.
Detector. Two methods for extraction of affinely invariant regions are proposed, yielding
geometry- and intensity-based regions. The regions are affine covariant, they adapt their
shape to the underlying intensity profile, in order to keep on representing the same physical
part of an object. Apart from the geometric invariance, photometric invariance allows for
independent scaling and offsets for each of the three colour channels. The region extraction
always starts by detecting stable anchor points. The anchor points are either Harris points
[13], or local extrema of image intensity. Although the detection of Harris points is not really
affine invariant, as the support set over which is the response computed is circu-lar, the points
are still fairly stable under viewpoint changes, and could be precisely localised (even to
subpixel accuracy). Intensity extrema, on the other hand, are invariant to any continuous
geometric transformation and to any monotonic transformation of the intensity, but are not
localised as ac-curately. On colour images, the detection is performed three times, separately
on each of the colour bands.
Descriptors and Matching. In the case of geometry-based regions, each of the regions is
described by a vector of 18 generalised colour moments [29], invariant to photometric
transformations. For the intensity-based regions, 9 rotation-invariant generalised colour
moments are used. The simi-larity between the descriptors is given by the Mahalanobis
-
8/11/2019 Object Recognition Techniques (2)
61/224
61
distance, correspondences between two images are formed from regions with the distance
mutually smallest. Once cor-responding regions have been found, the cross-correlation be-
tween them is computed as a final check before accepting the match. In the case of the
intensity-based regions, where the rotation is unknown, the crosscorrelation is maximised
over all rotations. Good matches are further fine-tuned by non-linear optimisation: the
crosscorrelation is maximised over small deviations of the transformation parameters.
Verification. The set of tentative correspondences is pruned by both geometric and
photometric constraints. The geometric constraint basically rejects correspondences con-
tradicting the epipolar geometry. Photometric constraint assumes that there is always a group
of corresponding re-gions that undergo the same transformation of intensities.
Correspondences that have singular photometric transforma-tion are rejected. Recently, a
growing flexible homography approach was presented, which allows for accurate model
alignment even for nonrigid objects. The size of the aligned area is then used as a measure of
the match quality.
d) The LAF Approach of Matas.
The approach of Matas et al. [25, 37, 26, 36] starts with detection of Maximally Stable
Extremal Regions. Affine co-variant local coordinate systems (called Local Affine Frames,
LAFs) are then established, and measurements taken rela-tive to them describe the regions.
-
8/11/2019 Object Recognition Techniques (2)
62/224
62
Figure 3: Examples of correspondences established between frames of a database image (left)
and a query image (right).
Detector. The Maximally Stable Extremal Regions (MSERs) were introduced in [25].
The attractive properties of MSERs are: 1. invariance to affine transformations of im-age
coordinates, 2. invariance to monotonic transformation of intensity, 3. computational
complexity almost linear in the number of pixels and consequently near real-time run time,
and 4. since no smoothing is involved, both very fine and coarse image structures are
detected. Starting with contours of the detected region, local frames (coordinate systems) are
constructed in several affine covariant ways. Affine covari-ant properties of covariance
matrix, bi-tangent lines, and line parallelism are exploited. As demonstrated in Figure, local
affine frames facilitate normalisation of image patches into a canonical frame and enable
direct comparison of photomet-ricaly normalised intensity values, eliminating the need for
invariants.
Descriptor. Three different descriptors were used. The first is directly the intensities of the
local patches. The intensities are discretised into 15 15 3 rasters, yield-ing 675-
dimensional descriptors. The size is discriminative enough to distinguish between a large
-
8/11/2019 Object Recognition Techniques (2)
63/224
63
amount of database objects, yet coarse enough to be tolerant to decent misalign-ments in the
frame localisation. Second type of descriptor employs the discrete cosine transformation,
which is applied to the discretised patches [38]. The number of low frequency DCT
coefficients that are kept in the database is used to adapt the preference of descriptor
discriminativity against the localisation tolerance. Finally, rotational invariants were used.
Verification. In the wide-baseline stereo problems, the cor-respondences are verified by
robustly selecting only these conforming to the epipolar geometry constraint. For object
recognition it is typically sufficient to approximate the global geometry transformation by a
homography with flexible tol-eration increasing towards the object boundaries.
e) The Approach of Zisserman
A. Zisserman and his collaborators developed strategies for matching of local features mainly
in the context of the wide-baseline stereo problem [43, 42, 48, 45, 46]. Recently they
presented an interesting work relating image retrieval prob-lem and text retrieval [63, 47, 49].
They introduced an im-age retrieval system, called VideoGoogle, which is capable of
processing and indexing full-length movies.
Detectors and Descriptors. Two types of detectors of lo-cal image elements are employed.
One is the shape-adapted elliptical regions by Mikolajczyk and Schmid, as described in
-
8/11/2019 Object Recognition Techniques (2)
64/224
-
8/11/2019 Object Recognition Techniques (2)
65/224
-
8/11/2019 Object Recognition Techniques (2)
66/224
66
entation of the objects, and to occlusion. Ohba and Ikeuchi and Jugessur and Dudek propose
an appearance-based object recognition method robust to variations in the background and
occlusion of a substantial fraction of the image.
In order to apply the eigenspace analysis to recognition of partially occluded objects, they
propose to divide the ob-ject appearance into small windows, referred to as eigen windows,
and to apply eigenspace analysis to them. Like in other approaches exploiting local
appearance, even if some of the windows are occluded, the remaining are still effective and
can recover the object identity and pose.
In addition to robustness to occlusions, Jugessur and Dudek [16] also address the problem of
rotation invariance. The proposed solution is to compute the PCA not on the intensity
patches, but rather in frequency domain of the win-dows represented in polar coordinates.
g) The Approach of Selinger & Nelson
The object recognition system developed by Nelson and Selinger at the University of
Rochester exploits a four-level hierarchy of grouping processes [35, 59, 61, 58, 57, 60]. The
system architecture is similar to other local feature-based ap-proaches though a different
terminology is used. Inspired by the Gestalt laws and perceptual grouping principles, a four-
-
8/11/2019 Object Recognition Techniques (2)
67/224
67
level grouping hierarchy is built, where higher levels contains groups of elements from lower
levels.
The hierarchy is constructed as follows. At the fourth highest level, a 3D object is represented
as a topologically structured set of flexible 2D views. The geometric relation between the
views is stored here. This level is used for geo-metric reasoning, but not for recognition.
Recognition takes place at the third level, the level of the component views. In these views
the visual appearance of an object, derived from a training image, is represented as a loosely
structured com-bination of a number of local context regions. Local context regions (local
features) are represented at the second level. The regions can be thought of as local image
patches that surround first level features. At the first level are features (detected image
elements) that are the result of grouping processes run on the image, typically representing
connected contour fragments, or locally homogeneous regions. Only
-
8/11/2019 Object Recognition Techniques (2)
68/224
68
Figure 4: Examples of corresponding query (left columns) and database (right columns)
images from the ZuBuD dataset. The image pairs exhibit occlusion, varying illumi-nation and
viewpoint and orientation changes.
Efficient recognition is achieved by using a database implemented as an associative memory
of keyed context patches. An unknown keyed context patch recalls associ-ated hypotheses for
all known views of objects that could have produced such context patch. These hypotheses
are processed by a second associative memory, indexed by the view parameters, which
partitions the hypotheses into clus-ters that are mutually consistent within a loose geometric
framework (these clusters are the third level groups). The looseness is obtained by tolerating
a specified deviation in position, size, and orientation. The bounds are set to be consistent
with a given distance between training views (e.g. approximately 20 degrees). The output of
the recognition stage is a set of third level groupings that represent hypothe-ses of the identity
and pose of objects in the scene, ranked by the total evidence for each hypothesis.
h) APPLICATIONS
Approaches matching local features have been experimen-tally shown to obtain state-of-the-
art results. Here we present few examples of the adressed problems. Results are demonstrated
using the approach of Matas et al. [37, 36, 38], although comparable results have been shown
by others.
-
8/11/2019 Object Recognition Techniques (2)
69/224
69
Figure 5: Image retrieval on FOCUS dataset: query local-isation results. query images,
database images, and query localisations
Object Recognition. In object recognition experiments, Columbia Object Image Library
(COIL-100) [1], or more of-ten its subset COIL-20, has been widely used, and for com-
parison purposes has become a de facto standard benchmark
Figure 6: An example of matches established on a wide-baseline stereo pair.
dataset. COIL-100 is a set of colour images of 100 different objects, where 72 images of each
object were taken at pose intervals of 5. The objects are unoccluded and on unclut-tered
black background. Such a configuration is benign for appearance-based methods. Table 1
compares recognition rates achieved by the LAF approach with the rates of sev-eral
-
8/11/2019 Object Recognition Techniques (2)
70/224
70
appearance-based object recognition methods. Results are presented for five experimental
set-ups, differing in the number of training views per object. Decreasing the number of
training views increases demands on the methods general-isation ability, and on the
insensivity to image deformations. The LAF approach performs best in all experiments,
regard-less of the number of training views. For only four training views, the recognition rate
is almost 95%, demonstrating the remarkable robustness to local affine distortions.
Image retrieval. The retrieval performance of the LAF method was evaluated on the FOCUS
dataset, containing 360 colour high-resolution images of advertisements scanned from
magazines. The task was to retrieve adverts for a given product, given a query image of the
product logo. Examples of query logos, retrieved images, and visualised localisations of the
logos are depicted in Figure.
Another challenging retrieval problem involved recogni-tion of buildings in urban scenes.
Given an image of an unknown building, taken from an unknown viewpoint, the algorithm
was to identify the building. The experiments were conducted on a set of images of 201
different buildings. The dataset was provided by ETH Zurich and is publicly available [62].
The database contains five photographs of every of the 201 buildings, and a separate set of
115 query images is provided. Examples of corresponding query and database images are
shown in Figure. The LAF method achieved 100% recognition rate in rank 1.
Video retrieval. The problem of retrieval of video frames from full-length movies was
addressed in [63]. Local descrip-tors were computed on key frames and stored into database.
-
8/11/2019 Object Recognition Techniques (2)
71/224
71
To reduce the otherwise enormous database size, descriptors were clustered according to their
similarity. Impresive real-time retrieval was achieved for a closed system, i.e. for the case of
query images originating from the movie itself.
Wide baseline stereo matching. For a significant variety of scenes the epipolar geometry can
be computed automati-cally from two (or possibly more) uncalibrated images, show-ing the
scene from significantly different viewpoints. The role of the matching in the wide-baseline
stereo problem is to provide corresponding points, i.e. the points which in the two images
represent identical element of the 3D scene. Correspondences found in a difficult stereo pair
are shown in Figure.
-
8/11/2019 Object Recognition Techniques (2)
72/224
-
8/11/2019 Object Recognition Techniques (2)
73/224
73
Vision is a difficult problem consisting of many building blocks that can be characterized in
isolation. Eye movements are one such building block.
2. Since visual sensitivity is the highest in the fovea, in general, eye movements are needed
for recognizing small stimuli.
3. During a fixation, a number of things happen concurrently: the visual information around
the fixation is analysed , and visual information away from the current fixation is analyzed to
help select the next saccade target.
The exact processes involved in this are still largely unknown. Findlay and Gilchrist [33] also
pose a number of questions, in order to demonstrate that numerous basic problems in vision
still remain open for research.
1. What visual information determines the target of the next eye movement?
2. What visual information determines when eyes move?
3. What information is combined across eye movements to form a stable representation of the
environment?
As discussed earlier [29], a brute force approach to object localization subject to a cost
constraint, is often intractable e as the search space size increases. Furthermore, the human
brain would have to be some hundreds of thousands times larger than it currently is, if visual
sensitivity across the visual space was the same as that in the fovea [29]. Thus, active and
-
8/11/2019 Object Recognition Techniques (2)
74/224
74
attentive approaches to the problem are usually proposed as a means of addressing these
constraints.
We will show in this section that within the context of the general framework for object
recognition that was illustrated in Fig.1, previous work on active object recognition systems
has conclusively demonstrated that active vision systems are capable of leading to significant
improvements in both the learning and inference phases of object recognition. This includes
improvements in the robustness of all the components of the feature-extraction feature
grouping object-hypothesis object-verification object-recognition pipeline.
Some of the problems inherent in single view object recognition, include [266]:
1. The im