object recognition techniques (2)

Upload: brijesh-bimalkant-jha

Post on 03-Jun-2018

224 views

Category:

Documents


1 download

TRANSCRIPT

  • 8/11/2019 Object Recognition Techniques (2)

    1/224

    1

    ANALYSIS OF HIERARCHIAL FOR

    OBJECT RECOGNITION

    TECHNIQUES

    For the Degree of

    Doctor of Philosophy

    In

    SUBJECT

    Submitted to

    SHRI VENKATESHWARA UNIVERSITY,

    Gajraula, Amroha (UTTAR PRADESH)

    Research Supervisor: Research

    Scholar:

    Dr. Name

    2014

  • 8/11/2019 Object Recognition Techniques (2)

    2/224

    2

    DECLARATION

    I hereby declare that this submission is my own work and that, to the best of my

    knowledge and belief, it contains no material previously published or written by

    another person nor material which to a substantial extent has been accepted for the

    award of any other degree or diploma of the university or other institute of higher

    learning, except where due acknowledgment has been made in the text.

    Signature of Research Scholar

    Name :

    Enrollment No.

  • 8/11/2019 Object Recognition Techniques (2)

    3/224

    3

    CERTIFICATE

    Certified that Name of student(enrollment no..) has carried out the research work

    presented in this thesis entitled Title of Thesis. for the award of Doctor

    ofPhilosophy from Shri Venkateshwara University, Gajraula under my/our (print only

    that is applicable) supervision. The thesis embodies results of original work, and

    studies as are carried out by the student himself/ herself (print only that is applicable)

    and the contents of the thesis do not form the basis for the award of any other degree to

    the candidate or to anybody else from this or any other University/Institution.

    Signature Signature

    (Name of Supervisor) (Name of Supervisor)

    Designation) (Designation)

    (Address) (Address)

    Date:

  • 8/11/2019 Object Recognition Techniques (2)

    4/224

    4

    SHRI VENKATESHWARA UNIVERSITY, GAJRAULA

    CERTIFICATE OF THESIS SUBMISSION FOR

    EVALUATION(To be submitted in duplicate)

    1.Name :..........

    2.Enrollment No. :

    3.Thesis title:...........

    .......

    4.Degree for which the thesis is submitted:

    5.Department of the University to which the thesis is submitted :

    ....................................................................................................................................................................................................

    6.Faculty of the University to which the thesis is submitted :

    ....................................................................................................................................................................................................

    7. Thesis Preparation Guide was referred to for preparing the thesis. Yes No

    8. Specifications regarding thesis format have been closely followed. Yes No

    9. The contents of the thesis have been organized based on the guidelines. Yes No

    10. The thesis has been prepared without resorting to plagiarism. Yes No

    11. All sources used have been cited appropriately. Yes No

    12. The thesis has not been submitted elsewhere for a degree. Yes No

    13. Submitted two copies of spiral bound thesis plus one CD. Yes No

    14. Submitted five copies of synopsis approved by RDC. Yes No

    15. Submitted two copies of spiral bound research summary. Yes No

    Name...Enrollment No

  • 8/11/2019 Object Recognition Techniques (2)

    5/224

    5

    SHRI VENKATESHWARA UNIVERSITY, GAJRAULA

    CERTIFICATE OF FINAL THESIS SUBMISSION

    (To be submitted in duplicate)1. Name : .........

    2. Enrollment No. :

    3. Thesis title:...

    ....

    4. Degree for which the thesis is submitted: ........

    5. Department of University to which the thesis is submitted :

    ...

    6. Faculty of the University to which the thesis is submitted :

    7. Thesis Preparation Guide was referred to for preparing the thesis. Yes No

    8. Specifications regarding thesis format have been closely followed. Yes No

    9. The contents of the thesis have been organized based on the guidelines. Yes No

    10.The thesis has been prepared without resorting to plagiarism. Yes No

    11.All sources used have been cited appropriately. Yes No

    12.The thesis has not been submitted elsewhere for a degree. Yes No

    13.All the corrections have been incorporated. Yes No

    14.Submitted five hard bound copies of the thesis plus one CD. Yes No

    15.Submitted five copies of research summary. Yes

    No

    (Signature(s) of the Supervisor(s) (Signature of the Candidate)

  • 8/11/2019 Object Recognition Techniques (2)

    6/224

    6

    Name(s): Name..

    Enrollment No

    1) ABSTRACT

    AnObject recognition systems constitute a deeply entrenched and omnipresent

    component of modern intelligent systems. Research on object recognition algorithms

    has led to advances in factory and office automation through the creation of optical

    character recognition systems, assembly-line industrial inspection systems, as well as

    chip defect identification systems. It has also led to significant advances in medical

    imaging, defence and biometrics. In this paper we discuss the evolution of computer-

    based object recognition systems over the last fifty years, and overview the successes

    and failures of proposed solutions to the problem. We survey the breadth of

    approaches adopted over the years in attempting to solve the problem, and highlight

    the important role that active and attentive approaches must play in any solution that

    bridges the semantic gap in the proposed object representations, while simultaneously

    leading to efficient learning and inference algorithms. From the earliest systems

    which dealt with the character recognition problem, to modern visually-guided agents

    that can purposively search entire rooms for objects, we argue that a common thread

    of all such systems is their fragility and their inability to generalize as well as the

    human visual system can. At the same time, however, we demonstrate that the

    performance of such systems in strictly controlled environments often vastly

    outperforms the capabilities of the human visual system. We conclude our survey by

  • 8/11/2019 Object Recognition Techniques (2)

    7/224

    7

    arguing that the next step in the evolution of object recognition algorithms will

    require radical and bold steps forward in terms of the object representations, as well

    as the learning and inference algorithms used.

    LIST OF TABLES

    Table: 1.

    Comparison of kernel descriptors (KDES) and hierarchi-cal kernel descriptors(HKDES) on CIFAR10 provide extensive comparisons with current state-of-the-art

    algorithms in terms of accuracy.

    Table: 2.

    Comparisons on the RGB-D Object Dataset. RGB de-notes features over RGB images

    and depth denotes features over depth images.

    Table: 3.

    Comparisons to existing recognition approaches using a combination of depth features

    and image features. Nonlinear SVMs use Gaussian kernel.

    LIST OF FIGURES

    Figure1:

  • 8/11/2019 Object Recognition Techniques (2)

    8/224

    8

    Different components of an object recognition system are shown

    Figure 2:

    Hierarchical Kernel Descriptors

    Figure 3:

    Examples of correspondences established between frames of a database image (left) and a

    query image (right).

    Figure 4:

    Examples of corresponding query (left columns) and database (right columns) images from

    the ZuBuD dataset. The image pairs exhibit occlusion, varying illumi-nation and viewpoint

    and orientation changes

    Figure 5:

    Examples of corresponding query (left columns) and database (right columns) images from

    the ZuBuD dataset. The image pairs exhibit occlusion, varying illumi-nation and viewpoint

    and orientation changes.

    Figure 6:

    Image retrieval on FOCUS dataset: query local-isation results. query images, database

    images, and query localisations

    Figure 7:

    An example of matches established on a wide-baseline stereo pair.

    Figure 8:

    Overview of the spatiotemporal (4-D) approach to dynamic vision (adapted from [50, 268]).

  • 8/11/2019 Object Recognition Techniques (2)

    9/224

    9

    Chart 1: Summary of the 1989-2009 papers in Table 5 on active object detection. By

    definition search efficiency is not the primary concern in these systems, since by assumption

    the object is always in the sensors field of view. However inference scalability constitutes a

    significant component of such systems. We notice very little use of function and context in

    these systems. Furthermore, training such systems is often non-trivial.

    Figure 9:

    A sequence of viewpoints from which the system developed by Wilkes and Tsotsos [266]

    actively recognizes an origami object.

    Figure 10:

    The object verification and next viewpoint selection algorithm used in [280] (diagramadapted from [280]).

    Figure 11:

    Graphical model for next-view-planning as proposed in [284, 285].

    Figure 12:

    The aspects of an object and its congruence classes (adapted from Gremban and Ikeuchi

    [287]).

    Figure 13:

    An aspect resolution tree used to determine if there is a single interval of values for that

    satisfy certain constraints (adapted from Gremban and Ikeuchi [287]).

    Figure 14:

    The two types of view degeneracies proposed by Dickinson et al. [49].

    Chart 2: Summary of the 1992-2012 papers on active object localization and recognition from

    Table 6. As expected, search efficiency and the role of 3D information is significantly more

    prominent in these papers (as compared to Chart 7)

    Figure 15:

  • 8/11/2019 Object Recognition Techniques (2)

    10/224

    10

    Reconstructionist vision vs. Selective Perception, after Rimey and Brown [302]

    feature vector c. Laporte and Arbel [291] build upon this work and choose the best next

    viewpoint by calculating the symmetric KL divergence (Jeffrey divergence) of the likelihood

    of the observed data given the assumption that this data resulted from two views of two

    distinct objects. By weighing each Jeffrey divergence by the product of the probabilities ofobserving the two competing objects and their two views, they can determine the next view

    which provides the object identity hypothesis, thus again demonstrating the active vision

    systems direct applicability in the standard recognition pipeline (see Fig.1).

    Figure 16:

    A PART-OF Bayes net for a table-top scenario, similar to what was proposed by Rimey and

    Brown [302].

    Figure 17:An IS-A Bayes tree for a table-top scenario that was used by Rimey and Brown [302].

    Figure 18:

    The direct-search model, which includes nodes that affect direct search efficiency (unboxed

    nodes) and explicit model parameters (boxed nodes). Adapted from Wixson and Ballard

    [303].

    Figure 19:Junction types proposed by Malik [321] and used by Brunnstrom et al. [306] for recognizing

    man-made objects.

    Figure 20:

    An ASIMO humanoid robot was used by Andreopoulos et al. [24] to actively search an

    indoor environment.

    Figure 21:

    An example of ASIMO pointing at an object once the target object is successfully localized

    in a 3D environment [24].

    Figure 22:

  • 8/11/2019 Object Recognition Techniques (2)

    11/224

    11

    The twenty object classes that the 2011 PASCAL dataset contains. Some of the earlier

    versions of the PASCAL dataset only used subsets of these object classes. Adapted from

    [324]

    Chart 3: Summary of the PASCAL Challenge papers from Table 7 which correspond to

    algorithms published between 2002-2011. Notice that the winning PASCAL challenge

    algorithms typically make little use of function, context, 3D and make a moderate use of

    texture.

    Figure 23:

    The HOG detector of Dalal and Triggs (from [335] with permission). (a): The average

    gradient image over a set of registered training images. (b), (c): Each pixel demonstrates the

    maximum and minimum (respectively) SVM weight of the corresponding block. (d): The test

    image used in the rest of the subfigures. (e): The computed R-HOG descriptor of the image in

    subfigure (e). (f),(g): The R-HOG descriptor weighed by the positive and negative SVM

    weights respectively.

    Figure 24:

    Examples of the Harris-Laplace detector and the Laplacian detector, which were used

    extensively in [142] as interest-point/region detectors (figure reproduced from [142] with

    permission).

    Figure 25:The distributions of various object classes corresponding to six feature classes.

    Figure 26:

    Example of the algorithm by Felzenszwalb et al. [366] localizing a person using the coarse

    template representation and the higher resolution subpart templates of the person (from [366]

    with permission).

    Figure 27:

    The HOG feature pyramid used in [366], showing the coarse root-level template and the

    higher resolution templates of the persons subparts (from [366] with permission).

    Figure 28:

  • 8/11/2019 Object Recognition Techniques (2)

    12/224

    12

    The distribution of edges and appearance patches of certain car model training images used

    by Chum and Zisserman [365], with the learned regions of interest overlaid (from [365], with

    permission).

    Figure 29:

    The 35 most frequent 2AS constructed from 10 outdoor images (from [367] with permission).

    It is easier to understand the left images contents (e.g., a busy road with mountains in the

    background) if the cars in the image have been firstly localized. Conversely, in the right

    image, occlusions make the object localization problem difficult. Thus, prior knowledge that

    the image contains exclusively cars, can make the localization problem easier (from [361]

    with permission).

    Figure 30:

    Demonstrating how top-down category-specific attentional biases can modulate the shape-

    words during the bag-of-words histogram construction (from [358] with permission). low

    level features (e.g., edges, color) and they are grouping them in more complex ways in order

    to achieve more universal representations of object parts. In terms of object verification and

    object hypothesizing (see Fig.1) the work by Felzenszwalb et al. [366] represents the most

    successful approach tested in Pascal 2007, for using a coarse generative model of object parts

    to improve recognition performance.

    Figure 31:

    (a)The 3-layer tree-like object representation in [348]. (b) A reference template without any

    part displacement, showing the root-node bounding box (blue), the centers of the 9 parts in

    the 2nd layer (yellow dots), and the 36 part at the last layer in color purple. (c) and (d) denote

    object localizations (from [348] with permission).

    Figure 32:

    On using context to mitigate the negative effects of ambiguous localizations [350]. The

    greater the ambiguities, the greater role contextual knowledge plays (from [350] with

    permission).

    Figure 33:

  • 8/11/2019 Object Recognition Techniques (2)

    13/224

    13

    An example of a feature extraction stage of the type FC SG Rabs N PA . An inputimage (or a feature map) is passed through a non-linear filterbank, followed by rectification,

    local contrast normalization and spatial pooling/sub- sampling.

    Figure 34:

    Test Error rate vs. number of training samples per class on NORB Dataset. Although pure

    random features perform surprisingly well when training data is very scarce, for large

    number of training data learning improves the per- formance significantly. Absolute value

    rectification (Rabs) and local normalization (N ) is shown to improve the performance in all

    cases.

    Figure 35:

    Left: random stage-1 filters, and corresponding optimal inputs that maximize the response

    of each corresponding complex cell in a FC SG Rabs N PA architecture.

    Figure 36:

    Left: A dictionary with 128 elements, learned with patch based sparse coding model. Right:

    A dictionary with 128 elements, learned with convolutional sparse coding model. The

    dictionary learned with the convolutional model spans the orientation space much more

    uniformly. In addition it can be seen that the diversity of filters obtained by convolutional

    sparse model is much richer compared to patch based one.

    Figure 37:

    Top Left: Smooth shrinkage function. Parameters and b control the smoothness and

    location of the kink of the function. As it converges more closely to soft thresholding

    operator. Top Right: Total loss as a function of number of iterations. The vertical dotted

    line marks the iteration number when diagonal hessian approximation was updated. It is

    clear that for both encoder functions, hessian update improves the convergence significantly.

    Bottom: 128 convolutional filters (k) learned in the encoder using smooth shrinkage

    function.

    Figure 38:

    Second stage filters. Left: Encoder kernels that correspond to the dictionary elements.

    Right: 128 dictionary elements, each row shows 16 dictionary elements, connecting to a

  • 8/11/2019 Object Recognition Techniques (2)

    14/224

    14

    single second layer feature map. It can be seen that each group extracts similar type of

    features from their corresponding inputs.

    Figure 39:

    Results on the INRIA dataset with per-image metric. Left: Comparing two best systems

    with unsupervised initialization (U U ) vs random initialization (RR). Right: Effect of

    bootstrapping on final performance for unsupervised ini- tialized system.

    Figure 40:

    Results on the INRIA dataset with per-image metric. These curves are computed from the

    bounding boxes and confidences made available by (Dollar et al., 2009b). Comparing our

    two best systems labeled (U +U + and R+R+)with all the other methods.

    Figure 41:

    Reconstruction Error vs 1 norm Sparsity Penalty for coordinate de- scent sparse coding and

    variational free energy minimization.

    Figure 42:

    Angle between representations obtained for two consecutive frames using different parameter

    values using sparse coding and variational free energy minimization.

  • 8/11/2019 Object Recognition Techniques (2)

    15/224

    15

    TABLE OF CONTENTS

    1. INTRODUCTION

    a. System Component

    b. Complexity of Object Recognition

    c. Two Dimensional

    d. Three Dimensional

    e. Segmented

    2. HIERARCHIAL KERNEL DESCRIPTOR

    a. Kernel Descriptor

    b. Kernel Descriptor Over Kernel Descriptor

    c. Everyday Object Recognition Using RGB-D

    d. Experiments

    i. Cifar 10

    ii. RGB-D Object Dataset

    3. OBJECT RECOGNITION METHOD BASED ON TRANSFORMATION

    a. Classes of object recognition methods

    i. Appearance based methods

    ii. Geometry Based Methods

    4. RECOGNITION AS A CORRESPONDENCE LOCAL FEATURE

    a. The approach of David lowe

    b. The approach of Mikolajczyk & Schmidc. The approach of Tuytelaars Ferrari & Gool

    d. The LAF approach of Matas

    e. The approach of Zisserman

  • 8/11/2019 Object Recognition Techniques (2)

    16/224

    16

    i. Indexing and Matching

    ii. Verification

    f. Scale Saliency by Kadir & Brady

    g. Local PCA, approaches of Jugessur & Ohba

    h. The approach of Selinger and Nelson

    i. Applications

    5. LITERATURE SURVEY

    a. Active and Dynamic Vision

    b. Active object detection literature survey

    c. Active object localization

    6. CASE STUDIES FROM RECOGNITION CHALLENGES AND THE EVOLVING

    LANDSCAPE

    a. Dataset and evaluation techniques

    b. Sampling the current state of the art in the recognition literature

    i. Pascal 2005

    ii. Pascal 2006

    iii. Pascal 2008

    iv. Pascal 2009

    v. Pascal 2010

    vi. Pascal 2011

    c. The evolving landscape

    7. MULTI STAGE ARCHITECTURE FOR OBJECT RECOGNITION

    a. Modules for hierarchical systems

    b. Combining module into hierarchy

    c. Training protocol

    d. Experiment with Caltech101 Dataset

    e. Using a single stage of feature extraction

    f. Using two stage of feature extraction

    g. NORB datatest

  • 8/11/2019 Object Recognition Techniques (2)

    17/224

    17

    h. Random filter performance

    i. Handwritten digit recognition

    j. Convolutional space coding

    k. Algorithms and Methods

    i. Learning convolutional dictionaries

    ii. Learning and efficient encoder

    iii. Patch based vs convolutional sparse modelling

    l. Multi stage architecture

    m. Experiments

    i. Object recognition using Caltech 101 dataset

    ii. Pedestrian detection

    iii. Architecture and Training

    iv. Per image evaluation

    n. Sparse coding by variational marginalization

    o. Variational marginalization for spare coding

    p. Stability experiments

    8. CONCLUSION

    9. References

  • 8/11/2019 Object Recognition Techniques (2)

    18/224

    18

    2) Introduction

    Object recognition is a fundamental and challenging problem and is a major focus of

    research in computer vision, machine learning and robotics. The task is difficult partly

    because images are in high-dimensional space and can change with viewpoint, while

    the objects them-selves may be deformable, leading to large intra-class variation. The

    core of building object recognition systems is to extract meaningful representations

    (features) from high-dimensional observations such as images, videos, and 3D point

    clouds. This paper aims to discover such representations using machine learning

    methods. An object recognition system finds objects in the real world from an image

    of the world, using object models which are known a priori. This task is surprisingly

    difficult. Humans perform object recognition effortlessly and instantaneously.

    Algorithmic description of this task for implementation on machines has been very

    difficult. In this chapter we will discuss different steps in object recognition and

    introduce some techniques that have been used for object recognition in many

    applications. We will discuss the different types of recognition tasks that a vision

    system may need to perform. We will analyze the complexity of these tasks and

    present approaches useful in different phases of the recognition task.

  • 8/11/2019 Object Recognition Techniques (2)

    19/224

    19

    Over the past few years, there has been increasing interest in feature learning for

    object recognition using ma-chine learning methods. Deep belief nets (DBNs) are

    appealing feature learning methods that can learn a hierarchy of features. DBNs are

    trained one layer at a time using contrastive divergence, where the feature learned by

    the current layer becomes the data for training the next layer. Deep belief nets have

    shown impressive results on handwritten digit recognition, speech recognition and

    visual object recognition. Convolutional neural networks (CNNs) are another example

    that can learn multiple layers of nonlinear features. In CNNs, the parameters of the

    entire network, including a final layer for recognition, are jointly optimized using the

    back-propagation algorithm.

    The object recognition problem can be defined as a labeling problem based on models

    of known objects. Formally, given an image containing one or more objects of interest

    (and background) and a set of labels corresponding to a set of models known to the

    system, the system should assign correct labels to regions, or a set of regions, in the

    image. The object recognition problem is closely tied to the segmentation problem:

    without at least a partial recognition of objects, segmentation cannot be done, and

    without segmentation, object recognition is not possible.

    In this chapter, we discuss basic aspects of object recognition hiarchy and its analysis.

    We present the architecture and main components of object recognition and discuss

    their role in object recognition systems of varying complexity.

  • 8/11/2019 Object Recognition Techniques (2)

    20/224

    20

    Figure1: Different components of an object recognition system are shown

    a) System Component

    An object recognition system must have the following components to perform the

    task:

    Model database (also called modelbase)

    Feature detector

    Hypothesizer

    Hypothesis verifier

    A block diagram showing interactions and information flow among different

    components of the system is given in Figure. The model database contains all the

    models known to the system. The information in the model database depends on the

  • 8/11/2019 Object Recognition Techniques (2)

    21/224

    21

    approach used for the recognition. It can vary from a qualitative or functional

    description to precise geometric surface information. In many cases, the models of

    objects are abstract feature vectors, as discussed later in this section. A feature is some

    attribute of the object that is considered important in describing and recognizing the

    object in relation to other objects. Size, color, and shape are some commonly used

    features.

    The feature detector applies operators to images and identifies locations of features

    that help in forming object hypotheses. The features used by a system depend on the

    types of objects to be recognized and the organization of the model database. Using

    the detected features in the image, the hypothesizer assigns likelihoods to objects

    present in the scene. This step is used to reduce the search space for the recognizer

    using certain features.

    The model base is organized using some type of indexing scheme to facilitate

    elimination of unlikely object candidates from possible consideration. The verifier

    then uses object models to verify the hypotheses and refines the likelihood of objects.

    The system then selects the object with the highest likelihood, based on all the

    evidence, as the correct object.

    All object recognition systems use models either explicitly or implicitly and employ

    feature detectors based on these object models. The hypothesis formation and

    verification components vary in their importance in different approaches to object

    recognition. Some systems use only hypothesis formation and then select the object

  • 8/11/2019 Object Recognition Techniques (2)

    22/224

    22

    with highest likelihood as the correct object. Pattern classification approaches are a

    good example of this approach. Many artificial intelligence systems, on the other

    hand, rely little on the hypothesis formation and do more work in the verification

    phases. In fact, one of the classical approaches, template matching, bypasses the

    hypothesis formation stage entirely.

    An object recognition system must select appropriate tools and techniques for the

    steps discussed above. Many factors must be considered in the selection of

    appropriate methods for a particular application. The central issues that should be

    considered in designing an object recognition system are:

    Object or model representation: How should objects be represented in the

    model database? What are the important attributes or features of objects that must be

    captured in these models? For some objects, geometric descriptions may be available

    and may also be efficient, while for another class one may have to rely on generic or

    functional features.

    The representation of an object should capture all relevant information without any

    redundancies and should organize this information in a form that allows easy access

    by different components of the object recognition system.

    Feature extraction: Which features should be detected, and how can they be

    detected reliably? Most features can be computed in twodimensional images but they

    are related to three-dimensional characteristics of objects. Due to the nature of the

  • 8/11/2019 Object Recognition Techniques (2)

    23/224

    23

    image formation process, some features are easy to compute reliably while others are

    very difficult. Feature detection issues were discussed in many chapters in this book.

    Feature-model matching: How can features in images be matched to models

    in the database? In most object recognition tasks, there are many features and

    numerous objects. An exhaustive matching approach will solve the recognition

    problem but may be too slow to be useful. Effectiveness of features and efficiency of

    a matching technique must be considered in developing a matching approach.

    Hypotheses formation: How can a set of likely objects based on the feature

    matching be selected, and how can probabilities be assigned to each possible object?

    The hypothesis formation step is basically a heuristic to reduce the size of the search

    space. This step uses knowledge of the application domain to assign some kind of

    probability or confidence measure to different objects in the domain. This measure

    reflects the likelihood of the presence of objects based on the detected features.

    Object verification: How can object models be used to select the most likely

    object from the set of probable objects in a given image? The presence of each likely

    object can be verified by using their models. One must examine each plausible

    hypothesis to verify the presence of the object or ignore it. If the models are

    geometric, it is easy to precisely verify objects using camera location and other scene

    parameters. In other cases, it may not be possible to verify a hypothesis.

  • 8/11/2019 Object Recognition Techniques (2)

    24/224

    24

    Depending on the complexity of the problem, one or more modules in Figure. may

    become trivial. For example, pattern recognition-based object recognition systems do

    not use any feature-model matching or object verification; they directly assign

    probabilities to objects and select the object with the highest probability.

    b) Complexity of Object Recognition

    As we studied in earlier chapters in this book, images of scenes depend on

    illumination, camera parameters, and camera location. Since an object must be

    recognized from images of a scene containing multiple entities, the complexity of

    object recognition depends on several factors. A qualitative way to consider the

    complexity of the object recognition task would consider the following factors:

    Scene constancy: The scene complexity will depend on whether the images are

    acquired in similar conditions (illumination, background, camera parameters, and

    viewpoint ) as t:le models. As seen in earlier chapters, scene conditions affect images

    of the same object dramatically. Under different scene conditions, the performance of

    different feature detectors will be significantly different. The nature of the

    background, other objects, and illumination must be considered to determine what

    kind of features can be efficiently and reliably detected.

    Image-models spaces: In some applications, images may be obtained such that

    three-dimensional objects can be considered two-dimensional. The models in such

  • 8/11/2019 Object Recognition Techniques (2)

    25/224

  • 8/11/2019 Object Recognition Techniques (2)

    26/224

    26

    recognition problem into the following classes.

    c) Two-dimensional

    In many applications, images are acquired from a distance sufficient to consider the

    projection to be orthographic. If the objects are always in one stable position in the

    scene, then they can be considered two-dimensional. In these applications, one can

    use a two-dimensional modelbase. There are two possible cases:

    Objects will not be occluded, as in remote sensing and many industrial

    applications .

    Objects may be occluded by other objects of interest or be partially visible, as

    in the bin of parts problem.

    In some cases, though the objects may be far away, they may appear in different

    positions resulting in multiple stable views. In such cases also, the problem may be

    considered inherently as two-dimensional object recognition.

    d) Three-dimensional

    If the images of objects can be obtained from arbitrary viewpoints, then an object may

    appear very different in its two views. For object recognition using three-dimensional

  • 8/11/2019 Object Recognition Techniques (2)

    27/224

    27

    models, the perspective effect and viewpoint of the image have to be considered. The

    fact that the models are three-dimensional and the images contain only two-

    dimensional information affects object recognition approaches. Again, the two factors

    to be considered are whether objects are separated from other objects or not.

    For three-dimensional cases, one should consider the information used in the object

    recognition task. Two different cases are:

    Intensity: There is no surface information available explicitly in intensity

    images. Using intensity values, features corresponding to the three-dimensional

    structure of objects should be recognized .

    2.5-dimensional images: In many applications, surface representations with

    viewer-centered coordinates are available, or can be computed, from images. This

    information can be used in object recognition.

    Range images are also 2.5-dimensional. These images give the distance to different

    points in an image from a particular view point.

    e) Segmented

    The images have been segmented to separate objects from the background. As

    discussed in Chapter 3 on segmentation, object recognition and segmentation

    problems are closely linked in most cases. In some applications, it is possible to

  • 8/11/2019 Object Recognition Techniques (2)

    28/224

    28

    segment out an object easily. In cases when the objects have not been segmented, the

    recognition problem is closely linked with the segmentation problem.

    3) HIERARCHIAL KERNEL DESCRIPTORS

    Kernel descriptors highlight the kernel view of orientation histograms, such as SIFT

    and HOG, and show that they are a particular type of match kernels over patches. This

    novel view suggests a unified framework for turning pixel attributes (gradient, color,

    local binary pattern, etc.) into patch-level features:

    (1) design match kernels using pixel attributes;

    (2) learn compact basis vectors using kernel principal component analysis (KPCA);

    (3) construct kernel descriptors by projecting the infinite-dimensional fea-ture vectors

    to the learned basis vectors.

    The key idea of this work is that we can apply the kernel descriptor framework not

    only over sets of pixels (patches), but also sets of kernel descriptors. Hierarchical

    kernel descriptors aggregate spatially nearby patch-level features to form higher level

    features by using kernel descriptors recursively, as shown in Fig. 1. This procedure

    can be repeated until we reach the final image-level features.

  • 8/11/2019 Object Recognition Techniques (2)

    29/224

    29

    a) Kernel Descriptors

    Patch-level features are critical for many computer vision tasks. Orientation

    histograms like SIFT and HOG are popular patch-level features for object recognition.

    Kernel descriptors include SIFT and HOG as special cases, and provide a principled

    way to generate rich patch-level features from various pixel attributes.

    The gradient match kernel, Kgrad, is based on the pixel

    Figure2:. Hierarchical Kernel Descriptors.

    In the first layer, pixel attributes are aggregated into patch-level features. In the

    second layer, patch-level features are turned into aggregated patch-level features. In

    the final layer, aggregated patch-level features are converted into image-level

    features. Kernel descriptors are used in every layer.

  • 8/11/2019 Object Recognition Techniques (2)

    30/224

    30

    where P and Q are patches from two different images, and denotes the 2D position of

    a pixel in an image patch. Let z, mz be e the orientation and magnitude of the image

    gradient at a pixe z.

    The color kernel descriptor Kcol is based on the pixel intensity attribute

    where czis the pixel color at position z (intensity for gray images and RGB values for

    color images) and kc(cz, cz0 ) = exp(ckcz cz0 k2

    ) is a Gaussian kernel. The shape

    kernel descriptor, Kshape, is based on the local binary pattern attribute

    Gradient, color and shape kernel descriptors are strong in their own right and

    complement one another. Their combi-nation turns out to be always (much) better

    than the best in-dividual feature. Kernel descriptors are able to generate rich visual

    feature sets by turning various pixel attributes into patch-level features, and are

    superior to the current state-of-the-art recognition algorithms on many standard visual

    object recognition datasets.

    b) Kernel Descriptors over Kernel Descriptors

    The match kernels used to aggregate patch-level features have similar structure to

    those used to aggregate pixel at-tributes:

  • 8/11/2019 Object Recognition Techniques (2)

    31/224

    31

    where A and A0denote image patches, and P and Q are sets of image patches.

    The patch position Gaussian kernel kC (CA, CA0 ) = exp(C kCA CA0 k2) = C

    (CA)>C (CA0 ) describes the spatial relationship between two patches, where C A is

    the center position of patch A (normalized to [0, 1]). The patch Gaussian kernel kF

    (FA, FA0 ) = exp(FkFA FA0 k2) = F(FA)

    >F(FA0 ) measures the similarity of two

    patch-level features, where FAare gradient, shape or color kernel descriptors in our

    case. The linear kernel WAWA0 weights the contribution of each patch-level fea- ture

    where W small positive constant. WA is the average of gradient mag-nitudes for the

    gradient kernel descriptor, the average of standard deviations for the shape kernel

    descriptor and is always 1 for the color kernel descriptor.

    Note that although efficient match kernels [1] used match kernels to aggregate patch-

    level features, they dont con- sider spatial information in match kernels and so spatial

    pyramid is required to integrate spatial information. In addi-tion, they also do not

    weight the contribution of each patch, which can be suboptimal. The novel joint

    match kernels ( 5) provide a way to integrate patch-level features, patch varia-tion,

    and spatial information jointly.

  • 8/11/2019 Object Recognition Techniques (2)

    32/224

  • 8/11/2019 Object Recognition Techniques (2)

    33/224

    33

    which can be written as a single Gaussian kernel. This procedure is optimal in the

    sense of minimizing the least square approximation error. However, it is intractable to

    compute the eigenvectors of a 125, 000 125, 000 matrix on a modern personal

    computer. Here we propose a fast algorithm for finding the eigenvec-tors of the

    Kronecker product of kernel matrices. Since ker-nel matrices are symmetric positive

    definite, we have

    suggests that the top r eigenvectors of KF KC can be chosen from the Kronecker

    product of the eigenvectors of KF and those of KC , which significantly reduces com-

    putational cost. The second layer kernel descriptors have the form. Recursively

    applying kernel descriptors in a similar man-ner, we can get kernel descriptors of

    more layers, which represents features at different levels.

    c) Everyday Object Recognition using RGB-D

    We recorded with the camera mounted at three different heights relative to the

    turntable, giving viewing angles of approxi-mately 30, 45 and 60 degrees with the

    horizon. One revolu-tion of each object was recorded at each height. Each video

    sequence is recorded at 20 Hz and contains around 250 frames, giving a total of

    250,000 RGB + Depth frames. A combination of visual and depth cues (Mixture-of-

    Gaussian fitting on RGB, RANSAC plane fitting on depth) produces a segmentation

  • 8/11/2019 Object Recognition Techniques (2)

    34/224

    34

    for each frame separating the object of in-terest from the background. The objects are

    organized into a hierarchy taken from WordNet hypernym/hyponym rela-tions and is

    a subset of the categories in ImageNet. Each of the 300 objects in the dataset belong

    to one of 51 cate-gories

    Our hierarchical kernel descriptors, being a generic ap-proach based on kernels, has

    no trouble generalizing from color images to depth images. Treating a depth image as

    a grayscale image, i.e. using depth values as intensity, gra-dient and shape kernel

    descriptors can be directly extracted and they capture edge and shape information in

    the depth channel. However, color kernel descriptors extracted over the raw depth

    image does not have any significant mean-ing. Instead, we make the observation that

    the distance d of an object from the camera is inversely proportional to the square root

    of its area s in RGB images. For a given object.

    Since we have the segmen-tation of objects, we can represent s using the number of

    pixels belonging to the object mask. Finally, we multiply depth values by s before

    extracting color kernel descrip-tors over this normalized depth image. This yields a

    feature that is sensitive to the physical size of the object.

  • 8/11/2019 Object Recognition Techniques (2)

    35/224

    35

    In the experiments section, we will compare in de-tail the performance of our

    hierarchical kernel descrip-tors on RGB-D object recognition to that in [15]. Our

    approach consistently outperforms the state of the art in [15]. In particular, our

    hierarchical kernel descriptors on the depth image perform much better than the com-

    bination of depth features (including spin images) used in [15], increasing the depth-

    only object category recog-nition from 53.1% (linear SVMs) and 64.7% (nonlinear

    SVMs) to 75.7% (hierarchical kernel descriptors and lin-ear SVMs). Moreover, our

    depth features served as the backbone in the object-aware situated interactive system

    that was successfully demonstrated at the Consumer Elec-tronics Show 2011 despite

    adverse lighting conditions.

    d) Experiments

    In this section, we evaluate hierarchical kernel descrip-tors on CIFAR10 and the

    RGB-D Object Dataset. We also

    Features KDES [1]

    HKDES (this

    work)

    Color 53.9 63.4

    Shape 68.2 69.4

    Gradient 66.3 71.2

  • 8/11/2019 Object Recognition Techniques (2)

    36/224

    36

    Combination 76.0 80.0

    Table 1. Comparison of kernel descriptors (KDES) and hierarchi-cal kernel

    descriptors (HKDES) on CIFAR10 provide extensive comparisons with current state-

    of-the-art algorithms in terms of accuracy.

    In all experiments we use the same parameter settings as the original kernel

    descriptors for the first layer of hi-erarchical kernel descriptors. For SIFT as well as

    gradi-ent and shape kernel descriptors, all images are transformed into grayscale ([0,

    1]). Image intensity and RGB values are normalized to [0, 1]. Like HOG [5], we

    compute gradients using the mask [1, 0, 1] for gradient kernel descriptors. We also

    evaluate the performance of the combination of the three hierarchical kernel

    descriptors by concatenating the image-level feature vectors. Our experiments suggest

    that this combination always improves accuracy.

    i) CIFAR10

    CIFAR10 is a subset of the 80 million tiny images dataset [26, 14]. These images are

    downsampled to 32 32 pixels. The training set contains 5,000 images per category,

    while the test set contains 1,000 images per category.

    Due to the tiny image size, we use two-layer hierarchical kernel descriptors to obtain

    image-level features. We keep the first layer the same as kernel descriptors. Kernel

  • 8/11/2019 Object Recognition Techniques (2)

    37/224

    37

    de-scriptors are extracted over 8 8 image patches over dense regular grids with a

    spacing of 2 pixels. We split the whole training set into 10,000/40,000

    training/validation set, and optimize the kernel parameters of the second layer kernel

    descriptors on the validation set using grid search. Fi-nally, we train linear SVMs on

    the full training set using the optimized kernel parameter setting. Our hierarchical

    model can handle large numbers of basis vectors. We tried both 1000 and 5000 basis

    vectors for the patch-level Gaus-sian kernel kF , and found that a larger number of

    visual words is slightly better (0.5% to 1% improvement depend-ing on the type of

    kernel descriptor). In the second layer, we use 1000 basis vector, enforce KPCA to

    keep 97% of the energy for all kernel descriptors, and produce roughly 6000-

    dimensional image-level features. Note that the sec-ond layer of hierarchical kernel

    descriptors are image-level features, and should be compared to that of image-level

    features formed by EMK, rather than that of kernel descriptors over image patches.

    The dimensionality of EMK features [1] in is 14000, higher than that of hierarchical

    kernel descriptors.

    We compare kernel descriptors and hierarchical kernel

    Method Accuracy

    Logistic regression 36.0

    Support Vector Machines 39.5

    GIST 54.7

  • 8/11/2019 Object Recognition Techniques (2)

    38/224

    38

    SIFT 65.6

    fine-tuning GRBM 64.8

    GRBM two layers 56.6

    mcRBM 68.3

    mcRBM-DBN 71.0

    Tiled CNNs 73.1

    improved LCC 74.5

    KDES + EMK + linear SVMs 76.0

    Convolutional RBM 78.9

    K-means (Triangle, 4k features) 79.6

    HKDES + linear SVMs (this

    work) 80.0

    descriptors in Table 1. As we see, hierarchical kernel de-scriptors consistently

    outperform kernel descriptors. The shape hierarchical kernel descriptor is slightly

    better than the shape kernel descriptor. The other two hierarchical ker-nel descriptors

    are much better than their counterparts: gra-dient hierarchical kernel descriptor is

    about 5 percent higher than gradient kernel descriptor and color hierarchical kernel

    descriptor is 10 percent better than color kernel descriptor. Finally, the combination of

    all three hierarchical kernel de-scriptors outperform the combination of all three

    kernel de-scriptors by 4 percent. We were not able to run nonlinear SVMs with

  • 8/11/2019 Object Recognition Techniques (2)

    39/224

    39

    Laplacian kernels on the scale of this dataset in reasonable time, given the high

    dimensionality of image-level features. Instead, we make comparisons on a subset of

    5,000 training images and our experiments suggest that non-linear SVMs have similar

    performance with linear SVMs when hierarchical kernel descriptors are used.

    We compare hierarchical kernel descriptors with the cur-rent state-of-the-art feature

    learning algorithms in Table 2. Deep belief nets and sparse coding have been

    extensively evaluated on this dataset [25, 31]. mcRBM can model pixel intensities and

    pairwise dependencies between them jointly. Factorized third-order restricted

    Boltzmann machine, fol-lowed by deep belief nets, has an accuracy of 71.0%. Tiled

    CNNs has the best accuracy among deep networks. The improved LCC extends the

    original local coordinate coding by including local tangent directions and is able to

    integrate geometric information. As we have seen, sophisticated fea-ture extraction

    can significantly boost accuracy and is much better than using raw pixel features.

    SIFT features have an accuracy of 65.2% and works reasonably even on tiny images.

    The combination of three hierarchical kernel de-scriptors has an accuracy of 80.0%,

    higher than all other competing techniques; its accuracy is 14.4 percent higher than

    SIFT, 9.0 percent higher than mcRBM combined with DBNs, and 5.5 percent higher

    than the improved LCC. Hi-erarchical kernel descriptors slightly outperform the very

    recent work: the convolutional RBM and the triangle K-means with 4000 centers.

    ii) RGB-D Object Dataset

  • 8/11/2019 Object Recognition Techniques (2)

    40/224

  • 8/11/2019 Object Recognition Techniques (2)

    41/224

    41

    determining the category name of an object (e.g. coffee mug). One category usually

    contains many different object instances.

    To test the generalization ability of our approaches, for category recognition we train

    models on a set of objects and at test time present to the system objects that were not

    present in the training set [15]. At each trial, we randomly leave one object out from

    each category for testing and train classifiers on the remaining 300 - 51 = 249 objects.

    For in-stance recognition we also follow the experimental setting suggested by [15]:

    train models on the video sequences of each object where the viewing angles are 30

    and 60with the horizon and test them on the 45video sequence.

    For category recognition, the average accuracy over 10 random train/test splits is

    reported in the second column of Table. For instance recognition, the accuracy on the

    test set is reported in the third column of Table. As we ex-pect, the combination of

    hierarchical kernel descriptors is much better than any single descriptor. The

    underlying rea- son is that each depth descriptor captures different informa-tion and

    the weights learned by linear SVMs using super-vised information can automatically

    balance the importance of each descriptor across objects.

    Method Category Instance

    Color HKDES (RGB) 60.12.1 58.4

  • 8/11/2019 Object Recognition Techniques (2)

    42/224

    42

    Shape HKDES (RGB) 72.61.9 74.6

    Gradient HKDES (RGB) 70.12.9 75.9

    Combination of HKDES

    (RGB) 76.12.2 79.3

    Color HKDES (depth) 61.82.4 28.8

    Shape HKDES (depth) 65.81.8 36.7

    Gradient HKDES (depth) 70.82.7 39.3

    Combination of HKDES

    (depth) 75.72.6 46.8

    Combination of all HKDES 84.12.2 82.4

    Table2: Comparisons on the RGB-D Object Dataset. RGB de-notes features over

    RGB images and depth denotes features over depth images.

    Approaches Category Instance

    Linear SVMs [15] 81.92.8 73.9

    Nonlinear SVMs [15] 83.83.5 74.8

    Random Forest [15] 79.64.0 73.1

    Combination of all

    HKDES 84.12.2 82.4

    Table3: Comparisons to existing recognition approaches using a combination of depth

    features and image features. Nonlinear SVMs use Gaussian kernel.

  • 8/11/2019 Object Recognition Techniques (2)

    43/224

    43

    In Table 4, we compare hierarchical kernel descriptors with the rich feature set used

    in, where SIFT, color and textons were extracted from RGB images, and 3-D bound-

    ing boxes and spin images over depth images. Hier-archical kernel descriptors are

    slightly better than this rich feature set for category recognition, and much better for

    in-stance recognition.

    It is worth noting that, using depth alone, we improve the category recognition

    accuracy in from 53.1% (lin-ear SVMs) to 75.7% (hierarchical kernel descriptors and

    linear SVMs). This shows the power of our hierarchical kernel descriptor formulation

    when being applied to a non-conventional domain. The depth-alone results are

    meaning-ful for many scenarios where color images are not used for privacy or

    robustness reasons.

    As a comparison, we also extracted SIFT features on both RGB and depth images and

    trained linear SVMs over image-level features formed by spatial pyramid EMK. The

    resulting classifier has an accuracy of 71.9% for category recognition, much lower

    than the result of the combination of hierarchical kernel descriptors (84.2%). This is

    not sur-prising since SIFT fails to capture shape and object size information.

    Nevertheless, hierarchical kernel descriptors provide a unified way to generate rich

    feature sets over both RGB and depth images, giving significantly better accuracy.

  • 8/11/2019 Object Recognition Techniques (2)

    44/224

    44

    4) OBJECT RECOGNITION METHODS BASED ON TRANSFORMATION

    Recognition of general three-dimensional objects from 2D images and videos is a

    challenging task. The common for-mulation of the problem is essentially: given some

    knowl-edge of how certain objects may appear, plus an image of a scene possibly

    containing those objects, find which objects are present in the scene and where.

    Recognition is accom-plished by matching features of an image and model of an

    object. The two most important issues that a method must address are the definition of

    a feature, and how the matching is found.

    What is the goal in designing an object recognition sys-tem? Achieving generality, i.e.

    the ability to recognise any object hand-crafted adaptation to a specific task,

    robustness, the ability to recognise the objects in arbitrary conditions, and easy

    learning, i.e. avoiding special or demanding proce-dures to obtain the database of

    models. Obviously these requirements are generally impossible to achieve, as it is for

    example impossible to recognise objects in images taken in complete darkness. The

    challenge is then to develop a method with minimal constraints.

    Object recognition methods can be classified according to a number of characteristics.

    We focus on model acqui-sition (learning) and invariance to image formation condi-

    tions. Historically, two main trends can be identified. In the so called geometry- or

    model-based object recognition, the knowledge of an object appearance is provided

    by the user as an explicit CAD-like model. Typically, such a model describes only the

  • 8/11/2019 Object Recognition Techniques (2)

    45/224

    45

    3D shape, omitting other properties such as colour and texture. On the other end of

    the spectrum are the appearance-based methods, where no explicit user-provided

    model is required. The object representations are usually acquired through an

    automatic learning phase (but not necessarily), and the model typically relies on

    surface re-flectance (albedo) properties. Recently, methods which put local image

    patches into correspondence emerged. Models are learned automatically, objects are

    represented by ap-pearance of small local elements. Global arrangement of the

    representation is constrained by weak or strong geometric models.

    The rest of the paper is structured as follows. In Sec-tion 2, an overview of classes of

    object recognition methods is given. Survey on methods which are based on match-

    ing of local features is presented in Section 3, and Section 4 describes some of their

    successful applications. Section 5 concludes the paper.

    4) CLASSES OF OBJECT RECOGNITION METHODS

    i) Appearance Based Methods

    The central idea behind appearance-based methods is the following. Having seen all

    possible appearances of an object, can recognition be achieved by just efficiently

    remembering all of them? Could recognition be thus implemented as an efficient

    visual (pictorial) memory? The answer obviously depends on what is meant by all

    appearances. The ap-proach has been successfully demonstrated for scenes with

    unoccluded objects on black background [34]. But remem-bering all possible object

  • 8/11/2019 Object Recognition Techniques (2)

    46/224

  • 8/11/2019 Object Recognition Techniques (2)

    47/224

    47

    The family of appearance-based object recognition meth-ods includes global

    histogram matching methods. In [66, 67], Swain and Ballard proposed to represent an

    object by a colour histogram. Objects are identified by matching his-tograms of image

    regions to histograms of a model image. While the technique is robust to object

    orientation, scaling, and occlusion, it is very sensitive to lighting conditions, and it is

    not suitable for recognition of objects that cannot be identified by colour alone. The

    approach has been later mod-ified by Healey and Slater [14] and Funt and Finlayson

    [12] to exploit illumination invariants. Recently, the concept of histogram matching

    was generalised by Schiele [52, 51, 50], where, instead of pixel colours, responses of

    various filters are used to form the histograms (called then receptive field histograms).

    To summarise, appearance based approaches are attrac-tive since they do not require

    image features or geometric primitives to be detected and matched. But their

    limitations, i.e. the necessity of dense sampling of training views and the low

    robustness to occlusion and cluttered background, make them suitable mainly for

    certain applications with limited or controlled variations in the image formation

    conditions, e.g. for industrial inspection.

    iii) Geometry-Based Methods

    In geometry- (or shape-, or model-) based methods, the in-formation about the objects

    is represented explicitly. The recognition can than be interpreted as deciding whether

  • 8/11/2019 Object Recognition Techniques (2)

    48/224

    48

    (a part of) a given image can be a projection of the known (usually 3D) model [41] of

    an object.

    Generally, two representations are needed: one to repre-sent object model, and

    another to represent the image con-tent. To facilitate finding a match between model

    and image, the two representations should be closely related. In the ideal case there

    will be a simple relation between primitives used to describe the model and those used

    to describe the image. Would the object be, for example, described by a wireframe

    model, the image might be best described in terms of linear intensity edges. Each edge

    can be then matched directly to one of the model wires. However, the model and

    image rep-resentations often have distinctly different meanings. The model may

    describe the 3D shape of an object while the im-age edges correspond only to visible

    manifestations of that shape mixed together with false edges (discontinuities in

    surface albedo) and illumination effects (shadows).

    To achieve pose and illumination invariance, it is prefer-able to employ model

    primitives that are at least somewhat invariant with respect to changes in these

    conditions. Con-siderable effort has been directed to identify primitives that are

    invariant with respect to viewpoint change.

    The main disadvantages of geometry-based methods are: the dependency on reliable

    extraction of geometric primi-tives (lines, circles, etc.), the ambiguity in interpretation

    of the detected primitives (presence of primitives that are not modelled), the restricted

  • 8/11/2019 Object Recognition Techniques (2)

    49/224

  • 8/11/2019 Object Recognition Techniques (2)

    50/224

    50

    correspondences. Since it is not required that all local features match, the approaches

    are robust to occlusion and cluttered background.

    To recognise objects from different views, it is necessary to handle all variations in

    object appearance. The varia-tions might be complex in general, but at the scale of the

    local features they can be modelled by simple, e.g. affine, transformations. Thus, by

    allowing simple transformations at local scale, a significant viewpoint invariance is

    achieved even for objects with complicated shapes. As a result, it is possible to obtain

    models of objects from only a few views, taken e.g. 90 degrees apart.

    The main advantages of the approaches based on match-ing local features are

    summarised below.

    [1]Learning, i.e. the construction of internal models of known objects, is done

    automatically from images depict-ing the objects. No user intervention is required

    except for providing the training images.

    [2]The local representation is based on appearance. There is no need to extract geometric

    primitives (e.g. lines), which are generally hard to detect reliably.

    [3]Segmentation of objects from background is not required prior recognition, and yet

    objects are recognised on an unknown background.

    [4]Objects of interest are recognised even if partially oc-cluded by other unknown

    objects in the scene.

  • 8/11/2019 Object Recognition Techniques (2)

    51/224

  • 8/11/2019 Object Recognition Techniques (2)

    52/224

  • 8/11/2019 Object Recognition Techniques (2)

    53/224

    53

    to these small misalignments. Such a descriptor might be based e.g. on colour

    moments (in-tegral statistics over whole region), or on local histograms.

    It follows that the major factors that affect the discrim-inative potential, and thus the

    ability to handle large object databases, of a method are the repeatability and the

    locali-sation precision of the detector.

    Indexing. During learning of object models, descriptors of local appearance are stored

    into a database. In the recognition phase, descriptors are computed on the query

    image, and the database is looked up for similar descriptors (potential matches). The

    database should be organised (indexed) in a way that allows an efficient retrieval of

    similar descriptors. The character of suitable indexing structure depends generally on

    the properties of the descriptors (e.g. their di-mensionality) and on the distance

    measure used to determine which are the similar ones (e.g. euclidean distance). Gen-

    erally, for optimal performance of the index (fast retrieval times), such combination of

    descriptor and distance measure should be sought, that minimises the ratio of

    distances to correct and to false matches.

    The choice of indexing scheme has major effect on the speed of the recognition

    process, especially on how the speed scales to large object databases. Commonly,

    though, the database searches are done simply by sequential scan, i.e. without using

    any indexing structure.

  • 8/11/2019 Object Recognition Techniques (2)

    54/224

    54

    Matching. When recognising objects in an unknown query image, local features are

    computed in the same form as for the database images. None, one, or possibly more

    tentative correspondences are then established for every feature de-tected in the query

    image. Searching the database, euclidean or mahalanobis distance is typically

    evaluated between the query feature and the features stored in the database. The

    closest match, if close enough, is retrieved. These tentative correspondences are based

    purely on the similarity of the descriptors. A database object which exhibit high (non-

    random) number of established correspondences is considered as a candidate match.

    Verification. The similarity of descriptors, on its own, is not a measure reliable

    enough to guarantee that an established correspondence is correct. As a final step of

    the recognition process, a verification of presence of the model in the query image is

    performed. A global transformation connecting the images is estimated in a robust

    way (e.g. by using RANSAC algorithm). Typically, the global transformation has the

    form of epipolar geometry constraint for general (but rigid) 3D objects, or of

    homography for planar objects. More complex transformations can be derived for

    non-rigid or articulated (piecewise rigid) objects.

    As mentioned before, if a detector cannot recover certain parameters of the image

    transformations, descriptor must be made invariant to them. It is preferable, though, to

    have a covariant detector rather than an invariant descriptor, as that allows for more

    powerful global consistency verification. If, for example, the detector does not

    provide the orienta-tions of the image elements, rotational invariants have to be

  • 8/11/2019 Object Recognition Techniques (2)

    55/224

    55

    employed in the descriptor. In such a case, it is impossi-ble to verify that all of the

    matched elements agree in their orientation.

    Finally, tentative correspondences which are not consistent with the estimated global

    transformation are rejected, and only remaining correspondences are used to estimate

    the final score of the match.

    In the following, main contributions to the field of object recognition based on local

    correspondences are reviewed. The approaches follow the aforementioned structure,

    but differ in individual steps; in the way how are the local features obtained

    (detectors), and what are the features themselves (descriptors).

  • 8/11/2019 Object Recognition Techniques (2)

    56/224

    56

    5) RECOGNITION AS A CORRESPONDENCE OF LOCAL FEATURES - A

    SURVEY

    a) The Approach of David Lowe

    David Lowe has developed an object recognition system, with emphasis on efficiency,

    achieving real-time recognition times. Anchor points of interest are detected with

    invariance to scale, rotation and translation. Since local patches undergo more

    complicated transforma-tions then similarities, a local-histogram based descriptor is

    proposed, which is robust to imprecisions in alignment of the patches.

    Detector. The detection of regions of interest proceeds as follows:

    [12] Detection of scale-space peaks. Circular regions with maximal response of the

    difference-of-gaussians (DoG) filter, are detected at all scales and image locations. Ef-

    ficient implementation exploits the scale-space pyramid. The initial image is

    repeatedly convolved with a Gaus-sian filter to produce a set of scale-space images.

    Adja-cent scale-space images are then subtracted to produce a set of DoG images. In

    these images, local minima and maxima (i.e. extrema of the DoG filter response) are

    de-tected, both in spatial and scale domains. The result of the first phase is thus a set

    of triplets x, y and , image locations and a characteristic scales.

    [13] The location of the detected points is refined. The DoG responses are locally

    fitted with 3D quadratic function and the location and characteristic scale of the

    circular regions are determined with subpixel accuracy. The re-finement is necessary,

  • 8/11/2019 Object Recognition Techniques (2)

    57/224

  • 8/11/2019 Object Recognition Techniques (2)

    58/224

    58

    Verification. The Hough transform is used to identify clusters of tentative

    correspondences with a consistent ge-ometric transformation. Since the actual

    transformation is approximated by a similarity, the Hough accumulator is 4-

    dimensionsional and is partitioned to rather broad bins. Only clusters with at least 3

    entries in a bin, are considered further. Each such cluster is then subject to a

    geometric ver-ification procedure in which an iterative least-squares fitting is used to

    find the best affine projection relating the query and database images.

    b) The Approach of Mikolajczyk & Schmid

    The approach by Schmid et al. is described in [44, 28, 56, 54, 53, 55, 27, 10]. Based on an

    affine generalisation of Harris corner detector, anchor points are detected and described by

    Gaussian derivatives of image intensities in shape-adapted elliptical neighbourhoods.

    Detector. In their work, Mikolajczyk and Schmid imple-ment affine-adapted Harris point

    detector. Since the three-parametric affine Gaussian scale space is too complex to be

    practically useful, they propose a solution which itera-tively search for affine shape

    adaptation in neighbourhoods of points detected in uniform scale space. For initialisa-tion,

    approximate locations and scales of interest points are extracted by standard multi-scale

    Harris detector. These points are not affine invariant because of the uniform Gaus-sian kernel

    used. Given the initial approximate solution, their algorithm iteratively modifies the shape,

  • 8/11/2019 Object Recognition Techniques (2)

    59/224

    59

    the scale and the spatial location of neighbourhood of each point, and con-verges to affine-

    invariant interest points. For more details see [28].

    Descriptors and Matching. The descriptors are com-posed from Gaussian derivatives

    computed over the shape-normalised regions. Invariance to rotation is obtained by steering

    the derivatives in the direction of the gradient. Using derivatives up to 4th order, the

    descriptors are 12-dimensional. The similarity of descriptors is in first approx-imation

    measured by the Mahalanobis distance. Promis-ing close matches are then confirmed or

    rejected by cross-correlation measure computed over normalised neighbour-hood windows.

    Verification. Once the point-to-point correspondences are obtained, a robust estimation of the

    geometric transforma-tion between the two images is computed using RANSAC algorithm.

    The transformation used is either a homography or a fundamental matrix.

    Recently, Dorko and Schmid [10] extended the approach towards object categorisation. Local

    image patches are de-tected and described by the same approach as described above. Patches

    from several examples of objects from a given category (e.g. cars) are collected together, and

    a classifier is trained to distinguish them from patches of different cate-gories and from

    background patches.

    c) The Approach of Tuytelaars, Ferrari & van Gool

  • 8/11/2019 Object Recognition Techniques (2)

    60/224

    60

    Luc van Gool and his collaborators developed an approach based on matching of local image

    features [73, 75, 11, 72, 71, 74, 69]. They start with detection of elliptical or parallelo-gram

    image regions. The regions are described by a vector of photometricaly invariant generalised

    colour moments, and matching is typically verified by the epipolar geometry con-straint.

    Detector. Two methods for extraction of affinely invariant regions are proposed, yielding

    geometry- and intensity-based regions. The regions are affine covariant, they adapt their

    shape to the underlying intensity profile, in order to keep on representing the same physical

    part of an object. Apart from the geometric invariance, photometric invariance allows for

    independent scaling and offsets for each of the three colour channels. The region extraction

    always starts by detecting stable anchor points. The anchor points are either Harris points

    [13], or local extrema of image intensity. Although the detection of Harris points is not really

    affine invariant, as the support set over which is the response computed is circu-lar, the points

    are still fairly stable under viewpoint changes, and could be precisely localised (even to

    subpixel accuracy). Intensity extrema, on the other hand, are invariant to any continuous

    geometric transformation and to any monotonic transformation of the intensity, but are not

    localised as ac-curately. On colour images, the detection is performed three times, separately

    on each of the colour bands.

    Descriptors and Matching. In the case of geometry-based regions, each of the regions is

    described by a vector of 18 generalised colour moments [29], invariant to photometric

    transformations. For the intensity-based regions, 9 rotation-invariant generalised colour

    moments are used. The simi-larity between the descriptors is given by the Mahalanobis

  • 8/11/2019 Object Recognition Techniques (2)

    61/224

    61

    distance, correspondences between two images are formed from regions with the distance

    mutually smallest. Once cor-responding regions have been found, the cross-correlation be-

    tween them is computed as a final check before accepting the match. In the case of the

    intensity-based regions, where the rotation is unknown, the crosscorrelation is maximised

    over all rotations. Good matches are further fine-tuned by non-linear optimisation: the

    crosscorrelation is maximised over small deviations of the transformation parameters.

    Verification. The set of tentative correspondences is pruned by both geometric and

    photometric constraints. The geometric constraint basically rejects correspondences con-

    tradicting the epipolar geometry. Photometric constraint assumes that there is always a group

    of corresponding re-gions that undergo the same transformation of intensities.

    Correspondences that have singular photometric transforma-tion are rejected. Recently, a

    growing flexible homography approach was presented, which allows for accurate model

    alignment even for nonrigid objects. The size of the aligned area is then used as a measure of

    the match quality.

    d) The LAF Approach of Matas.

    The approach of Matas et al. [25, 37, 26, 36] starts with detection of Maximally Stable

    Extremal Regions. Affine co-variant local coordinate systems (called Local Affine Frames,

    LAFs) are then established, and measurements taken rela-tive to them describe the regions.

  • 8/11/2019 Object Recognition Techniques (2)

    62/224

    62

    Figure 3: Examples of correspondences established between frames of a database image (left)

    and a query image (right).

    Detector. The Maximally Stable Extremal Regions (MSERs) were introduced in [25].

    The attractive properties of MSERs are: 1. invariance to affine transformations of im-age

    coordinates, 2. invariance to monotonic transformation of intensity, 3. computational

    complexity almost linear in the number of pixels and consequently near real-time run time,

    and 4. since no smoothing is involved, both very fine and coarse image structures are

    detected. Starting with contours of the detected region, local frames (coordinate systems) are

    constructed in several affine covariant ways. Affine covari-ant properties of covariance

    matrix, bi-tangent lines, and line parallelism are exploited. As demonstrated in Figure, local

    affine frames facilitate normalisation of image patches into a canonical frame and enable

    direct comparison of photomet-ricaly normalised intensity values, eliminating the need for

    invariants.

    Descriptor. Three different descriptors were used. The first is directly the intensities of the

    local patches. The intensities are discretised into 15 15 3 rasters, yield-ing 675-

    dimensional descriptors. The size is discriminative enough to distinguish between a large

  • 8/11/2019 Object Recognition Techniques (2)

    63/224

    63

    amount of database objects, yet coarse enough to be tolerant to decent misalign-ments in the

    frame localisation. Second type of descriptor employs the discrete cosine transformation,

    which is applied to the discretised patches [38]. The number of low frequency DCT

    coefficients that are kept in the database is used to adapt the preference of descriptor

    discriminativity against the localisation tolerance. Finally, rotational invariants were used.

    Verification. In the wide-baseline stereo problems, the cor-respondences are verified by

    robustly selecting only these conforming to the epipolar geometry constraint. For object

    recognition it is typically sufficient to approximate the global geometry transformation by a

    homography with flexible tol-eration increasing towards the object boundaries.

    e) The Approach of Zisserman

    A. Zisserman and his collaborators developed strategies for matching of local features mainly

    in the context of the wide-baseline stereo problem [43, 42, 48, 45, 46]. Recently they

    presented an interesting work relating image retrieval prob-lem and text retrieval [63, 47, 49].

    They introduced an im-age retrieval system, called VideoGoogle, which is capable of

    processing and indexing full-length movies.

    Detectors and Descriptors. Two types of detectors of lo-cal image elements are employed.

    One is the shape-adapted elliptical regions by Mikolajczyk and Schmid, as described in

  • 8/11/2019 Object Recognition Techniques (2)

    64/224

  • 8/11/2019 Object Recognition Techniques (2)

    65/224

  • 8/11/2019 Object Recognition Techniques (2)

    66/224

    66

    entation of the objects, and to occlusion. Ohba and Ikeuchi and Jugessur and Dudek propose

    an appearance-based object recognition method robust to variations in the background and

    occlusion of a substantial fraction of the image.

    In order to apply the eigenspace analysis to recognition of partially occluded objects, they

    propose to divide the ob-ject appearance into small windows, referred to as eigen windows,

    and to apply eigenspace analysis to them. Like in other approaches exploiting local

    appearance, even if some of the windows are occluded, the remaining are still effective and

    can recover the object identity and pose.

    In addition to robustness to occlusions, Jugessur and Dudek [16] also address the problem of

    rotation invariance. The proposed solution is to compute the PCA not on the intensity

    patches, but rather in frequency domain of the win-dows represented in polar coordinates.

    g) The Approach of Selinger & Nelson

    The object recognition system developed by Nelson and Selinger at the University of

    Rochester exploits a four-level hierarchy of grouping processes [35, 59, 61, 58, 57, 60]. The

    system architecture is similar to other local feature-based ap-proaches though a different

    terminology is used. Inspired by the Gestalt laws and perceptual grouping principles, a four-

  • 8/11/2019 Object Recognition Techniques (2)

    67/224

    67

    level grouping hierarchy is built, where higher levels contains groups of elements from lower

    levels.

    The hierarchy is constructed as follows. At the fourth highest level, a 3D object is represented

    as a topologically structured set of flexible 2D views. The geometric relation between the

    views is stored here. This level is used for geo-metric reasoning, but not for recognition.

    Recognition takes place at the third level, the level of the component views. In these views

    the visual appearance of an object, derived from a training image, is represented as a loosely

    structured com-bination of a number of local context regions. Local context regions (local

    features) are represented at the second level. The regions can be thought of as local image

    patches that surround first level features. At the first level are features (detected image

    elements) that are the result of grouping processes run on the image, typically representing

    connected contour fragments, or locally homogeneous regions. Only

  • 8/11/2019 Object Recognition Techniques (2)

    68/224

    68

    Figure 4: Examples of corresponding query (left columns) and database (right columns)

    images from the ZuBuD dataset. The image pairs exhibit occlusion, varying illumi-nation and

    viewpoint and orientation changes.

    Efficient recognition is achieved by using a database implemented as an associative memory

    of keyed context patches. An unknown keyed context patch recalls associ-ated hypotheses for

    all known views of objects that could have produced such context patch. These hypotheses

    are processed by a second associative memory, indexed by the view parameters, which

    partitions the hypotheses into clus-ters that are mutually consistent within a loose geometric

    framework (these clusters are the third level groups). The looseness is obtained by tolerating

    a specified deviation in position, size, and orientation. The bounds are set to be consistent

    with a given distance between training views (e.g. approximately 20 degrees). The output of

    the recognition stage is a set of third level groupings that represent hypothe-ses of the identity

    and pose of objects in the scene, ranked by the total evidence for each hypothesis.

    h) APPLICATIONS

    Approaches matching local features have been experimen-tally shown to obtain state-of-the-

    art results. Here we present few examples of the adressed problems. Results are demonstrated

    using the approach of Matas et al. [37, 36, 38], although comparable results have been shown

    by others.

  • 8/11/2019 Object Recognition Techniques (2)

    69/224

    69

    Figure 5: Image retrieval on FOCUS dataset: query local-isation results. query images,

    database images, and query localisations

    Object Recognition. In object recognition experiments, Columbia Object Image Library

    (COIL-100) [1], or more of-ten its subset COIL-20, has been widely used, and for com-

    parison purposes has become a de facto standard benchmark

    Figure 6: An example of matches established on a wide-baseline stereo pair.

    dataset. COIL-100 is a set of colour images of 100 different objects, where 72 images of each

    object were taken at pose intervals of 5. The objects are unoccluded and on unclut-tered

    black background. Such a configuration is benign for appearance-based methods. Table 1

    compares recognition rates achieved by the LAF approach with the rates of sev-eral

  • 8/11/2019 Object Recognition Techniques (2)

    70/224

    70

    appearance-based object recognition methods. Results are presented for five experimental

    set-ups, differing in the number of training views per object. Decreasing the number of

    training views increases demands on the methods general-isation ability, and on the

    insensivity to image deformations. The LAF approach performs best in all experiments,

    regard-less of the number of training views. For only four training views, the recognition rate

    is almost 95%, demonstrating the remarkable robustness to local affine distortions.

    Image retrieval. The retrieval performance of the LAF method was evaluated on the FOCUS

    dataset, containing 360 colour high-resolution images of advertisements scanned from

    magazines. The task was to retrieve adverts for a given product, given a query image of the

    product logo. Examples of query logos, retrieved images, and visualised localisations of the

    logos are depicted in Figure.

    Another challenging retrieval problem involved recogni-tion of buildings in urban scenes.

    Given an image of an unknown building, taken from an unknown viewpoint, the algorithm

    was to identify the building. The experiments were conducted on a set of images of 201

    different buildings. The dataset was provided by ETH Zurich and is publicly available [62].

    The database contains five photographs of every of the 201 buildings, and a separate set of

    115 query images is provided. Examples of corresponding query and database images are

    shown in Figure. The LAF method achieved 100% recognition rate in rank 1.

    Video retrieval. The problem of retrieval of video frames from full-length movies was

    addressed in [63]. Local descrip-tors were computed on key frames and stored into database.

  • 8/11/2019 Object Recognition Techniques (2)

    71/224

    71

    To reduce the otherwise enormous database size, descriptors were clustered according to their

    similarity. Impresive real-time retrieval was achieved for a closed system, i.e. for the case of

    query images originating from the movie itself.

    Wide baseline stereo matching. For a significant variety of scenes the epipolar geometry can

    be computed automati-cally from two (or possibly more) uncalibrated images, show-ing the

    scene from significantly different viewpoints. The role of the matching in the wide-baseline

    stereo problem is to provide corresponding points, i.e. the points which in the two images

    represent identical element of the 3D scene. Correspondences found in a difficult stereo pair

    are shown in Figure.

  • 8/11/2019 Object Recognition Techniques (2)

    72/224

  • 8/11/2019 Object Recognition Techniques (2)

    73/224

    73

    Vision is a difficult problem consisting of many building blocks that can be characterized in

    isolation. Eye movements are one such building block.

    2. Since visual sensitivity is the highest in the fovea, in general, eye movements are needed

    for recognizing small stimuli.

    3. During a fixation, a number of things happen concurrently: the visual information around

    the fixation is analysed , and visual information away from the current fixation is analyzed to

    help select the next saccade target.

    The exact processes involved in this are still largely unknown. Findlay and Gilchrist [33] also

    pose a number of questions, in order to demonstrate that numerous basic problems in vision

    still remain open for research.

    1. What visual information determines the target of the next eye movement?

    2. What visual information determines when eyes move?

    3. What information is combined across eye movements to form a stable representation of the

    environment?

    As discussed earlier [29], a brute force approach to object localization subject to a cost

    constraint, is often intractable e as the search space size increases. Furthermore, the human

    brain would have to be some hundreds of thousands times larger than it currently is, if visual

    sensitivity across the visual space was the same as that in the fovea [29]. Thus, active and

  • 8/11/2019 Object Recognition Techniques (2)

    74/224

    74

    attentive approaches to the problem are usually proposed as a means of addressing these

    constraints.

    We will show in this section that within the context of the general framework for object

    recognition that was illustrated in Fig.1, previous work on active object recognition systems

    has conclusively demonstrated that active vision systems are capable of leading to significant

    improvements in both the learning and inference phases of object recognition. This includes

    improvements in the robustness of all the components of the feature-extraction feature

    grouping object-hypothesis object-verification object-recognition pipeline.

    Some of the problems inherent in single view object recognition, include [266]:

    1. The im