cvpr 2007 book of abstracts - vigir-labvigir.missouri.edu/~gdesouza/research/conference... ·...

CVPR 2007

Book of Abstracts

1

Meeting Room Maps

2

2

CVPR Short Course descriptions

Short Course: Visual Search and Its Applications on Web

Lecturer: Burak Gokturk

Search has been proven to be a powerful tool and been the most used application in the web. Visual search, on the other hand, is a recently developing application of computer vision and machine learning. Several applications have already shown new ways of traversing and searching the web via visual search. This tutorial describes multiple concepts of visual search, including visual to visual and visual to text search. Visual to visual search uses images to initiate a search and retrieves images similar to the query. Traditionally, the image search problem has been addressed in the context-based image retrieval literature by extracting and matching visual feature sets for pairs of images. However, this approach is insufficient in some cases such as face, text and object recognition where more sophisticated object-based modeling is necessary. In that case, more complex processing including steps such as object detection, alignment/registration and segmentation need to be carried out prior to feature extraction. Visual to text search can be thought as of a reverse mapping of visual features in order to enrich the text indexing on the web. The combination of visual and text search has also been introduced in several applications and will be covered during this lecture. The outline of the lecture will be as follows:

A. Introduction

• What is Visual Search • Paradigms Of Visual Search (Object Based and Image-Retrieval Based) • Examples Of Visual Search on the Web

B Steps Of Visual Search

• Detection • Preprocessing • Segmentation • FE • Recognition/Similarity

C. Important Issues with Visual Search

• Registration • Learning in High Dimensional Space • Importance of Context • Combining features and detectors • Local features • Scalability

3

3

D. Combining Text and Visual Search

• Text to Visual Search • Visual to Text Search • Using Words and Pictures together

Short Course: Tensor Voting – A Perceptual Organization Approach to Computer Vision and Machine Learning

Lecturer: P. Mordohai

Tensor voting is a perceptual organization approach based on the Gestalt principles of proximity and good continuation. It is based on data representation by second-order, symmetric non-negative definite tensors and on information propagation in the form of votes among pairs of neighboring points. The votes are also in the form of second-order tensors and convey the support of the voter for the presence of a structure that passes through the voter and receiver. The analysis of accumulated votes at each point provides estimates of its saliency and orientation as part of a structure such as a surface, curve or junction. Under this approach, many problems can be formulated as the perceptual organization of primitives and the solutions can be found by detecting the most salient structures formed by the primitives. For instance, in the case of stereo vision, the primitives are potential pixel correspondences reconstructed in 3D. The true scene surfaces can be inferred based on the assumption that correct correspondences support each other and form salient surfaces, while erroneous correspondences are scattered and do not form salient structures. Our approach is data-driven, local, does not include any global computations and requires a minimal number of assumptions. The course will have little overlap with a similar course presented during CVPR 2003. There have been major improvements at the theoretical level, as well as new exciting applications in core computer vision and machine learning problems. It will not be presented as a historical recount of the work and the sequence of the presentation is not in chronological order. The emphasis will be on presenting a unified approach to a wide range of problems in computer vision and machine learning that may seem heterogeneous initially. After an overview of the theory and the introduction of recent enhancements to the framework, the most important among them being an N-D implementation, we will discuss our algorithms for addressing problems in computer vision including: figure completion, binocular stereo, multiple-view stereo, motion analysis, epipolar geometry estimation, texture synthesis, and new results we obtained in the field of instance-based learning, including dimensionality estimation, geodesic distance estimation, and function approximation.

Short Course: Novel Biometrics

Lecturer: I. Pavlidis

a) Physiology-Based Face Recognition

1. A1) Introduction 2. A2) Facial physiology 3. A3) Sensing modalities 4. A4) Structural and Functional Feature Extraction 5. A5) Pattern Matching and Recognition

4

4

b) Psycho-Physiology: Biometrics of Hostile Intent

1. B1) Introduction 2. B2) Intent assessment in context 3. B3) Sensing modalities 4. B4) Methodologies 5. B5) Experimental Analysis

Short Course: Generalized PCA

Lecturers: Yi Ma, R. Vidal

Over the past two decades, we have seen tremendous advances on the simultaneous segmentation and estimation of a collection of models from sample data points, without knowing which points correspond to which model. Most existing segmentation methods treat this problem as "chicken-and-egg", and iterate between model estimation and data segmentation.

This course will show that for a wide variety of data segmentation problems (e.g. mixtures of subspaces), the "chicken-and-egg" dilemma can be tackled using an algebraic geometric technique called Generalized Principal Component Analysis (GPCA). This technique is a natural extension of classical PCA from one to multiple subspaces.

The course will also include several applications of GPCA to computer vision problems such as image/video segmentation, 3-D motion segmentation, and dynamic texture segmentation.

List of topics

I Introduction to Generalized Principal Component Analysis

II Basic GPCA Theory and Algorithms

a. Review of Principal Component Analysis (PCA) b. Introductory Cases: Line, Plane and Hyperplane Segmentation c. Segmentation with Known Number of Subspaces d. Segmentation with Unknown Number of Subspaces

III Advanced Statistical and Algebraic Methods for GPCA

a. Model Selection for Subspace Arrangements b. Robust Sampling Techniques for Subspace Segmentation c. Voting Techniques for Subspace Segmentation

IV Applications to Motion and Video Segmentation

a. 2-D and 3-D Motion Segmentation b. Temporal Video Segmentation c. Segmentation of Dynamic Textures

5

5

V Applications to Image Representation and Segmentation

a. Multi-Scale Hybrid Linear Models for Sparse Image Representation b. Hybrid Linear Models for Image Segmentation

Short Course: Numerical Geometry of Non-Rigid Shapes

Lecturers: A. Bronstein, M. Bronstein, R. Kimmel

The short course deals with modern methods of analysis of non-rigid objects, an important emerging field bringing together different disciplines of mathematics and computer science such as differential and metric geometry, numerical analysis, optimization, computer graphics, machine learning, computer vision and computational geometry. The short course objective is to give theoretical and numerical tools for the analysis and comparison of surfaces from the perspective of the recent advances in the field. The first part of the short course will include a brief introduction into topology, metric and Riemannian geometry as well as numerical geometry, numerical analysis and state-of-the-art tools in numerical optimization. The second part is dedicated to the representation of intrinsic geometry of surfaces. We will discuss the notion of isometric embedding, discern between local and global isometries and study different numerical methods for analysis and comparison of non-rigid surfaces. A major emphasis will be made on multidimensional scaling. We will introduce an axiomatic construction of isometry-invariant distances and discuss their different aspects. The last part of the short course is dedicated to applications. We will see how many important problems can be addressed within the framework of non-rigid surface matching. We will demonstrate an expression-invariant three-dimensional face recognition system based on intrinsic geometric representation of faces. The short course is an abbreviated version of the Advanced Topics graduate course "Analysis of non-rigid surfaces" (236611) taught by us in the spring semester, 2006, at the Department of Computer Science in the Technion - Israel Institute of Technology.

Short Course: Recognizing and Learning Object Categories: Year 2007

Lecturers: L. Fei-Fei, R. Fergus, A. Torralba

1. Introduction:

• define the problem of object categorization (OC) • brief history • invariance issues in OC • representation • learning • recognition

6

6

2. Bag of words models:

• model representation • learning • recognition • demo • all related works

3. Part-based models:


4. Discriminative models:


5. Objects and its contexts:

• segmentation based recognition • context facilitated recognition • recognition and geometry

Short Course: Distributed Vision Processing in Smart Camera Networks

Lecturers: H. Aghajan, W. Wolf, H. Bishop, B. Rinner, F. Berry, R. Kleihorst

Distributed smart cameras combine techniques from computer vision, distributed processing, and embedded computing. Technological advances in the design of sensors and processors have facilitated the development of efficient embedded vision-based techniques. Distributed algorithms can provide more confident deductions about the events of interest or reduce ambiguities in a view caused by occlusion or other factors. Because they operate in real time, a variety of smart environment applications can be enabled based on the development of efficient architectures and algorithms for distributed vision networks. Building upon the premise of distributed vision-based sensing and processing, ambient intelligence can be conceived as electronic environments that are aware of and responsive to the presence of people. Most application development efforts based on vision have focused towards monitoring scenes and persons. Distributed image sensing networks not only enhance the performance and reliability of such applications, they also enable novel ambient intelligence application areas in which the network provides useful information services to the users by monitoring the events and context they are involved with. This provides a lively field for vision based research, pushing technology and relevant applications in smart homes, offices, factories, as well as entertainment and gaming application domains.

7

7

Short Course: Feature Extraction and Classification

Lecturer: A. Martinez

Description: Fundamentals and research directions of feature extraction algorithms. Feature extraction techniques are essential in many applications in science and engineering, including computer vision. The course will start with a review of unsupervised techniques, and will follow with a description of the most significant supervised methods defined to date. The effects of data noise and the use of linear methods with kernels and non-linear algorithms will be sketched.

List of topics:

• Review of unsupervised feature extraction method (PCA, MDS, ICA, FA, NMF, etc.) • Supervised feature extraction: discriminant analysis (DA). • The use of metrics in DA. • Alternative criteria. • Effects of data noise in feature extraction. • Selecting kernels in feature extraction. • Nonlinear methods.

Short Course: Fundamentals of Linking Discrete and Continuous Approaches to Computer Vision - A Topological View

Lecturer: Leo Grady

Description: Continuous and combinatorial methods of image analysis have developed in computer vision largely along separate lines of research. The primary difference between a combinatorial and continuous algorithm for computer vision is whether or not an image is treated as a large set of regular samples approximating a continuum domain or as a set of discrete objects. Although the former approach leads to mathematical exposition and analysis in terms of partial differential equations and the latter to graph theoretic methods, both toolsets for analysis may be derived from the common language of topology. In personal experience with members of the computer vision field, there seems to be a misconception that combinatorial and continuum mechanics are wholly separate from each other and that there can therefore be no meaningful cross-pollination of ideas between continuum and combinatorial algorithms. The primary goal of this short course will be to clarify this confusion and rebuild a common framework from primary principles that accommodates both approaches. The structure of this short course will be to start from the fundamentals of topology (specifically algebraic topology) and derive both the combinatorial and continuous frameworks in a common setting. The goal is to give attendees working with disparate mathematical tools the ability to translate continuum methods to a combinatorial formulation and vice versa in order to facilitate idea exchange. The latter part of the course will take practical examples from the computer vision literature and show how they fit into this common setting. A running theme of the short course is that neither continuum nor combinatorial methods has any inherit primacy in computer vision and both mathematical (and physical) traditions are rich enough to support most concepts. Therefore, more justification should be given to why a new idea is better expressed in a continuum or combinatorial language. Most of the examples given throughout the short course will be in the area of image segmentation, since this is the presenter's particular area of focus.

8

8

Abstracts: IEEE International Workshop on Projector-Camera Systems (PROCAMS 2007)

Poster/Demo Session

Inter-Reflection Compensation for Immersive Projection Display

Hitoshi Habe, Nobuo Saeki, Takashi Matsuyama

This paper proposes an effective method for compensating inter-reflection in immersive projection displays (IPDs). Because IPDs project images onto a screen, which surrounds a viewer, we have perform out both geometric and photometric corrections. Our method compensates interreflection on the screen. It requires no special device, and approximates both diffuse and specular reflections on the screen using block-based photometric calibration.

Analysis of Light Transport based on the Separation of Direct and Indirect Components

Osamu Nasu, Shinsaku Hiura, Kosuke Sato

Cordless portable multi-view fringe projection system for 3D reconstruction

C. Munkelt, I. Schmidt, C. Bräuer-Burchardt, P. Kühmstedt and G. Notni

A newly devised lightweight sensor head, combining a digital LED projector and two cameras in a stereo arrangement with access to even complicated measurable object details, is presented. It uses a pre-calibrated, epipolar constrained, phase correlation based fringe projection approach. The mobile unit is battery powered and data transfer is done via WLAN to enable flexible use in complex measurement situations. Multi-view measurement is realized using the phasogrammetric approach with virtual landmarks. Thereby the system enables whole-body measurement without matching procedures or markers. The mobile character suggest application in arts, design, archaeology and criminology.

High-Speed Visual Tracking of the Nearest Point of An Object Using 1,000-fps Adaptive Pattern Projection

Tomoyuki Inoue, Shingo Kagami, Joji Takei, Koichi Hashimoto

A 1,000-fps camera-projector system in which projected patterns are adaptively controlled according to image processing results is described. Adaptive structured light projection enables fast and efficient 3-D information acquisition. The prototype system is applied to the tracking of the nearest point of an object, and experimental results show that the system successfully tracked an apex of a fast-moving target object.

9

9

Projector Calibration using Arbitrary Planes and Calibrated Camera

Makoto Kimura, Masaaki Mochimaru, Takeo Kanade

In this paper, an easy calibration method for projector is proposed. The calibration handled in this paper is projective relation between 3D space and 2D pattern, and is not correction of trapezoid distortion in projected pattern. In projector-camera systems, especially for 3D measurement, such calibration is the basis of process. The projection from projector can be modeled as inverse projection of the pinhole camera, which is generally considered as perspective projection. In the existing systems, some special objects or devices are often used to calibrate projector, so that 3D-2D projection map can be measured for typical camera calibration methods. The proposed method utilizes projective geometry between camera and projector, so that it requires only pre-calibrated camera and a plane. It is easy to practice, easy to calculate, and reasonably accurate.

Paper Session I: Calibration and Measurement

Geometric Modeling and Calibration of Planar Multi-Projector Displays Using Rational Bezier Patches

Ezekiel Bhasker, Aditi Majumder

In order to achieve seamless imagery in a planar multiprojector display, geometric distortions and misalignment of images within and across projectors have to be removed. Camera-based calibration methods are popular for achieving this in an automated fashion. Previous methods for geometric calibration fall into two categories: (a) Methods that model the geometric function relating the projectors to cameras using simple linear models, like homography, to calibrate the display. These models assume perfect linear devices and cannot address projector non-linearities, like lens distortions, which are common in most commodity projectors. (b) Methods that use piecewise linear approximations to model the relationship between projectors and cameras. These require a dense sampling of the function space to achieve good calibration. In this paper, we present a new closed-form model that relates projectors to cameras in planar multi-projector displays, using rational Bezier patches. This model overcomes the shortcomings of the previous methods by allowing for projectors with significant lens distortion. It can be further used to develop an efficient and accurate geometric calibration method with a sparse sampling of the function.

High-Speed Measurement of BRDF using an Ellipsoidal Mirror and a Projector

Yasuhiro Mukaiawa, Kohei Sumino, Yasushi Yagi

Measuring BRDF (Bi-directional Reflectance Distribution Function) requires huge amounts of time because a target object must be illuminated from all incident angles and the reflected lights must be measured from all reflected angles. In this paper, we present a high-speed method to measure BRDFs using an ellipsoidal mirror and a projector. Our method makes it possible to change incident angles without a mechanical drive. Moreover, the omni-directional reflected lights from the object can be measured by one static camera at once. Our prototype requires only fifty minutes to measure anisotropic BRDFs, even if the lighting interval is one degree.

10

10

Photometric Self-Calibration of a Projector-Camera System

Ray Juang, Aditi Majumder

In this paper, we present a method for photometric selfcalibration of a projector-camera system. In addition to the input transfer functions (commonly called gamma functions), we also reconstruct the spatial intensity fall-off from the center to fringe (commonly called the vignetting effect) for both the projector and camera. Projector-camera systems are becoming more popular in a large number of applications like scene capture, 3D reconstruction, and calibrating multi-projector displays. Our method enables the use of photometrically uncalibrated projectors and cameras in all such applications.

Paper Session II: Real-Time Applications

Real-Time Projector Tracking on Complex Geometry Using Ordinary Imagery

Tyler Johnson, Henry Fuchs

Calibration techniques for projector-based displays typically require that the display configuration remain fixed, since they are unable to adapt to changes such as the movement of a projector. In this paper, we present a technique that is able to automatically recalibrate a projector in real time without interrupting the display of user imagery. In contrast to previous techniques, our approach can be used on surfaces of complex geometry without requiring the quality of the projected imagery to be degraded. By matching features between the projector and a stationary camera, we obtain a new pose estimate for the projector during each frame. Since matching features between a projector and camera can be difficult due to the nature of the images, we obtain these correspondences indirectly by first matching between the camera and an image rendered to predict what the camera will capture.

Shadow Removal in Front Projection Environments using Object Tracking

Samuel Audet, Jeremy Cooperstock

When an occluding object, such as a person, stands between a projector and a display surface, a shadow results. We can compensate by positioning multiple projectors so they produce identical and overlapping images and by using a system to locate shadows. Existing systems work by detecting either the shadows or the occluders. Shadow detection methods cannot remove shadows before they appear and are sensitive to video projection, while current occluder detection methods require near infrared cameras and illumination. Instead, we propose using a camera-based object tracker to locate the occluder and an algorithm to model the shadows. The algorithm can adapt to other tracking technologies as well. Despite imprecision in the calibration and tracking process, we found that our system performs effective shadow removal with sufficiently low processing delay for interactive applications with video projection.

11

11

DigiTable: An Interactive Multiusers Table for Collocated and Remote Collaboration Enabling Remote Gesture Visualization

François Coldefy, Stéphane Louis dit Picard

We present DIGITABLE, an experimental platform we hope lessen the gap between co-present and distant interaction. DIGITABLE is combining a multiuser tactile interactive tabletop, a video-communication system enabling eye-contact with real size distant user visualization and a spatialized sound system for speech transmission. Based on a robust computer vision module, it provides a fluid gesture visualization of each distant participant whether he/she is moving virtual digital objects or is intending to do so. Remote gesture visualization contributes to the efficiency of distant collaboration tasks because it enables the coordination among participant’s actions and talk. Our main contribution addresses the development and the integration of robust and real time projector-camera processing techniques in Computer Supported Cooperative Work.

Displaying a Moving Image By Multiple Steerable Projectors

Ikuhisa Mitsugami, Norimichi Ukita, Masatsugu Kidode

This paper proposes a method for precise overlapping of projected images from multiple steerable projectors. When they are controlled simultaneously, two problems are revealed: (1) even a slight positional error of the projected image, which does not matter in the case of a single projector, causes misalignments of multiple projected images that can be perceived clearly when using multiple projectors; and (2) as the projectors usually do not have architectures for their synchronization it is impossible to display a moving image that is by tiling or overlaying precisely the multiple projected images. To overcome (1), a method is proposed that measures preliminarily the misalignments through every plane in the environment, and hence displays the image without the misalignment. For (2), a consideration and a new proposal for the synchronization of multiple projectors are also discussed.

Posters

Projector-Camera Guided Fast Environment Restoration of a Biofeedback System for Rehabilitation

Yufei Liu, Gang Qian

An Embodied User Interface for Increasing Physical Activities in Game (Poster)

Si-Jung Kim, Woodrow W. Winchester, Yun-Bum Choi, Juck-Sik Lee

The rate of obese has been increasing and obesity has emerged as a significant threat not only to the health but also in society. Obesity has adverse effects such as physical appearance, psychosocial consequences and metabolic disturbances. One of reasons causing these phenomena is most games have static and stationary user interfaces as input devices. These kinds of interfaces hold users at their computers and cause not only decreases of the strength of their health, but also blocks communications between family members. In this paper, we propose physical activity based interactive exercise called Punch Punch, which is played with virtual objects displaying on a large screen. The informal study revealed that the Punch Punch enhanced physical and social activities while playing games. The goal of this study is finding embodied user interfaces to increase physical and social activities.

12

12

A Real-Time ProCam System for Interaction with Chinese Ink-and-Wash Cartoons

Ming Jin, Hui Zhang, Xubo Yang, Shuangjiu Xiao

This poster describes our recently developed real-time projector-camera system for interaction with Chinese Ink-and-Wash Cartoons. We implement a real-time interactive water simulation under the acceleration of GPU, together with a sensing subsystem using computer vision techniques. Combined with Chinese stylized fish rendered in process, the system provides real-time interactions with traditional Chinese paintings. By stirring up the still water, fish and other essential elements of Chinese paintings, we hope to present new interaction techniques and more lively Chinese painting sceneries compared to those in traditional static settings.

Virtual Recovery of the Deteriorated Art Object based on AR Technology

Toshiyuki Amano, Ryo Suzuki

In this paper, the virtual restoration method of the art piece in the real world is proposed. The correction pattern for restoration is generated from non damaged object’s image scanned in advance. As the method to recover the beauty, the marker tracking is used to detect art piece position and the correction image is projected from calibrated LCD projector. In the experiment, the color restoration by a sample image was performed. From the experimental results, the ability of restoration of the discoloring was confirmed in the real world imaginary by the correction image projection.

Automatic texture mapping on real 3D model

Molinier Thierry, Fofi David, Patrick Gorria, Joaquim Salvi

We propose a full automatic technique to project virtual texture on a real textureless 3D object. Our sytem is composed of cameras and projector and are used to determine the pose of the object in the real world with the projector as reference and then estimate the image seen by the projector if it would be a camera.

Paper Session III: Image Quality

Realizing Super-Resolution with Superimposed Projection

Niranjan Damera-Venkata, Nelson L. Chang

We consider the problem of rendering high-resolution images on a display composed of multiple superimposed lower-resolution projectors. A theoretical analysis of this problem in the literature previously concluded that the multi-projector superimposition of low resolution projectors cannot produce high resolution images. In our recent work, we showed to the contrary that super-resolution via multiple superimposed projectors is indeed theoretically achievable. This paper derives practical algorithms for real multi-projector systems that account for the intra- and inter-projector variations and that render high-quality, high-resolution content at real-time interactive frame rates. A camera is used to estimate the geometric, photometric, and color properties of each component projector in a calibration step. Given this parameter information, we demonstrate novel methods for efficiently generating optimal subframes so that the resulting projected image is as close as possible to the given high resolution images.

13

13

Improved Legibility of Text for Multiprojector Tiled Displays

Philip Tuddenham, Peter Robinson

Displaying small text on large multiprojector tiled displays is challenging. Problems arise because text is badly affected by the image-warping techniques that these displays apply to rectify projector misalignment. As a consequence, there has been little progress with important large display applications that require small text, such as collaborative tutoring or web-browsing. In this paper we present a new warping technique designed to preserve crisp text, based on recent work by Hereld and Stevens. Our technique produces good results, free of artifacts, when used in today’s multiprojector displays. We evaluate the legibility of our technique against conventional interpolation-based warping and find that users prefer our technique. We describe an efficient and reusable implementation, and show how the increased legibility has allowed us to investigate two new applications.

Focal Pre-Correction of Projected Image for Deblurring Screen Image

Yuji Oyamada, Hideo Saito

We propose a method for reducing out-of-focus blur caused by projector projection. In this method, we estimate the Point-Spread-Function (PSF) of the out-of-focus blur in the image projected onto the screen by comparing the screen image captured by a camera with the original image projected by the projector. According to the estimated PSF, the projected image is pre-corrected, so that the screen image can be deblurred. Experimental results show that our method can reduce out-of-focus projection blur.

14

14

Abstracts: Workshop on Online Learning for Classification

Oral Session

On-line Simultaneous Learning and Tracking of Visual Feature Graphs

Arnaud Declercq, Justus H. Piater

Model learning and tracking are two important topics in computer vision. While there are many applications where one of them is used to support the other, there are currently only few where both aid each other simultaneously. In this work, we seek to incrementally learn a graphical model from tracking and to simultaneously use whatever has been learned to improve the tracking in the next frames. The main problem encountered in this situation is that the current intermediate model may be inconsistent with future observations, creating a bias in the tracking results. We propose an uncertain model that explicitly accounts for such uncertainties by representing relations by an appropriately weighted sum of informative (parametric) and uninformative (uniform) components. The method is completely unsupervised and operates in real time.

Online Spatio-temporal Data Fusion for Robust Adaptive Tracking

Jixu Chen, Qiang Ji

One problem with the adaptive tracking is that the data that are used to train the new target model often contain errors and these errors will affect the quality of the new target model. As time passes by, these errors will accumulate and eventually lead the tracker to drift away. In this paper, we propose a new method based on online data fusion to alleviate this tracking drift problem. Based on combining the spatial and temporal data through a Dynamic Bayesian Network, the proposed method can improve the quality of online data labeling, therefore minimizing the error associated with model updating and alleviating the tracking drift problem. Experiments show the proposed method significantly improves the performance of an existing adaptive tracking method.

Practical Online Active Learning for Classification

Claire Monteleoni, Matti Kaariainen

We compare the practical performance of several recently proposed algorithms for active learning in the online classification setting. We consider two active learning algorithms (and their combined variants) that are strongly online, in that they access the data sequentially and do not store any previously labeled examples, and for which formal guarantees have recently been proven under various assumptions. We motivate an optical character recognition (OCR) application that we argue to be appropriately served by online active learning. We compare the practical efficacy, for this application, of the algorithm variants, and show significant reductions in label-complexity over random sampling.

15

15

Online Learning for Human-Robot Interaction

Bogdan Raducanu, Jordi Vitria

This paper presents a novel approach for incremental subspace learning based on an online version of the Nonparametric Discriminant Analysis (NDA). For many real-world applications (like the study of visual processes, for instance) there is impossible to know beforehand the number of total classes or the exact number of instances per class. This motivated us to propose a new algorithm, in which new samples can be added asynchronously, at different time stamps, as soon as they become available. The proposed technique for NDA-eigenspace representation has been applied to the problem of online face recognition for human-robot interaction scenario.

A Bayesian Non-Gaussian Mixture Analysis: Application to Eye Modeling

Nizar Bouguila, Djemel Ziou, Riad I. Hammoud

Many computer vision and pattern recognition problems involve the use of finite Gaussian mixture models. Finite mixture model using generalized Dirichlet distribution has been shown as a robust alternative of normal mixtures. In this paper, we adopt a Bayesian approach for generalized Dirichlet mixture estimation and selection. This approach, offers a solid theoretical framework for combining both the statistical model learning and the knowledge acquisition. The estimation of the parameters is based on the Monte Carlo simulation technique of Gibbs sampling mixed with a Metropolis-Hastings step. For the selection of the number of clusters, we used Bayes factors. We have successfully applied the proposed Bayesian framework to model IR eyes. Experimental results are shown to demonstrate the robustness, efficiency, and accuracy of the algorithm.

Poster Session

A Bio-inspired Learning Approach for the Classification of Risk Zones in a Smart Space

Alessio Dore, Matteo Pinasco, Carlo Regazzoni

Learning from experience is a basic task of human brain that is not yet fulfilled satisfactorily by computers. Therefore, in recent years to cope with this issue, bio-inspired approaches has gathered the attention of several researchers. In this work a learning method is proposed based on a model derived from neurophysiological observations of the generation of the sense of self which is connected to the memorization of the interaction with external entity. The domain of application where this algorithm is employed, is a Cognitive Surveillance system which aims at detecting intruders and communicate guidance messages to a user (a guard) provided with a mobile device in order to chase him. The proposed method intends to allow the system to establish an efficient interaction with the user by sending messages only when necessary. To this end the zones of the monitored area are classified according to the probability that a change in pursuit strategy will occur by learning online the motion of the user and the intruder. The proposed algorithm has been tested on real world data demonstrating the capacity of learning this information to be used to tag the zones of the area under exam.

16

16

Fast Sparse Gaussian Processes Learning for Man-Made Structure Classification

Hang Zhou, David Suter

Informative Vector Machine (IVM) is an efficient fast sparse Gaussian processs (GP) method previously suggested for active learning. It greatly reduces the computational cost of GP classification and makes the GP learning close to real time. We apply IVM for man-made structure classification (a two class problem). Our work includes the investigation of the performance of IVM with varied active data points as well as the effects of different choices of GP kernels. Satisfactory results have been obtained, showing that the approach keeps full GP classification performance and yet is significantly faster (by virtue if using a subset of the whole training data points).

Online Detection of Fire in Video

B. Ugur Toreyin, A. Enis Cetin

This paper describes an online learning based method to detect flames in video by processing the data generated by an ordinary camera monitoring a scene. Our fire detection method consists of weak classifiers based on temporal and spatial modeling of flames. Markov models representing the flame and flame colored ordinary moving objects are used to distinguish temporal flame flicker process from motion of flame colored moving objects. Boundary of flames are represented in wavelet domain and high frequency nature of the boundaries of fire regions is also used as a clue to model the flame flicker spatially. Results from temporal and spatial weak classifiers based on flame flicker and irregularity of the flame region boundaries are updated online to reach a final decision. False alarms due to ordinary and periodic motion of flame colored moving objects are greatly reduced when compared to the existing video based fire detection systems.

17

17

Abstracts: IEEE Computer Society Workshop on Biometrics

Session 1: Face recognition in video

Pose and Illumination Invariant Face Recognition in Video

Yilei Xu, Amit Roy-Chowdhury, Keyur Patel

The use of video sequences for face recognition has been relatively less studied than image-based approaches. In this paper, we present a framework for face recognition from video sequences that is robust to large changes in facial pose and lighting conditions. Our method is based on a recently obtained theoretical result that can integrate the effects of motion, lighting and shape in generating an image using a perspective camera. This result can be used to estimate the pose and illumination conditions for each frame of the probe sequence. Then, using a 3D face model, we synthesize images corresponding to the pose and illumination conditions estimated in the probe sequences. Similarity between the synthesized images and the probe video is computed by integrating over the entire sequence. The method can handle situations where the pose and lighting conditions in the training and testing data are completely disjoint.

Online Appearance Model Learning for Video-based Face Recognition

Liang Liu, Yunhong Wang, Tieniu Tan

In this paper, we propose a novel online learning method which can learn appearance models incrementally from a given video stream. The data of each frame in the video can be discarded as soon as it has been processed. We only need to maintain a few linear eigenspace models and a transition matrix to approximately construct face appearance manifolds. It is convenient to use these learnt models for video-based face recognition. There are mainly two contributions in this paper. First, we propose an algorithm which can learn appearance models online without using a pretrained model. Second, we propose a method for eigenspace splitting to prevent that most samples cluster into the same eigenspace. This is useful for clustering and classification. Experimental results show that the proposed method can both learn appearance models online and achieve high recognition rate.

Face Recognition in Video: Adaptive Fusion of Multiple Matchers

Unsang Park, Anil Jain, Arun Ross

Face recognition in video is being actively studied as a covert method of human identification in surveillance systems. Identifying human faces in video is a difficult problem due to the presence of large variations in facial pose and lighting, and poor image resolution. However, by taking advantage of the diversity of the information contained in video, the performance of a face recognition system can be enhanced. In this work we explore (a) the adaptive use of multiple face matchers in order to enhance the performance of face recognition in video, and (b) the possibility of appropriately populating the database (gallery) in order to succinctly capture intra class variations. To extract the dynamic information in video, the facial poses in various frames are explicitly estimated using Active Appearance Model (AAM) and a Factorization based 3D face reconstruction technique. We also estimate the motion blur using Discrete Cosine Transformation (DCT). Our experimental results on 204 subjects in CMU’s Face-In-Action (FIA) database show that the proposed recognition method provides consistent improvements in the matching performance using three different face matchers.

18

18

Session 2: Iris recognition

Non-intrusive Iris Image Capturing System Using Light Stripe Projection and Pan-Tilt-Zoom Camera

Sowon Yoon, Ho Gi Jung, Jae Kyu Suhr, Jaihie Kim

This paper proposes non-intrusive iris image capturing system, which consists of pan-tilt-zoom camera and light stripe projection. Light stripe projection provides the position of user. After panning according to user’s position, AdaBoost-based face detection finds tilt angle. With user’s position and tilt angle, zoom and focus position are initialized. User’s position replaces 2D face search with 1D face search. Exact zoom and focus position enable fast control and narrow search range. Consequently, experimental results show that proposed system can capture iris image within acceptable time.

On the Efficacy of Correcting for Refractive Effects in Iris Recognition

Jeffery Price, Timothy Gee, Vincent Paquit, Kenneth Tobin

In this study, we aim to determine if iris recognition accuracy might be improved by correcting for the refractive effects of the human eye when the optical axes of the eye and camera are misaligned. We undertake this investigation using an anatomically-approximated, three-dimensional model of the human eye and ray-tracing. We generate synthetic iris imagery from different viewing angles using first a simple pattern of concentric rings on the iris for analysis, and then synthetic texture maps on the iris for experimentation. We estimate the distortion from the concentric-ring iris images and use the results to guide the sampling of textured iris images that are distorted by refraction. Using the well-known Gabor filter phase quantization approach, our model-based results indicate that the Hamming distances between iris signatures from different viewing angles can be significantly reduced by accounting for refraction. Over our experimental conditions comprising viewing angles from 0 to 60 degrees, we observe a median reduction in Hamming distance of 27.4% and a maximum reduction of 70.0% when we compensate for refraction. Maximum improvements are observed at viewing angles of 20°-25°.

Automated Individualization of Deformable Eye Region Model and Its Application to Eye Motion Analysis

Tsuyoshi Moriyama, Takeo Kanade

This paper proposes a method of automated individualization of eye region model. The eye region model has been proposed in past research that parameterizes both the structure and the motion of the eye region. Without any prior knowledge, one can never determine a given appearance of eye region to be either neutral to any expression, i.e., the inherent structure of the eye region, or the result of motion by a facial expression. The past method manually individualized the model with respect to the structure parameters in the initial frame and tracks the motion parameters automatically across the rest of the image sequence, assuming the initial frame contains only neutral faces. Under the same assumption, we automatically determine the structure parameters for the given eye region image. We train Active Appearance Models (AAMs) for parameterizing the variance of individuality. The system projects a given eye region image onto the low dimensional subspace spanned by the AAM and retrieves the structure parameters of the nearest training sample and initializes the eye region model using them. The AAMs are trained in the subregions, i.e., the upper eyelid region, the palpebral fissure (the eye aparture) region, and the lower eyelid region, respectively. It enables each AAM to effectively represent fine structures. Experimental results show the proposed method gives as nice initialization as manual labor and allows comparative tracking results for a comprehensive set of eye motions.

19

19

Session 3: Security & privacy enhancement in biometrics

Anonymous and Revocable Fingerprint Recognition

Faisal Farooq, Nalini Ratha, Tsai-Yang Jea, Ruud Bolle

Biometric identification has numerous advantages over conventional ID and password systems; however, the lack of anonymity and revocability of biometric templates is of concern. Several methods have been proposed to address these problems. Many of the approaches require a precise registration before matching in the anonymous domain. We introduce binary string representations of fingerprints that obviates the need for registration and can be directly matched. We describe several techniques for creating anonymous and revocable representations using these binary string representations. The match performance of these representations is evaluated using a large database of fingerprint images. We prove that given an anonymous representation, it is computationally infeasible to invert it to the original fingerprint, thereby preserving privacy. To the best of our knowledge, this is the first linear, anonymous and revocable fingerprint representation that is implicitly registered.

Real-time Automatic Deceit Detection from Involuntary Facial Expressions

Zhi Zhang, Vartika Singh, Thomas Slowe, Sergey Tulyakov, Venugopal Govindaraju

Being the most broadly used tool for deceit measurement, the polygraph is a limited method as it suffers from human operator subjectivity and the fact that target subjects are aware of the measurement, which invites the opportunity to alter their behavior or plan counter-measures in advance. The approach presented in this paper attempts to circumvent these problems by unobtrusively and automatically measuring several prior identified Deceit Indicators (DIs) based upon involuntary, so-called reliable facial expressions through computer vision analysis of image sequences in real time. Reliable expressions are expressions said by the psychology community to be impossible for a significant percentage of the population to convincingly simulate, without feeling a true inner felt emotion. The strategy is to detect the difference between those expressions which arise from internal emotion, implying verity, and those expressions which are simulated, implying deceit. First, a group of Facial Action Units (AUs) related to the reliable expressions are detected based on distance and texture based features. The DIs then can be measured and finally a decision of deceit or verity will be made accordingly. The performance of this proposed approach is evaluated by its real time implementation for deceit detection.

Using Genetic Algorithms to Improve Matching Performance of Changeable biometrics from Combining PCA and ICA Methods

MinYi Jeong, Jeung-Yoon Choi, Jaihie Kim

Biometrics is personal authentication which uses an individual’s information. In terms of user authentication, biometric systems have many advantages. However, despite its advantages, they also have some disadvantages in the area of privacy problems. Changeable biometrics is solution to problem of privacy protection. In this paper we propose a changeable face biometrics system to overcome this problem. The proposed method uses the PCA and ICA methods and genetic algorithms. PCA and ICA coefficient vectors extracted from an input face image were normalized using their norm. The two normalized vectors were transformed using a weighting matrix which is derived using genetic algorithms and then scrambled randomly. A new transformed face coefficient vector was generated by addition of the two weighted normalized vectors. Through experiments, we see that we can achieve performance accuracy that is better than conventional methods. And, it is also shown that the changeable templates are non-invertible and provide sufficient reproducibility.

20

20

Secure Biometric Templates from Fingerprint-Face Features

Yagiz Sutcu, Qiming Li, Nasir Memon

Since biometric data cannot be easily replaced or revoked, it is important that biometric templates used in biometric applications should be constructed and stored in a secure way, such that attackers would not be able to forge biometric data easily even when the templates are compromised. This is a challenging goal since biometric data are “noisy” by nature, and the matching algorithms are often complex, which make it difficult to apply traditional cryptographic techniques, especially when multiple modalities are considered. In this paper, we consider a “fusion” of a minutiae-based fingerprint authentication scheme and an SVD-based face authentication scheme, and show that by employing a recently proposed cryptographic primitive called “secure sketch”, and a known geometric transformation on minutiae, we can make it easier to combine different modalities, and at the same time make it computationally infeasible to forge an “original” combination of fingerprint and face image that passes the authentication. We evaluate the effectiveness of our scheme using real fingerprints and face images from publicly available sources.

Session 4: Multi-modal

Fusing palmprint and palm vein images by an integrated line-preserving and contrast-enhancing method for person recognition based on "Laplacianpalm" feature

Jian-Gang Wang, Wei-Yun Yau, Andy Suwandy, Eric Sung

Unimodal analysis of palmprint and palm vein has been investigated for person recognition. However, they are not robust to noise and spoof attacks. In this paper, we present a multimodal personal identification system using palmprint and palm vein images with fusion applied at the image level. The palmprint and palm vein images are fused by a novel integrated line-preserving and contrast-enhancing fusion method. Based on our proposed fusion rule, the modified multiscale edges of palmprint and palm vein images are combined as well as the image contrast and the interaction points (IPs) of the palmprints and vein lines are enhanced. The IPs are novel features obtained in our fused images. A novel palm representation, called “Laplacianpalm” feature, is extracted from the fused images by Locality Preserving Projections (LPP). We compare the recognition performance using the unimodal and the proposed fused images. We also compared the proposed “Laplacianpalm” approach with the Fisherpalm and Eigenpalm on a large dataset. Experimental results show that the proposed multimodal approach provides a better representation and achieves lower error rates in palm recognition. A Novel Approach to Improve Biometric Recognition Accuracy Using Rank Level Fusion

Jay Bhatnagar, Ajay Kumar, Nipun Saggar

This paper proposes a novel approach for rank level fusion which gives improved performance gain verified by experimental results. In the absence of ranked features and instead of using the entire template, we propose using K partitions of the template. The approach proposed in the paper is useful for generating sequential ranks and survivor lists on partitions of template to boost confidence levels by incorporating information from partitions. The proposed algorithm iteratively generates ranks for each partition of the user template. Ranks from template partitions are consolidated to estimate the fusion rank for the classification. This paper investigates rank level fusion for palmprint biometric using two approaches: (1) fixed threshold and resulting survivor list, and (2) iterative thresholds and iteratively refined survivor list. The above approaches achieve similar performances as related manifestations of fusion architecture. The experimental results support the proposition of high in-template similarity of palmprint for a user and its relevance to the intra-modal fusion framework. Experimental results using proposed approach on real palmprint data from 100 users show superior performance with recognition accuracy of 99% as compared to recognition accuracy of 95% achieved with the conventional approach.

21

21

Multi-modal Person Identification in a Smart Environment

Hazim Ekenel, Mika Fischer, Qin Jin, Rainer Stiefelhagen

In this paper, we present a detailed analysis of multimodal fusion for person identification in a smart environment. The multi-modal system consists of a videobased face recognition system and a speaker identification system. We investigated different score normalization, modality weighting and modality combination schemes during the fusion of the individual modalities. We introduced two new modality weighting schemes, namely, the cumulative ratio of correct matches (CRCM) and distance-to-second-closest (DT2ND) measures. In addition, we also assessed the effects of the well-known score normalization and classifier combination methods on the identification performance. Experimental results obtained on the CLEAR 2007 evaluation corpus, which contains audio-visual recordings from different smart rooms, show that CRCM-based modality weighting improves the correct identification rates significantly.

Improving Iris Identification using User Quality and Cohort Information

Arun Passi, Ajay Kumar

Iris is one of the most distinguishable features of a human body, which remains fairly stable throughout the lifetime of an individual. This makes iris recognition one of the most reliable methods for biometric based identification. This paper investigates a new technique to improve the performance of the system by using cohort information and user-quality as the weight in the matching. The proposed approach uses the cohort information at the decision stage as cascaded classifiers. However, the second stage is only used if the first stage classifier is uncertain of its decision. The experimental results from the decision-level classifiers combination are presented, which show that the cascaded classification system significantly outperforms the single classifier, especially at lower value of FAR which is most likely to be the operating point for any system. This paper also proposes a new approach to ascertain the user-quality (iris) and illustrates its usage in the performance improvement.

Session 5: Fingerprint + hand geometry + palm

Biometric Authentication Using Finger-Back Surface

Ravikanth Ch., Ajay Kumar

This paper investigates a new biometric system based on texture of the hand knuckles. The texture pattern produced by the finger knuckle bending is highly unique and makes the surface a distinctive biometric identifier. Hand geometry features can be acquired from the same image, at the same time and integrated to improve the performance of the system. The finger back surface images from each of the users are used to extract scale, translation and rotational invariant knuckle images. The proposed system, especially on the peg-free and noncontact imaging setup, achieves promising results when tested over a database of 105 users.

22

22

A Robust Warping Method for Fingerprint Matching

Dongjin Kwon, Il Dong Yun, Sang Uk Lee

This paper presents a robust warping method for minutiae based fingerprint matching approaches. In this method, a deformable fingerprint surface is described using a triangular mesh model. For given two extracted minutiae sets and their correspondences, the proposed method constructs an energy function using a robust correspondence energy estimator and smoothness measuring of the mesh model. We obtain a convergent deformation pattern using an efficient gradient based energy optimization method. This energy optimization approach deals successfully with deformation errors caused by outliers, which are more difficult problems for the thin-plate spline (TPS) model. The proposed method is fast and the run-time performance is comparable with the method based on the TPS model. In the experiments, we provide a visual inspection of warping results on given correspondences and quantitative results using database.

A Component-Based Approach to Hand Verification

Gholamreza Amayeh, George Bebis, Ali Erol, Mireca Nicolescu

This paper describes a novel hand-based verification system based on palm-finger segmentation and fusion. The proposed system operates on 2D hand images acquired by placing the hand on a planar lighting table without any guidance pegs. The segmentation of the palm and the fingers is performed without requiring the extraction of any landmark points on the hand. First, the hand is segmented from the forearm using a robust, iterative methodology based on morphological operators. Then, the hand is segmented into six regions corresponding to the palm and the fingers using morphological operators again. The geometry of each component of the hand is represented using high order Zernike moments which are computed using an efficient methodology. Finally, verification is performed by fusing information from different parts of the hand. The proposed system has been evaluated on a database of 101 subjects illustrating high accuracy and robustness. Comparisons with competitive approaches that use the whole hand illustrate the superiority of the proposed, component-based, approach both in terms of accuracy and robustness. Qualitative comparisons with state of the art systems illustrate that the proposed system has comparable or better performance.

Session 6: Behavioral Biometrics

Are Digraphs Good for Free-Text Keystroke Dynamics?

Terence Sim, Rajkumar Janakiraman

Research in keystroke dynamics has largely focused on the typing patterns found in fixed text (e.g. userid and passwords). In this regard, digraphs and trigraphs have proven to be discriminative features. Recently, however, there is increasing interest in free-text keystroke dynamics, in which the user to be authenticated is free to type whatever he/she wants, rather than a pre-determined text. The natural question that arises is whether digraphs and trigraphs are just as discriminative for free text as they are for fixed text. We attempt to answer this question in this paper. We show that digraphs and trigraphs, if computed without regard to what word was typed, are no longer discriminative. Instead, word-specific digraphs/trigraphs are required. We also show that the typing dynamics for some words depend on whether they are part of a larger word. Our study is the first to investigate these issues, and we hope our work will help guide researchers looking for good features for freetext keystroke dynamics.

23

23

Facial Expression Biometrics Using Tracker Displacement Features

Sergey Tulyakov, Thomas Slowe, Zhi Zhang, Venu Govindaraju

In this paper we investigate a possibility of using the face expression information for person biometrics. The idea of this research is that person’s emotional face expressions are repeatable, and face expression features can be used for person identification. In order to avoid using person specific geometric or textural features traditionally used in face biometrics, we restrict ourselves to the tracker displacement features only. In contrast to previous research in facial expression biometrics, we extract features only from the pair of face images, neutral and the apex of emotion expression, instead of using the sequence of images from the video. The experiments, performed on two facial expression databases, confirm that proposed features can indeed be used for biometrics purposes.

Improving Variance Estimation in Biometric Systems

Ross Micheals, Terry Boult

Measuring system performance seems conceptually straightforward. However, the interpretation of the results and predicting future performance remain as exceptional challenges in system evaluation. Robust experimental design is critical in evaluation, but there have been very few techniques to check designs for either overlooked associations or weak assumptions. For biometric & vision system evaluation, the complexity of the systems make a thorough exploration of the problem space impossible — this lack of verifiability in experimental design is a serious issue. In this paper, we present a new evaluation methodology that improves the accuracy of variance estimator via the discovery of false assumptions about the homogeneity of cofactors — i.e., when the data is not “well mixed.” The new methodology is then applied in the context of a biometric system evaluation with highly influential cofactors.

Session 7: Advances in face recognition

Recognizing Faces of Moving People by Hierarchical Image-Set Matching

Masashi Nishiyama, Mayumi Yuasa, Tomoyuki Shibata, Tomokazu Wakasugi, Tomokazu Kawahara, Osamu Yamaguchi

This paper proposes a novel method for recognizing faces in a cluster of moving people. In this task, there are two problems caused by motion, which are occlusions, and changes in facial pose and illumination. Multiple cameras are used to acquire near-frontal faces to avoid occlusions and profile faces. The Hierarchical Image-Set Matching (HISM) creates a distribution for each individual by integrating a set of face images of the same individual acquired from the multiple cameras. By adopting a method for comparing between test and training distributions in identification, variation in pose and illumination is alleviated, and good recognition accuracy can be obtained. Experimental results using video sequences containing 349 people show that the proposed method achieves high recognition performance compared with conventional methods, which use frame-by-frame identification and a distribution obtained from a single camera.

24

24

Robustness of the New Owner-Tester Approach for Face Recognition Experiments

Wai Han Ho, Paul Watters, Dominic Verity

With the broad application of face identification, it is important that the performance estimated for an algorithm using a sample can be generalized to the population performance. We proposed using an Owner-Tester setup to replace the current approach for experiments on face identification (or other biometrics and pattern recognition systems). This paper looks into the robustness of the Owner-Tester setup in terms of goodness of fit and performance estimation using misidentification risk - the newly suggested performance evaluation metric. Testing results have indicated that the approach is robust in term of goodness of fit and performance estimation.

Kernel Fukunaga-Koontz Transform Subspaces For Enhanced Face Recognition

Yung-hui Li, Marios Savvides

Traditional linear Fukunaga-Koontz Transform (FKT) [1] is a powerful discriminative subspaces building approach. Previous work has successfully extended FKT to be able to deal with small-sample-size. In this paper, we extend traditional linear FKT to enable it to work in multi-class problem and also in higher dimensional (kernel) subspaces and therefore provide enhanced discrimination ability. We verify the effectiveness of the proposed Kernel Fukunaga-Koontz Transform by demonstrating its effectiveness in face recognition applications; however the proposed non-linear generalization can be applied to any other domain specific problems.

An Active Illumination and Appearance (AIA) Model for Face Alignment

Fatih Kahraman, Muhittin Gokmen, Sune Darkner, Rasmus Larsen

Face Recognition systems are typically required to work under highly varying illumination conditions. This leads to complex effects imposed on the acquired face image that pertains little to the actual identity. Consequently, illumination normalization is required to reach acceptable recognition rates in face recognition systems. In this paper, we propose an approach that integrates the face identity and illumination models under the widely used Active Appearance Model framework as an extension to the texture model in order to obtain illumination-invariant face localization.

Robust Face Alignment For Illumination and Pose Invariant Face Recognition

Fatih Kahraman, Binnur Kurt, Muhittin Gokmen

In building a face recognition system for real-life scenarios, one usually faces the problem that is the selection of a feature-space and preprocessing methods such as alignment under varying illumination conditions and poses. In this study, we developed a robust face alignment approach based on Active Appearance Model (AAM) by inserting an illumination normalization module into the standard AAM searching procedure and inserting different poses of the same identity into the training set. The modified AAM search can now handle both illumination and pose variations in the same epoch, hence it provides better convergence in both point-to-point and point-to-curve senses. We also investigate how face recognition performance is affected by the selection of feature space as well as the proposed alignment method. The experimental results show that the combined pose alignment and illumination normalization methods increase the recognition rates considerably for all feature spaces.

25

25

Abstracts: International Workshop on Semantic Learning Applications in Multimedia (SLAM)

Session 1: Semantic Image Annotation

Kernel Sharing With Joint Boosting For Multi-Class Concept Detection

Wei Jiang, Shih-Fu Chang, Alexander Loui

Object/scene detection by discriminative kernel-based classification has gained great interest due to its promising performance and flexibility. In this paper, unlike traditional approaches that independently build binary classifiers to detect individual concepts, we proposed a new framework for multi-class concept detection based on kernel sharing and joint learning. By sharing “good” kernels among concepts, accuracy of individual weak detectors can be greatly improved; by joint learning of common detectors among classes, the required kernels and the computational complexity for detecting each individual concept can be reduced. We demonstrated our approach by developing an extended JointBoost framework, which was used to choose the optimal kernel and subset of sharing classes in an iterative boosting process. In addition, we constructed multiresolution visual vocabularies by hierarchical clustering and computed kernels based on spatial matching. We tested our method in detecting 12 concepts (objects, scenes, etc) over 80+ hours of broadcast news videos from the challenging TRECVID 2005 corpus. Significant performance gains were achieved - 10% in mean average precision (MAP) and up to 34% average precision (AP) for some concepts like maps, building, and boat-ship. Extensive analysis of the results also revealed interesting and important underlying relations among concepts.

Automatic Image Annotation by Ensemble of Visual Descriptors

Emre Akbas, Fatos Yarman Vural

Automatic image annotation systems available in the literature concatenate color, texture and/or shape features in a single feature vector to learn a set of high level semantic categories using a single learning machine. This approach is quite naive to map the visual features to high level semantic information concerning the categories. Concatenation of many features with different visual properties and wide dynamical ranges may result in curse of dimensionality and redundancy problems. Additionally, it usually requires normalization which may cause an undesirable distortion in the feature space. An elegant way of reducing the effects of these problems is to design a dedicated feature space for each image category, depending on its content, and learn a range of visual properties of the whole image from a variety of feature sets. For this purpose, a two-layer ensemble learning system, called Supervised Annotation by Descriptor Ensemble (SADE), is proposed. SADE, initially, extracts a variety of low-level visual descriptors from the image. Each descriptor is, then, fed to a separate learning machine in the first layer. Finally, the meta-layer classifier is trained on the output of the first layer classifiers and the images are annotated by using the decision of the meta-layer classifier. This approach not only avoids normalization, but also reduces the effects of dimensional curse and redundancy. The proposed system outperforms a state-of-the-art automatic image annotation system, in an equivalent experimental setup.

26

26

Home Interior Classification using SIFT Keypoint Histograms

Brian Ayers, Matthew Boutell

Semantic scene classification, the process of categorizing photographs into a discrete set of classes using pattern recognition techniques, is a useful ability for image annotation, organization and retrieval. The literature has focused on classifying outdoor scenes such as beaches and sunsets. Here, we focus on a much more difficult problem, that of differentiating between typical rooms in home interiors, such as bedrooms or kitchens. This requires robust image feature extraction and classification techniques, such as SIFT (Scale-Invariant Feature Transform) features and Adaboost classifiers. To this end, we derived SIFT keypoint histograms, an efficient image representation that utilizes variance information from linear discriminant analysis. We compare SIFT keypoint histograms with other features such as spatial color moments and compare Adaboost with Support Vector Machine classifiers. We outline the various techniques used, show their advantages, disadvantages, and actual performance, and determine the most effective algorithm of those tested for home interior classification. Furthermore, we present results of pairwise classification of 7 rooms typically found in homes.

Session 2: Object and Event Recognition

Recognizing Groceries in situ Using in vitro Training Data

Michele Merler, Carolina Galleguillos, Serge Belongie

The problem of using pictures of objects captured under ideal imaging conditions (here referred to as in vitro) to recognize objects in natural environments (in situ) is an emerging area of interest in computer vision and pattern recognition. Examples of tasks in this vein include assistive vision systems for the blind and object recognition for mobile robots; the proliferation of image databases on the web is bound to lead to more examples in the near future. Despite its importance, there is still a need for a freely available database to facilitate study of this kind of training/testing dichotomy. In this work one of our contributions is a new multimedia database of 120 grocery products, GroZi-120. For every product, two different recordings are available: in vitro images extracted from the web, and in situ images extracted from camcorder video collected inside a grocery store. As an additional contribution, we present the results of applying three commonly used object recognition/detection algorithms (color histogram matching, SIFT matching, and boosted Haar-like features) to the dataset. Finally, we analyze the successes and failures of these algorithms against product type and imaging conditions, both in terms of recognition rate and localization accuracy, in order to suggest ways forward for further research in this domain. Hierarchical Recognition of Human Activities Interacting with Objects

Michael Ryoo, Jake Aggarwal

The paper presents a system that recognizes humans interacting with objects. We delineate a new framework that integrates object recognition, motion estimation, and semantic-level recognition for the reliable recognition of hierarchical human-object interactions. The framework is designed to integrate recognition decisions made by each component, and to probabilistically compensate for the failure of the components with the use of the decisions made by the other components. As a result, human-object interactions in an airport-like environment, such as ‘a person carrying a baggage’, ‘a person leaving his/her baggage’, or ‘a person snatching another's baggage’, are recognized. The experimental results show that not only the performance of the final activity recognition is superior to that of previous approaches, but also the accuracy of the object recognition and the motion estimation increases using feedback from the semantic layer. Several real examples illustrate the superior performance in recognition and semantic description of occurring events.

27

27

Accurate Dynamic Sketching of Faces from Video

Zijian Xu, Jiebo Luo

A sketch captures the most informative part of an object, in a much more concise and potentially robust representation (e.g., for face recognition or new capabilities of manipulating faces). We have previously developed a framework for generating face sketches from still images. A more interesting question is can we generate an animated sketch from video? We adopt the same hierarchical compositional graph model originally developed for still images for face representation, where each graph node corresponds to a multimodal model of a certain facial feature (e.g., close mouth, open mouth, and wide-open mouth). To enforce temporal-spatial consistency and improve tracking efficiency, we constrain the transition of a graph node to be only between immediate neighboring modes (e.g. from closed mouth to open mouth but not to wide-open mouth), as well as by its corresponding parts in the neighboring frames. To improve the matching accuracy, we model the local structure of a given mode as a shape-constrained Markov network (SCMN) of image patches. The preliminary results show accurate sketching results from video clips.

Scene Segmentation and Categorization Using NCuts

YanJun Zhao, Tao Wang, Peng Wang, Yangzhou Du

For video summarization and retrieval, one of the important modules is to group temporal-spatial coherent shots into high-level semantic video clips namely scene segmentation. In this paper, we propose a novel scene segmentation and categorization approach using normalized graph cuts(NCuts). Starting from a set of shots, we first calculate shot similarity from shot key frames. Then by modeling scene segmentation as a graph partition problem where each node is a shot and the weight of edge represents the similarity between two shots, we employ NCuts to find the optimal scene segmentation and automatically decide the optimum scene number by Q function. To discover more useful information from scenes, we analyze the temporal layout patterns of shots, and automatically categorize scenes into two different types, i.e. parallel event scenes and serial event scenes. Extensive experiments are tested on movie, and TV series. The promising results demonstrate that the proposed NCuts based scene segmentation and categorization methods are effective in practice.

Session 3: Multimedia Search and Retrieval

Fusing Local Image Descriptors for Large-Scale Image Retrieval

Eva Hoerster, Rainer Lienhart

Online image repositories such as Flickr contain hundreds of millions of images and are growing quickly. Along with that the needs for supporting indexing, searching and browsing is becoming more and more pressing. Here we will employ the image content as a source of information to retrieve images and study the representation of images by topic models for content-based image retrieval. We focus on incorporating different types of visual descriptors into the topic modeling context. Three different fusion approaches are explored. The image representations for each fusion approach are learned in an unsupervised fashion, and each image is modeled as a mixture of topics/object parts depicted in the image. However, not all object classes will benefit from all visual descriptors. Therefore, we also investigate which visual descriptor (set) is most appropriate for each of the twelve classes under consideration. We evaluate the presented models on a real world image database consisting of more than 246,000 images.

28

28

Diverse Active Ranking for Multimedia Search

Shyamsundar Rajaram, Charlie Dagli, Nemanja Petrovic, Thomas Huang

Interactively learning from a small sample of unlabeled examples is an enormously challenging task, one that often arises in vision applications. Relevance feedback and more recently active learning are two standard techniques that have received much attention towards solving this interactive learning problem. How to best utilize the user’s effort for labeling, however, remains unanswered. It has been shown in the past that labeling a diverse set of points is helpful, however, the notion of diversity has either been dependent on the learner used, or computationally expensive. In this paper, we intend to address these issues in the bipartite ranking setting. First, we introduce a scheme for picking the query set which will be labeled by an oracle so that it will aid us in learning the ranker in as few active learning rounds as possible. Secondly, we propose a fundamentally motivated, information theoretic view of diversity and its use in a fast, non-degenerate active learning-based relevance feedback setting. Finally, we report comparative testing and results in a real-time image retrieval setting.

Using Group Prior to Identify People in Consumer Images

Andrew Gallagher, Tsuhan Chen

While face recognition techniques have rapidly advanced in the last few years, most of the work is in the domain of security applications. For consumer imaging applications, person recognition is an important tool that is useful for searching and retrieving images from a personal image collection. It has been shown that when recognizing a single person in an image, a maximum likelihood classifier requires the prior probability for each candidate individual. In this paper, we extend this idea and describe the benefits of using a group prior for identifying people in consumer images with multiple people. The group prior describes the probability of a group of individuals appearing together in an image. In our application, we have a subset of ambiguously labeled images for a consumer image collection, where we seek to identify all of the people in the collection. We describe a simple algorithm for resolving the ambiguous labels. We show that despite errors in resolving ambiguous labels, useful classifiers can be trained with the resolved labels. Recognition performance is further improved with a group prior learned from the ambiguous labels. In summary, by modeling the relationships between the people with the group prior, we improve classification performance.

Session 4: Audio and Video Analysis

A Multimodality Framework for Creating Speaker/Non-Speaker Profile

Jehanzeb Abbas, Charlie Dagli, Thomas Huang

We propose a complete solution to full modality person-profiling for speakers and submodality person-profiling for non-speakers in real-world videos. This is a step towards building an elaborate database of face, name and voice correspondence for speakers appearing in the news videos. In addition we are also interested in only name and face correspondence database for non-speakers who appear during voice-overs. We use an unsupervised technique for creating a speaker identification database and a unique primary feature matching and parallel line matching algorithm for creating a non-speaker identification database. We tested our approach on real world data and the results show good performance for news videos. It can be incorporated as part of a larger multimedia news video analysis system or a multimedia search system for efficient news video retrieval and browsing.

29

29

Segmental Hidden Markov Models for View-based Sport Video Analysis

Yi Ding, Guoliang Fan

We present a generative model approach to explore intrinsic semantic structures in sport videos, e.g., the camera view in American football games. We will invoke the concept of semantic space to explicitly define the semantic structure in the video in terms of latent states. A dynamic model is used to govern the transition between states, and an observation model is developed to characterize visual features pertaining to different states. Then the problem is formulated as a statistical inference process where we want to infer latent states (i.e., camera views) from observations (i.e., visual features). Two generative models, the hidden Markov model (HMM) and the Segmental HMM (SHMM), are involved in this research. In the HMM, both latent states and visual features are shot-based, and in the SHMM, latent states and visual features are defined for shots and frames respectively. Both models provide promising performance for view-based shot classification, and the SHMM outperforms the HMM by involving a two-layer observation model to accommodate the variability of visual features. This approach is also applicable to other video mining tasks.

Salient Object Detection on Large-Scale Video Data

Hong Lu

Recently more and more researches focus on the concept extraction from unstructured video data. To bridge the semantic gap between the low-level features and the high-level video concepts, a mid-level understanding of the video contents, i.e., salient object is detected based on the techniques of image segmentation and machine learning. Specifically, 21 salient object detectors are developed and tested on TRECVID 2005 development video corpus. In addition, a boosting method is proposed to select the most representative features to achieve a higher performance than only using single modality, and lower complexity than taking all features into account.

30

30

CVPR 2007 Abstracts

Matching and Features

Learning Visual Similarity Measures for Comparing Never Seen Objects

Eric Nowak and Frederic Jurie

In this paper we propose and evaluate an algorithm that learns a similarity measure for comparing never seen objects. The measure is learned from pairs of training images labeled “same” or “different”. This is far less informative than the commonly used individual image labels (e.g. “car model X”), but it is cheaper to obtain. The proposed algorithm learns the characteristic differences between local descriptors sampled from pairs of “same” and “different” images. These differences are vector quantized by an ensemble of extremely randomized binary trees, and the similarity measure is computed from the quantized differences. The extremely randomized trees are fast to learn, robust due to the redundant information they carry and they have been proved to be very good clusterers. Furthermore, the trees efficiently combine different feature types (SIFT and geometry). We evaluate our innovative similarity measure on four very different datasets and consistantly outperform the state-of-the-art competitive approaches.

A contextual dissimilarity measure for accurate and efficient image search

Hervé Jégou, Hedi Harzallah, and Cordelia Schmid

In this paper we present two contributions to improve accuracy and speed of an image search system based on bag-of-features: a contextual dissimilarity measure (CDM) and an efficient search structure for visual word vectors. Our measure (CDM) takes into account the local distribution of the vectors and iteratively estimates distance correcting terms. These terms are subsequently used to update an existing distance, thereby modifying the neighborhood structure. Experimental results on the Nistér-Stewénius dataset show that our approach significantly outperforms the state-of-the-art in terms of accuracy. Our efficient search structure for visual word vectors is a two-level scheme using inverted files. The first level partitions the image set into clusters of images. At query time, only a subset of clusters of the second level has to be searched. This method allows fast querying in large sets of images. We evaluate the gain in speed and the loss in accuracy on large datasets (up to 500k images).

Learning Local Image Descriptors

Simon Winder and Matthew Brown

In this paper we study interest point descriptors for image matching and 3D reconstruction. We examine the building blocks of descriptor algorithms and evaluate numerous combinations of components. Various published descriptors such as SIFT, GLOH, and Spin Images can be cast into our framework. For each candidate algorithm we learn good choices for parameters using a training set consisting of patches from a multi-image 3D reconstruction where accurate ground-truth matches are known. The best descriptors were those with log polar histogramming regions and feature vectors constructed from rectified outputs of steerable quadrature filters. At a 95% detection rate these gave one third of the incorrect matches produced by SIFT.

31

31

Principal Curvature-Based Region Detector for Object Recognition

Hongli Deng, Wei Zhang, Eric Mortensen, Thomas Dietterich, and Linda Shapiro

This paper presents a new structure-based interest region detector called Principal Curvature Based Regions (PCBR) which we use for object class recognition. The PCBR interest operator detects stable watershed regions in the multi-scale principal curvature image. To detect robust watershed regions, we "clean" the principal curvature image using a combination of grayscale morphogical closing and a new "eigenvector flow" hysteresis thresholding. Robustness across scales is acheived by selecting the maximal stable regions across consecutive scales. PCBR typically detects distinctive patterns distributed evenly on the objects and it shows significant robustness to local intensity perturbations and intra-class variations. We evaluate PCBR both qualitatively (through visual inspection) and quantitatively (by measuring repeatability and classification accuracy in real-world object-class recognition problems). Experiments on different benchmark datasets show that PCBR is comparable or superior to state-of-art detectors for feature matching and object recognition problems. Moreover, we demonstrate the application of PCBR to symmetry detections.

Image Matching via Salient Region Correspondences

Alexander Toshev, Jianbo Shi, and Kostas Daniilidis

We introduce the notion of co-saliency for image matching. Our matching algorithm combines the discriminative power of feature correspondences with the descriptive power of matching segments. Co-saliency matching score favors correspondences that are consistent with 'soft' image segmentation as well as with local point feature matching. We express the matching model via a joint image graph (JIG) whose edge weights represent intra- as well as inter-image relations. The dominant spectral components of this graph lead to simultaneous pixel-wise alignment of the images and saliency-based synchronization of 'soft' image segmentation. The co-saliency score function, which characterizes these spectral components, can be directly used as a similarity metric as well as a positive feedback for updating and establishing new point correspondences. We present experiments showing the extraction of matching regions and pointwise correspondences, and the utility of the global image similarity in the context of place recognition.

Motion Segmentation and Tracking

A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms

Roberto Tron and Rene Vidal

Over the past few years, several methods for segmenting a scene containing multiple rigidly moving objects have been proposed. However, most existing methods have been tested on a handful of sequences only, and each method has been often tested on a different set of sequences. Therefore, the comparison of different methods has been fairly limited. In this paper, we compare four 3-D motion segmentation algorithms for affine cameras on a benchmark of 155 motion sequences of checkerboard, traffic, and articulated scenes.

32

32

Two-view Motion Segmentation from Linear Programming Relaxation

Hongdong Li

This paper studies the problem of multibody motion segmentation, which is an important, but challenging problem due to its well-known chicken-and-egg-type recursive character. We propose a new Mixture-of-Fundamental-matrices model to describe the multibody motions from two views. Based on the maximum likelihood estimation, in conjunction with a random sampling scheme, we show that the problem can be naturally formulated as a Linear Programming (LP) problem. Consequently, the motion segmentation problem can be solved efficiently by linear program relaxation. Experiments demonstrate that: without assuming the actual number of motions our method produces accurate segmentation result. This LP formulation has also other advantages, such as easy to handle outliers and easy to enforce prior knowledge etc.

A Nonparametric Treatment for Location/Segmentation Based Visual Tracking

Le Lu and Gregory Hager

In this paper, we address two closely related visual tracking problems: 1) localizing a target's position in low or moderate resolution videos and 2) segmenting a target's image support in moderate to high resolution videos. Both tasks are treated as an online binary classification problem using dynamic foreground/background appearance models. Our major contribution is a novel nonparametric approach that successfully maintains a temporally changing appearance model for both foreground and background. The appearance models are formulated as “bags of image patches” that approximate the true two-class appearance distributions. They are maintained using a temporal-adaptive importance resampling procedure that is based on simple nonparametric statistics of the appearance patch bags. The overall framework is independent of an specific foreground/background classification process and thus offers the freedom to use different classifiers. We demonstrate the effectiveness of our approach with extensive comparative experimental results on sequences from previous visual tracking and video matting work as well as our own data.

A Lagrangian Particle Dynamics Approach for Crowd Flow Segmentation and Stability Analysis

Saad Ali and Mubarak Shah

This paper proposes a framework in which Lagrangian Particle Dynamics is used for the segmentation of high density crowd flows and detection of flow instabilities. For this purpose, a flow field generated by a moving crowd is treated as an aperiodic dynamical system. A grid of particles is overlaid on the flow field, and is advected using a numerical integration scheme. The evolution of particles through the flow is tracked using a Flow Map, whose spatial gradients are subsequently used to setup a Cauchy Green Deformation tensor for quantifying the amount by which the neighboring particles have diverged over the length of the integration. The maximum eigenvalue of the tensor is used to construct a Finite Time Lyapunov Exponent (FTLE) field, which reveals the Lagrangian Coherent Structures (LCS) present in the underlying flow. The LCS divide flow into regions of qualitatively different dynamics and are used to locate boundaries of the flow segments in a normalized cuts framework. Any change in the number of flow segments over time is regarded as an instability, which is detected by establishing correspondences between flow segments over time. The experiments are conducted on a challenging set of videos taken from Google Video and a National Geographic documentary.

33

33

Differential Camera Tracking through Linearizing the Local Appearance Manifold

Hua Yang, Marc Pollefeys, Greg Welch, Jan-Michael Frahm, and Adrian Ilie

The appearance of a scene is a function of the scene contents, the lighting, and the camera pose. A set of n-pixel images of a non-degenerate scene captured from different perspectives lie on a 6D nonlinear manifold in Rn. In general, this nonlinear manifold is complicated and numerous samples are required to learn it globally. In this paper, we present a novel method and some preliminary results for incrementally tracking camera motion through sampling and linearizing the local appearance manifold. At each frame time, we use a cluster of calibrated and synchronized small baseline cameras to capture scene appearance samples at different camera poses. We compute a first-order approximation of the appearance manifold around the current camera pose. Then, as new cluster samples are captured at the next frame time, we estimate the incremental camera motion using a linear solver. By using intensity measurements and directly sampling the appearance manifold, our method avoids the commonly-used feature extraction and matching processes, and does not require 3D correspondences across frames. Thus it can be used for scenes with complicated surface materials, geometries, and view-dependent appearance properties, situations where many other camera tracking methods would fail.

Poster session 1

Learning and Pattern Recognition 1

Learning Gaussian Conditional Random Fields for Low-Level Vision

Marshall Tappen, Ce Liu, William Freeman, and Edward Adelson

Markov Random Field (MRF) models are a popular tool for vision and image processing. Gaussian MRF models are particularly convenient to work with because they can be implemented using matrix and linear algebra routines. However, recent research has focused on discrete-valued and non-convex MRF models because Gaussian models tend to over-smooth images and blur edges. In this paper, we show how to train a Gaussian Conditional Random Field (GCRF) model that overcomes this weakness and can outperform the non-convex Field of Experts model on the task of denoising images. A key advantage of the GCRF model is that the parameters of the model can be optimized efficiently on relatively large images. The competitive performance of the GCRF model and the ease of optimizing its parameters make the GCRF model an attractive option for vision and image processing applications.

Mapping Natural Image Patches by Explicit and Implicit Manifolds

Kent Shi and Song-Chun Zhu

Image patches are fundamental elements for object modeling and recognition. However, there has not been a panoramic study of the structures of the whole ensemble of natural image patches in the literature. In this article, we study the structures of this ensemble by mapping natural image patches into two types of subspaces which we call “explicit manifolds” and “implicit manifolds” respectively. On explicit manifolds, one finds those simple and regular image primitives, such as edges, bars, corners and junctions. On implicit manifolds, one finds those complex and stochastic image patches, such as textures and clutters. On different types of manifolds, different perceptual metrics are used. We propose a method for learning a probabilistic distribution on the space of patches by pursuing both types of manifolds using a common information theoretical criterion. The connection between the two types of manifolds is realized by image scaling, which changes the entropy of the image patches. The explicit manifolds live in low entropy regimes while the implicit manifolds live in high entropy regimes. We study the transition between the two types of manifolds over scale and show that the complexity of the manifolds peaks in a middle entropy regime.

34

34

Hierarchical Structuring of Data on Manifolds

Jun Li and Pengwei Hao

Manifold learning methods are promising data analysis tools. However, if we locate a new test sample on the manifold, we have to find its embedding by making use of the learned embedded representation of the training samples. This process often involves accessing considerable volume of data for large sample set. In this paper, an approach of selecting “landmark points” from the given samples is proposed for hierarchical structuring of data on manifolds. The selection is made such that if one use the Voronoi diagram generated by the landmark points in the ambient space to partition the embeded manifold, the topology of the manifold is preserved. The landmark points then are used to recursively construct a hierarchical structure of the data. Thus it can speed up queries in a manifold data set. It is a general framework that can fit any manifold learning algorithm as long as its result of an input can be predicted by the results of the neighbor inputs. Compared to the existing techniques of organizing data based on spatial partitioning, our method preserves the topology of the latent space of the data. Different from manifold learning algorithms that use landmark points to reduce complexity, our approach is designed for fast retrieval of samples. It may find its way in high dimensional data analysis such as indexing, clustering, and progressive compression. More importantly, it extends the manifold learning methods to applications in which they were previously considered to be not fast enough. Our algorithm is stable and fast, and its validity is proved mathematically.

Learning GMRF Structures for Spatial Priors

Lie Gu, Eric Xing, and Takeo Kanade

We present a method that learns sparse spatial dependencies among parts of visual objects using GMRFs, and we propose a greedy searching algorithm that takes the advantage of the graph sparsity to perform object recognition task. The graph structure learned from training data characterizes both statistical dependencies among object parts and the intrinsic physical structure of the object. We demonstrate the corresponding GMRF model can faithfully capture spatial constrains of object parts revealed in the training data and the key geometrical deformations. We illustrate the representation power of the model by drawing samples and comparing them with other models with pre-specified graph structures. We also demonstrate the graph leads to an efficient solution for localizing an object and its parts on new images.

Trace Ratio vs. Ratio Trace for Dimensionality Reduction

Huan Wang, Shuicheng Yan, Dong Xu, Xiaoou Tang, and Thomas Huang

A large family of algorithms for dimensionality reduction end with solving a Trace Ratio problem in the form of argmax

WTr W T SpW( )/Tr W T SlW( ), which is generally transformed into the corresponding Ratio

Trace form argmaxW

Tr W TSlW( )−1W TSpW( )[ ] for obtaining a closed-form but inexact solution. In this work, an

efficient iterative procedure is presented to directly solve the Trace Ratio problem. In each step, a Trace Difference problem argmax

WTr W T Sp − λSl( )W[ ] is solved with λ being the trace ratio value computed

from the previous step. Convergence of the projection matrix W, as well as the global optimum of the trace ratio value , are proven based on point-to-set map theories. In addition, this procedure is further extended for solving trace ratio problems with more general constraint WTCW=I and providing exact solutions for kernel-based subspace learning problems. Extensive experiments on faces and UCI data demonstrate the high convergence speed of the proposed solution, as well as its superiority in classification capability over corresponding solutions to the ratio trace problem.

35

35

Element Rearrangement for Tensor-Based Subspace Learning

Shuicheng Yan, Dong Xu, Stephen Lin, Thomas Huang, and Shih-Fu Chang

The success of tensor-based subspace learning depends heavily on reducing correlations along the column vectors of each mode-k flattened matrix. In this work, we study the problem of rearranging elements within a tensor in order to maximize these correlations, so that information redundancy in image data can be more extensively removed by existing tensor-based dimensionality reduction algorithms. An efficient iterative algorithm is proposed to tackle this essentially integer optimization problem. In each step, the tensor structure is refined with a spatially-constrained Earth Mover's Distance procedure that incrementally rearranges tensors to become more similar to their low rank approximations, which have high correlation along the tensor directions. Monotonic convergence of the algorithm is proven using an auxiliary function analogous to that used for proving convergence of the Expectation-Maximization algorithm. In addition, we present an extension of the algorithm for conducting supervised subspace learning with tensor data. Experiments in both unsupervised and supervised subspace learning demonstrate the effectiveness of our proposed algorithms in improving data compression performance and classification accuracy.

Incremental Linear Discriminant Analysis Using Sufficient Spanning Set Approximations

Tae-Kyun Kim, Shu-Fai Wong, Bjorn Stenger, Josef Kittler, and Roberto Cipolla

This paper presents a new incremental learning solution for Linear Discriminant Analysis (LDA). We apply the concept of the sufficient spanning set approximation in each update step, i.e. for the between-class scatter matrix, the projected data matrix as well as the total scatter matrix. The algorithm yields a more general and efficient solution to incremental LDA than previous methods. It also significantly reduces the computational complexity while providing a solution which closely agrees with the batch LDA result. The proposed algorithm has a time complexity of O(Nd2) and requires O(Nd) space, where d is the reduced subspace dimension and N the data dimension. We show two applications of incremental LDA: First, the method is applied to semi-supervised learning by integrating it into an EM framework. Secondly, we apply it to the task of merging large databases which were collected during MPEG standardization for face image retrieval.

Unsupervised Clustering using Multi-resolution Perceptual Grouping

Tanveer Syeda-Mahmood and Fei Wang

Clustering is a common operation for data partitioning in many practical applications. Often, such data distributions exhibit higher level structures which are important for problem characterization, but are not explicitly discovered by existing clustering algorithms. In this paper, we introduce multi-resolution perceptual grouping as an approach to unsupervised clustering. Specifically, we use the perceptual grouping constraints of proximity, density, contiguity and orientation similarity. We apply these constraints in a multi-resolution fashion, to group sample points in high dimensional spaces into salient clusters. We present an extensive evaluation of the clustering algorithm against state-of-the-art supervised and unsupervised clustering methods on large datasets.

36

36

Optical Flow and Tracking 1

Object Tracking by Asymmetric Kernel Mean Shift with Automatic Scale and Orientation Selection Alper Yilmaz Tracking objects using the mean shift method is performed by iteratively translating a kernel in the image space such that the past and current object observations are similar. Traditional mean shift method requires a symmetric kernel, such as a circle or an ellipse, and assumes constancy of the object scale and orientation during the course of tracking. In a tracking scenario, it is not uncommon to observe objects with complex shapes whose scale and orientation constantly change due to the camera and object motions. In the view of this observation, the requirements of the traditional mean shift method are usually not met while tracking the objects. In this paper, we present an object tracking method based on the asymmetric kernel mean shift, in which the scale and orientation of the kernel adaptively change depending on the observations at each iteration. Proposed method extends the traditional mean shift tracking, which is performed in the spatial image coordinates, by including the scale and orientation as additional dimensions and simultaneously estimates all the unknowns in a few number of mean shift iterations. The experimental results show that the proposed method is superior to the traditional mean shift tracking in the following aspects: 1) it provides consistent object tracking throughout the video; 2) it is not effected by the scale and orientation changes of the tracked objects; 3) it is less prone to the background clutter.

Free-Form Nonrigid Image Registration Using Generalized Elastic Nets

Andriy Myronenko, Xubo Song, and Miguel Carreira-Perpinan

We introduce a novel probabilistic approach for nonparametric nonrigid image registration using generalized elastic nets, a model previously used for topographic maps. The idea of the algorithm is to adapt an elastic net (a constrained Gaussian mixture) in the spatial-intensity space of one image to fit the second image. The resulting net directly represents the correspondence between image pixels in a probabilistic way and recovers the underlying image deformation. We regularize the net with a differential prior and develop an efficient optimization algorithm using linear conjugate gradients. The nonparametric formulation allows for complex transformations having local deformation. The method is generally applicable to registering point sets of arbitrary features. The accuracy and effectiveness of the method are demonstrated on different medical image and point set registration examples with locally nonlinear underlying deformations.

Improved Video Registration using Non-Distinctive Local Image Features

Robin Hess and Alan Fern

The task of registering video frames with a static model is a common problem in many computer vision domains. The standard approach to registration involves finding point correspondences between the video and the model and using those correspondences to numerically determine registration transforms. Current methods locate video-to-model point correspondences by assembling a set of reference images to represent the model and then detecting and matching invariant local image features between the video frames and the set of reference images. These methods work well when all video frames can be guaranteed to contain a sufficient number of distinctive visual features. However, as we demonstrate, these methods are prone to severe misregistration errors in domains where many video frames lack distinctive image features. To overcome these errors, we introduce a concept of local distinctiveness which allows us to find model matches for nearly all video features, regardless of their distinctiveness on a global scale. We present results from the American football domain—where many video frames lack distinctive image features—

37

37

which show a drastic improvement in registration accuracy over current methods. In addition, we introduce a simple, empirical stability test that allows our method to be fully automated. Finally, we present a registration dataset from the American football domain we hope can be used as a benchmarking tool for registration methods.

Robust Estimation of Texture Flow Distortion via Dense Feature Sampling

Yu-Wing Tai, Michael Brown, and Chi-Keung Tang

Texture distortion flow measured in terms of scale and orientation is invaluable in texture analysis, segmentation, shape-from-texture, and texture remapping. This paper describes a novel and effective technique to estimate a texture distortion map given a small example patch. The key idea consists of sampling a dense set of features for all pixels in the example patch, where all discrete orientations are encapsulated into the feature vector such that texture rotation can simulated as a linear shift of the feature vector. The resulting feature space is compressed by PCA and clustered using EM to produce a set of principal features. These principal features are used to compute the per-pixel scale and orientation likelihoods for a distorted texture in a new image. The final texture distortion flow field is formulated as the MAP solution of a labeling Markov network, which is solved by belief propagation. Experimental results on both synthetic and real distorted texture images demonstrate good results even for highly distorted examples. We also demonstrate the usefulness of our extracted flow field in texture remapping.

Multiple Target Tracking Using Spatio-Temporal Markov Chain Monte Carlo Data Association

Qian Yu, Gérard Medioni, and Isaac Cohen

We propose a framework for general multiple target tracking, where the input is a set of candidate regions in each frame, as obtained from a state of the art background learning, and the goal is to recover trajectories of targets over time from noisy observations. Due to occlusions by targets and static objects, noisy segmentation and false alarms, one foreground region may not correspond to one target faithfully. Therefore the one-to-one assumption used in most data association algorithm is not always satisfied. Our method overcomes the one-to-one assumption by formulating the visual tracking problem in terms of finding the best spatial and temporal association of observations, which maximizes the consistency of both motion and appearance of trajectories. To avoid enumerating all possible solutions, we take a Data Driven Markov Chain Monte Carlo (DD-MCMC) approach to sample the solution space efficiently. The sampling is driven by an informed proposal scheme controlled by a joint probability model combining motion and appearance. To make sure the Markov chain to converge to a desired distribution, we propose an automatic approach to determine the parameters in the target distribution. Comparative experiments with quantitative evaluations are provided.

In situ Evaluation of Tracking Algorithms using Time Reversed Chains

Hao Wu, Aswin Sankaranarayanan, and Rama Chellappa

Automatic evaluation of visual tracking algorithms in the absence of ground truth is a very challenging and important problem. In the context of online appearance modeling, there is an additional ambiguity involving the correctness of the appearance model. In this paper, we propose a novel performance evaluation strategy for tracking systems based on particle filter using a time reversed Markov chain. Starting from the latest observation, the time reversed chain is propagated back till the starting time t=0 of

38

38

the tracking algorithm. The posterior density of the time reversed chain is also computed. The distance between the posterior density of the time reversed chain (at t=0) and the prior density used to initialize the tracking algorithm forms the decision statistic for evaluation. It is postulated that when the data is generated true to the underlying models, the decision statistic takes a low value. We empirically demonstrate the performance of the algorithm against various common failure modes in the generic visual tracking problem. Finally, we derive a small frame approximation that allows for very efficient computation of the decision statistic.

Real-time Visual Tracking under Arbitrary Illumination Changes

Geraldo Silveira and Ezio Malis

In this paper, we investigate how to improve the robustness of visual tracking methods with respect to generic lighting changes. We propose a new approach to the direct image alignment of either Lambertian or non-Lambertian objects under shadows, inter-reflections, glints as well as ambient, diffuse and specular reflections which may vary in power, type, number and space. The method is based on a proposed model of illumination changes together with an appropriate geometric model of image motion. The parameters related to these models are obtained through an efficient second-order optimization technique which minimizes directly the intensity discrepancies. Comparison results with existing direct methods show significant improvements in the tracking performance. Extensive experiments confirm the robustness and reliability of our method.

Tracking Large Variable Numbers of Objects in Clutter

Margrit Betke, Diane Hirsh, Angshuman Bagchi, Nickolay Hristov, Nicholas Makris, and Thomas Kunz

We propose statistical data association techniques for visual tracking of enormously large numbers of objects. We do not assume any prior knowledge about the numbers involved, and the objects may appear or disappear anywhere in the image frame and at any time in the sequence. Our approach combines the techniques of multitarget track initiation, recursive Bayesian tracking, clutter modeling, event analysis, and multiple hypothesis filtering. The original multiple hypothesis filter addresses an NP-hard problem and is thus not practical. We propose two cluster-based data association approaches that are linear in the number of detections and tracked objects. We applied the method to track wildlife in infrared video. We have successfully tracked hundreds of thousands of bats which were flying at high speeds and in dense formations.

Learning Features for Tracking

Michael Grabner, Helmut Grabner, and Horst Bischof

We treat tracking as a matching problem of detected keypoints between successive frames. The novelty of this paper is to learn classifier-based keypoint descriptions allowing to incorporate background information. Contrary to existing approaches, we are able to start tracking of the object from scratch requiring no off-line training phase before tracking. The tracker is initialized by a region of interest in the first frame. Afterwards an on-line boosting technique is used for learning descriptions of detected keypoints lying within the region of interest. New frames provide new samples for updating the classifiers which increases their stability. A simple mechanism incorporates temporal information for selecting stable features. In order to ensure correct updates a verification step based on estimating homographies using RANSAC is performed. The approach can be used for real-time applications since on-line updating and evaluating classifiers can be done efficiently.

39

39

Classifying Video with Kernel Dynamic Textures

Antoni Chan and Nuno Vasconcelos

The dynamic texture is a stochastic video model that treats the video as a sample from a linear dynamical system. The simple model has been shown to be surprisingly useful in domains such as video synthesis, video segmentation, and video classification. However, one major disadvantage of the dynamic texture is that it can only model video where the motion is smooth, i.e. video textures where the pixel values change smoothly. In this work, we propose an extension of the dynamic texture to address this issue. Instead of learning a linear observation function with PCA, we learn a non-linear observation function using kernel-PCA. The resulting kernel dynamic texture is capable of modeling a wider range of video motion, such as chaotic motion (e.g. turbulent water) or camera motion (e.g. panning). We derive the necessary steps to compute the Martin distance between kernel dynamic textures, and then validate the new model through classification experiments on video containing camera motion.

Sensing, Photometrics, and Image Processing 1

Discontinuity Preserving Filtering over Analytic Manifolds

Raghav Subbarao and Peter Meer

Discontinuity preserving filtering of images is an important low-level vision task. With the development of new imaging techniques like diffusion tensor imaging (DTI), where the data does not lie in a vector space, previous methods like the original mean shift are not applicable. In this paper, we use the nonlinear mean shift algorithm to develop filtering methods for data lying on analytic manifolds. We work out the computational details of using mean shift on Symn

+, the manifold of n x n symmetric positive definite matrices. We apply our algorithm to chromatic noise filtering, which requires mean shift over the Grassmann manifold G3,1, and obtain better results then standard mean shift filtering. We also use our method for DTI filtering, which requires smoothing over Sym3

+.

Generalized Thin-Plate Spline Warps

Adrien Bartoli, Mathieu Perriollat, and Sylvie Chambon

Thin-Plate Spline warps have been shown to be very effective as a parameterized model of the optic flow field between images of various deforming surfaces. Examples include a sheet of paper being manually handled. Recent work has used such warps for images of smooth rigid surfaces. Standard Thin-Plate Spline warps are not rigid, in the sense that they do not satisfy the epipolar geometry constraint, and are intrinsically affine, in the sense of the affine camera model. We propose three types of warps based on the Thin-Plate Spline. The first one is a flexible rigid warp. It describes the optic flow field induced by a smooth rigid surface, and satisfies the affine epipolar geometry constraint. The second and third ones extend the standard Thin-Plate Spline and the proposed rigid flexible warp to the perspective camera model. The properties of these warps are studied in details, and a hierarchy is defined. Experimental results on simulated and real data are reported, showing that the proposed warps outperform the standard one in several cases of interest.

40

40

Simultaneous depth reconstruction and restoration of noisy stereo images using Non-local Pixel Distribution

Yong Seok Heo, Kyoung Mu Lee, and Sang Uk Lee

In this paper, we propose a new algorithm that solves both the stereo matching and the image denoising problem simultaneously for a pair of noisy stereo images. Most stereo algorithms employ L1 or L2 intensity error-based data costs in the MAP-MRF framework by assuming the naive intensity-constancy. These data costs make typical stereo algorithms suffer from the effect of noise severely. In this study, a new robust stereo algorithm to noise is presented that performs the stereo matching and the image denoising simultaneously. In our approach, we redefine the data cost by two terms. The first term is the restored intensity difference, instead of the observed intensity difference. The second term is the non-local pixel distribution dissimilarity around the matched pixels. We adopted the NL-means (Non Local-means) algorithm for restoring the intensity value as a function of disparity. And a pixel distribution dissimilarity is calculated by using PMHD (Perceptually Modified Hausdorff Distance). The restored intensity values in each image are determined by inferring optimal disparity map at the same time. Experimental results show that the proposed algorithm is more robust and accurate than other conventional algorithms in both stereo matching and denoising.

Using Geometry Invariants for Camera Response Function Estimation

Tian-Tsong Ng, Shih-Fu Chang, and Mao-Pei Tsui

In this paper, we present a new single-image camera response function (CRF) estimation method using geometry invariants (GI). We derive mathematical properties and geometric interpretation for GI, which lend insight to addressing various algorithm implementation issues in a principled way. In contrast to the previous single-image CRF estimation methods, our method provides a constraint equation for selecting the potential target data points. Comparing to the prior work, our experiment is conducted over more extensive data and our method is flexible in that its estimation accuracy and stability can be improved whenever more than one image is available. The geometry invariance theory is novel and may be of wide interest.

Image Hallucination Using Neighbor Embedding over Visual Primitive Manifolds

Wei Fan and Dit-Yan Yeung

In this paper, we propose a novel learning-based method for image hallucination, with image super-resolution being a specific application that we focus on here. Given a low-resolution image, its underlying higher-resolution details are synthesized based on a set of training images. In order to build a compact yet descriptive training set, we investigate the characteristic local structures contained in large volumes of small image patches. Inspired by recent progress in manifold learning research, we take the assumption that small image patches in the low-resolution and high-resolution images form manifolds with similar local geometry in the corresponding image feature spaces. This assumption leads to a super-resolution approach which reconstructs the feature vector corresponding to an image patch by its neighbors in the feature space. In addition, the residue errors associated with the reconstructed image patches are also estimated to compensate for the information loss in the local averaging process. Experimental results show that our hallucination method can synthesize higher-quality images compared with other methods.

41

41

A Blind Source Separation Perspective on Image Restoration

Lidan Miao and Hairong Qi

This paper re-investigates the physical image formation process leading to a new interpretation of the classic image restoration problem from a blind source separation (BSS) perspective. The observed distorted image is considered as a linear combination of a set of shifted version of the point spread function (PSF) with the weight coefficients determined by the actual image. The new interpretation brings two immediate benefits to the practice of image restoration. First, we can utilize the rich set of BSS methods to solve the blind image restoration problem. Second, the new formulation in terms of matrix product has the equivalent merit as the conventional matrix-vector notation in theoretical study of restoration algorithms. We develop a smoothness and block-decorrelation constrained nonnegative matrix factorization method (termed CNMF) to blindly recover both the PSF and the actual image. The experimental results compared to one of the state-of-the-art methods demonstrate the merit of the proposed approach.

Statistics of Infrared Images

Nigel Morris, Shai Avidan, Wojciech Matusik, and Hanspeter Pfister

The proliferation of low-cost infrared cameras gives us a new angle for attacking many unsolved vision problems by leveraging a larger range of the electromagnetic spectrum. A first step to utilizing these images is to explore the statistics of infrared images and compare them to the corresponding statistics in the visible spectrum. In this paper, we analyze the power spectra as well as the marginal and joint wavelet coefficient distributions of datasets of indoor and outdoor natural images. We note that infrared images have noticeably less texture indoors where temperatures are more homogenous. The joint wavelet statistics also show strong correlation between object boundaries in IR and visible images, leading to high potential for vision applications using a combined statistical model.

Sensor noise modeling using the Skellam distribution: Application to the color edge detection

Youngbae Hwang, Jun-Sik Kim, and In-So Kweon

In this paper, we introduce the Skellam distribution as a sensor noise model for CCD or CMOS cameras. This is derived from the Poisson distribution of photons that determine the sensor response. We show that the Skellam distribution can be used to measure the intensity difference of pixels in the spatial domain, as well as in the temporal domain. In addition, we show that Skellam parameters are linearly related to the intensity of the pixels. This property means that the brighter pixels tolerate greater variation of intensity than the darker pixels. This enables us to decide automatically whether two pixels have different colors. We apply this modeling to detect the edges in color images. The resulting algorithm requires only a confidence interval for a hypothesis test, because it uses the distribution of image noise directly. More importantly, we demonstrate that without conventional Gaussian smoothing the noise model-based approach can automatically extract the fine details of image structures, such as edges and corners, independent of camera setting.

42

42

A Probabilistic Intensity Similarity Measure based on Noise Distributions

Yasuyuki Matsushita and Stephen Lin

We derive a probabilistic measure of similarity between two observed image intensities that is based on the noise properties of the camera. In many vision algorithms, the effect of camera noise is either neglected or reduced in a preprocessing stage. However, noise reduction cannot be performed with high accuracy due to lack of knowledge about the true intensity signal. Our similarity metric specifically represents the likelihood that two intensity observations correspond to the same unknown noise-free scene radiance. By directly accounting for noise in the evaluation of similarity, the proposed measure makes noise reduction unnecessary and enhances many vision algorithms that involve matching of image intensities. Real-world experiments demonstrate the effectiveness of the proposed similarity measure in comparison to the standard L2 norm.

Optimized Color Sampling for Robust Matting

Jue Wang and Michael Cohen

Image matting is the problem of determining for each pixel in an image whether it is foreground, or background, or the mixing parameter, "alpha", for those pixels that are a mixture of foreground and background. Matting is inherently an ill-posed problem. Previous matting approaches either use naive color sampling methods to estimate foreground and background colors for unknown pixels, or use propagation-based methods to avoid color sampling under weak assumptions about image statistics. We argue that neither method itself is enough to generate good results for complex natural images. We analyze the weaknesses of previous matting approaches, and propose a new robust matting algorithm. In our approach we also sample foreground and background colors for unknown pixels, but more importantly, analyze the confidence of these samples. Only high confidence samples are chosen to contribute to the matting energy function which is minimized by Random Walk. The energy function we define also contains a neighborhood term to enforce the smoothness of the matte. To validate the approach, we present an extensive and quantitative comparison between our algorithm and a number of previous approaches in hopes of providing a benchmark for future matting research.

Segmentation 1

Iterative MAP and ML Estimations for Image Segmentation

Shifeng Chen, Liangliang Cao, Jianzhuang Liu, and Xiaoou Tang

Image segmentation plays an important role in computer vision and image analysis. In this paper, the segmentation problem is formulated as a labeling problem under a probability maximization framework. To estimate the label configuration, an iterative optimization scheme is proposed to alternately carry out the maximum a posteriori (MAP) estimation and the maximum-likelihood (ML) estimation. The MAP estimation problem is modeled with Markov random fields (MRFs). A graph-cut algorithm is used to find the solution to the MAP-MRF estimation. The ML estimation is achieved by finding the means of region features. Our algorithm can automatically segment an image into regions with relevant textures or colors without the need to know the number of regions in advance. In addition, under the same framework, it can be extended to another algorithm that extracts objects of a particular class from a group of images. Extensive experiments have shown the effectiveness of our approach.

43

43

Tree-based Classifiers for Bilayer Video Segmentation

Pei Yin, Antonio Criminisi, John Winn, and Irfan Essa

This paper presents an algorithm for the automatic segmentation of monocular videos into foreground and background layers. Correct segmentations are produced even in the presence of large background motion with nearly stationary foreground. There are three key contributions. The first is the introduction of a novel motion representation, “motons”, inspired by research in object recognition. Second, we propose learning the segmentation likelihood from the spatial context of motion. The learning is efficiently performed by Random Forests. The third contribution is a general taxonomy of tree-based classifiers, which facilitates theoretical and experimental comparisons of several known classification algorithms, as well as spawning new ones. Diverse visual cues such as motion, motion context, colour, contrast and spatial priors are fused together by means of a Conditional Random Field (CRF) model. Segmentation is then achieved by binary min-cut. Our algorithm requires no initialization. Experiments on many video-chat type sequences demonstrate the effectiveness of our algorithm in a variety of scenes. The segmentation results are comparable to those obtained by stereo systems.

Shape statistics for image segmentation with prior

Guillaume Charpiat, Olivier Faugeras, and Renaud Keriven

We propose a new approach to compute non-linear, intrinsic shape statistics and to incorporate them into a shape prior for an image segmentation task. Given a sample set of contours, we first define their mean shape as the one which is simultaneously closest to all samples up to rigid motions, and compute it in a gradient descent framework. We consider here a differentiable approximation of the Hausdorff distance between shapes. Statistics on the instantaneous deformation fields that the mean shape should undergo to move towards each sample lead to sensible characteristic modes of deformation that convey the shape variability. Contour statistics are turned into a shape prior which is rigid-motion invariant. Image segmentation results show the improvement gained by the shape prior.

Segmenting Images on the Tensor Manifold

Yogesh Rathi, Oleg Michailovich, and Allen Tannenbaum

In this note, we propose a method to perform segmentation on the tensor manifold, that is, the space of positive definite symmetric matrices of given dimension. In this work, we explicitly use the Riemannian structure of the tensor space in designing our algorithm. This structure has already been utilized in several approaches based on active contour models which separate the mean and/or variance inside and outside the evolving contour. We generalize these methods by proposing a new technique for performing segmentation by separating the entire probability distributions of the regions inside and outside the contour using the Bhattacharyya metric. In particular, this allows for segmenting objects with multimodal probability distributions (on the space of tensors). We demonstrate the effectiveness of our algorithm by segmenting various textured images using the structure tensor. A level set based scheme is proposed to implement the curve flow evolution equation.

44

44

Unsupervised Segmentation of Objects using Efficient Learning

Himanshu Arora, Nicolas Loeff, David Forsyth, and Narendra Ahuja

We describe an unsupervised method to segment objects detected in images using a novel variant of an interest point template, which is very efficient to train and evaluate. Once an object has been detected, our method segments an image using a Conditional Random Field (CRF) model. This model integrates image gradients, the location and scale of the object, the presence of object parts, and the tendency of these parts to have characteristic patterns of edges nearby. We enhance our method using multiple unsegmented images of objects to learn the parameters of the CRF, in an iterative conditional maximization framework. We show quantitative results on images of real scenes that demonstrate the accuracy of segmentation.

Nonlinear Dynamical Shape Priors for Level Set Segmentation

Daniel Cremers

The introduction of statistical shape knowledge into level set based segmentation methods was shown to improve the segmentation of familiar structures in the presence of noise, clutter or partial occlusions. While most work has been focused on shape priors which are constant in time, it is clear that when tracking deformable shapes certain silhouettes may become more or less likely over time. In fact, the deformations of familiar objects such as the silhouettes of a walking person are often characterized by pronounced temporal correlations. In this paper, we propose a nonlinear dynamical shape prior for level set based image segmentation. Specifically, we propose to approximate the temporal evolution of the eigenmodes of the level set function by means of a mixture of autoregressive models. We detail how such shape priors “with memory” can be integrated into a variational framework for level set segmentation. As an application, we experimentally validate that the nonlinear dynamical prior drastically improves the tracking of a person walking in different directions, despite large amounts of clutter and noise.

A Variational Approach to the Evolution of Radial Basis Functions for Image Segmentation

Greg Slabaugh, H. Quynh Dinh, and Gozde Unal

In this paper we derive differential equations for evolving radial basis functions (RBFs) to solve segmentation problems. The differential equations result from applying variational calculus to energy functionals designed for image segmentation. Our methodology supports evolution of all parameters of each RBF, including its position, weight, orientation, and anisotropy, if present. Our framework is general and can be applied to numerous RBF interpolants. The resulting approach retains some of the ideal features of implicit active contours, like topological adaptivity, while requiring low storage overhead due to the sparsity of our representation, which is an unstructured list of RBFs. We present the theory behind our technique and demonstrate its usefulness for image segmentation.

Implicit Active Contours Driven by Local Binary Fitting Energy

Chunming Li, Chiu-Yen Kao, John C. Gore, and Zhaohua Ding

Local image information is crucial for accurate segmentation of images with intensity inhomogeneity. However, image information in local region is not embedded in popular region-based active contour models, such as the piecewise constant models. In this paper, we propose a region-based active contour

45

45

model that is able to utilize image information in local regions. The major contribution of this paper is the introduction of a local binary fitting energy with a kernel function, which enables the extraction of accurate local image information. Therefore, our model can be used to segment images with intensity inhomogeneity, which overcomes the limitation of piecewise constant models. Comparisons with other major region-based models, such as the piecewise smooth model, show the advantages of our method in terms of computational efficiency and accuracy. In addition, the proposed method has promising application to image denoising.

A Probabilistic Model for Object Recognition, Segmentation, and Non-Rigid Correspondence

Ian Simon and Steven M. Seitz

We describe a method for fully automatic object recognition and segmentation using a set of reference images to specify the appearance of each object. Our method uses a generative model of image formation that takes into account occlusions, simple lighting changes, and object deformations. We take advantage of local features to identify, locate, and extract multiple objects in the presence of large viewpoint changes, nonrigid motions with large numbers of degrees of freedom, occlusions, and clutter. We simultaneously compute an object-level segmentation and a dense correspondence between the pixels of the appropriate reference images and the image to be segmented.

A Graph Reduction Method for 2D Snake Problems

Jianhua Yan, Keqi Zhang, Zhengcui Zhang, Shu-Ching Chen, and Giri Narasimhan

Energy-minimizing active contour models (snakes) have been proposed for solving many computer vision problems such as object segmentation, surface reconstruction, and object tracking. Dynamic programming which allows natural enforcement of constraints is an effective method for computing the global minima of energy functions. However, this method is only limited to snake problems with one dimensional (1D) topology (i.e., a contour) and cannot handle problems with two dimensional (2D) topology. In this paper, we have extended the dynamic programming method to address the snake problems with 2D topology using a novel graph reduction algorithm. Given a 2D snake with first order energy terms, a set of reduction operations are defined and used to simplify the graph of the 2D snake into one single vertex while retaining the minimal energy of the snake. The proposed algorithm has a polynomial-time complexity bound and the optimality of the solution for a reducible 2D snake is guaranteed. However, not all types of 2D snakes can be reduced into one single vertex using the proposed algorithm. The reduction of general planar snakes is an NP-Complete problem. The proposed method has been applied to optimize 2D building topology extracted from airborne LIDAR data to examine the effectiveness of the algorithm. The results demonstrate that the proposed approach successfully found the global optima for over 98% of building topology in a polynomial time.

Image Segmentation by Probabilistic Bottom-Up Aggregation and Cue Integration

Sharon Alpert, Meirav Galun, Ronen Basri, and Achi Brandt

We present a bottom-up aggregation approach to image segmentation. Beginning with an image, we execute a sequence of steps in which pixels are gradually merged to produce larger and larger regions. In each step we consider pairs of adjacent regions and provide a probability measure to assess whether or not they should be included in the same segment. Our probabilistic formulation takes into account intensity and

46

46

texture distributions in a local area around each region. It further incorporates priors based on the geometry of the regions. Finally, posteriors based on intensity and texture cues are combined using a “mixture of experts” formulation. This probabilistic approach is integrated into a graph coarsening scheme providing a complete hierarchical segmentation of the image. The algorithm complexity is linear in the number of the image pixels. We test our method on a variety of gray scale images and compare our results to several existing segmentation algorithms.

Shape 1

Hierarchical Matching of Deformable Shapes

Pedro Felzenszwalb and Joshua Schwartz

We describe a new hierarchical representation for two-dimensional objects that captures shape information at multiple levels of resolution. This representation is based on a hierarchical description of an object's boundary and can be used in an elastic matching framework, both for comparing pairs of objects and for detecting objects in cluttered images. In contrast to classical elastic models, our representation explicitly captures global shape information. This leads to richer geometric models and more accurate recognition results. Our experiments demonstrate classification results that are significantly better than the current state-of-the-art in several shape datasets. We also show initial experiments in matching shapes to cluttered images.

Delaunay Deformable Models: Topology-Adaptive Meshes Based on the Restricted Delaunay Triangulation

Jean-Philippe Pons and Jean-Daniel Boissonnat

In this paper, we propose a robust and efficient Lagrangian approach, which we call Delaunay Deformable Models, for modeling moving surfaces undergoing large deformations and topology changes. Our work uses the concept of restricted Delaunay triangulation, borrowed from computational geometry. In our approach, the interface is represented by a triangular mesh embedded in the Delaunay tetrahedralization of interface points. The mesh is iteratively updated by computing the restricted Delaunay triangulation of the deformed objects. Our method has many advantages over popular Eulerian techniques such as the level set method and over hybrid Eulerian-Lagrangian techniques such as the particle level set method: localization accuracy, adaptive resolution, ability to track properties associated to the interface, seamless handling of triple junctions. Our work brings a rigorous and efficient alternative to existing topology-adaptive mesh techniques such as T-snakes.

Shape from Planar Curves: A Linear Escape from Flatland

Ady Ecker, Kiriakos Kutulakos, and Allan Jepson

We revisit the problem of recovering 3D shape from the projection of planar curves on a surface. This problem is strongly motivated by perception studies. Applications include single-view modeling and fully uncalibrated structured light. When the curves intersect, the problem leads to a linear system for which a direct least-squares method is sensitive to noise. We derive a more stable solution and show examples where the same method produces plausible surfaces from the projection of parallel (non-intersecting) planar cross sections.

47

47

Learning and Matching Line Aspects for Articulated Objects

Xiaofeng Ren

Traditional aspect graphs are topology-based and are impractical for articulated objects. In this work we learn a small number of aspects, or prototypical views, from video data. Groundtruth segmentations in video sequences are utilized for both training and testing aspect models that operate on static images. We represent aspects of an articulated object as collections of line segments. In learning aspects, where object centers are known, a linear matching based on line location and orientation is used to measure similarity between views. We use K-medoid to find cluster centers. When using line aspects in recognition, matching is based on pairwise cues of relative location, relative orientation as well adjacency and parallelism. Matching with pairwise cues leads to a quadratic optimization that we solve with a spectral approximation. We show that our line aspect matching is capable of locating people in a variety of poses. Line aspect matching performs significantly better than an alternative approach using Hausdorff distance, showing merits of the line representation.

A Multi-Scale Tikhonov Regularization Scheme for Implicit Surface Modelling

Jianke Zhu, Steven C.H. Hoi, and Michael R. Lyu

Kernel machines have recently been considered as a promising solution for implicit surface modelling. A key challenge of machine learning solutions is how to fit implicit shape models from large-scale sets of point cloud samples efficiently. In this paper, we propose a fast solution for approximating implicit surfaces based on a multi-scale Tikhonov regularization scheme. The optimization of our scheme is formulated into a sparse linear equation system, which can be efficiently solved by factorization methods. Different from traditional approaches, our scheme does not employ auxiliary off-surface points, which not only saves the computational cost but also avoids the problem of injected noise. To further speedup our solution, we present a multi-scale surface fitting algorithm of coarse to fine modelling. We conduct comprehensive experiments to evaluate the performance of our solution on a number of datasets of different scales. The promising results show that our suggested scheme is considerably more efficient than the state-of-the-art approach.

Groupwise Shape Registration on Raw Edge Sequence via A Spatio-Temporal Generative Model

Huijun Di, Naveed Rao, Guangyou Xu, and Linmi Tao

Groupwise shape registration of raw edge sequence is addressed. Automatically extracted edge maps are treated as noised input shape of the deformable object and their registration are considered, results can be used to build statistical shape models without laborious manual labeling process. Dealing with raw edges poses several challenges, to fight against them a novel spatio-temporal generative model is proposed which joints shape registration and trajectory tracking. Mean shape, consistent correspondences among edge sequence and associated non-rigid transformations are jointly inferred under EM framework. Our algorithm is tested on real video sequences of a dancing ballerina, talking face, and walking person. Results achieved are interesting, promising, and prove the robustness of our method. Potential applications can be found in statistical shape analysis, action recognition, object tracking, etc.

48

48

Navigation and SLAM

Fast Terrain Classification Using Variable-Length Representation for Autonomous Navigation

Anelia Angelova, Larry Matthies, Daniel Helmick, and Pietro Perona

We propose a method for learning using a set of feature representations which retrieve different amounts of information at different costs. The goal is to create a more efficient terrain classification algorithm which can be used in real-time, onboard an autonomous vehicle. Instead of building a monolithic classifier with uniformly complex representation for each class, the main idea here is to actively consider the labels or misclassification cost while constructing the classifier. For example, some terrain classes might be easily separable from the rest, so very simple representation will be sufficient to learn and detect these classes. This is taken advantage of during learning, so the algorithm automatically builds a variable-length visual representation which varies according to the complexity of the classification task. This enables fast recognition of different terrain types during testing. We also show how to select a set of feature representations so that the desired terrain classification task is accomplished with high accuracy and is at the same time efficient. The proposed approach achieves a good trade-off between recognition performance and speedup on data collected by an autonomous robot.

Large scale vision based navigation without an accurate global reconstruction

Siniša Šegvi , Anthony Remazeilles, Albert Diosi, and François Chaumette

Autonomous cars will likely play an important role in the future. A vision system designed to support outdoor navigation for such vehicles has to deal with large dynamic environments, changing imaging conditions, and temporary occlusions by other moving objects. This paper presents a novel appearance-based navigation framework relying on a single perspective vision sensor, which is aimed towards resolving of the above issues. The solution is based on a hierarchical environment representation created during a teaching stage, when the robot is controlled by a human operator. At the top level, the representation contains a graph of key-images with extracted 2D features enabling a robust navigation by visual servoing. The information stored at the bottom level enables to efficiently predict the locations of the features which are currently not visible, and eventually (re-)start their tracking. The outstanding property of the proposed framework is that it enables robust and scalable navigation without requiring a globally consistent map, even in interconnected environments. This result has been confirmed by realistic off-line experiments and successful real-time navigation trials in public urban areas.

Robust Real-Time Visual SLAM Using Scale Prediction and Exemplar Based Feature Description

Denis Chekhlov, Mark Pupilli, Walterio Mayol, and Andrew Calway

Two major limitations of real-time visual SLAM algorithms are the restricted range of views over which they can operate and their lack of robustness when faced with erratic camera motion or severe visual occlusion. In this paper we describe a visual SLAM algorithm which addresses both of these problems. The key component is a novel feature description method which is both fast and capable of repeatable correspondence matching over a wide range of viewing angles and scales. This is achieved in real-time by using a SIFT-like spatial gradient descriptor in conjunction with efficient scale prediction and exemplar based feature representation. Results are presented illustrating robust real-time SLAM operation within an office environment.

49

49

Wide-Area Egomotion Estimation from Known 3D Structure

Olivier Koch and Seth Teller

Robust egomotion recovery for extended camera excursions has long been a challenge for machine vision researchers. Existing algorithms handle spatially limited environments and tend to consume prohibitive computational resources with increasing excursion time and distance. We describe an egomotion estimation algorithm that takes as input a coarse 3D model of an environment, and an omnidirectional video sequence captured within the environment, and produces as output a reconstruction of the camera's 6-DOF egomotion expressed in the coordinates of the input model. The principal novelty of our method is a robust matching algorithm that associates 2D image edges with 3D model segments. Our system handles 3-DOF and 6-DOF camera excursions of hundreds of meters within real, cluttered environments. It uses a novel prior visibility analysis to speed initialization and dramatically accelerate image-to-model matching. We demonstrate the method's operation, and qualitatively and quantitatively evaluate its performance, on both synthetic and real image sequences.

Enhancement 1: Blur and Resolution

Soft Edge Smoothness Prior for Alpha Channel Super Resolution

Shengyang Dai, Mei Han, Wei Xu, Ying Wu, and Yihong Gong

Effective image prior is necessary for image super resolution, due to its severely under-determined nature. Although the edge smoothness prior can be effective, it is generally difficult to have analytical forms to evaluate the edge smoothness, especially for soft edges that exhibit gradual intensity transitions. This paper finds the connection between the soft edge smoothness and a soft cut metric on an image grid by generalizing the Geocuts method, and proves that the soft edge smoothness measure approximates the average length of all level lines in an intensity image. This new finding not only leads to an analytical characterization of the soft edge smoothness prior, but also gives an intuitive geometric explanation. Regularizing the super resolution problem by this new form of prior can simultaneously minimize the length of all level lines, and thus resulting in visually appealing results. In addition, this paper presents a novel combination of this soft edge smoothness prior and the alpha matting technique for color image super resolution, by normalizing edge segments with their alpha channel description, to achieve a unified treatment of edges with different contrast and scale.

Single Image Motion Deblurring Using Transparency

Jiaya Jia

One of the key problems of restoring a degraded image from motion blur is the estimation of the unknown shift-invariant linear blur filter. Several algorithms have been proposed using image intensity or gradient information. In this paper, we separate the image deblurring into filter estimation and image deconvolution processes, and propose a novel algorithm to estimate the motion blur filter from a perspective of alpha values. The relationship between the object boundary transparency and the image motion blur is investigated. We formulate the filter estimation as solving a Maximum a Posteriori (MAP) problem with the defined likelihood and prior on transparency. Our unified approach can be applied to handle both the camera motion blur and the object motion blur.

50

50

Resolving Objects at Higher Resolution from a Single Motion-blurred Image

Amit Agrawal and Ramesh Raskar

Motion blur can degrade the quality of images and is considered a nuisance for computer vision problems. In this paper, we show that motion blur can in-fact be used for increasing the resolution of a moving object. Our approach utilizes the information in a single motion-blurred image without any image priors or training images. As the blur size increases, the resolution of the moving object can be enhanced by a larger factor, albeit with a corresponding increase in reconstruction noise. Traditionally, motion deblurring and super-resolution have been ill-posed problems. Using a coded-exposure camera that preserves high spatial frequencies in the blurred image, we present a linear algorithm for the combined problem of deblurring and resolution enhancement and analyze the invertibility of the resulting linear system. We also show a method to selectively enhance the resolution of a narrow region of high-frequency features, when the resolution of the entire moving object cannot be increased due to small motion blur. Results on real images showing up to four times resolution enhancement are presented.

Medical

Inferring Grammar-based Structure Models from 3D Microscopy Data

Joseph Schlecht, Kobus Barnard, Ekaterina Spriggs, and Barry Pryor

We present a new method to fit grammar-based stochastic models for biological structure to stacks of microscopic images captured at incremental focal lengths. Providing the ability to quantitatively represent structure and automatically fit it to image data enables important biological research. We consider the case where individuals can be represented as an instance of a stochastic grammar, similar to L-systems used in graphics to produce realistic plant models. In particular, we construct a stochastic grammar of Alternaria, a genus of fungus, and fit instances of it to microscopic image stacks. We express the image data as the result of a generative process composed of the underlying probabilistic structure model together with the parameters of the imaging system. Fitting the model then becomes probabilistic inference. For this we create a reversible-jump MCMC sampler to traverse the parameter space. We observe that incorporating spatial structure helps fit the model parts, and that simultaneously fitting the imaging system is also very helpful.

Model-Guided Segmentation of 3D Neuroradiological Image Using Statistical Surface Wavelet Model

Yang Li, Tiow-Seng Tan, Ihar Volkau, and Wieslaw Nowinski

This paper proposes a novel model-guided segmentation framework utilizing a statistical surface wavelet model as a shape prior. In the model building process, a set of training shapes are decomposed through the subdivision surface wavelet scheme. By interpreting the resultant wavelet coefficients as random variables, we compute prior probability distributions of the wavelet coefficients to model the shape variations of the training set at different scales and spatial locations. With this statistical shape model, the segmentation task is formulated as an optimization problem to best fit the statistical shape model with an input image. Due to the localization property of the wavelet shape representation both in scale and space, this multi-dimensional optimization problem can be efficiently solved in a multiscale and spatial-localized manner. We have applied our method to segment cerebral caudate nuclei from MRI images. The experimental results have been validated with segmentations obtained through human expert. These show that our method is robust, computationally efficient and achieves a high degree of segmentation accuracy.

51

51

Hierarchical Learning of Curves; Application to Guidewire Localization in Fluoroscopy

Adrian Barbu, Vassilis Athitsos, Bogdan Georgescu, Stefan Boehm, Peter Durlak, and Dorin Comaniciu

In this paper we present a method for learning a curve model for detection and segmentation by closely integrating a hierarchical curve representation using generative and discriminative models with a hierarchical inference algorithm. We apply this method to the problem of automatic localization of the guidewire in fluoroscopic sequences. In fluoroscopic sequences, the guidewire appears as a hardly visible, non-rigid one-dimensional curve. Our paper has three main contributions. Firstly, we present a novel method to learn the complex shape and appearance of a free-form curve using a hierarchical model of curves of increasing degrees of complexity and a database of manual annotations. Secondly, we present a novel computational paradigm in the context of Marginal Space Learning, in which the algorithm is closely integrated with the hierarchical representation to obtain fast parameter inference. Thirdly, to our knowledge this is the first full system which robustly localizes the whole guidewire and has extensive validation on thousands of frames. We present very good quantitative and qualitative results on real fluoroscopic video sequences, obtained in just one second per frame.

Poster session 2


Compositional Boosting for Computing Hierarchical Image Structures

Tian-Fu Wu, Gui-Song Xia, and Song-Chun Zhu

In this paper, we present a compositional boosting algorithm for detecting and recognizing 17 common image structures in low-middle level vision tasks. These structures, called “graphlets”, are the most frequently occurring primitives, junctions and composite junctions in natural images, and are arranged in a 3-layer And-Or graph representation. In this hierarchic model, larger graphlets are decomposed (in And-nodes) into smaller graphlets in multiple alternative ways (at Or-nodes), and parts are shared and re-used between graphlets. Then we present a compositional boosting algorithm for computing the 17 graphlets categories collectively in the Bayesian framework. The algorithm runs recursively for each node A in the And-Or graph and iterates between two steps – bottom-up proposal and top-down validation. The bottom-up step includes two types of boosting methods. (i) Detecting instances of A (often in low resolutions) using Adaboost method through a sequence of tests (weak classifiers) image feature. (ii) Proposing instances of A (often in high resolution) by binding existing children nodes of A through a sequence of compatibility tests on their attributes (e.g angles, relative size etc). The Adaboosting and binding methods generate a number of candidates for node A which are verified by a top-down process in a way similar to Data-Driven Markov Chain Monte Carlo. Both the Adaboosting and binding methods are trained off-line for each graphlet category, and the compositional nature of the model means the algorithm is recursive and can be learned from a small training set. We apply this algorithm to a wide range of indoor and outdoor images with satisfactory results.

Learning Generative Models via Discriminative Approaches

Zhuowen Tu

Generative model learning is one of the key problems in machine learning and computer vision. Yet, the use of generative models is limited due to the difficulty in effectively learning them. A new learning framework is proposed in this paper which progressively learns a target generative distribution through

52

52

discriminative approaches. This algorithm provides many interesting aspects to the literature. From the generative model side: (1) A reference distribution is used to assist the learning process, which removes sampling processes in the early stages. (2) The modeling and computing processes are directly combined. (3) The discrimination/classification power of discriminative approaches, e.g. boosting, is directly utilized. (4) The ability of selecting/exploring features from a large set of candidate pool allows us to make nearly no assumption about the training data. From the discriminative model side: (1) This framework improves the modeling capability of discriminative models. (2) It can start with source training data only and gradually “invent” negative samples. (3) We show how sampling schemes can be introduced to discriminative models. (4) The learning procedure helps to tighten up the decision boundaries for classification, and therefore, improves the robustness. In this paper, we show a variety of applications such as texture modeling and classification, non-photorealistic rendering, learning image statistics/denosing, and face modeling. It handles both homogeneous patterns, e.g. textures, and inhomogeneous patterns, e.g. faces, and we use nearly an identical parameter setting for all the tasks in the learning stage. The proposed framework is very general and the results obtained are promising.

Unsupervised Learning of Image Transformations

Roland Memisevic and Geoffrey Hinton

We describe a probabilistic model for learning rich, distributed representations of image transformations. The model is defined as a gated conditional random field that is trained to predict transformations of its inputs using a factorial set of latent variables. Inference in the model consists in extracting the transformation, given a pair of images, and can be performed exactly and efficiently. We show that, when trained on natural videos, the model develops domain specific motion features, in the form of fields of locally transformed edge filters. When trained on affine, or more general, transformations of still images, the model develops codes for these transformations, and can subsequently perform recognition tasks that are invariant under these transformations. It can also fantasize new transformations on previously unseen images. We describe several variations of the basic model and provide experimental results that demonstrate its applicability to a variety of tasks.

Utilizing Variational Optimization to Learn Markov Random Fields

Marshall Tappen

Markov Random Field, or MRF, models are a powerful tool for modeling images. While much progress has been made in algorithms for inference in MRFs, learning the parameters of an MRF is still a challenging problem. In this paper, we show how variational optimization can be used to learn the parameters of an MRF. This method for learning, which we refer to as Variational Mode Learning, finds the MRF parameters by minimizing a loss function that penalizes the difference between ground-truth images and an approximate, variational solution to the MRF. In particular, we focus on learning parameters for the Field of Experts model of Roth and Black. In addition to demonstrating the effectiveness of this method, we show that a model based on derivative filters performs quite similarly to the Field of Experts model. This suggests that the Field of Experts model, which is difficult to interpret, can be understood as imposing piecewise continuity on the image.

Connecting the Out-of-Sample and Pre-Image Problems in Kernel Methods

Pablo Arias, Gregory Randall, and Guillermo Sapiro

Kernel methods have been widely studied in the field of pattern recognition. These methods implicitly map, “the kernel trick,” the data into a space which is more appropriate for analysis. Many manifold learning and dimensionality reduction techniques are simply kernel methods for which the mapping is explicitly computed. In such cases, two problems related with the mapping arise: The out-of-sample extension and

53

53

the pre-image computation. In this paper we propose a new pre-image method based on the Nyström formulation for the out-of-sample extension, showing the connections between both problems. We also address the importance of normalization in the feature space, which has been ignored by standard pre-image algorithms. As an example, we apply these ideas to the Gaussian kernel, and relate our approach to other popular pre-image methods. Finally, we show the application of these techniques in the study of dynamic shapes.

Local and Weighted Maximum Margin Discriminant Analysis

Haixian Wang, Wenming Zheng, Zilan Hu, and Sibao Chen

In this paper, we propose a new approach, called local and weighted maximum margin discriminant analysis (LWMMDA), to performing object discrimination. LWMMDA is a subspace learning method that identifies the underlying nonlinear manifold for discrimination. The goal of LWMMDA is to seek a transformation such that data points of different classes are projected as far as possible while points within a same class are as compact as possible. The projections are obtained by maximizing a new discriminant criterion, called local and weighted maximum margin criterion (LWMMC). Different from previous maximum margin criterion (MMC) which seeks only the globally Euclidean structure of data points, LWMMC takes the local property into account, which makes LWMMC more accurate in finding discriminant information. LWMMC has an additional weighted parameter β that further broadens the average margin between different classes. Computationally, LWMMDA completely avoids the singularity problem. Besides, LWMMDA couples the QR-decomposition into its framework, which makes LWMMDA very efficient and stable in implementation. Finally, LWMMDA framework is straightforwardly extended into the reproducing kernel Hilbert space induced by a nonlinear function φ. Experiments on digit visualization, face recognition, and facial expression recognition are presented to show the effectiveness of the proposed method.

Integrating Global and Local Structures: A Least Squares Framework for Dimensionality Reduction

Jianhui Chen, Jieping Ye, and Qi Li

Linear Discriminant Analysis (LDA) is a popular statistical approach for dimensionality reduction. LDA captures the global geometric structure of the data by simultaneously maximizing the between-class distance and minimizing the within-class distance. However, local geometric structure has recently been shown to be effective for dimensionality reduction. In this paper, a novel dimensionality reduction algorithm is proposed, which integrates both global and local structures. The main contributions of this paper include: (1) We present a least squares formulation for dimensionality reduction, which facilities the integration of global and local structures; (2) We design an efficient model selection scheme for the optimal integration, which balances the tradeoff between the global and local structures; and (3) We present a detailed theoretical analysis on the intrinsic relationship between the proposed framework and LDA. Our extensive experimental studies on benchmark data sets show that the proposed integration framework is competitive with traditional dimensionality reduction algorithms, which use global or local structure only.

Eigenboosting: Combining Discriminative and Generative Information

Helmut Grabner, Peter Roth, and Horst Bischof

A major shortcoming of discriminative recognition and detection methods is their noise sensitivity, both during training and recognition. This may lead to very sensitive and brittle recognition systems focusing on irrelevant information. This paper proposes a method that selects generative and discriminative features. In

54

54

particular, we boost classical Haar-like features and use the same features to approximate a generative model i.e., eigenimages). A modified error function for boosting ensures that only features are selected that show a good discrimination and reconstruction. This allows a robust feature selection using boosting. Thus, we can handle problems where discriminant classifiers fail while still retaining the discriminative power. Our experiments show that we can significantly improve the recognition performance when learning from noisy data. Moreover, the feature type used allows efficient recognition and reconstruction.

Recognition and Detection 1

Simultaneous Object Detection and Segmentation by Boosting Local Shape Feature based Classifier

Bo Wu and Ram Nevatia

This paper proposes an approach to simultaneously detect and segment objects of a known category. Edgelet features are used to capture the local shape of the objects. For each feature a pair of base classifiers for detection and segmentation is built. The base segmentor is designed to predict the per-pixel figure-ground assignment around a neighborhood of the edgelet based on the feature response. The neighborhood is represented as an effective field which is determined by the shape of the edgelet. A boosting algorithm is used to learn the ensemble classifier with cascade decision strategy from the base classifier pool. The simultaneousness is achieved for both training and testing. The system is evaluated on a number of public image sets and compared with several previous methods.

Accurate Object Detection with Deformable Shape Models Learnt from Images

Vittorio Ferrari, Frederic Jurie, and Cordelia Schmid

We present an object class detection approach which fully integrates the complementary strengths offered by shape matchers. Like an object detector, it can learn class models directly from images, and localize novel instances in the presence of intra-class variations, clutter, and scale changes. Like a shape matcher, it finds the accurate boundaries of the objects, rather than just their bounding-boxes. This is made possible by 1) a novel technique for learning a shape model of an object class given images of example instances; 2) the combination of Hough-style voting with a non-rigid point matching algorithm to localize the model in cluttered images. As demonstrated by an extensive evaluation, our method can localize object boundaries accurately, while needing no segmented examples for training (only bounding-boxes).

Virtual Training for Multi-View Object Class Recognition

Han-Pang Chiu, Tomas Lozano-Perez, and Leslie Kaelbling

Our goal is to circumvent one of the roadblocks to using existing approaches for single-view recognition for achieving multi-view recognition, namely, the need for sufficient training data for many viewpoints. We show how to construct virtual training examples for multi-view recognition using a simple model of objects (nearly planar facades centered at arbitrary 3D positions). We also show how the models can be learned from a few labeled images for each class.

55

55

3D LayoutCRF for Multi-View Object Class Recognition and Segmentation

Derek Hoiem, Carsten Rother, and John Winn

We introduce an approach to accurately detect and segment partially occluded objects in various viewpoints and scales. Our main contribution is a novel framework for combining object-level descriptions (such as position, shape, and color) with pixel-level appearance, boundary, and occlusion reasoning. In training, we exploit a rough 3D object model to learn physically localized part appearances. To find and segment objects in an image, we generate proposals based on the appearance and layout of local parts. The proposals are then refined after incorporating object-level information, and overlapping objects compete for pixels to produce a final description and segmentation of objects in the scene. A further contribution is a novel instance penalty, which is handled very efficiently during inference. We experimentally validate our approach on the challenging PASCAL'06 car database.

Feature Mining for Image Classification

Piotr Dollár, Zhuowen Tu, Hai Tao, and Serge Belongie

The efficiency and robustness of a vision system is often largely determined by the quality of the image features available to it. In data mining, one typically works with immense volumes of raw data, which demands effective algorithms to explore the data space. In analogy to data mining, the space of meaningful features for image analysis is also quite vast. Recently, the challenges associated with these problem areas have become more tractable through progress made in machine learning and concerted research effort in manual feature design by domain experts. In this paper, we propose a feature mining paradigm for image classification and examine several feature mining strategies. We also derive a principled approach for dealing with features with varying computational demands. Our goal is to alleviate the burden of manual feature design, which is a key problem in computer vision and machine learning. We include an in-depth empirical study on three typical data sets and offer theoretical explanations for the performance of various feature mining strategies. As a final confirmation of our ideas, we show results of a system, that utilizing feature mining strategies matches or outperforms the best reported results on pedestrian classification (where considerable effort has been devoted to expert feature design).

Learning to Detect A Salient Object

Tie Liu, Jian Sun, Nan-Ning Zheng, Xiaoou Tang, and Heung-Yeung Shum

In this paper, we study visual attention by detecting a salient object in an input image. We formulate salient object detection as an image segmentation problem, where we separate the salient object from the image background. We propose a set of novel features including multi-scale contrast, center-surround histogram, and color spatial distribution to describe a salient object locally, regionally, and globally. A Conditional Random Field is learned to effectively combine these features for salient object detection. We also constructed a large image database containing tens of thousands of carefully labeled images by multiple users. To our knowledge, it is the first large image database for quantitative evaluation of visual attention algorithms. We validate our approach on this image database, which we intend to make available to the computer vision research community with this paper.

56

56

OPTIMOL: automatic Object Picture collecTion via Incremental MOdel Learning

Li-Jia Li, Gang Wang, and Li Fei-Fei

A well-built dataset gives a good starting point for advanced computer vision research. It plays a crucial role in evaluation and provides a continuous challenge to the state-of-the-art algorithms. Dataset collection is, however, a tedious and time-consuming task. In this paper, a novel automatic dataset collecting and model learning approach is presented by using object recognition techniques in an incremental method. The goal of this work is to use the tremendous web resources to learn robust object category models in order to detect and search for objects in real-world cluttered scenes. It mimics the human learning process of iteratively cumulating model knowledge and image examples. We adapt a non-parametric graphical model and propose an incremental learning framework. Our algorithm is capable of automatically collecting much larger object category datasets for 22 randomly selected classes from the Caltech 101 dataset. Furthermore, we offer not only more images in each object category dataset, but also a robust object model and meaningful image annotation. Our experiments show OPTIMOL is capable of collecting image datasets that are superior than Caltech 101 and LabelMe.

Image Classification with Segmentation Graph Kernels

Zaid Harchaoui and Francis Bach

We propose a family of kernels between images, defined as kernels between their respective segmentation graphs. The kernels are based on soft matching of subtree-pattern of the respective graphs, leveraging the natural structure of images while remaining robust to the associated segmentation process uncertainty. Indeed, output from morphological segmentation is often represented by a labelled graph, each vertex corresponding to a segmented region, with edges joining neighboring regions. However, such image representations have mostly remained underused for learning tasks, partly because of the observed instability of the segmentation process and the inherent hardness of inexact graph matching with uncertain graphs. Our kernels count common virtual substructures amongst images, which enables to perform efficient supervised classification of natural images with a support vector machine. Moreover, the kernel machinery allows us to take advantage of recent advances in kernel-based learning: i) semi-supervised learning reduces the required number of labelled images, while ii) multiple kernel learning algorithms efficiently select the most relevant similarity measures between images within our family.

An Exemplar Model for Learning Object Classes

Ondrej Chum and Andrew Zisserman

We present an exemplar method for weakly supervised earning of an object class model from a set of images. The model enables the detection (localization) of multiple instances of the object class in test images. In the training phase, image regions that maximize an objective function are automatically located in the training images, without requiring any user annotation such as bounding boxes. The objective function measures visual similarity between training image pairs, using the spatial distribution of both appearance patches and edges. Both, the training and detection stages are fully translation and scale invariant. The detection performance of the model is assessed on the PASCAL Visual Object Classes Challenge 2006 test set. For a number of object classes the performance far exceeds the current state of the art.

57

57

Recognizing objects by piecing together the Segmentation Puzzle

Timothee Cour and Jianbo Shi

We present an algorithm that recognizes objects of a given category using a small number of hand segmented images as references. Our method first over segments an input image into superpixels, and then finds a shortlist of optimal combinations of superpixels that best fit one of template parts, under affine transformations. Second, we develop a contextual interpretation of the parts, gluing image segments using top-down fiducial points, and checking overall shape similarity. In contrast to previous work, the search for candidate superpixel combinations is not exponential in the number of segments, and in fact leads to a very efficient detection scheme. Both the storage and the detection of templates only require space and time proportional to the length of the template boundary, allowing us to store potentially millions of templates, and to detect a template anywhere in a large image in roughly 0.01 seconds. We apply our algorithm on the Weizmann horse database, and show our method is comparable to the state of the art while offering a simpler and more efficient alternative compared to previous work.

Faces and Biometrics 1

Quality-Driven Face Occlusion Detection and Recovery

Dahua Lin and Xiaoou Tang

This paper presents a framework to automatically detect and recover the occluded facial region. We first derive a Bayesian formulation unifying the occlusion detection and recovery stages. Then a quality assessment model is developed to drive both the detection and recovery processes, which captures the face priors in both global correlation and local patterns. Based on this formulation, we further propose GraphCut-based Detection and Confidence-Oriented Sampling to attain optimal detection and recovery respectively. Compared to traditional works in image repairing, our approach is distinct in three aspects: (1) it frees the user from marking the occlusion area by incorporating an automatic occlusion detector; (2) it learns a face quality model as a criterion to guide the whole procedure; (3) it couples the detection and occlusion stages to simultaneously achieve two goals: accurate occlusion detection and high quality recovery. The comparative experiments show that our method can recover the occluded faces with both the global coherence and local details well preserved.

3D Face Recognition Founded on the Structural Diversity of Human Faces

Shalini Gupta, J. K. Aggarwal, Mia K. Markey, and Alan C. Bovik

We present a systematic procedure for selecting facial fiducial points associated with diverse structural characteristics of a human face. We identify such characteristics from the existing literature on anthropometric facial proportions. We also present three dimensional (3D) face recognition algorithms, which employ Euclidean/geodesic distances between these anthropometric fiducial points as features along with linear discriminant analysis classifiers. Furthermore, we show that in our algorithms, when anthropometric distances are replaced by distances between arbitrary regularly spaced facial points, their performances decrease substantially. This demonstrates that incorporating domain specific knowledge about the structural diversity of human faces significantly improves the performance of 3D human face recognition algorithms.

58

58

Learning a Spatially Smooth Subspace for Face Recognition

Deng Cai, Xiaofei He, Yuxiao Hu, Jiawei Han, and Thomas Huang

Subspace learning based face recognition methods have attracted considerable interests in recently years, including Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Locality Preserving Projection (LPP), Neighborhood Preserving Embedding (NPE), Marginal Fisher Analysis (MFA) and Local Discriminant Embedding (LDE). These methods consider an n1×n2 image as a vector in Rn1×n2 and the pixels of each image are considered as independent. While an image represented in the plane is intrinsically a matrix. The pixels spatially close to each other may be correlated. Even though we have n1×n2 pixels per image, this spatial correlation suggests the real number of freedom is far less. In this paper, we introduce a regularized subspace learning model using a Laplacian penalty to constrain the coefficients to be spatially smooth. All these existing subspace learning algorithms can fit into this model and produce a spatially smooth subspace which is better for image representation than their original version. Recognition, clustering and retrieval can be then performed in the image subspace. Experimental results on face recognition demonstrate the effectiveness of our method.

A Multi-Resolution Dynamic Model for Face Aging Simulation

Jinli Suo, Feng Min, SongChun Zhu, Shiguang Shan, and Xilin Chen

In this paper we present a dynamic model for simulating face aging process. We adopt a high resolution grammatical face model [1] and augment it with age and hair features. This model represents all face images by a multi-layer And-Or graph and integrates three most prominent aspects related to aging changes: global appearance changes in hair style and shape, deformations and aging effects of facial components, and wrinkles appearance at various facial zones. Then face aging is modeled as a dynamic Markov process on this graph representation which is learned from a large dataset. Given an input image, we firstly compute the graph representation, and then sample the graph structures over various age groups according to the learned dynamic model. Finally we generate new face images with the sampled graphs. Our approach has three novel aspects: (1) the aging model is learned from a dataset of 50,000 adult faces at different ages; (2) we explicitly model the uncertainty in face aging and can sample multiple plausible aged faces for an input image; and (3) we conduct a simple human experiment to validate the simulated aging process.

Filtered Component Analysis to Increase Robustness to Local Minima in Appearance Models

Fernando De la Torre, Alvaro Collet, Manuel Quero, Jeff Cohn, and Takeo Kanade

Appearance Models (AM) are commonly used to model appearance and shape variation of objects in images. In particular, they have proven useful to detection, tracking, and synthesis of people's faces from video. While AM have numerous advantages relative to alternative approaches, they have at least two important drawbacks. First, they are especially prone to local minima in fitting; this problem becomes increasingly problematic as the number of parameters to estimate grows. Second, often few if any of the local minima correspond to the correct location of the model error. To address these problems, we propose Filtered Component Analysis (FCA), an extension of traditional Principal Component Analysis (PCA). FCA learns an optimal set of filters with which to build a multi-band representation of the object. FCA representations were found to be more robust than either grayscale or Gabor filters to problems of local minima. The effectiveness and robustness of the proposed algorithm is demonstrated in both synthetic and real data.

59

59

Contextual Identity Recognition in Personal Photo Albums

Dragomir Anguelov, Kuang-Chih Lee, Salih Burak Gokturk, and Baris Sumengen

We present an efficient probabilistic method for identity recognition in personal photo albums. Personal photos are usually taken under uncontrolled conditions -- the captured faces exhibit significant variations in pose, expression and illumination that limit the success of traditional face recognition algorithms. We show how to improve recognition rates by incorporating additional cues present in personal photo collections, such as clothing appearance and information about when the photo was taken. This is done by constructing a Markov Random Field (MRF) that effectively combines all available contextual cues in a principled recognition framework. Performing inference in the MRF produces markedly improved recognition results in a challenging dataset consisting of the personal photo collections of multiple people. At the same time, the computational cost of our approach remains comparable to that of standard face recognition approaches.

Monocular and Stereo Methods for AAM Learning from Video

Jason Saragih and Roland Goecke

The active appearance model (AAM) is a powerful method for modeling deformable visual objects. One of the major drawbacks of the AAM is that it requires a training set of pseudo-dense correspondences over the whole database. In this work, we investigate the utility of stereo constraints for automatic model building from video. First, we propose a new method for automatic correspondence finding in monocular images which is based on an adaptive template tracking paradigm. We then extend this method to take the scene geometry into account, proposing three approaches, each accounting for the availability of the fundamental matrix and calibration parameters or the lack thereof. The performance of the monocular method was first evaluated on a pre-annotated database of a talking face. We then compared the monocular method against its three stereo extensions using a stereo database.

Boosting Coded Dynamic Features for Facial Action Units and Facial Expression Recognition

Peng Yang, Qingshan Liu, and Dimitris Metaxas

This paper proposes a novel approach to represent facial expressions and facial action units (AUs) in an image sequence using dynamic features. During the training process, code books are created based on the training samples. Based on the code books, we code the dynamic features and combine the correlated dynamic features to one coded feature. Weak classifiers are built on the coded features, boosting method is taken to construct one strong classifier to do the recognition of facial expression and facial AUs based on the coded features. Experiment on CMU expression database and our own AU database show our method performed with high accuracy and is much better than the methods using static features. This method can be easily extended to video-based face recognition.

Automatic Face Recognition from Skeletal Remains

Peter Tu, Rebecca Book, Xiaoming Liu, Nils Krahnstoever, Carl Adrian, and Phil Williams

The ability to determine the identity of a skull found at a crime scene is of critical importance to the law enforcement community. Traditional clay-based methods attempt to reconstruct the face so as to enable identification of the deceased by members of the general public. However, these reconstructions lack consistency from practitioner to practitioner and it has been shown that the human recognition of these reconstructions against a photo gallery of potential victims is little better than chance. In this paper we

60

60

propose the automation of the reconstruction process. For a given skull, a data-driven 3D generative model of the face is constructed using a database of CT head scans. The reconstruction can be constrained based on prior knowledge such as age and or weight. To determine whether or not these reconstructions have merit, geometric methods for comparing reconstructions against a gallery of facial images are proposed. First, Active Shape Models are used to automatically detect a set of facial landmarks on each image. These landmarks are associated with 3D points on the reconstruction. Direct comparison of the reconstruction is problematic since in general the camera geometry used for image capture is unknown and there are uncertainties associated with the reconstruction and landmark detection processes. The first method of comparison uses constrained optimization to determine the optimal projection of the reconstruction on to the image. Residuals are then analyzed resulting in a ranking of the gallery. The second method uses boosting to learn which points are both reliable and discriminating. This results in a match/no-match classifier. Experimental evidence indicating that skull recognition from facial images can be achieved is presented.

Quantifying Facial Expression Abnormality in Schizophrenia by Combining 2D and 3D Features

Peng Wang, Christian Kohler, Fred Barrett, Raquel Gur, Ruben Gur, and Ragini Verma

Most of current computer-based facial expression analysis methods focus on the recognition of perfectly posed expressions, and hence are incapable of handling the individuals with expression impairments. In particular, patients with schizophrenia usually have impaired expressions in the form of "flat" or "inappropriate" affects, which make the quantification of their facial expressions a challenging problem. This paper presents methods to quantify the group differences between patients with schizophrenia and healthy controls, by extracting specialized features and analyzing group differences on a feature manifold. The features include 2D and 3D geometric features and the moment invariants combining both 3D geometry and 2D textures. Facial expression recognition experiments on actors demonstrate that our combined features can better characterize facial expressions than either 2D geometric or texture features. The features are then embedded into an ISOMAP manifold to quantify the group differences between controls and patients. Experiments show that our results are strongly supported by the human rating results and clinical findings, thus providing a framework that is able to quantify the abnormality in patients with schizophrenia.

Geometry and Structure-From-Motion 1

Algorithms for Batch Matrix Factorization with Application to Structure-from-Motion

Jean-Philippe Tardif, Adrien Bartoli, Martin Trudeau, Nicolas Guilbert, and Sébastien Roy

Matrix factorization is a key component for solving several computer vision problems. It is particularly challenging in the presence of missing or erroneous data, which often arise in Structure-from-Motion. We propose batch algorithms for matrix factorization. They are based on closure and basis constraints, that are used either on the cameras or the structure, leading to four possible algorithms. The constraints are robustly computed from complete measurement sub-matrices with e.g. random data sampling. The cameras and 3D structure are then recovered through Linear Least Squares. Prior information about the scene such as identical camera positions or orientations, smooth camera trajectory, known 3D points and coplanarity of some 3D points can be directly incorporated. We demonstrate our algorithms on challenging image sequences with tracking error and more than 95% missing data.

A minimal solution to the autocalibration of radial distortion

Zuzana Kukelova and Tomas Pajdla

Epipolar geometry and relative camera pose computation are examples of tasks which can be formulated as minimal problems and solved from a minimal number of image points. Finding the solution leads to solving

61

61

systems of algebraic equations. Often, these systems are not trivial and therefore special algorithms have to be designed to achieve numerical robustness and computational efficiency. In this paper we provide a solution to the problem of estimating radial distortion and epipolar geometry from eight correspondences in two images. Unlike previous algorithms, which were able to solve the problem from nine correspondences only, we enforce the determinant of the fundamental matrix be zero. This leads to a system of eight quadratic and one cubic equation in nine variables. We simplify this system by eliminating six of these variables. Then, we solve the system by finding eigenvectors of an action matrix of a suitably chosen polynomial. We show how to construct the action matrix without computing complete Groebner basis, which provides an efficient and robust solver. The quality of the solver is demonstrated on synthetic and real data.

On the Direct Estimation of the Fundamental Matrix

Yaser Sheikh, Asaad Hakeem, and Mubarak Shah

The fundamental matrix is a central construct in the analysis of images captured from a pair of cameras and many feature-based methods have been proposed for its computation. In this paper, we propose a direct method for estimating the fundamental matrix where the motion between the frames is small (e.g. between successive frames of a video). To achieve this, a warping function is presented for the fundamental matrix by using the brightness constancy constraint in conjunction with geometric constraints. Using this warping function, an iterative hierarchical algorithm is described to recover accurate estimates of the fundamental matrix. We present results of experimentation to evaluate the performance of the proposed approach and demonstrate improved accuracy in the computation of the fundamental matrix.

Nine-point Algorithm for Para-catadioptric Fundamental Matrices

Christopher Geyer and Henrik Stewénius

We present a solution to finding the fundamental matrix for catadioptric cameras of the parabolic type. Central catadioptric cameras---an optical combination of a mirror and a lens that yields an imaging device equivalent within hemispheres to perspective cameras--have found wide application in robotics, tele-immersion and providing enhanced situational awareness for remote operation. We use an uncalibrated structure-from-motion framework developed for these cameras to consider the problem of estimation the fundamental matrix for such cameras. We present a solution that can compute the parabolic catadioptirc fundamental matrix with nine point correspondences, which is the smallest possible number because of the dimension of the manifold of such matrices. Since there exists no algorithm to take a linear estimate and project it to the same manifold, this is the first method that can yield and estimate guaranteed by construction to lie on the manifold of parabolic catadioptric fundamental matrices. We compare this algorithm to alternatives and show some results of using the algorithm in conjunction with random sample consensus (RANSAC).

On Constant Focal Length Self-Calibration From Multiple Views

Benoît Bocquillon, Adrien Bartoli, Pierre Gurdjos, and Alain Crouzil

We investigate the problem of finding the metric structure of a general 3D scene viewed by a moving camera with square pixels and constant unknown focal length. While the problem has a concise and well-understood formulation in the stratified framework thanks to the absolute dual quadric, two open issues remain. The first issue concerns the generic Critical Motion Sequences, i.e. camera motions for which self-calibration is ambiguous. Most of the previous work focuses on the varying focal length case. We provide a thorough study of the constant focal length case. The second issue is to solve the nonlinear set of equations in four unknowns arising from the dual quadric formulation. Most of the previous work either does local nonlinear optimization, thereby requiring an initial solution, or linearizes the problem, which introduces artificial degeneracies, most of which likely to arise in practice. We use interval analysis to solve this problem. The resulting algorithm is guaranteed to find the solution and is not subject to artificial

62

62

degeneracies. Directly using interval analysis usually results in computationally expensive algorithms. We propose a carefully chosen set of inclusion functions, making it possible to find the solution within few seconds. Comparisons of the proposed algorithm with existing ones are reported for simulated and real data.

Autocalibration via Rank-Constrained Estimation of the Absolute Quadric

Manmohan Chandraker, Sameer Agarwal, Fredrik Kahl, David Nistér, and David Kriegman

We present an autocalibration algorithm for upgrading a projective reconstruction to a metric reconstruction by estimating the absolute dual quadric. The algorithm enforces the rank degeneracy and the positive semidefiniteness of the dual quadric as part of the estimation procedure, rather than as a post-processing step. Furthermore, the method allows the user, if he or she so desires, to enforce conditions on the plane at infinity so that the reconstruction satisfies the chirality constraints. The algorithm works by constructing low degree polynomial optimization problems, which are solved to their global optimum using a series of convex linear matrix inequality relaxations. We show extensive results on synthetic as well as real datasets to validate our algorithm.

A Practical Algorithm for L∞ Triangulation with Outliers

Hongdong Li

This paper addresses the problem of robust optimal multi-view triangulation. We propose an abstract framework, as well as a practical algorithm, which finds the best 3D reconstruction with guaranteed global optimality even in the presence of outliers. Our algorithm is founded on the theory of LP-type problem. We have recognized that the L∞ triangulation is a concrete example of the LP-type problems. We propose a set of non-trivial basis operation subroutines that actually implement the idea. Experiments have validated the effectiveness and efficiency of the proposed algorithm.

Surveillance and Change Detection

PEET: Prototype Embedding and Embedding Transition for Matching Vehicles over Disparate Viewpoints

Yanlin Guo, Ying Shan, Harpreet Sawhney, and Rakesh Kumar

This paper presents a novel framework, Prototype Embedding and Embedding Transition (PEET), for matching objects, specifically vehicles, that undergo drastic pose, appearance, and even modality changes. The problem of matching objects seen under drastic variations is reduced to matching embeddings of object appearances instead of matching the object images directly. An object appearance is first embedded in the space of a representative set of model prototypes. Objects captured at disparate temporal and spatial sites are embedded in the space of prototypes that are rendered with the pose of the cameras at the respective sites. Low dimensional embedding vectors are subsequently matched. A significant feature of our approach is that no mapping function is needed to compute the distance between embedding vectors extracted from objects viewed from disparate pose and appearance changes, instead, an Embedding Transition (ET) scheme is utilized to implicitly realize the complex and non-linear mapping with high accuracy. The heterogeneous nature of matching between high-resolution and low-resolution image objects in PEET is discussed, and an unsupervised learning scheme based on the exploitation of the heterogeneous nature is developed to improve the overall matching performance of mixed resolution objects. The proposed approach has been applied to vehicular object classification and query application, and the extensive experimental results demonstrate the efficacy and versatility of the PEET framework.

63

63

Motion and Appearance Contexts for Tracking and Re-Acquiring Targets in Aerial Videos

Saad Ali, Vladimir Reilly, and Mubarak Shah

This paper attempts to solve the problem of persistent tracking and reacquiring of objects in aerial videos. Persistent tracking implies the ability to track objects through occlusions, while reacquiring implies the ability to maintain the correct labels of objects when they leave and come back into the field of view. Persistent tracking and reacquiring is a challenging problem due to number of a reasons: i) the motion of the platform on which camera is mounted is unconstrained, ii) targets of interest move independently with respect to the motion of the camera, iii) the field of view of the camera is restricted, iv) appearances and shapes of the objects change due to illumination and pose variation, and v) terrain features act as sources of occlusion resulting in loss of observations of target objects.

Surveillance in Virtual Reality: System Design and Multicamera Control

Faisal Qureshi and Demetri Terzopoulos

This paper advocates a Virtual Vision paradigm and demonstrates its usefulness in camera sensor network research. Virtual vision prescribes the use of a visually and behaviorally realistic virtual environment simulator in the design and evaluation of surveillance systems. Impediments to deploying and experimenting with appropriately complex camera networks makes virtual vision an attractive alternative for many vision researchers who are motivated to investigate high level multi-camera control issues within such networks. In particular, we present two prototype surveillance systems comprising passive and active pan/tilt/zoom cameras. We deploy these systems in a virtual train station environment populated by autonomous, lifelike virtual pedestrians. The easily reconfigurable virtual cameras situated throughout this environment generate synthetic video feeds that emulate those acquired by real surveillance cameras monitoring extensive public spaces. Our novel multi-camera control strategies enable the cameras to collaborate in persistently observing pedestrians of interest that move across their fields of view and in capturing close-up videos of pedestrians as they travel through designated areas. The sensor networks support task-dependent camera node selection and aggregation through local decision-making and inter-node communication. Our approach to multi-camera control is robust to node failures and message loss.

Unsupervised Activity Perception by Hierarchical Bayesian Models

Xiaogang Wang, Xiaoxu Ma, and Eric Grimson

We propose a novel unsupervised learning framework for activity perception. To understand activities in complicated scenes from visual data, we propose a hierarchical Bayesian model to connect three elements: low-level visual features, simple “atomic” activities, and multi-agent interactions. Atomic activities are modeled as distributions over low-level visual features, and interactions are modeled as distributions over atomic activities. Our models improve existing language models such as Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Process (HDP) by modeling interactions without supervision. Our data sets are challenging video sequences from crowded traffic scenes with different kinds of activities co-occurring. Our approach provides a summary of typical atomic activities and interactions in the scene. Unusual activities and interactions are found with natural probabilistic explanations. Our method supports flexible high-level queries on activities and interactions using atomic activities as components.

64

64

Change Detection in a 3-d World

Thomas Pollard and Joseph Mundy

This paper examines the problem of detecting changes in a 3-d scene from a sequence of images, taken by cameras with arbitrary but known pose. No prior knowledge of the state of normal appearance and geometry of object surfaces is assumed, and abnormal changes can occur in any image of the sequence. To the authors' knowledge, this paper is the first to address the change detection problem in such a general framework. Existing change detection algorithms that exploit multiple image viewpoints typically can detect only motion changes or assume a planar world geometry which cannot cope effectively with appearance changes due to occlusion and un-modeled 3-d scene geometry (ego-motion parallax). The approach presented here can manage the complications of unknown and sometimes changing world surfaces by maintaining a 3-d voxel-based model, where probability distributions for surface occupancy and image appearance are stored in each voxel. The probability distributions at each voxel are continuously updated as new images are received. The key question of convergence of this joint estimation problem is answered by a formal proof based on realistic assumptions about the nature of real world scenes. A series of experiments are presented that evaluate change detection accuracy under laboratory-controlled conditions as well as aerial reconnaissance scenarios.

Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video

Benjamin Laxton, Jongwoo Lim, and David Kriegman

We present a scalable approach to recognizing and describing complex activities in video sequences. We are interested in long-term, sequential activities that may have several parallel streams of action. Our approach integrates temporal, contextual and ordering constraints with output from low-level visual detectors to recognize complex, long-term activities. We argue that a hierarchical, object-oriented design lends our solution to be scalable in that higher-level reasoning components are independent from the particular low-level detector implementation and that recognition of additional activities and actions can easily be added. Three major components to realize this design are: a dynamic Bayesian network structure for representing activities comprised of partially ordered sub-actions, an object-oriented action hierarchy for building arbitrarily complex action detectors and an approximate Viterbi-like algorithm for inferring the most likely observed sequence of actions. Additionally, this study proposes the Erlang distribution as a comprehensive model of idle time between actions and frequency of observing new actions. We show results for our approach on real video sequences containing complex activities.

Learning Dynamic Event Descriptions in Image Sequences

Harini Veeraraghavan, Paul Schrater, and Nikos Papanikolopoulos

Automatic detection of dynamic events in video sequences has a variety of applications including visual surveillance and monitoring, video highlight extraction, intelligent transportation systems, video summarization, and many more. Learning an accurate description of the various events in real-world scenes is challenging owing to the limited user-labeled data as well as the large variations in the pattern of the events. Pattern differences arise either due to the nature of the events themselves such as the spatio-temporal events or due to missing or ambiguous data interpretation using computer vision methods. In this work, we introduce a novel method for representing and classifying events in video sequences using reversible context-free grammars. The grammars are learned using a semi-supervised learning method. More concretely, by using the classification entropy as a heuristic cost function, the grammars are iteratively learned using a search method. Experimental results demonstrating the efficacy of the learning algorithm and the event detection method applied to traffic video sequences are presented.

65

65

Trajectory Series Analysis based Event Rule Induction for Visual Surveillance

Zhang Zhang, Kaiqi Huang, Tieniu Tan, and Liangsheng Wang

In this paper, a generic rule induction framework based on trajectory series analysis is proposed to learn the event rules in surveillance scene automatically. First the trajectories acquired by a tracking system are mapped into a set of primitive events that represent some basic motion patterns of moving object passing a semantic region. Then a minimum description length (MDL) principle based grammar induction algorithm is adopted to infer the meaningful rules from the primitive event series. Compared with previous grammar rule based work on event recognition where the rules are all defined manually, our work aims to learn the event rules automatically. Within a grammar induction framework, frequency constraint and attribute constraint are proposed to filter out the redundant rule candidates and speed up the induction process. Experiments in a traffic crossroad have demonstrated the effectiveness of our methods. Shown in the experimental results, most of the grammar rules obtained by our algorithm coincide with the true semantic events in the crossroad. Furthermore the traffic lights rule in the crossroad can also be leaned correctly with the help of eliminating the irrelevant trajectories.

Graphics and Computational Photography

Efficient new-view synthesis using pairwise dictionary priors

Oliver Woodford, Ian Reid, and Andrew Fitzgibbon

New-view synthesis (NVS) using texture priors (as opposed to surface-smoothness priors) can yield high quality results, but the standard formulation is in terms of large-clique Markov Random Fields (MRFs). Only local optimization methods such as iterated conditional modes, which are prone to fall into local minima close to the initial estimate, are practical for solving these problems. In this paper we replace the large-clique energies with pairwise potentials, by restricting the patch dictionary for each clique to image regions suitable for that clique. This enables for the first time the use of a global optimization method, such as tree-reweighted message passing, to solve the NVS problem with image-based priors. We employ a robust, truncated quadratic kernel to reject outliers caused by occlusions, specularities and moving objects, within our global optimization. Because the MRF optimization is thus fast, computing the unary potentials becomes the new performance bottleneck. An additional contribution of this paper is a novel, fast method for enumerating color modes of the per-pixel unary potentials, despite the non-convex nature of our robust kernel. We compare the results of our technique with other rendering methods, and discuss the relative merits and flaws of regularizing color, and of local versus global dictionaries.

Seamless Mosaicing of Image-Based Texture Maps

Victor Lempitsky and Denis Ivanov

Image-based object modeling has emerged as an important computer vision application. Typically, the process starts with the acquisition of the image views of an object. These views are registered within the global coordinate system using structure-and-motion techniques, while on the next step the geometric shape of an object is recovered using stereo and/or silhouette cues. This paper considers the final step, which creates the texture map for the recovered geometry model. The approach proposed in the paper naturally starts by backprojecting original views onto the obtained surface. A texture is then mosaiced from these backprojections, whereas the quality of the mosaic is maximized within the process of Markov Random Field energy optimization. Finally, the residual seams between the mosaic components are removed via seam levelling procedure, which is similar to gradient-domain stitching techniques recently proposed for image editing. Unlike previous approaches to the same problem, intensity blending as well as image resampling are avoided on all stages of the process, which ensures that the resolution of the produced texture is essentially the same as that of the original views. Importantly, due to restriction to non-greedy energy optimization techniques, good results are produced even in the presence of significant errors on image registration and geometric estimation steps.

66

66

Simultaneous Matting and Compositing

Jue Wang and Michael Cohen

Recent work in matting, hole filling, and compositing allows image elements to be mixed in a new composite image. Previous algorithms for matting foreground elements have assumed that the new background for compositing is unknown. We show that, if the new background is known, the matting algorithm has more freedom to create a successful matte by simultaneously optimizing the matting and compositing operations. We propose a new algorithm, that integrates matting and compositing into a single optimization process. The system is able to compose foreground elements onto a new background more efficiently and with less artifacts compared with previous approaches. In our examples, we show how one can enlarge the foreground while maintaining the wide angle view of the background. We also demonstrate composing a foreground element on top of similar backgrounds to help remove unwanted portions of the background or to re-scale or re-arrange the composite. We compare and contrast our method with a number of previous matting and compositing systems.

Flash Cut: Foreground Extraction with Flash and No-flash Image Pairs

Jian Sun, Jian Sun, Sing Bing Kang, Zongben Xu, Xiaoou Tang, and Heung-Yeung Shum

In this paper, we propose a novel approach for foreground layer extraction using flash/no-flash image pairs, which we call flash cut. Flash cut is based on the simple observation that only the foreground is significantly brightened by the flash and the background appearance change is very small, if the background is distant. Changes due to flash, motion, and color information are fused in an MRF framework to produce high quality segmentation results. Flash cut handles some amount of camera shake, and foreground motion, which makes it practical for anyone with a flash-equipped camera to use. We validate our approach on a variety of indoor and outdoor examples.

Texture-Preserving Shadow Removal in Color Images Containing Curved Surfaces

Eli Arbel and Hagit Hel-Or

Several approaches to shadow removal in color images have been introduced in recent years. Yet these methods fail in removing shadows that are cast on curved surfaces, as well as retaining the original texture of the image in shadow boundaries, known as penumbra regions. In this paper, we propose a novel approach which effectively removes shadows from curved surfaces while retaining the textural information in the penumbra, yielding high quality shadow-free images. Our approach aims at finding scale factors to cancel the effect of shadows, including penumbra regions where illumination changes gradually. Due to the fact that surface geometry is also taken into account when computing the scale factors, our method can handle a wider range of shadow images than current state-of-the-art methods, as demonstrated by several examples.

Minimal Solutions for Panoramic Stitching

Matthew Brown, Richard Hartley, and David Nister

This paper presents minimal solutions for the geometric parameters of a camera rotating about its optical centre. In particular we present new 2 and 3 point solutions for the homography induced by a rotation with 1 and 2 unknown focal length parameters. Using tests on real data, we show that these algorithms

67

67

outperform the standard 4 point linear homography solution in terms of accuracy of focal length estimation and image based projection errors.

Learning and Shape

Online Learning Asymmetric Boosted Classifiers for Object Detection

Minh-Tri Pham and Tat-Jen Cham

We present an integrated framework for learning asymmetric boosted classifiers and online learning to address the problem of online learning asymmetric boosted classifiers, which is applicable to object detection problems. In particular, our method seeks to balance the skewness of the labels presented to the weak classifiers, allowing them to be trained more equally. In online learning, we introduce an extra constraint when propagating the weights of the data points from one weak classifier to another, allowing the algorithm to converge faster. In compared with the Online Boosting algorithm recently applied to object detection problems, we observed about 0-10% increase in accuracy, and about 5-30% gain in learning speed.

Local Ensemble Kernel Learning for Object Category Recognition

Yen-Yu Lin, Tyng-Luh Liu, and Chiou-Shann Fuh

This paper describes a local ensemble kernel learning technique to recognize/classify objects from a large number of diverse categories. Due to the possibly large intraclass feature variations, using only a single unified kernel-based classifier may not satisfactorily solve the problem. Our approach is to carry out the recognition task with adaptive ensemble kernel machines, each of which is derived from proper localization and regularization. Specifically, for each training sample, we learn a distinct ensemble kernel constructed in a way to give good classification performance for data falling within the corresponding neighborhood. We achieve this effect by aligning each ensemble kernel with a locally adapted target kernel, followed by smoothing out the discrepancies among kernels of nearby data. Our experimental results on various image databases manifest that the technique to optimize local ensemble kernels is effective and consistent for object recognition.

Accurate Object Localization with Shape Masks

Marcin Marszalek and Cordelia Schmid

This paper proposes an approach for object class localization which goes beyond bounding boxes, as it also determines the outline of the object. Unlike most current localization methods, our approach does not require any hypothesis parameter space to be defined. Instead, it directly generates, evaluates and clusters shape masks. Thus, the presented framework produces more informative results for object class localization. For example, it easily learns and detects possible object viewpoints and articulations, which are often well characterized by the object outline. We evaluate the proposed approach on the challenging natural-scene Graz-02 object classes dataset. The results demonstrate the extended localization capabilities of our method.

68

68

Semantic Hierarchies for Recognizing Objects and Parts

Boris Epshtein and Shimon Ullman

This paper describes a novel representation for the recognition of objects and their parts, the semantic hierarchy. The advantages of this representation include better classification performance, improved detection and localization of object parts and sub-parts, and explicitly identifying the different appearances of each object part. The semantic hierarchy algorithm starts by constructing a minimal feature hierarchy and then proceeds by adding semantically equivalent representatives to each node, using the entire hierarchy as a context for determining the identity and locations of added features. Unlike previous approaches, the semantic hierarchy learns to represent the set of possible appearances of object parts at all levels, and their statistical dependencies. The algorithm is fully automatic and is shown experimentally to substantially improve the recognition of objects and their parts.

3D and Geometry

Visual Odometry System Using Multiple Stereo Cameras and Inertial Measurement Unit

Taragay Oskiper, Zhiwei Zhu, Supun Samarasekera, and Rakesh Kumar

Over the past decade, tremendous amount of research activity has focused around the problem of localization in GPS denied environments. Challenges with localization are highlighted in human wearable systems where the operator can freely move through both indoors and outdoors. In this paper, we present a robust method that addresses these challenges using a human wearable system with two pairs of backward and forward looking stereo cameras together with an inertial measurement unit (IMU). This algorithm can run in real-time with 15Hz update rate on a dual-core 2GHz laptop PC and it is designed to be a highly accurate local (relative) pose estimation mechanism acting as the front-end to a Simultaneous Localization and Mapping (SLAM) type method capable of global corrections through landmark matching. Extensive tests of our prototype system so far, reveal that without any global landmark matching, we achieve between 0.5% and 1% accuracy in localizing a person over a 500 meter travel indoors and outdoors. To our knowledge, such performance results with a real time system have not been reported before.

Inferring Temporal Order of Images From 3D Structure

Grant Schindler, Frank Dellaert, and Sing Bing Kang

In this paper, we describe a technique to temporally sort a collection of photos that span many years. By reasoning about persistence of visible structures, we show how this sorting task can be formulated as a constraint satisfaction problem (CSP). Casting this problem as a CSP allows us to efficiently find a suitable ordering of the images despite the large size of the solution space (factorial in the number of images) and the presence of occlusions. We present experimental results for photographs of a city acquired over a one hundred year period.

69

69

Using Galois Theory to Prove Structure from Motion Algorithms are Optimal

David Nister, Richard Hartley, and Henrik Stewenius

This paper presents a general method for establishing that a problem can not be solved by a 'machine' that is capable of the standard arithmetic operations, extraction of radicals (that is, m-th roots for any m), as well as extraction of roots of polynomials of degree smaller than n, but no other numerical operations. The method is based on Galois theory. It is applied to two well known structure from motion procedures: five point calibrated relative orientation, which can be realized by solving a tenth degree polynomial, and L2-optimal two-view triangulation, which can be realized by solving a sixth degree polynomial. It is shown that both these problems have been optimally solved in the sense that neither problem can be solved by even repeated root extraction of polynomials of any lesser degree. This rules out the possibility of undiscovered symmetries in the problems. It also follows that neither of these problems can be solved in closed form.

Projective Factorization of Multiple Rigid-Body Motions

Ting Li, Vinutha Kallem, Dheeraj Singaraju, and Rene Vidal

Given point correspondences in multiple perspective views of a scene containing multiple rigid-body motions, we present an algorithm for segmenting the correspondences according to the multiple motions. We exploit the fact that when the depths of the points are known, the point trajectories associated with a single motion live in a subspace of dimension at most four. Thus motion segmentation with known depths can be achieved by methods of subspace separation, such as GPCA or LSA. When the depths are unknown, we proceed iteratively. Given the segmentation, we compute the depths using standard techniques. Given the depths, we use GPCA or LSA to segment the scene into multiple motions. Experiments on the Hopkins155 motion segmentation database show that our method compares favorably against existing affine motion segmentation methods in terms of segmentation error and execution time.

70

70

Recognition, Learning, and Optimization

Beyond Local Appearance: Category Recognition from Pairwise Interactions of Simple Features

Marius Leordeanu, Martial Hebert, and Rahul Sukthankar

We present a discriminative shape-based algorithm for object category localization and recognition. Our method learns object models in a weakly-supervised fashion, without requiring the specification of object locations nor pixel masks in the training data. We represent object models as cliques of fully-interconnected parts, exploiting only the pairwise geometric relationships between them. The use of pairwise relationships enables our algorithm to successfully overcome several problems that are common to previously-published methods. Even though our algorithm can easily incorporate local appearance information from richer features, we purposefully do not use them in order to demonstrate that simple geometric relationships can match (or exceed) the performance of state-of-the-art object recognition algorithms.

What makes a good model of natural images ?

Yair Weiss and Bill Freeman

Many low-level vision algorithms assume a prior probability over images, and there has been great interest in trying to learn this prior from examples. Since images are very non Gaussian, high dimensional, continuous signals, learning their distribution presents a tremendous computational challenge. Perhaps the most successful recent algorithm is the Fields of Experts (FOE) model which has shown impressive performance by modeling image statistics with a product of potentials defined on filter outputs. However, as in previous models of images based on filter outputs, calculating the probability of an image given the model requires evaluating an intractable partition function. This makes learning very slow (requires Monte-Carlo sampling at every step) and makes it virtually impossible to compare the likelihood of two different models. Given this computational difficulty, it is hard to say whether nonintuitive features learned by such models represent a true property of natural images or an artifact of the approximations used during learning. In this paper we present (1) tractable lower and upper bounds on the partition function of models based on filter outputs and (2) efficient learning algorithms that do not require any sampling. Our results are based on recent results in machine learning that deal with Gaussian potentials. We extend these results to non-Gaussian potentials and derive a novel, basis rotation algorithm for approximating the maximum likelihood filters. Our results allow us to (1) rigorously compare the likelihood of different models and (2) calculate high likelihood models of natural image statistics in a matter of minutes. Applying our results to previous models shows that the nonintuitive features are not an artifact of the learning process but rather are capturing robust properties of natural images.

Joint Optimization of Cascaded Classifiers for Computer Aided Detection

Murat Dundar and Jinbo Bi

The existing methods for offline training of cascade classifiers take a greedy search to optimize individual classifiers in the cascade, leading inefficient overall performance. We propose a new design of the cascaded classifier where all classifiers are optimized for the final objective function. The key contribution of this paper is the AND-OR framework for learning the classifiers in the cascade. In earlier work each classifier is trained independently using the examples labeled as positive by the previous classifiers in the cascade, and optimized to have the best performance for that specific local stage. The proposed approach takes into account the fact that an example is classified as positive by the cascade if it is labeled as positive by all the stages and it is classified as negative if it is rejected at any stage in the cascade. An offline training scheme

71

71

is introduced based on the joint optimization of the classifiers in the cascade to minimize an overall objective function. We apply the proposed approach to the problem of automatically detecting polyps from multi-slice CT images. Our approach significantly speeds up the execution of the Computer Aided Detection (CAD) system while yielding comparable performance with the current state-of-the-art, and also demonstrates favorable results over Cascade AdaBoost both in terms of performance and online execution speed.

Efficient Belief Propagation for Vision Using Linear Constraint Nodes

Brian Potetz

Belief propagation over pairwise connected Markov Random Fields has become a widely used approach, and has been successfully applied to several important computer vision problems. However, pairwise interactions are often insufficient to capture the full statistics of the problem. Higher-order interactions are sometimes necessary. Unfortunately, the complexity of belief propagation is exponential in the size of the largest clique. In this paper, we introduce a new technique to compute belief propagation messages in time linear with respect to clique size for a large class of potential functions over real-valued variables. We demonstrate this technique in two applications. First, we perform efficient inference in graphical models where the spatial prior of natural images is captured by 2 x 2 cliques. This approach shows significant improvement over the commonly used pairwise-connected models, and may benefit a variety of applications using belief propagation to infer images or range images. Finally, we apply these techniques to shape-from-shading and demonstrate significant improvement over previous methods, both in quality and in flexibility.

Fast, Approximately Optimal Solutions for Single and Dynamic MRFs

Nikos Komodakis, Georgios Tziritas, and Nikos Paragios

A new efficient MRF optimization algorithm, called Fast-PD, is proposed, which generalizes -expansion. One of its main advantages is that it offers a substantial speedup over that method, e.g. it can be at least 3-9 times faster than -expansion. Its efficiency is a result of the fact that Fast-PD exploits information coming not only from the original MRF problem, but also from a dual problem. Furthermore, besides static MRFs, it can also be used for boosting the performance of dynamic MRFs, i.e. MRFs varying over time. On top of that, Fast-PD makes no compromise about the optimality of its solutions: it can compute exactly the same answer as -expansion, but, unlike that method, it can also guarantee an almost optimal solution for a much wider class of NP-hard MRF problems. Results on static and dynamic MRFs demonstrate the algorithm's efficiency and power. E.g., Fast-PD has been able to compute disparity for stereoscopic sequences in real time, with the resulting disparity coinciding with that of -expansion.

Poster session 1


Fiber Tract Clustering on Manifolds With Dual Rooted-Graphs

Andy Tsai, Carl-Fredrik Westin, Alfred Hero, and Alan Willsky

We propose a manifold learning approach to fiber tract clustering using a novel similarity measure between fiber tracts constructed from dual-rooted graphs. In particular, to generate this similarity measure, the chamfer or Hausdorff distance is initially employed as a local distance metric to construct minimum spanning trees between pairwise fiber tracts. These minimum spanning trees are effective in capturing the intrinsic geometry of the fiber tracts. Hence, they are used to capture the neighborhood structures of the fiber tract data set. We next assume the high-dimensional input fiber tracts to lie on low-dimensional non-

72

72

linear manifolds. We apply Locally Linear Embedding, a popular manifold learning technique, to define a low-dimensional embedding of the fiber tracts that preserves the neighborhood structures of the high-dimensional data structure as captured by the method of dual-rooted graphs. Clustering is then performed on this low-dimensional data structure using the k-means algorithm. We illustrate our resulting clustering technique on both synthetic data and on real fiber tract data obtained from diffusion tensor imaging.

Capturing long-range correlations with patch models

Vincent Cheung, Nebojsa Jojic, and Dimitris Samaras

The use of image patches to capture local correlations between pixels has been growing in popularity for use in various low-level vision tasks. There is a trade-off between using larger patches to obtain additional high-order statistics and smaller patches to capture only the elemental features of the image. Previous work has leveraged short-range correlations between patches that share pixel values for use in patch matching. The idea here is if a patch in one image matches well to a patch in a second image, then a second patch in the first image which shares pixels with the first patch should match well to a patch similarly displaced in the second image. In this paper, long-range correlations between patches are introduced, where relations between patches that do not necessarily share pixels are learnt. Such correlations arise as an inherent property of the data itself. These long-range patch correlations are shown to be particularly important for video sequences where the patches have an additional time dimension and are three dimensional constructs, with correlation links in both space and time. We illustrate the power of our model on tasks such as multiple object registration and detection, as well as missing data interpolation, including a difficult task of photograph relighting, where a single photograph is assumed to be the only observed part of a 3D volume whose two coordinates are the image x and y coordinates and the third coordinate is the illumination angle θ. We show that in some cases, the long-range correlations observed among the mappings of different volume patches in a small training set are sufficient to infer the possible complex intensity changes in a new photograph due to illumination angle variation.

Region Classification with Markov Field Aspect Models

Jakob Verbeek and Bill Triggs

In recent years considerable advances have been made in learning to recognize and localize visual object classes from images annotated with global image-level labels, bounding boxes, or pixel-level segmentations. A second line of research uses unsupervised learning methods such as aspect models to automatically discover the latent object classes of unlabeled image collections. Here we learn spatial aspect models from image-level labels and use them to recover labeled regions in new images. Our models combine low-level texture, color and position cues with spatial random field models that capture the local coherence of region labels. We study two spatial inference models: one based on averaging over forests of minimal spanning trees linking neighboring image regions, the other on an efficient chain-merging Expectation Propagation method for regular 8-neighbor Markov random fields. Experimental results on the MSR Cambridge data sets show that incorporating spatial terms in the aspect model significantly improves the region-level classification rates. So much so, that the spatial random field model trained from image labels only outperforms PLSA trained from segmented images.

73

73

Modeling Appearances with Low-Rank SVM

Lior Wolf, Hueihan Jhuang, and Tamir Hazan

Several authors have noticed that the common representation of images as vectors is sub-optimal. The process of vectorization eliminates spatial relations between some of the nearby image measurements and produces a vector of a dimension which is the product of the measurements' dimensions. It seems that images may be better represented when taking into account their structure as a 2D (or multi-D) array. Our work bears similarities to recent work such as 2DPCA or Coupled Subspace Analysis in that we treat images as 2D arrays. The main difference, however, is that unlike previous work which separated representation from the discriminative learning stage, we achieve both by the same method. Our framework, "Low-Rank separators", studies the use of a separating hyperplane which are constrained to have the structure of low-rank matrices. We first prove that the low-rank constraint provides preferable generalization properties. We then define two "Low-rank SVM problems" and propose algorithms to solve these problems. Finally, we provide supporting experimental evidence for the Low-rank SVM framework.

Hybrid Learning of Large Jigsaws

Julia Lasserre, Anitha Kannan, and John Winn

A jigsaw is a recently proposed generative model that describes an image as a composition of non-overlapping patches of varying shape, extracted from a latent image. By learning the latent jigsaw image which best explains a set of images, it is possible to discover the shape, size and appearance of repeated structures in the images. A challenge when learning this model is the very large space of possible jigsaw pixels which can potentially be used to explain each image pixel. The previous method of inference for this model scales linearly with the number of jigsaw pixels, making it unusable for learning the large jigsaws needed for many practical applications. In this paper, we make three contributions that enable the learning of large jigsaws - a novel sparse belief propagation algorithm, a hybrid method which significantly improves the sparseness of this algorithm, and a method that uses these techniques to make learning of large jigsaws feasible. We provide detailed analysis of how our hybrid inference method leads to significant savings in memory and computation time. To demonstrate the success of our method, we present experimental results applying large jigsaws to an object recognition task.

Variational Bayes Approach to Robust Subspace Learning

Takayuki Okatani and Koichiro Deguchi

This paper presents a new algorithm for the problem of robust subspace learning (RSL), i.e., the estimation of linear subspace parameters from a set of data points in the presence of outliers (and missing data). The algorithm is derived on the basis of the variational Bayes (VB) method, which is a Bayesian generalization of the EM algorithm. For the purpose of the derivation of the algorithm as well as the comparison with existing algorithms, we present two formulations of the EM algorithm for RSL. One yields a variant of the IRLS algorithm, which is the standard algorithm for RSL. The other is an extension of Roweis's formulation of an EM algorithm for PCA, which yields a robust version of the alternated least squares (ALS) algorithm. This ALS-based algorithm can only deal with a certain type of outliers (termed vector-wise outliers). The VB method is used to resolve this limitation, which results in the proposed algorithm. Experimental results using synthetic data show that the proposed algorithm outperforms the IRLS algorithm in terms of the convergence property and the computational time.

74

74

A Variational Bayesian Approach for Classification with Corrupted Inputs

Chao Yuan and Claus Neubauer

Classification of corrupted images, for example due to occlusion or noise, is a challenging problem. Most existing methods tackled this problem using a two-step strategy: image reconstruction and classification of reconstructed images. However, their performances heavily relied on the accuracy of reconstruction and parameter estimation. We present a full Bayesian approach which infers the class label from the corrupted image by marginalizing the original image and parameters. Overfitting is effectively overcome through Bayesian integration. Our system consists of two models. The original image model, which specifies the original image generation process, is described by a Gaussian mixture model. The observation model, which relates the corrupted image to the original image, is depicted by an additive deviation model. Normal pixel and corrupted pixel values are elegantly handled by the covariance of the Gaussian deviation. We employ variational approximation to make the Bayesian integration tractable. The advantage of the proposed method is demonstrated by classification tests on the USPS digit database and PIE face database with pose and illumination variations.

Adaptive Distance Metric Learning for Clustering

Jieping Ye, Zheng Zhao, and Huan Liu

A good distance metric is crucial for unsupervised learning from high-dimensional data. To learn a metric without any constraint or class label information, most unsupervised metric learning algorithms appeal to projecting observed data to a low-dimensional manifold, where geometric relationships such as local or global pairwise distances are preserved. However, the projection may not necessarily improve the separability of the data, which is the desirable outcome of clustering. In this paper, we propose a novel unsupervised Adaptive Metric Learning algorithm, called AML, which performs clustering and distance metric learning simultaneously. AML projects the data onto a low dimensional manifold, where the separability of the data is maximized. We show that the joint clustering and distance metric learning can be formulated as a trace maximization problem, which can be solved via an iterative procedure in the EM framework. Experimental results on real-world data sets demonstrated the effectiveness of the proposed algorithm.


Discriminant Mutual Subspace Learning for Indoor and Outdoor Face Recognition

Zhifeng Li, Dahua Lin, Helen Meng, and Xiaoou Tang

Outdoor face recognition is among the most challenging problems for face recognition. In this paper, we develop a discriminant mutual subspace learning algorithm for indoor and outdoor face recognition. Unlike traditional algorithms using one subspace to model both indoor and outdoor face images, our algorithm simultaneously learn two related subspaces for indoor and outdoor images respectively thus can better model both. To further improve the recognition performance we develop a DMSL-based multi-classifier fusion framework on Gabor images using a new fusion method called adaptive informative fusion scheme. Experimental results clearly show that this framework can greatly enhance the recognition performance.

75

75

Face Recognition Using Kernel Ridge Regression

Senjian An, Wanquan Liu, and Svetha Venkatesh

In this paper, we present novel ridge regression (RR) and kernel ridge regression (KRR) techniques for multivariate labels and apply the methods to the problem of face recognition. Motivated by the fact that the regular simplex vertices are separate points with highest degree of symmetry, we choose such vertices as the targets for the distinct individuals in recognition and apply RR or KRR to map the training face images into a face subspace where the training images from each individual will locate near their individual targets. We identify the new face image by mapping it into this face subspace and comparing its distance to all individual targets. An efficient cross-validation algorithm is also provided for selecting the regularization and kernel parameters. Experiments were conducted on two face databases and the results demonstrate that the proposed algorithm significantly outperforms the three popular linear face recognition techniques (Eigenfaces, Fisherfaces and Laplacianfaces) and also performs comparably with the recently developed Orthogonal Laplacianfaces with the advantage of computational speed. Experimental results also demonstrate that KRR outperforms RR as expected since KRR can utilize the nonlinear structure of the face images. Although we concentrate on face recognition in this paper, the proposed method is general and may be applied for general multi-category classification problems.

Face Re-Lighting from a Single Image under Harsh Lighting Conditions

Yang Wang, Zicheng Liu, Gang Hua, Zhen Wen, Zhengyou Zhang, and Dimitris Samaras

In this paper, we present a new method to change the illumination condition of a face image, with unknown face geometry and albedo information. This problem is particularly difficult when there is only one single image of the subject available and it was taken under a harsh lighting condition. Recent research demonstrates that the set of images of a convex Lambertian object obtained under a wide variety of lighting conditions can be approximated accurately by a low-dimensional linear subspace using spherical harmonic representation. However, the approximation error can be large under harsh lighting conditions thus making it difficult to recover albedo information. In order to address this problem, we propose a subregion based framework that uses a Markov Random Field to model the statistical distribution and spatial coherence of face texture, which makes our approach not only robust to harsh lighting conditions, but insensitive to partial occlusions as well. The performance of our framework is demonstrated through various experimental results, including the improvement to the face recognition rate under harsh lighting conditions.

Face Recognition using Discriminatively Trained Orthogonal Rank One Tensor Projections

Gang Hua, Paul Viola, and Steven Drucker

We propose a method for face recognition based on a discriminative linear projection. In this formulation images are treated as tensors, rather than the more conventional vector of pixels. Projections are pursued sequentially and take the form of a rank one tensor, i.e., a tensor which is the outer product of a set of vectors. A novel and effective technique is proposed to ensure that the rank one tensor projections are orthogonal to one another. These constraints on the tensor projections provide a strong inductive bias and result in better generalization on small training sets. Our work is related to spectrum methods, which achieve orthogonal rank one projections by pursuing consecutive projections in the complement space of previous projections. Although this may be meaningful for applications such as reconstruction, it is less meaningful for pursuing discriminant projections. Our new scheme iteratively solves an eigenvalue problem with orthogonality constraints on one dimension, and solves unconstrained eigenvalue problems on the other dimensions. Experiments demonstrate that on small and medium sized face recognition datasets, this approach outperforms previous embedding methods. On large face datasets this approach achieves results comparable with the best, often using fewer discriminant projections.

76

76

Canonical Face Depth Map: A Robust 3D Representation for Face Verification

Dirk Colbry and George Stockman

The Canonical Face Depth Map (CFDM) is a standardized representation for storing and manipulating 3D data from human faces. We provide an automatic transformation of a 3D face scan into the canonical form, eliminating the need for hand-labeled anchor points and proving to be more robust to noise than using automatically detected anchor points. The experimental results demonstrate that the CFDM is robust to noise and occlusion, reduces memory requirements, and improves the efficiency of matching algorithms. The CFDM also supports some 2D methods -- such as convolution, PCA, and LDA -- which are readily used for feature localization and face recognition in our 3D data.

Graphical Model Approach to Iris Matching Under Deformation and Occlusion

Ryan Kerekes, Balakrishnan Narayanaswamy, Jason Thornton, Marios Savvides, and B.V.K. Vijaya Kumar

Template matching of iris images for biometric recognition typically suffers from both local deformations between the template and query images and large occlusions from the eyelid. In this work, we model deformation and occlusion as a set of hidden variables for each iris comparison. We use a field of directional vectors to represent deformation and a field of binary variables to represent occlusion. We impose a probability distribution on these fields using a lattice-type undirected graphical model, in which the graph edges represent interdependencies between neighboring iris regions. Gabor wavelet-based similarity scores and intensity statistics are used as observations in the model. Loopy belief propagation is applied to estimate the conditional distributions on the hidden variables, which are in turn used to compute final match scores. We present underlying theory as well as experimental results from both the CASIA iris database and the database provided for the Iris Challenge Evaluation (ICE). We show that our proposed method significantly improves recognition accuracy on these datasets over existing methods.

Revocable Fingerprint Biotokens: Accuracy and Security Analysis

Terrance Boult, Walter Scheirer, and Robert Woodworth

This paper reviews the biometric dilemma, the pending threat that may limit the long-term value of biometrics in security applications. Unlike passwords, if a biometric database is ever compromised or improperly shared, the underlying biometric data cannot be changed. The concept of revocable or cancelable biometric-based identity tokens (biotokens), if properly implemented, can provide significant enhancements in both privacy and security and address the biometric dilemma. The key to effective revocable biotokens is the need to support the highly accurate approximate matching needed in any biometric system as well as protecting privacy/security of the underlying data. We briefly review prior work and show why it is insufficient in both accuracy and security. This paper then adapts a recently introduced approach that separates each datum into two fields, one of which is encoded and one which is left to support the approximate matching. Previously applied to faces, this paper uses this approach to enhance an existing fingerprint system. Unlike previous work in privacy-enhanced biometrics, our approach improves the accuracy of the underlying system! The security analysis of these biotokens includes addressing the critical issue of protection of small fields. The resulting algorithm is tested on three different fingerprint verification challenge datasets and shows an average decrease in the Equal Error Rate of over 30% -- providing improved security and improved privacy.

77

77

Using Stereo Matching for 2-D Face Recognition Across Pose

Carlos Castillo and David Jacobs

We propose using stereo matching for 2-D face recognition across pose. The cost of this matching is used to evaluate the similarity of two faces. We show that this cost is robust to pose variations. To illustrate this idea we built a face recognition system on top of several dynamic programming stereo matching algorithms. The method works well even when the epipolar lines we use do not exactly fit the viewpoints. We have tested our approach on the PIE dataset. In all the experiments, our method demonstrates effective performance compared with other algorithms.

On the Performance Prediction and Validation for Multisensor Fusion

Rong Wang and Bir Bhanu

Multiple sensors are commonly fused to improve the detection and recognition performance of computer vision and pattern recognition systems. The traditional approach for fusion is to try all possible combinations of sensors by performing exhaustive experiments to determine the optimal combination. In this paper, we present a theoretical approach that predicts the performance of sensor fusion that allows us to select the optimal combination. We start with the characteristics of each sensor by computing the match score and non-match score distributions of objects to be recognized. These distributions are modeled as a mixture of Gaussians. Then, we use an explicit Phi transformation that maps a receiver operating characteristic (ROC) curve to a straight line in 2-D space whose axes are related to the false alarm rate and the correct recognition rate. Finally, using this representation, we derive a set of metrics to evaluate the sensor fusion performance and find the optimal sensor combination. We verify our prediction approach on the publicly available XM2VTS database as well as other databases.

Geometry and Structure-From-Motion 2

An Efficient Minimal Solution for Infinitesimal Camera Motion

Henrik Stewenius, Chris Engels, and David Nister

Given five motion vectors observed in a calibrated camera, what is the rotational and translational velocity of the camera? This problem is the infinitesimal motion analogue to the five-point relative orientation problem, which has previously been solved through the derivation of a tenth-degree polynomial and extraction of its roots. Here, we present the first efficient solution to the infinitesimal version of the problem. The solution is faster than its finite counterpart. In our experiments, we investigate over which range of motions and scene distances the infinitesimal approximation is valid and show that the infinitesimal approximation works well in applications such as camera tracking.

On the Spacetime Geometry of Galilean Cameras

Yaser Sheikh, Alexei Gritai, and Mubarak Shah

In this paper, we extend the conventional model of image projection to video captured by a camera moving at constant velocity. To that end, we introduce the concept of spacetime projection and develop the corresponding epipolar geometry between two such cameras. Both perspective imaging and linear pushbroom imaging form specializations of the proposed model and we show how six different “fundamental" matrices including the classic fundamental matrix, the Linear Pushbroom (LP) fundamental

78

78

matrix and a fundamental matrix relating Epipolar Plane Images (EPIs) are related and can be directly recovered from a general fundamental matrix that can be defined between two such cameras. We provide linear algorithms for estimating the parameters of this fundamental matrix and the mapping between videos in the case of planar scenes.

Robust Rotation and Translation Estimation in Multiview Reconstruction

Daniel Martinec and Tomas Pajdla

It is known that the problem of multiview reconstruction can be solved in two steps: first estimate camera rotations and then translations using them. This paper presents new robust techniques for both of these steps. (i) Given pair-wise relative rotations, global camera rotations are estimated linearly in least squares. (ii) Camera translations are estimated using a standard technique based on Second Order Cone Programming. Robustness is achieved by using only a subset of points according to a new criterion that diminishes the risk of choosing a mismatch. It is shown that only four points chosen in a special way are sufficient to represent a pairwise reconstruction almost equally as all points. This leads to a significant speedup. In image sets with repetitive or similar structures, non-existent epipolar geometries may be found. Due to them, some rotations and consequently translations may be estimated incorrectly. It is shown that iterative removal of pairwise reconstructions with the largest residual and reregistration removes most non-existent epipolar geometries. The performance of the proposed method is demonstrated on difficult wide base-line image sets.

Evaluation of Epipole Estimation Methods with/without Rank-2 Constraint across Algebraic/Geometric Error Functions

Tsuyoshi Migita and Takeshi Shakunaga

The best method for estimating the fundamental matrix and/or the epipole over a given set of point correspondences between two images is a nonlinear minimization, which searches a rank-2 fundamental matrix that minimizes the geometric error cost function. When convenience is preferred to accuracy, we often use a linear approximation method, which searches a rank-3 matrix that minimizes the algebraic error. Although it has been reported that the algebraic error causes very poor results, it is currently thought that the relatively inaccurate results of a linear estimation method are a consequence of neglecting the rank-2 constraint, and not a result of exploiting the algebraic error. However, the reason has not been analyzed fully. In the present paper, we analyze the effects of the cost function selection and the rank-2 constraint based on covariance matrix analyses and show theoretically and experimentally that it is more important to enforce the rank-2 constraint than to minimize the geometric cost function.

Moving Forward in Structure From Motion

Andrea Vedaldi, Gregorio Guidi, and Stefano Soatto

It is well-known that forward motion induces a large number of local minima in the instantaneous least-squares reprojection error. This is caused in part by singularities in the error landscape around the forward direction, and presents a challenge in using existing algorithms for structure-from-motion in autonomous navigation applications. In this paper we prove that imposing a bound on the reconstructed depth of the scene makes the least-squares reprojection error continuous. This has implications for autonomous

79

79

navigation, as it suggests simple modifications for existing algorithms to minimize the effects of local minima in forward translation.

Robust Metric Reconstruction from Challenging Video Sequences

Guofeng Zhang, Xueying Qin, Wei Hua, Tien-Tsin Wong, Pheng-Ann Heng, and Hujun Bao

Although camera self-calibration and metric reconstruction have been extensively studied during the past decades, automatic metric reconstruction from long video sequences with varying focal length is still very challenging. Several critical issues in practical implementations are not adequately addressed. For example, how to select the initial frames for initializing the projective reconstruction? What criteria should be used? How to handle the large zooming problem? How to choose an appropriate moment for upgrading the projective reconstruction to a metric one? This paper gives a careful investigation of all these issues. Practical and effective approaches are proposed. In particular, we show that existing image-based distance is not an adequate measurement for selecting the initial frames. We propose a novel measurement to take into account the zoom degree, the self-calibration quality, as well as image-based distance. We then introduce a new strategy to decide when to upgrade the projective reconstruction to a metric one. Finally, to alleviate the heavy computational cost in the bundle adjustment, a local on-demand approach is proposed. Our method is also extensively compared with the state-of-the-art commercial software to evidence its robustness and stability.

Kinematics from Lines in a Single Rolling Shutter Image

Omar Ait-Aider, Adrien Bartoli, and Nicolas Andreff

Recent work shows that recovering pose and velocity from a single view of a moving rigid object is possible with a rolling shutter camera, based on feature point correspondences. We extend this method to line correspondences. Owing to the combined effect of rolling shutter and object motion, straight lines are distorted to curves as they get imaged with a rolling shutter camera. Lines thus capture more information than points, which is not the case with standard projection models for which both points and lines give two constraints. We extend the standard line reprojection error, and propose a nonlinear method for retrieving a solution to the pose and velocity computation problem. A careful inspection of the design matrix in the normal equations reveals that it is highly sparse and patterned. We propose a blockwise solution procedure based on bundle-adjustment-like sparse inversion. This makes nonlinear optimization fast and numerically stable. The method is validated using real data.

80

80

Features, Regions, and Boundaries

Maximally Stable Colour Regions for Recognition and Matching

Per-Erik Forssen

This paper introduces a novel colour-based affine covariant region detector. Our algorithm is an extension of the maximally stable extremal region (MSER) to colour. The extension to colour is done by looking at successive time-steps of an agglomerative clustering of image pixels. The selection of time-steps is stabilised against intensity scalings and image blur by modelling the distribution of edge magnitudes. The algorithm contains a novel edge significance measure based on a Poisson image noise model, which we show performs better than the commonly used Euclidean distance. We compare our algorithm to the original MSER detector and a competing colour-based blob feature detector, and show through a repeatability test that our detector performs better. We also extend the state of the art in feature repeatability tests, by introducing a way to test stability across view changes even though the scene does not consist of a single world plane. This new test better evaluates how useful a feature is for recognition of objects that are far from being planar.

Discriminant Interest Points are Stable

Dashan Gao and Nuno Vasconcelos

A study of the performance of recently introduced discriminant methods for interest point detection is presented. It has been previously shown that the resulting interest points are more informative for object recognition than those produced by the detectors currently used in computer vision. Little is, however, known about the properties of discriminant points with respect to the metrics, such as repeatability, that have been traditionally used to evaluate interest point detection. A thorough experimental evaluation of the stability of discriminant points is presented, and this stability is compared to those of four popular methods. In particular, we consider image correspondence under geometric and photometric transformations, and extend the experimental protocol proposed by Mikolajczyk et al. for the evaluation of stability with respect to such transformations. The extended protocol is suitable for the evaluation of both bottom-up and top-down (learned) detectors. It is shown that the stability of discriminant interest points is comparable, and frequently superior, to those of interest points produced by various currently popular techniques.

Image representations beyond histograms of gradients: The role of Gestalt descriptors

Stanley Bileschi and Lior Wolf

Histograms of image filter responses and the statistics derived from them have proven to be effective image representations for various visual recognition tasks. Here, an attempt is made advance the performance envelope by including novel complex image representations that capture higher, mid-level visual cues. Four new image features are proposed, explicitly approximating the gestalt concepts continuity, shape-closure, symmetry and repetition. These new image features are used jointly alongside existing state-of-the-art features to improve the accuracy of detectors challenging real-world data sets. This is shown via improved accuracy on common CV benchmarks. As baseline features, we use Poggio's et. al. biological model features and the Dalal and Triggs' oriented-gradient-histogram features. Given that both of these baseline features have been optimized for performance, that our new mid-level representations can further improve detection results warrants special consideration. The performance improvement data for the new features are further analyzed alongside psychophysics data from rapid presentation experiments. The results suggest that the new features which appear to aid in object detection most are the same as those which are easily detected by human observers within the first 100ms of visual presentation. The conclusion drawn is that the human visual apparatus can be tested to suggest improvements in computer vision experiments.

81

81

Fast Keypoint Recognition in Ten Lines of Code

Mustafa Özuysal, Pascal Fua, and Vincent Lepetit

While feature point recognition is a key component of modern approaches to object detection, existing approaches require computationally expensive patch preprocessing to handle perspective distortion. In this paper, we show that formulating the problem in a Naive Bayesian classification framework makes such preprocessing unnecessary and produces an algorithm that is simple, efficient, and robust. Furthermore, it scales well to handle large number of classes. To recognize the patches surrounding keypoints, our classifier uses hundreds of simple binary features and models class posterior probabilities. We make the problem computationally tractable by assuming independence between arbitrary sets of features. Even though this is not strictly true, we demonstrate that our classifier nevertheless performs remarkably well on image datasets containing very significant perspective changes.

Feature Extraction by Maximizing the Average Neighborhood Margin

Wang Fei and Zhang Changshui

A novel algorithm for supervised linear feature extraction is proposed in this paper. The algorithm, called Average Neighborhood Margin Maximization (ANMM), aims at maximizing the total average distance between data points to their heterogeneous neighborhoods and their homogeneous neighborhoods. We will show that features extracted from ANMM can separate the data from different classes well, and it avoids the small sample size problem existed in traditional Linear Discriminant Analysis (LDA). The kernelized (nonlinear) counterpart of ANMM is also established in this paper. Moreover, as in many computer vision applications the data are more naturally represented by higher order tensors (e.g. images and videos), we develop a tensorized (multilinear) form of ANMM, which can directly extract features from tensors. The experimental results of applying ANMM to face recognition are presented to show the effectiveness of our method.

Linear Laplacian Discrimination for Feature Extraction

Deli Zhao, Zhouchen Lin, Rong Xiao, and Xiaoou Tang

Discriminant feature extraction plays a fundamental role in pattern recognition. In this paper, we propose the Linear Laplacian Discrimination (LLD) algorithm to discriminant feature extraction. LLD is an extension of Linear Discriminant Analysis (LDA). Our motivation is to address the issue that LDA cannot work well in cases where sample spaces are non-Euclidean. Specifically, we define the within-class scatter and the between-class scatter using similarities which are based on pairwise distances in sample spaces. Thus the structural information of classes is contained in the within-class and the between-class Laplacian matrices which are free from metrics of sample spaces. The optimal discriminant subspace can be derived by controlling the structural evolution of Laplacian matrices. Experiments are performed on the facial database for FRGC version 2. Experimental results show that LLD is superior in extracting discriminant features.

82

82

Local Image Structure Detection with Orientation-invariant Radial Configuration

Lech Szumilas, René Donner, Georg Langs, and Allan Hanbury

Local image descriptors have proved themselves as useful tools for many computer vision tasks such as matching points between multiple images of a scene and object recognition. Current descriptors, such as SIFT, are designed to match image features with unique local neighborhoods. However, the interest point detectors used with SIFT often fail to select perceptible local structures in the image, and the SIFT descriptor does not directly encode the local neighborhood shape. In this paper we propose a symmetry based interest point detector and radial local structure descriptor which consistently captures the majority of basic local image structures and provides a geometrical description of the structure boundaries. This approach concentrates on the extraction of shape properties in image patches, which are an intuitive way to represent local appearance for matching and classification. We explore the specificity and sensitivity of this local descriptor in the context of classification of natural patterns. The implications of the performance comparison with standard approaches like SIFT are discussed.

High Distortion and Non-Structural Image Matching via Feature Co-occurrence

Xi Chen and Tat-Jen Cham

We propose a novel approach for determining if a pair of images match each other under the effect of a high-distortion transformation or non-structural relation . The co-occurrence statistics between features across a pair of images are learned from a training set comprising matched and mismatched image pairs -- these are expressed in the form of a cross-feature ratio table. The proposed method does not require feature-to-feature correspondences, but instead identifies and exploits feature co-occurrences that are able to provide discriminative result from the transformation. The method not only allows for the matching of test image pairs that have substantially different visual content as compared to those present in the training set, but also caters for transformations and relations that do not preserve image structure.

Body Tracking, Gait, and Gesture 1

Scaled Motion Dynamics for Markerless Motion Capture

Bodo Rosenhahn, Thomas Brox, and Hans-Peter Seidel

This work proposes a way to use a-priori knowledge on motion dynamics for markerless human motion capture (MoCap). Specifically, we match tracked motion patterns to training patterns in order to predict states in successive frames. Thereby, modeling the motion by means of twists allows for a proper scaling of the prior. Consequently, there is no need for training data of different frame rates or velocities. Moreover, the method allows to combine very different motion patterns. Experiments in indoor and outdoor scenarios demonstrate the continuous tracking of familiar motion patterns in case of artificial frame drops or in situations insufficiently constrained by the image data.

83

83

Fast Human Pose Estimation using Appearance and Motion via Multi-Dimensional Boosting Regression

Alessandro Bissacco, Ming-Hsuan Yang, and Stefano Soatto

We address the problem of estimating human pose in video sequences, where the rough location of the human has been detected. We exploit both appearance and motion information by defining suitable features of an image and its temporal neighbors, and learning a regression map to the parameters of a model of the human body using boosting techniques. Our algorithm can be viewed as a fast initialization step for human body trackers, or as a tracker itself. We extend gradient boosting techniques to learn a multi-dimensional map from (rotated and scaled) Haar features to the entire set of joint angles representing the full body pose. Compared to prior work that either advocated learning a separate regressor for each joint angle or proposed schemes where all regressors shared the same scaling, our approach is more efficient (all joint angle estimators share the same features), more general (it handles inputs with elements having different scales) and more robust (it exploits the high degree of correlation between joint angles for natural human pose). We test our approach by learning a map from image patches to body joint angles from synchronized video and motion capture walking data. We show how our technique outperforms previous boosting approaches and allows to learn an efficient real-time pose estimator, trading off estimation accuracy for execution speed.

Tracking-as-Recognition for Articulated Full-Body Human Motion Analysis

Patrick Peursum, Svetha Venkatesh, and Geoff West

This paper addresses the problem of markerless tracking of a human in full 3D with a high-dimensional (29D) body model. Most work in this area has been focused on achieving accurate tracking in order to replace marker-based motion capture, but do so at the cost of relying on relatively clean observing conditions. This paper takes a different perspective, proposing a body-tracking model that is explicitly designed to handle real-world conditions such as occlusions by scene objects, failure recovery, long-term tracking, auto-initialisation, generalisation to different people and integration with action recognition. To achieve these goals, an action's motions are modelled with a variant of the hierarchical hidden Markov model. The model is quantitatively evaluated with several tests, including comparison to the annealed particle filter, tracking different people and tracking with a reduced resolution and frame rate.

Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching

Fengjun Lv and Ramakant Nevatia

3D human pose recovery is considered as a fundamental step in view-invariant human action recognition. However, inferring 3D poses from a single view usually is slow due to the large number of parameters that need to be estimated and recovered poses are often ambiguous due to the perspective projection. We present an approach that does not explicitly infer 3D pose at each frame. Instead, from existing action models we search for a series of actions that best match the input sequence. In our approach, each action is modeled as a series of synthetic 2D human poses rendered from a wide range of viewpoints. The constraints on transition of the synthetic poses is represented by a graph model called Action Net. Given the input, silhouette matching between the input frames and the key poses is performed first using an enhanced Pyramid Match Kernel algorithm. The best matched sequence of actions is then tracked using the Viterbi algorithm. We demonstrate this approach on a challenging video sets consisting of 15 complex action classes.

84

84

A Hierarchical Model of Shape and Appearance for Human Action Classification

Juan Carlos Niebles and Li Fei-Fei

We present a novel model for human action categorization. A video sequence is represented as a collection of spatial and spatial-temporal features by extracting static and dynamic interest points. We propose a hierarchical model that can be characterized as a constellation of bags-of-features and that is able to combine both spatial and spatial-temporal features. Given a novel video sequence, the model is able to categorize human actions in a frame-by-frame basis. We test the model on a publicly available human action dataset and show that our new method performs well on the classification task. We also conducted control experiments to show that the use of the proposed mixture of hierarchical models improves the classification performance over bag of feature models. An additional experiment shows that using both dynamic and static features provides a richer representation of human actions when compared to the use of a single feature type, as demonstrated by our evaluation in the classification task.

Bilattice-based Logical Reasoning for Human Detection

Vinay Shet, Jan Neumann, Visvanathan Ramesh, and Larry Davis

The capacity to robustly detect humans in video is a critical component of automated visual surveillance systems. This paper describes a bilattice based logical reasoning approach that exploits contextual information and knowledge about interactions between humans, and augments it with the output of different low level detectors for human detection. Detections from low level parts-based detectors are treated as logical facts and used to reason explicitly about the presence or absence of humans in the scene. Positive and negative information from different sources, as well as uncertainties from detections and logical rules, are integrated within the bilattice framework. This approach also generates proofs or justifications for each hypothesis it proposes. These justifications (or lack thereof) are further employed by the system to explain and validate, or reject potential hypotheses. This allows the system to explicitly reason about complex interactions between humans and handle occlusions. These proofs are also available to the end user as an explanation of why the system thinks a particular hypothesis is actually a human. We employ a boosted cascade of gradient histograms based detector to detect individual body parts. We have applied this framework to analyze the presence of humans in static images from different datasets.

Detecting Pedestrians by Learning Shapelet Features

Payam Sabzmeydani and Greg Mori

In this paper, we address the problem of detecting pedestrians in still images. We introduce an algorithm for learning shapelet features, a set of mid–level features. These features are focused on local regions of the image and are built from low–level gradient information that discriminates between pedestrian and non–pedestrian classes. Using AdaBoost, these shapelet features are created as a combination of oriented gradient responses. To train the final classifier, we use AdaBoost for a second time to select a subset of our learned shapelets. By first focusing locally on smaller feature sets, our algorithm attempts to harvest more useful information than by examining all the low–level features together. We present quantitative results demonstrating the effectiveness of our algorithm. In particular, we obtain an error rate 14 percentage points lower (at 10-6 FPPW) than the previous state of the art detector of Dalal and Triggs [1] on the INRIA dataset.

85

85

Epitomic Representation of Human Activities

Naresh Cuntoor and Rama Chellappa

We introduce an epitomic representation for modeling human activities in video sequences. A video sequence is divided into segments within which the dynamics of objects is assumed to be linear and modeled using linear dynamical systems. The tuple consisting of the estimated system matrix, statistics of the input signal and the initial state value is said to form an epitome. The system matrices are decomposed using the Iwasawa matrix decomposition to isolate the effect of rotation, scaling and projective action on the state vector. We demonstrate the usefulness of the proposed representation and decomposition for activity recognition using the TSA airport surveillance dataset and the UCF indoor human action dataset.

Incorporating On-demand Stereo for Real Time Recognition

Thomas Deselaers, Antonio Criminisi, John Winn, and Ankur Agarwal

A new method for localising and recognising hand poses and objects in real-time is presented. This problem is important in vision-driven applications where it is natural for a user to combine hand gestures and common objects when interacting with a machine. For example, using a real eraser to remove words from an electronic document on a tablet display. This paper employs a random forest algorithm to adaptively select a minimal combination of appearance, shape and stereo features which achieve maximum class discrimination for a given image. This minimal set leads to both efficiency at run time and good generalisation. Unlike previous stereo work which explicitly construct disparity maps, here stereo cues are computed only at image points where they are necessary for recognition and then used in combination with other features. This leads to improvements in efficiency. The proposed method is assessed on a database of a variety of objects and hand poses selected for interacting on a flat surface in an office environment.

Tensor Canonical Correlation Analysis for Action Classification

Tae-Kyun Kim, Shu-Fai Wong, and Roberto Cipolla

We introduce a new framework, namely Tensor Canonical Correlation Analysis (TCCA) which is an extension of classical Canonical Correlation Analysis (CCA) to multidimensional data arrays (or tensors) and apply this for action/gesture classification in videos. By Tensor CCA, joint space-time linear relationships of two video volumes are inspected to yield flexible and descriptive similarity features of the two videos. The TCCA features are combined with a discriminative feature selection scheme and a Nearest Neighbor classifier for action classification. In addition, we propose a time-efficient action detection method based on dynamic learning of subspaces for Tensor CCA for the case that actions are not aligned in the space-time domain. The proposed method delivered significantly better accuracy and comparable detection speed over state-of-the-art methods on the KTH action data set as well as self-recorded hand gesture data sets.

Shape Variation-Based Frieze Pattern for Robust Gait Recognition

Seungkyu Lee, Yanxi Liu, and Robert Collins

Gait is an attractive biometric for vision-based human identification. Previous work on existing public data sets has shown that shape cues yield improved recognition rates compared to pure motion cues. However, shape cues are fragile to gross appearance variations of an individual, for example, walking while carrying a ball or a backpack. We introduce a novel, spatiotemporal Shape Variation-Based Frieze Pattern (SVB

86

86

frieze pattern) representation for gait, which captures motion information over time. The SVB frieze pattern represents normalized frame difference over gait cycles. Rows/columns of the vertical/horizontal SVB frieze pattern contain motion variation information augmented by key frame information with body shape. A temporal symmetry map of gait patterns is also constructed and combined with vertical/horizontal SVB frieze patterns for measuring the dissimilarity between gait sequences. Experimental results show that our algorithm improves gait recognition performance on sequences with and without gross differences in silhouette shape. We demonstrate superior performance of this computational framework over previous algorithms using shape cues alone on both CMU MoBo and UoS HumanID gait databases.

Medical Imaging 1

A boosting regression approach to medical anatomy detection

Shaohua Kevin Zhou, Jinghao Zhou, and Dorin Comaniciu

The state-of-the-art object detection algorithm learns a binary classifier to differentiate the foreground object from the background. Since the detection algorithm exhaustively scans the input image for object instances by testing the classifier, its computational complexity linearly depends on the image size and, if say orientation and scale are scanned, the number of configurations in orientation and scale. We argue that exhaustive scanning is unnecessary when detecting medical anatomy because a medical image offers strong contextual information. We then present an approach to effectively leveraging the medical context, leading to a solution that needs only one scan in theory or several sparse scans in practice and only one integral image even when the rotation is considered. The core is to learn a regression function, based on an annotated database, that maps the appearance observed in a scan window to a displacement vector, which measures the difference between the configuration being scanned and that of the target object. To achieve the learning task, we propose an image-based boosting ridge regression algorithm, which exhibits good generalization capability and training efficiency. Coupled with a binary classifier as a confidence scorer, the regression approach becomes an effective tool for detecting left ventricle in echocardiogram, achieving improved accuracy over the state-of-the-art object detection algorithm with significantly less computation.

Speckle Tracking in 3D Echocardiography with Motion Coherence

Xubo Song, Andriy Myronenko, and David Sahn

Tracking of speckles in echocardiography enables the study of myocardium deformation, and thus can provide insights about heart structure and function. Most of the current methods are based on 2D speckle tracking, which suffers from errors due to through-plane decorrelation. Speckle tracking in 3D overcomes such limitation. However, 3D speckle tracking is a challenging problem due to relatively low spatial and temporal resolution of 3D echocardiography. To ensure accurate and robust tracking, high level spatial and temporal constraints need to be incorporated. In this paper, we introduce a novel method for speckle tracking in 3D echocardiography. Instead of tracking each speckle independently, we enforce a motion coherence constraint, in conjunction with a dynamic model for the speckles. This method is validated on in vivo porcine hearts, and is proven to be accurate and robust.

87

87

Multiple Instance Learning of Pulmonary Embolism Detection with Geodesic Distance along Vascular Structure

Jinbo Bi and Jianming Liang

We propose a novel classification approach for automatically detecting pulmonary embolism (PE) from computed-tomography-angiography images. Unlike most existing approaches that require vessel segmentation to restrict the search space for PEs, our toboggan-based candidate generator is capable of searching the entire lung for any suspicious regions quickly and efficiently. We then exploit the spatial information supplied in the vascular structure as a post-candidate-generation step by designing classifiers with geodesic distances between candidates along the vascular tree. Moreover, a PE represents a cluster of voxels in an image, and thus multiple candidates can be associated with a single PE and the PE is identified if any of its candidates is correctly classified. The proposed algorithm also provides an efficient solution to the problem of learning with multiple positive instances. Our clinical studies with 177 clinical cases demonstrate that the proposed approach outperforms existing detection methods, achieving 81% sensitivity on an independent test set at 4 false positives per study.

Deformable Motion Tracking of Cardiac Structures (DEMOTRACS) for Improved MR Imaging

Maneesh Dewan, Christine Lorenz, and Gregory Hager

The speed and quality of imaging cardiac structures (coronary arteries, cardiac valves etc) in MR can be improved by tracking and predicting their motion in MR images. The problem is challenging not only due to the complex motion of these structures that significantly changes the appearance of the region of interest, but also the ability to track at different spatial and temporal resolutions depending on the application. We have developed a multiple-template based tracking approach to track the cardiac structures in MR images. The algorithm has two novel features. First a bidirectional coordinate-descent algorithm is derived to improve accuracy and performance of tracking. Second we propose a method for choosing an optimal set of templates for tracking. The efficacy of the algorithm has been validated by tracking the coronary artery and cardiac valves reliably and accurately in thousands of high resolution cine and low-resolution real-time MR images.

A Fast 3D Correspondence Method for Statistical Shape Modeling

Pahal Dalal, Brent Munsell, Song Wang, Jijun Tang, Kenton Oliver, Hiroaki Ninomiya, Xiangrong Zhou, and Hiroshi Fujita

Accurately identifying corresponded landmarks from a population of shape instances is the major challenge in constructing statistical shape models. In this paper, we address this landmark-based shape-correspondence problem for 3D cases by developing a highly efficient landmark-sliding algorithm. This algorithm is able to quickly refine all the landmarks in a parallel fashion by sliding them on the 3D shape surfaces. We use 3D thin-plate splines to model the shape-correspondence error so that the proposed algorithm is invariant to affine transformations and more accurately reflects the nonrigid biological shape deformations between different shape instances. In addition, the proposed algorithm can handle both open- and closed-surface shape, while most of the current 3D shape-correspondence methods can only handle genus-0 closed surfaces. We conduct experiments on 3D hippocampus data and compare the performance of the proposed algorithm to the state-of-the-art MDL and SPHARM methods. We find that, while the proposed algorithm produces a shape correspondence with a better or comparable quality to the other two, it takes substantially less CPU time. We also apply the proposed algorithm to correspond 3D diaphragm data which have an open-surface shape.

88

88

Topology Preserving Log-Unbiased Nonlinear Image Registration: Theory and Implementation

Igor Yanovsky, Paul Thompson, Stanley Osher, and Alex Leow

In this paper, we present a novel framework for constructing large deformation log-unbiased image registration models that generate theoretically and intuitively correct deformation maps. Such registration models do not rely on regridding and are inherently topology preserving. We apply information theory to quantify the magnitude of deformations and examine the statistical distributions of Jacobian maps in the logarithmic space. To demonstrate the power of the proposed framework, we generalize the well known viscous fluid registration model to compute log-unbiased deformations. We tested the proposed method using a pair of binary corpus callosum images, a pair of two-dimensional serial MRI images, and a set of three-dimensional serial MRI brain images. We compared our results to those computed using the viscous fluid registration method, and demonstrated that the proposed method is advantageous when recovering voxel-wise maps of local tissue change.

3D/Graphics

3D Occlusion Inference from Silhouette Cues

Li Guan, Jean-Sebastien Franco, and Marc Pollefeys

We consider the problem of detecting and accounting for the presence of occluders in a 3D scene based on silhouette cues in video streams obtained from multiple, calibrated views. While well studied and robust in controlled environments, silhouette-based reconstruction of dynamic objects fails in general environments where uncontrolled occlusions are commonplace, due to inherent silhouette corruption by occluders. We show that occluders in the interaction space of dynamic objects can be detected and their 3D shape fully recovered as a byproduct of shape-from-silhouette analysis. We provide a Bayesian sensor fusion formulation to process all occlusion cues occurring in a multiple view sequence. Results show that the shape of static occluders can be robustly recovered from pure dynamic object motion, and that this information can be used for online self-correction and reinforcement of dynamic object shape reconstruction.

Dynamic 3D Scene Analysis from a Moving Vehicle

Bastian Leibe, Nico Cornelis, Kurt Cornelis, and Luc Van Gool

In this paper, we present a system that integrates fully automatic scene geometry estimation, 2D object detection, 3D localization, trajectory estimation, and tracking for dynamic scene interpretation from a moving vehicle. Our sole input are two video streams from a calibrated stereo rig on top of a car. From these streams, we estimate Structure-from-Motion (SfM) and scene geometry in real-time. In parallel, we perform multi-view/multi-category object recognition to detect cars and pedestrians in both camera images. Using the SfM self-localization, 2D object detections are converted to 3D observations, which are accumulated in a world coordinate frame. A subsequent tracking module analyzes the resulting 3D observations to find physically plausible spacetime trajectories. Finally, a global optimization criterion takes object-object interactions into account to arrive at accurate 3D localization and trajectory estimates for both cars and pedestrians. We demonstrate the performance of our integrated system on challenging real-world data showing car passages through crowded city areas.

89

89

Spectral Matting

Anat Levin, Alex Rav Acha, and Dani Lischinski

We present spectral matting: a new approach to natural image matting that automatically computes a set of fundamental fuzzy matting components from the smallest eigenvectors of a suitably defined Laplacian matrix. Thus, our approach extends spectral segmentation techniques, whose goal is to extract hard segments, to the extraction of fuzzy matting components. These components may then be used as building blocks to easily construct semantically meaningful foreground mattes, either in an unsupervised fashion, or based on a small amount of user input.

Poster session 2


Improving Part based Object Detection by Unsupervised, Online Boosting

Bo Wu and Ram Nevatia

Detection of objects of a given class is important for many applications. However it is difficult to learn a general detector with high detection rate as well as low false alarm rate. Especially, the labor needed for manually labeling a huge training sample set is usually not affordable. We propose an unsupervised, incremental learning approach based on online boosting to improve the performance on special applications of a set of general part detectors, which are learned from a small amount of labeled data and have moderate accuracy. Our oracle for unsupervised learning, which has high precision, is based on a combination of a set of shape based part detectors learned by off-line boosting. Our online boosting algorithm, which is designed for cascade structure detector, is able to adapt the simple features, the base classifiers, the cascade decision strategy, and the complexity of the cascade automatically to the special application. We integrate two noise restraining strategies in both the oracle and the online learner. The system is evaluated on two public video corpora.

Flexible Object Models for Category-Level 3D Object Recognition

Akash Kushal, Cordelia Schmid, and Jean Ponce

Today's category-level object recognition systems largely focus on fronto-parallel views of objects with characteristic texture patterns. To overcome these limitations, we propose a novel framework for visual object recognition where object classes are represented by assemblies of partial surface models (PSMs) obeying loose local geometric constraints. The PSMs themselves are formed of dense, locally rigid assemblies of image features. Since our model only enforces local geometric consistency, both at the level of model parts and at the level of individual features within the parts, it is robust to viewpoint changes and intra-class variability. The proposed approach has been implemented, and it outperforms the state-of-the-art algorithms for object detection and localization on the Pascal 2005 VOC Challenge Cars Test 1 data.

90

90

City-Scale Location Recognition

Grant Schindler, Matthew Brown, and Richard Szeliski

We look at the problem of location recognition in a large image dataset using a vocabulary tree. This entails finding the location of a query image in a large dataset containing 30,000 streetside images of a city. We investigate how the traditional invariant feature matching approach falls down as the size of the database grows. In particular we show that by carefully selecting the vocabulary using the most informative features, retrieval performance is significantly improved, allowing us to increase the number of database images by a factor of 10. We also introduce a generalization of the traditional vocabulary tree search algorithm which improves performance by effectively increasing the branching factor of a fixed vocabulary tree.

Learning Kernel Expansions for Image Classification

Fernando De la Torre and Oriol Vinyals

Kernel machines (e.g. SVM, KLDA) have shown state-of-the-art performance in several visual classification tasks. The classification performance of kernel machines greatly depends on the choice of kernels and its parameters. In this paper, we propose a method to search over a space of parameterized kernels using a gradient-descent based method. Our method effectively learns a non-linear representation of the data useful for classification and simultaneously performs dimensionality reduction. In addition, we suggest a new matrix formulation that simplifies and unifies previous approaches. The effectiveness and robustness of the proposed algorithm is demonstrated in both synthetic and real examples of pedestrian and mouth detection in images.

Concurrent Multiple Instance Learning for Image Categorization

Guo-Jun Qi, Xian-Sheng Hua, Yong Rui, Tao Mei, Jinhui Tang, and Hong-Jiang Zhang

We propose a novel Multiple Instance Learning (MIL) approach based on concurrent tensor inference to learn image categories. Unlike existing MIL algorithms, in which the individual instances in a bag are assumed to be independent with each other, we relax this assumption and explicitly formulate the inter-dependency as a concurrent tensor representation. And then rank-1 tensor factorization is applied to obtain the label of each instance. Moreover, we extend instance label prediction to the whole feature space by extending the above tensor inference to a kernelization framework. A regularizer is introduced in this framework to avoid overfitting of concurrent likelihood model by controlling the complexity of the high-order tensor model. We report highly competitive categorization performances on both the Corel and Caltech datasets.

Parameter Sensitive Detectors

Quan Yuan, Ashwin Thangali, Vitaly Ablavsky, and Stan Sclaroff

Object detection can be challenging when the object class exhibits large variations. One commonly-used strategy is to first partition the space of possible object variations and then train separate classifiers for each portion. However, with continuous spaces the partitions tend to be arbitrary since there are no natural boundaries (for example, consider the continuous range of human body poses). In this paper, a new formulation is proposed, where the detectors themselves are associated with continuous parameters, and

91

91

reside in a parameterized function space. There are two advantages of this strategy. First, a-priori partitioning of the parameter space is not needed; the detectors themselves are in a parameterized space. Second, the underlying parameters for object variations can be learned from training data in an unsupervised manner. For profile face detection, our detection rate outperforms Viola-Jones’ method by 5%, for 90 false alarms. On a hand shape data set, our method improves detection rate from 98% to 99.5% at a false positive rate of 0.1%, compared with partition based methods. On a pedestrian data set, our method reduces miss detection rate by a factor of three at a false positive rate of 1%, compared with Dalal-Triggs’ method.

Learning the Compositional Nature of Visual Objects

Björn Ommer and Joachim Buhmann

The compositional nature of visual objects significantly limits their representation complexity and renders learning of structured object models tractable. Adopting this modeling strategy we both (i) automatically decompose objects into a hierarchy of relevant compositions and we (ii) learn such a compositional representation for each category without supervision. The compositional structure supports feature sharing already on the lowest level of small image patches. Compositions are represented as probability distributions over their constituent parts and the relations between them. The global shape of objects is captured by a graphical model which combines all compositions. Inference based on the underlying statistical model is then employed to obtain a category level object recognition system. Experiments on large standard benchmark datasets underline the competitive recognition performance of this approach and they provide insights into the learned compositional structure of objects.

Composite Models of Objects and Scenes for Category Recognition

David Crandall and Daniel Huttenlocher

This paper presents a method of learning and recognizing generic object categories using part-based spatial models. The models are multiscale, with a scene component that specifies relationships between the object and surrounding scene context, and an object component that specifies relationships between parts of the object. The underlying graphical model forms a tree-structure, with a star topology for both the contextual and object components. A weakly supervised paradigm is used for learning the models, where each training image is labeled with bounding boxes indicating the overall location of object instances, but parts or regions of the objects and scene are not specified. The parts, regions and spatial relationships are learned automatically. We demonstrate the method on the detection task for the PASCAL 2006 Visual Object Classes Challenge dataset, where objects must be correctly localized. Our results demonstrate better overall performance than that of previously reported techniques, in terms of the average precision measure used in the PASCAL detection evaluation. Our results also show that incorporating scene context into the models improves performance in comparison with not using such contextual information.

Matrix-Structural Learning (MSL) of Cascaded Classifier from Enormous Training Set

Shengye Yan, Shiguang Shan, Xilin Chen, Wen Gao, and Jie Chen

Aiming at the problem when both positive and negative training sets are enormous, this paper proposes a novel Matrix-Structural Learning (MSL) method to learn cascaded classifier, based on Viola and Jones’ AdaBoost method for face detection. Briefly, unlike Viola and Jones’ methods that learn linearly by bootstrapping only negative samples, the proposed MSL method bootstraps both positive and negative samples in a matrix-like structure. In the MSL, two “inheritance” ways are further presented to improve the training efficiency. One is the inheritance of the positive samples of the last sub-classifier, i.e., these

92

92

samples are used directly by the current sub-classifier, which can greatly decrease the number of bootstrapping iterations. The other is the inheritance of features learned previously during the training procedure, which can also greatly reduce the time cost selecting “new” features. The proposed method is evaluated on face detection problem. On a positive set containing 230,000 face samples, only 12 hours are needed on a common PC with a 3.20GHz Pentium IV processor to learn a classifier with false alarm rate less than 1/1,000,000. The accuracy of the detector learned exceeds the state-of-the-art results on the CMU+MIT frontal face test set.

Unsupervised learning of invariant feature hierarchies, with application to object recognition

Marc'Aurelio Ranzato, Fu-Jie Huang, Y-Lan Boureau, and Yann LeCun

We present an unsupervised method for learning a hierarchy of sparse feature detectors that are invariant to small shifts and distortions. The resulting feature extractor consists of multiple convolution filters, followed by a point-wise sigmoid non-linearity, and a feature-pooling layer that computes the max of each filter output within adjacent windows. A second level of larger and more invariant features is obtained by training the same algorithm on patches of features from the first level. Training a supervised classifier on these features yields 0.64% error on MNIST, and 54% average recognition rate on Caltech 101 with 30 training samples per category. While the resulting architecture is similar to convolutional networks, the layer-wise unsupervised training procedure alleviates the over-parameterization problems that plague purely supervised learning procedures, and yields good performance with very few labeled training samples.

3D Reconstruction and Processing 1

A Closed-form Solution to 3D Reconstruction of Piecewise Planar Objects from Single Images

Zhenguo Li, Jianzhuang Liu, and Xiaoou Tang

This paper proposes a new approach to 3D reconstruction of piecewise planar objects based on two image regularities, connectivity and perspective symmetry. First, we formulate the whole shape of the objects in an image as a shape vector consisting of the normals of all the faces of the objects. Then, we impose several linear constraints on the shape vector using connectivity and perspective symmetry of the objects. Finally, we obtain a closed-form solution to the 3D reconstruction problem. We also develop an efficient algorithm to detect a face of perspective symmetry. Experimental results on real images are shown to demonstrate the effectiveness of our approach.

Light Fall-off Stereo

Miao Liao, Liang Wang, Ruigang Yang, and Minglun Gong

We present light fall-off stereo–LFS–a new method for computing depth from scenes beyond lambertian reflectance and texture. LFS takes a number of images from a stationary camera as the illumination source moves away from the scene. Based on the inverse square law for light intensity, the ratio images are directly related to scene depth from the perspective of the light source. Using this as the invariant,we developed both local and global methods for depth recovery. Compared to previous reconstruction methods for non-lamebrain scenes, LFS needs as few as two images, does not require calibrated camera or light sources, or reference objects in the scene. We demonstrated the effectiveness of LFS with a variety of real-world scenes.

93

93

Simultaneous Optimization of Structure and Motion in Dynamic Scenes Using Unsynchronized Stereo Cameras

Akihito Seki and Masatoshi Okutomi

In this paper, we propose a simultaneous estimation method of structure and motion in dynamic scenes. Usual methods for obtaining structure and motion using stereo cameras require two kinds of operations: stereo correspondence and tracking. Therefore, we must separately determine the correspondence between stereo images and sequential images. This necessity complicates the algorithm and increases the possibility of mismatches because of the object’s motion and visibility change in the images. Our proposed method makes two contributions. The first contribution is the method of corresponding all stereo images and sequential images at once. Therefore, we can obtain the structure and motion simultaneously and more accurately. On the other hand, most stereo correspondence algorithms are limited to use under a synchronized status. In a stereo rig using unsynchronized cameras, as are most commercially available cameras, the structure cannot be obtained by stereo correspondence and triangulation because of the unknown time offset between cameras. Therefore, our second contribution is a method of estimating structure, motion, and time offset simultaneously using unsynchronized stereo cameras. This latter task is accomplished by taking advantage of the first contribution scheme. Additionally, our method requires no preprocessing such as motion segmentation for separating identical-motion objects and advance calibration of the time offset. Finally, we present the experimental results using both synthetic and real images.

Efficiently Determining Silhouette Consistency

Li Yi and David Jacobs

Volume intersection is a frequently used technique to solve the Shape-From-Silhouette problem, which constructs a 3D object estimate from a set of silhouettes taken with calibrated cameras. It is natural to develop an efficient algorithm to determine the consistency of a set of silhouettes before performing time-consuming reconstruction, so that inaccurate silhouettes can be omitted. In this paper we first present a fast algorithm to determine the consistency of three silhouettes from known (but arbitrary) viewing directions, assuming the projection is scaled orthographic. The temporal complexity of the algorithm is linear in the number of points of the silhouette boundaries. We further prove that a set of more than three convex silhouettes are consistent if and only if any three of them are consistent. Another possible application of our approach is to determine the miscalibrated cameras in a large camera system. A consistent subset of cameras can be determined on the fly and miscalibrated cameras can also be recalibrated at a coarse scale. Real and synthesized data are used to demonstrate our results.

Illumination Multiplexing within Fundamental Limits

Netanel Ratner and Yoav Y. Schechner

Taking a sequence of photographs using multiple illumination sources or settings is central to many computer vision and graphics problems. Recently, a growing number of methods use multiple sources rather than single point sources in each frame of the sequence. Potential benefits include increased signal-to-noise ratio and accommodation of scene dynamic range. However, existing multiplexing schemes, including Hadamard-based codes, are inhibited by fundamental limits set by Poisson distributed photon noise and by sensor saturation. The prior schemes may actually be counterproductive due to these effects. We derive multiplexing codes that are optimal under these fundamental effects. Thus, the novel codes generalize the prior schemes and have a much broader applicability. Our approach is based on formulating the problem as a constrained optimization. We further suggest an algorithm to solve this optimization problem. The superiority and effectiveness of the method is demonstrated in experiments involving object illumination.

94

94

Dense mirroring surface recovery from 1D homographies and sparse correspondences

Stas Rozenfeld, Ilan Shimshoni, and Michael Lindenbaum

In this work we recover the 3D shape of mirroring objects such as mirrors, sunglasses, and stainless steel objects. A computer monitor displays several images of parallel stripes, each image at a different angle. Reflections of these stripes in a mirroring surface are captured by the camera. For every image point, the directions of the displayed stripes and their reflections in the image are related by a 1D homography which can be computed robustly and using the statistically accurate heteroscedastic model, without monitor--image correspondence, which is generally required by other techniques. Focusing on a small set of image points for which monitor--image correspondence is computed, the depth and the local shape may be calculated relying on this homography. This is done by an optimization process which is related to the one proposed by Savarese, Chen and Perona[10], but is different and more stable. Then dense surface recovery is performed using constrained interpolation, which does not simply interpolate the surface depth values, but rather solves for the depth, the correspondence, and the local surface shape, simultaneously at each interpolated point. Consistency with the 1D homography is thus required. The proposed method as well as the method described in [10] are inherently unstable on a small part of the surface. We propose a method to detect these instabilities and correct them. The method was implemented and the shapes of a mirror, sunglasses, and a stainless steel ashtray were recovered at sub-millimeter accuracy.

Toward Flexible 3D Modeling using a Catadioptric Camera

Maxime Lhuillier

Fully automatic 3D modeling from a catadioptric image sequence has rarely been addressed until now, although this is a long-standing problem for perspective images. All previous catadioptric approaches have been limited to dense reconstruction for a few view points, and the majority of them require calibration of the camera. This paper presents a method which deals with hundreds of images, and does not require precise calibration knowledge. In this context, the same 3D point of the scene may be visible and reconstructed in a large number of images at very different accuracies. So the main part of this paper concerns the selection of reconstructed points, a problem largely ignored in previous works. Summaries of the structure from motion and dense stereo steps are also given. Experiments include the 3D model reconstruction of indoor and outdoor scenes, and a walkthrough in a city.

Optimal Step Nonrigid ICP Algorithms For Surface Registration

Brian Amberg, Sami Romdhani, and Thomas Vetter

We show how to extend the ICP framework to nonrigid registration, while retaining the convergence properties of the original algorithm. The resulting optimal step nonrigid ICP framework allows the use of different regularisations, as long as they have an adjustable stiffness parameter. The registration loops over a series of decreasing stiffness weights, and incrementally deforms the template towards the target, recovering the whole range of global and local deformations. To find the optimal deformation for a given stiffness, optimal iterative closest point steps are used. Preliminary correspondences are estimated by a nearest-point search. Then the optimal deformation of the template for these fixed correspondences and the active stiffness is calculated. Afterwards the process continues with new correspondences found by searching from the displaced template vertices. We present an algorithm using a locally affine regularisation which assigns an affine transformation to each vertex and minimises the difference in the transformation of neighbouring vertices. It is shown that for this regularisation the optimal deformation for fixed correspondences and fixed stiffness can be determined exactly and efficiently. The method succeeds for a wide range of initial conditions, and handles missing data robustly. It is compared qualitatively and quantitatively to other algorithms using synthetic examples and real world data.

95

95

Simultaneous Covariance Driven Correspondence (CDC) and Transformation Estimation

Michal Sofka, Gehua Yang, and Charles Stewart

This paper proposes a new registration algorithm, Covariance Driven Correspondences (CDC), that depends fundamentally on the estimation of uncertainty in point correspondences. This uncertainty is derived from the covariance matrices of the individual point locations and from the covariance matrix of the estimated transformation parameters. Based on this uncertainty, CDC uses a robust objective function and an EM-like algorithm to simultaneously estimate the transformation parameters, their covariance matrix, and the likely correspondences. Unlike the Robust Point Matching (RPM) algorithm, CDC requires neither an annealing schedule nor an explicit outlier process. Experiments on synthetic and real images using a polynomial transformation models in 2D and in 3D show that CDC has a broader domain of convergence than the well-known Iterative Closest Point (ICP) algorithm and is more robust to missing or extraneous structures in the data than RPM.

Precise Registration of 3D Models To Images by Swarming Particles

Joerg Liebelt and Klaus Schertler

The precise alignment of a 3D model to 2D sensor images to recover the pose of an object in a scene is an important topic in computer vision. In this work, we outline a registration scheme to align arbitrary standard 3D models to optical and Synthetic Aperture Radar (SAR) images in order to recover the full 6 degrees of freedom of the object. We propose a novel similarity measure which combines perspective contour matching and an appearance-based Mutual Information (MI) measure. Unlike previous work, the resulting similarity measure is optimized using an evolutionary Particle Swarming strategy, parallelized to exploit the hardware acceleration potential of current generation graphics processors (GPUs). The performance of our registration scheme is systematically evaluated on an object tracking task using synthetic as well as real input images. We show that our approach leads to precise registration results, even for significant image noise, small object dimensions and partial occlusion where other methods would fail.

Retrieval and Search 1

Searching video for complex activities with finite state models

Nazlı kizler and David Forsyth

We describe a method of representing human activities that allows a collection of motions to be queried without examples, using a simple and effective query language. Our approach is based on units of activity at segments of the body, that can be composed across space and across the body to produce complex queries. The presence of search units is inferred automatically by tracking the body, lifting the tracks to 3D and comparing to models trained using motion capture data. We show results for a large range of queries applied to a collection of complex motion and activity. Our models of short time scale limb behaviour are built using labelled motion capture set. We compare with discriminative methods applied to tracker data; our method offers significantly improved performance. We show experimental evidence that our method is robust to view direction and is unaffected by the changes of clothing.

96

96

Probabilistic Reverse Annotation for Large Scale Image Retrieval

Pramod Sankar K. and C. V. Jawahar

Automatic annotation is an elegant alternative to explicit recognition in images. In annotation, the image is matched with keyword models, and the most relevant keywords are assigned to the image. Using existing techniques, the annotation time for large collections is very high, while the annotation performance degrades with increase in number of keywords. Towards the goal of large scale annotation, we present an approach called “Reverse Annotation”. Unlike traditional annotation where keywords are identified for a given image, in Reverse Annotation, the relevant images are identified for each keyword. With this seemingly simple shift in perspective, the annotation time is reduced significantly. To be able to rank relevant images, the approach is extended to Probabilistic Reverse Annotation. Our framework is applicable to a wide variety of multimedia documents, and scalable to large collections. Here, we demonstrate the framework over a large collection of 75,000 document images, containing 21 million word segments, annotated by 35000 keywords. Our image retrieval system replicates text-based search engines, in response time.

From Videos to Verbs: Mining Videos for Activities using a cascade of dynamical systems

Pavan Turaga, Ashok Veeraraghavan, and Rama Chellappa

Clustering video sequences in order to infer and extract activities from a single video stream is an extremely important problem and has significant potential in video indexing, surveillance, activity discovery and event recognition. Clustering a video sequence into activities requires one to simultaneously recognize activity boundaries (activity consistent subsequences) and cluster these activity subsequences. In order to do this, we build a generative model for activities (in video) using a cascade of dynamical systems and show that this model is able to capture and represent a diverse class of activities. We then derive algorithms to learn the model parameters from a video stream and also show how a single video sequence may be clustered into different clusters where each cluster represents an activity. We also propose a novel technique to build affine, view, rate invariance of the activity into the distance metric for clustering. Experiments show that the clusters found by the algorithm correspond to semantically meaningful activities.

Weighted Substructure Mining for Image Analysis

Sebastian Nowozin, Koji Tsuda, Takeaki Uno, Taku Kudo, and Gökhan Bakir

In web-related applications of image categorization, it is desirable to derive an interpretable classification rule with high accuracy. Using the bag-of-words representation and the linear support vector machine, one can partly fulfill the goal, but the accuracy of linear classifiers is not high and the obtained features are not informative for users. We propose to combine item set mining and large margin classifiers to select features from the power set of all visual words. Our resulting classification rule is easier to browse and simpler to understand, because each feature has richer information. As a next step, each image is represented as a graph where nodes correspond to local image features and edges encode geometric relations between features. Combining graph mining and boosting, we can obtain a classification rule based on subgraph features that contain more information than the set features. We evaluate our algorithm in a web-retrieval ranking task where the goal is to reject outliers from a set of images returned for a keyword query. Furthermore, it is evaluated on the supervised classification tasks with the challenging VOC2005 data set. Our approach yields excellent accuracy in the unsupervised ranking task compared to a recently proposed probabilistic model and competitive results in the supervised classification task.

97

97

Object retrieval with large vocabularies and fast spatial matching

James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman

We present a large-scale object retrieval system. The user supplies a query object by selecting a region of a query image, and the system returns a ranked list of images that contain the same object, retrieved from a large corpus. We demonstrate the system on a corpus of over 50,000 high-resolution images, crawled from the photo-sharing site, Flickr. We view this work as a step towards much larger (web-scale) image corpora. Consequently we compare different scalable methods for building an image-feature vocabulary. We also reproduce earlier results showing that enforcing a spatial model on the data after retrieval improves the average precision of the system. Previous spatial matching schemes were only feasible over a small corpus, and we remove this limitation using a RANSAC procedure that can rapidly handle many-to-many correspondences between two images. We evaluate several variants of the RANSAC procedure and demonstrate that it is practical to use as a re-ranking step for the most probable result images.

Learning Visual Representations using Images with Captions

Ariadna Quattoni, Michael Colllins, and Trevor Darrell

Current methods for learning visual categories work well when a large amount of labeled data is available, but can run into severe difficulties when the number of labeled examples is small. When labeled data is scarce it may be beneficial to use unlabeled data to learn an image representation that is low-dimensional, but nevertheless captures the information required to discriminate between image categories. This paper describes a method for learning representations from large quantities of unlabeled images which have associated captions; the goal is to improve learning in future image classification problems. Experiments show that our method significantly outperforms (1) a fully-supervised baseline model, (2) a model that ignores the captions and learns a visual representation by performing PCA on the unlabeled images alone and (3) a model that uses the output of word classifiers trained using captions and unlabeled data. Our current work concentrates on captions as the source of meta-data, but more generally other types of meta-data could be used.

A Binning Scheme for Fast Hard Drive Based Image Search

Friedrich Fraundorfer, Henrik Stewénius, and David Nistér

In this paper we investigate how to scale a content based image retrieval approach beyond the RAM limits of a single computer and to make use of its hard drive to store the feature database. The feature vectors describing the images in the database are binned in multiple independent ways. Each bin contains images similar to a representative prototype. Each binning is considered through two stages of processing. First, the prototype closest to the query is found. Second, the bin corresponding to the closest prototype is fetched from disk and searched completely. The query process is repeatedly performing these two stages, each time with a binning independent of the previous ones. The scheme cuts down the hard drive access significantly and results in a major speed up. An experimental comparison between the binning scheme and a raw search shows competitive retrieval quality.

98

98


Multi-class object tracking algorithm that handles fragmentation and grouping

Biswajit Bose, Xiaogang Wang, and Eric Grimson

We propose a framework for detecting and tracking multiple interacting objects, while explicitly handling the dual problems of fragmentation (an object may be broken into several blobs) and grouping (multiple objects may appear as a single blob). We use foreground blobs obtained by background subtraction from a stationary camera as measurements. The main challenge is to associate blob measurements with objects, given the fragment-object-group ambiguity when the number of objects is variable and unknown, and object-class-specific models are not available. We first track foreground blobs till they merge or split. We then build an inference graph representing merge-split relations between the tracked blobs. Using this graph and a generic object model based on spatial connectedness and coherent motion, we label the tracked blobs as whole objects, fragments of objects or groups of interacting objects. The outputs of our algorithm are entire tracks of objects, which may include corresponding tracks from groups during interactions. Experimental results on multiple video sequences are shown.

Nearest First Traversing Graph for Simultaneous Object Tracking and Recognition

Toshikazu Wada and Junya Sakagaito

This paper presents a new method for simultaneous object tracking and recognition using object image database. This application requires two searches: search for object appearance stored in the database and that for pose parameters (position, scale, orientation, and so on) of the tracking object in each image frame. For simplifying this problem, we propose a new method, pose parameter embedding (PPE) that transforms the original search problem to an appearance search problem. The nearest neighbor (NN) appearance search in this problem has a special property that a sequence of gradually changing queries is given. For this problem, graph based NN search is suitable, because the preceding search result can be used as the starting point of the next search. Delaunay graph can be used for this search, however, both the graph construction cost and the degree (number of edges connected to a vertex) increase in high-dimensional space. Instead, we propose nearest first traversing graph (NFTG) for avoiding these problems. Based on these two techniques, NFTG and PPE, we successfully realized video-rate processing using standard PC.

Tracking as Repeated Figure/Ground Segmentation

Xiaofeng Ren and Jitendra Malik

Tracking over a long period of time is challenging as the appearance, shape and scale of the object in question may vary. We propose a paradigm of tracking by repeatedly segmenting figure from background. Accurate spatial support obtained in segmentation provides rich information about the track and enables reliable tracking of non-rigid objects without drifting. Figure/ground segmentation operates sequentially in each frame by utilizing both static image cues and temporal coherence cues, which include an appearance model of brightness (or color) and a spatial model propagating figure/ground masks through low-level region correspondence. A superpixel-based conditional random field linearly combines cues and loopy belief propagation is used to estimate marginal posteriors of figure vs background. We demonstrate our approach on long sequences of sports video, including figure skating and football.

99

99

Spatial selection for attentional visual tracking

Ming Yang, Junsong Yuan, and Ying Wu

Long-duration tracking of general targets is quite challenging for computer vision, because in practice target may undergo large uncertainties in its visual appearance and the unconstrained environments may be cluttered and distractive, although it has never been a challenge to the human visual system. Psychological and cognitive findings indicate that the human perception is attentional and selective, and both early attentional selection that may be innate and late attentional selection that may be learned are necessary for human visual tracking. This paper proposes a new visual tracking approach by reflecting some aspects of spatial selective attention, and presents a novel attentional visual tracking (AVT) algorithm. In AVT, the early selection process extracts a pool of attentional regions (ARs) that are defined as the salient image regions that have good localization properties, and the late selection process dynamically identifies a subset of discriminative attentional regions (D-ARs) through a discriminative learning on the historical data on the fly. The computationally demanding process of matching of the AR pool is done in an efficient and innovative way by using the idea in the locality-sensitive hashing (LSH) technique. The proposed AVT algorithm is general, robust and computational efficient, as shown in extensive experiments on a large variety of real-world video.

Linear and Quadratic Subsets for Template-Based Tracking

Selim Benhimane, Alexander Ladikos, Vincent Lepetit, and Nassir Navab

We propose a method that dramatically improves the performance of template-based matching in terms of size of convergence region and computation time. This is done by selecting a subset of the template that verifies the assumption (made during optimization) of linearity or quadraticity with respect to the motion parameters. We call these subsets linear or quadratic subsets. While subset selection approaches have already been proposed, they generally do not attempt to provide linear or quadratic subsets and rely on heuristics such as texturedness. Because a naive search for the optimal subset would result in a combinatorial explosion for large templates, we propose a simple algorithm that does not aim for the optimal subset but provides a very good linear or quadratic subset at low cost, even for large templates. Simulation results and experiments with real sequences show the superiority of the proposed method compared to existing subset selection approaches.

A Linear Programming Approach for Multiple Object Tracking

Hao Jiang, Sidney Fels, and James Little

We propose a linear programming relaxation scheme for the class of multiple object tracking problems where the inter-object interaction metric is convex and the intra-object term quantifying object state continuity may use any metric. The proposed scheme models object tracking as a multi-path searching problem. It explicitly models track interaction, such as object spatial layout consistency or mutual occlusion, and optimizes multiple object tracks simultaneously. The proposed scheme does not rely on track initialization and complex heuristics. It has much less average complexity than previous efficient exhaustive search methods such as extended dynamic programming and is found to be able to find the global optimum with high probability. We have successfully applied the proposed method to multiple object tracking in video streams.

100

100

Crisp Weighted Support Vector Regression for robust model estimation : application to object tracking in image sequences

Franck Dufrenois, Johan Colliez, and Denis Hamad

Based on Support Vector Machines theory, Support Vector Regression (SVR) is now a well-established method for estimating real-valued functions. However, standard SVR is not effective in dealing with outliers and structured outliers in training data sets commonly encountered in computer vision applications. In this paper, we present a weighted version of SVM for regression. The proposed approach introduces a adaptive binary function that allows a dominant model in a degraded training data set to be extracted. This binary function progressively separates inliers from outliers following a one-against-all decomposition. Experimental tests show the high robustness of the proposed approach against outliers and residual structured outliers. Next, we validate our algorithm for object tracking and for optic flow estimation

Trajectory Association across Non-overlapping Moving Cameras in Planar Scenes

Yaser Sheikh, Xin Li, and Mubarak Shah

The ability to associate objects across multiple views allows co-operative use of an ensemble cameras for scene understanding. In this paper, we present a principled solution to object association where both the scene and the object motion are modeled. By making explicit the motion model of each object with respect to time, we are able to solve the trajectory association problem in a unified manner for overlapping or non-overlapping cameras. We recover the assignment of associations while simultaneously computing the maximum likelihood estimates of the inter-camera homographies and the trajectory parameters using the Expectation Maximization algorithm. Quantitative results on simulations are reported along with several results on real data.

Optimizing Distribution-based Matching by Random Subsampling

Alex Leung and Shaogang Gong

We boost the efficiency and robustness of distribution-based matching by random subsampling which results in the minimum number of samples required to achieve a specified probability that a candidate sampling distribution is a good approximation to the model distribution. The improvement is demonstrated with applications to object detection, Mean-Shift tracking using color distributions and tracking with improved robustness for low-resolution video sequences. The problem of minimizing the number of samples required for robust distribution matching is formulated as a constrained optimization problem with the specified probability as the objective function. We show that surprisingly Mean-Shift tracking using our method requires very few samples. Our experiments demonstrate that robust tracking can be achieved with even as few as 5 random samples from the distribution of the target candidate. This leads to a considerably reduced computational complexity that is also independent of object size. We show that random subsampling speeds up tracking by two orders of magnitude for typical object sizes.

101

101

Belief Propagation in a 3D Spatio-temporal MRF for Moving Object Detection

Zhaozheng Yin and Robert Collins

Previous pixel-level change detection methods either contain a background updating step that is costly for moving cameras (background subtraction) or can not locate object position and shape accurately (frame differencing). In this paper we present a Belief Propagation approach for moving object detection using a 3D Markov Random Field (MRF) model. Each hidden state in the 3D MRF model represents a pixel's motion likelihood and is estimated using message passing in a 6-connected spatio-temporal neighborhood. This approach deals effectively with difficult moving object detection problems like objects camouflaged by similar appearance to the background, or objects with uniform color that frame difference methods can only partially detect. Three examples are presented where moving objects are detected and tracked successfully while handling appearance change, shape change, varied moving speed/direction, scale change and occlusion/clutter.

Shape 2

A Novel Representation for Riemannian Analysis of Elastic Curves in Rn

Shantanu Joshi, Eric Klassen, Anuj Srivastava, and Ian Jermyn

We propose a novel representation of continuous, closed curves in Rn that is quite efficient for analyzing their shapes. We combine the strengths of two important ideas - elastic shape metric and path-straightening methods - in shape analysis and present a fast algorithm for finding geodesics in shape spaces. The elastic metric allows for optimal matching of features while path-straightening provides geodesics between curves. Efficiency results from the fact that the elastic metric becomes the simple L2 metric in the proposed representation. We present step-by-step algorithms for computing geodesics in this framework, and demonstrate them with 2-D as well as 3-D examples.

On Stable Evolution of Parametric Curves

Srikrishnan V, Subhasis Chaudhuri, Sumantra Dutta, and Daniel Šev ovi

Parametric active contours have been used extensively in computer vision for different tasks like segmentation and tracking. However, all parametric contours are known to suffer from the problem of frequent bunching and spacing out of curve points locally during the curve evolution. In a spline based implementation of active contours, this leads to occasional formation of loops locally, and subsequently the curve blows up due to instabilities. It has been shown earlier that in addition to usual evolution along the normal direction, the curve should also be evolved in the tangential direction for stability purposes. In this paper, we provide a mathematical basis for selecting such a suitable tangential component for stabilisation. We prove the boundedness of the evolved curve in this paper, and provide the physical significance. We demonstrate the usefulness of the proposed method with a number of experiments.

102

102

Visual Curvature

HaiRong Liu, Longin Latecki, WenYu Liu, and Xiang Bai

In this paper, we propose a new definition of curvature, called visual curvature. It is based on statistics of the extreme points of the height functions computed over all directions. By gradually ignoring relatively small heights, a single parameter multi-scale curvature is obtained. It does not modify the original contour and the scale parameter has an obvious geometric meaning. The theoretical properties and the experiments presented demonstrate that multi-scale visual curvature is stable, even in the presence of significant noise. In particular, it can deal with contours with significant gaps. We also show a relation between multi-scale visual curvature and convexity of simple closed curves. To our best knowledge, the proposed definition of visual curvature is the first ever that applies to regular curves as defined in differential geometry as well as to turn angles of polygonal curves. Moreover, it yields stable curvature estimates of curves in digital images even under sever distortions.

Riemannian Analysis of Probability Density Functions with Applications in Vision

Anuj Srivastava, Ian Jermyn, and Shantanu Joshi

Applications in computer vision involve statistically analyzing an important class of constrained, non-negative functions, including probability density functions (in texture analysis), dynamic time-warping functions (in activity analysis), and re-parametrization or non-rigid registration functions (in shape analysis of curves). For this one needs to impose a Riemannian structure on the spaces formed by these functions. We propose a “spherical" version of the Fisher-Rao metric that provides closed-form expressions for geodesics and distances, and allows fast computation of sample statistics. To demonstrate this approach, we present an application in planar shape classification.

Shape Representation and Registration using Vector Distance Functions

Hossam Abd and Aly Farag

This paper introduces a new method for shape registration by matching vector distance functions. The vector distance function representation is more flexible than the conventional signed distance map since it enables us to better control the shapes registration process by using more general transformations. Based on this model, a variational frame work is proposed for the global and local registration of shapes which does not need any point correspondences. The optimization criterion can handle efficiently the estimation of the global registration parameters. A closed form solution is provided to handle an incremental free form deformation model for covering the local deformations. This is an advantage over the gradient descent optimization which is biased towards the initialization and more time consuming. Results of real shapes registration will be demonstrated to show the efficiency of the proposed approach with small and large global/local deformations.

Layered Graph Match with Graph Editing

Liang Lin, Song-Chun Zhu, and Yongtian Wang

Many vision tasks are posed as either graph partition (coloring) or graph matching (correspondence)problems. The former include segmentation and grouping, and the latter include wide baseline stereo, large motion, object racking and recognition. In this paper, we present an integrated solution for both graph matching and graph partition using an effective sampling algorithm in a Bayesian

103

103

framework. Given two images for matching, we extract two graphs using a primal sketch algorithm[4]. The graph nodes are linelets and primitives (junctions). Both graphs are automatically partitioned into an unknown number of K+1 layers of subgraphs so that K pairs of subgraphs are matched and the remaining layer contains unmatched backgrounds. Each matched pair represent a "moving object" with a TPS (Thin-Plate-Spline) transform to account for its deformations and a set of graph operators to edit the pair of subgraphs to achieve perfect structural match. The matching energy between two subgraphs includes geometric deformations, appearance dissimilarities, and the cost of graph editing operators. The key contribution of this paper is a stochastic algorithm for simultaneous graph partition, matching, and editing, based on an attributed sketch graph representation. We demonstrate its application on two tasks: (i) large motion with occlusion, and (ii) automatic detection and recognition of common objects in a pair of images.

Stereo 1

Learning Conditional Random Fields for Stereo

Daniel Scharstein and Chris Pal

State-of-the-art stereo vision algorithms utilize color changes as important cues for object boundaries. Most methods impose heuristic restrictions or priors on disparities, for example by modulating local smoothness costs with intensity gradients. In this paper we seek to replace such heuristics with explicit probabilistic models of disparities and intensities learned from real images. We have constructed a large number of stereo datasets with ground-truth disparities, and we use a subset of these datasets to learn the parameters of Conditional Random Fields (CRFs). We present experimental results illustrating the potential of our approach for automatically learning the parameters of models with richer structure than standard hand-tuned MRF models.

Stereo Matching via Disparity Estimation and Surface Modeling

Jong Dae Oh, Siwei Ma, and C.-C. Jay Kuo

Two new techniques are proposed to improve stereo matching performance in this work. First, to address the disparity discontinuity problem in occluded regions, we present a disparity estimation procedure, which consists of two steps; namely, a greedy disparity filling algorithm and a least-squared-errors (LSE) fitting method. Second, it is observed that the existing fronto-parallel model with color segmentation is built upon the piecewise constant surface approximation, which is however not efficient in approximating slanted or curved objects. We use a piecewise linear surface model to represent 3-dimensional (3D) geometric structure for better surface modeling. The proposed stereo matching system with these two new components is evaluated with Middlebury data sets with excellent quantitative and qualitative results.

Probabilistic visibility for multi-view stereo

Carlos Hernandez, George Vogiatzis, and Roberto Cipolla

We present a new formulation to multi-view stereo that treats the problem as probabilistic 3D segmentation. Previous work has used the stereo photo-consistency criterion as a detector of the boundary between the 3D scene and the surrounding empty space. Here we show how the same criterion can also provide a foreground/background model that can predict if a 3D location is inside or outside the scene. This model replaces the commonly used naive foreground model based on ballooning which is known to perform poorly in concavities. We demonstrate how the probabilistic visibility is linked to previous work on depth-map fusion and we present a multi-resolution graph-cut implementation using the new ballooning term that is very efficient both in terms of computation time and memory requirements.

104

104

Stereo Matching on Objects with Fractional Boundary

Wei Xiong and Jiaya Jia

Conventional stereo matching algorithms assume color constancy on the corresponding opaque pixels in the stereo images. However, when the foreground objects with the fractional boundary are blended to the scene behind using unknown alpha values, due to the spatially varying disparities for different layers, the color constancy does not hold any more. In this paper, we address the fractional stereo matching problem. A probability framework is introduced to establish the correspondences of pixel colors, disparities, and alpha values in different layers. We propose an automatic optimization method to solve a Maximum a posteriori (MAP) problem using Expectation-Maximization (EM), given the input of only a narrow-band stereo image pair. Our method naturally encodes pixel occlusion in the formulation of layer blending without a special detection process. We demonstrate the effectiveness of our method using difficult stereo images.

A Surface-Growing Approach to Multi-View Stereo Reconstruction

Martin Habbecke and Leif Kobbelt

We present a new approach to reconstruct the shape of a 3D object or scene from a set of calibrated images. The central idea of our method is to combine the topological flexibility of a point-based geometry representation with the robust reconstruction properties of scene-aligned planar primitives. This can be achieved by approximating the shape with a set of surface elements (surfels) in the form of planar disks which are independently fitted such that their footprint in the input images matches. Instead of using an artificial energy functional to promote the smoothness of the recovered surface during fitting, we use the smoothness assumption only to initialize planar primitives and to check the feasibility of the fitting result. After an initial disk has been found, the recovered region is iteratively expanded by growing further disks in tangent direction. The expansion stops when a disk rotates by more than a given threshold during the fitting step. A global sampling strategy guarantees that eventually the whole surface is covered. Our technique does not depend on a shape prior or silhouette information for the initialization and it can automatically and simultaneously recover the geometry, topology, and visibility information which makes it superior to other state-of-the-art techniques. We demonstrate with several high-quality reconstruction examples that our algorithm performs highly robustly and is tolerant to a wide range of image capture modalities.

Mumford-Shah Meets Stereo: Integration of Weak Depth Hypotheses

Thomas Pock, Christopher Zach, and Horst Bischof

Recent results on stereo indicate that an accurate segmentation is crucial for obtaining faithful depth maps, especially near depth discontinuities. Variational methods have successfully been applied to both image segmentation and computational stereo. In this paper we propose a combination in a unified framework. In particular, we use a Mumford-Shah-like functional to compute a piecewise smooth depth map of a stereo pair. Our approach has two novel features: First, the regularization term of the functional combines edge information obtained from the color segmentation with flow-driven depth discontinuities emerging during the optimization procedure. Second, we propose a robust data term which adaptively selects the best matches obtained from different weak stereo algorithms. We integrate these features in a theoretically consistent framework. The final depth map is the minimizer of the energy functional, which can be solved by the associated functional derivatives. The underlying numerical scheme allows an efficient implementation on modern graphics hardware. We illustrate the performance of our algorithm using the Middlebury database as well as on real imagery.

105

105

Detection, Matching, Tracking

Human Detection via Classification on Riemannian Manifolds

Oncel Tuzel, Fatih Porikli, and Peter Meer

We present a new algorithm to detect humans in still images utilizing covariance matrices as object descriptors. Since these descriptors do not lie on a vector space, well known machine learning techniques are not adequate to learn the classifiers. The space of d-dimensional nonsingular covariance matrices can be represented as a connected Riemannian manifold. We present a novel approach for classifying points lying on a Riemannian manifold by incorporating the a priori information about the geometry of the space. The algorithm is tested on INRIA human database where superior detection rates are observed over the previous approaches.

Matching Local Self-Similarities across Images and Videos

Eli Shechtman and Michal Irani

We present an approach for measuring similarity between visual entities (images or videos) based on matching internal self-similarities. What is correlated across images (or across video sequences) is the internal layout of local self-similarities (up to some distortions), even though the patterns generating those local self-similarities are quite different in each of the images/videos. These internal self-similarities are efficiently captured by a compact local “self-similarity descriptor”, measured densely throughout the image/video, at multiple scales, while accounting for local and global geometric distortions. This gives rise to matching capabilities of complex visual data, including detection of objects in real cluttered images using only rough hand-sketches, handling textured objects with no clear boundaries, and detecting complex actions in cluttered video data with no prior learning. We compare our measure to commonly used image-based and video-based similarity measures, and demonstrate its applicability to object detection, retrieval, and action detection.

Tracking in Low Frame Rate Video: A Cascade Particle Filter with Discriminative Observers of Different Lifespans

Yuan Li, Haizhou Ai, Takayoshi Yamashita, Shihong Lao, and Masato Kawade

Tracking object in low frame rate video or with abrupt motion poses two main difficulties which conventional tracking methods can barely handle: 1) poor motion continuity and increased search space; 2) fast appearance variation of target and more background clutter due to increased search space. In this paper, we address the problem from a view which unifies conventional tracking and detection, and present a temporal probabilistic combination of discriminative observers of different lifespans. Each observer is learned from different ranges of samples, with different subsets of features, to achieve varying level of discriminative power at varying cost. An efficient fusion and temporal inference is then done by a cascade particle filter which consists of multiple stages of importance sampling. Extensive experiments show significantly improved accuracy of the proposed approach in comparison with existing tracking methods, under the condition of low frame rate data and abrupt motion of both target and camera.

Progressive Finite Newton Approach To Real-time Nonrigid Surface Detection

Jianke Zhu and Michael R. Lyu

Detecting nonrigid surfaces is an interesting research problem for computer vision and image analysis. One important challenge of nonrigid surface detection is how to register a nonrigid surface mesh having a large number of free deformation parameters. This is particularly significant for detecting nonrigid surfaces from noisy observations. Nonrigid surface detection is usually regarded as a robust parameter estimation problem, which is typically solved iteratively from a good initialization in order to avoid local minima. In

106

106

this paper, we propose a novel progressive finite Newton optimization scheme for the nonrigid surface detection problem, which is reduced to only solving a set of linear equations. The key of our approach is to formulate the nonrigid surface detection as an unconstrained quadratic optimization problem which has a closed-form solution for a given set of observations. Moreover, we employ a progressive active-set selection scheme, which takes advantage of the rank information of detected correspondences. We have conducted extensive experiments for performance evaluation on various environments, whose promising results show that the proposed algorithm is more efficient and effective than the existing iterative methods.

Search and Optimization

Approximate Nearest Subspace Search with Applications to Pattern Recognition

Ronen Basri, Tal Hassner, and Lihi Zelnik-Manor

Linear and affine subspaces are commonly used to describe appearance of objects under different lighting, viewpoint, articulation, and identity. A natural problem arising from their use is - given a query image portion represented as a point in some high dimensional space - find a subspace near to the query. This paper presents an efficient solution to the approximate nearest subspace problem for both linear and affine subspaces. Our method is based on a simple reduction to the problem of nearest point search, and can thus employ tree based search or locality sensitive hashing to find a near subspace. Further speedup may be achieved by using random projections to lower the dimensionality of the problem. We provide theoretical proofs of correctness and error bounds of our construction and demonstrate its capabilities on synthetic and real data. Our experiments demonstrate that an approximate nearest subspace can be located significantly faster than the exact nearest subspace, while at the same time it can find better matches compared to a similar search on points, in the presence of variations due to viewpoint, lighting etc.

Solving Large Scale Binary Quadratic Problems: Spectral Methods vs. Semidefinite Programming

Carl Olsson, Anders Eriksson, and Fredrik Kahl

In this paper we introduce two new methods for solving binary quadratic problems. While spectral relaxation methods have been the workhorse subroutine for a wide variety of computer vision problems - segmentation, clustering, image restoration to name a few - it has recently been challenged by semidefinite programming (SDP) relaxations. In fact, it can be shown that SDP relaxations produce better lower bounds than spectral relaxations on binary problems with a quadratic objective function. On the other hand, the computational complexity for SDP increases rapidly as the number of decision variables grows making them inapplicable to large scale problems. Our methods combine the merits of both spectral and SDP relaxations - better (lower) bounds than traditional spectral methods and considerably faster execution times than SDP. The first method is based on spectral subgradients and can be applied to large scale SDPs with binary decision variables and the second one is based on the trust region problem. Both algorithms have been applied to several large scale vision problems with good performance.

Optimizing Binary MRFs via Extended Roof Duality

Carsten Rother, Vladimir Kolmogorov, Victor Lempitsky, and Martin Szummer

Many computer vision applications rely on the efficient optimization of challenging, so-called non-submodular, binary pairwise MRFs. A promising graph cut-based approach for optimizing such MRFs known as “roof duality” was recently introduced into computer vision. We study two methods which extend this approach. First, we discuss an efficient implementation of the “probing” technique introduced recently by Boros et al. [5]. It simplifies the MRF while preserving the global optimum. Our code is 400-

107

107

700 faster on some graphs than the software of [5]. Second, we present a new technique which takes an arbitrary input labeling and tries to improve its energy. We give theoretical characterizations of local minima of the this procedure. We applied both techniques to many applications, including image segmentation, new view synthesis, super-resolution, diagram recognition, parameter learning, texture restoration, and image deconvolution. For several applications we see that we are able to find the global minimum very efficiently, and outperform considerably the original roof duality approach. In comparison to existing techniques, such as graph cut, TRW, BP, ICM, and simulated annealing, we nearly always find a lower energy.

P3 and Beyond: Solving Energies with Higher Order Cliques

Pushmeet Kohli, Pawan Mudigonda, and Philip Torr

In this paper we extend the class of energy functions for which the optimal α -expansion and αβ -swap moves can be computed in polynomial time. Specifically, we introduce a class of higher order clique potentials and show that the expansion and swap moves for any energy function composed of these potentials can be found by minimizing a submodular function. We also show that for a subset of these potentials, the optimal move can be found by solving an st-mincut problem. We refer to this subset as the Pn Potts model. Our results enable the use of powerful move making algorithms i.e. α -expansion and αβ -swap for minimization of energy functions involving higher order cliques. Such functions have the capability of modelling the rich statistics of natural scenes and can be used for many applications in computer vision. We demonstrate their use on one such application i.e. the texture based video segmentation problem.

Efficient MRF Deformation Model for Non-Rigid Image Matching

Alexander Shekhovtsov, Ivan Kovtun, and Václav Hlavá

We propose a novel MRF-based model for deformable image matching. Given two images, the task is to estimate a mapping from one image to the other maximizing the quality of the match. We consider mappings defined by a discrete deformation field constrained to preserve 2D continuity. We pose the task as finding MAP configurations of a pairwise MRF. We propose a more compact MRF representation of the problem which leads to a weaker, though computationally more tractable, linear programming relaxation -- the approximation technique we choose to apply. The number of dual LP variables grows linearly with the search window side, rather than quadratically as in previous approaches. To solve the relaxed problem (suboptimally), we apply TRW-S (Sequential Tree-Reweighted Message passing) algorithm [13,5]. Using our representation and the chosen optimization scheme, we are able to match much wider deformations than was considered previously in global optimization framework. We further elaborate on continuity and data terms to achieve more appropriate description of smooth deformations. The performance of our technique is demonstrated on both synthetic and real-world experiments.

Physics

Color Constancy Using Natural Image Statistics

Arjan Gijsenij and Theo Gevers

Although many color constancy methods exist, they are all based on specific assumptions such as the set of possible light sources, or the spatial and spectral characteristics of images. As a consequence, no algorithm can be considered as universal. However, with the large variety of available methods, the question is how to select the method that induces equivalent classes for different image characteristics. Furthermore, the

108

108

subsequent question is how to combine the different algorithms in a proper way. To achieve selection and combining of color constancy algorithms, in this paper, natural image statistics are used to identify the most important characteristics of color images. Then, based on these image characteristics, the proper color constancy algorithm (or best combination of algorithms) is selected for a specific image. To capture the image characteristics, the Weibull parameterization (e.g. texture and contrast) is used. Experiments show that, on a large data set of 11,000 images, our approach outperforms current state-of-the-art single algorithms, as well as simple alternatives for combining several algorithms.

Isotropy, Reciprocity and the Generalized Bas-Relief Ambiguity

Ping Tan, Satya Mallick, Long Quan, David Kriegman, and Todd Zickler

A set of images of a Lambertian surface under varying lighting directions defines it's shape up to a three-parameter Generalized Bas-Relief (GBR) ambiguity. In this paper, we examine this ambiguity in the context of surfaces having an additive non-Lambertian reflectance component, and we show that the GBR ambiguity is resolved by any non-Lambertian reflectance function that is isotropic and spatially invariant. The key observation is that each point on a curved surface under directional illumination is a member of a family of points that are in isotropic or reciprocal configurations. We show that the GBR can be resolved in closed form by identifying members of these families in two or more images. Based on this idea, we present an algorithm for recovering full Euclidean geometry from a set of uncalibrated photometric stereo images, and we evaluate it empirically on a number of examples.

Resolving the Generalized Bas-Relief Ambiguity by Entropy Minimization

Neil Alldrin, Satya Mallick, and David Kriegman

It is well known in the photometric stereo literature that uncalibrated photometric stereo, where light source strength and direction are unknown, can recover the surface geometry of a Lambertian object up to a 3-parameter linear transform known as the generalized bas relief (GBR) ambiguity. Many techniques have been proposed for resolving the GBR ambiguity, typically by exploiting prior knowledge of the light sources, the object geometry, or non-Lambertian effects such as specularities. A less celebrated consequence of the GBR transformation is that the albedo at each surface point is transformed along with the geometry. Thus, it should be possible to resolve the GBR ambiguity by exploiting priors on the albedo distribution. To the best of our knowledge, the only time the albedo distribution has been used to resolve the GBR is in the case of uniform albedo. We propose a new prior on the albedo distribution : that the entropy of the distribution should be low. This prior is justified by the fact that many objects in the real-world are composed of a small finite set of albedo values.

Polarization and Phase-shifting for 3D Scanning of Translucent Objects

Tongbo Chen, Hendrik Lensch, Christian Fuchs, and Hans-Peter Seidel

Translucent objects pose a difficult problem for traditional structured light 3D scanning techniques. Subsurface scattering corrupts the range estimation in two ways: by drastically reducing the signal-to-noise ratio and by shifting the intensity peak beneath the surface to a point which does not coincide with the point of incidence. In this paper we analyze and compare two descattering methods in order to obtain reliable 3D coordinates for translucent objects. By using polarization-difference imaging, subsurface scattering can be filtered out because multiple scattering randomizes the polarization direction of light while the surface reflectance partially keeps the polarization direction of the illumination. The descattered reflectance can be used for reliable 3D reconstruction using traditional optical 3D scanning techniques, such as structured light. Phase-shifting is another effective descattering technique if the frequency of the projected pattern is

109

109

sufficiently high. We demonstrate the performance of these two techniques and the combination of them on scanning real-world translucent objects.

Autocalibration and Uncalibrated Reconstruction of Shape from Defocus

Yifei Lou, Paolo Favaro, Andrea Bertozzi, and Stefano Soatto

Most algorithms for reconstructing shape from defocus assume that the images are obtained with a camera that has been previously calibrated so that the aperture, focal plane, and focal length are known. In this manuscript we characterize the set of scenes that can be reconstructed from defocused images regardless of calibration parameters. In lack of knowledge about the camera or about the scene, reconstruction is possible only up to an equivalence class that is described analytically. When weak knowledge about the scene is available, however, we show how it can be exploited in order to auto-calibrate the imaging device. This includes imaging a slanted plane or generic assumptions on the restoration of the deblurred images.

Poster session 1

Sensing, Photometrics, and Image Processing 2

Spatial-Depth Super Resolution for Range Images

Qingxiong Yang, Ruigang Yang, James Davis, and David Nistér

We present a new post-processing step to enhance the resolution of range images. Using one or two registered and potentially high-resolution color images as reference, we iteratively refine the input low-resolution range image, in terms of both its spatial resolution and depth precision. Evaluation using the Middlebury benchmark shows across-the-board improvement for sub-pixel accuracy. We also demonstrated its effectiveness for spatial resolution enhancement up to 100×100 with a single reference image.

The hyperbolic geometry of illumination induced chromaticity changes

Reiner Lenz, Pedro Latorre Carmona, and Peter Meer

The non-negativity of color signals implies that they span a conical space with a hyperbolic geometry. We use perspective projections to separate intensity from chromaticity, and for 3-D color descriptors the chromatic properties are represented by points on the unit disk. Descriptors derived from the same object point but under different imaging conditions can be joined by a hyperbolic geodesic. The properties of this model are investigated using multichannel images of natural scenes and black body illuminants of different temperatures. We show, over a series of static scenes with different illuminants, how illumination changes influence the hyperbolic distances and the geodesics. Descriptors derived from conventional RGB images are also addressed.

110

110

Radiometric Calibration from Noise Distributions

Yasuyuki Matsushita and Stephen Lin

A method is proposed for estimating radiometric response functions from noise observations. From the statistical properties of noise sources, the noise distribution of each pixel is shown to be symmetric for a radiometrically calibrated camera. However, due to the non-linearity of camera response functions, the observed noise distributions become skewed in an uncalibrated camera. In this paper, we capitalize on these asymmetric profiles of measured noise distributions to estimate radiometric response functions. Unlike prior approaches, the proposed method is not sensitive to noise level, and is therefore particularly useful when the noise level is high. Also, the proposed method does not require registered input images taken with different exposures; only statistical noise distributions at multiple intensity levels are used. Real-world experiments demonstrate the effectiveness of the proposed approach in comparison to standard calibration techniques.

Automatic Removal of Chromatic Aberration from a Single Image

Sing Bing Kang

Many high resolution images exhibit chromatic aberration (CA), where the color channels appear shifted. Unfortunately, merely compensating for these shifts is sometimes inadequate, because the intensities are modified by other effects such as spatially-varying defocus and (surprisingly) in-camera sharpening. In this paper, we start from the basic principles of image formation to characterize CA, and show how its effects can be substantially reduced. We also show results of CA correction on a number of high-resolution images taken with different cameras.

Detecting specular surfaces on natural images

Andrey DelPozo and Silvio Savarese

Recognizing and localizing specular (or mirror-like) surfaces from a single image is a great challenge to computer vision. Unlike other materials, the appearance of a specular surface changes as function of the surrounding environment as well as the position of the observer. Even though the reflection on a specular surface has an intrinsic ambiguity that might be resolved by high level reasoning, we argue that we can take advantage of low level features to recognize specular surfaces. This intuition stems from the observation that the surrounding scene is highly distorted when reflected off regions of high curvature or occluding contours. We call these features static specular flows (SSF). We show how to characterize SSF and use them for identifying specular surfaces. To evaluate our result we collect a dataset of 120 images containing specular surfaces. Our algorithm can achieve good performances on this challenging dataset. Particularly, our results outperform other methods that follow a more naive approach.

Variable Bandwidth Image Denoising Using Image-based Noise Models

Noura Azzabou, Nikos Paragios, Frédéric Guichard, and Frédéric Cao

This paper introduces a variational formulation for image denoising based on a quadratic function over kernels of variable bandwidth. These kernels are scale adaptive and reflect spatial and photometric similarities between pixels. The bandwidth of the kernels is observation-dependent towards improving the accuracy of the reconstruction process and is constrained to be locally smooth. We analyze the evolution of the noise model form the RAW space to the RGB one, by propagating it over the image formation process.

111

111

The experimental results demonstrate that the use of a variable bandwidth approach and an image intensity dependent noise variance ensures better restoration quality.

Estimating Scale of a Scene from a Single Image Based on Defocus Blur and Scene Geometry

Takayuki Okatani and Koichiro Deguchi

Using an imaging system in which the image plane can be tilted with respect to the optical axis of the lens, the image of a large-scale scene that appears to be a miniature to human eyes can be captured. This phenomenon suggests that the image contains information regarding the scale of the scene and that human vision can extract this information and recognize the scene scale from a single image. In this study, we consider how human vision can perform this single-view scale estimation. Although it is obvious that the existence of defocus blur in the image that simulates a shallow DOF plays an essential role in the scale estimation, we propose that this alone is not sufficient to explain the estimation mechanism. By incorporating a few assumptions, we theoretically show that scale estimation is made possible when (1) the 3D structure of the scene can be recovered from the image and furthermore, (2) the structure is combined with the defocus blur. Further, we present a simple algorithm for scale recognition and demonstrate its working using a real image.

Learning Color Names from Real-World Images

Joost van der Weijer, Cordelia Schmid, and Jakob Verbeek

Within a computer vision context color naming is the action of assigning linguistic color labels to image pixels. In general, research on color naming applies the following paradigm: a collection of color chips is labelled with color names within a well-defined experimental setup by multiple test subjects. The collected data set is subsequently used to label RGB values in real-world images with a color name. Apart from the fact that this collection process is time consuming, it is unclear to what extend color naming within a controlled setup is representative for color naming in real-world images. In this paper we propose to learn color names from real-world images. We avoid test subjects by using Google Image to collect a data set of color names. An adapted PLSA model is applied to extract the color names from this database, where the topics represent color distributions. Experiments show that for computer vision applications color names learned from real-world images significantly outperform color names learned in controlled settings.

Active Aperture Control and Sensor Modulation for Flexible Imaging

Chunyu Gao, Narendra Ahuja, and Hong Hua

In the paper, we describe an optical system which is capable of providing external access to both the sensor and the lens aperture (i.e., projection center) of a conventional camera. The proposed optical system is attached in front of the camera, and is the equivalent of adding externally accessible intermediate image plane and projection center. The system offers controls of the response of each pixel which could be used to realize many added imaging functions, such as high dynamic range imaging, image modulation, and optical computation. The ability to access the optical center could enable a wide variety of applications by simply allowing manipulation of the geometric properties of the optical center. For instance, panoramic imaging can be implemented by rotating a planar mirror about the camera axis; and small base-line stereo can be implemented by shifting the camera center. We have implemented a bench setup to demonstrate some of these functions. The experimental results are included.

112

112

Retrieval and Search 2

A Topic-Motion Model for Unsupervised Video Object Discovery

David Liu and Tsuhan Chen

The bag-of-words representation has attracted a lot of attention recently in the field of object recognition. Based on the bag-of-words representation, topic models such as Probabilistic Latent Semantic Analysis (PLSA) have been applied to unsupervised object discovery in still images. In this paper, we extend topic models from still images to motion videos with the integration of a temporal model. We propose a novel spatial-temporal framework that uses topic models for appearance modeling, and the Probabilistic Data Association (PDA) filter for motion modeling. The spatial and temporal models are tightly integrated so that motion ambiguities can be resolved by appearance, and appearance ambiguities can be resolved by motion. We show promising results that cannot be achieved by appearance or motion modeling alone.

Content-Based Image Annotation Refinement

Changhu Wang, Feng Jing, Lei Zhang, and Hong-Jiang Zhang

Automatic image annotation has been an active research topic due to its great importance in image retrieval and management. However, results of the state-of-the-art image annotation methods are often unsatisfactory. Despite continuous efforts in inventing new annotation algorithms, it would be advantageous to develop a dedicated approach that could refine imprecise annotations. In this paper, a novel approach to automatically refining the original annotations of images is proposed. For a query image, an existing image annotation method is first employed to obtain a set of candidate annotations. Then, the candidate annotations are re-ranked and only the top ones are reserved as the final annotations. By formulating the annotation refinement process as a Markov process and defining the candidate annotations as the states of a Markov chain, a content-based image annotation refinement (CIAR) algorithm is proposed to re-rank the candidate annotations. It leverages both corpus information and the content feature of a query image. Experimental results on a typical Corel dataset show not only the validity of the refinement, but also the superiority of the proposed algorithm over existing ones.

Discovery of Collocation patterns: from Visual Words to Visual Phrases

Junsong Yuan, Ying Wu, and Ming Yang

A visual object can be represented as a set of primitive features. Such a “bag-of-words” representation has led to many significant results in various vision tasks including object recognition and categorization. However, in practice, the use of primitive visual features tends to result in synonymous visual words that over-represent visual patterns, as well as polysemous visual words that bring large uncertainties and ambiguities in the representation. This paper proposes a novel “bag-of-phrases” model, where each visual phrase is composed of spatially collocated visual words. By automatically discovering meaningful visual phrases in images, this new model has less synonymous and polysemous issues than bag-of-words models. The contributions of this paper include: (1) a fast and principled solution to the discovery of significant spatial collocation patterns using frequent itemset mining; (2) a pattern summarization method that deals with the compositional uncertainties in visual phrases; and (3) a top-down refinement scheme of the visual word dictionary by feeding back discovered phrases to tune the similarity measure through metric learning.

113

113

Multi-modal Clustering for Multimedia Collections

Ron Bekkerman and Jiwoon Jeon

Most of the online multimedia collections, such as picture galleries or video archives, are categorized in a fully manual process, which is very expensive and may soon be infeasible with the rapid growth of multimedia repositories. In this paper, we present an effective method for automating this process within the unsupervised learning framework. We exploit the truly multi-modal nature of multimedia collections---they have multiple views, or modalities, each of which contributes its own perspective to the collection's organization. For example, in picture galleries, image captions are often provided that form a separate view on the collection. Color histograms (or any other set of global features) form another view. Additional views are blobs, interest points and other sets of local features. Our model, called Comraf* (pronounced Comraf-Star), efficiently incorporates various views in multi-modal clustering, by which it allows great modeling flexibility. Comraf* is a light-weight version of the recently introduced combinatorial Markov random field (Comraf). We show how to translate an arbitrary Comraf into a series of Comraf* models, and give an empirical evidence for comparable effectiveness of the two. Comraf* demonstrates excellent results on two real-world image galleries: it obtains 2.5-3 times higher accuracy compared with a uni-modal k-means.

Reducing correspondence ambiguity in loosely labeled training data

Kobus Barnard and Quanfu Fan

We develop an approach to reduce correspondence ambiguity in training data where data items are associated with sets of plausible labels. Our domain is images annotated with keywords where it is not known which part of the image a keyword refers to. In contrast to earlier approaches that build predictive models or classifiers despite the ambiguity, we argue that that it is better to first address the correspondence ambiguity, and then build more complex models from the improved training data. This addresses difficulties of fitting complex models in the face of ambiguity while exploiting all the constraints available from the training data. We contribute a simple and flexible formulation of the problem, and show results validated by a recently developed comprehensive evaluation data set and corresponding evaluation methodology.

Pyramid Match Hashing: Sub-Linear Time Indexing Over Partial Correspondences

Kristen Grauman and Trevor Darrell

Matching local features across images is often useful when comparing or recognizing objects or scenes, and efficient techniques for obtaining image-to-image correspondences have been developed. However, given a query image, searching a very large image database with such measures remains impractical. We introduce a sub-linear time randomized hashing algorithm for indexing sets of feature vectors under their partial correspondences. We develop an efficient embedding function for the normalized partial matching similarity between sets, and show how to exploit random hyperplane properties to construct hash functions that satisfy locality-sensitive constraints. The result is a bounded approximate similarity search algorithm

that finds (1+ ε)-approximate nearest neighbor images in O N1/(1+ε )( ) time for a database containing N

images represented by (varying numbers of) local features. We demonstrate our approach applied to image retrieval for images represented by sets of local appearance features, and show that searching over correspondences is now scalable to large image databases.

114

114

Visual Event Recognition in News Video using Kernel Methods with Multi-Level Temporal Alignment

Dong Xu and Shih-Fu Chang

In this work, we systematically study the problem of visual event recognition in unconstrained news video sequences. We adopt the discriminative kernel-based method for which video clip similarity plays an important role. First, we represent a video clip as a bag of orderless descriptors extracted from all of the constituent frames and apply Earth Mover’s Distance (EMD) to integrate similarities among frames from two clips. Observing that a video clip is usually comprised of multiple sub-clips corresponding to event evolution over time, we further build a multilevel temporal pyramid. At each pyramid level, we integrate the information from different sub-clips with Integer-value constrained EMD to explicitly align the sub-clips. By fusing the information from the different pyramid levels, we develop Temporally Aligned Pyramid Matching (TAPM) for measuring video similarity. We conduct comprehensive experiments on the Trecvid 2005 corpus, which contains more than 6,800 clips. Our experiments demonstrate that 1) the TAPM multi-level method clearly outperforms single-level EMD, and 2) single-level EMD outperforms by a large margin (43.0% in Mean Average Precision) basic detection methods that use only a single key-frame. Extensive analysis of the results also reveals an intuitive interpretation of subclip alignment at different levels.

Efficient Indexing For Articulation Invariant Shape Matching and Retrieval

Soma Biswas, Gaurav Aggarwal, and Rama Chellappa

Most shape matching methods are either fast but too simplistic to give the desired performance or promising as far as performance is concerned but computationally demanding. In this paper, we present a very simple and efficient approach that not only performs almost as good as many state-of-the-art techniques but also scales up to large databases. In the proposed approach, each shape is indexed based on a variety of simple and easily computable features which are invariant to articulations and rigid transformations. The features characterize pairwise geometric relationships between interest points on the shape, thereby providing robustness to the approach. Shapes are retrieved using an efficient scheme which does not involve costly operations like shape-wise alignment or establishing correspondences. Even for a moderate size database of 1000 shapes, the retrieval process is several times faster than most techniques with similar performance. Extensive experimental results are presented to illustrate the advantages of our approach as compared to the best in the field.

Segmentation 2

A Topological Approach to Hierarchical Segmentation using Mean Shift

Sylvain Paris and Frédo Durand

Mean shift is a popular method to segment images and videos. Pixels are represented by feature points, and the segmentation is driven by the point density in feature space. In this paper, we introduce the use of Morse theory to interpret mean shift as a topological decomposition of the feature space into density modes. This allows us to build on the watershed technique and design a new algorithm to compute mean-shift segmentations of images and videos. In addition, we introduce the use of topological persistence to create a segmentation hierarchy. We validated our method by clustering images using color cues. In this context, our technique runs faster than previous work, especially on videos and large images. We evaluated accuracy with a classical benchmark which shows results on par with existing low-level techniques, i.e. we do not sacrifice accuracy for speed.

115

115

Multiple Class Segmentation Using A Unified Framework over Mean-Shift Patches

Lin Yang, Peter Meer, and David Foran

Object-based segmentation is a challenging topic. Most of the previous algorithms focused on segmenting a single or a small set of objects. In this paper, the multiple class object-based segmentation is achieved using the appearance and bag of keypoints models integrated over mean-shift patches. We also propose a novel affine invariant descriptor to model the spatial relationship of keypoints and apply the Elliptical Fourier Descriptor to describe the global shapes. The algorithm is computationally efficient and has been tested for three real datasets using less training samples. Our algorithm provides better results than other studies reported in the literature.

Multi-label image segmentation via max-sum solver

Branislav Micusik and Tomas Pajdla

We formulate single-image multi-label segmentation into regions coherent in texture and color as a max-sum problem for which efficient linear programming based solvers have recently appeared. By handling more than two labels, we go beyond widespread binary segmentation methods, eg, min-cut or normalized cut based approaches. We show that the max-sum solver is a very powerful tool for obtaining the MAP estimate of a Markov random field (MRF). We build the MRF on superpixels to speed up the segmentation while preserving color and texture. We propose new quality functions for setting the MRF, exploiting priors from small representative image seeds, provided either manually or automatically. We show that the proposed automatic segmentation method outperforms previous techniques in terms of the Global Consistency Error evaluated on the Berkeley segmentation database.

ROI-SEG: Unsupervised Color Segmentation by Combining Differently Focused Sub Results

Michael Donoser and Horst Bischof

This paper presents a novel unsupervised color segmentation scheme named ROI-SEG, which is based on the main idea of combining a set of different sub-segmentation results. We propose an efficient algorithm to compute sub-segmentations by an integral image approach for calculating Bhattacharyya distances and a modified version of the Maximally Stable Extremal Region (MSER) detector. The sub-segmentation algorithm gets a region-of-interest (ROI) as input and detects connected regions having similar color appearance as the ROI. We further introduce a method to identify ROIs representing the predominant color and texture regions of an image. Passing each of the identified ROIs to the sub-segmentation algorithm provides a set of different segmentations, which are then combined by analyzing a local quality criterion. The entire approach is fully unsupervised and does not need a priori information about the image scene. The method is compared to state-of-the-art algorithms on the Berkeley image database, where it shows competitive results at reduced computational costs.

116

116

Combining Region and Edge Cues for Image Segmentation in a Probabilistic Gaussian Mixture Framework

Omer Rotem, Hayit Greenspan, and Jacob Goldberger

In this paper we propose a new segmentation algorithm which combines patch-based information with edge cues under a probabilistic framework. We use a mixture of multiple Gaussians for building the statistical model with color and spatial features, and we incorporate edge information based on texture, color and brightness differences into the EM algorithm. We evaluate our results qualitatively and quantitatively on a large data-set of natural images and compare our results to other state-of-the-art methods.

CRF-driven Implicit Deformable Model

Gabriel Tsechpenakis and Dimitris Metaxas

We present a topology independent solution for segmenting objects with texture patterns of any scale, using an implicit deformable model driven by Conditional Random Fields (CRFs). Our model integrates region and edge information as image driven terms, whereas the probabilistic shape and internal (smoothness) terms use representations similar to the level-set based methods. The evolution of the model is solved as a MAP estimation problem, where the target conditional probability is decomposed into the internal term and the image-driven term. For the later, we use discriminative CRFs in two scales, pixel- and patch-based, to obtain smooth probability fields based on the corresponding image features. The advantages and novelties of our approach are (i) the integration of CRFs with implicit deformable models in a tightly coupled scheme, (ii) the use of CRFs which avoids ambiguities in the probability fields, (iii) the handling of local feature variations by updating the model interior statistics and processing at different spatial scales, and (v) the independence from the topology. We demonstrate the performance of our method in a wide variety of images, from the zebra and cheetah examples to the left and right ventricles in cardiac images.

Joint Object Segmentation and Behavior Classification in Image Sequences

Laura Gui, Jean-Philippe Thiran, and Nikos Paragios

In this paper, we propose a general framework for fusing bottom-up segmentation with top-down object behavior classification over an image sequence. This approach is beneficial for both tasks, since it enables them to cooperate so that knowledge relevant to each can aid in the resolution of the other, thus enhancing the final result. In particular, classification offers dynamic probabilistic priors to guide segmentation, while segmentation supplies its results to classification, ensuring that they are consistent both with prior knowledge and with new image information. We demonstrate the effectiveness of our framework via a particular implementation for a hand gesture recognition application. The prior models are learned from training data using principal components analysis and they adapt dynamically to the content of new images. Our experimental results illustrate the robustness of our joint approach to segmentation and behavior classification in challenging conditions involving occlusions of the target object before a complex background.

117

117

Segmenting Motions of Different Types by Unsupervised Manifold Clustering

Alvina Goh and René Vidal

We propose a novel algorithm for segmenting multiple motions of different types from point correspondences in multiple affine or perspective views. Since point trajectories associated with different motions live in different manifolds, traditional approaches deal with only one manifold type: linear subspaces for affine views, and homographic, bilinear and trilinear varieties for two and three perspective views. As real motion sequences contain motions of different types, we cast motion segmentation as a problem of clustering manifolds of different types. Rather than explicitly modeling each manifold as a linear, bilinear or multilinear variety, we use nonlinear dimensionality reduction to learn a low-dimensional representation of the union of all manifolds. We show that for a union of separated manifolds, the LLE algorithm computes a matrix whose null space contains vectors giving the segmentation of the data. An analysis of the variance of these vectors allows us to distinguish them from other vectors in the null space. This leads to a new algorithm for clustering both linear and nonlinear manifolds. Although this algorithm is theoretically designed for separated manifolds, our experiments demonstrate its performance on real data where this assumption does not hold. We test our algorithm on the Hopkins 155 motion segmentation database and achieve an average classification error of 4.8%, which compares favorably against state-of-the art multiframe motion segmentation methods.


Combining Local and Global Motion Models for Feature Point Tracking

Aeron Buchanan and Andrew Fitzgibbon

Accurate feature point tracks through long sequences are a valuable substrate for many computer vision applications, e.g. non-rigid body tracking, video segmentation, video matching, and even object recognition. Existing algorithms may be arranged along an axis indicating how global the motion model used to constrain tracks is. Local methods, such as the KLT tracker, depend on local models of feature appearance, and are easily distracted by occlusions, repeated structure, and image noise. This leads to short tracks, many of which are incorrect. Alone, these require considerable postprocessing to obtain a useful result. In restricted scenes, for example a rigid scene through which a camera is moving, such postprocessing can make use of global motion models to allow “guided matching” which yields long high-quality feature tracks. However, many scenes of interest contain multiple motions or significant non-rigid deformations which mean that guided matching cannot be applied. In this paper we propose a general amalgam of local and global models to improve tracking even in these difficult cases. By viewing rank-constrained tracking as a probabilistic model of 2D tracks rather than 3D motion, we obtain a strong, robust motion prior, derived from the global motion in the scene. The result is a simple and powerful prior whose strength is easily tuned, enabling its use in any existing tracking algorithm.

On-the-fly Object Modeling while Tracking

Zhaozheng Yin and Robert Collins

To implement a persistent tracker, we build a set of view-dependent object appearance models adaptively and automatically while tracking an object under different viewing angles. This collection of acquired models is indexed with respect to the view sphere. The acquired models aid recovery from tracking failure due to occlusion and changing view angle. In this paper, view-dependent object appearance is represented by intensity patches around detected Harris corners. The intensity patches from a model are matched to the current frame by solving a bipartite linear assignment problem with outlier exclusion and missed inlier recovery. Based on these reliable matches, the change in object rotation, translation and scale is estimated

118

118

between consecutive frames using Procrustes analysis. The experimental results show good performance using a collection of view-specific patch-based models for detection and tracking of vehicles in low-resolution airborne video.

Deformable Surface Tracking Ambiguities

Mathieu Salzmann, Vincent Lepetit, and Pascal Fua

We study from a theoretical standpoint the ambiguities that occur when tracking a generic deformable surface under monocular perspective projection given 3–D to 2–D correspondences. We show that, additionally to the known scale ambiguity, a set of potential ambiguities can be clearly identified. From this, we deduce a minimal set of constraints required to disambiguate the problem and incorporate them into a working algorithm that runs on real noisy data.

Regression tracking with data relevance determination

Ioannis Patras and Edwin Hancock

This paper addresses the problem of efficient visual 2D template tracking in image sequences. We adopt a discriminative approach in which the observations at each frame yield direct predictions of a parametrisation of the state (e.g. position/scale/rotation) of the tracked target. To this end, a Bayesian Mixture of Experts (BME) is trained on a dataset of image patches that are generated by applying artificial transformations to the template at the first frame. In contrast to other methods in the literature, we explicitly address the problem that the prediction accuracy can deteriorate drastically for observations that are not similar to the ones in the training set; such observations are common in case of partial occlusions or of fast motion. To do so, we couple the BME with a probabilistic kernel-based classifier which, when trained, can determine the probability that a new/unseen observation can accurately predict the state of the target (the ’relevance’ of the observation in question). In addition, in the particle filtering framework, we derive a recursive scheme for maintaining an approximation of the posterior probability of the target’s state in which the probabilistic predictions of multiple observations are moderated by their corresponding relevance. We apply the algorithm in the problem of 2D template tracking and demonstrate that the proposed scheme outperforms classical methods for discriminative tracking in case of motions large in magnitude and of partial occlusions.

Kernel-based Tracking from a Probabilistic Viewpoint

Quang Anh Nguyen, Antonio Robles-Kelly, and Chunhua Shen

In this paper, we present a probabilistic formulation of kernel-based tracking methods based upon maximum likelihood estimation. To this end, we view the coordinates for the pixels in both, the target model and its candidate as random variables and make use of a generative model so as to cast the tracking task into a maximum likelihood framework. This, in turn, permits the use of the EM-algorithm to estimate a set of latent variables that can be used to update the target-center position. Once the latent variables have been estimated, we use the Kullback-Leibler divergence so as to minimise the mutual information between the target model and candidate distributions in order to develop a target-center update rule and a kernel bandwidth adjustment scheme. The method is very general in nature. We illustrate the utility of our approach for purposes of tracking on real-world video sequences using two alternative kernel functions.

119

119

High-dimensional statistical distance for region-of-interest tracking: Application to combining a soft geometric constraint with radiometry

Sylvain Boltz, Eric Debreuve, and Michel Barlaud

This paper deals with region-of-interest (ROI) tracking in video sequences. The goal is to determine in successive frames the region which best matches, in terms of a similarity measure, an ROI defined in a reference frame. Two aspects of a similarity measure between a reference region and a candidate region can be distinguished: radiometry which checks if the regions have similar colors and geometry which checks if these colors appear at the same location in the regions. Measures based solely on radiometry include distances between probability density functions (PDF) of color. The absence of geometric constraint increases the number of potential matches. A soft geometric constraint can be added to a PDF-based measure by enriching the color information with location, thus increasing the dimension of the domain of definition of the PDFs. However, high-dimensional PDF estimation is not trivial. Instead, we propose to compute the Kullback-Leibler distance between high-dimensional PDFs without explicitly estimating the PDFs. The distance is expressed directly from the samples using the k-th nearest neighbor framework. Tracking experiments were performed on several standard sequences.

Discriminative Learning of Dynamical Systems for Motion Tracking

Minyoung Kim and Vladimir Pavlovic

We introduce novel discriminative learning algorithms for dynamical systems. Models such as Conditional Random Fields or Maximum Entropy Markov Models outperform the generative Hidden Markov Models in sequence tagging problems in discrete domains. However, continuous state domains introduce a set of constraints that can prevent direct application of these traditional models. Instead, we suggest to learn generative dynamic models with discriminative cost functionals. For Linear Dynamical Systems, the proposed methods provide significantly lower prediction error than the standard maximum likelihood estimator, often comparable to nonlinear models. As a result, the models with lower representational capacity but computationally more tractable than nonlinear models can be used for accurate and efficient state estimation. We evaluate the generalization performance of our methods on the 3D human pose tracking problem from monocular videos. The experiments indicate that the discriminative learning can lead to improved accuracy of pose estimation with no increase in computational cost of tracking.

Closed-Loop Tracking and Change Detection in Multi-Activity Sequences

Bi Song, Namrata Vaswani, and Amit Roy-Chowdhury

We present a novel framework for tracking of a long sequence of human activities, including the time instances of change from one activity to the next, using a closed-loop, non-linear dynamical feedback system. A composite feature vector describing the shape, color and motion of the objects, and a non-linear, piecewise stationary, stochastic dynamical model describing its spatio-temporal evolution, are used for tracking. The tracking error or expected log likelihood, which serves as a feedback signal, is used to automatically detect changes and switch between activities happening one after another in a long video sequence. Whenever a change is detected, the tracker is reinitialized automatically by comparing the input image with learned models of the activities. Unlike some other approaches that can track a sequence of activities, we do not need to know the transition probabilities between the activities, which can be difficult to estimate in many application scenarios. We demonstrate the effectiveness of the method on multiple indoor and outdoor real-life videos and analyze its performance.

120

120

Detection and segmentation of moving objects in highly dynamic scenes

Aurélie Bugeau and Patrick Pérez

Detecting and segmenting moving objects in dynamic scenes is a hard but essential task in a number of applications such as surveillance. Most existing methods only give good results in the case of persistent or slowly changing background, or if both the objects and the background are rigid. In this paper, we propose a new method for direct detection and segmentation of foreground moving objects in the absence of such constraints. First, groups of pixels having similar motion and photometric features are extracted. For this first step only a sub-grid of image pixels is used to reduce computational cost and improve robustness to noise. We introduce the use of p-value to validate optical flow estimates and of automatic bandwidth selection in the mean shift clustering algorithm. In a second stage, segmentation of the object associated to a given cluster is performed in a MAP/MRF framework. Our method is able to handle moving camera and several different motions in the background. Experiments on challenging sequences show the performance of the proposed method and its utility for video analysis in complex scenes.

Stereo 2

Real-Time Plane-Sweeping Stereo with Multiple Sweeping Directions

David Gallup, Jan-Michael Frahm, Philippos Mordohai, Qingxiong Yang, and Marc Pollefeys

Recent research has focused on systems for obtaining automatic 3D reconstructions of urban environments from video acquired at street level. These systems record enormous amounts of video; therefore a key component is a stereo matcher which can process this data at speeds comparable to the recording frame rate. Furthermore, urban environments are unique in that they exhibit mostly planar surfaces. These surfaces, which are often imaged at oblique angles, pose a challenge for many window-based stereo matchers which suffer in the presence of slanted surfaces. We present a multi-view plane-sweep-based stereo algorithm which correctly handles slanted surfaces and runs in real-time using the graphics processing unit (GPU). Our algorithm consists of (1) identifying the scene's principle plane orientations, (2) estimating depth by performing a plane-sweep for each direction, (3) combining the results of each sweep. The latter can optionally be performed using graph cuts. Additionally, by incorporating priors on the locations of planes in the scene, we can increase the quality of the reconstruction and reduce computation time, especially for uniform textureless surfaces. We demonstrate our algorithm on a variety of scenes and show the improved accuracy obtained by accounting for slanted surfaces.

Accurate, Dense, and Robust Multi-View Stereopsis

Yasutaka Furukawa and Jean Ponce

This paper proposes a novel algorithm for calibrated multi-view stereopsis that outputs a (quasi) dense set of rectangular patches covering the surfaces visible in the input images. This algorithm does not require any initialization in the form of a bounding volume, and it detects and discards automatically outliers and obstacles. It does not perform any smoothing across nearby features, yet is currently the top performer in terms of both coverage and accuracy for four of the six benchmark datasets presented in [Seitz et al. 2006]. The keys to its performance are effective techniques for enforcing local photometric consistency and global visibility constraints. Stereopsis is implemented as a match, expand, and filter procedure, starting from a sparse set of matched keypoints, and repeatedly expanding these to nearby pixel correspondences before using visibility constraints to filter away false matches. A simple but effective method for turning the resulting patch model into a mesh appropriate for image-based modeling is also presented. The proposed approach is demonstrated on various datasets including objects with fine surface details, deep concavities, and thin structures, outdoor scenes observed from a restricted set of viewpoints, and “crowded” scenes where moving obstacles appear in different places in multiple images of a static structure of interest.

121

121

Quasi-dense wide baseline matching using match propagation

Juho Kannala and Sami S. Brandt

In this paper we propose extensions to the match propagation algorithm which is a technique for computing quasi-dense point correspondences between two views. The extensions make the match propagation applicable for wide baseline matching, i.e., for cases where the camera pose can vary a lot between the views. Our first extension is to use local affine model for the geometric transformation between the images. The estimate of the local transformation is obtained from affine covariant interest regions which are used as seed matches. The second extension is to use the second order intensity moments to adapt the current estimate of the local affine transformation during the propagation. This allows a single seed match to propagate into regions where the local transformation between the views differs from the initial one. The experiments with real data show that the proposed techniques improve both the quality and coverage of the quasi-dense disparity map.

Evaluation of Cost Functions for Stereo Matching

Heiko Hirschmüller and Daniel Scharstein

Stereo correspondence methods rely on matching costs for computing the similarity of image locations. In this paper we evaluate the insensitivity of different matching costs with respect to radiometric variations of the input images. We consider both pixel-based and window-based variants and measure their performance in the presence of global intensity changes (e.g., due to gain and exposure differences), local intensity changes (e.g., due to vignetting, non-Lambertian surfaces, and varying lighting), and noise. Using existing stereo datasets with ground-truth disparities as well as six new datasets taken under controlled changes of exposure and lighting, we evaluate the different costs with a local, a semi-global, and a global stereo method.

Graph Cut Based Optimization for MRFs with Truncated Convex Priors

Olga Veksler

Optimization with graph cuts became very popular in recent years. Progress in problems such as stereo correspondence, image segmentation, etc., can be attributed, in part, to the development of efficient graph cut based optimization. Recent evaluation of optimization techniques shows that the popular expansion and swap graph cut algorithms perform extremely well for energies where the underlying MRF has the Potts prior, which corresponds to the assumption that the true labeling is piecewise constant. For more general priors, however, such as corresponding to piecewise smoothness assumption, both swap and expansion algorithms do not perform as well. We develop several optimization algorithms for truncated convex priors, which imply piecewise smoothness assumption. Both expansion and swap algorithms are based on moves that give each pixel a choice of only two labels. Our insight is that to obtain a good approximation under piecewise smoothness assumption, a pixel should have a choice among more than two labels. We develop new “range” moves which act on a larger set of labels than the expansion and swap algorithms. We evaluate our method on problems of image restoration, inpainting, and stereo correspondence. Our results show that we are able to get more accurate answers, both in terms of the energy, which is the direct goal, and in terms of accuracy, which is an indirect, but more important goal.

122

122

A Direct and Efficient Method for Piecewise-Planar Surface Reconstruction from Stereo Images

Shigeki Sugimoto and Masatoshi Okutomi

In this paper, we propose a direct method for 3D surface reconstruction from stereo images. We reconstruct a 3D surface by estimating all depths of the vertices of a mesh composed by piecewise triangular patches on the reference (template) image. We assume that the deformation of the mesh between the stereo images is specified by homographies, each of which represents the deformation of a single patch. The homography that deforms each patch has three degrees of freedom under epipolar constrains. We first formulate a fast direct method for estimating the three parameters of a 3D plane by incorporating inverse compositional expression in the SSD (Sum of Squared Differences) function of two stereo images. This method is about three times faster than a conventional method. Then we extend the direct method to the estimation of the vertex depths in the mesh for reconstructing piecewise-planar surfaces. The validity of the proposed method is shown by some experiments using synthetic and real images.

Document Processing

Multi-View Document Rectification using Boundary

Yau-Chat Tsoi and Michael Brown

We present a new technique that uses multiple images of bound and folded documents to rectify the imaged content such that it appears flat and photometrically uniform. Our approach works from a sparse set of uncalibrated views of the document which are mapped to a canonical coordinate frame using the document’s boundary. A composite image is constructed from these canonical views that significantly reduces the effects of depth distortion without the blurring artifacts that is problematic in single image approaches. In addition, we propose a new technique to estimate illumination variation in the individual images allowing the final composited content to be photometrically rectified. Our approach is straight-forward, robust, and produces good results.

Handwritten Carbon Form Preprocessing Based on Markov Random Field

Huaigu Cao and Venu Govindaraju

This paper proposes a statistical approach to degraded handwritten form image preprocessing including binarization and form line removal. The degraded image is modeled by a Markov Random Field (MRF) where the a prior is learnt from a training set of high quality binarized images, and the probabilistic density is learnt on-the-fly from the gray-level histogram of input image. We also modified the MRF model to implement form line removal. Test results of our approach show excellent performance on the data set of handwritten carbon form images.

Combining Static Classifiers and Class Syntax Models for Logical Entity Recognition in Scanned Historical Documents

Song Mao, Praveer Mansukhani, and George Thoma

Class syntax can be used to 1) model temporal or locational evolvement of class labels of feature observation sequences, 2) correct classification errors of static classifiers if feature observations from different classes overlap in feature space, and 3) eliminate redundant features whose discriminative information is already represented in the class syntax. In this paper, we describe a novel method that combines static classifiers with class syntax models for supervised feature subset selection and

123

123

classification in unified algorithms. Posterior class probabilities given feature observations are first estimated from the output of static classifiers, and then integrated into a parsing algorithm to find an optimal class label sequence for the given feature observation sequence. Finally, both static classifiers and class syntax models are used to search for an optimal subset of features. An optimal feature subset, associated static classifiers, and class syntax models are all learned from training data. We apply this method to logical entity recognition in scanned historical U.S. Food and Drug Administration (FDA) documents containing court case Notices of Judgments (NJs) of different layout styles, and show that the use of class syntax models not only corrects most classification errors of static classifiers, but also significantly reduces the dimensionality of feature observations with negligible impact on classification performance.

Towards Automatic Photometric Correction of Casually Illuminated Documents

George V. Landon, Yun Lin, and W. Brent Seales

Creating uniform lighting for archival-quality document acquisition remains a non-trivial problem. We propose a novel method for automatic photometric correction of non-planar documents by estimating a single, point light-source using a simple light probe. By adding a simple piece of folded white paper with a known 3D surface to a scene, we are able to extract the 3D position of a light source, automatically perform white balance correction, and determine areas of poor illumination. Furthermore, this method is designed with the purpose of adding it to an already implemented document digitization pipeline. To justify our claims, we provide an accuracy analysis of our correction technique using simulated ground-truth data which allows individual sources of error to be determined and compared. These techniques are then applied on real documents that have been acquired using a 3D scanner.

Multi-scale Structural Saliency for Signature Detection

Guangyu Zhu, Yefeng Zheng, David Doermann, and Stefan Jaeger

Detecting and segmenting free-form objects from cluttered backgrounds is a challenging problem in computer vision. Signature detection in document images is one classic example and as of yet no reasonable solutions have been presented. In this paper, we propose a novel multi-scale approach to jointly detecting and segmenting signatures from documents with diverse layouts and complex backgrounds. Rather than focusing on local features that typically have large variations, our approach aims to capture the structural saliency of a signature by searching over multiple scales. This detection framework is general and computationally tractable. We present a saliency measure based on a signature production model that effectively quantifies the dynamic curvature of 2-D contour fragments. Our evaluation using large real world collections of handwritten and machine printed documents demonstrates the effectiveness of this joint detection and segmentation approach.

Miscellaneous Applications

Imaging the Finger Force Direction

Yu Sun, John Hollerbach, and Stephen Mascaro

This paper presents a method of imaging the coloration pattern in the fingernail and surrounding skin to infer fingertip force direction during planar contact. Nail images from 7 subjects were registered to reference images with RANSAC and then warped to an atlas with elastic registration. Recognition of fingertip force direction, based on Linear Discriminant Analysis, shows that there are common color pattern features in the fingernail and surrounding skin for different subjects. Based on the common features, the overall recognition accuracy is 92%.

124

124

Multi-scale Features for Detection and Segmentation of Rocks in Mars Images

Heather Dunlop, David Thompson, and David Wettergreen

Geologists and planetary scientists will benefit from methods for accurate segmentation of rocks in natural scenes. However, rocks are poorly suited for current segmentation techniques --- they exhibit diverse morphologies but have no uniform property to distinguish them from background soil. We address this challenge with a novel detection and segmentation method incorporating features from multiple scales. These features include local attributes such as texture, object attributes such as shading and 2D shape, and scene attributes such as the direction of illumination. Our method uses a superpixel segmentation followed by region-merging to search for the most probable groups of superpixels. A learned model of rock appearances identifies whole rocks by scoring candidate superpixel groups. We evaluate our method's performance on representative images from the Mars Exploration Rover catalog.

Consistent Temporal Variations in Many Outdoor Scenes

Nathan Jacobs, Nathaniel Roman, and Robert Pless

This paper details an empirical study of large image sets taken by static cameras. These images have consistent correlations over the entire image and over time scales of days to months. Simple second-order statistics of such image sets show vastly more structure than exists in generic natural images or video from moving cameras. Using a slight variant to PCA, we can decompose all cameras into comparable components and annotate images with respect to surface orientation, weather, and seasonal change. Experiments are based on a data set from 538 cameras across the United States which have collected more than 17 million images over the last 6 months.

Towards Fog-Free In-Vehicle Vision Systems through Contrast Restoration

Nicolas Hautiere, Jean-Philippe Tarel, and Didier Aubert

In foggy weather, the contrast of images grabbed by in-vehicle cameras in the visible light range is drastically degraded, which makes the current applications very sensitive to weather conditions. An onboard vision system should take fog effects into account. The effects of fog varies across the scene and are exponential with respect to the depth of scene points. Because it is not possible in this context to compute the road scene structure beforehand contrary to fixed camera surveillance, a new scheme is proposed. Weather conditions are first estimated and then used to restore the contrast according to a scene structure which is inferred a priori and refined during the restoration process. Based on the aimed application, different algorithms with increasing complexities are proposed. Results are presented using sample road scenes under foggy weather and assessed by computing the contrast before and after restoration.

125

125

Enhancement 2: Noise

Removal of Image Artifacts Due to Sensor Dust

Changyin Zhou and Stephen Lin

Image artifacts that result from sensor dust are a common but annoying problem for many photographers. To reduce the appearance of dust in an image, we first formulate a model of artifact formation due to sensor dust. With this artifact formation model, we make use of contextual information in the image and a color consistency constraint on dust to remove these artifacts. When multiple images are available from the same camera, even under different camera settings, this approach can also be used to reliably detect dust regions on the sensor. In contrast to image inpainting or other hole-filling methods, the proposed technique utilizes image information within a dust region to guide the use of contextual data. Joint use of these multiple cues leads to image recovery results that are not only visually pleasing, but also faithful to the actual scene. The effectiveness of this method is demonstrated in experiments with various cameras.

Spatio-Temporal Markov Random Field for Video Denoising

Jia Chen and Chi-Keung Tang

This paper presents a novel spatio-temporal Markov random field (MRF) for video denoising. Two main issues are addressed in this paper, namely, the estimation of noise model and the proper use of motion estimation in the denoising process. Unlike previous algorithms which estimate the level of noise, our method learns the full noise distribution nonparametrically which serves as the likelihood model in the MRF. Instead of using deterministic motion estimation to align pixels, we set up a temporal likelihood by combining a probabilistic motion field with the learned noise model. The prior of this MRF is modeled by piece-wise smoothness. The main advantage of the proposed spatio-temporal MRF is that it integrates spatial and temporal information adaptively into a statistical inference framework, where the posteriori is optimized using graph cuts with alpha expansion. We demonstrate the performance of the proposed approach on benchmark data sets and real videos to show the advantages of our algorithm compared with previous single frame and multi-frame algorithms.

Variational Distance-Dependent Image Restoration

Ran Kaftory, Yoav Schechner, and Yehoshua Zeevi

There is a need to restore color images that suffer from distance-dependent degradation during acquisition. This occurs, for example, when imaging through scattering media. There, signal attenuation worsens with the distance of an object from the camera. A ‘naive’ restoration may attempt to restore the image by amplifying the signal in each pixel according to the distance of its corresponding object. This, however, would amplify the noise in a non-uniform manner. Moreover, standard space-invariant denoising over-blurs close by objects (which have low noise), or insufficiently smoothes distant objects (which are very noisy). We present a variational method to overcome this problem. It uses a regularization operator which is distance dependent, in addition to being edge-preserving and color-channel coupled. Minimizing this functional results in a scheme of reconstruction-while-denoising. It preserves important features, such as the texture of close by objects and edges of distant ones. A restoration algorithm is presented for reconstructing color images taken through haze. The algorithm also restores the path radiance, which is equivalent to the distance map. We demonstrate the approach experimentally.

126

126

Handwriting and Faces

Offline Signature Verification Using Online Handwriting Registration

Yu Qiao, Jianzhuang Liu, and Xiaoou Tang

This paper proposes a novel framework for offline signature verification. Different from previous methods, our approach makes use of online handwriting instead of handwritten images for registration. The online registrations enable robust recovery of the writing trajectory from an input offline signature and thus allow effective shape matching between registration and verification signatures. In addition, we propose several new techniques to improve the performance of the new signature verification system: 1. we formulate and solve the recovery of writing trajectory within the framework of Conditional Random Fields; 2. we propose a new shape descriptor, online context, for aligning signatures; 3. we develop a verification criterion which combines the duration and amplitude variances of handwriting. Experiments on a benchmark database show that the proposed method significantly outperforms the well-known offline signature verification methods and achieve comparable performance with online signature verification methods.

Skin Detail Analysis for Face Recognition

Jean-Sébastien Pierrard and Thomas Vetter

This paper presents a novel framework to localize in a photograph prominent irregularities in facial skin, in particular nevi (moles, birthmarks). Their characteristic configuration over a face is used to encode the person's identity independent of pose and illumination. This approach extends conventional recognition methods, which usually disregard such small scale variations and thereby miss potentially highly discriminative features. Our system detects potential nevi with a very sensitive multi scale template matching procedure. The candidate points are filtered according to their discriminative potential, using two complementary methods. One is a novel skin segmentation scheme based on gray scale texture analysis that we developed to perform outlier detection in the face. Unlike most other skin detection/segmentation methods it does not require color input. The second is a local saliency measure to express a point's uniqueness and confidence taking the neighborhood's texture characteristics into account. We experimentally evaluate the suitability of the detected features for identification under different poses and illumination on a subset of the FERET face database.

Generic Face Alignment using Boosted Appearance Model

Xiaoming Liu

This paper proposes a discriminative framework for efficiently aligning images. Although conventional Active Appearance Models (AAM)-based approaches have achieved some success, they suffer from the generalization problem, i.e., how to align any image with a generic model. We treat the iterative image alignment problem as a process of maximizing the score of a trained two-class classifier that is able to distinguish correct alignment (positive class) from incorrect alignment (negative class). During the modeling stage, given a set of images with ground truth landmarks, we train a conventional Point Distribution Model (PDM) and a boosting-based classifier, which we call Boosted Appearance Model (BAM). When tested on an image with the initial landmark locations, the proposed algorithm iteratively updates the shape parameters of the PDM via the gradient ascent method such that the classification score of the warped image is maximized. The proposed framework is applied to the face alignment problem. Using extensive experimentation, we show that, compared to the AAM-based approach, this framework greatly improves the robustness, accuracy and efficiency of face alignment by a large margin, especially for unseen data.

127

127

Poster session 2


Fisher Kernels on Visual Vocabularies for Image Categorization

Florent Perronnin and Christopher Dance

Within the field of pattern classification, the Fisher kernel is a powerful framework which combines the strengths of generative and discriminative approaches. The idea is to characterize a signal with a gradient vector derived from a generative probability model and to subsequently feed this representation to a discriminative classifier. We propose to apply this framework to image categorization where the input signals are images and where the underlying generative model is a visual vocabulary: a Gaussian mixture model which approximates the distribution of low-level features in images. We show that Fisher kernels can actually be understood as an extension of the popular bag-of-visterms. Our approach demonstrates excellent performance on two challenging databases: an in-house database of 19 object/scene categories and the recently released VOC 2006 database. It is also very practical: it has low computational needs both at training and test time and vocabularies trained on one set of categories can be applied to another set without any significant loss in performance.

General Purpose Object Detection: A Spectral Residual Approach

Xiaodi Hou and Liqing Zhang

The ability of human visual system to detect visual saliency is extraordinarily fast and reliable. However, computational modeling of this basic intelligent behavior still remains a challenge. This paper presents a simple method for the visual saliency detection. Our model is independent of features, categories, or other forms of prior knowledge of the objects. By analyzing the log-spectrum of an input image, we extract the spectral residual of an image in spectral domain, and propose a fast method to construct the corresponding saliency map in spatial domain. We test this model on both natural pictures and artificial images such as psychological patterns. The result indicate fast and robust saliency detection of our method.

Partially Occluded Object-Specific Segmentation in View-Based Recognition

Min Su Cho and Kyoung Mu Lee

We present a novel object-specific segmentation method which can be used in view-based object recognition systems. Previous object segmentation approaches generate inexact results especially in partially occluded and cluttered environment because their top-down strategies fail to explain the details of various specific objects. On the contrary, our segmentation method efficiently exploits the information of the matched model views in view-based recognition because the aligned model view to the input image can serve as the best top-down cue for object segmentation. In this paper, we cast the problem of partially occluded object segmentation as that of labelling displacement and foreground status simultaneously for each pixel between the aligned model view and an input image. The problem is formulated by a maximum a posteriori Markov random field (MAP-MRF) model which minimizes a particular energy function. Our method overcomes complex occlusion and clutter and provides accurate segmentation boundaries by combining a bottom-up segmentation cue together. We demonstrate the efficiency and robustness of it by experimental results on various objects under occluded and cluttered environments.

128

128

Towards Scalable Representations of Object Categories: Learning a Hierarchy of Parts

Sanja Fidler and Aleš Leonardis

This paper proposes a novel approach to constructing a hierarchical representation of visual input that aims to enable recognition and detection of a large number of object categories. Inspired by the principles of efficient indexing (bottom-up), robust matching (top-down), and ideas of compositionality, our approach learns a hierarchy of spatially flexible compositions, i.e. parts, in an unsupervised, statistics-driven manner. Starting with simple, frequent features, we learn the statistically most significant compositions (parts composed of parts), which consequently define the next layer. Parts are learned sequentially, layer after layer, optimally adjusting to the visual data. Lower layers are learned in a category-independent way to obtain complex, yet sharable visual building blocks, which is a crucial step towards a scalable representation. Higher layers of the hierarchy, on the other hand, are constructed by using specific categories, achieving a category representation with a small number of highly generalizable parts that gained their structural flexibility through composition within the hierarchy. Built in this way, new categories can be efficiently and continuously added to the system by adding a small number of parts only in the higher layers. The approach is demonstrated on a large collection of images and a variety of object categories. Detection results confirm the effectiveness and robustness of the learned parts.

Discriminative Cluster Refinement: Improving Object Category Recognition Given Limited Training Data

Liu Yang, Rong Jin, Caroline Pantofaru, and Rahul Sukthankar

A popular approach to problems in image classification is to cluster high-dimensional features into a set of visual words, represent the image as a bag of words and then employ a classifier, such as a support vector machine (SVM), to categorize the image. This idea has recently shown promising results for applications in object category recognition and content-based video retrieval. Unfortunately, a significant shortcoming of this approach is that the clustering and classification are disconnected. Since the clustering into visual words is unsupervised (e.g., k-means), the representation does not necessarily capture the aspects of the data that are most useful for classification. More seriously, the semantic relationship between clusters is lost, causing the overall classification performance to suffer. This problem becomes much more significant when the number of training examples is small and the approach is inadequate for estimating the association between classes and a large number of clusters. We introduce “discriminative cluster refinement” (DCR), a method that explicitly models the pairwise relationships between different visual words by exploiting their co-occurrence information. Furthermore, the assigned class labels are used to identify the co-occurrence patterns that are most informative to object classification. DCR employs a maximum-margin approach to generate an optimal kernel matrix for classification by extending the theory of the support vector machine. One important benefit of DCR is that it integrates smoothly into existing bag-of-words information retrieval systems by employing the set of visual words generated by any clustering method. While DCR could improve a broad class of information retrieval systems, this paper focuses on object category recognition. We present a direct comparison with a state-of-the art method on the PASCAL 2006 database and show that cluster refinement results in a significant improvement in classification accuracy given a small number of training examples.

Using segmentation to verify object hypotheses

Deva Ramanan

We present an approach for object recognition that combines detection and segmentation within a efficient hypothesize/test framework. Scanning-window template classifiers are the state-of-the-art for many object classes such as faces, cars, and pedestrians. Such approaches, though quite successful, can be hindered by their lack of explicit encoding of object shape/structure -- one might, for example, find faces in trees. We

129

129

adopt the following strategy; we first use these systems as attention mechanisms, generating many possible object locations by tuning them for low missed-detections and high false-positives. At each hypothesized detection, we compute a local figure-ground segmentation using a window of slightly larger extent than that used by the classifier. We learn offline from training data those segmentations that are consistent with true positives. We then prune away those hypotheses with bad segmentations. We show this strategy leads to significant improvements (10-20%) over established approaches such as ViolaJones and DalalTriggs on a variety of benchmark datasets including the PASCAL challenge, LabelMe, and the INRIAPerson dataset.

Semantic Hierarchies for Visual Object Recognition

Marcin Marszalek and Cordelia Schmid

In this paper we propose to use lexical semantic networks to extend the state-of-the-art object recognition techniques. We use the semantics of image labels to integrate prior knowledge about inter-class relationships into the visual appearance learning. We show how to build and train a semantic hierarchy of discriminative classifiers and how to use it to perform object detection. We evaluate how our approach influences the classification accuracy and speed on the PASCAL VOC challenge 2006 dataset, a set of challenging real-world images. We also demonstrate additional features that become available to object recognition due to the extension with semantic inference tools - we can classify high-level categories, such as animals, and we can train part detectors, for example a window detector, by pure inference in the semantic network.

Discriminant Additive Tangent Spaces for Object Recognition

Liang Xiong, Jianguo Li, and Changshui Zhang

Pattern variation is a major factor that affects the performance of recognition systems. In this paper, a novel manifold tangent modeling method called Discriminant Additive Tangent Spaces (DATS) is proposed for invariant pattern recognition. In DATS, intra-class variations for traditional tangent learning are called positive tangent samples. In addition, extra-class variations are introduced as negative tangent samples. We use log-odds to measure the significance of samples being positive or negative, and then directly characterizes this log-odds using generalized additive models (GAM). This model is estimated to maximally discriminate positive and negative samples. Besides, since traditional GAM fitting algorithm can not handle the high dimensional data in visual recognition tasks, we also present an efficient, sparse solution for GAM estimation. The resulting DATS is a nonparametric discriminant model based on quite weak prior hypotheses, hence it can depict various pattern variations effectively. Experiments demonstrate the effectiveness of our method in several recognition tasks.

Detector Ensemble

Shengyang Dai, Ming Yang, Ying Wu, and Aggelos Katsaggelos

Component-based detection methods have demonstrated their promise by integrating a set of part-detectors to deal with large appearance variations of the target. However, an essential and critical issue, i.e., how to handle the imperfectness of part-detectors in the integration, is not well addressed in the literature. This paper proposes a detector ensemble model that consists of a set of substructure-detectors, each of which is composed of several part-detectors. Two important issues are studied both in theory and in practice, (1) finding an optimal detector ensemble, and (2) detecting targets based on an ensemble. Based on some

130

130

theoretical analysis, a new model selection strategy is proposed to learn an optimal detector ensemble that has a minimum number of false positives and satisfies the design requirement on the capacity of tolerating missing parts. In addition, this paper also links ensemble-based detection to the inference in Markov random field, and shows that the target detection can be done by a max-product belief propagation algorithm.

Joint Real-time Object Detection and Pose Estimation Using Probabilistic Boosting Network

Jingdan Zhang, Shaohua Kevin Zhou, Leonard McMillan, and Dorin Comaniciu

In this paper, we present a learning procedure called probabilistic boosting network (PBN) for joint real-time object detection and pose estimation. Grounded on the law of total probability, PBN integrates evidence from two building blocks, namely a multiclass boosting classifier for pose estimation and a boosted detection cascade for object detection. By inferring the pose parameter, we avoid the exhaustive scanning for the pose, which hampers real time requirement. In addition, we only need one integral image/volume with no need of image/volume rotation. We implement PBN using a graph-structured network that alternates the two tasks of foreground/background discrimination and pose estimation for rejecting negatives as quickly as possible. Compared with previous approaches, we gain accuracy in object localization and pose estimation while noticeably reducing the computation. We invoke PBN to detect the left ventricle from a 3D ultrasound volume, processing about 10 volumes per second, and the left atrium from 2D images in real time.


A New Performance Evaluation Method for Face Identification - Regression Analysis of Misidentification Risk

Wai Ho and Paul Watters

The performance of a face identification system varies with its enrollment size. However, most experiments evaluated the performance of algorithms at only one enrollment size with the rank-1 identification rate. The current practice does not demonstrate the usability of algorithms thoroughly. But the problem is, in order to measure identification performance at different sizes, experimenters have to repeat the evaluation with samples of those sizes, which is almost impossible when they are large. Approaches using the Binomial theorem with match and non-match scores have been proposed to estimate performance at different sizes, but as a separate process from the evaluation itself. This paper presents a new way of evaluating identification algorithms that allows the estimating and comparing of performance at different sizes, using the regression analysis of Misidentification Risk.

3D Face Recognition in the Presence of Expression: A Guidance-based Constraint Deformation Approach

Yueming Wang, Gang Pan, and Zhaohui Wu

Three-dimensional human face recognition in the presence of expression is a big challenge, since the shape distortion caused by facial expression greatly weakens the rigid matching. This paper proposes a guidance-based constraint deformation(GCD) model to cope with the shape distortion by expression. The basic idea is that, the face model with non-neutral expression is deformed toward its neutral one under certain constraint so that the distortion is reduced while inter-class discriminative information is preserved. The

131

131

GCD model exploits the neutral 3D face shape to guide the deformation, meanwhile applies a rigid constraint on it. Both steps are smoothly unified in the Poisson equation framework. The GCD approach only needs one neutral model for each person in the gallery. The experimental results, carried out on the large 3D face databases–FRGC v2.0, demonstrate that our method significantly outperforms ICP method for both identification and authentication mode. It shows the GCD model is promising for coping with the shape distortion in 3D face recognition.

A Unified Probabilistic Framework for Facial Activity Modeling and Understanding

Yan Tong, Wenhui Liao, Zheng Xue, and Qiang Ji

Facial activities are the most natural and powerful means of human communication. Spontaneous facial activity is characterized by rigid head movements, non-rigid facial muscular movements, and their interactions. Current research in facial activity analysis is limited to recognizing rigid or non-rigid motion separately, often ignoring their interactions. Furthermore, although some of them analyze the temporal properties of facial features during facial feature extraction, they often recognize the facial activity statically, ignoring the dynamics of the facial activity. In this paper, we propose to explicitly exploit the prior knowledge about facial activities and systematically combine the prior knowledge with image measurements to achieve an accurate, robust, and consistent facial activity understanding. Specifically, we propose a unified probabilistic framework based on the dynamic Bayesian network (DBN) to simultaneously and coherently represent the rigid and non-rigid facial motions, their interactions, and their image observations, as well as to capture the temporal evolution of the facial activities. Robust computer vision methods are employed to obtain measurements of both rigid and non-rigid facial motions. Finally, facial activity recognition is accomplished through a probabilistic inference by systemically integrating the visual measurements with the facial activity model.

Robust 3D Face Recognition Using Learned Visual Codebook

Cheng Zhong, Zhenan Sun, and Tieniu Tan

In this paper, we propose a novel learned visual codebook (LVC) for 3D face recognition. In our method, we first extract intrinsic discriminative information embedded in 3D faces using Gabor filters, then K-means clustering is adopted to learn the centers from the filter response vectors. We construct LVC by these learned centers. Finally we represent 3D faces based on LVC and achieve recognition using a nearest neighbor (NN) classifier. The novelty of this paper comes from 1) We first apply textons based methods into 3D face recognition; 2) We encompass the efficiency of Gabor features for face recognition and the robustness of texton strategy for texture classification simultaneously. Our experiments are based on two challenging databases, CASIA 3D face database and FRGC2.0 3D face database. Experimental results show LVC performs better than many commonly used methods.

Biased Manifold Embedding: A Framework for Person-Independent Head Pose Estimation

Vineeth Nallure Balasubramanian, Jieping Ye, and Sethuraman Panchanathan

The estimation of head pose angle from face images is an integral component of face recognition systems, human computer interfaces and other human-centered computing applications. To determine the head pose, face images with varying pose angles can be considered to be lying on a smooth low-dimensional manifold in high-dimensional feature space. While manifold learning techniques capture the geometrical relationship between data points in the high-dimensional image feature space, the pose label information of the training data samples are neglected in the computation of these embeddings. In this paper, we propose a novel supervised approach to manifold-based non-linear dimensionality reduction for head pose estimation. The

132

132

Biased Manifold Embedding (BME) framework is pivoted on the ideology of using the pose angle information of the face images to compute a biased neighborhood of each point in the feature space, before determining the low-dimensional embedding. The proposed BME approach is formulated as an extensible framework, and validated with the Isomap, Locally Linear Embedding (LLE) and Laplacian Eigenmap techniques. A Generalized Regression Neural Network (GRNN) is used to learn the non-linear mapping, and linear multi-variate regression is finally applied on the low-dimensional space to obtain the pose angle. We tested this approach on face images of 24 individuals with pose angles varying from -90 to +90 degrees with a granularity of 2. The results showed substantial reduction in the error of pose angle estimation, and robustness to variations in feature spaces, dimensionality of embedding and other parameters.

On Constructing Facial Similarity Maps

Alex Holub, Yun-hsueh Liu, and Pietro Perona

Automatically determining facial similarity is a difficult and open question in computer vision. The problem is complicated both because it is unclear what facial features humans use to determine facial similarity and because facial similarity is subjective in nature: similarity judgements change from person to person. In this work we suggest a system which places facial similarity on a solid computational footing. First we describe methods for acquiring facial similarity ratings from humans in an efficient manner. Next we show how to create feature vector representations for each face by extracted patches around facial key-points. Finally we show how to use the acquired similarity ratings to learn functional mapping which project facial-feature vectors into Face Spaces which correspond to our notions of facial similarity. We use different collections of images to both create and validate the Face Spaces including: perceptual similarity data obtained from humans, morphed faces between two different individuals, and the CMU PIE collection which contains images of the same individual under different lighting conditions. We demonstrate that using our methods we can effectively create Face Spaces which correspond to human notions of facial similarity.

A Face Annotation Framework with Partial Clustering and Interactive Labeling

Yuandong Tian, Wei Liu, Rong Xiao, Fang Wen, and Xiaoou Tang

Face annotation technology is important for a photo management system. In this paper, we propose a novel interactive face annotation framework combining unsupervised and interactive learning. There are two main contributions in our framework. In the unsupervised stage, a partial clustering algorithm is proposed to find the most evident clusters instead of grouping all instances into clusters, which leads to a good initial labeling for later user interaction. In the interactive stage, an efficient labeling procedure based on minimization of both global system uncertainty and estimated number of user operations is proposed to reduce user interaction as much as possible. Experimental results show that the proposed annotation framework can significantly reduce the face annotation workload and is superior to existing solutions in the literature.

A Minutiae-based Fingerprint Individuality Model

Jiansheng Chen and Yiu-Sang Moon

Fingerprint individuality study deals with the crucial problem of the discriminative power of fingerprints for recognizing people. In this paper, we present a novel fingerprint individuality model based on minutiae, the most commonly used fingerprint feature. The probability of the false correspondence among fingerprints from different fingers is calculated by combining the distinctiveness of the spatial locations and directions of the minutiae. To validate our model, experiments were performed using different fingerprint

133

133

databases. The matching score distribution predicted by our model actually fits the observed experimental results satisfactorily. Comparing to most previous fingerprint individuality models, our model makes more reasonably conservative estimate of the fingerprint discriminative power, making it a powerful tool for studying the fingerprint individuality as well as the performance evaluation of fingerprint verification systems.

3D Probabilistic Feature Point Model for Object Detection and Recognition

Sami Romdhani and Thomas Vetter

This paper presents a novel statistical shape model that can be used to detect and localise feature points of a class of objects in images. The shape model is inspired from the 3D Morphable Model (3DMM) and has the property to be viewpoint invariant. This shape model is used to estimate the probability of the position of a feature point given the position of reference feature points, accounting for the uncertainty of the position of the reference points and of the intrinsic variability of the class of objects. The viewpoint invariant detection algorithm maximises a foreground/background likelihood ratio of the relative position of the feature points, their appearance, scale, orientation and occlusion state. Computational efficiency is obtained by using the Bellman principle and an early rejection rule based on 3D to 2D projection constraints. Evaluations of the detection algorithm on the CMU-PIE face images and on a large set of non-face images show high levels of accuracy (zero false alarms for more than 90% detection rate). As well as locating feature points, the detection algorithm also estimates the pose of the object and a few shape parameters. It is shown that it can be used to initialise a 3DMM fitting algorithm and thus enables a fully automatic viewpoint and lighting invariant image analysis solution.

3D Reconstruction and Processing 2

Multiple View Image Reconstruction: A Harmonic Approach

Justin Domke and Yiannis Aloimonos

This paper presents a new constraint connecting the signals in multiple views of a surface. The constraint arises from a harmonic analysis of the geometry of the imaging process and it gives rise to a new technique for multiple view image reconstruction. Given several views of a surface from different positions, fundamentally different information is present in each image, owing to the fact that cameras measure the incoming light only after the application of a low-pass filter. Our analysis shows how the geometry of the imaging is connected to this filtering. This leads to a technique for constructing a single output image containing all the information present in the input images.

Shape from Shading Under Various Imaging Conditions

Abdelrehim Ahmed and Aly Farag

Most of the shape from shading (SFS) algorithms have been developed under the simplifying assumptions of a Lambertian surface, an orthographic projection, and a distant light source. Due to the difficulty of the SFS problem, only a small number of algorithms have been proposed for surfaces with non-Lambertian reflectance, and among those, only very few algorithms are applicable for surfaces with specular and diffuse reflectance. In this paper we propose a unified framework that is capable of solving the SFS problem under various settings of imaging conditions i.e., Lambertian or non-Lambertian, orthographic or perspective projection, and distant or nearby light source. The proposed algorithm represents the image irradiance equation of each setting as an explicit Partial Differential Equation (PDE). In our implementation we use the Lax-Friedrichs sweeping method to solve this PDE. To demonstrate the efficiency of the proposed algorithm, several comparisons with the state of the art of the SFS literature are given.

134

134

Shape from Shading Based on Lax-Friedrichs Fast Sweeping and Regularization Techniques With Applications to Document Image Restoration

Li Zhang, Andy M. Yip, and Chew Lim Tan

In this paper, we describe a 2-pass iterative scheme to solve the general partial differential equation (PDE) related to the Shape-from-Shading (SFS) problem under both distant and close point light sources. In particular, we discuss its applications in restoring warped document images that often appear in the daily snapshots. The proposed method consists of two steps. First the image irradiance equation is formulated as a static Hamilton-Jacobi (HJ) equation and solved using a fast sweeping strategy with Lax-Friedrichs Hamiltonian. However, abrupt errors may arise when applying to real document images due to noises in the approximated shading image. To reduce the noise sensitivity, a minimization method thus follows to smooth out the abrupt ridges in the initial result and produce a better reconstruction. Experiments on synthetic surfaces show promising results comparing to the ground truth data. Moreover, a general framework is developed, which demonstrates that the SFS method can help to remove both geometric and photometric distortions in warped document images for better visual appearance and higher recognition rate.

ShadowCuts: Photometric Stereo with Shadows

Manmohan Chandraker, Sameer Agarwal, and David Kriegman

We present an algorithm for performing Lambertian photometric stereo in the presence of shadows. The algorithm has three novel features. First, a fast graph cuts based method is used to estimate per pixel light source visibility. Second, it allows images to be acquired with multiple illuminants, and there can be fewer images than light sources. This leads to better surface coverage and improves the reconstruction accuracy by enhancing the signal to noise ratio and the condition number of the light source matrix. The ability to use fewer images than light sources means that the computational and imaging effort grows sublinearly with the number of light sources. Finally, the recovered shadow maps are combined with shading information to perform constrained surface normal integration. This reduces the low frequency bias inherent to the normal integration process and ensures that the recovered surface is consistent with the shadowing configuration The algorithm works with as few as four light sources and four images. We report results for light source visibility detection and high quality surface reconstructions for synthetic and real datasets.

Symmetric Objects are Hardly Ambiguous

Gaurav Aggarwal, Soma Biswas, and Rama Chellappa

Given any two images taken under different illumination conditions, there always exist a physically realizable object which is consistent with both the images even if the lighting in each scene is constrained to be a known point light source at infinity. In this paper, we show that images are much less ambiguous for the class of bilaterally symmetric Lambertian objects. In fact, the set of such objects can be partitioned into equivalence classes such that it is always possible to distinguish between two objects belonging to different equivalence classes using just one image per object. The conditions required for two objects to belong to the same equivalence class are very restrictive, thereby leading to the conclusion that images of symmetric objects are hardly ambiguous. The observation leads to an illumination-invariant matching algorithm to compare images of bilaterally symmetric Lambertian objects. Experiments on real data are performed to show the implications of the theoretical result even when the symmetry and Lambertian assumptions are not strictly satisfied.

135

135

Inferring 3D Volumetric Shape of Both Moving Objects and Static Background Observed by a Moving Camera

Chang Yuan and Gérard Medioni

We present a novel approach to inferring 3D volumetric shape of both moving objects and static background from video sequences shot by a moving camera, with the assumption that the objects move rigidly on a ground plane. The 3D scene is divided into a set of volume elements, termed as voxels, organized in an adaptive octree structure. Each voxel is assigned a label at each time instant, either as empty, or belonging to background structure, or a moving object. The task of shape inference is then formulated as assigning each voxel a dynamic label which minimizes photo and motion variance between voxels and the original sequence. We propose a three-step voxel labeling method based on a robust photo-motion variance measure. First, a sparse set of surface points are utilized to initialize a subset of voxels. Then, a deterministic voxel coloring scheme carves away the voxels with large variance. Finally, the labeling results are refined by a Graph Cuts based optimization method to enforce global smoothness. Experimental results on both indoor and outdoor sequences demonstrate the effectiveness and robustness of our method.

Fast 3D Scanning with Automatic Motion Compensation

Thibaut Weise, Bastian Leibe, and Luc Van Gool

We present a novel 3D scanning system combining stereo and active illumination based on phase-shift for robust and accurate scene reconstruction. Stereo overcomes the traditional phase discontinuity problem and allows for the reconstruction of complex scenes containing multiple objects. Due to the sequential recording of three patterns, motion will introduce artifacts in the reconstruction. We develop a closed-form expression for the motion error in order to apply motion compensation on a pixel level. The resulting scanning system can capture accurate depth maps of complex dynamic scenes at 17 fps and can cope with both rigid and deformable objects.

Viewpoint-Coded Structured Light

Mark Young, Erik Beeson, James Davis, Szymon Rusinkiewicz, and Ravi Ramamoorthi

We introduce a theoretical framework and practical algorithms for augmenting time-coded structured light patterns with additional camera locations. Current structured light methods typically use log(N) light patterns, encoded over time, to unambiguously reconstruct N unique depths. We demonstrate that each additional camera location may replace one frame in a temporal binary code. Our theoretical viewpoint coding analysis shows that, by using a high frequency stripe pattern and placing cameras in carefully selected locations, the epipolar projection in each camera can be made to mimic the binary encoding patterns normally projected over time. Results from our practical implementation demonstrate reliable depth reconstruction that makes neither temporal nor spatial continuity assumptions about the scene being captured.

136

136

Global Optimization for Shape Fitting

Victor Lempitsky and Yuri Boykov

We propose a global optimization framework for 3D shape reconstruction from sparse noisy 3D measurements frequently encountered in range scanning, sparse feature-based stereo, and shape-from-X. In contrast to earlier local or banded optimization methods for shape fitting, we compute global optimum in the whole volume removing dependence on initial guess and sensitivity to numerous local minima. Our global method is based on two main ideas. First, we suggest a new regularization functional with a data alignment term that maximizes the number of (weakly-oriented) data points contained by a surface while allowing for some measurement errors. Second, we propose a "touch-expand" algorithm for finding a minimum cut on a huge 3D grid using an automatically adjusted band. This overcomes prohibitively high memory cost of graph cuts when computing globally optimal surfaces at high-resolution. Our results for sparse or incomplete 3D data from laser scanning and passive multi-view stereo are robust to noise, outliers, missing parts, and varying sampling density.

Topology matching for 3D video compression

Tony Tung, Francis Schmitt, and Takashi Matsuyama

This paper presents a new technique to reduce the storage cost of high quality 3D video. In 3D video [12], a sequence of 3D objects represents scenes in motion. Every frame is composed by one or several accurate 3D meshes with attached high fidelity properties such as color and texture. Each frame is acquired at video rate. The entire video sequence requires a huge amount of free disk space. To overcome this issue, we propose an original approach using Reeb graphs, which are well-known topology based shape descriptors. In particular, we take advantage of the augmented multiresolution Reeb graph properties [18] to store the relevant information of the 3D model of each frame. This graph structure has shown its efficiency as a motion descriptor, being able to track similar nodes all along the 3D video sequence. Therefore we can describe and reconstruct the 3D models of all frames with a very low-cost data size. The algorithm has been implemented as a fully automatic 3D video compression system. Our experiments show the robustness and accuracy of the proposed technique by comparing reconstructed sequences against challenging real ones.

Layered Depth Panorama

Ke Colin Zheng, Sing Bing Kang, Michael Cohen, and Richard Szeliski

Representations for interactive photorealistic visualization of scenes range from compact 2D panoramas to data-intensive 4D light fields. In this paper, we propose a technique for creating a layered representation from a sparse set of images taken with a hand-held camera. This representation, which we call layered depth panorama, allows the user to experience 3D by off-axis panning. It combines the compelling experience of panoramas with limited 3D navigation. Our choice of representation is motivated by ease of capture and compactness. We formulate the problem of constructing the layered depth panorama as the recovery of color and geometry in a multi-perspective cylindrical disparity space. We leverage a graph cut approach to sequentially determine the disparity and color of each layer using multi-view stereo. Geometry visible through the cracks at depth discontinuities in a front-most disparity layer is determined and assigned to layers behind the front-most layer. All layers are then used to render novel panoramic views with parallax. We demonstrate our approach on a variety of complex outdoor and indoor scenes.

137

137

Body Tracking, Gait, and Gesture 2

Marker-less Deformable Mesh Tracking for Human Shape and Motion Capture

Edilson de Aguiar, Christian Theobalt, Carsten Stoll, and Hans-Peter Seidel

We present a novel algorithm to jointly capture the motion and the dynamic shape of humans from multiple video streams without using optical markers. Instead of relying on kinematic skeletons, as traditional motion capture methods, our approach uses a deformable high-quality mesh of a human as scene representation. It jointly uses an image-based 3D correspondence estimation algorithm and a fast Laplacian mesh deformation scheme to capture both motion and surface deformation of the actor from the input video footage. As opposed to many related methods, our algorithm can track people wearing wide apparel, it can straightforwardly be applied to any type of subject, e.g. animals, and it preserves the connectivity of the mesh over time. We demonstrate the performance of our approach using synthetic and captured real-world video sequences and validate its accuracy by comparison to the ground truth.

Bridging the Gap between Detection and Tracking for 3D Monocular Video-Based Motion Capture

Andrea Fossati, Miodrag Dimitrijevic, Vincent Lepetit, and Pascal Fua

We combine detection and tracking techniques to achieve robust 3-D motion recovery of people seen from arbitrary viewpoints by a single and potentially moving camera. We rely on detecting key postures, which can be done reliably, using a motion model to infer 3-D poses between consecutive detections, and finally refining them over the whole sequence using a generative model. We demonstrate our approach in the case of people walking against cluttered backgrounds and filmed using a moving camera, which precludes the use of simple background subtraction techniques. In this case, the easy-to-detect posture is the one that occurs at the end of each step when people have their legs furthest apart.

Recognizing Human Activities from Silhouettes: Motion Subspace and Factorial Discriminative Graphical Model

Liang Wang and David Suter

This paper describes a probabilistic framework for recognizing human activities in monocular video based on simple silhouette observations. The methodology combines kernel principal component analysis (KPCA) based feature extraction and factorial conditional random field (FCRF) based motion modeling. Silhouette data is represented in a compact manner by nonlinear dimensionality reduction that explores the underlying structure of the articulated action space and preserves explicit temporal orders in projection trajectories of motions (PTM). The FCRF models temporal sequences in multiple interacting ways, thus increasing joint accuracy by information sharing, with the ideal advantages of discriminative models over generative ones, e.g., relaxing independence assumption between observations and the ability to incorporate both overlapping features and long-range dependencies. The experimental results on two recent datasets have shown that the proposed framework can not only accurately recognize human activities with temporal, intra- and inter-person variations, but also is considerably robust to noise and other factors such as partial occlusion and irregularities in motion styles.

138

138

Latent-Dynamic Discriminative Models for Continuous Gesture Recognition

Louis-Philippe Morency, Ariadna Quattoni, and Trevor Darrell

Many problems in vision involve the prediction of a class label for each frame in an unsegmented sequence. In this paper, we develop a discriminative framework for simultaneous sequence segmentation and labeling which can capture both intrinsic and extrinsic class dynamics. Our approach incorporates hidden state variables which model the sub-structure of a class sequence and learn dynamics between class labels. Each class label has a disjoint set of associated hidden states, which enables efficient training and inference in our model. We evaluated our method on the task of recognizing human gestures from unsegmented video streams and performed experiments on three different datasets of head and eye gestures. Our results demonstrate that our model compares favorably to Support Vector Machines, Hidden Markov Models, and conditional Random Fields on visual gesture recognition tasks.

Towards Reliable Pedestrian Detection in Crowded Image Sequences

Edgar Seemann, Mario Fritz, and Bernt Schiele

Object class detection in scenes of realistic complexity remains a challenging task in computer vision. Most recent approaches focus on a single and general model for object class detection. However, in particular in the context of image sequences, it may be advantageous to adapt the general model to a more object-instance specific model in order to detect this particular object reliably within the image sequence. In this work we present a generative object model that is capable to scale from a general object class model to a more specific object-instance model. This allows to detect class instances as well as to distinguish between individual object instances reliably. We experimentally evaluate the performance of the proposed system on both still images and image sequences.

Bottom-up Recognition and Parsing of the Human Body

Praveen Srinivasan and Jianbo Shi

Recognizing humans, estimating their pose and segmenting their body parts are key to high-level image understanding. Because humans are highly articulated, the range of deformations they undergo makes this task extremely challenging. Previous methods have focused largely on heuristics or pairwise part models in approaching this problem. We propose a bottom-up growing, similar to parsing, of increasingly more complete partial body masks guided by a composition tree. At each level of the growing process, we evaluate the partial body masks directly via shape matching with exemplars, without regard to how the hypotheses are formed. The body is evaluated as a whole, not the sum of its parts, unlike previous approaches. Multiple image segmentations are included at each of the levels of the growing/parsing, to augment existing hypotheses or to introduce ones. Our method yields both a pose estimate as well as a segmentation of the human. We demonstrate competitive results on this challenging task with relatively few training examples on a dataset of baseball players with wide pose variation. Our method is comparatively simple and could be easily extended to other objects.

139

139

Accurately measuring human movement using articulated ICP with soft-joint constraints and a repository of articulated models

Lars Muendermann, Stefano Corazza, and Thomas Andriacchi

A novel approach for accurate markerless motion capture combining a precise tracking algorithm with a database of articulated models is presented. The tracking approach employs an articulated iterative closest point algorithm with soft joint constraints for tracking body segments in visual hull sequences. The algorithm extends previous approaches for tracking articulated models that enforced hard constraints on the joints of the articulated body. Two adjacent body segments predict the location of the common joint center. The deviation between the two predictions is penalized in least-squares terms. The balance between emphasizing on matching surface properties and/or preserving joint consistencies can be continuously varied. The database of articulated models is derived from a combination of human shapes and anthropometric data, contains a large variety of models and closely mimics variations found in the human population. The database provides articulated models that closely match the outer appearance of the visual hulls, e.g. matches overall height and volume. This information is paired with a kinematic chain enhanced through anthropometric regression equations. Deviations in the kinematic chain from true joint center locations are compensated by the soft joint constraints approach. As a result accurate and a more anatomical correct outcome is obtained suitable for biomechanical and clinical applications. Joint kinematics obtained using this approach closely matched joint kinematics obtained from a marker based motion capture system.

Real-time Gesture Recognition with Minimal Training Requirements and On-line Learning

Stjepan Rajko, Gang Qian, Todd Ingalls, and Jodi James

In this paper, we introduce the semantic network model (SNM), a generalization of the hidden Markov model (HMM) that uses factorization of state transition probabilities to reduce training requirements, increase the efficiency of gesture recognition and on-line learning, and allow more precision in gesture modeling. We demonstrate the advantages both formally and experimentally, using examples such as full-body multimodal gesture recognition via optical motion capture and a pressure sensitive floor, as well as mouse / pen gesture recognition. Our results show that our algorithm performs much better than the traditional approach in situations where training samples are limited and/or the precision of the gesture model is high.

Objects in Action: An Approach for Combining Action Understanding and Object Perception

Abhinav Gupta and Larry Davis

Analysis of videos involving human-object interactions involves understanding human movements, locating and recognizing objects and observing the effects of human movements on those objects. While each of these can be conducted independently, recognition improves when interactions between these elements are considered. Motivated by psychological studies of human perception, we present a Bayesian approach which unifies the inference processes involved in object classification and localization, action understanding and perception of object reaction. Traditional approaches for object classification and action understanding have relied on shape features and movement analysis respectively. By placing object classification and localization in a video interpretation framework, we can localize and classify objects which are either hard to localize due to clutter or hard to recognize due to lack of discriminative features. Similarly, by applying context on human movements from the objects on which these movements impinge and the effects of these movements, we can segment and recognize actions which are either too subtle to perceive or too hard to recognize using motion features alone.

140

140

Learning Motion Categories using both Semantic and Structural Information

Shu-Fai Wong, Tae-Kyun Kim, and Roberto Cipolla

Current approaches to motion category recognition typically focus on either full spatiotemporal volume analysis (holistic approach) or analysis of the content of spatiotemporal interest points (part-based approach). Holistic approaches tend to be more sensitive to noise e.g. geometric variations, while part-based approaches usually ignore structural dependencies between parts. This paper presents a novel generative model, which extends probabilistic latent semantic analysis (pLSA), to capture both semantic (content of parts) and structural (connection between parts) information for motion category recognition. The structural information learnt can also be used to infer the location of motion for the purpose of motion detection. We test our algorithm on challenging datasets involving human actions, facial expressions and hand gestures and show its performance is better than existing unsupervised methods in both tasks of motion localisation and recognition.

On the Blind Classification of Time Series

Alessandro Bissacco and Stefano Soatto

We propose a cord distance in the space of dynamical models that takes into account their dynamics, including transients, output maps and input distributions. In data analysis applications, as opposed to control, the input is often not known and is inferred as part of the (blind) identification. So it is an integral part of the model that should be considered when comparing different time series. Previous work on kernel distances between dynamical models was restricted to them having either identical or independent inputs. We extend it to arbitrary distributions, highlighting connections with system identification, independent component analysis, and optimal transport. The increased modeling power is demonstrate empirically on gait classification from simple visual features.

Medical Imaging 2 and Biologically Motivated

An optimal Reduced Representation of a MoG with Applications to Medical Image Database Classification

Jacob Goldberger, Hayit Greenspan, and Jeremie Dreyfuss

This work focuses on a general framework for image categorization, classification and retrieval that may be appropriate for medical image archives. The proposed methodology is comprised of a continuous and probabilistic image representation scheme using Gaussian mixture modeling (MoG) along with information-theoretic image matching measures (KL). A category model is obtained by learning a reduced model from all the images in the category. We propose a novel algorithm for learning a reduced representation of a MoG, that is based on the Unscented-Transform. The superiority of the proposed method is validated on both simulation experiments and categorization of a real medical image database.

141

141

Topology-preserving Geometric Deformable Model on Adaptive Quadtree Grid

Ying Bai, Xiao Han, and Jerry Prince

Topology-preserving geometric deformable models (TGDMs) are used to segment objects that have a known topology. Their accuracy is inherently limited, however, by the resolution of the underlying computational grid. Although this can be overcome by using fine-resolution grids, both the computational cost and the size of the resulting contour increase dramatically. In order to maintain computational efficiency and to keep the contour size manageable, we have developed a new framework, termed QTGDMs, for topology-preserving geometric deformable models on balanced quadtree grids (BQGs). In order to do this, definitions and concepts from digital topology on regular grids were extended to BQGs so that characterization of simple points could be made. Other issues critical to the implementation of geometric deformable models are also addressed and a strategy for adapting a BQG during contour evolution is presented. We demonstrate the performance of the QTGDM method using both mathematical phantoms and real medical images.

Statistical Shape Analysis of Multi-Object Complexes

Kevin Gorczowski, Martin Styner, Ja-Yeon Jeong, James S Marron, Joseph Piven, Heather Cody Hazlett, Stephen M. Pizer, and Guido Gerig

An important goal of statistical shape analysis is the discrimination between populations of objects, exploring group differences in morphology not explained by standard volumetric analysis. Certain applications additionally require analysis of objects in their embedding context by joint statistical analysis of sets of interrelated objects. In this paper, we present a framework for discriminant analysis of populations of 3-D multi-object sets. In view of the driving medical applications, a skeletal object parametrization of shape is chosen since it naturally encodes thickening, bending and twisting. In a multi-object setting, we not only consider a joint analysis of sets of shapes but also must take into account differences in pose. Statistics on features of medial descriptions and pose parameters, which include rotational frames and distances, uses a Riemannian symmetric space instead of the standard Euclidean metric. Our choice of discriminant method is the distance weighted discriminant (DWD) because of its generalization ability in high dimensional, low sample size settings. Joint analysis of 10 subcortical brain structures in a pediatric autism study demonstrates that multi-object analysis of shape results in a better group discrimination than pose, and that the combination of pose and shape performs better than shape alone. Finally, given a discriminating axis of shape and pose, we can visualize the differences between the populations.

Beyond bottom-up: Incorporating task-dependent influences into a computational model of spatial attention

Robert Peters and Laurent Itti

A critical function in both machine vision and biological vision systems is attentional selection of scene regions worthy of further analysis by higher-level processes such as object recognition. Here we present the first model of spatial attention that (1) can be applied to arbitrary static and dynamic image sequences with interactive tasks and (2) combines a general computational implementation of both bottom-up (BU) saliency and dynamic top-down (TD) task relevance; the claimed novelty lies in the combination of these elements and in the fully computational nature of the model. The BU component computes a saliency map from 12 low-level multi-scale visual features. The TD component computes a low-level signature of the entire image, and learns to associate different classes of signatures with the different gaze patterns recorded from human subjects performing a task of interest. We measured the ability of this model to predict the eye movements of people playing contemporary video games. We found that the TD model alone predicts where humans look about twice as well as does the BU model alone; in addition, a combined BU*TD

142

142

model performs significantly better than either individual component. Qualitatively, the combined model predicts some easy-to-describe but hard-to-compute aspects of attentional selection, such as shifting attention leftward when approaching a left turn along a racing track. Thus, our study demonstrates the advantages of integrating BU factors derived from a saliency map and TD factors learned from image and task contexts in predicting where humans look while performing complex visually-guided behavior.

Artificial Complex Cells via the Tropical Semiring

Lior Wolf and Moshe Guttmann

The seminal work of Hubel and Wiesel and the vast amount of work that followed it prove that hierarchies of increasingly complex cells play a central role in cortical computations. Computational models, pioneered by Fukushima, suggest that these hierarchies contain feature-building cells (“S-cells”) and pooling cells (“C-cells”). More recently, Riesenhuber & Poggio have developed the HMAX model, in which S-cells perform linear combinations, while C-cells perform a MAX operation. We note that methods for computing the connectivity of S-cells abound since there exist many algorithms for suggesting informative linear combinations. There are, however, only few published methods that are suitable for the construction of C-cells. Here, we build a novel dimensionality reduction algorithm for learning the connectivity of C-cells, using the framework of the max-plus (“tropical”) semiring.

One-class Machine Learning for Brain Activation Detection

Xiaomu Song, George Iordanescu, and Alice Wyrwicz

Machine learning methods, such as support vector machine (SVM), have been applied to fMRI data analysis, where most studies focus on supervised detection and classification of cognitive states. In this work, we study the general fMRI activation detection using SVM in an unsupervised way instead of the classification of cognitive states. Specifically, activation detection is formulated as an outlier (activated voxels) detection problem of the one-class support vector machine (OCSVM). An OCSVM implementation, v-SVM, is used where parameter v controls the outlier ratio, and is usually unknown. We propose a detection method that is not sensitive to v randomly set within a range known a priori. In cases that this range is also unknown, we consider v estimation using geometry and texture features. Results from both synthetic and experimental data demonstrate the effectiveness of the proposed methods.

Shape and Boundaries

Detailed Human Shape and Pose from Images

Alexandru Balan, Leonid Sigal, Michael Black, James Davis, and Horst Haussecker

Much of the research on video-based human motion capture assumes the body shape is known a priori and is represented coarsely (e.g. using cylinders or superquadrics to model limbs). These body models stand in sharp contrast to the richly detailed 3D body models used by the graphics community. Here we propose a method for recovering such models directly from images. Specifically, we represent the body using a recently proposed triangulated mesh model called SCAPE which employs a low-dimensional, but detailed, parametric model of shape and pose-dependent deformations that is learned from a database of range scans of human bodies. Previous work showed that the parameters of the SCAPE model could be estimated from marker-based motion capture data. Here we go further to estimate the parameters directly from image data. We define a cost function between image observations and a hypothesized mesh and formulate the problem

143

143

as optimization over the body shape and pose parameters using stochastic search. Our results show that such rich generative models enable the automatic recovery of detailed human shape and pose from images.

Semi-supervised Hierarchical Models for 3D Human Pose Reconstruction

Atul Kanaujia, Cristian Sminchisescu, and Dimitris Metaxas

Recent research in visual inference from monocular images has shown that discriminatively trained image-based predictors can provide fast, automatic qualitative 3D reconstructions of human body pose or scene structure in real-world environments. However, the stability of existing image representations tends to be perturbed by deformations and misalignments in the training set, which, in turn, degrades the quality of learning and generalization. In this paper we advocate the semi-supervised learning of hierarchical image descriptions in order to better tolerate variability at multiple levels of detail. We combine multilevel encodings with improved stability to geometric transformations, with metric learning and semi-supervised manifold regularization methods in order to further profile them for task-invariance -- resistance to background clutter and within the same human pose class variance. We quantitatively analyze the effectiveness of both descriptors and learning methods and show that each one can contribute, sometimes substantially, to more reliable 3D human pose estimates in cluttered images.

Physics-Based Person Tracking Using Simplified Lower-Body Dynamics

Marcus A. Brubaker, David J. Fleet, and Aaron Hertzmann

We introduce a physics-based model for 3D person tracking. Based on a biomechanical characterization of lower-body dynamics, the model captures important physical properties of bipedal locomotion such as balance and ground contact, generalizes naturally to variations in style due to changes in speed, step-length, and mass, and avoids common problems such as footskate that arise with existing trackers. The model dynamics comprises a two degree-of-freedom representation of human locomotion with inelastic ground contact. A stochastic controller generates impulsive forces during the toe-off stage of walking and spring-like forces between the legs. A higher-dimensional kinematic observation model is then conditioned on the underlying dynamics. We use the model for tracking walking people from video, including examples with turning, occlusion, and varying gait.

Detecting Object Boundaries Using Low-, Mid-, and High-level Information

Songfeng Zheng, Zhuowen Tu, and Alan Yuille

Object boundary detection and segmentation is a central problem in computer vision. The importance of combining low-level, mid-level, and high-level cues has been realized in recent literature. However, it is unclear how to efficiently and effectively engage and fuse different levels of information. In this paper, we emphasize a learning based approach to explore different levels of information, both implicitly and explicitly. First, we learn low-level cues for object boundaries and interior regions using a probabilistic boosting tree (PBT) [17,6]. Second, we learn short and long range context information based on the results from the first stage. Both stages implicitly contain object-specific information such as texture and local geometry, and it is shown that this implicit knowledge is extremely powerful. Third, we use high-level shape information explicitly to further refine the object segmentation and to parse the object into components. The algorithm is trained and tested on a challenging dataset of horses [2], and the results obtained are very encouraging compared with other approaches. In detailed experiments we show significantly better performance (e.g. F-values of 0.75 compared to 0.66) than the best comparable reported performance on this dataset [14]. Furthermore, the system only needs 1.5 minutes for a typical image.

144

144

Although our system is illustrated on horse images, the approach can be directly applied to detecting/segmenting other types of objects.

Multimodal and Sign Language

Harmony in Motion

Zohar Barzelay and Yoav Schechner

Cross-modal analysis offers information beyond that extracted from individual modalities. Consider a camcorder having a single microphone in a cocktail-party: it captures several moving visual objects which emit sounds. A task for audio-visual analysis is to identify the number of independent audio-associated visual objects (AVOs), pinpoint the AVOs' spatial locations in the video and isolate each corresponding audio component. Part of these problems were considered by prior studies, which were limited to simple cases, e.g., a single AVO or stationary sounds. We describe an approach that seeks to overcome these challenges. It acknowledges the importance of temporal features that are based on significant changes in each modality. A probabilistic formalism identifies temporal coincidences between these features, yielding cross-modal association and visual localization. This association is of particular benefit in harmonic sounds, as it enables subsequent isolation of each audio source. We demonstrate this in challenging experiments, having multiple, simultaneous highly nonstationary AVOs.

Microphone Arrays as Generalized Cameras for Integrated Audio Visual Processing

Adam O'Donovan, Ramani Duraiswami, and Jan Neumann

Combinations of microphones and cameras allow the joint audio visual sensing of a scene. Such arrangements of sensors are common in biological organisms and in applications such as meeting recording and surveillance where both modalities are necessary to provide scene understanding. Microphone arrays provide geometrical information on the source location, and allow the sound sources in the scene to be separated and the noise suppressed, while cameras allow the scene geometry and the location and motion of people and other objects to be estimated. In most previous work the fusion of the audio-visual information occurs at a relatively late stage. In contrast, we take the viewpoint that both cameras and microphone arrays are geometry sensors, and treat the microphone arrays as generalized cameras. We employ computer-vision inspired algorithms to treat the combined system of arrays and cameras. In particular, we consider the geometry introduced by a general microphone array and spherical microphone arrays. The latter show a geometry that is very close to central projection cameras, and we show how standard vision based calibration algorithms can be profitably applied to them. Experiments are presented that demonstrate the usefulness of the considered approach.

Transfer Learning in Sign language

Ali Farhadi, David Forsyth, and Ryan White

We build word models for American Sign Language (ASL) that transfer between different signers and different aspects. This is advantageous because one could use large amounts of labelled avatar data in combination with a smaller amount of labelled human data to spot a large number of words in human data. Transfer learning is possible because we represent blocks of video with a novel intermediate discriminative feature based on splits of the data. By constructing the same splits in avatar and human data and clustering appropriately, our features are both discriminative and semantically similar: across signers similar features

145

145

imply similar words. We demonstrate transfer learning in two scenarios: from avatar to a frontally viewed human signer and from an avatar to human signer in a 3/4 view.

Enhanced Level Building Algorithm to the Movement Epenthesis Problem in Sign Language

Ruiduo Yang, Sudeep Sarkar, and Barbara Loeding

One of the hard problems in automated sign language recognition is the movement epenthesis (me) problem. Movement epenthesis is the gesture movement that bridges two consecutive signs. This effect can be over a long duration and involve variations in hand shape, position, and movement, making it hard to explicitly model these intervening segments. This creates a problem when trying to match individual signs to full sign sentences since for many chunks of the sentence, corresponding to these mes, we do not have models. We present an approach based on version of a dynamic programming framework, called Level Building, to simultaneously segment and match signs to continuous sign language sentences in the presence of movement epenthesis (me). We enhance the classical Level Building framework so that it can accomodate me labels for which we do not have explicit models. This enhanced Level Building algroithm is then coupled with a trigram grammar model to optimally segment and label sign language sentences. We demonstrate the efficiency of the algorithm using a single view video dataset of continuous sign language sentences. We obtain 83% word level recognition rate with the enhanced Level Building approach, as opposed to a 20% recognition rate using a classical Level Building framework on the same dataset. The proposed approach is novel since it does not need explicit models for movement epenthesis frames.

146

146

Abstracts: Workshop on Component Analysis Methods for Classification, Clustering, Modeling, and Estimation Problems in Computer Vision

Regularized Mixed Dimensionality and Density Learning in Computer Vision

Gloria Haro, Gregory Randall, Guillermo Sapiro

A framework for the regularized estimation of nonuniform dimensionality and density in high dimensional data is introduced in this work. This leads to learning stratifications, that is, mixture of manifolds representing different characteristics and complexities in the data set. The basic idea relies on modeling the high dimensional sample points as a process of Poisson mixtures, with regularizing restrictions and spatial continuity constraints. Theoretical asymptotic results for the model are presented as well. The presentation of the framework is complemented with artificial and real examples showing the importance of regularized stratification learning in computer vision applications.

The Hierarchical Isometric Self-Organizing Map for Manifold Representation

Haiying Guan, Matthew Turk

We present an algorithm, Hierarchical ISOmetric Self-Organizing Map (H-ISOSOM), for a concise, organized manifold representation of complex, non-linear, large scale, high-dimensional input data in a low dimensional space. The main contribution of our algorithm is threefold. First, we modify the previous ISOSOM algorithm by a local linear interpolation (LLI) technique, which maps the data samples from low dimensional space back to high dimensional space and makes the complete mapping pseudo-invertible. The modified-ISOSOM (M-ISOSOM) follows the global geometric structure of the data, and also preserves local geometric relations to reduce the nonlinear mapping distortion and make the learning more accurate. Second, we propose the H-ISOSOM algorithm for the computational complexity problem of Isomap, SOM and LLI and the nonlinear complexity problem of the highly twisted manifold. H-ISOSOM learns an organized structure of a non-convex, large scale manifold and represents it by a set of hierarchical organized maps. The hierarchical structure follows a coarse-to-fine strategy. According to the coarse global structure, it “unfolds” the manifold at the coarse level and decomposes the sample data into small patches, then iteratively learns the nonlinearity of each patch in finer levels. The algorithm simultaneously reorganizes and clusters the data samples in a low dimensional space to obtain the concise representation. Third, we give quantitative comparisons of the proposed method with similar methods on standard data sets. Finally, we apply H-ISOSOM to the problem of appearance based hand pose estimation. Encouraging experimental results validate the effectiveness and efficiency of H-ISOSOM.

Manifold Learning Techniques in Image Analysis of High-dimensional Diffusion Tensor Magnetic Resonance Images

Parmeshwar Khurd, Sajjad Baloch, Ruben Gur, Christos Davatzikos, Ragini Verma

Diffusion Tensor magnetic resonance imaging (DT-MRI) provides a comprehensive characterization of white matter (WM) in the brain and therefore, plays a crucial role in the investigation of diseases in which WM is suspected to be compromised such as multiple sclerosis and neuropsychiatric disorders like schizophrenia. However changes induced by pathology may be subtle and affected regions of the brain can only be revealed by a group-based analysis of patients in comparison with healthy controls. This in turn requires voxel-based statistical analysis of spatially normalized brain DT images, as in the case of conventional MR images. However this process is rendered extremely challenging in DT-MRI due to the high dimensionality of the data and its inherent non-linearity that causes linear component analysis

147

147

methods to be inapplicable. We therefore propose a novel framework for the statistical analysis of DT-MRI data using manifold-based techniques such as isomap and kernel PCA that determine the underlying manifold structure of the data, embed it to a manifold and help perform high dimensional statistics on the manifold to determine regions of difference between the groups of patients and controls. The framework has been successfully applied to DT-MRI data from patients with schizophrenia, as well as to study developmental changes in small animals, both of which identify regional changes, indicating the need for manifold-based methods for the statistical analysis of DTI.

A Graph Cut Approach to Image Segmentation in Tensor Space

James Malcolm, Yogesh Rathi, Allen Tannenbaum

This paper proposes a novel method to apply the standard graph cut technique to segmenting multimodal tensor valued images. The Riemannian nature of the tensor space is explicitly taken into account by first mapping the data to a Euclidean space where nonparametric kernel density estimates of the regional distributions may be calculated from user initialized regions. These distributions are then used as regional priors in calculating graph edge weights. Hence this approach utilizes the true variation of the tensor data by respecting its Riemannian structure in calculating distances when forming probability distributions. Further, the non-parametric model generalizes to arbitrary tensor distribution unlike the Gaussian assumption made in previous works. Casting the segmentation problem in a graph cut framework yields a segmentation robust with respect to initialization on the data tested. Nonnegative Tucker Decomposition

Yong-Deok Kim, Seungjin Choi

Nonnegative tensor factorization (NTF) is a recent multiway (multilinear) extension of nonnegative matrix factorization (NMF), where nonnegativity constraints are imposed on the CANDECOMP/PARAFAC model. In this paper we consider the Tucker model with nonnegativity constraints and develop a new tensor factorization method, referred to as nonnegative Tucker decomposition (NTD). The main contributions of this paper include: (1) multiplicative updating algorithms for NTD; (2) an initialization method for speeding up convergence; (3) a sparseness control method in tensor factorization. Through several computer vision examples, we show the useful behavior of the NTD, over existing NTF and NMF methods.

A linear estimation method for 3D pose and facial animation tracking

José Ybanez, Franck Davoine, Maurice Charbit

This paper presents an approach that incorporates Canonical Correlation Analysis (CCA) for monocular 3D face pose and facial animation estimation. The CCA is used to find the dependency between texture residuals and 3D face pose and facial gesture. The texture residuals are obtained from observed raw brightness shape-free 2D image patches that we build by means of a parameterized 3D geometric face model. This method is used to correctly estimate the pose of the face and the model’s animation parameters controlling the lip, eyebrow and eye movements (encoded in 15 parameters). Extensive experiments on tracking faces in long real video sequences show the effectiveness of the proposed method and the value of using CCA in the tracking context.

148

148

Estimating cluster overlap on Manifolds and its Application to Neuropsychiatric Disorders

Peng Wang, Christian Kohler, Ragini Verma

Although it is usually assumed in many pattern recognition problems that different patterns are distinguishable, some patterns may have inseparable overlap. For example, some facial expressions involve subtle muscle movements, and are difficult to separate from other expressions or neutral faces. In this paper, we consider such overlapped patterns as “clusters”, and present a novel method to quantify cluster overlap based on the Bayes error estimation on manifolds. Our method first applies a manifold learning method, ISOMAP, to discover the intrinsic structure of data, and then measures the overlap of different clusters using the k-NN Bayes error estimation on the learned manifolds. Due to the ISOMAP’s capability of preserving geodesic distances and k-NN’s localized estimation, the method can provide an accurate measure of the overlap between clusters, as demonstrated by our simulation experiments. The method is further applied for an analysis of a specific type of facial expression impairment in schizophrenia, i.e.,“flat effect”, which refers to a severe reduction in emotional expressiveness. In this study, we capture facial expressions of individuals, and quantify their expression flatness by estimating overlap between different facial expressions. The experimental results show that the patient group has much larger facial expression overlap than the control group, and demonstrate that the flat affect is an important symptom in diagnosing schizophrenia patients.

Dimensionality Reduction and Clustering on Statistical Manifolds

Sang-Mook Lee, A. Lynn Abbott, Philip A. Araman

Dimensionality reduction and clustering on statistical manifolds is presented. Statistical manifold [16] is a 2D Riemannian manifold which is statistically defined by maps that transform a parameter domain onto a set of probability density functions. Principal component analysis (PCA) based dimensionality reduction is performed on the manifold, and therefore, estimation of a mean and a variance of the set of probability distributions are needed. First, the probability distributions are transformed by an isometric transform that maps the distributions onto a surface of hyper-sphere. The sphere constructs a Riemannian manifold with a simple geodesic distance measure. Then, a Fréchet mean is estimated on the Riemannian manifold to perform the PCA on a tangent plane to the mean. Experimental results show that clustering on the Riemannian space produce more accurate and stable classification than the one on Euclidean space.

Sparse Kernels for Bayes Optimal Discriminant Analysis

Onur Hamsici, Aleix Martinez

Discriminant Analysis (DA) methods have demonstrated their utility in countless applications in computer vision and other areas of research – especially in the C class classification problem. The most popular approach is Linear DA (LDA), which provides the C• 1-dimensional Bayes optimal solution, but only when all the class covariance matrices are identical. This is rarely the case in practice. To alleviate this restriction, Kernel LDA (KLDA) has been proposed. In this approach, we first (intrinsically) map the original nonlinear problem to a linear one and then use LDA to find the C • 1-dimensional Bayes optimal subspace. However, the use of KLDA is hampered by its computational cost, given by the number of training samples available and by the limitedness of LDA in providing a C • 1-dimensional solution space. In this paper, we first extend the definition of LDA to provide subspace of q < C • 1 dimensions where the Bayes error is minimized. Then, to reduce the computational burden of the derived solution, we define a sparse kernel representation, which is able to automatically select the most appropriate sample feature vectors that represent the kernel. We demonstrate the superiority of the proposed approach on several standard datasets. Comparisons are drawn with a large number of known DA algorithms.

149

149

Conformal Embedding Analysis with Local Graph Modeling on the Unit Hypersphere

Yun Fu, Ming Liu, Thomas Huang

We present the Conformal Embedding Analysis (CEA) for feature extraction and dimensionality reduction. Incorporating both conformal mapping and discriminating analysis, CEA projects the high-dimensional data onto the unit hypersphere and preserves intrinsic neighbor relations with local graph modeling. Through the embedding, resulting data pairs from the same class keep the original angle and distance information on the hypersphere, whereas neighboring points of different class are kept apart to boost discriminating power. The subspace learned by CEA is graylevel variation tolerable since the cosine-angle metric and the normalization processing enhance the robustness of the conformal feature extraction. We demonstrate the effectiveness of the proposed method with comprehensive comparisons on visual classification experiments.

Optimal dimensionality Discriminant Analysis and Its Application to Image Recognition

Feiping Nie, Shiming Xiang, Yangqiu Song, Changshui Zhang

Dimensionality reduction is an important issue when facing high-dimensional data. For supervised dimensionality reduction, Linear Discriminant Analysis (LDA) is one of the most popular methods and has been successfully applied in many classification problems. However, there are several drawbacks in LDA. First, it suffers from the singularity problem, which makes it hard to preform. Second, LDA has the distribution assumption which may make it fail in applications where the distribution is more complex than Gaussian. Third, LDA can not determine the optimal dimensionality for discriminant analysis, which is an important issue but has often been neglected previously. In this paper, we propose a new algorithm and endeavor to solve all these three problems. Furthermore, we present that our method can be extended to the two-dimensional case, in which the optimal dimensionalities of the two projection matrices can be determined simultaneously. Experimental results show that our methods are effective and demonstrate much higher performance in comparison to LDA.

150

150

Abstracts: Workshop on Beyond Multiview Geometry – Robust Estimation and Organization of Shapes from Multiple Cues

Session 1: Shape Estimation

Multiview Normal Field Integration Using Level Set Methods

Ju Yong Chang, Kyoung Mu Lee, Sang Uk Lee

In this paper, we propose a new method to integrate multiview normal fields using level sets. In contrast with conventional normal integration algorithms used in shape from shading and photometric stereo that reconstruct a 2.5D surface using a single-view normal field, our algorithm can combine multiview normal fields simultaneously and recover the full 3D shape of a target object. We formulate this multiview normal integration problem by an energy minimization framework and find an optimal solution in a least square sense using a variational technique. A level set method is applied to solve the resultant geometric PDE that minimizes the proposed error functional. It is shown that the resultant flow is composed of the well known mean curvature and flux maximizing flows. In particular, we apply the proposed algorithm to the problem of 3D shape modeling in a multiview photometric stereo setting. Experimental results for various synthetic data show the validity of our approach.

Constrained Optimization for Retinal Curvature Estimation Using an Affine Camera

Thitiporn Chanwimaluang, Guoliang Fan

We study retinal curvature estimation from multiple images that provides the fundamental geometry of human retina. We use an affine camera model due to its simplicity, linearity, and robustness. Moreover, the affine camera is suitable in this research because (1) NIH's retinal imaging protocols specify a narrow 30°

field-of-view in each eye and (2) each field has small depth variation. A major challenge is that there is a series of optics involved in the imaging process, including an actual fundus camera, a digital camera, and the human cornea, all of which cause significant non-linear distortions in the retinal images. In this work, we develop a new constrained optimization procedure that considers both the geometric shape of human retina and lens distortions. Moreover, the constrained optimization is implemented in the affine space because it is computationally efficient and robust to noise. Specifically, we amend the affine bundle adjustment algorithm by including a quadratic surface fitting error and the lens distortion correction into the cost function for constrained optimization. The experiments on both synthetic data and real retinal images show the effectiveness and robustness of the proposed algorithm.

Session 2: Shape Matching

An MRF and Gaussian Curvature Based Shape Representation for Shape Matching

Pengdong Xiao Nick Barnes, Tiberio Caetano, Paulette Lieby

Matching and registration of shapes is a key issue in Computer Vision, Pattern Recognition, and Medical Image Analysis. This paper presents a shape representation framework based on Gaussian curvature and Markov random fields (MRFs) for the purpose of shape matching. The method is based on a surface mesh model in R3, which is projected into a two-dimensional space and there modeled as an extended boundary

151

151

closed Markov random field. The surface is homeomorphic to S2. The MRF encodes in the nodes entropy features of the corresponding similarities based on Gaussian curvature, and in the edges the spatial consistency of the meshes. Correspondence between two surface meshes is then established by performing probabilistic inference on the MRF via Gibbs sampling. The technique combines both geometric, topological, and probabilistic information, which can be used to represent shapes in three dimensional space, and can be generalized to higher dimensional spaces. As a result, the representation can be used for shape matching, registration, and statistical shape analysis.

Robust Click-Point Linking: Matching Visually Dissimilar Local Regions

Kazunori Okada, Xiaolei Huang

This paper presents robust click-point linking: a novel localized registration framework that allows users to interactively prescribe where the accuracy has to be high. By emphasizing locality and interactivity, our solution is faithful to how the registration results are used in practice. Given a user-specified point, the click-point linking provides a single point-wise correspondence between a data pair. In order to link visually dissimilar local regions, a correspondence is sought by using only geometrical context without comparing the local appearances. Our solution is formulated as a maximum likelihood estimation (MLE) without estimating a domain transformation explicitly. A spatial likelihood of Gaussian mixture form is designed to capture geometrical configurations between the point-of-interest and a hierarchy of global-to-local 3D landmarks that are detected using machine learning and entropy based feature detectors. A closed-form formula is derived to specify each Gaussian component by exploiting geometric invariances under specific group of domain transformation via RANSAC-like random sampling. A mean shift algorithm is applied to robustly and efficiently solve the local MLE problem, replacing the standard consensus step of the RANSAC. Two transformation groups of pure translation and scaling/translation are considered in this paper. We test feasibility of the proposed approach with 16 pairs of whole-body CT data, demonstrating the effectiveness.

Session 3: Sensors

Opti-Acoustic Stereo Imaging: On System Calibration and 3-D Reconstruction

Shahriar Negahdaripour, Hicham Sekkati, Hamed Pirsiavash

Utilization of an acoustic camera for range measurements is a key advantage for 3-D shape recovery of underwater targets by opti-acoustic stereo imaging, where the associated epipolar geometry of optical and acoustic image correspondences can be described in terms of conic sections. In this paper, we propose methods for system calibration and 3-D scene reconstruction by maximum likelihood estimation from noisy image measurements. The recursive 3-D reconstruction method utilized as initial condition a closed-form solution that integrates the advantages of so-called range and azimuth solutions. Synthetic data tests are given to provide insight into the merits of the new target imaging and 3-D reconstruction paradigm, while experiments with real data confirm the findings based on computer simulations, and demonstrate the merits of this novel 3-D reconstruction paradigm.

152

152

Scene-space Feature Detectors

Yves Jean

Derivative, spectral, and other operators implemented in filter kernels and applied with convolution have been well studied in image recognition problems. We explore in this paper a novel system that applies such filters in the actual 3D scene. The system projects images of filters into the scene as controlled illumination. The projected filters are subjected to object surface reflectance (BRDF) and scene composition; and the filter response is captured by camera pixels. Our approach utilizes a pixel-based convolution formulation, exploiting the intrinsic integration performed by individual pixel sensors and diffraction-limited optics. For example, we project a grid of Gaussian derivative filters as images and simply retrieve a filter response as a pixel value. This is a local (pixel-only) computation that is not limited by camera resolution. Our system can serve as a front-end to existing techniques for surface or geometry estimation from texture analysis. A projector and camera pair is the basic equipment requirement.

Session 4: Optimization and Sampling

Active Visual Object Reconstruction using D-, E-, and T-Optimal Next Best Views

Stefan Wenhardt, Benjamin Deutsch, Elli Angelopoulou, Heinrich Niemann

In visual 3-D reconstruction tasks with mobile cameras, one wishes to move the cameras so that they provide the views that lead to the best reconstruction result. When the camera motion is adapted during the reconstruction, the view of interest is the next best view for the current shape estimate. We present such a next best view planning approach for visual 3-D reconstruction. The reconstruction is based on a probabilistic state estimation with sensor actions. The next best view is determined by a metric of the state estimation's uncertainty. We compare three metrics: D-optimality, which is based on the entropy and corresponds to the (D)eterminant of the covariance matrix of a Gaussian distribution, E-optimality, and T-optimality, which are based on (E)igenvalues or on the (T)race of this matrix, respectively. We show the validity of our approach with a simulation as well as real-world experiments, and compare reconstruction accuracy and computation time for the optimality criteria.

Joint Priors for Variational Shape and Appearance Modeling

Jeremy Jackson, Anthony Yezzi, Stefano Soatto

We are interested in modeling the variability of different images of the same scene, or class of objects, obtained by changing the imaging conditions, for instance the viewpoint or the illumination. Understanding of such a variability is key to reconstruction of objects despite changes in their appearance (e.g. due to non-Lambertian reflection), or to recognizing classes of objects (e.g. cars), or individual objects seen from different vantage points. We propose a model that can account for changes in shape or viewpoint, appearance, and also occlusions of line of sight. We learn a prior model of each factor (shape, motion and appearance) from a collection of samples using principal component analysis, akin a generalization of active appearance models to dense domains affected by occlusions. The ultimate goal of this work is stereo reconstruction in 3D, but first we have developed the first stage in this approach by addressing the simpler case of 2D shape/radiance detection in single images. We illustrate our model on a collection of images of different cars and show how the learned prior can be used to improve segmentation and 3D stereo reconstruction.

153

153

Abstracts: Workshop on Object Tracking and Classification Beyond the Visible Spectrum (OTCBVS'07)

Session 1 - Thermal imaging

Thermal Imaging of the Superficial Temporal Artery: An Arterial Pulse Recovery Model Sergey Chekmenev, Aly Farag and Edward Essock We present a novel model for measurement of the arterial pulse from the Superficial Temporal Artery (STA) using passive thermal Infra Red (IR) sensors. The proposed approach has a physical and physiological basis and as such is of fundamental nature. Thermal IR camera is used to capture the heat pattern from superficial arteries, and a blood vessel model is used to describe the pulsatile nature of the blood flow. A multresolution wavelet-based signal analysis approach is used to extract the arterial pulse waveform, which lends itself to various physiological measurements. We validate the results using a traditional contact vital-sign monitor as a ground truth. Eight people of different age, race and gender have been tested in our study consistent with IRB approval. The resultant arterial pulse waveforms exactly matched the ground-truth readings. The essence of our approach is the automatic detection of Region Of arterial pulse Measurement (ROM), from which the arterial pulse waveform is extracted. To the best of our knowledge, the correspondence between non-contact thermal IR imaging based measurements of the arterial pulse in the time domain and traditional contact approaches has never been reported in the literature.

Thermal-Visible Video Fusion for Moving Target Tracking and Pedestrian Classification Alex Leykin, Yang Ran and Riad Hammoud The paper presents a fusion-tracker and pedestrian classifier for color and thermal cameras. The tracker builds a background model as a multi-modal distribution of colors and temperatures. It is constructed as a particle filter that makes a number of informed reversible transformations to sample the model probability space in order to maximize posterior probability of the scene model. Observation likelihoods of moving objects account their 3D locations with respect to the camera and occlusions by other tracked objects as well as static obstacles. After capturing the coordinates and dimensions of moving objects we apply a pedestrian classifier based on periodic gait analysis. To separate humans from other moving objects, such as cars, we detect, in human gait, a symmetrical double helical pattern, that can then be analyzed using the Frieze Group theory. The results of tracking on color and thermal sequences demonstrate that our algorithm is robust to illumination noise and performs well in the outdoor environments. Non-linear IR Scene Prediction for Range Video Surveillance Mehmet Celenk, James Graham and Kai-jen Cheng This paper describes a non-linear IR (infra-red) scene prediction method for range video surveillance and navigation. A Gabor-filter bank is selected as a primary detector for any changes in a given IR range image sequence. The detected ROI (region of interest) involving arbitrary motion is fed to a non-linear Kalman filter for predicting the next scene in time-varying 3D IR video. Potential applications of this research are mainly in indoor/outdoor heat-change based range measurement, synthetic IR scene generation, rescue missions, and autonomous navigation. Experimental results reported herein show that non-linear Kalman filtering-based scene prediction can perform more accurately than linear estimation of future frames in range and intensity driven sensing. The low least mean square error (LMSE), on the average of about 2% using a bank of 8 Gabor filters, also proves the reliability of the IR scene estimator (or predictor) developed in this work.

154

154

A Bayesian algorithm for tracking multiple moving objects in outdoor surveillance video Manjunath Narayana and Donna Haverkamp Reliable tracking of multiple moving objects in video is an interesting challenge, made difficult in real-world video by various sources of noise and uncertainty. We propose a Bayesian approach to find correspondences between moving objects over frames. By using color values and position information of the moving objects as observations, we probabilistically assign tracks to those objects. We allow for tracks to be lost and then recovered when they resurface. The probabilistic assignment method, along with the ability to recover lost tracks, adds robustness to the tracking system. We present results that show that the Bayesian method performs well in difficult tracking cases and compare the probabilistic results to a Euclidean distance based method. Moving Object Detection on a Runway Prior to Landing Using an Onboard Infrared Camera Cheng-Hua Pai, Gérard Medioni, Yu-Ping Lin and Ray Rida Hamza Determining the status of a runway prior to landing is essential for any aircraft, whether manned or unmanned. In this paper, we present a method that can detect moving objects on the runway from an onboard infrared camera prior to the landing phase. Since the runway is a planar surface, we first locally stabilize the sequence to automatically selected reference frames using feature points in the neighborhood of the runway. Next, we normalize the stabilized sequence to compensate for the global intensity variation caused by the gain control of the infrared camera. We then create a background model to learn an appearance model of the runway. Finally, we identify moving objects by comparing the image sequence with the background model. We have tested our system with both synthetic and real world data and show that it can detect distant moving objects on the runway. We also provide a quantitative analysis of the performance with respect to variations in size, direction and speed of the target.

Detector adaptation by maximising agreement between independent data sources Ciarán Ó Conaire, Noel O'Connor and Alan Smeaton Traditional methods for creating classifiers have two main disadvantages. Firstly, it is time consuming to acquire, or manually annotate, the training collection. Secondly, the data on which the classifier is trained may be over-generalised or too specific. This paper presents our investigations into overcoming both of these drawbacks simultaneously, by providing example applications where two data sources train each other. This removes both the need for supervised annotation or feedback, and allows rapid adaptation of the classifier to different data. Two applications are presented: one using thermal infrared and visual imagery to robustly learn changing skin models, and another using changes in saturation and luminance to learn shadow appearance parameters.

Object of Interest segmentation and Tracking by Using Feature Selection and Active Contours Mohand saïd Allili and Djemel Ziou Most image segmentation algorithms in the past are based on optimizing an objective function that aims to achieve the similarity between several low-level features to build a partition of the image into homogeneous regions. In the present paper, we propose to incorporate the relevance (selection) of the grouping features to enforce the segmentation toward the capturing of objects of interest. The relevance of the features is determined through a set of positive and negative examples of a specific object defined a priori by the user. The calculation of the relevance of the features is performed by maximizing an objective function defined on the mixture likelihoods of the positive and negative object examples sets. The

155

155

incorporation of the features relevance in the object segmentation is formulated through an energy functional which is minimized by using level set active contours. We show the efficiency of the approach on several examples of object of interest segmentation and tracking where the features relevance is used.

Application of the Reeb Graph Technique to Vehicle Occupant's Head Detection in Low-resolution range Images Pandu Devarakota, Marta Castillo-Franco, Romuald Ginhoux, Bruno Mirbach and Bjorn Ottersten In [3], a low-resolution range sensor was investigated for an occupant classification system that distinguish person from child seats or an empty seat. The optimal deployment of vehicle airbags for maximum protection moreover requires information about the occupant's size and position. The detection of occupant's position involves the detection and localization of occupant's head. This is a challenging problem as the approaches based on local shape analysis (in 2D or 3D) alone are not robust enough as other parts of the person's body like shoulders, knee may have similar shapes as the head. This paper discusses and investigate the potential of a Reeb graph approach to describe the topology of vehicle occupants in terms of a skeleton. The essence of the proposed approach is that an occupant sitting in a vehicle has a typical topology which leads to different branches of a Reeb Graph and the possible location of the occupant's head are thus the end points of the Reeb graph. The proposed method is applied on real 3D range images and is compared to Ground truth information. Results show the feasibility of using topological information to identify the position of occupant's head.

Generative Graphical Models for Maneuvering Object Tracking and Dynamics Analysis

Xin Fan and Guoliang Fan

We study the challenging problem of maneuvering object tracking with unknown dynamics, i.e., forces or torque. We investigate the underlying causes of object kinematics, and propose a generative model approach that encodes the Newtonian dynamics for a rigid body by relating forces and torques with object's kinematics in a graphical model. This model also accommodates the physical constraints between maneuvering dynamics and object kinematics in a probabilistic form, allowing more accurate and efficient object tracking. Additionally, we develop a sequential Monte Carlo inference algorithm that is embedded with Markov Chain Monte Carlo (MCMC) steps to rejuvenate the path of particles. The proposed algorithm can estimate both maneuvering dynamics and object kinematics simultaneously. The experiments performed on both simulated and real-world data of ground vehicles show the robustness and effectiveness of the proposed graphical model-based approach along with the sampling-based inference algorithm.

Pedestrian Detection in Infrared Images based on Local Shape Features Li Zhang, Ramakant Nevatia and Bo Wu Use of IR images is advantageous for many surveillance applications where the systems must operate around the clock and external illumination is not always available. We investigate the methods derived from visible spectrum analysis for the task of human detection. Two feature classes (edgelets and HOG features) and two classification models (AdaBoost and SVM cascade) are extended to IR images. We find out that it is possible to get detection performance in IR images that is comparable to state-of-the-art results for visible spectrum images. It is also shown that the two domains share many features, likely originating from the silhouettes, in spite of the starkly different appearances of the two modalities.

156

156

Robust Occlusion Handling in Object Tracking

Jiyan Pan and Bo Hu

In object tracking, occlusions significantly undermine the performance of tracking algorithms. Unlike the existing methods that solely depend on the observed target appearance to detect occluders, we propose an algorithm that progressively analyzes the occlusion situation by exploiting the spatiotemporal context information, which is further double checked by the reference target and motion constraints. This strategy enables our proposed algorithm to make a clearer distinction between the target and occluders than existing approaches. To further improve the tracking performance, we rectify the occlusion-interfered erroneous target location by employing a variant-mask template matching operation. As a result, correct target location can always be obtained regardless of the occlusion situation. Using these techniques, the robustness of tracking under occlusions is significantly promoted. Experimental results have confirmed the effectiveness of our proposed algorithm.

A New Method for Object Tracking Based on Regions Instead of Contours Nicolas Amezquita, Rene Alquezar and Francesc Serratosa

This paper presents a new method for object tracking in video sequences that is especially suitable in very noisy environments. In such situations, segmented images from one frame to the next are usually so different that it is very hard or even impossible to match the corresponding regions or contours of both images. With the aim of tracking objects in these situations, our approach has two main characteristics. On one hand, we assume that the tracking approaches based on contours cannot be applied, and therefore, our system uses object recognition results computed from regions (specifically, colour spots from segmented images). On the other hand, we discard to match the spots of consecutive segmented images and, consequently, the methods that represent the objects by structures such as graphs or skeletons, since the structures obtained may be too different in consecutive frames. Thus, we represent the location of tracked objects through images of probabilities that are updated dynamically using both recognition and tracking results in previous steps. From these probabilities and a simple prediction of the apparent motion of the object in the image, a binary decision can be made for each pixel and object.

Target Tracking with Online Feature Selection in FLIR Imagery Vijay Venkataraman, Guoliang Fan and Xin Fan We present a particle filter-based target tracking algorithm for FLIR imagery. A dual foreground and background model is proposed for target representation which supports robust and accurate target tracking and size estimation. A novel online feature selection technique is introduced that is able to adaptively select the optimal feature to maximize the tracking confidence. Moreover, a coupled particle filtering approach is developed for joint target tracking and feature selection in an unified Bayesian estimation framework. The experimental results show that the proposed algorithm can accurately track poorly-visible targets in FLIR imagery even with strong ego-motion. The tracking performance is improved when compared to the tracker with a foreground-based target model and without online feature selection. Extraction of objects surface normals and indices of refraction using a pair of passive polarimetric sensors

Firooz Sadjadi

This paper summarizes a method for extracting 3D information and indices of refraction from a scene by means of a pair of polarimetric passive imaging sensors. Each sensor provides the Stokes vector at each sensor pixel location, from which, degree and angle of linear polarization are computed. Angle of linear

157

157

polarization provides the azimuth angle of the surface normal vector. Two cases are considered. For the special case when the two sensors have a common azimuth plane the index of refraction can be find analytically in terms of the degrees of polarization and the angle between the lines of sight from the two sensors, from which the depression angle of the surface normal can be computed. For the second and more general case the surface normal is estimated from the cross-product of the azimuth vectors from the two sensors and the inner product of the line of sight vectors and surface normal. Once the depression angles are estimated the index of refraction can be computed. Results of the application of this approach on simulated infrared polarimetric data are provided. Modulation Domain Template Tracking Chuong Nguyen, Joseph Havlicek and Mark Yeary For the first time, we perform normalized correlation template tracking in the modulation domain. For each frame of the video sequence, we compute a multi-component AM-FM image model that characterizes the local texture structure of objects and backgrounds. Tracking is carried out by formulating a modulation domain correlation function in the derived feature space. Using visible and longwave infrared sequences as illustrative examples, we study the performance of this new approach relative to two basic pixel domain correlation template trackers. We also present preliminary results from a new dual domain tracker that operates simultaneously in both the pixel and modulation domains. Learning Object Material Categories via Pairwise Discriminant Analysis Zhouyu Fu and Antonio Robles-Kelly In this paper, we investigate linear discriminant analysis (LDA) methods for multiclass classification problems in hyperspectral imaging. We note that LDA does not consider pairwise relations between different classes, it rather assumes equal within and between-class scatter matrices. As a result, we present a pairwise discriminant analysis algorithm for learning class categories. Our pairwise linear discriminant analysis measures the separability of two classes making use of the class centroids and variances. Our approach is based upon a novel cost function with unitary constraints based on the aggregation of pairwise costs for binary classes. We view the minimisation of this cost function as an unconstrained optimisation problem over a Grassmann manifold and solve using a projected gradient method. Our approach does not require matrix inversion operations and, therefore, does not suffer of stability problems for small training sets. We demonstrate the utility of our algorithm for purposes of learning material categories in hyperspectral images. Part-based Face Recognition Using Near Infrared Images Ke Pan, Shengcai Liao, Zhijian Zhang, Stan Z. Li and Peiren Zhang Recently, we developed NIR based face recognition for highly accurate face recognition under illumination variations [10]. In this paper, we present a part-based method for improving its robustness with respect to pose variations. An NIR face is decomposed into parts. A part classifier is built for each part, using the most discriminative LBP histogram features selected by AdaBoost learning. The outputs of part classifiers are fused to give the final score. Experiments show that the present method outperforms the whole face-based method [10] by 4.53%.

158

158

Abstracts: Workshop on Visual Surveillance (VS2007)

Oral Session 1: Alternative Modalities

Real-Time Posture Analysis in a Crowd using Thermal Imaging

Quoc-Cuong Pham, Laetitia Gond, Julien Begard, Nicolas Allezard, Patrick Sayd

This article describes a video-surveillance system developed within the ISCAPS project. Thermal imaging provides a robust solution to visibility change (illumination, smoke) and is a relevant technology for discriminating humans in complex scenes. In this article, we demonstrate its efficiency for posture analysis in dense groups of people. The objective is to automatically detect several persons lying down in a very crowded area. The presented method is based on the detection and segmentation of individuals within groups of people using a combination of several weak classifiers. The classification of extracted silhouettes enables to detect abnormal situations. This approach was successfully applied to the detection of terrorist gas attacks on railway platform and experimentally validated in the project. Some of the results are presented here.

Oral Session 2: Segmentation

Multi-Layer Background Subtraction Based on Color and Texture

Jian Yao and Jean-Marc Odobez

In this paper, we propose a robust multi-layer background subtraction technique which takes advantages of local texture features represented by local binary patterns (LBP) and photometric invariant color measurements in RGB color space. LBP can work robustly with respective to light variation on rich texture regions but not so efficiently on uniform regions. In the latter case, color information should overcome LBP’s limitation. Due to the illumination invariance of both the LBP feature and the selected color feature, the method is able to handle local illumination changes such as cast shadows from moving objects. Due to the use of a simple layer-based strategy, the approach can model moving background pixels with quasiperiodic flickering as well as background scenes which may vary over time due to the addition and removal of long-time stationary objects. Finally, the use of a cross-bilateral filter allows to implicitely smooth detection results over regions of similar intensity and preserve object boundaries. Numerical and qualitative experimental results on both simulated and real data demonstrate the robustness of the proposed method.

Simultaneous Detection and Segmentation of Pedestrians using Top-down and Bottom-up Processing

Vinay Sharma and James W. Davis

We present a method for the simultaneous detection and segmentation of people from static images. The proposed technique requires no manual segmentation during training, and exploits top-down and bottom-up processing within a single framework for both object localization and 2D shape estimation. First, the coarse shape of the object is learned from a simple training phase utilizing low-level edge features. Motivated by the observation that most object categories have regular shapes and closed boundaries, relations between these features are then exploited to derive mid-level cues, such as continuity and closure. A novel Markov random field defined on the edge features is presented that integrates the coarse shape information with our

159

159

expectation that objects are likely to have boundaries that are regular and closed. The algorithm is evaluated on pedestrian datasets of varying difficulty, including a wide range of camera viewpoints, and person orientations. Quantitative results are presented for person detection and segmentation, demonstrating the effectiveness of the proposed technique to simultaneously address both these tasks.

Sequential Architecture for Efficient Car Detection

Zhenfeng Zhu, Yao Zhao and Hanqing Lu

Based on multi-cue integration and hierarchical SVM, we present a sequential architecture for efficient car detection under complex outdoor scene in this paper. On the low level, two novel area templates based on edge and interest-point cues respectively are first constructed, which can be applied to forming the identities of visual perception to some extent and thus utilized to reject rapidly most of the negative non-car objects at the cost of missing few of the true ones. Moreover on the high level, both global structure and local texture cues are exploited to characterize the car objects precisely. To improve the computational efficiency of general SVM, a solution approximating based two-level hierarchical SVM is proposed. The experimental results show that the integration of global structure and local texture properties provides more powerful ability in discrimination of car objects from non-car ones. The final high detection performance also contributes to the utilizing of two novel low level visual cues and the hierarchical SVM.

Two thresholds are better than one

Tao Zhang, Terry Boult, R.C. Johnson

The concept of the Bayesian optimal single threshold is a well established and widely used classification technique. In this paper, we prove that when spatial cohesion is assumed for targets, a better classification result than the “optimal” single threshold classification can be achieved. Under the assumption of spatial cohesion and certain prior knowledge about the target and background, the method can be further simplified as dual threshold classification. In core-dual threshold classification, spatial cohesion within the target core allows “continuation” linking values to fall between the two thresholds to the target core; classical Bayesian classification is employed beyond the dual thresholds. The core-dual threshold algorithm can be built into a Markov Random Field model (MRF). From this MRF model, the dual thresholds can be obtained and optimal classification can be achieved. In some practical applications, a simple method called symmetric subtraction may be employed to determine effective dual thresholds in real time. Given dual thresholds, the Quasi-Connected Component algorithm is shown to be a deterministic implementation of the MRF core-dual threshold model combining the dual thresholds, extended neighborhoods and efficient connected component computation.

Oral Session 3: Tracking

Kernel-Based 3D Tracking

Ambrish Tyagi, Mark Keck, James W. Davis and Gerasimos Potamianos

We present a computer vision system for robust object tracking in 3D by combining evidence from multiple calibrated cameras. This kernel-based 3D tracker is automatically bootstrapped by constructing 3D point clouds. These points clouds are then clustered and used to initialize the trackers and validate their

160

160

performance. The framework describes a complete tracking system that fuses appearance features from all available camera sensors and is capable of automatic initialization and drift detection. Its elegance resides in its inherent ability to handle problems encountered by various 2D trackers, including scale selection, occlusion, view-dependence, and correspondence across views. Tracking results for an indoor smart room and a multi-camera outdoor surveillance scenario are presented. We demonstrate the effectiveness of this unified approach by comparing its performance to a baseline 3D tracker that fuses results of independent 2D trackers, as well as comparing the re-initialization results to known ground truth.

Automatic Person Detection and Tracking using Fuzzy Controlled Active Cameras

Keni Bernardin, Florian van de Camp and Rainer Stiefelhagen

This paper presents an automatic system for the monitoring of indoor environments using pan-tilt-zoomable cameras. A combination of Haar-feature classifier-based detection and color histogram filtering is used to achieve reliable initialization of person tracks even in the presence of camera movement. A combination of adaptive color and KLT feature trackers for face and upper body allows for robust tracking and track recovery in the presence of occlusion or interference. The continuous recomputation of camera parameters, coupled with a fuzzy controlling scheme allow for smooth tracking of moving targets as well as acquisition of stable facial closeups, similar to the natural behavior of a human cameraman. The system is tested on a series of natural indoor monitoring scenarios and shows a high degree of naturalness, flexibility and robustness.

Oral Session 4: Classification

Real-time Object Classification in Video Surveillance Based on Appearance Learning

Lun Zhang, Stan Z. Li, Xiaotong Yuan and Shiming Xiang

Classifying moving objects to semantically meaningful categories is important for automatic visual surveillance. However, this is a challenging problem due to the factors related to the limited object size, large intra-class variations of objects in a same class owing to different viewing angles and lighting, and real-time performance requirement in real-world applications. This paper describes an appearance-based method to achieve real-time and robust objects classification in diverse camera viewing angles. A new descriptor, i.e., the Multi-block Local Binary Pattern (MB-LBP), is proposed to capture the large-scale structures in object appearances. Based on MB-LBP features, an adaBoost algorithm is introduced to select a subset of discriminative features as well as construct the strong two-class classifier. To deal with the non-metric feature value of MBLBP features, a multi-branch regression tree is developed as the weak classifiers of the boosting. Finally, the Error Correcting Output Code (ECOC) is introduced to achieve robust multi-class classification performance. Experimental results show that our approach can achieve real-time and robust object classification in diverse scenes.

Hidden Markov Models with Kernel Density Estimation of Emission Probabilities and their Use in Activity Recognition

Massimo Piccardi and Oscar Perez

In this paper, we present a modified hidden Markov model with emission probabilities modelled by kernel density estimation and its use for activity recognition in videos. In the proposed approach, kernel density estimation of the emission probabilities is operated simultaneously with that of all the other model parameters by an adapted Baum-Welch algorithm. This allows us to retain maximum-likelihood estimation while overcoming the known limitations of mixture of Gaussians in modelling certain probability

161

161

distributions. Experiments on activity recognition have been performed on groundtruthed data from the CAVIAR video surveillance database and reported in the paper. The error on the training and validation sets with kernel density estimation remains around 14-16% while for the conventional Gaussian mixture approach varies between 15 and 24%, strongly depending on the initial values chosen for the parameters. Overall, kernel density estimation proves capable of providing more flexible modelling of the emission probabilities and, unlike Gaussian mixtures, does not suffer from being highly parametric and of difficult initialisation.

Human Activity Recognition Based on R Transform

Ying Wang, Kaiqi Huang and Tieniu Tan

This paper addresses human activity recognition based on a new feature descriptor. For a binary human silhouette, an extended radon transform, R transform, is employed to represent low-level features. The advantage of the R transform lies in its low computational complexity and geometric invariance. Then a set of HMMs based on the extracted features are trained to recognize activities. Compared with other commonly-used feature descriptors, R transform is robust to frame loss in video, disjoint silhouettes and holes in the shape, and thus achieves better performance in recognizing similar activities. Rich experiments have proved the efficiency of the proposed method.

Posters

Multi-Object Tracking Using Color, Texture and Motion

Valtteri Takala and Matti Pietikäinen, University of Oulu

In this paper, we introduce a novel real-time tracker based on color, texture and motion information. RGB color histogram and correlogram (autocorrelogram) are exploited as color cues and texture properties are represented by local binary patterns (LBP). Object’s motion is taken into account through location and trajectory. After extraction, these features are used to build a unifying distance measure. The measure is utilized in tracking and in the classification event, in which an object is leaving a group. The initial object detection is done by a texture-based background subtraction algorithm. The experiments on indoor and outdoor surveillance videos show that a unified system works better than the versions based on single features. It also copes well with low illumination conditions and low frame rates which are common in large scale surveillance systems.

EDA Approach for Model Based Localization and Recognition of Vehicles

Zhaoxiang Zhang, Weishan Dong, Kaiqi Huang and Tieniu Tan

We address the problem of model based recognition. Our aim is to localize and recognize road vehicles from monocular images in calibrated scenes. A deformable 3D geometric vehicle model with 12 parameters is set up as prior information and Bayesian Classification Error is adopted for evaluation of fitness between the model and images. Using a novel evolutionary computing method called EDA (Estimation of Distribution Algorithm), we can not only determine the 3D pose of the vehicle, but also obtain a 12 dimensional vector which corresponds to the 12 shape parameters of the model. By clustering obtained

162

162

vectors in the parameter space, we can recognize different types of vehicles. Experimental results demonstrate the effectiveness of the approach to vehicles of different types and poses. Thanks to EDA, we can not only localize and recognize vehicles, but also show the whole evolution procedure of the deformable model which gradually fits the image better and better.

Euclidean Path Modeling from Ground and Aerial Views

Imran N. Junejo and Hassan Foroosh

We address the issue of Euclidean path modeling in a single camera for activity monitoring in a multi-camera video surveillance system. The paper proposes a novel linear solution to auto-calibrate any camera observing pedestrians and uses these calibrated cameras to detect unusual object behavior. The input trajectories are metric rectified and the input sequences are registered to the satellite imagery and prototype path models are constructed. During the testing phase, using our simple yet efficient similarity measures, we seek a relation between the input trajectories derived from a sequence and the prototype path models. Real-world pedestrian sequences are used to demonstrate the practicality of the proposed method.

Semi-supervised Learning on Semantic Manifold for Event Analysis in Dynamic Scenes

Lun Xin and Tieniu Tan

Events can be considered as obvious changes of important properties with semantic meanings. Usually, all these properties are measurable and continual in complex formats and higher dimensions. It is hard to define and measure semantic events on the original observed data. However, according to the perception process of human being, these spatial-temporal continuous data can be mapped onto corresponding smooth manifolds, and different appearances on manifolds can indicate different semantic meanings. In this paper, we propose a semi-supervised learning method, which is based on partially labeled data, to map original observed data onto semantic manifolds for events definition and analysis in dynamic scenes. Furthermore we also perform semantic representations for various events in real world scenes. Finally, we present experimental results to evaluate the performance of our method.

Cast Shadow Removal Combining Local and Global Features

Zhou Liu, Kaiqi Huang, Tieniu Tan and Liang Sheng Wang

In this paper, we present a method using pixel-level information, local region-level information and global-level information to remove shadow. At the pixel-level, we employ GMM to model the behavior of cast shadow for every pixel in the HSV color space, as it can deal with complex illumination conditions. However, unlike the GMM for background which can obtain sample every frame, this model for shadow needs more frames to get the same number of sample, because shadow may not appear at the same pixel for each frame. Therefore, it will take a long time to converge. To overcome this drawback, we use the local region-level information to get more samples and global-level information to improve a preclassifier and then, by using it, we get samples which are more likely to be shadow. Also, at the local region-level, we use Markov random fields to represent dependencies between the label of single pixel and labels of its neighborhood. Moreover, to make global level information more robust, tracking information is used. Experimental results show that the proposed method is efficient and robust.

163

163

Capturing People in Surveillance Video

Rogerio Feris, Ying-li Tian and Arun Hampapur

This paper presents reliable techniques for detecting, tracking, and storing keyframes of people in surveillance video. The first component of our system is a novel face detector algorithm, which is based on first learning local adaptive features for each training image, and then using Adaboost learning to select the most general features for detection. This method provides a powerful mechanism for combining multiple features, allowing faster training time and better detection rates. The second component is a face tracking algorithm that interleaves multiple view-based classifiers along the temporal domain in a video sequence. This interleaving technique, combined with a correlation-based tracker, enables fast and robust face tracking over time. Finally, the third component of our system is a keyframe selection method that combines a person classifier with a face classifier. The basic idea is to generate a person keyframe in case the face is not visible, in order to reduce the number of false negatives. We performed quantitatively evaluation of our techniques on standard datasets and on surveillance videos captured by a camera over several days.

Spatio-temporal Shape and Flow Correlation for Action Recognition

Yan Ke, Rahul Sukthankar and Martial Hebert

This paper explores the use of volumetric features for action recognition. First, we propose a novel method to correlate spatio-temporal shapes to video clips that have been automatically segmented. Our method works on oversegmented videos, which means that we do not require background subtraction for reliable object segmentation. Next, we discuss and demonstrate the complementary nature of shape- and flow-based features for action recognition. Our method, when combined with a recent flow-based correlation technique, can detect a wide range of actions in video, as demonstrated by results on a long tennis video. Although not specifically designed for whole-video classification, we also show that our method’s performance is competitive with current action classification techniques on a standard video classification dataset.

Recognizing Night Walkers Based on One Pseudoshape Representation of Gait

Daoliang Tan, Kaiqi Huang, Shiqi Yu, and Tieniu Tan

Gait is a promising biometric cue which can facilitate the recognition of human beings, particularly when other biometrics are unavailable. Existing work for gait recognition, however, lays more emphasis on the problem of daytime walker recognition and overlooks the significance of walker recognition at night. This paper deals with the problem of recognizing nighttime walkers. We take advantage of infrared gait patterns to accomplish this task: 1) Walker detection is improved using intensity compensation-based background subtraction; 2) pseudoshape-based features are proposed to describe gait patterns; 3) the dimension of gait features is reduced through the principal component analysis (PCA) and linear discriminant analysis (LDA) techniques; 4) temporal cues are exploited in the form of the relevant component analysis (RCA) learning; 5) the nearest neighbor classifier is used to recognize unknown gait. Experimental results justify the effectiveness of our method and show that our method has an encouraging potential for the application in surveillance systems.

164

164

Object Classification In Visual Surveillance Using Adaboost

J.R.Renno, D. Makris and G.A. Jones

In this paper, we present a method of object classification within the context of Visual Surveillance. Our goal is the classification of tracked objects into one of the two classes: people and cars. Using training data comprised of trajectories tracked from our car-park, a weighted ensemble of Adaboost classifiers is developed. Each ensemble is representative of a particular feature, evaluated and normalised by its significance. Classification is performed using the sub-optimal hyperplane derived by selection of the N-best performing feature ensembles. The resulting performance is compared to a similar Adaboost classifier, trained using a single ensemble over all dimensions.

A Sequential Monte Carlo Approach to Anomaly Detection in Tracking Visual Events

Peng Cui, Lifeng Sun, Zhi-qiang Liu and Shi-qiang Yang

In this paper we propose a technique to detect anomalies in individual and interactive event sequences. We categorize anomalies into two classes: abnormal event, and abnormal context, and model them in the Sequential Monte Carlo framework which is extended by Markov Random Field for tracking interactive events. Firstly, we propose a novel pixel-wise event representation method to construct feature images, in which each blob corresponds to a visual event. Then we transform the original blob-level features into subspaces to model probabilistic appearance manifolds for each event-class. With the probability of an observation associated with each event-class (or state) derived from probabilistic manifolds, and state transitional probability, the prior and posterior state distributions can be estimated. We demonstrate in experiments that the approach can reliably detect such anomalies with low false alarm rates.

Robust Change-Detection by Normalised Gradient-Correlation

Robert O'Callaghan and Tetsuji Haga

A novel algorithm for robustly segmenting changes between different images of a scene is presented. This computationally efficient algorithm is based on a non-linear comparison of gradient structure in overlapping imageregions and offers intrinsic invariance to changing illumination, without recourse to background-model adaptation. High accuracy is demonstrated on test video data with and without illumination changes. The technique is applicable to motion-segmentation as well as measuring longer-term object-changes.

Progressive Learning for Interactive Surveillance Scenes Retrieval

Jerome Meessen, Xavier Desurmont, Jean-Francois Delaigle, Christophe De Vleeschouwer and Benoit Macq

This paper tackles the challenge of interactively retrieving visual scenes within surveillance sequences acquired with fixed camera. Contrarily to today’s solutions, we assume that no a-priori knowledge is available so that the system must progressively learn the target scenes thanks to interactive labelling of a few frames by the user. The proposed method is based on very low-cost features extraction and integrates relevance feedback, multiple-instance SVM classification and active learning. Each of these 3 steps runs iteratively over the session, and takes advantage of the progressively increasing training set. Repeatable

165

165

experiments on both simulated and real data demonstrate the efficiency of the approach and show how it allows reaching high retrieval performances.

OVVV: Using Virtual Worlds to Design and Evaluate Surveillance Systems

Geoffrey R. Taylor, Andrew J. Chosak and Paul C. Brewer

ObjectVideo Virtual Video (OVVV) is a publicly available visual surveillance simulation test bed based on a commercial game engine. The tool simulates multiple synchronized video streams from a variety of camera configurations, including static, PTZ and omni-directional cameras, in a virtual environment populated with computer or player controlled humans and vehicles. To support performance evaluation, OVVV generates detailed automatic ground truth for each frame including target centroids, bounding boxes and pixel-wise foreground segmentation. We describe several realistic, controllable noise effects including pixel noise, video ghosting and radial distortion to improve the realism of synthetic video and provide additional dimensions for performance testing. Several indoor and outdoor virtual environments developed by the authors are described to illustrate the range of testing scenarios possible using OVVV. Finally, we provide a practical demonstration of using OVVV to develop and evaluate surveillance algorithms.

Multiple-View Face Tracking For Modeling and Analysis Based On Non-Cooperative Video Imagery

Scott Von Duhn, Lijun Yin, Myung Jin Ko, and Terry Hung

3D face analysis has been researched intensively in recent decades. Most 3D data (so called range facial data) are obtained from 3D range imaging systems. Such data representations have been proven effective for face recognition in 3D space. However, obtaining such data requires subject cooperation in a constrained environment, which is not practical for many real applications of video surveillance. It is therefore in high demand to use regular video cameras to generate 3D face models for further classification. The goal of our research is to develop a method of tracking feature points on a face in multiple views in order to build 3D models of individual faces. We proposed a three-view based video tracking and model creation algorithm, which is based on the Active Appearance Model and a generic facial model. We will describe how to build useful individual models over time, and validate the created dynamic model sequences through the application of face recognition. Tracking multiple view fiducial points of a face in a time sequence can also be used for facial expression analysis. Our experiments demonstrated the feasibility of the proposed work.

166

166

List of Papers: Workshop on Multimodal Sentient Computing – Sensors, Algorithms, and Systems

Extended Abstracts for papers at this meeting are available on the conference DVD-ROM.

Introduction (Tom Huang, Zhigang Zhu, Ying-li Tian)

Session I: Multimodal Biometrics

Multi-Modal Biometrics Involving the Human Ear, Christopher Middendorff, Kevin W. Bowyer and Ping Yan (University of Notre Dame)

Fusion of Face and Palmprint for Personal Identification Based on Ordinal Features, Rufeng Chu, Shengcai Liao, Yufei Han, Zhenan Sun, Stan Z. Li and Tieniu Tan (Center for Biometrics and Security Research & National Laboratory of Pattern Recognition, Chinese Academy of Sciences)

Human Identification Using Gait and Face, Rama Chellappa (University of Maryland), Amit K. Roy-Chowdhury (University of California Riverside), Amit Kale (University of Maryland; University of Kentucky)

Multimodal Biometric Systems: Applications and Usage Scenarios, Michael Thieme, Director of Special Projects (International Biometric Group)

Session II: Multimodal Sentient Computing

Automatic Audio-Visual Speech Recognition, Stephen M. Chu (IBM T.J. Watson Research Center), and Thomas S. Huang (UIUC)

Multimodal Tracking for Smart Videoconferencing and Video Surveillance, Dmitry Zotkin, Vikas Raykar, Ramani Duraiswami and Larry S. Davis (U. Maryland)

Sensor Fusion and Environmental Modeling for Multimodal Sentient Computing, Christopher Town (University of Cambridge Computer Laboratory, UK)

SATware: Middleware for Sentient Spaces, Bijit Hore, Hojjat Jafarpour, Ramesh Jain, Shengyue Ji, Daniel Massaguer, Sharad Mehrotra, Nalini Venkatasubramanian, Utz Westermann (University of California at Irvine)

Session III: Multimodal Surveillance (Chair: Ying-Li Tian)

An End-to-End eChronicling System for Mobile Human Surveillance, Gopal Pingali, Ying-Li Tian, Shahram Ebadollahi, Mark Podlaseck, Jason Pelecanos, Harry Stavropoulos (IBM T.J. Watson Research Center)

Systems issues in distributed multi-modal surveillance, Li Yu (ObjectVideo Inc.), Terry Boult (University of Colorado at Colorado Spring)

A Multimodal Workbench for Automatic Surveillance, Dragos Datcu, Zhenke Yang, L.J.M. Rothkrantz (Delft University of Technology)

167

167

Automatic 3D Modeling of Cities with Multimodal Air and Ground Sensors, Avideh Zakhor and Christian Frueh (University of California at Berkeley)

Session IV: Multimodal Sensors

The ARL Multi-Modal Sensor: a research tool for target signature collection, algorithm validation, and emplacement studies, Jeff Houser, Lei Zong (Army Research Lab)

Multimodal Image Fusion Systems, Diego Socolinsky (Equinox Corporation)

LDV Sensing and Processing for Remote Hearing in a Multimodal Surveillance System, Zhigang Zhu, Weihong Li, Edgardo Molina and George Wolberg (CUNY City College and Graduate Center)

Sensor and Data Systems, Audio-Assisted Cameras and Acoustic Doppler Sensors, Paris Smaragdis, Bhiksha Raj and Kaustubh Kalgaonkar (MERL Research Lab)

168

168

Abstracts: Workshop on Embedded Computer Vision

Applications

Real-Time License Plate Recognition on an Embedded DSP-Platform

Clemens Arth, Florian Limberger, Horst Bischof

In this paper we present a full-featured license plate detection and recognition system. The system is implemented on an embedded DSP platform and processes a video stream in real-time. It consists of a detection and a character recognition module. The detector is based on the AdaBoost approach presented by Viola and Jones. Detected license plates are segmented into individual characters by using a region-based approach. Character classification is performed with support vector classification. In order to speed up the detection process on the embedded device, a Kalman tracker is integrated into the system. The search area of the detector is limited to locations where the next location of a license plate is predicted. Furthermore, classification results of subsequent frames are combined to improve the class accuracy. The major advantages of our system are its real-time capability and that it does not require any additional sensor input (e.g. from infrared sensors) except a video stream. We evaluate our system on a large number of vehicles and license plates using bad quality video and show that the low resolution can be partly compensated by combining classification results of subsequent frames.

PrivacyCam: a Privacy Preserving Camera Using uCLinux on the Blackfin DSP

Ankur Chattopadhyay, Terry Boult

Considerable research work has been done in the area of surveillance and biometrics, where the goals have always been high performance, robustness in security and cost optimization. With the emergence of more intelligent and complex video surveillance mechanisms, the issue of “privacy invasion” has been looming large. Very little investment or effort has gone into looking after this issue in an efficient and cost-effective way. The process of PICO (Privacy through Invertible Cryptographic Obscuration) is a way of using cryptographic techniques and combining them with image processing and video surveillance to provide a practical solution to the critical issue of “privacy invasion”. This paper presents the idea and example of a real-time embedded application of the PICO technique, using uCLinux on the tiny Blackfin DSP architecture, along with a small Omnivision camera. It demonstrates how the practical problem of “privacy invasion” can be successfully addressed through DSP hardware in terms of smallness in size and cost optimization. After review of previous applications of “privacy protection”, and system components, we discuss the “embedded jpeg-space” detection of regions of interest to improve privacy while allowing general surveillance to continue. The resulting approach permits full access (violation of privacy) only by access to the private-key to recover the decryption key, thereby striking a fine trade-off among privacy, security, cost, and space.

Real time planar surface segmentation in disparity space

Ninad Thakoor, Sungyong Jung, Jean Gao

An iterative Segmentation-Estimation framework for segmentation of planar surfaces in the disparity space is implemented on a Digital Signal Processor (DSP). Disparity of a scene is modeled by approximating various surfaces in the scene to be planar. The surface labels are estimated during the segmentation phase of the framework with help of the underlying plane parameters. After segmentation, planar surfaces are separated into spatially continuous regions. The largest of these regions is used to compute the estimates for

169

169

the plane parameters. The iterative process is continued till convergence. The algorithm was optimized and implemented on TMS320DM642 based embedded system that operates at 3 to 5 frames per second on images of size 320 x 240.

Architecture

A Specialized Processor Suitable for AdaBoost-Based Detection with Haar-like Features

Masayuki Hiromoto, Kentaro Nakahara, Hiroki Sugano, Yukihiro Nakamura, Ryusuke Miyamoto

Robust and rapid object detection is one of the great challenges in the field of computer vision. This paper proposes a hardware architecture suitable for object detection by Viola and Jones [9] based on an AdaBoost learning algorithm with Haar-like features as weak classifiers. Our architecture realizes rapid and robust detection with two major features: hybrid parallel execution and an image scaling method. The first exploits the cascade structure of classifiers, in which classifiers located near the beginning of the cascade are used more frequently than subsequent classifiers. We assign more resources to the former classifiers to execute in parallel than subsequent classifiers. This dramatically improves the total processing speed without a great increase in circuit area. The second feature is a method of scaling input images instead of scaling classifiers. This increases the efficiency of hardware implementation while retaining a high detection rate. In addition we implement the proposed architecture on a Virtex-5 FPGA to show that it achieves real-time object detection at 30 frames per second on VGA video.

OpenVL: Towards A Novel Software Architecture for Computer Vision

Changsong Shen, James Little, Sidney Fels

This paper presents our progress on OpenVL - a novel software architecture to address efficiency through facilitating hardware acceleration, reusability and scalability for computer vision. A logical image understanding pipeline is introduced to allow parallel processing. As well, we discuss our middleware - VLUT that enables applications to operate transparently over a heterogeneous collection of hardware implementations. OpenVL works as a state machine, with an event-driven mechanism to provide users with application-level interaction. Various explicit or implicit synchronization and communication methods are supported among distributed processes in the logical pipelines. The intent of OpenVL is to allow users to quickly and easily recover useful information from multiple scenes across various software environments and hardware platforms. We implement two different human tracking systems to validate the critical underlying concepts of OpenVL. Hardware implementation of an SAD based stereo vision algorithm Kristian Ambrosch, Martin Humenberger, Wilfried Kubinger, Andreas Steininger This paper presents the hardware implementation of a stereo vision core algorithm, that runs in real-time and is targeted at automotive applications. The algorithm is based on the Sum of Absolute Differences (SAD) and computes the disparity map using 320 °— 240 input images with a maximum disparity of 100 pixels. The hardware operates at a frequency of 65 MHz and achieves a frame rate of 425 fps by calculating the data highly parallel and pipelined. Thus an implemented and basically optimized software solution, running on an Intel Pentium 4 with 3 GHz clock frequency is 166 times outperformed.

170

170

Analysis

Multimodal Mean Adaptive Backgrounding for Embedded Real-Time Video Surveillance

Scott Wills, Linda Wills, Senyo Apewokin, Brian Valentine, Antonio Gentile

Automated video surveillance applications require accurate separation of foreground and background image content. Cost sensitive embedded platforms place realtime performance and efficiency demands on techniques to accomplish this task. In this paper we evaluate pixel-level foreground extraction techniques for a low cost integrated surveillance system. We introduce a new adaptive technique, multimodal mean (MM), which balances accuracy, performance, and efficiency to meet embedded system requirements. Our evaluation compares several pixel-level foreground extraction techniques in terms of their computation and storage requirements, and functional accuracy for three representative video sequences. The proposed MM algorithm delivers comparable accuracy of the best alternative (Mixture of Gaussians) with a 6X improvement in execution time and an 18% reduction in required storage.

Robust Local Features and their Application in Self-Calibration and Object Recognition on Embedded Systems

Clemens Arth, Christian Leistner, Horst Bischof

In recent years many powerful Computer Vision algorithms have been invented, making automatic or semiautomatic solutions to many popular vision tasks, such as visual object recognition or camera calibration, possible. On the other hand embedded vision platforms and solutions such as smart cameras have successfully emerged, however, only offering limited computational and memory resources. The first contribution of this paper is the investigation of a set of robust local feature detectors and descriptors for application on embedded systems. We briefly describe the methods involved, i.e. the DoG (Difference of Gaussian) and MSER (Maximally Stable Extremal Regions) detector as well as the PCA-SIFT descriptor, and discuss their suitability for smart systems and their qualification for given tasks. The second contribution of this work is the experimental evaluation of these methods on two challenging tasks, namely fully embedded object recognition on a moderate size database and on the task of robust camera calibration. Our approach is fortified by encouraging results we present at length.

A Human Action Recognition System for Embedded Computer Vision Application

Hongying Meng, Nick Pears, Chris Bailey

In this paper, we propose a human action recognition system suitable for embedded computer vision applications in security systems, human-computer interaction and intelligent environments. Our system is suitable for embedded computer vision application based on three reasons. Firstly, the system was based on a linear Support Vector Machine (SVM) classifier where classification progress can be implemented easily and quickly in embedded hardware. Secondly, we use compacted motion features easily obtained from videos. We address the limitations of the well known Motion History Image (MHI) and propose a new Hierarchical Motion History Histogram (HMHH) feature to represent the motion information. HMHH not only provides rich motion information, but also remains computationally inexpensive. Finally, we combine MHI and HMHH together and extract a low dimension feature vector to be used in the SVM classifiers. Experimental results show that our system achieves significant improvement on the recognition performance.

171

171

Performance Benchmark of DSP and FPGA Implementations of Low-Level Vision Algorithms

Daniel Baumgartner, Peter Roessler, Wilfried Kubinger

Selecting an embedded hardware platform for image processing has a big influence on the achievable performance. This paper reports our work on a performance benchmark of different implementations of some low-level vision algorithms. The algorithms are implemented on both Digital Signal Processor (DSP) and Field Programmable Gate Array (FPGA) high-speed embedded platforms. The target platforms are a TI TMS320C6414 DSP and an Altera Stratix FPGA. The implementations are evaluated, compared and discussed. The DSP implementations outperform the FPGA implementations, but at the cost of spending all its resources to these tasks. FPGAs, however, are well suited to algorithms, which benefit from parallel execution.

172

172

Abstracts: Workshop on Image Registration and Fusion

Session 1: Image Registration I

Metropolis-Hasting techniques for finite-element based registration (Invited)

Frederic Richard, Adeline Samon

In this paper, we focus on the design of Markov Chain Monte Carlo techniques in a statistical registration framework based on finite element basis (FE). Due to the use of FE basis, this framework has specific features. The main feature is that displacement random fields are markovian. We construct two hybrid Gibbs/Metropolis-Hasting algorithms which take fully advantage of this markovian property. The second technique is defined in a coarse-to-fine way by introducing a penalization on the sampled posterior distribution. We present some promising results suggesting that both techniques can accurately register images. Experiments also show that the penalized technique is more robust to local maxima of the posterior distribution than the first technique. This study is a preliminary step towards the estimation of model parameters in complex image registration problems.

Research issues in image registration for remote sensing (Invited)

Roger D. Eastman, Jacqueline Le Moigne, Nathan Netanyahu

Image registration is an important element in data processing for remote sensing with many applications and a wide range of solutions. Despite considerable investigation the field has not settled on a definitive solution for most applications and a number of questions remain open. This article looks at selected research issues by surveying the experience of operational satellite teams, application-specific requirements for Earth science, and our experiments in the evaluation of image registration algorithms with emphasis on the comparison of algorithms for subpixel accuracy. We conclude that remote sensing applications put particular demands on image registration algorithms to take into account domain-specific knowledge of geometric transformations and image content.

Fourier methods for nonparametric image registration

Nathan Cahill, Alison Noble, David Hawkes

Nonparametric image registration algorithms use deformation fields to define nonrigid transformations relating two images. Typically, these algorithms operate by successively solving linear systems of partial differential equations. These PDE systems arise by linearizing the Euler-Lagrange equations associated with the minimization of a functional defined to contain an image similarity term and a regularizer. Iterative linear system solvers can be used to solve the linear PDE systems, but they can be extremely slow. Some faster techniques based on Fourier methods, multigrid methods, and additive operator splitting, exist for solving the linear PDE systems for specific combinations of regularizers and boundary conditions. In this paper, we show that Fourier methods can be employed to quickly solve the linear PDE systems for every combination of standard regularizers (diffusion, curvature, elastic, and fluid) and boundary conditions (Dirichlet, Neumann, and periodic).

173

173

Session 2: Image Registration II

Gradient intensity: A new mutual information-based registration method

Ramtin Shams, Parastoo Sadeghi, Rodney A. Kennedy

Conventional mutual information (MI)-based registration using pixel intensities is time-consuming and ignores spatial information, which can lead to misalignment. We propose a method to overcome these limitation by acquiring initial estimates of transformation parameters. We introduce the concept of ‘gradient intensity’ as a measure of spatial strength of an image in a given direction. We determine the rotation parameter by maximizing the MI between gradient intensity histograms. Calculation of the gradient intensity MI function is extremely efficient. Our method is designed to be invariant to scale and translation between the images. We then obtain estimates of scale and translation parameters using methods based on the centroids of gradient images. The estimated parameters are used to initialize an optimization algorithm which is designed to converge more quickly than the standard Powell algorithm in close proximity of the minimum. Experiments show that our method significantly improves the performance of the registration task and reduces the overall computational complexity by an order of magnitude.

Keypoint descriptors for matching across multiple image modalities and non-linear intensity variation

Avi Kelman, Michal Sofka, Charles Stewart

In this paper, we investigate the effect of substantial inter-image intensity changes and changes in modality on the performance of keypoint detection, description, and matching algorithms in the context of image registration. In doing so, we modify widely-used keypoint descriptors such as SIFT and shape contexts, attempting to capture the insight that some structural information is indeed preserved between images despite dramatic appearance changes. These extensions include (a) pairing opposite-direction gradients in the formation of orientation histograms and (b) focusing on edge structures only. We also compare the stability of MSER, Laplacian-of-Gaussian, and Harris corner keypoint location detection and the impact of detection errors on matching results. Our experiments on multimodal image pairs and on image pairs with significant intensity differences show that indexing based on our modified descriptors produces more correct matches on difficult pairs than current techniques at the cost of a small decrease in performance on easier pairs. This extends the applicability of image registration algorithms such as the Dual-Bootstrap which rely on correctly matching only a small number of keypoints.

Local shape registration using boundary-constrained match of skeletons

Yun Zhu, Xenophon Papademetris, Albert Sinusas, James Duncan

This paper presents a new shape registration algorithm that establishes “meaningful correspondence” between objects, in that it preserves the local shape correspondence between the source and target objects. By observing that an object’s skeleton corresponds to its local shape peaks, we use skeleton to characterize the local shape of the source and target objects. Unlike traditional graph-based skeleton matching algorithms that focus on matching skeletons alone and ignore the overall alignment of the boundaries, our algorithm is formulated in a variational framework which aligns local shape by registering two potential fields that are associated with skeletons. Also, we add a boundary constraint term to the energy functional, such that our algorithm can be applied to match bulky objects where skeleton and boundary are far away to each other. To increase the robustness of our algorithm, we incorporate M-estimator and dynamic pruning algorithm to form a feedback system that eliminates local shape outliers caused by nonrigid deformation,

174

174

occlusion, and missing parts. Experiments on 2D binary shapes and 3D cardiac sequences validate the accuracy and robustness of this algorithm.

Map-enhanced UAV image sequence registration and synchronization of multiple image sequences

Yu-Ping Lin, Gérard Medioni

Registering consecutive images from an airborne sensor into a mosaic is an essential tool for image analysts. Strictly local methods tend to accumulate errors, resulting in distortion. We propose here to use a reference image (such as a high resolution map image) to overcome this limitation. In our approach, we register a frame in an image sequence to the map using both frame-to-frame registration and frame-to-map registration iteratively. In frame-to-frame registration, a frame is registered to its previous frame. With its previous frame been registered to the map in the previous iteration, we can derive an estimated transformation from the frame to the map. In frame-to-map registration, we warp the frame to the map by this transformation to compensate for scale and rotation difference and then perform an area based matching using Mutual Information to find correspondences between this warped frame and the map. These correspondences together with the correspondences in previous frames could be regarded as correspondences between the partial local mosaic and the map. By registering the partial local mosaic to the map, we derive a transformation from the frame to the map. With this two-step registration, the errors between each consecutive frames are not accumulated. We then extend our approach to synchronize multiple image sequences by tracking moving objects in each image sequence, and aligning the frames based on the object’s coordinates in the reference image.

Robust Bayesian estimation and normalized convolution for super-resolution image reconstruction

Antonis Katartzis, Maria Petrou

We investigate new ways of improving the performance of Bayesian-based super-resolution image reconstruction by using a discontinuity adaptive image prior distribution based on robust statistics and a fast and efficient way of initializing the optimization process. The latter is an adapted Normalized Convolution (NC) technique that incorporates the uncertainty induced by registration errors. We present both qualitative and quantitative results on real video sequences and demonstrate the advantages of the proposed method compared to conventional methodologies.

Real-time image matching based on multiple view kernel projection

Quan Wang, Suya You

This paper proposes a novel matching method for realtime finding the correspondences among different images containing the same object. The method utilizes an efficient Kernel Projection scheme to descript the image patch around a detected feature point. In order to achieve invariance and tolerance to geometric distortions, it combines a training stage based on generated synthetic views of the object. The two reliable and efficient methods cooperate together, resulting the core part of our novel Multiple View Kernel Projection method (MVKP). Finally, considering the properties and distribution of the described feature vectors, we search for the best correspondence between two sets of features using a Fast Filtering Vector Approximation (FFVA) algorithm, which can be viewed as a fast lower-bound rejection scheme. Extensive experimental results on both synthetic and real data have demonstrated the effectiveness of the proposed approach.

175

175

Multiphase segmentation of deformation using logarithmic priors

Igor Yanovsky, Paul Thompson, Stanley Osher, Luminita Vese, Alex Leow

In [8], the authors proposed the large deformation logunbiased diffeomorphic nonlinear image registration model which has been successfully used to obtain theoretically and intuitively correct deformation maps. In this paper, we extend this idea to simultaneously registering and tracking deforming objects in a sequence of two or more images. We generalize a level set based Chan-Vese multiphase segmentation model to consider Jacobian fields while segmenting regions of growth and shrinkage in deformations. Deforming objects are thus classified based on magnitude of homogeneous deformation. Numerical experiments demonstrating our results include a pair of two-dimensional synthetic images and pairs of two-dimensional and three-dimensional serial MRI images.

Shape matching through particle dynamics warping

Gady Agam, Suneel Suresh

Shape matching is fundamental to numerous computer vision algorithms and may be used for similarity determination and registration. Establishing correspondence and measuring similarity between shapes is of great importance. Shape matching often involves simultaneous estimation of both a correspondence and an alignment transformation. Such an estimate is particularly difficult when the alignment transformation is non-linear and so contains a large number of degrees of freedom. We describe a novel approach for shape matching that is based on shape contexts and uses particle dynamics warping to maximize the similarity of shapes while satisfying structural constraints. The approach is based on an iterative solution of a system of first order ordinary differential equations. The main advantage of the proposed approach is its ability to incorporate shape constraints into the matching process. Furthermore, the proposed approach does not require a solution of an optimal assignment problem which is sensitive to outliers, and does not require thin-plate spline warping which is computationally expensive. To illustrate the applicability of our approach we address the problem of offline signature recognition which in contrast to online signature recognition does not provide for a simple parametrization of the signature curves. The proposed approach is evaluated by measuring the precision and recall rates of documents based on signature similarity. To facilitate a realistic evaluation, the signature data we use was collected from real world documents spanning a period of several decades.

Session 3: Image Fusion

The effect of pixel-level fusion on object tracking in multi-sensor surveillance video

N. Cvejic, S.G. Nikolov, H. D. Knowles, A. Loza, A. Achim, D. R. Bull, C. N. Canagarajah

This paper investigates the impact of pixel-level fusion of videos from visible (VIZ) and infrared (IR) surveillance cameras on object tracking performance, as compared to tracking in single modality videos. Tracking has been accomplished by means of a particle filter which fuses a colour cue and the structural similarity measure (SSIM). The highest tracking accuracy has been obtained in IR sequences, whereas the VIZ video showed the worst tracking performance due to higher levels of clutter. However, metrics for fusion assessment clearly point towards the supremacy of the multiresolutional methods, especially Dual Tree-Complex Wavelet Transform method. Thus, a new, tracking-oriented metric is needed that is able to accurately assess how fusion affects the performance of the tracker.

176

176

An adaptive focal connectivity algorithm for multifocus fusion

Harishwaran Hariharan, Andreas Koschan, Mongi Abidi

Multifocus fusion is the process of fusing focal information from a set of input images into one all-infocus image. Here, a versatile multifocus fusion algorithm is presented for application-independent fusion. A focally connected region is a region or a set of regions in an input image that falls under the depth of field of the imaging system. Such regions are segmented adaptively under the predicate of focal connectivity and fused by partition synthesis. The fused image has information from all focal planes, while maintaining the visual verisimilitude of the scene. In order to validate the fusion performance of our method, we have compared our results with those of tiling and multiscale fusion techniques. In addition to performing a seamless fusion of the focally connected regions, our method out performs the competing methods regarding overall sharpness in all our experiments. Several illustrative examples of multifocus fusion are shown and objective comparisons are provided.

177

177

Abstracts: Workshop – Towards Benchmarking Automated Calibration, Orientation and Surface Reconstruction from Images (BenCOS 2007)

Evaluation

A Comparison of PMD-cameras and stereo-vision for the task of surface reconstruction using patchlets

C. Beder, B. Bartczak, and R. Koch

Recently real-time active 3D range cameras based on time-of-flight technology (PMD) have become available. Those cameras can be considered as a competing technique for stereo-vision based surface reconstruction. Since those systems directly yield accurate 3d measurements, they can be used for benchmarking vision based approaches, especially in highly dynamic environments. Therefore, a comparative study of the two approaches is relevant. In this work the achievable accuracy of the two techniques, PMD and stereo, is compared on the basis of patchlet estimation. As patchlet we define an oriented small planar 3d patch with associated surface normal. Leastsquares estimation schemes for estimating patchlets from PMD range images as well as from a pair of stereo images are derived. It is shown, how the achivable accuracy can be estimated for both systems. Experiments under optimal conditions for both systems are performed and the achievable accuracies are compared. It has been found that the PMD system outperformed the stereo system in terms of achievable accuracy for distance measurements, while the estimation of normal direction is comparable for both systems.

A benchmarking dataset for performance evaluation of automatic surface reconstruction algorithms

A. Bellmann, O. Hellwich, V. Rodehorst, and U. Yilmaz

Numerous techniques were invented in computer vision and photogrammetry to obtain spatial information from digital images. We intend to describe and improve the performance of these vision techniques by providing test objectives, data, metrics and test protocols. In this paper we propose a comprehensive benchmarking dataset for evaluating a variety of automatic surface reconstruction algorithms (shape-from-X) and a methodology for comparing their results.

Feasibility boundary in dense and semi-dense stereo matching

J. Kostlivá, J. Cech, and R. Sára

In stereo literature, there is no standard method for evaluating algorithms for semi-dense stereo matching. Moreover, existing evaluations for dense methods require a fixed parameter setting for the tested algorithms. In this paper, we propose a method that overcomes these drawbacks and still is able to compare algorithms based on a simple numerical value, so that reporting results does not take up much space in a paper. We propose evaluation of stereo algorithms based on Receiver Operating Characteristics (ROC) which captures both errors and sparsity. By comparing ROC curves of all tested algorithms we obtain the Feasibility Boundary, the best possible performance achieved by a set of tested stereo algorithms, which allows stereo algorithm users to select the proper method and parameter setting for a required application.

178

178

Influence of numerical conditioning on the accuracy of relative orientation

S. Šegvič, G. Schweighofer, and A. Pinz We study the influence of numerical conditioning on the accuracy of two closed-form solutions to the overconstrained relative orientation problem. We consider the well known eight-point algorithm and the recent five-point algorithm, and evaluate changes in their performance due to Hartley’s normalization and Muehlich’s equilibration. The need for numerical conditioning is introduced by explaining the known occurrence of the bias of the eight-point algorithm towards the forward motion. Then it is shown how conditioning can be used to improve the results of the recent five-point algorithm. This is not straightforward since the conditioning disturbs the calibration of the input data. The conditioning therefore needs to be reverted before enforcing the internal cubic constraints of the essential matrix. The obtained improvements are less dramatic than in the case of the eight-point algorithm, for which we offer a plausible explanation. The theoretical claims are backed up with extensive experimentation on noisy artificial datasets, under a variety of geometric and imaging parameters.

Orientation / Pose estimation

3D pose estimation based on multiple monocular cues

B. Barrois and C. Wöhler In this study we propose an integrated approach to the problem of 3D pose estimation. The main difference to the majority of known methods is the usage of complementary image information, including intensity and polarization state of the light reflected from the object surface, edge information, and absolute depth values obtained based on a depth from defocus approach. Our method is based on the comparison of the input image to synthetic images generated by an OpenGL-based renderer using model information about the object provided by CAD data. This comparison provides an error term which is minimised by an iterative optimisation algorithm. Although all six degrees of freedom are estimated, our method requires only a monocular camera, circumventing disadvantages of multiocular camera systems such as the need for external camera calibration. Our framework is open for the inclusion of independently acquired depth data. We evaluate our method on a toy example as well as in two realistic scenarios in the domain of industrial quality inspection. Our experiments regarding complex real-world objects located at a distance of about 0.5 m to the camera show that the algorithm achieves typical accuracies of better than 1 degree for the rotation angles, 1–2 image pixels for the lateral translations, and several millimetres or about 1 percent for the object distance.

Image-based localization using hybrid feature correspondences

K. Josephson, M. Byröd, F. Kahl, and K. Åström Where am I and what am I seeing? This is a classical vision problem and this paper presents a solution based on efficient use of a combination of 2D and 3D features. Given a model of a scene, the objective is to find the relative camera location of a new input image. Unlike traditional hypothesize-and-test methods that try to estimate the unknown camera position based on 3D model features only, or alternatively, based on 2D model features only, we show that using a mixture of such features, that is, a hybrid correspondence set, may improve performance. We use minimal cases of structure-from-motion for hypothesis generation in a RANSAC engine. For this purpose, several new and useful minimal cases are derived for calibrated, semi-calibrated and uncalibrated settings. Based on algebraic geometry methods, we show how these minimal hybrid cases can be solved efficiently. The whole approach has been validated on both synthetic and real data, and we demonstrate improvements compared to previous work.

179

179

Integration of motion cues in optical and sonar video imaging for 3-D positioning

S. Negahdaripour, H. Pirsiavash, and H. Sekkati Target-based positioning and 3-D target reconstruction are critical capabilities in deploying submersible platforms for a range of underwater applications, e.g., search and inspection missions. While optical cameras provide highresolution and target details, they are constrained by limited visibility range. In highly turbid waters, target at up to distances of 10s of meters can be recorded by high-frequency (MHz) 2-D sonar imaging systems that have become introduced to the commercial market in recent years. Because of lower resolution and SNR level and inferior target details compared to optical camera in favorable visibility conditions, the integration of both sensing modalities can enable operation in a wider range of conditions with generally better performance compared to deploying either system alone. In this paper, estimate of the 3-D motion of the integrated system and the 3-D reconstruction of scene features are addressed. We do not require establishing matches between optical and sonar features, referred to as opti-acoustic correspondences, but rather matches in either the sonar or optical motion sequences. In addition to improving the motion estimation accuracy, advantages of the system comprise overcoming certain inherent ambiguities of monocular vision, e.g., the scale-factor ambiguity, and dual interpretation of planar scenes. We discuss how the proposed solution provides an effective strategy to address the rather complex opti- acoustic stereo matching problem. Experiment with real data demonstrate our technical contribution.

Efficient sampling of disparity space for fast and accurate matching

J. ech and R. Šára

A simple stereo matching algorithm is proposed that visits only a small fraction of disparity space in order to find a semi-dense disparity map. It works by growing from a small set of correspondence seeds. Unlike in known seedgrowing algorithms, it guarantees matching accuracy and correctness, even in the presence of repetitive patterns. This success is based on the fact it solves a global optimization task. The algorithm can recover from wrong initial seeds to the extent they can even be random. The quality of correspondence seeds influences computing time, not the quality of the final disparity map. We show that the proposed algorithm achieves similar results as an exhaustive disparity space search but it is two orders of magnitude faster. This is very unlike the existing growing algorithms which are fast but erroneous. Accurate matching on 2-megapixel images of complex scenes is routinely obtained in a few seconds on a common PC from a small number of seeds, without limiting the disparity search range.

A quasi-minimal model for paper-like surfaces

M. Perriollat and A. Bartoli

Smoothly bent paper-like surfaces are developable. They are however difficult to minimally parameterize since the number of meaningful parameters is intrinsically dependent on the actual deformation. Previous generative models are either incomplete, i.e. limited to subsets of developable surfaces, or depend on huge parameter sets. We propose a generative model governed by a quasi-minimal set of intuitive parameters, namely rules and angles. More precisely, a flat mesh is bent along guiding rules, while a number of extra rules controls the level of smoothness. The generated surface is guaranteed to be developable. A fully automatic multi-camera threedimensional reconstruction algorithm, including model-based bundle-adjustment, demonstrates our model on real images.

180

180

Abstracts: Beyond Patches Workshop – Patches Everywhere

Using Multiple Patches for 3D Object Recognition

Andrea Selinger Salgian

Image patches have become increasingly popular in a variety of applications, due to their resistance to clutter and partial occlusion, as well as their partial insensitivity to object pose. Recently Mikolajczyk and Schmid [10] compared a number of local descriptors and concluded that the SIFT-based ones perform best in image matching tasks. In this paper we analyze the performance of three patch descriptors in the context of 3D object recognition: SIFT [9], PCA-SIFT [6] and keyed context patches [15]. We use a data set containing images of six objects on clean and cluttered backgrounds, taken around the whole viewing sphere, and we look at individual and fused performances. Individually, the keyed context patches perform best overall, but they are outperformed for some objects by SIFT and PCASIFT. Recognition is improved by fusing the rankings generated by these classifiers.

Extraction of 3D Transform and Scale Invariant Patches from Range Scans

Erdem Akagündüz, lkay Ulusoy

An algorithm is proposed to extract transformation and scale invariant 3D fundamental elements from the surface structure of 3D range scan data. The surface is described by mean and Gaussian curvature values at every data point at various scales and a scale-space search is performed in order to extract the fundamental structures and to estimate the location and the scale of each fundamental structure. The extracted fundamental structures can later be used as nodes in a topological graph where the links between the nodes can be defined as the spatial and geometric relations between the fundamental elements.

Invariant Features of Local Textures - A Rotation Invariant

Pranam Janney, Zhenghua Yuji

In this paper, we present a new rotation-invariant texture descriptor algorithm called Invariant Features of Local Textures (IFLT). The proposed algorithm extracts rotation invariant features from a small neighbourhood of pixels around a centre pixel or a texture patch. Intensity vector which is derived from a texture patch is normalized and Haar wavelet filtered to derive rotation-invariant features. Texture classification experiments on the Brodatz album and Outex databases have shown that the proposed algorithm has a high rate of correct classification.

181

181

The Effective Resolution of Correlation Filters Applied to Natural Scenes

Michel Vidal-Naquet, Manabu Tanifuji

In this paper, we measure the responses of image patches, used as filters, on different image ensembles and examine how the responses are affected by reducing the resolution of the image ensembles. By comparing the set of responses obtained at high and reduced resolutions, we find that for the ensembles of natural and object images (cars), there is a limit resolution of about 15x15 and 10x10 pixels, respectively, beyond which the filter responses are significantly affected by resolution reduction. We support the result by a simple theoretical analysis based on image ensemble statistics. There are two consequences to this result. First, it provides a natural working resolution, determined solely from the image ensemble statistics, to which higher resolution templates can be reduced without losing a significant amount of information. This can be used, in particular, to reduce the search space for useful visual features in many applications. Secondly, in contrast to many studies, it suggests that features that are more complex than Gabor patches can be effectively used as first layer filters and combined in order to represent more complex shapes and appearances.

Modelling Objects using Distribution and Topology of Multiscale Region Pairs

Himanshu Arora, Narendra Ahuja

We propose a method for simultaneous detection, localization and segmentation of objects of a known category. We show that this is possible by using segments as features. To this end, we propose an object model in which the image is represented as a tree, that captures containment relationships among the segments. Using segments as features has the advantage that object detection and segmentation is done simultaneously, forgoing the need for a separate sophisticated model for object segmentation. A generative model of an object category is estimated in a supervised mode, in terms of the characteristics of its constituent regions, their relative locations, and their mutual containment. The novel aspect of this work lies in simplifying the description of the hierarchy in terms of constraints that apply to only pairs of nodes, instead of all nodes in the tree. We show that this indeed improves the speed of learning algorithm. Inference is done using graph cuts. We report the performance of the model on standard datasets.

Unsupervised Learning of Hierarchical Semantics of Objects (hSOs)

Devi Parikh, Tsuhan Chen

A successful representation of objects in the literature is as a collection of patches, or parts, with a certain appearance and position. The relative locations of the different parts of an object are constrained by the geometry of the object. Going beyond the patches on a single object, consider a collection of images of a particular class of scenes containing multiple (recurring) objects. The parts belonging to different objects are not constrained by such a geometry. However the objects, arguably due to their semantic relationships, themselves demonstrate a pattern in their relative locations, which also propagates to their parts. Analyzing the interactions between the parts across the collection of images would reflect these patterns, and the parts can be grouped accordingly. These groupings are typically hierarchical. We introduce hSO: Hierarchical Semantics of Objects, which is learnt from a collection of images of a particular scene and captures this hierarchical grouping. We propose an approach for the unsupervised learning of the hSO. The hSO simply holds objects, as clusters of patches, at its nodes, but it goes much beyond that and also captures interactions between the objects through its structure. In addition to providing the semantic layout of the scene, learnt hSOs can have several useful applications such as providing context for enhanced object detection and compact scene representation for scene category classification.

182

182

Adaptive Patch Features for Object Class Recognition with Learned Hierarchical Models

Fabien Scalzo, Justus Piater

We present a hierarchical generative model for object recognition that is constructed by weakly-supervised learning. A key component is a novel, adaptive patch feature whose width and height are automatically determined. The optimality criterion is based on minimum-variance analysis, which first computes the variance of the appearance model for various patch deformations, and then selects the patch dimensions that yield the minimum variance over the training data. They are integrated into each level of our hierarchical representation that is learned in an iterative, bottom-up fashion. At each level of the hierarchy, pairs of features are identified that tend to occur at stable positions relative to each other, by clustering the configurational distributions of observed feature co-occurrences using Expectation-Maximization. For recognition, evidence is propagated using Nonparametric Belief Propagation. Discriminative models are learned on the basis of our feature hierarchy by combining a SVM classifier with feature selection based on the Fisher score. Experiments on two very different, challenging image databases demonstrate the effectiveness of this framework for object class recognition, as well as the contribution of the adaptive patch features towards attaining highly competitive results.

Complex Salient Regions for Computer Vision Problems

Sergio Escalera, Oriol Pujol, Petia Radeva

The goal of interest point detectors is to find, in an unsupervised way, keypoints easy to extract and at the same time robust to image transformations. We present a novel set of saliency features based on image singularities that takes into account the region content in terms of intensity and local structure. The region complexity is estimated by means of the entropy of the grey-level information; shape information is obtained by measuring the entropy of significant orientations. The regions are located in their representative scale and categorized by their complexity level. Thus, the regions are highly discriminable and less sensitive to confusion and false alarm than the traditional approaches. We compare the novel complex salient regions with the state-of-the-art keypoint detectors. The presented interest points show robustness to a wide set of image transformations and high repeatability, as well as slows matching from different camera points of view. Besides, we show the temporal robustness of the novel salient regions in real video sequences, being potentially useful for matching, image retrieval, and object categorization problems.

Patch-based Image Correlation with Rapid Filtering

Guodong Guo, Charles Dyer

This paper describes a patch-based approach for rapid image correlation or template matching. By representing a template image with an ensemble of patches, the method is robust with respect to variations such as local appearance variation, partial occlusion, and scale changes. Rectangle filters are applied to each image patch for fast filtering based on the integral image representation. A new method is developed for feature dimension reduction by detecting the “salient” image structures given a single image. Experiments on a variety images show the success of the method in dealing with different variations in the test images. In terms of computation time, the approach is faster than traditional methods by up to two orders of magnitude and is at least three times faster than a fast implementation of normalized cross correlation.

183

183

Toward A Discriminative Codebook: Codeword Selection across Multi-resolution

Lei Wang

In patch-based object recognition, there are two important issues on the codebook generation: (1) resolution: a coarse codebook lacks sufficient discriminative power, and an over-fine one is sensitive to noise; (2) codeword selection: non-discriminative codewords not only increase the codebook size, but also can hurt the recognition performance. To achieve a discriminative codebook for better recognition, this paper argues that these two issues are strongly related and should be solved as a whole. In this paper, a multi-resolution codebook is first designed via hierarchical clustering. With a reasonable size, it includes all of the codewords which cross a large number of resolution levels. More importantly, it forms a diverse candidate codeword set that is critical to codeword selection. A Boosting feature selection approach is modified to select the discriminative codewords from this multi-resolution codebook. By doing so, the obtained codebook is composed of the most discriminative codewords culled from different levels of resolution. Experimental study demonstrates the better recognition performance attained by this codebook.

Scene Classification Using Bag-of-Regions Representations

Demir Gökalp, Selim Aksoy

We present a hierarchical generative model for object recognition that is constructed by weakly-supervised learning. A key component is a novel, adaptive patch feature whose width and height are automatically determined. The optimality criterion is based on minimum-variance analysis, which first computes the variance of the appearance model for various patch deformations, and then selects the patch dimensions that yield the minimum variance over the training data. They are integrated into each level of our hierarchical representation that is learned in an iterative, bottom-up fashion. At each level of the hierarchy, pairs of features are identified that tend to occur at stable positions relative to each other, by clustering the configurational distributions of observed feature co-occurrences using Expectation-Maximization. For recognition, evidence is propagated using Nonparametric Belief Propagation. Discriminative models are learned on the basis of our feature hierarchy by combining a SVM classifier with feature selection based on the Fisher score. Experiments on two very different, challenging image databases demonstrate the effectiveness of this framework for object class recognition, as well as the contribution of the adaptive patch features towards attaining highly competitive results.

184

184