accurate object recognition with shape masks

Int J Comput Vis (2012) 97:191–209DOI 10.1007/s11263-011-0479-2

Accurate Object Recognition with Shape Masks

Marcin Marszałek · Cordelia Schmid

Received: 6 August 2009 / Accepted: 13 June 2011 / Published online: 1 July 2011© Springer Science+Business Media, LLC 2011

Abstract In this paper we propose an object recognitionapproach that is based on shape masks—generalizationsof segmentation masks. As shape masks carry informationabout the extent (outline) of objects, they provide a con-venient tool to exploit the geometry of objects. We applyour ideas to two common object class recognition tasks—classification and localization. For classification, we extendthe orderless bag-of-features image representation. In theproposed setup shape masks can be seen as weak geomet-rical constraints over bag-of-features. Those constraints canbe used to reduce background clutter and help recogni-tion. For localization, we propose a new recognition schemebased on high-dimensional hypothesis clustering. Shapemasks allow to go beyond bounding boxes and determinethe outline (approximate segmentation) of the object dur-ing localization. Furthermore, the method easily learns anddetects possible object viewpoints and articulations, whichare often well characterized by the object outline. Our ex-periments reveal that shape masks can improve recognitionaccuracy of state-of-the-art methods while returning richerrecognition answers at the same time. We evaluate the pro-posed approach on the challenging natural-scene Graz-02object classes dataset.

Keywords Shape masks · Object recognition · Objectsegmentation · Local features · Bag-of-features · Graz-02

M. Marszałek (�) · C. SchmidINRIA Grenoble, LEAR - LJK, 665 av de l’Europe,38330 Montbonnot, Francee-mail: [email protected]

C. Schmide-mail: [email protected]

1 Introduction

The recognition of object categories is one of the most chal-lenging problems in computer vision, especially in the pres-ence of pose changes, intra-class variation, occlusion andbackground clutter. In this work we introduce shape masksand make a connection between three arguably most com-mon visual object class recognition tasks—classification, lo-calization and segmentation. We propose a novel general ob-ject recognition scheme and apply it to the aforementionedtasks in a challenging environment of natural images.

In the proposed recognition scheme we use local appear-ance image features to cast object localization hypotheses inthe form of shape masks. Shape masks divide the image intoobject and background areas, i.e., approximate object seg-mentation. Search for consistent hypotheses leads to: (a) aset of consistent features that can be used to classify the ob-ject; (b) a consistent object outline that can be used for ob-ject localization and segmentation. Consequently, the contri-bution of this paper is two-fold. First, we use shape masks toimprove classification. Second, we determine the pixel-levelsegmentation of localized objects.

For classification, we extend the orderless bag-of-featu-res image representation. In the proposed setup shape maskscan be seen as weak geometrical constraints over bag-of-features. We combine shape masks cast by many local fea-tures and find groups of consistent features that support eachother. Removal of inconsistent features allows to reduce thebackground clutter and makes the classification task easier.

For localization, we propose a new recognition schemebased on high-dimensional hypothesis clustering. Consis-tent object outlines—maxima in the shape masks space—correspond to localization decisions. At the same time shapemasks allow to go beyond commonly used bounding boxes.They determine the outline of the localized objects, i.e., ap-

mailto:[email protected]

mailto:[email protected]

192 Int J Comput Vis (2012) 97:191–209

proximate object segmentation. Using shape masks as lo-calization hypotheses, we also implicitly handle the globalconsistency issues and address multiple object aspects.

The paper is organized as follows. In Sect. 2 we reviewsome related work. Section 3 explains the components ofour method. We briefly review local features (Sect. 3.1) anddescribe their rectification parameters (Sect. 3.2). We intro-duce shape masks with their alignment (Sect. 3.3) and de-fine the shape mask similarity measure we use (Sect. 3.4).Finally we apply our ideas to classification task in Sect. 4and combined localization and segmentation task in Sect. 5.Experimental results are given in Sect. 6. We evaluate thecomponents of our framework in Sects. 6.1–6.2 and com-pare to the state of the art in Sects. 6.3–6.4. We conclude thepaper in Sect. 7.

2 Related Work

In the traditional view of object recognition, image seg-mentation was often seen as a necessary preprocessing stepfor recognition (Marr 1982). The success of appearance-based methods which provide recognition results withoutprior segmentation, led to the separation of the two areas.In recent years, however, those areas converge again. This ismainly due to a growing insight that recognition and seg-mentation are interleaved processes in human brain (Pe-terson 1994; Vecera 1998) and that intermediate recogni-tion results can be used to drive a segmentation process(Borenstein and Ullman 2002; Yu and Shi 2003). Many vi-sion researchers build today’s recognition systems on seg-ments (Gu et al. 2009) and consider segmentation to be akey element of scene understanding (Li et al. 2009). Thereis also a growing interest to integrate an object segmenta-tion capability into successful appearance-based recognitionmethods (Fussenegger et al. 2006; Wu and Nevatia 2007;Galleguillos et al. 2008; Leibe et al. 2008). We take the ob-ject recognition perspective with its two standard tasks—localization and classification—to discuss the work relatedto our approach.

A preliminary version of this work has appeared in ourearlier publications (Marszałek and Schmid 2006, 2007).

2.1 Classification

In order to deal with high intra-class variations typical forthe visual world, image classification methods based onsparse local features (Mikolajczyk and Schmid 2004) andbag-of-features (Sivic and Zisserman 2003; Csurka et al.2004) were proposed. They have shown to give excellentrecognition results (Agarwal et al. 2004; Sivic et al. 2005;Fergus et al. 2007; Zhang et al. 2007). However, in objectclass recognition context, they are sensitive to background

clutter, because they cannot distinguish between objects andbackground. Efforts have been made to overcome this prob-lem by using feature selection (Dorkó and Schmid 2003),boosting (Opelt et al. 2004b) or designing novel kernelswith high discriminative power (Grauman and Darrell 2005;Lyu 2005). Robustness to occlusions was also improved byintroducing similarity measures based on partial matching(e.g. EMD distance, see Rubner et al. 2000) or histogramcomparison (e.g. χ2 distance, see Hayman et al. 2004).However, there still seems to be a strong potential for im-proving background clutter robustness of bag-of-featuresrepresentation (Zhang et al. 2007) and segmentation wasshown to be an important cue, surprisingly often missingfrom recent state-of-the-art recognition systems (Ramanan2007).

Lazebnik et al. (2005) have shown that considering spa-tial relationships between features, which are ignored by thestandard bag-of-features representation, may lead to highclassification results. This motivates us to extend the orig-inal bag-of-features representation to incorporate spatial in-formation in the form of shape masks. Since interest pointscan generate accurate hypotheses about localization of theobject in the image (Lowe 2004; Leibe et al. 2008), we in-troduce a method in which the features that agree on thelocalization and shape of the object boost the importance ofeach other.

2.2 Localization

The criteria of measuring localization accuracy have evolvedover time. Agarwal and Roth (2002) evaluated the centerpoint of an object and classified localization as correct whenthe marked point was in the close neighborhood of the realcenter of the object. During the PASCAL Visual Object Clas-ses challenges (Everingham et al. 2006, 2008) the partic-ipants have to return bounding boxes for the objects. Webelieve that modern localization methods should go evenfurther, e.g., return some additional information about ob-ject pose (viewpoint, articulation), aspect (sub-type) or evenstate and properties. This can, to some extent, be achievedby returning the object segmentation and motivates us to in-tegrate segmentation to help the standard object recognitiontasks.

A few methods which perform interleaved object de-tection and segmentation have been developed recently—see for example the work of Leibe et al. (2008) and ofFussenegger et al. (2006). Most of those methods, however,attack the segmentation problem after the decision aboutthe object location is already made. And since the authorssolve the localization problem by voting in the general-ized Hough space, they are limited to parametrized hypothe-ses like rectangles, ellipses, etc. The information present inthe object segmentation can not be fully exploited in such

Int J Comput Vis (2012) 97:191–209 193

setup. As Hough space cannot deal directly with arbitraryshapes due to their high dimensionality, it generates a fewproblems with the otherwise very successful Implicit ShapeModel (ISM) of Leibe et al. (2008). Firstly, the low dimen-sionality of the hypotheses causes the final answers to revealproblems with global consistency. This was addressed byLeibe et al. (2005), but only in form of a postprocessing stepappended to the original ISM. We approach this problemdirectly by using the high-dimensional shape masks as hy-potheses. Such hypotheses can be considered similar only ifthe object outlines are globally similar. Secondly, the low di-mensionality of the hypotheses makes it difficult to deal withmultiple object viewpoints and articulations. This was ad-dressed by Seemann et al. (2006), but as aspect parametriza-tion is difficult in a general case, the proposed solution waslimited to aspect clustering and treating each aspect sepa-rately. Treating each aspect separately, however, prohibitsaspect combination during recognition. This was resolvedthrough semi-local matching in a later paper by Seemannand Schiele (2006). An alternative solution to the same prob-lem was proposed by Thomas et al. (2006), but finding themulti-view tracks that link single-view detectors requires aspecial training procedure with over 10 viewpoints of eachtraining object. We implicitly deal with multiple viewpointsand articulations. Object aspects are detected during trainingand the similar ones can be combined during recognition.

Fritz et al. (2005) have recently shown that combining thepower of generative modeling with a discriminative classi-fier allows to obtain good results for object category local-ization. They extend the Implicit Shape Model mentionedearlier by appending a Support Vector Machine classifier toits output. In our framework, however, we propose to evalu-ate the hypotheses (shape masks cast using local features)before the evidence collection step. This allows to easilydeal with false hypotheses caused by local ambiguities andmakes the search for maxima in the hypothesis space easier.

We also would like to underline a significant differencein our object class localization approach compared to the re-cent object class segmentation approaches, like the ones ofWinn and Joijic (2005), Todorovic and Ahuja (2006), Rus-sell et al. (2006) or Shotton et al. (2008). We perform objectlocalization instead of scene segmentation. Our goal is to lo-calize separate object instances within a test image, handlingocclusions and strong background clutter. Our method doesnot use any segmentation or edge information of a test imageand therefore we expect to get only approximate shapes forthe localized objects—shapes that reveal additional objectproperties and do not segment out the visible object parts.Pixel-level accurate segmentation, however, should be eas-ier after the object localization problem that we address inthis paper is solved.

Fig. 1 Visualization of shape mask alignment. Three central features(two heads and one drop) agree on the object localization, two others(one head and one drop) are mistakes, the “Big Ben” feature representsthe background. Note the scale compensation for all features and theambiguity introduced by the “head” feature

3 Shape Masks

A shape mask encodes an object outline and is here pre-dicted by local features. By collecting and combining shapemasks cast by individual features we obtain an estimate ofthe object shape, cf. Fig. 1.

By definition, a shape mask S : R2 → R is a general-ization of the discrete binary segmentation mask Sb : Z2 →{0,1}. It can be initialized from a segmentation mask andsimplifies the notation by postponing the problem of spa-tial quantization through the extension of the domain. Fur-thermore, additional operations become possible due to theextension of the codomain, e.g., a weighted sum of shapemasks can be computed.

In the following, we describe the key operations we needto perform on shape masks in order to use them for clas-sification (in Sect. 4) and localization (in Sect. 5). First,we review local features and their rectification parameters.Those parameters, typically encoded in a rectification ma-trix, permit shape mask casting and alignment. Alignedshape masks, which are described next, can be collected andcombined to produce approximate object segmentations. Anapproximate image segmentation derived from such objectsegmentations is used to reduce background clutter for clas-sification. Finally, we define a shape mask similarity mea-sure, which allows shape masks clustering. We cluster shapemasks to localize individual objects and to learn canonicalobject aspects.

3.1 Local Features

We use two local region detectors to extract salient imagestructures: the Harris-Laplace detector (Mikolajczyk and

194 Int J Comput Vis (2012) 97:191–209

Fig. 2 Sample interest point detections

Schmid 2004) responding to corner-like regions and theLaplacian detector (Lindeberg 1998) extracting blob-like re-gions.

Both detectors are invariant to scale transformations,they output circular regions at a certain characteristic scale.To achieve rotation invariance, we may rotate the circu-lar regions in the direction of the dominant gradient ori-entation (Lowe 2004; Mikolajczyk and Schmid 2004). Theaffine adaptation procedure (Gårding and Lindeberg 1996;Mikolajczyk and Schmid 2004) allows to obtain an affine-invariant version of the detectors. Affinely adapted detectorsoutput elliptical regions which are then normalized into cir-cles.

Figure 2 demonstrates the detection results on two sam-ple images. Note that interest regions are allocated aroundsalient image structures. This effect is especially prominentfor Harris-Laplace detector which responds to corners. Onthe other hand, the Laplacian detector generally returns moreregions and it well covers most of the parts of the image.

In our experiments the Harris-Laplace detector usuallyproduces comparable results with a lower number of de-tections, thus it is our preferred choice for efficiency rea-sons. However, for small images, like the ones in Shotton’shorses dataset, we choose the Laplacian detector that detectsenough interest points for our method to work.

It is also unreasonable to use a more invariant descrip-tion than required for a given dataset (Zhang et al. 2007).For most natural object datasets the vertical direction is welldefined, and the orientation of the features contains valu-able information. Thus, even though our framework sup-ports affinely adapted features as discussed in the following,in our experiments we use only the scale-invariant versionof the detectors.

To compute appearance-based descriptors on the patchesobtained by the detectors we employ the SIFT (Lowe 2004)descriptor. For each interest region it computes a gradient

orientation histogram which leads to a 128-dimensional fea-ture vector. The design of the descriptor makes it less sen-sitive to small changes in the position of the support regionand puts more emphasis on the gradients that are near thecenter of the region. The descriptors are normalized in orderto obtain robustness to illumination changes.

3.2 Rectification Parameters

In our framework rectification parameters complement theinvariant description of a local image feature. For example,if a local region description (feature) is invariant to scaletransformations, the rectification parameters have to includethe scale to compensate for this invariance.

Precisely, if a description d(·) of a local image regionis invariant to a transformation T (·, ρ) with parameters ρ,then for each local image region i the parameters of thistransformation ρi ∈ Dρ(T ) are included in its rectificationθi ∈ �. This can be written as

� =⊗

T ∈TDρ(T ), (1)

where⊗

denotes a Cartesian product, Dρ(T ) is a domainof transformation and T is a set of transformations to whichthe description is invariant:

T = {T : ∀ρ, r d(r) = d(T (r, ρ))

}. (2)

In our framework features are made invariant to a cho-sen type of affine transformations by normalizing the localimage region before computing the description. Therefore,the rectification matrix θi transforming the image coordi-nates to the normalized patch coordinates (Rothganger et al.2003) can be used to encode the rectification parameters ofa feature i.

3.3 Shape Masks Alignment

Let us assume a match between two features i and j . If ashape mask is associated with feature i, we can project it tothe reference frame of feature j—we call this mask align-ment. To align shape masks, we compose them with thetransformation matrix Pij computed as

Pij = θ−1i θj , (3)

where the θi and θj are the rectification matrices correspond-ing to the matching features.

Figure 1 shows a toy example where shape masks getaligned with regard to position and scale. Let us assumethat a “drop” feature indicates the presence of an umbrellajust below, while a “head” feature suggests that one is to befound just above and to the side. Since a head might appearto the left or to the right, there is an ambiguity—an umbrella

Int J Comput Vis (2012) 97:191–209 195

Fig. 3 Test images of Graz-02 dataset (on the left), generated masks(in the middle) and multiplication of the two (on the right)

may be found more to the left or to the right of a head. Notehow a position of a feature determines the position of a shapemask and how larger features cast larger masks.

Looking further at Fig. 1, one can quickly notice thataligned shape masks can be used to produce approximateobject segmentations—just as shown by the dark umbrellashape in the center of the figure. Given a set of features, eachfeature may produce a hypothesis about the localization ofthe object. Note that true foreground features which agreeon the position and shape of the object will quickly producea strong response—the features indicate that an umbrella isin the center of the visualization.

Figure 3 presents the sums of all generated masks forsample Graz-02 images. The quality of resulting approxi-mate segmentation varies, but that masks usually tend to fo-cus on the object. We can expect to find some consistenthypotheses even in blurred mask clouds.

3.4 Shape Masks Similarity

To measure the shape mask similarity we adapt a commonmeasure defined as the ratio of overlap area to the unionarea, see Fig. 4 for illustration. For binary masks Qb and Rb

the overlap area measure can be written as

ob(Qb,Rb) = |Q1b ∩ R1

b ||Q1

b ∪ R1b |

=∑

min(Qb,Rb)∑max(Qb,Rb)

, (4)

where Q1b resp. R1

b denotes the level set of mask Qb resp.Rb at 1, min(Qb,Qr)(x, y) = min(Qb(x, y),Qr(x, y)) andthe sum is taken over the whole domain. Thus, we define theoverlap based similarity measure os for shape masks Q andR as

os(Q,R) =∫

min(Q,R)∫max(Q,R)

. (5)

Fig. 4 Illustration of the overlap area similarity measure for twosample shapes. The measure is defined as o(A,B) = (|A| ∩ |B|)/(|A| ∪ |B|)

Note, that this similarity measure will return 1 for identicalshape masks and 0 for non-overlapping ones.

A straightforward implementation of the similarity mea-sure given in (5) leads to very inefficient code. Note, how-ever, that it can be rewritten as

os(Q,R) = C∫Q + ∫

R − C, C =

∫min(Q,R). (6)

Sums of all mask pixels can be cached and C needs to becomputed only on the intersection of the supports of theshape masks. This makes the computation very efficient.

Finally, we need to compute a similarity measure of be-tween two shape masks ζi and ζj associated with features i

and j after aligning them. We define it as

of (i, j) = os(ζi ◦ Pij , ζj ) = os(ζi, ζj ◦ Pji), (7)

where os is defined by (5) and Pij (Pji ) by (3). We call suchsimilarity measure between two features a featured shapemask similarity.

4 Classification

We first apply shape masks to image classification. In thestandard bag-of-features approach each feature equally in-fluences the bag-of-features representation. We propose toreduce the influence of background clutter by employingspatial relationships between the features and giving lowerweights to background features. This is achieved by havingeach feature boost other features that, from its spatial pointof view, should belong to an object, e.g., a feature belong-ing to the wheel of a car should increase the weights of thefeatures belonging to the other parts of the car. The spatialinformation is conveyed through shape masks, cf. Sect. 3.Shape masks produce an approximate object segmentationbased on direct matches between test and training features.This segmentation is then used to weight features withinthe bag-of-features image representation, boosting the fore-ground features and suppressing the background clutter. Ourexperiments will show that if we remove background clutterwe are able to improve the classification results.

Our image classification framework, based on the workof Zhang et al. (2007), is described in Sects. 4.1–4.2. The de-tails of our spatial weighting procedure are given in Sect. 4.3.

196 Int J Comput Vis (2012) 97:191–209

4.1 Bag-of-Features Representation

Given a set of local invariant descriptors (see Sect. 3.1), wewant to represent their distributions in training and test im-ages. We therefore build a visual vocabulary by clusteringthe descriptors from the training set and then represent eachimage in the dataset as a histogram of visual words (vocab-ulary entries, see Csurka et al. 2004; Sivic and Zisserman2003). Each histogram entry hij ∈ Hi is the proportion ofall descriptors in image i having label j to the total numberof descriptors computed for the image.

Our evaluation has shown that vocabulary constructionhas little impact on the final results. We therefore randomlysubsample the training features and cluster 50k features us-ing K-means as clustering method to create a 1000-elementvocabulary.

4.2 Classification with Non-linear SVMs

For classification, we use non-linear Support Vector Ma-chines (SVMs, see Schölkopf and Smola 2002). In a two-class setup that we use for binary detection, i.e., classifyingimages as containing or not containing a given object class,the decision function for a test sample x has the followingform:

g(x) =∑

i

αiyiK(xi, x) − b, (8)

where K(xi, x) is the value of a kernel function for the train-ing sample xi and the test sample x, yi ∈ {+1,−1} is theclass label of xi , αi is a learned weight of the training sam-ple xi , and b is a learned threshold. The training sampleswith weight αi > 0 are usually called support vectors.

To obtain a detector response, we use the raw output ofthe SVM, given by (8). By placing different thresholds onthis output, we influence the decision and obtain ReceiverOperating Characteristic (ROC) or precision-recall curves(RP).

We use an extended Gaussian kernel (Chapelle et al.1999; Jing et al. 2003)

K(Hi,Hj ) = e− 1A

D(Hi,Hj ), (9)

where Hi = {hin} and Hj = {hjn} are image histograms andD(Hi,Hj ) is the χ2 distance defined as

D(Hi,Hj ) = 1

2

N∑

n=1

(hin − hjn)2

hin + hjn

, (10)

where N is the size of the vocabulary (N = 1000 in ourexperiments). The resulting χ2 kernel is a Mercer kernel(Fowlkes et al. 2004). The parameter A is the mean valueof the distances between all training images (Zhang et al.

Fig. 5 Overview of the spatial weighting procedure

2007). We may combine different feature channels by sum-ming their distances, so that D = ∑

n Dn where Dn is theχ2 distance for channel n.

4.3 Spatial Weighting

The overview of our shape mask alignment and featureweighting procedure for classification is given in Fig. 5.Given a test image, local features and their rectification pa-rameters are computed first. Next, localization hypothesesare cast by matching test features with training features andaligning the training shape masks. Then, the hypotheses arestacked to produce approximate object segmentation. Fi-nally, features are weighted according to this segmentationand a classifier is executed. In the following we explain eachof these steps in detail.

Compute Sparse Local Features

Given an image, we first compute a set of sparse local fea-tures as described in Sect. 3.1. During training, ground-truthsegmentation information is used to learn (remember) theshape mask of the object from a “point of view” relative tothe training features. This is encoded in the rectification pa-rameters (see Sect. 3.2) of each feature and in the pointer tothe corresponding shape mask. Shape masks are initializedfrom provided image segmentations.

In fact, as we have ground-truth data for the training set,we first filter the training data to include foreground featuresonly. We perform this operation to avoid noise that would beintroduced by matching test features with background fea-tures. The position of background features cannot be corre-lated with the position of the object, even if those featureswould give hints about the object category, e.g., a street signcould give us hints about cars object category, but it is im-possible to draw any precise conclusions about the locationof a car from the position of the sign. However, as segmenta-tion information only roughly follows object edges and localdescriptors need some support area, it is worth dilating (our

Int J Comput Vis (2012) 97:191–209 197

choice) or blurring the segmentation image by some pixels(we chose 32) before filtering out the background features.

Given a test image, for each test feature we look for thefeatures from the training data that are closest (we use Eu-clidean distance here) in the 128-dimensional feature-space(for our SIFT implementation). We choose N = 100 mostsimilar training features in our experiments.

Cast Hypotheses

Pairs of similar training and test features can generate hy-potheses about possible object locations and shapes. Rectifi-cation parameters are used to align the retrieved shape masksas described in Sect. 3.3. For example, in the case of scaleinvariant features, a feature detected at scale 6 may corre-spond to a feature detected at scale 3 in the training image.Then we need to shift the mask of the training image to therelative position in the test image and rescale the mask bya factor of 2. Similarly, we can use rotation compensationfor rotation invariant features and affine transformation foraffine invariant features. A hypothesis is represented by analigned shape mask.

Stack Hypotheses

We collect all the hypotheses cast by a given test featureand sum the corresponding shape masks. Masks in the sumare weighted with a Gaussian function G(0, σ ) of the dis-tance between the training and test feature; we have foundσ = 0.15 to be a reasonable value. The sum of weights isnormalized to unity such that the contribution of each testfeature is equal. The hypotheses generated by all the testfeatures are summed and form the approximate segmenta-tion mask. Figure 3 shows some examples of the resultingmasks.

Weight Features

The final mask represents a score map describing the likeli-hood that a given image pixel belongs to an object. The scorevalue reflects the number of features agreeing on a hypothe-sis that a given pixel is a foreground pixel, i.e., belongs to anobject. The scores are then used to boost the importance ofthe features lying on the object and suppress the backgroundfeatures.

Let us recall that we use a bag-of-features representationfor classification. When a feature distribution (histogram) iscomputed, we weight features according to the mask valuecorresponding to their positions, denoted by Mi(xf , yf ) inthe following. Precisely, we construct the histogram Hi forimage i by computing

Hi =∥∥∥∥∥

∑

f ∈Fi

Mi(xf , yf ) · Iwf

∥∥∥∥∥1

, (11)

Fig. 6 Overview of the localization approach. The main operationblock is executed in a pipe to reduce memory requirements

where Fi is the set of features obtained for image i,Mi(xf , yf ) are the values of the final mask computed forthis image, xf and yf are the coordinates of feature f , wf

is the vocabulary word assigned to this feature, and Iwfis

the unit vector indicating this vocabulary word.Consequently, the features considered to be foreground

features will significantly contribute to the representationand the background features will have a minor impact onthe histogram. Note that the feature weighting procedurecould be run iteratively—weighted features could be usedto compute a subsequent segmentation mask—yet we havenot noticed a significant gain in recognition performance inpractice.

Weighted histograms represent original images with sup-pressed background clutter. They can be classified with anygeneric classifier, we use Support Vector Machines with χ2

kernel.

5 Localization

As we have shown in the previous section, shape masks castby local features can be used to perform approximate imagesegmentation. The obtained segmentation masks seem to bepromising for localization. Note, however, that a single im-age segmentation is not sufficient for the localization task,where one has to distinguish between multiple separate ob-ject instances. To distinguish between separate objects in asingle image we propose to cluster the shape masks.

The overview of the localization approach is given inFig. 6. First, local features are computed and used to castlocalization hypotheses. The hypotheses are evaluated andclustered to obtain object localizations. Finally, decisionscan be filtered to reduce the number of false positives. Notethat the three main blocks of the procedure, i.e., casting,evaluating and online clustering of the hypotheses, can beexecuted in a pipe to optimize storage. The details of eachstep are given below.

198 Int J Comput Vis (2012) 97:191–209

Fig. 7 Main points of our localization framework. (a) Ambiguities in-troduced by local features may generate false hypotheses (left). Hy-pothesis evaluation helps to avoid them in our framework (right).

(b) Occlusion weakens the discriminative classifier response and theobject may be missed (left). This is reduced in our framework by col-lecting the local evidence provided by agreeing features (right)


First, sparse local features are computed over all training im-ages as described in Sect. 3.1. The descriptors are clusteredusing k-means with k = 1000. Cluster centers form a visualvocabulary and each feature is assigned to the closest vo-cabulary word. Given object segmentations, we discard thefeatures which do not lie on any object. For the remainingfeatures we keep the pointer to the shape mask created fromthe relevant object segmentation. For each feature we alsokeep its rectification parameters.

Given a test image, a set of features is computed and thefeature space is quantized using the vocabulary created dur-ing training, i.e., each test feature is assigned to the near-est vocabulary word. Each test feature is then matched withtraining features. Features are considered matching if theybelong to the same vocabulary word.1

Note that this step of the procedure is analogous to thecorresponding step of spatial weighting for classification, cf.Sect. 4. The only difference is that we additionally quantizefeatures using a visual vocabulary. This allows faster match-ing, but sacrifices the match distance. Since we replace hy-pothesis stacking with hypothesis clustering for localization,this is a reasonable trade-off.

Cast Hypotheses

The hypotheses are generated by investigating all test im-age features in arbitrary order. For each test feature we con-sider all matching training features, each pointing to a shapemask. The rectification parameters of a training feature and atest feature determine the alignment, as shown in (3). Thus,analogously to the corresponding step in Sect. 4, we projectthe training shape masks into the test image and cast a hy-pothesis about possible object location.

1Note that for image classification in Sect. 4 we only quantize featuresto build a bag-of-features image representation; for spatial weightingshape masks features are directly matched. In this section we matchquantized features, in order to speed up computation without any no-ticeable impact on performance.

Evaluate Hypotheses

Each hypothesis, i.e., each aligned shape mask generated bya test feature, is evaluated with an SVM classifier. A bag-of-features representation is built from all features encom-passed by the shape mask. Precisely, we construct a his-togram Hih for test image i and hypothesis h representedby mask ζh by computing

Hih =∥∥∥∥∥

∑

f ∈Fi

ζh(xf , yf ) · Iwf

∥∥∥∥∥1

, (12)

where Fi is the set of features obtained for test image i, xf

and yf are the coordinates of feature f , wf is the vocab-ulary word assigned to this feature, Iwf

is the unit vectorindicating this vocabulary word, and ζh(xf , yf ) denotes thevalue of the shape mask at position (xf , yf ).

Given a histogram, a confidence measure is provided bythe classifier, see Sect. 4.2 for details. After the evaluationonly the hypotheses for which a positive confidence mea-sure is returned are kept. The confidence measure is storedwith each hypothesis. Performing the evaluation step imme-diately after the hypothesis is cast allows to deal with ambi-guities intrinsic to local features before they could influencethe hypothesis-space clustering, see left part of Fig. 7. It alsoeliminates wrong hypotheses caused by background clutter.

Cluster Hypotheses

After the hypotheses are evaluated, we could look for thestrongest ones and consider them as localization decisions.We use, however, a discriminative classifier, which is rel-atively sensitive to occlusions. Thus, combining its out-put with the generative evidence provided by local featuresshould be beneficial, see right part of Fig. 7. To collect theevidence from multiple hypotheses we perform online ag-glomerative hypothesis (or shape mask) clustering.

It is computationally prohibitive to store all generatedshape masks in memory. To overcome this problem, we usean online approach. We pipe the hypotheses through the sys-tem as they are generated and we keep a limited number of

Int J Comput Vis (2012) 97:191–209 199

them in memory at the same time. The number of hypothe-ses is reduced during the clustering step by merging similarshape masks and dropping non-promising ones. To improvethe stability we shuffle the test features before generating hy-potheses, which reduces the number of early overlaps. Fur-thermore, we keep as many shape masks as memory per-mits, thus our heuristics of dropping the weakest hypothesesis valid.

At this point of the algorithm each hypothesis consists oftwo elements—a shape mask and an associated confidencevalue, computed by the SVM in the previous step. We canmeasure the similarity between two shape masks as definedby (5). When the number of collected shape masks exceedsthe limit of L = 100 elements, the pair of hypotheses withthe most similar masks is considered for merge. If the sim-ilarity is above a merge threshold, U = 0.7 in our experi-ments, the hypotheses are merged. Otherwise, the hypothe-sis with the lowest confidence value is removed.

When two hypotheses are merged, a combined shapemask needs to be computed. The resulting shape mask is theaverage of the masks being merged, weighted by the con-fidence values associated with each of the two mask. Theconfidence of the resulting hypothesis is the sum of the con-fidence values of the hypotheses to be combined. Thus, themerge of shape masks Q and R associated with confidencevalues ηQ and ηR can be expressed as

S = ηQ

ηQ + ηR

· Q + ηR

ηQ + ηR

· R,

ηS = ηQ + ηR,

(13)

where S is the resulting shape mask and ηS its confidence.After all hypotheses are generated, evaluated and col-

lected, the agglomerative clustering continues until no morehypotheses can be merged, i.e., all the remaining hypothesispairs have a shape mask similarity below the threshold. Theremaining hypotheses are then passed to the next step.

Filter Decisions

Finally, the decisions are filtered to reduce the number offalse positives. This is a standard way to proceed (Rowleyet al. 1998; Viola and Jones 2004). We have implementeda simple approach for situations where no significant self-occlusion of objects is expected. We reduce significantlyoverlapping decisions to the one with the highest confidencevalue. This allows us to avoid false positives resulting fromsubsequent detections of an already detected object.

5.1 Learning Aspects

It is possible to apply the shape mask clustering procedureduring the training phase in order to learn object aspects. Byclustering the training shape masks, common object types,

Fig. 8 Overview of the aspect learning procedure. The main operationblocks are executed iteratively

viewpoints and states can be discovered. Since the clusteringreduces the number of the training shape masks, this allowsto reduce the recognition complexity. Moreover, the clus-tering procedure also allows to clean the training data fromoutliers—unusual object aspects that appear in the trainingset only once and are unlikely to be seen again.

The overview of the aspect learning procedure is given inFig. 8. It is an agglomerative clustering approach in whichclose shape masks pairs are found based on features similar-ities and outline match. In each iteration of the algorithm theclosest shape mask pair is merged. Two main blocks of theprocedure, i.e., computation of the aspect similarities andmerge of the two most similar aspects, are performed itera-tively until no more merges are possible. The details of eachstep are given below.


First, sparse local features are computed over all training im-ages and the features are assigned to the nearest vocabularyword.

Compute Feature Similarities

We assume that two aspects are similar if they result in glob-ally similar object shape and are also supported by similar

200 Int J Comput Vis (2012) 97:191–209

local features appearing on the object at approximately thesame locations. Thus, for each visual vocabulary word weconsider all pairs of features that were assigned to the samevocabulary word and come from different training images.For each such feature pair we compute the featured shapemask similarity as defined by (7). Thresholding the similar-ity measure allows us to find visually matching (belongingto one feature cluster) feature pairs that would cast simi-lar (as defined by (5)) shape masks. We set the threshold toT = 0.85 in our experiments.

Vote for Shape Mask Pairs

Each pair of matching features determined in the previousstep casts a vote for the aspect pair they belong to. The pairof shape masks with the highest number of votes is a can-didate to be merged. This choice assures that the aspectswill have similar object outlines (similarity measure abovethe threshold T is necessary to cast a vote) and that the as-pects with similar appearance (the ones with high numberof matching features) are merged first. If there are no moremerge candidates left, the iterative part ends and singletonsare pruned.

Find the Best Merge Geometry

To merge two shape masks we first determine a geometricaltransformation between them. We choose the transformationdefined by the feature pair with the highest featured shapemask similarity. This assures good overlap of both shapemasks (high similarity of the aligned shape masks) and fea-tures (they get aligned according to the best matched featurepair).

Merge the Shape Masks

After the transformation is determined, the common refer-ence frame is established by the feature with a higher scaleparameter (in practical implementation this assures the bestmask resolution). The other shape mask is transformed ac-cording to (3). The same transformation is applied to therectification parameters of the features associated with thetransformed shape mask. This allows the features to followthe geometry change.

The weighted average of the registered shape masks iscomputed and features pointing to the merged masks are as-sociated with the new mask. The featured shape mask simi-larities affected by the merge are recomputed before contin-uing.

Merge Similar Features

To speed up the computation we reduce the number of fea-tures. We detect and merge only redundant features, so the

recognition performance is not affected. We exploit the factthat after merging two shape masks there will be similar fea-tures appearing at approximately the same locations. There-fore, it is possible and desirable to combine those features.We compute the weighted average of features that (i) are vi-sually similar (belong to one feature cluster), (ii) point to thesame shape mask and (iii) would cast similar shape masks,i.e., their featured shape similarity is above the threshold T .The last condition assures that also the rectification parame-ters of merged features are similar, i.e., that we do not merge,e.g., front and back wheels of a car. The featured shape masksimilarities for merged features have to be recomputed be-fore launching a new clustering iteration.

Prune the Singletons

As our experiments show, it can be beneficial to prune thesingle training shape masks that are not merged with anyother shape mask during the agglomerative shape mask clus-tering procedure.

6 Experimental Results

As we have stated in the introduction, object class recog-nition is especially challenging in the presence of posechanges, intra-class variation, occlusion and backgroundclutter. We choose the Graz-02 (Opelt et al. 2004a) dataset toevaluate our framework, as it contains natural real-world im-ages with significant amount of intra-class variations, occlu-sions and background clutter (cf. Fig. 9). We also compareto other methods on the PASCAL VOC’05 dataset (Evering-ham et al. 2006), the PASCAL VOC’09 dataset (Everinghamet al. 2009) and the Weizmann horse dataset (Shotton et al.2005).

For the results presented in this section we use two lo-cal region detectors to extract salient image structures: theHarris-Laplace detector (Mikolajczyk and Schmid 2004)(denoted HS) and the Laplacian detector (Lindeberg 1998)(denoted LS). To compute appearance-based descriptors onthe patches obtained by the detectors we employ the SIFT(Lowe 2004) descriptor. See Sect. 3.1 for details. SVMswith χ2 kernel are used for classification, refer to Sect. 4.2for details.

In the following we first evaluate the components of ourshape masks framework (Sects. 6.1 and 6.2). Then we com-pare to the state of the art. We apply our framework to clas-sification task (Sect. 6.3) and localization task (Sect. 6.4).

6.1 Evaluation of the Recognition Components

In this subsection we evaluate our recognition componentson the Graz-02 dataset using the original ground-truth anno-tation. These annotations do not give the outline for individ-ual objects, but only the segmentation mask for each image,

Int J Comput Vis (2012) 97:191–209 201

Fig. 9 Sample Graz-02 images. Note the high intra-class variations, significant amount of background clutter and difficult occlusions

Table 1 Graz-02 pixel-based recall/precision curve EER measuringthe impact of hypothesis evaluation and evidence collection on therecognition performance

Cars People Bicycles

No hypothesis evaluation 40.4% 28.4% 46.6%

No evidence collection 50.3% 40.3% 48.9%

Our full framework 53.8% 44.1% 61.8%

i.e., it is impossible to know how many objects are present.Clustering shape masks requires object specific annotationsand will therefore not be used in this subsection. Here, weapproximate the object shape mask with the segmentationmask of the entire image.

We run our framework on all three object classes: bikes,cars and people (cf. Fig. 9). For each class we use the first150 odd-numbered images for training and the first 150even-numbered images for testing, i.e., we follow the exper-imental setup defined by Opelt and Pinz (2005). To evaluatethe results, we use pixel-based recall precision curves (RPs).Based on the ground-truth segmentation maps we count anobject pixel as a true positive when it is detected and as afalse negative otherwise. The pixels incorrectly detected asobject pixels are false positives.

Table 1 shows the equal error rates of the recall preci-sion curves for each of the classes. We compare our fullrecognition system with image-based shape masks to twomodified versions. “No hypothesis evaluation” does not usethe hypothesis evaluation step and assumes the same con-fidence for each hypothesis cast by the local features. “Noevidence collection” does not collect the evidence providedby the features, but selects the hypotheses with the highestclassifier response instead. For each class the performanceof our combined framework is significantly better than theperformance of the approaches where the hypothesis evalu-ation or evidence collection are missing. This confirms thatboth elements are necessary in order to perform precise ob-ject class localization and proves that our framework is ableto combine them. Note that evidence collection is crucial for

Fig. 10 Several car aspects detected by agglomerative clustering ofGraz-02 data

bicycles, as the discriminative classifier may get easily dis-tracted by the background surrounding thin bicycle parts.

6.2 Aspect Clustering

Clustering shape masks requires additional annotations ofthe Graz-02 dataset. We have therefore extended the annota-tions by separating the available per-image annotations intoper-object segmentations. Moreover, each object has beenmarked as truncated by an image border or difficult to rec-ognize if appropriate. The images were divided into equallylarge training and test sets. For training, images containingat least one non-truncated object were randomly drawn. Theremaining images on which all the appearing objects were

202 Int J Comput Vis (2012) 97:191–209

Fig. 11 Recognition rate for Graz-02 cars given as recall in a functionof FP per image. We can observe the impact of aspect clustering

annotated define the test set.2 The improved annotations andthe training and test image numbers are available online athttp://lear.inrialpes.fr/data.

Figure 10 shows fragments of the agglomerative aspectclustering trees computed during training for the cars. Wehave chosen 3 aspect clusters from the 6 largest ones (group-ing at least 17 shape masks). Next to the shape mask re-sulting from the clustering we present the earliest merged(thus most similar) training objects that initialized each ofthe clusters. We can see that the detected aspects revealmore than just the viewpoint at which the object is observed.When the object outline is significant, different car types(like sedans and minivans) are clustered together and formseparate aspects. Also 2 sample singletons (from the totalnumber of 72) are shown. Zooming in, we can see that thesingletons are true outliers—neither an ambulance nor a carwith open trunk is present on any of the remaining imagesof the dataset.

The influence of aspect clustering on recognition accu-racy for cars is presented in Fig. 11. We use the shape maskoverlap similarity (cf. (5)) with threshold 0.3 as the criterionfor correct localization, and display recall as a function offalse positives per image. We can observe a slight improve-ment in the accuracy due to aspect clustering, and furtherimprovement due to singleton pruning performed after theclustering, referred to as the full framework. When the num-ber of training samples is small, outliers might waste model-ing effort on highly unlikely poses, so it is better to discard

2Due to image content requirements mentioned above some imageswere not used at all, but at the same time we have annotated and usedsome images that were not used in the original setup, but present inthe dataset. There are 354 car images, 280 people images and 324 bikeimages in total.

Table 2 Equal error rates of ROC curves for the classification of Graz-02 images. Note that only baseline is shown as no further gain is pos-sible

Opelt et al.(2006)

Pure bag-of-features

HS LS (HS+LS)

SIFT SIFT SIFT

Cars 70.2% 100.0% 100.0% 100.0%

People 81.0% 100.0% 100.0% 100.0%

Bikes 76.5% 100.0% 99.3% 100.0%

those. What can be even more important in some applica-tions, is that aspect clustering also improves the recognitionspeed and memory requirements, as there are less featuresand shape masks to be kept and considered during recog-nition. Furthermore, it opens the possibility of annotatingaspects with pose or sub-types information.

For people and bicycles, due to a large number of possi-ble variations in articulations and poses, the relatively smallGraz-02 dataset turns out to be insufficient to determinegood aspect clusters. They have either low support, or be-come blurred after we lowered the threshold T for mergingshapes. One could expect that learning canonical poses issimpler for rigid objects. Yet, with singleton pruning turnedoff we could still perform successful localization. For peo-ple, the recall is 43% for 5 FPs/image. The number of falsepositives may appear high. Note, however, that people areoften small, close to each other and occluded. It may also bedifficult to match the correct articulation and a shape mis-match can result in a false positive. For bicycles, with a lo-calization accuracy criterion of 0.2, the recall is 59% for 2FPs/image. We had to lower the criterion due to the transpar-ent structure of a bike that lowers the overlap measure evenfor a small misalignment of the mask.

6.3 Classification

In this subsection we apply our framework to the classifi-cation task, as discussed in Sect. 4. Table 2 shows the clas-sification results obtained with our baseline on the Graz-02classification task (Opelt et al. 2006). We follow the setupestablished by the dataset authors, i.e., use the first 150 oddimages of a category and the first 150 odd images of thecounter-set for training, then the first 150 even images of acategory and the first 150 even images of the counter-set fortesting. Even though the dataset is challenging for segmen-tation and localization, cf. Sects. 6.1, 6.2 and 6.4, it turns outto be quite simplistic for classification. Our baseline methodachieves perfect or almost perfect recognition, which leavesno space for improvement.

Consequently, to evaluate our spatial weighting proce-dure for classification, we take the PASCAL VOC’05 chal-lenge (Everingham et al. 2006) dataset. It is built around a

http://lear.inrialpes.fr/data

Int J Comput Vis (2012) 97:191–209 203

similar set of classes as Graz-02 dataset (with the addition ofmotorbikes) and the standard evaluation procedure is practi-cally the same (EERs are computed for each binary classifi-cation task). Unfortunately, precise foreground/backgroundsegmentations are not available, but since all images areannotated with bounding boxes, we can obtain a veryrough approximation of image segmentation. Using bound-ing boxes instead of true segmentations probably penalizesour method, but we still expect overall improvement.

Table 3 summarizes the classification results on the PAS-CAL VOC’05 dataset (Everingham et al. 2006). For each ofthe eight sub-tasks we present the best reported result of thechallenge, the performance of the evaluated method withoutspatial weighting and the performance after introducing ourtechnique. We also show the gain achieved by our method.The results are already quite high, but our method showsconsistent improvement. Furthermore, the final recognitionaccuracy improves over the baseline in most cases.

The experiment on the PASCAL VOC’05 challengedataset also allows for a comparison with the Implicit ShapeModel (ISM). It was shown by Leibe et al. (2008) that ISMin the classification setup performs slightly below the base-line of Zhang et al. (2007). As shown by the above experi-ment, our method slightly improves over this baseline.

Figure 12 presents some of the computed ROC curves.Plots are shown for the evaluated method without spatialweighting, for the same method with spatial weighting andfor approximate background suppression using ground-truthbounding boxes. The curves were selected to show caseswith and without improvement. Results for other test sets areconsistent with the ones selected for brevity. One can noticethat the method does not give improvement for all sub-tasks.However, comparing the achieved gain with the potential es-timated using the ground-truth bounding-box segmentation,we see that the improvement is proportional to the gap be-tween the ground truth and the baseline. This can be easilyunderstood. If there is little background clutter, removal ofbackground does not help. For example, the background ofmotorbikes in test set 1 is mostly uniform and the EER cannot be improved by using shape masks. On the other hand,where there is significant background clutter, our methodcan reduce it and improves recognition. For instance, thebackground of bicycles in test set 2 is heavily cluttered andspatial weighting gives 2.0% accuracy gain in this case.

6.4 Localization

Figure 13 shows sample detections after applying our lo-calization framework (see Sect. 5) to the Graz-02 dataset.It demonstrates that our method is able to successfully lo-calize object instances, even under difficult conditions ofbackground clutter and occlusion. We can observe that thecomputed masks give information about the pose of the ob-ject. Note that in the case of the images with more than one

object, the subsequent objects are localized with subsequenthypotheses—see the bottom row. Note that the third hypoth-esis in the last row has a very low score of 4.9 and can bediscarded. It is interesting to observe that it is probably dueto the small car part in the bottom left corner.

To compare our approach to the state of the art on theGraz-02 dataset we follow the experimental setting of Opeltand Pinz (2005). We use the first 150 odd-numbered imagesfor training and the first 150 even-numbered images for test-ing. We also keep the localization accuracy criterion chosenby the authors, which is based on the criterion established byAgarwal et al. (2004). It requires the position given by thesystem to fall within an ellipse drawn at the center of the lo-calized object. However, due to different parameter settings,Opelt’s ellipse is larger and thus the criterion is weaker thanAgarwal’s. Note that for a fair comparison we could not useshape mask clustering due to limitations of the original an-notations, as discussed earlier.

Table 4 presents the comparison with Opelt’s approach.Even in its basic form our method achieves superior accu-racy. The most significant gain can be observed for people,where the richness of the articulations makes the task diffi-cult for methods based on parametrized hypotheses.

To compare the full localization capabilities of our me-thod to the state of the art we evaluate our framework onthe Weizmann horse dataset. We closely follow the setup ofShotton et al. (2005)—we use the first 50 images of horsesand the corresponding object segmentation plus the first 50background images for training. The next 277 images fromeach set are used for testing. We also use the same crite-rion for localization accuracy—we determine the centroidof the computed segmentation mask and compute the dis-tance to the centroid in the ground truth. If the distance isless than 25 pixels, the localization is considered to be cor-rect. Note that we follow the protocol strictly by using scale-normalized images, and running our system at a single scale.Due to small size of the images we use Laplacian features toget enough detections as discussed in Sect. 3.1.

Table 5 compares our results to Shotton et al. (2005). Wecan observe that our approach improves the performance.Our results are reported for a training procedure withoutsingleton pruning and the standard shape merging thresh-old T = 0.85 as well as for a lower shape merging thresholdwith singleton pruning. If we do not lower the threshold,too few aspect clusters are formed due to the large numberof horses articulations, i.e., most of the aspects are single-tons. Figure 14 shows a few example results. We can observethat the shapes of the horses are detected very accurately inthe test images. We can even judge from the detected shapemasks if the horse is standing or running.

We have also tested our method on the PASCALVOC’09segmentation challenge. For this task object instance and ob-

204 Int J Comput Vis (2012) 97:191–209

Table 3 Equal error rate percentages for the classification task of the PASCAL VOC’05 challenge. Best result achieved during the challenge(“winner”), performance of our baseline method and improvement introduced with shape masks are presented

Winner (seeEveringham et al.2006)

Reimpl. of Zhang et al. (2007) Shape masks

HS LS (HS+LS) HS LS (HS+LS) Gain

SIFT SIFT SIFT SIFT SIFT SIFT

Test

set1

Bikes 93.0 85.1 90.4 92.1 86.8 91.2 92.1

Cars 96.1 93.5 93.8 94.5 93.5 94.9 96.0 +1.5

Motorbikes 97.7 94.0 95.8 96.3 92.6 95.4 96.3

People 91.7 89.3 88.1 91.7 89.3 89.3 92.9 +1.2

Test

set2

Bikes 72.8 72.6 73.4 74.8 75.3 75.9 76.8 +2.0

Cars 72.0 72.5 73.9 75.8 73.7 73.9 76.8 +1.0

Motorbikes 79.8 72.9 77.1 78.8 74.3 78.2 79.3 +0.5

People 71.9 75.1 74.5 76.9 76.3 74.9 77.9 +1.0

Fig. 12 Selected Receiver Operating Characteristic (ROC) curves forclassification on the PASCAL VOC’05 dataset. Results for our baseline,our spatial weighting method and ground truth based segmentation are

presented. Please note that only the most interesting parts of the curvesare shown to improve readability

Int J Comput Vis (2012) 97:191–209 205

Fig. 13 Sample results on Graz-02 dataset. Note the precise object shape estimations despite occlusions and background clutter. Multiple objectinstances are detected with subsequent hypotheses as is shown in the bottom row

Table 4 Percentage of the Graz-02 images that satisfy the localizationcriterion of Opelt and Pinz (2005)

Bikes Cars People

Opelt and Pinz (2005) 76.7% 55.3% 48.0%

Our framework 78.7% 62.7% 83.3%

Table 5 Recall/precision curve EER for Weizmann horse dataset

Horses

Shotton et al. (2005) 92.1%

Our framework (T = 0.85, with singletons) 94.6%

Our framework (T = 0.7, no singletons) 94.6%

ject class segmentations for 20 object categories3 are pro-vided. Object class segmentation is evaluated on a set oftest images. Figure 15 shows images, ground-truth segmen-tations and actual recognition results with our approach.

The training set for the segmentation task consists of3786 images of which 749 are annotated with ground-truthsegmentations. The validation set consists of 3908 imagesof which 750 are segmented. Finally, the test set consistsof 750 images to segment. We use only the segmented im-ages from the training and validation set to train our local-ization system based on shape masks, but use all the imagesfor training a classifier. Since no ground truth is availablefor the test data, we submitted our recognition results to theofficial evaluation server and compare the returned accuracyto other submitted methods.

3The 20 object categories of the PASCALVOC’09 are: person, bird, cat,cow, dog, horse, sheep, aeroplane, bicycle, boat, bus, car, motorbike,train, bottle, chair, dining table, potted plant, sofa, tv/monitor.

Fig. 14 Sample results on Weizmann horses dataset. Note that theshape masks are very accurate: the horse articulations are visible

The PASCAL VOC’2009 segmentation task provides ob-ject segmentations which our system takes as input, but doesnot require detection of separate object instances in the testdata. Furthermore, the task is inherently multiclass, whereaswe have focused on one-class problems only. Therefore, wefirst determine which object classes are present in a giventest image using the classification approach of Zhang et al.(2007). Then we use our method to localize object instancesof each class and sum all detections (computed shape masks)to obtain a class segmentation of a test image. If multipleclasses could be assigned to a pixel, we keep the class forwhich the shape mask value is the highest.

We compare the segmentation accuracy of our methodto other localization and segmentation methods submittedto the PASCALVOC’09 challenge. The distinction is madesuch that methods designed for different tasks are not mixed.A direct comparison of our method to state-of-the-art seg-mentation approaches would be somewhat unfair, since ourmethod does not use pixel-level image information to matchsegments to image edges. Furthermore, our approach spendsa lot of modeling effort to recognize individual object in-stances, instead of directly labeling image areas. We did not

206 Int J Comput Vis (2012) 97:191–209

Fig. 15 Ground-truth segmentation (top two rows) and recognitionresults (bottom two rows) for PASCAL VOC’09 images. Note that onlyclass segmentation is shown, even though both the ground truth and

our recognition method distinguish separate object instances. This in-formation, however, is not yet used in the official challenge evaluation

optimize our method for the segmentation task, and use ofpixel-level information is future work. Consequently, a fairevaluation is to compare to methods designed for and sub-mitted to the localization task. Most of the state-of-the-artlocalization methods submitted to the challenge do not useimage segmentation for training, but they use object bound-ing boxes provided for a significantly larger set of images.We estimate that the annotation effort to train each systemcan be considered roughly comparable.

Table 6 compares the segmentation accuracy of ourmethod to state of the art localization and segmentationmethods. We split the methods as in the original challengereport, except that methods submitted to both tasks are in-cluded in both tables. Our method performs favorably com-pared to other localization methods (upper part of the table).This confirms that our shape-based method can better deter-mine the location of the object in the image than an approachbased on a bounding box. The segmentation accuracy of ourmethod can not compare to the state-of-the-art segmenta-tion methods (lower part of the table) due to the fact that wesolve a different problem. Still, our localization method im-proves over some of the segmentation methods. Combiningour shape-based localization approach with image segmen-tation techniques would certainly help to improve the seg-mentation accuracy, but is beyond the scope of this paper.

Figure 15 (bottom two rows) shows a few example seg-mentation results. Since we do not use edge information theobject outlines are approximate. Nevertheless, it often fol-

lows object shape and even in crowded environments someobjects like monitors, tables and plants are found and cor-rectly localized. The overall performance is clearly lowerthan for simpler datasets like Graz-02 or Weizmann horses.This can be explained by the difficulty of the VOC’09 seg-mentation test set, where objects occur in unusual poses, un-der significant occlusion and varying lighting conditions.

7 Conclusion

In this paper we have proposed an object recognitionapproach that is based on shape masks—generalizationsof segmentation masks. We have developed a shape masksalignment procedure that allows to transfer informationabout the extent (outline) of objects. We have shownthat shape masks provide a convenient tool to exploit thegeometry of objects. Our ideas were applied to two commonobject class recognition tasks—classification as well aslocalization.

For classification, we have proposed an extension to bag-of-features that incorporates spatial relations between fea-tures. We have introduced the shape masks based “spatialweighting” technique, which uses spatial relations to boostthe weights of foreground features and to decrease the in-fluence of background features on the representation, thusmaking it more robust to background clutter. The experi-mental evaluation has shown that applying the proposed ex-tension improves the classification results.

Int J Comput Vis (2012) 97:191–209 207

Tabl

e6

Segm

enta

tion

accu

racy

onth

ePA

SC

AL

VO

C’0

9se

gmen

tatio

nch

alle

nge.

We

com

pare

toot

her

loca

lizat

ion

(upp

erpa

rt)

and

segm

enta

tion

(low

erpa

rt)

met

hods

,as

prop

osed

byth

ech

alle

nge

orga

nize

rs

Mean

Background

Aeroplane

Bicycle

Bird

Boat

Bottle

Bus

Car

Cat

Chair

Cow

Diningtable

Dog

Horse

Motorbike

Person

Pottedplant

Sheep

Sofa

Train

Tv/monitor

(a)

Com

pari

son

toal

loth

ersu

bmit

ted

loca

liza

tion

met

hods

NE

CU

IUC

-dtc

t29

.781

.841

.923

.122

.422

.027

.843

.251

.825

.94.

518

.518

.023

.526

.936

.634

.88.

828

.314

.035

.534

.7

UoC

TT

I29

.078

.935

.322

.519

.123

.536

.241

.250

.111

.78.

928

.51.

45.

924

.035

.333

.435

.127

.714

.234

.141

.8

Our

s15

.153

.216

.37.

86.

813

.810

.222

.816

.016

.72.

410

.914

.39.

812

.615

.814

.27.

610

.913

.421

.720

.1

UC

3M14

.569

.820

.89.

76.

34.

37.

919

.721

.87.

73.

87.

59.

69.

512

.316

.516

.41.

514

.211

.014

.120

.3

UV

A12

.612

.310

.95.

95.

410

.77.

836

.417

.69 .

94.

611

.712

.97.

217

.819

.116

.32.

015

.56.

722

.211

.0

MPI

-str

uct

11.0

10.6

9.8

5.1

6.1

7.2

12.0

29.1

17.2

9.6

2.7

12.8

7.7

9.5

11.5

13.4

13.5

5.0

10.2

7.9

15.6

14.3

Oxf

ord

10.9

2.0

9.2

7.7

5.3

6.1

20.1

36.7

18.2

8.5

2.9

6.5

1.4

6.2

10. 9

12.2

15.5

4.2

8.9

5.1

20.8

19.7

CA

SIA

10.3

24.5

8.2

4.3

6.5

4.8

14.7

27.2

10.0

7.0

4.2

6.5

4.9

3.6

10.3

14.5

11.2

8.1

10.6

4.9

13.3

17.6

CV

C9.

42.

39.

46.

13.

45.

213

.521

.815

.26.

62.

27.

01.

65.

611

.111

.716

.83.

213

.87.

317

.316

.5

LE

AR

8.4

7.1

7.4

4.2

2.9

5.2

15.1

19.7

17.7

5.8

2.3

6.3

2.6

3.7

7.0

6.3

9.7

7.6

9.2

7.5

12.8

16.8

Miz

zou

7.5

0.6

1.2

2.5

0.0

0.0

18.4

33.7

13.3

7.6

0.7

2.2

1.1

2.8

13.8

22.7

14.9

0.6

15.5

0.0

0.0

5.5

LE

AR

-cls

5.9

6.8

4.0

2 .1

1.4

1.8

3.7

19.5

11.8

5.3

3.8

0.1

3.6

3.0

2.1

2.8

8.0

0.9

4.4

8.3

6.9

22.8

Miz

zou-

wo/

ctx

5.4

0.6

1.3

2.9

0.0

0.0

22.7

5.0

9.5

3.2

1.3

2.8

0.3

1.4

8.9

7.9

11.9

0.9

10.9

2.8

3.5

14.9

CV

C-fl

at0.

92.

01.

70.

20.

10.

30.

60.

10.

21.

20.

30.

80.

40.

10.

20.

11.

71.

11.

12.

00.

24.

6

(b)

Com

pari

son

toal

loth

ersu

bmit

ted

segm

enta

tion

met

hods

Bon

n36

.383

.964

.321

.821

.732

.040

.257

.349

.438

.85.

228

.522

.019

.633

.645

.533

.627

.340

.418

.133

.646

.1

CV

C-H

OC

RF

34.5

80.2

67.1

26.6

30.3

31.6

30.0

44.5

41.6

25.2

5.9

27.8

11.0

23.1

40.5

53.2

32.0

22.2

37.4

23.6

40.3

30.2

NE

CU

IUC

-dtc

t29

.781

.841

.923

.122

.422

.027

.843

.251

.825

.94.

518

.518

.023

.526

.936

.634

.88.

828

.314

.035

.534

.7

UoC

TT

I29

.078

.935

.322

.519

.123

.536

.241

.250

.111

.78.

928

.51.

45.

924

.035

.333

.435

.127

. 714

.234

.141

.8

NE

CU

IUC

-seg

28.3

81.5

39.3

20.9

22.6

21.7

26.1

37.1

51.5

25.2

5.7

17.5

15.7

24.2

27.4

35.3

33.0

7.9

23.4

12.5

32.1

33.3

LE

AR

-seg

det

25.7

79.1

44.6

15.5

20.5

13.3

28.8

29.3

35.8

25.4

4.4

20.3

1.3

16.4

28.2

30.0

24.5

12.2

31.5

18.3

28.8

31.9

Bro

okes

MSR

C24

.879

. 648

.36.

719

.110

.016

.632

.738

.125

.35.

59.

425

.113

.312

.335

.520

.713

.417

.118

.437

.536

.4

UC

I24

.780

.738

.330

.93.

44.

431

.745

.547

.310

.44.

814

.38.

86.

121

.525

.038

.914

.814

.43.

029

.145

.5

Our

s15

.153

.216

.37.

86.

813

.810

. 222

.816

.016

.72.

410

.914

.39.

812

.615

.814

.27.

610

.913

.421

.720

.1

MPI

-A2

15.0

70.9

16.4

8.7

8.6

8.3

20.8

21.6

14.4

10.5

0.0

14.2

17.2

7.3

9.3

20.3

18.2

6.9

14.1

0.0

13.2

13.2

UC

3M14

.569

.820

.89.

76.

34.

37.

919

.721

.87.

73.

87.

59 .

69.

512

.316

.516

.41.

514

.211

.014

.120

.3

UC

LA

13.8

51.2

13.9

7.0

3.9

6.4

8.1

14.4

24.3

12.1

6.4

10.3

14.5

6.7

9.7

23.6

20.0

2.3

12.6

12.3

17.0

13.2

208 Int J Comput Vis (2012) 97:191–209

For localization, we have demonstrated that shape masksenrich the localization decisions by revealing additional in-formation about object viewpoint, articulation, sub-type orstate. At the same time, the experimental results show thatthe standard localization performance of our method is su-perior to the state of the art. We have successfully combinedthe clustering of the generated hypotheses with a discrimi-native hypothesis classifier, showing that both elements arenecessary for good localization accuracy. Our method per-forms well on natural images, robustly handling multipleobject aspects, significant intra-class variations, occlusionsand background clutter.

Future research could focus on improving the generatedhypotheses by using edges or image segmentations. Wethink that given a good object localization hypothesis, theobject segmentation task should be less difficult and goodsegmentation accuracy easier to achieve. Furthermore, wewill explore the possibility of detecting pose or sub-typesby annotating the aspects.

Acknowledgements M. Marszałek was supported by a grant fromthe European Community under the Marie-Curie project VISITOR.This work was supported by the European Network of Excellence PAS-CAL.

References

Agarwal, S., & Roth, D. (2002). Learning a sparse representation forobject detection. In ECCV.

Agarwal, S., Awan, A., & Roth, D. (2004). Learning to detect objectsin images via a sparse, part-based representation. IEEE Trans-actions on Pattern Analysis and Machine Intelligence, 26(11),1475–1490.

Borenstein, E., & Ullman, S. (2002). Class-specific, top-down segmen-tation. In ECCV.

Chapelle, O., Haffner, P., & Vapnik, V. (1999). Support vector ma-chines for histogram-based image classification. IEEE Transac-tions on Neural Networks, 10(5), 1055–1064.

Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004).Visual categorization with bags of keypoints. In ECCV workshopon statistical learning in computer vision.

Dorkó, G., & Schmid, C. (2003). Selection of scale-invariant parts forobject class recognition. In ICCV.

Everingham, M., Zisserman, A., Williams, C., & Gool, L.V., et al.(2006). The 2005 PASCAL visual object classes challenge. In Se-lected proceedings of the first PASCAL challenges workshop.

Everingham, M., van Gool, L., Williams, C., Winn, J., & Zisserman, A.(2008). Overview and results of the detection challenge. In ThePASCAL VOC’08 challenge workshop in conj. with ECCV.

Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zis-serman, A. (2009). The PASCAL visual object classes chal-lenge 2009 (VOC2009) results. http://www.pascal-network.org/challenges/VOC/voc2009/workshop/index.html.

Fergus, R., Perona, P., & Zisserman, A. (2007). Weakly supervisedscale-invariant learning of models for visual recognition. Inter-national Journal of Computer Vision, 71(3), 273–303.

Fowlkes, C., Belongie, S., Chung, F., & Malik, J. (2004). Spectralgrouping using the Nyström method. IEEE Transactions on Pat-tern Analysis and Machine Intelligence, 26(2), 1–12.

Fritz, M., Leibe, B., Caputo, B., & Schiele, B. (2005). Integrating rep-resentative and discriminant models for object category detection.In ICCV.

Fussenegger, M., Opelt, A., & Pinz, A. (2006). Object localization/segmentation using generic shape priors. In ICPR.

Galleguillos, C., Babenko, B., Rabinovich, A., & Belongie, S. (2008).Weakly supervised object localization with stable segmentations.In ECCV.

Gårding, J., & Lindeberg, T. (1996). Direct computation of shapecues using scale-adapted spatial derivative operators. Interna-tional Journal of Computer Vision, 17(2), 163–191.

Grauman, K., & Darrell, T. (2005). The pyramid match kernel: Dis-criminative classification with sets of image features. In ICCV.

Gu, C., Lim, J., Arbelaez, P., & Malik, J. (2009). Recognition usingregions. In CVPR.

Hayman, E., Caputo, B., Fritz, M., & Eklundh, JO (2004). On the sig-nificance of real-world conditions for material classification. InECCV.

Jing, F., Li, M., Zhang, H. J., & Zhang, B. (2003). Support vector ma-chines for region-based image retrieval. In ICME.

Lazebnik, S., Schmid, C., & Ponce, J. (2005). A maximum en-tropy framework for part-based texture and object recognition. InICCV.

Leibe, B., Seemann, E., & Schiele, B. (2005). Pedestrian detection incrowded scenes. In CVPR.

Leibe, B., Leonardis, A., & Schiele, B. (2008). Robust object detectionwith interleaved categorization and segmentation. InternationalJournal of Computer Vision, 77(1–3), 259–289.

Li, L. J., Socher, R., & Fei-Fei, L. (2009). Towards total scene under-standing: classification, annotation and segmentation in an unsu-pervised framework. In CVPR.

Lindeberg, T. (1998). Feature detection with automatic scale selection.International Journal of Computer Vision, 30(2), 79–116.

Lowe, D. (2004). Distinctive image features form scale-invariant key-points. International Journal of Computer Vision, 60(2), 91–110.

Lyu, S. (2005). Mercer kernels for object recognition with local fea-tures. In CVPR.

Marr, D. (1982). Vision. New York: Freeman.Marszałek, M., & Schmid, C. (2006). Spatial weighting for bag-of-

features. In CVPR.Marszałek, M., & Schmid, C. (2007). Accurate object localization with

shape masks. In CVPR.Mikolajczyk, K., & Schmid, C. (2004). Scale and affine invariant in-

terest point detectors. International Journal of Computer Vision,60(1), 63–86.

Opelt, A., & Pinz, A. (2005). Object localization with boosting andweak supervision for generic object recognition. In SCIA.

Opelt, A., Fussenegger, M., Pinz, A., & Auer, P. (2004a). Generic ob-ject recognition with boosting. Tech. rep. TR-EMT-2004-01, TUGraz.

Opelt, A., Fussenegger, M., Pinz, A., & Auer, P. (2004b). Weak hy-potheses and boosting for generic object detection and recogni-tion. In ECCV.

Opelt, A., Pinz, A., Fussenegger, M., & Auer, P. (2006). Generic objectrecognition with boosting. IEEE Transactions on Pattern Analysisand Machine Intelligence, 28(3), 416–431.

Peterson, M. (1994). Object recognition processes can and do operatebefore figure-ground organization. Current Directions in Psycho-logical Science, 3, 105–111.

Ramanan, D. (2007). Using segmentation to verify object hypotheses.In CVPR.

Rothganger, F., Lazebnik, S., Schmid, C., & Ponce, J. (2003). 3D ob-ject modeling and recognition using affine-invariant patches andmulti-view spatial constraints. In CVPR.

Rowley, H., Baluja, S., & Kanade, T. (1998). Neural networks basedface detection. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 20(1), 22–38.

http://www.pascal-network.org/challenges/VOC/voc2009/workshop/index.html

http://www.pascal-network.org/challenges/VOC/voc2009/workshop/index.html

Int J Comput Vis (2012) 97:191–209 209

Rubner, Y., Tomasi, C., & Guibas, L. (2000). The Earth Mover’s dis-tance as a metric for image retrieval. International Journal ofComputer Vision, 40(2), 99–121.

Russell, B., Efros, A., Sivic, J., Freeman, W., & Zisserman, A. (2006).Using multiple segmentations to discover objects and their extentsin image collections. In CVPR.

Schölkopf, B., & Smola, A. (2002). Learning with kernels: supportvector machines, regularization, optimization and beyond. Cam-bridge: MIT Press.

Seemann, E., & Schiele, B. (2006). Cross-articulation learning for ro-bust detection of pedestrians. In DAGM.

Seemann, E., Leibe, B., & Schiele, B. (2006). Multi-aspect detectionof articulated objects. In CVPR.

Shotton, J., Blake, A., & Cipolla, R. (2005). Contour-based learningfor object detection. In ICCV.

Shotton, J., Johnson, M., & Cipolla, R. (2008). Semantic texton forestsfor image categorization and segmentation. In CVPR.

Sivic, J., & Zisserman, A. (2003). Video Google: a text retrieval ap-proach to object matching in videos. In ICCV.

Sivic, J., Russell, B., Efros, A., Zisserman, A., & Freeman, W. (2005).Discovering objects and their location in images. In ICCV.

Thomas, A., Ferrari, V., Leibe, B., Tuytelaars, T., Schiele, B., &Gool, L. V. (2006). Towards multi-view object class detection. InCVPR.

Todorovic, S., & Ahuja, N. (2006). Extracting subimages of an un-known category from a set of images. In CVPR.

Vecera, S. (1998). Figure-ground organization and object recognitionprocesses: an interactive account. Journal of Experimental Psy-chology. Human Perception and Performance, 24(2), 441–462.

Viola, P., & Jones, M. (2004). Robust real-time object detection. Inter-national Journal of Computer Vision, 57(2), 137–154.

Winn, J., & Joijic, N. (2005). LOCUS: learning object classes withunsupervised segmentation. In ICCV.

Wu, B., & Nevatia, R. (2007). Simultaneous object detection and seg-mentation by boosting local shape feature based classifier. InCVPR.

Yu, S., & Shi, J. (2003). Object-specific figure-ground segregation. InCVPR.

Zhang, J., Marszałek, M., Lazebnik, S., & Schmid, C. (2007). Localfeatures and kernels for classification of texture and object cate-gories: a comprehensive study. International Journal of ComputerVision, 73(2), 213–238.

accurate object recognition with shape masks

Documents