372 ieee transactions on multimedia, vol. 12, no. 5, august 2010 …€¦ · 372 ieee transactions...

14
372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 An Efficient and Robust Algorithm for Shape Indexing and Retrieval Soma Biswas, Graduate Student Member, IEEE, Gaurav Aggarwal, Student Member, IEEE, and Rama Chellappa, Fellow, IEEE Abstract—Many shape matching methods are either fast but too simplistic to give the desired performance or promising as far as performance is concerned but computationally demanding. In this paper, we present a very simple and efficient approach that not only performs almost as good as many state-of-the-art techniques but also scales up to large databases. In the proposed approach, each shape is indexed based on a variety of simple and easily com- putable features which are invariant to articulations, rigid trans- formations, etc. The features characterize pairwise geometric rela- tionships between interest points on the shape. The fact that each shape is represented using a number of distributed features instead of a single global feature that captures the shape in its entirety pro- vides robustness to the approach. Shapes in the database are or- dered according to their similarity with the query shape and sim- ilar shapes are retrieved using an efficient scheme which does not involve costly operations like shape-wise alignment or establishing correspondences. Depending on the application, the approach can be used directly for matching or as a first step for obtaining a short list of candidate shapes for more rigorous matching. We show that the features proposed to perform shape indexing can be used to perform the rigorous matching as well, to further improve the re- trieval performance. To illustrate the computational and performance advantages of the proposed approach, extensive experiments have been per- formed on several challenging problems that involve matching shapes. We also highlight the effectiveness of the approach to perform robust and efficient shape matching in real images and videos for different applications like human pose estimation and activity classification. Index Terms—Fast retrieval, indexing, shape matching. I. INTRODUCTION N UMEROUS applications of shape matching and recog- nition have made it a very important area of research in the field of computer vision (see Fig. 1). Character recognition, Manuscript received September 13, 2009; revised February 11, 2010; ac- cepted April 14, 2010. Date of publication May 20, 2010; date of current version July 16, 2010. This work was supported in part by UNISYS, in part by NSF-ITR Grant 03-25119, and in part by ONR MURI Grant N00014–08–1–0638. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. James Z. Wang. S. Biswas was with the Center for Automation Research and the Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742 USA, and is now with the Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556 USA (e-mail: [email protected]). G. Aggarwal was with the Center for Automation Research and Department of Computer Science, University of Maryland, College Park, MD 20742 USA, and is now with the Department of Computer Science and Engineering, Univer- sity of Notre Dame, Notre Dame, IN 46556 USA (e-mail: [email protected]). R. Chellappa is with the Center for Automation Research and the Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2010.2050735 Fig. 1. A few applications that can benefit from robust and efficient shape matching. (a) Matching and retrieval of 2-D shapes [1], like trademark retrieval [2], leaf recognition [3], etc. (b) Activity classification [4]. (c) Gesture recogni- tion. (d) Pose estimation in sports clips [5]. trademark logo retrieval, activity recognition, object recogni- tion, and human pose estimation are a few of the challenging applications that can benefit from accurate and efficient shape matching techniques. Different applications require different representations and hence different matching algorithms to handle the large variations in shapes. Also with the recent ad- vancement in technology and the availability of different kinds of sensors, the amount of data to be handled has increased tremen- dously over the last few decades. So even though research in the area of shape matching has matured, the challenges involved in achieving high performance in terms of both accuracy and computational complexity continues to interest researchers. Shapes show a great deal of intra-class variations including rotations, translations, articulations, missing portions, and other inexplicable deformations which make the problem of shape matching quite challenging. Errors in extracting shapes from input images or videos further add to the complexity. Matching shapes across complex deformations has been the main focus of most works in recent times. Many existing shape matching algo- rithms require computationally demanding matching schemes to be able to handle the aforementioned variations, making them not so effective for large databases. On the other hand, much re- search has also been focused on efficient retrieval of shapes. But many of these approaches are not designed to handle complex deformations like articulations of part structures. In contrast, we propose an indexing system for fast and robust matching and retrieval of shapes across both rigid and nonrigid transforma- tions. We envisage a shape matching system which can effi- ciently scale to large databases without compromising on the re- 1520-9210/$26.00 © 2010 IEEE

Upload: others

Post on 14-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 …€¦ · 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 An Efficient and Robust Algorithm for

372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010

An Efficient and Robust Algorithmfor Shape Indexing and Retrieval

Soma Biswas, Graduate Student Member, IEEE, Gaurav Aggarwal, Student Member, IEEE, andRama Chellappa, Fellow, IEEE

Abstract—Many shape matching methods are either fast but toosimplistic to give the desired performance or promising as far asperformance is concerned but computationally demanding. In thispaper, we present a very simple and efficient approach that notonly performs almost as good as many state-of-the-art techniquesbut also scales up to large databases. In the proposed approach,each shape is indexed based on a variety of simple and easily com-putable features which are invariant to articulations, rigid trans-formations, etc. The features characterize pairwise geometric rela-tionships between interest points on the shape. The fact that eachshape is represented using a number of distributed features insteadof a single global feature that captures the shape in its entirety pro-vides robustness to the approach. Shapes in the database are or-dered according to their similarity with the query shape and sim-ilar shapes are retrieved using an efficient scheme which does notinvolve costly operations like shape-wise alignment or establishingcorrespondences. Depending on the application, the approach canbe used directly for matching or as a first step for obtaining a shortlist of candidate shapes for more rigorous matching. We show thatthe features proposed to perform shape indexing can be used toperform the rigorous matching as well, to further improve the re-trieval performance.

To illustrate the computational and performance advantagesof the proposed approach, extensive experiments have been per-formed on several challenging problems that involve matchingshapes. We also highlight the effectiveness of the approach toperform robust and efficient shape matching in real images andvideos for different applications like human pose estimation andactivity classification.

Index Terms—Fast retrieval, indexing, shape matching.

I. INTRODUCTION

N UMEROUS applications of shape matching and recog-nition have made it a very important area of research in

the field of computer vision (see Fig. 1). Character recognition,

Manuscript received September 13, 2009; revised February 11, 2010; ac-cepted April 14, 2010. Date of publication May 20, 2010; date of current versionJuly 16, 2010. This work was supported in part by UNISYS, in part by NSF-ITRGrant 03-25119, and in part by ONR MURI Grant N00014–08–1–0638. Theassociate editor coordinating the review of this manuscript and approving it forpublication was Dr. James Z. Wang.

S. Biswas was with the Center for Automation Research and the Departmentof Electrical and Computer Engineering, University of Maryland, College Park,MD 20742 USA, and is now with the Department of Computer Science andEngineering, University of Notre Dame, Notre Dame, IN 46556 USA (e-mail:[email protected]).

G. Aggarwal was with the Center for Automation Research and Departmentof Computer Science, University of Maryland, College Park, MD 20742 USA,and is now with the Department of Computer Science and Engineering, Univer-sity of Notre Dame, Notre Dame, IN 46556 USA (e-mail: [email protected]).

R. Chellappa is with the Center for Automation Research and the Departmentof Electrical and Computer Engineering, University of Maryland, College Park,MD 20742 USA (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TMM.2010.2050735

Fig. 1. A few applications that can benefit from robust and efficient shapematching. (a) Matching and retrieval of 2-D shapes [1], like trademark retrieval[2], leaf recognition [3], etc. (b) Activity classification [4]. (c) Gesture recogni-tion. (d) Pose estimation in sports clips [5].

trademark logo retrieval, activity recognition, object recogni-tion, and human pose estimation are a few of the challengingapplications that can benefit from accurate and efficient shapematching techniques. Different applications require differentrepresentations and hence different matching algorithms tohandle the large variations in shapes. Also with the recent ad-vancement in technology and the availability of different kinds ofsensors, the amount of data to be handled has increased tremen-dously over the last few decades. So even though research in thearea of shape matching has matured, the challenges involvedin achieving high performance in terms of both accuracy andcomputational complexity continues to interest researchers.

Shapes show a great deal of intra-class variations includingrotations, translations, articulations, missing portions, and otherinexplicable deformations which make the problem of shapematching quite challenging. Errors in extracting shapes frominput images or videos further add to the complexity. Matchingshapes across complex deformations has been the main focus ofmost works in recent times. Many existing shape matching algo-rithms require computationally demanding matching schemesto be able to handle the aforementioned variations, making themnot so effective for large databases. On the other hand, much re-search has also been focused on efficient retrieval of shapes. Butmany of these approaches are not designed to handle complexdeformations like articulations of part structures. In contrast, wepropose an indexing system for fast and robust matching andretrieval of shapes across both rigid and nonrigid transforma-tions. We envisage a shape matching system which can effi-ciently scale to large databases without compromising on the re-

1520-9210/$26.00 © 2010 IEEE

Page 2: 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 …€¦ · 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 An Efficient and Robust Algorithm for

BISWAS et al.: AN EFFICIENT AND ROBUST ALGORITHM FOR SHAPE INDEXING AND RETRIEVAL 373

trieval performance obtained by state-of-the-art shape matchingalgorithms.

We model a shape as a collection of landmark points arrangedin a plane (2-D) or in 3-D space. In our approach, each shapeis characterized by features that are used to index it to a table.The table is analogous to the inverted page table used to indexweb pages using words/phrases. Given a test shape, similarones from a pre-indexed collection are determined based on itscharacterizing features. The computational overhead (of estab-lishing point-wise correspondence) involved in the traditionalway of matching the query with each shape in the dataset isthereby avoided.

As we deal with shapes, the only information usually avail-able is the underlying geometry. Appropriate features arechosen to encode this geometry as richly as possible, withoutcompromising on robustness. Quite clearly, the set of usefulfeatures varies depending on the particular application at hand.For example, invariance to articulations of part structures isvery important in applications like gait-based human identifi-cation whereas the same feature is not desired for applicationslike retrieval based on human pose. Our goal here is to developa system that supports fast retrieval of shapes without needingany costly correspondence step during matching. To this end,we use (or propose) features that address most challenges facedby shape matching tasks including invariance to object transla-tion, rotation, scale, articulations, etc. In the proposed indexingframework, a given shape is represented using a collection offeature vectors, each characterizing a geometrical relationshipbetween a pair of landmark points. The features should beeasily computable for the matching algorithm to be efficientand to be able to scale up to large database sizes. For eachlandmark pair, depending on the application, all or a subset ofthe following geometrical characteristics are encoded in thecorresponding feature vector.

1) distance between the points. This can be the inner dis-tance [3] or the standard Euclidean distance, depending onwhether or not articulation-invariance is desired;

2) relative angles between the line segment joining the twopoints and tangents to the contour at the points;

3) contour distance between the points (analogous to geodesicdistance in case of 3-D shapes);

4) distances of the points from the center of mass. For appli-cations requiring invariance to articulations, we propose anarticulation-invariant center of mass which is analogous tothe standard center of mass with the added feature of beinginvariant to articulations.

Clearly, more suitable (indexable) features can be easily addedto this list to make the representation richer. The feature vectorsare suitably quantized for indexing. The fact that feature vectorsdepend only on a few points and are quantized provides the nec-essary robustness required to generalize across large intra-classvariations. As shown by the results, the matching speed andability to generalize does not come at the cost of discriminabilityacross shapes.

Since all the desired characteristics of the shape matching al-gorithm like invariance to rotation, articulation, etc. are incor-porated in the feature vectors themselves, this kind of represen-tation allows the proposed system to have a very simple andefficient retrieval scheme. Given a test shape, the matching bins

in the index table are determined. A single parse through thematching bins returns the most similar shapes. This does notrequire any alignment or correspondences, making it extremelyfast and scalable. Depending on the requirements of the applica-tion, the top matches returned by the single parse retrieval algo-rithm may directly be used as the similar shapes or there may bea need to further compare the query against the top few matchesusing a more rigorous algorithm to refine their ordering. Such arefinement stage will typically be more computationally expen-sive as compared to the proposed retrieval algorithm, but thiswill need to be performed only for a few top matches instead ofthe whole shape database that makes such a two-stage schemecomputationally attractive. In this paper, we show how we canuse the same set of features to do a more rigorous matching onthe short-list candidates returned by the indexing phase to fur-ther improve the matching performance.

The proposed algorithm has been rigorously tested forseveral shape matching applications. We first evaluate theapproach by providing performance comparisons with severalexisting methods using standard shape silhouette datasets likethe MPEG7 shape dataset [1], the articulation dataset [3],Kimia (1 and 2) datasets [6], [7], and ETH-80 object database[8]. The computational advantage obtained using the proposedapproach is also highlighted. In addition, we perform experi-ments on human pose estimation and activity classification onchallenging datasets.

A. Organization of the Paper

The rest of the paper is organized as follows. Section II dis-cusses some of the related works. Section III introduces the in-dexing framework proposed in the paper. Section IV describesthe indexable shape representation. A detailed description of theindexing and retrieval algorithms is given in Section V. The de-tails of the refinement algorithm to re-rank top matches returnedby the indexing system are presented in Section VI. Section VIIpresents the results of extensive evaluations done to compare theproposed algorithm with others. Some real applications of shapematching are shown in Section VIII. The paper concludes witha summary and discussion. A preliminary version of this workwas reported in [9].

II. PREVIOUS WORK

The problem of shape matching has been around for quitesometime, probably due to its universality. Though significantadvancements have been made, the demands on computationalefficiency and accuracy continue to interest researchers. In thissection, we discuss some of the previous efforts that are relatedto the approach proposed in this paper. The various approachesfor shape matching in literature have their respective advantagesand limitations with respect to the kind of input they can handle,computational complexity, etc.

A. Related Work on Shape Matching

Shape context-based matching [2] has been the theme ofseveral recent works [10]–[13] on shape matching. In theoriginal version [2], each point is characterized by the spatialdistribution of the other points relative to it. Similarity com-putation involves establishing correspondences using bipartite

Page 3: 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 …€¦ · 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 An Efficient and Robust Algorithm for

374 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010

graph matching and thin plate spline (TPS)-based alignment.The shape context framework has since been extended in var-ious ways to suit different requirements of the shape matchingproblem. Mori and Malik [10] propose using statistics of thetangent vectors along with the point counts to perform objectrecognition in clutter. A figural continuity constraint has beenincorporated to yield reliable correspondences in clutteredscenes [12]. Tu and Yuille [11] incorporate softassign [14]in a shape context framework [11] for shape matching. Afteraligning the shapes using the correspondence given by shapecontext, Daliri and Torre [15] transform each contour into astring of symbols which is then matched using a modified editdistance. A recent extension by Ling and Jacobs [3] accountsfor movement of part structures, by replacing the Euclideandistance in the classical version with inner distance, which isrobust to articulations.

McNeill and Vijayakumar [16] propose the hierarchicalProcrustes matching algorithm which generalizes the idea offinding a point-to-point correspondence between two shapesto that of finding a segment-to-segment correspondence. Inanother recent work, Felzenszwalb and Schwartz [17] use anew hierarchical representation called shape tree for two-di-mensional objects that captures shape information at multiplelevels of resolution. Peter et al. [18] represent point-set shapesas the square root of probability densities expanded in thewavelet basis and uses a linear assignment solver to account fornonrigid transformations prior to matching. There is anotherbody of work for capturing part structures in which shapesare represented using shock graphs [6], [19]. The shock graphgrammar helps to reduce the shock graph representation to aunique rooted shock tree which is then matched using a treematching algorithm. To handle shape deformations, Sebastianet al. [7] propose finding the optimal deformation path of shockgraphs that brings the two graphs (shapes) into correspon-dence. Many of the approaches discussed above require findingcorrespondence between points/curve segments of two shapeswhich usually requires computationally expensive methods.Readers are referred to several other interesting approaches formatching shapes [20]–[26].

B. Related Work on Efficient Matching and Indexing

Fast nearest neighbor searches in Euclidean space for findingclosest points in metric spaces has a rich history [27]. Due tothe tremendous increase in the amount of data that needs to behandled, indexing techniques are becoming increasingly pop-ular for the development of fast retrieval algorithms for docu-ments, images, etc. The indexing approach used in the paperis inspired by the work on fingerprint indexing using minutiaetriangles as features [28]. Unlike classical geometrical hashing[29], the triangle-based approach hashes a set of points based onlocal invariants (depends only on three minutiae, though neednot be local spatially), which is more robust and leads to fasterretrieval. For fast matching and retrieval of images, a vocab-ulary tree-based representation has been recently proposed byNister and Stewenius [30]. Similar to their approach, our in-dexing system relies on invariant and robust shape representa-tion, to make the retrieval process extremely fast. In [31], Moriet al. propose solutions to improve the computational efficiency

of shape contexts-based approaches. They show how pruningand vector quantization techniques can be utilized to make shapecontext useful for large databases.

Another approach for fast shape matching is to reduce theshape matching problem to the comparison of probability dis-tributions, which does not require pose registration, feature cor-respondence, or model fitting. Osada et al. [32] use shape distri-butions sampled from a shape function and measure global geo-metric properties of an object for fast matching of 3-D models.Ohbuchi et al. [33] use joint 2-D histogram of distance and ori-entation of pairs of points for improved performance. Hamzaand Krim [34] use geodesic shape distribution that measuresthe global geodesic distance between two arbitrary points on thesurface to be able to better capture the (nonlinear) intrinsic geo-metric structure of the data. The idea of describing 3-D modelsusing distance between pairs of points and/or their mutual ori-entations has also appeared in [35] and [36]. Apart from these,other approaches have also been proposed which focus on effi-cient shape matching [37]–[41].

Existing shape matching methods can also be classified basedon the kind of input they require. Some methods require theshape to be represented as a closed contour [3], [24] while someothers are more flexible in the kind of input they can work withand just require a set of points as their input [2], [18]. Our ap-proach falls in the first category, but has the advantage of beingefficient while being able to handle complex deformations likearticulations of part structures.

III. INDEXING FRAMEWORK—A GLANCE

In many of the existing approaches, a query needs to be com-pared with every shape in the dataset to return the most sim-ilar ones and the comparisons often involve computationallydemanding operations like registration, establishing correspon-dence, etc. Since for each query, these costly operations have tobe repeated for each shape in the database, the computationalload can become prohibitively high as the size of the databaseincreases. Our goal is to come up with a fast and efficient frame-work for shape indexing and retrieval that can perform robustshape matching.

Fig. 2 illustrates a prototype of our shape indexing frame-work. In the proposed approach, a shape is represented using aset of indexable feature vectors which are appropriately mappedto a hash table. For a shape , a bin in the hash table storesan entry , where is the number of featurevectors from shape that get hashed to bin . This is repeatedfor each shape in the database and the hash table is populated.Thus, typically, each bin of the hash table has several 2-tuplescorresponding to the different shapes. The quantization schemedetermines how uniformly the entries are distributed across thehash table.

For a query shape , the feature vectors are extracted and itshash table entries , are determined as done forthe case of the database shapes. Then a single parse through theset of matching bins that contain a 2-tuple determinesits similarity with all the shapes in the database. In such a re-trieval scheme, the processing time depends only on the numberof 2-tuples and the number of database entries in thematching bins. So the more uniformly distributed the hash table

Page 4: 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 …€¦ · 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 An Efficient and Robust Algorithm for

BISWAS et al.: AN EFFICIENT AND ROBUST ALGORITHM FOR SHAPE INDEXING AND RETRIEVAL 375

Fig. 2. Prototype of the proposed shape indexing framework. Each shape inthe database is indexed to a hash table using a set of indexable feature vectorsextracted from the shape.

is, the less is the average time required to process a query. Typ-ically, the processing time increases much more slowly as com-pared to the database size. The details of the algorithm are de-scribed later in Section V.

IV. SHAPE REPRESENTATION

In this section, we describe suitable features that seamlesslyintegrate with the proposed indexing framework. To ensure thatthe single pass retrieval algorithm directly returns the most sim-ilar shapes, the features in addition to being indexable, shouldbe invariant to different rigid and nonrigid transformations as re-quired by the application at hand. The choice of features affectsboth the generalizability and discriminability of the approach.Here we use features that depend only on a few points on theshape and also take the global shape into account. The depen-dence on only a few points ensures robustness while their rela-tive configuration with respect to the global shape provides dis-criminability. Complexity of a typical matching algorithm de-pends on the complexity of the type of transformations that needto be handled, which in turn depends on the application. Artic-ulation of part structures being one of the most difficult kind ofdeformations addressed by several recent shape matching tech-niques, we describe representative features that are invariant toarticulations in addition to rigid transformations.

A. Pairwise Geometrical Features

Following these guidelines, each shape is characterized by aset of feature vectors where each vector encodes pairwise geo-metrical relationships on the shape. Each vector consists of thefollowing features that are robust to different deformations.

1) Inner Distance Between Two Points: The Euclidean dis-tance between two interest points is invariant to rigid transfor-mations of the shapes and is useful for applications where itis required to preserve articulation-dependent discriminability.But even small articulations can change the Euclidean distancesignificantly for several point-pairs on the shape. Therefore, forapplications requiring invariance to articulations, we use theinner distance (ID) [3] which is robust to articulations of part

Fig. 3. Inner distance and relative angles. The two human silhouettes on theleft show the insensitivity of inner distance with articulation of part structures.

Fig. 4. Contour distance. The shown shapes illustrate the insensitivity of con-tour distance to length-preserving deformations.

structures. The inner distance between two points is the lengthof the shortest path within the silhouette of the shape. Fig. 3(left) illustrates the difference of inner distance over the stan-dard Euclidean one.

Computation of inner distance involves forming a graph withlandmark points on the shape forming the nodes. Two nodes inthis graph are connected if there is a straight line path betweenthe corresponding points which is completely inside the shapecontour. The corresponding edge weight is the Euclidean dis-tance between the two. From this graph, any standard shortestpath algorithm can be used to compute the inner-distance for allthe unconnected nodes.

2) Relative Angles: Relative angles (A1 and A2) encode theangular relationship between a pair of points. Since absoluteorientation of the line segment connecting the points is not in-variant to rotations, we use the relative orientation of the con-necting line segment with respect to the incident tangents at eachend point. If the inner distance is used, this is the relative orien-tation of the first segment of the path corresponding to the innerdistance (see Fig. 3, right).

3) Contour Distance: The contour distance (CD) is analo-gous to geodesic distance for 3-D shapes and captures the rela-tive positions of the two points with respect to the entire shapecontour. The contour distance between two points for 2-D sil-houettes is simply the length of the contour between the twopoints. The distance is robust to both articulations and contourlength preserving deformations and complements inner distancein characterizing the relative location of the point pair with re-spect to the entire shape. Fig. 4 shows the contour distance be-tween two points of an object across several deformations.

4) Articulation-Invariant Center of Mass: The features de-scribed so far depend on the entire shape, but none of them cap-ture much information about the relative placement of variouspoint pairs in the shape. Though robust, such a representationmay not be able to provide the desired level of discriminability.For matching across rigid transformations, the distance of thepoints and the line segment joining them from the center of masscan be used as additional features to encode their relative place-ment. Clearly, since the center of mass can change appreciably

Page 5: 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 …€¦ · 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 An Efficient and Robust Algorithm for

376 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010

Fig. 5. Articulation-invariant center of mass. Row 1: Original shapes. Row 2:Transformed shapes after MDS.

with articulations, these features are not invariant to articula-tions. We propose an articulation-insensitive alternative to thetraditional center of mass if invariance to articulation is required.

We first describe how the location of articulation invariantcenter of mass is determined followed by a description of thefeatures derived from it. Directly determining such a point is noteasy. The proposed approach first transforms a given shape to anarticulation-invariant space. All objects related by articulationsof their part structures get transformed to the same shape in thenew space. This essentially means that the distances betweenthe transformed points are invariant to articulations.

The transformation is done using multidimensional scaling(MDS) [42]. MDS essentially places the points in a new Eu-clidean space such that the inter-point distances are as close aspossible to the given inner distances in a collective manner. Weuse the classical MDS as opposed to other more accurate butiterative algorithms for efficiency. The transformation compu-tation involves spectral decomposition of inner product matrix

, which is related to the (squared) inner-distance matrixas follows:

(1)

The matrix is symmetric, positive semidefinite and can beexpressed as

(2)

The required transformed coordinates in an -dimensionaloutput space can be obtained by

(3)

Fig. 5 shows the result of performing MDS on a few shapes.As desired, the transformed shapes Fig. 5 (second row) lookquite similar across articulations. Here is taken to be twofor visualization. The approximation improves with the dimen-sionality of the output space. The desired articulation-invariantcenter of mass is the center of mass of the transformed shape.

Given the articulation-invariant center of mass of a shape,we derive features which capture the relative positioning of thepoint pairs. For each point pair, distances (D1, D2, D3) of thepoints and the line segment joining them from the estimatedcenter of mass are computed. This is done in the transformedspace itself as the distances in the transformed space are insen-sitive to articulations.

TABLE INUMBER OF QUANTIZATION BITS FOR THE USED FEATURES

B. Bag of Features

Given a shape, the pairwise geometrical features are com-puted for each pair of landmark points on the shape. Here, eachpoint pair is characterized by a seven-dimensional feature vectorcomprising of the features described above. The distance basedfeatures in the vector are made robust to variations in scale bynormalizing each with their medians. Note that here, we providea basic set of features that are robust to rigid transformations andarticulations of part structures. The exact choice of the set of fea-tures may depend upon the application at hand. The collectionof such feature vectors for all pairs of landmark points charac-terize the shape. In all the experiments, we have used 100 land-mark points sampled uniformly on the contour for each shape.The inner points of the shape boundary (if present) have not beenconsidered.

V. INDEXING AND RETRIEVAL OF SHAPES

In this section, we describe in detail the shape indexing andretrieval algorithm using the proposed representation. Hashingthe feature vectors of each shape to the index table requiresdiscretization of the space of feature vectors. Here, we quan-tize each dimension of the vector independently using a suit-ably chosen number of levels for each. Supposedenotes the seven-dimensional feature vector. If the number ofquantization levels for feature is given by , then bitsare required to represent the feature. So each feature vector con-sisting of seven features is represented usingnumber of bits. There are possible combinations of the fea-ture vectors, and hence, any vector can belong to one of the

bins in the hash table. Though the ap-propriate number of bits assigned to each feature may vary de-pending on the application, Table I shows the typical number ofbits assigned to each feature in our system.

The quantization boundaries for each feature are chosen suchthat there are almost the same number of feature vectors in eachbin. This is done for each of the seven features independentlyby using a set of training shapes which are representative ofthe database. In all our experiments, we use roughly 10% ofthe dataset as training shapes to determine these boundaries. Inaddition to being the basic requirement of an indexing system,quantization provides robustness to variations in actual valuesof the features across different instances of the same shape.

A. Indexing

Fig. 2 illustrates the overall indexing procedure. The steps inthe indexing are described below in detail.

1) For each shape in the database, landmark points are ex-tracted from the shape contour. Though one can judiciallychoose these points, we simply pick points uniformly onthe shape contours.

2) For each pair of landmark points, features are computed asdescribed in Section IV. This results in a collection of fea-

Page 6: 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 …€¦ · 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 An Efficient and Robust Algorithm for

BISWAS et al.: AN EFFICIENT AND ROBUST ALGORITHM FOR SHAPE INDEXING AND RETRIEVAL 377

Fig. 6. (Left) Retrieval algorithm. (Right) Post-retrieval rank refinement to improve accuracy.

ture vectors for each shape. If there are landmark points,we have feature vectors.

3) Each feature vector is quantized using the proposed quan-tization scheme.

4) The quantized feature vectors are mapped on to the appro-priate bins in the hash table. The th bin contains 2-tuplesof the form , where is the th shape inthe database and denotes the number of feature vectorsof shape that hash to bin .

B. Retrieval

Given a query shape, the aim is to retrieve the similar shapesin the database as efficiently as possible. Fig. 6 illustrates theretrieval phase using a flow chart. The different steps involvedin the retrieval phase are enumerated below.

1) Feature vectors for the query shape are extracted in amanner similar to the one used for indexing.

2) Each vector is quantized using the same quantization stepsas used for the shapes enrolled in the database.

3) Hashing each feature vector to the index table results in alist of matching bins , where is thenumber of query feature vectors which hash to bin . Ingeneral, the number of matching bins is much less than thetotal number of bins in the hash table.

4) The distance of the query with each shape inthe database is initialized to zero.

5) Now we parse through the list and update the distanceof the query with each enrolled shape at every step usingthe following distance metric:

(4)

where the shape has an entry in the thmatching bin. If there is no such entry for a shape in thebin, is taken to be zero. The choice of distance metricis inspired by the standard statistic.

6) If during parsing, the distance for any particular shape inthe database exceeds a pre-specified threshold, then thatshape is discarded from further computation.

7) At the end of the parse, we get a list of shapes from thedatabase which are most similar to the query shape.

C. Computational Complexity

The computational complexity of the indexing phase de-pends on the complexity of feature extraction. For a shape with

landmarks, the inner distance computation is of complexity. Computation of relative angles and contour distances

takes . The complexity of calculating the articulationinvariant center of mass is while deriving features basedon it takes . Therefore, the complexity of indexing a shapeis . Note that indexing can be done offline so that queryprocessing time is not affected. To ensure fairness, all runningtimes reported in the paper include the time spent in indexing.

Page 7: 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 …€¦ · 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 An Efficient and Robust Algorithm for

378 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010

TABLE IINUMBER OF TOP MATCHES VERSUS ERROR RATE

As in the indexing phase, for a query shape with land-marks, feature extraction and hashing is . Hashing resultsin matching bins. Suppose each bin hasentries, where is the number of shapes in the database, weneed to perform distance updates (4). This does not takeinto account the fact that many shapes are discarded during re-trieval, which would further reduce the query processing time.It is difficult to put a bound on how large and can be. Inthe worst case, can be as large as and as large as ,but that does not happen in practice. With suitable quantization,

increases much slower than . Moreover, if elimination ofdissimilar shapes during retrieval process is taken into account,the complexity of the process depends on the number of thosedatabase shapes which are somewhat similar to the query. Theseattributes make the system quite scalable.

VI. RANK REFINEMENT: RE-RANKING TOP

MATCHES USING DETAILED MATCHING

As illustrated by the experimental results (Section VII), inmost cases, the proposed indexing and retrieval approach per-forms very well in terms of accuracy while being extremely fastand efficient. But depending on the application and the accuracyrequirement, this can be followed by a detailed matching stagewhere the query shape is compared with a subset of the databaseshapes returned by proposed indexing algorithm. Though in theindexing stage, the shapes have been represented by a numberof descriptors including some global features to get a rich rep-resentation, the information about the relative positioning of thepoint pairs cannot be fully captured due to the bag of featureskind of distributed representation. The goal of the refinementstage is to re-rank the top matches returned by the first stageaccording to the global similarity. At this stage, we can poten-tially use any of the already available algorithms. Since for eachquery, this matching need to be done for a very small subset ofthe entire database, the increase in computational overhead willbe much smaller than using the same algorithm for matching thequery with each and every shape in the database. In this paper,we propose such a detailed second matching stage based on thesame feature vectors computed in the indexing stage.

Before we describe the matching algorithm, we first inves-tigate the usefulness of the proposed indexing/retrieval algo-rithm as a first step before a more rigorous matcher is used.We follow the pruning protocol used by Mori et al. [31] on theEth80 dataset which consists of eight categories of objects withten examples of each. Each example has 41 images from dif-ferent viewpoints. As in [31], gallery is composed of one ran-domly picked example for each category (all views) leadingto 328 images in the gallery. The remaining 2952 images areused as queries. The reported error rate at rank “r” representsthe average possibility of not finding a correct match in top “r”matches as returned by the proposed indexing/retrieval system.The experiment is repeated 100 times for different random se-lections of gallery. The proposed hashing approach gives an

average error rate of 5.28% for 40-fold pruning (top 8 ranks)which is much better compared to the performance reported in[31] (10% using representative shape contexts and 14% usingshapemes). Table II shows the variation of error rates with re-spect to the number of top matches being considered. So wesee that though for a query shape, the best matching shape isnot always the one with the highest similarity score, it comeswithin the top few matches and so a more rigorous matcher hasthe potential to further improve the matching performance byappropriately re-ranking the top matches returned by the pro-posed indexing framework.

A. Dynamic Programming-Based Re-Ranking Algorithm

We make use of the same features as used for indexing tore-rank the top matches, thereby avoiding extra computationaloverhead for feature extraction. To this end, we propose a tighterrepresentation of shape by characterizing each landmark of theshape. Suppose each shape has ( is 100 for all our exper-iments) landmarks. Each landmark corresponds to pair-wise feature vectors, each of which corresponds to a hashing binid. In our algorithm, we create a histogram of these binid’s to characterize each shape landmark. The different steps ofthe proposed re-ranking algorithm (Fig. 6) are enumerated asfollows.

1) Compute histograms for all landmarks of all databaseshapes as a one-time pre-processing step. Each shape ischaracterized using ordered set of histograms corre-sponding to landmarks.

2) Given a query, characterize its landmarks in a similarfashion.

3) Using this representation, compute the similarity of queryshape with each of the top matches returned by the pro-posed indexing/retrieval system using a dynamic program-ming-based approach (described below).

Suppose the landmark points on the contour of the queryshape are denoted as and that of a database shapeas . If denotes the mapping between the twoshapes such that the th landmark of the query shape is matchedto the th landmark of the database shape, the matching costof the two shapes is given by

(5)

The mapping should be chosen in such a way that it minimizesthe matching cost given by (5). A penalty can be imposed if

is left unmatched but for all our experiments, the penalty istaken to be zero. The cost of matching the landmarks and isgiven by the distance between the histograms correspondingto the two landmark locations. Since the shape contours provideinformation about the ordering of the points and

, this can be used to restrict the mapping to thisorder, thereby making it possible to use dynamic programming(DP) [43] to perform this matching.

Page 8: 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 …€¦ · 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 An Efficient and Robust Algorithm for

BISWAS et al.: AN EFFICIENT AND ROBUST ALGORITHM FOR SHAPE INDEXING AND RETRIEVAL 379

TABLE IIIPERFORMANCE COMPARISON ON MPEG7 DATASET. � : SHAPE CONTEXT DISTANCE

Fig. 7. (Left) Example shapes from MPEG7 CE Shape 1 dataset [1].(Right) Articulation database [3].

VII. EXPERIMENTS

In this section, we report the results of empirical evaluation ofthe proposed system and compare it with many state-of-the artmatching algorithms on standard datasets. In addition, we high-light the computational advantages of our indexing approachand the usefulness of the proposed refinement stage in terms ofimprovement in accuracy. In the next section, we also performexperiments on human pose estimation and activity classifica-tion to further illustrate the usefulness of the proposed frame-work for real-world problems that involve large size databases.In all the experiments, we take 100 uniformly sampled pointson the shape contour as landmarks.

A. MPEG7 Shape Dataset

As our focus is to show the efficiency of the proposedsystem along with its accuracy, we first test it on the MPEG7CE-Shape-1 [1] dataset, which is probably one of the largestbenchmark used for evaluating shape matching algorithms. Thedataset consists of 1400 silhouettes with 20 images each for 70different objects (see Fig. 7, Left). The standard test for thisdataset is the Bullseye test. It is a leave-one-out kind of testwhere 40 most similar shapes are determined for every queryshape. The final score is given by the ratio of the number ofcorrect hits to the best possible number of hits (20 1400).

Table III compares the performance and computation time ofthe proposed approach with many algorithms reported in the lit-erature. In terms of accuracy, the algorithm (without refinement)performs quite well, though the performance is not exactly atpar with some of the very recently published approaches. Onthe other hand, as can be seen from Table III, the proposedapproach takes several order of magnitudes less time than otherapproaches. We also report results obtained by applying theproposed refinement step using the top 100 shapes retrieved bythe hashing approach. As desired, the refinement step results insignificant improvement in performance. Each comparison inthe refinement step takes around 0.07 s. Since this is done onlyfor top 100 matches, the overall computation time required isstill smaller than the state-of-the-art approaches. The systemruns on a regular desktop and is implemented in MATLAB.

The run-times reported for other algorithms are directly takenfrom the respective references and may vary slightly due to dif-ferences in machine configurations. References to some otherpapers which have reported results on this dataset can be foundat http://knight.cis.temple.edu/shape/MPEG7/results.html. Arecent method proposed by Yang et al. [26], which takes intoaccount the influence of the other shapes while computingthe similarity of a pair of shapes, has reported an accuracy of93.32% on this dataset.

Performance Analysis With Respect to Variation in Quantiza-tion Scheme: We perform experiments on MPEG7 dataset usingdifferent quantization levels for each of the seven features. Weobserve that reducing the number of levels for a particular fea-ture by half leads to an average accuracy of 81.16%, which isless than 0.7% below the one obtained using the quantizationsuggested in Table I. On the other hand, doubling the quantiza-tion levels for a feature leads to an average accuracy of 79.44%.

Performance Analysis With Variation in Number of LandmarkPoints: We perform experiments on the MPEG7 dataset usinga varying number of landmarks. The results show that the re-trieval accuracy degrades gracefully as the number of landmarksare reduced from 100. With 75, 50, and 25 number of landmarklocations, the accuracy is 81.77%, 79.73%, and 70.90%, respec-tively, compared to 81.8% for 100 landmark points as used in allour experiments.

B. Articulation Database

The features used in our approach were chosen so as to sup-port articulation-invariant matching. Therefore, it is importantto evaluate the performance of the system on a dataset whichexplicitly deals with large articulations. Here we use the articu-lation dataset introduced in [3] which consists of eight objectswith five shapes each as shown in Fig. 7 (right). We use thesame test scheme as in [3]. For each shape, four most similarshapes are selected and the number of correct hits for ranks 1,2, 3, and 4 are calculated. Clearly, the best performance of anysystem possible is to get 40 correct matches at all the four ranks.Table IV (left) summarizes the results obtained. The proposedapproach favorably compares with other approaches. It is note-worthy that unlike other approaches, the proposed hashing ap-proach does not require any alignment or costly matching forcomputing similarity with each shape in the dataset. The accu-racy improves when the proposed refinement step is used formatching, further signifying the efficacy of the proposed shaperepresentation.

Since the proposed set of features is meant to be insensitiveto articulations, we perform an analysis of the features on the ar-ticulation dataset. For this analysis, we divide the features into

Page 9: 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 …€¦ · 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 An Efficient and Robust Algorithm for

380 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010

TABLE IVARTICULATION DATASET: (LEFT) RETRIEVAL RESULT. (RIGHT) ANALYSIS OF THE VARIOUS FEATURES USED

TABLE VRETRIEVAL RESULTS ON KIMIA 1 (LEFT) AND KIMIA 2 (RIGHT) DATASETS

Fig. 8. Kimia database. (a) Kimia dataset 1 [6]. (b) Kimia dataset 2 [7].

three sets namely, inner distance + relative angles, contour dis-tance, and articulation-invariant center of mass (AICM)-basedfeatures. Table IV (right) summarizes the performance of thesefeature sets on the articulation dataset.

C. Kimia Dataset 1 and 2

Kimia dataset 1 [6] [see Fig. 8(a)] consists of 25 shapes fromfive categories. The experiment is run in a leave-one-out pat-tern. The performance is measured by accumulating the correctmatches at ranks 1, 2, and 3. The best one can get at any rankis 25. Table V (left) compares the results obtained with otherapproaches. The proposed approach compares well with otherapproaches.

Kimia dataset 2 [7] [see Fig. 8(b)] is a larger version of dataset1. It consists of 99 silhouettes from nine categories. The perfor-mance is measured by examining the correct matches at top 10ranks for each query. The best one can get for each rank is 99.Table V (right) summarizes the results obtained. In addition tobeing extremely efficient, the proposed approach compares fa-vorably with many existing algorithms.

D. ETH-80 Database

The ETH-80 database [8] contains a total of 80 objects, teneach from eight different categories (Fig. 9). Each object is rep-resented by 41 images taken from viewpoints spaced equally

Fig. 9. Eight object categories of the ETH-80 database [8]. Each category con-tains ten objects with 41 views per object.

over the upper viewing hemisphere resulting in a total of 3280images. We follow the standard testing protocol for the databasewhich is leave-one-object-out cross-validation. Each image inthe database is compared with all the images (all 41 views) fromthe other 79 objects, and if the correct category label is assigned,the recognition is considered successful. The recognition rate isaveraged over all the objects.

Table VI summarizes the results obtained. The approacheslisted in the table use a single cue (either appearance or shape)for performing object recognition [8]. The best reported resulton this dataset (to the best of our knowledge) is 93.02% whichis obtained using a decision trees-based approach [8] that com-bines the first seven approaches (i.e., combines multiple cuesof shape, color, etc.) for better performance. We also report theaccuracy obtained using the proposed refinement step on top 5,top 10, and top 20 matches obtained from the efficient retrievalprocess. As desired, the refinement step improves the accuracymaking it even better than the one reported using multiple cues.

Page 10: 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 …€¦ · 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 An Efficient and Robust Algorithm for

BISWAS et al.: AN EFFICIENT AND ROBUST ALGORITHM FOR SHAPE INDEXING AND RETRIEVAL 381

TABLE VIRECOGNITION RESULT ON THE ETH-80 DATASET. COMPARED ACCURACIES ARE FROM [8]

VIII. APPLICATIONS

Efficient shape matching and retrieval is useful for manypractical applications. Here, we describe two such applications,namely human pose estimation and activity classification.

A. Human Pose Estimation

Data retrieval based on content rather than human annotationwhich might be absent or erroneous has received much atten-tion recently. The ability to automatically describe human activ-ities in long video sequences is very useful for automatic videoarchiving, browsing, and retrieval. Though motion is a very im-portant cue, human activities in videos can often be described bythe body pose in still frames [5]. In our context, human pose es-timation implies matching the corresponding human silhouettesin the 2-D images based on their body posture and not explicitlyestimating the 3-D pose.

1) Evaluation Protocol: As the underlying pose space is con-tinuous, so exemplars cannot be easily classified into positiveand negative samples. Here, we use the same evaluation pro-tocol as followed by Tresadern and Reid [48]. If the body jointlocations are known, then for each query image , the sum ofsquared errors between corresponding joint center projectionsin the image between the query image and each image in thedatabase are calculated. Let this distance in the pose space be de-noted by . The database poses are then ranked in orderof similarity to the query as determined by the shape descriptor.Let the index of the closest training example be and the fur-thest be where is the number of images in the database.The curve , given by

(6)

represents the mean distance of the highest ranking databaseexamples to the query for . Intuitively speaking,the function determines how well the ranking obtained usingthe shape descriptors correlates with the one given by jointlocations.

2) Experiments on MOCAP Data: We first evaluate theproposed shape indexing method using binary silhouettesof a human body model generated from motion capturedata which contains information about the joint centers(http://mocap.cs.cmu.edu). Fig. 10 (left) shows a few examplesof binary silhouettes. The training data consists of 1500 binarysilhouettes of size 128 128 from different motions. Theevaluation is performed on over 400 synthetically generatedtest silhouettes. The silhouettes generated from the syntheticdata were automatically labeled with the image projectionsof the joint centers for evaluation. Fig. 10 (right) shows thenormalized curve of against where is the

Fig. 10. (Left) Example silhouettes from the CMU MOCAP dataset. (Right)Evaluation of the proposed method for human pose estimation. Comparisonwith (a) Lipschitz embeddings (lipschitz) and (b) histogram of shape contexts(hists) is also shown.

Fig. 11. Sample frames from the figure skating data [5].

total number of training images. As mentioned earlier, thelower the curve is, the better is the performance.

To illustrate the effectiveness of the proposed approachfor human pose estimation, we compare the results with twodifferent approaches, viz. Lipschitz embeddings [49] and his-togram of shape contexts [50], that were recently evaluatedfor this task [48]. The comparison of these approaches withthe proposed approach is shown in Fig. 10 (right). We see thatthe performance of the proposed approach compares favorablywith other shape descriptors. The dash-dot curve indicates thebest possible ranking where distance in image space correlatesperfectly with distance in pose space. Though histogram ofshape contexts-based approach gives similar performance, itis several times slower than the proposed indexing framework(985 s versus 393 s for the entire experiment).

3) Experiment on Figure Skating Data [5]: We also performhuman pose estimation on a real figure skating dataset [5]. Thevideos are unconstrained and involve swift motion of the skaterand real-world motion of the camera including pan, tilt, andzoom, making it very challenging (Fig. 11).

We first perform simple pre-processing of the raw video datato obtain the binary silhouettes of the skater. The foregroundpixels are separated from the background by building colormodels for both, which is followed by median filtering to rejectsmall isolated blobs. The extracted silhouettes are noisy andpresent quite a challenge for any shape matching algorithm.Since the pose space here is continuous, it is not straightforwardto divide the data into separate classes and perform quantitative

Page 11: 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 …€¦ · 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 An Efficient and Robust Algorithm for

382 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010

Fig. 12. Visualization of similarity of the different poses of the skater using MDS. MDS places the input silhouettes in a new Euclidean space such that theinter-silhouette distances in the transformed space are as close to the distances obtained using the proposed shape matching approach. We see that similar posesappear closer to each other, even after the dimensionality of the transformed space is reduced to two.

evaluation of the retrieval results. Here, we use MDS to analyzethe effectiveness of the proposed method for representing thedifferent poses of the skater. As described in Section IV, MDSplaces the input binary silhouettes in a new Euclidean spacesuch that the inter-point distances (here each point representsan input silhouette) in the new space are as close to the inter-sil-houette distances obtained using the proposed shape matchingapproach. Fig. 12 shows the result of performing MDS on asubset of the figure skating data. Here the output space is takento be two-dimensional for visualization purposes. As desired,similar poses appear closer to one another and different posesappear farther apart in the transformed space.

We also perform a retrieval experiment to retrieve similarposes from the database. Fig. 13 shows the top 5 matches for afew query images (shown in the first column). In the figure, otherthan for the second query, the algorithm successfully returns im-ages having similar pose as in the query. These examples showthe ability of the proposed framework to effectively match com-plicated shapes using noisy silhouettes extracted from real data.

B. Activity Classification

The goal of activity classification is to classify the content ofhuman activity sequences in an unsupervised manner withoutany prior knowledge of the type of actions being performed.Many activity classification methods have addressed this taskfrom a shape matching perspective [51]–[54]. Here, we presenta very simple approach to show the usefulness of the proposed

Fig. 13. Image retrieval based on pose. First column shows query image.Second to sixth columns show the top 5 matches.

indexing approach for the task of activity classification. In addi-tion to analyzing the sequence of silhouettes to characterize thespatial information, we propose a novel temporal shape repre-sentation to capture the temporal characteristics of the observedactivity. Note that any method which transforms the activityclassification task into a shape matching problem can benefit

Page 12: 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 …€¦ · 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 An Efficient and Robust Algorithm for

BISWAS et al.: AN EFFICIENT AND ROBUST ALGORITHM FOR SHAPE INDEXING AND RETRIEVAL 383

Fig. 14. Silhouettes (first column) and temporal shapes (second column) for a few activities as chosen by our algorithm.

TABLE VIIACTIVITY CLASSIFICATION PERFORMANCE OBTAINED FROM SILHOUETTES-BASED SPATIAL AND TEMPORAL CHARACTERIZATION. THE TWO NUMBERS IN EACH

TABLE ENTRY SHOW THE PERFORMANCE OBTAINED USING THE PROPOSED SPATIAL AND TEMPORAL CHARACTERIZATIONS, RESPECTIVELY

from the computational efficiency provided by our framework,irrespective of the exact form of representation. The followingdiscussion provides the details of the approach and the resultsof the experiments performed for its evaluation.

Spatial Characterization: Depending on the input video se-quence, the foreground silhouettes are obtained using low-levelimage processing techniques. Temporal clustering is performedon these silhouettes to obtain number of clusters based on thepose ( in our experiments). We use the distance transformto do the clustering. But they can be taken as key frames or anyshape representations from the approaches which view activityclassification as a shape matching problem. Temporal clusteringresults in silhouettes which provide the spatial characteriza-tion of the sequence of foreground silhouettes.

Temporal Characterization: The indexing approach pre-sented is useful for efficient matching of shapes. In order toefficiently utilize the temporal information for activity classifi-cation, we transform it to another shape matching problem. Anactivity sequence can be represented using a 3-D space-timevolume. The silhouettes are essentially slices of this volumetaken at different instances along the temporal axis. In a similarmanner, one can slice the space-time 3-D volume along one ofthe spatial axis (here y-axis) to obtain 2-D space-time shapeswhich we call as temporal shapes. Similar to the temporalclustering of the silhouettes, spatial clustering is performed onthese temporal shapes to obtain ( in our experiments)key temporal shapes. Fig. 14 shows the landmark silhouettesand temporal shapes for a few activities. From the figure, wesee that this representation seems to contain discriminative

information which can be utilized for classifying differentactivities.

Each video sequence is represented with 2-D shapes( silhouettes and temporal shapes). Note that these 2-Dshapes are ordered (in time and space, respectively). Each shapeis then indexed based on the computed features, resulting in sep-arate hash tables. During retrieval, each shape of thequery video is used to retrieve similar shapes from the corre-sponding hash table in a manner similar to the one describedin the previous sections. The similarity scores of the retrievedshapes are then fused in an additive manner to obtain the finalsimilarity scores.

1) Experimental Evaluation: We evaluate the proposedapproach on the activity dataset introduced in [54]. The datasetconsists of 90 video sequences of nine different personsperforming ten different activities, namely, run, walk, skip,jumping jack (or jack in short), jump forward on two legs (orjump in short), jump in place with two legs (pjump), gallopsideways (side), wave with two hands (wave2), wave with onehand (wave1), and bend. We follow a leave-one-out protocolas suggested in [54], i.e., for each query sequence, we removethe entire sequence from the database and compare it againstthe remaining 89 sequences. Table VII shows the performanceobtained in this experiment using the proposed spatial and tem-poral characterization of activity sequences. The performanceis measured by verifying if the best match for each querysequence is from the same category or not. Clearly, the bestperformance possible is to get nine correct matches in all thediagonal entries (as there are nine instances per category that

Page 13: 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 …€¦ · 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 An Efficient and Robust Algorithm for

384 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010

act as queries in a leave-one-out fashion). The performance iscomparable to the approach in [54] which computes featuresfrom the complete space-time volume for classification.

IX. SUMMARY AND DISCUSSION

We presented an efficient and robust approach for fastmatching and retrieval of shapes. The following attributesof the approach contribute towards its robustness and hencegraceful degradation of performance in the presence of noise,outliers, and other deformations: 1) pair-wise geometric fea-ture-based representation, 2) feature quantization, and 3)invariance of features to rigid transformations and articulationsof part structures. Rich and robust feature representation isimportant even for retrieval process. This helps to achieverobust matching using an extremely simple algorithm notinvolving any correspondence matching as required by moststate-of-the-art techniques. In most existing techniques, thealignment process has to be repeated for every shape in thedatabase for retrieval, making them much slower than theproposed scheme. As dissimilar shapes are eliminated veryearly during our retrieval process, little effort is wasted in com-paring a query to the database shapes which are very different,making the system scalable. We also proposed a refinementstage to further highlight the usefulness of the proposed shaperepresentation and indexing framework. The extensive experi-mental evaluations performed illustrate the effectiveness of theproposed approach. Due to increase in the amount of data to behandled, most real-life applications require efficient algorithmswhich can scale up to large size databases. The results obtainedare quite promising and make a strong case for such an efficientindexing-based framework for shape matching.

REFERENCES

[1] L. J. Latecki, R. Lakamper, and U. Eckhardt, “Shape descriptors fornon-rigid shapes with a single closed contour,” in Proc. IEEE Conf.Computer Vision and Pattern Recognition, 2000, pp. 424–429.

[2] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and objectrecognition using shape contexts,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 24, no. 4, pp. 509–522, Apr. 2002.

[3] H. Ling and D. W. Jacobs, “Shape classification using the inner-dis-tance,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 2, pp.286–299, Feb. 2007.

[4] C. Rao, A. Yilmaz, and M. Shah, “View-invariant representation andrecognition of actions,” Int. J. Comput. Vis., vol. 50, no. 2, pp. 203–226,2002.

[5] Y. Wang, H. Jiang, M. Drew, L. Ze-Nian, and G. Mori, “Unsuperviseddiscovery of action classes,” in Proc. IEEE Conf. Computer Vision andPattern Recognition, 2006, pp. 1654–1661.

[6] D. Sharvit, J. Chan, H. Tek, and B. B. Kimia, “Symmetry-based in-dexing of image databases,” J. Vis. Commun. Image Represent., vol. 9,no. 4, pp. 366–380, 1998.

[7] T. B. Sebastian, P. N. Klein, and B. B. Kimia, “Recognition of shapesby editing their shock graphs,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 26, no. 5, pp. 550–571, May 2004.

[8] B. Leibe and B. Schiele, “Analyzing appearance and contour basedmethods for object categorization,” in Proc. IEEE Conf. Computer Vi-sion and Pattern Recognition, 2003.

[9] S. Biswas, G. Aggarwal, and R. Chellappa, “Efficient indexing for ar-ticulation invariant shape matching and retrieval,” in Proc. IEEE Conf.Computer Vision and Pattern Recognition, 2007, pp. 1–8.

[10] G. Mori and J. Malik, “Recognizing objects in adversarial clutter:Breaking a visual captcha,” in Proc. IEEE Conf. Computer Vision andPattern Recognition, 2003, pp. 134–141.

[11] Z. Tu and A. L. Yuille, “Shape matching and recognition: Using gener-ative models and informative features,” in Proc. Eur. Conf. ComputerVision, 2004, pp. 195–209.

[12] A. Thayananthan, B. Stenger, P. H. S. Torr, and R. Cipolla, “Shapecontext and chamfer matching in cluttered scenes,” in Proc. IEEE Conf.Computer Vision and Pattern Recognition, 2003, pp. 127–133.

[13] G. Mori, S. Belongie, and J. Malik, “Shape contexts enable efficientretrieval of similar shapes,” in Proc. IEEE Conf. Computer Vision andPattern Recognition, 2001, pp. 723–730.

[14] H. Chui and A. Rangarajan, “A new point matching algorithm fornon-rigid registration,” Comput. Vis. Image Understand., vol. 89, pp.114–141, 2003.

[15] M. Daliri and V. Torre, “Robust symbolic representation for shaperecognition and retrieval,” Pattern Recognit., vol. 41, no. 5, pp.1799–1815, 2008.

[16] G. McNeill and S. Vijayakumar, “Hierarchical procrustes matchingfor shape retrieval,” in Proc. IEEE Conf. Computer Vision and PatternRecognition, 2006, pp. 885–894.

[17] P. Felzenszwalb and J. Schwartz, “Hierarchical matching of de-formable shapes,” in Proc. IEEE Conf. Computer Vision and PatternRecognition, 2007, pp. 1–8.

[18] A. Peter, A. Rangarajan, and J. Ho, “Shape lne rouge: Sliding waveletsfor indexing and retrieval,” in Proc. IEEE Conf. Computer Vision andPattern Recognition, 2008, pp. 1–8.

[19] K. Siddiqi, A. Shokoufandeh, S. J. Dickinson, and S. W. Zucker,“Shock graphs and shape matching,” Int. J. Comput. Vis., vol. 35, no.1, pp. 13–32, 1999.

[20] N. Alajlan, M. Kamel, and G. Freeman, “Geometry-based image re-trieval in binary image databases,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 30, no. 6, pp. 1003–1013, Jun. 2008.

[21] C. Grigorescu and N. Petkov, “Distance sets for shape filters and shaperecognition,” IEEE Trans. Image Process., vol. 12, no. 7, pp. 729–739,Jul. 2003.

[22] J. Xie, P. Heng, and M. Shah, “Shape matching and modeling usingskeletal context,” Pattern Recognit., vol. 41, no. 5, pp. 1756–1767, 2008.

[23] E. Attalla and P. Siy, “Robust shape similarity retrieval based on con-tour segmentation polygonal multiresolution and elastic matching,”Pattern Recognit., vol. 38, no. 12, pp. 2229–2241, 2005.

[24] T. Adamek and N. OConnor, “A multiscale representation method fornonrigid shapes with a single closed contour,” IEEE Trans. CircuitsSyst. Video Technol., vol. 14, no. 5, pp. 742–753, May 2004.

[25] B. Super, “Retrieval from shape databases using chance probabilityfunctions and fixed correspondence,” Int. J. Pattern Recognit. Artif. In-tell., vol. 20, no. 8, pp. 1117–1138, 2006.

[26] X. Yang, S. Koknar-Tezel, and L. Latecki, “Locally constrained dif-fusion process on locally densified distance spaces with applicationsto shape retrieval,” in Proc. IEEE Conf. Computer Vision and PatternRecognition, 2009, pp. 357–364.

[27] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algo-rithms. Cambridge, MA: MIT Press, 2001.

[28] R. S. Germain, A. Califano, and S. Colville, “Fingerprint matchingusing transformation parameter clustering,” Comput. Sci. Eng., vol. 4,no. 4, pp. 42–49, 1997.

[29] Y. Lamdan and H. J. Wolfson, “Geometric hashing: A general and ef-ficient model-based recognition scheme,” in Proc. Int. Conf. ComputerVision, 1988, pp. 238–249.

[30] D. Nister and H. Stewenius, “Scalable recognition with a vocabularytree,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition,2006, pp. 2161–2168.

[31] G. Mori, S. Belongie, and J. Malik, “Efficient shape matching usingshape contexts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no.11, pp. 1832–1837, Nov. 2005.

[32] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin, “Shape distri-butions,” ACM Trans. Graph., vol. 21, no. 4, pp. 807–832, 2002.

[33] R. Ohbuchi, T. Minamitani, and T. Takei, “Shape-similarity search of3D models by using enhanced shape functions,” in Proc. Theory andPractice of Computer Graphics, 2003, pp. 97–104.

[34] A. B. Hamza and H. Krim, “Geodesic object representation and recog-nition,” in DGCI, LNCS 2886, 2003, pp. 378–387.

[35] C. Y. Ip, D. Lapadat, L. Sieger, and W. C. Regli, “Using shape distri-butions to compare solid models,” in Proc. ACM Symp. Solid Modelingand Applications, 2002, pp. 273–280.

[36] Y. Liu, H. Zha, and H. Qin, “The generalized shape distributions forshape matching and analysis,” in Proc. Int. Conf. Shape Modeling andApplications, 2002.

[37] J. Beis and D. Lowe, “Shape indexing using approximate nearest-neighbour search in high dimensional spaces,” in Proc. IEEE Conf.Computer Vision and Pattern Recognition, 1997, pp. 984–989.

[38] I. Fudos and L. Palios, “An efficient shape-based approach to imageretrieval,” Pattern Recognit., vol. 23, no. 6, pp. 731–741, 2002.

[39] D. Rafiei and A. Mendelzon, “Efficient retrieval of similar shapes,” Int.J. Very Large Data Bases, vol. 11, no. 1, pp. 17–27, 2002.

Page 14: 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 …€¦ · 372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010 An Efficient and Robust Algorithm for

BISWAS et al.: AN EFFICIENT AND ROBUST ALGORITHM FOR SHAPE INDEXING AND RETRIEVAL 385

[40] S. Berretti, A. Del Bimbo, and P. Pala, “Retrieval by shape similaritywith perceptual distance and effective indexing,” IEEE Trans. Multi-media, vol. 2, no. 4, pp. 225–239, Dec. 2000.

[41] J. Wang, W. Chang, and R. Acharya, “Efficient and effective similarshape retrieval,” in Proc. IEEE Int. Conf. Multimedia Computing andSystems, 1999.

[42] A. Elad and R. Kimmel, “On bending invariant signatures for surfaces,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 10, pp. 1285–1295,Oct. 2003.

[43] E. Petrakis, A. Diplaros, and E. Milios, “Matching and retrieval of dis-torted and occluded shapes using dynamic programming,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 24, no. 11, pp. 1501–1516, Nov. 2002.

[44] F. Mokhtarian, F. Abbasi, and J. Kittler, “Efficient and robust retrievalby shape content through curvature scale space,” in Proc. ImageDatabases and Multimedia Search, 1997, pp. 51–58.

[45] L. J. Latecki and R. Lakamper, “Shape similarity measure based oncorrespondence of visual parts,” IEEE Trans. Pattern Anal. Mach. In-tell., vol. 22, no. 10, pp. 1185–1190, Oct. 2000.

[46] T. B. Sebastian, P. N. Klien, and B. B. Kimia, “On aligning curves,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 1, pp. 116–125,Jan. 2003.

[47] Y. Gdalyahu and D. Weinshall, “Flexible syntactic matching of curvesand its applications to automatic hierarchical classification of silhou-ettes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 12, pp.1312–1328, Dec. 1999.

[48] P. Tresadern and I. Reid, “An evaluation of shape descriptors for imageretrieval in human pose estimation,” in Proc. British Machine VisionConf., 2007.

[49] G. R. Hjaltason and H. Samet, “Properties of embedding methods forsimilarity searching in metric spaces,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 25, no. 5, pp. 530–549, May 2003.

[50] A. Agarwal and B. Triggs, “Recovering 3D human pose from monoc-ular images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 1,pp. 44–58, Jan. 2006.

[51] S. Carlsson and J. Sullivan, “Action recognition by shape matchingto key frames,” in Proc. IEEE Comput. Soc. Workshop Models versusExemplars in Computer Vision , 2001.

[52] A. F. Bobick and J. W. Davis, “The recognition of human movementusing temporal templates,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 23, no. 3, pp. 257–267, Mar. 2001.

[53] A. Yilmaz and M. Shah, “Actions sketch: A novel action representa-tion,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition,2005, pp. 984–989.

[54] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, “Actionsas space-time shapes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29,no. 12, pp. 2247–2253, Dec. 2007.

Soma Biswas (GS’09) received the B.E. degree inelectrical engineering from Jadavpur University,Kolkata, India, in 2001, the M.Tech. degree fromthe Indian Institute of Technology, Kanpur, in 2004,and the Ph.D. degree in electrical and computer en-gineering from the University of Maryland, CollegePark, in 2009.

She is currently working as a Research AssistantProfessor at the University of Notre Dame. Her re-search interests are in signal, image, and video pro-cessing, computer vision, and pattern recognition.

Gaurav Aggarwal (S’02) received the B.Tech. de-gree in computer science and engineering from theIndian Institute of Technology, Madras, in 2002 andthe M.S. and Ph.D. degrees in computer science fromthe University of Maryland, College Park, in 2004and 2008, respectively.

He is currently working as a Research Scientistwith Object Video, Reston, VA. His research interestsare in image and video processing, computer vision,and pattern recognition.

Rama Chellappa (F’92) received the B.E. (Hons.)degree from the University of Madras, Madras, India,in 1975, the M.E. (Distinction) degree from the In-dian Institute of Science, Bangalore, in 1977, and theM.S.E.E. and Ph.D. degrees in electrical engineeringfrom Purdue University, West Lafayette, IN, in 1978and 1981 respectively.

Since 1991, he has been a Professor of ElectricalEngineering and an affiliate Professor of ComputerScience at the University of Maryland, College Park.He is also affiliated with the Center for Automation

Research (Director) and the Institute for Advanced Computer Studies (Perma-nent Member). In 2005, he was named a Minta Martin Professor of Engineering.Prior to joining the University of Maryland, he was an Assistant (1981-1986)and Associate Professor (1986-1991) and Director of the Signal and Image Pro-cessing Institute (1988-1990) at the University of Southern California (USC),Los Angeles. Over the last 29 years, he has published numerous book chap-ters, peer-reviewed journal, and conference papers. He has co-authored andedited books on MRFs, face and gait recognition, and collected works on imageprocessing and analysis. His current research interests are face and gait anal-ysis, markerless motion capture, 3-D modeling from video, image and video-based recognition and exploitation, compressive sensing, and hyper spectralprocessing.

Prof. Chellappa served as an Associate Editor of four IEEE Transactions,as a Co-Editor-in-Chief of Graphical Models and Image Processing, and asthe Editor-in-Chief of the IEEE TRANSACTIONS ON PATTERN ANALYSIS AND

MACHINE INTELLIGENCE. He served as a member of the IEEE Signal ProcessingSociety Board of Governors and as its Vice President of Awards and Member-ship. He is serving a two-year term as the President of the IEEE BiometricsCouncil. He has received several awards, including an NSF Presidential YoungInvestigator Award, four IBM Faculty Development Awards, an Excellence inTeaching Award from the School of Engineering at USC, and two paper awardsfrom the International Association of Pattern Recognition. He received the So-ciety, Technical Achievement Award, and Meritorious Service Awards from theIEEE Signal Processing Society. He also received the Technical Achievementand Meritorious Service Awards from the IEEE Computer Society. At the Uni-versity of Maryland, he was been elected as a Distinguished Faculty ResearchFellow, as a Distinguished Scholar-Teacher, received the Outstanding FacultyResearch Award from the College of Engineering, an Outstanding InnovatorAward from the Office of Technology Commercialization, and an OutstandingGEMSTONE Mentor Award. He is a Fellow of the International Association forPattern Recognition and Optical Society of America. He has served as a Generalthe Technical Program Chair for several IEEE international and national con-ferences and workshops. He is a Golden Core Member of the IEEE ComputerSociety and served a two-year term as a Distinguished Lecturer of the IEEESignal Processing Society.