downloads.hindawi.comdownloads.hindawi.com/journals/specialissues/912571.pdf · editor-in-chief k....

EURASIP Journal on Applied Signal Processing

Image Analysis for MultimediaInteractive Services—Part I

Guest Editors: Moncef Gabbouj, Faouzi Alaya Cheikh,Bogdan Cramariuc, and Geoff Morrison


Guest Editors: Moncef Gabbouj, Faouzi Alaya Cheikh,Bogdan Cramariuc, and Geoff Morrison


Copyright © 2002 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in volume 2002 of “EURASIP Journal on Applied Signal Processing.” All articles are open accessarticles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproductionin any medium, provided the original work is properly cited.

Editor-in-ChiefK. J. Ray Liu, University of Maryland, College Park, USA

Associate EditorsKiyoharu Aizawa, Japan Jiri Jan, Czech Antonio Ortega, USAGonzalo Arce, USA Shigeru Katagiri, Japan Mukund Padmanabhan, USAJaakko Astola, Finland Mos Kaveh, USA Ioannis Pitas, GreeceMauro Barni, Italy Bastiaan Kleijn, Sweden Raja Rajasekaran, USASankar Basu, USA Ut Va Koc, USA Phillip Regalia, FranceShih-Fu Chang, USA Aggelos Katsaggelos, USA Hideaki Sakai, JapanJie Chen, USA C. C. Jay Kuo, USA William Sandham, UKTsuhan Chen, USA S. Y. Kung, USA Wan-Chi Siu, Hong KongM. Reha Civanlar, USA Chin-Hui Lee, USA Piet Sommen, The NetherlandsTony Constantinides, UK Kyoung Mu Lee, Korea John Sorensen, DenmarkLuciano Costa, Brazil Y. Geoffrey Li, USA Michael G. Strintzis, GreeceIrek Defee, Finland Heinrich Meyr, Germany Ming-Ting Sun, USAEd Deprettere, The Netherlands Ferran Marques, Spain Tomohiko Taniguchi, JapanZhi Ding, USA Jerry M. Mendel, USA Sergios Theodoridis, GreeceJean-Luc Dugelay, France Marc Moonen, Belgium Yuke Wang, USAPierre Duhamel, France José M. F.Moura, USA Andy Wu, TaiwanTariq Durrani, UK Ryohei Nakatsu, Japan Xiang-Gen Xia, USASadaoki Furui, Japan King N. Ngan, Singapore Zixiang Xiong, USAUlrich Heute, Germany Takao Nishitani, Japan Kung Yao, USAYu Hen Hu, USA Naohisa Ohta, Japan

Contents

Editorial, Moncef Gabbouj, Faouzi Alaya Cheikh, Bogdan Cramariuc, and Geoff MorrisonVolume 2002 (2002), Issue 4, Pages 341-342

Overview of the MPEG-7 Standard and of Future Challenges for Visual Information Analysis,Philippe SalembierVolume 2002 (2002), Issue 4, Pages 343-353

Using MPEG-7 at the Consumer Terminal in Broadcasting, Alan Pearmain, Mounia Lalmas,Ekaterina Moutogianni, Damien Papworth, Pat Healey, and Thomas RöllekeVolume 2002 (2002), Issue 4, Pages 354-361

Ordinal-Measure Based Shape Correspondence, Faouzi Alaya Cheikh, Bogdan Cramariuc,Mari Partio, Pasi Reijonen, and Moncef GabboujVolume 2002 (2002), Issue 4, Pages 362-371

Audio Classification in Speech and Music: A Comparison between a Statistical and a Neural Approach,Alessandro Bugatti, Alessandra Flammini, and Pierangelo MiglioratiVolume 2002 (2002), Issue 4, Pages 372-378

Video Segmentation Using Fast Marching and Region Growing Algorithms, Eftychis Sifakis,Ilias Grinias, and Georgios TziritasVolume 2002 (2002), Issue 4, Pages 379-388

Stand-Alone Objective Segmentation Quality Evaluation, Paulo Lobato Correia and Fernando PereiraVolume 2002 (2002), Issue 4, Pages 389-400

Objective Evaluation Criteria for 2D-Shape Estimation Results of Moving Objects, Roland Mechand Ferran MarquésVolume 2002 (2002), Issue 4, Pages 401-409

Using Invariant Image Features for Synchronization in Spread Spectrum Image Watermarking,Ebroul IzquierdoVolume 2002 (2002), Issue 4, Pages 410-417

Segmentation and Content-Based Watermarking for Color Image and Image Region Indexing andRetrieval, Nikolaos V. Boulgouris, Ioannis Kompatsiaris, Vasileios Mezaris, Dimitrios Simitopoulos,and Michael G. StrintzisVolume 2002 (2002), Issue 4, Pages 418-431

EURASIP Journal on Applied Signal Processing 2002:4, 341–342c© 2002 Hindawi Publishing Corporation

Editorial

Moncef GabboujSignal Processing Laboratory, Tampere University of Technology, P.O. Box 553, FIN-33101 Tampere, FinlandEmail: [email protected]

Faouzi Alaya CheikhDigital Media Institute, P.O. Box 553, FIN-33101 Tampere, FinlandEmail: [email protected]

Bogdan CramariucDigital Media Institute, P.O. Box 553, FIN-33101 Tampere, FinlandEmail: [email protected]

Geoff MorrisonBTexact Technologies, Adastral Park, Ipswich, IP5 3RE, UKEmail: [email protected]

There has been a large volume of research and developmentwork in the area of image analysis for multimedia interac-tive services in the past decade. The European Union COST211 Action has actively contributed to this work for manyyears. The recent focus has been on two major topics: mul-timedia indexing and retrieval and video segmentation. Fiveyears ago, COST 211 organized the first international work-shop on Image Analysis for Multimedia Interactive Services,WIAMIS, which was held in Louvain-la-Neuve, Belgium.Since then, further workshops have been organized in Berlinin 1999 and in Tampere in 2001. The next workshop will beheld in London in 2003.

WIAMIS proved to be a major window to the outsideworld for the closed collaborative Action COST 211. Crossfertilization of ideas is the prime goal in the workshops, inaddition to attracting new members working on related ar-eas to COST 211. To reach a broader audience, the Editor-in-Chief of the EURASIP Journal on Applied Signal Process-ing kindly accepted the proposal to publish a selection of thepapers presented at WIAMIS 2001. The authors response tothis call-for-papers exceeded our expectations and the edito-rial board of the journal agreed to allocate two issues for theselected papers. The next issue is planned to appear in June2002.

This part (Part I) of the special issue covers four maintopics. We open the issue with an invited tutorial on theMPEG-7 standard and future challenges for visual infor-mation analysis and retrieval. The tutorial is authored byPhilippe Salembier who has been actively involved in the

MPEG-7 development activity standard since its beginning.Three other papers in the area of indexing and retrieval coverthe use of MPEG-7 at the consumer terminal in broadcast-ing, retrieval based on shape correspondences using ordinalmeasures, and audio classification in speech and music.

Video segmentation is the focus of the next three pa-pers. As mentioned earlier, video segmentation has been akey research area in COST 211quat. The group has developeda software package called the COST AM (Analysis Model)which can produce, among other outputs, a set of segmen-tation masks for an input video. A call for Comparison withthe COST AM has been issued and some proposals have beenreceived and evaluated. One of these is presented here in apaper authored by E. Sifakis et al., in which they propose anovel algorithm for video segmentation using fast marchingand region growing. A major issue related to video segmenta-tion is the segmentation quality, that is, how can one objec-tively measure the segmentation results. Two papers in thisissue propose novel solutions to this important problem.

The last two papers in this part of the special issue fo-cus on watermarking. The first presents a scheme in whichcharacteristics of both spatial and frequency techniques arecombined to achieve robustness against image processingand geometric transformations; while the second considerssegmentation and content-based watermarking for color im-age and image region indexing and retrieval.

Part II of the special issue will focus on video segmen-tation, detection, tracking, motion estimation, and post-processing. We hope the reader will enjoy this issue. The

342 EURASIP Journal on Applied Signal Processing

Guest Editors would like to thank all contributing authorsfor their efforts in meeting the quality requirements of thejournal as well as the tight schedule. We are indebted to theEditor-in-Chief, Ray Liu, for his support and patience.

Moncef GabboujFaouzi Alaya Cheikh

Bogdan CramariucGeoff Morrison

Moncef Gabbouj received his B.S. degreein electrical engineering in 1985 from Ok-lahoma State University, Stillwater, and hisM.S. and Ph.D. degrees in electrical en-gineering from Purdue University, WestLafayette, Indiana, in 1986 and 1989, re-spectively. Dr. Gabbouj is currently a pro-fessor and Head of the Institute of SignalProcessing of Tampere University of Tech-nology, Tampere, Finland. From 1995 to 1998 he was a Professorwith the Department of Information Technology of Pori Schoolof Technology and Economics, Pori, and during 1997 and 1998he was on sabbatical leave with the Academy of Finland. From1994 to 1995 he was an Associate Professor with the Signal Pro-cessing Laboratory of Tampere University of Technology, Tampere,Finland. From 1990 to 1993 he was a Senior Research Scientistwith the Research Institute for Information Technology, Tampere,Finland. His research interests include nonlinear signal and imageprocessing and analysis, content-based analysis and retrieval, andmathematical morphology. Dr. Gabbouj is the Vice-Chairman ofthe IEEE-EURASIP NSIP (Nonlinear Signal and Image Processing)Board. He is currently the Technical Committee Chairman of theEC COST 211quat. He served as Associate Editor of the IEEE Trans-actions on Image Processing, and was Guest Editor of the EuropeanJournal Signal Processing, Special Issue on Nonlinear Digital SignalProcessing (August 1994). He is the Chairman of the IEEE FinlandSection and past Chair of the IEEE Circuits and Systems Society,Technical Committee on Digital Signal Processing, and the IEEESP/CAS Finland Chapter. He was also the TPC Chair of EUSIPCO2000 and the DSP Track Chair of the 1996 IEEE ISCAS and theProgram Chair of NORSIG ’96. He is also member of EURASIPAdCom. Dr. Gabbouj is the Director of the International UniversityProgram in Information Technology and member of the Council ofthe Department of Information Technology at Tampere Universityof Technology. He is also the Secretary of the International Advi-sory Board of Tampere International Center of Signal Processing,TICSP. He is a member of Eta Kappa Nu, Phi Kappa Phi, IEEESP and CAS societies. Dr. Gabbouj was co-recipient of the MyrilB. Reed Best Paper Award from the 32nd Midwest Symposium onCircuits and Systems and co-recipient of the NORSIG 94 Best PaperAward from the 1994 Nordic Signal Processing Symposium.

Faouzi Alaya Cheikh received his B.S.degree in electrical engineering in 1992from Ecole Nationale d’Ingenieurs de Tu-nis, Tunisia. He received his M.S. degree inelectrical engineering (Major in Signal Pro-cessing) from Tampere University of Tech-nology, Finland, in 1996. Mr. Alaya Cheikhis currently a Ph.D. candidate and worksas a Researcher at the Institute of Signal

Processing, Tampere University of Technology, Tampere, Finland.From 1994 to 1996, he was a Research Assistant at the Institute ofSignal Processing, and from 1997 he has been a Researcher with thesame institute. His research interests include nonlinear signal andimage processing and analysis, pattern recognition and content-based analysis and retrieval. He has been an active member in manyFinnish and European research projects among them Nobless es-prit, COST 211 quat, and MUVI. He served as Associate Editor ofthe EURASIP Journal on Applied Signal Processing, Special Issueon Image Analysis for Multimedia Interactive Services. He servesas a reviewer to several conferences and journals. He co-authoredover 30 publications.

Bogdan Cramariuc received his M.S. de-gree in electrical engineering in 1993 fromPolytechnica University of Bucharest, Fac-ulty of Electronics and Telecommunica-tions, Bucharest, Romania. Mr. Cramariucis currently a Ph.D. candidate and works asResearcher for the Institute of Signal Pro-cessing at Tampere University of Technol-ogy, Tampere, Finland. From 1993 to 1994he worked as Teaching Assistant at the Fac-ulty of Electronics and Telecommunications at the PolytechnicaUniversity of Bucharest. During this period he has also been in-volved as Researcher with Electrostatica S.A., a national researchinstitute in Bucharest, Romania. Since 1995 he has been with theInstitute of Signal Processing at Tampere University of Technology,Tampere, Finland. His research interests include signal and imageanalysis, image segmentation, texture analysis, content-based in-dexing and retrieval in multimedia databases, mathematical mor-phology, computer vision, parallel processing, data mining, and ar-tificial intelligence. Mr. Cramariuc has been an active member inseveral Finnish and European projects, such as Nobless, Esprit andMUVI. He served as Associate Editor of the EURASIP Journal onApplied Signal Processing, Special Issue on Image Analysis for Mul-timedia Interactive Services.

Geoff Morrison graduated from the Uni-versity of Cambridge, UK and joined theBritish Post Office Research Department.He worked on analogue video transmissionsystems, processing, and switching, mainlyfor videoconferencing and videotelephonyservices. Subsequently his research activitiescentered on digital video. After a six monthsecondment to NTT Laboratories in Japan,he was an active contributor to the CCITT group which devel-oped Recommendation H.261. Simultaneously he led a group atBT Labs which constructed the first European real-time hardwareimplementation of it. His theoretical and practical knowledge ofvideo compression contributed to MPEG-1 and MPEG-2 where hechaired the Implementation Studies Group for several years. Healso participated in many European collaborative projects includ-ing COST 211bis through to the current COST 211quat which hechairs. Geoff gained his doctorate in 1997 from the University ofWaseda in Tokyo following a secondment there. He is an HonoraryFellow of the University of Essex. Currently he is Senior ResearchAdvisor in the Content and Coding Laboratory of BTexact Tech-nologies.


Overview of the MPEG-7 Standard and of FutureChallenges for Visual Information Analysis

Philippe SalembierUniversitat Politecnica de Catalunya, Campus Nord, Modulo D5, Jordi Girona, 1-3, 08034 Barcelona, SpainEmail: [email protected]

Received 31 July 2001

This paper presents an overview of the MPEG-7 standard: the Multimedia Content Description Interface. It focuses on visualinformation description including low-level visual Descriptors and Segment Description Schemes. The paper also discusses somechallenges in visual information analysis that will have to be faced in the future to allow efficient MPEG-7-based applications.

Keywords and phrases: MPEG-7, indexing, search, retrieval, browsing, navigation, multimedia, description schemes, descriptors,search engine.

1. INTRODUCTION

The goal of the MPEG-7 standard is to allow interoperablesearching, indexing, filtering, and access of audio-visual (AV)content by enabling interoperability among devices and ap-plications that deal with AV content description. MPEG-7specifies the description of features related to the AV con-tent as well as information related to the management ofAV content. As illustrated in Figure 1, the scope of the stan-dard is to define the representation of the description, thatis, the syntax and the semantics of the structures used tocreate MPEG-7 descriptions. For most description tools, thestandard does not provide normative tools for the gener-ation nor for the consumption of the description. This isnot necessary to guarantee interoperability and, moreover,this allows future improvements to be included in MPEG-7compliant applications. However, as will be discussed in thispaper, in order to guarantee interoperability for some low-level features, MPEG-7 also specifies part of the extractionprocess.

MPEG-7 descriptions take two possible forms: (1) a tex-tual XML form suitable for editing, searching, filtering, andbrowsing and (2) a binary form suitable for storage, trans-mission, and streaming. Overall, the standard specifies fourtypes of normative elements illustrated in Figure 2: Descrip-tors, Description Schemes (DSs), a Description DefinitionLanguage (DDL), and coding schemes.

In order to describe AV content, a set of Descriptors hasto be used. In MPEG-7, a Descriptor defines the syntax andthe semantics of an elementary feature. A Descriptor can dealwith low-level features, which represent the signal character-istics, such as color, texture, shape, motion, audio energy or

Descriptiongeneration

Descriptionconsumption

Description

Research andconsumption

Scope of MPEG-7

Figure 1: Scope of the MPEG-7 standard.

audio spectrum as well as high-level features such as the titleor the author. The main constraint on a descriptor is thatit should describe an elementary feature. In MPEG-7, thesyntax of Descriptors is defined by the Description DefinitionLanguage (DDL) which is an extension of the XML Schemalanguage [1]. The DDL is used not only to define the syntaxof MPEG-7 Descriptors but also to allow developers to de-clare the syntax of new Descriptors that are related to specificneeds of their application.

In general, the description of AV content involves a largenumber of Descriptors. The Descriptors are structured andrelated within a common framework based on DescriptionSchemes (DSs). As shown in Figure 2, the DSs define a modelof the description using as building blocks the Descriptors.The syntax of DSs is also defined with the DDL and, for spe-cific applications, new DSs can also be created.

When the set of DSs and Descriptors is instantiated to de-scribe a piece of AV content, the resulting description takes


< scene id=1>< time> ....< camera>..

< annotation</scene>

D1

DS5D2

D5D4D6

DS2

DS3

DS1

DS4

Description DefinitionLanguage

Extension

Definition Tags

Instantiation

Structuring

<scene id=1><time>. . .<camera>. . .<annotation>

</scene>

Descriptors(syntax & semanticof feature representation)

Encoding&

DeliveryDescription Schemes

1 010101

D1D2

D3

D4D8D6

D7

D5D9

D10

DS1

DS2

DS3

DS4 DS5

D1

D2

D4 D5D6

Figure 2: Main components of the MPEG-7 standard.

Views

Content organization Collections ModelsUser

interaction

Userhistory

Userpreference

Creation &production

Media Usage

Structuralaspects

Semanticaspects

Content management

Content description

Basic elements

Schematools

Basicdatatypes

Links & medialocalization

Basictools

Navigation &access

Summaries

Views

Variations

Figure 3: Overview of MPEG-7 multimedia DSs.

the form on an XML document [2]. This is the first nor-mative format in MPEG-7. This format is very efficient forediting, searching, filtering, and processing. Moreover, a verylarge number of XML-aware tools are available. However,XML documents are verbose, difficult to stream and notresilient with respect to transmission errors. To solve theseproblems, MPEG-7 defines a binary format (BiM: Binary for-mat for Mpeg-7) and the corresponding encoding and decod-ing tools. This second format is particularly efficient in termsof compression and streaming functionality. Note that XMLand BiM representations are equivalent and can be encodedand decoded losslessly.

The objective of this paper is to provide an overview ofthe MPEG-7 DSs and Descriptors focusing on the visual as-pects (Section 2) and then, to discuss a set of visual informa-tion analysis challenges that could be studied to lead to veryefficient MPEG-7-based applications (Section 3).

2. OVERVIEW OF MPEG-7

2.1. Multimedia Description Schemes

Figure 3 provides an overview of the organization of the Mul-timedia DSs into different functional areas: Basic Elements,Content Management, Content Description, Navigation andAccess, Content organization, and User Interaction. TheMPEG-7 DSs can be considered as a library of descriptiontools and, in practice, an application should select an appro-priate subset of relevant DSs. This section discusses each ofthe different functional areas of Multimedia DSs.

2.1.1 Basic elements

The first set of DSs can be seen on the lower part of Figure 3.They are called Basic elements because they provide elemen-tary description functions and are intended to be used asbuilding blocks for descriptions or DSs.

Overview of the MPEG-7 Standard and of Future Challenges for Visual Information Analysis 345

MPEG-7 provides a number of Schema tools that assistin the formation, packaging, and annotation of MPEG-7 de-scriptions. An MPEG-7 description begins with a root ele-ment that signifies whether the description is complete orpartial. A complete description provides a complete, stand-alone description of AV content for an application. On theother hand, a description unit carries only partial or incre-mental information that possibly adds to an existing descrip-tion. In the case of a complete description, an MPEG-7 top-level element follows the root element. The top-level ele-ment orients the description around a specific descriptiontask, such as the description of a particular type of AV con-tent (for instance an image, video, audio, or multimedia), ora particular function related to content management, (suchas creation, usage, summarization, etc.). The top-level ele-ments collect together the appropriate tools for carrying outthe specific description task.

In the case of description units, the root element canbe followed by an instance of an arbitrary MPEG-7 DSor descriptor. Unlike a complete description which usu-ally contains a “semantically-complete” MPEG-7 descrip-tion, a description unit can be used to send a partial de-scription as required by an application such as a descriptionof a place, a shape, and texture descriptor, and so on. It isalso used to define elementary piece of information to betransported or streamed in case the complete description istoo large.

Beside the schema tools, a number of basic elementsare used as fundamental constructs in defining the MPEG-7 DSs. The basic data types provide a set of extended datatypes and mathematical structures such as vectors and matri-ces, which are needed by the DSs for describing AV content.The basic elements include also constructs for linking mediafiles, localizing pieces of content, and describing time, places,persons, individuals, groups, organizations, textual annota-tion (including free text, structured annotation or annota-tion with syntactic dependency, etc.), classification schemesand controlled terms.

2.1.2 Content management

MPEG-7 provides DSs for AV content management. Theydescribe information that generally cannot be perceived inthe content itself but that is of vital importance for manyapplications. These tools describe the following information:(1) creation and production, (2) media coding, storage, andfile formats, and (3) content usage.

The Creation Information provides a title (which may it-self be textual or another piece of AV content), and informa-tion such as creators, creation locations, and dates. It also in-cludes classification information describing how the AV ma-terial may be categorized into genre, subject, purpose, lan-guage, and so forth. Finally, review and guidance informa-tion such as age classification, parental guidance, and subjec-tive review are also given.

The Media Information describes the storage media in-cluding the format, the compression, and the coding of theAV content. The Media Information identifies the mastermedia, which is the original source from which different in-

stances of the AV content are produced. The instances of theAV content are referred to as Media Profiles, which are ver-sions of the master obtained by using different encodings, orstorage and delivery formats. Each Media Profile is describedindividually in terms of the encoding parameters, storagemedia information, and location.

The Usage Information describes usage rights, usagerecord, and financial information. The rights information isnot explicitly included in the MPEG-7 description, instead,links are provided to the rights holders and to other infor-mation related to rights management and protection.

2.1.3 Content Description: structural aspects

In MPEG-7, Content Description refers to information thatcan be perceived in the content. Two different view pointsare provided: the first one emphasizes the structural aspect ofthe signal whereas the second one focuses on the conceptualaspects of the content. This section presents the structuralaspects with some details. Conceptual aspects will be brieflydiscussed in Section 2.1.4.

The description of the structure of the AV content relieson the notion of segments. The Segment DS describes the re-sult of a spatial, temporal, or spatio-temporal partitioning ofthe AV content. It can describe a hierarchical decompositionresulting in a segment tree. Moreover, the SegmentRelationDS describes additional relationships among segments andallows the creation of graphs.

The Segment DS forms the base type of the different spe-cialized segment types such as audio segments, video seg-ments, audio-visual segments, moving regions, and still re-gions. As a result, a segment may have spatial and/or tempo-ral properties. For example, the AudioSegment DS describesa temporal interval of an audio sequence. The VideoSegmentDS describes a set of video frames. The AudioVisualSegmentDS describes a combination of audio and visual informationsuch as a video with synchronized audio. The StillRegion DSdescribes a region of an image or a frame in a video. Finally,the MovingRegion DS describes a moving region of a videosequence.

There exists also a set of specialized segments for specifictype of AV content. For example, the Mosaic DS is a special-ized type of StillRegion. It describes a mosaic or panoramicview of a video segment [3]. The VideoText is a subclass ofthe MovingRegion DS and describes a region of video con-tent corresponding to text or captions. This includes super-imposed text as well as text appearing in scene. Another ex-ample of specialized DS is the InkSegment DS which de-scribes a segment of an electronic ink document created by apen-based system or an electronic white-board.

The Segment DS contains elements and attributes thatare common to the different segment types. Among the com-mon properties of segments is information related to cre-ation, usage, media location, and text annotation. The Seg-ment DS can be used to describe segments that are notnecessarily connected, but composed of several noncon-nected components. Connectivity refers here to both spa-tial and temporal domains. A temporal segment (VideoSeg-ment, AudioSegment, and AudioVisualSegment) is said to be


Time

One connectedcomponent

(a) Temporal segment(AudioVisual, Video, AudioSegments).

One connectedcomponent

(b) Spatial segment(StillRegion).

Time

Three connectedcomponents

(c) Temporal segment(AudioVisual, Video, AudioSegments).

Three connectedcomponents

(d) Spatial segment(StillRegion).

Figure 4: Examples of segments: (a) and (b) segments composed ofone single connected component; (c) and (d) segments composedof three connected components.

temporally connected, if it is a sequence of continuous videoframes or audio samples. A spatial segment (StillRegion) issaid to be spatially connected if it is a group of connectedpixels. A spatio-temporal segment (MovingRegion) is said tobe spatially and temporally connected if the temporal seg-ment where it is instantiated is temporally connected and ifeach one of its temporal instantiations in frames is spatiallyconnected. (Note that this is not the classical connectivity ina 3D space.)

Figure 4 illustrates several examples of temporal or spa-tial segments and their connectivity. Figures 4a and 4b illus-trate a temporal and a spatial segment composed of a singleconnected component. Figures 4c and 4d illustrate a tempo-ral and a spatial segment composed of three connected com-ponents. Note that, in all cases, the Descriptors and DSs at-tached to the segment are global to the union of the con-nected components building the segment. At this level, it isnot possible to describe individually the connected compo-nents of the segment. If connected components have to bedescribed individually, then the segment has to be decom-posed into various subsegments corresponding to its indi-vidual connected components.

The Segment DS may be subdivided into subsegments,and thus may form a hierarchy (tree). The resulting segmenttree is used to describe the media source, the temporal and/orspatial structure of the AV content. For example, a video pro-gram may be temporally segmented into various levels ofscenes, shots, and micro-segments. A table of contents may

TimeParent seg.

Children seg.

Decomposition in threesubsegments without gapnor overlap

(a) Parent segment: oneconnected component.

TimeParent seg.

Children seg.

Decomposition in twosubsegments without gapnor overlap

(b) Parent segment: twoconnected components.

TimeParent seg.

Children seg.

Decomposition in threesubsegments with gapbut no overlap

(c) Parent segment: oneconnected component.

TimeParent seg.

Children seg.

Decomposition in three subsegmentswith gap and overlap (onesubsegment is non-connected)

(d) Parent segment: twoconnected components.

Figure 5: Examples of Segment Decomposition: (a) and (b) Seg-ment Decompositions without gaps nor overlaps; (c) and (d) Seg-ment Decompositions with gap or overlap.

thus be generated based on this structure. Similar strategiescan be used for spatial and spatio-temporal segments.

A segment may also be decomposed into various mediasources such as various audio tracks or viewpoints from sev-eral cameras. The hierarchical decomposition is useful to de-sign efficient search strategies (global search to local search).It also allows the description to be scalable: a segment may bedescribed by its direct set of Descriptors and DSs, but it mayalso be described by the union of the descriptors and DSsthat are related to its subsegments. Note that a segment maybe subdivided into subsegments of different types, for exam-ple, a video segment may be decomposed in moving regionsthat are themselves decomposed in still regions.

The decomposition is described by a set of attributesdefining the type of subdivision: temporal, spatial, spatio-temporal, or media source. Moreover, the spatial and tem-poral subdivisions may leave gaps and overlaps between thesubsegments. Several examples of decompositions are de-scribed for temporal segments in Figure 5. Figures 5a and5b describe two examples of decompositions without gapsnor overlaps (partition in the mathematical sense). In bothcases the union of the children corresponds exactly to thetemporal extension of the parent, even if the parent is it-self nonconnected (see the example of Figure 5b). Figure 5cshows an example of decomposition with gaps but no over-laps. Finally, Figure 5d illustrates a more complex case wherethe parent is composed of two connected components andits decomposition creates three children: the first one is itselfcomposed of two connected components, the two remain-ing children are composed of a single connected component.


SR7:• Color Histogram• Textual annotation

SR8:• Color Histogram• Textual annotation

SR3:• Shape• Color Histogram• Textual annotation



SR5:• Shape• Textual annotation


SR1:• Creation, usage meta

information• Media description• Textual annotation• Color Histogram, Texture

No gap, no overlap

Gap, no overlap No gap, no overlap

Gap, no overlap

Figure 6: Examples of image description with still regions.

The decomposition allows gap and overlap. Note that, in anycase, the decomposition implies that the union of the spatio-temporal space defined by the children segments is includedin the spatio-temporal space defined by their ancestor seg-ment (children are contained in their ancestors).

As described above, any segment may be described bycreation information, usage information, media informa-tion, and textual annotation. However, specific low-level fea-tures depending on the segment type are also allowed. Anexample of image description is illustrated in Figure 6. Theoriginal image is described as a StillRegion, SR1, which is de-scribed by creation (title, creator), usage information (copy-right), media information (file format) as well as a tex-tual annotation (summarizing the image content), a colorhistogram and a texture descriptor. This initial region canbe further decomposed into individual regions. For eachdecomposition step, we indicate if Gaps and Overlaps arepresent. The segment tree is composed of eight StillRegions(note that SR8 is a single segment made of two connectedcomponents). For each region, Figure 6 shows the type offeature, that is, instantiated. Note that it is not necessaryto repeat in the tree hierarchy the creation, usage informa-tion, and media information, since the child segments are as-sumed to inherit their parent value (unless re-instantiated).

The description of the content structure is not con-strained to rely on trees. Although, hierarchical structuressuch as trees are adequate for efficient access, retrieval andscalable description, they imply constraints that may makethem inappropriate for certain applications. In such cases,the SegmentRelation DS has to be used. The graph structureis defined very simply by a set of nodes, each correspondingto a segment, and a set of edges, each corresponding to a re-lationship between two nodes. To illustrate the use of graphs,consider the example shown in Figure 7.

This example shows an excerpt from a soccer match.

Two video segments, one still region and three moving re-gions are considered. A possible graph describing the struc-ture of the content is shown in Figure 7. The Video Segment:Dribble & Kick involves the Ball, the Goalkeeper, and thePlayer. The Ball remains close to the Player who is movingtoward the Goalkeeper. The Player appears on the Right ofthe Goalkeeper. The Goal score video segment involves thesame moving regions plus the still region called Goal. In thispart of the sequence, the Player is on the Left of the Goal-keeper and the Ball moves toward the Goal. This very simpleexample illustrates the flexibility of this kind of representa-tion. Note that this description is mainly structural becausethe relations specified in the graph edges are purely physicaland the nodes represent segments (still and moving regionsin this example). The only explicit semantic information isavailable from the textual annotation (where keywords suchas Ball, Player, or Goalkeeper can be specified).

2.1.4 Content Description: conceptual aspects

For some applications, the viewpoint described in Section2.1.3 is not appropriate because it highlights the structuralaspects of the content. For applications where the structureis of no real use, but where the user is mainly interested in thesemantic of the content, an alternative approach is providedby the Semantic DS. In this approach, the emphasis is not onsegments but on Events, Objects in narrative worlds, Con-cepts and Abstractions. As shown in Figure 8, the Seman-ticBase DS describes narrative worlds and semantic entitiesin a narrative world. In addition, a number of specializedDSs are derived from the generic SemanticBase DS, whichdescribe specific types of semantic entities, such as narrativeworlds, objects, agent objects, events, places, time, and ab-stractions.

As in the case of the Segment DS, the conceptual aspectsof description can be organized in a tree or in a graph. The


Player

moves toward

Video Segment:Dribble & Kick

Is composed of

Ball Player Goalkeeper

Is close to Right of

Moves toward

Same as Same as Same as

Video Segment:Goal Score

Is composed of

Ball Player Goalkeeper Goal

Left of

Moves toward

Video segment 1: Dribble & Kick

Video segment 2: Goal Score

Moving region:Player

Moving region:Ball

Moving region:Goalkeeper

Still region:Goal

Figure 7: Examples of segment graph.

graph structure is defined by a set of nodes, representing se-mantic notions, and a set of edges specifying the relationshipbetween the nodes. Edges are described by the SemanticRe-lation DSs.

Finally, as an example of combination of structural andconceptual aspects, Figure 9 illustrates the description of avideo sequence inspired from the classical way of describingthe content of written documents such as books: the table ofcontents and the index [4]. The table of contents is a hierar-chical representation that splits the document into elemen-tary pieces (chapters, sections, subsections, etc.). The orderin which the items are presented follows the linear structureof the book itself. As a result, the table of contents is a rep-resentation of the linear, one-dimensional structure of thebook. The goal of the index is not to define the linear struc-ture of the book, but to define a set of potentially interestingitems and to provide references to the book sections wherethese items are discussed. In order to be of practical interestto human users, the items are selected based on their seman-tic value. In many cases, the Index is also presented in a hier-archical fashion to allow fast access to the item of interest forthe user.

Figure 9 shows an example of a video description. Thedescription also involves two hierarchical structures repre-sented by trees. The first one is devoted to the structural as-pects and is based on a segment tree whereas the second one

describes what is happening, that is, the conceptual aspects,and is termed the Event Tree. The links from the Event Tree tothe Segment Tree relate semantic notions (events) with oneor several occurrences of these notions in time. As a resultthe description is itself a graph built around two trees.

2.1.5 Navigation and access

MPEG-7 facilitates navigation and access of AV content bydescribing summaries, views, and variations. The summaryDS describes semantically meaningful summaries and ab-stracts of AV content. The summary descriptions allow theAV content to be navigated in either a hierarchical or se-quential fashion. The hierarchical summary DS describes theorganization of summaries into multiple levels of detail. Themain navigation mode is from coarse to fine and vice-versa.Note that the hierarchy may be based on the quantity of in-formation (e.g., a few key-frames for a coarse representationversus a large number of key-frames for a fine representa-tion) or on specific features (e.g., only the most importantevents are highlighted in the coarse representation whereas alarge number of less important events may be shown in thefine representation).

The SequentialSummary DS describes a summary con-sisting of a sequence of images or video frames, which is pos-sibly synchronized with audio and text. The SequentialSum-mary may also contain a sequence of audio clips. The main


Audio-visualcontent

Segment DS

AnalyticModel DS

Semanticrelation

DS

Abstraction Level

SemanticBase DS(abstract)

SemanticContainer DS

Semantic DS

describes

Narrative worldcaptures

Object DS

Event DS

Concept DS

SemanticState DS

SemanticPlace DS

SemanticTime DS

AgentObject DS. . .

Figure 8: Tools for the description of conceptual aspects.

navigation mode is linear (forward-backward).The view DS describes structural views of the AV signals

in the space or frequency domain in order to enable multi-resolution access and progressive retrieval.

Finally, the Variation DS describes relationships betweendifferent variations of AV programs. The variations of the AVcontent include compressed or low-resolution versions, sum-maries, different languages, and different modalities, such asaudio, video, image, text, and so forth. One of the targetedfunctionalities is to allow a server or proxy to select the mostsuitable variation of the AV content for delivery according tothe capabilities of terminal devices, network conditions, oruser preferences.

2.1.6 Content Organization

The Content Organization is built around two main DSs: theCollection DS and the Model DS. The Collection DS includestools for describing collections of AV material, collectionsof AV content descriptions, collections of semantic concepts,mixed collections (content, descriptions, and concepts), andcollection structures in terms of the relationships among col-lections.

The Model DS describes parametrized models of AV con-tent, descriptors, or collections. It involves two importantDSs: the ProbabilityModel and the AnalyticModel DSs. TheProbabilityModel DS describes different statistical functionsand probabilistic structures, which can be used to describesamples of AV content and classes of Descriptors using statis-tical approximation. The AnalyticModel DS describes a col-lection of examples of AV content or clusters of Descriptorsthat are used to provide a model for a particular semanticclass. For example, a collection of art images labeled with tagindicating that the paintings are examples of the impression-ist period forms an analytic model. The AnalyticModel DS

also optionally describes the confidence in which the seman-tic labels are assigned.

2.1.7 User Interaction

The UserInteraction DS describes preferences of users per-taining to the consumption of the AV content, as well asusage history. The MPEG-7 AV content descriptions can bematched to the preference descriptions in order to select andpersonalize AV content for more efficient and effective access,presentation and consumption. The UserPreference DS de-scribes preferences for different types of content and modesof browsing, including context dependency in terms of timeand place. The UsageHistory DS describes the history of ac-tions carried out by a user of a multimedia system. The usagehistory descriptions can be exchanged between consumers,their agents, content providers, and devices, and may in turnbe used to determine the user’s preferences with regard to AVcontent.

2.2. Visual features

The low-level visual features described in MPEG-7 are color,texture, shape and localization, motion, and low-level facecharacterization. With respect to the Multimedia DSs de-scribed in Section 2.1, the Descriptors or DSs that handlelow-level visual features are to be considered as a charac-terization of segments. Not all Descriptors and DSs are ap-propriate for all segments and the set of allowable Descrip-tors or DSs for each segment type is defined by the stan-dard. This section summarizes the most important descrip-tion tools dealing with low-level visual features.

2.2.1 Color feature

MPEG-7 has standardized eight color Descriptors: Colorspace, Color quantization, Dominant colors, Scalable color


Semantic DS (Events)

• Introduction

• Summary

• Program lo go

• Studio

• Overview

• News Presenter

• News Items

• International

• Clinton Case

• Pope in Cuba

• National

• Twins

• Sports

• Closing

Segment tree Segmentic DS (events)Timeaxis

Shot 1 Shot 2 Shot 3

Segment 1Subsegment 1

Subsegment 2

Subsegment 3

Subsegment 4

Segment 2

Segment 3

Segment 4

Segment 5

Segment 6

Segment 7

• Introduction

• Summary

• Program logo

• Studio

• Overview• News presenter

• News items

• International

• Clinton case

• Popa in Cuba

• National

• Twins

• Sports

• Closing

Figure 9: Example of table of content and index combining structural and conceptual aspects.

histogram, Color structure, Color layout, and GoF/GoPcolor. The first two Descriptors, color space, and quanti-zation, are intended to be used in conjunction with othercolor Descriptors. Possible color spaces include {R,G, B},{Y,Cr , Cb}, {H, S, V}, Monochrome and any linear combi-nation of {R,G, B}. The color quantization supports linearand nonlinear quantizers as well as lookup tables.

The DominantColor Descriptor is suitable for represent-ing local features where a small number of colors are enoughto characterize the color information in the region of interest.It can also be used for whole images. The Descriptor definesthe set of dominant colors, the percentage of each color inthe region of interest and, optionally, the spatial coherence.This Descriptor is mainly used in retrieval by similarity.

The ScalableColorHistogram Descriptor represents acolor histogram in the {H, S, V} color space. The histogramis encoded with a Haar transform to provide scalability interms of bin numbers and accuracy. It is particularly attrac-tive for image-to-image matching and color-based retrieval.

The ColorStructure Descriptor captures both color con-tent and its structure. Its main functionality is image-to-image matching. The extraction method essentially com-putes the relative frequency of 8 × 8 windows that containa particular color. Therefore, unlike a color histogram, thisdescriptor can distinguish between two images in which agiven color is present with the same probability but wherethe structures of the corresponding pixels are different.

The ColorLayout Descriptor specifies the spatial distri-bution of colors for high-speed retrieval and browsing. Ittargets not only image-to-image matching and video-clip-to-video-clip matching, but also layout-based retrieval for color,such as sketch-to-image matching which is not supported by

other color descriptors. The Descriptor represents the DCTvalues of an image or a region that has been previously parti-tioned into 8 × 8 blocks and where each block is representedby its dominant color.

The last color Descriptor is the GroupOfFrames/Group-OfPicturesColor descriptor. It extends the ScalableColorHis-togram Descriptor defined for still images to video sequencesor collection of still images. The extension describes how theindividual histograms computed for each image have beencombined: by average, median, or intersection. It has beenshown that this information allows the matching betweenVideoSegments to be more accurate.

2.2.2 Texture feature

There are three texture descriptors: Homogeneous Texture,Texture Browsing, and Edge Histogram. Homogeneous tex-ture has emerged as an important visual primitive for search-ing and browsing through large collections of similar look-ing patterns. The HomogeneousTexture Descriptor providesa quantitative representation. The extraction relies on a fre-quencial decomposition with a filter bank based on Gaborfunctions. The frequency bands are defined by a scale pa-rameter and an orientation parameter. The first and secondmoments of the energy in the frequency bands are then usedas the components of the Descriptor. The number of filtersused is 5× 6 = 30 where 5 is the number of scales and 6 is thenumber of orientations used in the Gabor decomposition.

The TextureBrowsing Descriptor provides a qualitativerepresentation of the texture similar to a human charac-terization, in terms of dominant direction, regularity, andcoarseness. It is useful for texture-based browsing appli-cations. The Descriptor represents one or two dominant


(a) (b) (c)

Figure 10: Illustration of region and contour similarity.

directions and, for each dominant direction, the regularity(four possible levels) and the coarseness (four possible val-ues) of the texture.

The EdgeHistogram Descriptor represents the histogramof five possible types of edges, namely four directional edgesand one nondirectional edge. The Descriptor primarily tar-gets image-to-image matching (query by example or bysketch), especially for natural images with nonuniform edgedistribution.

2.2.3 Shape and localization features

There are five shape or localization descriptors: Region-based Shape, Contour-based Shape, Region Locator, Spatio-temporal Locator, and 3D shape.

The Region-based and Contour-based Shape Descrip-tors are intended for shape matching. They do not provideenough information to reconstruct the shape nor to defineits position in the image. Two shape Descriptors have beendefined because, in terms of applications, there are at leasttwo major interpretations of shape similarity. For example,the shapes represented in Figures 10a and 10b are similar be-cause they correspond to a cross. The similarity is based onthe contours of the shape and, in particular, on the presenceof points of high curvature along the contours. This type ofsimilarity is handled by the Contour-based Shape Descriptor.Shapes illustrated in Figures 10b and 10c can also be consid-ered as similar. However, the similarity does not rely on thecontours but on the distribution of pixels belonging to theregion. This second similarity notion is represented by theRegion-based Shape Descriptor.

The Contour-based Shape Descriptor captures charac-teristics of a shape based on its contour. It relies on the so-called Curvature Scale-Space [5] representation, which cap-tures perceptually meaningful features of the shape. The De-scriptor essentially represents the points of high curvaturealong the contour (position of the point and value of thecurvature). This representation has a number of importantproperties, namely, it captures characteristic features of theshape, enabling efficient similarity-based retrieval. It is ro-bust to non-rigid deformation and partial occlusion.

The Contour-based Shape Descriptor captures the dis-tribution of all pixels within a region. Note that, in con-trast with the Contour-based Shape Descriptor, this descrip-tor can deal with regions made of several connected compo-

nents or including holes. The Descriptor is based on an An-gular Radial Transform (ART) which is a 2D complex trans-form defined with polar coordinates on the unit disk. TheART basis functions are separable along the angular and ra-dial dimensions. Twelve angular and three radial basis func-tions are used. The Descriptor represents the set of coeffi-cients resulting from the projection of the binary region intothe 36 ART basis functions.

The RegionLocator and the Spatio-temporalLocatorcombine shape and localization information. Although theymay be less efficient in terms of matching for certain appli-cations, they allow the shape to be (partially) reconstructedand positioned in the image. The RegionLocator Descriptorrepresents the region with a compact and scalable representa-tion of a polygon. The Spatio-temporalLocator has the samefunctionality but describes moving regions in a video se-quence. The Descriptor specifies the shape of a region withinone frame together with its temporal evolution based on mo-tion.

3DShape information can also be described in MPEG-7.Most of the time, 3D information is represented by polygo-nal meshes. The 3D shape Descriptor provides an intrinsicshape description of 3D mesh models. It exploits some lo-cal attributes of the 3D surface. The Descriptor representsthe 3D mesh shape spectrum, which is the histogram of theshape indexes [6] calculated over the entire mesh. The mainapplications targeted by this Descriptor are search, retrieval,and browsing of 3D model databases.

2.2.4 Motion feature

There are four motion Descriptors: camera motion, objectmotion trajectory, parametric object motion, and motion ac-tivity. The CameraMotion Descriptor characterizes 3D cam-era motion parameters. It supports the following basic cam-era operations: fixed, tracking (horizontal transverse move-ment, also called traveling in the film industry), booming(vertical transverse movement), dollying (translation alongthe optical axis), panning (horizontal rotation), tilting (ver-tical rotation), rolling (rotation around the optical axis), andzooming (change of the focal length), the Descriptor is basedon time intervals characterized by their start time, and dura-tion, the type(s) of camera motion during the interval, andthe focus-of-expansion (FoE) (or focus-of-contraction FoC).The Descriptor can describe a mixture of different camera


motion types. The mixture mode captures globally informa-tion about the camera motion parameters, disregarding de-tailed temporal information.

The MotionTrajectory Descriptor characterizes the tem-poral evolution of key-points. It is composed of a list of key-points (x, y, z, t) along with a set of optional interpolatingfunctions that describe the trajectory between key-points.The speed is implicitly known by the key-points specifica-tion and the acceleration between two key-points can be es-timated if a second-order interpolating function is used. Thekey-points are specified by their time instant and their 2D or3D Cartesian coordinates, depending on the intended appli-cation. The interpolating functions are defined for each com-ponent x(t), y(t), and z(t) independently. The Descriptionis independent of the spatio-temporal resolution of the con-tent (e.g., 24 Hz, 30 Hz, 50 Hz, CIF, SIF, SD, HD, etc.). Thegranularity of the descriptor is chosen through the numberof key-points used for each time interval.

Parametric motion models have been extensively usedwithin various image processing and analysis applications.The ParametricMotion Descriptor defines the motion of re-gions in video sequences as a 2D parametric model. Specif-ically, affine models include translations, rotations, scalingand combination of them. Planar perspective models makepossible to take into account global deformations associ-ated with perspective projections. Finally, quadratic modelsmakes it possible to describe more complex movements. Theparametric model is associated with arbitrary regions over aspecified time interval. The motion is captured in a compactmanner as a reduced set of parameters.

A human watching a video or animation sequence per-ceives it as being a “slow” sequence, a “fast paced” sequence,an “action” sequence, and so forth. The MotionActivity De-scriptor captures this intuitive notion of “intensity of action”or “pace of action” in a video segment. Examples of high ac-tivity include scenes such as “scoring in a basketball game,”“a high speed car chase,” and so forth. On the other hand,scenes such as “news reader shot” or “an interview scene” areperceived as low action shots. The motion activity descrip-tor is based on five main features: the intensity of the motionactivity (value between 1 and 5), the direction of the activity(optional), the spatial localization, the spatial and the tem-poral distribution of the activity.

2.2.5 Face descriptor

The FaceRecognition Descriptor can be used to retrieve faceimages that match a query face image. The Descriptor isbased on the classical eigen faces approach [7]. It representsthe projection of a face region onto a set of basis vectors (49vectors) which span the space of possible face vectors.

3. CHALLENGES FOR VISUAL INFORMATIONANALYSIS

As mentioned in the introduction, the scope of the MPEG-7standard is to define the syntax and semantics of the DSsand Descriptors. The description generation and consump-tion are out of the scope of the standard. In practice, this

means that feature extraction, indexing process, annotation,and authoring tools as well as search and retrieval engines,filtering and browsing devices are non-normative parts ofthe MPEG-7 standard and can lead to future improvements.It has to be mentioned, however, that for low-level features,the distinction between the definition of the semantics of atool and its extraction may become fuzzy. A typical exampleis represented by the HomogeneousTexture Descriptor (seeSection 2.2.2). In order to support interoperability, MPEG-7 has defined the set filters to be used in the decomposition(Gabor filters and their parameters). Beside the implemen-tation, this leaves little room for future studies and improve-ments. A similar situation can be found for most visual de-scriptors described in Section 2.2: the definition of their se-mantics defines partially the extraction process. The mainexceptions are the TextureBrowsing and the MotionActivityDescriptors. Indeed, the characterization of the “Texture reg-ularity” or of the “Motion intensity” is qualitatively done.The CameraMotion Descriptor is a special case, because ei-ther one has access to the real parameters of the camera orone has to estimate the camera motion from the observedsequence.

The definition of a low-level Descriptor may also lead tothe use of a natural matching distance. However, the stan-dardization of matching distances is not considered as beingnecessary to support interoperability and the standard onlyprovides informative sections in this area. This will certainlybe a challenging area in the future.

Most of the Descriptors corresponding to low-level fea-tures can be extracted automatically from the original con-tent. Most of the time, the main issue is to define the tempo-ral interval or the region of interest that has to be character-ized by the descriptor. This is a classical segmentation prob-lem for which, a large number of tools have been reportedin the literature (see [8, 9, 10] and the references herein). Anarea which has been less worked out is the instantiation of thedecomposition involved in the Segment DS. It can be viewedas a hierarchical segmentation problem where elementaryentities (region, video segment, etc.) have to be defined andstructured by inclusion relationship within a tree. This pro-cess leads, for example, to the extraction of Tables of Contentsor Indexes from the AV content as illustrated in Figure 9. Al-though some preliminary results have been reported in theliterature, this area still represents a challenge for the future.

One of the most challenging aspects of the MPEG-7 stan-dard in terms of application is to use it efficiently. The selec-tion of the optimum set of DSs and Descriptors for a givenapplication is an open issue. Even if the identification of thebasic features that have to be represented is a simple task,the selection of specific descriptors may not be straightfor-ward: for example, DominantColor versus ScalableColorHis-togram or MotionTrajectory versus ParametricMotion, andso forth. Moreover, the real power of the standard will be ob-tained when DSs and Descriptors are jointly used and whenthe entire description is considered as a whole, for example,taking into account the various relationships between seg-ments in trees or graphs.

In terms of research, one of the most challenging issues



Low-leveldescription


Recognition


High-leveldescription


Recognition

Figure 11: Localization of the recognition process depending onthe feature types.

may be the mapping between low-level and high-level de-scriptions. First we discuss the relation between low-level,high-level descriptions, and recognition processes. Considerthe two situations represented in Figure 11: on the top, thedescription is assumed to rely mainly on high-level features.This implies that the automatic or manual indexing processhas performed a recognition step during description gener-ation. This approach is very powerful but not very flexible.Indeed, if during the description generation, the high-levelfeature of interest for the end user has been identified, thenthe matching and retrieval will be very easy to do. However,if the end user wants to use a feature that has not been rec-ognized during the indexing phase, then it is extremely dif-ficult to do anything. The alternative solution is representedin the lower part of Figure 11. In this case, we assume thatthe description relies mainly on low-level features. No recog-nition process is required during the description generation.However, for many applications, the mapping between low-level descriptions and high-level queries will have to be doneduring the description consumption. That is, the search en-gine or the filtering device will have to analyze the low-levelfeatures and, on this basis, perform the recognition process.This is a very challenging task for visual analysis research. To-day, the technology related to intelligent search and filteringengines using low-level visual features, possibly together withhigh-level features, is still very limited. As a final remark, wemention that this challenging issue has also some implica-tions for the description generation. Indeed, a major openquestion is to know what are the useful set of low-level De-scriptors that have to be used to allow a certain class of recog-nition tasks to be performed on the description itself.

REFERENCES

[1] D. C. Fallside, Ed., XML Schema Part 0: Primer. W3C Recom-mendation, May 2001, http://www.w3.org/TR/xmlschema-0/.

[2] T. Bray, J. Paoli, C. M. Sperberg-McQueen, and E. Maler, Eds.,XML: Extensible Markup Language 1.0, 2nd edition, October2000, http://www.w3.org/TR/REC-xml.

[3] H. Sawhney and S. Ayer, “Compact representations of videosthrough dominant and multiple motion estimation,” IEEETrans. on Pattern Analysis and Machine Intelligence, vol. 18,no. 8, pp. 814–830, 1996.

[4] Y. Rui, T. S. Huang, and S. Mehrotra, “Exploring video struc-ture beyond the shots,” in Proc. IEEE International Conferenceon Multimedia Computing and Systems, pp. 237–240, 28 June–

1 July 1998, Austin, Tex, USA.[5] F. Mokhtarian and A. K. Mackworth, “A theory of multi-scale,

curvature-based shape representation for planar curves,” IEEETrans. on Pattern Analysis and Machine Intelligence, vol. 14, no.8, pp. 789–805, 1992.

[6] J. J. Koenderink and A. J. van Doorn, “Surface shape and cur-vature scales,” Image and Vision Computing, vol. 10, no. 8, pp.557–565, 1992.

[7] B. Moghaddam and A. Pentland, “Probabilistic visual learn-ing for object representation,” IEEE Trans. on Pattern Analysisand Machine Intelligence, vol. 19, no. 7, pp. 696–710, 1997.

[8] B. S. Manjunath, T. Huang, A. M. Teklap, and H. J.Zhang, Eds., “Special issue on image and video processingfor digital libraries,” IEEE Trans. Image Processing, vol. 9, no.1, 2000.

[9] K. N. Ngan, S. Panchanathan, T. Sikora, and M. T. Sun, Eds.,“Special issue on segmentation, description and retrieval ofvideo content,” IEEE Trans. Circuits and Systems for VideoTechnology, vol. 8, no. 5, pp. 521–524, 1998.

[10] F. Pereira, S. F. Chang, R. Koenen, A. Puri, and O. Avaro, Eds.,“Special issue on object-based video coding and description,”IEEE Trans. Circuits and Systems for Video Technology, vol. 9,no. 8, pp. 1144, 1999.

Philippe Salembier received a degree fromthe Ecole Polytechnique, Paris, France, in1983 and a degree from the Ecole NationaleSuperieure des Telecommunications, Paris,France, in 1985. He received the Ph.D.from the Swiss Federal Institute of Tech-nology (EPFL) in 1991. He was a Postdoc-toral Fellow at the Harvard Robotics Lab-oratory, Cambridge, MA, in 1991. From1985 to 1989 he worked at Laboratoiresd’Electronique Philips, Limeil-Brevannes, France, in the fields ofdigital communications and signal processing for HDTV. In 1989,he joined the Signal Processing Laboratory of the Swiss Federal In-stitute of Technology in Lausanne, Switzerland, to work on imageprocessing. At the end of 1991, after a stay at the Harvard RoboticsLab., he joined the Polytechnic University of Catalonia, Barcelona,Spain, where he is lecturing on the area of digital signal and im-age processing. His current research interests include image andsequence coding, compression and indexing, image modeling, seg-mentation problems, video sequence analysis, mathematical mor-phology, and nonlinear filtering. In terms of standardization activ-ities, he has been involved in the definition of the MPEG-7 stan-dard (“Multimedia Content Description Interface”) as Chair of the“Multimedia Description Scheme” group between 1999 and 2001.He served as an Area Editor of the Journal of Visual Communica-tion and Image Representation (Academic Press) from 1995 until1998 and as an AdCom Officer of the European Association for Sig-nal Processing (EURASIP) in charge of the edition of the Newslet-ter from 1994 until 1999. He has edited (as Guest Editor) specialissues of Signal Processing on Mathematical Morphology (1994)and on Video Sequence Analysis (1998). He has also co-editeda special issue of Signal processing: Image Communication onMPEG-7 proposals (2000). Currently, he is associate editor of IEEETransactions on Image Processing and Co-Editor-in-Chief of SignalProcessing. Finally, he is member of the Image and Multidimen-sional Signal Processing Technical Committee of the IEEE SignalProcessing Society.


Using MPEG-7 at the Consumer Terminalin Broadcasting

Alan PearmainElectronic Engineering Department, Queen Mary, University of London, Mile End Road, London E1 4NS, England, UKEmail: [email protected]

Mounia LalmasComputer Science Department, Queen Mary, University of London, Mile End Road, London E1 4NS, England, UKEmail: [email protected]

Ekaterina MoutogianniComputer Science Department, Queen Mary, University of London, Mile End Road, London E1 4NS, England, UKEmail: [email protected]

Damien PapworthComputer Science Department, Queen Mary, University of London, Mile End Road, London E1 4NS, England, UKEmail: damien [email protected]

Pat HealeyComputer Science Department, Queen Mary, University of London, Mile End Road, London E1 4NS, England, UKEmail: [email protected]

Thomas RollekeComputer Science Department, Queen Mary, University of London, Mile End Road, London E1 4NS, England, UKEmail: [email protected]

Received 1 August 2001 and in revised form 14 January 2002

The European Union IST research programme SAMBITS (System for Advanced Multimedia Broadcast and IT Services) projectis using Digital Video Broadcasting (DVB), the DVB Multimedia Home Platform (MHP) standard, MPEG-4 and MPEG-7 in astudio production and multimedia terminal system to integrate broadcast data and Internet data. This involves using data deliveryover multiple paths and the use of a back channel for interaction. MPEG-7 is being used to identify programme content and toconstruct queries to allow users to identify and retrieve interesting related content. Searching for content is being carried out usingthe HySpirit search engine. The paper deals with terminal design issues, the use of MPEG-7 for broadcasting applications andusing a consumer broadcasting terminal for searching for material related to a broadcast.

Keywords and phrases: MPEG-7, digital television, information retrieval, MPEG-4, multimedia home platform.

1. INTRODUCTION

SAMBITS is a European Union IST research programmeproject investigating the ways in which digital television canenhance programmes and provide the viewer with a person-alised service. Part of this enhancement requires broadcast-ing and the Internet to work together. The project is workingon studio systems for producing content that allow a broad-caster to add additional information to the broadcasts and tolink broadcasting and the Internet. The project is also work-

ing on terminals capable of displaying the enhanced contentin a way that is accessible to ordinary users [1].

The broadcasting chain starts with normal MPEG-2broadcast content that is sent by standard DVB techniques,but this is linked to extra content, including MPEG-4 audio-video sequences and HTML pages. MPEG-2 and MPEG-4multimedia information has MPEG-7 [2, 3, 4] metadataadded at the studio which describes certain features of thecontent. The extra MPEG-4 content may be sent over theMPEG-2 transport stream as separate streams, as part of the

Using MPEG-7 at the Consumer Terminal in Broadcasting 355

Studiosystem

Broadcast &Internet server

Terminalsystem

Local interactionRemote interaction

Local interaction

Figure 1: The SAMBITS system.

data carousel, in private sections or it may be sent over theInternet.

The terminal is based on the Multimedia Home Plat-form (MHP) [5] reference software running on a set-topbox. MHP currently only supports MPEG-2, so the projectis adding software to support MPEG-4 and MEPG-7, storageof multimedia content and searching of multimedia content.It is intended that the user will be able to access this contentwith a system that is an advanced set-top box and televisionwith a remote control.

The SAMBITS project has twelve partners: Institut furRundfunktechnik GmbH, European Broadcasting Union,British Broadcast Corporation, Brunel University, Heinrich-Hertz-Institut fur Nachrichtentechnik Berlin GmbH, KPNResearch, Philips Research, Queen Mary, University of Lon-don, Siemens AG, Telenor AS, Fraunhofer-Institut fur Inte-grierte Publikations-und Informationssysteme, BayerischerRundfunk. Queen Mary is contributing to the consumer ter-minal: the MPEG-7 descriptors, information retrieval, andthe user interface. The project started in January 2000 andfinished at the end of December 2001. There was a demon-stration of the project at IBC2001 in Amsterdam in Septem-ber 2001.

2. BACKGROUND

The outline of the complete system that is being developedis shown in Figure 1. The studio system involves the devel-opment of various authoring and visualization tools. Stan-dard equipment is being used for the broadcast and Internetservers and the terminal development is based on a Fujitsu-Siemens ACTIVY set-top box.

Some of the functions that are available in the terminalare:

• enhanced programmes containing additional contentand metadata information;

• instant access to the additional content, which may beprovided via DVB or via the Internet;

• access to information about the current programme;• searching for additional information either using

metadata from the current programme or using astored user profile.

One of the features of the system is that, it provides aplatform for investigating how MPEG-7 descriptors can beused at the consumer end in a broadcasting environment.The first problem was to choose a suitable set of descriptors.The descriptors that are useful to a user are high-level de-

scriptions of the content. The studio will also include lower-level descriptors such as the percentage of different coloursin a scene or camera information (since the studio involvesexpert users, e.g., programme editors, etc.), but these wouldnot be useful at the terminal.

User interaction is limited to remote control buttons,rather than a keyboard, as many television users do not feelcomfortable having to use a keyboard and keyboards arebulky and relatively expensive. This produces some chal-lenges for the user interface design, particularly in the con-struction of queries.

The user will have the option whether or not to displaythe MPEG-7 data that is associated with the current pro-gramme via an Info button on the remote control. Searchesare constructed based on the MPEG-7 metadata available forthe current programme. The retrieval engine uses HySpirit(http://www.hyspirit.de), a retrieval framework based onprobabilistic relational algebra [6].

3. THE TERMINAL HARDWARE

The Fujitsu-Siemens ACTIVY box, which is used for the ter-minal, has the following characteristics:

• Win98 operating system.• Integrated DVB-receiver.• Optimisation of the graphical subsystem for display on

a TV-screen.• DVB-C or DVB-S input.• TV output via SCART, FBAS, S-Video either in PAL or

NTSC norm, including macrovision, flicker reduction,and hardware support for transparent overlays.

• VGA-output.• 2 MPEG-2 decoder chips.• Common Interface for Conditional Access Module

(DVB compliant).• AC97 codec, AC3 pass through.• S/P-DIF I/O (digital audio I/O interface).• 600 MHz Celeron processor.

The box has a similar form factor to the current genera-tion of set-top boxes.

4. THE TERMINAL SOFTWARE

The terminal receives an MPEG-2 transport stream and ad-ditional material. The additional material can be of severaltypes:

• MPEG-7 metadata, either information about the mainMPEG-2 programme or the MPEG-4 or other addi-tional material;

• an MPEG-4 stream that is synchronised with the mainprogramme and displayed as an object overlaid on theMPEG-2 picture. The display of this stream will be atuser discretion. A typical application of this feature isdisplaying a signer for people who are deaf;


Applicationmanager

Xletinfo Application

engine

Contentactivation Presentation

engine

Related materialdescription

Contentnotification

Contentinformation

MPEG-7interface

Timeadjust

MPEG-7time

Contentmanager

User interface(Navigator)

Playmanager

Visibility/layout

MPEG-7description

Elapsed time(UDP packets)

Storagemanager Player

Content/MediaLocator

Contentstorage

PlayerAWTobject

Figure 2: Content review and storage.

• MPEG-4 material that could be an additional streamin the multiplex or could be transmitted via the datacarousel or could be available from the broadcaster’sweb server via the Internet;

• web pages transmitted in the data carousel or availablevia the Internet;

• other material such as 3D models or games transmit-ted via the data carousel or from the broadcaster’s webserver via the Internet.

One of the uses of MPEG-7 metadata is to indicate theextra content that is available at different times during theprogramme. The overall architecture of content managementin the terminal is shown in Figure 2.

To synchronise the MPEG-7 data with the MPEG-2stream, UDP packets containing time data are sent fromthe studio system to the terminal. The MPEG-7 user in-terface uses an integrated browser based on the MozillaHTML browser. The MPEG-7 information is transformedfrom XML to HTML using style sheets, and the embeddedbrowser then renders the HTML.

Additional controls for the MPEG-7 engine, such assearching for related material, are also placed in the gener-ated HTML pages.

5. MPEG-7 CONTENT DESCRIPTION

The MPEG-7 standard specifies a rich set of descriptionstructures for audio-visual (AV) content, which can be in-stantiated by any application to describe various features andinformation related to the AV content. A Descriptor (D) de-fines the syntax and the semantics of an elementary feature.This can be either a low-level feature that represents a charac-teristic such as colour or texture, or a high-level feature suchas the title or the author of a video. A Description Scheme(DS) uses Descriptors as building blocks in order to definethe syntax and semantics of a more complex description. Thesyntax of Ds and DSs is defined by the Description Defini-tion Language (DDL). The DDL is an extension of the XMLschema language [7] and can also be used by developers for

creating new Ds and DSs according to the specific needs ofan application.

The set of description structures that MPEG-7 standard-ises is very broad so each application is responsible for se-lecting an appropriate subset to instantiate, according to theapplication’s functionality requirements. The choice of theMPEG-7 descriptions that were considered to be suitable forthe SAMBITS terminal functionality was based on what wasavailable at the time at working level [2]. The project con-tributed to the standardisation process. Elements that werestill evolving and the use of which was not clear were not con-sidered. The names were later updated to conform to the Fi-nal Committee Draft (FCD) elements [3]. The use of MPEG-7 has also been discussed in [4, 8].

After examining the available description schemes, thoseareas of MPEG-7 that were considered potentially useful toany SAMBITS application were identified, that is, the Mul-timedia Description Schemes part. In particular, the BasicElements on which the high-level descriptions are built, theContent Creation and Production which provide informa-tion related to the programme, the Structural Aspects whichallow a detailed structured description of the programme,and the User Preferences. Ds and DSs that describe low-levelvisual or audio aspects of the content were not considered tobe useful for the terminal functionality desired, where high-level descriptions meaningful for the viewers were required.Elements from the above areas were then selected, so that theminimum functionality could be achieved at the SAMBITSterminal. The selection is shown in Table 1: for each chosenelement type listed in the first column, the related elements(Ds and DSs) which are used are listed in the second column.

The following sections describe in more detail theMPEG-7 elements that were implemented for the terminalfunctionality.

5.1. Structural aspects

The MPEG-7 descriptions at the terminal focus on the struc-tural aspects of the programme. The Segment DS is used todescribe the structure of the broadcast programme. Specif-ically, the Video Segment DS, which describes temporal


Table 1: MPEG-7 elements selected for the SAMBITS terminal.

Type Contained Elements

Structural aspects

SegmentTypeMediaLocator

CreationInformation

TextAnnotation

SegmentDecompositionTypeSegment

(type=“VideoSegmentType”)

VideoSegmentType MediaTime (datatype)

Content creation and production

CreationInfomationTypeCreation

Classification

Related Material

CreationTypeTitle

Abstract

Creator

ClassificationTypeGenre

Language

RelatedMaterialType MediaLocator

Basic elements

TextAnnotationTypeFreeTextAnnotation

Structured Annotation

Structured AnnotationTypeWho, WhatObject, WhatAction,

Where, When, Why, How

User preferences

UserPreferencesTypeUserIdentifier

UsagePreferences

UsagePreferencesTypeFilteringAndSearchPreferences

BrowsingPreferences

FilteringAndSearchPreferences ClassificationPrefeernces

Type CreationPreferences

BrowsingPreferencesType SummaryPreferences

segments of the video, is used. The Video Segment Decom-position tools are then used for temporally decomposing seg-ments into subsegments to capture the hierarchical nature ofthe content. The result is called a Table of Contents where, forexample, a video programme can be temporarily segmentedinto various levels of scenes, sub-scenes, and shots. MediaLo-cator and MediaTime Ds contain the reference to the mediaand time information, respectively.

The Table of Contents allows a granular description ofthe content, which is needed at the terminal to support theuser navigation through the programme and to provide in-formation at various levels of detail. It is also useful for thesearch functionality of the terminal, as this allows the re-trieval to return the most relevant part within a video.

The hierarchically structured content description allowsfurther descriptions to be attached at the different segments

of the hierarchy, in order to provide a high-level representa-tion of the content at a given granularity. The MPEG-7 struc-tures used in SAMBITS are described in the next subsections.

5.2. Creation information

The Creation Information DS, which is part of the ContentCreation & Production set of DSs, was used to provide gen-eral background information related to the videos. In par-ticular, Creation DS provides a Title, an Abstract, and theCreator. Classification DS describes how the material may becategorised into Genre and Language. For our Classificationinstance, free text is used instead of any classification schemesor controlled terms.

The Creation and Classification descriptions are usefulfor the search functionality by performing matching on thebasis of these features. The creation and classification infor-mation can also be used in combination with profile infor-mation to perform ranking of search results according to userpreferences.

The Related Material DS, which describes additional ma-terial that is linked to the content, was also implemented.In particular, the MediaLocator of the referenced material isonly included as it is assumed that the referenced materialhas also been described.

The Related Material descriptions at the terminal allowan integrated view of the main broadcast programme and allthe linked content.

5.3. Textual annotation

Free Text Annotation and Structured Annotation DSs pro-vide the main description of each segment that is mean-ingful for a viewer. In particular, the following elements:Who, WhatObject, WhatAction, Where, When, Why, How,are used for the Structured Annotation.

The textual annotation provides the main features of themultimedia material that are used for matching the queriesand the material when searching. It is therefore used for rep-resenting the material for the search engine and for repre-senting the queries. It is envisaged that the structured an-notation will also allow users to specify some keywords thatmost nearly represent the type of information that they wishto locate.

An example description of a video segment as used inSambits can be seen in Figure 3. The video is described asaudio-visual content, which is described by creation (Title,Abstract, Creator), classification (Genre, Language), and me-dia information (Time, Location). The video is temporallydecomposed into scenes and scenes are decomposed intoshots. For the segments at any level, there may be textual an-notation, both free text and structured (Who, WhatObject,Where, etc.). Related material for each segment can also bespecified, using links to its location. Note that it is enough tohave the creation and classification information only at theroot level, since it is inherited to child segments of the de-composition (unless they are instantiated again).

Note that the description of the structure of the video isgenerated semi-automatically by first using a segmentation


Global informationabout the programme

MediaLocatorCreation

TitleAbstractCreator

ClassificationGenreLanguage

AudioVisual content

SegmentDecomposition

Scene Scene

SegmentDecomposition

Information related toa specific scene

Shots

MediaTimeStructuredAnnotation

WhoWhatObjectWhatActionWhereWhenWhyHow

FreeTextAnnotationRelatedMaterial

Figure 3: Example description of a video segment.

algorithm that identifies the shots and then editing the struc-ture to achieve the desired hierarchical structure. The Cre-ation Information and Textual Annotation DSs have then tobe attached manually, with the support of existing tools.

To illustrate the procedure, we use the extract shown inFigure 4 of a sample MPEG-7 description of a soccer game.The extract consists of an audio-visual segment (AudioVi-sualSegmentType), composed of two subsegments (Segment-Decomposition). Creation information is provided for theaudio-visual segment, such as a Title, an Abstract, the Cre-ator, the Genre, and Language (the content managementpart of MPEG-7). The segment also has a free text anno-tation. The subsegments (VideoSegmentType) correspond tovideo shots. Each subsegment has a free text annotationcomponent.

5.4. User preferences

The User Preference DS is used to specify user preferenceswith respect to the content. Browsing Preferences that de-scribe preferred views of the content (i.e., summary prefer-ences) can be used for displaying the search results. Filteringand Search Preferences that describe preferences for contentin terms of genre or language can be exploited to classify thesearch results.

The User Preferences that are best for the terminal func-tionality have not yet been fully determined. A number ofuser studies are currently taking place to investigate which

of the standardised preferences best correspond to viewerneeds. Note that the user preferences are created at the termi-nal side, as opposed to the content description that is createdat the studio side.

The definition of descriptors within the MPEG-7 stan-dard was still ongoing at the time of the project, but thesedescriptors were in the set that was the candidate for adop-tion in the standard.

5.5. Binary MPEG-7

MPEG-7 data can be transported over the broadcast chan-nel either as text or as a binary representation. The binaryrepresentation is a recent development within the standard-isation process. If a binary form is used, it must first be de-coded to the text description, which is an XML structure. AnXSLT processor is then used, together with a style sheet, toproduce an HTML version of the description. The HTML issent to a local web server on the terminal. If the user requeststhe MPEG-7 data about the current programme, the HTMLbrowser on the terminal is used to send a request to this localweb server.

The MPEG-7 content data is displayed overlaid on anarea of the screen. Another overlay strip on the screen showsthe control buttons that have different uses depending on themode of the terminal. In the MPEG-7 information displaymode the round circle button allows the data display to beturned on or off.

6. SEARCHING

Queries are constructed from MPEG-7 data for the currentprogramme. An example of query construction is shown inFigure 5.

The user has asked for further information and has beenpresented with the information that is immediately available.He or she can select which of these items he or she wants toselect with the up/down buttons on the remote control. Usersare also presented with an option to extend the search. Thesearch could then be extended either to the Internet server ofthe broadcaster or to the whole of the Internet.

The query is formulated as an XML query and sent to theHySpirit search engine. This is a retrieval framework basedon a probabilistic extension of well-known database datamodels such as the relational model, the deductive modeland the object-oriented model for information retrieval pur-poses. HySpirit allows content to be captured (e.g., terms),facts (e.g., authors), and structure (e.g., XML) in retrievinginformation from semi-structured and heterogeneous datasources.

We also use MPEG-7 descriptions of the multimedia con-tent in order to develop integrated search mechanisms thatassist users in identifying additional material that is specifi-cally of interest to them.

The MPEG-7 data associated with the programme, orprogramme elements, is used to build queries. Context-sensitive buttons are used to display the MPEG-7 descriptionof the current programme element and the user then usescheck boxes to select the terms that they would like to use


<AudioVisual xsi : type=“AudioVisualSegmentType”><CreationInformation>

<Creation><Title>Spain vs Sweden (July 1998)</Title><Abstract><FreeTextAnnotation>Spain scores a goal quickly in this World Cup soccer gameagainst Sweden. The scoring player is Morientes.</FreeTextAnnotation></Abstract><Creator>BBC</Creator></Creation><Classification>

<Genre type=“main”>Sports</Genre><Language type=“original”>English</Language></Classification>

</CreationInformation><TextAnnotation>

<FreeTextAnnotation>Soccer game between Spain and Sweden.</FreeTextAnnotation></TextAnnotation><SegmentDecomposition decompositionType=“temporal” id=“shots” ><Segment xsi : type=“VideoSegmentType” id=“ID84”><MediaLocator> (?) </MediaLocator>

<TextAnnotation><FreeTextAnnotation>Introduction.</FreeTextAnnotation></TextAnnotation>

</Segment><Segment xsi : type=“VideoSegmentType” id=“ID88”><MediaLocator> (?) </MediaLocator>

<TextAnnotation><FreeTextAnnotation>Game.</FreeTextAnnotation></TextAnnotation>

</Segment></SegmentDecomposition>

</AudioVisualContent>

Figure 4: Extract of an MPEG-7 description.

as the basis of an additional search. Thus, a description ofa segment of a football match might include the name of aplayer, the name of a team, the stadium where the game isbeing played and so on. Any of these items could form thebasis of a search for further information.

In addition, MPEG-7 is used to represent the availablecontent and provide indexing information that is taken intoaccount during the search in order to match the queriesand retrieve the relevant material. Information retrieval tech-niques for MPEG-7 data were developed and implementedusing the HySpirit framework. The method returns a list ofranked results, so that the parts most interesting to the userare presented first.

Figure 6 shows the system for processing a query. Thequery is formed in the HTML browser and sent to the lo-cal web server. The query is then sent to the search engineas an XML query. An indexing module in the search engineconverts the XML query to a Probabilistic Relational Alge-bra (PRA) query that is suitable for submission to HySpiritand HySpirit returns PRA results that are converted to XML.These are then processed with an XSLT style sheet to givethe results in a rank order as HTML to send to the browser.Figure 7 shows the results of the search with the ranking as apercentage. We may present this ranking information is somealternative graphical form. A filter module for the system isnot shown in Figure 6, but this can be included so that theresults presented are based on a user profile (see below).

7. USE OF USER PREFERENCES

Users are able to store a profile of their preferences and boththe metadata about the current programme and the searchresults will be filtered according to this profile before dis-play. Some preferences will relate to the type of data to bedisplayed, for example, a user could select that he was notinterested in place information or that he only wanted to seethe two best match results from a search. Other options couldbe added about the interests of the user exploiting the Navi-gation and Access component of MPEG-7. Such parametersare, for example, the number of results per page, whetherthey see a thumbnail or not, or the level of detail of the de-scription of the results [9]. A possible development wouldbe to monitor user searches and requests to automaticallybuild a user profile to filter results. Note that viewers willhave the option of editing a set of parameters in their per-sonal profile concerning the display of the list of search re-sults.

8. PROJECT DEMONSTRATION

The whole studio and terminal system which was developedin SAMBITS, was demonstrated at IBC2001 in Amsterdam.Two scenarios were used in the demonstration: one based ona programme about dinosaurs and the other based on the2001 Eurovision song contest broadcast.


Figure 5: Construction of a query.

Search engine

XQL query XML results

Dataindexing

Queryindexing

Resultsprocessing

PRA query PRA results

Inference engine

Storage

User profilePRA data PRA data

UI

MPEG-7 data(XML)

Figure 6: The search system.

The dinosaur programme offers MPEG-4 clips of back-ground technical data that are available at appropriate timesduring the programme and related HTML pages. There isalso MPEG-7 indexing of the main programme content andadditional content so that particular programme segmentscan be found by content and the content description can bedisplayed superimposed on the main programme. A signerto assist the deaf is also broadcast as a synchronised MEPG-4stream and the signer can, at viewer discretion, be superim-posed on either the main programme or the MPEG-4 relatedcontent.

The Eurovision song contest programme allowed meta-data on singer, song title, country, and so forth. to be pro-vided and searches to be carried out to find more informa-tion about the singer, songs from previous years, informationabout the singer country of origin. and an extra backstagecamera view of the contest. Some of this material was avail-able as MPEG-4 multimedia content via the object carouseland some was the type of information that would normallybe available from the Internet. Some of the material was pro-

Figure 7: Display of search results.

vided by web pages and there is a 3D Hall of Fame that couldbe navigated to select previous contest winning entries.

9. CONCLUSION

The SAMBITS project has developed a system for enhanc-ing and personalising digital television broadcasts by addingMPEG-4, MPEG-7 and other data to MPEG-2 transportstreams or via the Internet. This paper has described someof the work in developing a consumer terminal to supportthe use of MPEG-7 metadata. The MPEG-7 metadata associ-ated with broadcast programme content can be displayed onscreen, at viewer request and used to select additional pro-gramme data or to construct queries. One use of additionalMPEG-4 material is to broadcast a signer that can be super-imposed on part of the screen for the hard of hearing and thatcan be turned on or off by the viewer. The consumer terminalfor the enhanced broadcasts can be implemented on an ad-vanced version of a set-top box using the Multimedia HomePlatform software with extensions to support MPEG-4 andMPEG-7.

As part of this work suitable descriptors and descriptionschemes for use at a consumer broadcast terminal have beendeveloped. The system also allows the construction of queriesfrom this metadata using the set-top box remote control. Thequeries are submitted to the HySpirit search engine and theresults are returned in rank order. Results are filtered accord-ing to user preferences.

The initial tests of the system have used material relatedto a programme about dinosaurs and material related to the2001 Eurovision song contest broadcast. The system shouldform an excellent platform for evaluating user reaction tothese functions for integrating the Internet with television.

REFERENCES

[1] P. Healey, M. Lalmas, E. Moutogianni, Y. Paker, and A. Pear-main, “Integrating internet and digital video broadcast data,”


in 4th World Multiconference on Systemics, Cybernetics and In-formatics, vol. 1, pp. 624–627, Orlando, Fla, USA, July 2000.

[2] ISO MPEG-7, “MPEG-7 Multimedia Description Schemes WD(Version 4.0),” ISO/IEC JTC 1/SC 29/WG 11/N3465, July 2000,Beijing, China.

[3] ISO MPEG-7, “Text of 15938-5 FCD Information Technology—Multimedia Content Description Interface—Part 5 Multi-media Description Schemes,” ISO/IEC JTC 1/SC 29/WG 11N3966, March 2001, Singapore.

[4] P. Salembier, “An overview of the MPEG-7 standard and offuture challenges for visual information analysis,” in Workshopon Image Analysis for Multimedia Interactive Services, pp. 75–82,Tampere, Finland, May 2001.

[5] J.-P. Evain, “The multimedia home platform–an overview,”EBU Technical Review, , no. 275, pp. 4–10, Spring 1998.

[6] N. Fuhr and T. Rolleke, “Hyspirit—a probabilistic inferenceengine for hypermedia retrieval in large databases,” in Ad-vances in Database Technology, 6th International Conference onExtending Database Technology, vol. 1377 of Lecture Notes inComputer Science, pp. 24–38, Springer-Verlag, Valencia, Spain,March 1998.

[7] D. C. Fallside, Ed., XML Schema Part 0: Primer. W3C Recom-mendation, May 2001, http://www.w3.org/TR/xmlschema-0/.

[8] W. Putz, “The usage of MPEG-7 metadata in a broadcast ap-plication,” in Proc. International Conference on Media Futures,pp. 133–136, Florence, Italy, May 2001.

[9] D. Ileperuma, M. Lalmas, and T. Roelleke, “MPEG-7 for anintegrated access to broadcast and Internet data,” in Proc. In-ternational Conference on Media Futures, pp. 235–238, Florence,Italy, May 2001.

Alan Pearmain is a Senior Lecturer in the Department of ElectronicEngineering. He joined the department in 1979 after previouslyworking at University College, Dublin and Brookhaven NationalLaboratory, New York. He has worked in several European and UKmultimedia projects, having previously worked in VLSI design. Hehas B.S. (Eng) and Ph.D. degrees from Southampton University,England.

Mounia Lalmas is a Reader in Information Retrieval at QueenMary, which she joined as a Lecturer in 1999. Prior to this, shewas a Research Scientist at the University of Dortmund in 1998,a Lecturer from 1995 to 1997 and a Research Fellow from 1997 to1998 at the University of Glasgow, where she received her Ph.D. in1996. Her research interests centre around the development of ef-fective formalisms able to model information in the places and inthe forms that it appears in a multimedia interactive informationretrieval system.

Ekaterina Moutogianni is a research assistant at Queen Mary, Uni-versity of London. She has previously worked at the London Insti-tute and at SENA S.A. in Athens. She obtained her B.S. degree fromUniversity of Athens, Greece in 1997 and her M.S. from QueenMary, London in 1999.

Damien Papworth is a Research Assistant at Queen Mary, Univer-sity of London. He has previously worked at University College,London. He obtained his B.S. degree from University College, Lon-don in 1998.

Pat Healey is a Senior Lecturer in Media and Communication inthe Computer Science Department and an Associate Researcherwith ATR MIC laboratories, Kyoto, Japan. Dr. Healey joined QMWin 1998. From 1995 to 1997 he held an HCM fellowship in Require-ments Analysis for Cooperative work. He has published papers andbook chapters in the areas of CSCW, HCI, and Cognitive Science.

Thomas Rolleke is a Research Fellow in the Department of Com-puter Science. From 1994 to 1999 he was a Researcher at Chair VI,Information Retrieval group, University of Dortmund and he is co-founder of the HySpirit company that has developed an informa-tion retrieval engine.


Ordinal-Measure Based Shape Correspondence

Faouzi Alaya CheikhSignal Processing Laboratory, Tampere University of Technology, P.O. Box 553, FIN-33101 Tampere, FinlandEmail: [email protected]

Bogdan CramariucSignal Processing Laboratory, Tampere University of Technology, P.O. Box 553, FIN-33101 Tampere, FinlandEmail: [email protected]

Mari PartioSignal Processing Laboratory, Tampere University of Technology, P.O. Box 553, FIN-33101 Tampere, FinlandEmail: [email protected]

Pasi ReijonenSignal Processing Laboratory, Tampere University of Technology, P.O. Box 553, FIN-33101 Tampere, FinlandEmail: [email protected]

Moncef GabboujSignal Processing Laboratory, Tampere University of Technology, P.O. Box 553, FIN-33101 Tampere, FinlandEmail: [email protected]

Received 31 July 2001 and in revised form 10 February 2002

We present a novel approach to shape similarity estimation based on distance transformation and ordinal correlation. The pro-posed method operates in three steps: object alignment, contour to multilevel image transformation, and similarity evaluation.This approach is suitable for use in shape classification, content-based image retrieval, and performance evaluation of segmen-tation algorithms. The two latter applications are addressed in this paper. Simulation results show that in both applications ourproposed measure performs quite well in quantifying shape similarity. The scores obtained using this technique reflect well thecorrespondence between object contours as humans perceive it.

Keywords and phrases: shape, ordinal, correlation, content, retrieval, indexing, segmentation, performance.

1. INTRODUCTION

Shape representation techniques are generally characterizedas being boundary-based or region-based. The former (alsoknown as contour-based) represents the shape by its outline,while the latter considers the shape as being formed of a setof two-dimensional regions. The human visual system itselffocuses on edges and ignores uniform regions [1, 2]. This ca-pability is hardwired into the retina. Connected directly tothe rods and cones of the retina are two layers of the neu-rons that perform an operation similar to the Laplacian. Thisoperation is called lateral inhibition and helps us to extractboundaries and edges. Therefore, in this paper we focus onthis aspect of the shapes and not on the regions they maycontain. Object contours however, will have intrinsic intra-class variations. Moreover, object boundary deformation isexpected in most imaging applications due to the varying

imaging conditions, sensor noise, occlusion, and imperfectsegmentation.

Estimating the similarity between objects shapes can bedescribed in a simplistic way in two steps: shape featuresextraction and feature comparison. Each of these two stepshowever represents a difficult problem by itself. Selecting aset of features to characterize a shape for a certain applicationis not easy, since one must take into consideration the vari-ability of the shapes and the application domain specificity.Feature comparison can be understood as a way of quantify-ing the similarity/dissimilarity between their correspondingobjects. This is a very difficult problem since it tries to mimicthe human perception [2].

Several shape features have been proposed in the liter-ature for shape characterization [3]. Many of these tech-niques however, cannot be used for content-based indexingand retrieval due to their complexity or because they have no

Ordinal-Measure Based Shape Correspondence 363

(a) (b)

Figure 1: (a) The original bird contour before the alignment step,(b) the contour after alignment using three universal axes.

obvious counterpart in the human vision. Therefore, tech-niques based on simple and visually meaningful shape fea-tures have been used in several content-based indexing andretrieval (CBIR) systems, for example, QBIC [4], MUVIS[5, 6] such as high curvature points [7, 8, 9], polygonal ap-proximation [10], morphological and topological featuresand others [3, 11].

In this paper, we introduce a novel boundary-based ap-proach to shape similarity estimation. This technique isapplied to two problems: shape-based image retrieval andperformance evaluation of segmentation algorithms. Therest of the paper is organized as follows: Section 2 presentsan overview of the proposed method, followed by a de-tailed description of each step. Experimental results for bothapplications are presented in Section 3, using a subset ofthe MPEG-7 shape test data and the segmentation masksobtained by the COST Analysis Model (COST AM) [12,13] segmentation algorithm. In Section 4, conclusions aredrawn.

2. THE PROPOSED METHOD

Images in the target applications are representing either: asingle object outline or a segmentation mask. Therefore, wewill not discuss how to obtain the contour or the segmenta-tion masks. Our goal is to compute a similarity score betweenany two shapes or two segmentation masks. The proposedmethod operates in three steps: alignment, boundary to mul-tilevel image transformation, and similarity evaluation. Thealignment step is not needed in the case of the segmentationperformance evaluation, since we are comparing segmenta-tions masks corresponding to the same image.

Once the boundaries are aligned, the binary images con-taining the boundaries are transformed into multilevel im-ages through distance transformation [3]. The obtained im-ages are then compared using the ordinal correlation mea-sure introduced in [14, 15]. This ordinal measure estimatesthe similarity between the two shapes based on the correla-tion of their corresponding transform images. In the rest ofthis section we give detailed description of each one of thesteps mentioned above.

2.1. Object alignment based on universal axes

The alignment is performed by first detecting three universalaxes [16] (those with the largest magnitude) for each shape,then orienting the shape in such a way that these axes arealigned in a standard way for all the objects to be compared.We use the same notation as in [16].

In this implementation of the universal axes determina-tion algorithm we use the version number µ = 2l. The stepsof the alignment algorithm are detailed below.

Step 1. Translate the coordinate system so that the origin be-comes the center of gravity of the shape S.

Step 2. Compute

∣∣x(l)µ + iy(l)

µ

∣∣ = ∫ ∫S

(√x2 + y2

)µ x + iy√x2 + y2

l

dxdy

=∫ ∫

Srµeilθdxdy,

(1)

and using normalized counterpart (called Universal Axes(UA))

∣∣x(l)µ + iy(l)

µ

∣∣ =∣∣x(l)

µ + iy(l)µ

∣∣∫ ∫S

(√x2 + y2

)µdxdy

, for l = 1, 2, 3. (2)

Step 3. Compute the polar angle Θµ ∈ [0, 2π] so that

RµeiΘµ =

∣∣x(l1)µ + iy(l1)

µ

∣∣ (3)

with Rµ being the magnitude of |x(l1)µ + iy(l1)

µ |. l1 is the numberof axes needed to align an object.

Step 4. Compute the directional angles of the l1 universalaxes of the shape S as follows:

θj =Θµ

l1+ ( j − 1)

2πl1, for j = 1, 2, . . . , l1. (4)

In our implementation we used l1 = 3, see Figure 1, sincefor l1 = 2 the two universal axes orientation will verify θ2 =θ1 + π. Therefore, they cannot be used alone to determine ifan object is flipped around the direction they define or not.

Step 5. Once the three universal axes are determined, rotatethe contour so that the most dominant UA (UA with thelargest magnitude) will be aligned with the positive x-axis,see Figure 1.

Step 6. Then, if the y-component of the second most domi-nant UA is positive, flip the contour around the x-axis.

To illustrate the alignment performance we applied itto the set of contours in Figure 2. The results of the align-ment are presented in Figure 3. It can be noticed that thisalignment scheme solved both problems of rotation andmirroring.


bird-1 bird-2 bird-3 bird-4 bird-5




Figure 2: Bird contours from the MPEG-7 shape test set B.





Figure 3: The contours in Figure 2 after alignment.

2.2. Boundary to multilevel image transformation

Let S be a shape represented by its contour C in a binaryimage. The binary image is transformed into a multilevel(grayscale) image G using a mapping function φ, such thatthe pixel values in G, {G1, G2, . . . , Gn}, depend on their rela-tive position to the contour pixels C1, C2, . . . , Cp:

Gi = φ(Ck : k = 1, 2, . . . , p

), for i = 1, 2, . . . , n, (5)

where Ck is the position of the contour pixel k in the imageG. It should be observed that several transformations satisfythis requirement, including any distance transform [3].

As a result of this mapping the information contained inthe shape boundary will be spread throughout all the pix-els of the image. Computing the similarity in the transformdomain will benefit from the boundary information redun-dancy in the new image. We expect that there is no single op-timal mapping; different mappings will emphasize differentfeatures of the contour. We distinguish however two specialcases:

• the first is defined as follows:

Gi =

{V0 − d

(Pi, C

), if d

(Pi, C

)< Th,

0, otherwise,(6)

for i = 1, 2, . . . , n and V0 > 0 is the value assignedto the pixels on the boundary. The larger the distanced(Pi, C) from the contour points is, the smaller the newpixel value will be. This mapping function emphasizesthe details on the boundary;

• the second mapping is when V0 = 0 and Gi = d(Pi, C),using a geodesic distance [17]. When this mapping isapplied inside the contour only, the emphasis is on theshape skeleton which is a very important feature of ashape, see Figure 4. Applying the distance mapping in-side and outside the contour can lead to a better eval-uation of segmentation results. One can even assigndifferent weights inside and outside of the contours.

In this work, we implemented the second mapping basedon the geodesic distance. The metric is integer and its appli-cation is done through an iterative wave propagation process[18]. The contour points are considered as seeds during theconstruction of the distance map. The distance map can begenerated inside and/or outside the contour, as stated earlier.The values can increase or decrease starting from the contourand can be limited. The pixel values in the distance map canbe therefore written as follows:

Gi =∣∣V0 ± d

(Pi, C

)∣∣, for i = 1, 2, . . . , n, (7)

where, V0 is the value on the contour and d(Pi, C) is the dis-tance from any point Pi in the image to the contour C.

Figure 4a represents an example of a distance map gen-erated only inside the contour of a bird contour. Figure 4bshows a 3D visualization of this distance map.

2.2.1 Similarity evaluation

The evaluation of image similarity is based on the frame-work for ordinal-based image correspondence introduced in[14]. Figure 5 gives a general overview of this region-basedapproach.

Suppose we have two images, X and Y , of equal size.In a practical setting, images are resized to a common size.Let {X1, X2, . . . , Xn} and {Y1, Y2, . . . , Yn} be the pixels of im-ages X and Y , respectively. We select a number of areas{R1, R2, . . . , Rm} and extract the pixels from both images thatbelong to these areas. Let RX

j and RYj be the pixels from im-

ages X and Y , respectively, which belong to areas Rj , withj = 1, 2, . . . , m.

The goal is to compare the two images using a region-based approach. To this end, we will be comparing RX

j and RYj

for each j = 1, 2, . . . , m. Thus, each block in image X is com-pared to the corresponding block in image Y in an ordinalfashion. The ordinal comparison of the two regions meansthat only the ranks of the pixels are utilized. For every pixelXk, we construct a so-called slice SXk = {Sk,l : l = 1, 2, . . . , n},


(a)

2040

6080

100120

140160

180200

220

0100200

180 160 140 120 100 80 60 40 20

(b)

Figure 4: (a) The distance map generated for the “bird-19” contour in Figure 1, (b) a 3D view of this distance map.

where

SXk,l =

{1, if Xk < Xl,

0, otherwise.(8)

As can be seen, slice SXk corresponds to pixel Xk and is abinary image of size equal to image X . Slices are built in asimilar manner for image Y as well.

To compare regions RXj and RY

j , we first combine theslices from image X , corresponding to all the pixels belong-ing to region RX

j . The slices are combined using the operation

OP1(·) into a metaslice MXj .

Figure 6 shows an illustration of the slices and metaslicescreation for a 4 × 4 image and blocks of 2 × 2. The four slicesS1, S2, S5, and S6 shown in this figure are computed for thefour pixels in block B1. The operation used in this illustrationto create the metaslice M1 is OP1(·) =∑(·).

More formally, MXj = OP1(SXk : Xk ∈ RX

j ) for j = 1,2, . . . , m. Similarly, we combine the slices from image Y toform MY

j for j = 1, 2, . . . , m. It should be noted that themetaslices are equal in size to the original images and couldbe multivalued, depending on the operation OP1(·). Eachmetaslice represents the relation between the region it cor-responds to and the entire image.

The next step is a comparison between all pairs ofmetaslices MX

j and MYj by using operation OP2, resulting

in the metadifference Dj .That is, Dj = OP2(MXj ,M

Yj ), j =

1, 2, . . . , m. We thus construct a set of metadifferences D ={D1, D2, . . . , Dm}. The final step is to extract a scalar mea-sure of correspondence from set D, using operation OP3(·).In other words, λ = OP3(D). It was shown in [14] that thisstructure could be used to model the well-known Kendall’s τand Spearman’s ρ measures [19].

The image similarity measure used in this paper is aninstance of the previously mentioned framework. This mea-sure has been analyzed more extensively by Cramariuc et al.

[15]. Following is a short description of the operationsOPk(·), k = 1, 2, 3 adopted for this measure. OperationOP1(·) is chosen to be the component-wise summation op-eration; that is, metaslice Mj is the summation of all slicescorresponding to the pixels in block j or in other words,Mj =

∑k:Xk∈Rj

Sk.Next, operation OP2(·) is chosen to be the squared Eu-

clidean distance between corresponding metaslices. That is,Dj = ||MX

j −MYj ||22. Finally, operation OP3(·) sums together

all metadifferences to produce λ =∑

j Dj , for j = 1, 2, . . . , m.Small values of λ mean similar objects.

One advantage of this approach over classical ordinal cor-relation measures is its capability to take into account differ-ences between images at a scale related to the chosen blocksize.

3. EXPERIMENTAL RESULTS

The proposed technique is applied to two important prob-lems: content-based retrieval of shape images and perfor-mance evaluation of segmentation algorithms. The experi-ments performed are presented and their results analyzed inthe rest of this section.

3.1. Shape similarity estimation

The shape similarity estimation experiments were conductedon two sets of 20 images. The two sets are taken from theMPEG-7 CE Shape test set B, which contains 1400 imagesgrouped in 70 categories. These test sets are chosen in sucha way as to assess the performance of our technique in esti-mating the object similarity within a single category (intra-category) and between contours from different categories(inter-category). Therefore, the first test set contains all thesamples in the bird category of the MPEG-7 Shape test set B,see Figure 2. While, the second set contains 20 objects takenfrom four different categories, see Figure 7. In both exper-iments, the similarity score λ is computed for all the pairs


λ

OP3

D1 · · · Dj · · · Dm

OP2

Metadifference

Metaslice MXj Metaslice MY

j

OP1

...

OP1

Slice SXk

Slice SYk

Region Rj

Pixel YkRegion Rj

Pixel Xk

Image X Image Y

Figure 5: The general framework for ordinal correlation of images.

16 17

5 20

4 4 9 9

4 4 9 9

B1

I

0 1 1 1

1 1 0 1

0 0 1 1

0 0 1 1

S1

0 0 1 1

0 1 0 1

0 0 0 0

0 0 0 0

S2

0 1 1 1

0 1 0 1

0 0 0 0

0 0 0 0

S5

0 0 1 1

0 0 0 1

0 0 0 0

0 0 0 0

S6

0 2 4 4

1 3 0 4

0 0 1 1

0 0 1 1

M1

Figure 6: Example of slices and metaslice for a 4 × 4 image using blocks of 2 × 2.

of shapes in the set. The similarity scores obtained are pre-sented in Tables 1 and 2. All the scores are multiplied by 103

when they are presented in the tables and the figures. Thedistance maps were generated inside the objects only withV0 = 50. This setting emphasizes the shape skeleton andgives less importance to contour pixels. The distance trans-formed images are resized to 32 × 32 pixels and blocks ofsize 4 × 4 were used. Larger images can be used if moreprecision is needed, this would imply the creation of moreslices and therefore more computational power would beneeded.

Figure 8 represents a surface plot of the similarity scoresin Table 1. It shows that within the same category, the scores

have small values, which means that they are quite simi-lar according to our measure. It is worth noticing that thescores on the diagonals are zero which means that each ob-ject is identical to itself, so there is no bias in the similarityscores. It is worth noticing that the scores obtained betweenthe “bird-3,” “bird-4,” “bird-5,” “bird-6,” and the rest of thebirds in this category are larger than the rest of the scores.This can be explained by the fact that these four birds havemuch shorter tails and have a more circular contours com-pared the rest of the birds. The similarity scores are low be-tween themselves, moreover, the scores for the pairs (bird-3,bird-4), (bird-5, bird-6), (bird-7, bird-8), and (bird-9, bird-11) are very small. By visual inspection one can verify that


Table 1: Similarity scores for the contours in Figure 3. These scores are multiplied by 103.

Object bird

-1

bird

-2

bird

-3

bird

-4

bird

-5

bird

-6

bird

-7

bird

-8

bird

-9

bird

-10

bird

-11

bird

-12

bird

-13

bird

-14

bird

-15

bird

-16

bird

-17

bird

-18

bird

-19

bird

-20

bird-1 0 32 75 74 80 81 46 44 52 44 52 49 64 49 53 63 60 51 64 59bird-2 32 0 71 70 79 79 39 38 60 59 62 61 65 56 53 68 58 52 62 55bird-3 75 71 0 0 48 48 80 77 88 84 85 87 102 53 71 117 112 105 116 106bird-4 74 70 0 0 48 48 80 76 88 83 85 87 102 53 71 116 112 104 116 106bird-5 80 79 48 48 0 1 101 98 105 89 102 98 120 76 95 125 122 112 125 121bird-6 81 79 48 48 1 0 101 98 106 90 103 99 120 76 95 126 122 113 126 121bird-7 46 39 80 80 101 101 0 7 39 42 42 40 42 54 28 53 42 38 46 43bird-8 44 38 77 76 98 98 7 0 38 39 39 38 44 52 23 58 47 41 51 47bird-9 52 60 88 88 105 106 39 38 0 25 7 17 32 51 33 51 44 39 49 60bird-10 44 59 84 83 89 90 42 39 25 0 22 12 45 53 32 55 48 43 53 60bird-11 52 62 85 85 102 103 42 39 7 22 0 16 36 49 31 56 49 45 55 63bird-12 49 61 87 87 98 99 40 38 17 12 16 0 37 55 31 52 46 41 50 61bird-13 64 65 102 102 120 120 42 44 32 45 36 37 0 60 42 41 35 37 40 57bird-14 49 56 53 53 76 76 54 52 51 53 49 55 60 0 47 86 80 75 86 83bird-15 53 53 71 71 95 95 28 23 33 32 31 31 42 47 0 67 56 52 59 61bird-16 63 68 117 116 125 126 53 58 51 55 56 52 41 86 67 0 17 22 19 44bird-17 60 58 112 112 122 122 42 47 44 48 49 46 35 80 56 17 0 18 8 39bird-18 51 52 105 104 112 113 38 41 39 43 45 41 37 75 52 22 18 0 17 35bird-19 64 62 116 116 125 126 46 51 49 53 55 50 40 86 59 19 8 17 0 38bird-20 59 55 106 106 121 121 43 47 60 60 63 61 57 83 61 44 39 35 38 0


cattle-5 cattle-6 cattle-7 cattle-8 cattle-9

fork-5 fork-6 fork-7 fork-8 fork-9

frog-10 frog-6 frog-7 frog-8 frog-9

Figure 7: Contours of test set 2 after alignment.

each pair of contours represent the same bird contour ro-tated or rotated and scaled. Therefore, we can safely say thatour measures have a 0.5% error, which can be explained bythe small contour variation introduced by rotation and thesize reduction of the distance maps. Lower error can be ob-tained by increasing the size of the distance map images andreducing the block size used for the metaslices creation.

Dark blue regions in Figure 8 represent very low scores(close to zero), which shows that there are quite many objectsin this category which are very similar or even identical.

To find out which are the most similar contours to a givencontour in Figure 7 we sort the scores on the raw correspond-ing to this contour in Table 2. Using Figure 9, one can easily

24

68

10

1214

1618

20

050

100

2 4 6 8 10 12 14 16 18 20

Figure 8: The similarity scores for the bird contours in Figure 3,dark blue cells mean most similar contours.

estimate which are the most similar objects within this cat-egory, based on the clustered dark blue cells. Figure 9 showsthat similarity scores between subjects from the same cate-gory are low, while those obtained for subjects from differentcategories are relatively high. Therefore, sorting the scores inascending order will yield the most similar object first.

The inter-category scores obtained by our similarity es-timation technique are larger than the inter-category ones.Therefore, this technique can be used as a shape classificationtechnique. Moreover, it is sensitive to intra-category shapevariations thus it can be used in a content-based retrievalsystem.


Table 2: Similarity scores for the contours in Figure 7. These scores are multiplied by 103.

Objects bird

-16

bird

-17

bird

-18

bird

-19

bird

-20

catt

le-5

catt

le-6

catt

le-7

catt

le-8

catt

le-9

fork

-5

fork

-6

fork

-7

fork

-8

fork

-9

frog

-10

frog

-6

frog

-7

frog

-8

frog

-9

bird-16 0 39 63 39 101 282 322 321 351 348 358 368 361 379 401 196 209 210 303 302bird-17 39 0 58 22 94 273 306 304 342 341 337 345 339 363 377 199 206 208 297 297bird-18 63 58 0 58 83 256 293 291 325 322 325 330 323 346 364 182 185 187 285 284bird-19 39 22 58 0 74 298 332 331 369 366 360 368 363 388 402 215 227 228 318 318bird-20 101 94 83 74 0 304 333 334 370 367 365 370 363 387 392 234 233 234 330 330cattle-5 282 273 256 298 304 0 73 71 75 73 247 221 184 206 244 162 152 153 136 138cattle-6 322 306 293 332 333 73 0 1 52 53 246 204 168 198 234 195 192 193 143 145cattle-7 321 304 291 331 334 71 1 0 51 52 247 206 169 199 236 193 191 192 141 144cattle-8 351 342 325 369 370 75 52 51 0 5 244 207 172 191 228 206 199 200 141 143cattle-9 348 341 322 366 367 73 53 52 5 0 244 209 173 190 232 202 196 197 140 142fork-5 358 337 325 360 365 247 246 247 244 244 0 59 118 84 63 373 347 346 356 359fork-6 368 345 330 368 370 221 204 206 207 209 59 0 62 74 77 355 337 338 326 329fork-7 361 339 323 363 363 184 168 169 172 173 118 62 0 81 123 320 304 305 287 290fork-8 379 363 346 388 387 206 198 199 191 190 84 74 81 0 79 351 331 331 319 321fork-9 401 377 364 402 392 244 234 236 228 232 63 77 123 79 0 379 354 352 339 340frog-10 196 199 182 215 234 162 195 193 206 202 373 355 320 351 379 0 50 51 132 129frog-6 209 206 185 227 233 152 192 191 199 196 347 337 304 331 354 50 0 1 138 138frog-7 210 208 187 228 234 153 193 192 200 197 346 338 305 331 352 51 1 0 139 139frog-8 303 297 285 318 330 136 143 141 141 140 356 326 287 319 339 132 138 139 0 2frog-9 302 297 284 318 330 138 145 144 143 142 359 329 290 321 340 129 138 139 2 0

24

68

1012

1416

1820

5

10

15

20

0200400

Figure 9: Similarity scores obtained for the contours in test set 2presented in Figure 7.

3.2. Segmentation quality evaluation

The objective evaluation of the performance of segmentationalgorithms is an important problem [20, 21, 22]. Even whena reference mask is available, comparing two segmentationmasks is still a difficult problem. Several factors make suchevaluation difficult, among the most important factors is thedifficulty to discriminate between many small distributed er-ror segments and few larger error segments.

Our shape correspondence technique proposed inSection 2, discriminates easily between the two cases of seg-mentation errors. The geodesic distance transformation isapplied inside each segment of the mask. Therefore, smallregions yield small distances inside them and therefore willgenerate pixels with low gray values.

0 5 10 15 20 25 30 35 40 45 50Frame number

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Err

orm

easu

re

COST AM ver 5.0COST AM ver 5.1

Figure 10: The segmentation performance scores for frames 4–49of the sequence “Erik,” for COST AM versions 5.0 and 5.1.

In this experiment, the segmentation masks resultingfrom the COST AM versions 5.0 and 5.1 [12, 13], are com-pared against a reference mask. The plot in Figure 10, showsthe segmentation performance scores obtained by our tech-nique, for the frames 4–49 of “Erik” sequence. The plots inFigures 11 and 12, show quantitative measures of the errorsin number of pixels from both COST AM versions 5.0 and5.1, respectively. Three different numbers are computed foreach frame:

• number of pixels of the background segmented as fore-ground pixels,


5 10 15 20 25 30 35 40 45 50Frame number

0

500

1000

1500

2000

2500

3000

3500

4000

Nu

mbe

rof

pixe

ls

Total error pixelsObject to background errorBackground to object error

Figure 11: Plot of the segmentation errors, for COST AM 5.0.

5 10 15 20 25 30 35 40 45 50Frame number

0

500

1000

1500

2000

2500

3000

3500

4000

Nu

mbe

rof

pixe

ls

Total error pixelsObject to background errorBackground to object error

Figure 12: Plot of the segmentation errors, for COST AM 5.1.

• number of pixels of the foreground segmented as back-ground pixels,

• sum of the two previous numbers.

For illustration we present the colored segmentationmasks for frames 15 and 20 from “Erik” sequence in Figures13, 14, 15, and 16. The frame pixels are colored as follows:

• black represents the background,• white is the region where the reference and estimated

masks overlap,• green represents the areas of the background seg-

mented as part of the object,• purple represents the regions from the object merged

with the background.

Figure 13: The colored segmentation error of frame 15 from thesequence “Erik,” segmented using COST AM 5.0.

Figure 14: The colored segmentation result of frame 15 from thesequence “Erik,” segmented using COST AM 5.1.




It can be easily seen that our segmentation performancescores in Figure 10, correlate very well with the variation ofthe total number of pixel errors. Moreover, it reflects the vari-ation in both types of segmentation errors. Our measure in-herently resolves the case of many small errors and the caseof a single large error region.

4. CONCLUSIONS AND FUTURE WORK

In this paper, we proposed a contour correspondence mea-sure, based on distance transformation and ordinal correla-tion. The similarity scores obtained are in line with the vi-sual perception of the similarity between shapes. We showedthat the proposed measure can be used in two applications:shape similarity estimation in the context of content-basedimage retrieval, and for performance evaluation of segmen-tation algorithms. Simulation results were presented for bothapplications using images from the MPEG-7 shape test set Bfor the first application, and 50 frames from “Erik” sequencefor the second. The proposed technique produced encourag-ing results in both experiments. Further study is needed tooptimize the proposed technique in order to select appropri-ate parameters for the application at hand. Finally, furtheranalysis of the behavior of the proposed technique may iden-tify new applications.

REFERENCES

[1] C. Hildreth, “The detection of intensity changes by com-puter and biological vision systems,” in Proc. Computer Vision,Graphics, and Image Processing, vol. 22, pp. 1–27, 1983.

[2] T. V. Papathomas, “Special issue on visual-perception: guesteditorial,” International Journal of Imaging Systems and Tech-nology, vol. 7, no. 2, pp. 63–64, 1996.

[3] L. da F. Costa and R. M. Cesar Jr., Shape Analysis and Classifi-cation: Theory and Practice, CRC Press, Boca Raton, Fla, USA,2001.

[4] M. Flickner, H. Sawhney, W. Niblack, et al., “Query by imageand video content: The qbic system,” IEEE Computer Maga-zine, vol. 28, no. 9, pp. 23–32, 1995.

[5] C. F. Alaya, B. Cramariuc, C. Reynaud, et al., “Muvis: a sys-tem for content-based indexing and retrieval in large imagedatabases,” in Proc. SPIE/EI ’99 Conference on Storage and Re-trieval for Image and Video Databases VII, vol. 3656, pp. 98–106, San Jose, Calif, USA, January 1999.

[6] M. Trimeche, C. F. Alaya, M. Gabbouj, and B. Cramariuc,“Content-based description of images for retrieval in largedatabases: Muvis,” in Proc. X European Signal Processing Con-ference, Tampere, Finland, September 2000.

[7] C. H. Teh and R. T. Chin, “On the detection of dominantpoints on digital curves,” IEEE Trans. on Pattern Analysis andMachine Intelligence, vol. 11, no. 8, pp. 859–872, 1989.

[8] S. Abbasi, F. Mokhtarian, and J. Kittler, “Curvature scale spaceimage in shape similarity retrieval,” Springer Journal of Multi-media Systems, vol. 7, no. 6, pp. 467–476, 1999.

[9] A. Quddus, C. F. Alaya, and M. Gabbouj, “Wavelet-basedmulti-level object retrieval in contour images,” in Proc. In-ternational Workshop on Very Low Bit Rate Video Coding, pp.1–5, Kyoto, Japan, October 1999.

[10] M. W. Koch and R. L. Kashyap, “Using polygon to recognizeand locate partially occluded objects,” IEEE Trans. on PatternAnalysis and Machine Intelligence, vol. 9, no. 4, pp. 483–494,1987.

[11] J. C. Russ, The Image Processing Handbook, CRC, Springerand IEEE Press, 3rd edition, 1999.

[12] A. Alatan, L. Onural, M. Wollborn, R. Mech, E. Tuncel, andT. Sikora, “Image sequence analysis for emerging interactivemultimedia services—the European COST 211 framework,”IEEE Trans. Circuits and Systems for Video Technology, vol. 8,no. 7, pp. 802–813, 1998.

[13] M. Gabbouj, G. Morrison, C. F. Alaya, and R. Mech, “Redun-dancy reducation techniques and content analysis for mul-timedia services—the European COST 211quat action,” inProc. Workshop on Image Analysis for Multimedia InteractiveServices, Berlin, Germany, 31 May–1 June 1999.

[14] I. Shmulevich, B. Cramariuc, and M. Gabbouj, “A frameworkfor ordinal-based image correspondence,” in Proc. X Euro-pean Signal Processing Conference, Tampere, Finland, Septem-ber 2000.

[15] B. Cramariuc, I. Shmulevich, M. Gabbouj, and A. Makela, “Anew image similarity measure based on ordinal correlation,”in Proc. International Conference on Image Processing, vol. 3,pp. 718–721, Vancouver, BC, Canada, September 2000.

[16] J. C. Lin, “The family of universal axes,” Pattern Recognition,vol. 29, no. 3, pp. 477–485, 1996.

[17] P. J. Toivanen, “New geodesic distance transforms for gray-scale images,” Pattern Recognition Letters, vol. 17, no. 5, pp.437–450, 1996.

[18] I. Ragnelmam, “Neighborhoods for distance transformationsusing ordered propagation,” Computer Vision, Graphics andImage Processing, vol. 56, no. 3, pp. 399–409, 1992.

[19] M. Kendall and J. D. Gibbons, Rank Correlation Methods, Ed-ward Arnold, New York, 5th edition, 1990.

[20] P. Villegas, X. Marichal, and A. Salcedo, “Objective evaluationof segmentation masks in video sequences,” in Proc. Workshopon Image Analysis for Multimedia Services (WIAMIS ’1999),Berlin, Germany, May 1999.

[21] R. Mech, “Objective evaluation criteria for 2D-shape estima-tion results of moving objects,” in Proc. Workshop on ImageAnalysis for Multimedia Services, pp. 23–28, Tampere, Finland,May 2001.

[22] R. Mech and F. Marqu s, “Objective evaluation criteria for 2D-shape estimation results of moving objects,” EURASIP Jour-nal of Applied Signal Processing, Special Issue: Image Analysisfor Multimedia Interactive Services, 2002.

Faouzi Alaya Cheikh received his B.S.degree in electrical engineering in 1992from Ecole Nationale d’Ingenieurs de Tu-nis, Tunisia. He received his M.S. degree inelectrical engineering (Major in Signal Pro-cessing) from Tampere University of Tech-nology, Finland, in 1996. Mr. Alaya Cheikhis currently a Ph.D. candidate and works asa Researcher at the Institute of Signal Pro-cessing, Tampere University of Technology,Tampere, Finland. From 1994 to 1996, he was a Research Assis-tant at the Institute of Signal Processing, and from 1997 he hasbeen a Researcher with the same institute. His research interests in-clude nonlinear signal and image processing and analysis, patternrecognition and content-based analysis and retrieval. He has beenan active member in many Finnish and European research projectsamong them Nobless esprit, COST 211 quat, and MUVI. He servedas Associate Editor of the EURASIP Journal on Applied Signal Pro-cessing, Special Issue on Image Analysis for Multimedia InteractiveServices. He serves as a reviewer to several conferences and jour-nals. He co-authored over 30 publications.


Bogdan Cramariuc received his M.S. de-gree in electrical engineering in 1993 fromPolytechnica University of Bucharest, Fac-ulty of Electronics and Telecommunica-tions, Bucharest, Romania. Mr. Cramariucis currently a Ph.D. candidate and works asResearcher for the Institute of Signal Pro-cessing at Tampere University of Technol-ogy, Tampere, Finland. From 1993 to 1994he worked as Teaching Assistant at the Fac-ulty of Electronics and Telecommunications at the PolytechnicaUniversity of Bucharest. During this period he has also been in-volved as Researcher with Electrostatica S.A., a national researchinstitute in Bucharest, Romania. Since 1995 he has been with theInstitute of Signal Processing at Tampere University of Technology,Tampere, Finland. His research interests include signal and imageanalysis, image segmentation, texture analysis, content-based in-dexing and retrieval in multimedia databases, mathematical mor-phology, computer vision, parallel processing, data mining, and ar-tificial intelligence. Mr. Cramariuc has been an active member inseveral Finnish and European projects, such as Nobless, Esprit andMUVI. He served as Associate Editor of the EURASIP Journal onApplied Signal Processing, Special Issue on Image Analysis for Mul-timedia Interactive Services.

Mari Partio was born 1979 in Finland. Autumn 1998 she startedher M.S. studies at Tampere University of Technology in depart-ment of Electric Engineering. She is majoring in Signal Process-ing and her minor is Software Engineering. Since summer 2000 shehas been working as a research assistant at the Institute of SignalProcessing. Her research interests include content-based image andvideo retrieval.

Pasi Reijonen was born 1976 in Finland. Started at Tampere Uni-versity of Technology in Automation Degree Program 1996 andreceived M.S. degree 2001. Major: Electronics. Minor: IndustrialManagement, Automation and Control Engineering. For the aca-demic year 1999–2000, he was as an Erasmus student in England(University of Leeds). Currently he is working as a researcher atTampere University of Technology at Institute of Signal Processing.His research interests are Shape Analysis for Content-Based ImageRetrieval.

Moncef Gabbouj received his B.S. degreein electrical engineering in 1985 from Ok-lahoma State University, Stillwater, and hisM.S. and Ph.D. degrees in electrical en-gineering from Purdue University, WestLafayette, Indiana, in 1986 and 1989, re-spectively. Dr. Gabbouj is currently a pro-fessor and Head of the Institute of SignalProcessing of Tampere University of Tech-nology, Tampere, Finland. From 1995 to1998 he was a professor with the Department of InformationTechnology of Pori School of Technology and Economics, Pori,and during 1997 and 1998 he was on sabbatical leave with theAcademy of Finland. From 1994 to 1995 he was an associate pro-fessor with the Signal Processing Laboratory of Tampere Uni-versity of Technology, Tampere, Finland. From 1990 to 1993 hewas a senior research scientist with the Research Institute forInformation Technology, Tampere, Finland. His research inter-ests include nonlinear signal and image processing and analysis,content-based analysis and retrieval and mathematical morphol-ogy. Dr. Gabbouj is the Vice-Chairman of the IEEE-EURASIP NSIP

(Nonlinear Signal and Image Processing) Board. He is currentlythe Technical Committee Chairman of the EC COST 211quat. Heserved as associate editor of the IEEE Transactions on Image Process-ing, and was guest editor of the European journal Signal Processing,special issue on nonlinear digital signal processing (August 1994).He is the chairman of the IEEE Finland Section and past chair of theIEEE Circuits and Systems Society, Technical Committee on Digi-tal Signal Processing, and the IEEE SP/CAS Finland Chapter. Hewas also the TPC Chair of EUSIPCO 2000 and the DSP track chairof the 1996 IEEE ISCAS and the program chair of NORSIG’ 96,and is the technical program chair of EUSIPCO 2000. He is alsomember of EURASIP AdCom. Dr. Gabbouj is the Director of theInternational University Program in Information Technology andmember of the Council of the Department of Information Tech-nology at Tampere University of Technology. He is also the Secre-tary of the International Advisory Board of Tampere InternationalCenter of Signal Processing, TICSP. He is a member of Eta KappaNu, Phi Kappa Phi, IEEE SP and CAS societies. Dr. Gabbouj wasco-recipient of the Myril B. Reed Best Paper Award from the 32ndMidwest Symposium on Circuits and Systems and co-recipient ofthe NORSIG 94 Best Paper Award from the 1994 Nordic SignalProcessing Symposium. He is co-author of over 150 publications.Dr. Gabbouj was the prime investigator in two ESPRIT projects, 1HCM project and several Tempus projects. He also served as Eval-uator of IST proposals, and Auditor of a number of ACTS and ISTprojects on multimedia security, augmented and virtual reality, im-age and video signal processing.


Audio Classification in Speech and Music: A ComparisonBetween a Statistical and a Neural Approach

Alessandro BugattiDepartment of Electronics for Automation, University of Brescia, Via Branze 38, 25123 Brescia, ItalyEmail: [email protected]

Alessandra FlamminiDepartment of Electronics for Automation, University of Brescia, Via Branze 38, 25123 Brescia, ItalyEmail: [email protected]

Pierangelo MiglioratiDepartment of Electronics for Automation, University of Brescia, Via Branze 38, 25123 Brescia, ItalyEmail: [email protected]

Received 27 July 2001 and in revised form 8 January 2002

We focus the attention on the problem of audio classification in speech and music for multimedia applications. In particular,we present a comparison between two different techniques for speech/music discrimination. The first method is based on zerocrossing rate and Bayesian classification. It is very simple from a computational point of view, and gives good results in case ofpure music or speech. The simulation results show that some performance degradation arises when the music segment containsalso some speech superimposed on music, or strong rhythmic components. To overcome these problems, we propose a secondmethod, that uses more features, and is based on neural networks (specifically a multi-layer Perceptron). In this case we obtainbetter performance, at the expense of a limited growth in the computational complexity. In practice, the proposed neural networkis simple to be implemented if a suitable polynomial is used as the activation function, and a real-time implementation is possibleeven if low-cost embedded systems are used.

Keywords and phrases: speech/music discrimination, indexing of audio-visual documents, neural networks, multimediaapplications.

1. INTRODUCTIONEffective navigation through multimedia documents is nec-essary to enable widespread use and access to richer andnovel information sources.

Design of efficient indexing techniques to retrieve rele-vant information is another important requirement. Allow-ing for possible automatic procedures to semantically indexaudio-video material represents therefore a very importantchallenge. Such methods should be designed to create indicesof the audio-visual material, which characterize the temporalstructure of a multimedia document from a semantic pointof view.

The International Standard Organization (ISO) startedin October 1996 a standardization process for the descriptionof the content of multimedia documents, namely MPEG-7:the “Multimedia Content Description Interface” [1, 2]. How-ever, the standard specifications do not indicate methods forthe automatic selection of indices.

A possible mean is to identify series of consecutive seg-

ments, which exhibit a certain coherence, according to someproperty of the audio-visual material. By organizing the de-gree of coherence, according to more abstract criteria, itis possible to construct a hierarchical representation of in-formation, so as to create a Table of Content descriptionof the document. Such description appears quite adequatefor the sake of navigation through the multimedia docu-ment, thanks to the multi-layered summary that it provides[3, 4].

Traditionally, the most common approach to create anindex of an audio-visual document has been based on the au-tomatic detection of the changes of camera records and thetypes of involved editing effects. This kind of approach hasgenerally demonstrated satisfactory performance and leadto a good low-level temporal characterization of the visualcontent. However, the reached semantic level remains poorsince the description is very fragmented considering the highnumber of shot transitions occurring in typical audio-visualprograms.

Audio Classification in Speech and Music: A Comparison Between a Statistical and a Neural Approach 373

Alternatively, there have been recent research efforts tobase the analysis of audio-visual documents by a joint audioand video processing so as to provide for a higher-level orga-nization of information [5, 6, 7, 8]. In [7, 8] these two sourcesof information have been jointly considered for the identifi-cation of simple scenes that compose an audio-visual pro-gram. The video analysis associated to cross-modal proce-dures can be very computationally intensive (by relying, e.g.,on identifying correlation between nonconsecutive shots).

We believe that audio information carries out by itself arich level of semantic significance, and this paper focuses onthis issue.

In particular, we propose and compare two simplespeech/music discrimination schemes for audio segments.

The first approach, based mainly on Zero Crossing Rate(ZCR) and Bayesian classification, is very simple from a com-putational complexity point of view, and gives good results incase of pure music or speech. Some problems arises when themusic segment contains also some speech superimposed onmusic, or strong rhythmic components.

To overcome this problem, we propose an alternativemethod, that uses more features and is based on neural net-works (specifically a Multi Layer Perceptron, MLP). In thiscase we obtain better performance, at the expense of anincreased computational complexity. Anyway, the proposedneural network is simple to be implemented if a suitablepolynomial is used as the activation function, and a real-timeimplementation is possible even if low-cost embedded sys-tems are used.

The paper is organized as follows. Section 2 is devoted toa brief description of the solutions for speech/music discrim-ination presented in the literature. The proposed algorithmsare described, respectively, in Sections 3 and 4, whereas inSection 5 we report and discuss the experimental results.Some concluding remarks are given in Section 6.

2. STATE OF THE ART SOLUTIONS

In this section, we focus the attention on the solutions pro-posed in the literature to the problem of speech/music dis-crimination.

Saunders [9] proposed a method based on the statisticalparameters of the ZCR, plus a measure of the short time en-ergy contour. Then, using a multivariate Gaussian classifier,he obtained a good percentage of class discrimination. Thisapproach is successful for discriminating speech from musicon a broadcast FM radio program, and it allows achievingthe goal for the low computational complexity and for therelative homogeneity of this type of audio signal.

Scheirer and Slaney [10] developed another approachto the same problem, which exploits different features stillachieving similar results. Even in this case the algorithmachieves real-time performance and uses time domain fea-tures (short-term energy, ZCR) and frequency domain fea-tures (4 Hz modulation energy, spectral rolloff point, cen-troid and flux, . . . ), extracting also their variance in onesecond segments. In this case, they use some methods for theclassification (Gaussian mixture model, K-nearest neighbor),

and they obtain similar results.Foote [11] adopted a technique purely data-driven, and

he did not extract subjectively “meaningful” acoustic param-eters. In his work, the audio signal is first parameterized intoMel-scaled Cepstral coefficients plus an energy term, obtain-ing a 13-dimensional feature vector (12 MFCC plus energy)at a 100 Hz frame rate. Then using a tree-based quantiza-tion the audio is classified into speech, music, and novocalsounds.

Saraceno and Leonardi [7], and Zhang and Kuo [12] pro-posed more sophisticated approaches to achieve a finest de-composition of the audio stream. In both works the audiosignal is decomposed at least in four classes: silence, music,speech, and environmental sounds.

In the first work, at the first stage, a silence detector isused, which divides the silence frames from the others with ameasure of the short time energy. It considers also their tem-poral evolution by dynamic updating of the statistical param-eters, and by means of a finite state machine, to avoid mis-classification errors. Hence, the three remaining classes aredivided using autocorrelation measures, local as well as con-textual, and the ZCR, obtaining good results, where misclas-sifications occur mainly at the boundary between segmentsbelonging to different classes.

In [12] the classification is performed at two levels: acoarse and a fine level. For the first level, it is used a mor-phological and statistical analysis of the energy function,average ZCR and the fundamental frequency. Then a rule-based heuristic procedure is proposed to classify audio sig-nals based on these features. At the second level, a furtherclassification is performed for each type of sounds. Becausethis finest classification is inherently semantic, for each classa different approach could be used. The results for the coarselevel show a good accuracy, and misclassification usually oc-curs in hybrid sounds, which contains more than one basictype of audio.

Liu et al. [13] used another kind of approach, becausetheir aim was to analyze the audio signal for a scene classifi-cation of TV programs. The features selected for this task areboth in time and frequency domain, and they are meaning-ful for the scene separation and classification. These featuresare: no-silence ratio, volume standard deviation, volume dy-namic range, frequency component at 4 Hz, pitch standarddeviation, voice of music ratio, noise or unvoiced ratio, fre-quency centroid, bandwidth and energy in 4 sub-bands ofthe signal. Feedforward neural networks are used successfullyas pattern classifiers in this work. The recognized classes areadvertisement, basketball, football, news, weather forecasts,and the results show the usefulness of using audio featuresfor the purpose of scene classifications.

An alternative approach in audio data partitioning con-sists in a supervised partitioning. The supervision concernsthe ability to train the models of the various clusters con-sidered in the partitioning. In literature, the Gaussian Mix-ture Models (GMM) [14] are frequently used to train themodels of the chosen clusters. From a reference segmentedand labeled database, the GMMs are trained on acoustic datafor modeling characterized clusters (e.g., speech, music, and


background). The great variability of noises (e.g., rumbling,explosion, creaking), and of music (e.g., classic, pop) ob-served on the audio-video databases (e.g., broadcast news,movie films) makes difficult to select a suitable training strat-egy of the models of the various clusters characterizing thesesounds. The main problem to train the models is the segmen-tation/labeling of large audio databases allowing a statisticaltraining. So long as the automatic partitioning is not perfect,the labeling of databases is time consuming of human ex-perts. To avoid this cost and to cover the processing of anyaudio document, the characterization must be generic, andan adaptation of the techniques of data partitioning on theaudio signals is required to minimize the training of the var-ious clusters of sounds.

In general, these algorithms suffer some performancedegradation when the music segment contains some speechsuperimposed on music, or strong rhythmic components. Aspreviously mentioned, in our work, to overcome these prob-lems, we propose a method that is based on neural networks,that gives good performance also in these specific cases, atthe expense of a limited growth in the computational com-plexity. The performance of this method are then comparedto that obtained using a statistical approach based on ZCRand Bayesian classification.

3. ZCR WITH BAYESIAN CLASSIFIER

As previously mentioned, several researchers assume an au-dio model composed of four classes: silence, music, speech,and noise.

In this work, we focus the attention on the specific prob-lem of audio classification in music and speech, assumingthat the silence segments have already been identified using,for example, the method proposed in [8].

For this purpose, we use a speech characteristic to dis-criminate it from the music; the speech shows a very regu-lar structure where the music does not show it. Indeed, thespeech is composed of a succession of vowels and conso-nants: while the vowels are high energy events with most ofthe spectral energy contained at low frequencies, the conso-nant are noise-like, with the spectral energy distributed moretowards the higher frequencies.

Saunders [9] used the ZCR, which is a good indicator ofthis behavior, as shown in Figure 1.

In our algorithm, depicted in Figure 2, the audio file ispartitioned into segments of 2.04 second; each of them iscomposed of 150 consecutive nonoverlapping frames. Thesevalues allow a statistical significance of the frame numberand, using a 22050 Hz sample frequency, each frame con-tains 300 samples, which is an adequate tradeoff betweenthe quasi-stationary properties of the signal and a sufficientlength to accurately evaluate the ZCR. For every frame, thevalue of the ZCR is then calculated using the definition givenin [9].

These 150 values of the ZCR are then used to estimate thefollowing statistical measures:

• variance: which indicates the dispersion with respectto the mean value;

0

20

40

60

80

100

0

10

20

30

4050

60

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9 10Second

Second

Voice

Music

Zer

ocr

ossi

ng

rate

Zer

ocr

ossi

ng

rate

Figure 1: The ZCR behaviour for voice and music segments.

• third-order moment: which indicates the degree ofskewness with respect to the mean value;

• difference between the number of ZCR samples, whichare above and below the mean value.

Each segment of 2.04 seconds is thus associated with a 3-dimensional vector.

To achieve the separation between speech and music us-ing a computationally efficient implementation, a multivari-ate Gaussian classifier has been used. A set of about 4004-second-long audio sample, equally distributed betweenspeech and music, have been used to characterize the clas-sifier. At the end of this step we obtain a set of consecutivesegments labeled like speech or no-speech.

The next step is justified by an empirical observation: theprobability to observe a single segment of speech surroundedof music segments is very low, and vice versa. Therefore, asimple regularization procedure is applied to properly set thelabels of these spurious segments.

The boundaries between segments of different classes areplaced in fixed positions, inherently to the nature of theZCR algorithm. Obviously these boundaries are not placedin a sharp manner, thus a fine-level analysis of the segmentsacross the boundaries is needed to determine a sharp place-ment of them. In particular, the ZCR values of the neighbor-ing segments are processed to identify the exact position ofthe transition between speech and music signal. A new signalis obtained from these ZCR values, applying this function

y[n] =1P

n+P/2∑m=n−P/2

(x[m] − xn

)2 withP

2< n < 300 − P

2,

(1)

where x[n] is the nth ZCR value of the current segment, andxn is defined as

xn =1P

n+P/2∑m=n−p/2

x[m]. (2)


2.04 sec. segments of speech

2.04 sec. segments of no-speech

Flat audio stream

No silence segments Silence segments

Speech/no-speech segmentsseparation using ZCR

Silence detection usingshort-time energy

Recovering of spurioussegments

Fine level placementof boundaries

Figure 2: The proposed ZB algorithm.

Therefore, y[n] is an estimation of the ZCR variance in ashort window. A low-pass filter is then applied to this signalto obtain a smoother version of it, and finally a peak extractoris used to identify the transition between speech and music.

4. NEURAL NETWORK CLASSIFIER

The second approach we propose is based on a Multi-LayerPerceptron (MLP) network [15].

The MLP has been trained using five classes of audiotraces, supposing other audio sources, as silence or noise, tobe previously removed. The classes of audio traces consid-ered have been, namely: instrumental music without voice, asBeethoven symphony no. 6 (class labeled as “Am”), melodicsongs, as “My heart will go on” from Titanic (class labeledas “Bm”), rhythmic songs, as rap music or Dire Straits song“Sultans of swing” (class labeled as “Cm”), pure speech (classlabeled as “Av”), and speech superimposed on music (classlabeled as “Bv”), as commercials.

In the literature main features have been suggested forspeech/music discrimination, for example, see [16]. In thiswork, we have analysed more than 30 features, and eight ofthem have been selected as the neural network inputs. Theseparameters have been computed considering 86 frames by1024 points each (sampling frequency fs = 22050 Hz), with atotal observing time of about 4 seconds.

To test the effectiveness of the various features, and totrain the MLP, a set of about 400 4-second-long audio sam-

ples have been considered belonging to the five classes labeledas Am, Bm, Cm, Av, Bv, and equally distributed betweenspeech (Av, Bv) and music (Am, Bm, Cm). The discrimina-tion power of the selected features has been firstly evaluatedby computing the index α, defined by (3), for each feature Pj ,with j = 1 to 8, where µm and σm are, respectively, the meanvalue and standard deviation of parameter Pj for music sam-ples, and µv and σv are the same for speech. If parameter Pj

follows a Gaussian distribution, an α-value equal to 1 yieldsto a statistical classification error of about 15%. α-values be-tween 0.7 and 1 result for the selected features

α =∣∣∣∣µm − µvσm + σv

∣∣∣∣. (3)

A short description of the eight selected features follows.Parameter P1 is the spectral flux, as suggested in [10]. It indi-cates how rapidly changes the frequency spectrum, with par-ticular attention to the low frequencies (up to 2.5 kHz), andit generally assumes higher values for speech.

Parameters P2 and P3 are related to the short-time energy[17]. Function E(n), with n = 1 to 86, is computed as the sumof the square value of the previous 1024 signal samples. Afourth-order high-pass Chebyshev filter is applied with about100 Hz as the cutting frequency. Parameter P2 is computed asthe standard deviation of the absolute value of the resultingsignal, and it is generally higher in speech. Parameter P3 is theminimum of the short-time energy and it is generally lowerin speech, due to the pauses that occur among words or syl-lables.

Parameters P4 and P5 are related to the cepstrum coeffi-cients, evaluated using

c(n) =1

2π

∫π

−πlog

∣∣X(e jw)∣∣e jwndw. (4)

Cepstrum coefficients cj(n), suggested in [18] as goodspeech detectors, have been computed for each frame, thenthe mean value cµ(n) and the standard deviation cσ(n) havebeen calculated, and parameters P4 and P5 result as indicatedin

P4 = cµ(9) · cµ(11) · cµ(13),

P5 = cσ(2) · cσ(5) · cσ(9) · cσ(12).(5)

Parameter P6 is related to the centroid that is computedstarting from the spectrum module of each frame.

Parameter P6 is the product of the mean value by thestandard deviation computed by the 86 values of barycentre.In fact, due to the speech discontinuity, standard deviationmakes this parameter more distinctive.

Parameter P7 is related to the ratio of the high-frequencypower spectrum (7.5 kHz < f < 11 kHz) to the whole powerspectrum. The speech spectrum is usually considered up to4 kHz, but the lowest limit has been increased to consider sig-nals with speech over music. To consider the speech discon-tinuity and increase the discrimination between speech andmusic, P7 is the ratio of the mean value to the standard devi-ation obtained by the 86 values of the relative high-frequencypower spectrum.


P1P2P3P4P5P6P7P8

MLP

MAX

MAX

PAm

PBm

PCm

PAv

PBv

Pm

Pv

+

−

y y ≥ 0, music

y < 0, speech

Figure 3: The decision algorithm.

Parameter P8 is the syllabic frequency [10] computedstarting from the short-time energy calculated on 256 sam-ples (≈ 12 ms) instead of 1024. A 5-taps median filter hasfiltered this signal, and the computed syllabic frequency (P8)is the number of peaks detected in 4 seconds. As it is known,music should present a greater number of peaks [10].

The proposed MLP has eight input, corresponding to thenormalized features P1 ÷P8, fifteen hidden neurons, five out-put neurons, corresponding to the five considered classes,and uses normalized sigmoid activation function.

The 400 audio samples, that have been used also forthe ZCR with Bayesian classifier, have been divided intothree sets: training (200 samples), validation (100 samples),and test (100 samples). Each sample is formatted as {P1 ÷P8, PAv, PBv, PAm, PBm, PCm}, where PAv is the probability thatsample belongs to class Av.

The goal is to distinguish between speech and music andnot to identify the class; for this purpose a different andmore complex set of parameters should be designed. To per-form the proposed binary classification, target has been as-signed with “1” to the selected class, “0” to the farest class,a value between 0.8 and 0.9 to the similar classes, and avalue between 0.1 and 0.2 to the other classes. For instance,if a sample of Bm (melodic songs) is considered, PBm = 1,PAm = PCm = 0.8 because music is dominant, PBv = 0.2 be-cause it is anyway a mix of music and voice, and PAv = 0.1,because the selected sample contains voice.

If a pure music sample is considered (class Am), PAm = 1,PBm = PCm = 0.8 because it is a mix of music and voice wheremusic is dominant, PBv = 0.1 because it contains music, andPAv = 0, because pure speech is the farest class. In fact, clas-sifying the speech over music as speech inclines the MLP toclassify as speech some rhythmic songs: by adjusting the sam-ple target it is possible to incline to one side or another theMLP response.

The MLP has been trained using the Levenberg-Marquardt method [19] with a starting value of µ equal to1000 (slow and accurate behavior). The decision algorithmis depicted in Figure 3.

The mean square error related to the 400 samples wasabout 4%. It should be noticed that most of the music sam-ples wrongly classified as speech belonged to the class Cm,that is, rhythmic songs as, for example, rap music.

The selected features are rather simple to be computedeven by a low-cost device (DSP, microcontroller), except forparameters P4 and P5, related to the cepstrum coefficients.If P4 and P5 are neglected, and a 6-inputs MLP is used, themean square error related to the 400 samples increases toabout 5%.

186 ms 4 s4 s

f0 f1 f2 f3 f4 fi f86 f87 f88 f89 f90

Pj[k] Pj[k+1]

Figure 4: Features Pj updating frequency.

The neural network is simple to be implemented if a suit-able polynomial is used as the activation function [20], anda real-time implementation is possible even if low-cost em-bedded systems are used.

Output y is updated every 4 seconds, and this could bea limit to finely detect the exact position of class changes.To increase the output updating frequency, a circular framebuffer has been provided, and features pj , in terms of meanvalue and standard deviation, are updated every 186 ms, cor-responding to 4 frames fi, as shown in Figure 4.

The new updating frequency has been chosen as thefastest to be implemented on a low-cost DSP (TMS320C31).In addition, this operation allow low-pass filters to be appliedto the MLP output before the maximum value has been com-puted.

5. SIMULATION RESULTS

The proposed algorithms have been tested by computer sim-ulations to estimate the classification performance. The testscarried out can be divided into two categories: the first oneis about the misclassification errors, while the second oneis about the precision in music-speech and speech-musicchange detection.

Considering the misclassification errors, we defined threeparameters as follows:

• MER (Music Error Rate): it represents the ratio betweenthe total duration of the music segments misclassified, andthe total duration of the music test file.

• SER (Speech Error Rate): it represents the ratio betweenthe total duration of the speech segments misclassified, andthe total duration of the speech test file.

• TER (Total Error Rate): it represents the ratio betweenthe total duration of the segments misclassified in the wrongcategory (both music and speech), and the total duration ofthe test file.

The selection of the test files was carried out “manu-ally,” that is, each file is composed of many pieces of differenttypes of audio (different speakers over different environmen-tal noise, different kinds of music such as classical, pop, rap,funky, etc.) concatenated in order to have a five minutes seg-ment of speech followed by a five minutes segment of music,and so on, for a total duration of 30 minutes.

All the content of this file has been recorded from anFM radio receiver, and it has been sampled at a frequencyof 22050 Hz, with a 16-bit uniform quantization.

The classification results for both the proposed methodsare shown in Table 1.


Table 1: Classification results of the proposed algorithms (MLP:Multi Layer Perceptron; ZB: ZCR with Bayesian classifier).

MER SER TER

MLP 11.62% 0.17% 6.0%

ZB 29.3% 6.23% 17.7%

Music Music MusicSpeech Speech Speech

(a)

(b)

0 5 10 15 20 25 30

t[min]

Figure 5: Graphical display of the classification results ((a) MLP,(b) ZB).

From the analysis of the simulation results, we can seethat the MLP method gives better results compared to theZB one, having a lower error rate both in music and speech.

Moreover, both the methods show the worst perfor-mance in the classification of the music segments, that is,many segments of music are classified as speech than vicev-ersa. For a better understanding of these results, Figure 5.

In the figure, the white intervals represent the segmentsclassified as speech, whereas the black ones show the seg-ments classified as music. From this figure, it appears clearlythat the worst classification results are obtained in the thirdmusic segment, between the minutes 20 and 25. The expla-nation is that these pieces of music contain strong voicedcomponents, under a weak music component (e.g., rap andfunky). The neural network makes some mistakes only withthe rap song (minutes 23–24 referred in Figure 5), while theZB approach misclassifies the funky song (minutes 20–23)too. Commercials, that includes speech with music in back-ground, are present in the test file at minutes 17–18: in thiscase the ZB approach shows only some uncertainties.

The problem related to music identification is due mainlyto the following reasons:

• The MLP has been trained to recognize also music witha voiced component, and it gets wrong only if the voicedcomponent is too rhythmic (e.g., rap song in our case). Onthe other hand, the Bayesian classifier used in the ZB ap-proach does not take into account cases with mixed compo-nent (music and voice), and therefore in this case the classifi-cation results are significantly affected by the relative strong-ness of the spurious components.

• Furthermore, the ZB approach, that uses very few pa-rameters, is inherently unable to discriminate between pure

Table 2: MLP (a), and ZB (b) change detection results expressed inseconds.

PM2S PS2M

Min 0.56 0.19

Mean 1.30 1.53

Max 1.49 2.98

(a)

PM2S PS2M

Min 0.56 12.28

Mean 1.30 14.51

Max 2.79 16.74

(b)

speech and speech with music background, while the MLPnetwork, which uses more features, is able to make it.

Considering the precision of music-speech and speech-music change detection, we measured the distance betweenthe correct point in the time scale when a change occurred,and the nearest change point automatically extracted fromthe proposed algorithms. In particular, we have measuredthe maximum, minimum, and the mean interval between thereal change and the extracted one. The results are shown inTable 3(b), where PS2M (Precision Speech to Music) is theerror in speech to music change detection, and PM2S (Preci-sion Music to Speech) is the error in music to speech changedetection.

Also in this case, the MLP obtains better performancethan the ZB.

6. CONCLUSION

In this paper, we have proposed and compared two differ-ent algorithms for audio classification into speech and mu-sic. The first method is based mainly on ZCR and Bayesianclassification (ZB). It is very simple from a computationalpoint of view and gives good results in case of pure music orspeech. Anyway some performance degradation arises whenthe music segment contains also some speech superimposedon music, or strong rhythmic components. We have pro-posed therefore a second method that is based on a Multi-Layer Perceptron. In this case we obtain better performance,at the expense of a limited growth in the computational com-plexity. In practice, a real-time implementation is possibleeven if low-cost embedded systems are used.

ACKNOWLEDGMENTS

The authors wish to thank Profs. D. Marioli and R. Leonardifor numerous valuable comments and suggestions, and Ing.C. Pasin for providing some of the simulation results.


REFERENCES

[1] MPEG Requirement Group. MPEG-7: Overview of theMPEG-7 Standard. ISO/IEC JTC1/SC29/WG11 N3752,France, October 1998.

[2] MPEG-7: ISO/IEC 15938-5 Final Commitee Draft-Information Technology-Multimedia Content DescriptionInterface-Part 5 Multimedia Description Schemes, ISO/IECJTC1/SC29/WG11 MPEG00/N3966, Singapore, May 2001.

[3] N. Adami, A. Bugatti, R. Leonardi, P. Migliorati, and L. A.Rossi, “Describing multimedia documents in natural andsemantic-driven ordered hierarchies,” in Proc. IEEE Int. Conf.Acoustics, Speech, Signal Processing, pp. 2023–2026, Istanbul,Turkey, June 2000.

[4] N. Adami, A. Bugatti, R. Leonardi, P. Migliorati, and L. A.Rossi, “The ToCAI description scheme for indexing and re-trieval of multimedia document,” Multimedia Tools and Ap-plications, vol. 14, no. 2, pp. 153–173, 2001.

[5] Y. Wang, Z. Liu, and J. Huang, “Multimedia content analysisusing audio and visual information,” IEEE Signal ProcessingMagazine, vol. 17, no. 6, pp. 12–36, 2000.

[6] K. Minami, A. Akutsu, H. Hamada, and Y. Tonomura, “Videohandling with music and speech detection,” IEEE Multimedia,vol. 5, no. 3, pp. 17–25, 1998.

[7] C. Saraceno and R. Leonardi, “Indexing audio-visualdatabases through a joint audio and video processing,” Int.J. Image Syst. Technol., vol. 9, no. 5, pp. 320–331, 1998.

[8] A. Bugatti, R. Leonardi, and L. A. Rossi, “A video indexingapproach based on audio classification,” in Proc. InternationalWorkshop on Very Low Bit Rate Video, pp. 75–78, Kyoto, Japan,October 1999.

[9] J. Saunders, “Real-time discrimination of broadcastspeech/music,” in Proc. IEEE Int. Conf. Acoustics, Speech, Sig-nal Processing, vol. 2, pp. 993–996, Atlanta, Ga, USA, May1996.

[10] E. Scheirer and M. Slaney, “Construction and evaluationof a robust multifeature speech/music discriminator,” inProc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, pp.1331–1334, Munich, Germany, April 1997.

[11] J. T. Foote, “A similarity measure for automatic audio classi-fication,” in Proc. AAAI 1997 Spring Symposium on IntelligentIntegration and Use of Text, Image, Video, and Audio Corpora,Stanford, Calif, USA, March 1997.

[12] T. Zhang and C. C. J. Kuo, “Hierarchical classification ofaudio data for archiving and retrieving,” in Proc. IEEE Int.Conf. Acoustics, Speech, Signal Processing, vol. 6, pp. 3001–3004, Phoenix, Ariz, USA, March 1999.

[13] Z. Liu, J. Huang, Y. Wang, and T. Chen, “Audio feature ex-traction and analysis for scene classification,” in Proc. IEEE1997 Workshop on Multimedia Signal Processing, Princeton,NJ, USA, June 1997.

[14] J. L. Gauvain, L. Lamel, and G. Adda, “Partitioning and tran-scription of broadcast news data,” in Proc. International Con-ference on Speech and Language Processing, vol. 4, pp. 1335–1338, Sydney, Australia, December 1998.

[15] S. Haykin, Neural Networks: A Comprehensive Foundation,Prentice-Hall, Upper Saddle River, NJ, USA, 2nd edition,1999.

[16] M. J. Carey, E. S. Parris, and H. Lloyd-Thomas, “A comparisonof features for speech, music discrimination,” in Proc. IEEEInt. Conf. Acoustics, Speech, Signal Processing, pp. 1432–1436,Phoenix, Ariz, USA, March 1999.

[17] L. R. Rabiner and R. W. Schafer, Digital Processing of SpeechSignals, Prentice-Hall, Englewood Cliffs, NJ, USA, 1988.

[18] L. Rabiner and B. Juang, Fundamentals of Speech Recognition,Prentice-Hall, Englewood Cliffs, NJ, USA, 1993.

[19] D. G. Luenberger, Linear and Nonlinear Programming, Addi-son Wesley, Ontario, Canada, 2nd edition, 1989.

[20] A. Flammini, D. Marioli, D. Pinelli, and A. Taroni, “A sim-ple neural technique for sensor data processing,” in Proc.IEEE International Workshop on Emerging Technologies, Intelli-gent Measurement and Virtual Systems for Instrumentation andMeasurement, pp. 1–10, Minnesota Club, St. Paul, Minn, USA,May 1998.

Alessandro Bugatti was born in Brescia,Italy, in 1971. He received the Dr. Eng. de-gree in electronic engineering in 1998, fromthe University of Brescia. Currently he isworking at University of Brescia as externconsultant. He was involved in the activitiesof the European project AVIR from 1998 to2000 and in the MPEG-7 standardizationprocess. His research interest is in audio seg-mentation and classification, user interfacesand audio-video indexing. His background includes also studies inAI and expert system fields. His efforts are currently devoted toboth automatic audio sequences content analysis and cross-modalanalysis.

Alessandra Flammini was born in Brescia,Italy, in 1960. She graduated with honors inPhysics at the University of Rome, Italy, in1985. From 1985 to 1995 she worked on in-dustrial research and development on digi-tal drive control. Since 1995, she has been aResearcher at the Department of Electronicsfor Automation of the University of Brescia.Her main field activity is the design of dig-ital electronic circuits (FPGA, DSP, proces-sors) for measurement instrumentation.

Pierangelo Migliorati got the Laurea (cumlaude) in electronic engineering from thePolitecnico di Milano in 1988, and the Mas-ter in information technology from theCEFRIEL Research Centre, Milan, 1989,respectively. He joined CEFRIEL researchcenter in 1990. From 1995 he is AssistantProfessor at University of Brescia, where heis involved in activities related to channelequalization and indexing of multimediadocuments. He is a member of IEEE.


Video Segmentation Using Fast Marchingand Region Growing Algorithms

Eftychis SifakisDepartment of Computer Science, University of Crete, P.O. Box 2208, Heraklion, GreeceEmail: [email protected]

Ilias GriniasDepartment of Computer Science, University of Crete, P.O. Box 2208, Heraklion, GreeceEmail: [email protected]

Georgios TziritasDepartment of Computer Science, University of Crete, P.O. Box 2208, Heraklion, GreeceEmail: [email protected]

Received 31 July 2001

The algorithm presented in this paper is comprised of three main stages: (1) classification of the image sequence and, in the case ofa moving camera, parametric motion estimation, (2) change detection having as reference a fixed frame, an appropriately selectedframe or a displaced frame, and (3) object localization using local colour features. The image sequence classification is based onstatistical tests on the frame difference. The change detection module uses a two-label fast marching algorithm. Finally, the objectlocalization uses a region growing algorithm based on the colour similarity. Video object segmentation results are shown usingthe COST 211 data set.

Keywords and phrases: video object segmentation, change detection, colour-based region growing.

1. INTRODUCTION

Video segmentation is a key step in image sequence analy-sis and its results are extensively used for determining mo-tion features of scene objects, as well as for coding purposesto reduce storage requirements. The development and wide-spread use of the international coding standard MPEG-4 [1],which relies on the concept of image/video objects as trans-mission elements, has raised the importance of these meth-ods. Moving objects could also be used for content descrip-tion in MPEG-7 applications.

Various approaches have been proposed for video orspatio-temporal segmentation. An overview of segmentationtools, as well as of region-based representations of image andvideo, are presented in [2]. The video object extraction couldbe based on change detection and moving object localization,or on motion field segmentation, particularly when the cam-era is moving. Our approach is based exclusively on changedetection. The costly and potentially inaccurate motion es-timation process is not needed. We present here some rele-vant work from the related literature for better situating ourcontribution.

Spatial Markov Random Fields (MRFs) through theGibbs distribution have been widely used for modelling thechange detection problem [3, 4, 5, 6, 7, 8]. These approachesare based on the construction of a global cost function, whereinteractions (possibly nonlinear) are specified among differ-ent image features (e.g., luminance, region labels). Multi-scale approaches have also been investigated in order to re-duce the computational overhead of the deterministic costminimization algorithms [7] and to improve the quality ofthe field estimates.

In [9], a motion detection method based on an MRFmodel was proposed, where two zero-mean generalizedGaussian distributions were used to model the interframedifference. For the localization problem, Gaussian distribu-tion functions were used to model the intensities at the samesite in two successive frames. In each problem, a cost func-tion was constructed based on the above distributions alongwith a regularization of the label map. Deterministic relax-ation algorithms were used for the minimization of the costfunction.

On the other hand, approaches based on contour evo-lution [10, 11] or on partial differential equations are also


proposed in the literature. In [12], a three-step algorithmis proposed, consisting of contour detection, estimation ofthe velocity field along the detected contours and finally thedetermination of moving contours. In [13], the contours tobe detected and tracked are modelled as geodesic active con-tours. For the change detection problem a new image is gen-erated, which exhibits large gradient values around the mov-ing area. The problem of object tracking is posed in a unifiedactive contour model including both change detection andobject localization.

In the framework of COST 211, an Analysis Model (AM)is proposed for image and video analysis and segmentation[14]. The essential feature of the AM is its ability to fuse in-formation from different sources: colour segmentation, mo-tion segmentation, and change detection. Kim et al. [15] pro-posed a method using global motion estimation, change de-tection, temporal and spatial segmentation.

Our algorithm, after the global motion estimation phase,is mainly based on change detection. The change detectionproblem is formulated as two-label classification. In [16]we introduce a new methodology for pixel labelling calledBayesian Level Sets, extending the level set method [17] topixel classification problems. We have also introduced theMulti-Label Fast Marching algorithm and applied it at firstto the change detection problem [18]. A more recent and de-tailed presentation is given in [19]. The algorithm presentedin this paper differs from previous work in the final stage,where the boundary-based object localization is replaced bya region-based object labelling.

In Section 2, the method for selecting the appropriateframe difference for detecting the moving object is presented.In Section 3, we present the multi-label fast marching algo-rithm, which uses the frame difference and an initial labellingfor segmenting the image into unchanged and changed re-gions with respect to the camera, that is, changes indepen-dent of the camera motion. The last step of the entire algo-rithm is presented in Section 4 where a region growing tech-nique extends an initial segmentation map. Section 5 con-cludes the paper, commenting on the obtained results.

2. FRAME DIFFERENCE

In our approach, the main step in video object segmenta-tion is change detection. Therefore, for each frame we mustfirst determine another frame which will be retained as a ref-erence frame and used for the comparison. Three differentmain situations may occur: (a) a constant reference frame, asin surveillance applications, (b) another frame appropriatelyselected, in the case of a still camera, and (c) a computed dis-placed frame, in the case of a moving camera.

The image sequence must be classified according to theabove categories. We use a hierarchical categorization basedon statistics of frame differences (Figure 1). At first the hy-pothesis (a) is tested against the other two. We can considerthere to exist a unique background reference image if, for anumber of frames, the observed frame differences are negli-gible. A test on the empirical probability distribution is thenused.

Independent motion

Testdifference

Test firstframes

Known background

Change detection

Figure 1: The tests of image sequence classification.

When the reference is not constant we have to determinethe more appropriate reference in order to identify indepen-dently moving objects. In order to determine the referenceframe, it must be ascertained whether the camera is mov-ing or not. The test is again based on the empirical proba-bility distribution of the frame differences. More precisely, ifthe probability that the observed frame difference is less than3, is less than 0.5, then the camera is considered as possiblymoving, and the parametric camera motion is estimated, ac-cording to an algorithm presented later.

Before considering the two possible cases we will presentthe statistical model used for the frame difference, becausethe determination of the appropriate reference frame is basedon this model. Let D = {d(x, y), (x, y) ∈ S} denote the graylevel difference image. The change detection problem con-sists of determining a “binary” label Θ(x, y) for each pixelon the image grid. We associate the random field Θ(x, y)with two possible events, Θ(x, y) = static (unchanged pixel),and Θ(x, y) = mobile (changed pixel). Let pD|static(d | static)(resp., pD|mobile(d | mobile)) be the probability density func-tion of the observed inter-frame difference under the H0

(resp., H1) hypothesis. These probability density functionsare assumed to be zero-mean Laplacian for both hypotheses(l = 0, 1)

p(d(x, y) | Θ(x, y) = l

)=λl2e−λl |d(x,y)|. (1)

Let P0 (resp., P1) be the a priori probability of hypothesis H0

(resp., H1). Thus the probability density function is given by

pD(d) = P0pD|0(d | static

)+ P1pD|1

(d | mobile

). (2)

In this mixture distribution {Pl, λl; l ∈ {0, 1}} are unknownparameters. The principle of Maximum Likelihood is used toobtain an estimate of these parameters [20].

In the case of a still camera, the current frame must becompared to another frame sufficiently distinct, that is, aframe where the moving object is displaced to be clearly de-tectable. For that the mixture of Laplacian distributions (2) isfirst identified. The degree of discrimination of the two dis-tributions is indicated by the ratio of the two corresponding

Video Segmentation Using Fast Marching and Region Growing Algorithms 381

Coast guard Container ship Erik

Road surveillance Tennis table Urbicande

Figure 2: Initial labelled sets.

standard deviations, or, equivalently, by the ratio of the twoestimated parameters λ0 and λ1. Indeed, the Bhattacharyadistance between the two distributions is equal to ln(λ0 +λ1)/2

√λ0λ1. So we search for the closest frame which is suf-

ficiently discriminated from the current one. Indeed, a valuebelow the threshold means that the objects’ movement issmall, and therefore it is difficult to detect the object. Thethreshold (Tλ) on the ratio of standard deviations is suppliedby the user, and thus is determined by the frame difference.

In the case of a moving camera the frame differenceis determined by the displaced frame difference of succes-sive frames. The camera movement must be computed forobtaining the displaced frame difference. We use a three-parameter model for describing the camera motion, com-posed of two translation parameters, (u, v), and a zoom pa-rameter, ε. The estimation of the three parameters is based ona frame matching technique with a robust criterion of leastmedian of absolute displaced differences

min median{∣∣I(x, y, t)−I(x−u − εx, y − v − εy, t − 1)

∣∣}.(3)

Only a fixed number of possible values for the set of mo-tion parameters (u, v, ε) is considered. Assuming convexity,we perform a series of refinements on the parameter space,a three-dimensional “divide-and-conquer” which yields thedesired minimum within an acceptable accuracy after onlyfour steps. In our implementation this requires the compu-tation of roughly one hundred values of the median of ab-

solute differences. For reasons of computational complexitythe median is determined using the histogram of the absolutedisplaced frame differences.

3. CHANGE DETECTION USING FAST MARCHINGALGORITHM

3.1. Initial labelling

The labelling algorithm requires some initial correctly la-belled sets. For that we use statistical tests with high confi-dence for the initialisation of the label map. The percentageof points labelled by purely statistical tests depends on theability to discriminate the two classes, which is related to theamount of relative object motion. For the Coast Guard se-quence (Figure 2), where it is difficult to distinguish the lit-tle boat, less than one percent of pixels are initialized. Thebackground is shown in black, the foreground in white andunlabelled points in gray. For the Erik sequence (Figure 2),for which the two probability density functions are shown inFigure 3, a large number of pixels are classified in the initial-ization stage.

The first test detects changed sites with high confidence.The false alarm probability is set to a small value, say PF . Thethreshold for labelling a pixel as “changed” is

T1 =1λ0

ln1PF

. (4)


−25 −20 −15 −10 −5 0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

Den

sity

0.3

0.35

0.4

Difference−25 −20 −15 −10 −5 0 5 10 15 20 250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Den

sity

Difference

Figure 3: Mixture decomposition in Laplacian distributions for the inter-frame difference (Erik sequence).

Table 1

w 3 4 5 6 7

γ1w 1.6 3.6 7.0 12.0 20.0

γ2w 0.4 1.0 1.6 4.0 10.0

Subsequently, a series of tests is used for finding unchangedsites with high confidence, that is, with a small probabilityof non-detection. For these tests a series of six windowsof dimension (2w + 1)2, w = 2, . . . , 7, is considered andthe corresponding thresholds are preset as a function of λ1.We denote by Bw the set of pixels labelled as unchangedwhen testing the window indexed by w. We set them asfollows:

Bw ={

(x, y) :w∑

k=−w

w∑l=−w

∣∣d(x + k, y + l)∣∣ < γw

λ1

}, (5)

for w = 2, . . . , 7. The probability of non-detection dependson the threshold γw, while λ1 is inversely proportional tothe dispersion of d(x, y) under the “changed” hypothesis.As the evaluation of this probability is not straightforward,the numerical value of γw is empirically fixed. The param-eter γ2 is chosen such that at least one pixel is labelled as“changed.” The other parameters (w = 3, . . . , 7) are such thatγw = γ1

w + γ2wvm, where vm is proportional to the amount of

camera motion. In Table 1 we give the values used in our im-plementation.

Finally, the union of the above sets ∪7w=2Bw determines

the initial set of “unchanged” pixels.

3.2. Label propagation

A multi-label fast marching level set algorithm is then ap-plied to all sets of points initially labelled. This algorithmis an extension of the well-known fast marching algorithm[17]. The contour of each region is propagated according

to a motion field, which depends on the label and on theabsolute inter-frame difference. The label-dependent prop-agation speed is set according to the a posteriori probabil-ity principle. As the same principle will be used later forother level set propagations and for their respective veloc-ities, we shall present here the fundamental aspects of thedefinition of the propagation speed. The candidate label isideally propagated with a speed in the interval [0, 1], equalin magnitude to the a posteriori probability of the candi-date label at the considered point. We define the propagationspeed at a site (x, y), for a candidate label l and for a datavector d,

vl(x, y) = Pr{l(x, y) | d(x, y)

}. (6)

Then we can write

vl(x, y) =p(d(x, y) | l(x, y)

)Pr{l(x, y)

}∑

k p(d(x, y) | k(x, y)

)Pr{k(x, y)

} . (7)

Therefore the propagation speed depends on the likelihoodratios and on the a priori probabilities. The likelihood ratioscan be evaluated according to assumptions on the data, andthe a priori probabilities could be estimated, either globallyor locally, or assumed all equal.

In the case of a decision between the “changed” and the“unchanged” labels according to the assumption of Laplaciandistributions, the likelihood ratios are exponential functionsof the absolute value of the inter-frame difference. In a pixel-based framework the decision process is highly noisy. More-over, the moving object might be non-rigid, its various com-ponents undergoing different movements. In regions of uni-form intensity the frame difference could be small, while theobject is moving. The memory of the “changed” area of theprevious frames should be used in the definition of the local apriori probabilities used in the propagation process. Accord-ing to (1) and (7) the two propagation velocities could be


written as follows:

v0(x, y)=1

1+(Q1(x, y; 0)λ1/Q0(x, y; 0)λ0

)e(λ0−λ1)|d(x,y)| ,

v1(x, y)=1

1+(Q0(x, y; 1)λ0/Q1(x, y; 1)λ1

)e−(λ0−λ1)|d(x,y)| ,

(8)

where the parameters λ0 and λ1 have been previously esti-mated. We distinguish the notation of the a priori probabili-ties defined here from those given in (2), because they shouldadapte to the conditions of propagation and to local situa-tions. Indeed, the above velocity definition is extended in or-der to include the neighbourhood of the considered point

vl(x, y) = Pr{l(x, y) | d(x, y), k

(x′, y′

),(

x′, y′) ∈ �(x, y)

},

(9)

where the neighbourhood �(x, y) may depend on the la-bel, and may be defined on the current frame as well as onprevious frames. Therefore, in this case the ratio of a prioriprobabilities is adapted to the local context, as in a Marko-vian model. A more detailed presentation of the approachfor defining and estimating these probabilities follows.

From the statistical analysis of the data’s mixture distri-bution we have an estimation of the a priori probabilities ofthe two labels (P0, P1). This is an estimation and not a prioriknowledge. However, the initially labelled points are not nec-essarily distributed according to the same probabilities, be-cause the initial detection depends on the amount of motion,which could be spatially and temporally variant. We define aparameter β measuring the divergence of the two probabilitydistributions as follows:

β =(P0P1

P1P0

)β0(P0+P1)

, (10)

where P0 + P1 + Pu = 1, Pu being the percentage of unlabelledpixels. The parameter β0 is fixed equal to 4 if the camera isnot moving, and to 2 if the camera is moving. Then β will bethe ratio of the a priori probabilities. In addition, for v1(x, y)the previous “change” map and local assignements are takeninto account, and we define

Q0(x, y; 1)

Q1(x, y; 1)=eθ1−(α(x,y)+n1(x,y)−n0(x,y))ζ

β, (11)

where α(x, y) = η(x, y) − 1, with η(x, y) the distance ofthe (interior) point from the border of the “changed” areaon the previous pair of frames, and n1(x, y) (resp., n0(x, y))the number of pixels in neighbourhood already labelled as“changed” (resp., “unchanged”). The parameter ζ is adoptedfrom the Markovian nature of the label process and it can beinterpreted as a potential characterizing the labels of a pairof points. Finally, the exact propagation velocity for the “un-changed” label is

v0(x, y) =1

1 + β(λ1/λ0

)eθ0+(λ0−λ1)|d(x,y)|−n∆(x,y)ζ

(12)

0 1 2 3 4 5 76 8 9 10Absolute frame difference

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Spee

d

Figure 4: The propagation speeds of the two labels; solid line:“changed” label, dashed line: “unchanged” label.

and for the “changed” label

v1(x, y)

=1

1 +(1/β)(λ0/λ1

)eθ1−(λ0−λ1)|d(x,y)|−(α(x,y)−n∆(x,y))ζ

,(13)

where n∆(x, y) = n0(x, y) − n1(x, y). In the tested imple-mentation the parameters are set as follows: θ0 = 4ζ andθ1 = 5ζ + 4. In Figure 4, the two speeds are mapped as func-tions of the absolute inter-frame difference for typical pa-rameter values near the boundary.

We use the fast marching algorithm for advancing thecontours towards the unlabelled space. Often in level set ap-proaches constraints on the boundary points are introducedin order to obtain a smooth and regularised contour andso that an automatic stopping criterion for the evolution isavailable. Our approach differs in that the propagation speeddepends on competitive region properties, which both sta-bilises the contour and provides automatic stopping for theadvancing contours. Only the smoothness of the boundaryis not guaranteed. Therefore, the dependence of the propa-gation speed on the pixel properties alone, and not on con-tour curvature measures, is not a strong disadvantage here.The main advantage is the computational efficiency of thefast marching algorithm.

The proposed algorithm is a variant of the fast march-ing algorithm which, while retaining the properties of theoriginal, is able to cope with multiple classes (or labels). Theexecution time of the new algorithm is effectively made in-dependent of the number of existing classes by handling allthe propagations in parallel and dynamically limiting therange of action for each label to the continually shrink-ing set of pixels for which a final decision has not yetbeen reached. The propagation speed may also have a dif-ferent definition for each class and the speed could takeinto account the statistical description of the consideredclass.


Hall monitor Mother and daughter Erik

Figure 5: Change detection results.

The high-level description of the algorithm is as follows:

InitTValueMap()InitTrialLists()while (ExistTrialPixels()){pxl = FindLeastTValue()MarkPixelAlive(pxl)UpdateLabelMap(pxl)AddNeighborsToTrialLists(pxl)UpdateNeighborTValues(pxl)}

The algorithm is supplied with a label map partially filledwith decisions. A map with pointers to linked lists of trialpixel candidacies is also maintained. These lists are initiallyempty except for sites neighbouring initial decisions. Forthose sites a trial pixel candidacy is added to the correspond-ing list for each different label of neighbouring decisions andan initial arrival time is assigned. The arrival time for theinitially labelled sites is set to zero, while for all others it isset to infinity. Apart from their participation in trial lists, alltrial candidacies are maintained in a common priority queue,in order to facilitate the selection of the candidacy with thesmallest arrival time.

While there are still unresolved trial candidacies, thetrial candidacy with the smallest arrival time is selected andturned alive. If no other alive candidacy exists for this site, itslabel is copied to the final label map. For each neighbour ofthis site a trial candidacy of the same label is added, if it doesnot already possess one, to its corresponding trial list. Finally,all neighbouring trial pixels of the same label update their ar-rival times according to the stationary level set equation

∥∥∇T(x, y)∥∥ =

1v(x, y)

, (14)

where v(x, y) corresponds to the propagation speed at point(x, y) of the evolving front, while T(x, y) is a map of crossingtimes.

While it may seem that for a given site trial pixels can ex-ist for all different labels, in fact there can be at most four,since a trial candidacy is only introduced by a finalised deci-sion of a neighbouring pixel. In practice, trial pixels of dif-

ferent labels coexist only in region boundaries; therefore, theaverage number of label candidacies per pixel is at most two.Even in the worst case, it is evident that the time and spacecomplexity of the algorithm is independent of the numberof different labels. Experiments indicate a running time nomore than twice that of the single contour fast marching al-gorithm.

4. MOVING OBJECT LOCALIZATION USING REGIONGROWING ALGORITHM

4.1. InitialisationThe change detection stage could be used for initialisation ofthe moving object tracker. The objective now is to localizethe boundary of the moving object. The ideal change area isthe union of sites which are occupied by the object in twosuccessive time instants

C(t, t + 1) = O(t) ∪O(t + 1), (15)

where O(t) is the set of points belonging to the moving objectat time t. We also consider the change area

C(t − 1, t) = O(t) ∪O(t − 1). (16)

It can easily be shown that the intersection of two successivechange maps C(t − 1, t) ∩ C(t, t + 1) is equal to

O(t) ∪ (O(t + 1) ∩O(t − 1)). (17)

This means that the intersection of two successive changemaps is a better initialisation for moving object localizationthan either one of them alone. In addition, sometimes

(O(t + 1) ∩O(t − 1)

) ⊂ O(t). (18)

If this is true, then

C(t, t + 1) ∩ C(t, t − 1) = O(t). (19)

Of course, the above described situation is an ideal one,and is a good approximation only in the case of a still camera.When the camera is moving, the camera motion is compen-sated, and the intersection is suitably adapted. Results of thechange detection algorithm are shown in Figure 5.


Mother and daughter Erik

Figure 6: Results on the uncertainty area.

Knowing also that there are some errors in change de-tection and that sometimes, under certain assumptions, theintersection of the two change maps gives the object approxi-mate location, we propose to initialize a region growing algo-rithm by this map, that is, the intersection of two successivechange maps. This search will be performed in two stages:first, an area containing the object’s boundary is extracted,and second, the boundary is detected. The description ofthese stages follows.

4.2. Extraction of the uncertainty area

The objective now is to determine the area that contains theobject’s boundary with extremely high confidence. Becauseof errors arising in the change detection stage, and also be-cause of the fact that the initial boundary is, in principle,placed outside the object, as shown in the previous subsec-tion, it is necessary to find an area large enough to containthe object’s boundary. This task is simplified if some knowl-edge about the background is available. In the absence ofknowledge concerning the background, the initial boundarycould be relaxed in both directions, inside and outside, witha constant speed, which may be different for the two direc-tions. Within this area then we search for the photometricboundary.

The objective is to place the inner border on the movingobject and the outer border on the background. We empha-sise here that inner means inside the object and outer meansoutside the object. Therefore, if an object contains holes theinner border corresponding to the hole includes the respec-tive outer border, in which case the inner border is expand-ing and the outer border is shrinking. In any case, the ob-ject contour is expected to be situated between them at everypoint and under this assumption it will be possible to deter-mine its location by the region-growing module describedin Section 4.3. Therefore, the inner border should advancerapidly for points on the background and slowly for pointson the object, whereas the opposite should be happen for theouter border.

For cases in which the background can be easily de-scribed, a level set approach extracts the zone of the ob-ject’s boundary. Suppose that the image intensity of the back-ground could be described by a Gaussian random variable

with mean µ and variance σ2. This model could be adaptedto local measurements.

The propagation speeds will be also determined by thea posteriori probability principle. If, as assumed, the inten-sity on the background points is distributed according to theGaussian distribution, the local average value of the intensityshould also follow the Gaussian distribution with the samemean value and variance proportional to σ2. The likelihoodtest on the validity of this hypothesis is based on the nor-malised difference between the average and the mean value

(I − µ

)2

σ2, (20)

where I is the average value of the intensity in a window ofsize 3 × 3 centered at the examined point. A low value meansa good fit with the background. Therefore, the inner bor-der should advance more rapidly for low values of the abovestatistics, while the outer border should be decelerated for thesame values.

On the other hand, it is almost certain that the borderresulting from the previous stages is located on the back-ground. Thus the probability of being on the background ismuch higher than the probability of being on the object. Forthe outer border the speed is defined as

vb =1

1 + cbe−4(I−µ)2/σ2, (21)

where it is considered that the variance of I is equal to σ2/8.According to (7) the constant cb is

cb =PbPo

∆

σ√

2π, (22)

where Pb and Po are the a priori probabilities of being on thebackground or on the moving object, respectively. We haveassumed that in the absence of knowledge the intensity of theobject is uniformly distributed in an interval whose width is∆ (possibly equal to 255). As the initial contour is more likelylocated on the background, Po is given a smaller value thanPb (typically Pb/Po = 3). The outer border advances with thecomplementary speed


Coast guard Container ship Erik

Road surveillance Tennis table Urbicande

Hall monitor Mother and daughter Lion

Figure 7: Results of video object extraction.

vo = 1 − vb, (23)

using the same local variance computation.For cases in which the background is inhomogeneous,

the uncertainty area is a fixed zone, where the two propaga-tion velocities are constant. They may be different in order toachieve the objective of placing the inner border on the mov-ing object and the outer border on the background. Resulton the Erik and Mother and daughter sequences are shown inFigure 6.

The width of the uncertainty zone is determined by athreshold on the arrival times, which depends on the sizeof the detected objects and on the amount of motion andwhich provides the stopping criterion. At each point alongthe boundary, the distance from a corresponding “center”

point of the object is determined using a heuristic techniquefor fast computation. The uncertainty zone is a fixed per-centage of this radius modified in order to be adapted to themotion magnitude. However, motion is not estimated, andonly a global motion indicator is extracted from the compar-ison of the consecutive changed areas. The motion indicatoris equal to the ratio of the number of pixels with different la-bels on two consecutive “change” maps to the number of thedetected object points.

4.3. Region growing-based object localization

The last stage of object segmentation is carried out by aSeeded Region Growing (SRG) algorithm which was initiallyproposed for static image segmentation using a homogeneitymeasure on the intensity function [21]. It is a sequential la-


belling technique, in which each step of the algorithm labelsexactly one pixel, that with the lowest dissimilarity. In [22],the SRG algorithm was used for semi-automatic motion seg-mentation.

The segmentation result depends on the dissimilarity cri-terion, say δ(·, ·). The colour features of both backgroundand foreground are unknown in our case. In addition, localinhomogeneity is possible. For these reasons, we first deter-mine the connected components already labelled, with twopossible labels: background and foreground. On the bound-ary of all connected components we place representativepoints, for which we compute the locally average colour vec-tor in the Lab system. The dissimilarity of the candidate pointfrom the already labelled regions during region growing pro-cess is determined using this feature as well as the Euclideandistance. After every pixel labelling, the corresponding fea-ture is updated. Therefore, we search for sequential spatialsegmentation based on colour homogeneity, knowing thatboth background and foreground objects may be globally in-homogeneous, but presenting local colour similarities suffi-cient for their discrimination.

For the implementation of the SRG algorithm, a list thatkeeps its members (pixels) ordered according to the dissim-ilarity criterion is used, traditionally referred to as Sequen-tially Sorted List (SSL). With this data structure available, thecomplete SRG algorithm is as follows:

S1 Label the points of the initial sets.S2 Insert all neighbours of the initial sets into the SSL.S3 Compute the average local colour vector for a prede-

termined subset of the boundary points of the initialsets.

S4 While the SSL is not empty:S4.1 Remove the first point y from the SSL and la-

bel it.S4.2 Update the colour features of the representative

to which the point y was associated.S4.3 Test the neighbours of y and update the SSL:

S4.3.1 Add neighbours of y which are neither al-ready labelled nor already in the SSL, ac-cording to their value of δ(·, ·).

S4.3.2 Test for neighbours which are already in theSSL and now border on an additional set be-cause of y’s classification. These are flaggedas boundary points. Furthermore, if theirδ(·, ·) is reduced, they are promoted accord-ingly in the SSL.

When SRG is completed, every pixel is assigned one of thetwo possible labels: foreground or background.

5. RESULTS AND CONCLUSION

We applied the above described algorithm to the entireCOST data set. The results are given in our web pagehttp://www.csd.uoc.gr/tziritas/cost.html

We obtained results ranging from good to very good, de-pending on the image sequence. Some segmented frames areshown in Figure 7. For comparison the spatial quality mea-

0 5 10 15 20 25 3530 40 45 50Frame number

Erik sequence

−18

−16

−14

−12

−10

−8

−6

−4

−2

0

Spat

ialq

ual

ity

Figure 8: Comparison based on the spatial quality measure for theErik sequence.

0 50 100 150 200 250 300Frame number

Hall monitor sequence

−18

−16

−14

−12

−10

−8

−6

−4

−2

0Sp

atia

lqu

alit

y

Figure 9: Comparison based on the spatial quality measure for theHall monitor sequence.

sures [23] on the Erik (resp., Hall Monitor) sequence forthe COST AM algorithm [14] and that of our algorithm areshown together in Figure 8 (resp., Figure 9). Our algorithmgives results of quality either similar to or better than theCOST AM algorithm. The COST AM results, the referencesegmented sequences, and the evaluation tool are taken fromthe web site http://www.tele.ucl.ac.be/EXCHANGE/

For the algorithm proposed the image sequence classi-fication was always correct. The parametric motion modelwas estimated with sufficient accuracy. The independent mo-tion detection was confident in the case of camera motion.The mixture of Laplacians was accurately estimated, and theinitialization of the label map was correct, except for someproblems caused by shadows, reflexions, and homogeneousintensity on the moving objects. The fast marching algorithmwas very efficient and performant. The last stage of moving


object localization can be further improved. The modeliza-tion of local colour and texture content could be possible,leading to a more adaptive region growing, or eventually apixel labelling procedure.

ACKNOWLEDGMENTS

This work has been funded in part by the European ISTPISTE (“Personalized Immersive Sports TV Experience”)and the Greek “MPEG-4 Authoring Tools” projects.

REFERENCES

[1] T. Sikora, “The MPEG-4 video standard verification model,”IEEE Trans. Circuits and Systems for Video Technology, vol. 7,no. 1, pp. 19–31, 1997.

[2] P. Salembier and F. Marques, “Region-based representationsof image and video: segmentation tools for multimedia ser-vices,” IEEE Trans. Circuits and Systems for Video Technology,vol. 9, no. 8, pp. 1147–1169, 1999.

[3] T. Aach and A. Kaup, “Bayesian algorithms for adaptivechange detection in image sequences using Markov randomfields,” Signal Processing: Image Communication, vol. 7, no. 2,pp. 147–160, 1995.

[4] T. Aach, A. Kaup, and R. Mester, “Statistical model-basedchange detection in moving video,” Signal Processing, vol. 31,no. 2, pp. 165–180, 1993.

[5] M. Bischel, “Segmenting simply connected moving objects ina static scene,” IEEE Trans. on Pattern Analysis and MachineIntelligence, vol. 16, no. 11, pp. 1138–1142, 1994.

[6] K. Karmann, A. Brandt, and R. Gerl, “Moving object segmen-tation based on adaptive reference images,” in European SignalProcessing Conf., pp. 951–954, 1990.

[7] J.-M. Odobez and P. Bouthemy, “Robust multiresolution esti-mation of parametric motion models,” Journal of Visual Com-munication and Image Representation, vol. 6, no. 4, pp. 348–365, 1995.

[8] Z. Sivan and D. Malah, “Change detection and texture analysisfor image sequence coding,” Signal Processing: Image Commu-nication, vol. 6, no. 4, pp. 357–376, 1994.

[9] N. Paragios and G. Tziritas, “Adaptive detection and localiza-tion of moving objects in image sequences,” Signal Processing:Image Communication, vol. 14, no. 4, pp. 277–296, 1999.

[10] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: active con-tour models,” International Journal of Computer Vision, vol. 1,no. 4, pp. 321–331, 1988.

[11] A. Blake and M. Isard, Active Contours, Springer-Verlag, NY,USA, 1998.

[12] V. Caselles and B. Coll, “Snakes in movement,” SIAM Journalon Numerical Analysis, vol. 33, no. 6, pp. 2445–2456, 1996.

[13] N. Paragios and R. Deriche, “Geodesic active contours andlevel sets for the detection and tracking of moving objects,”IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.22, no. 3, pp. 266–280, 2000.

[14] A. Alatan, L. Onural, M. Wollborn, R. Mech, E. Tuncel, andT. Sikora, “Image sequence analysis for emerging interactivemultimedia services—the European COST 211 framework,”IEEE Trans. Circuits and Systems for Video Technology, vol. 8,no. 7, pp. 802–813, 1998.

[15] M. Kim, J. G. Choi, D. Kim, et al., “A VOP generation tool: au-tomatic segmentation of moving objects on image sequencesbased on spatio-temporal information,” IEEE Trans. Circuitsand Systems for Video Technology, vol. 9, no. 8, pp. 1216–1226,1999.

[16] E. Sifakis, C. Garcia, and G. Tziritas, “Bayesian level sets for

image segmentation,” Journal of Visual Communication andImage Representation, vol. 13, no. 112, pp. 44–64, 2002.

[17] J. A. Sethian, “Theory, algorithms, and applications of levelset methods for propagating interfaces,” Acta Numerica, vol.5, pp. 309–395, 1996.

[18] E. Sifakis and G. Tziritas, “Fast marching to moving objectlocation,” in Proc. 2nd Int. Conf. on Scale-Space Theories inComputer Vision, pp. 447–452, 1999.

[19] E. Sifakis and G. Tziritas, “Moving object localisation using amulti-label fast marching algorithm,” Signal Processing: ImageCommunication, vol. 16, no. 10, pp. 963–976, 2001.

[20] R. O. Duda and P. E. Hart, Pattern Classification and SceneAnalysis, Wiley, NY, USA, 1973.

[21] R. Adams and L. Bischof, “Seeded region growing,” IEEETrans. on Pattern Analysis and Machine Intelligence, vol. 16,no. 6, pp. 641–647, 1994.

[22] I. Grinias and G. Tziritas, “A semi-automatic seeded regiongrowing algorithm for video object localization and tracking,”Signal Processing: Image Communication, vol. 16, no. 10, pp.977–986, 2001.

[23] R. Mech and F. Marques, “Objective evaluation criteria for2D-shape estimation results of moving objects,” in Proc.Workshop on Image Analysis for Multimedia Interactive Ser-vices, Tampere, Finland, May 2001.

Eftychis Sifakis was born in Heraklion,Crete on May 20, 1978. He received his B.S.in Computer Science (2000) from the Uni-versity of Crete. He received from Ericssonan award of Excellence in Telecommunica-tions for his B.S. thesis (2001). His researchinterests are in image analysis and patternrecognition.

Ilias Grinias received his B.S. (1997) and the M.S. (1999) in Com-puter Science from the University of Crete. His research interestsare in image analysis and pattern recognition.

Georgios Tziritas was born in Heraklion,Crete on January 7, 1954. He received theDiploma of Electrical Engineering (1977)from the Technical University of Athens, the“Diplome d’Etudes Approfondies” (DEA,1978), the “Diplome de Docteur Inge-nieur” (1981), and the “Diplome de Doc-teur d’Etat” (1985) from the “Institut Poly-technique de Grenoble.” From 1982 he wasa researcher of the “Centre National dela Recherche Scientifique,” with the “Centre d’Etudes des Phe-nomenes Aleatoires” (CEPHAG, until August 1985), with the “In-stitut National de Recherche en Informatique et Automatique” (IN-RIA, until January 1987), and with the “Laboratoire des Signaux etSystemes” (LSS). From September 1992 he is Associate Professor atthe University of Crete, Department of Computer Science, teach-ing digital signal processing, digital image processing, digital videoprocessing, and information and coding theory. G. Tziritas is coau-thor (with C. Labit) of a book on “Motion Analysis for Image Se-quence Coding” (Elsevier, 1994), and of more than 70 journal andconference papers on signal and image processing, and image andvideo analysis. His research interests are in the areas of signal pro-cessing, image processing and analysis, computer vision, motionanalysis, image and video indexing, and image and video commu-nication.


Stand-Alone Objective Segmentation Quality Evaluation

Paulo Lobato CorreiaInstituto Superior Tecnico, Instituto de Telecomunicaoes, Av. Rovisco Pais, 1049-001 Lisboa, PortugalEmail: [email protected]

Fernando PereiraInstituto Superior Tecnico, Instituto de Telecomunicaoes, Av. Rovisco Pais, 1049-001 Lisboa, PortugalEmail: [email protected]


The identification of objects in video sequences, that is, video segmentation, plays a major role in emerging interactive multimediaservices, such as those enabled by the ISO MPEG-4 and MPEG-7 standards. In this context, assessing the adequacy of the identifiedobjects to the application targets, that is, evaluating the segmentation quality, assumes a crucial importance. Video segmentationtechnology has received considerable attention in the literature, with algorithms being proposed to address various types of ap-plications. However, the segmentation quality performance evaluation of those algorithms is often ad hoc, and a well-establishedsolution is not available. In fact, the field of objective segmentation quality evaluation is still maturing; recently, some more ef-forts have been made, mainly following the emergence of the MPEG object-based coding and description standards. This paperdiscusses the problem of objective segmentation quality evaluation in its most difficult scenario: stand-alone evaluation, that is,when a reference segmentation is not available for comparative evaluation. In particular, objective metrics are proposed for theevaluation of stand-alone segmentation quality for both individual objects and overall segmentation partitions.

Keywords and phrases: video segmentation, segmentation quality evaluation, objective segmentation quality, stand-alone seg-mentation quality evaluation.

1. INTRODUCTION

With the publication of the MPEG-4 standard in the Springof 1999 [1], which allows to independently encode audiovi-sual objects, and the development of the MPEG-7 standard[2], allowing the content-based description of audiovisualmaterial, the MPEG committee has given a significant contri-bution for the development of a new generation of interactivemultimedia services. Innovative types of interaction are oftenbased on the understanding of a video scene as composed bya set of video objects, to which it is possible to associate spe-cific information as well as interactive “hooks” to deploy thedesired application behaviour.

To enable such type of interactive services, an under-standing of the scene semantics is required, notably in termsof the relevant objects that are present. It is in this contextthat video segmentation plays a determinant role. Segmenta-tion may be automatically obtained at the video productionstage, for example, by using chroma keying techniques, or itmay have to be obtained from the images captured by a cam-era by using appropriate segmentation algorithms.

The evaluation of the adequacy of a segmentation algo-rithm, and its parameters’ configuration, for a given applica-tion may be crucial to guarantee that the application interac-tive requirements can be fulfilled.

The current practice for segmentation quality evaluationmainly consists in subjective ad hoc assessment by a repre-sentative group of human viewers. This is a time-consumingand expensive process, whose subjectivity can be minimisedby following strict evaluation conditions, with the videoquality evaluation recommendations developed by ITU pro-viding valuable guidelines [3, 4].

Subjective segmentation quality evaluation differs de-pending on the availability, or not, of a reference segmenta-tion (often called the “ground truth” segmentation) to com-pare against the results of the segmentation algorithm understudy. For both cases, the evaluation proceeds by analysingthe segmentation quality of one object after another, with thehuman evaluators integrating the partial results and, finally,deciding on an overall segmentation quality score. It is worthnoting that these current practice evaluation methodologieshave not been formally presented, but they are regularly usedin fora such as the COST 211quat European project [5]; somedetails on this evaluation procedure are available in [6].

Alternatively, objective segmentation quality evaluationmethodologies can be used. Unfortunately, the amount of at-tention devoted to this issue in the past is not comparable tothe investment made on the development of the segmenta-tion algorithms themselves [7, 8, 9]. Some proposals for theobjective evaluation of segmentation quality have been made


since the 1970s, mainly regarding the assessment of the per-formance of edge detectors—see reviews in [9, 10, 11]. Morerecently, the emergence of the MPEG-4 and MPEG-7 stan-dards has given a new impulse, not only to the developmentof video segmentation technology, but also to the segmen-tation quality evaluation methodologies themselves—see forinstance [12, 13]. However, the metrics available for segmen-tation quality evaluation typically perform well only for veryconstrained applications scenarios.

This paper discusses the objective evaluation of segmen-tation quality, in particular when no “ground truth” segmen-tation is available to use as a reference for comparison, thismeans, the so-called stand-alone objective segmentation qual-ity evaluation.

The various types of stand-alone objective segmentationquality evaluation are discussed in Section 2. Metrics for in-dividual object and overall segmentation quality evaluationare proposed in Sections 3 and 4, respectively. Results are pre-sented in Section 5 and conclusions in Section 6.

2. TYPES OF STAND-ALONE SEGMENTATIONQUALITY EVALUATION

Stand-alone segmentation quality evaluation is performedwhen no reference video segmentation is available. There-fore, the a priori information that may be available aboutthe expected video segmentation results has a decisive im-pact on the type of evaluation procedure to be applied, sothat meaningful results can be achieved. In particular, stand-alone evaluation of segmentation quality is not expected toprovide as reliable results as the evaluation relative to a ref-erence segmentation. A discussion on the relative evaluationof segmentation quality has been presented by the authorsin [14].

When performing segmentation quality evaluation, twotypes of measurements can be targeted:

• individual object segmentation quality evaluation: eachof the objects identified by the segmentation algorithmcan be independently evaluated in terms of its videosegmentation quality;

• overall segmentation quality evaluation: the set of ob-jects identified by the segmentation algorithm can alsobe globally evaluated as the set of elements that com-pose the video sequence under analysis. Besides the in-dividual object evaluation, it is important to assess ifthe appropriate objects have been detected. To producea meaningful overall segmentation quality evaluationmetric, also the relevance of each object present in thescene must be taken into account, since segmentationerrors in the most important objects are more notice-able to a human viewer.

The need for individual object segmentation quality eval-uation is motivated by the fact that each video object maybe independently stored in a database, or reused in a differ-ent context, depending on the adequacy of its segmentationquality for the new purpose targeted. An overall segmenta-tion evaluation is also of great importance as it determines,

for instance, if the segmentation goals for a certain applica-tion have been globally met, and thus if a given segmentationalgorithm is appropriate for a given type of application.

Objective segmentation quality evaluation uses auto-matic analysis tools and thus produces objective evaluationmeasures. The automatic tools operate on segmentation re-sults obtained for a selected set of video sequences; if individ-ual object evaluation is being performed, the object whosesegmentation quality is to be assessed has first to be selected.

Both the individual object and the overall segmentationquality measures are typically computed for each time in-stant, requiring that some temporal processing of the instan-taneous results is done to reflect the segmentation qualityover the complete sequence or shot. For instance, a tempo-ral mean or median may be computed.

Building on the existing knowledge on segmentationquality evaluation and also on some relevant aspects from thevideo quality evaluation field, a set of relevant features to beevaluated for performing the objective evaluation of stand-alone segmentation quality, as well as appropriate objectivequality metrics for both individual object and overall par-tition segmentation quality evaluation are proposed in thefollowing.

3. INDIVIDUAL OBJECT SEGMENTATIONQUALITY EVALUATION

The stand-alone evaluation of segmentation quality is per-formed by applying the segmentation algorithms to the se-lected video sequences and then analysing the segmentationresults produced. Since the evaluation is performed withoutusing any reference segmentation for comparison, significantassessment results are only expected for well-constrainedsegmentation scenarios. These results will mainly provide themeans for the ranking of partitions in terms of segmentationquality, that is, the results are expected to be more qualitativethan quantitative.

The criteria to be applied in stand-alone objective seg-mentation quality evaluation may be generic, based on thehuman visual system (HVS) characteristics, or more adjustedto the specific application scenario targeted by consideringthe available a priori information. In the first case, all aspectsconsidered important in terms of the HVS are included. Ex-amples are the recognition that some types of shapes usu-ally attract more the human viewer attention or the un-equal treatment of the various image components with lumi-nance receiving more attention. Additional assumptions, likea smooth temporal evolution implying limited changes inthe object features for consecutive time instants, are usuallymore dependent on the specific application scenario. Theseassumptions can be clustered into the following classes: shaperegularity, spatial uniformity, temporal stability, and motionuniformity, as discussed below.

The stand-alone evaluation of individual objects can relyon spatial and temporal features of the objects themselves(intra-object homogeneity features) as well as on the com-parison of selected object features with neighbouring objects(inter-object disparity features). Intra-object features give an

Stand-Alone Objective Segmentation Quality Evaluation 391

indication about the internal homogeneity of the objects,while inter-object features indicate if the objects were cor-rectly identified as separate entities.

The desired metrics for stand-alone segmentation qual-ity evaluation can thus be established based on the followingtypes of features.

Intra-object homogeneity features

Intra-object homogeneity regards the internal homogeneityof each object which can be evaluated by means of spatial andtemporal object features.

(a) Spatial features: the stand-alone evaluation of an ob-ject’s spatial features can be done by evaluating its shape reg-ularity and spatial uniformity. However, the applicability andimportance of shape regularity and spatial uniformity is dif-ferent depending on the segmentation scenario considered.

Shape regularity: in some cases, the objects are expectedto exhibit regular shapes, which can be evaluated by geomet-rical features such as the circularity, elongation, and com-pactness of the objects.

Spatial uniformity: in some circumstances, the texture ofthe object is expected to be reasonably uniform; features suchas the spatial perceptual information [4] or texture variancecan be used to measure the spatial uniformity.

(b) Temporal features: the importance of the tempo-ral features for segmentation quality evaluation also differsdepending on the segmentation scenario being considered.Stand-alone evaluation of temporal features relies on the as-sumption of a smooth temporal evolution or on the unifor-mity of the motion within the object area.

Temporal stability: assuming that the temporal evolutionof the object features is smooth, the variation between con-secutive time instants can be checked for evaluating theirtemporal stability. Significant variations in the temporal sta-bility metrics, in scenarios where they are supposed to besmall, indicate the presence of segmentation errors.

Motion uniformity: when objects are supposed to exhibituniform motion, such properties as the variance of the ob-ject’s motion vector values or the criticality [15] can providevaluable segmentation evaluation metrics since they are ableto signal higher or lower segmentation qualities.

Inter-object disparity features

The comparison of an object’s features against those of itsneighbours can provide useful information for stand-aloneevaluation: it is assumed that additional objects should beidentified when they are sufficiently different from theirneighbours. This comparison can be done locally, along theobject boundaries, or it can be based on features computedfor the entire objects.

(a) Local contrast to neighbours: one of the assumptionsthat holds in many circumstances is that there should be asignificant contrast along the border between the inside andoutside of an object. This can be evaluated by a local contrastmetric.

(b) Neighbouring objects features difference: several fea-tures computed for the object area can be compared with the

corresponding feature values for the neighbours, to check ifthey were correctly identified as separate entities. Examplesare the shape regularity, spatial uniformity, temporal stabil-ity, or motion uniformity, whenever each of them is relevantfor the target application.

Relevant metrics for each of these classes of features arepresented below, followed by the proposal of composite met-rics for two classes of content with different properties.

3.1. Elementary metrics for individualobject evaluation

Metrics for individual object stand-alone segmentation qual-ity evaluation can be established corresponding to the classesof features identified above.

In particular, intra-object homogeneity can be evaluatedby means of spatial and temporal object features. The spa-tial features considered for individual object evaluation, andcorresponding metrics, are as follows.

Shape regularity: regularity of shapes can be evaluated bygeometrical metrics such as the compactness (compact), or acombination of the circularity and elongation (circ elong) ofthe objects

compact(E) = max(

perimeter 2(E)75 · area(E)

, 1),

circ elong(E) = max(

circ(E),max(

elong(E)5

, 1))

.

(1)

With circularity and elongation defined by

circ(E) =4 · π · area(E)

perimeter 2(E),

elong(E) =area(E)(

2 · thickness(E))2,

(2)

where thickness(E) is the number of morphological erosionsteps that can be applied to the object until it disappears [16].The normalizing constants were empirically determined afteran exhaustive set of tests.

Spatial uniformity: spatial uniformity can be evaluated bymetrics such as spatial perceptual information (SI) [4] andtexture variance (text var)—see for instance [11]

SI = maxtime(

SIstdev(I)),

text var(E) =3 · varY(E) + varU(E) + varV(E)

5,

(3)

with

SIstdev(I)=

√√√√1N

·∑i

∑j

(Sobel(I)

)2−( 1N

·∑i

∑j

(Sobel(I)

))2

.

(4)

The Sobel operator is specified, for instance, in AnnexA of ITU-T Recommendation P.910 [4], and maxtime(E) isthe maximal value of E taken for all the temporal instants


considered. varY(E), varU(E), and varV(E), are the vari-ances of the Y , U , and V components, respectively.

The metrics corresponding to the temporal features con-sidered are as follows:

Temporal stability: a smooth temporal evolution of objectfeatures can be tested for checking temporal stability. Thesefeatures may include: size, position, temporal perceptual in-formation [4], criticality [15], texture variance, circularity,elongation, and compactness. The selected metrics for tem-poral stability evaluation are

sizediff =∣∣ area

(Et) − area

(Et−1

)∣∣,elongdiff =

∣∣ elong(Et) − elong

(Et−1

)∣∣,critdiff =

∣∣ crit(Et) − crit

(Et−1

)∣∣,(5)

with crit(E) being the criticality value as defined in [15]

crit = 4.68 − 0.54 · p1 − 0.46 · p2, (6)

where

p1 = log10

(meantime

(SIrms(I) · TIrms(I)

)),

p2 = log10

(maxtime

(abs

(SIrms

(It) − SIrms

(It−1))))

,

SIrms(I) =

√√√√ 1N

·∑i

∑j

(Sobel(I)

)2,

TIrms(It)=

√√√√ 1N

·∑i

∑j

(It − It−1

)2.

(7)

Motion uniformity: the uniformity of motion can be eval-uated by metrics such as the variance of the object’s motionvector values (mot var), or by criticality (crit) as definedabove

mot var(E) = varXvec(E) + varYvec(E), (8)

where varXvec(E) and varYvec(E) denote the variances forthe x and y components of the motion vector field at a giventime instant, respectively.

The above spatial and temporal features are not expectedto be homogeneous for every segmented object; the applica-bility and importance of the corresponding metrics is condi-tioned by the type of application addressed.

Inter-object disparity: metrics can be computed either lo-cally along the object boundaries, or for the complete ob-ject area. Again, these metrics are applicable only in somecircumstances, such as when a significant contrast, or someother feature significant value difference between neighbour-ing objects is expected. The metrics considered are as follows.

Local contrast to neighbours: the following local contrastmetric can be used for evaluating if a significant contrast be-tween the inside and outside of an object, along the object

border, exists

contrast =1

4 · 255 ·Nb

·∑i, j

(2 · max

(DYij

)+ max

(DUij

)+ max

(DVij

)),

(9)

where Nb is the number of border pixels for the object andDYij , DUij , and DVij are the differences between an object’sborder pixel Y , U , and V components, respectively, and its4-neighbours.

Neighbouring objects features difference: several features,for which objects are expected to differ from their neigh-bours, can be tested. Examples are the shape regularity, spa-tial uniformity, temporal stability, and motion uniformityvalues, whenever each of them is relevant taking the appli-cation characteristics into account. In particular, a metric forthe motion uniformity feature is considered of interest:

mot unifneigh diff =1N

·∑j∈NSi

∣∣mot unif j −mot unifi∣∣,(10)

where i is the object under analysis, N and NSi are, respec-tively, the number and the set of neighbours of object i, andthe motion uniformity for each object is computed as

mot unifi = mot vari + criti . (11)

Each of the elementary metrics considered for individ-ual object segmentation quality evaluation is normalized toproduce results in the interval [0, 1], with the highest valuesassociated to the best segmentation quality results.

3.2. Composite metrics for individual objectstand-alone segmentation quality evaluation

The proposal of composite metrics for individual objectstand-alone segmentation quality evaluation depends on thetype of application (and thus content) being considered,since the adequate elementary metrics depend on the ex-pected characteristics of the content. Therefore, a singlegeneral-purpose composite metric cannot be established. In-stead, the approach taken here is to select two major classes ofcontent differing in terms of their spatial and temporal char-acteristics, and propose different composite metrics for eachof them.

The distinction between the two classes of content ismainly associated to the temporal characteristics; this fact isreflected in the names adopted for the two content classesdefined.

Content class I: stable content: this class corresponds tocontent that is temporally stable and has reasonably regularshapes. Additionally, the contrast between objects is expectedto be strong.

Content class II: moving content: this class correspondsto content with strongly moving objects, and thus tempo-ral stability is less relevant. Often, the motion of the objectsis uniform, and neighbouring objects may be less spatially


contrasted, while motion differences between neighbours areexpected to be larger. Regular shapes are still expected, evenif this characteristic assumes here a lower importance.

The proposed composite metrics for these two contentclasses are discussed below. Whenever the video content toanalyse does not fit well into one of the two classes above, ei-ther the closest one is chosen and the results are interpretedwith care, or a new combination of the various elementarymetrics has to be selected to develop a more appropriatecomposite metric.

3.2.1 Composite metric for individual objectevaluation of stable content

A composite metric to perform as reliably as possible individ-ual object stand-alone segmentation quality evaluation forcontent class I is proposed below.

The composite metric includes some classes of elemen-tary metrics and excludes some others, to reflect the factthat for this content class, object motion is expected to beweak and the objects are expected to have reasonably regularshapes. Among the excluded classes of metrics are the spatialuniformity (as defined for the elementary metrics proposedhere), since arbitrary spatial patterns may be found in theexpected objects, and the motion uniformity, as motion isnot very relevant for this content class. Thus, the stand-aloneevaluation of segmentation quality for this type of contentincludes the following classes of elementary metrics.

Shape regularity: the shape regularity class of metricsmust be included in the composite metric, since shapes areexpected to be reasonably regular. The two relevant elemen-tary metrics, compact and circ elong, are included in thecomposite metric with equal weights as they complementwell each other.

Temporal stability: content in this class is expected to bestable. Therefore, the size, elongation, and criticality stabilitymetrics are combined to represent this class of metrics, allequally weighted.

Local contrast to neighbours: in most cases, the type ofcontent considered will exhibit a significant contrast betweenneighbouring objects. Assuming that this is the case, then thelocal contrast metric should be included in the compositemetric.

The weights for each class of metrics within the compos-ite metric have been adjusted according to their strength incapturing visual attention, and their ability to match the hu-man subjective evaluation of the segmented sequences withthe objective segmentation quality evaluation values. Thefinal weight values were selected after verifying the aboveassumptions by testing several combinations of elementarymetrics’ weights.

The proposed composite metric for individual objectstand-alone segmentation quality evaluation for this class ofcontent (SQ io std stable) is given by

SQ io std stable =1N

·N∑i=1

SQ io std stablei, (12)

where N is the total number of images in the sequence whose

segmentation is being evaluated, and the instantaneous val-ues of SQ io std stablei are given by

SQ io std stablei = intrai + interi, (13)

with

intrai = 0.30 · shape regi +0.33 · temp stabi,

interi = 0.37 · contrasti,

shape regi = 0.5 · circ elongi +0.5 · compacti,

temp stabi = 0.33 · sizediffi +0.33 · elongdiffi+0.33 · critdiffi .

(14)

3.2.2 Composite metric for individual objectevaluation of moving content

A composite metric to perform as reliably as possible individ-ual object stand-alone segmentation quality evaluation forcontent class II is proposed below.

Again, the composite metric only includes the relevantclasses of elementary metrics, to adequately reflect the char-acteristics of this content class. In this case, the content is notexpected to be temporally stable, but the objects should haverather uniform motion, and the neighbouring objects mo-tion differences should be pronounced. The classes of met-rics considered for the stand-alone evaluation of this type ofcontent are as follows.

Shape regularity: the object shapes are expected to be reg-ular in most of the applications envisioned, even if, due tothe motion, this regularity may sometimes not be completelyverified (for instance, a walking person will usually have a lessregular shape than a person standing still). The compact andcirc elong elementary metrics are again used for the evalua-tion of shape regularity, with equal weights.

Motion uniformity: in this content class, objects are ex-pected to exhibit reasonably uniform motion. This can beevaluated using the criticality elementary metric.

Local contrast to neighbours: in many cases, the variousobjects will exhibit a significant contrast to their neighbours.Contrast is not so important in terms of segmentation qual-ity evaluation as for the case of stable content, but the localcontrast metric is yet considered useful.

Neighbouring objects feature difference: neighbouring ob-jects are expected to exhibit different motion characteristics.Therefore, the motion uniformity difference metric is hereused for segmentation quality evaluation.

The proposed composite metric for individual objectstand-alone segmentation quality evaluation for this class ofcontent (SQ io std moving) is given by

SQ io std moving =1N

·N∑i=1

SQ io std movingi, (15)

where N is the total number of images in the sequence whosesegmentation is being evaluated, and the instantaneous val-ues of SQ io std movingi are given by

SQ io std movingi = intrai + interi, (16)


with

intrai = 0.28 · shape regi +0.29 · mot unifi,

interi = 0.19 · contrasti +0.24 · mot unifneigh diffi ,

shape regi = 0.5 · circ elongi +0.5 · compacti,

mot unifi = criti .(17)

4. OVERALL SEGMENTATION QUALITY EVALUATION

The overall objective segmentation quality evaluation com-bines each individual object’s segmentation quality evalua-tion mark, with the corresponding relevance in the scene anda factor reflecting the similarity between the sets of target andestimated objects.

Individual object evaluation has been specified inSection 3. The relevance of an object in the scene is evaluatedusing a metric called Relative Contextual Relevance (RC rel),which has been previously proposed by the authors in [17].This metric computes a relevance mark reflecting how muchthe human viewer attention is attracted by a given object,producing results in the [0, 1] range, with the restriction thatthe relevancies of all objects composing a partition at a giventime instant have to sum one. A mark of one corresponds tothe highest possible relevance.

The assessment of the similarity of objects for stand-alone segmentation quality evaluation, and a proposal for theoverall segmentation quality metric are presented below.

4.1. Similarity of objects evaluation

The degree of correspondence between the objects found bya segmentation algorithm and those targeted by the appli-cation addressed must be taken into account by the over-all segmentation quality metric. This is done in the sim-ilarity of objects evaluation step, by computing a metriccalled sim obj factor, which is a multiplicative factor to in-clude in the computation of the overall segmentation qualityevaluation.

For stand-alone segmentation quality evaluation, a firstobject similarity check can be done, if the target number ofobjects is known, by measuring a ratio between the target andestimated numbers of objects. The ratio proposed is definedby

num obj comparison

=min

(num est obj,num target obj

)max

(num est obj,num target obj

) , (18)

where num est obj and num target obj refer to the es-timated and the target number of objects, respectively. Thenum obj comparison metric takes value one when the es-timated number of objects is equal to the target number, andsmaller values as the two numbers become more different.

The metric above provides a limited amount of informa-tion about the correctness of the correspondence betweenestimated and target objects since it does not distinguish

between too many or too few objects in the estimated seg-mentation.

To make the sim obj factor metric more informed, it ispossible to consider also a measure of the partition stability,applicable to the cases where the evolution of the number ofobjects in a segmentation partition is assumed to be smooth.In this case, not many objects are expected to enter or leavethe scene too frequently, and thus an additional, or alterna-tive if the number of target objects is not known, metric canbe defined, evaluating the number of label changes betweenconsecutive time instants

num obj stability

=min

(num est obji−1,num est obji

)max

(num est obji−1,num est obji

) , (19)

where num obji−1 and num obji refer to the number of es-timated objects in the previous and in the current time in-stants, respectively.

This num obj stability metric indicates if the numberof objects in the partition has remained stable (value close toone) or not (metric value approaching zero).

The proposed sim obj factor metric for stand-alonesegmentation quality evaluation is thus obtained by themultiplication of the two individual factors, num objcomparison and num obj stability, if both are available

sim obj factor

= num obj comparison ·num obj stability.(20)

Whenever one of the two factors above cannot be com-puted, for instance if the number of target objects present ateach time instant is not known, or if the stability hypothe-sis is not applicable, only the other factor is considered inthe sim obj factor. If none of the factors can be computed,then the sim obj factor cannot be taken into account forthe final segmentation quality evaluation.

To obtain a sim obj factor representative of the com-plete sequence or shot, and since the two factors may varyas time evolves, a temporal integration of the instantaneousvalues can be done through their temporal average.

4.2. Metric for overall stand-alone segmentationquality evaluation

The proposal for the overall stand-alone segmentation qual-ity evaluation metric combines the appropriate measures ofindividual object quality (depending on the type of content),the object’s relevance and the similarity of objects factor. Aninitial proposal for an overall segmentation quality evalua-tion metric (SQ) is

SQ = sim obj factor

·(

num objects∑j=1

(SQ io

(Ej) · RC rel

(Ej)))

,(21)


where SQ io(Ej) is the individual object segmentationquality mark estimated for object j, RC rel(Ej) is the corre-sponding relative contextual relevance, and sim obj factoris the factor evaluating the degree of correspondence betweenthe detected and target objects. The sum is performed for allthe estimated objects in the scene segmented.

Alternatively, to more explicitly include the temporal di-mension into the computation of the overall segmentationquality evaluation, instead of taking the temporally averagedmarks for its various components and multiplying them to-gether, the overall segmentation quality may be computedby weighting the instantaneous qualities of the various ob-jects by their instantaneous relevance values. This alternativeis justified by the fact that one object may have large vari-ations in its quality or relevance marks along time. For in-stance, if an object has a bad segmentation quality during theshort temporal period where it is very relevant, the overallsegmentation quality metric should be more penalized thanwhat is expressed using (21), where the object’s low averagerelevance is multiplied by its average quality. Also the simi-larity of objects factor may have fluctuations along time thatshould be instantaneously acknowledged by the compositemetric. Thus, the final proposal for the overall stand-alonesegmentation quality evaluation metric computes the tem-poral average of the instantaneous values, as given by

SQ =1N

·N∑i=1

sim obj factori

·num objects∑

j=1

(SQ ioi

(Ej) · RC reli

(Ej))

.

(22)

This overall segmentation quality evaluation metric ex-presses the overall segmentation quality as a sum of the in-dividual object segmentation quality marks weighted by thecorresponding contextual relevance and affected by the sim-ilarity of objects factor, for each time instant. The higher theindividual object quality is for the more relevant objects, thebetter is the overall segmentation quality, ensuring that themost relevant objects, which are the most visible to the hu-man observers, have a larger impact on the overall segmenta-tion quality result. Furthermore, the mismatch between thetarget objects and the estimated ones is expressed through anobject similarity corrective factor, taking values between zeroand one, and penalizing the overall segmentation quality ifthe target objects are incorrectly matched.

5. STAND-ALONE SEGMENTATION QUALITYEVALUATION RESULTS

This section presents and discusses the results obtained us-ing the two composite metrics proposed for the stand-alonesegmentation quality evaluation of individual objects andof entire segmentation partitions. Since the two proposedstand-alone metrics are applicable only under certain cir-cumstances, each of the stand-alone composite metrics istested with the appropriate content.

The test sequences and the corresponding segmentationpartitions used are described below, before presenting thesegmentation quality evaluation results obtained with theproposed composite metrics.

5.1. Test sequences and segmentation partitions

Several test sequences, mainly from the MPEG-4 test set,showing different spatial complexity and temporal activitycharacteristics have been used to test the proposed segmen-tation quality evaluation metrics. For each sequence, severalsegmentation partitions with different segmentation quali-ties were considered.

Three subsets of the test sequences, each with 30 repre-sentative images of the desired objects’ behaviour and char-acteristics, were used to illustrate the results obtained. Thesesubsequences were the following.

Akiyo, images 0 to 29. This is a sequence with low tem-poral activity and not very complex texture. It contains twoobjects of interest: the woman, and the background.

News, images 90 to 119. This is a sequence with low tem-poral activity and not very complex texture. It contains threeobjects of interest: the man, the woman, and the background.

Stefan, images 30 to 59. This is a sequence with high tem-poral activity and relatively complex texture. It contains twoobjects of interest: the tennis player, and the background.

Samples of the original images and of the segmentationpartitions are shown in Figures 1, 2, and 3, respectively, forthe sequences Akiyo, News, and Stefan. The segmentationpartitions labelled as reference are those made available by theMPEG group; the other partitions were created with differ-ent segmentation quality levels, ranging from a close matchwith the reference to more objectionable segmentations. No-tice that for the sequence News the reference segmentationprovided by the MPEG group is not used, as the objectsof interest here are not the same as those considered byMPEG.

5.2. Results and analysis

Stand-alone segmentation quality evaluation metrics are ap-plicable only in certain circumstances, and thus the two met-rics proposed have been tested with the appropriate contents.Results for these metrics, considering both the individual ob-ject and the overall evaluation cases are included below.

A set of preliminary experiments showed that similar seg-mentation quality evaluation results are produced indepen-dently of the input format, for example, CIF and QCIF, andthus the QCIF resolution was used to limit the algorithm ex-ecution time.

The results presented below include, for each test se-quence, a graph, representing the temporal evolution of theoverall segmentation quality, and a table, containing the tem-poral average of the instantaneous results computed bothfor individual object and for overall segmentation qualityevaluation.

Content class I corresponds to video sequences whichhave relatively simple shapes, and present a limited amountof motion. To evaluate this type of content, the Akiyo and


(a)

Obj 1 Obj 1

Obj 0 Obj 1Obj 0

(b)

Obj 1 Obj 1

Obj 0 Obj 1Obj 0

(c)

Obj 1 Obj 1

Obj 0 Obj 1Obj 0

(d)

Obj 1 Obj 1

Obj 0 Obj 1Obj 0

(e)

Obj 1 Obj 1

Obj 0 Obj 1Obj 0

(f)

Figure 1: Sample original images (a) and segmentation partitions: reference (b), seg1 (c), seg2 (d), seg3 (e), and seg4 (f) for the imagesnumber 0 and 29 of the sequence Akiyo.

the News test sequences and the corresponding segmentationpartitions were used.

For a human observer, the ranking of the segmentationpartitions provided for the sequence Akiyo would most likelylist the reference and segmentation 1 as having the best qual-ity, followed by segmentation 2, then segmentation 3, and,finally, segmentation 4 would be considered the worst seg-mentation.

The results of the proposed objective evaluation algo-rithms, included in Figure 4 and in Table 1, show three seg-mentation quality groups for the woman object: the bestquality is achieved by the reference, segmentation 1, andsegmentation 2, then segmentation 3 achieves intermediatequality, and, finally, segmentation 4 gets the worst results. Inthis case, the reference segmentation does not get the bestevaluation result since a part of the woman’s hair is intenselyilluminated, and when included as part of the woman it leadsto a lower contrast to the background than when it is omit-

ted, as it happens with segmentations 1 and 2. Segmenta-tion 4, for which the woman object captures a significant partof the background, is clearly identified as the worst segmen-tation. Table 1 also shows that the individual object stand-alone segmentation quality results for the background objectare less discriminative than for the woman object, but stillclearly distinguish segmentation 4, for which the woman ob-ject captures a significant part of the background, as beingworst than the other segmentations. The overall segmenta-tion quality results also show the same three quality groupsas for the individual object results, following the same order-ing, and matching well the subjective ranking performed byhuman viewers.

The behaviour of the stand-alone segmentation qualityevaluation metric for stable content, for sequences with morethan two objects, is illustrated using the sequence News.

From a human observer point of view, the ranking of thesegmentation partitions provided for the sequence News in


(a)

Obj 1 Obj 2 Obj 1 Obj 2

Obj 0 Obj 1Obj 0

(b)


Obj 0 Obj 1Obj 0

(c)


Obj 0 Obj 1Obj 0

(d)


Obj 0 Obj 1Obj 0

(e)


Obj 0 Obj 1Obj 0

(f)

Figure 2: Sample original images (a) and segmentation partitions: seg1 (b), seg2 (c), seg3 (d), seg4 (e), and seg5 (f) for the images number90 and 119 of the sequence News.

terms of their segmentation quality would be in the order oftheir numbering. In fact, segmentation 1 has object contoursvery close to their correct positions, thus corresponding tothe best quality. Segmentation 2 includes some small errorsin the object contours, being the second best segmentation.Then, segmentation 3 has incorrect contours, but the shapesresemble the newscasters’ objects. Segmentation 4 also hasincorrect contours, but since the shapes of the newscastersare less similar to the desired shapes it would, very likely, beconsidered as the worst segmentation. In segmentation 5, theman is as well segmented as for segmentation 1, while thesegmentation for the woman is somewhat worst than for seg-mentation 3; therefore, the subjective quality result wouldprobably be some intermediate mark between those of seg-mentations 1 and 3.

The objective segmentation quality evaluation results forthe sequence News are presented in Figure 5 and in Table 2.The overall results identify three levels of quality: the best

quality is achieved by segmentation 1, then segmentation2 achieves intermediate quality, and, finally, segmentations3 and 4 get the worst values. As expected, segmentation 5gets an intermediate overall segmentation quality value be-tween those of segmentations 1 and 3. The main differenceregarding the subjective evaluation ranking mentioned aboveis that the automatic algorithm did not distinguish betweenthe qualities of segmentations 3 and 4. This is explained bythe type of segmentation errors observed in these two seg-mentation partitions, which are accounted by the objectivemetrics in a similar manner: they both add part of the back-ground (which is relatively homogeneous in texture) to thenewscasters’ objects; moreover, none of the considered ob-ject shapes is very irregular.

In terms of individual object segmentation quality re-sults, the marks obtained for segmentation 5 show that theautomatic evaluation algorithm is capable of distinguishingthe quality of the different objects: the man object achieves


(a)

Obj 1Obj 1

Obj 0 Obj 1Obj 0

(b)

Obj 1Obj 1

Obj 0 Obj 1Obj 0

(c)

Obj 1Obj 1

Obj 0 Obj 1Obj 0

(d)

Obj 1

Obj 1

Obj 0 Obj 1Obj 0

(e)

Obj 1 Obj 1

Obj 0 Obj 1Obj 0

(f)

Figure 3: Sample original images (a) and segmentation partitions: reference (b), seg1 (c), seg2 (d), seg3 (e), and seg4 (f) for the imagesnumber 30 and 59 of the sequence Stefan.

the highest average individual object quality, together withsegmentations 1 and 2, while the woman object gets thelowest average mark, together with segmentation 3, and theremaining background object gets an intermediate qualitymark, as expected.

As shown by the two examples above, the stand-alonesegmentation quality evaluation algorithm reveals itself ca-pable of ranking the qualities of the various segmentationpartitions, but the results should be interpreted in a morequalitative and relative way (e.g., for ranking purposes or formutual comparison), rather than in a quantitative and abso-lute manner.

Content class II corresponds to more complex video con-tent than that for the previous case. Object shapes may notbe so simple, and motion should be more important. Thesequence Stefan and the corresponding segmentation parti-tions were used to evaluate the metric proposal made in thispaper for this type of content.

For the sequence Stefan, a human observer would mostlikely rank the segmentation partitions provided in the fol-lowing order: the reference segmentation and segmentation1 as having the best quality, closely followed by segmentation2, then segmentation 3, and, finally, segmentation 4.

The results of the objective evaluation algorithm, in-cluded in Figure 6 and Table 3, show that segmentation 1gets the best overall segmentation quality result followed bya group formed by the reference and segmentations 2 and3. Segmentation 4 gets the worst result. These results can beexplained as follows: segmentation 1 is in fact more precisethan the reference partition, as the reference is smoother andsometimes includes fragments of the background as belong-ing to the player object; the reference and segmentation 2 arecorrectly classified as the next quality group, while segmen-tation 3 receives a higher ranking than expected due to thefact that it always includes the moving player object, whichis not very contrasted to the surrounding background area.


5 10 15Image

20 250.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

Qu

alit

y

Overall segmentation quality—Akiyo

RefSeg1Seg2

Seg3Seg4

Figure 4: Stand-alone overall and individual object quality evalua-tion results for the sequence Akiyo.

Table 1: Stand-alone overall and individual object quality evalua-tion results for the sequence Akiyo.

Average segmentation quality

Background Woman Overall

Ref 0.76 0.77 0.77

Seg1 0.79 0.79 0.79

Seg2 0.79 0.80 0.80

Seg3 0.73 0.73 0.73

Seg4 0.65 0.56 0.60

Finally, segmentation 4 is correctly ranked as the worst, sincethe detected object mask is static in time, including a largeamount of background as part of the player object. For thiscase, the overall segmentation quality marks are always in thelower half of the segmentation quality scale since the objec-tive evaluation metrics do not find the objects to be very ho-mogeneous either in texture or in motion, and thus cannotconclude that the best segmentations are rather good for ahuman observer (at least in the context of the assumptionsmade).

The results obtained show that the stand-alone segmen-tation quality evaluation algorithms proposed are capable ofranking the quality of the various segmentation partitions,but the results must be interpreted in a rather qualitative andrelative way (e.g., for ranking purposes). Stand-alone evalu-ation results are not expected to be as reliable as those ob-tained with relative evaluation when a ground truth segmen-tation is available, but they can still be very useful for iden-tifying the segmentation quality classes among the varioustested segmentations/algorithms which is a major problem inthe context of emerging interactive multimedia applications.

6. CONCLUSIONS

Video segmentation quality evaluation is a key elementwhenever the identification of a set of objects in a video

90 95 100 105Image

110 1150.20

0.30

0.40

0.50

0.60

0.70

0.80

Qu

alit

y

Overall segmentation quality—News

Seg1Seg2Seg3

Seg4Seg5

Figure 5: Stand-alone overall and individual object quality evalua-tion results for the sequence News.

Table 2: Stand-alone overall and individual object quality evalua-tion results for the sequence News.

Average segmentation quality

Background Man Woman Overall

Seg1 0.63 0.57 0.69 0.64

Seg2 0.57 0.57 0.56 0.57

Seg3 0.47 0.49 0.49 0.49

Seg4 0.47 0.51 0.50 0.50

Seg5 0.53 0.57 0.49 0.54

sequence is required since it allows the assessment of theperformance of segmentation algorithms in view of a givenapplication targets. However, a satisfying solution for ob-jective segmentation quality evaluation is not yet avail-able.

This paper discusses the objective segmentation qualityevaluation problem, in particular when a reference segmen-tation playing the role of “ground truth” is not available—stand-alone evaluation, and proposes metrics for both indi-vidual object and for overall stand-alone segmentation qual-ity evaluation.

As expected, stand-alone evaluation revealed itself sensi-tive to the type of application/content considered. The var-ious classes of elementary metrics available are not univer-sally applicable, but when carefully selected metrics are em-ployed for given classes of content then very useful segmen-tation quality evaluation results can be obtained. Two suchmetrics are proposed in this paper: for stable content and formoving content.

It is recognised that stand-alone objective segmenta-tion quality evaluation is not as powerful as relative eval-uation, but stand-alone evaluation results allow the com-parative analysis of segmentation results and thus of seg-mentation algorithms, which is an important functional-ity for the adequate design of video segmentation enabledsystems.


5 10 15Image

20 250.25

0.30

0.35

0.40

0.45

0.50

Qu

alit

y

Overall segmentation quality—Stefan

RefSeg1Seg2

Seg3Seg4

Figure 6: Stand-alone overall and individual object quality evalua-tion results for the sequence Stefan.

Table 3: Stand-alone overall and individual object quality evalua-tion results for the sequence Stefan.

Average segmentation qualityBackground Player Overall

Ref 0.33 0.43 0.38Seg1 0.34 0.49 0.42Seg2 0.32 0.43 0.38Seg3 0.34 0.45 0.39Seg4 0.32 0.37 0.34

REFERENCES

[1] ISO/IEC 14496, “Information technology—coding of audio-visual objects,” 1999.

[2] MPEG Requirements Group, “MPEG-7 overview,” Doc.ISO/IEC JTC1/SC29/WG11 N4031, March 2001, SingaporeMPEG Meeting.

[3] ITU-R, “Methodology for the subjective assessment of thequality of television pictures,” Recommendation BT.500-7,1995.

[4] ITU-T, “Subjective video quality assessment methods formultimedia applications,” Recommendation P.910, August1996.

[5] COST 211quat, “Redundancy reduction techniques andcontent analysis for multimedia services,” COST project,http://www.iva.cs.tut.fi/COST211/.

[6] COST 211quat, “Call for AM comparisons—compareyour segmentation algorithm to the COST 211quat analysismodel,” COST project, available at http://www.iva.cs.tut.fi/COST211/Call/Call.htm.

[7] G. Rees and P. Greenway, “Metrics for image segmentation,”in Workshop on Performance Characterisation and Benchmark-ing of Vision Systems, pp. 20–37, Essex, UK, January 1999.

[8] Y. Zhang and J. Gerbrands, “Objective and quantitative seg-mentation evaluation and comparison,” Signal Processing, vol.39, no. 1–2, pp. 43–54, 1994.

[9] Y. Zhang, “A survey on evaluation methods for image seg-mentation,” Pattern Recognition, vol. 29, no. 8, pp. 1335–1346,1996.

[10] M. Heath, S. Sarkar, T. Sanocki, and K. Bowyer, “A robustvisual method for assessing the relative performance of edge-detection algorithms,” IEEE Trans. on Pattern Analysis and

Machine Intelligence, vol. 19, no. 12, pp. 1338–1359, 1997.[11] M. Levine and A. Nazif, “Dynamic measurement of computer

generated image segmentations,” IEEE Trans. on Pattern Anal-ysis and Machine Intelligence, vol. 7, no. 2, pp. 155–164, 1985.

[12] P. Villegas, X. Marichal, and A. Salcedo, “Objective evaluationof segmentation masks in video sequences,” in WIAMIS’ 99,pp. 85–88, Germany, 31 May–1 June 1999.

[13] M. Wollborn and R. Mech, “Refined procedure for objec-tive evaluation of video object generation algorithms,” Doc.ISO/IEC JTC1/SC29/WG11 M3448, March 1998.

[14] P. Correia and F. Pereira, “Objective evaluation of relativesegmentation quality,” in Int. Conference on Image Processing(ICIP), pp. 308–311, Vancouver, Canada, September 2000.

[15] S. Wolf and A. Webster, “Subjective and objective measures ofscene criticality,” in ITU Meeting on Subjective and ObjectiveAudiovisual Quality Assessment Methods, Turin, Italy, October1997.

[16] J. Serra, Image Analysis and Mathematical Morphology, vol. 1,Academic Press, San Diego, Calif, USA, 1988.

[17] P. Correia and F. Pereira, “Estimation of video object’s rele-vance,” in EUSIPCO’ 2000, pp. 925–928, Finland, September2000.

Paulo Lobato Correia graduated as an Engi-neer and obtained an M.S. in electrical andcomputers engineering from Instituto Su-perior Tecnico (IST), Universidade Tecnicade Lisboa, Portugal, in 1989 and 1993, re-spectively. He is currently working towardsa Ph.D. in the area of image analysis for cod-ing and indexing. Since 1990 he is a Teach-ing Assistant at the Electrical and Comput-ers Department of IST, and since 1994 he isa researcher at the Image Communication Group of IST. His cur-rent research interests are in the area of video analysis and pro-cessing, including video segmentation, objective video segmenta-tion quality evaluation, and content-based video description andrepresentation.

Fernando Pereira was born in Vermelha,Portugal in October 1962. He was grad-uated in Electrical and Computers Engi-neering by Instituto Superior Tecnico (IST),Universidade Tecnica de Lisboa, Portugal, in1985. He received the M.S. and Ph.D. de-grees in Electrical and Computers Engineer-ing from IST, in 1988 and 1991, respectively.He is currently Professor at the Electricaland Computers Engineering Department ofIST. He is responsible for the participation of IST in many nationaland international research projects. He is a member of the EditorialBoard and Area Editor on Image/Video Compression of the SignalProcessing: Image Communication Journal and an Associate Editorof IEEE Transactions of Circuits and Systems for Video Technology.He is a member of the Scientific Committee of several internationalconferences. He has contributed more than one hundred papers. Hewon the 1990 Portuguese IBM Award and an ISO Award for Out-standing Technical Contribution for his participation in the devel-opment of the MPEG-4 Visual standard, in October 1998. He hasbeen participating in the work of ISO/MPEG for many years, no-tably as the head of the Portuguese delegation, and chairing manyAd Hoc Groups related to the MPEG-4 and MPEG-7 standards. Hiscurrent areas of interest are video analysis, processing, coding anddescription, and multimedia interactive services.


Objective Evaluation Criteria for 2D-Shape EstimationResults of Moving Objects

Roland MechInstitut fur Theoretische Nachrichtentechnik und Informationsverarbeitung, Universitat Hannover,Appelstrasse 9A, 30167 Hannover, GermanyEmail: [email protected]

Ferran MarquesUniversitat Politecnica de Catalunya, Campus Nord—Modul D5, C/ Jordi Girona 1-3, Barcelona 08034, SpainEmail: [email protected]

Received 3 August 2001 and in revised form 15 January 2002

The objective evaluation of 2D-shape estimation results for moving objects in a video sequence is still an open problem. Firstapproaches in the literature evaluate the spatial accuracy and the temporal coherency of the estimated 2D object shape. Thereby, itis not distinguished between several estimation errors located around the object contour and a few, but larger, estimation errors.Both cases would lead to similar evaluation results, although the 2D-shapes would be visually very different. To overcome thisproblem, in this paper, a new evaluation approach is proposed. In it, the evaluation of the spatial accuracy and the temporalcoherency is based on the mean and the standard deviation of the 2D-shape estimation errors.

Keywords and phrases: shape evaluation, objective evaluation, shape estimation, segmentation, video object, MPEG.

1. INTRODUCTION

One major problem in the development of algorithms for2D-shape estimation of moving objects, is to assess the qual-ity of the estimation results. Up to now, mainly subjectiveevaluation, that is, tape viewing, has been used in order todecide upon the quality of a certain algorithm. Although thisis very helpful and gives already some indication of the re-sulting quality, this procedure very much depends on thesubjective conditions, that is, the attending people, the timeof viewing, the used video equipment, and so forth. In thesequel, since we are only dealing with 2D-shape, the term“shape” will be used.

In the literature, first approaches for objective evalua-tion of shape estimation results can be found [1, 2, 3, 4, 5,6, 7, 8, 9, 10, 11, 12]. During the standardization work ofISO/MPEG-4 [13], within the core-experiment on automaticsegmentation of moving objects it became necessary to com-pare the results of different proposed shape estimators, notonly by subjective evaluation, but also by objective evalua-tion. The proposal for objective evaluation [9], which wasagreed by the working group, uses an a priori known shapeto evaluate the estimation result. This shape is denoted asreference shape, and has to be created once in an appropri-ate way, for example, by manual segmentation of each frame,by color-keying, or using synthetic image sequences, where

shapes are known. The shape of a moving object can be rep-resented by a binary mask, where a pel has object label if it isinside the object and background label if it is outside the ob-ject. In [9], such a mask is called object mask. There are twoobjective evaluation criteria defined:

(i) the first criterion evaluates the spatial accuracy of anestimated shape. The algorithm obtains the amount ofpels that have different labels in the estimated and thereference object masks. Then, this value is normalizedby the size of the object in the reference object mask;

(ii) the most subjectively disturbing effect is the temporalincoherence of an estimated sequence of object masks.This is evaluated by the second criterion. The num-ber of pels with opposite label between two successiveframes is calculated for the reference and the estimatedsequence of object masks. For each frame, the differ-ence of these two values is computed and normalizedby the size of the object. A large resulting value hints toa large difference in activity between the reference andthe estimated shapes.

Beside the ISO/MPEG-4 core-experiment, this objec-tive evaluation approach was used by the European projectsCOST 211 [14] and ACTS/MoMuSys [15]. However, the ap-proach has the following shortcomings:


(1) the criterion for spatial accuracy does not distinguishamong several small deviations between the estimatedand the reference masks (case 1) and a few, but larger,deviations (case 2). Both cases can lead to the samevalue for spatial accuracy, although they are visuallyvery different;

(2) the same problem appears for the temporal coherencycriterion, where several areas of small contour activityand a few ones of larger activity may lead to similarresults;

(3) the temporal coherency evaluation may lead to a sec-ond type of problems in the case of camera or objectmotion. Then, changes in the object mask between twoconsecutive frames can be either caused by movementor by contour activity, which is not distinguished bythe criterion.

Within the project COST 211 the above approach hasbeen further developed [6, 8]:

• for evaluation of the spatial accuracy, it is distin-guished between pels that have object-label in the es-timated object mask, but not in the reference objectmask, and vice versa; that is, if the estimated shape istoo large or too small. Furthermore, the impact of amisclassified pel on the criterion for spatial accuracydepends on its distance to the object contour. By theseimprovements, the evaluation of shape estimation re-sults can be adapted to specific applications;

• for evaluating the temporal coherency, two criteria areused. The first one analyzes local instabilities by com-paring the variation of the spatial accuracy criterionbetween successive frames. The second one assumesthat the shape is correctly estimated, but oscillatesaround the reference shape. For this case, the distancebetween the gravity center of the object in the esti-mated and in the reference object masks is analyzedfor succeeding frames.

The use of the variation of the spatial accuracy crite-rion for evaluating the temporal coherency allows solving thethird problem. However, these criteria do not solve the firstand second ones and neither does the approach in [3]. There,additional geometric features such as the size and the posi-tion of an object as well as the average color within an objectarea are evaluated based on the estimated and the referenceobject masks.

In this paper, a simple approach [16] for objective eval-uation of results from a 2D-shape estimation is proposed,which tackles the three mentioned problems. As in previousapproaches, the spatial accuracy and the temporal coherencyof an estimated shape are evaluated by comparing it with thecorresponding reference shape. It is assumed that the refer-ence shape does not contain any holes. In the case that thereference object consists of several components, each com-ponent is evaluated separately. The estimation error is de-fined as the spatial distance between the reference and the es-timated shapes. In order to measure the distance, shapes arenot represented as binary object masks, but as object con-

tours. An object contour is the set of pels that have objectlabel in the corresponding object mask, and at least one ofthe four neighboured pels has background label. The evalua-tion approach is mainly based on calculating the mean andthe standard deviation of the shape estimation errors.

The paper is organized as follows: in Section 2, the pro-posed evaluation method is described. The criteria for spa-tial accuracy and temporal coherency are explained. Afterthat, it is discussed how these criteria can be used to eval-uate shape estimation results with respect to a given applica-tion. In Section 3, results of the proposed evaluation methodare presented, and it is demonstrated that, in addition tothe third problem also the first two problems are solved.Section 4 summarizes the paper and gives conclusions.

2. OBJECTIVE EVALUATION CRITERIA

2.1. Spatial accuracy

The spatial accuracy of an estimated shape of a moving objectcan be defined by the spatial distance between the referenceshape and the estimated one. In this paper, this distance is de-termined based on a given set of Nm measure points on thereference object contour. This means that for each measurepoint i its distance di to the estimated object contour is mea-sured. Here, the Euclidean distance is used. For the measureddistance values the mean and the standard deviation are cal-culated, which are then normalized by the maximal expan-sion ∅max of the object, resulting in the normalized mean md

and normalized standard deviation σd:

md =1

∅max· 1Nm

Nm∑i=1

di,

σd =1

∅max·

√√√√√ 1Nm − 1

Nm∑j=1

[dj − 1

Nm

Nm∑i=1

di

]2

.

(1)

The maximal expansion of an object ∅max is defined as thelength of the longest straight line segment between two pelsof the reference object contour. Due to the normalization by∅max, the mean and the standard deviation become indepen-dent from the object size. While the normalized mean md isa measure for the average distance between the reference andthe estimated object contour, the normalized standard devi-ation σd represents how different the measured distances diare.

The algorithm for measuring the distance values di be-tween the two object contours consists of two steps, which areshown in Figure 1: in the first step, the reference and the es-timated object contours are split into parts that are assignedto each other. This is done by determining for each pel onthe reference object contour the straight line, which is per-pendicular to the tangent line for that pel. The tangent line isestimated based on the two neighbour contour pels on eachside of the considered pel. The intersection point betweenthe perpendicular line and the estimated object contour de-fines the corresponding pel on the estimated object contour.The corresponding pels associated to two succeeding pels on

Objective Evaluation Criteria for 2D-Shape Estimation Results of Moving Objects 403

Estimatedobject contour

Referenceobject contour

Assignment of contour partsbetween the estimated and

the reference object contour

Correspondingobject contour parts

Calculation of the distance

to the estimated object contourfor each measure point

on the reference object contour

Distance values di

Figure 1: Block diagram of the algorithm for measuring the dis-tance values di between the reference and the estimated object con-tour.

the reference object contour define the corresponding seg-ment in the estimated object contour (see zoomed area ofFigure 2).

In the special cases that the reference object contour is in-tersected first, or the intersection point belongs to an alreadyassigned part of the estimated object contour, the assignmentis invalid (dotted arrows in Figure 2), and therefore the nextpel on the reference object contour is processed. This is con-tinued until the reference object contour is not intersected asfirst and the resulting intersection point is not yet assigned.The segment of the estimated object contour which is sur-rounded by this intersection point and the previous intersec-tion point for which the assignment was valid (solid arrowsin Figure 2) is assigned to the segment of the reference objectcontour, which is surrounded by the latest processed pel andthe preceding pel for which the assignment was valid.

In the second step in Figure 1, the distance between cor-responding measure points on the reference and on the esti-mated object contours is calculated. A measure point is de-fined as the point on the reference or estimated object con-tour in the middle of two succeeding contour pels. For eachmeasure point on the reference object contour (rhombs inFigure 3), the average distance to all measure points withinthe corresponding part of the estimated object contour (cir-cles in Figure 3) is calculated. In the example, which is shownin the zoomed area of Figure 3, there are two measure pointson the estimated object contour assigned to measure point 12on the reference object contour and, therefore, two distancesare calculated. These two distances are averaged, resulting inthe distance value d12 for the investigated measure point 12on the reference object contour. If there is more than onemeasure point in the same part of the reference object con-

tour, as in the case of measure points 14 to 29 in Figure 3, thecalculation is done for each of them, separately.

2.2. Temporal coherency

The temporal coherency of an estimated shape sequence isevaluated by the temporal variation of the two criteria forspatial accuracy between succeeding frames:

∆md,t =∣∣md,t −md,t−1

∣∣,∆σd,t =

∣∣σd,t − σd,t−1∣∣, (2)

where md,t is the normalized mean md and σd,t is the normal-ized standard deviation σd for the frame at time instance t. Ifthe normalized mean md and the normalized standard devia-tion σd of the distance values di between the reference and theestimated object contour are similar for succeeding frames,their respective temporal variation ∆md,t and ∆σd,t are small.In this case the temporal coherency is judged as good.

However, these two parameters do not analyze whetherthe measured distances keep the same value in succeedingframes, while changing their spatial position. In order to de-tect such cases, a third criterion for evaluation of the tempo-ral coherency is used, which is proposed in [8] (see Figure 4):

∆gt =∣∣∣∣ 1∅t(

greft − gest

t

) − 1∅t−1

(greft−1 − gest

t−1

)∣∣∣∣. (3)

The vectors greft and gest

t are the gravity centers of the eval-uated object in the reference and the estimated object maskat time instance t, respectively. ∆gt is the amount of variationfrom time instance t − 1 to t of the difference between thegravity centers in the reference and estimated object masknormalized by the maximal object expansion ∅max. For thisthird criterion, it is assumed that changes on the position ofthe estimation errors are not symmetrically distributed withrespect to the gravity center.

2.3. Interpretation of results of the objectiveevaluation criteria

In Sections 2.1 and 2.2, criteria for evaluating the spatial ac-curacy and the temporal coherency of an estimated objectshape are proposed. Furthermore, it is described how thesecriteria are measured. In this subsection it is discussed howthese criteria can be interpreted.

For evaluation of the spatial accuracy of an estimatedshape two criteria are used, the normalized mean md and thenormalized standard deviation σd of the measured distancesdi. For a specific application both criteria should be lowerthan given thresholds in order to meet the demanded ac-curacy. For example, one class of applications, which wouldcontain MPEG-4 [13] and MPEG-7 [17] content generationtools, demands a high spatial accuracy, which means low val-ues for md and σd. Another class of applications could al-low a few shape errors, but overall the shape should be wellestimated. This class, which could include tools for sceneinterpretation, demands mainly a small mean value md. Athird class of applications could allow larger shape errors,but the errors should be of constant amplitude, which is an


Pel on estimated object contour

Pel on reference object contour

Reference object contour

Estimated object contour

Valid assignment

Invalid assignment

Figure 2: Example for the assignment of contour parts between the estimated and the reference object contour.

Pel on estimated object contour

Pel on reference object contour



Valid assignment

Measure point on reference object contour

Measure point on estimated object contour

1 2 3

4 5 6 7

89

1011

121314

1516

1718

19

20

2122 23 24 25

26

27

2829

12

Figure 3: Example for the calculation of the distance to the estimated object contour for each measure point on the reference object contour.

advantage, if the shape has to be coded. For this class, md canbe larger, but σd must be small.

Additionally, for the case that a human observer shouldnot be disturbed by the spatial inaccuracy of an estimatedshape, thresholds for md and σd can be found. This meansthat it is possible to represent the impression that a humanobserver gets from a shape estimation result by the two pro-posed criteria for spatial accuracy, opening the door to re-placing subjective evaluation by objective criteria.

In an analogous way, the above statements are valid forthe temporal evaluation criteria. For a given application,thresholds for ∆md,t, ∆σd,t, and ∆gt have to be fixed. Then,

it can be decided if the temporal behavior of the shape esti-mation errors is good enough for a specific application.


The proposed evaluation method has been applied to shapeestimation results for several test sequences. Thereby, a goodcorrespondence with the visual impression of the results wasestablished.

With the results in Figure 5, it is shown that the first twoproblems of previous approaches are solved by the proposedevaluation method. Figure 5a shows the reference frame of


greft−1

gestt−1

gestt

greft



Gravity center

Difference vector between gravity center

Figure 4: Evaluation of the temporal coherency by investigating thetemporal variation of the gravity center difference (gref

t − gestt ) be-

tween succeeding frames at time instances t − 1 and t.

(a) Original frame. (b) Reference object mask.

(c) First example of anestimated object mask.

(d) Second example of anestimated object mask.

Figure 5: Examples for 2D-shape estimation results for frame 30 ofthe MPEG-4 test sequence Akiyo.

the MPEG-4 test sequence Akiyo. The corresponding ref-erence shape represented by an object mask is shown inFigure 5b. Figures 5c and 5d present two examples for shapeestimation results. The first one (Figure 5c) is the referenceobject mask after dilation [18], which therefore has varioussmall estimation errors around the object contour. This cor-responds to case 1 in the introduction. In the second one(Figure 5d), which corresponds to case 2 in the introduction,a part of the left arm and a part of the hair of the personare missing. Thus, there are large estimation errors mainly

Table 1

Estimation result md[%] σd[%]

Object mask in Figure 5c 0.856 0.437

Object mask in Figure 5d 0.824 1.485

Table 2

Estimation result ∆md,30[%] ∆σd,30[%]

Object mask in Figure 5c 0.856 0.437

Object mask in Figure 5d 0.824 1.485

at two positions of the object contour. Although both shapeslook very different, they would give similar values for the spa-tial accuracy, if evaluated by an approach from the literature,for example, [6]. Using the evaluation method proposed inthis paper, the two criteria for evaluating the spatial accuracyhave the values (given as percentage) in Table 1.

The normalized mean md of the estimation errors of bothshapes is nearly equal. However, their normalized standarddeviation σd is quite different. Therefore, the spatial accuracyof both results is judged different if using the proposed eval-uation method.

Assuming that the two estimation results in Figure 5 havebeen perfect for the preceding frame 29, both, md and σdwould have been zero. Then, the temporal coherency crite-ria for frame 30, ∆md,30 and ∆σd,30, would be as in Table 2.

The temporal variation of the normalized mean ∆md,30

is nearly the same for both estimation results, because in thecase of the mask in Figure 5c there is small temporal shapeactivity around the whole object, while in case of the maskin Figure 5d the temporal shape activity is much higher, butmainly at two positions of the object. However, caused bythis, the temporal variation of the normalized standard de-viation ∆σd,30 is small for the mask in Figure 5c and muchlarger for the mask in Figure 5d. This shows that the case ofseveral areas of small contour activity can be distinguishedfrom the case of only a few areas, but of larger activity. There-fore, also the second problem from the introduction is solvedby the proposed evaluation method.

In Figures 6 and 7, the results for all criteria of the pro-posed evaluation method are shown to estimate the resultsof the MPEG-4 test sequences Akiyo and Hall-monitor gener-ated by the COST 211 Analysis Model (Version 5.1) [14, 19].In the results for Akiyo, it is visible that the two spatial crite-ria (Figures 6a and 6b) and the three temporal criteria (Fig-ures 6c, 6d, and 6e) have quite large values for the first sevenframes. In these first frames, the estimated shape tends tocomplete Akiyo’s silhouette. Therefore, the shape is not cor-rectly estimated, and it changes rapidly between these frames.For all following frames, Akiyo’s shape is correctly estimated,which results in low values for the spatial criteria and also forthe temporal criteria. At frames 31, 33, and 95, the estimatedshape presents small estimation errors in the head area andso does at frame 74 at the right arm. Such errors explain thesmall peaks in Figure 6. Only the criterion for variation of


0 10 20 30 40 50 60 70 80 90Frame number

0

5

10

15

20

25

30

35

40

45M

ean

[%]

(a) Normalized mean of distances.

0 10 20 30 40 50 60 70 80 90Frame number

0

5

10

15

20

25

30

35

40

45

Stan

dard

deri

vati

on[%

]

(b) Normalized standard deviation of distances.

0 10 20 30 40 50 60 70 80 90Frame number

0

5

10

15

20

25

30

35

40

45

Var

iati

onof

mea

n[%

]

(c) Variation of normalized mean of distances.

0 10 20 30 40 50 60 70 80 90Frame number

0

5

10

15

20

25

30

35

40

45

Var

iati

onof

stan

dard

deri

vati

on[%

]

(d) Variation of normalized standard deviation of distances.

0 10 20 30 40 50 60 70 80 90Frame number

0

5

10

15

20

25

30

35

40

45

Var

iati

onof

grav

ity

cen

ter

diff

eren

ce[%

]

(e) Variation of normalized gravity center difference.

Figure 6: Evaluation of 2D-shape estimation results for the MPEG-4 test sequence Akiyo (10 Hz) generated by the COST 211 Analysis Model(Version 5.1) using the proposed evaluation method.


0 10 20 30 40 50 60 70 80 90Frame number

0

5

10

15

20

25

30

35

40

45M

ean

[%]

(a) Normalized mean of distances.

0 10 20 30 40 50 60 70 80 90Frame number

0

5

10

15

20

25

30

35

40

45

Stan

dard

deri

vati

on[%

]

(b) Normalized standard deviation of distances.

0 10 20 30 40 50 60 70 80 90Frame number

0

5

10

15

20

25

30

35

40

45

Var

iati

onof

mea

n[%

]

(c) Variation of normalized mean of distances.

0 10 20 30 40 50 60 70 80 90Frame number

0

5

10

15

20

25

30

35

40

45

Var

iati

onof

stan

dard

deri

vati

on[%

]

(d) Variation of normalized standard deviation of distances.

0 10 20 30 40 50 60 70 80 90Frame number

0

5

10

15

20

25

30

35

40

45

Var

iati

onof

grav

ity

cen

ter

diff

eren

ce[%

]

(e) Variation of normalized gravity center difference.

Figure 7: Evaluation of 2D-shape estimation results for the MPEG-4 test sequence Hall-monitor (10 Hz) generated by the COST 211 AnalysisModel (Version 5.1) using the proposed evaluation method.


(a) Original frame 29. (b) Estimated object mask forframe 29.

(c) Original frame 84. (d) Estimated object mask forframe 84.

Figure 8: 2D-shape estimation results of the COST 211 AnalysisModel (Version 5.1) for the MPEG-4 test sequence Hall-monitor(10 Hz).

the difference between the gravity centers (Figure 6e) is notmuch affected by these estimation errors, because they arequite small.

Figure 7 shows the evaluation results for the test sequenceHall-monitor. Here, only the shape of the person on the leftside of the image is evaluated. This person becomes visiblein the second frame, but appears in the estimation result offrame 6 for the first time. Therefore, the two spatial criteriaare zero for frame 0 and very large for frames 1 to 5 (Figures7a and 7b). In the following frames the mean of the estima-tion errors lies between 2 and 15% of the object expansion.There are only two exceptions: the first one is between frames26 and 29, where half of the body of the person is missing (seeFigures 8a and 8b). The second one is between frames 80 and84, where the person leaves the scene so that he is not visibleafter frame 84 (see Figures 8c and 8d). Because of the mem-ory usage in the COST 211 Analysis Model and some shadoweffects in the scene, the disappearance of the person is notdetected. This results in a growing estimation error, which isvisible in Figure 7a. Figure 7b presents the normalized stan-dard deviation of the distances. Its value is large especiallybetween frames 26 and 29, where half of the body is missing.This is reasonable, because in the missing part of the bodythe estimation errors are much larger than in the other part.Thus, the estimation errors are quite different. Of course forthese frames also the temporal variation of the gravity centerdifference is large, as it can be seen in Figure 7e.

4. CONCLUSIONS

In this paper a method for objective evaluation of 2D-shapeestimation results is proposed. The estimation error of an es-timated object shape is defined as the distance between thereference and the estimated object contour, which is mea-sured for several points of the reference object contour. Forevaluating the spatial accuracy, the mean and the standarddeviation of the measured distances are calculated.

It is shown that the normalized mean of the measureddeviations between the estimated and the corresponding ref-erence shape is a useful criterion to evaluate the spatial accu-racy. Furthermore, by the normalized standard deviation itcan be distinguished if an estimated shape has several smallestimation errors or if it has only a few, but larger, estimationerrors.

For evaluating the temporal coherency, the temporalvariation between succeeding frames of the normalized meanand of the normalized standard deviation is investigated. It isshown that by these two criteria it can be assessed if there arevarious small contour activity areas around the object con-tour between succeeding frames, or if there is a higher con-tour activity, but only at a few positions of the object contour.

A third criterion is applied to detect changes of the spa-tial position of estimation errors. It evaluates the temporalvariation of the difference between the gravity centers of thereference and the estimated shape.

The approach has been tested with shape estimation re-sults for several test sequences. Thereby, a good correspon-dence with the visual impression of the results was estab-lished. This have lead to use the evaluation approach withinthe project COST 211.

Finally, it is explained that the evaluation method canbe adapted to a specific application by definition of thresh-olds for the spatial and temporal criteria. Specifically, thresh-olds can be found to model a human observer’s impressionon estimation errors. Furthermore, it is possible to combinethe proposed evaluation method with the ideas from [6, 8],where positive and negative distances are distinguished.

ACKNOWLEDGMENTS

This work has been partially supported by the grant CI-CYT TIC2001-0996 of the Spanish Government and bythe German Fraunhofer Gesellschaft under contract no.E/E815/X5241/M0413.

REFERENCES

[1] M. Borsotti, P. Campadelli, and R. Schettini, “Quantita-tive evaluation of color image segmentation results,” PatternRecognition Lett., vol. 19, no. 8, pp. 741–747, 1998.

[2] P. Correia and F. Pereira, “Estimation of video object’s rel-evance,” in European Conference on Signal Processing (EU-SIPCO ’2000), Tampere, Finland, September 2000.

[3] P. Correia and F. Pereira, “Objective evaluation of relativesegmentation quality,” in Int. Conference on Image Processing(ICIP), pp. 308–311, Vancouver, Canada, September 2000.

[4] C. E. Eroglu and B. Sankur, “Performance evaluation met-rics for object-based video segmentation,” in 10th European


Signal Processing Conference (EUSIPCO ’2000), pp. 917–920,Tampere, Finland, September 2000.

[5] M. D. Levine and A. M. Nazif, “Dynamic measurement ofcomputer generated image segmentations,” IEEE Trans. onPattern Analysis and Machine Intelligence, vol. 7, no. 2, pp.155–165, 1985.

[6] X. Marichal and P. Villegas, “Objective evaluation of seg-mentation masks in video sequences,” in European Conferenceon Signal Processing (EUSIPCO ’2000), vol. 4, pp. 2193–2196,Tampere, Finland, September 2000.

[7] K. McKoen, R. Navarro-Prieto, B. Duc, E. Durucan, F. Ziliani,and T. Ebrahimi, “Evaluation of video segmentation methodsfor surveillance applications,” in Proc. European Signal Pro-cessing Conference 2000, Tampere, Finland, September 2000.

[8] P. Villegas, X. Marichal, and A. Salcedo, “Objective evalu-ation of segmentation masks in video sequences,” in Proc.Workshop on Image Analysis for Multimedia Interactive Services(WIAMIS ’99), pp. 85–88, Berlin, Germany, May/June 1999.

[9] M. Wollborn and R. Mech, “Refined procedure for objec-tive evaluation of VOP generation algorithms,” Doc. ISO/IECJTC1/SC29/WG11 MPEG98/3448, Fribourg, Switzerland, Oc-tober 1997.

[10] W. A. Yasnoff, J. K. Mui, and J. W. Bacus, “Error measuresfor scene segmentation,” Pattern Recognition, vol. 9, no. 4, pp.217–231, 1977.

[11] Y. J. Zhang, “A survey on evaluation methods for image seg-mentation,” Pattern Recognition, vol. 29, no. 8, pp. 1335–1346,1996.

[12] Y. J. Zhang, “Evaluation and comparison of different segmen-tation algorithms,” Pattern Recognition Lett., vol. 18, no. 10,pp. 963–974, 1997.

[13] MPEG-4: Doc. ISO/IEC JTC1/SC29/WG11 N2502, “Informa-tion Technology—Generic Coding of Audiovisual Objects,Part 2: Visual, Final Draft of International Standard,” Octo-ber 1998.

[14] M. Gabbouj, G. Morrison, F. Alaya-Cheikh, and R. Mech,“Redundancy reduction techniques and content analysis formultimedia services—The European COST 211quat action,”in Proc. Workshop on Image Analysis for Multimedia InteractiveServices (WIAMIS ’99), pp. 69–72, Berlin, Germany, 31 May–11 June 1999.

[15] B. Marcotegui, P. Correia, F. Marques, et al., “A video objectgeneration tool allowing friendly user interaction,” in Interna-tional Conference on Image Processing (ICIP ’99), Kobe, Japan,October 1999.

[16] R. Mech and F. Marques, “Objective evaluation criteria for2D-shape estimation results of moving objects,” in Proc.Workshop on Image Analysis for Multimedia Interactive Services(WIAMIS ’01), Tampere, Finland, May 2001.

[17] MPEG-7: Doc. ISO/IEC JTC1/SC29/WG11 N2822, “VisualPart of Experimentation Model Version 2.0,” Vancouver,Canada, July 1999.

[18] M. Sonka, V. Hlavac, and R. Boyle, Image Processing, Analysis,and Machine Vision, Chapman & Hall Computing, London,UK, 1993.

[19] A. Alatan, L. Onural, M. Wollborn, R. Mech, E. Tuncel, andT. Sikora, “Image sequence analysis for emerging interactivemultimedia services—The European COST 211 framework,”IEEE Trans. Circuits and Systems for Video Technology, vol. 8,no. 7, pp. 802–813, 1998.

Roland Mech received the Diplom-Informatiker degree in computer sciencefrom the University of Dortmund, Dort-mund, Germany, in 1995. Since 1995he is with the “Institut fur TheoretischeNachrichtentechnik und Informationsver-arbeitung” at the University of Hannover,Germany, where he works in the areas ofimage sequence analysis and image se-quence coding. He is a member of the European project COST 211.Furthermore, he was a member of the finished European projectACTS-MoMuSys and contributed actively to the ISO/MPEG-4standardization activities. His present research interests coverimage sequence analysis, especially 2D-shape estimation of movingobjects, and the application of object-based image sequencecoding.

Ferran Marques received the Electrical En-gineering degree from the Polytechnic Uni-versity of Catalunya (UPC), Barcelona,Spain, in 1988. From 1989 to June 1990,he worked in the Swiss Federal Instituteof Technology in Lausanne (EPFL) in thegroup of “Digital image sequence process-ing and coding.” In June 1990, he joined theDepartment of Signal Theory and Commu-nications of the Polytechnic University of Catalunya (UPC). FromJune 1991 to September 1991, he was with the Signal and ImageProcessing Institute at USC in Los Angeles, California. He receivedthe Ph.D. degree from the UPC in December 1992 and the Span-ish Best Ph.D. thesis on Electrical Engineering Award-1992. Since1995, he is Associate Professor at UPC, having served as Asso-ciate Dean for International Relations of the TelecommunicationSchool (ETSETB) at UPC (1997–2000). He is lecturing on the areaof digital signal and image processing. His current research inter-ests include still image and sequence analysis, still image and se-quence segmentation, image sequence coding, motion estimationand compensation, mathematical morphology and biomedical ap-plications. In the area of image coding and representation, he hasbeen a very active partner in the MPEG-4 standard process, mainlythrough the European project MoMuSys. In MoMuSys, he acted asWork Package Leader of the Video Algorithms work package that,among other tasks, implemented the MPEG4 VM. He has servedas Officer responsible for the Membership Development (1994–1998) for the EURASIP AdCom, as elected member responsible ofthe Member Services (1998–2000) and currently he is serving asSecretary and Treasurer. He was Associate Editor of the Journal ofElectronic Imaging (SPIE) in the area of Image Communications(1996–2000) and he serves in the Editorial Board of the EURASIPJournal on Applied Signal Processing. He is author or co-author ofmore than 50 publications that have appeared as journal papers andproceeding articles, 4 book chapters, and 4 international patents.


Using Invariant Image Features for Synchronizationin Spread Spectrum Image Watermarking

Ebroul IzquierdoMultimedia and Vision Research Lab, Department of Electronic Engineering, Queen Mary, University of London, London, UKEmail: [email protected]

Received 10 August 2001 and in revised form 10 December 2001

A watermarking scheme is presented in which the characteristics of both spatial and frequency techniques are combined to achieverobustness against image processing and geometric transformations. The proposed approach consists of three basic steps: estima-tion of the just noticeable image distortion, watermark embedding by adaptive spreading of the watermark signal in the frequencydomain, and extraction of relevant information relating to the spatial distribution of pixels in the original image. The just no-ticeable image distortion is used to insert a pseudo-random signal such that its amplitude is maintained below the distortionsensitivity of the pixel into which it is embedded. Embedding the watermark in the frequency domain guarantees robustnessagainst compression and other common image processing transformations. In the spatial domain most salient image points arecharacterized using the set of Hilbert first-order differential invariants. This information is used to detect geometrical attacks in afrequency-domain watermarked image and to resynchronize the attacked image. The presented schema has been evaluated exper-imentally. The obtained results show that the technique is resilient to most common attacks including rotation, translation, andscaling.

Keywords and phrases: watermarking, data hiding, image invariants.

1. INTRODUCTION

Conventional analog media distribution systems have an in-herent built-in defense against copying, alteration, and fraud.Each time a new copy is issued the quality of the duplicatedcontent is degraded accordingly. In contrast to that, digitalmultimedia documents are completely susceptible to exactreplication and alteration. This, together with the rapid pro-liferation of digital documents, multimedia processing toolsand the world-wide availability of internet access have cre-ated an ideal medium for piracy, copyright fraud, and uncon-trollable distribution of high quality but unregistered multi-media content. Since digital watermarking can be seen as asolution to this problem, both the number of activities in thisarea and the recognition of the difficulties and challenges in-volved in this new technology have increased in the last fewyears [1, 2]. Basically, the major challenge is to find a strategythat satisfies the conflicting objectives of performing imagechanges that are imperceptible for the human eye and be-ing extremely robust against detection or removal either acci-dentally or intentionally. These two objectives are conflictingin nature because it is not possible to simultaneously maxi-mize robustness and imperceptibility. Indeed, maximizationof robustness leads to the introduction of large distortionsin the image or video and consequently strong perceivableimage changes. On the other hand, keeping imperceptibility

means keeping the embed amount of information minimaland consequently very susceptible to removal or detection.

Usually digital watermarks are classified according to theembedding and retrieval domain, that is, the transform co-efficient magnitude in the frequency domain and the lumi-nance intensity in the spatial domain. Frequency-based tech-niques are very robust against certain kinds of signal process-ing, such as compression and filtering [3, 4]. Since the water-mark is spread throughout the image data, rather than tar-geting individual pixels, any attempts at attack means thatthe most fundamental structural components of the datamust be targeted. This increases the chances of fidelity degra-dation, thereby rendering the attacked image useless. Usingcommon image processing operations, if image integrity hasnot been compromised, the watermark can be reliably de-tected. In this context many watermarking algorithms relyingon spread spectrum and different transform domains havebeen proposed in the literature: using the discrete wavelettransform (DWT) [5], the discrete Cosine transform (DCT)[3, 6], the discrete Fourier transform (DFT) [4], and so forth.

However, reliable detection of the frequency-based wa-termark is impeded when synchronization is lost as a re-sult of geometric transformations. Although watermark-ing using spread spectrum in a transformed domain asdescribed before is very resistant to amplitude distortionsand additive noise, it becomes fragile if the starting point

Using Invariant Image Features for Synchronization in Spread Spectrum Image Watermarking 411

(a) (b) (c)

Figure 1: Distortions introduced by the log-polar map. (a) Original image Lena, (b) log-polar map, and (c) recovered image from thelog-polar map.

for decoding is lost. To deal with these attacks, a spa-tial watermark is more appropriate as it targets specificlocations in the image. Watermarking in the spatial do-main is less resilient to common image processing oper-ations, since the watermark becomes undetectable whenthe intensity information is modified. However, spatialtechniques allow relevant information relating to the spa-tial distribution of pixels to be extracted. With this in-formation, following geometric transformations such asrotation, the image can be resynchronized. The most com-mon strategy for detecting a watermark after geometric dis-tortion is to try to identify what the distortions are and thento invert them before applying the detector, for example, byintroducing a template [7, 8]. This requires the insertion andthe detection of two watermarks: one of which does not carryinformation but helps to detect geometric transformationsand a second one in which the hidden information is repre-sented. This approach has two drawbacks: it further affectsimage fidelity and it increases the probability of false nega-tives. Additionally, in general this technique requires exhaus-tive searches thus resulting in a significant increase of work-load. Since the watermarking strategy should be public, it israther easy to destroy the synchronization template.

A more elegant approach to achieve robustness againstloss of synchronization is to use transformations that mapthe signal information into an invariant domain. The mostcommonly used transformations depend on the propertiesof the Fourier transform. They use well-established tech-niques in pattern recognition and the invariance propertiesof the Fourier-Mellin transform [9]. Combining the Fouriertransform with a log-polar map result on invertible rotation,translation, and scale invariant representation. Nevertheless,as noted by, Ruanaidh and Pun [10] using this to achieve wa-termarking leads to difficulties. The main problem is that theexponential nature of the inverse log-polar mapping causes aloss of image information in the discrete space.

To illustrate this problem, in Figure 1 the original imageLena along with the log-polar representation and the recon-

structed image using the inverse log-polar mapping is shown.Although in these images a high resolution is used, the lossof image quality in the reconstructed image is clearly vis-ible. Another problem is that achieving high resolution inthe log-polar space requires interpolation. This causes fur-ther numerical instabilities since interpolation behaves badlyif the interpolation nodes are not of the same scale [10]. Re-cently, Lin et al. [11] proposed another approach based onthe Fourier-Mellin transform. The main idea in this tech-nique is to use a noninvertible extraction function in or-der to overcome the instabilities arising when strong invari-ant representations are used. Although this approach tacklesome implementation problems arising from the nature ofthe Fourier-Mellin transform, other aspects of robustness re-main unsolved. For instance, strong resilience against imageprocessing operations like JPEG compression is severely af-fected using this technique.

The main goal of this research is to overcome the robust-ness limitations of reported watermarking methods. We pro-pose to combine frequency and spatial domain watermarkingstrategies into a new model in which the different robustnessproperties of both schemes are consistently exploited. Thefrequency domain is used to hide the watermark in the hostimage. DCT coefficients from the middle frequency bandsare selected and a sequence of pseudo-random real num-bers are inserted within these coefficients. Basically, this isa well-established approach proven to be extremely robustagainst additive noise, digital-analog-digital conversion andcompression. These are the required properties of any water-marking scheme that we want to keep. To maximize robust-ness and minimize the introduced distortions, the pseudo-random watermark signal is adaptive spreading in the fre-quency domain. In this process the distortion amplitude isdefined adaptively according to the shape of a just notice-able distortion mask. The just noticeable distortion (JND) isestimated from the characteristics of the image content. Fi-nally, invariant primitives extracted from the pixel domainare used to achieve synchronization in case of geometric


Encoder

Message X

Image I

Perceptualmodel

Watermarkembedder

Attacks Watermarkextractor

Evaluate

I ′ I∗

Figure 2: System overview.

attacks. The proposed technique consists of three basic steps:(1) content based estimation of JND in the frequency do-main; (2) adaptive spreading of a pseudo-random watermarksignal in the frequency domain; and (3) extraction of rele-vant information related to the spatial distribution of pixelsin the original image in order to resynchronize a geometricdistorted image. The useful combination of a frequency do-main watermarking approach with spatial information leadsto a very robust system. The proposed system benefits fromthe advantages of both spread spectrum and image primitiveinvariance with respect to affine image transformations. InFigure 2, an overview of the implemented system is given.

In the first step, the image is analyzed in both the fre-quency and the spatial domain in order to detect the distor-tion sensitivity of the image according to its content. Localinformation derived from texture, edge, and luminance in-formation is used to define a measure of JND. The JND isused adaptively according to the image content to maximizethe amount of information (signal) that will be embeddedas the watermark. To insert the watermark signal the middleDCT-frequency band of a block-wise transformed image isused. To maximize the capacity a pseudo-random signal withamplitude just below the image distortion, sensitivity is cre-ated according to the JND mask. Thus, the watermark signalis spread over the whole host by keeping its amplitude belowthe noise sensitivity of each pixel.

By transforming the original image into a suitable fre-quency domain information, redundancy of the spatial do-main can be highly decorrelated and high-energy com-paction can be achieved. The popular DCT transform forcoding and compression consists of dividing the image intosmaller blocks of the same size, and then transforming eachblock to obtain equal-sized blocks of transform coefficients.These coefficients are then thresholded and quantized in or-der to remove subjective redundancies. Since one of the mostcommon and useful image processing methods is compres-sion, it seems reasonable to exploit the knowledge about theprocessing methodology to generate a watermark scheme re-silient to this image modification. Basically, we use the mid-dle DCT-frequency band of a block-wise transformed im-age to insert the watermark signal. To maximize capacity, apseudo-random signal with amplitude just below the imagedistortion sensitivity is created according to the JND mask.

Thus, the watermark signal is spread over the whole hostby keeping its amplitude below the noise sensitivity of eachpixel.

The spatial domain is used to resynchronize the image.This is implemented by applying primitive matching usingpoint characterization according to Hilbert differential in-variants [12]. These invariants have been defined in the lit-erature for gray-level images, where good results can be ob-tained using differentials up to order three. For colour im-ages this characterization can be improved if the three colourplanes are used. In this case, only first-order invariants arenecessary. Using the colour information a more efficientcharacterization can be achieved, as only first-order deriva-tives are necessary. In this work a generalization of the Harriscorner detector [13] is used to detect salient image points.Furthermore first-order Hilbert invariants [12] are selectedto characterize the salient image primitives. Finally, to detectthe affine transformations undergone by the attacked image,a matching technique is used in which the extracted imageprimitives are compared.

The paper is organized as follows: in Section 2, themethod for extraction of the JND mask is described.Section 3 deals with the watermark insertion and extractionin the DCT domain. Correction of geometric distortions us-ing Hilbert invariants is described in Section 4. To assess theperformance of the proposed watermarking method, sev-eral experiments have been conducted. The obtained resultsshow that, this approach is resilient to most common dis-tortions due to image processing transformations includinghigh compression, geometrical distortions as well as somemalicious attacks. Selected results of computer simulationsare reported in Section 5. The paper is closed with conclu-sions in Section 6.

2. JND-MASK GENERATION

The process of embedding a watermark in any image canbe regarded in the same way as adding noise to the image.This process leads to an alteration of the host image. Obvi-ously, altering a large number of pixel values arbitrarily willresult in noticeable image distortions. These distortions areproportional to the amplitude of the embedded signal. Con-sequently, an image can be distorted only to a certain limit


(a) (b)

Figure 3: DT vales for Baboon (a) and Lena (b).

without making the difference between the original and thealtered one perceptible. This limit varies according to the im-age content and is called just noticeable distortion (JND). Sev-eral methods can be found in the literature to estimate JNDmasks in the context of compression and image quality as-sessment. Several perceptually-based algorithms have beenproposed in the literature [14]. An overview of visual mod-els for signal compression is presented in [15]. Most of thesemodels use both frequency sensitivity, and spatial maskingbased on image edges. The spatial masking is a modified ver-sion of the spatial masking model presented in [16]. Themain principle of the spatial masking is that, edges in animage are able to mask signals of much greater amplitudethan regions of near-constant intensity. For a given image, atolerable-error level value may be formed for each pixel.

To estimate the JND mask, three image characteristicsare considered: texture, edgeness, and smoothness. Accord-ing to several studies about the human vision system, it iswell known that the distortion visibility in highly texturedareas is very low. This means that these areas are the mostsuitable ones to hide the watermark signal. In fact texturedregions present a very high noise-sensitivity level. In contrastto that, edge information of an image is the most importantfactor for the human image perception. Consequently, edgeshave the least noise sensitivity and it is essential to keep edgeintegrity in order to preserve the image quality. Edges havethe lowest JND values. Similarly, smooth image areas havea general bandpass characteristic. They influence the humanperception and consequently their JND values are also rel-atively low. The definition of a suitable JND mask dependsessentially on the accurate extraction of image texture, edges,and smooth areas.

Texture information can be retrieved directly from thetransformed domain by analyzing the DCT coefficients. Fol-lowing the JPEG and MPEG2 encoding strategies, the imageis first split into 8×8 blocks. Each block is transformed in theDCT domain and the resulting 64 coefficients analyzed. It iswell known that in high textured regions or along edges, the

signal energy concentrates in the high frequency coefficientswhile in uniform image areas the signal energy is concen-trated in the low frequency components. Thus, the energyof the AC coefficients can be used as a measure for texturewithin each block. Using the following formula for the en-ergy in the AC coefficients a measure for texture DT withineach block is derived:

DT = log

(63∑i=1

v2i − v2

0

), (1)

where vi, i = 0, . . . , 63, are the 64 DCT coefficients of theconsidered block. The obtained DT values are scaled to therange [0, 64] and the resulting normalized values assigned tothe corresponding blocks. The images in Figure 3 show theobtained DT values for the two test images Lena and Baboon.In this representation the DT values have been scaled to therange [0, 255] and rounded to the nearest integer.

Edges and smooth areas are extracted from the pixel do-main. The main difficulty here consists of discriminating be-tween relevant image edges and spurious edges due to noiseand texture. Relevant edges are detected using either theCanny operator. The length of each single edge is calculatedby traversing it. Edges whose length does not exceed a thresh-old are considered texture edges and are removed. Using thebinary image containing the final edge information, edgenessis calculated block-wise according to the formula

DE =64 · PE

max(PE) , (2)

where PE is the cardinality of the set of pixels within the blockand at edge locations and max(PE) is the maximum value ofPE over all the blocks in the image.

Finally, the Moravec operator is used to extract uniformimage regions. The same strategy introduced in [17] for therecognition of uniform regions is used. To avoid the effectsof noise, the Moravec operator is applied on the Gaussian


(a) (b)

Figure 4: Extracted DU vales for Baboon (a) and Lena (b).

smoothed image. Additionally, small picks are removed usinga median filter over the binary image obtained after thresh-olding. The uniformity DU in a block is defined as the num-ber of pixels belonging to a uniform area within the block.The images in Figure 4 show the obtained DU values for thetwo test images Lena and Baboon. In this representation theDU values have been scaled to the range [0, 255] and roundedto the nearest integer.

Using the three values DT , DE, and DU an initial estimatefor the JND value over each block is calculated as

J = DT − 12

(DE + DU

). (3)

Since the human vision system is more sensitive to inten-sity changes in the mid-gray region and its sensitivity failparabolically at both ends of the gray scale, a correction to Jis used. The final JND values are obtained as J = J+(128− I)2,where I is the average of the luminance values within the con-sidered block.

3. WATERMARK EMBEDDING IN THE DCT DOMAIN

The method of watermark insertion is to modify selectedDCT coefficients by embedding a sequence of pseudo-random numbers within them. This technique does not tar-get individual pixels but rather, upon inverse transforma-tion, the watermark will be dispersed over the entire image.The watermark itself consists of a sequence of real pseudo-random numbers X = x1, . . . , xm, with m ≤ 64, satisfying thenormal distribution η(0, 1). The array of DCT coefficientsobtained from the transformation consists of a sequence ofvalues V = v1, . . . , v64.

The watermark X = x1, . . . , xm is inserted into m selectedcoefficients from the middle frequency bands. This yields anadjusted set of values V ′ = v′1, . . . , v

′64. The inverse of the DCT

is then performed to obtain the watermarked image I ′. Toinsert the watermark we use the formula

v′i = vi + α · J · |vi|xi, (4)

where α is a scaling parameter and J is the distortion pa-rameter previously defined. If the watermarked image I ′ istransformed, by processing or attacking, a new image I∗ isgenerated.

The presence of the watermark can be detected in thetransformed image I∗ by performing a correlation test

R =1N

(V ∗ · X), (5)

where V ∗ is the vector containing those extracted DCT coef-ficients which have been modified and X is the original wa-termark. Since this is a statistical test we must be aware of thepossibility of obtaining detection errors. To decide whetherthe watermark is authentic, we must determine some thresh-old T and test whether the obtained correlation coefficientis greater than T . Setting the detection threshold is a deci-sion based on the desire to minimize both false positives andfalse rejections.

Figure 5a shows the watermarked image using this spreadspectrum technique. Although high distortions have been in-troduced to some DCT values according to the JND maskthe image quality has not been compromised. In Figure 5bthe detector response is shown for the same image after JPEGcompression using 50% quality factor. In Figure 6, similar re-sults are shown after digital-analog-digital conversion.

4. GEOMETRIC ATTACKS AND RESYNCHRONIZATION

Since most digital images are coloured rather than gray-level,the entire colour information can be exploited using well-known gray-level image attributes independently for eachcolour plane. We are interested in attributes that are invariantwith respect to as large a group of geometric transformationsas possible, but specifically in orthogonal and affine transfor-mations. The study of these invariants can be traced back tothe 18th century in the work undertaken by the renowned


(a)

Lena JPEG compressed 50%

Watermark

1

0.8

0.6

0.4

0.2

0

−0.2

−0.4

200 400 600 800 1000

Cor

rela

tion

(b)

Figure 5: Watermarked image and response of the watermark de-tector after JPEG-compression using 50% quality factor to 1000randomly generated watermarks including the genuine watermarkat position 188.

mathematician David Hilbert [12]. He showed that any in-variant of finite order can be expressed as a polynomial func-tion of a set of irreducible invariants. Although the invari-ant theory has been widely applied in pattern recognitionand computer vision [18], it has not been used to recovertransformation parameter and resynchronize signal in water-marking applications. In a gray-level image the fundamentalset of irreducible invariants can be defined as I , Iη, Iηη, Iςη,Iςς, with the unit vector η given by η = ∇I/|∇I | and ς ⊥ η.

Given an original image I , the first step in generating afeature space with invariant attributes is to detect some keypoints that are robust to common image processing and geo-metrical transformations. We have chosen the Harris detec-tor which uses only first-order derivatives and is well knownas one of the most stable and robust corner detectors in im-age processing. The Harris detector is defined as the positivelocal extreme of the following operator:

Det(M) − k Trace2(M), (6)

(a)

Watermark

6

5

4

3

2

1

0

−1

−2

−3

Cor

rela

tion

200 400 600 800 1000

(b)

Figure 6: (a) Image obtained after printing, photocopying, andscanning the watermarked Lena and (b) response of the watermarkdetector to 1000 randomly generated watermarks including the gen-uine watermark at position 188.

where

M =[C11 C12

C21 C22

], (7)

withC11 = Γσ(R2x+G2

x+B2x

),C12 = Γσ

(Rx ·Ry+Gx ·Gy+Bx ·By

),

C21 = Γσ(Rx ·Ry +Gx ·Gy +Bx ·By

), C22 = Γσ

(R2y +G2

y +B2y

),

k = 0.04 is a scalar, and Γ is the Gaussian convolution kernelof variance σ .

As mentioned before the basic idea behind the strategyfor defining an invariant feature space is to use first-orderHilbert invariants. For each corner point, the vector compris-ing the following eight differential primitives is calculated:

F =(R, |∇R|2, G, |∇G|2, B, |∇B|2,∇R · ∇G,∇R · ∇B)T . (8)

The vector (8) forms the space of feature invariants in theconsidered colour image. It is invariant with respect to ro-tation, translation, and scaling which are the most common


(a) (b)

Figure 7: (a) Selected relevant corners using the Harris detector and (b) matched corners using first order Hilbert invariants.

geometric transformations. It allows very robust characteri-zation with regard to noise, since only first-order derivativesare involved. Additionally, the complexity of the method re-mains very low since only simple pixel differences should becalculated. The use of higher-order derivatives involves heav-ier workload and would cause severe instabilities. Finally, thetransformation parameters are detected by matching the cor-ners in the primitive space defined by (8). In the matchingprocedure the conventional Euclidean distance between theprimitives is used as similarity measure.


Several experiments have been conducted to test the perfor-mance of the proposed system. In this section some selectedresults are reported. The images in Figures 5 and 6 show thewatermarked image Lena after compression and format con-version as well as the detector response obtained using thetechnique described in Section 3.

Tests against geometric attacks such as cropping, rota-tions, scaling, and so forth, have been carried out. In mostcases, good synchronization has been achieved. For croppingattacks the image must be padded to the original size beforedecoding. Thus frequency sampling is performed with thesame step both by the encoder and the decoder. In this case,cropping robustness is obtained from the invariance of themagnitude spectrum to spatial shifting. In all cases, the pa-rameters describing the performed geometrical attack havebeen detected using the technique described in Section 4.Figure 7 shows the detected corners in the watermarked im-age. In this representation corners are highlighted as smallblack spots. For the sake of visibility in this representation,five corner points appear surrounded by circles. The imagein Figure 7b has been rotated clockwise by 15 degrees. Thecorresponding corners are also represented by small blackspots. Matching between image primitives using the Hilbertinvariant features, described in Section 4, reveal the param-eters encoding the geometric transformations undergone by

Watermarks

1.2

1

0.8

0.6

0.4

0.2

0

−0.2

−0.4

Cor

rela

tion

200 400 600 800 1000

Figure 8: Response of the watermark detector after re-synchro-nization.

the attacked image. In this case then rotation angle has beenrecovered accurately.

The recovered parameters have been used to re-syn-chronize the transformed image and to detect the watermarkin the DCT domain. Figure 8 shows the response of the wa-termark detector for the rotated image in Figure 7b. Usingthe technique described in Section 4, the rotation angle hasbeen estimated and the attacked image rotated back. Thewatermark detector has been applied to the resynchronizedimage.

6. CONCLUSIONS

First-order invariant image primitives have been combinedwith a spread spectrum approach to produce a novel wa-termarking scheme. In the introduced technique robustnessagainst usual image processing and distortions is achievedby combining a well-established spread spectrum approachwith an adaptive masking of the JND. The JND is extractedfrom the image content itself. Unlike most common strate-gies that use a reference template to detect geometrical dis-tortions, the schema presented in this paper exploit relevant


image features to perform resynchronization. The spatial do-main is used to extract relevant information related to thepixels distribution in the original image. The underlyingstrategy consists of characterizing salient image points usingfirst-order differential invariants. These primitives are invari-ant to orthogonal image transformations. To detect geomet-rical attacks in a frequency-domain watermarked image a ro-bust matching technique is applied. Features extracted fromthe original image are matched with those extracted from theattacked image. Their disparities reveal the transformationparameters and are used to re-synchronize the attacked im-age. Experimental results confirm the robustness of the pro-posed schema.

REFERENCES

[1] Signal Processing, “Special Issue on Watermarking,” vol. 66,no. 3, 1998.

[2] Proceedings of the IEEE, “Special Issue on Identification andProtection of Multimedia Information,” vol. 87, no. 7, 1999.

[3] C.-T. Hsu and J.-L. Wu, “Hidden digital watermarks in im-ages,” IEEE Trans. Image Processing, vol. 8, no. 1, pp. 58–68,1999.

[4] R. B. Wolfgang, C. I. Podilchuk, and E. J. Delp, “Perceptualwatermarks for digital images and video,” Proceedings of theIEEE, vol. 87, no. 7, pp. 1108–1126, 1999.

[5] M. Barni, F. Bartolini, V. Cappellini, A. Lippi, and A. Piva, “ADWT-based technique for spatio-frequency masking of digi-tal signatures,” in Security and Watermarking of MultimediaContents, Electronic Imaging, vol. 3657 of Proceedings of SPIE,pp. 31–39, San Jose, Calif, USA, January 1999.

[6] A. Piva, M. Barni, F. Bartolini, and V. Cappellini, “DCT-basedwatermark recovering without resorting to the uncorruptedoriginal image,” in Proc. 4th IEEE International Conference onImage Processing (ICIP ’97), vol. 1, pp. 520–523, Santa Bar-bara, Calif, USA, October 1997.

[7] S. Pereira and T. Pun, “Fast robust template matching foraffine resistant image watermarking,” in International Work-shop on Information Hiding, vol. 1768 of Lecture Notes in Com-puter Science, pp. 200–210, Springer-Verlag, Dresden, Ger-many, 29 September–1 October 1999.

[8] M. Kutter, “Watermarking resistance to translation, rotation,and scaling,” in SPIE Conf. on Multimedia Systems and Appli-cations, vol. 3528, pp. 423–431, 1998.

[9] R. D. Brandt and F. Lin, “Representations that uniquely char-acterize images modulo translation, rotation, and scaling,”Pattern Recognition Lett., vol. 17, no. 9, pp. 1001–1015, 1996.

[10] J. J. K. O’Ruanaidh and T. Pun, “Rotation, scale and transla-tion invariant spread spectrum digital image watermarking,”Signal Processing, vol. 66, no. 3, pp. 303–318, 1998.

[11] C.-Y. Lin, M. Wu, J. A. Bloom, M. L. Miller, I. J. Cox, and Y. M.Lui, “Rotation, scale, and translation resilient watermarkingfor images,” IEEE Trans. Image Processing, vol. 10, no. 5, pp.767–782, 2001.

[12] D. Hilbert, Theory of Algebraic Invariants, Cambridge Math-ematical Library, Cambridge University Press, Cambridge,1890.

[13] C. Harris and M. Stephens, “A combined corner and edgedetector,” in Proc. 4th Alvey Vision Conference, pp. 147–151,Manchester, UK, 1988.

[14] I. S. Hontsch and L. J. Karam, “APIC: adaptive perceptualimage coding based on sub-band decomposition with locallyadaptive perceptual weighting,” in Proc. 1997 IEEE Interna-tional Conference on Image Processing (ICIP ’97), vol. 1, pp.37–40, Santa Barbara, Calif, USA, October 1997.

[15] N. S. Jayant, J. D. Johnston, and R. Safranek, “Signal com-pression based on models of human perception,” Proceedingsof the IEEE, vol. 81, no. 10, pp. 1385–1422, 1993.

[16] B. Girod, “The information theoretical significance of spatialand temporal masking in video signals,” in Proc. SPIE HumanVision, Visual Processing, and Digital Display, vol. 1077, pp.178–187, Los Angeles, Calif, USA, 1989.

[17] E. Izquierdo, “Stereo image analysis for multi-viewpointtelepresence applications,” Signal Processing: Image Commu-nication, vol. 11, no. 3, pp. 231–254, 1998.

[18] P. Montesinos, V. Gouet, R. Deriche, and D. Pele, “Matchingcolor uncalibrated images using differential invariants,” Imageand Vision Computing, vol. 18, no. 9, pp. 659–671, 2000.

Ebroul Izquierdo is a Lecturer in the De-partment of Electronic Engineering, QueenMary University of London. He receivedthe Dr. Rer. Nat. from the Humboldt Uni-versity, Berlin, Germany, in 1993. From1990 to 1992 he was a Teaching Assistantat the Department of Applied Mathemat-ics, Technical University, Berlin. In 1993,Dr. Izquierdo joined the Heinrich-HertzInstitute for Communication Technology(HHI), Berlin, Germany, where he worked from 1993 to 1997.During this period he developed and implemented techniques forstereo vision, disparity and motion estimation, video compression,3D modelling, and immersive telepresence. From 1998 to 1999, Dr.Izquierdo was with the Department of Electronic Systems Engi-neering of the University of Essex, UK as a Senior Research Officer.He worked on research dealing with content-based coding, efficientindexing, and retrieval of video and nonlinear diffusion modelsfor image processing. Dr. Izquierdo has been involved in researchand management of projects in Germany, the UK, and two Euro-pean projects in the multimedia field. In 1999, Dr. Izquierdo wasawarded a short term British Telecom fellowship. Currently, he isthe coordinator of the European IST project BUSMAN and repre-sents QMUL in the European IST Network of Excellence SCHEMA.Dr. Izquierdo is an IEE Chartered Engineer, a member of the IEEE,the IEE, and the British Machine Vision Association. He repre-sents the UK in the European COST211 forum and is member ofthe management committee of the Information Visualization So-ciety. He is an active member of the Programme Committee ofthe IEEE Information Visualization, the EURASIP & IEEE Interna-tional Conference on Video Processing and Multimedia Communi-cation and the COST211 sponsored European Workshop on ImageAnalysis for Multimedia Interactive Services. He has published over100 papers including chapters in books.


Segmentation and Content-Based Watermarkingfor Color Image and Image RegionIndexing and Retrieval

Nikolaos V. BoulgourisInformatics and Telematics Institute (ITI), 1st Km Thermi-Panorama Road, Thermi-Thessaloniki,P.O. Box 361, Gr-57001, GreeceEmail: [email protected]

Ioannis KompatsiarisInformatics and Telematics Institute (ITI), 1st Km Thermi-Panorama Road, Thermi-Thessaloniki,P.O. Box 361, Gr-57001, GreeceEmail: [email protected]

Vasileios MezarisInformation Processing Laboratory, Electrical and Computer Engineering Department,Aristotle University of Thessaloniki, Thessaloniki 540 06, GreeceEmail: [email protected]

Dimitrios SimitopoulosInformatics and Telematics Institute (ITI), 1st Km Thermi-Panorama Road, Thermi-Thessaloniki,P.O. Box 361, Gr-57001, GreeceEmail: [email protected]

Michael G. StrintzisInformation Processing Laboratory, Electrical and Computer Engineering Department,Aristotle University of Thessaloniki, Thessaloniki 540 06, GreeceEmail: [email protected]


An entirely novel approach to image indexing is presented using content-based watermarking. The proposed system uses color-image segmentation and watermarking in order to facilitate content-based indexing, retrieval and manipulation of digital imagesand image regions. A novel segmentation algorithm is applied on reduced images and the resulting segmentation mask is embed-ded in the image using watermarking techniques. In each region of the image, indexing information is additionally embedded. Inthis way, the proposed system is endowed with content-based access and indexing capabilities which can be easily exploited via asimple watermark detection process. Several experiments have shown the potential of this approach.

Keywords and phrases: image segmentation, image analysis, watermarking, information hiding.

1. INTRODUCTION

In recent years, the proliferation of digital media has estab-lished the need for the development of tools for the effi-cient access and retrieval of visual information. At the sametime, watermarking has received significant attention dueto its applications on the protection of intellectual prop-erty rights (IPR) [1, 2]. However, many other applicationscan be conceived which involve information hiding [3, 4]. In

this paper, we propose the employment of watermarking asa means to content-based indexing and retrieval of imagesfrom databases.

In order to endow the proposed scheme with content-based functionalities, information must be hidden region-wise in digital images. Thus, the success of any content-basedapproach depends largely on the segmentation of the imagebased on its content. In the present paper, a novel segmenta-tion algorithm is used prior to information embedding.

Segmentation and Content-Based Watermarking for Color Image and Image Region Indexing and Retrieval 419

Originalimage

Imagereduction

Markedimage

Indexinginformationembedding

Channelcoding

Featureextraction

Segmentationinformationembedding

Segmentation

Protected indexing bits

Indexing bits

Figure 1: Block diagram of the embedding scheme.

Segmentation methods for 2D images may be dividedprimarily into region-based and boundary-based methods[5, 6, 7, 8]. In this paper, a region-based [9, 10] approach ispresented using a combination of position, intensity, and tex-ture information, in order to form large connected regionsthat correspond to the objects contained in the image. Thesegmentation of the image into regions is followed by the ex-traction of a set of region descriptors for each region; theseserve as indexing information.

The segmentation and indexing information are subse-quently embedded into the images using digital watermark-ing techniques. Specifically, segmentation information is em-bedded using an M-ary symbol modulation technique inwhich each symbol corresponds to an image region. Index-ing information is embedded as a binary stream followed bychannel coding. In this way, both segmentation and index-ing information can be easily extracted using a fast water-mark detection procedure. This is an entirely novel conceptthat clearly differentiates our system from classical indexingand retrieval methodologies, in which feature informationfor each image is separately stored in database records.

Embedding segmentation and indexing information inimage regions [11, 12] has the following advantages:

• each region in the image carries its own descriptionand no additional information must be kept for its de-scription;

• the image can be moved from a database to anotherwithout the need to move any associated description;

• objects can be cropped at the decoder from imageswithout the requirement for employing segmentationalgorithms.

The above advantages can be exploited using our water-marking methodology.

The paper is organized as follows: the system overviewis given in Section 2. The segmentation algorithm is pre-sented in Section 3. In Section 4, the derivation of regiondescriptors used for indexing is described. The informationembedding process is shown in Section 5. In Section 6, ex-perimental evaluation is discussed, and finally, conclusionsare drawn in Section 7.

2. SYSTEM OVERVIEW

The block diagram of the proposed system is shown inFigure 1. The system first segments an image into objects us-ing a segmentation algorithm that forms only connected re-gions. The segmentation algorithm is applied to a reducedimage consisting of the mean values of the pixel intensitiesin 8 × 8 blocks of the original image. Apart from speedingthe segmentation process, this approach has the additionaladvantage that it yields image regions comprising a num-ber of 8 × 8 blocks (since a single pixel in the reduced im-age corresponds to a whole block in the original image). Fol-lowing segmentation, watermarking can proceed immedi-ately. Unlike segmentation, the watermarking process is ap-plied to the full resolution image. Specifically, the segmen-tation information is embedded first. The indexing informa-tion is obtained from the reduced image and the indexing bitsare channel coded and then embedded in the full resolutionimage.

Conversely, the first step in the watermark detection pro-cess is to detect the segmentation watermark and subse-quently, based on this segmentation, to extract the informa-tion bits associated with each object (see Figure 2). If, dueto unsuccessful watermark detection, the segmentation maskdetected at the decoder is different than the one used at theencoder, then the detection process will not be synchronizedwith the embedding process and the embedded indexing in-formation will not be retrieved correctly.

A segmentation algorithm employing color and textureinformation is described in the ensuing section.

3. COLOR IMAGE SEGMENTATION

3.1. Segmentation system overview

The segmentation system described in this section is basedon a variant of the popular K-means algorithm: the K-means-with-connectivity-constraint algorithm (KMCC)[13, 14]. This is an algorithm that classifies the pixels intoregions taking into account not only the intensity or textureinformation associated with each pixel but also the positionof the pixel, thus producing connected regions rather than


Segmentationmask

Markedimage

Indexinginformationextraction

Indexingfeatures

Segmentationinformationextraction

Figure 2: Block diagram of the detection scheme.

Originalimage

Segmentationmask

Conditionalfiltering

Pixel color andtexture feature

estimation

Initialclustering

Classificationusing KMCC

algorithm

Figure 3: Overview of the segmentation algorithm.

sets of chromatically similar pixels. Furthermore, the com-bination of intensity and texture information enables thealgorithm to handle textured objects effectively, by forminglarge, chromatically nonuniform regions instead of break-ing down the objects to a large number of chromaticallyuniform regions. To achieve this, the texture information isnot only utilized by the KMCC algorithm, but is also usedfor determining whether and to which pixels of the imagea moving average filter should be applied. Before the finalapplication of the KMCC algorithm, a moving average filteralters the intensity information in those parts of the imagewhere intensity fluctuations are particularly pronounced,since in these parts the KMCC algorithm does not performefficiently. This stage of conditional filtering is described inmore detail in the sequel.

The result of the application of the segmentation algo-rithm to a color image is the segmentation mask: a grayscaleimage in which different gray values correspond to differentregions formed by the KMCC algorithm.

The segmentation algorithm consists of the followingstages (Figure 3).Stage 1. Extraction of the intensity and texture feature vectorscorresponding to each pixel. These will be used along withthe spatial features in the following stages.Stage 2. Estimation of the initial number of regions and theirspatial, intensity, and texture centers of the KMCC algo-rithm.Stage 3. Conditional filtering using a moving average filter.Stage 4. Final classification of the pixels, using the KMCC al-gorithm.

3.2. Color and texture features

The color features used are the three intensity coordinatesof the CIE L∗a∗b∗ color space. This color space is related tothe CIE XYZ standard through a nonlinear transformation.What makes CIE L∗a∗b∗ more suitable for the proposed al-gorithm than the widely used RGB color space is percep-

tual uniformity: the CIE L∗a∗b∗ is approximately perceptuallyuniform, that is, the numerical distance in this color space isapproximately proportional to the perceived color difference[15]. The color feature vector of pixel p, I(p) is defined as

I(p) =[IL(p), Ia(p), Ib(p)

]T. (1)

In order to detect and characterize texture properties inthe neighborhood of each pixel, the discrete wavelet frames(DWF) decomposition is used. This is a method similar tothe discrete wavelet transform (DWT), that uses a filter bankto decompose each intensity coordinate of the image to a setof subbands (Figure 4). The main difference between the twomethods is that in the DWF decomposition, the output of thefilter bank is not subsampled. The DWF approach has beenproven to decrease the variability of the estimated texture fea-tures, thus improving classification performance [16].

The filter bank used is based on the lowpass Haar filter

H(z) =12

(1 + z−1), (2)

which satisfies the lowpass condition H(z)|z=1 = 1. The com-plementary highpass filter G(z) is defined with respect to thelowpass H(z) as follows:

G(z) = zH( − z−1). (3)

The filters of the filter bank, HLd (z), Gi(z), i = 1, . . . , Ld aregenerated by the prototypes H(z), G(z), according to the fol-lowing equations:

Hi+1(z) = H(z2i)Hi(z),

Gi+1(z) = G(z2i)Hi(z), i = 0, . . . , Ld − 1,

(4)

where H0(z) = 1 is the necessary initial condition and Ldis the number of levels of decomposition. The frequency re-sponses of those filters for Ld = 2 are presented in Figure 5.


Intensitycomponent

G1(z)

G2(z)

GLd (z)

HLd (z)

d1

d2

dLd

sLd

...

Approximationcomponent

Detail componentsof DWFdecomposition

Figure 4: 1D Discrete Wavelet Frames decomposition of Ld levels.

0 0.5 1 1.5 2 2.5 3ω

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1H2(z)

G2(z)

G1(z)

Figure 5: Frequency responses of Haar filter bank for 2 levels ofdecomposition.

Although the frequency localization of the filters is relativelypoor, as shown by Figure 5, it has been shown in [16] thatgood space localization of the filter bank is more importantthan frequency localization; therefore simple prototype fil-ters like the Haar filter used are good choices. The applicationof such simple filters also has the advantage of correspond-ingly reduced computational complexity.

The discrete wavelet frames decomposition can be ex-tended in the two-dimensional space by successively pro-cessing the rows and columns of the image. In this case,for each intensity component and each level of decomposi-tion, three detail components of the wavelet coefficients areproduced. The fast iterative scheme used in our segmenta-tion algorithm, for two levels of decomposition of the two-dimensional image, is presented in Figure 6.

The texture of pixel p is then characterized by the stan-dard deviations of all detail components, calculated in aneighborhood F of pixel p. This neighborhood F is a squareof dimension f × f , where in our system the value of f is oddand is chosen to be equal to the dimension of the blocks usedfor the initial clustering procedure (Section 3.3).

We have chosen to use a two-dimensional DWF decom-position of two levels: Ld = 2. Since three detail componentsare produced for each level of decomposition and each one ofthe three intensity components (Figure 6), the texture featurevector for pixel p, T(p), comprises 18 texture components,σq(p), q = 1, . . . , 18:

T(p) =[σ1(p), σ2(p), . . . , σ18(p)

]T. (5)

3.3. Initial clustering

Similarly to any other variant of the K-means algorithm, theKMCC algorithm requires initial values; in our case, an initialestimation is needed of the number of regions in the imageand their spatial, intensity, and texture centers. A set of val-ues chosen randomly could be used as initial values, sinceall these values can and are expected to be altered duringthe execution of the algorithm. Nevertheless, a well-chosenstarting point can lead to a more accurate representationof the objects of the image. It can also facilitate the con-vergence of the K-means-with-connectivity-constraint algo-rithm, thus reducing the time necessary for the segmentationmask to be produced.

In order to compute the appropriate initial values, theimage is broken down to square, nonoverlapping blocks ofdimension f × f . In this way, a total of L blocks, bl, l = 1, . . . , Lare created. In our experiments, the value of f was chosen sothat the number L of blocks created would be approximately75; this was found to be a good compromise between theneed for accuracy of the initial clustering, which improvesas the number L of blocks increases, and the need for its fastcompletion. The center of block bl is pixel pl

cntr. A color fea-ture vector I(bl) and a texture feature vector T(bl) are as-signed to each block, as follows:

I(bl)=

1f 2

f 2∑m=1

I(

plm

), T

(bl)= T

(pl

cntr

), (6)

where plm, m = 1, . . . , f 2 are the pixels belonging to block bl.

The distance between two blocks is defined as follows:

D(bl1 , bl2

)=∥∥I(bl1) − I

(bl2)∥∥ +

∥∥T(bl1) − T(bl2)∥∥, (7)


Intensitycomponent

GR(z)

HR(z)

GC(z)

HC(z)

GC(z)

HC(z)

GR(z2) GC(z2)

HR(z2) HC(z2)

GC(z2)

HC(z2)

HH1

HL1

LH1

HH2

HL2

LH2

LL2

Detail componentsof the first levelof decomposition

Detail componentsof the second levelof decomposition

Figure 6: Fast iterative 2D discrete wavelet frames decomposition of 2 levels. Subscripts R,C denote filters applied row-wise and column-wise, respectively.

where∥∥I(bl1) − I(bl2)∥∥

=√(

IL(bl1)−IL(bl2))2 +

(Ia(bl1

)−Ia(bl2))2 +(Ib(bl1)−Ib(bl2))2,

∥∥T(bl1) − T(bl2)∥∥ =

√√√√ 18∑q=1

(σq(

pl1cntr

) − σq(

pl2cntr

))2.

(8)

The number of regions of the image is initially estimatedby applying a variant of the maximin algorithm to this set ofblocks. This algorithm consists of the following steps.Step 1. The block in the upper left corner of the image is cho-sen to be the first intensity and texture center.Step 2. For each block bl, l = 1, . . . , L, the distance betweenbl and the first center is calculated; the block for which thedistance is maximized is chosen to be the second intensityand texture center. The distance Dbmax between the first twocenters is indicative of the intensity and texture contrast ofthe particular image.Step 3. For each block bl, the distances between bl and allcenters are calculated and the minimum of those distances isassigned to block bl. The block that was assigned the maxi-mum of the distances assigned to blocks is a new candidatecenter.Step 4. If the distance that was assigned to the candidate cen-ter is greater than γ ·Dbmax, where γ is a predefined parame-ter, the candidate center is accepted as a new center and Step3 is repeated; otherwise, the candidate center is rejected andthe maximin algorithm is terminated. In our experiments thevalue for γ = 0.4 was used.

The number of centers estimated by the maximin algo-rithm constitutes an estimate of the number of regions in theimage. Nevertheless, it is not possible to determine whetherthese regions are connected or not. Furthermore, there is noinformation regarding their spatial centers. In order to solvethese problems, a simple K-means algorithm is applied to theset of blocks, using the information produced by the max-imin algorithm as initial values. The simple K-means algo-rithm consists of the following steps.Step 1. The output of the maximin algorithm is used asa starting point, regarding the number of regions sk, k =1, . . . , K , and their intensity and texture centers, I(sk) andT(sk), respectively.Step 2. For every block bl, l = 1, . . . , L, the distance is eval-uated between bl and all region centers. The block bl is as-signed to the region for which the distance is minimized.Step 3. Region centers are recalculated, as the mean values ofthe intensity and texture vectors over the blocks belonging tothe corresponding region

I(sk)=

1

M′k

M′k∑

m=1

I(bkm),

T(sk)=

1

M′k

M′k∑

m=1

T(bkm),

(9)

where bkm, m = 1, . . . ,M′k are the blocks currently assigned to

region sk.Step 4. If the new centers are equal to those calculated inthe previous iteration of the algorithm, then stop, else go toStep 2.

When the K-means algorithm converges, the connectiv-ity of the regions that were formed is evaluated; those that are


(a) (b)

Figure 7: (a) Original image “zebra.” (b) Filtered image.

not connected are easily broken down to the minimum num-ber of connected regions using a recursive four-connectivitycomponent labelling algorithm [17], so that a total of K ′ con-nected regions are identified. Their centers, including theirspatial centers S(sk) = [Sx(sk), Sy(sk)]T , k = 1, . . . , K ′, willnow be calculated. In order to obtain other useful informa-tion as well, such as the current size Mk of each region in pix-els, we choose to perform the center calculation process notin the block domain but in the pixel domain, as we will doduring the execution of the KMCC algorithm. These centerswill be used as initial values by the KMCC algorithm:

I(sk)=

1Mk

Mk∑m=1

I(

pkm

),

T(sk)=

1Mk

Mk∑m=1

T(

pkm

),

Sx(sk)=

1Mk

Mk∑m=1

pkm,x,

Sy(sk)=

1Mk

Mk∑m=1

pkm,y,

(10)

where Mk is the number of pixels pkm, m = 1, . . . ,Mk that

belong to region sk.

3.4. Conditional filtering

In some images, there are parts of the image where inten-sity fluctuations are particularly pronounced, even when allpixels in that part of the image belong to a single object(Figure 7). In order to facilitate the grouping of all these pix-els in a single region based on their texture similarity, whichis our objective, it would be of great importance to some-how reduce their intensity differences. This can be achievedby applying a moving average filter to the appropriate partsof the image, thus altering the intensity information of thecorresponding pixels.

The decision of whether the filter should be applied to aparticular pixel p or not is made by evaluating the norm ofthe texture feature vector T(p) (see Section 3.2); the filter isnot applied if that norm is below a threshold Tth. The outputof the conditional filtering module can thus be expressed as:

J(p) =

I(p) if

∥∥T(p)∥∥ < Tth,

1f 2

f 2∑m=1

I(

pm)

if∥∥T(p)

∥∥ ≥ Tth.(11)

An appropriate value of the threshold Tth was experimentallyfound to be

Tth = max{

0.65 · Tmax, 14}, (12)

where Tmax is the maximum value of the norm ‖T(p)‖ inthe image. For computational efficiency purposes, the max-imum of ‖T(p)‖ can be sought only among the pixels thatserved as block centers during the initial clustering describedin Section 3.3. The term 0.65 · Tmax in the threshold defi-nition is used to make sure that the filter will not be ap-plied outside the borders of the textured objects. In this way,the boundaries of the textured objects will not be corrupted,thus enabling the KMCC algorithm to accurately detect thoseboundaries. The constant term 14, on the other hand, is nec-essary for the system to be able to handle images composedof chromatically uniform objects; in such images, the value ofTmax is expected to be relatively small and would correspondto pixels on edges between objects, where the application ofa moving average filter is obviously undesirable.

The output of the conditional filtering stage will be nowused by the KMCC algorithm.

3.5. The K-means with connectivityconstraint algorithm

Clustering based on the K-means algorithm is a widely usedregion segmentation method [18, 19, 20] which, however,tends to produce unconnected regions. This is due to thepropensity of the classical K-means algorithm to ignore spa-tial information about the intensity values in an image, sinceit only takes into account the global intensity or color in-formation. In order to alleviate this problem, we proposethe use of an extended K-Means algorithm: the K-means-with-connectivity-constraint algorithm. In this algorithmthe spatial proximity of each region is also taken into ac-count by defining a new center for the K-means algorithmand by integrating the K-means with a component labelingprocedure.

The K-means with connectivity constraint (KMCC) al-gorithm, that is, applied on the pixels of the image consistsof the following steps (Figure 8).Step 1. An initial clustering is produced, using the estimationprocedure in Section 3.3; thus the number of regions K isinitialized as K = K ′.Step 2. For every pixel p, the distance between p and all regioncenters is calculated. The pixel is then assigned to the regionfor which the distance is minimized. A generalized distanceof a pixel p from a region sk is defined as follows:

D(

p, sk)=∥∥J(p) − J

(sk)∥∥

+∥∥T(p) − T

(sk)∥∥ + λ

A

Ak

∥∥p − S(sk)∥∥, (13)


Originalimage

Output of conditionalfiltering module

Initialclustering

Segmentationmask

Pixelassignmentto regions

Componentlabeling

Region numberand region center

calculation

Region numberand region center

calculation

Regionmerging

Small regionappending

YES

NO

Convergence?

Figure 8: Block diagram of the KMCC algorithm.

where∥∥J(p) − J(sk)∥∥

=√(

JL(p)−JL(sk))2 +

(Ja(p)−Ja

(sk))2+

(Jb(p)−Jb

(sk))2

,

∥∥T(p) − T(sk)∥∥ =

√√√√ 18∑q=1

(σq(p) − Tq

(sk))2

,

∥∥p − S(sk)∥∥ =

√(px − Sx

(sk))2 +

(py − Sy

(sk))2

,

(14)

the area Ak of each region is defined as

Ak = Mk, (15)

where Mk is the number of pixels assigned to region sk, andA is the average area of all regions:

A =1K

K∑k=1

Ak. (16)

The regularization parameter λ is defined as:

λ = 0.4 · Dbmax√p2x,max + p2

y,max

. (17)

Normalization of the spatial distance, ‖p − Sk‖ with the areaof each region, A/Ak is necessary in order to encourage thecreation of large connected regions; otherwise, pixels wouldtend to be assigned to smaller rather than larger regions dueto greater spatial proximity to their centers. In this case, largeobjects would be broken down to more than one neighboringsmaller regions instead of forming one single, larger region.

The regularization parameter λ is used to make sure thata pixel is assigned to a region primarily due to their inten-sity and texture similarity. Being proportional to the inten-sity and texture contrast Dbmax of the image, it ensures thateven in low-contrast images, where intensity and texture dif-

ferences are small, these will not become insignificant com-pared to spatial distances. The opposite would result in theformation of regions that would not correspond to the ob-jects of the image.Step 3. The connectivity of the regions formed is evaluated;those which are not connected are easily broken down tothe minimum number of connected regions using a recur-sive four-connectivity component labelling algorithm [17].Step 4. Region centers are recalculated. Regions with areasbelow a size threshold thsize are dropped. The number of re-gions K is also recalculated, taking into account only the re-maining regions.Step 5. Two regions are merged if they are neighbors and iftheir intensity and texture distance is not greater than an ap-propriate merging threshold:

D(sk1 , sk2

)=∥∥J(sk1

) − J(sk2

)∥∥+∥∥T(sk1

) − T(sk2

)∥∥ ≤ thmerge .(18)

Step 6. Region number K and region centers are once againreevaluated.Step 7. If the region number K is equal to the one calculatedin Step 6 of the previous iteration and the difference betweenthe new centers and those in Step 6 of the previous iteration isbelow the corresponding threshold for all centers, then stop,else go to Step 2. If index “old” characterizes the region num-ber and region centers calculated in Step 6 of the previousiteration, the convergence condition can be expressed as:

K = Kold,∥∥J(sk) − J(soldk

)∥∥ ≤ thI ,∥∥T(sk) − T(soldk

)∥∥ ≤ thT ,∥∥S(sk) − S(soldk

)∥∥ ≤ thS,

(19)

for k = 1, . . . , K .Even though the region centers of particularly small re-

gions are omitted in Step 4 and the formation of large re-gions is encouraged in Step 2, there is no guarantee that no


Table 1: Threshold values.

Threshold description Threshold value

Texture threshold Tth = max{0.65 · Tmax, 14}

Region size threshold thsize = 0.75% of the total image area

Region merging threshold thmerge =

20 if Dbmax > 75

15 if Dbmax ≤ 75 and Tmax ≥ 14

10 if Dbmax ≤ 75 and Tmax < 14

thI = 4.0

Convergence thresholds thT = 1.0

thS = 2.0

such small regions will be present in the segmentation maskafter the convergence of the algorithm. Since these regionsare not wanted, they are forced to merge with one of theirneighboring regions, based on intensity and texture similar-ity: a small region sk1 , Mk1 < thsize is appended to region sk2 ,k2 = 1, . . . , K , k2 �= k1, for which the distance

D(sk1 , sk2

)=∥∥J(sk1

) − J(sk2

)∥∥ +∥∥T(sk1

) − T(sk2

)∥∥ (20)

is minimum. This procedure is performed for all small re-gions of the segmentation mask, until all such small regionsare absorbed.

In Table 1, a summary of the thresholds required by thesegmentation algorithm and the corresponding values usedin our experiments is presented.

4. REGION DESCRIPTORS

As soon as the segmentation mask is produced, a set of de-scriptors that will be used for querying are calculated for eachregion. These region descriptors compactly characterize eachregion’s color, position, and shape.

The color and position descriptors of a region are basedon the intensity and spatial centers that were calculated forthe region in the last iteration of the KMCC algorithm. Inparticular, the color descriptors of region sk are the intensitycenters of that region, IL(sk), Ia(sk), Ib(sk), whereas the posi-tion descriptors Pk,x, Pk,y are the spatial centers normalizedby the dimensions of the image:

Pk,x =Sk,xxmax

,

Pk,y =Sk,yymax

.

(21)

The shape descriptors of a region are its area, eccentricityand orientation. The area Ek is expressed by the number ofpixels Mk that belong to region sk, divided by the total num-ber of pixels of the image:

Ek =Mk

xmax · ymax. (22)

The other two shape descriptors are calculated using the co-variance or scatter matrix Ck of the region. This is definedas:

Ck =1Mk

Mk∑m=1

(pkm − Sk

)(pkm − Sk

)T, (23)

where pkm = [pkm,x, p

km,y]T , m = 1, . . . ,Mk are the pixels be-

longing to region sk. Let ρi, ui, i = 1, 2 be its eigenvalues andeigenvectors: Ckui = ρiui with uT

i ui = 1, uTi u j = 0, i �= j and

ρ1 ≥ ρ2. As is known from Principal Component Analysis(PCA), the principal eigenvector u1 defines the orientationof the region and u2 is perpendicular to u1. The two eigen-values provide an approximate measure of the two dominantdirections of the shape. Using these quantities, an approxi-mation of the eccentricity εk and orientation θk of the regionare calculated: orientation θk is the argument of the prin-cipal eigenvector of Ck, u1, and eccentricity εk is defined asfollows:

εk = 1 − ρ1

ρ2. (24)

The eight region descriptors mentioned above form a re-gion descriptor vector Dk:

Dk =[IL(sk), Ia(sk), Ib(sk), Pk,x, Pk,y , Ek, θk, εk

]. (25)

Using eight bits to express each one of the region descriptors,a total of 64 bits is required for the entire region descriptorvector Dk. This information, along with the segmentationmask, will be embedded in the image using digital water-marking techniques, as described in the ensuing section.

5. CONTENT-BASED INFORMATION EMBEDDING

The information obtained for each image using the tech-niques of the preceding sections is embedded in the im-age itself. Two kinds of watermarks are embedded; one con-taining segmentation information and another that carriesindexing information. Both are embedded in the spatialdomain.


0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00000000000000000000

0011100000000333330

0000000000000000000

0001100000000033300

0111110000003333330

1111110000003333330

1111100000003333330

0110000000003333330

0000000000003333330

0000000000003333330

0000000000003333330

0000002200000033300

0000222220000000000

0002222222000000000

0002222222000000000

0002222222000000000

0000222220000000000

0000002200000000000

0000000000000000000

0000000000000000000

Figure 9: Example of embedding labels to each image block.

5.1. Segmentation information embedding

The segmentation watermark consists of a number of sym-bols. A different symbol is embedded in each image regionin order to make possible the identification of the regionsduring the watermark detection process. In a sense, the seg-mentation watermark symbols can be seen as labels that aretagged to each image object (see Figure 9). We chose to em-bed the segmentation watermark in the blue component ofthe RGB images because the Human Visual System is less sen-sitive to blue color [21]. Even though the methodology thatwill be developed is entirely general and an arbitrary numberof regions may be labelled, we shall describe it here for thesake of brevity and simplicity for the case where only four re-gions of each segmented image need be labelled, that is, threemain objects and the background. It will be assumed that anyother objects determined using the segmentation algorithmare small and insignificant objects which are labelled usingthe same label as the background and therefore cannot be in-dexed and retrieved independently.

The pixels of block (l1, l2) in the blue component IB ofthe image will be modified due to the watermarking processas follows:

I ′(l1,l2)B[i, j] = IB[8l1 + i, 8l2 + j

]+ al1 ,l2 · w[i, j], (26)

where l1, l2 are the block indices, i, j are the indices speci-fying the position of the pixel inside the block and al1 ,l2 is amodulating factor valued as follows:

al1 ,l2 =

−3, if the block (l1, l2) belongs to the

background or is not classified,

−1, if the block(l1, l2

)belongs to the 1st region,

+1, if the block(l1, l2

)belongs to the 2nd region,

+3, if the block(l1, l2

)belongs to the 3rd region

(27)

+1 −1 1 −1 +1 −1 +1 −1−1 +1 −1 +1 −1 +1 −1 +1+1 −1 +1 −1 +1 −1 +1 −1−1 +1 −1 +1 −1 +1 −1 +1+1 −1 +1 −1 +1 −1 +1 −1−1 +1 −1 +1 −1 +1 −1 +1+1 −1 +1 −1 +1 −1 +1 −1−1 +1 −1 +1 −1 +1 −1 +1

Figure 10: Watermark matrix. The modulating factor correspond-ing to the embedded information symbol is multiplied by the ma-trix elements and the resulting signal is added to the image.

−5 −4 −3 −2 −1 0 1 2 3 4 5Detector output

Figure 11: Distributions of the detector statistic for the four sym-bols.

and w[i, j] is the watermark matrix (see Figure 10), which isthe same for all blocks in the image and is given by w[i, j] =(−1)i+ j , that is,

w[i, j] =

1, if i + j = even,

−1, if i + j = odd.(28)

The extraction of the label of each block is achieved via asimple correlation-based watermark detection process. Thewatermark detector is applied to each block. To detect thewatermark, we calculate the correlation between the inten-sities I ′(l1,l2)B[i, j] of the watermarked pixels in a block andw[i, j]. The detector output for block (l1, l2) is calculated by

ql1,l2 =1N

∑i

∑j

I ′(l1,l2)B[i, j] · w[i, j], (29)

where N is the number of pixels in a block. In our case,N = 64 since 64 pixels are included in an 8 × 8 block. Thesymbol that is extracted from each block depends on the de-tector output. The probability density function of the detec-tor output can be approximated by a Gaussian distributionwith mean equal to −3, −1, 1, or 3, depending on the symbolthat was embedded (see Figure 11). By choosing the decision


boundaries to coincide with the value of q at which adjacentdistributions in Figure 11 cross, the probability of misclassi-fication is minimized [22]. Clearly, the optimal rule for ex-tracting the label of block (l1, l2) is

s =

0, if ql1,l2 < −2,

1, if − 2 < ql1,l2 < 0,

2, if 0 < ql1,l2 < 2,

3, if 2 < ql1,l2

(30)

since the above choice minimizes the probability of erro-neous symbol detection. This probability will be shown tobe very small. In fact, using the conditional distributions ofthe detector statistic for each of the four symbols, the prob-abilities Pi, i = 0, . . . , 3, that the ith symbol is extracted erro-neously are

P0 =1√2πσ

∫∞

−2e−(x+3)2/2σ2

dx, (31)

P1 =1√2πσ

(∫−2

−∞e−(x+1)2/2σ2

dx +∫∞

0e−(x+1)2/2σ2

dx

), (32)

P2 = P1, (33)

P3 = P0. (34)

The mean value of the detector output is

m =1N

∑i

∑j

(I(l1 ,l2)B[i, j] + al1 ,l2 · w[i, j]

) · w[i, j]

=1N

∑i

∑j

(I(l1 ,l2)B[i, j] · w[i, j] + al1 ,l2 · w[i, j]2)

=1N

(∑i

∑j

I(l1 ,l2)B[i, j] · w[i, j] + al1 ,l2∑i

∑j

w[i, j]2

).

(35)

From (28),

1N

∑i

∑j

I(l1 ,l2)B[i, j] · w[i, j]

=1N

( ∑i+ j:even

I(l1 ,l2)B[i, j] −∑

i+ j:odd

I(l1 ,l2)B[i, j]

) (36)

which is a very small quantity. Furthermore,∑i

∑j

w[i, j]2 = N. (37)

Thus, (35) yields

m =1N

· al1 ,l2 ·N = al1 ,l2 (38)

and the variance of the correlator output is

σ2 =

(1N

∑i

∑j

(I(l1 ,l2)B[i, j] · w[i, j]

+ al1 ,l2 · w[i, j]2) − al1 ,l2

)2

=

(1N

∑i

∑j

I(l1,l2)B[i, j] · w[i, j]

+1Nal1 ,l2

∑i

∑j

w[i, j]2 − al1 ,l2

)2

=

(1N

∑i

∑j

I(l1,l2)B[i, j]· w[i, j]+1N

· al1 ,l2 ·N− al1 ,l2

)2

=

(1N

∑i

∑j

I(l1,l2)B[i, j] · (−1)i+ j

)2

=1N2

( ∑i+ j:even

I(l1 ,l2)B[i, j] −∑

i+ j:odd

I(l1 ,l2)B[i, j]

)2

=1

642

( ∑i+ j:even

I(l1,l2)B[i, j] −∑

i+ j:odd

I(l1,l2)B[i, j]

)2

.

(39)

Since the term in brackets is far lower than 642 (intensities ina block are highly correlated), the resulting variance σ2 willbe a very small quantity (see Figure 11).

In most practical cases, if no attack is performed in theimage, the variance σ2 of the segmentation watermark detec-tor is approximately equal to 0.09. By substituting this valuein (31) and (32), the probability that the label of a block inan object is misinterpreted is less than 10−3. Although this isa very small probability, there are some cases in which evensuch a small error could affect the synchronization capabilityof the system and the subsequent indexing information ex-traction. Such a case may occur if a block on region bound-aries is misinterpreted (see Figure 12). As seen, detection er-rors occurring at blocks within a region can be easily cor-rected since isolated blocks obviously cannot be consideredobjects. However, errors at block boundaries cannot be cor-rected since there is ambiguity regarding the region in whichthey belong. For this reason, immediately before embeddingindexing information, a dummy detection of segmentationinformation takes place in order to identify blocks whichyield ambiguous segmentation labels. In such blocks, no in-dexing information is embedded.

5.2. Indexing information embedding

Indexing information is embedded in the red component ofeach image using binary symbols. For each region, eight fea-ture values described by 8 bits each are ordered in a binary


Incorrectly retrieved blocksegmentation information

Figure 12: Incorrectly retrieved block segmentation information.

vector of 64 bits. Each bit of this vector is embedded in ablock of the corresponding region. After the embedding ofthe watermark, the red component of the block (l1, l2) of theimage is as follows:

I ′(l1,l2)R[i, j] = IR[8l1 + i, 8l2 + j

]+ al1 ,l2 · w[i, j], (40)

where w is the watermark matrix given in (28) and al1 ,l2 is amodulating factor valued as follows:

al1 ,l2 =

1, if the embedded bit is 1,

−1, if the embedded bit is 0.(41)

The green component is not altered. The detection iscorrelation-based, that is,

ql1,l2 =1N

∑i

∑j

I ′(l1,l2)R[i, j] · w[i, j]. (42)

If the output q of the detector is less than zero then theextracted bit is 0, otherwise 1. Using this rule, the result-ing probability of erroneous detection is very small. How-ever, in order to achieve lossless extraction, error correctingcodes [23] can be used. Error correcting codes can detectand correct errors that may occur during the extraction ofthe embedded bitstream. In this paper, a simple Hammingcode is used that adds three error control bits BC1, BC2, BC3

for every four information bits BI1, BI2, BI3, BI4. The controlbits are computed from the information bits in the followingway:

BC1 = BI1 ⊕ BI2 ⊕ BI4,

BC2 = BI1 ⊕ BI3 ⊕ BI4,

BC3 = BI2 ⊕ BI3 ⊕ BI4,

(43)

where ⊕ denotes the XOR operation. Thus, the embeddingbitstream takes the form BC1, BC2, BI1, BC3, BI2, BI3, BI4 for

every four indexing bits. If only a single error occurs whiledetecting the four indexing bits, the error can be corrected.The protection achieved using this approach is so strong (forthe given application) that practically guarantees the correctextraction of all indexing bits.


The segmentation and watermarking algorithms described inthe previous sections were tested for embedding informationin a variety of test images [24]. As seen in Figures 13 and14, the segmentation algorithm is endowed with the capa-bility to handle efficiently both textured and non-texturedobjects. This is due to the combined use of intensity, tex-ture, and position features for the image pixels. The deriva-tion of the segmentation mask was followed by the extrac-tion of indexing features for the formed regions as describedin Section 4 (these features are also used by the ISTORAMA1

content-based image retrieval system). The above segmen-tation and indexing information was subsequently embed-ded in the images. Alternatively, instead of indexing in-formation, any other kind of object-related informationcould be embedded, including a short text describing theobject.

The segmentation information was embedded in theblue component of RGB images using the procedure de-scribed in the previous section. The indexing informationwas embedded in the red component of RGB images. More-over, if the object was large enough, the same indexingbits were embedded twice or even more, until all avail-able region blocks are used. The average time for water-marking an image was 0.07 seconds and the average timefor the extraction of indexing information was 0.035 sec-onds on a computer with a Pentium-III processor. No per-ceptual degradation of image quality was observed (seeFigure 15) due to watermarking. The 0.07 seconds includeboth mask and indexing information embedding but ex-clude the time needed for segmentation and feature extrac-tion. The processes of segmenting an image and extractingindexing features from the formed regions are more time-consuming and in our system take several seconds/image.In practice, the entire process in Figure 1 takes roughly15 seconds. However, the process in Figure 1 is performedonly once (at the time an image is segmented and marked)whereas the detection process (Figure 2), which takes placemany times (once for each different query), still needs0.035 seconds/image.

The proposed system was subsequently tested for theretrieval of image regions using 1000 images from theISTORAMA database. In all cases, due to the channel coding,100% of the embedded indexing bits were reliably extractedfrom the watermarked image. A simple retrieval example isshown in Figure 16. In most cases, the system was able to re-spond in less than 20 seconds and present the image regionwhich was close to the one required by the user. However, for

1http://uranus.ee.auth.gr/Istorama/.


Figure 13: Images segmented into regions.

Figure 14: Images segmented into regions.

(a) (b)

Figure 15: (a) Original image. (b) Watermarked image. No percep-tual degradation can be observed.

applications in which the speed of the system in its presentform is considered not satisfactory, a separate file could bebuilt offline containing the features values that are embed-ded in the images. In this way, the feature values could beaccessed much faster than extracting them from the imagesonline.

7. CONCLUSIONS

A methodology was presented for the segmentation andcontent-based embedding of indexing information in digi-tal images. The segmentation algorithm combines pixel po-sition, intensity, and texture information in order to segmentthe image into a number of regions. Two types of watermarksare subsequently embedded in each region: a segmentationwatermark and an indexing watermark.

The proposed system is appropriate for building flexibledatabases in which no side information is needed to be keptfor each image. Moreover, the semantic regions comprisingeach image can be easily extracted using the segmentationwatermark detection procedure.

ACKNOWLEDGMENTS

The authors are with the Information Processing Laboratory,Electrical and Computer Engineering Department, Aristo-tle University of Thessaloniki, Thessaloniki, Greece, and theInformatics and Telematics Institute, Thessaloniki, Greece.This work was supported by the COST 211quat and the EUIST project ASPIS.


(a) (b) (c) (d)

(e) (f) (g)

Figure 16: (a) Original image (730×490). (b) Segmentation mask. (c) The region used for querying is presented in white. (d)–(g) Retrievedimages.

REFERENCES

[1] W. Zeng and B. Liu, “A statistical watermark detection tech-nique without using original images for resolving rightfulownerships of digital images,” IEEE Trans. Image Processing,vol. 8, no. 11, pp. 1534–1548, 1999.

[2] D. Simitopoulos, N. V. Boulgouris, A. Leontaris, and M. G.Strintzis, “Scalable detection of perceptual watermarks inJPEG 2000 images,” in Conference on Communications andMultimedia Security, Darmstadt, Germany, May 2001.

[3] A. M. Alattar, “Smart images using digimarc’s watermark-ing technology,” in Security and Watermarking of MultimediaContents II, Proceedings of SPIE, vol. 3971, pp. 264–273, Jan-uary 2000.

[4] N. V. Boulgouris, I. Kompatsiaris, V. Mezaris, and M. G.Strintzis, “Content-based watermarking for indexing usingrobust segmentation,” in Proc. Workshop on Image Analysis ForMultimedia Interactive Services, Tampere, Finland, May 2001.

[5] “Special issue,” IEEE Trans. Circuits and Systems for VideoTechnology, Special Issue on Image and Video Processing, vol.8, no. 5, September 1998.

[6] “Special issue,” Signal Processing, Special Issue on Video Se-quence Segmentation for Content-Based Processing and Manip-ulation, vol. 66, no. 2, April 1998.

[7] K. S. Fu and J. K. Mui, “A survey on image segmentation,”Pattern Recognition, vol. 13, no. 1, pp. 3–16, 1981.

[8] R. M. Haralick and L. G. Shapiro, “Image segmentation tech-niques,” Comput. Vision Graphics Image Process., vol. 29, no.1, pp. 100–132, 1985.

[9] A. A. Alatan, L. Onural, M. Wollborn, R. Mech, E. Tuncel, andT. Sikora, “Image sequence analysis for emerging interactivemultimedia services—the European COST 211 framework,”IEEE Trans. Circuits and Systems for Video Technology, vol. 8,no. 7, pp. 19–31, 1998.

[10] D. Geiger and A. Yuille, “A common framework for imagesegmentation,” International Journal of Computer Vision, vol.6, no. 3, pp. 227–243, 1991.

[11] P. Bas, N. V. Boulgouris, F. D. Koravos, J. M. Chassery, M. G.Strintzis, and B. Macq, “Robust watermarking of video ob-jects for MPEG-4 applications,” in Proc. SPIE InternationalSymposium on Optical Science and Technology, San Diego,Calif, USA, July–August 2001.

[12] N. V. Boulgouris, F. D. Koravos, and M. G. Strintzis, “Self-synchronizing watermark detection for MPEG-4 objects,” inProc. IEEE International Conference on Electronics, Circuits andSystems, Malta, September 2001.

[13] I. Kompatsiaris and M. G. Strintzis, “Content-based rep-resentation of colour image sequences,” in Proc. IEEEInt. Conf. Acoustics, Speech, Signal Processing, Salt Lake City,Utah, USA, May 2001.

[14] I. Kompatsiaris and M. G. Strintzis, “Spatiotemporal segmen-tation and tracking of objects for visualization of videocon-ference image sequences,” IEEE Trans. Circuits and Systems forVideo Technology, vol. 10, no. 8, 2000.

[15] S. Liapis, E. Sifakis, and G. Tziritas, “Color and/or texture seg-mentation using deterministic relaxation and fast marchingalgorithms,” in International Conference on Pattern Recogni-tion, vol. 3, pp. 621–624, September 2000.

[16] M. Unser, “Texture classification and segmentation usingwavelet frames,” IEEE Trans. Image Processing, vol. 4, no. 11,pp. 1549–1560, 1995.

[17] R. Jain, R. Kasturi, and B. G. Schunck, Machine Vision,McGraw-Hill International Editions, New York, USA, 1995.

[18] S. Z. Selim and M. A. Ismail, “K-means-type algorithms,”IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.6, no. 1, pp. 81–87, 1984.

[19] S. Sakaida, Y. Shishikui, Y. Tanaka, and I. Yuyama, “Image seg-


mentation by integration approach using initial dependenceof k-means algorithm,” in Picture Coding Symposium 97, pp.265–269, Berlin, Germany, September 1997.

[20] I. Kompatsiaris and M. G. Strintzis, “3D representation ofvideoconference image sequences using VRML 2.0,” in Eu-ropean Conference for Multimedia Applications Services andTechniques (ECMAST ’98), pp. 3–12, Berlin, Germany, May1998.

[21] M. Kutter, Digital image watermarking: hiding information inimages, Ph.D. thesis, 1999.

[22] R. O. Duda and P. E. Hart, Pattern Classification and SceneAnalysis, John Wiley, New York, USA, 1973.

[23] S. Lin and D. J. Costello, Error Control Coding: Fundamentalsand Applications, Prentice-Hall, Englewood Cliffs, NJ, USA,1983.

[24] Corel stock photo library, Corel Corp., Ontario, Canada.

Nikolaos V. Boulgouris was born in Greecein 1975. He received the Diploma and thePh.D. degrees from the Electrical and Com-puter Engineering Department, AristotleUniversity of Thessaloniki, Thessaloniki,Greece, in 1997 and 2002, respectively. Heis currently a Researcher in the Informat-ics and Telematics Institute, Thessaloniki,Greece. Since 1997, Dr. Boulgouris has par-ticipated in several research projects fundedby the European Union and the Greek Secretariat of Research andTechnology. His research interests include image/video communi-cation, networking, content-based indexing and retrieval, wavelets,pattern recognition and multimedia copyright protection. Niko-laos V. Boulgouris is a member of the Technical Chamber of Greeceand a member of IEEE.

Ioannis Kompatsiaris received theDiploma degree in electrical engineer-ing and the Ph.D. degree in 3D modelbased image sequence coding from Aris-totle University of Thessaloniki (AUTH),Thessaloniki, Greece in 1996 and 2001,respectively. He is a Senior Researcher withthe Informatics and Telematics Institute,Thessaloniki. Previously, he was a LeadingResearcher on 2D and 3D Imaging atAUTH. I. Kompatsiaris has participated in many research projectsfunded by the EC and the GSRT. His research interests includeimage processing, computer vision, model-based monoscopicand multiview image sequence analysis and coding, medicalimage processing and video coding standards (MPEG-4, MPEG-7,MPEG-21). He is a representative of the Greek National Standard-ization Body (ELOT) to the ISO JTC 1/SC 29/WG 11 MPEG group.In the last 4 years, he has authored 6 articles in scientific journalsand delivered over 20 scientific conference presentations in theseand similar areas. I. Kompatsiaris is a member of the TechnicalChamber of Greece.

Vasileios Mezaris was born in Athens,Greece, in 1979. He received the Diplomadegree in Electrical and Computer Engi-neering in 2001 from the Electrical andComputer Engineering Department, Aris-totle University of Thessaloniki, Greece,where he is currently working towards thePh.D. degree. He is also a Graduate ResearchAssistant at the Informatics and TelematicsInstitute, Thessaloniki, Greece. His researchinterests include still image segmentation, video segmentation andobject tracking, video streaming.

Dimitrios Simitopoulos was born inGreece in 1977. He received his Diploma inElectrical and Computer Engineering fromAristotle University of Thessaloniki, Greece,in 1999. He is currently working towardsthe Ph.D. degree in the Department ofElectrical and Computer Engineering inAristotle University of Thessaloniki, wherehe holds a teaching assistanship position.Since 2000, he is working as a researchassistant in Informatics and Telematics Institute. His researchinterests include watermarking and multimedia security and imageindexing and retrieval.

Michael G. Strintzis received the Diplomain Electrical Engineering from the NationalTechnical University of Athens, Athens,Greece in 1967, and the M.A. and Ph.D. de-grees in Electrical Engineering from Prince-ton University, Princeton, NJ, USA in 1969and 1970, respectively. He then joined theElectrical Engineering Department at theUniversity of Pittsburgh, Pittsburgh, Pa,USA, where he served as Assistant (1970–1976) and Associate (1976–1980) Professor. Since 1980 he is Pro-fessor of Electrical and Computer Engineering at the Universityof Thessaloniki, and since 1999 Director of the Informatics andTelematics Research Institute in Thessaloniki, Greece. Since 1999 heserves as an Associate Editor of the IEEE Transactions on Circuitsand Systems for Video Technology. His current research interestsinclude 2D and 3D image coding, image processing, biomedicalsignal and image processing and DVD and Internet data authen-tication and copy protection. In 1984, Dr. Strintzis was awardedone of the Centennial Medals of the IEEE.

downloads.hindawi.comdownloads.hindawi.com/journals/specialissues/912571.pdf · editor-in-chief k....

Documents