toward bridging the annotation-retrieval gap in image search

By combining novelstatistical modelingtechniques and theWordNet ontology,we offer a promisingnew approach toimage search thatuses automaticimage taggingdirectly to performretrieval.

Quick ways to capture pictures, cheapdevices to store them, and conve-nient mechanisms for sharing themare all part and parcel of our daily

lives. With multitudes of pictures to deal with,everyone would benefit from smart programs tomanage photo collections, tag them automatical-ly, and make them searchable through keywords.To satisfy such needs, the multimedia, informationretrieval, and computer vision communities have,time and again, attempted automated image anno-tation, as we have witnessed in the recent past.1-4

While many interesting ideas have emerged, wehaven’t seen much attention paid to the direct useof automatic annotation for image search. Usually,it is assumed that good annotation implies quali-ty image search. Moreover, most past approachesare too slow to be of practical use for today’s mas-sive picture collections.

The problem would not be interesting if all pic-tures came with tags, which in turn were reliable.Unfortunately, for today’s picture collectionssuch as Yahoo! Flickr, this is seldom the case.These collections are characterized by their mam-moth volumes, lack of reliable tags, and thediverse spectrum of topics they cover. In Webimage search systems such as those of Yahoo!and Google, surrounding text forms the basis ofkeyword searches, which come with their ownproblems.

In this article, we discuss our attempt to buildan image search system based on automatic tag-ging. Our goal is to treat automatic annotation as

a means of satisfactory image search. We look atrealistic scenarios that arise in image search, andpropose a framework that can handle themthrough a unified approach. To achieve this, welook at how pictures can be accurately and rapid-ly placed into a large number of categories, howthe categorization can be used effectively forautomatic annotation, and how these annota-tions can be harnessed for image search. For this,we use novel statistical models and the WordNetontology,5 as well as state-of-the-art content-based image retrieval (CBIR) methods6-8 for com-parison. Our method significantly outperformscompeting strategies for this problem, and alsosuggests nonintuitive results.

Bridging the gapOur motivation to bridge the annotation-

retrieval gap is driven by a desire to effectively han-dle common, challenging cases of image search ina unified manner. Four real-world scenarios,schematically presented in Figure 1, are as follows:

❚ Scenario 1. Either a tagged picture or a set ofkeywords is used as a query. The problem aris-es when all or part of the image database (suchas Web images) is not tagged, making thisportion inaccessible through text queries. Westudy how our annotation-driven imagesearch approach performs in first annotatingthe untagged pictures, then performing mul-tiple keyword queries on the partially taggedpicture collection.

❚ Scenario 2. An untagged image is used as aquery, aiming to find semantically related pic-tures or documents from a tagged database orthe Web. We look at how our approach per-forms in first tagging the query picture, thenin retrieval.

❚ Scenario 3. The query image as well as all orpart of the image database is untagged. This isthe case that best motivates CBIR, since theonly available information is visual content.We study the effectiveness of our approach intagging the query image and the database,then in retrieval.

❚ Scenario 4. A tagged query image is used tosearch a tagged image database. The problemis that these tags might be noisy and unreli-able, as is common in user-driven picture tag-ging portals. We study how our approach

24 1070-986X/07/$25.00 © 2007 IEEE Published by the IEEE Computer Society

Advances in Multimedia Computing

Toward Bridgingthe Annotation-Retrieval Gap inImage Search

Ritendra Datta, Weina Ge, Jia Li, and James Z. WangThe Pennsylvania State University

helps improve tagging by reannotation, andsubsequently performs retrieval.

In each case, we look at reasonable and prac-tical alternative strategies for search, with thehelp of a state-of-the-art CBIR system. For sce-nario 4, we are also interested in analyzing theextent to which our approach helps improveannotation under varying noise levels.

Additional goals include the ability to gener-ate precise annotations of pictures in near-realtime. While most previous annotation systemsassess performance based on annotation qualityalone, this measure is only part of our goal. Forus, the main challenge is to have the annotationshelp generate meaningful retrieval. To this end,we developed our approach by first building anear-real-time categorization algorithm (about 11seconds per image) capable of producing accurateresults and then generating categorization-basedannotation, ensuring high precision and recall.With this annotation system in place, we assessits performance as a means of image search underthe preceding scenarios.

Model-based categorizationWe employ generative statistical models for

accurate, near-real-time categorization of gener-ic images. This implies training independent sta-tistical models for each image category using asmall set of training images. We can then assigncategory labels to new pictures via smart use ofthe likelihood overall models. In our system, weuse two generative models (per image category)to provide evidence for categorization from twodifferent aspects of the images. We generate finalcategorization by combining these evidences.

In the case of many generic image categories,

it is challenging to build a robust classifier. Featureextraction becomes extremely critical, since itmust have the discriminative power to differenti-ate between a broad range of image categories. Webase our models on the following intuitions: Forcertain categories such as sky, marketplace, ocean,forests, and Hawaii, or those with dominant back-ground colors such as paintings, color and texturefeatures might be sufficient to characterize them.In fact, a structure or composition for these cate-gories may be too hard to generalize. On the otherhand, categories such as fruits, waterfalls, moun-tains, lions, and birds might not have dominatingcolor or texture but often have an overall structureor composition that helps us identify them despiteheavily varying color distributions.

Motivated by these facts, we built two modelsto capture different visual aspects: a structure-composition model that uses Beta distributionsto capture color interactions in a very flexiblemanner and a Gaussian mixture model in thejoint color-texture feature space. In essence, wewant to examine each picture from two separateviewpoints and place it in a category after con-sulting both. While our approach models colorand texture explicitly, it models the segmentsand edges implicitly through the structure-com-position (S-C) model.

Structure-composition modelsWe wanted a way to represent how the colors

interact with each other in certain picture cate-gories. The description of an average beach picturecould comprise a set of relationships between dif-ferent colored regions—for example, orange (sun)completely inside light blue (sky), and light bluesharing a long border with dark blue (ocean). Fortiger pictures, we have yellow and black regions

25

July–Septem

ber 2007

Scenario 1

Query

Database

car, sedan car, sedan, boring

car, 2004 Jeep, redred

parrotbird Pionus

parrots

CuckooBirdWatcher’sGuide

No tags

auto, driving,hatchback

Scenario 2 Scenario 3 Scenario 4

No tagsNo tags

Noisytags

No tags

Internet InternetDatabase Internet Internet

Figure 1. Four common

scenarios for real-world

image retrieval.

26

IEEE

Mul

tiM

edia

sharing similar borders with each other (stripes)and remaining colors interacting without muchpattern. The description for coarse texture patternssuch as pictures of beads of different colors couldcomprise any color (bead) surrounding any othercolor (bead), some color (background) complete-ly containing most colors (beads), and so on. Thisidea led to a principled formulation of our rota-tion and scale invariant S-C models.

Given the set of all training images across cat-egories, we take every pixel from each image andmap it to the LUV color space, which is a percep-tually uniform space consisting of the luminancecomponent L and the chrominance componentsU and V, and perform K-means clustering on amanageable random subsample of it. This processyields a set of 20 cluster centroids, such as shadesof red or yellow, giving us a color palette repre-senting the entire training set.

We then perform a nearest-neighbor-basedsegmentation on each training image I by assign-ing every pixel a cluster label closest to it in thecolor space to obtain a new image J, which isessentially a color-quantized representation. Thishelps build a uniform model representation forall image categories. To get disjoint segmentsfrom the image, we perform an 8-connected com-ponent labeling (that groups pixels into islands,treating diagonally located pixels as having an

adjoint, or neighboring, relationship) on thecolor-segmented image J.

Let �i be the set of neighboring segments tosegment si. Here, a neighborhood implies that fortwo segments si and sj, at least one pixel is in eachof si and sj that is 8-connected. We wish to char-acterize the interaction of colors by modelinghow each color shares (if at all) boundaries withevery other color. Let’s denote by � the length ofthe shared border between a segment and one ofits neighboring segments, and by � the totallength of the segment’s perimeter.

We want to model the �/� ratios for eachcolor pair by the flexible Beta distribution, whichis appropriate for modeling ratios in the [0,1]range. The distribution is characterized by shapeparameters � and �. We build models consistingof a set of Beta distributions for every color pair.

For each category, and for every color pair, wefind instances in the training set (for that cate-gory) in which segments of that color pair share acommon border. Let the number of suchinstances be �. We then compute the corre-sponding set of �/� ratios and estimate a Betadistribution—that is, parameters � and �—usingthese values for that color pair. Figure 2 showsthe overall process of estimating S-C models,along with their representation.

In the model representation, diagonal entries

Matrix of segmentadjacency counts over allcolor pairs for Bus model

Originalpictures

Segmentedversions

Pairwise segmentadjacency countsover all color pairs

k 1 2 ... S1 n/a �, �, � ... �, �, �

2 �, �, � n/a ... ...... ... ... ... �, �, �

S �, �, � ... �, �, � n/a

Matrix of mean Δ/Θratios over all colorpairs for Bus model

(a) (b)

Figure 2. Steps toward

generating the

structure-composition

(S-C) model. (a) Three

training pictures from

the bus category, their

segmented forms, and a

matrix representation

of their segment

adjacency counts.

(b) The corresponding

matrix representations

over all three training

pictures. These

matrices combine to

produce the S-C model,

shown here

schematically as a

matrix of Beta

parameters and counts.

27

are not defined, and the matrix entries are asym-metric, since we treat colors as ordered pairs. Thenumber of samples � used to estimate the � and� for each entry are stored alongside as well. Webuild and store such models for every category. InFigure 3, we show simple representations of thelearned models for three such picture categories.(See Figure 2 to better understand these represen-tations.) From these representations, computingthe actual model parameters is straightforward.

To make the modeling process fast, we use asimple moment matching method (statistical para-meter estimation by equating sample momentsto corresponding population moments) for esti-mating the Beta distributions. Depending on thenumber of samples � we have for a color pair, weface a parameter estimation issue. For low valuesof �, the estimates are undefined or poor. Yet, we

might often have training pictures with few or noborders of a given color pair. But, this may notnecessarily mean that such borders won’t occurin test pictures.

We play it safe here by treating those colorpairs as “unknown.” For this reason, we first esti-mate parameters ��k and ��k for the distribution ofall �/� ratios across all color pairs in a given cat-egory k of training images, and then store themas part of that model. Now, for a test picture, wesegment it the same way we did in training andfind all the segment boundaries. For each suchboundary x, we compute the �/� ratio comingfrom color pair (i, j). We then compute the prob-ability of x given a model k as

P x k f x f xSC k k( | ) ( | , ) ( | , )= ′ ′η

ηα β

ηα β

+1+ 1

+1

Figure 3. Sample

categories and

corresponding S-C

model representations.

(a) Sample training

pictures. (b) Matrices of

segment adjacency

counts. (c) Matrices of

mean �/� ratios.

Brightness levels

represent the relative

magnitude of values.

(a)

(b)

(c)

which has parameters defined in the usual way,for color pair (i, j) in model k. Here, Psc is the con-ditional density for the S-C model, and f(|�, �) isthe Beta density function. What we have hereessentially is a regularization measure often usedin statistics when the amount of confidence insome estimate is low: a weighted probability iscomputed instead of the original one, with theweights varying with the number of samplesused for estimation. Naturally, when � gets small-er, more importance is given to the prior esti-mates (��k,��k) over all color pairs, because wedon’t have high confidence in the estimates forthat specific color pair.

Finally, we use a standard conditional inde-pendence assumption for computing the overalllikelihood of a test image I given an S-C model k.This is simply the product of the individual Psc

values corresponding to each border x in theimage. We take the logarithm of this value anddenote this log-likelihood as lSC(I|Mk). Here, theconditional independence assumption might nothold true in reality, but it helps reduce computa-tion significantly, which is essential for produc-ing rapid categorization.

Color-texture modelsMany image categories, especially those not

containing specific objects, can probably bedescribed best by their color/texture distribu-tions. In fact, a well-defined structure might noteven exist per se for high-level categories such asChina and Europe, but the overall ambience thatcertain colors and textures form often character-izes them.

We use a mixture of multivariate normal dis-tributions to model the joint color-texture fea-ture space for a given category. The motivationis simple: In many cases, two or more represen-tative regions in the color-texture feature spacecan represent the image category best. For exam-ple, beach pictures typically have one or moreyellow areas (sand), a blue nontextured area(sky), and a blue textured region (sea). Mixturemodels for the normal distribution are well-stud-ied, with many tractable properties in statistics.Yet, these simple models have not been widelyexploited in image categorization.

We extract the same color and texture featuresused in the Automated Linguistic Indexing ofPictures (ALIP),3 taking nonoverlapping 4 4blocks of the image. For each category, we getmixture-model parameter estimates out of thefeatures extracted from the training pictures. We

use Bouman’s cluster package for this estimation.9

As usual, this package implements the well-known expectation-maximization (EM) algo-rithm for mixture models. Using this package, wecompute and store color-texture models for eachpicture category. Thus, for a 4 4 block x andlearned model �k, we denote the probability of xgiven that model as Pct(x|�k).

Ignoring spatial dependence among blocks, wefinally compute the log-likelihood of a test imageI given a category k by taking the log of the prod-uct of Pct(x|�k) values over each 4 4 block of theimage I. We denote by lct(I|Mk) this log-likelihood.We argue that the conditional independenceassumption here is reasonable, for three reasons:we intend rapid categorization, which is aided bythis simplifying assumption; our S-C modelalready captures spatial relationships at a moremeaningful granularity than fixed size blocks; anddependencies among blocks have been explicitlymodeled by Li and Wang’s method,3 which ourapproach empirically outperforms.

Annotation and retrievalWe use the categorization models for annota-

tion, which in turn helps us with image search.

Automatic tagging basicsThree important considerations we make in

automatic tagging are

❚ how strongly the categorization results favora tag,

❚ how frequently we see that tag in the trainingset—that is, the likelihood of its chanceappearance, and

❚ whether the tag is meaningful in the contextof the picture’s other tags.

Suppose we have a 600-category training-imagedata set (the setting for all our experiments), eachcategory annotated by three to five tags—forexample, [sail, boat, ocean] and [sea, fish,ocean]—with many tags shared among cate-gories. Initially, all the tags from each categoryare pooled together. Tag saliency is measured in away similar to computing inverse document fre-quency in the document retrieval domain. Thetotal number of categories in the database is C.We count the number of categories that containeach unique tag t, and denote it by F(t). For agiven test image I, the S-C models and the C-T

28

IEEE

Mul

tiM

edia

models independently generate ranked lists ofpredicted categories. We generate these lists byplacing all the categories in descending order oflog-likelihoods, for each model type. We choosethe top 10 categories that each model predictsand pool them together for annotation. Wedenote the union of all unique words from bothmodels by U(I), which forms the set of candidatetags. Let the frequency of occurrence of eachunique tag t among the top 10 model predictionsbe fsc(t|I) and fct(t|I), respectively.

WordNet is a semantic lexicon that groupsEnglish words into sets of synonyms and recordsthe semantic relations among the synonym sets.5

Based on this ontology, researchers have pro-posed numerous measures of word relatedness. Ameasure that we observed to produce reasonablerelatedness scores was the Leacock andChodorow (LCH),10 which we use in our experi-ments. We convert this relatedness measure to adistance measure by taking the exponent andnormalizing it to get a [0, 24] range of values.Inspired by Jin et al.,11 we measure a congruityscore for a candidate tag t and denote it by G(t|I)for tag t and image I. (See Datta et al. for furthertechnical details.)12

In essence, a tag that is semantically distinctfrom the rest of the words predicted will likelyhave a low congruity score, while a closely relat-ed one will have a high score. The measure canpotentially remove noisy and unrelated tags fromconsideration. Having computed the three mea-sures, for each of which higher scores indicatebetter fitness for inclusion, the overall score for acandidate tag t is given by a linear combinationas follows:

Here, a1 � a2 � a3 1 and f(t|I) bfsc(t|I) � (1 �

b)fct(t|I) is the main model combination step forannotation, linearly combining the evidenceseach model generated for the tag. Experimentsshow us that combining the models helps signifi-cantly over either model independently. The valueof b is a measure of relative confidence in the S-Cmodel. A tag t is chosen for annotation only whenits score is within the top � percentile among thecandidate tags, where � basically controls thenumber of annotations generated per image.

After performing experiments on a validationset of 1,000 pictures, we arrive at a satisfactory setof values for the aforementioned constants,

namely a1 = 0.4%, a2 0.2%, b 0.3%, and �

60%. This choice of weights depends on theimage database used and can be determined byan appropriate grid search. Also, after the systemchooses candidate tags, the final set of tags areselected fairly independently of each other. Ourhope is that candidate tag selection reflects thetags’ cooccurrence in the training set and thatthe congruity measure implicitly introducesdependence, but a joint modeling of tags mightproduce better results.

Performing annotation-driven image searchWe are now equipped to search pictures, using

automatic annotation and a bag-of-words dis-tance. Whenever tags are missing in the queryimage and/or the database, the system performsautomatic annotation. Next, a bag-of-words dis-tance measure (and hence, a support for multiplekeyword queries) between query picture tags andthe database tags helps rank the pictures, whichalso supports multiple keyword queries. Whentags are present but are known to be noisy, thesystem intuitively combines these tags and thelearned models to improve the tagging prior toperforming search.

Our aim is to make the entire picture collec-tion searchable by keywords and to allow alltypes of searches under a common framework.The bag-of-words distance measure we use is theaverage aggregated minimum distance.13 Put plain-ly, the approach attempts to match each word inbag 1 to the semantically closest word in bag 2(again, using the WordNet-based LCH distance),then match each word in bag 2 to the closest inbag 1, and finally compute the weighted averageof the matched distances based on the bag sizesto ensure that the measure is symmetric.

R t I a f t Ia

CCF t

a G t I( | ) ( | )log

log( )

( | )= + +12

31+

29

July–Septem

ber 2007

Our aim is to make

the entire picture collection

searchable by keywords

and to allow all types of

searches under a common

framework.

30

IEEE

Mul

tiM

edia

Experimental validationWe investigate our system’s performance on

four grounds: how accurately it identifies picturecategories, how well it tags pictures, how well itperforms reannotation of noisy tags, and howmuch it improves image search for the four sce-narios we described earlier. However, the improve-ment in image search quality is the main focus ofthis work. The data sets we look at consist of

❚ 54,000 Corel stock photos encompassing 600picture categories and

❚ a 1,000 picture collection from Yahoo! Flickr.

Of the Corel collection, we use 24,000 to trainthe two statistical models, and use the rest forassessing performance.

Identifying picture categoriesTo fuse the two models for categorization, we

use a simple combination strategy14 that resultsin impressive performance. Given a picture, werank each category k based on likelihoods fromboth models, to get ranks �sc(k) and �ct(k). Wethen linearly combine these two ranks for eachcategory, �(k) ��sc(k) � (1 � �)�ct(k), with � 0.2working best in practice. We then assign the cat-egory that yields the highest linearly combinedscore to this picture.

We decide how well our system is doing in

predicting categories by involving two picturedata sets. The first one is a standard 10-classimage data set that has been commonly used forthe same research question. Using 40 trainingpictures per category, we assess the categorizationresults on another 50 per category. We computeaccuracies while varying the number of mixturecomponents in the C-T model.

We present our results—along with those pre-viously reported15-17 on the same data—in Figure4. We see that our combined model does a betterjob at identifying categories than previousattempts. Not surprisingly, as we increase thenumber of mixture components, the C-T modelsbecome more refined. We thus continue to getimproved categorization with greater compo-nents, although more components mean morecomputation as well.

Our second data set consists of the same 600category Corel images that were used in the ALIPsystem.3 With an identical training process forthe two models (the number of mixture compo-nents is chosen as 10), we observe the catego-rization performance on a separate set of 27,000pictures. What we find is that the actual picturecategories coincide with our system’s top choice14.4 percent of the time, are within our system’stop two choices 19.3 percent of the time, and arewithin our system’s top three choices 22.7 per-cent of the time. The corresponding accuracy val-ues for the ALIP system are 11.9, 17.1, and 20.8percent, respectively.

Our system takes about 26 seconds to build astructure-composition category model and about106 seconds to build a color-texture model, bothon a 40-picture training set. As with generativemodels, we can independently and in parallelbuild the models for each category and type. Topredict the top five ranked categories for a giventest picture, our system takes about 11 seconds.Naturally, we have a system that is orders of mag-nitude faster than the ALIP system, which takesabout 30 minutes to build a model and about 20minutes to test on a picture. Most other auto-matic tagging systems in the literature do notexplicitly report speed.

However, many of them depend on sophisti-cated image segmentation algorithms, which caneasily bottleneck performance. The improvedperformance in model building means that evenlarger numbers of models can be built (for exam-ple, one model per unique tag), and the model-ing process can be made dynamic (retraining atintervals) to accommodate changing picture col-

1 2 3 4 5 6 7 8 9 1075

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

Number of mixture components λ in color−texture model

Cat

egor

izat

ion

accu

racy

(%

)

C−T + S−C model (this article) J. Li and J.Z. Wang3

R. Maree15

Y. Chen and J.Z. Wang16

Y. Li et al.17

Figure 4.

Categorization

accuracies for the 10-

class experiment.

Performance of our

combined S-C � C-T

model with varying

numbers of mixture

components in the C-T

model. We also show

previously reported best

results for comparison.

31

July–Septem

ber 2007

lections, such as Web sites that let users uploadpictures.

Tagging the picturesWe now look at how our system performs

when it comes to automatic picture tagging.Tagging is fast, since it depends primarily on thecategorization speed. Over a random test set of10,000 Corel pictures, our system generates anaverage of seven tags per picture. We use stan-dard metrics for evaluating annotation perfor-mance: precision, the fraction of tags predictedthat are correct, and recall, the fraction of tags forthe picture that are correctly guessed. Averageprecision over this test set is 22.4 percent, whileaverage recall is 40.7 percent.

Thus, on an average, roughly one in four ofour system’s predicted tags are correct, while oursystem guessed two in five correct tags. In gener-al, results of this nature are useful for filteringand classification. A potential domain of thou-sands of tags can be reduced to a handful, mak-ing human tagging much easier, as used in theAutomatic Linguistic Indexing of Pictures—RealTime system (ALIPR; see http://alipr.com).18

Increased homogeneity and reduced ambiguityin the tagging process are additional benefits.

We make a more qualitative assessment of tag-ging performance on the 1,000 Flickr pictures.We point out that the training models are stillthose built with Corel pictures, but because theyrepresent the spectrum of photographic imageswell, these models serve as fair knowledge bases.

In this case, most automatically generated tagsare meaningful and generally very encouraging.In Figure 5, we present a sampling of theseresults. Getting quantitative performance is hard-er here because Flickr tags are often proper nouns(such as names of buildings and people) thataren’t contained in our training base.

Reannotating noisy tagsWe assess our annotation system’s perfor-

mance in improving tagging at high noise lev-els. The scenario of noisy tags, at a level denotedby e, is simulated in the following manner: Foreach of 10,000 test pictures with the original(reliable) tags, a new set of tags is generated byreplacing a tag by a random tag e fraction of thetimes, at random, and is unchanged at othertimes. The resulting noisy tags for these testimages, when assessed for performance, giveprecision and recall values that directly correlatewith e. In the absence of learned models, this isour baseline case.

When such models are available, we can usethe noisy tags and the categorization models toreannotate the pictures, because the noisy tagsstill contain exploitable information. We per-form reannotation by simply treating each noisytag t of a picture I as an additional instance of theword in the pool of candidate tags.

In effect, we increment the values of fsc(t|I) andfct(t|I) by a constant Z, thus increasing the chanceof t to appear as a tag. The value of Z controls howmuch we want to promote these tags and is natu-

Image

Image

Our labels sky, city, modern, door pattern, Europe, train, car, people, man, office, indoor, building, Boston historical building, city life, city fashion, people

Flickr labels Amsterdam, building, Tuschinski, Amsterdam honeymoon, hat, Chris, cards, Maheler4, Zuitas Amsterdam funny

Our labels lake, Europe, landscape, lion, animal, wild life, speed, race, people, dog, glass, animal, boat, architecture Africa, super-model Holland, motorcycle rural, plant

Flickr labels Amsterdam, canal, leopard, cat, ragged Preakness, horse, Nanaimo Torgersons, water photo, animal jockey, motion, animal, Quinn, dog, unfound photo camera-phone

Figure 5. Sample

automatic tagging

results on Yahoo! Flickr

pictures taken in

Amsterdam, along with

manual tags.

32

IEEE

Mul

tiM

edia

rally related to the noise in the tags. Figure 6shows the annotation precision and recall thisapproach achieves, with e varying from 0.5 (mod-erately noisy) to 1 (completely noisy, no usefulinformation) for the case of Z 0.5. We noticethat, at high levels of noise, our reannotation pro-duces better performance than the baseline.

Figure 7 shows a summary of a more generalanalysis of the trends, with larger values of Z—

that is, greater confidence in the noisy tags. Thegraph shows precision and recall for our reanno-tation, varying Z for each of five error levels. Weobserve that for the same value of Z, less noisytags lead to better reannotation. Moreover, afterreaching a peak near Z 2.5, the recall starts todrop, while precision continues to improve. Thisgraph can be useful in selecting parameters for adesired precision or recall level after reannota-tion, given an estimated level of noise in the tags.

Searching for picturesCompared to traditional methods, our approach

improves image search performance. We assumethat either the database is partially tagged, or thesearch is performed on a picture collection visu-ally coherent with some standard knowledgebase. In all our cases, the statistical models arelearned from the Corel data set. For scenario 4,we assume that everything is tagged, but sometags are incorrect or inconsistent. Once again, wetrain a knowledge base of 600 picture categories,and then use it to do categorization and auto-matic tagging on the test set. This set consists of10,000 randomly sampled pictures from amongthe remaining Corel pictures (those not used fortraining).

We now consider the four image search sce-narios we discussed in the “Bridging the gap” sec-tion. For each scenario, we compare results of ourannotation-driven image search strategy withalternative strategies. For those alternative strate-gies involving CBIR, we use the IRM distanceused in the SIMPLIcity system8 to get around themissing tag problem in the databases and queries.

We choose the alternative strategies and theirparameters by considering a wide range of possi-ble methods. We assess the methods based onthe standard information retrieval concepts ofprecision (percentage of retrieved pictures thatare relevant) and recall (percentage of relevantpictures that are retrieved). We consider two pic-tures/queries to be relevant whenever there isoverlap between their set of tags. In this article,we report performance in terms of precision,which is usually considered a more useful metricin information retrieval. Recall performance canbe found in our earlier work.12

Scenario 1. Here, the database doesn’t haveany tags. Queries may be in the form of eitherone or more keywords or tagged pictures.

Keyword queries on an untagged picture data-base is a key problem in real-world image search.

0 1 2 3 4 5 0 1 2 3 4 50

10

20

30

40

50

60

0

10

20

30

40

50

60

Value of parameter Z

Aver

age

anno

tatio

n p

reci

sion

(%

)

e = .5e = .6e = .7e = .8e = .9e = 1

Value of parameter Z

Aver

age

anno

tatio

n re

call

(%)

e = .5e = .6e = .7e = .8e = .9e = 1

(a) (b)

Figure 7. (a) Precision and (b) recall achieved by reannotation, varying

parameter Z, shown for five noise levels.

0.5 0.6 0.7 0.8 0.9 10

10

20

30

40

50

60

70

80

0

10

20

30

40

50

60

70

80

Noise level e Noise level e0.5 0.6 0.7 0.8 0.9 1

Aver

age

anno

tatio

n p

reci

sion

(%

)

Our reannotationBaseline

Aver

age

anno

tatio

n re

call

(%)

Our reannotationBaseline

(a) (b)

Figure 6. (a) Precision and (b) recall achieved by reannotation, with varying

noise levels in original tags. Note that the linear correlation of the baseline

case to e is intrinsic to the noise simulation.

We look at 40 percent (out of the 417 uniqueones in the Corel set) of randomly chosen pairsof query words (each word is chosen from the417 unique words in our training set). In ourstrategy, we perform a search by first automati-cally tagging the database, and then retrievingimages based on bag-of-words distances betweenquery tags and our predicted tags.

The alternative CBIR-based strategy used forcomparison is as follows: without any image as aquery, CBIR can’t be performed directly on querykeywords. Instead, suppose the system is provid-ed access to a knowledge base of tagged Corel pic-tures. A random set of three pictures for eachquery word is chosen from the knowledge base,and we compute IRM distances between theseimages and the database. We then use the aver-age IRM distance over the six pictures to rank thedatabase pictures. We report these two results,along with the random results, in Figure 8a.Clearly, our method performs impressively andsignificantly better than the alternative approach.

Scenario 2. In this case, the query is anuntagged picture, and the database is tagged.First, we tag the query picture automatically, andthen rank the database pictures using bag-of-words distance. We randomly choose 100 querypictures from Corel and test it on the database of10,000 pictures. The alternative CBIR-based strat-egy we use is as follows: the IRM distance is usedto retrieve five (empirically observed to be thebest count) pictures most visually similar to thequery, and the union of all their tags is filteredusing the expression for R(t|I) to get automatictags for the query (the same way that we filter ourannotation, as we described in the “Annotationand retrieval” section). Now, the search proceedsidentically to ours. We present these results,along with the random scheme, in Figure 8b. Asthe figure shows, our strategy has a significantperformance advantage over the alternate strate-gy. The CBIR-based strategy performs almost aspoorly as the random scheme, which is probablybecause of the direct use of CBIR for tagging.

Scenario 3. In this case, neither the query pic-ture nor the database is tagged. We test 100 ran-dom picture queries on the 10,000-imagedatabase. Our strategy is simply to tag both thequery picture as well as the database automati-cally, and then perform bag-of-words-basedretrieval. Without any tags present, the alterna-tive CBIR-based strategy we used here is essen-

tially a standard use of the IRM distance to rankpictures based on visual similarity to the query.We present these results, along with the randomcase, in Figure 8c. Once again, we see the advan-tage of our common image search frameworkover straightforward visual similarity-basedretrieval. What we witness is how, in an indirectway, the learned knowledge base helps to

33

July–Septem

ber 2007

0 10 20 30 40 50 60 70 80 90 100

0 10 20 30 40 50 60 70 80 90 100

0 10 20 30 40 50 60 70 80 90 100

0

10

20

30

40

50

0

10

20

30

40

50

0

10

20

30

40

50

Number of images retrieved (out of 10,000)Aver

age

prec

ision

ove

r 40

two−

wor

d qu

erie

s (%

)

Proposed methodCBIR-based strategyRandom annotation

Number of images retrieved (out of 10,000)

Aver

age

pre

cisi

on o

ver

100

que

ries

(%)



Aver

age

pre

cisi

on o

ver

100

que

ries

(%)


(a)

(b)

(c)

Figure 8. Retrieval

precision under

(a) scenario 1,

(b) scenario 2, and

(c) scenario 3,

compared to

baseline/random

strategies.

improve search performance over a strategy thatdoesn’t involve statistical learning.

Scenario 4. Here, the query picture and thedatabase are both fully tagged, but many tags areincorrect. This is a situation that often arisesunder user-driven tagging because of reasonssuch as subjectivity. We use our reannotationapproach to refine these noisy tags prior to per-forming retrieval. Introducing noise levels of e

0.7, 0.8, and 0.9 and using parameter Z 1.5, wetest 100 random picture queries on the 10,000images. For this, queries and the database are firstreannotated.

The alternate strategy includes the baselinecase, as we described in the “Reannotating noisytags” section. Figure 9 shows the precisionresults over the top 100 retrieved pictures for thethree noise levels. Interestingly, even at e 0.7,where our reannotation approach doesn’t sur-pass the baseline in annotation precision, it doesso in retrieval precision, making it a usefulapproach at this noise level. Moreover, the dif-ference with the baseline is maximum at noiselevel 0.8. Note that for e � 0.5, our approach did-n’t yield a better performance than the baseline,since the tags were sufficiently clean. Theseresults suggest that at relatively high noise lev-els, our reannotation approach can lead to sig-nificantly improved image retrieval performancecompared to the baseline.

ConclusionThe framework for our novel annotation-dri-

ven image search is standard for different scenar-ios and different types of queries, which shouldmake implementation fairly straightforward. Ineach scenario we discussed, our approach yieldsmore promising results than traditional methods.In fact, the categorization performance in itselfimproves on previous attempts. Moreover, we areable to categorize and tag the pictures in a veryshort time. All of these factors make our approachattractive for real-world implementation. MM

AcknowledgmentsThe research is supported in part by the US

National Science Foundation, grants 0347148and 0202007. We thank David M. Pennock atYahoo! for providing some test images. Wewould also like to acknowledge the commentsand constructive suggestions from reviewers.

References1. K. Barnard et al., “Matching Words and Pictures,” J.

Machine Learning Research, vol. 3, 2003, pp. 1107-

1135.

2. E. Chang et al., “CBSA: Content-Based Soft

Annotation for Multimodal Image Retrieval Using

Bayes Point Machines,” IEEE Trans. Circuits and Systems

for Video Technology, vol. 13, no. 1, 2003, pp. 26-38.

3. J. Li and J.Z. Wang, “Automatic Linguistic Indexing

of Pictures by a Statistical Modeling Approach,”

IEEE Trans. Pattern Analysis and Machine Intelligence,

vol. 25, no. 19, 2003, pp. 1075-1088.

4. F. Monay and D. Gatica-Perez, “On Image Auto-

Annotation with Latent Space Models,” Proc. ACM

Int’l Conf. Multimedia, ACM Press, 2003, pp. 275-278.

5. G. Miller, “WordNet: A Lexical Database for English,”

Comm. ACM, vol. 38, no. 11, 1995, pp. 39-41.

6. R. Datta, J. Li, and J.Z. Wang, “Content-Based Image

Retrieval: Approaches and Trends of the New Age,”

Proc. ACM SIGMM Int’l Workshop Multimedia

Information Retrieval, ACM Press, 2005, pp. 253-262.

7. A.W. Smeulders et al., “Content-Based Image

Retrieval at the End of the Early Years,” IEEE Trans.

Pattern Analysis and Machine Intelligence, vol. 22,

no. 12, 2000, pp. 1349-1380.

8. J.Z. Wang, J. Li, and G. Wiederhold, “SIMPLIcity:

Semantics-Sensitive Integrated Matching for Picture

Libraries,” IEEE Trans. Pattern Analysis and Machine

Intelligence, vol. 23, no. 9, 2001, pp. 947-963.

9. C.A. Bouman, “Cluster: An Unsupervised Algorithm

for Modeling Gaussian Mixtures”; http://dynamo.

ecn.purdue.edu/~bouman/software/cluster.

10. C. Leacock and M. Chodorow, “Combining Local

34

IEEE

Mul

tiM

edia

20

40

0

60

20

40

0

60

20

40

0

60

Proposed methodBaseline

0 10 20 30 40 50 60 70 80 90 100

0 10 20 30 40 50 60 70 80 90 100

0 10 20 30 40 50 60 70 80 90 100

Aver

age

pre

cisi

on o

ver

100

que

ries

(%)




Error levele = 0.7

Error levele = 0.8

Error levele = 0.9

Figure 9. Retrieval

precision for scenario 4

at three noise levels.

Context and WordNet Similarity for Word Sense

Identification,” WordNet: An Electronic Lexical

Database, C. Fellbaum, ed., MIT Press, 1998, pp.

265-283.

11. Y. Jin et al., “Image Annotations By Combining

Multiple Evidence and WordNet,” Proc. ACM Int’l

Conf. Multimedia, ACM Press, 2005, pp. 706-715.

12. R. Datta et al., “Toward Bridging the Annotation-

Retrieval Gap in Image Search by a Generative

Modeling Approach,” Proc. ACM Int’l Conf.

Multimedia, ACM Press, 2006, pp. 977-986.

13. J. Li, “A Mutual Semantic Endorsement Approach

to Image Retrieval and Context Provision,” Proc.

ACM SIGMM Int’l Workshop Multimedia Information

Retrieval, ACM Press, 2005, pp. 173-182.

14. T.K. Ho, J.J. Hull, and S.N. Srihari, “Decision

Combination in Multiple Classifier Systems,” IEEE

Trans. Pattern Analysis and Machine Intelligence, vol.

16, no. 1, 1994, pp. 66-75.

15. R. Marée, “Random Subwindows for Robust Image

Classification,” Proc. IEEE Computer Soc. Conf. on

Computer Vision and Pattern Recognition (CVPR),

IEEE CS Press, vol. 1, 2005, pp. 34-40.

16. Y. Chen and J.Z. Wang, “Image Categorization by

Learning and Reasoning with Regions,” J. Machine

Learning Research, vol. 5, 2004, pp. 913-939.

17. Y. Li, Y. Du, and X. Lin, “Kernel-Based Multifactor

Analysis for Image Synthesis and Recognition,”

Proc. Int’l Conf. Computer Vision (ICCV), IEEE CS

Press, pp. 114-119.

18. J. Li and J.Z. Wang, “Real-Time Computarized

Annotation of Pictures,” Proc. ACM Int’l Conf.

Multimedia, ACM Press, 2006, pp. 911-920.

Ritendra Datta is a PhD candi-

date in computer science and

engineering at The Pennsylvania

State University. His research

interests include stastistical

learning, computer vision, com-

putational aesthetics, multimedia,

and information retrieval. He has a BE in information

technology from the Bengal Engineering and Science

University, Shibpur, India. He is a recipient of the Glenn

Singley Memorial Graduate Fellowship in Engineering.

Weina Ge is a PhD candidate and

research assistant in the Computer

Science and Engineering Depart-

ment at The Pennsylvania State

University. Her research interests

include computer vision, image

retrieval, and data mining. She has a

BS in computer science from Zhejiang University, China.

Jia Li is an associate professor of

statistics at The Pennsylvania State

University. Her research interests

include statistical learning, data

mining, and image processing,

retrieval, and annotation. She has

a PhD in electrical engineering

from Stanford University.

James Z. Wang is an associate

professor at The Pennsylvania

State University. His research

interests include semantics-sensi-

tive image retrieval, linguistic

indexing, biomedical informatics,

story picturing, and computation-

al aesthetics. He has a master's degree in both mathe-

matics and computer science, as well as a PhD in medical

information sciences, all from Stanford University.

Readers may contact Ritendra Datta at datta@cse.

psu.edu.

For further information on this or any other computing

topic, please visit our Digital Library at http://www.

computer.org/publications/dlib.

35

Submit your ideas and videos to the

IEEE MultiMedia video blog!

Visit

http://computer.org/multimedia

for more details

toward bridging the annotation-retrieval gap in image search

Documents