a multi-expert based framework for automatic image...

16
A multi-expert based framework for automatic image annotation $ Abbas Bahrololoum, Hossein Nezamabadi-pour n Department of Electrical Engineering, Shahid Bahonar University of Kerman, P.O. Box 76169-133, Kerman, Iran article info Article history: Received 28 April 2016 Received in revised form 25 June 2016 Accepted 22 July 2016 Available online 25 July 2016 Keywords: Automatic image annotation Feature space Concept space Prototype Semantic gap Fusion abstract Automatic image annotation (AIA) for a wide-range collection of image data is a difcult challenging topic and has attracted the interest of many researchers in the last decade. To achieve the goal of AIA, a multi-expert based framework is presented in this paper which is based on the combination of results obtained from feature space and concept space. Considering a real-world image dataset, a large storage is required; therefore, the idea of generating prototypes in both feature and concept spaces is used. The prototypes are generated in learning phase using a clustering technique. The input unlabeled images are assigned to the nearest prototypes in both feature and concept spaces, and primary labels are obtained from the nearest prototypes. Eventually, these labels are fused and nal labels for a target image are chosen. Since all feature types do not describe a concept label equally, some prototypes are more ef- fective to represent a concept and bridge the semantic gap, so a metaheuristic algorithm is employed to search for the best subset of feature types and best criterion of fusion. To evaluate the performance of the proposed framework, an example of its implementation is presented. A comparative experimental study with several state-of-the-art methods is reported on two standard databases of about 20k images. The obtained results conrm the effectiveness of the proposed framework in the eld of automatic image annotation. & 2016 Elsevier Ltd. All rights reserved. 1. Introduction The considerable development of the digital acquisition, com- puter hardware, storage techniques and Internet technology makes millions of images accessible to people. One widely adopted solution for accessing and retrieving digital images in addition to video is to annotate the content with semantically meaningful labels. Two types of annotation approaches are available: manual and automatic [1]. Manual image annotation is time-consuming, laborious and expensive task; to address this, many researches have focused on automatic image annotation. The goal of automatic image annotation goes to assign a col- lection of keywords (annotation) from a given dictionary to a target image (previously unseen). That is, the input is the target (untagged) image and the output is a collection of keywords that describe the target image in the best possible way [2]. In other words, the automatic system semantically describes the content of an image. To do this, a set of semantic labels is assigned to each image to describe its content [3]. Then, a system is developed to provide a model for the relation between visual features and tags of images. Automatic image annotation has been reviewed extensively for several years. The image annotation is just an extremely challen- ging task. The same object can be captured from different angles, distances or under different luminance conditions. This is sub- jective and sometimes it is difcult to automatically describe im- age content by keywords [1]. Additionally, an object of the real world with the same namemay have different visual content (e.g., shape, color). The semantic gap between low level features and high level concepts (i.e. the interpretation of the images in the way that humans do) is a fundamental problem in a content-based image retrieval (CBIR) system. To bridge the semantic gap, some systems use the relevance feedback technique [48] to incorporate user knowledge into the retrieval process. Some approaches attempt to reduce annotation errors by making use of word relations [9]. Other approaches make use of external resources such as auxiliary texts of web images, WorldNet and ontology, Google distance, click-through data, and Wikipedia articles [10]. Topic based approaches model joint distributions of visual features and words [11]. On the other hand, multiple instance learning (MIL) approaches [12] focus on solving the problem of weakly labeling in image annotation that is the absence of correspondence between labels and regions in images. Multiple feature spaces [13] are also selected to improve the performance of CBIR systems. Recently, studies [14] on Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition http://dx.doi.org/10.1016/j.patcog.2016.07.034 0031-3203/& 2016 Elsevier Ltd. All rights reserved. AuthorsGoogle scholar Homepage: https://scholar.google.com/citations?user ¼ GopXT0MAAAAJ&hl ¼en; https://scholar.google.com/citations?user¼OJQ70wEAAAAJ &hl ¼en. n Corresponding author. E-mail addresses: [email protected] (A. Bahrololoum), [email protected] (H. Nezamabadi-pour). Pattern Recognition 61 (2017) 169184

Upload: others

Post on 06-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A multi-expert based framework for automatic image annotationstatic.tongtianta.site/paper_pdf/d27b4692-3838-11e9-8ffc-00163e08… · A multi-expert based framework for automatic image

Pattern Recognition 61 (2017) 169–184

Contents lists available at ScienceDirect

Pattern Recognition

http://d0031-32

☆AuthGopXT0&hl¼en

n CorrE-m

nezam@

journal homepage: www.elsevier.com/locate/pr

A multi-expert based framework for automatic image annotation$

Abbas Bahrololoum, Hossein Nezamabadi-pour n

Department of Electrical Engineering, Shahid Bahonar University of Kerman, P.O. Box 76169-133, Kerman, Iran

a r t i c l e i n f o

Article history:Received 28 April 2016Received in revised form25 June 2016Accepted 22 July 2016Available online 25 July 2016

Keywords:Automatic image annotationFeature spaceConcept spacePrototypeSemantic gapFusion

x.doi.org/10.1016/j.patcog.2016.07.03403/& 2016 Elsevier Ltd. All rights reserved.

ors’ Google scholar Homepage: https://scholMAAAAJ&hl¼en; https://scholar.google.com/ci.esponding author.ail addresses: [email protected] (A. Bahruk.ac.ir (H. Nezamabadi-pour).

a b s t r a c t

Automatic image annotation (AIA) for a wide-range collection of image data is a difficult challengingtopic and has attracted the interest of many researchers in the last decade. To achieve the goal of AIA, amulti-expert based framework is presented in this paper which is based on the combination of resultsobtained from feature space and concept space. Considering a real-world image dataset, a large storage isrequired; therefore, the idea of generating prototypes in both feature and concept spaces is used. Theprototypes are generated in learning phase using a clustering technique. The input unlabeled images areassigned to the nearest prototypes in both feature and concept spaces, and primary labels are obtainedfrom the nearest prototypes. Eventually, these labels are fused and final labels for a target image arechosen. Since all feature types do not describe a concept label equally, some prototypes are more ef-fective to represent a concept and bridge the semantic gap, so a metaheuristic algorithm is employed tosearch for the best subset of feature types and best criterion of fusion. To evaluate the performance of theproposed framework, an example of its implementation is presented. A comparative experimental studywith several state-of-the-art methods is reported on two standard databases of about 20k images. Theobtained results confirm the effectiveness of the proposed framework in the field of automatic imageannotation.

& 2016 Elsevier Ltd. All rights reserved.

1. Introduction

The considerable development of the digital acquisition, com-puter hardware, storage techniques and Internet technologymakes millions of images accessible to people. One widely adoptedsolution for accessing and retrieving digital images in addition tovideo is to annotate the content with semantically meaningfullabels. Two types of annotation approaches are available: manualand automatic [1]. Manual image annotation is time-consuming,laborious and expensive task; to address this, many researcheshave focused on automatic image annotation.

The goal of automatic image annotation goes to assign a col-lection of keywords (annotation) from a given dictionary to atarget image (previously unseen). That is, the input is the target(untagged) image and the output is a collection of keywords thatdescribe the target image in the best possible way [2]. In otherwords, the automatic system semantically describes the content ofan image. To do this, a set of semantic labels is assigned to eachimage to describe its content [3]. Then, a system is developed to

ar.google.com/citations?user¼tations?user¼OJQ70wEAAAAJ

ololoum),

provide a model for the relation between visual features and tagsof images.

Automatic image annotation has been reviewed extensively forseveral years. The image annotation is just an extremely challen-ging task. The same object can be captured from different angles,distances or under different luminance conditions. This is sub-jective and sometimes it is difficult to automatically describe im-age content by keywords [1]. Additionally, an object of the realworld with the same “name” may have different visual content(e.g., shape, color). The semantic gap between low level featuresand high level concepts (i.e. the interpretation of the images in theway that humans do) is a fundamental problem in a content-basedimage retrieval (CBIR) system.

To bridge the semantic gap, some systems use the relevancefeedback technique [4–8] to incorporate user knowledge into theretrieval process. Some approaches attempt to reduce annotationerrors by making use of word relations [9]. Other approachesmake use of external resources such as auxiliary texts of webimages, WorldNet and ontology, Google distance, click-throughdata, and Wikipedia articles [10]. Topic based approaches modeljoint distributions of visual features and words [11]. On the otherhand, multiple instance learning (MIL) approaches [12] focus onsolving the problem of weakly labeling in image annotation thatis the absence of correspondence between labels and regions inimages. Multiple feature spaces [13] are also selected to improvethe performance of CBIR systems. Recently, studies [14] on

Page 2: A multi-expert based framework for automatic image annotationstatic.tongtianta.site/paper_pdf/d27b4692-3838-11e9-8ffc-00163e08… · A multi-expert based framework for automatic image

A. Bahrololoum, H. Nezamabadi-pour / Pattern Recognition 61 (2017) 169–184170

jointly modeling scene classification and image annotation havebeen used.

However, in image annotation problem, images are often de-scribed by multiple feature space (multiview features). Differentviews such as color, texture and shape features, describe differentattributes of an image [15–17]. Each view describes a property ofthe image, and the weaknesses of a view can be reduced by thestrengths of others.

Multiview learning algorithms can be grouped into three ca-tegories: a) co-training, b) multiple kernel learning and c) sub-space learning. Co-training style algorithms usually train separatelearners on distinct views, which are then forced to be consistentacross views. It is assumed that the features obtained from thedifferent views are sufficient and they are conditionally in-dependent of one another to train a classifier. Multiple kernellearning algorithms calculate separate kernels on each view whichare combined with a kernel-based method. Subspace learning-based approaches aim to obtain an appropriate subspace to ex-plore the complementary properties of different views by assum-ing that input views are generated from a latent view [16].

1.1. Related works

Automatic image annotation methods are usually classified intotwo categories, namely probabilistic modelling-based methods[18,19] and classification-based methods [20–22]. One strategy forstatistical annotation is unsupervised labeling which estimates thejoint density of visual features and words by implementing anunsupervised learning algorithm on a training image dataset.These methods introduce a hidden variable and assume that fea-tures and words are independent of the hidden variable value.Another formulation for statistical annotation is supervised multi-class labeling [20] that estimates a conditional distribution foreach semantic class to determine probability. The problem ofmulti-label classification generalizes the traditional multi-classclassification problem, the former allows a set of labels to be as-sociated with an instance whereas the latter allows only one. Animage to be annotated can get several labels simultaneously, thatmakes this problem as a multi-label one [23].

The authors in [24] present a multi-label classification frame-work for automatic image annotation. The proposed frameworkcomprises an initial clustering phase that breaks the originaltraining set into several disjoint clusters of data. It then trains amulti-label classifier from the data of each cluster. Given a newtest instance, the framework first finds the nearest cluster andthen applies the corresponding model.

The authors in [25] propose a solution to the problem of largescale concept space learning and mismatch between semantic andvisual spaces (semantic gap). To tackle the first issue, they presentthe use of higher level semantic space with lower dimension byclustering correlated keywords into topics in a local neighborhood.The topics are used as lexis for assigning multiple labels to un-labeled images. To deal with the problem of semantic gap, theypropose a way to reduce the bias between visual and semanticspaces by finding optimal margins in both spaces. In particular, theproposed method is an iterative solution that alternately max-imizes the sum of margins to reduce the gap between visual andsemantic similarities.

In the paper [26], authors present multiview Hessian Regular-ization (mHR) for image annotation. The proposed method com-bines multiview features and Hessian regularizations obtained fromdifferent views. It is claimed that the method effectively exploresthe complementary properties of different features from differentviews and thus boosts the image annotation performance sig-nificantly. In [27], the authors propose the multiview Hessian dis-criminant sparse coding (mHDSC) scheme for image annotation.

The method employs Hessian regularization (HR) to encode thelocal geometry. And, it is applied to multiview features. In addition,mHDSC acts the label information as an additional view of featureto boost the discrimination of the dictionary.

The co-occurrence model proposed by Mori et al. [28] is per-haps one of the first attempts at image auto-annotation. They firstdivide images into rectangular tiles of the same size, and calculatea feature descriptor of color and texture for each tile. All the de-scriptors are clustered into a number of groups, each of which isrepresented by a centroid. On the other hand, each tile inherits thewhole set of labels from the original image. Second, for the set ofsegments, the probability of each keyword is estimated by using avector quantization of the segment’s features. This method has arelatively low annotation performance [1].

Duygulu et al. propose machine translation model (TM) [29],which considers image annotation as a translation problem be-tween two languages: one language is visual vocabulary of imagecontents; the other is real text. They use normalized cut algorithmto segment images, and then use K-means algorithm to clusterthese regions. Image annotation can be regarded as translationprocesses from visual vocabulary blobs to the semantic keywords.Mapping between blobs and keywords was learned using the ex-pectation–maximization (EM) algorithm. One of the key problemsof the model is high computational complexity of the EM algo-rithm, so it is not suitable for large-scale datasets.

Inspired by the relevance language models for informationretrieval and cross-lingual retrieval, several relevance models havebeen proposed such as continuous relevance model (CRM) whichdirectly uses continuous features of image regions and non-para-metric Gaussian kernel to continuously estimate generationprobability of visual contents [30], cross-media relevance model(CMRM) which uses joint probability of semantic labels and visualwords to annotate images [31], dual cross-media relevance model(DCMRM) which performs image annotation by maximizing thejoint probability of images and words [32]. The suggested dualmodel involves two types of relations, word-to-word and word-to-image relations, both of which are estimated by using searchtechniques on the web data, and multimodal latent binary em-bedding (MLBE) [33]. Feng et al. propose the multiple Bernoullirelevance model (MBRM) [34] which utilizes rectangular gridsinstead of complicated segmentation algorithms to partitionimages. They applies Bernoulli distribution instead of multinomialdistribution to describe the distribution of vocabulary that takesinto account image context, i.e., from training images it learns thata class is more associated with some classes and is less associatedwith some other classes. The authors claimed that this method ismore effective for image annotation than the translation model.However, its drawback is that only images consistent with thetraining images can be annotated with keywords in a limitedvocabulary.

Amiri and Jamzad in [3] developed an annotation system forsemi-supervised learning framework that constructs a generativemodel for each semantic class in two main steps. First, based onGamma distribution a generative model is constructed for eachsemantic class using labeled images in that class. The second stepincorporates the unlabeled images by using a modified EM algo-rithm to update parameters of the constructed models.

Metzler and Manmatha [35] segmented training images, con-nected them and their annotations in an inference network,whereby an unseen image is annotated by instantiating the net-work with its regions and propagating belief through the networkto nodes representing the words.

A non-parametric density estimation approach and the techniqueof kernel smoothing have been proposed by Yavlinsky et al. [36].They have claimed that the results are comparable with the in-ference network and CRM. These automatic annotation approaches

Page 3: A multi-expert based framework for automatic image annotationstatic.tongtianta.site/paper_pdf/d27b4692-3838-11e9-8ffc-00163e08… · A multi-expert based framework for automatic image

A. Bahrololoum, H. Nezamabadi-pour / Pattern Recognition 61 (2017) 169–184 171

have achieved notable success, especially, when the keywords havefrequent occurrence and strong visual similarity. However, it re-mains a problem to accurately annotate less visually similarkeywords.

The goal of classification-based methods is to form imageclassifiers that are trained to separate training images with eachkeyword. After training, a classifier is able to identify a targetimage class where the keywords in the training dataset are used toannotate the target image. In [5], K-nearest neighbor (K-NN)classifier is applied as a classifier where user feedbacks are re-quired to improve system accuracy.

In the paper [37], authors presented a two-stage method formulti-class image labeling. They first introduce a simple label-fil-tering algorithm, which can remove most of the irrelevant labelsfor a query image while the potential labels are maintained. With asmall population of potential labels left, then the relationshipbetween the features to be used and each single class is explored.Hence, they claimed that the specific and effective features areselected for each class to form a label-specific classifier.

In [38], authors proposed a method based on combining learn-ing vector quantization (LVQ) technique and SVM classifier to im-prove annotation accuracy and speed. Affinity propagation algo-rithm-based LVQ technique is used to optimize the training set, anda few number of optimized representative feature vectors are usedto train SVM. Except SVM and KNN, the other typical representativeclassifiers are Hidden Markov Models (HMM) [39], Markov RandomFields (MRF) [40] and supervised multi-class labelling [41]. Thegeneral disadvantage of most classifiers is that they are designed forsmall-scale image datasets which means that they are classified intoa limited number of classes. It is still an open research problem toconstruct large-scale learning classifiers and therefore, thesemethods are usually used for annotating specific objects.

The methods belonging to knowledge-based approach are ofthe recently widely used methods for image annotation. Theknowledge-based approach is used in the higher layers of imageinterpretation [42]. Authors in [42] proposed a fuzzy-knowledgebased intelligent system for image annotation, which is able todeal with uncertain and ambiguous knowledge and can annotateimages with concepts on different levels of abstraction that ismore human-like. They used a fuzzy knowledge-representationscheme based on fuzzy petri net (KRFPN). They claimed that theacquisition of knowledge is facilitated in a way that besides thegeneral knowledge provided by the expert, the computable factsand rules about the concepts as well as their reliability are pro-duced automatically from data.

Authors in [2] suggested a method for automatic image anno-tation using relatively large-scale image training dataset. Theycombined local and global features to ensure robustness andgeneralization needed by complex queries, so they focused onperformance and scalability. To obtain annotation for a given tar-get image, they claimed that their approach is based on the wayhow human annotate images manually.

Authors in [43] proposed an annotation method, called Tag-Prop, which discriminatively trains nearest neighbor model. Ima-ges labels are predicted using a weighted nearest neighbor modelto exploit training images. Weights are based on neighbor score ordistance. TagProp allows the integration of metric learning by di-rectly maximizing the log-likelihood of the label predictions in thetraining set.

Authors in [44] presented an AIA method in three separatephases, named Image Annotation Genetic Algorithm (IAGA). Theyused GA as feature selection method in the first phase to solve thehigh dimensions problem. In the next phase, Multi-Label KNNalgorithm is applied to weight neighbors and generate a novelweighted matrix. Finally, GA is used to combine the results andassign the related words to a query image.

1.2. Challenges and contributions

Although the problem of image annotation have been studiedfor many years, there are still many open challenges which arediscussed in the following:

– The performance (efficiency and/or effectiveness) of existingapproaches, for real-world image databases in the image anno-tation domain, are not satisfactory enough and more investiga-tions are needed.

– To date, many methods for feature extraction in CBIR and AIAare proposed. One of the main challenges for researchers is tofind which subset of feature spaces (feature types) among muchpossible spaces should be chosen for better representation ofconcept labels and how they should be combined.

– Many image annotation methods as are based on image seg-mentation [45], which is a main procedure to extract thedescriptors of the image to be indexed, and these descriptorscan well describe the content of the image. However, there aresome disadvantages for image segmentation such as sensitiveand erroneous results, over/under segmentation, etc.

– Furthermore, many of existing methods for image annotationemploy complicated mathematics models. Most of these com-plex models (e.g. generative models and discriminative models),fail to achieve a good performance when they face an increasingnumber of image collections with a dictionary that covers aconsiderable number of potential semantics (labels).

The proposed method in this paper targets to develop an au-tomatic image annotation system in such a way to address somechallenges. To this end, it covers the following contributions:

� This paper presents an effective automatic image framework thatuses the idea of feature and concept prototypes in a simplemanner which is far from current complicated mathematicalmodelling. To do this, a simple classical prototype generationmethod by employing a clustering algorithm is used. Overall, inthe implementation of the proposed framework some existingalgorithms, methods and techniques have been used which makethe framework to be user friendly and easy for understanding.

� Different types of visual features provide multi-modal re-presentation for images in the annotation task. This paper pre-sents a suitable solution to the problem of choosing the effectivesubset of feature spaces among possible employed spaces forbetter representation of the concepts. This addresses one of themain concerns in the field of AIA that mentioned before.

� The proposed method extracts the visual feature prototypes andconcept prototypes from each feature space separately, and in theperformance phase fuses the labels provided by each featurespace. In this view, it acts as a multi-expert system. In this way,we try to make a strong method by combining some weak clas-sifiers. How to fuse labels suggested by different feature spaces(opinion of different experts) is another contribution of the paper.

� The proposed method works with a suitable combination of localand global feature types (appropriate subset of feature spaces). By anautomatic manner the feature spaces are selected to overcome theweakness of each other in the presentation of concepts; therefore,there is no need to extract the features from the segmentationprocess. This makes it to be beneficial for the image annotation task.

� The proposed method as a multiview learning approach, pre-sents a solution for view sufficiency and dependency issue byusing a metaheuristic to select the appropriate feature spaces.

� An implementation of the proposed method containing manyexperimentations is performed. The model’s parameters arejustified and compared with several well-known state-of-the-art methods; also, the results are presented.

Page 4: A multi-expert based framework for automatic image annotationstatic.tongtianta.site/paper_pdf/d27b4692-3838-11e9-8ffc-00163e08… · A multi-expert based framework for automatic image

A. Bahrololoum, H. Nezamabadi-pour / Pattern Recognition 61 (2017) 169–184172

1.3. Paper organization

The remainder of the paper is organized as follows. In Section2, the basic idea behind the proposed system and framework ofthe automatic image annotation are described. In Section 3, theimplementation details such as databases, measures and differentparts of the system are presented. Finally, experimental results andconclusion are provided in Sections 4 and 5, respectively.

2. The proposed automatic image annotation: framework

2.1. Basic idea

In this section, we describe the basic idea behind the proposedframework of automatic image annotation. Our proposed methodto achieve the goal of automatic image annotation is based oncombination of results obtained from feature space and conceptspace. Considering real-world image dataset, a considerable sto-rage is required, so we use the idea of prototype in both featureand concept spaces.

The prototypes are generated in learning phase and input un-labeled images are assigned to the nearest prototypes in both fea-ture and concept spaces, next primary labels are obtained in anuncertain manner from the selected prototypes. Lastly, the obtainedlabels are combined and final labels for a target image are chosen.

In spite of the fact that the proposed image annotation methodseems simple and its simplicity can be considered as a merit, thereare some challenges:

� When there are several types (spaces) of different visual fea-tures, two important issues are which of visual feature spacesare effective for our purpose and how much the influence ofeach feature type is to find the final labels.

� There are several feature spaces and too many feature dimen-sions for some of these feature spaces which lead to the curse ofdimensionality.

� The modeling of concepts based on visual features is a majorconcern in image retrieval systems known as the semantic gap.

� It has been inferred experimentally that a combination ofranked labels can improve the annotation and retrieval perfor-mance. Combination rule of resultant labels is another issue.

� Facing with unbalanced data is another key issue to datagrouping.

The proposed framework tries to answer and solve the chal-lenges described above. First, we briefly present the learningphase, and then the scenario of image annotation in performancephase is explained.

2.2. Learning phase

In learning phase, an unsupervised clustering algorithm is ap-plied on visual features to extract clusters’ centroids as featureprototypes and a supervised clustering algorithm is used for ex-tracting concepts’ prototypes. Since a significant number of featuretypes have been extracted and introduced in considerable relatedworks, selecting more appropriate feature spaces (feature types) isthe main problem of generating prototypes. Here, a basic questionis that from which visual feature spaces, prototypes should beextracted. To this end, a metaheuristic algorithm is employed tofind the most effective feature spaces to bridge the semantic gapand reduce the effect of the curse of dimensionality.

Fig. 1 shows the block diagram of the suggested framework inlearning phase. As the figure shows, the proposed method has twoseparate stream lines: one for preparing feature prototypes (called

as feature stream (FS)) and the other for creating concept proto-types (called as concept stream (CS)). The results obtained in thesetwo streams are combined at the end of the process to finalize thelabels to be assigned to the input image. To find the most effectivefeature spaces (more effective prototypes) a metaheuristic algo-rithm is employed in each stream line separately to maximize theperformance of the system in tagging task.

In the FS line, the training images are clustered based on visualfeatures and in each cluster some prototypes are selected as fea-ture prototypes. Then, a priori probability of predefined labels iscalculated for each cluster. On the other hand, in the CS line, thetraining images are first categorized based on their labels and inthe next step, a supervised clustering algorithm is applied on themwhich means that the images containing a similar label are clus-tered according to feature vectors. After that, a number of proto-types are selected for each cluster as concept prototypes. Up tohere, it seems simple but the problem is that in image retrievaland annotation systems we face with a significant number offeature spaces instead of one feature space. Therefore, conceptprototypes are chosen for each feature space distinctively whichresults in many prototypes. To improve annotation performance,we should select spaces with positive annotation role and removethe spaces with weak or negative role in addition to choosing anoptimum way for combining the results of different prototypes.The following subsections describe the training procedure of theproposed framework in details.

2.2.1. Detailed description of CS lineFor each image in the database, many feature vectors are ex-

tracted and the image is indexed by them. Let us consider thatevery image is indexed in K feature spaces by extracting suitablevisual feature in each space. It means that for each image, there areK feature vectors that present the attributes of that image based onthe corresponding feature type. In this regard, the concept pro-totypes are separately generated for each feature space. With thisintroduction, more details of CS line are elaborated as follows.

2.2.1.1. Concept prototypes. The training images are grouped to-gether according to the predefined concepts indicated by their textannotations. To find concept prototypes, all training images in thedatabase are categorized based on a special concept or tag. In thenext step, for each keyword (tag) a clustering algorithm is appliedto the feature vectors of the images that have the same tag. Thisprocess is repeated K times, each time for a feature space (Fig. 2).

Therefore, for each of K feature spaces, the concept prototypes areextracted separately by a suitable clustering algorithm. In this step,some representatives of a concept (the centroids of clusters) areformed in visual feature spaces and the results are stored in a database.

2.2.1.2. Feature-space selection and fusion of resultant labels in CSline. Multiplicity of visual feature spaces with different visualproperties and large number of feature dimensions may result inthe curse of dimensionality. Although research on the automaticimage annotation systems focus on developing a learning model,these studies do not explicitly address the problem of selectingsuitable feature spaces and their effect on the overall performance.

Since all feature spaces do not describe a concept label equally,the facing challenge is which of the prototypes are more effectiveto represent a concept. It is likely that combination of some specialspaces is better than a single space or combination of all spaces inbridging the semantic gap and semantic description of conceptsbased on visual features. Here, there are two problems: one is theselection of the best subset of feature spaces among all featurespaces and the other is how the results of different selected fea-ture spaces should be fused. A useful way of handling these pro-blems is to model it as an optimization problem where finding the

Page 5: A multi-expert based framework for automatic image annotationstatic.tongtianta.site/paper_pdf/d27b4692-3838-11e9-8ffc-00163e08… · A multi-expert based framework for automatic image

Fig. 1. The proposed framework of automatic image annotation in Learning phase.

A. Bahrololoum, H. Nezamabadi-pour / Pattern Recognition 61 (2017) 169–184 173

best subset of feature spaces and the best criterion of fusion arethe final goals (Fig. 3).

From another viewpoint, each concept space can be consideredas an expert which suggests some labels for an input unseen im-age. Therefore, in the proposed framework, by combining severalexperts the final decision is obtained. However, all spaces have noequal degree of importance in the prediction of the correct labelsfor unseen images. We should select some spaces which are ableto cover the weaknesses of each other to make acceptable resultsfor target images by bridging the semantic gap. Also, how theresults of different experts are combined can affect the final re-sults. In summary, the proposed method achieves two objectivessimultaneously: (1) proper fusion of different modalities, (2) se-lection of optimal subset of feature spaces.

To select the best fusion type among the existing set of fusiontypes, the results obtained from concepts’ prototypes of selectedfeature spaces are combined together to classify the training imagesand any criterion like classifier accuracy, precision, etc. can be usedto guide the optimizer algorithm in finding the best solution.

2.2.2. Detailed description of FS lineIt is noted that the visual features used to index the images are

the same in both stream lines. Similar to the CS line, many featurevectors are employed to index the database images. It means thatfor each image, there are K feature vectors that represent the at-tributes of that image; each one is based on the correspondingfeature type. In this regard, the visual feature prototypes are se-parately generated for each feature space. In this section, theprocess of finding feature prototypes is presented in more details.

2.2.2.1. Visual feature prototypes. In this process, for each featurespace, the visual feature of all images are clustered to find clustercentroids (i.e. each centroid is considered as a visual feature pro-totype) as illustrated in details in Fig. 4. In the first step, for eachfeature space, a suitable clustering algorithm is applied to thefeature vectors and the centroids of the clusters are used as re-presentatives of the corresponding clusters which are considered

as feature prototypes. Next, according to the existing images ineach cluster a probability vector is computed for that cluster whichshows the frequency of the occurrence of predefined labels in thatcluster. Therefore, the centroids (feature prototypes) along withtheir corresponding vector probabilities are stored in a database atthe final step. The vector probability of a feature prototype showswhat types of images may exist in that cluster. It is noticed thatthis process is repeated K times, each time for a feature space.When images are clustered into several groups, each group re-presents some hidden concepts as semantic layers. Informationfrom image groups is used as a priori knowledge to help annota-tion and retrieval process.

2.2.2.2. Feature space selection and fusion of resultant labels in FSline. As mentioned in previous section, after applying a clusteringmethod to each space, a significant number of centroids are ex-tracted as visual feature prototypes. Accompanying each proto-type, we calculate a vector of probability whose i-th item describesthe probability of the images containing i-th concept in the cor-responding cluster.

In other words, we calculate distance of an input image (i.e. itsvisual feature vector) in each space to all feature prototypes andselect the nearest one in that space. Then, we predict its uncertainlabels based on related probability vector. Now, we encounter theproblem of diversity in feature space where key issues are: ob-taining the number of feature spaces to take part in annotationprocess, choosing a combination method, determining a degree ofassociation and selecting proper space.

Because of semantic gap, it is clear that all spaces do not haveequal degree of importance. We apply a metaheuristic algorithm tooptimize both the number of feature spaces and the way of fusionbased on an admissible criterion. Then, final prototypes and se-lected fusion method are stored in dataset (see Fig. 3). The processof selecting proper subset of feature spaces and fusion method isdone for FS line independently from CS line.

Page 6: A multi-expert based framework for automatic image annotationstatic.tongtianta.site/paper_pdf/d27b4692-3838-11e9-8ffc-00163e08… · A multi-expert based framework for automatic image

Fig. 2. Learning phase for obtaining concept prototypes.

A. Bahrololoum, H. Nezamabadi-pour / Pattern Recognition 61 (2017) 169–184174

2.3. Final fusion of resultant labels of CS and FS lines

With respect to the previous section, it is clear that the fusion isthe main process in the proposed framework of image annotation.At the end of each stream line, we have some labels which areranked sequentially based on their importance. Now, we have twoexperts that have their distinct decision regarding each image. It isnecessary to combine the results of two experts (two stream lines)to obtain the final tags. In this case, many aggregation methods areexamined and the best one is selected in the training phase.

2.4. Performance phase

We introduce the query process to annotate an unlabeled inputimage based on both concept prototypes and visual feature pro-totypes in performance phase. As it is mentioned in Sections

2.2 and 2.3, for each feature space, training images are groupedfrom two different paths based on unsupervised clustering andsupervised clustering to find concept and visual feature proto-types, respectively.

The proposed system for performance phase, as described inFig. 5, includes two main processes. The first process searches forthe closest cluster to the query image in selected visual featurespaces and finds some probability vectors of labels related thenearest cluster in each space, separately. Then it fuses them and avector of labels with the highest chance is presented. If thenumber of keyword labels and the size of selected feature spacesubset are presented by m and rf respectively, the fusion block inFS line at the input receives rf vectors with length m where, eachvector is provided by a selected feature space. Each entry of thisvector contains the probability value of belonging query image tothe corresponding labels. Therefore, the fusion block fuses rf

Page 7: A multi-expert based framework for automatic image annotationstatic.tongtianta.site/paper_pdf/d27b4692-3838-11e9-8ffc-00163e08… · A multi-expert based framework for automatic image

A. Bahrololoum, H. Nezamabadi-pour / Pattern Recognition 61 (2017) 169–184 175

probability vectors and gives out a vector at the output. Differenttypes of fusion methods can be used in this block to aggregate theprobability vectors provided by different selected feature spaces.

Fig. 4. Learning phase to find visual feature proto

Fig. 3. Learning phase to select feature spaces and fusion types in the CS line.

The second process finds input query distances to conceptprototypes and ranks the predicted labels with respect to dis-tance in all spaces. In CS line, according to the distances betweenquery image and concept prototypes in each feature space aprobability vector is constructed. If the number of keyword labelsand the size of selected feature space subset in CS line are pre-sented by m and rc respectively, the fusion block in CS line at theinput receives rc vectors with length m. Therefore, the fusionblock fuses rc probability vectors and gives out a vector at theoutput. Different types of fusion methods also can be used in thisblock to aggregate the probability vectors provided by differentselected feature spaces.

Considering the two presented vectors of labels with respect toeach foresaid line, two obtained lists of labels from each line arecombined and a vector that sorts the proposed labels is presented.Finally, we select a number of labels with higher probability valuesby using a fusion method to annotate the input query image.

types and their related concept probabilities.

Page 8: A multi-expert based framework for automatic image annotationstatic.tongtianta.site/paper_pdf/d27b4692-3838-11e9-8ffc-00163e08… · A multi-expert based framework for automatic image

Fig. 5. The proposed framework of automatic image annotation in performance phase.

A. Bahrololoum, H. Nezamabadi-pour / Pattern Recognition 61 (2017) 169–184176

The worst-case running time, T , of the proposed algorithm inthe performance phase is obtained as T ( n n n d, , ,f c ns s)¼

( )+ ∑ +=C C n C n dins

i f i c si1 1 2 , where Ci1, Ci2 and C are time constantssuch that each of them can contain basic operation such as basicarithmetic operation (e.g., þ , ×), assigning a value to a variable(e.g., =x 0), testing or comparing (e.g., <x 0), reading a primitivetype such as integer, float, character, boolean from memory andwriting a primitive type to memory. In addition, nf , nc , dsi and ns

are number of clusters (feature prototypes), number of conceptprototypes, feature dimension for the i-th space and number offeature spaces, respectively and ds is the whole feature dimension.Also, the time complexity of the algorithms is ( )O n n n df c ns s . Thebig-oh notation is used in the theory of complexity to describe thebehavior of functions. Basically, it demonstrates how fast a func-tion grows or declines.

3. Implementation

In previous section, the proposed framework is described forautomatic image annotation. Many possible ways are available forimplementing the proposed framework. In this section, we describethe details of an implementation of the proposed framework.

3.1. Databases

The proposed automatic image annotation framework is eval-uated on two well-known benchmark image databases that aredescribed in details in this section.

3.1.1. IAPR-TC12 databaseThe IAPR TC-12 database [49] includes 19,627 images, 17,665

for training and 1962 images for testing. Overall, there are 291keywords (an average of 4.7 keywords per image) that appear inboth the train and the test set.

3.1.2. ESP Game databaseThe ESP Game database [50] consists of a set of 21,844 images.

It is divided into two sets: a training set of 19,659 images and a

test set of 2185 images. Overall, there are 269 keywords that ap-pear in both the train and the test sets. Each image is associatedwith up to 15 keywords (an average of 4.6 keywords per image).

3.2. Visual features

To implement our framework and evaluate its performance, weuse the same features as [44]. Table 1 shows a brief description ofused feature spaces. These are a combined set of local and globalfeatures. The local features contain the SIFT in addition to huedescriptors obtained densely from multi-scale grid, and fromHarris-Laplacian interest points. The global features include colorhistograms in RGB, HSV and LAB color spaces, and the Gist fea-tures. This results in 15 distinct feature types, namely one Gistdescriptor, 6 color histograms (3 color spaces�2 layouts) and8 bag-of-features (2 descriptors�2 detectors�2 layouts).

3.3. Normalization and preprocessing

Normalization of the features is a necessary pre-processingstep to bring all the features into the same dynamical range. Datanormalization is the process of reducing data to its canonical form.In this work, we apply a Gaussian normalization that puts equalemphasis on the values in each of the feature spaces. By doing so,we normalize the values of each descriptor into a normal dis-tribution using μ

δ−×

d3

i where m is the mean value and s is its standarddeviation, both are calculated from the image feature dataset.

3.4. Gravity clustering method

One of the major challenges in clustering algorithms are theability to deal with noises and outliers, imbalanced groups as wellas the sensitivity to the initial position of cluster centroids. Ref.[46] proposed a nature inspired clustering algorithm which is ableto overcome some problems like outliers (or noisy data), im-balanced clusters, overlapped clusters and sensitivity to the initialposition of cluster centroids. In this clustering algorithm, the datapoints to be clustered are considered as fixed celestial objects withthe unity mass to apply a gravity force to movable objects

Page 9: A multi-expert based framework for automatic image annotationstatic.tongtianta.site/paper_pdf/d27b4692-3838-11e9-8ffc-00163e08… · A multi-expert based framework for automatic image

Table 1A brief description of used feature spaces (feature types) [44].

Space no. Space name Abstract description # of features

1 DenseHue The robust HUE descriptor computed for regions on a dense multi-scale grid. 1002 DenseHueV3H1 The HUE histograms that is computed over three horizontal regions. 3003 DenseSift The SIFT descriptor computed for regions on a dense multi-scale grid. 10004 DenseSiftV3H1 The SIFT histograms that is computed over three horizontal regions. 30005 Gist The Gist descriptor 5126 HarrisHue The robust HUE descriptor computed for regions on regions found using a Harris-Laplacian detector. 1007 HarrisHueV3H1 The HarrisHue histograms that computed over three horizontal regions. 3008 HarrisSift The robust HUE descriptor computed for regions on regions found using a Harris-Laplacian detector. 10009 HarrisSiftV3H1 The HarrisSift histograms that is computed over three horizontal regions. 300010 HSV The colour histogram that is computed in the HSV space. 409611 HSVV3H1 The Hsv histograms that is computed over three horizontal regions. 518412 Lab The color histogram that is computed in the LAB space. 409613 LabV3H1 The LAB histograms that is computed over three horizontal regions. 518414 RGB The color histogram that is computed in the RGB space. 409615 RGBV3H1 The RGB histograms that is computed over three horizontal regions. 5184

Fig. 6. The object representation in the case 1.

A. Bahrololoum, H. Nezamabadi-pour / Pattern Recognition 61 (2017) 169–184 177

(centroids) and change their positions in the feature space. Theaim is to find the best position of each cluster centroids (clusterrepresentative) where each centroid is modeled by a movableagent with the unity mass. The centroids move around the featurespace in the influence of the gravity force exerted by the celestialobjects to find the best position. One can expect that the centroidsstop in the optimum positions. Due to nature of data in the pro-blem of image annotation which contains noisy and outlier data aswell as imbalanced classes of data, the gravity clustering is se-lected for this purpose. On the other hand, the gravity clustering isan incremental algorithm and with increase of data there is noneed to start the learning process again for all data. Different stepsof the gravity clustering algorithm and its detail description isgiven in [46] and the interested readers could follow it, in brief, inthe Appendix A.

3.5. Gravitational search algorithm as a metaheuristic optimizer

In the proposed framework, we need a metaheuristic searchalgorithm to handle optimization tasks. During the last decades,several metaheuristics have been developed to solve complexengineering optimization problems which most of them have beeninspired by natural phenomena and swarm behaviors. Gravita-tional search algorithm (GSA) is one of the recent created swarm-based metaheuristic search algorithms, which has a flexible andwell-balanced mechanism to enhance exploration and exploita-tion abilities. GSA has been inspired by Newtonian laws of gravityand motion. In this algorithm, mass interactions are simulated andobjects (agents) move through a multi-dimensional search spaceunder the influence of gravitation. The origin version of the GSAhas been operated in continuous space to optimize the problemswith real-valued variables [48]. However, many optimizationproblems are set in binary space. In this regards, some versions ofthe algorithms have been suggested. Among them the binaryquantum-inspired GSA (BQIGSA) [47] is one of the most successfulbinary search algorithms which is employed here.

In this paper, the optimization problem includes binary vari-ables as well as real parameters to be optimized. The effectivenessof GSA family and their success in solving different optimizationproblems motivated us to use it in this paper. In the cases that theproblem contains real-valued and binary-valued parameters, ahybrid GSA by combining real-valued GSA [48] and BQIGSA [47]which is able to optimize both real and binary parameters ofproblem at hand is utilized. A brief review on GSA and BQIGSA hasbeen given in appendix B for interested readers. The metaheuristicoptimizer are employed in our implementation in the followingcases.

3.5.1. Space subset selection and fusion of spaces’ decisionTo select the optimal subset of the spaces in each stream line

the GSA family is applied. We do feature space selection and fusionof the spaces’ decision, simultaneously. To this end, two cases aresuggested and examined in the experiments which are detailed inthe following.

� Case 1: Selection of optimal space subset and fusion method

In this case, each solution is represented with a string of bitsand the metaheuristic algorithm searches for the value of each bitin the string. A direct approach using a metaheuristic search al-gorithm for feature space selection is to find an optimal binaryvector, where each bit is associated with a space. If the i-th bit ofthis vector equals 1, then the i-th feature type (space) is allowed toparticipate in annotation, otherwise the corresponding space isleft out. Each resulting subset of spaces is evaluated according to asuitable criterion on a set of training data.

In the case 1, along with the finding the optimal space subset,some bits is added to the binary strings to encode the types offusion of the selected spaces. In the experiments, we have ex-amined eight fusion methods. So, only three bits is required to beused for encoding of fusion type. Overall, the length of each stringis 18 bits for each of stream line (15 bits for 15 spaces and 3 bits forselecting the fusion method). The object representation is given inFig. 6. To apply the BQIGSA, the Q-bit representation is used. Therepresentation shown in Fig. 6 is the binary string after observa-tion process. It should be noted that the algorithm is run for eachline separately. In this case, only the BQIGSA is used to find theoptimal value of the binary-valued variables. The employed fusionmethods are reported in Table 2.

As it is explained before, in each line the results obtained by theselected feature spaces are fused using the selected fusion method.As it is shown by Table 2, we have used eight fusion methods andthe fusion type is encoded by three bits in the BQIGSA. Assumethat, for a query input image, the probability vector provided byselected feature space i in the FS line is presented by Li

FS which is a

vector with length m (i.e. ( )= … = …L l l l i r, , , , 1, 2, ,iFS

iFS

iFS

i mFS

f,1 ,2 , ). Forexample the “Maximum” fusion type combines the probabilityvectors obtained by rf selected feature spaces (i.e. the spaces thattheir corresponding bit element take the value 1 in the objectshown in Fig. 6) in the FS line as = ( )=l lmaxfin j

FSir

i jFS

, 1 ,f , = …j r1,2, , f ;

Page 10: A multi-expert based framework for automatic image annotationstatic.tongtianta.site/paper_pdf/d27b4692-3838-11e9-8ffc-00163e08… · A multi-expert based framework for automatic image

Table 2Fusion methods.

Row Method Description

1 Maximum = ( ) ∈ { … }=l l j mmax , 1,2, , ,fin jFS

irf

i jFS

, 1 ,(1)

2 Minimum = ( ) ∈ { … }=l l j mmin , 1,2, , ,fin jFS

irf

i jFS

, 1 ,(2)

3 Arithmeticmean

= ∑ ∈ { … }=l l j m, 1,2, , ,fin jFS

rf irf

i jFS

,1

1 ,(3)

4 Mean/min =( ∑ =l lfin jFS

rf irf

i jFS

,1

1 , )/( ( ))= lminirf

i jFS

1 , , ∈ { … }j m, 1,2, , , (4)

5 Product/mean= ∈ { … }

∏ =

∑ =

⎜ ⎟

⎜ ⎟

⎛⎝

⎞⎠

⎛⎝

⎞⎠

l j m, 1,2, , ,fin jFS

rf jrf li j

FS

jrf li j

FS,1 ,

1 ,

(5)

6 Median Middle value of the given numbers in their as-cending/descending order.

(6)

7 GeometricMean ( )( )= ∏ ∈ { … }=l l j m, 1,2, , ,fin j

FSirf

i jFS n

, 1 ,

1 (7)

8 Harmonicmean

= ∈ { … }

∑ =

⎝⎜⎜⎜

⎠⎟⎟⎟

l j m1,2, , ,fin jFS n

irf

li jFS

,

11

,

(8)

A. Bahrololoum, H. Nezamabadi-pour / Pattern Recognition 61 (2017) 169–184178

( )= …L l l l, , ,finFS

finFS

finFS

fin mFS

,1 ,2 , is the final probability vector provided by

the FS line. Similar to this process, the probability vector obtained

by the CS line is created, called as ( )= …L l l l, , ,finCS

finCS

finCS

fin mCS

,1 ,2 , . Again, it

is noted that the optimization process is performed separately foreach stream line.

� Case 2: Optimal weights of weighted sum fusion method

In this case, a weighted sum fusion is applied to aggregate theopinion of different experts (feature spaces). The value of eachweight to be optimized can be varied in the interval [0,1]. There-fore, a real-valued vector is needed to encode the weights of theweighted sum fusion. Furthermore, similar to the case 1, a string ofbinary bits is used for selecting the optimal space selection. If thei-th bit of this vector equals 1, then the i-th feature type is allowedto participate in annotation, otherwise the corresponding space isleft out. For each space where the corresponding bit is 1, the cor-responding weight is considered in the weighted sum fusion andfor that which takes 0, the corresponding space is left out andcannot contribute in the weighted sum.

In this case, a hybrid algorithm containing a real GSA and aBQIGSA, called mixed GSA, is used to bridge the semantic gapbetween the low level visual features and high level semantics.Each solution is represented with a mixed vector of binary-valuedand real-valued variables with length of 30; i.e. 15 binary variablesfor space subset selection and 15 real variables for their con-tribution weights. The mixed GSA is used to find the presence ofspaces as Bi and participation weights of those spaces as Wi . Theobject representation is given in Fig. 7. To apply the BQIGSA in thebinary part, the Q-bit representation is used. The representationshown in Fig. 7 for binary part, is the binary string after observa-tion process. It should be noted that the algorithm is run for eachstream line separately.

The weighted sum fusion is applied to aggregate the probabilityvectors provided by feature spaces in the FS line using Eq. (9).

Fig. 7. In the case 2, the object i is comprised of two parts: binary part, and real-valued part.

∑=( )=

l B W l. .9

fin jFS

i

n

i i i jFS

,1

,

s

where ns is the number of feature spaces. Here, =LfinFS

( )…l l l, , ,finFS

finFS

fin mFS

,1 ,2 , is the final probability vector provided by the FSline. Similar to this process, the probability vector obtained by theCS line is created, called as ( )= …L l l l, , ,fin

CSfinCS

finCS

fin mCS

,1 ,2 , . It should benoted that the optimization process is performed separately foreach stream line.

3.5.2. Fusion of stream lines’ decisionThe weighted sum method is applied to combine the result of

concept prototypes and feature prototypes. In this case, the realGSA is used to find the optimal value of the two weights on thetraining datasets.

= + = … ( )l W l W l j m. . , 1,2, , 10fin j FS fin jFS

CS fin jCS

, , ,

where lfin j, , WFS and WCS are the element j of final probability vectorprovided by proposed method (combining two stream lines), andthe weights for combining the results of FS and CS lines in theweighted sum fusion, respectively. The weights WFS and WCS areoptimized by running GSA on the training dataset. To annotate aninput query image, the final fusion block receives the probabilityvectors Lfin

FS and LfinCS , from FS and CS lines, respectively at the input

and aggregate them by Eq. (10) which is resulted to =Lfin

( … )l l l, , ,fin fin fin m,1 ,2 , . After computing Lfin, their elements are sorted indescending order and the keyword labels corresponding to the mostvalues are assigned to the query input image as final keywords.

3.5.3. Fitness function designA fitness function should be used to evaluate the quality of a

solution in a metaheuristic algorithm. In this work, we use F-measure criterion which is a combination of precision and recall,for the proposed process described in previous Sections 3.5.1 and3.5.2. These criteria are described in the next section.

4. Experimental results

According to what we implement in the previous section, someexperiments are conducted to evaluate the effectiveness of theproposed framework.

4.1. Performance measures

Performance measures to assess an AIA method can be cate-gorized into two groups: label-based and example-based. Theformer is calculated for each label (annotation) and then theaverage on all labels are reported, whereas the latter is calculatedfor each test example and then the averaged on the test set isobtained. (Eqs. 11 and 12) show precision and recall measure thatare used in this work, respectively.

∑=( )=

Pm

NcNt

1

11i

mi

i1

∑=( )=

Rm

NcNr

1

12i

mi

i1

where m, Nc Nr,i i and Nti are the number of total labels in dataset,the number of correctly annotated images, retrieved images and

Page 11: A multi-expert based framework for automatic image annotationstatic.tongtianta.site/paper_pdf/d27b4692-3838-11e9-8ffc-00163e08… · A multi-expert based framework for automatic image

A. Bahrololoum, H. Nezamabadi-pour / Pattern Recognition 61 (2017) 169–184 179

truly related images in test set for i-th label, respectively. In thiswork, mean precision, mean recall, F-measure (Eq. (13)) and Nþ asthe number of total keywords recalled (i.e. it is the number ofkeywords with non-zero recall (NZR)) are reported.

− = × ×+ ( )F measure

2 Precision RecallPrecision Recall 13

The F-measure is computed as the harmonic mean of precisionand recall. Thus, our goal can be viewed as designing an algorithmthat generates high quality solution.

4.2. Parameter settings

Parameter settings for the gravity clustering algorithm, BQIGSAand GSA are done according to the best parameters that have beenreported in the corresponding literature, respectively [46–48]. Forall experiments the number of population and the number ofiterations are set to 40 and 100.

The number of visual feature prototypes and concept prototypesare set experimentally. A set of experiments is performed on thetraining set of two feature spaces (spaces no. 3 and 9) of the databaseIAPR-TC12 to demonstrate the behavior of the model with the in-crease of cluster’s number (the number of prototypes). In fact, this isa problem dependent parameter and the optimal value may changefrom one database to another as well as from on feature spaces toanother. On the other side, we know that by increasing the numberof clusters, the precision and recall measures should be improved inthe training set. However, it increases the probability of over-fittingin the test set. Also, increasing the number of clustering causes theincrease of time complexity. In Fig. 8, the F-measure values (in per-cent) versus the number of clusters for spaces 3 and 9 of databaseIAPR-TC12 are illustrated for the concept stream line (Fig. 8a) andfeature stream line (Fig. 8b), respectively. For all set of experiments inthe current work, to have a fair comparison, similar to all competingmethods the length of assigned tags is set to five words.

Considering all aforementioned points, to achieve a tradeoffbetween computational complexity and performance in the pro-posed method, the number of feature prototypes and conceptprototypes are set to 300 and 11, respectively. Subsequently, forthe proposed framework, all experiments are done by these set-tings of parameters.

(a)

(b)

2 4 6 8 10 12 14 1620

30

40

50

60

number of concept prototypes

f-m

easu

re

feature space 9feature space 3

0 100 200 300 400 500 60010

15

20

25

30

35

number of feature prototypes

f-mea

sure

feature space 9feature space 3

Fig. 8. The curve of F-measure (in percent) variations versus the number of(a) concept prototypes (b) visual feature prototypes for feature spaces no. 3 and 9 ofdatabase IAPR-TC12.

4.3. Performance of the proposed method for each feature space

To discover the performance of each feature space and itsability in the annotation, a set of experiments is provided in thissection. We perform the experiments separately for each streamline. The results are summarized in Tables 3 and 4 for training andtest sets, respectively. The results are reported in terms of P, R andNþ measures. Based on these two tables, it is possible to comparethe power of two stream lines with each other in the training andtest phases.

According to the results obtained, in the training phase, thefeature spaces 3, 4 and 9 have better performance in two data-bases in contrast to other spaces for both stream lines. However,the results show that in the training phase the performance ofeach feature space in the CS line is better than its performance inthe FS line.

Similar to the results presented in the Table 3, the same ex-periments are performed for the test dataset and the results aregiven in the Table 4. The results obtained show the effect of 15feature spaces in both stream lines.

According to the results presented in Table 4, in the perfor-mance phase, again, the feature spaces 3, 4 and 9 have betterperformance in two databases in contrast to other spaces for bothstream lines. But, the results show that in contrast to the trainingphase, in the performance phase the performance of each featurespace in the CS line is weaker than its performance in the FS line.Also, comparing the results in two tables reveal that the perfor-mance of all feature spaces in the CS line is decreased drastically inthe performance phase with respect to the training phase. The Nþresult that reflects the coverage level of annotation words is highin CS compared to FS. This illustrates the concept diversity in theCS line is higher than FS line.

4.4. Performance of the proposed method for combination of featurespace

To combine the results of different spaces (the opinion of theseveral experts), a metaheuristic algorithm is applied in two cases.In the case 1, the BQIGSA is applied to select the optimal spacesubset and the best fusion method separately in two stream lines.Furthermore, in the case 2, the mixed GSA is applied to select theoptimal space subset and the optimal weights of weighted sumfusion method separately, in two stream lines. Eventually, the re-sults of two stream lines are combined by applying the GSA al-gorithm using a weighted sum fusion method. The results for thetraining dataset in terms of P, R and Nþ measures for two imagedatabases are reported in Table 5. The table shows that the bestfusion method in the case 1 for both databases is “max” function.Also, as it can be seen, the results obtained by case 1 are similar tothose of case 2 in the training phase for both databases. Also, it isrevealed that the combination of spaces in CS line causes betterresults that those of FS line in the training phase.

Table 6 presents the results of the proposed methods in com-bination of spaces in both CS and FS lines for two used databasesin the performance phase. The results are reported in the terms ofP, R and Nþ . To annotate a test image, the assigned labels by theproposed method are sorted based on their integrated scores forthat image and then the top five labels are selected.

The results say that the results obtained by case 2 are betterthan those of case 1 in the performance phase for both databases.Also, it is revealed that the combination of spaces in CS line causesbetter results that those of FS line in the performance phase.Several sample images along with their human and system an-notations are depicted in Fig. 9.

Page 12: A multi-expert based framework for automatic image annotationstatic.tongtianta.site/paper_pdf/d27b4692-3838-11e9-8ffc-00163e08… · A multi-expert based framework for automatic image

Table 3The annotation performance of different feature spaces in terms of P, R and Nþ in training phase based on the proposed framework.

Dataset IAPR-TC12 ESP-Game

Stream Line FS CS FS CS

Space/criteria P R Nþ P R Nþ P R Nþ P R Nþ

1 25.5 22.2 141 18.4 16.0 291 23.6 25.1 115 11.9 12.7 2682 26 22.7 143 24.7 21.6 291 23.4 25.0 118 16.5 17.6 2683 32.5 28.4 156 48.3 42.2 291 24.2 25.8 160 35.5 38.7 2684 33.3 29.1 163 58.2 50.8 291 24.4 26.0 177 45.7 48.8 2685 29.2 25.5 124 35.0 30.6 291 23.5 25.1 117 21.6 23.1 2686 25.1 22.2 141 17.8 15.6 291 23.5 25.0 126 11.1 11.9 2687 25.8 22.5 142 24.3 21.2 291 23.2 24.8 133 16.2 17.3 2688 30.6 26.7 178 50.2 43.8 291 22.6 24.1 154 37.1 39.6 2689 31 27 180 58 50.5 291 22.8 24.4 181 47.4 50.5 26810 25.2 22.0 184 26.4 23.0 291 22.8 24.4 175 23.4 24.9 26811 25.8 22.5 194 31.7 27.6 291 22.8 24.4 179 26.2 27.9 26812 25.0 21.9 174 16.8 14.7 291 22.5 24.0 186 16.7 15.6 26813 25.4 22.2 196 19.6 17.1 291 22.7 24.2 180 17.4 18.5 26814 24.9 21,7 198 20.6 18 291 23.3 24.9 186 21.4 22.8 26815 25.1 21.9 206 24.6 21.5 291 22.8 24.4 196 23.4 24.9 268

Table 4The annotation performance of different feature spaces in terms of P, R and Nþ in performance phase based on the proposed framework.

Dataset IAPR-TC12 ESP-Game

Stream Line FS CS FS CS

Space/criteria P R Nþ P R Nþ P R Nþ P R Nþ

1 22.3 19.7 86 9.7 8.6 225 22.3 23.7 62 7.9 8.4 1832 22.9 20.3 83 13.2 11.7 238 22 23.4 62 10.9 11.6 1963 28.6 25.3 112 23.5 20.8 248 22.3 23.7 87 18.1 19.3 2274 30.1 26.7 119 29.2 25.9 244 22.7 24.2 95 21.7 23.1 2175 25.6 22.7 88 15.0 13.3 236 21.3 22.7 80 11.1 11.8 2246 22.5 20.0 72 0.95 0.84 223 21.7 23.1 57 0.72 0.76 1817 22.6 20.1 70 12.2 10.8 221 21.3 22.7 53 1.0 10.5 1888 27.2 24.1 108 23.3 20.7 238 21.2 22.6 66 17.7 18.9 2159 28.3 25.1 109 27.3 24.2 227 21.1 22.4 68 21.7 23.1 22310 22.8 20.3 80 14.2 12.6 226 21.0 22.3 64 14.6 15.6 20311 23.4 20.8 87 17.8 15.8 231 21.1 22.5 75 16.6 17.7 21512 21.9 19.4 88 8.2 7.3 210 20.3 21.7 65 10.3 11.0 19113 22.5 20.0 76 10.3 9.2 229 20.9 22.3 70 12.0 12.8 20414 22.6 20.0 63 10.9 9.6 225 21.5 22.9 65 14.2 15.2 20615 22.6 20.1 77 13.2 11.8 233 21.2 22.5 65 15.6 16.6 206

Table 5An overview of annotation performance for combination of spaces in terms of P, R and Nþ in training phase (Case 1).

Dataset IAPR-TC12 ESP-Game

Stream Line FS CS FS CS

Fusion/criteria P R Nþ P R Nþ P R Nþ P R Nþ

Weighted sum 36.7 32.0 179 69.4 60.0 291 29.2 31.2 175 56.7 60.5 268Selected (max) 38.2 33.3 266 68.0 59.4 291 32.5 34.7 259 53.1 56.7 268FS &CS P¼61.3 R¼70.3 Nþ¼291 P¼57.8 R¼60.9 Nþ¼268

Table 6An overview of annotation performance for fusion methods in terms of P, R and Nþ in performances phase.

Dataset IAPR-TC12 ESP-Game

Stream Line FS CS FS CS

Fusion/criteria P R Nþ P R Nþ P R Nþ P R Nþ

Weighted sum 32.7 29.0 96 37.6 33.4 252 27.0 28.6 60 30.0 32.0 224Selected (max) 28.6 25.3 86 37.1 32.9 248 24.4 26.0 77 28.8 29.9 231FS & CS P¼40.9 R¼38.3 Nþ¼240 P¼32.8 P¼36.7 Nþ¼220

A. Bahrololoum, H. Nezamabadi-pour / Pattern Recognition 61 (2017) 169–184180

Page 13: A multi-expert based framework for automatic image annotationstatic.tongtianta.site/paper_pdf/d27b4692-3838-11e9-8ffc-00163e08… · A multi-expert based framework for automatic image

Fig. 9. Sample images which are annotated by the proposed framework.

A. Bahrololoum, H. Nezamabadi-pour / Pattern Recognition 61 (2017) 169–184 181

4.5. Comparative study

We applied the proposed framework to the annotation problemand the results are reported along with those of previous works ofother scholars which have been already reported in the literatureon the same testbed. To this end, a comparison with 16 well-known image annotation methods on two standard databases isdone and the results are reported in the Table 7. For this purposethe annotation performance is evaluated based on the P, R, Nþ andF-measure criteria.

The results obtained illustrates that the performance of theproposed method according to R and F-measure criteria is higherthan all competing methods on both databases. Also, for the Nþand P criteria its results are reasonably good and comparable withthe recent created methods. Overall, taking into account thecomparing results, it is worth noticing that the proposed frame-work can be considered as an alternative tool beside other existingmethods for automatic image annotating.

5. Conclusion

This paper presents an automatic image annotation frameworkincluding two stream lines for combining different feature spaces.Concept prototypes and feature prototypes are extracted in differentspaces by gravity clustering in training phase. To select the optimalspace subset and related participation weights for combining thefeature spaces in each stream line, in training phase, a metaheuristicsearch algorithm is applied to the problem. Finally, the result of twostream lines are combined to annotate an unlabeled query image inthe performance phase. The proposed framework is tested on twostandard databases containing 20k images and a set of experimentsare performed to explore the different aspects and abilities of theframework. Furthermore the proposed framework is compared with16 well-known algorithms. The experimental results confirm theeffectiveness of the proposed framework and show that it cansuccessfully be used as an image annotation tools beside other al-gorithms that have proved their effectiveness thus far.

Page 14: A multi-expert based framework for automatic image annotationstatic.tongtianta.site/paper_pdf/d27b4692-3838-11e9-8ffc-00163e08… · A multi-expert based framework for automatic image

Table 7The results of annotation performance of competing methods in terms of P, R, Nþand F-measure on IAPR and ESP databases. The results of competing methods havebeen already reported in the literature.

Dataset IAPR-TC12 ESP-Game

Methods/criteria P R Nþ F-measure P R Nþ F-measure

ProposedMethod

40.9 38.3 240 39.6 32.8 36.7 220 34.6

MBRM from [2] 24 19 223 21.2 18 19 209 18.4JE from [2] 28 24 250 25.8 22 25 224 23.4TagProp [43] 46 21 266 28.8 39 27 239 31.9ANNOR-L [2] 22 25 98 23.4 19 21 86 20.0ANNOR-G [2] 38 29 242 32.9 36 29 231 32.1ANNOR-LG [2] 48 32 272 38.4 39 28 241 32.6Jec-15 from [43] 29 19 211 22.9 24 19 222 21.2Makadia [13] 28 29 – 28.5 22 25 – 23.4Lasso from [13] 28 29 246 28.5 21 24 224 22.4Feng et al. from[5]

24 23 – 23.4 – – – –

GS from [44] 32 29 252 30.4 – – – –

PATSI from [44] 26 31 – 28.3 – – – –

IAGA [44] 39.8 30 244 34.2 – – – –

AICDM from [37] – – – – 24 26 231 25.0WOLFA [37] – – – – 41 28 241 33.2WI LFA [37] – – – – 35 25 228 29.2

A. Bahrololoum, H. Nezamabadi-pour / Pattern Recognition 61 (2017) 169–184182

Acknowledgement

The authors would like to thank the Editorial Board and theanonymous reviewers for their very helpful suggestions. In addi-tion, the authors would like to extend their appreciation to MissElham Ghazizadeh for proof reading the manuscript and providingvaluable comments.

Appendix A. Gravity clustering

Different steps of the gravity clustering algorithm are as follows[46]:

Input: data objects X¼ {X1,…, Xn}Step 1: Initialize the algorithm parameters including number ofclusters (K), maximum number of iteration (T), parameters usedin gravitational constant equation ( αG ,0 ), and set =t 0.Step 2: Generate randomly the position of K initial movable par-ticles (agents) and set the mass values of these particles to one. Inother words, the cluster centers …Z Z Z, , , K1 2 are initialized where

=( … )Z z z z, , ,i i i iD1 2 is a D-dimensional feature space.Step 3: Repeat the following until the maximum number ofiterations is reached.Step 3-1: Each fixed particle (data object, Xi ) is assigned to thenearest cluster center, i.e. for each data object, the Euclideandistance from all cluster centers, Zj , is computed and the dataobject is assigned to the nearest one.

{ }= − ≤ − ∀ ≤ ≤ ( )C X X Z X Z j j K: , 1A-1l i i l i j

2 2

Step 3-2: Update cluster centers, Zj, i.e. the new positions ofmovable particles as cluster centers are computed under the in-fluence of the force. To update the center of cluster j, = …j K1,2, , ,first the total gravity force which is applied to the cluster center Zjfrom all fixed particles within cluster Cj is computed as follows:

( )( )∑( )=

+∈( ( )− ( ))

( )∈

F tG t

Cr

M M

R tX t Z t

A-2j

j X Ci

i j

ijp i j

i j

where Mi and Mj present the mass values of the fixed particlei andmovable particle j respectively; they are set to 1 (i.e. Mi¼ =M 1j ), Rij isthe Euclidean distance between data object i and cluster center jdefined as ( )− ( )X t Z ti j , p is a parameter of the algorithm wheretunes the effect of distance on the calculation of force, ri is a D-di-mensional randomly generated vector by uniform distributionfunction over the interval [0,1], ε is a small number, the term ( − )X Zi j

indicates the direction of force vector, and the gravitational constantG is a decreasing function of time which is initialized to G0 at =t 0and is decreased by lapse of time as Eq. (A-3).

( )= ( ) ( )G t G G t, A-30

Then, for movable particle j the acceleration and velocity arecalculated based on Newton's second law as follows:

( ) ( ) ∑+ =( )

=( ( )− ( ))

( ) ( )∈

a tF

M

G t

Cr

X t Z t

R t1

t

A-4j

j

j j X Ci

i j

ijp

i j

( ) ( ) ( )+ = + + ( )V t V t a t1 1 A-5j j j

And the new position of movable particles (cluster centers) iscalculated using Eq. (A-6).

( ) ( ) ( )+ = + + ( )X t X t V t1 1 A-6j j j

Output: optimal partition C ¼ {C1, …, CK} and cluster centersZ¼{ … }Z Z Z, , , K1 2

Appendix B. A brief review on GSA and BQIGSA

B-1. Gravitational Search Algorithm (GSA) [48]

To describe the GSA, consider a system with N objects (agents)in which the position of the object i is defined as:

( ) ( ) … ( )⎡⎣ ⎤⎦x t x t x t, , , ,i i iD1 2 i¼1,2, ..., N where ( )x ti

d is the position of i-thobject in the d-th dimension, and D is dimension of the searchspace. The mass of each object/agent is calculated after computingcurrent population's fitness as follows:

)( ) ( )( )

=− (

∑ ( − ( )) ( )=

M tt t

t t

fit worst

fit worst B-1i

i

iN

i1

where ( )M ti and fiti(t) represent the mass and the fitness value ofthe agent i at t, and, worst(t) is the worst fitness value amongstpopulation. Then, the next position of each member of populationis calculated using the following equations.

( )( ) ( )( ) ( )( ) ( ) ( )∑

ε=

+−

( )∈ ≠

F t G tM t M t

R tx t x trand

B-2id

j kbest i j

N

jj i

ijjd

id

( ) ( )( )=

( )a t

F t

M t B-3jd i

d

i

( ) ( ) ( )+ = × + ( )v t v t a t1 rand B-4jd

i id

jd

( ) ( ) ( )+ = + ( )x t v t x t1 B-5jd

id

jd

Where Fid, aj

d and vjd present the resultant acting force on i-th

object, acceleration and velocity of i-th object in the d-th dimen-sion respectively, randi and randj are two uniformly distributedrandom numbers in the interval [0, 1], ε is a small value, and Rij is

Page 15: A multi-expert based framework for automatic image annotationstatic.tongtianta.site/paper_pdf/d27b4692-3838-11e9-8ffc-00163e08… · A multi-expert based framework for automatic image

A. Bahrololoum, H. Nezamabadi-pour / Pattern Recognition 61 (2017) 169–184 183

the Euclidian distance between two agents i and j. kbest is the setof first K agents with the best fitness value and biggest mass. kbestis a function of time, initialized to K0 at the beginning and is de-creased with time. Here, K0 is set to N (population size) and isdecreased linearly to 1. The gravitational constant, G, is a de-creasing function of time where is set to G0 at the beginningand is decreased exponentially towards zero at the last iteration(Figs. B-1 and B-2).

B-2. Binary Quantum-Inspired Gravitational Search Algorithm(BQIGSA)

The BQIGSA’s steps are as follows [47]:Step 1 (Initialization): Set =t 0, ( )={}SB t , where SB is an ar-chive to maintain the best solution provided for each object byobservation process through iterations. In this step, aswarm, }( ) ( ) ( ) ( ){= …q q qQ t t t t, , N1 2 consists of N Q-bit objects inan D-dimensional search space is randomly generated, wherethe structure of ith Q-bit object is:

( ) = ( ) ( ) … ( ) =α ( )

β ( )

α ( )

β ( )…

α ( )

β ( ) ( )

⎡⎣ ⎤⎦⎡

⎣⎢⎢

⎦⎥⎥q t q t q t q t

t

t

t

t

t

t, , ,

B-6i i i i

n i

i

i

i

iD

iD

1 21

1

2

2

The value α = … = …i N d D, 1,2, , and 1,2,id is randomly chosen

between �1 and 1, and the value βid is either positive or ne-

gative selected at αβ = −1id

id2 2.

Step 2 (Observation): In this step, current solutions

( ) ( ) ( ) ( )={ … }S t X t X t X t, , , N1 2 , which are binary solutions, are made

by independently observing each Q-bit of ( )Q t , where

( )= ( ) ( ) … ( )⎡⎣ ⎤⎦X t x t x t x t, , , ,i i i iD1 2 = …i N1,2, , and ( )∈{ } = …x t d D0,1 , 1,2, ,i

d .Step 3 (Fitness evaluation): Each current solution, ( )X ti , of theswarm ( )SW t is evaluated to obtain a measure of its fitness.Step 4 (Updating ( )SB t ): Each solution in ( ) ( )={SB t B t ,1

( ) ( )}B t B t,. . . N2 is compared with its corresponding solution in

( )SW t and will be replaced by it if ( ) ( )( ) ( )>fit X t fit B ti i

= …i N, 1,2, , , where ( )( )fit B ti presents the fitness value and

( ) = ( ) ( ) … ( ) = …⎡⎣ ⎤⎦B t b t b t b t i N, , , , 1,2, ,i i i iD1 2 . It is worth noticed

that at =t 0 all solutions in ( )SW 0 are transferred into ( )SB 0 toconstruct it.Step 5 (Computing ( )M ti ): This step calculates

( ) = …M t i N, 1,2, ,i based on the fitness values of best solutions

of archive ( )SB t using Eq. (B-1).

Fig. B-1. Pseudo-code for GSA.

Fig. B-2. Pseudo-code for BQIGSA.

Step 6 )( ( )Q tupdating : To use modified quantum rotation gate(RQ-gate), the angular velocity is needed. It describes the speedof rotation of an object around the axis which the object is ro-tating. The angular and linear velocities are related to eachother by:

ω = ( )vr B-7

where ω, v, and r indicate angular velocity, linear velocity andthe radius of the circular system, respectively. Hence, to calcu-late the angular velocity for each element of ( )Q t , Eqs. (B-8) and(B-9) are used as follow:

( )( ) ( ) ( )( ) ( ) ( )∑α

ε=

+−

( )∈ ≠

t G tM t

R tb t x trand

B-8id

j kbest i j

N

jj

ijjd

id

( ) ( ) ( )ω ω α+ = + ( )t rand t t1 B-9id

i id

id

where kbest is the set of first K solutions in the archive ( )SB twith the best fitness value and biggest mass, which is a functionof time, initialized to N at the beginning and decreased linearlywith time to 1 at the end. ( )R tij is computed based on thenormalized Hamming distance which calculates the distancebetween two agents i and j in the binary space and then isnormalized by dividing by n.Therefore, depending on which quarter the current position ofθi

d is, θ∆ id is computed as follows:

( )( )

θω α β

ω α β∆ =

+ ( + ) ( + ) ≥

− + ( + ) ( + ) ≤ ( )

⎧⎨⎪⎩⎪

t if t t

t if t t

1 1 1 0

1 1 1 0 B-10id i

did

id

id

id

id

Now, we can use Eq. (B-11) to update each element of ( )Q t :

( ) ( )( ) ( )

( )( )

( )( )

α +

β +=

∆θ − ∆θ

∆θ ∆θ

α

β

= … ( )

⎣⎢⎢

⎦⎥⎥

⎣⎢⎢⎢

⎦⎥⎥⎥

⎣⎢⎢

⎦⎥⎥

t 1

t 1

Cos Sin

Sin Cos

t

t

for d 1, 2, , D B-11

d

d

d d

d d

d

d

Step 7 (Repeat): Repeat steps (2)–(6) until the stopping criter-ion is met.

References

[1] A. Hanbury, A survey of methods for image annotation, J. Vis. Lang. Comput. 19(2008) 617–627.

[2] E. Kuric, M. Bielikova, ANNOR: efficient image annotation based on combininglocal and global features, Comput. Graph. 47 (2015) 1–15.

[3] S.H. Amiri, M. Jamzad, Automatic image annotation using semi-supervisedgenerative modeling, Pattern Recognit. 48 (1) (2015) 174–188.

[4] E. Rui, T.S. Huang, M. Ortega, S. Mehrotra, Relevance feedback: a power tool forinteractive content-based image retrieval, IEEE Trans. Circuits Syst. VideoTechnol. 8 (5) (1998) 644–655.

[5] H. Nezamabadi-pour, E. Kabir, Concept learning by fuzzy k -NN classificationand relevance feedback for efficient image retrieval, Expert Syst. Appl. 36 (3)(2009) 5948–5954.

[6] E. Rashedi, H. Nezamabadi-pour, S. Saryazdi, A simultaneous feature adapta-tion and feature selection method for content-based image retrieval systems,Knowl.-Based Syst. 39 (2013) 85–94.

[7] E. Rashedi, H. Nezamabadi-pour, S. Saryazdi, Information fusion between shortterm learning and long term learning in content based image retrieval sys-tems, Multimedia Tools Appl. 74 (2015) 3799–3822.

[8] A. Shamsi, H. Nezamabadi-pour, S. Saryazdi, A short-term learning approachbased on similarity refinement in content-based image retrieval, MultimediaTools Appl. 72 (2014) 2025–2039.

[9] J. Liu, B. Wang, H. Lu, S. Ma, A graph-based image annotation framework,Pattern Recognit. Lett. 29 (4) (2008) 407–415.

[10] Y. Wang, S. Gong, Refining image annotation using contextual relations be-tween words, In CIVR '07: Proceedings of the 6th ACM international Con-ference on Image and video retrieval, pp. 425–432, New York, NY, USA, 2007.

[11] Y. Feng, M. Lapata, Topic Models for Image Annotation and Text Illustration, in:Human Language Technologies 2010 Annual Conference of the North AmericanChapter of the Association for Computational Linguistics, no. June, pp. 831–839, 2010.

Page 16: A multi-expert based framework for automatic image annotationstatic.tongtianta.site/paper_pdf/d27b4692-3838-11e9-8ffc-00163e08… · A multi-expert based framework for automatic image

A. Bahrololoum, H. Nezamabadi-pour / Pattern Recognition 61 (2017) 169–184184

[12] X. Qi, Y. Han, Incorporating multiple SVMs for automatic image annotation,Pattern Recognit. 40 (2007) 728–741.

[13] A. Makadia, V. Pavlovic, S. Kumar, A new baseline for image annotation, in:ECCV '08: Proceedings of the 10th European Conference on Computer Vision,pp. 316–329, 2008.

[14] J.F. Serrano-Talamantes, C. Avilés-Cruz, J. Villegas-Cortez, J.H. Sossa-Azuela,Self organizing natural scene image retrieval, Expert Syst. Appl. 40 (2013)2398–2409.

[15] S. Sun, A survey of multiview machine learning, Neural Comput. Appl. 23(2013) 2031–2038.

[16] C. Xu, D. Tao, C. Xu, A survey on multiview learning, arXiv Prepr. arXiv 1304(2013) 5634.

[17] C. Xu, D. Tao, C. Xu, Multiview intact space learning, IEEE Trans. Pattern Anal.Mach. Intell. 37 (12) (2015) 2531–2544.

[18] K. Pramod Sankar, C.V. Jawahar, Probabilistic reverse annotation for large scaleimage retrieval, in: Proceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition, 2007.

[19] X.X. Li, C.B. Sun, P. Lu, X.J. Wang, Y.X. Zhong, Simultaneous image classificationand annotation based on probabilistic model, J. China Univ. Posts Telecommun.19 (2) (2012) 107–115.

[20] A. Fakhari, A.M.E. Moghadam, Combination of classification and regression indecision tree for multi-labeling image annotation and retrieval, Appl. SoftComput. 13 (2) (2013) 1292–1302.

[21] P. Dong, K. Mei, N. Zheng, H. Lei, J. Fan, Training inter-related classifiers forautomatic image classification and annotation, Pattern Recognit. 46 (2013)1382–1395.

[22] Z. Lu, L. Wang, Learning descriptive visual representation for image classifi-cation and annotation, Pattern Recognit. 48 (2) (2015) 498–508.

[23] M.L. Zhang, Z.H. Zhou, A review on multi-label learning algorithms, IEEE Trans.Knowl. Data Eng. 26 (8) (2014) 1819–1837.

[24] G. Nasierding, G. Tsoumakas, A.Z. Kouzani, Clustering based multi-label clas-sification for image annotation and retrieval, in: IEEE International Conferenceon Systems, Man and Cybernetics (SMC 2009), pp. 4514–4519, 2009.

[25] M. Wang, X. Zhou, T.S Chua, Automatic image annotation via local multi-labelclassification, in: Proceedings of the 2008 international conference on Con-tent-based image and video retrieval, pp. 17–26, ACM, 2008.

[26] W. Liu, D. Tao, Multiview hessian regularization for image annotation, IEEETrans. Image Process. 22 (7) (2013) 2676–2687.

[27] W. Liu, D. Tao, J. Cheng, Y. Tang, Multiview hessian discriminative sparsecoding for image annotation, Comput. Vision Image Underst. 118 (2014)50–60.

[28] Y. Mori, H. Takahashi, R. Oka, Image-to-word transformation based on dividingand vector quantizing images with words, in: First International Workshop onMultimedia Intelligent Storage and Retrieval Management (in conjunctionwith ACM Multimedia Conference), Orlando, FL, USA, 1999.

[29] P. Duygulu, K. Barnard, N. de Freitas, D. Forsyth, Object recognition as machinetranslation: learning a lexicon for a fixed image vocabulary, in: Proceedings ofthe 7th European Conf. on Computer Vision, 2002, pp. 97–112.

[30] V. Lavrenko, R. Manmatha, J. Jeon, A Model for Learning the Semantics ofPictures, in: Proceedings of the Advances in Neutral Information ProcessingSystems, 2003.

[31] J. Jeon, V. Lavrenko, M. Manmath, Automatic image annotation and retrievalusing cross-media relevance models, in: Proceedings of the 26th Annual In-ternational ACM SIGIR Conf. on Research and Development in InformationRetrieval, SIGIR '03, Toronto, Canada, July 28–August 01, 2003, pp. 119–126.

[32] J. Liu, B. Wang, M. Li, Z. Li, W. Ma, H. Lu, et al., Dual cross-media relevancemodel for image annotation, in: Proceedings of the 15th International Con-ference on Multimedia. MULTIMEDIA '07, pp. 605–614, 2007.

[33] Y. Zhen, D.Y. Yeung, A probabilistic model for multimodal hash function.Learning, in: Proceedings of the 18th ACM SIGKDD international Conferenceon knowledge discovery and data mining, KDD '12, 2012.

[34] S. Feng, R. Manmatha, V. Laverenko, Multiple Bernoulli Relevance Models forImage and Video Annotation, in: IEEE Computer Society Conference onComputer Vision and Pattern Recognition, pp. 1002–1009, 2004.

[35] D. Metzler, R. Manmatha, An inference network approach to image retrieval,in: P. Enser, Y. Kompatsiaris, N. OConnor, A. Smeaton, A. Smeulders (Eds.),Image and video retrieval. Lecture Notes in Computer Science, vol. 3115, 2005,pp. 42–50.

[36] A. Yavlinsky, E. Schofield, S. Rüger, Automated image annotation using globalfeatures and robust nonparametric density estimation, in: Proceedings of the4th international Conference on image and video retrieval, CIVR'05, pp. 507–517, 2005.

[37] J. Hu, K.M. Lam, An efficient two-stage framework for image annotation,Pattern Recognit. 46 (2013) 936–947.

[38] P. Guo, Z. Jiang, S. Lin, Y. Yao, Combining LVQ with SVM technique for imagesemantic annotation, Neural Comput. Appl. 21 (2012) 735–746.

[39] Y. Zhao, Y. Zhao, Z. Zhu, TSVM-HMM: transductive SVM based hidden Markovmodel for automatic image annotation, Expert Syst. Appl. 36 (6) (2009)9813–9818.

[40] C. Wang, N. Komodakis, N. Paragios, Markov Random Field modeling, in-ference & learning in computer vision & image understanding: a survey,Comput. Vision. Image Underst. 117 (11) (2013) 1610–1627.

[41] G. Carneiro, A.B. Chan, P.J. Moreno, N. Vasconcelos, Supervised learning ofsemantic classes for image annotation and retrieval, IEEE Trans. Pattern Anal.Mach. Intell. 29 (3) (2007) 394–410.

[42] M. Ivasic-kos, I. Ipsic, S. Ribaric, A knowledge-based multi-layered image an-notation system, Expert Syst. Appl. 42 (24) (2015) 9539–9553.

[43] J. Verbeek, M. Guillaumin, T. Mensink, C. Schmid, I. Annotation, J. Verbeek, M.Guillaumin, I. Rhône-alpes, T. Mensink, C. Schmid, I. Rhône-alpes, Image An-notation with TagProp on the MIRFLICKR set To cite this version: Image An-notation with TagProp on the MIRFLICKR Set, 2011.

[44] S. Bahrami, M.S. Abadeh, Automatic Image Annotation Using an EvolutionaryAlgorithm (IAGA), in: 2014 7th International Symposium on Tele-communications (IST’2014), 2014, pp. 320–325.

[45] D. Zhang, M. Islam, G. Lu, A review on automatic image annotation techniques,Pattern Recognit. 45 (1) (2012) 346–362.

[46] A. Bahrololoum, H. Nezamabadi-pour, S. Saryazdi, A data clustering approachbased on universal gravity rule, Eng. Appl. Artif. Intell. 45 (2015) 415–428.

[47] H. Nezamabadi-pour, A quantum-inspired gravitational search algorithm forbinary encoded optimization problems, Eng. Appl. Artif. Intell. 40 (2015)62–75.

[48] E. Rashedi, H. Nezamabadi-pour, S.G. Saryazdi, A gravitational search algo-rithm, information sciences, Inf. Sci. 179 (13) (2009) 2232–2248.

[49] M. Grubinger, P. Clough, H. Muller, T. Deselears, The IAPR TC-12 Benchmark: anew evaluation resource for visual information system, in: Proceedings of theInternational Workshop OntoImage ’06 Language Resources for Content-Based Image Retrieval, Genoa, Italy, 2006, pp. 13–23.

[50] L. von Ahn and L. Dabbish, Labeling images with a computer game, In Proc.SIGCHI Conference on Human Factors in Computing Systems, 2004, pp. 319–326.

Abbas Bahrololoum received his B.S. degree in Computer Engineering from Isfahan University of Technology, Isfahan, Iran, in 1996, and his M.Sc. degree in ComputerEngineering from Shiraz University, Shiraz, Iran, in 1999. He is currently pursuing his Ph.D. degree in the Electrical Engineering Department of Shahid Bahonar University ofKerman, Kerman, Iran. His research interests include computer vision, image retrieval and annotation, and content analysis and classification.

Hossein Nezamabadi-pour received his B.Sc. degree in Electrical Engineering from Shahid Bahonar University of Kerman in 1998, and his M.Sc. and Ph.D. degrees inElectrical Engineering from Tarbait Moderres University, Tehran, Iran, in 2000 and 2004, respectively. In 2004, he joined the Department of Electrical Engineering at ShahidBahonar University of Kerman, Kerman, Iran, as an assistant Professor, and was promoted to full Professor in 2012. Dr. Nezamabadi-pour is the author and co-author of morethan 360 peer reviewed journal and conference papers. His interests include image processing, pattern recognition, soft computing, and evolutionary computation.