a probabilistic topic model for event-based image...

Signal Processing: Image Communication 76 (2019) 283–294

Contents lists available at ScienceDirect

Signal Processing: Image Communication

journal homepage: www.elsevier.com/locate/image

A probabilistic topic model for event-based image classification andmulti-label annotation✩

Lakhdar Laib a, Mohand Saïd Allili b,∗, Samy Ait-Aoudia a

a Laboratoire des Méthodes de Conception des Systèmes, École Nationale Supérieure d’Informatique, BP 68M, 16309, Oued-Smar, Alger, Algériab Université du Québec en Outaouais, Département d’Informatique et d’Ingénierie, 101, St-Jean Bosco, Gatineau, QC, J8X 3X7, Canada

A R T I C L E I N F O

Keywords:Event recognitionImage annotationTopic modelingConvolutional neural nets

A B S T R A C T

We propose an enhanced latent topic model based on latent Dirichlet allocation and convolutional neuralnets for event classification and annotation in images. Our model builds on the semantic structure relatingevents, objects and scenes in images. Based on initial labels extracted from convolution neural networks(CNNs), and possibly user-defined tags, we estimate the event category and final annotation of an imagethrough a refinement process based on the expectation–maximization (EM) algorithm. The EM steps allowto progressively ascertain the class category and refine the final annotation of the image. Our model can bethought of as a two-level annotation system, where the first level derives the image event from CNN labelsand image tags and the second level derives the final annotation consisting of event-related objects/scenes.Experimental results show that the proposed model yields better classification and annotation performance inthe two standard datasets: UIUC-Sports and WIDER.

1. Introduction

The massive and rapid growth of the Internet and smart-phoneshas led to an ever-increasing amount of shared images and videoson social media (e.g., Facebook, Flickr). This has raised the difficultyfor indexing and retrieving visual content related to events, which isan important issue for content-based image retrieval and event mon-itoring. To manage large databases, images should be complementedwith metadata (i.e., annotations) summarizing their semantic content.Unfortunately, most of the images do not have clean textual labelsreflecting a desirable annotation. Several learning techniques havebeen used to mine automatically these descriptions from images [1,2].However, they have difficulty in extracting coherent and structuredsemantic descriptions related to events depicted in images.

Events in single images can be defined as occurring real-world activ-ities (sport, social, etc.) by interacting objects in a given environment(i.e., scenes). Event recognition aims at deriving a single label related tothe depicted activity [3–5], whereas image annotation tries to associatemultiple labels to an image reflecting its semantic content (e.g., objects,scenes, whether, actions, events, etc.) [6–10]. Both problems, however,are strongly related since knowing an event class can oftentimes give ahint about the image content in terms of objects and scenes. Likewise,objects and scenes are important cues for identifying events capturedin images [8,11,12]. Recently, a significant progress has been made for

✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work.For full disclosure statements refer to https://doi.org/10.1016/j.image.2019.05.012.∗ Corresponding author.

E-mail address: [email protected] (M.S. Allili).

event recognition in videos or collections of images [13–16]. However,little has been done for tackling simultaneous event classification andannotation in single images [8,17].

Early image annotation methods were based on the classificationof low-level features such as SIFT, HOG, texture [2]. Recently, gen-erative models, such as latent Dirichlet allocation (LDA) [6,18], havebeen successfully used in image classification/annotation [8,14,19–21]. To reduce the semantic gap between low-level features (e.g., re-gions, patches) and high-level semantics, these models introduce latenttopics which facilitate inferring high-level concepts such as scene cat-egories [5,20,21] or annotations [2,22–24]. To achieve simultaneousclassification and annotation, supervised LDA (sLDA) has been pro-posed in [8,18] which links image classes and annotation labels tothe same latent topics. This model was extended in [20,21] for rec-ognizing social events from tagged images. However, by assumingclasses and labels are independent, derived annotations can deviatefrom the scope of an event. Another limitation is that these modelsrely heavily on low-level features for image description (e.g., regions,patches, SIFT, etc.) [25–27]. Since objects/scenes can be composed ofmultiple regions with varying position, shape and color, associatingsemantic labels independently to individual regions can increase thenumber of false annotations by labeling differently regions belongingto the same object/scene [28]. Finally, some methods have successfullyintegrated high-level knowledge, such as context and location for event

https://doi.org/10.1016/j.image.2019.05.012Received 10 February 2019; Received in revised form 2 May 2019; Accepted 24 May 2019Available online 28 May 20190923-5965/© 2019 Elsevier B.V. All rights reserved.

https://doi.org/10.1016/j.image.2019.05.012

http://www.elsevier.com/locate/image

http://www.elsevier.com/locate/image

http://crossmark.crossref.org/dialog/?doi=10.1016/j.image.2019.05.012&domain=pdf


mailto:[email protected]


L. Laib, M.S. Allili and S. Ait-Aoudia Signal Processing: Image Communication 76 (2019) 283–294

classification in multiple images [29,30]. However, since they rely onlow-level features, which are void of semantic information, they are notguaranteed to achieve good event recognition.

In this paper, we propose a framework for simultaneous eventclassification and annotation in still images. We rely on the assumptionthat the event category can provide valuable information for imageannotation and vice-versa. Event category can indeed narrow the scopeof relevant annotations and reduce the probability of irrelevant ones.Building on the recent advances in convolutional neural nets (CNN)for object and scene recognition [31,32], we propose a probabilisticgenerative model combining latent scene/object topics for predictingevents in images and deriving event-related annotations. Oftentimes,images come with noisy tags which are imprecise and not alwaysreflecting the high-level image semantics. While some of the tags canbe helpful for inferring the event category of an image, others shouldbe filtered out. Our model achieves this goal by encoding a two-levelsemantic hierarchy which enforce each other during inference. The firstlevel reflects the image content in terms of objects and scenes, whilethe second one concerns the event/activity depicted in the image. Theselevels enforce each other through the Expectation–Maximization (EM)algorithm that estimate the most likely event category while producingimage annotations consistent with the event.

To train our model, we use a collection of annotated images forthe different event categories. Given a newly-observed unlabeled im-age or accompanied by (noisy) tags, we use CNNs first to extractobject and scene label candidates, which are used jointly with tagsto estimate the event category of the image. The labels and tags arethen iteratively ranked to derive the final object/scene annotation ofthe image. We show that our model not only allows to disambiguateCNN results, but also provides an accurate event categorization andevent-specific annotation. We conducted several experiments to assessthe performance of the proposed approach on the well-known UIUCand WIDER datasets. Obtained results demonstrate that our methodenhances event classification in general and allows to generate moreconsistent annotations. Our main contributions in this paper can besummarized in the following points:

• We propose a generative model using event-based supervised LDAto jointly model object, scene and event information in singleimages. By assuming events are related to specific objects andscenes through latent topics, we propose an event-specific con-ditional distribution for each topic to generate objects and scenesin images. Finally, latent topics are linked to events throughprobabilistic logistic regression.

• Our model integrates seamlessly visual and text information bycombining CNN-based image outputs and user-predefined tags.While CNN outputs represent visual content, tags can give vi-sual or abstract descriptions on the image content. This has theadvantage of enforcing disambiguation of noisy labels that canbe generated by CNNs [33] or given by users, by discardingirrelevant ones when the event class is fixed.

• For a newly seen image, possibly accompanied with noisy usertags, CNN outputs are first calculated and used along with tags todefine candidate labels and infer latent object/scene topics. Then,an iterative , based on the EM algorithm, is used to ascertainthe event category and relevant annotations of the image. Ineach EM iteration, event-conditional likelihoods for labels arecalculated and the top-ranked ones are retained for the finalimage annotation.

The rest of this paper is organized as follows: Section 2 reviewssome relevant literature to our work. Section 3 presents the proposesmodel for event classification and image annotation. Section 5 showsexperiments for validating the proposed approach. We end the paperwith a conclusions and some future work perspective.

2. Related work

Since our work deals with simultaneous event recognition and an-notation in single images, we give a brief review of proposed methodsdealing with event recognition and multi-label image annotation inimages. For event recognition in videos, we refer the reader to morededicated works such as [13,34].

2.1. Event classification in still images

Social event classification has gained an increased attention in com-puter vision research [27]. Web platforms such as Flickr and Instagramusually contain images capturing personal events or activities suchas sports, politics, etc. Event recognition on its own is a challeng-ing problem and has been investigated recently in several researchworks [14,26,29,35–37]. Unlike action recognition which focuses ondescribing the behavior of a single person [34,38], event recognition tryto find meaningful activities involving one or several objects occurringwithin specific environments [14,26]. Since objects and scenes are thebuilding blocks for events, identifying them is a key issue towards goodevent recognition [15,26].

Proposed approaches for event recognition can rely exclusively onvisual information [14,26,39,40] or a combination of visual and textualinformation [16,20,21,35]. To recognize sport events, Li et al. [26]have proposed a graphical model integrating object and scene in-formation extracted from local image patches. Authors in [29] usedhidden Markov models (HMMs) to detect sub-events occurring in pre-defined time intervals and visual information through low-level SURFfeatures [41]. Most of these works have shown success for eventclassification in single or multiple images. However, since visual infor-mation is represented by local features (e.g., interest points, regions),variability in object/scene appearance can confuse event classification.

With the proliferation of social media, images often come withmetadata in the form of tags, HTML or accompanying text that canbe exploited for enhancing event recognition [20,21]. In [20], au-thors have proposed a model based on supervised LDA (sLDA) [18]to combine visual and text features for social event classification inFlickr images. An extension of this model has been proposed in [21]by separating visual concepts (e.g., chair, car, etc.) from abstract ones(e.g., economy, politics, etc.), and a more accurate event recognitionresults have been reported. Most of these methods, however, are notefficient for annotation because of the semantic gap created by usinglow-level features for visual representation.

2.2. Multi-label image annotation (MLIA)

Early approaches for MLIA use relevance feedback in image retrievalfor labeling images [2,42–44]. However, this is a time-consumingprocess when dealing with large datasets and labels. To reduce thislimitation, learning methods have been proposed for mapping low-levelfeatures (e.g., SIFT, region, GIST, etc.) to high-level concepts for imageannotation [24,45]. Discriminative methods such as support vector ma-chines (SVM) [46,47], decision trees (DT) [9,48] and neural networks(NN) [4] have been used to train models to recognize various conceptsin still images. In [49], authors proposed a patch-based approach toevent recognition using multiple instance learning. However, giventhat a separate classifier is trained for each concept, these approachesare not scalable for a large number of labels and images. In addition,training independent concept classifiers can introduce redundancy andnoise in the image annotation [24].

To overcome the above limitations, generative methods have beenproposed for multi-label annotations [3,11,45]. These methods asso-ciate multiple labels to an image by maximizing the joint probabilitydistribution over low-level image features and concept labels [8,22,50]. For example, Gaussian mixture models have been used to model

284


concepts in images where an image is considered as a bag of in-stances (e.g., regions), which is labeled ‘‘positive’’ if it contains agiven concept and ‘‘negative’’, otherwise [19,50–52]. A new image isthen annotated with concepts having the highest likelihoods. However,since correlation between concepts is ignored in this method, it canlead to incoherent annotations. Besides, parameter estimation of GMMsbecomes complex with a high number of concepts.

With the success of latent topic models for text analysis [53,54],several methods have been proposed for associating semantic conceptsto images [3,5,6,8,20,21]. Latent Dirichlet allocation (LDA) [53] and itsextensions are among the most investigated methods in this regard.Early works have proposed mmLDA (multi-modal LDA) [22] and corr-LDA (correspondence LDA) [6] for associating caption words to imageregions. mmLDA assumes image features and words are independent,whereas corr-LDA assumes that labels and image features are generatedfrom the same latent topics. In [23,55], a mapping is estimated betweenlatent visual and text tropics to infer image scene categories. How-ever, a one-to-one correspondence between these topics is not alwaysrealistic since annotation can be made on non-visual aspects of theimage. To remedy this issue, approaches such as sLDA [18] have beenproposed to directly link classes to image features and annotations [3,5,8,17,21]. Such approaches reduce the learning complexity of topicdistributions and improve the classification accuracy [18]. Early sLDAmodels focused on deriving scene categories in images by analyzinglocal patches/regions [3,5,8] or the entire image [56–60]. However,given that these models are based on low-level image features, they canconfuse discrimination of object/scene instances appearing differently,which can badly affect image annotation.

2.3. Deep learning and image classification/annotation

Deep neural networks (DNN) and, in particular, convolutional neu-ral networks (CNN) have recently achieved a great success in visualrecognition [31,32,61]. This success is largely attributable to the avail-ability of large annotated datasets (e.g., Pascal [62], ImageNet [63])and the efficient training implementation on modern powerful GPUs. Inaddition, CNNs share several properties with the human visual systemfor integrating low/mid/high-level image features for recognition [64].

Building on this success, [14] used CNNs for detecting objectsand scenes in images which are used to assign event classes to acollection of images. On the other hand, several annotation methodsbased on CNNs have been proposed [65–69]. Given the success of deeplearning for natural language modeling, methods have been proposedto generate annotations for images by leveraging responses of CNNs andsequential text modeling using recurrent neural networks (RNNs) [67,69–72]. These methods achieve high semantic description close to humanperformance. However, they require a huge amount of images and validtext descriptions for their training.

3. The proposed model

Humans in general can recognize, in a single glance to an im-age, individual objects, the different environments of the scene andinfer complex social interactions/activities between objects [73]. Onthe one hand, detecting low-level semantic concepts (i.e., objects,scenes, pose) can facilitate inference of high-level concepts such in-dividual actions and group events [26]. On the other hand, knowingthe type of event captured in an image (e.g., soccer or basketball) canhelp disambiguating instances of similar objects and scenes categories(e.g., balls, players, etc.) [74]. Although topic-based approaches havebeen proposed for image classification/annotation, most of them arebased on low-level features for image description (e.g., regions, patches,SIFT), which lacks semantic meaning for identifying reliably semanticconcepts in the image. In addition, annotations describing semanticconcepts (e.g., objects and scenes) are often derived from separatelatent topics regardless of the event classes.

To address these limitations, we propose to harness the expressivityof CNNs in visual recognition and topic modeling for simultaneousevent classification and annotation in single images. Our model takes itsroots from supervised LDA [5,8,18] by deriving event classes and imageannotations through an iterative process based on the EM algorithm. Toobtain image annotations that are consistent with the event category,we use a separate class-conditional distribution for each event category.A multi-level semantic structure is proposed to link low-level semanticconcepts (i.e., objects and scenes) to high-level concepts (i.e., events,activities), which enforce each other during the inference process.

We train our model using a large annotated data set. Fig. 1 showsour graphical model structure encoding dependencies between thedifferent semantic variables and latent topics. The graph contains twoparallel modules inferring separately object and scene information,which are linked to the event classes and the final annotation of theimage.

More formally, let 𝐾, 𝑁𝑜 and 𝑁𝑠 be the number of event, objectand scene categories in our model, and 𝑦 ∈ {1, 2,… , 𝐾} be a discreterandom variable representing the event category. We assume that theposterior distribution of each class 𝑦 is governed by the parametervector 𝜂𝑦 and that we have 𝑇𝑜 and 𝑇𝑠 object and scene latent topics,respectively. To enhance topic representativeness of each class data,we suppose that each event category has a separate word tropic-conditional distribution, as suggested in class-specific supervised LDAmodel (css-LDA) proposed in [5], Our model has therefore the followinggenerative process of an image and its associated event category:

• Draw proportions for object and scene topics 𝜃 ∼ 𝐷𝑖𝑟(𝛼) and𝛹 ∼ 𝐷𝑖𝑟(𝜆).

• Having 𝑀 object instances and 𝐿 scene instances in the image:

– For each object instance 𝑜𝑚, 𝑚 = 1,… ,𝑀 , draw an objecttopic 𝑧(𝑠)𝑚 ∼ 𝑀𝑢𝑙𝑡(𝜃).

– For each scene instance 𝑠𝑙, 𝑙 = 1,… , 𝐿, draw a scene topic𝑧(𝑠)𝑙 ∼ 𝑀𝑢𝑙𝑡(𝛹 ).

• Draw an event class 𝑦∗ ∼ argmax𝑦{

𝑠𝑜𝑓𝑡𝑚𝑎𝑥([�̄�(𝑜), �̄�(𝑠)]𝑇 , 𝜂𝑦)}

,where we have:

�̄�(𝑜) = 1𝑀

𝑀∑

𝑚=1𝑧(𝑜)𝑚 ; �̄�(𝑠) = 1

𝐿

𝐿∑

𝑙=1𝑧(𝑠)𝑙

• Having the image event class label 𝑦∗:

– Draw each object instance 𝑜𝑚 ∼ 𝑀𝑢𝑙𝑡(

𝛷𝑧(𝑜)𝑚 ,𝑦∗

)

, 𝑚 =1,… ,𝑀 .

– Draw each scene instance 𝑠𝑙 ∼ 𝑀𝑢𝑙𝑡(

𝛺𝑧(𝑠)𝑙 ,𝑦∗

)

, 𝑙 = 1,… , 𝐿.

where 𝐷𝑖𝑟(⋅) and 𝑀𝑢𝑙𝑡(⋅) stand for the Dirichlet and Multinoulli distri-butions, respectively. For a more handy notation, we take 𝑧(𝑜)𝑚 and 𝑧(𝑠)𝑙as binary vectors of 𝑇𝑜 and 𝑇𝑠 dimensions representing the number ofobject and scene topics, respectively. The 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 function providesthe following posterior probability for each event category:

𝑝(𝑦|�̄�(𝑜), �̄�(𝑠), 𝜂) = exp (𝜂𝑇𝑦 �̄�)∕𝐾∑

𝑖=1exp (𝜂𝑇𝑦 �̄�), (1)

where �̄� = [�̄�(𝑜); �̄�(𝑠)] is a (𝑇𝑜 + 𝑇𝑠)-dimensional vector representingthe empirical proportions of object and scene topics in our image.Therefore, as in [18], object/scene topic representation determine theclass label of the image. However, as will be shown in the inferenceprocess, each class will have its specific topic-conditional distributionenabling more accurate event-specific annotation.

4. Model training and inference for annotation

Real-world images in the Web are generally connected with richdocuments, consisting of title, captions and free user tags describingimages [75]. However, some of the tags can be often imprecise or not

285


Fig. 1. Graphical representation of our model.

relevant for the image content, especially when the labeling should berelated to specific events represented in the image [76]. Our modelaims at generating event-related annotations describing images in termsof contained objects and scenes. To achieve this goal, we train ourmodel using a labeled collection of images containing different eventcategories. The labels include event information as well as detailedimage content in terms of objects and scenes. For each image, weadded two to three labels that represent non-visual aspects. Given anunlabeled or partially-tagged new image, we aim to derive a completeand relevant annotation through evidence extracted only from thevisual content and the (noisy) tags.

4.1. Model training

We suppose that each image in our training set is annotated withan event label and other labels describing concepts either related tovisual aspects such as objects and scenes, or non-visual event-relatedaspects such as object pose, action, weather, etc. Let 𝑁 be the numberof images with their event classes and tags =

{

(𝐼1, 𝑦1,𝐰1), (𝐼2, 𝑦2,𝐰2),… , (𝐼𝑁 , 𝑦𝑁 ,𝐰𝑁 )

}

, where 𝑦𝑑 designates the event class of image 𝐼𝑑 and𝐰𝑑 the set of its associated tags. Note that since our training dataset isannotated, the learning phase of our model is performed only on textualinformation (i.e., annotations). Therefore, the variables 𝑣(𝑜) and 𝑣(𝑠) ofour model depicted on Fig. 1 collapse to become observed variables 𝑤(𝑜)

and 𝑤(𝑠), respectively. In what follows, we use 𝐰(𝑜)𝑑 and 𝐰(𝑠)

𝑑 to refer toobject/scene labels in the 𝑑th image and 𝐰(𝑜) and 𝐰(𝑠) to refer to all thelabels in the whole training dataset.

Exact inference to estimate the non-observed variables is oftenintractable in topic models [53]. Nonetheless, methods such as Gibbssampling and variational inference can lead to approximate solutions.Following the sLDA model [18], we compute the conditional distribu-tion of the latent structure given our model and the annotated images.Let 𝐳(𝑜) and 𝐳(𝑠) be the sets of latent object and scene variables, respec-tively, and 𝐲 the set of all class categories in . We use collapsed Gibbssampling by iteratively assigning values to the latent topic variables𝐳(𝑜) and 𝐳(𝑠) conditioned on the previous state of the model, and finallyestimating unknown parameters 𝛩 = {𝜃, 𝛹,𝛺,𝛷, 𝜂}. For our model, theposterior joint probability of the variables can be factorized as follows:

𝑝(𝐳(𝑜), 𝐳(𝑠), 𝐲,𝐰(𝑜),𝐰(𝑠)|𝛩)

∝ 𝑝(𝐰(𝑜)|𝐳(𝑜), 𝐲, 𝛽)𝑝(𝐰(𝑠)

|𝐳(𝑠), 𝐲, 𝜉)𝑝(𝐳(𝑜)|𝜃)𝑝(𝐳(𝑠)|𝛹 )𝑝(𝐲|�̄�, 𝜂)

∝ 𝑝(𝐲|�̄�, 𝜂) × ∫ 𝑝(𝐳(𝑜)|𝜃)𝑝(𝜃|𝛼)𝑑𝜃 ∫ 𝑝(𝐰(𝑜)|𝐳(𝑜), 𝐲, 𝛷)𝑝(𝛷|𝛽)𝑑𝛷

×∫ 𝑝(𝐳(𝑠)|𝛹 )𝑝(𝛹 |𝜆)𝑑𝛹 ∫ 𝑝(𝐰(𝑠)|𝐳(𝑠), 𝐲, 𝛺)𝑝(𝛺|𝜉)𝑑𝛺 (2)

Gibbs sampling use an expectation–maximization (EM) strategy forparameter inference [77]:

In the E-step, the variables 𝐳(𝑜) and 𝐳(𝑠) are sampled given the param-eters 𝜂1,… , 𝜂𝐾 . In the M-step, the event class parameters 𝜂 are updatedby maximizing the joint likelihood of variables. More specifically, thehidden variables 𝑧(𝑜)𝑑𝑖 and 𝑧(𝑠)𝑑𝑗 are assigned for each object/scene words𝑤(𝑜)

𝑑𝑖 and 𝑤(𝑠)𝑑𝑗 , respectively, in image 𝐼𝑑 according to the following

probabilities:

𝑝(𝑧(𝑜)𝑑𝑖 = ℎ, 𝑧(𝑠)𝑑𝑗 = 𝑙|𝐳(𝑜)¬𝑑𝑖, 𝐳(𝑠)¬𝑑𝑗 ,𝐰

(𝑜),𝐰(𝑠), 𝐲, 𝛩)

∝ 𝑝(𝑧(𝑜)𝑑𝑖 = ℎ|𝐳(𝑜)¬𝑑𝑖,𝐰(𝑜), 𝑦𝑑 , 𝛩) × 𝑝(𝑧(𝑠)𝑑𝑗 = 𝑙|, 𝐳(𝑠)¬𝑑𝑗 ,𝐰

(𝑠), 𝑦𝑑 , 𝛩) (3)

where we have:

𝑝(𝑧(𝑜)𝑑𝑖 = ℎ|𝐳(𝑜)¬𝑑𝑖,𝐰(𝑜), 𝑦𝑑 , 𝛩) ∝

𝛼ℎ + 𝑛(𝑜)𝑑ℎ,¬𝑑𝑖∑𝑇𝑜

𝑡=1 𝛼𝑡 + 𝑛(𝑜)𝑑𝑡,¬𝑑𝑖

×𝛽𝑤 + 𝑛(𝑜)ℎ𝑤,¬𝑑𝑖,𝑦𝑑

∑𝑁𝑜𝑣=1 𝛽𝑣 + 𝑛(𝑜)ℎ𝑣,¬𝑑𝑖,𝑦𝑑

, (4)

𝑝(𝑧(𝑠)𝑑𝑗 = 𝑙|𝐳(𝑠)¬𝑑𝑗 ,𝐰(𝑠), 𝑦𝑑 , 𝛩) ∝

𝜆𝑙 + 𝑛(𝑠)𝑑𝑙,¬𝑑𝑗∑𝑇𝑠

𝑡=1 𝜆𝑡 + 𝑛(𝑠)𝑑𝑡,¬𝑑𝑗

×𝜉𝑢 + 𝑛(𝑠)𝑙𝑢,¬𝑑𝑗,𝑦𝑑

∑𝑁𝑠𝑣=1 𝜉𝑣 + 𝑛(𝑠)𝑙𝑣,¬𝑑𝑗,𝑦𝑑

. (5)

The symbols 𝑛(𝑜)𝑑ℎ,¬𝑑𝑖 (resp. 𝑛(𝑠)𝑑𝑙,¬𝑑𝑗) designate the number of times the ℎthobject topic (resp. the 𝑙th scene topic) has been assigned for words inimage 𝐼𝑑 , except for the considered 𝑖th object word (resp. 𝑗th sceneword) of the image. Likewise, the symbols 𝑛(𝑜)ℎ𝑤,¬𝑑𝑖,𝑦𝑑

(resp. 𝑛(𝑠)𝑙𝑢,¬𝑑𝑗,𝑦𝑑)

designate the number of times the 𝑖th object word 𝑤 (resp. the 𝑗thscene word 𝑢) in 𝐼𝑑 has been assigned the ℎth object topic (resp. 𝑙thscene topic) in all images of having the same class label as 𝑦𝑑 .After running the Gibbs sampling for several iterations, the unknownparameters {𝜃, 𝛹,𝛺,𝛷} are estimated as follows:

𝜃𝑑ℎ =𝛼ℎ + 𝑛(𝑜)𝑑ℎ

∑𝑇𝑜𝑡=1 𝛼𝑡 + 𝑛(𝑜)𝑑𝑡

, (6)

𝛹𝑑𝑙 =𝜆𝑙 + 𝑛(𝑠)𝑑𝑙

∑𝑇𝑠𝑡=1 𝜆𝑡 + 𝑛(𝑠)𝑑𝑡

, (7)

𝛷𝑤|ℎ,𝑦 =𝛽𝑤 + 𝑛(𝑜)ℎ𝑤,𝑦

∑𝑁𝑜𝑣=1 𝛽𝑣 + 𝑛(𝑜)ℎ𝑣,𝑦

, (8)

𝛺𝑢|𝑙,𝑦 =𝜉𝑢 + 𝑛(𝑠)𝑙𝑢,𝑦

∑𝑁𝑠𝑣=1 𝛽𝑣 + 𝑛(𝑠)𝑙𝑣,𝑦

, (9)

where 𝑛(𝑜)𝑑ℎ and 𝑛(𝑠)𝑑𝑙 designate the number of times the ℎth object topicand 𝑙th scene topic have been assigned to image 𝐼𝑑 , respectively. The

286


symbols 𝑛(𝑜)ℎ𝑤,𝑦 (resp. 𝑛(𝑠)𝑙𝑢,𝑦) designate the number of times the objectword 𝑤 (resp. scene word 𝑢) has been associated to the ℎth object topic(resp. 𝑙th scene topic) in images of class 𝑦 in the training dataset ,respectively.

In the M-step, parameters 𝜂𝑐 are estimated for each class 𝑐 ∈{1,… , 𝑁𝑒} by minimizing the minus log-likelihood of data as follows:

𝜂∗ = argmin𝜂

𝑁∑

𝑑=1

𝐾∑

𝑐=11(𝑦𝑑 = 𝑐) log

[

exp (𝜂𝑐 �̄�)∕𝐾∑

𝑡=1exp (𝜂𝑡�̄�)

]

+ 𝜈2

𝐾∑

𝑡=1𝜂𝑇𝑡 𝜂𝑡,

(10)

where 1(𝑥) is the indicator function taking value 1 if its logical ar-gument 𝑥 is true, and 0 otherwise. The symbol 𝜈 is a regularizationconstant preventing over-fitting of the model. To minimize function(10), the optimization is carried about using trust region Newtonmethod [78].

4.2. Model inference and image classification and annotation

Given a newly observed image 𝐼 which can be partially tagged, thegoal is to infer the most likely event category of the image and derive anappropriate annotation describing the image content in terms of objectsand scenes related to the event. For this purpose, we use convolutionalneural networks (CNN) [31,32] to extract candidate labels which areused along the tags and we fold-in our model to infer the event class ofthe image and its final annotation. Let us consider the set of the most𝑀 likely objects 𝐿 likely scene labels: 𝐯(𝑜) and 𝐯(𝑠) returned by CNN,respectively (e.g., 𝑀 = 𝐿 = 20). Let also 𝐭(𝑜) and 𝐭(𝑠) be the sets of(noisy) objects and scenes tags accompanying the image. We denote by𝐰(𝑜)𝐼 = 𝐯(𝑜)∪𝐭(𝑜) and 𝐰(𝑠)

𝐼 = 𝐯(𝑠)∪𝐭(𝑠) the sets of all candidate object/scenelabels. We suppose that we aim to retain 𝑚 ≪ 𝑀 object and 𝑛 ≪ 𝐿scene labels for the final annotation denoted by the sets 𝐰(𝑜)

𝑐 and 𝐰(𝑠)𝑐 ,

respectively. To derive the event category and the final annotation ofthe image, we use the EM steps as follows:

• E-step:

(1) Select the top 𝑚 ≤ 𝑀 object candidate labels 𝐰(𝑜)𝑐 from 𝐰(𝑜)

𝐼 .(2) Select the top 𝑛 ≤ 𝐿 scene candidate labels 𝐰(𝑠)

𝑐 from 𝐰(𝑠)𝐼 .

(3) Estimate the latent topics for each word in 𝐰(𝑜)𝑐 and 𝐰(𝑠)

𝑐 ,respectively :

𝑝(𝑧(𝑜)𝑖 = ℎ|𝐳(𝑜)¬𝑖 ,𝐰(𝑜)𝑐 ,𝐰(𝑜), 𝛩) ∝

𝛼ℎ + 𝑛(𝑜)ℎ,¬𝑖∑𝑇𝑜

𝑡=1 𝛼𝑡 + 𝑛(𝑜)𝑡,¬𝑖

×𝛽𝑤 + 𝑛(𝑜)ℎ𝑤,¬𝑖

∑𝑁𝑜𝑣=1 𝛽𝑣 + 𝑛(𝑜)ℎ𝑣,¬𝑖

, (11)

𝑝(𝑧(𝑠)𝑗 = ℎ|𝐳(𝑠)¬𝑗 ,𝐰(𝑠)𝑐 ,𝐰(𝑠), 𝛩) ∝

𝜆𝑙 + 𝑛(𝑠)𝑙,¬𝑗∑𝑇𝑠

𝑡=1 𝜆𝑡 + 𝑛(𝑠)𝑡,¬𝑗

×𝜉𝑢 + 𝑛(𝑜)ℎ𝑢,¬𝑗

∑𝑁𝑠𝑣=1 𝜉𝑣 + 𝑛(𝑠)ℎ𝑣,¬𝑗

, (12)

where the symbols 𝑛(𝑜)ℎ,¬𝑖 (resp. 𝑛(𝑠)𝑙,¬𝑗) designate the numberof times the ℎth object topic (resp. the 𝑙th scene topic) hasbeen assigned for words (different than the considered one)in image 𝐼 . Likewise, symbols 𝑛(𝑜)ℎ𝑤,¬𝑖 (resp. 𝑛(𝑠)𝑙𝑢,¬𝑗) designatethe number of times the 𝑖th object word 𝑤 (resp. the 𝑗thscene word 𝑢) of image 𝐼 has been assigned the ℎth objecttopic (resp. 𝑙th scene topic) in images of ∪ {𝐼}.

• M-Step:

(1) Calculate the event class probabilities using the sigmoidfunction.

(2) Calculate the probabilities of all words in the sets 𝐰(𝑜)𝐼 and

𝐰(𝑠)𝐼 .

𝑝(𝑤(𝑜)|𝐼) =

∑

𝑦

∑

𝑧(𝑜)𝑝(𝑤(𝑜), 𝑧(𝑜), 𝑦|𝐼)

=∑

𝑦

∑

𝑧(𝑜)𝑝(𝑤(𝑜)

|𝑧(𝑜), 𝑦, 𝐼)𝑝(𝑧(𝑜)|𝐼)𝑝(𝑦|𝐼),

=∑

𝑦

∑

𝑧(𝑜)𝛷𝑤(𝑜)

|𝑧(𝑜) ,𝑦 × 𝜃𝑧(𝑜) × 𝜎(𝑦|𝐼) (13)

𝑝(𝑤(𝑠)|𝐼) =

∑

𝑦

∑

𝑧(𝑠)𝑝(𝑤(𝑠), 𝑧(𝑠), 𝑦|𝐼)

=∑

𝑦

∑

𝑧(𝑠)𝑝(𝑤(𝑠)

|𝑧(𝑠), 𝑦, 𝐼)𝑝(𝑧(𝑠)|𝐼)𝑝(𝑦|𝐼),

=∑

𝑦

∑

𝑧(𝑠)𝛺𝑤(𝑠)

|𝑧(𝑠) ,𝑦 × 𝛹𝑧(𝑠) × 𝜎(𝑦|𝐼) (14)

with 𝜃𝑧(𝑜) , 𝛹𝑧(𝑠) , 𝛷𝑤(𝑜)|𝑧(𝑜) ,𝑦 and 𝛺𝑤(𝑠)

|𝑧(𝑠) ,𝑦 are given by Eqs.(6) to (9). The symbol 𝜎 designates the multi-class sigmoidfunction given by function (1).

(3) Rank the words in 𝐰(𝑜)𝐼 and 𝐰(𝑠)

𝐼 according to the calculatedprobabilities.

(4) Go to the E-step.

We stop the iterative process once the sets 𝐰(𝑜)𝑐 and 𝐰(𝑠)

𝑐 are stable.Then, the event category of the image will correspond to the maximuma posteriori probability among the classes:

𝑦∗ = argmax𝑦

{𝑝(𝑦|𝐳(𝑜), 𝐳(𝑠), 𝜂)} (15)

The final annotation of the image will correspond to 𝐰(𝑜)𝑐 ∪ 𝐰(𝑠)

𝑐 . Notethat, contrarily to [8] which selects the final annotation independentlyof the class label, our model gives the most event-related annotationsas the class is ascertained by the latent topics.

5. Experimental results

To evaluate the performance of the proposed approach, we con-ducted experiments on event-based image classification and annotation.We first train our model on labeled images randomly selected from ourdatasets. Given a newly observed image, which can be partially labeled,we aim at predicting first the event captured in the image and thenderive the most likely event-related object/scene concepts constitutingthefinal annotation of the image. To assess the performance of ourapproach, we compared our results to recent methods based topicmodels for both classification and annotation problems.

5.1. Evaluation datasets

All our experiments have been performed on the recent and chal-lenging datasets UIUC dataset [26] and WIDER dataset [15]. TheUIUC dataset contains the following 8 sport event classes: rowing(250images), rock climbing (194 images), croquet (236 images), polo (182images), snow-boarding (190 images), bocce (137 images), sailing (190images), and badminton (200 images). Following the setting used in [2,26], we randomly select 70 images per class to constitute our trainingset for our model learning. The WIDER dataset for event classificationcontains 50 574 images annotated with 61 different event categories.Following the experiment setting used in [15], we extract 50% of imagesfor training and the rest of the images (50%) is used for testing. Fig. 2shows a sample of images from the two datasets.

5.2. Candidate labels extraction

The training of our model is performed using a sample of annotatedimages from each of our datasets. To test the performance of our modelon classification and annotation, we take images without their anno-tation and extract candidate object/scene labels using deep learning.

287


Fig. 2. Some image examples in the evaluation datasets: (a) images from the UIUC dataset, (b) images from the WIDER dataset.

The candidate labels help identifying the event category of the image,which in turn help narrowing the set of final annotations associated tothe image.

Candidate label extraction is performed by using convolutionalneural networks (CNNs), which have enjoyed recently a noticeablesuccess in visual recognition thanks to their built-in structure [64]. Sev-eral CNN architectures have been proposed in the literature, namely:AlexNet [31], VGGNet [79], GoogLeNet [61] and ResNet [80]. Otherarchitectures have been proposed for scene recognition [81]. For theirperformance, we use the GoogLeNet [61] to extract object candidatelabels, and Places [81] to extract scene candidate labels, in each testedimage. Note that the GoogLeNet is pre-trained on the ImageNet datasetto recognize 1000 object classes, while Places is pre-trained from thePlaces dataset to recognize 205 scene classes.

To adapt the CNN architectures for detecting the objects and scenesrelated to our datasets, we fine-tuned them through transfer leaning,using the annotated images from the datasets. Tables 1(a) and 2(a)show the top 5 popular predicted object and scene concepts for sportevents Rowing and Snowmobile in the UIUC dataset [26]. Likewise,

Table 1Top most popular 5 objects: (a) in events Rowing and Snowmobile in UIUC dataset [26],(b) in events Traffic and Picnic in WIDER dataset [15].

(a)

Rowing paddle, canoe, sandbar, lakeside, speedboatSnowmobile ski, snowmobile, alp, shovel, ski mask

(b)

Traffic traffic light, cab, street sign, minivan, racer, polePicnic hamper, park bench, lakeside, folding chair, dining table

Tables 1(b) and 2(b) show the top 5 popular predicted object and scenelabels for events Traffic and Picnic of WIDER dataset [15].

5.3. Event classification in still images

We first measure the performance of our approach on event clas-sification of images based on extracted object and scene labels. Inthe first part of our evaluation, we measured the performance of ourmodel using objects and scenes separately, then combining them in

288


Table 2Top most popular 5 scenes: (a) in events Rowing and Snowmobile in UIUC dataset [26],(b) in events Traffic and Picnic in WIDER dataset [15].

(a)

Rowing boat_deck, harbor, river, ocean, coastsnowmobile ski_slope, ski_resort, mountain_snowy, crevasse, sky

(b)

Traffic parking_lot, highway, crosswalk, construction_site,residential_neighborhood

Picnic picnic_area, yard, patio, campsite, orchard

event classification. We also tested our method for partially labeledimages. To assess the merit of the proposed method, we compared itwith recent methods in the literature. The first comparison is madeon the basis of the total number of object and scene topics used inthe model. Fig. 3(a) and (b) show the graphs obtained for UIUC andWIDER datasets, respectively. The graphs of the following LDA-basedmethods have been plotted as well : abc-corr-LDA [58], sLDA-ann [8]and SupDocNADE [55].

We can note that for the majority of methods, augmenting thenumber of topics increases the classification accuracy. However, ourmethod starts overfitting at 40 topics in UIUC and 120 topics in WIDERdatasets, whereas for the other methods, they start to overfit at lessernumbers (30 and 80 for the same datasets, respectively). This can beexplained partly by having separate topic sets for objects and scenes inour method, which enabled to handle more latent topics than the othermethods. On the other hand, having a separate word-topic distributionfor each class helped increase the accuracy of event classification.Detailed evaluation of each dataset are given in the following sections.

5.3.1. UIUC datasetFigs. 4 to 6 represent the confusion matrices of event classification

in the UIUC dataset using objects, scenes and their combination, re-spectively. The results have been obtained using a total of 40 topics.As can be expected, we can note that combination of objects andscenes yields better results than using either one separately for eventrecognition. This goes in line with past works [14,26] that made thesame conclusion for event classification in single or group of images,respectively (see Fig. 5).

We have also compared the performance of our approach withthe following state-of-the-art methods that used the UIUC dataset fortheir evaluation. Wang et al. [8], Li et al. [58], Zheng et al. [55],Li et al. [59], Jeon et al. [57], Huang et al. [56], Wang et al. [68]and Ahmad et al. [49]. These methods are dedicated to either eventrecognition and/or image annotation problems. Table 3 shows the bestclassification results obtained for each compared method. The obtainedperformance for our method is superior to most of the state-of-artmethods, and we obtained a comparable performance with [82] usingdeep learning features for object and scene information for event classi-fication. Using partial random tagging for the images (20% labels fromthe ground-truth) significantly enhance the outcome of our method tooutperform all the others.

5.3.2. WIDER datasetThe same evaluation setting is used on WIDER dataset. Figs. 7 to 9

represent the confusion matrices of event classification on 61 categoriesusing objects, scenes and their combination, respectively. As for theUIUC dataset, we can note that combination of objects and scenes yieldsbetter results than using either one separately. Table 4 shows a compar-ison of the overall classification accuracy of our proposed model withseveral methods proposed in the literature using the WIDER datasetfor their evaluation. We can clearly see that our method comparesfavorably to all methods. Using partial random tagging for the images(20% labels from the ground-truth) allowed our method to outperformall the others.

Table 3Comparison of event classification accuracy for the different models on the UIUC dataset.

Methods Accuracy

Wang et al. (sLDA-ann) [8] 67.56%LI et al. (abc-corr-LDA) [58] 66.95%Zheng et al. (SupDocNADE) [55] 77.16%Zang et al. [60] 70.05%Li et al. (MSS-sLDA) [59] 74.40%Jeon et al. (scLDA) [57] 81.60%Huang et al. (lcst-LDA-cnn-FC-SVM) [56] 91.45%Wang et al. (Transferring-Deep) [82] 98.80%Zhou et al. (GoogLeNet-GAP) [83] 95.00%Ahmad et al. (region-based) [49] 98.38%

Our (Object only) 86.330%Our (Scene only) 91.799%Our (Object+Scene) 98.064%Our approach (Object+Scene+Tags) 99.17%

Table 4Comparison of event classification accuracy for the different models on the WIDERdataset.

Methods Accuracy

Wang et al. (sLDA-ann) [8] 26.89%LI et al. (abc-corr-LDA) [58] 36.90%Zheng et al. (SupDocNADE) [55] 41.46%Xiong et al. (Baseline CNN) [15] 39.70%Xiong et al. (Deep channel fusion) [15] 42.40%Rachmadi et al. (Combined CNN) [84] 44.06%Wang et al. (Transferring-Deep) [82] 53.00%Ahmad et al. (region-based) [49] 55.04%

Our approach (Object) 36.392%Our approach (Scene) 39.764%Our approach (Object+Scene) 54.091%Our approach (Object+Scene+Tags) 58.091%

To see more thoroughly the contribution of partial image tagging inenhancing event classification, Fig. 10(a) and (b) show the classifica-tion accuracy as a function of the percentage of correct tags availablefor each image. By varying this percentage between 0% an d 50%, wecan see an increase of the classification accuracy of our method forboth datasets. This confirms the advantage of combining visual andtextual information for enhancing the robustness of the model for eventrecognition (see Fig. 8).

5.4. Image annotation

To provide quantitative evaluation for the annotation part of ourmethod, we generated a ground truth annotation for each image refer-ring to contained objects/scenes. More specifically, we selected the 40most occurring objects (resp. 40 most occurring scenes) in the UIUCand WIDER datasets, respectively, to make our object and scene labelsets. Most of these labels can be returned by the GoogLeNet [61] andPlaces [81]. To assess the performance of our model, we compared itwith three state-of-the-art classification/annotation methods, namely:sLDA [8], corr-LDA [58] and SupDocNADE [55]. Given the precision(𝑃 ) and recall (𝑅), we tested the 𝐹 -measure (𝐹 ) given by:

𝐹 = 2 × 𝑃 × 𝑅𝑃 + 𝑅

(16)

We compare the overall annotation performance of our proposed modelwith several baseline methods performing simultaneous image clas-sification and annotation. Note that [8,55,58] are based on modelsderived from LDA, whereas [75] combines multiple instance learningand CNNs. To show the performance of our model with respect tothe number of topics used in our model, we draw the graphs of𝐹 -measure with respect to the number of topics. These graphs areshown in Fig. 11(a) and (b), respectively. We can clearly see thatfor all compared methods, increasing the number of topics results inan increase of the annotation accuracy. Our method has obtained the

289


Fig. 3. Comparison of event classification as a function of the number of topics: (a) UIUC and (b) WIDER datasets, respectively.

Fig. 4. Confusion matrix (object only).

Fig. 5. Confusion matrix (scene only).

best performance among the compared ones thanks to the capacity ofretrieving annotations related to events.

Table 5 gives the best 𝐹 -measure values obtained by the comparedmethods where we have fixed the number of topics to 40 in the UIUCand 120 in the WIDER datasets, respectively. We can see that ourmethod obtained better annotation results. This is mainly obtained

thanks several factors. Firstly, extracting candidate labels using deep-learning guarantees having more relevant object/scene labels thanusing low-level features. Second, our annotation procedure refinescandidate labels/tags based on latent topics and class category, whichhelps obtaining more relevant event-related annotations. Fig. 12 showsthe average annotation accuracy as a function of the percentage ofthe amount of tags available for each image. We used the following

290


Fig. 6. Confusion matrix (object and scene).

Fig. 7. Confusion matrix of WIDER dataset (object only).

Fig. 8. Confusion matrix of WIDER dataset (scene only).

291


Fig. 9. Confusion matrix of WIDER dataset (object and scene).

Fig. 10. Classification accuracy as a function of the percentage of partial labeling: (a) UIUC and (b) WIDER datasets, respectively.

Fig. 11. Comparison of annotation performance with respect to the number of topics: (a) UIUC dataset and (b) WIDER dataset.

procedure to generate noisy tags. If we what to generate 𝑝% of tags foran image having 𝑛 labels in the ground truth, we select randomly, andwithout repetition, 𝑞 = 𝑛×𝑝% labels from the ground-truth of the imageand 𝑛∕2−𝑞 non-relevant labels from other images. By varying 𝑝 between0% an d 50% (see Fig. 12 for illustration), we can see that an increaseof the number of tags results in increasing the accuracy of the final

annotation, even if the tag are noisy. This demonstrates the robustnessof our method for generating correct event-related annotations.

6. Conclusions

We have proposed a structured probabilistic latent-topic model forsimultaneous event classification and annotation in still images. Our

292


Table 5Annotation performance comparison for the UIUC and WIDER datasets.

Methods UIUC dataset WIDER dataset

Wang et al. (sLDA) [8] 38.80% 36.90%LI et al. (abc-corr-LDA) [58] 46.02% 37.99%Zheng et al. (SupDocNADE) [55] 52.95% 44.89%Wu et al. (DeepMIL) [75] 72.48% 60.03%Our methods 76.951% 61.581%

Fig. 12. Annotation accuracy as a function of the percentage of partial labeling forour datasets.

model is built on the assumption that object/scene occurrences aredependent on a latent topic structure related to event classes. Contrarilyto previous work on latent-topic-based modeling for image classifica-tion/annotation, our model assumes an event-specific distribution forobject/scene generation which yields consistent image annotations. Ourmodel parses newly-observed images by first generating image labelsthrough CNNs and possibly accompanying noisy tags. These are refinedthrough an EM-based procedure which progressively ascertains themost likely event and relevant annotations of the image. Experimentson a two well-known datasets have enabled us to evaluate our methodand demonstrate its performance with comparison with state-of-the artmethods.

The proposed work can be easily extended for analyzing and de-tecting more complex events and interactions between objects. Forexample, more visual semantic cues, such as human pose, gender,garments, text, etc., can be exploited to infer social interactions be-tween objects and event recognition. For example, our model in itscurrent form cannot distinguish between instances of the same eventthat may involve only one of the two genders (e.g., men/wommensoccer). Another avenue that we are currently exploring is the possibil-ity of analyzing multiple images at the same time for inferring eventsregistered in albums. Indeed, multiple images can give more evidencein terms of objects and context for inferring social interactions andgroup activities.

Acknowledgment

This work has been achieved thanks to the support of the Govern-ment of Algeria and the Natural Sciences and Engineering ResearchCouncil of Canada (NSERC).

References

[1] A. Makadia, V. Pavlovic, S. Kumar, A new baseline for image annotation, in:European Conf. on Computer Vision, 2008, pp. 316–329.

[2] A.M. Tousch, S. Herbin, J.Y. Audibert, Semantic hierarchies for image annotation:A survey, Pattern Recognit. 45 (1) (2012) 333–345.

[3] L. Fei-Fei, P. Perona, A Bayesian hierarchical model for learning natural sceneCategories, in: IEEE Conf. on Computer Vision and Pattern Recognition, 2005,pp. 524–531.

[4] S.B. Park, J.W. Lee, S.K. Kim, Content-based image classification using a neuralnetwork, Pattern Recognit. Lett. 25 (3) (2004) 287–300.

[5] N. Rasiwasia, N. Vasconcelos, Latent dirichlet allocation models for imageclassification, IEEE Trans. Pattern Anal. Mach. Intell. 35 (11) (2013) 2665–2679.

[6] D.M. Blei, M.I. Jordan, Modeling annotated data, in: ACM SIGIR Conf. onResearch and Development in Information Retrieval, 2003, pp. 127–134.

[7] C.-T. Nguyen, D.-C. Zhan, Z.-H. Zhou, Multi-modal image annotation with multi-instance multi-label LDA, in: Int’L Joint Conf. on Artificial Intelligence, 2013,pp. 1558–1564.

[8] C. Wang, D. Blei, L. Fei-Fei, Simultaneous image classification and annotation, in:IEEE Conf. on Computer Vision and Pattern Recognition, 2009, pp. 1903–1910.

[9] R.C.F. Wong, C.H.C. Leung, Automatic semantic annotation of real-world webimages, IEEE Trans. Pattern Anal. Mach. Intell. 30 (11) (2008) 1933–1944.

[10] Y. Yang, F. Wu, F. Nie, H.T. Shen, Y. Zhuang, A.G. Hauptmann, Web andpersonal image annotation by mining label correlation with relaxed visual graphembedding, IEEE Trans. Image Process. 21 (3) (2012) 1339–1349.

[11] L.-J. Li, R. Socher, L. Fei-Fei, Towards total scene understanding: Classification,annotation and segmentation in an automatic framework, in: IEEE Conf. onComputer Vision and Pattern Recognition, 2009, pp. 2036–2043.

[12] Y. Wang, G. Mori, Max-margin latent dirichlet allocation for image classificationand annotation, in: British Machine Vision Conference, 2011, pp. 1–11.

[13] J.K. Aggarwal, M.S. Ryoo, Human activity analysis: A review, ACM Comput.Surv. 43 (3) (2011) 16.

[14] S. Bacha, M.S. Allili, N. Benblidia, Event recognition in photo albums usingprobabilistic graphical models and feature relevance, J. Vis. Commun. ImageRepresent. 40 (2016) 546–558.

[15] Y. Xiong, K. Zhu, D. Lin, X. Tang, Recognizing complex events from staticimages by fusing deep channels, in: IEEE Conf. on Computer Vision and PatternRecognition, 2015, pp. 1600–1609.

[16] M. Zeppelzauer, D. Schopfhauser, Multimodal classification of events in socialmedia, Image Vis. Comput. 53 (2016) 45–56.

[17] L. Yang, L. Jing, M.K. Ng, J. Yu, A discriminative and sparse topic model forimage classification and annotation, Image Vis. Comput. 51 (2016) 22–35.

[18] D.M. Blei, J.D. McAuliffe, Supervised topic models, Neural Inf. Process. Syst.(2007) 121–128.

[19] J. Li, J.Z. Wang, Real-time computerized annotation of pictures, IEEE Trans.Pattern Anal. Mach. Intell. 30 (6) (2008) 985–1002.

[20] S. Qian, T. Zhang, C. Xu, M.S. Hossain, Social event classification via boostedmultimodal supervised latent dirichlet allocation, ACM Trans. MultimediaComput. Commun. Appl. 11 (2) (2014) 1–22.

[21] S. Qian, T. Zhang, C. Xu, J. Shao, Multi-modal event topic for social eventanalysis, IEEE Trans. Multimedia 18 (2) (2016) 233–246.

[22] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D.M. Blei, M.I. Jordan,Matching words and pictures, J. Mach. Learn. Res. 3 (2003) 1107–1135.

[23] D. Putthividhy, H.T. Attias, S.S. Nagarajan, Topic regression multi-modal latentdirichlet allocation for image annotation, in: IEEE Conf. on Computer Vision andPattern Recognition, 2010, pp. 3408–3415.

[24] D. Zhang, M. Islam, G. Lu, A review on automatic image annotation techniques,Pattern Recognit. 45 (1) (2012) 346–362.

[25] M. Hayat, S.H. Khan, M. Bennamoun, S. An, A spatial layout and scale invariantfeature representation for indoor scene classification, IEEE Trans. Image Process.25 (10) (2016) 4829–4841.

[26] L.J. Li, L. Fei-Fei. What, Where and who? classifying events by scene and objectrecognition, in: IEEE Int’L Conf. on Computer Vision, 2007, pp. 1–8.

[27] C. Tzelepis, Z. Mac, V. Mezaris, B. Ionescud, I. Kompatsiaris, Event-based mediaprocessing and analysis: A survey of the literature, Image Vis. Comput. 53 (2016)3–19.

[28] X. Li, T. Uricchio, L. Ballan, M. Bertini, C.G.M. Snoek, A. Del Bimbo, Socializingthe semantic gap: A comparative survey on image tag assignment, refinement,and retrieval, ACM Comput. Surv. 49 (1) (2016) 14.

[29] L. Bossard, M. Guillaumin, L. Van Gool, Event recognition in photo collectionswith a stopwatch HMM, in: IEEE Int’l Conf. on Computer Vision, 2013, pp.1193–1200.

[30] R. Mattivi, J. Uijlings, F.G.B. De Natale, N. Sebe, Exploitation of time con-straints for (sub-)event recognition, in: Joint ACM Workshop on Modeling andRepresenting Events, 2011, pp. 7–12.

[31] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deepconvolutional neural networks, Neural Inf. Process. Syst. (2012) 1097–1105.

[32] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, A. Oliva, Learning deep featuresfor scene recognition using places database, Neural Inf. Process. Syst. (2014)487–495.

[33] A.M. Nguyen, J. Yosinski, J. Clune, Deep neural networks are easily fooled: Highconfidence predictions for unrecognizable images, in: IEEE Conf. on ComputerVision and Pattern Recognition, 2015, pp. 427–436.

[34] O. Ouyed, M.S. Allili, Feature weighting for multinomial kernel logistic regressionand application to action recognition, Neurocomputing 275 (2018) 1752–1768.

[35] U. Ahsan, C. Sun, J. Hays, I. Essa, Complex event recognition from images withfew training examples, in: IEEE Winter Conf. on Applications of Computer Vision,2017, pp. 669–678.

[36] M.S. Allili, S. Bacha, Feature relevance in bayesian network classifiers andapplication to image event recognition, in: FLAIRS Conference, 2016, pp.760–763.

293

http://refhub.elsevier.com/S0923-5965(19)30120-1/sb1















































































































































[37] S. Park, N. Kwak, Cultural event recognition by subregion classification withconvolutional neural network, in: IEEE Conf. on Computer Vision and PatternRecognition Workshop, 2015, pp. 45–50.

[38] O. Ouyed, M.S. Allili, Recognizing human interactions using group feature rele-vance in multinomial kernel logistic regression, in: Florida Artificial IntelligenceResearch Society Conference, 2018, pp. 541–546.

[39] L. Wang, Z. Wang, W. Du, Y. Qiao, Object-scene convolutional neural networksfor event recognition in images, in: IEEE Conf. on Computer Vision and PatternRecognition Workshops, 2015, pp. 30–35.

[40] L. Xie, H. Sundaram, M. Campbell, Event mining in multimedia streams, Proc.IEEE 96 (4) (2008) 623–647.

[41] H. Bay, T. Tuytelaars, L. Van Gool, SURF: Speeded up robust features, in:European Conf. on Computer Vision, 2006, pp. 404–417.

[42] Y. Gao, J. Fan, X. Xue, R. Jain, Automatic image annotation by incorporatingfeature hierarchy and boosting to scale up SVM classifiers, in: ACM Int’L Conf.on Multimedia, 2006, pp. 901–910.

[43] C. Jin, S. Jin, Image distance metric learning based on neighborhood setsfor automatic image annotation, J. Vis. Commun. Image Represent. 34 (2016)167–175.

[44] D. Tao, X. Tang, X. Li, X. Wu, Asymmetric bagging and random subspace forsupport vector machines-based relevance feedback in image retrieval, IEEE Trans.Pattern Anal. Mach. Intell. 28 (7) (2006) 88–99.

[45] V. Lavrenko, R. Manmatha, J. Jeon, A model for learning the semantics ofpictures, Neural Inf. Process. Syst. (2003) 553–560.

[46] A. Bosch, A. Zisserman, X. Muñoz, Scene classification using a hybrid gener-ative/discriminative approach, IEEE Trans. Pattern Anal. Mach. Intell. 30 (4)(2008) 712–727.

[47] O. Chapelle, P. Haffner, V.N. Vapnik, Support vector machines for histogram-based image classification, IEEE Trans. Neural Netw. 10 (1999) 1055–1064.

[48] F. Moosmann, E. Nowak, F. Jurie, Randomized clustering forests for imageclassification, IEEE Trans. Pattern Anal. Mach. Intell. 30 (9) (2008) 1632–1646.

[49] K. Ahmad, N. Conc, F.G.B. De Natale, A saliency-based approach to eventrecognition, Signal Process., Image Commun. 60 (2018) 42–51.

[50] G. Carneiro, A.B. Chan, P.J. Moreno, N. Vasconcelos, Supervised learning ofsemantic classes for image annotation and retrieval, IEEE Trans. Pattern Anal.Mach. Intell. 29 (3) (2007) 394–410.

[51] M.S. Allili, D. Ziou, Object of interest segmentation and tracking by using featureselection and active contours, in: IEEE Conf. on Computer Vision and PatternRecognition, 2007, pp. 1–8.

[52] M.S. Allili, D. Ziou, Likelihood-based feature relevance for figure-groundsegmentation in images and videos, Neurocomputing 167 (2015) 658–670.

[53] D.M. Blei, A.Y. Ng, M.I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res.3 (2003) 993–1022.

[54] T. Hofmann, Probabilistic latent semantic analysis, Uncertain. Artif. Intell. (1999)289–296.

[55] Y. Zheng, Y.-J. Zhang, H. Larochelle, Topic modeling of multimodal data:An autoregressive approach, in: IEEE Conf. on Computer Vision and PatternRecognition, 2014, pp. 1364–1377.

[56] C. Huang, W. Luo, Y. Xie, Local-class-shared-topic latent Dirichlet allocationbased scene classification, Multimedia Tools Appl. 76 (14) (2017) 15661–15679.

[57] J. Jeon, M. Kim, A spatial class LDA model for classification of sports sceneimages, in: IEEE Int’L Conf. on Image Processing, 2015, pp. 4649–4653.

[58] X. Li, C. Sun, P. Lu, X. Wang anf Y. Zhong, Simultaneous image classificationand annotation based on probabilistic model, J. China Univ. Posts Telecommun.19 (2) (2012) 107–115.

[59] X. Li, Z. Ma, P. Peng, X. Guo, F. Huang, X. Wang, J. Guo, Supervised latentdirichlet allocation with a mixture of sparse softmax, Neurocomputing 312(2018) 324–335.

[60] M. Zang, D. Wen, K. Wang, T. Liu, W. Song, A novel topic feature for imagescene classification, Neurocomputing 148 (2015) 467–476.

[61] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V.Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: IEEE Conf. onComputer Vision and Pattern Recognition, 2015, pp. 1–9.

[62] M. Everingham, S.M. Ali Eslami, L.V. Gool, C.K.I. Williams, J. Winn, A.Zisserman, The pascal visual object classes challenge: A retrospective, Int’l J.Comput. Vis. 111 (1) (2015) 98–136.

[63] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A.Karpathy, A. Khosla, M. Bernstein, A.C. Berg, L. Fei-Fei, Imagenet large scalevisual recognition challenge, Int’L J. Comput. Vis. 115 (3) (2015) 211–252.

[64] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks,in: European Conf. on Computer Vision, 2014, pp. 818–833.

[65] Y. Gong, Y. Jia, T.K. Leung, A. Toshev, S. Ioffe, Deep Convolutional Ranking forMultilabel Image Annotation. arXiv:1312.4894 (2013).

[66] X. He, L. Deng, Deep learning for image-to-text generation, IEEE Signal Process.Mag. 34 (6) (2017) 109–116.

[67] A. Karpathy, L. Fei-Fei, Deep visual-semantic alignments for generating imagedescriptions, IEEE Trans. Pattern Anal. Mach. Intell. 39 (4) (2017) 664–676.

[68] R. Wang, Y. Xie, J. Yang, L. Xue, Hu. M., Q. Zhang, Large scale automaticimage annotation based on convolutional neural network, J. Vis. Commun. ImageRepresent. 49 (2017) 213–224.

[69] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y.Bengio. Show, Attend and tell: Neural image Caption generation with visualattention, in: Int’L Conf. on Machine Learning, 2015, pp. 2048–2057.

[70] A. Graves, A. Mohamed, G.E. Hinton, Speech recognition with deep recurrentneural networks, in: IEEE Int’L Conf. on Acoustics, Speech and Signal Processing,2013, pp. 6645–6649.

[71] J. Krause, J. Johnson, R. Krishna, L. Fei-Fei, A hierarchical approach forgenerating descriptive image paragraphs, in: IEEE Conf. on Computer Vision andPattern Recognition, 2017, pp. 3337–3345.

[72] R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recurrent neuralnetworks, in: Int’L Conf. on Machine Learning, 2013, pp. 1310–1318.

[73] L. Fei-Fei, A. Iyer, C. Koch, P. Perona, What do we see in a glance of a scene?,J. Vis. 7(1) (10) (2007) 1–29.

[74] C. Galleguillos, S. Belongie, Context based object Categorization: A criticalsurvey, Comput. Vis. Image Understand. 114 (2010) 712–722.

[75] J. Wu, Y. Yu, C. Huang, K. Yu, Deep multiple instance learning for imageclassification and auto-annotation, in: IEEE Conf. on Computer Vision and PatternRecognition, 2015, pp. 3460–3469.

[76] D. Liu, X.-S. Hua, L. Yang, M. Wang, H.-J. Zhang, Tag ranking, in: Int’L Conf.on World Wide Web, 2009, pp. 351–360.

[77] T.L. Griffiths, M. Steyvers, Finding scientific topics, Proc. Natl. Acad. Sci. 101(2004) 5228–5235.

[78] C.-J. Lin, R.C. Weng, S.S. Keerthi, Trust region Newton method for large-scalelogistic regression, J. Mach. Learn. Res. 9 (2008) 625–650.

[79] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-ScaleImage Recognition. arXiv preprint arXiv:1409.1556, 2014..

[80] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,in: IEEE Conf. on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

[81] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, A. Torralba, Places: A 10 millionimage database for scene recognition, IEEE Trans. Pattern Anal. Mach. Intell. 40(6) (2018) 1452–1464.

[82] L. Wang, Z. Wang, Y. Qiao, L. Van Gool, Transferring deep object and scenerepresentations for event recognition in still images, Int’l J. Comput. Vis. 126(2–4) (2018) 390–409.

[83] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep featuresfor discriminative localization, in: IEEE Conf. on Computer Vision and PatternRecognition, 2016, pp. 2921–2929.

[84] R.-F. Rachmadi, K. Uchimura, G. Koutaki, Combined convolutional neural net-work for event recognition, in: Korea-Japan Joint Workshop on Frontiers ofComputer Vision, 2016, pp. 85–90.

Lakhdar Laib is a Ph.D. Candidate since 2014 at the ÉcoleSupérieur D’informatique in Algéria. He works under thesupervision of professors Mohand Said Allili and Samy Ait-Aoudia. His main research interests include computer visionand graphics, image processing, pattern recognition, andmachine learning.

Mohand Said Allili received the M.Sc. and Ph.D. degreesin computer science from the University of Sherbrooke,Sherbrooke, QC, Canada, in 2004 and 2008, respectively.Since June 2008, he has been an Assistant Professor ofcomputer science with the Department of Computer Sci-ence and Engineering, Université du Québec en Outaouais,Canada. His main research interests include computer visionand graphics, image processing, pattern recognition, andmachine learning. Dr. Allili was a recipient of the Best Ph.D.Thesis Award in engineering and natural sciences from theUniversity of Sherbrooke for 2008 and the Best StudentPaper and Best Vision Paper awards for two of his papersat the Canadian Conference on Computer and Robot Vision2007 and 2010, respectively.

Samy Ait-Aoudia received a DEA ‘‘Diplôme d’Etudes Appro-fondies" in image processing from Saint-Etienne University,France, in 1990. He holds a Ph.D. degree in computerscience from the Ecole des Mines, Saint-Etienne, France,in 1994. He is currently a Professor of computer scienceat the National High School in Computer Science at Al-giers/Algeria, where he is involved in teaching BSc and MSclevels in computer science and software engineering. Hisareas of research include image processing, CAD/CAM andconstraints management in solid modeling.

294



















































































































http://arxiv.org/abs/1312.4894









































































a probabilistic topic model for event-based image...

Documents