1 visual semantic information pursuit: a surveyepubs.surrey.ac.uk/853066/1/visual semantic...

20
1 Visual Semantic Information Pursuit: A Survey Daqi Liu, Miroslaw Bober, Member, IEEE , Josef Kittler, Life Member, IEEE Abstract—Visual semantic information comprises two important parts: the meaning of each visual semantic unit and the coherent visual semantic relation conveyed by these visual semantic units. Essentially, the former one is a visual perception task while the latter one corresponds to visual context reasoning. Remarkable advances in visual perception have been achieved due to the success of deep learning. In contrast, visual semantic information pursuit, a visual scene semantic interpretation task combining visual perception and visual context reasoning, is still in its early stage. It is the core task of many different computer vision applications, such as object detection, visual semantic segmentation, visual relationship detection or scene graph generation. Since it helps to enhance the accuracy and the consistency of the resulting interpretation, visual context reasoning is often incorporated with visual perception in current deep end-to-end visual semantic information pursuit methods. Surprisingly, a comprehensive review for this exciting area is still lacking. In this survey, we present a unified theoretical paradigm for all these methods, followed by an overview of the major developments and the future trends in each potential direction. The common benchmark datasets, the evaluation metrics and the comparisons of the corresponding methods are also introduced. Index Terms—Semantic Scene Understanding, Visual Perception, Visual Context Reasoning, Deep Learning, Variational Free Energy Minimization, Message Passing. 1 I NTRODUCTION S EMANTICS is the linguistic and philosophical study of meaning in language, programming languages or for- mal logics. In linguistics, the semantic signifiers can be words, phrases, sentences or paragraphs. To interpret com- plicated signifiers such as phrases or sentences, we need to understand the meaning of each word as well as the semantic relation among those words. Here, words are the basic semantic units and a semantic relation is any relation- ship between two or more words based on the meaning of the words. In other words, semantic relations define the consistency among the associated semantic units in terms of meaning, which guarantees that the corresponding complex semantic signifier can be interpreted. The above strategy can be seamlessly applied to the extraction of the meaning of visual information, in which the basic semantic units are potential pixels or potential region bounding boxes while the visual semantic relation is represented as a local visual relationship structure or a holistic scene graph. For visual perception tasks such as visual semantic segmentation or object detection, visual semantic relations promote smoothness and consistency of interpretation of the input visual semantic units. It acts as a regularizer and causes the associated visual semantic units to be biased towards certain configurations which are more likely to occur. For visual context reasoning applications, such as visual relationship detection or scene graph genera- tion, the corresponding visual semantic units are considered as the associated context information and different inference methods are applied to pursue the visual semantic relation. In a word, the visual semantic units and the visual semantic relations are complementary. Visual semantic units are the prerequisites of visual semantic relations, while the visual semantic relations can be explored to further improve the The authors are with the Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford GU2 7XH, U.K. E-mail: {daqi.liu, m.bober, j.kittler}@surrey.ac.uk detection accuracy of the visual semantic units. Most im- portantly, it is the combination of visual semantic units and their relations, which conveys the meaning of visual seman- tic information. Its essence will be task dependent and, to some extent, modulated by the dynamics of the successive goals, which may be influenced by the current state of visual information understanding. Its evolving nature renders the visual semantic information extraction a task of inquiry, rather than a mapping. For this reason, we should refer to it as visual semantic information pursuit. The extent to which a visual semantic information pur- suit method can interpret the input visual stimuli is to- tally dependent on the prior knowledge of the observer. Vocabulary is one part of the knowledge, which defines the meaning of each visual semantic unit. The vocabulary itself may be enough for some specific visual perception tasks (such as weakly-supervised learning for object detection). However, for visual context reasoning applications, it is certainly not sufficient. We still need additional knowledge to identify and understand interpretable visual semantic relations. In most cases, besides the vocabulary, the associ- ated benchmark datasets also have to provide ground-truth information about the visual semantic relations. In this survey, four main research topics in visual se- mantic information pursuit are introduced: object detection, visual semantic segmentation, visual relationship detection and scene graph generation. The last two applications have only appeared after entering the deep learning era, while the first two tasks have existed for several decades. Before diving into the deep learning based pursuit algorithms, it is necessary to briefly introduce the conventional methods [1], [2] accomplishing the first two tasks. Typically, they are built upon hand-crafted features, such as Viola Jones detector [3], Histogram of Oriented Gradients (HOG) fea- ture descriptor [4], Scale-invariant Feature Transform (SIFT) [5] or Deformable Part-based Model (DPM) [6]. Meanwhile, classical visual semantic segmentation algorithms like [7],

Upload: others

Post on 08-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

1

Visual Semantic Information Pursuit: A SurveyDaqi Liu, Miroslaw Bober, Member, IEEE , Josef Kittler, Life Member, IEEE

Abstract—Visual semantic information comprises two important parts: the meaning of each visual semantic unit and the coherentvisual semantic relation conveyed by these visual semantic units. Essentially, the former one is a visual perception task while the latterone corresponds to visual context reasoning. Remarkable advances in visual perception have been achieved due to the success ofdeep learning. In contrast, visual semantic information pursuit, a visual scene semantic interpretation task combining visual perceptionand visual context reasoning, is still in its early stage. It is the core task of many different computer vision applications, such as objectdetection, visual semantic segmentation, visual relationship detection or scene graph generation. Since it helps to enhance theaccuracy and the consistency of the resulting interpretation, visual context reasoning is often incorporated with visual perception incurrent deep end-to-end visual semantic information pursuit methods. Surprisingly, a comprehensive review for this exciting area is stilllacking. In this survey, we present a unified theoretical paradigm for all these methods, followed by an overview of the majordevelopments and the future trends in each potential direction. The common benchmark datasets, the evaluation metrics and thecomparisons of the corresponding methods are also introduced.

Index Terms—Semantic Scene Understanding, Visual Perception, Visual Context Reasoning, Deep Learning, Variational Free EnergyMinimization, Message Passing.

F

1 INTRODUCTION

S EMANTICS is the linguistic and philosophical study ofmeaning in language, programming languages or for-

mal logics. In linguistics, the semantic signifiers can bewords, phrases, sentences or paragraphs. To interpret com-plicated signifiers such as phrases or sentences, we needto understand the meaning of each word as well as thesemantic relation among those words. Here, words are thebasic semantic units and a semantic relation is any relation-ship between two or more words based on the meaningof the words. In other words, semantic relations define theconsistency among the associated semantic units in terms ofmeaning, which guarantees that the corresponding complexsemantic signifier can be interpreted.

The above strategy can be seamlessly applied to theextraction of the meaning of visual information, in whichthe basic semantic units are potential pixels or potentialregion bounding boxes while the visual semantic relationis represented as a local visual relationship structure ora holistic scene graph. For visual perception tasks suchas visual semantic segmentation or object detection, visualsemantic relations promote smoothness and consistency ofinterpretation of the input visual semantic units. It acts as aregularizer and causes the associated visual semantic unitsto be biased towards certain configurations which are morelikely to occur. For visual context reasoning applications,such as visual relationship detection or scene graph genera-tion, the corresponding visual semantic units are consideredas the associated context information and different inferencemethods are applied to pursue the visual semantic relation.In a word, the visual semantic units and the visual semanticrelations are complementary. Visual semantic units are theprerequisites of visual semantic relations, while the visualsemantic relations can be explored to further improve the

• The authors are with the Centre for Vision, Speech and Signal Processing,University of Surrey, Guildford GU2 7XH, U.K.E-mail: {daqi.liu, m.bober, j.kittler}@surrey.ac.uk

detection accuracy of the visual semantic units. Most im-portantly, it is the combination of visual semantic units andtheir relations, which conveys the meaning of visual seman-tic information. Its essence will be task dependent and, tosome extent, modulated by the dynamics of the successivegoals, which may be influenced by the current state of visualinformation understanding. Its evolving nature renders thevisual semantic information extraction a task of inquiry,rather than a mapping. For this reason, we should refer to itas visual semantic information pursuit.

The extent to which a visual semantic information pur-suit method can interpret the input visual stimuli is to-tally dependent on the prior knowledge of the observer.Vocabulary is one part of the knowledge, which defines themeaning of each visual semantic unit. The vocabulary itselfmay be enough for some specific visual perception tasks(such as weakly-supervised learning for object detection).However, for visual context reasoning applications, it iscertainly not sufficient. We still need additional knowledgeto identify and understand interpretable visual semanticrelations. In most cases, besides the vocabulary, the associ-ated benchmark datasets also have to provide ground-truthinformation about the visual semantic relations.

In this survey, four main research topics in visual se-mantic information pursuit are introduced: object detection,visual semantic segmentation, visual relationship detectionand scene graph generation. The last two applications haveonly appeared after entering the deep learning era, whilethe first two tasks have existed for several decades. Beforediving into the deep learning based pursuit algorithms, itis necessary to briefly introduce the conventional methods[1], [2] accomplishing the first two tasks. Typically, theyare built upon hand-crafted features, such as Viola Jonesdetector [3], Histogram of Oriented Gradients (HOG) fea-ture descriptor [4], Scale-invariant Feature Transform (SIFT)[5] or Deformable Part-based Model (DPM) [6]. Meanwhile,classical visual semantic segmentation algorithms like [7],

Page 2: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

2

Fig. 1: Four main visual semantic information pursuit appli-cations are introduced in this survey, which include objectdetection (OD), visual semantic segmentation (VSS), visualrelationship detection (VRD) and scene graph generation(SGG).

[8] tend to apply probabilistic models like Markov RandomField (MRF) [9] or Conditional Random Field (CRF) [10] toexplicitly characterize the correlations among pixels beingpredicted. Besides, some of the early works focus on discov-ering the spatiotemporal relationship of visual objects [11],[12], [13], [14]. However, due to the superior expressivityand scalability, deep learning based methods have startedto dominate the visual semantic information pursuit areain recent years. They can automatically learn expressivefeature representations from huge datasets, which would beimpossible for the classical methods.

Traditionally, the visual semantic information pursuittasks are generally considered as visual perception prob-lems. However, current visual semantic information pursuitmethods start to treat them as a combination of perceptionand reasoning. Therefore, unlike the previous surveys, allapplications mentioned in this article include visual contextreasoning modules and can be trained end-to-end throughthe associated deep learning models.

Specifically, object detection (OD) aims at detecting allpossible objects appearing in the input image by assigningcorresponding bounding boxes as well as their associatedlabels. Two main conventional deep learning based objectdetection categories [15] have been proposed in recent years:two-stage detection algorithms like Spatial Pyramid PoolingNetworks (SPPNet) [16] or Faster Regional ConvolutionNeural Networks (Faster RCNN) [17] and one-stage de-tection methods like You Only Look Once (YOLO) [18] orSingle Shot MultiBox Detector (SSD) [19]. The former followa ”proposal generation + verification” paradigm and usuallyachieve better detection accuracy, while the latter directly

apply a fixed number of proposals on a grid and generallyhave a much higher detection speed. Essentially, the aboveconventional methods consider that the potential objectsare independent and thus generally discard the meaningfulvisual contextual information. In contrast, various moderndeep learning based object detection methods with visualcontext reasoning modules [20], [21], [22], [23] have beenproposed recently, which are the focus of this survey.

Visual semantic segmentation (VSS) [24], [25], [26] refersto labelling each pixel to be one of the semantic categories.To robustly parse input images, effective visual contextmodelling is essential. Due to its intrinsic characteristic, vi-sual semantic segmentation is often formulated as an undi-rected graphical model, such as Undirected Cyclic Graph(UCG) [26] or Conditional Random Field (CRF) [27]. In mostcases, the energy function corresponding to the undirectedgraphical model is factorized into two potential functions:unary function and binary function. The former generates apredicted label for each input pixel while the latter definesthe pairwise interaction between adjacent pixels. As a con-straint term, the binary potential function is used to regulatethe predicted labels generated from the unary potentialfunction to be spatially consistent and smooth.

Visual relationship detection (VRD) [28], [29], [30], [31],[32], [33], [34], [35], [36] focuses on recognizing the potentialrelationship between pairs of detected objects, in whichthe output is often formulated as a triplet in the form of(subject, predicate, object). Generally, it is not sufficient tointerpret the input image by only recognizing the individualobjects. The visual relationship triplet, and in particularthe predicate, plays an important role in understandingthe input images. However, it is often hard to predict thepredicates since they tend to exhibit a long-tail distribu-tion. In most cases, for the same predicate, the diversityof the subject-object combinations is often enormous [37].Compared with the individual objects, the correspondingpredicates capture more general abstractions from the inputimages.

Scene graph generation (SGG) [38], [39], [40], [41], [42],[43], [44] builds a visually-grounded scene graph to explic-itly model the objects and their relationships. Unlike thevisual relationship detection, the scene graph generationaims to build a global scene graph instead of producinglocal visual relationship triplets. The contextual informationconveyed within the scene graph is not limited to isolatedtriplets, but extends to all the related objects and predicates.To jointly infer the scene graph, message passing among theassociated objects and predicates is essential in scene graphgeneration tasks.

The above visual semantic information pursuit applica-tions, as shown in Fig.1, try to interpret the input image atdifferent semantic levels. For instance, object detection triesto interpret the visual semantic units while visual semanticsegmentation seeks to interpret the visual semantic regions(essentially, semantic regions are semantic units with dif-ferent representation forms); Visual relationship detectiontries to interpret the visual semantic phrases while the scenegraph generation attempts to interpret the visual semanticscene. Visual semantic information from the above low- andmid-level visual intelligence tasks is the basis of the high-level visual intelligence tasks such as visual captioning [45],

Page 3: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

3

[46], [47] or visual question answering [48], [49], [50].Specifically, to accomplish the visual semantic informa-

tion pursuit, three key questions need to be answered: 1)What kind of visual context information is required? 2) Howto model the required visual context information? 3) Howto infer the posterior distribution given the visual contextinformation? This article presents a comprehensive surveyof the state-of-the-art visual semantic information pursuitalgorithms, which try to answer the above questions.

This survey is organized as follows: Section 2 presentsthe terminologies and the fundamentals of the visual se-mantic information pursuit. Section 3 introduces a uni-fied paradigm for all visual semantic information pursuitmethods. The major developments and the future researchdirections are covered in Section 4 and Section 5, respec-tively. The common benchmarks and evaluation metricsare summarized in Section 6. The experimental comparisonof the key methods and the corresponding discussion areillustrated in Section 7 and Section 8, respectively. Finally,the conclusions are drawn in Section 9.

2 PRELIMINARY KNOWLEDGE

A typical visual semantic information pursuit method con-sists of two modules: a visual perception module and avisual context reasoning module. The visual perceptionmodule tries to detect visual semantic units from the inputvisual stimuli and assign specific meaning to them. Convo-lutional neural network (CNN) architectures such as fullyconvolutional networks (FCNs) [51] or faster regional CNNs(faster R-CNNs) [17] are often used to model the visualperception tasks. The comprehensive introduction to theseCNN models can be found in the previous surveys [15], [52].In general, regional proposal networks (RPNs) [17], [53] areoften used to produce the proposal bounding boxes. Regionof interest (ROI) pooling [17] or bilinear feature interpola-tion [33] are usually applied to obtain the correspondingfeature vectors. Three possible prior factors associated withregion proposals - visual appearance, class information andrelative spatial relationship - are often considered in formingthe visual semantic perception module, as shown in Fig.2.

As discussed in [34], despite some state-of-the-art one-stage detection models being able to achieve better perfor-mance, current visual semantic information pursuit meth-ods still prefer to employ two-stage detection algorithms toconstruct the visual perception modules for the followingreasons: 1) it requires a quadratic number of anchor boxesto detect the visual semantic relations for the same regres-sion and classification strategy compared to the one-stagemethods, and thus becomes intractable at large scale; 2) thevisual semantic relations are often open, which presents ahuge challenge for current classification schemes; 3) com-pared with object hypotheses, proposing relationships ismore challenging since it not only requires localizing salientregions but also evaluating the visual connection betweenregions.

Given detected visual semantic units, the aim of thevisual context reasoning module is to produce the mostprobable interpretation through a maximum a posteriori(MAP) inference. Basically, the above MAP inference isan NP-hard integer programming problem [54]. However,

Fig. 2: In current visual perception modules, three possibleprior factors associated with region proposals - visual ap-pearance, class information and relative spatial relationship- are often considered since the current visual semanticrelations, as discussed in Section 2, are normally generatedvia two-stage detection methods. To our knowledge, thepotential dense semantic segmentation maps are not yetexplored to generate visual semantic relations in currentresearch.

it is possible to address the above integer programmingproblem via a linear relaxation, where the generated linearprogramming problem can be presented as a variationalfree energy minimization approximation [55], [56]. Amongthe current visual semantic information pursuit methods,to accomplish the MAP inference, the associated marginalpolytopes corresponding to the target variational free en-ergies are often approximated by their corresponding fea-sible polytopes [57]. Such feasible polytopes can furtherbe factorized into numerous regions, in which they can betrained sequentially or in parallel using the correspondingoptimization methods. Essentially, the aim of the aboveapproximation is to find an upper bound for the targetvariational free energy. The tighter the upper bound, thebetter the MAP inference will be.

Furthermore, the visual context reasoning module of-ten incorporates prior knowledge to regularize the targetvariational free energy. Generally, there are two types ofprior knowledge: internal prior knowledge and externalprior knowledge. The internal prior knowledge is acquiredfrom the visual stimulus itself. For instance, the adjacencyof the visual stimuli is a typical internal prior knowledge,i.e. the adjacent objects tend to have a relationship or theadjacent pixels tend to have the same label. The externalprior knowledge is obtained from external sources, such astasks, contexts, knowledge bases. For instance, the linguisticknowledge bases like word2vec [58], [59] are often usedas external prior knowledge since the embeddings can beused to measure the semantic similarity among differentwords. Specifically, word embeddings are positioned in thevector space such that words that share common contextsin the corpus are located in close proximity to one another.Accordingly, the current visual semantic pursuit algorithmscan generally be divided into the following two categories:bottom-up methods and top-down methods. The formeronly use internal prior knowledge while the latter incor-porate both internal and external prior knowledge.

Furthermore, to resolve a visual semantic information

Page 4: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

4

pursuit application, one needs a model selection step, whichaims to find the best model (an optimum member within adistribution family) through maximizing the correspondingconditional likelihood. Essentially, the current visual seman-tic information pursuit algorithms can be considered asvariational Bayesian methods, in which the deep learning-based message passing optimization strategies are oftenused to accomplish the variational inference steps, whilethe variational learning steps are generally implemented bystochastic gradient descent (SGD) methods.

3 UNIFIED PARADIGM

Within a visual semantic information pursuit system, thevisual perception module initializes the visual context rea-soning module, while the visual context reasoning moduleconstraints the visual perception module. Those two mod-ules are complementary since both can provide contextualinformation to each other. In recent years, the deep learningmodels like the CNNs have been shown to achieve superiorperformance in numerous visual perception tasks [60], [61],[62]. They become the de facto choice as visual perceptionmodules in the current research. However, the conventionalCNNs are still not close to solving the inference tasks withinthe visual context reasoning modules.

3.1 FormulationTo accomplish an inference task, a probabilistic graphicalmodel is often adopted as a visual semantic informationpursuit framework. It uses a graph-based representation asthe foundation for encoding a distribution over a multi-dimensional space. It is a factorized representation of theset of independences that hold in a specific distribution.Two types of graphical models are commonly used, namely,Bayesian Networks and Markov Random Fields. The formerare directed acyclic graphical models with causality connec-tions while the latter are undirected graphical models withcycles in most cases. Both of them can be reformulated asthe corresponding factor graph models. Specifically, withinthe associated factor graphical model, the visual semanticunits are represented as the variable nodes while the visualsemantic relations are depicted as the factor nodes.

Given an associated factor graphical model, the aim ofthe visual context reasoning module is to infer the mostprobable interpretation, given the observed input visualstimuli. In other words, given the input images and otherground-truth information (such as the locations of the as-sociated bounding boxes), we want to maximize the corre-sponding posterior distribution. The above MAP inferenceis an NP-hard integer programming problem and it is of-ten reformulated as a linear programming problem via alinear relaxation [57]. Furthermore, within the probabilisticgraphical model, the posterior can be derived from thecorresponding energy functions according to Boltzmann’sLaw. Basically, the lower the energy, the more probable thepotential interpretation will be. The above linear program-ming problem can further be expressed as a variationalfree energy minimization problem. In a word, instead ofexact inference, the visual context reasoning module wouldinterpret the input visual stimuli using a relevant variationalfree energy minimization method.

Fig. 3: The unified paradigm of the current visual semanticinformation pursuit methods.

3.2 Unified Paradigm

Based on the above analysis, the current visual semanticinformation pursuit methods follow a unified paradigm.Specifically, the visual perception module applies a cor-responding CNN model to identify and locate the visualsemantic units, while the visual context reasoning moduleuses a relevant deep learning-based variational free energyminimization method to approximate the target visual se-mantic relations linking the above visual semantic units.Fig.3 schematically represents the unified paradigm of thecurrent visual semantic information pursuit methods.

In this survey, we use the more general scene graphgeneration task as an example to develop a mathematicalmodel corresponding to the unified paradigm. Other visualsemantic information pursuit applications are special casesof this formulation. To generate a visually-grounded scenegraph, the corresponding visual perception module appliesa CNN model like RPN to automatically obtain an initial setof object bounding boxes BI from the input image I . Foreach proposal bounding box, the visual context reasoningmodule needs to infer three variables: 1) the associatedobject class label; 2) the corresponding four bounding boxoffsets relative to the proposal box coordinates; 3) the rele-vant predicate labels between the potential object pairs.

Given a set of object classes C and a set of relationshiptypes R, the above set of variables can be depicted asX = {xclsi , xbboxi , xi→j |i = 1 · · ·n, j = 1 · · ·n, i 6= j},where n is the number of the proposal bounding boxes,xclsi ∈ C represents the class label of the i-th proposalbounding box, xbboxi ∈ R4 depicts the bounding box offsetsrelative to the i-th proposal box coordinates, and xi→j ∈ Ris the relationship predicate between the i-th and the j-th proposal bounding boxes. Generally, the ground-truthposterior P (x|I,BI) is computationally intractable. There-fore, in current research, a tractable variational distributionQ(x) (which is often selected from conditional conjugateexponential families [63] so that the target variational freeenergy can be computed analytically, please refer to [63] formore details) is generally used to approximate the ground-truth posterior and we need to accomplish the followingMAP inference to obtain the optimal interpretation:

x∗ = argmaxx∈X

P (x|I,BI) (1)

where x ∈ X is a possible configuration or interpretationand P (x|I,BI) is generally considered as the target poste-rior, which can also be derived as follows:

P (x|I,BI) =exp(−E(x, I,BI))

Z(I,BI)(2)

Page 5: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

5

TABLE 1: Overview of the typical visual semantic information pursuit methods.

Category Method Features Pros Cons

Bottom-up

PMPS [30] Triplet-based reasoning Phrase-guided message passing Only predicate is highlighted

CRF-as-RNN [52] CRF-based reasoning End-to-end training Binary potential terms only

MSDN [39] Semantic hierarchy reasoning Propagate via a dynamic graph Require hierarchy alignment

DAG-RNN [26] DAG-based reasoning Encode higher-order potential terms Poor unary predictions

SMN [20] External memory reasoning Converge quickly Higher computational cost

Top-downLP [28] Semantic affinity distillation Apply a language prior Unreliable predictions

LK [32] Teacher-student distillation Employ a teacher-student scheme Enlarge parameter space

• Note: Due to limited space, in this table, only one representative method is selected for each sub-category in Section 4.

where E(x, I,BI) is the energy function, which computesthe assignment cost for the potential interpretation. Z(I,BI)is the associated partition function. Generally, the energyfunction can be factorized into a summation of numerouspotential terms. For instance, the following equation demon-strates one possible factorization:

E(x, I,BI) =∑i

ψu(xi, I, BI) +∑i6=j

ψb(xi, xj) (3)

where ψu(xi, I, BI) represents the unary potential termand ψb(xi, xj) depicts the binary potential term. Essentially,the unary potential terms relate to the visual perceptionmodule, while the higher order potential terms characterizethe visual context reasoning module.

Furthermore, suppose the above energy function isparametrized as Eθ(x, I,BI), the target variational freeenergy F (θ,Q) for the corresponding MAP inference is:

F (θ,Q) =∑x∈X

Q(x)Eθ(x, I,BI) (4)

where the minimization of F (θ,Q) generally requires spe-cific annealing strategies since it is often NP-hard. Based onthe applied decomposition strategy of Q(x), typical visualsemantic information pursuit tasks tend to find a surrogatevariational free energy for F (θ,Q) and generally employ avariational Bayesian method to find the optimal Q and θ. Inthe current literature, deep learning-based message passingstrategies are generally applied to implement the variationalinference steps, while the variational learning steps are oftenaccomplished by SGD methods. Luckily, if one choosesto use a message passing optimization strategy, since themessage passing update rule implicitly accomplishes thevariational inference step, it is not necessary to state thevariational free energy explicitly. In other words, one canchoose different types of variational free energy by changingthe corresponding message passing update rules.

3.3 Training Strategy

The existing visual semantic information pursuit methodsgenerally follow two main training strategies: modulartraining and end-to-end training. Within the model selectionstep, the error differentials of the former one are only al-lowed to back-propagate within the visual context reasoningmodule, while the latter one can further back-propagate theerror differentials to the previous visual perception module

so that the whole learning system can be trained end-to-end. Essentially, within the modular training strategy, thevisual context reasoning module can be considered as apost-processing stage of visual perception. For instance, [64]formulates the sequential scene parsing task as a binarytree graphical model and proposes a Bayesian framework toinfer the associated target posterior. Specifically, three vari-ants of VGG nets [65] are applied as the visual perceptionmodule, while the proposed Bayesian framework is usedas the post-processing visual context reasoning module. Toaccomplish the visual semantic segmentation task, [66] and[67] use FCNs as the visual perception modules and applyCRF models as the post-processing visual context reasoningmodules.

As discussed in [24] and [52], instead of using modulartraining, the current visual semantic information pursuitmethods tend to apply end-to-end training. Moreover, theytend to use deep learning based variational free energy min-imization methods to model the visual context reasoningmodule. Such changes have several advantages: 1) sincethe error differentials within the visual context reasoningmodule can be back-propagated to the previous visual per-ception module, the whole system can be trained end-to-endand the final performance would be improved accordingly;2) with the deep learning models, the classical inferenceoperations like message passing or aggregation can easilybe accomplished by a simple tensor manipulation; 3) thevisual context reasoning module based on deep learningmodels can fully utilize the advanced parallel capability ofthe modern GPUs to achieve reasonable inference speeds.

4 MAJOR DEVELOPMENTS

In this survey, we will limit the discussion to the key deeplearning based visual context reasoning methods. Specifi-cally, based on the manner the prior knowledge is applied,the current deep learning based visual context reasoningmethods can be categorized as either bottom-up or top-down. Table.1 presents an overview of the typical visualsemantic information pursuit methods in each potentialcategory. In the following sections, we will introduce themajor developments of these two directions in terms ofthe applied variational free energies and the correspondingoptimization methods.

Page 6: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

6

4.1 Major Developments in Bottom-up Methods

A visual semantic information pursuit task can generallybe represented in terms of associated probabilistic graphicalmodels. For instance, MRFs or CRFs are often used to modelthe visual semantic segmentation tasks. Given a probabilis-tic graphical model, the visual context reasoning moduleis typically formulated as a variational free energy mini-mization problem. As the optimization problem is generallyNP-hard to resolve, we need to relax the original tight con-straints and use variational-based methods to approximatethe target posterior. Specifically, we firstly need to definean associated variational free energy and then try to find acorresponding optimization method to minimize it. In mostcases, the applied variational free energy depends on thecorresponding relaxation strategy.

Even though numerous types of optimization methods[68] are capable of minimizing the target variational freeenergy, one particular type of optimization strategy - mes-sage passing [69], [70] - stands out from the competitionand is widely applied in the current deep learning basedvisual context reasoning methods. This is because, unlikethe sequential optimization methods (such as the steepestdescent algorithm [71] and its many variants), the messagepassing strategy is capable of optimizing different decom-posed sub-problems (usually from dual decomposition) inparallel. Furthermore, the message passing or aggregationoperation can easily be accomplished by a simple tensormanipulation. For the bottom-up deep learning based visualcontext reasoning methods, only the internal prior knowl-edge is required to regularize the associated variational freeenergy. Within the existing bottom-up methods, numerousmessage passing variants have been proposed using variousdeep learning architectures, which can be summarized asfollows:

4.1.1 Triplet-based Reasoning Models

The visual relationship triplet (subject, predicate, object)plays an important role in understanding the input image, inwhich (subject, object) represent a pair of detected entitieswhile predicate is used to describe the relationship be-tween those two entities. Instead of categorizing the tripletas a whole, the current bottom-up methods tend jointlyto classify each component to reduce the computationalcomplexity from O(N2R) to O(N + R) (for N objectsand R predicates). However, it is extremely hard to detectpredicates since they often obey a long-tail distribution (thecomplexity become quadratic when considering all possiblesubject-object pairs). Given a set of object classes C and a setof relationship types R, the triplet variable can be depictedas X = {xclss , xbboxs , xclso , xbboxo , xp}, where xcls∗ ∈ C rep-resents the class label of the associated proposal boundingbox, xbbox∗ ∈ R4 depicts the bounding box offsets relative tothe associated proposal box coordinates, and xp ∈ R is therelationship predicate. The aim of the triplet-based reason-ing model is to maximize the posterior P (X|VS , VP , VO), inwhich VS , VP , VO represent the associated observed featurevectors. Through modeling the unary potential terms of theassociated variational free energies, CNN-based visual per-ception modules are generally used to generate the abovefeature vectors.

Fig. 4: Message passing strategies for different types of thetriplet-based reasoning models, in which the messages arepassing among the corresponding triplet components.

In the current literature, the triplet-based reasoningmodels generally use CNN architectures to model higherorder potential terms. Moreover, to minimize the variationalfree energy, the associated message passing strategies areoften proposed based on the connection configuration ofthe triplet graphical model, as shown in Fig.4. One typi-cal triplet-based reasoning model is the relationship pro-posal network [34], which formulates the triplet structureas a fully connected clique and only considers a third-order potential term within the associated variational freeenergy. Two CNN-based compatibility modules are pro-posed to model the third-order potential terms so that aconsistent unary prediction combination is rendered morelikely. Inspired by the faster RCNN model, the authors in[30] propose a CNN-based phrase-guided message passingstructure (PMPS) to infer input triplet proposals, in whichthe subjects and the objects are only connected through thepredicates. They place the predicate at the dominant posi-tion and specifically design a gather-and-broadcast messagepassing strategy, which is applied in both convolutionaland fully connected layers. Unlike the above methods, [33]proposes to model the predicate as a vector translationbetween the subject and the object, in which both subjectand object are mapped into a low-dimensional relationspace with less variance. Instead of the conventional ROIpooling, the authors use the bilinear feature interpolation totransfer knowledge between object and predicate.

4.1.2 MRF-based or CRF-based Reasoning Models

MRFs and CRFs are commonly used undirected probabilis-tic graphical models in the computer vision community.They are capable of capturing rich contextual informationexhibited in natural images or videos. MRFs are generativemodels while CRFs are discriminative models. Most of vi-sual semantic information pursuit applications, in particularvisual semantic segmentation tasks, are often formulated asMRFs or CRFs. Given an input stimuli I and a variableX representing the semantic information of interest, theaim of the MRF-based or CRF-based reasoning model is tomaximize the posterior P (X|I) or minimize the variationalfree energy F (Q). The exact inference methods only existfor special MRF or CRF structures such as conjunction treesor local cliques. For instance, the authors in [29] proposea deep relational network to find the potential visual rela-tionship triplets. They apply CRFs to model the associatedfully connected triplet cliques and use the sequential CNN

Page 7: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

7

Fig. 5: Message passing strategies for MRF-based or CRF-based reasoning models, in which the messages are gener-ally passing within the same semantic level.

computing layers to infer the corresponding marginals fac-torized from the joint posterior exactly.

In the current literature relating to the bottom-up ap-proach, general MRF or CRF structures often need varia-tional inference methods such as mean field (MF) approx-imation [72], [73] or loopy belief propagation (BP) [74] toinfer the target posterior. Specifically, the applied variationalfree energy is often devised depending on the relaxationstrategy of the corresponding constraint optimization prob-lem, and the message passing optimization methodologyis generally applied to minimize the above variational freeenergy. Fig.5 shows the general message passing strategiesof the MRF-based or CRF-based reasoning models. A well-known CRF-based reasoning model is the CRF-as-RNN[24], [52], which incorporates a RNN-based visual contextreasoning module in the FCN visual perception module sothat the proposed visual semantic segmentation system canbe trained end-to-end. Specifically, FCN layers are used toformulate the unary potentials of the DenseCRF model [75],while the binary potentials are formed by a sequence ofCNN layers. As a result, the associated mean field inferencemethod can be implemented by the corresponding RNN.Essentially, two relaxation measures are applied withinthe mean field approximation: 1) a tractable variationaldistribution Q(X) is used to approximate the underlingposterior P (X|I); 2) the joint variational distribution isfully factorized into a combination of independent nodes,which implies the maximization of the marginal of eachindependent node guarantees to accomplish the originalMAP estimation. Inspired by the above methodology, theauthors in [38] use a more general RNN architecture - gatedrecurrent units (GRUs) - to generate the scene graphs fromthe input images using the mean field inference method.Specifically, they use the internal memory cells in GRUsto store the generated contextual information and apply aprimal-dual update rule to speed up the inference proce-dure. Unlike the above methods that optimize the CRFsusing an iterative strategy, the deep parsing network (DPN)proposed in [25] is able to achieve high visual semanticsegmentation performance by only applying one iterationof MF, which also can be considered as a generalized caseof the existing models since it can represent various types ofbinary potential terms.

Fig. 6: Message passing strategies for the visual semantichierarchy reasoning models, in which the messages arepassing though different semantic levels.

4.1.3 Visual Semantic Hierarchy Reasoning Models

In visual semantic information pursuit tasks, visual se-mantic hierarchies are ubiquitous. They often consist ofvisual semantic units, visual semantic phrases, local visualsemantic regions and the scene graph. Given a visual se-mantic hierarchy, the contextual information within othersemantic layers is often used to maximize the posteriorof the current semantic layer. Essentially, such inferenceprocedure tend to have a tighter upper bound for the targetvariational free energy and thus often achieves a betterinference performance. Specifically, to dynamically buildvisual semantic hierarchies, the visual semantic hierarchyreasoning models are often required to align the contextualinformation of different visual semantic levels. As shownin Fig.6, through passing the contextual information amongdifferent visual semantic levels within the generated visualsemantic hierarchy, each visual semantic level can obtainmuch more consistent posterior.

Unlike the previous methods that only model pairwisepotential terms within the same visual semantic level, thestructure inference network (SIN) [23] propagates contex-tual information from the holistic scene and the adjacentconnected nodes to the target node. Within this object de-tection framework, GRUs are used to store the contextualinformation. Motivated by the computational consideration,the maximum pooling layer is applied to aggregate mes-sages from the adjacent connected nodes into an integratedmessage. To leverage the contextual information across dif-ferent semantic levels, multi-level scene description network(MSDN) [39] establishes a dynamic graph consisting ofobject nodes, phrase nodes and region nodes. For eachsemantic level, a CNN-based merge-and-refine strategy isproposed to pass the contextual information along the graphstructure.

4.1.4 DAG-based Reasoning Models

Generally, the variational free energy employed in CRFsusually fails to enforce higher-order contextual consistencydue to the computational considerations. Furthermore,small-sized objects are often smoothed out by the CRFs,which degrades the semantic segmentation performance.Instead of applying conventional CRFs, some bottom-upmethods tend to use undirected cyclic graphical (UCG)models to formulate visual semantic segmentation tasks.However, due to its loopy property, UCG models generally

Page 8: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

8

Fig. 7: The applied UCG is decomposed as a sequence ofDAGs (one possible decomposition), in which the messagesin each DAG are passing along its specific structure.

cannot be formulated as RNNs. To resolve this issue, asshown in Fig.7, UCGs are often decomposed into a se-quence of directed acyclic graphs (DAGs), in which eachDAG can be modelled by a corresponding RNN. Suchan RNN architecture is also known as DAG-RNN, whichexplicitly propagates local contextual information based onthe directed graphical structure. Essentially, a DAG-basedreasoning model like the DAG-RNN has two main advan-tages: 1) compared with conventional CNN models (suchas FCNs), it is empirically found to be significantly moreeffective at aggregating context; 2) it requires substantiallyless parameters as well as demanding fewer computationoperations, which makes it more favourable for applicationson resource-limited embedded platforms.

In current DAG-based reasoning models [26], [76], theDAG-RNNs apply 8-neighborhood UCG graphs to encodethe long-range contextual information effectively so thatthe discriminative capabilities of the local representationsare greatly improved. Inspired by the conventional tree-reweighed max-product algorithm (TRW) [77], the appliedUCGs are decomposed into a sequence of DAGs, in whichany pair of vertices are mutually reachable. Furthermore,the DAG-RNNs are often integrated with convolution anddeconvolution layers, and a class-weighted loss is applied toreflect that the class occurrence frequencies are generally im-balanced in most visual semantic information pursuit tasks,especially the visual semantic segmentation applications.

4.1.5 External Memory Reasoning ModelsA difficult issue for the visual semantic information pursuitis to resolve the dataset imbalance problem. To address thisissue, one generally needs to accomplish, so-called, few-shot learning tasks [78], [79], [80] since most categoriesin the datasets have only few training samples. Unlikethe above models, instead of using the internal memorycells like long-short term memory (LSTM) or gated recur-rent unit (GRU), the external memory reasoning modelsapply the external memory cells, such as the neural tur-ing machine (NTM) [81] or the memory-augmented neuralnetwork (MANN) [82], to store the generated contextualinformation. More importantly, they tend to use meta-learning strategy [82] within the inference procedure. Sucha meta-learning strategy can be summarized as ”learning tolearn”, which selects parameters θ to reduce the expectedlearning cost L across a distribution of datasets p(D):θ∗ = argminθED∼p(D)[L(D; θ)]. To prevent the networkfrom slowly learning specific sample-class bindings, it di-rectly stores the new input stimuli at the correspondingexternal memory cells instead of relearning them. Throughsuch meta-learning, the convergence speed of the visual

Fig. 8: Overview of 2-D spatial external memory iterationsfor object detection. The old detection is marked with agreen box, and the new detection is marked with orange.Here, the spatial memory network is only unrolled oneiteration.

semantic information pursuit task is greatly improved sothat only a few training samples are enough to converge ata stable status.

Instead of detecting objects in parallel like the conven-tional object detection methods, the authors in [20] proposedan instance-level spatial reasoning strategy, which tries torecognize objects conditioned on the previous detections.To this end, a spatial memory network (SMN) [20] (a 2-Dspatial external memory) is devised to store the generatedcontextual information and extract spatial patterns by usingan effective reasoning module. Essentially, this leads to anew sequential reasoning module where image and mem-ory are processed in parallel to obtain detections whichupdate the memory again, as shown in Fig.8. Unlike theabove method that makes sequential updates to memory,the authors in [21] proposed to update the regions in par-allel as an approximation, in which a cell can be coveredmultiple times from different regions in overlapping cases.Specifically, a weight matrix is devised to keep track of howmuch a region has contributed to a memory cell. The finalvalue of each updated cell is the weighted average of allregions.

4.2 Major Developments in Top-down Methods

For visual semantic information pursuit applications, theassociated visual semantic relations generally reside in ahuge semantic space. Unfortunately, only limited trainingsamples are available, which implies it is impossible tofully train every possible visual relation. To maximize thetarget posterior from this long-tail distribution, the existingtop-down methods generally transform the MAP inferencetasks into linear programming problems. More importantly,they often distill the external linguistic prior knowledgeinto the associated learning systems so that the objectivefunctions of the target constraint optimization problemscan be further regularized accordingly. Therefore, compared

Page 9: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

9

Fig. 9: The diagram of semantic affinity distillation models.

with the bottom-up methods, the top-down methods gen-erally converge relatively easily. In this section, based ontheir distillation strategies, we divide the existing top-downmethods into the following categories:

4.2.1 Semantic Affinity Distillation ModelsEven though the visual semantic relations obey a long-tail distribution, they are often semantically related to eachother, which means it is possible to infer an infrequentrelation from similar relations. To this end, the semanticaffinity distillation models project the corresponding fea-ture vectors (which are often generated from the unionof bounding boxes of the associated objects) into a low-dimensional semantic relation embedding space and usetheir semantic affinities as the external prior knowledge toregularize the target optimization problem. In general, theprojection function would be trained by enforcing similarvisual semantic relations to be close together in the semanticrelation embedding space. For instance, the visual semanticrelation (man − ride − horse) should be close to (man −ride − elephant) and far away from (car − has − wheel)in the associated semantic relation embedding space, asillustrated in Fig.9. The semantic affinity distillation modelsare capable of resolving zero-shot learning tasks since thevisual semantic relations without any training samples canstill be recognized by the external linguistic knowledge,which is clearly impossible for the bottom-up methods thatonly use internal visual prior knowledge.

One of the pioneering works is the visual relationshipdetection with language prior method [28], which trainsthe visual models for objects and predicates individually,and later combines them together by applying the exter-nal semantic affinity-based linguistic knowledge to pre-dict consistent visual semantic relations. Unlike the abovealgorithm, the context-aware visual relationship detectionmethod [37] tries to recognize the predicates by incorporat-ing the subject-object pair semantic contextual information.Specifically, the context is encoded via word2vec into asemantic embedding space and is applied to generate aclassification result for the predicate. To summarize, theexternal semantic affinity linguistic knowledge can not onlyimprove the inference speed but also leads to zero-shotgeneralizations.

4.2.2 Teacher-student Distillation ModelsTo resolve the long-tail distribution issue, instead of rely-ing on semantic affinities, the teacher-student distillation

Fig. 10: The diagram of teacher-student distillation models.

models tend to use external linguistic knowledge (the con-ditional distribution of a visual semantic relation givenspecific visual semantic units) generated from public knowl-edge databases to constrain the target optimization prob-lem. Given the input stimuli, X , and the prediction Y ,the optimal teacher network t is selected from an associ-ated candidate set T by minimizing KL(t(Y )||sφ(Y |X)) −CEt[L(X,Y )], where t(Y ) and sφ(Y |X) represent the pre-diction results of the teacher and student networks, respec-tively; φ is the parameter set of the student network and Cis a balancing term; L(X,Y ) depicts the constraint function,in which the predictions that satisfy the constraints are re-warded and the remaining are penalized; KL measures theKL divergence of the teacher’s and the student’s predictiondistributions. Essentially, through solving the above opti-mization, the teacher’s output can be viewed as a projectionof the student’s output in the feasible polytopes constrainedby the external linguistic prior knowledge.

However, the teacher network alone, in most cases,is unable to provide accurate predictions since the exter-nal linguistic prior knowledge is often noisy. In general,the student network represents the architecture withoutany external linguistic knowledge, while the frameworkincorporating both internal visual and external linguisticknowledge is formulated as the teacher network. They eachhave their own advantages: the teacher outperforms thestudent in cases with sufficient training samples, while thestudent achieves superior performance in the few-shot orzero-shot learning scenarios. Therefore, unlike the previousdistillation methods [83], [84], [85] that only use either theteacher or the student as the output, the current teacher-student distillation models [32], [35] tend to incorporate theprediction results from both student and teacher networks,as shown in Fig.10.

5 FUTURE RESEARCH DIRECTIONS

Even though the current visual semantic information pur-suit methods have achieved satisfying performance not seenbefore, there are still numerous challenging, yet excitingresearch directions to investigate in the future.

Page 10: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

10

5.1 Weakly-supervised Pursuit Methods

Annotations for visual semantic information pursuit appli-cations, especially the visual semantic segmentation tasks,are generally hard to obtain since we need to invest tremen-dous time and effort into the labelling process. Moreover, asmost current visual semantic information pursuit methodsare essentially fully-supervised algorithms, rich annotationsare required if we want to train these methods. For instance,to locate objects, the fully-supervised methods requireground-truth locations of the associated bounding boxes. Toalleviate the annotation burden, weakly-supervised pursuitmethods [86], [87], [88], [89], [90] have been proposed in re-cent years, which only require the vocabulary information toextract the visual semantic information. Unfortunately, thecurrent weakly-supervised pursuit methods do not achievecomparable performance to fully-supervised pursuit meth-ods. This is the reason why the fully-supervised methodsare prevalent in the visual semantic information pursuitliterature.

5.2 Pursuit Methods using Region-based Decomposi-tion

For most visual semantic information graphical models, itis generally computationally intractable to infer the targetposteriors. To resolve this NP-hard problem, mean fieldapproximation is often applied in the current visual se-mantic information pursuit methods, in which the associ-ated graphical model is fully decomposed into independentnodes. Unfortunately, such simple decomposition strategyonly incorporates unary pseudo-marginals into the associ-ated variational free energy, which is clearly not enough forthe complicated visual semantic information pursuit appli-cations. Moreover, dense (fully connected) inference modelsare slow to converge. Inspired by the generalized beliefpropagation algorithm [91], there are pursuit methods [42],[43] starting to apply region-based decomposition strategy,in which the associated graphical model is factorized intovarious regions with the nodes within each region beingnot independent. The applied region-based decompositionstrategy can not only improve the inference speed in somecases, but also incorporate higher-order pseudo-marginalsinto the associated variational free energy. However, thereare still several open questions: 1) How many regionsare enough for most visual semantic information pursuitapplications? 2) How to efficiently compute the higher-order pseudo-marginals given the decomposed regions? 3)How to properly propagate contextual information betweendifferent regions?

5.3 Pursuit Methods with Higher-order Potential Terms

To extract the visual semantic information, scene mod-elling methods generally need to factorize the associatedenergy functions into various potential terms, in which theunary terms produce the predictions while the higher-orderpotential terms constrain the generated predictions to beconsistent. However, the current visual semantic informa-tion pursuit methods typically incorporate only pair-wisepotential terms into the associated variation free energy,which is clearly not enough. To resolve this issue, the current

trend is to incorporate higher-order potential terms into theassociated variational free energies. For instance, for thevisual semantic segmentation task, the recently proposedUCG-based pursuit methods [26], [76] replace the pair-wisepotential terms (applied in most CRF-based models) withhigher potential terms, and thus achieve the state-of-the-art segmentation performance. However, through incorpo-rating higher-order potential terms, the target constraintoptimization problems become much harder to solve sincethe polynomial higher-order potential terms inject morenon-convexities into the objective function [92]. Therefore,further efforts are needed to address this non-convexityissue.

5.4 Pursuit Methods with Advanced Domain Adapta-tion

One of the most difficult issues for visual semantic informa-tion pursuit applications is the dataset imbalance problem.In most cases, only few categories have enough trainingsamples, while the remaining either have few or even zerotraining samples. Due to the long-tail distribution, suchsituation would become even worse when we try to pursuethe semantic relation information. To address this few-shotor zero-shot learning problem, domain adaptation [93], [94],[95], [96] becomes a natural choice since its aim is to transferthe missing knowledge from the related domains into thetarget domain. For instance, to resolve the few-shot learningproblem, the top-down pursuit methods use distilled exter-nal linguistic knowledge to regularize the variational freeenergy, while the bottom-up pursuit methods achieve meta-learning through using external memory cells. Essentially,the current domain adaptation strategies used in existingvisual semantic information pursuit methods focus on learn-ing generic feature vectors in one domain and transfer toother domains. Unfortunately, they generally transfer unaryfeatures and largely ignore the more structured graphicalrepresentations [97]. To transfer structured graphs to thecorresponding domains, more advanced domain adaptationmethodologies are needed in the future.

5.5 Pursuit Methods without Message Passing

To resolve the target NP-hard constrained optimizationproblem, even though numerous constraint optimizationstrategies are available [70], the current deep learning basedvisual semantic information pursuit methods totally de-pend on one specific optimization methodology - messagepassing. The technique is generally motivated by linearprogramming and variational optimization. It is proventhat, for modern deep learning architectures, such paral-lel optimization methodology is more effective than otheroptimization strategies. Besides being widely applied in dif-ferent visual semantic information pursuit tasks, it has beenfound successful in the quantum chemistry area [98], [99],[100], [101]. However, the parallel message passing strate-gies empirically underperform compared to the sequentialoptimization methodologies and typically do not providefeasible integer solutions [70]. To address these issues, alter-native visual semantic information pursuit methods will beneeded in the future.

Page 11: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

11

6 BENCHMARKS AND EVALUATION METRICS

In this section, we introduce the main benchmarks andevaluation metrics for the four research applications inves-tigated in this survey, which include object detection, visualsemantic segmentation, visual relationship detection andscene graph generation.

6.1 Object Detection

6.1.1 Benchmarks

Two benchmarks are commonly used in object detectionapplications: the test set of PASCAL VOC 2007 [102] andthe validation set of MS COCO [103]. More specifically, thetest set of PASCAL VOC 2007 contains 4,952 images and14,976 object instances from 20 categories. To evaluate theperformance of object detection methods in different imagescenes, it includes a large number of objects with abundantvariations within each category, in terms of viewpoint, scale,position, occlusion and illumination. Compared with PAS-CAL VOC 2007 test set, the MS COCO benchmark is morechallenging since the images in this dataset are gatheredfrom complicated day-to-day scenes that contain commonobjects in their natural contexts. Specifically, MS COCObenchmark contains 80,000 training images and 500,000instance annotations. To evaluate the detection performance,most object detection methods use the first 5,000 MS COCOvalidation images. Additional 5,000 non-overlapping im-ages have also been used as the validation dataset in somecases.

6.1.2 Evaluation Metrics

To evaluate object detection methods, one needs to considerthe following two performance measures: the object pro-posals generated by the methods and the correspondingobjectness. In the existing literatures, the metrics for eval-uating object proposals are often functions of intersectionover union (IOU) between the proposal locations and theassociated ground-truth annotations. Given IOU, recall canbe obtained as the fraction of ground-truth bounding boxescovered by proposal locations above a certain IOU overlapthreshold. To evaluate the performance of the objectnessdetection, mean average precision (mAP) metric is com-monly used for the VOC 2007 test benchmark (the IOUthreshold is normally set to 0.5), while the MS COCO 2015test-dev benchmark generally involves two types of metrics:average precision (AP) over all categories and different IOUthresholds, and average recall (AR) over all categories andIoUs (which is basically computed on a per-category basis,i.e. the maximum recall given a fixed number of detectionsper image). Specifically, AP , AP 50, AP 70 represent theaverage precision over different IOU thresholds from 0.5to 0.95 with a step of 0.05 (written as 0.5:0.95), the averageprecision with IOU threshold of 0.5 and the average pre-cision with IOU threshold of 0.7, respectively. AR1, AR10,AR100 depict the average recall given 1, 10, 100 detectionsper image, respectively. We recommend interested readersrefer to the relevant papers [17], [18] for the details and themathematical formulations of the above metrics.

6.2 Visual Semantic Segmentation

6.2.1 Benchmarks

In this survey, three main benchmarks are chosen out fromthe abundant datasets for visual semantic segmentationmethods: Pascal Context [104], Sift Flow [105] and COCOStuff [106]. Specifically, the Pascal Context benchmark con-tains 10,103 images extracted from the Pascal VOC 2010dataset, in which 4,998 images are used for training. The im-ages are relabelled as pixel-wise segmentation maps whichinclude 540 semantic categories (including the original 20categories) and each image has approximately the size of375 × 500. The Sift Flow dataset contains 2,688 imagesobtained from 8 specific kinds of outdoor scenes. Eachimage has the size of 256 × 256, and belongs to one of 33semantic classes. COCO Stuff is a recently released scenesegmentation dataset. It includes 10,000 images extractedfrom the Microsoft COCO dataset, in which 9,000 imagesare used for training and the previous unlabelled stuff pixelsare further densely annotated with extra 91 class labels.

6.2.2 Evaluation Metrics

To evaluate the performances of visual semantic segmenta-tion methods, three main metrics are generally applied inthe existing literature: Global Pixel Accuracy (GPA), Averageper-Class Accuracy (ACA) and mean Intersection of Union(mIOU). Specifically, GPA represents the percentage of allcorrectly classified pixels, ACA depicts the mean of class-wise pixel accuracy and mIOU is the mean of the accuracymetric IOU. The details and the corresponding mathemati-cal formulations of the above metrics can be found in [51].

6.3 Visual Relationship Detection

6.3.1 Benchmarks

The current visual relationship detection methods often usetwo benchmarks: visual relationship dataset [28] and visualgenome [107]. Unlike the datasets for object detection, visualrelationship datasets should contain more than just objectslocalized in the image. Instead, they should capture the richvariety of interactions between subject and object pairs. Var-ious types of interactions are considered in the above visualrelationship benchmark datasets, i.e. verbs (e.g. wear), spa-tial (e.g. in front of), prepositions (e.g. with) or comparative(e.g. higher than). Moreover, the types of predicates percategory should be large enough. For instance, a man canbe associated with the predicates such as wear, with, play,etc. Specifically, the visual relationship dataset contains 5000images with 100 object categories and 70 predicates. Intotal, the dataset contains 37,993 relationships with 6,672relationship types and 24.25 predicates per object category.Unlike the visual relationship dataset, the recently proposedvisual genome dataset incorporates numerous kinds of an-notations, one of which is visual relationships. The visualgenome relationship dataset contains 108,077 images and1,531,448 relationships. However, it generally needs to becleansed since the corresponding annotations often containsome misspellings and noisy characters, and the verbs andthe nouns are also in different forms.

Page 12: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

12

6.3.2 Evaluation MetricsThe current visual relationship detection methods often usetwo evaluation metrics: recall@50 and recall@100. Here,recall@x [108] represents the fraction of times the correctrelationship is predicted in the top x confident relationshippredictions. The reason why we use recall@x instead ofwidely applied mean average precision (mAP) metric isbecause mAP is a pessimistic evaluation metric, meaning wecan not exhaustively annotate all possible relationships inan image. Even if the prediction is correct, mAP still wouldpenalize the prediction if it is not congruent with the groundtruth annotation.

6.4 Scene Graph Generation

6.4.1 BenchmarksThe visual genome [107] is often used as the benchmark forthe scene graph generation applications. Unlike the pre-vious visual relationship datasets, the visual relationshipswithin the visual genome scene graph are generally notindependent of each other. Specifically, the visual genomescene graph dataset contains 108,077 images with an aver-age of 38 objects and 22 relationships per image. However, asubstantial fraction of the object annotations have poor qual-ity and overlapping bounding boxes and/or ambiguousobject names. Therefore, a clean visual genome scene graphgeneration dataset is needed for a meaningful evaluationprocedure. In the current scene graph generation methods,instead of training on all possible categories and predicates,the most frequent categories and predicates are often chosenfor evaluation.

6.4.2 Evaluation MetricsSimilar to the previous visual relationship detection meth-ods, the scene graph generation methods generally applyrecall@x [108] metric instead of the mAP metric. Specifi-cally, recall@50 and recall@100 are normally used to eval-uate scene graph generation methods.

7 EXPERIMENTAL COMPARISON

In this section, we compare the performance of differentvisual semantic information pursuit methods for each po-tential application mentioned in this survey. Specifically,for each of the following subsection, we will choose themost representative methods to compare the pursuit per-formance. The benchmarks and the evaluation metrics men-tioned in the above section are the basis for the performancecomparisons.

7.1 Object Detection

In this section, two benchmarks - VOC 2007 test [102] andMS COCO 2015 test-dev [103] - are applied to comparedifferent cutting-edge object detection methods. For a faircomparison, in this section, all methods besides YOLOv2[18] employ VGG-16 as the backbones. Mask RCNN [109] isnot included in this comparison since it applies much morecomplex ResNet-101 as the backbone. For the VOC 2007test dataset, we select 6 current object detection methodsincluding Fast R-CNN [53], Faster R-CNN [17], SSD500 [19],

TABLE 2: Performance comparison on VOC 2007 test.

Method Train mAP

Fast R-CNN [53] 07 + 12 70.0

Faster R-CNN [17] 07 + 12 73.2

SSD500 [19] 07 + 12 75.1

ION [110] 07 + 12 75.6

SIN [23] 07 + 12 76.0

RFB Net300 [111] 07 + 12 80.5

• Note: 07 + 12 represents 07 trainval + 12 trainval.

TABLE 3: Performance comparison on COCO 2015 test-dev.

Method Train AP AP 50 AP 70 AR1 AR10 AR100

Fast R-CNN [53] train 20.5 39.9 19.4 21.3 29.5 30.1

Faster R-CNN [17] train 21.1 40.9 19.9 21.5 30.4 30.8

YOLOv2 [18] trainval35k 21.6 44.0 19.2 20.7 31.6 33.3

ION [110] train 23.0 42.0 23.0 23.0 32.4 33.0

SIN [23] train 23.2 44.5 22.0 22.6 31.6 32.0

• Note: trainval35k represents COCO train + 35k val [110].

ION [110], SIN [23] and RFB Net300 [111], as shown inTable 2; For the MS COCO 2015 test-dev benchmark, FastR-CNN [53], Faster R-CNN [17], YOLOv2 [18], ION [110]and SIN [23] are included in the performance comparisonreported in Table 3. Among the above methods, only ION[110] and SIN [23] incorporate the visual context reasoningmodules within the learning procedure, while others merelyapply visual perception modules. The reason for incorporat-ing various state-of-the-art visual perception models in thecomparison is to provide a complete overview and to gainfurther understanding of the impact of the visual contextreasoning modules.

In Table 2 and 3, we observe that, besides RFB Net300[111], the object detection methods with the visual contextreasoning modules (such as ION and SIN) generally achievebetter performance than the current visual perception objectdetection algorithms (such as Fast R-CNN, Faster R-CNN,SSD500 and YOLOv2). This is because they consider theobject detection task as a combination of perception andreasoning instead of only concentrating on the perception.Specifically, they consider other detected objects or the holis-tic scene as contextual information and try to improve thedetection performance by inferencing over this contextualinformation. In some cases, such contextual informationcan be quite important for detecting the target objects. Forinstance, when the target object is partly occluded by otherobjects or the target object only occupies extremely smallregion within the image. For such scenarios, it is almostimpossible for the current visual perception object detec-tion methods to detect the target objects. However, giventhe contextual information around the target objects, it ispossible to infer the presence even in such harsh conditions.

As discussed in Section 2, the current visual semanticinformation pursuit methods tend to employ two-stage de-tection strategies to accomplish visual perception. However,as shown in Table 2, compared with those traditional two-stage detection algorithms, the current one-stage detection

Page 13: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

13

methods like RFB Net300 [111] achieve much better perfor-mance. Theoretically, incorporating such one-stage detectionstrategies into the proposed unified paradigm could furtherboost the object detection performance, yet the biggest ob-stacle here is how to propose visual semantic relations in theframework of such one-stage detection strategies.

7.2 Visual Semantic SegmentationIn this section, we compare several state-of-the-art visualsemantic segmentation methods on three benchmarks: Pas-cal Context [104], Sift Flow [105] and COCO Stuff [106]. ForPascal Context, only the most frequent 59 classes are selectedfor evaluation. The classes whose frequencies are lower than0.01 are considered as rare classes according to the 85-15percent rule; For Sift Flow, similar to paper [105], we splitthe whole dataset into training and test sets with 2,488 and200 images, respectively. Each pixel within the above imagescan be classified as one of the most frequent 33 semanticcategories. Based on the 85-15 percent rule, the classeswhose frequencies are lower than 0.05 are considered asrare classes; For COCO Stuff, each pixel can be categorizedas one of 171 semantic classes in total and the frequencythreshold 0.4 is used to define the rare classes.

Specifically, 10 current visual semantic segmentationmethods including CFM [112], DeepLab [66], DeepLab +CRF [66], FCN-8s [113], CRF-RNN [24], ParseNet [25],ConvPP-8s [114], UoA-Context + CRF [115], DAG-RNN[26] and DAG-RNN +CRF [26] are compared on the PascalContext benchmark, as shown in Table 4. For the Sift Flowdataset, besides the state-of-the-art semantic segmentationmethods such as ParseNet [25], ConvPP-8s [114], FCN-8s[113], UoA-Context + CRF [115], DAG-RNN [26] and DAG-RNN +CRF [26], we also compare various previous methodslike Byeon et al. [116], Liu et al. [105], Pinheiro et al. [117],Farabet et al. [118], Tighe et al. [119], Sharma et al. [120],Yang et al. [121] and Shuai et al. [122], as shown in Table5. For the recently released COCO Stuff benchmark, wecompare 5 different visual semantic segmentation methods,which include FCN [106], DeepLab [66], FCN-8s [113],DAG-RNN [26] and DAG-RNN +CRF [26], as depicted inTable 6.

Recently, due to the effective feature generation, CNN-based visual semantic segmentation methods are becomingpopular. For instance, FCN [106] and its variant FCN-8s [113] are the most well-known examples. However, itis generally effort-demanding to identify the optimal net-work architecture for images with different resolutions.The small convolution kernels in FCNs are sufficient forlow-resolution images, but it is hard to extend them toencode relatively large receptive fields for high-resolutionimages. Various visual semantic segmentation methods usevisual context reasoning modules, i.e. DeepLab + CRF [66],CRF-RNN [24], UoA-Context + CRF [115] and DAG-RNN+CRF [26], to solve the above issue. Essentially, combingthe strength of CNNs and CRFs for semantic segmentationbecomes the focus. Among these methods, only DeepLab+ CRF [66] trains FCN [106] and applies a dense CRFmethod as a post-processing step. Other methods jointlylearn the dense CRFs and CNNs. The majority of them onlyincorporate pairwise (binary) potential terms within theirrespective variational free energies.

TABLE 4: Performance comparison (%) on Pascal Contextdataset (59 classes).

Method GPA ACA mIOU

CFM [112] − − 31.5

DeepLab [66] − − 37.6

FCN-8s [113] 67.5 52.3 39.1

CRF-RNN [24] − − 39.3

DeepLab + CRF [66] − − 39.6

ParseNet [25] − − 40.4

ConvPP-8s [114] − − 41.0

UoA-Context + CRF [115] 71.5 53.9 43.3

DAG-RNN [26] 72.7 55.3 42.6

DAG-RNN + CRF [26] 73.6 55.8 43.7

• Note: For a fair comparison, all the above methods apply VGG-16[65] as their visual perception module.

TABLE 5: Performance comparison (%) on Sift Flow dataset(33 classes).

Method GPA ACA mIOU

Byeon et al. [116] 70.1 22.6 −Liu et al. [105] 74.8 − −

Pinheiro et al. [117] 77.7 29.8 −Farabet et al. [118] 78.5 29.4 −Tighe et al. [119] 79.2 39.2 −

Sharma et al. [120] 79.6 33.6 −Yang et al. [121] 79.8 48.7 −Shuai et al. [122] 81.2 45.5 −

ParseNet [25] 86.8 52.0 40.4

ConvPP-8s [114] − − 40.7

FCN-8s [113] 85.9 53.9 41.2

DAG-RNN + CRF [26] 87.8 57.8 44.8

DAG-RNN [26] 87.3 60.2 44.4

UoA-Context + CRF [115] 88.1 53.4 44.9

• Note: For a fair comparison, all current methods below the middlehorizontal line apply VGG-16 [65] as their visual perception module.The methods above the line use their bespoke solutions.

TABLE 6: Performance comparison (%) on COCO Stuffdataset (171 classes).

Method GPA ACA mIOU

FCN [106] 52.0 34.0 22.7

DeepLab [66] 57.8 38.1 26.9

FCN-8s [113] 60.4 38.5 27.2

DAG-RNN [26] 62.2 42.3 30.4

DAG-RNN + CRF [26] 63.0 42.8 31.2

• Note: For a fair comparison, all the above methods apply VGG-16[65] as their visual perception module.

According to the comparison results shown in Tables 4,5, 6, the recently proposed DAG-RNN + CRF [26] methodachieves the best performances in most scenarios, while

Page 14: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

14

the previous favourable UoA-Context + CRF [115] algo-rithm outperforms it only in a few cases in the Sift Flowbenchmark. This is because DAG-RNN module within theDAG-RNN + CRF [26] method incorporates higher-orderpotential terms into the variational free energy instead ofonly pairwise potential terms. It is capable of enforcing bothlocal consistency and higher-order semantic coherence [26].Moreover, the CRF module boosts the unary predictions andimproves the localization of object boundaries, compared tothe DAG-RNN module [26].

7.3 Visual Relationship DetectionIn this section, two main benchmarks - visual relationshipdataset [28] and visual genome [107] - are used to comparedifferent visual relationship detection methods on threetasks: predicate recognition, where both the bounding boxesand the labels of the subjects and the objects are given;phrase recognition, which predicts the triple labels, given atriplet structure as a union bounding box; relationship recog-nition, which also outputs triple labels but detects separatebounding boxes of the subjects and the objects. Specifically,for the visual relationship dataset, 10 state-of-the-art visual re-lationship detection methods are chosen in the performancecomparison including LP [28], VTransE [33], CAI [37], ViP[30], VRL [31], LK [32], PPRFCN [123], SA-Full [124], Zoom-Net [125] and CAI + SCA-M [125], as shown in Table 7; Forvisual genome, we compare three current visual relationshipdetection methods: DR-Net [29], ViP [30] and Zoom-Net[125], as presented in Table 8. It should be noted, the evalua-tion metric recall@x [108] performance is dependent on thenumber of predicates per subject-object pair to be evaluated,i.e. top k predictions. In this survey, we choose k = 1 for thevisual relationship dataset and k = 100 for the visual genome.The IOU between the predicated bounding boxes and theground-truth is required above 0.5 for the above methods.Furthermore, for a fair comparison, all methods mentionedabove apply RPN [17] and triplet NMS [30] to generateobject proposals and remove redundant triplet candidates,respectively.

According to the results shown in Table 7, the recentlyproposed CAI + SCA-M method [125] outperforms earliervisual relationship detection methods in all comparisoncriteria. For better understanding of the comparison re-sults, we divide the methods into three different categories.Specifically, PPRFCN [123] and SA-Full [124] are essentiallyweakly-supervised visual relationship detection methods,that cannot deliver detection performance comparable to thefully-supervised algorithms. Among all the fully-supervisedmethods, VTransE [33], ViP [30] and Zoom-Net [125] are vir-tually bottom-up visual relationship pursuit models, whichonly incorporate internal visual prior knowledge into thedetecting procedure. Unlike the above algorithms, the top-down visual relationship pursuit methods such as LP [28],CAI [37], VRL [31], LK [32] and CAI + SCA-M [125] distillexternal linguistic prior knowledge into the learning frame-works. Generally, the external linguistic prior knowledgeregularizes the original constraint optimization problemsso that the associated top-down methods tend to be biasedtowards certain feasible polytopes.

Unlike Table 7, all the methods in Table 8 are bottom-up visual relationship pursuit methods. We can observe

TABLE 7: Performance comparison on Visual Relationship dataset (k =1).

Predicate Phrase Relationship

Method R@50 R@100 R@50 R@100 R@50 R@100

LP [28] 47.87 47.87 16.17 17.03 13.86 14.70

VTransE [33] 44.76 44.76 19.42 22.42 14.07 15.20

PPRFCN [123] 47.43 47.43 19.62 23.15 14.41 15.72

SA-Full [124] 50.40 50.40 16.70 18.10 14.90 16.10

CAI [37] 53.59 53.59 17.60 19.24 15.63 17.39

ViP [30] − − 22.78 27.91 17.32 20.01

VRL [31] − − 21.37 22.60 18.19 20.79

Zoom-Net [125] 50.69 50.69 24.82 28.09 18.92 21.41

LK [32] 55.16 55.16 23.14 24.03 19.17 21.34

CAI + SCA-M [125] 55.98 55.98 25.21 28.89 19.54 22.39

• Note: All the above methods apply RPN [17] to generate object proposals, andremove the redundant triplet candidates using triplet NMS [30].

TABLE 8: Performance comparison on Visual Genome dataset (k =100).

Predicate Phrase Relationship

Method R@50 R@100 R@50 R@100 R@50 R@100

DR-Net [29] 62.05 71.96 13.51 17.23 12.56 16.06

ViP [30] 63.44 74.15 15.70 19.96 14.78 18.85

Zoom-Net [125] 67.25 77.51 20.84 26.16 19.97 25.07

• Note: All the above methods apply RPN [17] to generate object proposals, andremove the redundant triplet candidates using triplet NMS [30].

in Table 8 that Zoom-Net [125] outperforms the other twomethods by a large margin, especially in the visual rela-tionship recognition task. As a visual semantic hierarchyreasoning model, Zoom-Net [125] propagates contextualinformation to different visual semantic levels. Essentially,within the associated variational inference, it can obtain atighter upper bound for the target variational free energyand thus generally converges to a better local optimum, asshown in Table 8.

7.4 Scene Graph Generation

In this section, seven available scene graph generationmethods - IMP [38], MSDN [39], NM-Freq [40], Graph R-CNN [41], MotifNet [40], GPI [43] and LinkNet [44] - arecompared on the visual genome dataset [107], as shown inTable 9. Various visual genome dataset cleaning strategiesexist in the current literature and, for a fair comparison,we choose the one used in the pioneering work [38] as theuniversal preprocessing model for all the above methods.Such cleaning strategy generates training and test sets with75,651 images and 32,422 images, respectively. Moreover, the150 most-frequent object classes and 50 relation classes areselected in this survey. In general, each image has around11.5 objects and 6.2 relationships in the scene graph. Further-more, three evaluation aspects - Predicate Classification (Pred-Cls), Phrase Classification (PhrCls) and Scene Graph Generation(SGGen) - are considered in this survey. Specifically, PredClsrepresents the performance in recognizing the relation be-tween two objects given the ground-truth locations; PhrClsdepicts the performance in the task of recognizing twoobject categories and their relation given the ground-truth

Page 15: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

15

TABLE 9: Performance comparison on Visual Genome dataset.

PredCls PhrCls SGGen

Method R@50 R@100 R@50 R@100 R@50 R@100

IMP [38] 40.8 45.2 20.6 22.4 6.4 8.0

MSDN [39] 53.2 57.9 27.6 29.9 7.0 9.1

NM-Freq [40] 41.8 48.8 23.8 27.2 6.9 9.1

Graph R-CNN [41] 54.2 59.1 29.6 31.6 11.4 13.7

MotifNet [40] 65.2 67.1 35.8 36.5 27.2 30.3

GPI [43] 65.1 66.9 36.5 38.8 − −LinkNet [44] 67.0 68.5 41.0 41.7 27.4 30.1

• Note: All the above methods apply the same cleaning strategy proposed inpaper [38].

locations; SGGen indicates the performance in detectingobjects (IOU > 0.5) and recognising the predicates linkingobject pairs.

Unlike visual relationship detection, scene graph gen-eration must model global inter-dependency among theobjects in the entire scene, rather than focus on local tripletrelationships in isolation. Essentially, strong independenceassumptions regarding local predictors limit the quality ofglobal predictions [40]. As shown in Table 9, the first fourmethods (IMP [38], MSDN [39], NM-Freq [40] and GraphR-CNN [41]) use graph-based inference to propagate localcontextual information in both directions between objectand relationship nodes, while the last three methods (Mo-tifNet [40], GPI [43] and LinkNet [44]) tend to incorporateglobal contextual information within the inference proce-dure. From Table 9, it can be seen that the latter methodsoutperform the former ones to a large extent. Among them,the recently proposed LinkNet [44] achieves the best per-formance in almost all comparison criteria. The core of thesuccess of the method is a simple and effective relationalembedding module designed to model the global contextualinformation explicitly.

8 DISCUSSIONS

To gain the insight into the above visual semantic informa-tion pursuit algorithms, in this section, we analyze themin terms of optimization strategy, loss function as well assupervision category. The link between the applied opti-mization strategy and the input visual semantic stimulusrepresentation is also investigated.

8.1 Further AnalysisAs discussed in Section 3, current visual semantic informa-tion pursuit methods follow a unified paradigm consistingof visual perception and visual context reasoning. Giventhe visual semantic units output by the visual perceptionmodule, one can infer the corresponding visual semanticrelations via a deep learning based variational free energyminimization method. This induces defining a target varia-tional free energy F (θ,Q) (also known as loss function) andthen minimizing it. In this section, instead of focusing onspecific experimental comparisons, we concentrate on inves-tigating representative visual semantic information pursuitalgorithms in terms of supervision categories, loss functionsas well as optimization strategies.

8.1.1 Supervision Category

In the current literature, excluding certain weakly-supervised object detection methods, all visual semantic in-formation pursuit algorithms are either fully-supervised orsemi-supervised. Even though it is often tedious and time-consuming to obtain rich annotations, the fully-supervisedmethods, due to their superior performance, currently dom-inate the visual semantic information pursuit landscape. Forinstance, most of the current bottom-up visual semantic in-formation pursuit methods are essentially fully-supervised.

Unfortunately, for massive training datasets, such fully-supervised training methodology may fail miserably inpursuing the visual semantic relations. This is because theyoften obey a long-tail distribution, which is manifest in mostcategories having few or none training samples. To this end,various semi-supervised top-down visual semantic infor-mation pursuit methods like [28], [32], [37] have recentlybeen proposed to resolve the above few-shot or zero-shotlearning issues. Such methods often distill internal/externallinguistic knowledge to regularize the target variational freeenergy. This kind of regularization normally leads to eitherfew-shot or zero-shot generalization.

8.1.2 Loss Function

Generally speaking, most current research concentrates onimproving the overall inference performance, since in mostvisual semantic information pursuit tasks, especially visualrelationship detection and scene graph generation applica-tions, it is still poor. Understandally, in the current literature,the computational complexity of the methods, especiallythe time complexity, are largely ignored. Therefore, in thissection, the loss functions (variational free energies) are onlyevaluated in terms of inference performance.

In visual semantic information pursuit tasks, one canalter the target variational free energy in the following threeways: 1) apply an alternative approximation strategy forthe variational distribution; 2) define a new energy func-tion with higher-order potential terms; 3) distill internal orexternal prior knowledge to regularize the variational freeenergy.

Essentially, smart approximation strategies aim to finda surrogate variational free energy so that the variationaldistribution is as close to the target posterior as possible. Asthe adopted energy function specifies a factorization of theinput joint distribution, it gives the opportunity to incorpo-rate higher-order potential terms to model the commonlyexisting higher-order dependencies among objects. To re-solve the dataset imbalance problem, the internal or externalprior knowledge could be distilled to further regularize thetarget variational free energy. Without such regularizations,visual semantic information pursuit methods may not con-verge during learning.

All the above strategies could improve the overall in-ference performance given one can select a surrogate varia-tional free energy balanced in terms of bias (the complexityof the applied surrogate) and variance (the uncertainty ofthe applied estimation). This is also known as bias-variancetrade-off [126]. A better trade-off can be found by tuning thecomplexity of the surrogate energy. For instance, to obtainvisual semantic segmentation, CRF-RNN [24] only applies

Page 16: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

16

binary potential terms to characterize the dependenciesamong the adjacent pixels, while DAG-RNN [26] incorpo-rates relatively higher-order potentials in its energy functionand thus improves the inference performance, as shown inTable 4. To generate the scene graph, IMP [38] employs amean field approximation strategy to decompose the vari-ational distribution, while GPI [43] introduces a relativelycomplex region-based decomposition strategy and achievesmuch better performance, as evident in Table 9. Overall, inthe current literature, a mean field approximation strategy isgenerally applied to decompose the variational distributionwith the energy function often only incorporating unary andbinary potential terms.

Due to its superior efficiency and scalability, all visual se-mantic information pursuit methods, except [42], [43], tendto apply the same naive mean field approximation strategyto decompose the variational distribution. However, thisstrategy often induces an NP-hard non-convex variationalfree energy [63], which may jeopardize the overall inferenceperformance. This risk could potentially explain the factthat the overall performance of the current visual semanticinformation pursuit methods is still far away from our ex-pectation, especially for those complex tasks such as visualrelationship detection or scene graph generation.

In order to capture higher-order dependencies withinthe input probabilistic graphical model, some visual se-mantic segmentation methods [26], [76] start to incorporaterelatively higher-order potential terms into the associatedenergy functions. Instead of relying on conventional CRFs,they apply UCG models to formulate visual semantic infor-mation pursuit applications. Due to the intrinsic loopy prop-erty, such UCGs are generally decomposed into a sequenceof DAGs with each being modelled by a correspondingDAG-RNN structure. Other than those models, all the meth-ods mentioned in the above section tend to only consider thepairwise interaction.

To resolve the few-shot or zero-shot learning issues in-duced by any dataset imbalance, top-down visual semanticinformation pursuit methods like [28], [32] tend to distillboth internal and external linguistic knowledge to regular-ize the target variational free energy. In most cases, suchregularization not only improves the inference speed butalso leads to better zero-shot generalization.

8.1.3 Optimization StrategyTo minimize the target variational free energy, current visualsemantic information pursuit methods often apply differentdeep learning based message passing strategies to imple-ment the variational inference steps, while the associatedvariational learning steps are generally accomplished us-ing the same classical SGD method. Specifically, given atarget mean field variational free energy, the current visualsemantic information pursuit methods tend to employ dif-ferent deep learning architectures to simulate the associatedmessage passing update rule. Besides the external memoryreasoning models like [20], [21], all the others tend to applyconventional RNN architectures like LSTM or GRU. For in-stance, [38] proposes a GRU-based inference architecture togenerate scene graphs from the input images, while [24] and[52] suggest to apply a specialized CRF-as-RNN structure todevelop visual semantic segmentation applications.

8.2 Correlation InvestigationIn visual semantic information pursuit tasks, various visualsemantic stimulus representations can be distilled from theassociated visual perception modules, such as bounding boxlocations, instance level class information or visual semanticfeatures. As mentioned in Section 3, the current visualsemantic information pursuit methods generally apply deeplearning based message passing strategies to solve the asso-ciated variational inference task. In this section, we explorethe link between such parallel optimization strategies anddifferent visual semantic stimulus representations.

Compared with visual semantic units, it is generallymuch harder to predict the visual semantic relations as theytend to reside in an extremely large semantic space. To thisend, the associated NP-hard variational inference task isoften modelled as a variational free energy minimizationproblem. Unfortunately, the target variational free energy isgenerally non-convex. One way to resolve the above NP-hard optimization problem is to regularize the non-convexvariational free energy by selecting meaningful visual se-mantic stimulus representations. Specifically, the visual se-mantic features are vital to identify the predicates sinceall related visual contextual information is encapsulated insuch expressive representations. Moreover, the relative spa-tial relationships conveyed by the associated bounding boxlocations are also essential, since adjacent objects generallytend to have a semantic relationship. Besides, the instancelevel class information also plays an important role, sincethe subject/object pairs often maintain certain statisticaldependencies (not all pairs can generate meaningful pred-icates). Furthermore, current visual semantic informationpursuit methods often employ an intuitive concatenationstrategy to fuse the associated visual semantic stimulusrepresentations.

Theoretically, if the target non-convex variational free en-ergy is regularized by incorporating more meaningful visualsemantic stimulus representations, its complexity/flexibilitywould be reduced, which could potentially lead to a bettertrade-off between bias and variance, and thus improve theoverall inference performance. This is perhaps the main rea-son why most state-of-the-art visual semantic informationpursuit methods, especially the visual relationship detectionand scene graph generation algorithms, tend to incorporatethe above three types of visual semantic stimulus repre-sentations. For instance, to detect the visual relationships,ViP [30] applies both visual semantic features and relativespatial relationships, while LP [28] incorporates all threetypes of visual semantic stimulus representations.

9 CONCLUSION

This survey presented a comprehensive review of the state-of-the-art visual semantic information pursuit methods.Specifically, we mainly focused on four related applica-tions: object detection, visual semantic segmentation, vi-sual relationship detection and scene graph generation.To understand the essence of the current visual semanticinformation pursuit methods, they were presented in thecontext of a unified paradigm. The main developments inthe field were reviewed and future trends were identified.The most popular benchmarks and the evaluation metrics

Page 17: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

17

were overviewed. Finally, the relative performance of thekey algorithms was evaluated and discussed.

ACKNOWLEDGMENTS

This work was supported in part by the U.K. DefenceScience and Technology Laboratory, and in part by theEngineering and Physical Research Council (collaborationbetween U.S. DOD, U.K. MOD, and U.K. EPSRC throughthe Multidisciplinary University Research Initiative) underGrant EP/R018456/1.

REFERENCES

[1] P. Viola, M. Jones et al., “Robust real-time object detection,”International Journal of Computer Vision, vol. 4, no. 4, pp. 34–47,2001.

[2] C. Zhang, J. C. Platt, and P. A. Viola, “Multiple instance boostingfor object detection,” in Advances in Neural Information ProcessingSystems (NIPS), 2006, pp. 1417–1424.

[3] P. Viola and M. J. Jones, “Robust real-time face detection,” In-ternational Journal of Computer Vision, vol. 57, no. 2, pp. 137–154,2004.

[4] N. Dalal and B. Triggs, “Histograms of oriented gradients for hu-man detection,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), vol. 1, 2005, pp. 886–893.

[5] D. G. Lowe, “Distinctive image features from scale-invariantkeypoints,” International Journal of Computer Vision, vol. 60, no. 2,pp. 91–110, 2004.

[6] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-manan, “Object detection with discriminatively trained part-based models,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 32, no. 9, pp. 1627–1645, 2009.

[7] T. Zoller and J. M. Buhmann, “Robust image segmentation usingresampling and shape constraints,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 29, no. 7, pp. 1147–1164,2007.

[8] S. Alpert, M. Galun, A. Brandt, and R. Basri, “Image segmenta-tion by probabilistic bottom-up aggregation and cue integration,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 34, no. 2, pp. 315–327, 2011.

[9] A. Blake, P. Kohli, and C. Rother, Markov random fields for visionand image processing. Mit Press, 2011.

[10] C. Sutton, A. McCallum et al., “An introduction to conditionalrandom fields,” Foundations and Trends in Machine Learning, vol. 4,no. 4, pp. 267–373, 2012.

[11] A. Hakeem and M. Shah, “Learning, detection and representationof multi-agent events in videos,” Artificial Intelligence, vol. 171, no.8-9, pp. 586–605, 2007.

[12] A. Gupta, P. Srinivasan, J. Shi, and L. S. Davis, “Understandingvideos, constructing plots learning a visually grounded storylinemodel from annotated videos,” in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), 2009, pp.2012–2019.

[13] D. Kuettel, M. D. Breitenstein, L. Van Gool, and V. Ferrari,“What’s going on? discovering spatio-temporal dependencies indynamic scenes,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR). IEEE, 2010, pp. 1951–1958.

[14] Y. Zhang, Y. Zhang, E. Swears, N. Larios, Z. Wang, and Q. Ji,“Modeling temporal interactions with interval temporal bayesiannetworks for complex activity recognition,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 35, no. 10, pp. 2468–2483, 2013.

[15] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu, “Advanceddeep-learning techniques for salient and category-specific objectdetection: a survey,” IEEE Signal Processing Magazine, vol. 35,no. 1, pp. 84–100, 2018.

[16] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid poolingin deep convolutional networks for visual recognition,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 37,no. 9, pp. 1904–1916, 2015.

[17] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towardsreal-time object detection with region proposal networks,” inAdvances in Neural Information Processing Systems (NIPS), 2015,pp. 91–99.

[18] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2017, pp. 6517–6525.

[19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, andA. C. Berg, “Ssd: Single shot multibox detector,” in Proceedings ofEuropean Conference on Computer Vision (ECCV), 2016, pp. 21–37.

[20] X. Chen and A. Gupta, “Spatial memory for context reasoning inobject detection,” in Proceedings of IEEE International Conference onComputer Vision (ICCV), 2017, pp. 4106–4116.

[21] X. Chen, L.-J. Li, L. Fei-Fei, and A. Gupta, “Iterative visualreasoning beyond convolutions,” in Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7239–7248.

[22] X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu,Y. Zhou, B. Yang, Z. Wang et al., “Crafting gbd-net for objectdetection,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 40, no. 9, pp. 2109–2123, 2018.

[23] Y. Liu, R. Wang, S. Shan, and X. Chen, “Structure inferencenet: Object detection using scene-level context and instance-levelrelationships,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2018, pp. 6985–6994.

[24] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su,D. Du, C. Huang, and P. H. Torr, “Conditional random fields asrecurrent neural networks,” in Proceedings of the IEEE InternationalConference on Computer Vision (ICCV), 2015, pp. 1529–1537.

[25] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang, “Deep learningmarkov random field for semantic segmentation,” IEEE Transac-tions on Pattern Analysis and Machine Intelligence, vol. 40, no. 8, pp.1814–1828, 2018.

[26] B. Shuai, Z. Zuo, B. Wang, and G. Wang, “Scene segmentationwith dag-recurrent neural networks,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 40, no. 6, pp. 1480–1493,2018.

[27] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditionalrandom fields: Probabilistic models for segmenting and labelingsequence data,” in Proceedings of the Eighteenth International Con-ference on Machine Learning (ICML), 2001, pp. 282–289.

[28] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei, “Visual relation-ship detection with language priors,” in Proceedings of EuropeanConference on Computer Vision (ECCV). Springer, 2016, pp. 852–869.

[29] B. Dai, Y. Zhang, and D. Lin, “Detecting visual relationships withdeep relational networks,” in Proceedings of IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2017, pp. 3298–3308.

[30] Y. Li, W. Ouyang, X. Wang, and X. Tang, “Vip-cnn: visual phraseguided convolutional neural network,” in Proceedings of IEEEConference on Computer Vision and Pattern Recognition (CVPR),2017, pp. 7244–7253.

[31] X. Liang, L. Lee, and E. P. Xing, “Deep variation-structuredreinforcement learning for visual relationship and attribute de-tection,” in Proceedings of IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2017, pp. 4408–4417.

[32] R. Yu, A. Li, V. I. Morariu, and L. S. Davis, “Visual relationshipdetection with internal and external linguistic knowledge distil-lation,” in Proceedings of IEEE International Conference on ComputerVision (ICCV), 2017, pp. 1068–1076.

[33] H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua, “Visual transla-tion embedding network for visual relation detection,” in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2017, pp. 3107–3115.

[34] J. Zhang, M. Elhoseiny, S. Cohen, W. Chang, and A. Elgammal,“Relationship proposal networks,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR),2017, pp. 5226–5234.

[35] F. Plesse, A. Ginsca, B. Delezoide, and F. Preteux, “Visual re-lationship detection based on guided proposals and semanticknowledge distillation,” arXiv preprint arXiv:1805.10802, 2018.

[36] J. Zhang, Y. Kalantidis, M. Rohrbach, M. Paluri, A. Elgammal,and M. Elhoseiny, “Large-scale visual relationship understand-ing,” arXiv preprint arXiv:1804.10660, 2018.

[37] B. Zhuang, L. Liu, C. Shen, and I. Reid, “Towards context-aware interaction recognition for visual relationship detection,”in Proceedings of the IEEE International Conference on ComputerVision (ICCV), 2017, pp. 589–598.

[38] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generationby iterative message passing,” in Proceedings of the IEEE Confer-

Page 18: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

18

ence on Computer Vision and Pattern Recognition (CVPR), 2017, pp.3097–3106.

[39] Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang, “Scenegraph generation from objects, phrases and region captions,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2017, pp. 1261–1270.

[40] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, “Neural mo-tifs: Scene graph parsing with global context,” in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2018, pp. 5831–5840.

[41] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn forscene graph generation,” in Proceedings of European Conference onComputer Vision (ECCV), September 2018, pp. 690–706.

[42] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang,“Factorizable net: An efficient subgraph-based framework forscene graph generation,” in Proceedings of European Conference onComputer Vision (ECCV), September 2018, pp. 346–363.

[43] R. Herzig, M. Raboh, G. Chechik, J. Berant, and A. Globerson,“Mapping images to scene graphs with permutation-invariantstructured prediction,” in Advances in Neural Information Process-ing Systems (NIPS), 2018, pp. 7211–7221.

[44] S. Woo, D. Kim, D. Cho, and I. S. Kweon, “Linknet: Relationalembedding for scene graph,” in Advances in Neural InformationProcessing Systems (NIPS), 2018, pp. 558–568.

[45] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignmentsfor generating image descriptions,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR),2015, pp. 3128–3137.

[46] X. He and L. Deng, “Deep learning for image-to-text generation:a technical overview,” IEEE Signal Processing Magazine, vol. 34,no. 6, pp. 109–116, 2017.

[47] K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang, “Aligning where to seeand what to tell: Image captioning with region-based attentionand scene-specific contexts,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 39, no. 12, pp. 2321–2334, 2017.

[48] M. Malinowski, M. Rohrbach, and M. Fritz, “Ask your neurons:A deep learning approach to visual question answering,” Inter-national Journal of Computer Vision, vol. 125, no. 1-3, pp. 110–135,2017.

[49] A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Parikh,and D. Batra, “Vqa: Visual question answering,” InternationalJournal of Computer Vision, vol. 123, no. 1, pp. 4–31, 2017.

[50] D. Teney, Q. Wu, and A. van den Hengel, “Visual questionanswering: A tutorial,” IEEE Signal Processing Magazine, vol. 34,no. 6, pp. 63–75, 2017.

[51] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-works for semantic segmentation,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR),2015, pp. 3431–3440.

[52] A. Arnab, S. Zheng, S. Jayasumana, B. Romera-Paredes, M. Lars-son, A. Kirillov, B. Savchynskyy, C. Rother, F. Kahl, and P. H.Torr, “Conditional random fields meet deep neural networksfor semantic segmentation: Combining probabilistic graphicalmodels with deep learning for structured prediction,” IEEE SignalProcessing Magazine, vol. 35, no. 1, pp. 37–52, 2018.

[53] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE InternationalConference on Computer Vision (ICCV), 2015, pp. 1440–1448.

[54] T. Werner, “A linear programming approach to max-sum prob-lem: A review,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 29, no. 7, pp. 1165–1179, 2007.

[55] Y. Weiss, C. Yanover, and T. Meltzer, “Map estimation, linearprogramming and belief propagation with convex free energies,”in Proceedings of the Twenty-Third Conference on Uncertainty inArtificial Intelligence (UAI), 2007, pp. 416–425.

[56] T. Meltzer, A. Globerson, and Y. Weiss, “Convergent messagepassing algorithms: a unifying view,” in Proceedings of the Twenty-fifth Conference on Uncertainty in Artificial Intelligence (UAI), 2009,pp. 393–401.

[57] Q. Liu and A. Ihler, “Variational algorithms for marginal map,”The Journal of Machine Learning Research, vol. 14, no. 1, pp. 3165–3200, 2013.

[58] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient esti-mation of word representations in vector space,” arXiv preprintarXiv:1301.3781, 2013.

[59] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their

compositionality,” in Advances in Neural Information ProcessingSystems (NIPS), 2013, pp. 3111–3119.

[60] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-cation with deep convolutional neural networks,” in Advances inNeural Information Processing Systems (NIPS), 2012, pp. 1097–1105.

[61] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenetlarge scale visual recognition challenge,” International Journal ofComputer Vision, vol. 115, no. 3, pp. 211–252, 2015.

[62] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.

[63] M. J. Wainwright, M. I. Jordan et al., “Graphical models, exponen-tial families, and variational inference,” Foundations and Trends inMachine Learning, vol. 1, no. 1–2, pp. 1–305, 2008.

[64] E. Jahangiri, E. Yoruk, R. Vidal, L. Younes, and D. Geman,“Information pursuit: A bayesian framework for sequential sceneparsing,” arXiv preprint arXiv:1701.02343, 2017.

[65] K. Simonyan and A. Zisserman, “Very deep convolutionalnetworks for large-scale image recognition,” arXiv preprintarXiv:1409.1556, 2014.

[66] I. K. K. M. L.-C. Chen, G. Papandreou and A. L. Yuille, “Semanticimage segmentation with deep convolutional nets and fullyconnected crfs,” in Proceedings of the International Conference onLearning Representations (ICLR), 2015.

[67] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.Yuille, “Deeplab: Semantic image segmentation with deep convo-lutional nets, atrous convolution, and fully connected crfs,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 40,no. 4, pp. 834–848, 2018.

[68] C. Zhang, J. Butepage, H. Kjellstrom, and S. Mandt, “Advancesin variational inference,” IEEE Transactions on Pattern Analysis andMachine Intelligence, 2018.

[69] S. Ross, D. Munoz, M. Hebert, and J. A. Bagnell, “Learningmessage-passing inference machines for structured prediction,”in Proceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2011, pp. 2737–2744.

[70] J. Kappes, B. Andres, F. Hamprecht, C. Schnorr, S. Nowozin,D. Batra, S. Kim, B. Kausler, J. Lellmann, N. Komodakis et al., “Acomparative study of modern inference techniques for discreteenergy minimization problems,” in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), 2013, pp.1328–1335.

[71] R. Battiti, “First-and second-order methods for learning: betweensteepest descent and newton’s method,” Neural Computation,vol. 4, no. 2, pp. 141–166, 1992.

[72] A. Georges, G. Kotliar, W. Krauth, and M. J. Rozenberg, “Dy-namical mean-field theory of strongly correlated fermion systemsand the limit of infinite dimensions,” Reviews of Modern Physics,vol. 68, no. 1, p. 13, 1996.

[73] A.-L. Barabasi, R. Albert, and H. Jeong, “Mean-field theory forscale-free random networks,” Physica A: Statistical Mechanics andits Applications, vol. 272, no. 1-2, pp. 173–187, 1999.

[74] K. P. Murphy, Y. Weiss, and M. I. Jordan, “Loopy belief propaga-tion for approximate inference: An empirical study,” in Proceed-ings of the Fifteenth Conference on Uncertainty in Artificial Intelli-gence (UAI), 1999, pp. 467–475.

[75] P. Krahenbuhl and V. Koltun, “Efficient inference in fully con-nected crfs with gaussian edge potentials,” in Advances in NeuralInformation Processing Systems (NIPS), 2011, pp. 109–117.

[76] B. Shuai, Z. Zuo, B. Wang, and G. Wang, “Dag-recurrent neuralnetworks for scene labeling,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3620–3629.

[77] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky, “A new classof upper bounds on the log partition function,” IEEE Transactionson Information Theory, vol. 51, no. 7, pp. 2313–2335, 2005.

[78] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng, “Zero-shotlearning through cross-modal transfer,” in Advances in NeuralInformation Processing Systems (NIPS), 2013, pp. 935–943.

[79] B. Romera-Paredes and P. Torr, “An embarrassingly simple ap-proach to zero-shot learning,” in Proceedings of International Con-ference on Machine Learning (ICML), 2015, pp. 2152–2161.

[80] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matchingnetworks for one shot learning,” in Advances in Neural InformationProcessing Systems (NIPS), 2016, pp. 3630–3638.

Page 19: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

19

[81] A. Graves, G. Wayne, and I. Danihelka, “Neural turing ma-chines,” arXiv preprint arXiv:1410.5401, 2014.

[82] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lil-licrap, “Meta-learning with memory-augmented neural net-works,” in Proceedings of International Conference on Machine Learn-ing (ICML), 2016, pp. 1842–1850.

[83] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge ina neural network,” arXiv preprint arXiv:1503.02531, 2015.

[84] Z. Hu, X. Ma, Z. Liu, E. Hovy, and E. Xing, “Harnessing deepneural networks with logic rules,” in Proceedings of the 54th An-nual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), vol. 1, 2016, pp. 2410–2420.

[85] Z. Hu, Z. Yang, R. Salakhutdinov, and E. Xing, “Deep neuralnetworks with massive learned knowledge,” in Proceedings ofthe 2016 Conference on Empirical Methods in Natural LanguageProcessing, 2016, pp. 1670–1679.

[86] H. Bilen, M. Pedersoli, and T. Tuytelaars, “Weakly supervisedobject detection with convex clustering,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR),2015, pp. 1081–1089.

[87] H. Bilen and A. Vedaldi, “Weakly supervised deep detectionnetworks,” in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2016, pp. 2846–2854.

[88] Y. Wei, X. Liang, Y. Chen, X. Shen, M.-M. Cheng, J. Feng, Y. Zhao,and S. Yan, “Stc: A simple to complex framework for weakly-supervised semantic segmentation,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 39, no. 11, pp. 2314–2320,2017.

[89] R. G. Cinbis, J. Verbeek, and C. Schmid, “Weakly supervisedobject localization with multi-fold multiple instance learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 39, no. 1, pp. 189–203, 2017.

[90] D. Zhang, J. Han, L. Zhao, and D. Meng, “Leveraging prior-knowledge for weakly supervised object detection under a col-laborative self-paced curriculum learning framework,” Interna-tional Journal of Computer Vision, pp. 1–18, 2018.

[91] J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Generalized beliefpropagation,” in Advances in Neural Information Processing Systems(NIPS), 2001, pp. 689–695.

[92] G. Fazelnia and J. Paisley, “Crvi: Convex relaxation for variationalinference,” in Proceedings of International Conference on MachineLearning (ICML), 2018, pp. 1476–1484.

[93] X. Glorot, A. Bordes, and Y. Bengio, “Domain adaptation forlarge-scale sentiment classification: A deep learning approach,”in Proceedings of the 28th International Conference on Machine Learn-ing (ICML), 2011, pp. 513–520.

[94] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adapta-tion via transfer component analysis,” IEEE Transactions on NeuralNetworks, vol. 22, no. 2, pp. 199–210, 2011.

[95] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptationby backpropagation,” in Proceedings of International Conference onMachine Learning (ICML), 2015, pp. 1180–1189.

[96] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa, “Visual domainadaptation: A survey of recent advances,” IEEE Signal ProcessingMagazine, vol. 32, no. 3, pp. 53–69, 2015.

[97] Z. Yang, J. Zhao, B. Dhingra, K. He, W. W. Cohen, R. R.Salakhutdinov, and Y. LeCun, “Glomo: Unsupervised learning oftransferable relational graphs,” in Advances in Neural InformationProcessing Systems (NIPS), 2018, pp. 8964–8975.

[98] M. B. R. Z. Yujia Li, Daniel Tarlow, “Gated graph sequence neuralnetworks,” in Proceedings of International Conference on LearningRepresentations (ICLR), 2016.

[99] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende et al., “Interactionnetworks for learning about objects, relations and physics,” inAdvances in Neural Information Processing Systems (NIPS), 2016,pp. 4502–4510.

[100] K. T. Schutt, F. Arbabzadah, S. Chmiela, K. R. Muller, andA. Tkatchenko, “Quantum-chemical insights from deep tensorneural networks,” Nature Communications, vol. 8, p. 13890, 2017.

[101] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl,“Neural message passing for quantum chemistry,” in Proceedingsof International Conference on Machine Learning (ICML), 2017, pp.1263–1272.

[102] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, andA. Zisserman, “The pascal visual object classes challenge 2007(voc 2007) results (2007),” 2008.

[103] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects incontext,” in Proceedings of European Conference on Computer Vision(ECCV), 2014, pp. 740–755.

[104] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler,R. Urtasun, and A. Yuille, “The role of context for object detectionand semantic segmentation in the wild,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR),2014, pp. 891–898.

[105] C. Liu, J. Yuen, and A. Torralba, “Nonparametric scene pars-ing: Label transfer via dense scene alignment,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2009, pp. 1972–1979.

[106] H. Caesar, J. Uijlings, and V. Ferrari, “Coco-stuff: Thing and stuffclasses in context,” CoRR, abs/1612.03716, vol. 5, p. 8, 2016.

[107] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visualgenome: Connecting language and vision using crowdsourceddense image annotations,” International Journal of Computer Vision,vol. 123, no. 1, pp. 32–73, 2017.

[108] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectnessof image windows,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 34, no. 11, pp. 2189–2202, 2012.

[109] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask r-cnn,” inProceedings of the IEEE International Conference on Computer Vision(ICCV), 2017, pp. 2961–2969.

[110] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick, “Inside-outside net: Detecting objects in context with skip pooling andrecurrent neural networks,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2874–2883.

[111] S. Liu, D. Huang et al., “Receptive field block net for accurate andfast object detection,” in Proceedings of the European Conference onComputer Vision (ECCV), 2018, pp. 385–400.

[112] J. Dai, K. He, and J. Sun, “Convolutional feature masking for jointobject and stuff segmentation,” in Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3992–4000.

[113] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional net-works for semantic segmentation.” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 39, no. 4, pp. 640–651, 2017.

[114] S. Xie, X. Huang, and Z. Tu, “Top-down learning for structuredlabeling with convolutional pseudoprior,” in Proceedings of Euro-pean Conference on Computer Vision (ECCV), 2016, pp. 302–317.

[115] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid, “Efficientpiecewise training of deep structured models for semantic seg-mentation,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2016, pp. 3194–3203.

[116] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki, “Scene labelingwith lstm recurrent neural networks,” in Proceedings of IEEEConference on Computer Vision and Pattern Recognition (CVPR),2015, pp. 3547–3555.

[117] P. H. Pinheiro and R. Collobert, “Recurrent convolutional neuralnetworks for scene labeling,” in Proceedings of International Con-ference on Machine Learning (ICML), 2014, pp. 82–90.

[118] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learninghierarchical features for scene labeling,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1915–1929, 2013.

[119] J. Tighe and S. Lazebnik, “Finding things: Image parsing withregions and per-exemplar detectors,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR),2013, pp. 3001–3008.

[120] A. Sharma, O. Tuzel, and M.-Y. Liu, “Recursive context propaga-tion network for semantic scene labeling,” in Advances in NeuralInformation Processing Systems (NIPS), 2014, pp. 2447–2455.

[121] J. Yang, B. Price, S. Cohen, and M.-H. Yang, “Context drivenscene parsing with attention to rare classes,” in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2014, pp. 3294–3301.

[122] B. Shuai, Z. Zuo, G. Wang, and B. Wang, “Scene parsing withintegration of parametric and non-parametric models,” IEEETransactions on Image Processing, vol. 25, no. 5, pp. 2379–2391, 2016.

[123] H. Zhang, Z. Kyaw, J. Yu, and S.-F. Chang, “Ppr-fcn: Weaklysupervised visual relation detection via parallel pairwise r-fcn,”in Proceedings of the IEEE International Conference on ComputerVision (ICCV), 2017, pp. 4233–4241.

Page 20: 1 Visual Semantic Information Pursuit: A Surveyepubs.surrey.ac.uk/853066/1/Visual Semantic Information Pursuit A Survey.pdfsemantic relation among those words. Here, words are the

20

[124] J. Peyre, I. Laptev, C. Schmid, and J. Sivic, “Weakly-supervisedlearning of visual relations,” in Proceedings of IEEE InternationalConference on Computer Vision (ICCV), 2017, pp. 5179–5188.

[125] G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang, J. Shao, andC. Change Loy, “Zoom-net: Mining deep feature interactionsfor visual relationship recognition,” in Proceedings of EuropeanConference on Computer Vision (ECCV), 2018, pp. 330–347.

[126] S. Geman, E. Bienenstock, and R. Doursat, “Neural networks andthe bias/variance dilemma,” Neural computation, vol. 4, no. 1, pp.1–58, 1992.

Daqi Liu is a Research Fellow at the Centre forVision, Speech and Signal Processing, Univer-sity of Surrey, Guildford, U.K. His research in-terests include bayesian machine learning, com-puter vision, pattern recognition and bio-inspiredcomputational models. He has published severalscientific papers in top-ranked journals, includingIEEE transactions on Cybernetics, IEEE trans-actions on Neural Networks and Learning Sys-tems, Neurocomputing etc.

Miroslaw Bober (S’94-A’95-M’04) received theM.Sc. and Ph.D. degrees from the Universityof Surrey, Guildford, U.K., in 1991 and 1995,respectively.

He is a Professor of video processing withthe University of Surrey, Guildford, U.K. Between1997 and 2011, he headed the Mitsubishi Elec-tric Corporate R&D Center Europe (MERCE),Livingston, U.K. He has been actively involvedin the development of MPEG standards for morethan 20 years, chairing the MPEG-7, CDVS, and

CVDA groups. He is an inventor of more than 70 patents and severalof his inventions are deployed in consumer and professional products.He has authored or coauthored more than 80 refereed publications,including three books and book chapters. His research interests includevarious aspects of computer vision and machine intelligence, with recentfocus on image/video database retrieval and data mining.

Josef Kittler (M’74-LM’12) received the B.A.,Ph.D., and D.Sc. degrees from the University ofCambridge, in 1971, 1974, and 1991, respec-tively. He is a distinguished Professor of MachineIntelligence at the Centre for Vision, Speech andSignal Processing, University of Surrey, Guild-ford, U.K. He conducts research in biometrics,video and image database retrieval, medical im-age analysis, and cognitive vision. He publishedthe textbook Pattern Recognition: A StatisticalApproach and over 700 scientific papers.

He is series editor of Springer Lecture Notes on Computer Science.He currently serves on the Editorial Boards of Pattern RecognitionLetters, Pattern Recognition and Artificial Intelligence, Pattern Analysisand Applications. He also served as a member of the Editorial Board ofIEEE Transactions on Pattern Analysis and Machine Intelligence during1982-1985.