features to text: a comprehensive survey of deep learning

Research ArticleFeatures to Text A Comprehensive Survey of Deep Learning onSemantic Segmentation and Image Captioning

Ariyo Oluwasammi 1 Muhammad Umar Aftab 2 Zhiguang Qin 1 Son Tung Ngo 3

Thang Van Doan 3 Son Ba Nguyen 3 Son Hoang Nguyen 3

and Giang Hoang Nguyen 3

1School of Information and Software Engineering University of Electronic Science and Technology of ChinaChengdu 610054 China2Department of Computer Science National University of Computer and Emerging Sciences IslamabadChiniot-Faisalabad Campus Chiniot 35400 Pakistan3ICT Department FPT University Hanoi 10000 Vietnam

Correspondence should be addressed to Zhiguang Qin qinzuestceducn

Received 8 January 2021 Revised 31 January 2021 Accepted 6 March 2021 Published 23 March 2021

Academic Editor Dan Selisteanu

Copyright copy 2021 Ariyo Oluwasammi et al -is is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

With the emergence of deep learning computer vision has witnessed extensive advancement and has seen immense applications inmultiple domains Specifically image captioning has become an attractive focal direction for most machine learning experts whichincludes the prerequisite of object identification location and semantic understanding In this paper semantic segmentation and imagecaptioning are comprehensively investigated based on traditional and state-of-the-art methodologies In this survey we deliberate on theuse of deep learning techniques on the segmentation analysis of both 2D and 3D images using a fully convolutional network and otherhigh-level hierarchical feature extraction methods First each domainrsquos preliminaries and concept are described and then semanticsegmentation is discussed alongside its relevant features available datasets and evaluation criteria Also the semantic informationcapturing of objects and their attributes is presented in relation to their annotation generation Finally analysis of the existing methodstheir contributions and relevance are highlighted informing the importance of these methods and illuminating a possible researchcontinuation for the application of semantic image segmentation and image captioning approaches

1 Introduction

-e data of optical perception are becoming increasinglyavailable in large volume nowadays creating a crucial use inseveral real-world applications such as quality assurancemedical analysis surveillance autonomous vehicles facerecognition forensic and biometrics and 3D reconstruction[1ndash4] -is upsurge in the bulk of digital images and videohas directed the creation of computer vision (CV) a branchof computer science (CS) From a general overview com-puter vision relates to the use of the computer to gain a high-level understanding of images and videos [5] Rather thanmanual operations it encompasses the automatic acquisi-tion processing and analyzing of large data for the sole

purpose of extracting patterns and intuition In most casescomputer vision seeks to apply artificial intelligence (AI)theory equations tools frameworks and algorithms foraccomplishing the task of helping computers to see and alsounderstand the content of both digital and analog worldthrough the mimicking of the human visual system [6]Although seeing and understanding seems a trivial task orvery easy for humans it is nevertheless a complex problemfor computers partly because of our limited understandingof how the human brain works and how it processes things[7] However through years of research and technologicaladvancement some feats have been achieved and computervision has extensively evolved [8ndash11] Today semanticsegmentation remains a huge challenge in the scope of image

HindawiComplexityVolume 2021 Article ID 5538927 19 pageshttpsdoiorg10115520215538927

and video understanding alongside image captioning whichcombines computer vision with another branch of artificialintelligence called natural language processing (NLP) toderive sentence description of an image [12] Notwith-standing as with all other AI-related tasks a modern subsetof machine learning (ML) namely deep learning (DL) hasbeen the evolution of machine learning producing state-of-the-art results in almost all of the tasks compared to othertraditional algorithms such as decision trees naive Bayessupport vector machines (SVMs) ensembles and clusteringalgorithms [13ndash16]

Deep learning as a branch of machine learning useslayers of artificial neural networks to imitate the humanneural networks in decoding intuition from a large amountof data automatically [17] and is unlike other machinelearning algorithms which rely heavily on feature engi-neering utilizing domain knowledge in the creation offeature extractors [18] -e stacked layer of neural networksrepresents feature hierarchy as simple features at the initiallayers are reconstructed from one layer to another informing complex features [19] As a result the deepernetworks are computationally intensive to model and trainleading to the manufacture of more advanced computationalchips including Graphical Processing Units (GPUs) andTensor Processing Units (TPUs) [20 21] Presently severaldeep learning models exist and some of the most popularones include recurrent neural networks (RNNs) autoen-coders convolutional neural network (CNN) deep beliefnetworks (DBNs) and deep Boltzmann machine (DBM)[22ndash25] Among the most common deep learning algo-rithms the convolutional neural network is themost suitablefor analyzing visual imagery because of its shift and spaceinvariant characteristics taking advantage of hierarchicallearning in combining simpler patterns to form complexpatterns and structures [26] Using the shared weights ar-chitectural pattern of filters [27ndash29] each filter representsdifferent features of the input data which when summed canyield more complex structures [30ndash32]

In this paper our prime motivation is fixating on therecent deep learning segmentation techniques of both 2Dand 3D images using fully convolutional network andother high-level hierarchical feature extraction methodsas an integral component of computer vision -is isfurther expanded into the generation of captions forimages emerging as a subset of artificial intelligencersquosnatural language processing Furthermore we review thediscussed modelsrsquo accomplishments by comparing theirevaluation which indicates the most effective and effi-cient approaches for different tasks and challenges en-countered -is we believe is enlightening as it providesinsight for the further evolution of practical modeldesign

-is paper is organized as follows it introduces seg-mentation popular segmentation algorithms characteris-tics datasets and evaluation in Section 2 An introduction ofcaptioning and its various models are followed in Section 3alongside available datasets evaluation metrics and acomparative discussion of the models Finally Section 4concludes the paper with the overall summary of the typical

problems solutions and possible directions in semanticsegmentation and image captioning

2 Semantic Segmentation

Semantic segmentation relates to the process of pixel-levelclassification of images such that each pixel in an image isclassified into a distinguished class cluster [33] Since theinception of deep learning semantic segmentation has beena pivotal area of image processing and computer visionwhich has seen major research and application in severaldomains [34] Image segmentation recognizes boundariesbetween objects in an image by using line and curve seg-ments to categorize such objects while instance segmen-tation however classifies instances of all the available classesin an image such that all objects are identified as a separateentity All the same semantic segmentation differs fromordinary segmentation which on the one hand only ex-presses the partitioning of an image into clusters without atangible intuitive attempt at understanding the partitionedclusters or relating them with one another [35] Semanticsegmentation on the other hand as the name implies triesto describe semantically meaningful objects in an imagebased on their well-defined association and understanding[36] these differences are well depicted in Figure 1

21 Methods and Approaches

211 Traditional Methods During the pre-ANN era mostsegmentation and semantic segmentation approaches werepredominantly thresholding and clustering algorithmswhich are largely unsupervised methods In most casestraditional semantic segmentation methods consume lesstime for model computation Also most of the approachesrequire less data than the modern-day era of artificial neuralnetworks and deep learning -e simplest by all means isthe thresholding techniques which apply pixel intensity asthe criteria for distribution For binary segmentation asingle threshold value is required and pixels on both sides ofthe threshold are classified separately into two distinctclasses -ere are also advance forms of thresholding in-volving more classes and they are often grouped as histo-gram shape-based entropy-based object attribute-basedand spatial-based [37]

K-means clustering uses a predefined number of cen-troids to determine the number of clusters in which objectsare to be categorized -e centroids are randomly selected atthe beginning and then are iteratively adjusted by computingthe distance apart from other points in the dataset assigningeach point to the closest centroid [38] Fuzzy C-means(FCM) a technically advanced form of K-means allowsclassification of data points into many label classes based onthe level of membership [39] -is is of advantage in situ-ations where dataset texture overlaps or does not have a well-defined cluster [40] Gaussian mixture model (GMM) is alsooften used for both hard clustering and soft clustering byassigning the pixel to the component having the highestposterior probability [41] GMM assumes that the datarsquosGaussian distributions represent the number of clusters

2 Complexity

available in the data and it uses the expectation-maximi-zation (EM) algorithm to determine missing latent variables[42] Random forest [43] naive Bayes [44] ensemblemodeling [45] Markov random network (MRF) [46] andsupport vector machines (SVMs) [47] are also techniquesthat are useful for several tasks especially classification andregression [48]

212 Region-Based Models In the region-based semanticsegmentation design regions are first extracted in an imageand described based on their constituent features [49] -ena region classifier that has been trained is used to label pixelsper region with which it has the highest occurrence -eregion-based approaches use the divide and conquer methodsuch that many features are captured using multiscale fea-tures and then combined to form a whole In cases whereobjects overlap on several regions the classifier either de-termines the most suitable region or the model is set to selectthe region with the maximum value [50] -is often causespixel mismatch but a postprocessing operation is mostlyused to reduce the effects [51]

Region CNN (R-CNN) uses a bounding box to identifyand classify objects in an image by proposing several boxesconvolving an image and identifying if they correspond to anobject [52 53] -e process of selective search is used increating boundary boxes of varying window sizes for regionproposal and each of the boxes classifies objects based ondifferent properties making the algorithm quite impressivebut slow [54] To overcome the drawbacks of the R-CNNFast R-CNN [55] was proposed which eliminates the re-dundancy in the proposed region thereby lessening thecomputational requirements -e R-CNN model wasreplaced with a single CNN per image whose computationwould be shared among the proposals using the region ofinterest pooling technique and training all the models in-cluding the use of convolutional neural network to classifythe images and bounding boxes regressor as a single entity-e Faster R-CNN [56] uses a region proposal network(RPN) to obtain a proposal from the network instead whilethe Mask R-CNN [57] was extended to include a pixel-levelsegmentation Technically the Mask R-CNN replaces theregion of interest pooling module in the Faster R-CNN withanother which has an accurate alignment module Also itincludes an additional parallel branch for segmentationmask prediction [58]

213 Fully Convolutional Network-Based Models Fullyconvolutional network (FCN) models do not have denselayers such as in other traditional CNNs they are composedof 1times 1 convolutions that achieve the task of dense layers orfully connected layers Also fully convolutional network(FCN) as displayed in Figure 2 takes images of arbitrarysizes as the input and returns outputs of correspondingspatial dimensions -is model principally builds on theencoder-decoder model to classify pixels in an image intopredefined classes by using a convolution network in theencoder to extract features thereby reducing the featuremapsrsquo dimensionality before being upsampled by the de-coder (SegNet) [60] During convolutional neural networkcomputation input images are downsized resulting in asmaller output with reduced spatial features -is problem issolved via the upsample technique which transposes thedownsized images to a larger size making pixelwise com-parison efficient and effective Some upsampling methodssuch as transpose convolutions are learned thus increasingmodel complexity and computation while several othersexclude learning including nearest neighbor the bed of nailsand max unpooling [61] -e fully convolutional network ismajorly trained as an end-to-end model to compute pix-elwise loss and trained using the backpropagation approach-e FCN was firstly inspired by Long et al [59] using thepopular AlexNet CNN architecture in the encoder andtranspose convolution layers in the decoder to upsample thefeature to the desired dimension A variant FCN having skipconnections from previous layers in the network was pro-posed named UNet [62] UNet intends to compliment thelearned features with fine-grain details from contractingpaths to capture the context and enhance classificationaccuracy

Residual blocks were introduced with shorter skipconnections between the encoder and decoder grantingfaster convergence of deeper models during training [63]Multiscale induction in the dense blocks conveys low-levelfeatures across layers to ones with high-level featuresresulting in feature reuse [64] -is makes the model easierto train and improve the performance as well An archi-tecture having two-stream CNN uses its gates to process theimage shape in different branches then connected and fusedat a later stage -e model also proposed a new loss functionthat aligns segmentation prediction with the ground truthboundaries [65] An end-to-end single-pass model applyingdense decoder shortcut connections extracts semantics from

Boat

Person

Diningtable

(a) (b) (c)

Figure 1 Illustration of differences in segmentation (a) object detection (b) semantic segmentation and (c) instance segmentation [36]

Complexity 3

high-level features such that propagation of informationfrom one block to another combines multiple-level features[66] -e model designed based on the ResNeXtrsquos residualbuilding blocks helps it to aggregate blocks of featurecaptures which are fused as output resolutions [67]

ExFuse aims to connect the gap between low- and high-level features in convolutional networks by introducingsemantic information at the lower-level features as well ashigh resolutions into the high-level features -is wasachieved by proposing two fusion methods named explicitchannel resolution embedding and densely adjacent pre-diction [68] Contrary to most models a balance betweenmodel accuracy and speed was achieved in ICNet whichconsolidates several multiresolution branches by introduc-ing an image cascade network that allows real-time infer-ence -e network cascade image inputs into differentsegments as low medium and high resolution before beingtrained with this label guidance [69]

214 Refinement Network Because of the resolution re-duction caused by the encoder models in the typical FCN-based model the decoder has inherent problems of pro-ducing fine-grained segmentation especially at theboundaries and edges [70] -ough this problem has beentackled by incorporating skip connections adding globalinformation and others means the problem is by no meanssolved and some algorithms have involved several featuresor in some cases certain postprocessing functions to findalternative solutions [71] DeepLab1 [72] combines ideasfrom the deep convolutional neural network and probabi-listic graphical models to achieve pixel-level classification-e localization problem of the neural network output layerwas remedied using a fully connected conditional randomfield (CRF) as a means of performing segmentation withcontrolled signal extermination DeepLab1 applied atrousconvolutions instead of the regular convolutions whichaccomplish the learning of aggregate multiscale contextualfeatures Visible in Figure 3 DeepLab1 allows the expansionof kernel window sizes without increasing the number ofweights -e multiscale atrous convolutions help to over-come the problem of insensitivity to fine details by other

models and decrease output blurriness -is could result inadditional complexity in computation and time dependingon the postprocessing networkrsquos computational processes

-e ResNet deep convolutional network architecture wasapplied in DeepLab2 which enables the training of variousdistinct layers while preserving the performance [73] Be-sides DeepLab2 uses atrous spatial pyramid pooling (ASPP)to capture long-range context -e existence of objects atdifferent scales and the reduced feature resolution problemsof semantic segmentation are tackled by designing a cascadeof atrous convolutions which could run in parallel to capturevarious scales of context information alongside global av-erage pooling (GAP) to embed context information features[74] FastFCN implements joint pyramid upsampling whichsubstitutes atrous convolutions to free up memory andlessen computations Using a fully connected networkframework the joint pyramid upsampling technique ex-tracts feature maps of high resolution into a joint upsam-pling problem -e models used atrous spatial pyramidpooling to extract the last three-layer output features and aglobal context module to map out the final predictions [75]-e atrous spatial pyramid pooling limitation of lack ofdense feature resolution scale is attempted by concatenatingmultiple branches of atrous-convolved features at differentrates which are later fused into a final representationresulting in dense multiscale features [76]

215 Weakly Supervised and Semisupervised Approaches-ough most models depend on a large number of imagesand their annotated label the process of manually anno-tating labels is quite daunting and time-consuming so se-mantic segmentation models have been attempted withweakly supervised approaches Given weakly annotated dataat the image level the model was trained to assign higherweights to pixels corresponding with the class label Trainedon a subset of the ImageNet dataset during training thenetworks focus on recognizing important pixels relating to aprior labeled single-class object and matching them to theclass through inference [77] Bounding box annotation isused to train semantic labeling of image segmentation whichaccomplishes 95 quality of fully supervised models -e

96

256384 384 256 4096 4096 21

21

Backwardlearning

Forwardinference

Pixelwise prediction

Segmentation gt

Figure 2 Illustration of fully convolutional networks for semantic segmentation [59]

4 Complexity

bounding box and information of the constituent objectwere used prior to training [78] A model combining bothlabeled and weakly annotated images with a clue of thepresence or absence of a semantic class was developed usingthe deep convolutional neural network and expectation-maximization (EM) algorithm [79]

BoxSup iteratively generates automatic region proposalswhile training convolutional networks to obtain segmen-tation masks and as well as improve the modelrsquos ability toclassify objects in an image -e network uses bounding boxannotations as a substitute for supervised learning such thatregions are proposed during training to determine candidatemasks overtime improving the confidence of the segmen-tation masks [80] A variant of generative adversariallearning approach which constitutes a generator and dis-criminator was used to design a semisupervised semanticsegmentation model -e model was first trained using fulllabeled data which enable the modelrsquos generator to learn thedomain sample space of the dataset which is leveraged tosupervise unlabeled data Alongside the cross-entropy lossan adversarial loss was proposed to optimize the objectivefunction of tutoring the generator to generate images asclose as possible to the image labels [81]

22Datasets Deep learning requires an extensive amount oftraining data to comprehensively learn patterns and fine-tune the number of parameters needed for its gradientconvergence Accordingly there are several available data-sets specifically designed for the task of semantic segmen-tation which are as follows

PASCAL VOC PASCAL Visual Object Classes (VOC)[82] is arguably the most popular semantic segmentationdataset with 21 classes of predefined object labels back-ground included -e dataset contains images and anno-tations which could be used for detection classificationaction classification person layout and segmentation tasks-e datasetrsquos training validation and test set has 1464 1449and 1456 images respectively Yearly the dataset has beenused for public competitions since 2005

MS COCO Microsoft Common Objects in Context [83]was created to push the computer vision state of the art with

standard images annotation and evaluation Its objectdetection task dataset uses either instanceobject annotatedfeatures or a bounding box In total it has 80 object cate-gories with over 800000 available images for its trainingtest and validation sets as well as over 500000 object in-stances that are segmented

Cityscapes Cityscapes dataset [84] has a huge amount ofimages taken from 50 cites during different seasons andtimes of the year It was initially a video recording and theframes were extracted as images It has 30 label classes inabout 5000 densely annotated images and 20000 weaklyannotated images which have been categorized into 8 groupsof humans vehicles flat surfaces constructions objectsvoid nature and sky It was primarily designed for urbanscene segmentation and understanding

ADE20K ADE20K dataset [85] has 20210 trainingimages 2000 validation images and 3000 test images whichare well suited for scene parsing and object detectionAlongside the 3-channel images the dataset contains seg-mentation masks part segmentation masks and a text filethat contains information about the object classes identi-fication of instances of the same class and the description ofeach imagersquos content

CamVid CamVid [86] is also a video sequence of sceneswhich have been extracted into images of high resolution forsegmentation tasks It consists of 101 images of 960 lowast 720dimension and their annotations which have been classifiedinto 32 object classes including void indicating areas whichdo not belong to a proper object class -e dataset RGB classvalues are also available ranging from 0 to 255

KITTI KITTI [87] is popularly used for robotics andautonomous car training focusing extensively on 3Dtracking stereo optical flow 3D object detection and visualodometry -e images were obtained through two high-resolution cameras attached to a car driving around the cityof Karlsruhe Germany while their annotations were doneby a Velodyne laser scanner -e data aim to reduce bias inexisting benchmarks with a standard evaluation metric andwebsite

SYNTHIA SYNTHetic [88] Collection of Imagery andAnnotations (SYNTHIA) is a compilation of imaginaryimages from a virtual city that has a high pixel-level

Input DCNNAeroplane coarse

score map

Final output Fully connected CRF

Atrous convolution

Bilinear interpolation

Figure 3 DeepLab framework with fully connected CRFs [72]

Complexity 5

resolution -e dataset has 13 predefined label classesconsisting of road sidewalk fence sky building sign pe-destrian vegetation pole and car It has a total of 13407training images

23 Evaluation Metrics -e performance of segmentationmodels is computed mostly in the supervised scope wherebythe modelrsquos prediction is compared with the ground truth atthe pixel level -e common evaluation metrics are pixelaccuracy (PA) and intersection over union (IoU) Pixelaccuracy refers to the ratio of all the pixels classified in theircorrect classes to the total amount of pixels in the imagePixel accuracy is trivial and suffers from class imbalance suchthat certain classes immensely dominate other classes

Accuracy TP + TN

TP + TN + FP + FN

PA 1113936inii

1113936iti

(1)

where nc is represented as the number of classes and nii isalso represented as the number of pixels of class i which arepredicted to class i while nij represents the number of pixelsof class i which are predicted as class j with the total numberof pixels of a particular class i represented as ti 1113936jnij

Mean pixel accuracy (mPA) improves pixel accuracyslightly it computes the accuracy of the images per classinstead of a global computation of all the classes -e meanof the class accuracies is then computed to the overallnumber of classes

mPA 1nc

1113944i

nii

ti

(2)

Intersection over union (IoU) metric which is also re-ferred to as the Jaccard index measures the percentageoverlap of the ground truth to the model prediction at thepixel level thereby computing the amount of pixels commonwith the ground truth label and mask prediction [89]

mIoU 1nc

1113936inii

1113936jnij + 1113936jnji minus nii

(3)

24 Discussion Different machine learning and deeplearning algorithms and backbones yield different resultsbased on the modelsrsquo ability to learn mappings from inputimages to the label In tasks involving images CNN-basedapproaches are by far the most expedient Although they canbe computationally expensive compared to other simplermodels suchmodels occupy a bulk of the present state of theart Traditional machine learning algorithms such as randomforest naive Bayes ensemble modeling Markov randomfield (MRF) and support vector machines (SVMs) are toosimple and rely heavily on domain feature understanding orhandcrafted feature engineering and in some cases they arenot easy to fine-tune Also clustering algorithms such as K-means and fuzzy C-means mostly require that the number of

clusters is specified beforehand and they are not very ef-fective with multiple boundaries

Because of the CNNrsquos invariant property it is very ef-fective for spatial data and object detection and localizationBesides the modern backbone of the fully convolutionalnetwork has informed several methods of improving seg-mentation localization First the decoder uses upsamplingtechniques to increase the features resolution and then skipconnections are added to achieve the transfer of fine-grainfeatures to the other layers Furthermore postprocessingoperations as well as the context and attention networkshave been exploited

-e supervised learning approach still remains thedormant technique as there have been many options forgenerating datasets as displayed in Table 1 Data augmen-tation involving operations such as scaling flipping rotat-ing scaling cropping and translating has mademultiplication of data possible Also the application of thegenerative adversarial network (GAN) has played a majorrole in the replication of images and annotations

3 Image Captioning

Image captioning relates to the general idea of automaticallygenerating the textual description of an image and it is oftenalso referred to as image annotation It involves the appli-cation of both computer vision and natural language pro-cessing tools to achieve the transformation of imagerydepiction into a textual composition [111] Tasks such ascaptioning were almost impossible prior to the advent ofdeep learning and with advances in sophisticated algo-rithms multimodal techniques efficient hardware and alarge bulk of datasets such tasks are becoming easy to ac-complish [112] Image captioning has several applications tosolving some real-world problems including providing aid tothe blind autonomous cars academic bot and militarypurposes To a large extent the majority of image captioningsuccess so far has been from the supervised domain wherebyhuge amounts of data consisting of images and about two tofive label captions describing the actions of the images areprovided [113]-is way the network is tasked with learningthe imagesrsquo feature presentation and mapping them to alanguage model such that the end goal of a captioningnetwork is to generate a textual representation of an imagersquosdepiction [114]

-ough characterizing an image in the form of textseems trivial and straightforward for humans it is by nomeans simple to be replicated in an artificial system andrequires advanced techniques to extract the features fromthe images as well as map the features to the correspondinglanguage model Generally a convolutional neural network(CNN) is tasked with feature extraction while a recurrentneural network (RNN) relies upon to translate the trainingannotations with the image features [115] Aside from de-termining and extracting salient and intricate details in animage it is equally important to extract the interactions andsemantic relationship between such objects and how to il-lustrate them in the right manner using appropriate tensesand sentence structures [116] Also because the training

6 Complexity

labels which are texts are different from the features obtainedfrom the images language model techniques are required toanalyze the form meaning and context of a sequence ofwords -is becomes even more complex as keywords arerequired to be identified for emphasizing the action or scenebeing described [117]

Visual features deep convolutional neural network(DCNN) is often used as the feature extractor for images andvideos because of the CNNrsquos invariance property [118] suchthat it is able to recognize objects regardless of variation inappearances such as size illumination translation or ro-tation as displayed in Figure 4 -e distortion in pixel ar-rangement has less impact on the architecturersquos ability tolearn essential patterns in the identification and localizationof the crucial features Essential feature extraction is para-mount and this is easily achieved via the CNNrsquos operation ofconvolving filters over images subsequently generatingfeature maps from the receptive fields from which the filtersare applied Using backpropagation techniques the filterweights are updated to minimize the loss of the modelrsquosprediction compared to the ground truth [119] -ere havebeen several evolutions over the years and this has usheredconsiderate architectural development in the extraction

methodology Recently the use of a pretrained model hasbeen explored with the advantage of reducing time andcomputational cost while preserving efficiency -eseextracted features are passed along to other models such asthe language decoder in the visual space-basedmethods or toa shared model as in the multimodal space for image cap-tioning training [120]

Captioning image caption or annotation is an inde-pendent scope of artificial intelligence and it mostly com-bines two models consisting of a feature extractor as theencoder and a recurrent decoder model While the extractorobtains salient features in the images the decoder modelwhich is similar in pattern to the language model utilizes arecurrent neural network to learn sequential information[121] Most captioning tasks are undertaken in a supervisedmanner whereby the image features act as the input whichare learned and mapped to a textual label [122] -e labelcaptions are first transformed into a word vector and arecombined with the feature vector to generate a new textualdescription Most captioning architectures follow the partialcaption technique whereby part of the label vector iscombined with the image vector to predict the next word inthe sequence [123] -en the prior words are all combinedto predict the next word and continued till an end token isreached In most cases to infuse semantics and intuitiverepresentation into the label vector a pretrained wordembedding is used to map the dimensional representation ofthe embeddings into the word vector enriching its contentand generalization [124]

31 Image Captioning Techniques

311 Retrieval-Based Captioning Early works in imagecaptioning were based on caption retrieval Using thistechnique the caption to a target image is generated byretrieving the descriptive sentence of such an image from aset of predefined caption databases In some cases the newlygenerated caption would be one of the existing retrievedsentences or in other cases could be made up of severalexisting retrieved sentences [125] Initially the features of animage are compared to the available candidate captions orachieved by tagging the image property in a query Certainproperties such as color texture shape and size were usedfor similarity computation between a target image andpredefined images [126] Captioning via the retrieval methodcan be retrieved through the visual and multimodal spaceand these approaches produce good results generally but areoverdependent on the predefined captions [127]

Specific details such as the object in an image or the actionor scene depicted were used to connect images to the corre-sponding captions -is was computed by finding the ratio ofsimilarity between such information to the available sentences[128] Using the kernel canonical correlation analysis tech-nique images and related sentences were ranked based on theircosine similarities after which the most similar ones wereselected as the suitable labels [129] while the image featureswere used for reranking the ratio of image-text correlation[130]-e edges and contours of images were utilized to obtain

Table 1 Class pixel label distribution in the CamVid dataset

Dataset Method mIoU

CamVid

ApesNet [90] 480ENet [91] 513SegNet [60] 556LinkNet [92] 558FCN8 [59] 570

AttentionM [93] 601DeepLab-LFOV [72] 616

Dilation8 [66] 653BiseNet [94] 687PSPNet [60] 691

DenseDecoder [67] 709AGNet [95] 752

PASCAL VOC

Wails [96] 559FCN8 [59] 622

PSP-CRF [97] 654Zoom Out [98] 696

DCU [99] 717DeepLab1 [72] 716DeConvNet [61] 725GCRF [100] 732DPN [101] 741

Piecewise [102] 753

Cityscapes

FCN8 [59] 653DPN [101] 668

Dilation10 [103] 671LRR [104] 697

DeepLab2 [73] 704FRRN [105] 718

RefineNet [106] 736GEM [107] 7369PEARL [108] 754TuSimple [109] 776PSPNet [110] 784SPP-DCU [99] 789

Complexity 7

pattern and sketches such that they were used as a query forimage retrieval whereby the generated sketches and theoriginal images were structured into a database [131] Buildingon the logic of the original image and its sketch more imageswere dynamically included alongside their sketches to enhancelearning [132] Furthermore deep learning models were ap-plied to retrieval-based captioning by using convolutionalneural networks (CNNs) to extract features from differentregions in an image [133]

312 Template-Based Captioning Another common ap-proach of image annotation is template-based which in-volves the identification of certain attributes such as objecttype shape actions and scenes in an image which are thenused in forming sentences in a prestructured template [134]In this method the predetermined template has a constantnumber of slots such that all the detected attributes are thenpositioned in the slots to make up a sentence In this case thewords representing the detected features make up thecaption and are arranged such that they are syntactically andsemantically related thus generating grammatically correctrepresentations [135] -e fixed template problem of thetemplate-based approach was overcome by incorporating aparsed language model [136] giving the network higherflexibility and ability to generate better captions

-e underlying nouns scenes verbs and prepositionsdetermining the main idea of a sentence were explored and

trained using a language model to obtain the probabilitydistribution of such parameters [137] Certain humanpostures and orientation which do not involve the move-ment of the hands such as walking standing and seeing andthe position of the head were used to generate captions of animage [138] Furthermore the postures were extended todescribe human behavior and interactions by incorporatingmotion features and associating them with the corre-sponding action [139] Each body part and the action it isundergoing are identified and then this is compiled andintegrated to form a description of the complete humanbody Human posture position and direction of the headand position of the hands were selected as geometric in-formation for network modeling

313 Neural Network-Based Captioning Compared to othermachine learning algorithms or preexisting approaches deeplearning has achieved astonishing heights in image captioningsetting new benchmarks with almost all of the datasets in all ofthe evaluation metrics -ese deep learning approaches aremostly in the supervised setting which requires a huge amountof training data including both images and their correspondingcaption labels Several models have been applied such as ar-tificial neural network (ANN) convolutional neural network(CNN) recurrent neural network (RNN) autoencoder gen-erative adversarial network (GAN) and even a combination ofone or more of them

A female tennis player inaction on the court

(a)

A group of young men playinga game of soccer

(b)

A man riding a wave on topof a surfboard

(c)

A baseball game in progress with the batter up to plate

(d)

A brown bear standing on topof a lush green field

(e)

A person holding a cellphone in their hand

(f )

A close up of a personbrushing his teeth

(g)

A women laying on a bed ina bed-room

(h)

A black and white cat is sitting on a chair

(i)

A large clock mounted to theside of a building

(j)

A bunch of fruits that aresitting on a table

(k)

A toothbrush holder sitting on top of a white sink

(l)

Figure 4 Sample images and their corresponding captions [112]

8 Complexity

Dense captioning dense captioning emerges as a branchof computer vision whereby pictorial features are denselyannotated depending on the object objectrsquos motion and itsinteraction -e concept depends on the identification offeatures as well as their localization and finally expressingsuch features with short descriptive phrases [140] -is ideais drawn from the understanding that providing a singledescription for a whole picture can be complex or sometimesbias therefore a couple of annotations are generated re-lating to different recognized salient features of the image-e training data of a dense caption in comparison to aglobal caption are different in that various labels are givenfor individual features identified by bounding boxeswhereby a general description is given in the global cap-tioning without a need for placing bounding boxes on theimages [141] Visible in Figure 5 a phrase is generated fromeach region in the image and these regions could becompiled to form a complete caption of the image Gen-erally dense captioning models face a huge challenge asmost of the target regions in the images overlap which makesaccurate localization challenging and daunting [143]

-e intermodal alignment of the text and images wasinvestigated on region-level annotations which pioneers anew approach for captioning leveraging the alignmentbetween the feature embedding and the word vector se-mantic embedding [144] A fully convolutional localizationnetwork (FCLN) was developed to determine importantregions of interest in an image -e model combines a re-current neural network language model and a convolutionalneural network to enforce the logic of object detection andimage description -e designed network uses dense lo-calization layer and convolution anchors built on the FasterR-CNN technique to predict region proposal from the inputfeatures [142] A contextual information model that com-bines previous and future features of a video spanning up totwo minutes achieved dense captioning by transforming thevideo input into slices of frames With this an event pro-posal module helps to extract the context from the frameswhich are fed into a long short-term memory (LSTM) unitenabling it to generate different proposals at different timesteps [145] Also a framework having separate detectionnetwork and localization captioning network accomplishedimproved dense captioning with faster speed by directlyproducing detected features rather than via the common useof the region proposal network (RPN) [146]

Encoder-decoder framework most image captioningtasks are built on the encoder-decoder structure whereby theimages and texts are managed as separate entities by dif-ferent models In most cases a convolutional neural networkis presented as the encoder which acts as a feature extractorwhile a recurrent neural network is presented as the decoderwhich serves as a language model to process the extractedfeatures in parallel with the text label consequently gen-erating predicted captions for the input images [147] CNNhelps to identify and localize objects and their interactionand then this insight is combined with long-term depen-dencies from a recurrent network cell to predict a word at atime depending on the image context vector and previouslygenerated words [148] Multiple CNN-based encoders were

proposed to provide a more comprehensive and robustcapturing of objects and their interaction from images -eidea of applying multiple encoders is suggested to com-plement each unit of the encoder to obtain better featureextraction -ese interactions are then translated to a novelrecurrent fusion network (RFNet) which could fuse andembed the semantics from the multiple encoders to generatemeaningful textual representations and descriptions [149]

Laid out in Figure 6 a combination of two CNN modelsas both encoder and decoder was explored to speed upcomputational time for image captioning tasks BecauseRNNrsquos long-range information is computed step by step thiscauses very expensive computation and is solved by stackinglayers of convolution to mimic tree structure learning of thesentences [150] -ree distinct levels of features which areregional visual and semantic features were encoded in amodel to represent different analyses of the input image andthen this is fed into a two-layer LSTM decoder for generatingwell-defined captions [151] A concept-based sentencereranking technique was incorporated into the CNN-LSTMmodel such that concept detectors are added to the un-derlying sentence generation model for better image de-scription with minimal manual annotation [152]Furthermore the generative adversarial network (GAN) wasconditioned on a binary vector for captioning -e binaryvector represented some form of sentiment which the imageportrays and then was used to train the adversarial model-e model took both images and an adjective or adjective-noun pair as the input to determine if the network couldgenerate a caption describing the intended sentimentalstance [153]

Attention-guided captioning attention has become in-creasingly paramount and it has driven better benchmarksin several tasks such as machine translation languagemodeling and other natural language processing tasks aswell as computer vision tasks In fact attention has proven tocorrelate the meaning between features and this helps tounderstand how such a feature relates to one another [154]Incorporating this into a neural network it encourages themodel to focus on salient and relevant features and pay lessconsideration to other noisy aspects of the data space dis-tribution [155] To estimate the concept of attention in imageannotation a model is trained to concentrate its compu-tation on the identified salient regions while generatingcaptions using both soft and hard attention [156] -e de-terministic soft attention which is trainable via standardbackpropagation is learned by weighting the annotatedvector of the image features while the stochastic hard at-tention is trained via maximizing a variational lower boundsetting it to 1 when the feature is salient [157]

Following the where and what analysis of what the modelshould concentrate on adaptive attention used a hierarchicalstructure to fuse both high-level semantic information andvisual information from an image to form intuitive repre-sentation [120] -e top-down and bottom-up approaches arefused using semantic attention which first defines attributedetectors that dynamically enable it to switch between con-cepts -is empowers the detectors to determine suitablecandidates for attention computation based on the specified

Complexity 9

inputs [158] -e limitation of long-distance dependency andinference speed in medical image captioning was tackled usinga hierarchical transformer -e model includes an image en-coder that extracts features with the support of a bottom-upattention mechanism to capture and extract top-down visualfeatures as well as a nonrecurrent transformer captioningdecoder which helps to compile the generated medical illus-tration [159] Salient regions in an image are also extractedusing a convolutional model as region features were repre-sented as pooled feature vector -ese intraimage regionvectors are appropriately attended to obtain suitable weightsdescribing their influence before they are fed into a recurrentmodel that learns their semantic correlation -e corre-sponding sequence of the correlated features is transformedinto representations which are illustrated as sentences de-scribing the featuresrsquo interactions [160]

314 Unsupervised or Semisupervised CaptioningSupervised captioning has so far been productive andsuccessful partly due to the availability of sophisticated deeplearning algorithms and an increasing outpour of data-rough the supervised deep learning techniques a com-bination of models and frameworks which learn the jointdistribution of images and labels has displayed a very in-tuitive and meaningful illustration of images even similar tohumans However despite the achievement the process ofcompletely creating a captioning training set is quitedaunting and the manual effort required to annotate themyriad of images is very challenging As a result othermeans which are free of excessive training data are exploredAn unsupervised captioning approach that combines twosteps of query and retrieval was researched in [161] Firstseveral target images are obtained from the internet as well

Input image

Flow of gradients

ResNet +fasterRCNN

Featureextractor

Speakernetwork

Φ

A browncat

standingon two redsuitcases

Comparewith

naturallanguage

Naturalnessloss A large brown cat on

top of red luggageStanding on ared suitcaseNatural

description of the image

Discriminabilityloss

Target anddistractors

Listenernetwork

θGreen modules are trainedBlack modules are fixed

Figure 6 Sample architecture of a multimodal image captioning network [111]

Whole image Image regions Label density

DetectionClassification

Cat

Singlelabel Cat

Skateboard

Sequence

Labelcomplexity

Captioning

A catriding a

skateboard

Dense captioning

Orange spotted cat

Skateboard withred wheels

Cat riding askateboard

Brown hardwoodflooring

Figure 5 Dense captioning illustrating multiple annotations with a single forward pass [142]

10 Complexity

as a huge database of words describing such images For anychosen image words representing its visual display are usedto query captions from a reference dataset of sentences -isstrategy helps to eliminate manual annotation and also usesmultimodal textual-visual information to reduce the effect ofnoisy words in the vocabulary dataset

Transfer learning which has seen increasing applicationin other deep learning domains especially in computervision was applied to image captioning First the model istrained on a standard dataset in a supervised manner andthen the knowledge from the supervised model is transferredand applied on a different dataset whose sentences andimages are not paired For this purpose two autoencoderswere designed to train on the textual and visual datasetrespectively using the distribution of the learned supervisedembedding space to infer the unstructured dataset [162]Also the process of manual annotation of the training setwas semiautomated by evaluating an image into severalfeature spaces which are individually estimated by an un-supervised clustering algorithm -e centers of the clusteredgroups are then manually labeled and compiled into asentence through a voting scheme which compiles all theopinions suggested from each cluster [163] A set of naıveBayes model with AdaBoost was used for automatic imageannotation by first using a Bayesian classifier to identifyunlabeled images and then labeled by a succeeding classifierbased on the confidence measurement of the prior classifier[164] A combination of keywords which have been asso-ciated with both labeled and unlabeled images was trainedusing a graph model -e semantic consistency of the un-labeled images is computed and compared to the labeledimages -is is continued until all the unlabeled images aresuccessfully annotated [165]

315 Difference Captioning As presented in Figure 7 aspot-the-difference task which describes the differencesbetween two similar images using advance deep learningtechnique was first investigated in [166] -eir model used alatent variable to capture visual salience in an image pair byaligning pixels which differ in both images -eir workincluded different model designs such as nearest neighbormatching scheme captioning masked model and DifferenceDescription with Latent Alignment uniform for obtainingdifference captioning -e Difference Description with La-tent Alignment (DDLA) compares both input images at apixel level via a masked L2 distance function

Furthermore the Siamese Difference Captioning Model(SDCM) also combined techniques from deep Siameseconvolutional neural network soft attention mechanismword embedding and bidirectional long short-term mem-ory [167] -e features in each image input are computedusing the Siamese network and their differences are ob-tained using a weighted L1 distance function Differentfeatures are then recursively translated into text using arecurrent neural network and an attention network whichfocuses on the relevant region on the images -e idea of theSiamese Difference Captioning Model was extended byconverting the Siamese encoder into a Fully Convolutional

CaptionNet (FCC) through a fully convolutional network[168] -is helps to transform the extracted features into alarger dimension of the input images whichmakes differencecomputation more efficient Also a word embedding pre-trained model was used to embed semantics into the textdataset and beam search technique to ensure multiple op-tions for robustness

32 Datasets -ere are several publicly available datasetswhich are useful for training image captioning tasks -emost popular datasets include Flickr8k [169] Flickr30k[170] MS COCO dataset [83] Visual Genome dataset [171]Instagram dataset [172] and MIT-Adobe FiveK dataset[173]

Flickr30K dataset it has about 30000 images from Flickrand about 158000 captions describing the content of theimages Because of the huge volume of the data users areable to determine their preferred split size for using the data

Flickr8K dataset it has a total of 8000 images which aredivided as 6000 1000 and 1000 for the training test andvalidation set respectively All the images have 5 labelcaptions which are used as a supervised setting for trainingthe images

Microsoft COCO dataset it is perhaps the largest cap-tioning dataset and it also includes training data for objectrecognition and image segmentation tasks respectively -edataset contains around 300000 images with 5 captions foreach image

33 Evaluation Metrics -e automatically generated cap-tions are evaluated to confirm their correctness in describingthe given image In machine learning some of the commonimage captioning evaluation measures are as follows

BLEU (BiLingual Evaluation Understudy) [174] as ametric it counts the number of matching n-grams in themodelrsquos prediction compared to the ground truth With thisprecision is calculated based on the mean n-grams com-puted and the recall is computed via the introduction of abrevity penalty in the caption label

ROUGE (Recall-Oriented Understudy for GistingEvaluation) [175] it is useful for summary evaluation and iscalculated as the overlap of either 1-gram or bigrams be-tween the referenced caption and the predicted sequenceUsing the longest sequence available the co-occurrence F-score mean of the predicted sequencersquos recall and predictionis obtained

METEOR (Metric for Evaluation of Translation withExplicit Ordering) [176] it addresses the drawback of BLEUand it is based on a weighted F-score computation as well asa penalty function meant to check the order of the candidatesequence It adopts synonyms matching in the detection ofsimilarity between sentences

CIDEr (Consensus-based Image Description Evalua-tion) [177] it determines the consensus between a referencesequence and a predicted sequence via cosine similaritystemming and TF-IDF weighting-e predicted sequence iscompared to the combination of all available referencesequences

Complexity 11

SPICE (Semantic Propositional Image Caption Evalua-tion) [178] it is a relatively new caption metric which relateswith the semantic interrelationship between the generatedand referenced sequence Its graph-based methodology usesa scene graph of semantic representations to indicate detailsof objects and their interaction to describe their textualillustrations

34 Discussion With an increase in the generation of dataproduction of sophisticated computing hardware and

complex machine learning algorithms a lot of achievementshave been accomplished in the field of image captioning-ough there have been several implementations the bestresults in almost all of the metrics have been recordedthrough the use of deep learning models In most cases thecommon implementation has been the encoder-decoderarchitecture which has a feature extractor as the encoder anda language model as the decoder

Compared to other approaches this has proven useful asit has become the backbone for more recent designs Toachieve better feature computation attention mechanism

(a) (b)

Figure 7 Image pair difference annotations of the Spot-the-Diff dataset [166] (a)-e blue truck is no longer there (b) A car is approachingthe parking lot from the right


Dataset Method B-1 B-2 B-3 B-4 M C

MS COCO

LSTM-A-2 [179] 0734 0567 0430 0326 0254 100Att-Reg [180] 0740 0560 0420 0310 0260 mdash

Attend-tell [156] 0707 0492 0344 0243 0239 mdashSGC [181] 671 488 343 239 218 733

phi-LSTM [182] 666 489 355 258 231 821COMIC [183] 706 534 395 292 237 881TBVA [184] 695 521 386 287 241 919SCN [185] 0741 0578 0444 0341 0261 1041

CLGRU [186] 0720 0550 0410 0300 0240 0960A-Penalty [187] 721 551 415 314 247 956VD-SAN [188] 734 566 428 322 254 999ATT-CNN [189] 739 571 433 33 26 1016RTAN [190] 735 569 433 329 254 1033Adaptive [191] 0742 0580 0439 0332 0266 1085Full-SL [192] 0713 0539 0403 0304 0251 0937

Flickr30K

hLSTMat [193] 738 551 403 294 23 666SGC [181] 615 421 286 193 182 399

RA+ SF [194] 0649 0462 0324 0224 0194 0472gLSTM [195] 0646 0446 0305 0206 0179 mdash

Multi-Mod [196] 0600 0380 0254 0171 0169 mdashTBVA [184] 666 484 346 247 202 524

Attend-tell [156] 0669 0439 0296 0199 0185 mdashATT-FCN [158] 0647 0460 0324 0230 0189 mdash

VQA [197] 0730 0550 0400 0280 mdash mdashAlign-Mod [144] 0573 0369 0240 0157 mdash mdashm-RNN [198] 0600 0410 0280 0190 mdash mdashLRCN [112] 0587 0391 0251 0165 mdash mdashNIC [141] 0670 0450 0300 mdash mdash mdashRTAN [190] 671 487 349 239 201 5333-gated [199] 694 457 332 226 23 mdashVD-SAN [188] 652 471 336 239 199 mdashATT-CNN [189] 661 472 334 232 194 mdash

12 Complexity

concepts have been applied to help in focusing on the salientsection of images and their features thereby improvingfeature-text capturing and translation In the same mannerother approaches such as generative adversarial network andautoencoders have been thriving in achieving concise imageannotation and to this end such idea has been incorporatedwith other unsupervised concepts for captioning purposes aswell For example reinforced learning technique also gen-erated sequences which are able to succinctly describe im-ages in a timely manner Furthermore analyses of severalmodel designs and their results are displayed in Table 2depicting their efficiency and effectiveness in the BLEUMETEOR ROUGE-L CIDEr and SPICE metrics

4 Conclusion

In this survey the state-of-the-art advances in semantic seg-mentation and image captioning have been discussed -echaracteristics and effectiveness of the important techniqueshave been considered as well as their process of achieving bothtasks Some of the methods which have accomplished out-standing results have been illustrated including the extractionidentification and localization of objects in semantic seg-mentation Also the process of feature extraction and trans-formation into a language model has been studied in the imagecaptioning section In our estimation we believe that because ofthe daunting task of manually segmenting images into se-mantic classes as well as the human annotation of imagesinvolved in segmentation and captioning future researchwould move in the direction of an unsupervised setting ofaccomplishing this task -is would ensure more energy andfocus is invested solely in the development of complexmachinelearning algorithms and mathematical models which couldimprove the present state of the art

Data Availability

-e dataset used for the evalution of the models presented inthis study have been discussed in the manuscript as well astheir respective references and publications such as PASCALVOC PASCAL Visual Object Classes (VOC) [82]MSCOCO Microsoft Common Objects in Context [83]Cityscapes Cityscapes dataset [84] ADE20K ADE20Kdataset [85] CamVid [86] Flickr8k [168] Flickr30k [169] MSCOCO Dataset [83] VisualGenome Dataset [170] InstagramDataset [171] and MITndashAdobe FiveK dataset [172]

Conflicts of Interest

-e authors declare that they have no conflicts of interest

Acknowledgments

-is work was supported in part by the NSFC-GuangdongJoint Fund (Grant no U1401257) the National NaturalScience Foundation of China (Grant nos 6130009061133016 and 61272527) the Science and Technology PlanProjects in Sichuan Province (Grant no 2014JY0172) andthe Opening Project of Guangdong Provincial Key

Laboratory of Electronic Information Products ReliabilityTechnology (Grant no 2013A061401003)

References

[1] M Leo G Medioni M Trivedi T Kanade andGM Farinella ldquoComputer vision for assistive technologiesrdquoComputer Vision and Image Understanding vol 154 pp 1ndash15 2017

[2] A Kovashka O Russakovsky L Fei-Fei and K GraumanldquoCrowdsourcing in computer visionrdquo Foundations andTrends in Computer Graphics and Vision vol 10 no 3pp 177ndash243 2016

[3] H Ayaz M Ahmad M Mazzara and A Sohaib ldquoHyper-spectral imaging for minced meat classification using non-linear deep featuresrdquoApplied Sciences vol 10 no 21 p 77832020

[4] A Oluwasanmi M U Aftab A Shokanbi J JacksonB Kumeda and Z Qin ldquoAttentively conditioned generativeadversarial network for semantic segmentationrdquo IEEE Ac-cess vol 8 pp 31733ndash31741 2020

[5] A Kendall and Y Gal ldquoWhat uncertainties do we need inbayesian deep learning for computer visionrdquo in Proceedingsof the 2017 Conference on Neural Information ProcessingSystems Long Beach CA USA December 2017

[6] C Szegedy V Vanhoucke S Ioffe J Shlens and Z WojnaldquoRethinking the inception architecture for computer visionrdquoin Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) pp 2818ndash2826 Las VegasNV USA June 2016

[7] B G Weinstein ldquoA computer vision for animal ecologyrdquoJournal of Animal Ecology vol 87 no 3 pp 533ndash545 2018

[8] A Oluwasanmi M U Aftab Z Qin et al ldquoTransfer learningand semisupervised adversarial detection and classificationof COVID-19 in CT imagesrdquo Complexity vol 2021 ArticleID 6680455 11 pages 2021

[9] M Ahmad I Haq Q Mushtaq and M Sohaib ldquoA newstatistical approach for band clustering and band selectionusing k-means clusteringrdquo International Journal of Engi-neering and Technology vol 3 2011

[10] M Buckler S Jayasuriya and A Sampson ldquoReconfiguringthe imaging pipeline for computer visionrdquo in Proceedings ofthe 2017 IEEE International Conference on Computer Vision(ICCV) pp 975ndash984 Venice Italy October 2017

[11] M Leo A Furnari G G Medioni M M Trivedi andG M Farinella ldquoDeep learning for assistive computer vi-sionrdquo in Proceedings of the European Conference on Com-puter VisionWorkshops Munich Germany September 2018

[12] H Fang S Gupta F N Iandola et al ldquoFrom captions tovisual concepts and backrdquo in Proceedings of the 2015 IEEEConference on Computer Vision and Pattern Recognition(CVPR) pp 1473ndash1482 Boston MA USA June 2015

[13] D E Goldberg and J H Holland ldquoGenetic algorithms andmachine learningrdquo Machine Learning vol 3 no 23pp 95ndash99 1988

[14] S Shalev-Shwartz and S Ben-DavidUnderstanding MachineLearning From gteory to Algorithms Cambridge UniversityPress Cambridge UK 2014

[15] E Alpaydin ldquoIntroduction to machine learningrdquo AdaptiveComputation AndMachine Learning MIT Press CambridgeMA USA 2004

[16] J Snoek H Larochelle and R P Adams ldquoPractical bayesianoptimization of machine learning algorithmsrdquo in

Complexity 13

Proceedings of the Neural Information Processing Systems(NIPS) Lake Tahoe NV USA August 2012

[17] I G Goodfellow Y Bengio and A C Courville DeepLearning Nature vol 521 pp 436ndash444 2015

[18] F Hutter L Kotthoff and J Vanschoren ldquoAutomatedmachine learning methods systems challengesrdquo Auto-mated Machine Learning MIT Press Cambridge MA USA2019

[19] R Singh A Sonawane and R Srivastava ldquoRecent evolutionof modern datasets for human activity recognition a deepsurveyrdquo Multimedia Systems vol 26 2020

[20] H Kuehne H Jhuang E Garrote T Poggio and T SerreldquoHMDB a large video database for human motion recog-nitionrdquo in Proceedings of the 2011 International Conferenceon Computer Vision pp 2556ndash2563 Barcelona SpainNovember 2011

[21] R Poppe ldquoA survey on vision-based human action recog-nitionrdquo Image Vision Comput vol 28 pp 976ndash990 2011

[22] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[23] R Dey and F M Salem ldquoGate-variants of gated recurrentunit (GRU) neural networksrdquo in Proceedings of the IEEE 60thInternational Midwest Symposium on Circuits and Systems(MWSCAS) pp 1597ndash1600 Boston MA USA August 2017

[24] K Simonyan and A Zisserman ldquoVery deep convolutionalnetworks for large-scale image recognitionrdquo 2014 httparxivorgabs14091556

[25] O Ivanov M Figurnov and D P Vetrov ldquoVariationalautoencoder with arbitrary conditioningrdquo in Proceedings ofthe International Conference on Learning RepresentationsVancouver Canada May 2018

[26] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-based learning applied to document recognitionrdquo Proceed-ings of the IEEE vol 86 no 11 pp 2278ndash2324 1998

[27] A Krizhevsky I Sutskever and G E Hinton ldquoImageNetclassification with deep convolutional neural networksrdquoAdvances in Neural Information Processing Systems vol 252012

[28] K Khan M S Siddique M Ahmad and M Mazzara ldquoAhybrid unsupervised approach for retinal vessel segmenta-tionrdquo BioMed Research International vol 2020 Article ID8365783 20 pages 2020

[29] C Szegedy W Liu Y Jia et al ldquoGoing deeper with con-volutionsrdquo in Proceedings of the 2015 IEEE Conference onComputer Vision and Pattern Recognition (CVPR) BostonMA USA June 2015

[30] S Xie R B Girshick P Dollar Z Tu and K He ldquoAg-gregated residual transformations for deep neural networksrdquoin Proceedings of the 2017 IEEE Conference on ComputerVision and Pattern Recognition (CVPR) pp 5987ndash5995Honolulu HI USA July 2017

[31] K He X Zhang S Ren and J Sun ldquoDeep residual learningfor image recognitionrdquo in Proceedings of the 2016 IEEEConference on Computer Vision and Pattern Recognition(CVPR) pp 770ndash778 Las Vegas NV USA June 2016

[32] H Hassan A Bashir M Ahmad et al ldquoReal-time imagedehazing by superpixels segmentation and guidance filterrdquoJournal of Real-Time Image Processing 2020

[33] C Liu L Chen F Schroff et al ldquoAuto-DeepLab hierarchicalneural architecture search for semantic image segmenta-tionrdquo in Proceedings of the Conference on Computer Visionand Pattern Recognition Long Beach CA USA June 2019

[34] A Kirillov K He R B Girshick C Rother and P DollarldquoPanoptic segmentationrdquo in Proceedings of the Conference on

Computer Vision and Pattern Recognition Long Beach CAUSA June 2019

[35] H Kervadec J Dolz M Tang E Granger Y Boykov andI Ben Ayed ldquoConstrained-CNN losses for weakly supervisedsegmentationrdquo Medical Image Analysis vol 54 pp 88ndash992019

[36] A Arnab S Zheng S Jayasumana et al ldquoConditionalrandom fields meet deep neural networks for semanticsegmentation combining probabilistic graphical modelswith deep learning for structured predictionrdquo IEEE SignalProcessing Magazine vol 35 no 1 pp 37ndash52 2018

[37] M Sezgin and B Sankur ldquoSurvey over image thresholdingtechniques and quantitative performance evaluationrdquoJournal of Electronic Imaging vol 13 pp 146ndash168 2004

[38] A Oluwasanmi Z Qin and T Lan ldquoBrainMR segmentationusing a fusion of K-means and spatial Fuzzy C-meansrdquo inProceeding of International Conference on Computer Scienceand Application Engineering Wuhan China July 2017

[39] A Oluwasanmi Z Qin T Lan and Y Ding ldquoBrain tissuesegmentation in MR images with FGMrdquo in Proceeding of theInternational Conference on Artificial Intelligence andComputer Science Guilin China December 2016

[40] J Chen C Yang G Xu and L Ning ldquoImage segmentationmethod using Fuzzy C mean clustering based on multi-objective optimizationrdquo Journal of Physics Conference Seriesvol 1004 pp 12ndash35 2018

[41] A Oluwasanmi Z Qin and T Lan ldquoFusion of Gaussianmixture model and spatial Fuzzy C-means for brain MRimage segmentationrdquo in Proceedings of International Con-ference on Computer Science and Application EngineeringWuhan China July 2017

[42] B J Liu and L Cao ldquoSuperpixel segmentation usingGaussian mixture modelrdquo IEEE Transactions on ImageProcessing vol 27 no 8 pp 4105ndash4117 2018

[43] B Kang and T Q Nguyen ldquoRandom forest with learnedrepresentations for semantic segmentationrdquo IEEE Transac-tions on Image Processing vol 28 no 7 pp 3542ndash3555 2019

[44] Z Zhao and X Wang ldquoMulti-egments Naıve Bayes classifierin likelihood spacerdquo IET Computer Vision vol 12 no 6pp 882ndash891 2018

[45] T Y Tan L Zhang C P Lim B Fielding Y Yu andE Anderson ldquoEvolving ensemble models for image seg-mentation using enhanced particle swarm optimizationrdquoIEEE Access vol 7 pp 34004ndash34019 2019

[46] X Cao F Zhou L Xu D Meng Z Xu and J PaisleyldquoHyperspectral image classification with Markov randomfields and a convolutional neural networkrdquo IEEE Transac-tions on Image Processing vol 27 no 5 pp 2354ndash2367 2018

[47] S Mohapatra ldquoSegmentation using support vector machinesrdquoin Proceedings of the Second International Conference on Ad-vanced Computational and Communication Paradigms(ICACCP 2019) pp 1ndash4 Gangtok India November 2019

[48] Ccedil Kaymak andA Uccedilar ldquoA brief survey and an application ofsemantic image segmentation for autonomous drivingrdquoHandbook of Deep Learning Applications Springer BerlinGermany 2018

[49] B Hariharan P A Arbelaez R B Girshick and J MalikldquoSimultaneous detection and segmentationrdquo in Proceedingsof the European Conference on Computer Vision ZurichSwitzerland September 2014

[50] R Girshick J Donahue T Darrell and J Malik ldquoRegion-based convolutional networks for accurate object detectionand segmentationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 38 no 1 pp 142ndash158 2016

14 Complexity

[51] Y Li H Qi J Dai X Ji and Y Wei ldquoFully convolutionalinstance-aware semantic segmentationrdquo in Proceedings of the2017 IEEE Conference on Computer Vision and PatternRecognition (CVPR) pp 4438ndash4446 Honolulu HI USAJuly 2017

[52] H Caesar J Uijlings and V Ferrari ldquoRegion-based semanticsegmentation with end-to-end trainingrdquo in Proceedings ofthe European Conference on Computer Vision pp 381ndash397Amsterdam -e Netherlands October 2016

[53] R B Girshick J Donahue T Darrell and J Malik ldquoRichfeature hierarchies for accurate object detection and se-mantic segmentationrdquo in Proceedings of the 2014 IEEEConference on Computer Vision and Pattern Recognitionpp 580ndash587 Columbus OH USA June 2014

[54] N Wang S Li A Gupta and D Yeung ldquoTransferring richfeature hierarchies for robust visual trackingrdquo 2015 httparxivorgabs150104587

[55] R B Girshick ldquoFast R-CNNrdquo in Proceedings of the 2015IEEE International Conference on Computer Vision (ICCV)pp 1440ndash1448 Santiago Chile December 2015

[56] S Ren K He R B Girshick and J Sun ldquoFaster R-CNNtowards real-time object detection with region proposalnetworksrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence vol 39 pp 1137ndash1149 2015

[57] K He G Gkioxari P Dollar and R B Girshick ldquoMaskR-CNNrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 2980ndash2988Venice Italy October 2017

[58] A Salvador X Giro F Marques and S Satoh ldquoFasterR-CNN features for instance searchrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tionWorkshops (CVPRW 2016) pp 394ndash401 Las Vegas NVUSA June 2016

[59] J Long E Shelhamer and T Darrell ldquoFully convolutionalnetworks for semantic segmentationrdquo in Proceedings of the2015 IEEE Conference on Computer Vision and PatternRecognition (CVPR) pp 3431ndash3440 Boston MA USA June2015

[60] V Badrinarayanan A Kendall and R Cipolla ldquoSegNet adeep convolutional encoder-decoder architecture for imagesegmentationrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 39 no 12 pp 2481ndash2495 2017

[61] H Noh S Hong and B Han ldquoLearning deconvolutionnetwork for semantic segmentationrdquo in Proceedings of the2015 IEEE International Conference on Computer Vision(ICCV) pp 1520ndash1528 Santiago Chile December 2015

[62] O Ronneberger P Fischer T Brox and U-Net Convolu-tional Networks for Biomedical Image Segmentation MIC-CAI Munich Germany 2015

[63] M Drozdzal E Vorontsov G Chartrand S Kadoury andC J Pal ldquo-e importance of skip connections in biomedicalimage segmentationrdquo 2016 httparxivorgabs160804117

[64] S Jegou M Drozdzal D Vazquez A Romero andY Bengio ldquo-e one hundred layers tiramisu fully con-volutional DenseNets for semantic segmentationrdquo in Pro-ceedings of the 2017 IEEE Conference on Computer Vision andPattern Recognition Workshops (CVPRW) Honolulu HIUSA July 2017

[65] T Takikawa D Acuna V Jampani and S Fidler ldquoGated-SCNN gated shape CNNs for semantic segmentationrdquo 2019httparxivorgabs190705740

[66] F Yu and V Koltun ldquoMulti-scale context aggregation bydilated convolutionsrdquo 2015 httparxivorgabs151107122

[67] P Bilinski and V Prisacariu ldquoDense decoder shortcutconnections for single-pass semantic segmentationrdquo inProceedings of the 2018 IEEECVF Conference on ComputerVision and Pattern Recognition pp 6596ndash6605 Salt LakeCity UT USA June 2018

[68] Z Zhang X Zhang C Peng D Cheng and J Sun ldquoExFuseenhancing feature fusion for semantic segmentationrdquo inProceedings of the European Conference on Computer VisionMunich Germany September 2018

[69] H Zhao X Qi X Shen J Shi and J Jia ldquoICNet for real-timesemantic segmentation on high-resolution imagesrdquo inProceedings of the European Conference on Computer VisionKolding Denmark June 2017

[70] H Li P Xiong H Fan and J Sun ldquoDFANet deep featureaggregation for real-time semantic segmentationrdquo in Pro-ceedings of the Conference on Computer Vision and PatternRecognition Long Beach CA USA June 2019

[71] W Xiang H Mao and V Athitsos ldquo-underNet a turbounified network for real-time semantic segmentationrdquo inProceedings of the 2019 IEEE Winter Conference on Appli-cations of Computer Vision (WACV) pp 1789ndash1796 VillageHI USA January 2019

[72] L Chen G Papandreou I Kokkinos K Murphy andA L Yuille ldquoSemantic image segmentation with deepconvolutional nets and fully connected CRFsrdquo in Proceedingsof the International Conference on Learning Representationspp 11ndash25 San Diego CA USA May 2015

[73] L Chen G Papandreou I Kokkinos K Murphy andA L Yuille ldquoDeepLab semantic image segmentation withdeep convolutional nets atrous convolution and fullyconnected CRFsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 40 pp 834ndash848 2016

[74] L Chen G Papandreou F Schroff and H Adam ldquoRe-thinking atrous convolution for semantic image segmenta-tionrdquo 2017 httparxivorgabs170605587

[75] H Wu J Zhang K Huang K Liang and Y Yu ldquoFastFCNrethinking dilated convolution in the backbone for semanticsegmentationrdquo 2019 httparxivorgabs190311816

[76] M Yang K Yu C Zhang Z Li and K Yang ldquoDenseASPPfor semantic segmentation in street scenesrdquo in Proceedings ofthe 2018 IEEECVF Conference on Computer Vision andPattern Recognition pp 3684ndash3692 Salt Lake City UT USAJune 2018

[77] P H Pinheiro and R Collobert ldquoFrom image-level to pixel-level labeling with Convolutional Networksrdquo in Proceedingsof the 2015 IEEE Conference on Computer Vision and PatternRecognition (CVPR) pp 1713ndash1721 Boston MA USA June2015

[78] A Khoreva R Benenson J H Hosang M Hein andB Schiele ldquoSimple does it weakly supervised instance andsemantic segmentationrdquo in Proceedings of the 2017 IEEEConference on Computer Vision and Pattern Recognition(CVPR) pp 1665ndash1674 Honolulu HI USA July 2017

[79] G Papandreou L Chen K P Murphy and A L YuilleldquoWeakly-and semi-supervised learning of a deep convolu-tional network for semantic image segmentationrdquo in Pro-ceedings of the 2015 IEEE International Conference onComputer Vision (ICCV) pp 1742ndash1750 Santiago ChileDecember 2015

[80] J Dai K He and J Sun ldquoBoxSup exploiting bounding boxesto supervise convolutional networks for semantic segmen-tationrdquo in Proceedings of the 2015 IEEE International Con-ference on Computer Vision (ICCV) pp 1635ndash1643 SantiagoChile December 2015

Complexity 15

[81] W Hung Y Tsai Y Liou Y Lin and M Yang ldquoAdversariallearning for semi-supervised semantic segmentationrdquo inProceedings of the British Machine Vision ConferenceNewcastle UK September 2018

[82] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo-e pascal visual object classes (VOC)challengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2009

[83] T Lin M Maire S J Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo ECCV pp 740ndash755 SpringerBerlin Germany 2014

[84] M Cordts M Omran S Ramos et al ldquo-e Cityscapesdatasetrdquo in Proceedings of the CVPRWorkshop on the Futureof Datasets in Vision Boston MA USA June 2015

[85] B Zhou H Zhao X Puig S Fidler A Barriuso and A TorralbaldquoScene parsing through ADE20K datasetrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) pp 5122ndash5130 Honolulu HI USA July 2017

[86] A Geiger P Lenz C Stiller and R Urtasun ldquoVision meetsrobotics the KITTI datasetrdquo gte International Journal ofRobotics Research vol 32 no 11 pp 1231ndash1237 2013

[87] G Ros L Sellart J Materzynska D Vazquez andA M Lopez ldquo-e SYNTHIA dataset a large collection ofsynthetic images for semantic segmentation of urbanscenesrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) pp 3234ndash3243 LasVegas NV USA June 2016

[88] S Nowozin ldquoOptimal decisions from probabilistic modelsthe intersection-over-union caserdquo in Proceedings of the 2014IEEE Conference on Computer Vision and Pattern Recogni-tion pp 548ndash555 Columbus OH USA June 2014

[89] C Wu H Cheng S Li H Li and Y Chen ldquoApesNet apixel-wise efficient segmentation network for embeddeddevicesrdquo in Proceedings of the 14th ACMIEEE Symposiumon Embedded Systems for Real-Time Multimedia pp 1ndash7Pittsburgh PA USA October 2016

[90] A Paszke A Chaurasia S Kim and E Culurciello ldquoEnet adeep neural network architecture for real-time semanticsegmentationrdquo 2016 httparxivorgabs160602147

[91] A Chaurasia and E Culurciello ldquoLinkNet exploiting en-coder representations for efficient semantic segmentationrdquoin Proceedings of the IEEE Visual Communications and ImageProcessing pp 1ndash4 St Petersburg FL USA December 2017

[92] L Fan W Wang F Zha and J Yan ldquoExploring newbackbone and attention module for semantic segmentationin street scenesrdquo IEEE Access vol 6 2018

[93] C Yu J Wang G Peng C Gao G Yu and N Sang ldquoBiSeNetbilateral segmentation network for real-time semantic seg-mentationrdquo in Proceedings of the European Conference onComputer Vision Munich Germany September 2018

[94] J Li Y Zhao J Fu J Wu and J Liu ldquoAttention-guidednetwork for semantic video segmentationrdquo IEEE Accessvol 7 pp 140680ndash140689 2019

[95] H Zhou K Song X Zhang W Gui and Q Qian ldquoWAILSwatershed algorithm with image-level supervision for weaklysupervised semantic segmentationrdquo IEEE Access vol 7pp 42745ndash42756 2019

[96] L Zhang P Shen G Zhu et al ldquoImproving semantic imagesegmentation with a probabilistic superpixel-based denseconditional random fieldrdquo IEEE Access vol 6 pp 15297ndash15310 2018

[97] M Mostajabi P Yadollahpour and G ShakhnarovichldquoFeedforward semantic segmentation with zoom-out fea-turesrdquo in Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition pp 3376ndash3385 Boston MAUSA June 2015

[98] C Han Y Duan X Tao and J Lu ldquoDense convolutionalnetworks for semantic segmentationrdquo IEEE Access vol 7pp 43369ndash43382 2019

[99] R Vemulapalli O Tuzel M Y Liu and R ChellapaldquoGaussian conditional random field network for semanticsegmentationrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition pp 3224ndash3233Las Vegas NV USA June 2016

[100] Z Liu X Li P Luo C C Loy and X Tang ldquoSemantic imagesegmentation via deep parsing networkrdquo in Proceedings ofthe IEEE International Conference on Computer Visionpp 1377ndash1385 Santiago Chile December 2015

[101] G Lin C Shen A van den Hengel and I Reid ldquoEfficientpiecewise training of deep structured models for semanticsegmentationrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition pp 3194ndash3203Las Vegas NV USA July 2016

[102] C X Peng X Zhang K Jia G Yu and J Sun ldquoMegDet alarge mini-batch object detectorrdquo in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition(CVPR) pp 6181ndash6189 Salt Lake City UT USA June 2008

[103] G Ghiasi and C C Fowlkes ldquoLaplacian pyramid recon-struction and refinement for semantic segmentationrdquo inProceedings of the European Conference on Computer Visionpp 519ndash534 Amsterdam -e Netherlands October 2016

[104] T Pohlen A Hermans M Mathias and B Leibe ldquoFull-resolution residual networks for semantic segmentation instreet scenesrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR)pp 4151ndash4160 Honolulu HI USA July 2017

[105] G Lin A Milan C Shen and I Reid ldquoRefineNet multi-pathrefinement networks for high-resolution semantic segmen-tationrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) pp 5168ndash5177Honolulu HI USA July 2017

[106] H-H Han and L Fan ldquoA new semantic segmentation modelfor supplementing more spatial informationrdquo IEEE Accessvol 7 pp 86979ndash86988 2019

[107] X Jin X Li H Xiao et al ldquoVideo scene parsing withpredictive feature learningrdquo in Proceedings of the 2017 IEEEInternational Conference on Computer Vision (ICCV)pp 5581ndash5589 Venice Italy October 2017

[108] P Wang P Chen Y Yuan et al ldquoUnderstanding convo-lution for semantic segmentationrdquo in Proceedings of the 2018IEEE Winter Conference on Applications of Computer Vision(WACV) pp 1451ndash1460 Lake Tahoe NV USAMarch 2018

[109] H Zhao J Shi X Qi X Wang and J Jia ldquoPyramid sceneparsing networkrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR)pp 6230ndash6239 Honolulu HI USA June 2017

[110] G Vered G Oren Y Atzmon and G Chechik ldquoCooper-ative image captioningrdquo 2019 httparxivorgabs190711565

[111] J Donahue L Hendricks S Guadarrama M Rohrbach andS Venugopalan ldquoLong-term recurrent convolutional net-works for visual recognition and descriptionrdquo in Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition pp 2625ndash2634 Boston MA USA June 2015

[112] Z Fan Z Wei S Wang and X Huang Bridging by WordImage Grounded Vocabulary Construction for Visual Cap-tioning Association for Computational LinguisticsStroudsburg PA USA 2019

16 Complexity

[113] Y Zhou Y Sun and V Honavar ldquoImproving image cap-tioning by leveraging knowledge graphsrdquo in Proceedings ofthe 2019 IEEE Winter Conference on Applications of Com-puter Vision (WACV) pp 283ndash293 Waikoloa Village HIUSA January 2019

[114] X Li and S Jiang ldquoKnow more say less image captioningbased on scene graphsrdquo IEEE Transactions on Multimediavol 21 no 8 pp 2117ndash2130 2019

[115] Q Wang and A B Chan ldquoDescribing like humans ondiversity in image captioningrdquo in Proceedings of the Con-ference on Computer Vision and Pattern Recognition LongBeach CA USA June 2019

[116] J Gu S R Joty J Cai H Zhao X Yang and G WangldquoUnpaired image captioning via scene graph alignmentsrdquo2019 httparxivorgabs190310658

[117] X Zhang QWang S Chen and X Li ldquoMulti-scale croppingmechanism for remote sensing image captioningrdquo in Pro-ceedings of the IEEE International Geoscience and RemoteSensing Symposium (IGARSS) pp 10039ndash10042 YokohamaJapan August 2019

[118] S Wang L Lan X Zhang G Dong and Z Luo ldquoCascadesemantic fusion for image captioningrdquo IEEE Access vol 7pp 66680ndash66688 2019

[119] Y Su Y Li N Xu and A Liu ldquoHierarchical deep neuralnetwork for image captioningrdquo Neural Processing Lettersvol 52 pp 1ndash11 2019

[120] M Yang W Zhao W Xu et al ldquoMultitask learning forcross-domain image captioningrdquo IEEE Transactions onMultimedia vol 21 no 4 pp 1047ndash1061 2019

[121] Z Zha D Liu H Zhang Y Zhang and F Wu ldquoContext-aware visual policy network for fine-grained image cap-tioningrdquo IEEE Transactions on Pattern Analysis andMachineIntelligence 2019

[122] S Sheng K Laenen and M Moens ldquoCan image captioninghelp passage retrieval in multimodal question answeringrdquo inProceedings of European Conference on Information Retrieval(ECIR) pp 94ndash101 Springer Cologne Germany April 2019

[123] N Yu X Hu B Song J Yang and J Zhang ldquoTopic-orientedimage captioning based on order-embeddingrdquo IEEETransactions on Image Processing vol 28 no 6 pp 2743ndash2754 2019

[124] S Sreedevi and S Sebastian ldquoContent based image retrievalbased on Database revisionrdquo in Proceedings of the Interna-tional Conference on Machine Vision and Image Processingpp 29ndash32 Taipei Taiwan December 2012

[125] E R Vimina and J K Poulose ldquoImage retrieval using colourand texture features of Regions of Interestrdquo in Proceedings ofthe International Conference on Information Retrieval andKnowledge Management pp 240ndash243 Kuala LumpurMalaysia December 2012

[126] V Ordonez G Kulkarni and T R Berg ldquoIm2text de-scribing images using 1 million captioned photographsrdquo inAdvances in Neural Information Processing Systemspp 1143ndash1151 Springer Berlin Germany 2011

[127] J R Curran S Clark and J Bos ldquoLinguistically motivatedlarge-scale NLP with C and C and boxerrdquo in Proceedings ofthe Forty Fifth Annual Meeting of the ACL on Inter-ActivePoster and Demonstration Sessions pp 33ndash36 PragueCzech June 2007

[128] D R Hardoon S R Szedmak J R Shawe-TaylorS Szedmak and J Shawe-Taylor ldquoCanonical correlationanalysis an overview with application to learning methodsrdquoNeural Computation vol 16 no 12 pp 2639ndash2664 2004

[129] F R Bach andM I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3pp 1ndash48 2002

[130] J Wang Y Zhao Q Qi et al ldquoMindCamera interactivesketch-based image retrieval and synthesisrdquo IEEE Accessvol 6 pp 3765ndash3773 2018

[131] D Xu X Alameda-Pineda J Song E Ricci and N SebeldquoCross-paced representation learning with partial curriculafor sketch-based image retrievalrdquo IEEE Transactions onImage Processing vol 27 no 9 pp 4410ndash4421 2018

[132] K Song F Li F Long J Wang J Wang and Q LingldquoDiscriminative deep feature learning for semantic-basedimage retrievalrdquo IEEE Access vol 6 pp 44268ndash44280 2018

[133] P Kuznetsova V Ordonez A C Berg T Berg and Y ChoildquoCollective generation of natural image descriptionsrdquo Associ-ation for Computational Linguistics vol 1 pp 359ndash368 2012

[134] P Kuznetsova V Ordonez T L Berg and Y ChoildquoTREETALK composition and compression of trees forimage descriptionsrdquo Transactions of the Association forComputational Linguistics vol 2 no 10 pp 351ndash362 2014

[135] M Mitchell ldquoMidge generating image descriptions fromcomputer vision detectionsrdquo in Proceedings of the 13thConference of the European Chapter of the Association forComputational Linguistics pp 747ndash756 Avignon FranceApril 2012

[136] Y Yang C L Teo H Daume and Y Aloimonos ldquoCorpus-guided sentence generation of natural imagesrdquo in Proceed-ings of the Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) pp 444ndash454 Portland ORUSA June 2011

[137] A Kojima M Izumi T Tamura and K Fukunaga ldquoGen-erating natural language description of human behaviorfrom video imagesrdquo in Proceedings of the ICPR 2000 vol 4pp 728ndash731 Barcelona Spain September 2000

[138] A Kojima T Tamura and K Fukunaga ldquonatural languagedescription of human activities from video images based onconcept hierarchy of actionsrdquo International Journal ofComputer Vision vol 50 no 2 pp 171ndash184 2002

[139] A Tariq andH Foroosh ldquoA context-driven extractive frameworkfor generating realistic image descriptionsrdquo IEEE Transactions onImage Processing vol 26 no 2 pp 619ndash632 2017

[140] O Vinyals A Toshev S Bengio and D Erhan ldquoShow andtell a neural image caption generatorrdquo 2014 httparxivorgabs14114555

[141] J M Johnson A Karpathy and L Fei-Fei ldquoDenseCap fullyconvolutional localization networks for dense captioningrdquo inProceedings of the 2016 IEEE Conference on Computer Visionand Pattern Recognition (CVPR) pp 4565ndash4574 Las VegasNV US June 2016

[142] X Li W Lan J Dong and H Liu ldquoAdding Chinese captionsto imagesrdquo in Proceedings of the 2016 ACM on InternationalConference on Multimedia Retrieval pp 271ndash275 New YorkNY USA June 2016

[143] A Karpathy and F Li ldquoDeep visual-semantic alignments forgenerating image descriptionsrdquo 2014 httparxivorgabs14122306

[144] R Krishna K Hata F Ren F Li and J C Niebles ldquoDense-captioning events in videosrdquo in Proceedings of the 2017 IEEEInternational Conference on Computer Vision (ICCV)pp 706ndash715 Venice Italy October 2017

[145] L Yang K D Tang J Yang and L Li ldquoDense captioning withjoint inference and visual contextrdquo in Proceedings of the 2017IEEE Conference on Computer Vision and Pattern Recognition(CVPR) pp 1978ndash1987 Honolulu HI USA July 2017

Complexity 17

[146] G Srivastava and R SrivastavaA Survey on Automatic ImageCaptioning International Conference on Mathematics andComputing Springer Berlin Germany 2018

[147] X Li X Song L Herranz Y Zhu and S Jiang ldquoImagecaptioning with both object and scene informationrdquo in ACMMultimedia Springer Berlin Germany 2016

[148] W Jiang L Ma Y Jiang W Liu and T Zhang ldquoRecurrentfusion network for image captioningrdquo in Proceedings of theECCV Munich Germany September 2018

[149] Q Wang and A B Chan ldquoCNN+CNN convolutional de-coders for image captioningrdquo 2018 httparxivorgabs180509019

[150] K Zheng C Zhu S Lu and Y Liu Multiple-Level Feature-Based Network for Image Captioning Springer BerlinGermany 2018

[151] X Li and Q Jin Improving Image Captioning by Concept-Based Sentence Reranking Springer Berlin Germany 2016

[152] T Karayil A Irfan F Raue J Hees and A Dengel Con-ditional GANs for Image Captioning with SentimentsICANN Los Angeles CA USA 2019

[153] A Vaswani N Shazeer N Parmar et al ldquoAttention is all youneedrdquo in Proceedings of the NIPS Long Beach CA USADecember 2017

[154] Z Yang D Yang C Dyer X He A J Smola andE H Hovy ldquoHierarchical attention networks for documentclassificationrdquo in Proceedings of the 2016 Conference of theNorth American Chapter of the Association for Computa-tional Linguistics Human Language Technologies San DiegoCA USA January 2016

[155] K Xu J Ba R Kiros et al ldquoShow attend and tell neuralimage caption generation with visual attentionrdquo in Pro-ceedings of the International Conference on MachineLearning Lille France July 2015

[156] D Bahdanau K Cho and Y Bengio ldquoNeural machinetranslation by jointly learning to align and translaterdquo 2015httparxivorgabs14090473

[157] Q You H Jin Z Wang C Fang and J Luo ldquoImagecaptioning with semantic attentionrdquo in Proceedings of the2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR) pp 4651ndash4659 Las Vegas NV USAJune 2016

[158] Y Xiong B Du and P Yan ldquoReinforced transformer formedical image captioningrdquo in Proceedings of the MLMIMICCAI Shenzhen China October 2019

[159] S Wang H Mo Y Xu W Wu and Z Zhou ldquoIntra-imageregion context for image captioningrdquo PCM Springer BerlinGermany 2018

[160] L Pellegrin J A Vanegas J Arevalo et al ldquoA two-stepretrieval method for image captioningrdquo Lecture Notes inComputer Science pp 150ndash161 Springer Berlin Germany2016

[161] A Carraggi M Cornia L Baraldi and R Cucchiara ldquoVi-sual-semantic alignment across domains using a semi-su-pervised approachrdquo in Proceedings of the EuropeanConference on Computer Vision Workshops pp 625ndash640Munich Germany September 2018

[162] S Vajda D You S Antani and G -oma ldquoLarge imagemodality labeling initiative using semi-supervised and op-timized clusteringrdquo International Journal of MultimediaInformation Retrieval vol 4 no 2 pp 143ndash151 2015

[163] H M Castro L E Sucar and E F Morales ldquoAutomaticimage annotation using a semi-supervised ensemble ofclassifiersrdquo Lecture Notes in Computer Science vol 4756pp 487ndash495 Springer Berlin Germany 2007

[164] Y Xiao Z Zhu N Liu and Y Zhao ldquoAn interactive semi-supervised approach for automatic image annotationrdquo inProceedings of the Pacific-Rim Conference on Multimediapp 748ndash758 Singapore December 2012

[165] H Jhamtani and T Berg-Kirkpatrick ldquoLearning to describedifferences between pairs of similar imagesrdquo in Proceedingsof the Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) Brussels Belgium October 2018

[166] A Oluwasanmi M U Aftab E Alabdulkreem B KumedaE Y Baagyere and Z Qin ldquoCaptionNet automatic end-to-end siamese difference captioning model with attentionrdquoIEEE Access vol 7 pp 106773ndash106783 2019

[167] A Oluwasanmi E Frimpong M U Aftab E Y BaagyereZ Qin and K Ullah ldquoFully convolutional CaptionNet si-amese difference captioning attention modelrdquo IEEE Accessvol 7 pp 175929ndash175939 2019

[168] M Hodosh P Young and J Hockenmaier ldquoFraming imagedescription as a ranking task data models and evaluationmetricsrdquo Journal of Artificial Intelligence Research vol 47pp 853ndash899 2013

[169] B A Plummer L Wang C M Cervantes J C CaicedoJ Hockenmaier and S Lazebnik ldquoFlickr30k entities col-lecting region-to-phrase correspondences for richer image-to-sentence modelsrdquo in Proceedings of the IEEE InternationalConference on Computer Vision pp 2641ndash2649 SantiagoChile December 2015

[170] R Krishna Y Zhu O Groth et al ldquoVisual genome con-necting language and vision using crowdsourced denseimage annotationsrdquo International Journal of Computer Vi-sion vol 123 no 1 pp 32ndash73 2017

[171] K Tran X He L Zhang and J Sun ldquoRich image captioningin the wildrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshopspp 434ndash441 Las Vegas NA USA July 2016

[172] V Bychkovsky S Paris E Chan and F Durand ldquoLearningphotographic global tonal adjustment with a database ofinputoutput image pairsrdquo Computer Vision and PatternRecognition (CVPR) vol 97 2011

[173] K Papineni S Roukos T Ward and W Zhu Bleu AMethod For Automatic Evaluation Of Machine Translationpp 311ndash318 Association for Computational LinguisticsStroudsburg PA USA 2001

[174] C Lin ROUGE A Package For Automatic Evaluation OfSummaries pp 74ndash81 Association for ComputationalLinguistics (ACL) Stroudsburg PA USA 2004

[175] S Banerjee and A Lavie ldquoMETEOR an automatic metric formt evaluation with improved correlation with humanjudgmentsrdquo in Proceedings Of gte Meeting Of gte Associ-ation For Computational Linguistics pp 65ndash72 Ann ArborMI USA June 2005

[176] R Vedantam C Zitnick and D Parikh ldquoCIDEr consensus-based image description evaluationrdquo in Proceedings Of gteComputer Vision and Pattern Recognition (CVPR)pp 4566ndash4575 Boston MA USA June 2015

[177] P Anderson B Fernando M Johnson and S Gould ldquoSpicesemantic propositional image caption evaluationrdquo in Proceed-ings Of gte European Conference on Computer Visionpp 382ndash398 Springer Amsterdam -e Netherlands October2016

[178] T Yao Y Pan Y Li Z Qiu and T Mei ldquoBoosting imagecaptioning with attributesrdquo in Proceedings Of gte IEEEInternational Conference on Computer Vision (ICCV)pp 4904ndash4912 Venice Italy October 2017

18 Complexity

[179] Q Wu C Shen P Wang A Dick and A van den HengelldquoImage captioning and visual question answering based onattributes and external knowledgerdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 40 no 6pp 1367ndash1381 2018

[180] N Xu A-A Liu J Liu et al ldquoScene graph captioner imagecaptioning based on structural visual representationrdquoJournal of Visual Communication and Image Representationvol 58 pp 477ndash485 2019

[181] Y H Tan C S Chan and C S Chan ldquoPhrase-based imagecaption generator with hierarchical LSTM networkrdquo Neu-rocomputing vol 333 pp 86ndash100 2019

[182] J H Tan C S Chan and J H Chuah ldquoCOMIC towards acompact image captioning model with attentionrdquo IEEETransactions on Multimedia vol 99 2019

[183] C He and H Hu ldquoImage captioning with text-based visualattentionrdquo Neural Processing Letters vol 49 no 1pp 177ndash185 2019

[184] Z Gan C Gan X He et al ldquoSemantic compositionalnetworks for visual captioningrdquo in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition(CVPR) Honolulu HI USA June 2017

[185] J Gu G Wang J Cai and T Chen ldquoAn empirical study oflanguage cnn for image captioningrdquo in Proceedings of theInternational Conference on Computer Vision (ICCV)Venice Italy October 2017

[186] J Li M K Ebrahimpour A Moghtaderi and Y-Y YuldquoImage captioning with weakly-supervised attention pen-altyrdquo 2019 httparxivorgabs190302507

[187] X He B Yang X Bai and X Bai ldquoVD-SAN visual-denselysemantic attention network for image caption generationrdquoNeurocomputing vol 328 pp 48ndash55 2019

[188] D Zhao Z Chang S Guo Z Chang and S Guo ldquoAmultimodal fusion approach for image captioningrdquo Neu-rocomputing vol 329 pp 476ndash485 2019

[189] W Wang and H Hu ldquoImage captioning using region-basedattention joint with time-varying attentionrdquo Neural Pro-cessing Letters vol 13 2019

[190] J Lu C Xiong D Parikh and R Socher ldquoKnowing when tolook adaptive attention via A visual sentinel for image cap-tioningrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) Honolulu HI USAJuly 2017

[191] Z Ren X Wang N Zhang X Lv and L Li ldquoDeep rein-forcement learning-based image captioning with embeddingrewardrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) Honolulu HI USAJuly 2017

[192] L Gao X Li J Song and H T Shen ldquoHierarchical LSTMswith adaptive attention for visual captioningrdquo IEEE Trans-actions on Pattern Analysis and Machine Intelligence vol 992019

[193] K Fu J Jin R Cui F Sha and C Zhang ldquoAligning where tosee and what to tell image captioning with region-basedattention and scene-specific contextsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 2016

[194] X Jia E Gavves B Fernando and T Tuytelaars ldquoGuidingthe long-short term memory model for image captiongenerationrdquo in Proceedings of the IEEE in- TernationalConference on Computer Vision pp 2407ndash2415 SantiagoChile December 2015

[195] R Kiros R Zemel and R Salakhutdinov ldquoMultimodalneural language modelsrdquo in Proceedings of the InternationalConference on Machine Learning Beijing China June 2014

[196] Q Wu C Shen P Wang A Dick and A V Hengel ldquoImagecaptioning and visual question answering based on attributesand external knowledgerdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 40 pp 1367ndash13812016

[197] J Mao W Xu Y Yang J Wang Z Huang and A YuilleldquoDeep captioning with multimodal recurrent neural net-worksrdquo in Proceedings of the International Conference onLearning Representation San Diego CA USA May 2015

[198] A Yuan X Li and X Lu ldquo3G structure for image captiongenerationrdquo Neurocomputing vol 330 pp 17ndash28 2019

[199] G J Brostow J Shotton J Fauqueur and R CipollaldquoSegmentation and recognition using structure from motionpoint cloudsrdquo in Proceedings of the ECCV pp 44ndash57Marseille France October 2008

Complexity 19

and video understanding alongside image captioning whichcombines computer vision with another branch of artificialintelligence called natural language processing (NLP) toderive sentence description of an image [12] Notwith-standing as with all other AI-related tasks a modern subsetof machine learning (ML) namely deep learning (DL) hasbeen the evolution of machine learning producing state-of-the-art results in almost all of the tasks compared to othertraditional algorithms such as decision trees naive Bayessupport vector machines (SVMs) ensembles and clusteringalgorithms [13ndash16]

Deep learning as a branch of machine learning useslayers of artificial neural networks to imitate the humanneural networks in decoding intuition from a large amountof data automatically [17] and is unlike other machinelearning algorithms which rely heavily on feature engi-neering utilizing domain knowledge in the creation offeature extractors [18] -e stacked layer of neural networksrepresents feature hierarchy as simple features at the initiallayers are reconstructed from one layer to another informing complex features [19] As a result the deepernetworks are computationally intensive to model and trainleading to the manufacture of more advanced computationalchips including Graphical Processing Units (GPUs) andTensor Processing Units (TPUs) [20 21] Presently severaldeep learning models exist and some of the most popularones include recurrent neural networks (RNNs) autoen-coders convolutional neural network (CNN) deep beliefnetworks (DBNs) and deep Boltzmann machine (DBM)[22ndash25] Among the most common deep learning algo-rithms the convolutional neural network is themost suitablefor analyzing visual imagery because of its shift and spaceinvariant characteristics taking advantage of hierarchicallearning in combining simpler patterns to form complexpatterns and structures [26] Using the shared weights ar-chitectural pattern of filters [27ndash29] each filter representsdifferent features of the input data which when summed canyield more complex structures [30ndash32]

In this paper our prime motivation is fixating on therecent deep learning segmentation techniques of both 2Dand 3D images using fully convolutional network andother high-level hierarchical feature extraction methodsas an integral component of computer vision -is isfurther expanded into the generation of captions forimages emerging as a subset of artificial intelligencersquosnatural language processing Furthermore we review thediscussed modelsrsquo accomplishments by comparing theirevaluation which indicates the most effective and effi-cient approaches for different tasks and challenges en-countered -is we believe is enlightening as it providesinsight for the further evolution of practical modeldesign

-is paper is organized as follows it introduces seg-mentation popular segmentation algorithms characteris-tics datasets and evaluation in Section 2 An introduction ofcaptioning and its various models are followed in Section 3alongside available datasets evaluation metrics and acomparative discussion of the models Finally Section 4concludes the paper with the overall summary of the typical

problems solutions and possible directions in semanticsegmentation and image captioning

2 Semantic Segmentation

Semantic segmentation relates to the process of pixel-levelclassification of images such that each pixel in an image isclassified into a distinguished class cluster [33] Since theinception of deep learning semantic segmentation has beena pivotal area of image processing and computer visionwhich has seen major research and application in severaldomains [34] Image segmentation recognizes boundariesbetween objects in an image by using line and curve seg-ments to categorize such objects while instance segmen-tation however classifies instances of all the available classesin an image such that all objects are identified as a separateentity All the same semantic segmentation differs fromordinary segmentation which on the one hand only ex-presses the partitioning of an image into clusters without atangible intuitive attempt at understanding the partitionedclusters or relating them with one another [35] Semanticsegmentation on the other hand as the name implies triesto describe semantically meaningful objects in an imagebased on their well-defined association and understanding[36] these differences are well depicted in Figure 1

21 Methods and Approaches

211 Traditional Methods During the pre-ANN era mostsegmentation and semantic segmentation approaches werepredominantly thresholding and clustering algorithmswhich are largely unsupervised methods In most casestraditional semantic segmentation methods consume lesstime for model computation Also most of the approachesrequire less data than the modern-day era of artificial neuralnetworks and deep learning -e simplest by all means isthe thresholding techniques which apply pixel intensity asthe criteria for distribution For binary segmentation asingle threshold value is required and pixels on both sides ofthe threshold are classified separately into two distinctclasses -ere are also advance forms of thresholding in-volving more classes and they are often grouped as histo-gram shape-based entropy-based object attribute-basedand spatial-based [37]

K-means clustering uses a predefined number of cen-troids to determine the number of clusters in which objectsare to be categorized -e centroids are randomly selected atthe beginning and then are iteratively adjusted by computingthe distance apart from other points in the dataset assigningeach point to the closest centroid [38] Fuzzy C-means(FCM) a technically advanced form of K-means allowsclassification of data points into many label classes based onthe level of membership [39] -is is of advantage in situ-ations where dataset texture overlaps or does not have a well-defined cluster [40] Gaussian mixture model (GMM) is alsooften used for both hard clustering and soft clustering byassigning the pixel to the component having the highestposterior probability [41] GMM assumes that the datarsquosGaussian distributions represent the number of clusters

2 Complexity






Boat

Person

Diningtable

(a) (b) (c)


Complexity 3







96

256384 384 256 4096 4096 21

21

Backwardlearning

Forwardinference


Segmentation gt


4 Complexity













score map


Atrous convolution



Complexity 5



Accuracy TP + TN

TP + TN + FP + FN

PA 1113936inii

1113936iti

(1)



mPA 1nc

1113944i

nii

ti

(2)


mIoU 1nc

1113936inii


(3)





3 Image Captioning



6 Complexity









Dataset Method mIoU

CamVid





PASCAL VOC

Wails [96] 559FCN8 [59] 622

PSP-CRF [97] 654Zoom Out [98] 696


Piecewise [102] 753

Cityscapes

FCN8 [59] 653DPN [101] 668

Dilation10 [103] 671LRR [104] 697

DeepLab2 [73] 704FRRN [105] 718


Complexity 7







(a)


(b)


(c)


(d)


(e)


(f )


(g)


(h)


(i)


(j)


(k)


(l)


8 Complexity








Complexity 9



Input image

Flow of gradients

ResNet +fasterRCNN

Featureextractor

Speakernetwork

Φ

A browncat


Comparewith

naturallanguage






Listenernetwork





Cat

Singlelabel Cat

Skateboard

Sequence

Labelcomplexity

Captioning

A catriding a

skateboard

Dense captioning

Orange spotted cat





10 Complexity















Complexity 11





(a) (b)




MS COCO





Flickr30K

hLSTMat [193] 738 551 403 294 23 666SGC [181] 615 421 286 193 182 399

RA+ SF [194] 0649 0462 0324 0224 0194 0472gLSTM [195] 0646 0446 0305 0206 0179 mdash




12 Complexity


4 Conclusion


Data Availability




Acknowledgments



References

















Complexity 13





































14 Complexity































Complexity 15


































16 Complexity


































Complexity 17


































18 Complexity






















Complexity 19






Boat

Person

Diningtable

(a) (b) (c)


Complexity 3







96

256384 384 256 4096 4096 21

21

Backwardlearning

Forwardinference


Segmentation gt


4 Complexity













score map


Atrous convolution



Complexity 5



Accuracy TP + TN

TP + TN + FP + FN

PA 1113936inii

1113936iti

(1)



mPA 1nc

1113944i

nii

ti

(2)


mIoU 1nc

1113936inii


(3)





3 Image Captioning



6 Complexity









Dataset Method mIoU

CamVid





PASCAL VOC

Wails [96] 559FCN8 [59] 622

PSP-CRF [97] 654Zoom Out [98] 696


Piecewise [102] 753

Cityscapes

FCN8 [59] 653DPN [101] 668

Dilation10 [103] 671LRR [104] 697

DeepLab2 [73] 704FRRN [105] 718


Complexity 7







(a)


(b)


(c)


(d)


(e)


(f )


(g)


(h)


(i)


(j)


(k)


(l)


8 Complexity








Complexity 9



Input image

Flow of gradients

ResNet +fasterRCNN

Featureextractor

Speakernetwork

Φ

A browncat


Comparewith

naturallanguage






Listenernetwork





Cat

Singlelabel Cat

Skateboard

Sequence

Labelcomplexity

Captioning

A catriding a

skateboard

Dense captioning

Orange spotted cat





10 Complexity















Complexity 11





(a) (b)




MS COCO





Flickr30K

hLSTMat [193] 738 551 403 294 23 666SGC [181] 615 421 286 193 182 399

RA+ SF [194] 0649 0462 0324 0224 0194 0472gLSTM [195] 0646 0446 0305 0206 0179 mdash




12 Complexity


4 Conclusion


Data Availability




Acknowledgments



References

















Complexity 13





































14 Complexity































Complexity 15


































16 Complexity


































Complexity 17


































18 Complexity






















Complexity 19







96

256384 384 256 4096 4096 21

21

Backwardlearning

Forwardinference


Segmentation gt


4 Complexity













score map


Atrous convolution



Complexity 5



Accuracy TP + TN

TP + TN + FP + FN

PA 1113936inii

1113936iti

(1)



mPA 1nc

1113944i

nii

ti

(2)


mIoU 1nc

1113936inii


(3)





3 Image Captioning



6 Complexity









Dataset Method mIoU

CamVid





PASCAL VOC

Wails [96] 559FCN8 [59] 622

PSP-CRF [97] 654Zoom Out [98] 696


Piecewise [102] 753

Cityscapes

FCN8 [59] 653DPN [101] 668

Dilation10 [103] 671LRR [104] 697

DeepLab2 [73] 704FRRN [105] 718


Complexity 7







(a)


(b)


(c)


(d)


(e)


(f )


(g)


(h)


(i)


(j)


(k)


(l)


8 Complexity








Complexity 9



Input image

Flow of gradients

ResNet +fasterRCNN

Featureextractor

Speakernetwork

Φ

A browncat


Comparewith

naturallanguage






Listenernetwork





Cat

Singlelabel Cat

Skateboard

Sequence

Labelcomplexity

Captioning

A catriding a

skateboard

Dense captioning

Orange spotted cat





10 Complexity















Complexity 11





(a) (b)




MS COCO





Flickr30K

hLSTMat [193] 738 551 403 294 23 666SGC [181] 615 421 286 193 182 399

RA+ SF [194] 0649 0462 0324 0224 0194 0472gLSTM [195] 0646 0446 0305 0206 0179 mdash




12 Complexity


4 Conclusion


Data Availability




Acknowledgments



References

















Complexity 13





































14 Complexity































Complexity 15


































16 Complexity


































Complexity 17


































18 Complexity






















Complexity 19













score map


Atrous convolution



Complexity 5



Accuracy TP + TN

TP + TN + FP + FN

PA 1113936inii

1113936iti

(1)



mPA 1nc

1113944i

nii

ti

(2)


mIoU 1nc

1113936inii


(3)





3 Image Captioning



6 Complexity









Dataset Method mIoU

CamVid





PASCAL VOC

Wails [96] 559FCN8 [59] 622

PSP-CRF [97] 654Zoom Out [98] 696


Piecewise [102] 753

Cityscapes

FCN8 [59] 653DPN [101] 668

Dilation10 [103] 671LRR [104] 697

DeepLab2 [73] 704FRRN [105] 718


Complexity 7







(a)


(b)


(c)


(d)


(e)


(f )


(g)


(h)


(i)


(j)


(k)


(l)


8 Complexity








Complexity 9



Input image

Flow of gradients

ResNet +fasterRCNN

Featureextractor

Speakernetwork

Φ

A browncat


Comparewith

naturallanguage






Listenernetwork





Cat

Singlelabel Cat

Skateboard

Sequence

Labelcomplexity

Captioning

A catriding a

skateboard

Dense captioning

Orange spotted cat





10 Complexity















Complexity 11





(a) (b)




MS COCO





Flickr30K

hLSTMat [193] 738 551 403 294 23 666SGC [181] 615 421 286 193 182 399

RA+ SF [194] 0649 0462 0324 0224 0194 0472gLSTM [195] 0646 0446 0305 0206 0179 mdash




12 Complexity


4 Conclusion


Data Availability




Acknowledgments



References

















Complexity 13





































14 Complexity































Complexity 15


































16 Complexity


































Complexity 17


































18 Complexity






















Complexity 19



Accuracy TP + TN

TP + TN + FP + FN

PA 1113936inii

1113936iti

(1)



mPA 1nc

1113944i

nii

ti

(2)


mIoU 1nc

1113936inii


(3)





3 Image Captioning



6 Complexity









Dataset Method mIoU

CamVid





PASCAL VOC

Wails [96] 559FCN8 [59] 622

PSP-CRF [97] 654Zoom Out [98] 696


Piecewise [102] 753

Cityscapes

FCN8 [59] 653DPN [101] 668

Dilation10 [103] 671LRR [104] 697

DeepLab2 [73] 704FRRN [105] 718


Complexity 7







(a)


(b)


(c)


(d)


(e)


(f )


(g)


(h)


(i)


(j)


(k)


(l)


8 Complexity








Complexity 9



Input image

Flow of gradients

ResNet +fasterRCNN

Featureextractor

Speakernetwork

Φ

A browncat


Comparewith

naturallanguage






Listenernetwork





Cat

Singlelabel Cat

Skateboard

Sequence

Labelcomplexity

Captioning

A catriding a

skateboard

Dense captioning

Orange spotted cat





10 Complexity















Complexity 11





(a) (b)




MS COCO





Flickr30K

hLSTMat [193] 738 551 403 294 23 666SGC [181] 615 421 286 193 182 399

RA+ SF [194] 0649 0462 0324 0224 0194 0472gLSTM [195] 0646 0446 0305 0206 0179 mdash




12 Complexity


4 Conclusion


Data Availability




Acknowledgments



References

















Complexity 13





































14 Complexity































Complexity 15


































16 Complexity


































Complexity 17


































18 Complexity






















Complexity 19









Dataset Method mIoU

CamVid





PASCAL VOC

Wails [96] 559FCN8 [59] 622

PSP-CRF [97] 654Zoom Out [98] 696


Piecewise [102] 753

Cityscapes

FCN8 [59] 653DPN [101] 668

Dilation10 [103] 671LRR [104] 697

DeepLab2 [73] 704FRRN [105] 718


Complexity 7







(a)


(b)


(c)


(d)


(e)


(f )


(g)


(h)


(i)


(j)


(k)


(l)


8 Complexity








Complexity 9



Input image

Flow of gradients

ResNet +fasterRCNN

Featureextractor

Speakernetwork

Φ

A browncat


Comparewith

naturallanguage






Listenernetwork





Cat

Singlelabel Cat

Skateboard

Sequence

Labelcomplexity

Captioning

A catriding a

skateboard

Dense captioning

Orange spotted cat





10 Complexity















Complexity 11





(a) (b)




MS COCO





Flickr30K

hLSTMat [193] 738 551 403 294 23 666SGC [181] 615 421 286 193 182 399

RA+ SF [194] 0649 0462 0324 0224 0194 0472gLSTM [195] 0646 0446 0305 0206 0179 mdash




12 Complexity


4 Conclusion


Data Availability




Acknowledgments



References

















Complexity 13





































14 Complexity































Complexity 15


































16 Complexity


































Complexity 17


































18 Complexity






















Complexity 19







(a)


(b)


(c)


(d)


(e)


(f )


(g)


(h)


(i)


(j)


(k)


(l)


8 Complexity








Complexity 9



Input image

Flow of gradients

ResNet +fasterRCNN

Featureextractor

Speakernetwork

Φ

A browncat


Comparewith

naturallanguage






Listenernetwork





Cat

Singlelabel Cat

Skateboard

Sequence

Labelcomplexity

Captioning

A catriding a

skateboard

Dense captioning

Orange spotted cat





10 Complexity















Complexity 11





(a) (b)




MS COCO





Flickr30K

hLSTMat [193] 738 551 403 294 23 666SGC [181] 615 421 286 193 182 399

RA+ SF [194] 0649 0462 0324 0224 0194 0472gLSTM [195] 0646 0446 0305 0206 0179 mdash




12 Complexity


4 Conclusion


Data Availability




Acknowledgments



References

















Complexity 13





































14 Complexity































Complexity 15


































16 Complexity


































Complexity 17


































18 Complexity






















Complexity 19








Complexity 9



Input image

Flow of gradients

ResNet +fasterRCNN

Featureextractor

Speakernetwork

Φ

A browncat


Comparewith

naturallanguage






Listenernetwork





Cat

Singlelabel Cat

Skateboard

Sequence

Labelcomplexity

Captioning

A catriding a

skateboard

Dense captioning

Orange spotted cat





10 Complexity















Complexity 11





(a) (b)




MS COCO





Flickr30K

hLSTMat [193] 738 551 403 294 23 666SGC [181] 615 421 286 193 182 399

RA+ SF [194] 0649 0462 0324 0224 0194 0472gLSTM [195] 0646 0446 0305 0206 0179 mdash




12 Complexity


4 Conclusion


Data Availability




Acknowledgments



References

















Complexity 13





































14 Complexity































Complexity 15


































16 Complexity


































Complexity 17


































18 Complexity






















Complexity 19



Input image

Flow of gradients

ResNet +fasterRCNN

Featureextractor

Speakernetwork

Φ

A browncat


Comparewith

naturallanguage






Listenernetwork





Cat

Singlelabel Cat

Skateboard

Sequence

Labelcomplexity

Captioning

A catriding a

skateboard

Dense captioning

Orange spotted cat





10 Complexity















Complexity 11





(a) (b)




MS COCO





Flickr30K

hLSTMat [193] 738 551 403 294 23 666SGC [181] 615 421 286 193 182 399

RA+ SF [194] 0649 0462 0324 0224 0194 0472gLSTM [195] 0646 0446 0305 0206 0179 mdash




12 Complexity


4 Conclusion


Data Availability




Acknowledgments



References

















Complexity 13





































14 Complexity































Complexity 15


































16 Complexity


































Complexity 17


































18 Complexity






















Complexity 19















Complexity 11





(a) (b)




MS COCO





Flickr30K

hLSTMat [193] 738 551 403 294 23 666SGC [181] 615 421 286 193 182 399

RA+ SF [194] 0649 0462 0324 0224 0194 0472gLSTM [195] 0646 0446 0305 0206 0179 mdash




12 Complexity


4 Conclusion


Data Availability




Acknowledgments



References

















Complexity 13





































14 Complexity































Complexity 15


































16 Complexity


































Complexity 17


































18 Complexity






















Complexity 19





(a) (b)




MS COCO





Flickr30K

hLSTMat [193] 738 551 403 294 23 666SGC [181] 615 421 286 193 182 399

RA+ SF [194] 0649 0462 0324 0224 0194 0472gLSTM [195] 0646 0446 0305 0206 0179 mdash




12 Complexity


4 Conclusion


Data Availability




Acknowledgments



References

















Complexity 13





































14 Complexity































Complexity 15


































16 Complexity


































Complexity 17


































18 Complexity






















Complexity 19


4 Conclusion


Data Availability




Acknowledgments



References

















Complexity 13





































14 Complexity































Complexity 15


































16 Complexity


































Complexity 17


































18 Complexity






















Complexity 19





































14 Complexity































Complexity 15


































16 Complexity


































Complexity 17


































18 Complexity






















Complexity 19































Complexity 15


































16 Complexity


































Complexity 17


































18 Complexity






















Complexity 19


































16 Complexity


































Complexity 17


































18 Complexity






















Complexity 19


































Complexity 17


































18 Complexity






















Complexity 19


































18 Complexity






















Complexity 19






















Complexity 19

features to text: a comprehensive survey of deep learning

Documents