ijrar.orgijrar.org/papers/ijrar_202837.docx · web viewcbir is the procedure of automatically...

A survey on automatic image captioning using deep learning

S.Sindu1* and Dr.R.Kousalya2

1*Research Scholar, Dept. of Computer ScienceDr.N.G.P, Arts and Science College, Coimbatore, India

[email protected]*Corresponding Author: [email protected]

2Professor and Head, Dept. of Computer ApplicationsDr.N.G.P, Arts and Science College

Coimbatore, IndiaAbstract

Image captioning can be achieved by capturing the semantic information of image and expressing it as Natural Language Generation. Automatic image description generation is a challenging problem that has recently received a large amount of interest from the computer vision and natural language processing communities. Image captioning requires high level understanding of the semantic contents of the image which should be similar to human like sentences. The use of image captioning in content based image retrieval is to minimize the semantic gap by using text output of image captioning model to capture the features of the image more accurately. In deep learning, image captioning uses a convolutional neural network to extract visual features from the image, recurrent neural network is used to decode these features into a sentence. This paper presents a survey on image captioning model using deep learning techniques and its role for content based image retrieval.

Keywords : Image captioning, Convolutional neural network, Recurrent neural network, Natural language processing

1. Introduction

Large number of images are generated from various sources such as social media, medical images, ecommerce etc. Image captioning[1] requires to recognize the important objects, their attributes and their relationships in an image. Automatic caption generation requires both image understanding and natural language generation as shown in Fig.1. Thus it bridges the Computer Vision and Natural Language Processing community. Computer Vision is the process of enabling computers to visualize image as humans visualize, by recognizing and processing the image.

Content Based Image Retrieval, which is an application of computer vision, is the mechanism by which a system retrieves images from an image collection according to the visual contents of the query image. CBIR is the procedure of automatically indexing images by the extraction of their low-level visual features, like shape, color, and texture, and these indexed features are solely responsible for the retrieval of images [2].

Fig 1. Automatic Image Captioning

The Natural language processing[3,4] task of natural language generation (NLG) takes a non-linguistic representation, which is an image representation (e.g., a list of objects and their spatial relationships) and turns it into human-readable text, e.g., a sentence in a natural language. Generating text involves :

Content Selection : decides which aspects of the input to talk about Text Planning is how to organize the content Surface realization means verbalizing it which includes lexicalization: Choosing the right

word Surface, Referential expression integration using pronouns whenever appropriate, Grouping of related information

2. Survey on Image Captioning model

Describing the main event of an image involves identifying the objects depicted and predicting the relationships between them[5]. The authors introduce visual dependency representations to capture the relationships between the objects in an image, and hypothesize that this representation can improve image description. The hypothesis test is performed using a new data set of region-annotated images, associated with visual dependency representations and gold-standard descriptions. Two template-based description generation models is used that operate over visual dependency representations. The authors proved that visual dependency representations can be induced automatically using a standard dependency parser and that the descriptions generated from the induced representations are as good as the ones generated from gold-standard representations.

In [6], authors present a system to automatically generate natural language descriptions from images that exploits both statistics obtained from parsing large quantities of text data and recognition algorithms from computer vision. It also generates descriptions that are notably more true to the specific image content. Conditional Random Field (CRF) to predict the best labeling for an image is used. A fully automatic, system is demonstrated that generates natural language descriptions for images. Human evaluation validates the quality of the generated sentences.

Automatically mining and parsing large text collections to obtain statistical models for visually descriptive language is performed.

Image annotation is frequently used in image base management. In [7], authors discussed a method for automatic image description in natural language for images. The method relies on a image indexation and natural language processing and generation based on image processing for segmentation and indexing. The merger between these techniques and natural language processing techniques manages to generate a coherent and rich description in natural language which corresponds to what a user might find more intuitive. Furthermore, the indexing in natural language sentences improves the quality of the results by reducing the ambiguity present in a keywords index.

2.1 . Survey on Neural network based image captioning model

Xinlei Chen and Lawrence Zitnick [8], explored the bi-directional mapping between images and their sentence-based descriptions. Recurrent neural network is used to dynamically build a visual representation of the scene as a caption which is generated or read. The representation automatically learns to remember long-term visual concepts. Their model is capable of both generating novel captions given an image, and reconstructing visual features given an image description. The task is evaluated for sentence generation, sentence retrieval and image retrieval. When compared to human generated captions, automatically generated captions are equal to or preferred by humans 21.0% of the time.

A deep neural networks based image caption generation method is analysed systematically in [9]. In their work, the CNN part is replaced with three state-of-the-art architectures VGGNet, ALexNet and GoogLeNet and VGGNet performs best according to the BLEU score. A simplified version the Gated Recurrent Units (GRU) as a new recurrent layer is proposed which is compared with LSTM, it has few parameters which saves memory and is faster in training. Multiple sentences using Beam Search is generated . The experiments show that the modified method can generate captions comparable to the-state-of-the-art methods with less training memory.

Oriol Vinyals et al[10] presented a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Authors presented NIC, an end-to-end neural network system that can automatically view an image and generate a reasonable description in plain English. NIC is based on a convolution neural network that encodes an image into a compact representation, followed by a recurrent neural network that generates a corresponding sentence. The model is trained to maximize the likelihood of the sentence given the image. Performance metrics used is BLEU, a metric used in machine translation to evaluate the quality of generated sentences.

In [11] authors discussed complex objects with multiple labels represented as multiple modal representations, e.g., the complex articles contain text and image information as well as are with multiple annotations. In their work a novel Multi-modal Multi-instance Multi-label

Deep Network (M3DN) is proposed, which learns the label prediction and exploits label correlation simultaneously based on the Optimal Transport, by considering the consistency principle between different modal bag-level prediction and the learned latent ground label metric. Experiments on benchmark datasets and real world WKG Game-Hub dataset validate the effectiveness of their proposed method.

Qi Wu et al, [12] addressed a method of incorporating high-level concepts into the successful CNN-RNN approach, and showed that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering. They also proved that the same mechanism can be used to incorporate external knowledge, which is critically important for answering high level visual questions. A visual question answering model is designed that combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. It particularly allows questions to be asked where the image alone does not contain the information required to select the appropriate answer.

DeepSeek a natural language processing based deep learning model is presented in [13] that allows users to enter a description of the kind of images that they want to search. Based on the search the system retrieves all the images that semantically and contextually relate to the query. They used ResNet-101 as the feature extraction backbone and initialized the network with the weights pretrained from a MS-COCO object detection task and then fine tuned it for the task of caption generation on the MS-COCO dataset. Once the captions are generated, the skip thought model is used that converts the captions into a vector embedding. Once the captions are converted into vectors, the same is done to the query that is provided by the user. The retrieval of images is performed by minimizing the L2 distance between the two vectors .

Jia Xu et al[14], focused on the problem of image caption generation using the extension of the long short term memory (LSTM) model, called gLSTM. Semantic information extracted from the image as extra input to each unit of the LSTM block is added, with the aim of guiding the model towards solutions that are more tightly coupled to the image content. Different length normalization strategies for beam search to avoid bias towards short sentences is analyzed. Various benchmark datasets such as Flickr8K, Flickr30K and MS COCO are used. They proved that the model can better stay “on track”, describing the image content without drifting away to unrelated yet common phrases. Key aspects of other methods, such as attention mechanisms or model ensembles are not used in their model.

Rahul Singh and Aayush Sharma[15], developed a framework using artificial neural networks to caption an image based on its significant features. Recurrent Neural Network (RNN) is used as encoding-decoding frameworks for machine translation. Their objective is to replace the encoder part of the RNN with a Convolutional Neural Network (CNN) and transforming the images into relevant input data to feed into the decoder of the RNN. The image is converted into a multi-feature dataset, characterizing its distinctive features. The analysis is carried out on the popular Flickr8K dataset. Image Labeling and Automatic Machine translation is combined into an end-to-end hybrid neural network system. The developed model is capable to autonomously view an image and generate a reasonable description in natural language with reasonable accuracy and naturalness.

2.2 Performance metrics used in image captioning model

The different performance metrics used for evaluating the performance of the image captioning model are CIDEr, METEOR, ROUGE and BLEU.

BLEU (Bilingual Evaluation Understudy Score) is a metric for evaluating a generated sentence to a referenced sentence[16]. The BLEU metric value ranges from 0 to 1. The primary programming task for a BLEU implementor is to compare n-grams of the candidate with the n-grams of the reference translation and count the number of matches. These matches are position independent. The more the matches, the better the candidate translation is.

CIDEr (Consensus-based Image Description Evaluation) measures the similarity of a generated sentence against a set of ground truth sentences written by humans[17]. This metric shows high agreement with consensus as assessed by humans. Using sentence similarity, the notions of grammaticality, saliency, importance and accuracy (precision and recall) are inherently captured by this metric. Given an image and a collection of human generated reference sentences describing it, the goal of consensus based protocol is to measure the similarity of a candidate sentence to a majority of how most people describe the image (i.e. the reference sentences).

ROUGE Recall-Oriented Understudy for Gisting Evaluation[18] measures automatically to determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. To assess the effectiveness of ROUGE measures, the correlation between ROUGE assigned summary scores and human assigned summary scores is computed.

METEOR (Metric for Evaluation of Translation with Explicit ORdering), is an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine produced translation and human-produced reference translations[19]. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies.

4. Image captioning for content based image retrieval

Image captioning is important for many reasons. For example, they can be used for automatic image indexing. Image indexing is important for Content-Based Image Retrieval (CBIR) and therefore, it can be applied to many areas, including biomedicine, commerce, military, education, digital libraries, and web searching. Social media platforms such as facebook and twitter can directly generate descriptions from images which can include the place (example temple, school etc) with the action of what the persons in image are doing.

In [20], Content-based image retrieval is proposed using effective neural network-based solutions. The input to the algorithm is a collection of raw images in which the user would like to search, and a query sentence meant to describe the desired image. The output of the algorithm would be a list of top images that are relevant to the query sentence. A recurrent neural network

is trained to obtain a representation of the sentence that will be properly aligned with the corresponding image features in a shared high dimensional space. The images are found based on nearest neighborhood search in that shared space.

Image retrieval are based only on image features which emphasizes visual similarity and does not capture semantic similarity between images. In order to capture the semantic similarity textual data associated with images can be very useful. In [21] the authors demonstrate that semantics of an image, while poorly captured by the image alone, can be captured by text that accompanies the image. These semantics include artistic feel and socio cultural events. Semantics are captures by modeling the topics generated by the accompanying text, referred to as captions, while visual features are extracted with a deep convolutional network. Adding an autoencoder to the network and fine tuning the weights reduces computational requirements, and can also improve the quality of retrieval results. Image-only and text-only models are used to obtain a joint model that extracts similarity captured in either the image or the text data or both.

Conclusion

The focus of this paper is to review various techniques for automatically generating caption for images, which is important for many image related applications. The survey lists various approaches undertaken to close or reduce the gap between the query image and the results which is usually expressed as Semantic gap. The text based similarity is found to express the exact features in the image more accurately, hence text similarity based Image retrieval is expected to improve the overall results of the CBIR as it is able to reduce the Semantic gap. The results so far shows further scope for accuracy. There are limitations in the Captioned image data as datasets are only prepared with limited number of words in the caption, more robust captions representing all the semantic information of the image can improve the search results. This also requires corresponding increase in computational capacity of the machines to support RNN part of the neural captioning model to be trained on longer sequences of words. Hence the future research will improve the results of CBIR with the Image captioning based approach as computational capacity increases with better hardware and more robust captioned data become available for training the Image Captioning models.

References [1] Moses Soh, "Learning CNN-LSTM Architectures for Image Caption Generation ", https://cs224d.stanford.edu/reports/msoh.pdf ,2016. [2] Ying Liu, Dengsheng Zhang, Guojun Lu, and Wei-Ying Ma, "A survey of content based image retrieval with high-level semantics", Pattern Recogn., 40(1):262–282, January 2007.[3] Kamenka Staykova ,"Natural Language Generation and Semantic Technologies", Cybernatics and Information Technologies", Volume 14, No 2, DOI: 10.2478/cait-2014-0015, 2014.[4] Y. Yang, C. L. Teo, H. Daume, Y. Aloimono, "Corpus-guided sentence generation of natural images", in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011, pp. 444–454.

[5]Desmond Elliott and Frank Keller, "Image description using visual dependency representations", In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1292–1302, 2013.[6] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, T. L. Berg, "Babytalk: Understanding and generating simple image descriptions", IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (12) (2013) 2891–2903, 2013.[7] P. Hede, P. Moellic, J. Bourgeoys, M. Joint, C. Thomas, "Automatic generation of natural language descriptions for images", in: Proc. Recherche Dinformation Assistee Par Ordinateur, 2004.[8] X. Chen and C. L. Zitnick. "Mind’s eye:a recurrent visual representation for image caption generation", In CVPR, 2015.[9] Jianhui Chen, Wenqiang Dong, Minchen Li, "Image Caption Generator Based On Deep Neural Networks", https://www.cs.ubc.ca/~carenini/TEACHING/CPSC503-19/FINAL-PROJECTS-2016/image_caption_generator_final_report.pdf, 2016.[10] Oriol Vinyals and Alexander Toshev and Samy Bengio and Dumitru Erhan,"Show and tell: A neural image caption generator", IEEE Conference on Computer Vision and Pattern Recognition, pages:3156-3164, 2015.[11] Yang Yang, Yi-Feng Wu, De-Chuan Zhan, Zhi-Bin Liu, and Yuan Jiang, "Complex Object Classification: A Multi-Modal MultiInstance Multi-Label Deep Network with Optimal Transport", In KDD ’18: The 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 19–23, ACM, 2018. [12] Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, Anton van den Hengel ,"Image Captioning and Visual Question Answering Based on Attributes and Their Related External Knowledge", IEEE Transactions on Pattern Analysis and Machine Intelligence. PP. 10.1109/TPAMI.2017.2708709.,2017.[13] Piplani Tanya , Bamman David, "DeepSeek: Content Based Image Search & Retrieval",[14]Jia Xu, Gavves Efstratios , Fernando Basura , Tuytelaars, Tinne." Guiding Long-Short Term Memory for Image Caption Generation", 10.1109/ICCV.2015.277.2015.[15] Rahul Singh, Aayush Sharma, "Image captioning using Deep Neural Networks"[16] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "Bleu: a method for automatic evaluation of machine translation", In ACL, 2002.[17] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider, "Consensus-based image description evaluation" In CVPR, 2015.[18] C.-Y. Lin," Rouge: A package for automatic evaluation of summaries", In ACL Workshop, 2004.[19] M. D. A. Lavie,"Meteor universal: Language specific translation evaluation for any target language", In ACL, 2014.[20] Junyang Qian, Giacomo Lamberti, "Neural caption image retrieval", http://cs229.stanford.edu/proj2018/report/59.pdf, 2018.[21] Jo. Boon and Akshay Sood and Meenakshi Syamkumar, "Robust image retrieval using topic modeling on captioned imagedata", 2015.

https://www.cs.ubc.ca/~carenini/TEACHING/CPSC503-19/FINAL-PROJECTS-2016/image_caption_generator_final_report.pdf

https://www.cs.ubc.ca/~carenini/TEACHING/CPSC503-19/FINAL-PROJECTS-2016/image_caption_generator_final_report.pdf

ijrar.orgijrar.org/papers/ijrar_202837.docx · web viewcbir is the procedure of automatically...

Documents