generating chinese captions for flickr30k...

Generating Chinese Captions for Flickr30K Images

Hao PengIndiana University, Bloomington

[email protected]

Nianhen LiIndiana University, Bloomington

[email protected]

Abstract

We trained a Multimodal Recurrent Neural Network onFlickr30K dataset with Chinese sentences. The RNN modelis from Karpathy and Fei-Fei, 2015 [6]. As Chinese sen-tence has no space between words, we implemented themodel on Flickr30 dataset in two methods. In the first set-ting, we tokenized each Chinese sentence into a list of wordsand feed them to the RNN. While in the second one, we spliteach Chinese sentence into a list of characters and feedthem into the same model. We compared the BLEU scoreachieved by our two methods to that achieved by [6]. Wefound that the RNN model trained with char-level methodfor Chinese captions outperforms the word-level one. Theformer method performs very close to that trained on En-glish captions by [6]. This came to a conclusion that theRNN model works universally well, or at least the same, forimage caption system on different languages.

1. IntroductionHumans are good at describing and understanding the vi-

sual scene expressed in images with just a glance. But it is akind of tough task for computers to describe the context oreven just recognize all the objects in one image. Therefor,an automated image caption system is very helpful in manyway. The self-driving cars, the VR glass all need this tech-nology to build up its functionality. These tools also couldpotentially be used to provide richer descriptions of imagesfor people who are blind or visually impaired.

The majority of previous work in visual recognition hasfocused on labeling images with a fixed set of visual cat-egories and great progress has been achieved in these en-deavors [4, 11]. However, while closed visual “words” or“vocabularies” consists of reasonable modeling assumption,they are vastly limited when compared to the descriptionsarticulated by humans.

Recently, many researches on image caption tasks hasbeen devoted to RNN models, as they are said to be veryeffective on modeling sequential data, and also to capturecontext and semantic relation in languages. However, all

these models are trained on images with English captions.Thus we don’t know their performance in other languages tosee that whether this method works universally. We testedit on Chinese caption system in this paper.

As Chinese sentence has no space between words, it isvery different from English. We implemented the RNNmodel with the same architecture used by [6] on Flickr30dataset with Chinese captions in two scenarios. The Chi-nese captions are obtained by translating the original En-glish caption using Google Translation API.

Our experiments show that the generated Chinese sen-tences aligns pretty well with the translated Chinese cap-tions, we also report the BLEU [10] score computed withthe coco-caption code [1], which is a metric evaluat-ing a candidate sentence by measuring how well it matchesa set of five reference sentences written by humans.

2. Related WorkResearchers have explored a lot in vision to language,

such as, examining image captioning (Lin et al., 2014;Karpathy and Fei-Fei, 2015; Vinyals et al., 2015; Xu etal., 2015; Chen et al., 2015; Young et al., 2014; Elliottand Keller, 2013), question answering (Antol et al., 2015;Ren et al., 2015; Gao et al., 2015; Malinowski and Fritz,2014), visual phrases (Sadeghi and Farhadi, 2011), videounderstanding (Ramanathan et al., 2013), and visual con-cepts (Krishna et al., 2016; Fang et al., 2015).

To build the visual description system, recent state ofthe art work [6, 14] has used the multimodal recurrent neu-ral network (RNN) to create a “sequence to sequence” ma-chine learning system which is similar to the kind other re-searchers have used for machine translation. In this case,however, instead of translating from, say, French to English,the researchers were training the system to translate fromimages to sentences.

Multiple closely related work has also used RNNs togenerate image descriptions [9, 14, 3, 8, 5, 2]. But [6]claims their model to be simpler than most of the previousapproaches. Thus we decided to apply their model on ourChinese caption task on the same image dataset, Flickr30k.We also quantify the performance and comparison in our

1

experiments to their original results.

3. Our ModelAs been said, the architecture of our RNN model is the

same with the one used in [6] because we want to make aperformance comparision in this paper. Thus some of thelines in this section (the Training/Testing process and Opti-mization as well) are borrowed from [6].

The RNN model accepts a image vector and outputs acorresponding sentence description.Each sentence is splitinto a sequence of elements and feed into the RNN (as weimplemented a word level and a character level method, werefer them as elements here). It generates elements by defin-ing a probability distribution of the next element in a se-quence given the current element and context from pervioustime steps. At the first time step, it conditions the probabil-ity of element only on the input image vector. For testing,the model can predict a variable length of elements given animage.

Specifically, our RNN model is trained as follows, ittakes the image pixels I and a sequence of one-hot en-coded word vectors (x1, x2, · · · , xT ). It then computes asequence of hidden states (h1, h2, · · · , ht) and a sequenceof outputs (y1, y2, · · · , yt) by iterating the following for-mula for t = 1 to T :

bv = Whi[CNNθc(I)] (1)ht = f(Whxxt +Whhht−1 + bh + 1(t = 1)� bv)(2)yt = softmax(Wohht + bo) (3)

In the above equations, Whi,Whx,Whh,Woh, xt and bh, boare learnable parameters which would be updated duringtraining, and CNNθc(I) is the output of the last layer of theV GG [12] net (as shown in Figure 1). During our training,the image encoding size, word encoding size and hiddensize are all set to 256, which means xt, bv, ht, bh and bo areall 256 dimensional vectors. The output vector yt is a logprobabilities of words in the vocabulary and one additionaldimension for a special END token. We feed into RNN theimage encoding vector bv only at the first iteration.

3.1. Training process

Our RNN model is trained to predict the next word ytbased on the input word xt and the previous context (hiddenstate) ht−1. We simply treat the image encoding vector bvas a bias term on the first iteration.The training process is il-lustrated in Figure 2: We set h0 = ~0, x1 to a special STARTvector, and we expect y1 to be close to the first word in thesequence. Similarly, we set x2 to the first word vector andexpect the network to predict the second word, etc. Finally,xT would be the last word vector in the sequence and weexpect the RNN to predict a special END token. The goal isto maximize the log probability assigned to the target labels.

Figure 1. The image vector produced from VGG net.

Figure 2. Illustration of RNN sentence generating process.

3.2. Testing process

To predict a sentence, we compute the image encodingvector bv , set h0 = 0, x1 to the START vector and com-pute the distribution over the first word y1. We sample aword from that distribution, set its embedding vector as in-put word x2, and repeat this process until the END tokenis generated or the length of generated sequence exceeds20. We also report the BLUE score with different beam sizesearch.

3.3. Optimization

As we are going to compare the performance of the RNNmodel on Chinese sentence generation to that on Englishsentence generation in [6], we need to keep the RNN archi-tecture and training parameters the same with [6]. We useSGD with mini-batches of 100 image-sentence pairs andmomentum of 0.9 to optimize the alignment model. Wecross-validate the learning rate and the weight decay. We

2

Figure 3. Some examples of English captions and their translatedChinese. The translated sentences are obtained by using GoogleTranslation API.

Figure 4. An example of sentence segmentation in word levelmethod.

Figure 5. An example of sentence segmentation in character levelmethod.

also use dropout regularization in all layers except in therecurrent layers. We also achieved the best results usingRMSprop [13].

4. Experiments4.1. Dataset processing

We experiment on Flickr30K [15] dataset, which con-tains 31, 000 images and each comes with 5 Chinese sen-tences. Be noted that the original captions are in English.We obtained the Chinese captions using Google TranslationAPI. Some examples are shown in Figure 3. For Flickr30K,we use 1, 000 images for validation, another 1, 000 imagesfor testing and the rest for training (the same setting as [6]).

4.2. Methods

As you can see from Figure 3, Chinese is very differentfrom English, we have no blank between Chinese charac-ters. Thus we trained our model in two different method. Inthe first method, we tokenized each Chinese sentences intoa list of words, such an example is shown in Figure 4. In thesecond method, we split each Chinese caption into a list ofChinese characters, such an example is shown in Figure 5.

4.3. Model evaluation and comparison

We first trained the RNN model in two ways, it can pro-duce reasonable descriptions of images during test for bothmethod. Figure 6 shows an example of test image Chinesecaption generated by the character level RNN model. Fig-ure 7 shows an example of generated Chinese caption bythe word level RNN model.

Figure 6. Chinese caption generated by character level RNN dur-ing test. For understanding, the Chinese sentence in the bottom ofthis figure means “a young girl is wearing a red shirt and a blacktrousers”.

Figure 7. Chinese caption generated by word level RNN duringtest. For understanding, the Chinese sentence in the bottom of thisfigure means “a man and a woman are dancing”.

Figure 8. An example of different captions generated by word level(left) and character level (right) methods. For understanding, thecorresponding English sentences for each of them are shown in thebottom of this figure.

We also compared the captions generated by our modelin two methods on the same test images. The interestingpart is that, though there is a slightly difference betweenthe two captions, each of the generated sentence still makessense. Such an example is shown in Figure 8.

Before we came to a conclusion, we also comparedtheir performance with the result achieved by [6] (FeiFei’smodel) on English caption generation quantitatively. Thus

3

Figure 9. BLEU scores for the RNN model on English captiongeneration (FeiFeis), word level and character level method onChinese caption generation on Flickr30K image dataset.

we report the BLEU [10] score (see in Figure 9) for bothmethods with Beam size of 7 using the coco-captioncode [1], which is the same setting in [6].

From the BLEU scores in Figure 9, we can see that theRNN model trained with character level method for Chinesecaptions outperforms the model trained with word levelmethod. The character level method performs very closeto the original model trained on English captions [6], whilethe word level method performs slightly worse.

Thus, we came to the conclusion that this RNN modelworks universally well for image caption on different lan-guages. Before our work in this paper, we haven’t seen anyapplication of the RNN model on image caption generationother than English. Thus it is not clear whether this sequen-tial model applies to different languages or not. We provedor at least tested this model on Chinese language and finallycame to a conclusion which we think is fair.

One surprise we have noticed is that the characterlevel method works better than the word level one inthis task. As some researchers have proved, characterlevel convolutional neural network works even better intext/sentiment/sentiment classification [16, 7] than wordlevel one. Our finding in this paper opens a further researchto explore the performance of character level method onRNN model in other tasks, such as LSTM on test classi-fication.

5. Limitation and Future work

We used Google translation API to obtain the Chinesecaptions due to limited resources. However, the automaticMachine Translation is not very accurate currently. The re-ported BLEU scores may not be influenced too much, asit measures the relatively similarity between the generatedsentence and reference one. But the quality of translationmay compromise the quality of generated image descrip-tions.

So what we could do in the future, instead of using thesentences translated by Google, we may review on a fewthousands of images and translated sentences and manuallycorrect them by ourselves. With that small set of clean data,we may try to train it again to see if it could work better. Orfine tune the hyper parameters of the RNN model to see ifit yields a even better result.

AcknowledgmentsWe highly appreciate the help we received from Profes-

sor David and all the AIs in this great course. Most of theknowledge and techniques used in this paper are learnt fromthe vision course. The idea of training RNN model for im-age caption generation on Chinese language is inspired byProfessor David, and he had provided us with many valu-able suggestions and feedbacks.

We want to say thank you to all the course staff in thiscourse. Thanks for hosting attentive office hours and devot-ing efforts on the course development and poster session.We really learnt a lot from it.

References[1] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta,

P. Dollar, and C. L. Zitnick. Microsoft coco captions:Data collection and evaluation server. arXiv preprintarXiv:1504.00325, 2015.

[2] X. Chen and C. L. Zitnick. Learning a recurrent visual rep-resentation for image caption generation. arXiv preprintarXiv:1411.5654, 2014.

[3] J. Donahue, L. Anne Hendricks, S. Guadarrama,M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar-rell. Long-term recurrent convolutional networks for visualrecognition and description. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 2625–2634, 2015.

[4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (voc) chal-lenge. International Journal of Computer Vision, 88(2):303–338, 2009.

[5] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng,P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. Fromcaptions to visual concepts and back. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 1473–1482, 2015.

[6] A. Karpathy and L. Fei-Fei. Deep visual-semantic align-ments for generating image descriptions. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 3128–3137, 2015.

[7] Y. Kim. Convolutional neural networks for sentence classifi-cation. arXiv preprint arXiv:1408.5882, 2014.

[8] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifyingvisual-semantic embeddings with multimodal neural lan-guage models. arXiv preprint arXiv:1411.2539, 2014.

[9] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explainimages with multimodal recurrent neural networks. arXivpreprint arXiv:1410.1090, 2014.

[10] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: amethod for automatic evaluation of machine translation. InProceedings of the 40th annual meeting on association forcomputational linguistics, pages 311–318. Association forComputational Linguistics, 2002.

[11] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,

4

A. C. Berg, and L. Fei-Fei. ImageNet Large Scale VisualRecognition Challenge. International Journal of ComputerVision (IJCV), 115(3):211–252, 2015.

[12] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014.

[13] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Dividethe gradient by a running average of its recent magnitude.COURSERA: Neural Networks for Machine Learning, 4:2,2012.

[14] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show andtell: A neural image caption generator. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 3156–3164, 2015.

[15] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From im-age descriptions to visual denotations: New similarity met-rics for semantic inference over event descriptions. Transac-tions of the Association for Computational Linguistics, 2:67–78, 2014.

[16] X. Zhang, J. Zhao, and Y. LeCun. Character-level convolu-tional networks for text classification. In Advances in NeuralInformation Processing Systems, pages 649–657, 2015.

5

generating chinese captions for flickr30k...

Documents