a visual attention grounding neural model for …ixto et al.,2017c) learn a multi-modal...

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3643–3653Brussels, Belgium, October 31 - November 4, 2018. c©2018 Association for Computational Linguistics

3643

A Visual Attention Grounding Neural Modelfor Multimodal Machine Translation

Mingyang Zhou Runxiang Cheng Yong Jae Lee Zhou YuDepartment of Computer Science

University of California, Davis{minzhou, samcheng, yongjaelee, joyu}@ucdavis.edu

Abstract

We introduce a novel multimodal machinetranslation model that utilizes parallel visualand textual information. Our model jointlyoptimizes the learning of a shared visual-language embedding and a translator. Themodel leverages a visual attention ground-ing mechanism that links the visual semanticswith the corresponding textual semantics. Ourapproach achieves competitive state-of-the-artresults on the Multi30K and the AmbiguousCOCO datasets. We also collected a newmultilingual multimodal product descriptiondataset to simulate a real-world internationalonline shopping scenario. On this dataset, ourvisual attention grounding model outperformsother methods by a large margin.

1 Introduction

Multimodal machine translation is the problem oftranslating sentences paired with images into a dif-ferent target language (Elliott et al., 2016). In thissetting, translation is expected to be more accu-rate compared to purely text-based translation, asthe visual context could help resolve ambiguousmulti-sense words. Examples of real-world appli-cations of multimodal (vision plus text) translationinclude translating multimedia news, web productinformation, and movie subtitles.

Several previous endeavours (Huang et al.,2016; Calixto et al., 2017a; Elliott and Kádár,2017) have demonstrated improved translationquality when utilizing images. However, how toeffectively integrate the visual information still re-mains a challenging problem. For instance, in theWMT 2017 multimodal machine translation chal-lenge (Elliott et al., 2017), methods that incorpo-rated visual information did not outperform puretext-based approaches with a big margin.

In this paper, we propose a new modelcalled Visual Attention Grounding Neural Ma-

chine Translation (VAG-NMT) to leverage visualinformation more effectively. We train VAG-NMTwith a multitask learning mechanism that simul-taneously optimizes two objectives: (1) learn-ing a translation model, and (2) constructing avision-language joint semantic embedding. In thismodel, we develop a visual attention mechanismto learn an attention vector that values the wordsthat have closer semantic relatedness with the vi-sual context. The attention vector is then pro-jected to the shared embedding space to initial-ize the translation decoder such that the sourcesentence words that are more related to the vi-sual semantics have more influence during the de-coding stage. When evaluated on the benchmarkMulti30K and the Ambiguous COCO datasets,our VAG-NMT model demonstrates competitiveperformance compared to existing state-of-the-artmultimodal machine translation systems.

Another important challenge for multimodalmachine translation is the lack of a large-scale,realistic dataset. To our knowledge, the onlyexisting benchmark is Multi30K (Elliott et al.,2016), which is based on an image captioningdataset, Flickr30K (Young et al., 2014) with man-ual German and French translations. There areroughly 30K parallel sentences, which is rela-tively small compared to text-only machine trans-lation datasets that have millions of sentences suchas WMT14 EN→DE. Therefore, we propose anew multimodal machine translation dataset calledIKEA to simulate the real-world problem of trans-lating product descriptions from one language toanother. Our IKEA dataset is a collection of paral-lel English, French, and German product descrip-tions and images from IKEA’s and UNIQLO’swebsites. We have included a total of 3,600 prod-ucts so far and will include more in the future.

3644

2 Related Work

In the machine translation literature, there are twomajor streams for integrating visual information:approaches that (1) employ separate attention fordifferent (text and vision) modalities, and (2) fusevisual information into the NMT model as part ofthe input. The first line of work learns independentcontext vectors from a sequence of text encoderhidden states and a set of location-preserving vi-sual features extracted from a pre-trained convnet,and both sets of attentions affect the decoder’stranslation (Calixto et al., 2017a; Helcl and Li-bovický, 2017). The second line of work insteadextracts a global semantic feature and initializeseither the NMT encoder or decoder to fuse thevisual context (Calixto et al., 2017b; Ma et al.,2017). While both approaches demonstrate sig-nificant improvement over their Text-Only NMTbaselines, they still perform worse than the bestmonomodal machine translation system from theWMT 2017 shared task (Zhang et al., 2017).

The model that performs best in the multimodalmachine translation task employed image contextin a different way. (Huang et al., 2016) combineregion features extracted from a region-proposalnetwork (Ren et al., 2015) with the word sequencefeature as the input to the encoder, which leads tosignificant improvement over their NMT baseline.The best multimodal machine translation systemin WMT 2017 (Caglayan et al., 2017) performselement-wise multiplication of the target languageembedding with an affine transformation of theconvnet image feature vector as the mixed inputto the decoder. While this method outperforms allother methods in WMT 2017 shared task work-shop, the advantage over the monomodal transla-tion system is still minor.

The proposed visual context grounding processin our model is closely related to the literature onmultimodal shared space learning. (Kiros et al.,2014) propose a neural language model to learna visual-semantic embedding space by optimizinga ranking objective, where the distributed repre-sentation helps generate image captions. (Karpa-thy and Li, 2014) densely align different objects inthe image with their corresponding text captions inthe shared space, which further improves the qual-ity of the generated caption. In later work, multi-modal shared space learning was extended to mul-timodal multilingual shared space learning. (Cal-ixto et al., 2017c) learn a multi-modal multilin-

gual shared space through optimization of a mod-ified pairwise contrastive function, where the ex-tra multilingual signal in the shared space leads toimprovements in image-sentence ranking and se-mantic textual similarity task. (Gella et al., 2017)extend the work from (Calixto et al., 2017c) byusing the image as the pivot point to learn themultilingual multimodal shared space, which doesnot require large parallel corpora during train-ing. Finally, (Elliott and Kádár, 2017) is the firstto integrate the idea of multimodal shared spacelearning to help multimodal machine translation.Their multi-task architecture called “imagination”shares an encoder between a primary task of theclassical encoder-decoder NMT and an auxiliarytask of visual feature reconstruction.

Our VAG-NMT mechanism is inspired by (El-liott and Kádár, 2017), but has important differ-ences. First, we modify the auxiliary task as avisual-text shared space learning objective insteadof the simple image reconstruction objective. Sec-ond, we create a visual-text attention mechanismthat captures the words that share a strong seman-tic meaning with the image, where the groundedvisual-context has more impact on the transla-tion. We show that these enhancements lead to im-provement in multimodal translation quality over(Elliott and Kádár, 2017).

Figure 1: An overview of the VAG-NMT structure

3 Visual Attention Grounding NMT

Given a set of parallel sentences in language Xand Y , and a set of corresponding images V pairedwith each sentence pair, the model aims to trans-late sentences {xi}Ni=1 ∈ X in language X to sen-tences {yi}Ni=1 ∈ Y in language Y with the assis-tance of images {vi}Ni=1 ∈ V .

We treat the problem of multimodal machine

3645

translation as a joint optimization of two tasks: (1)learning a robust translation model and (2) con-structing a visual-language shared embedding thatgrounds the visual semantics with text. Figure 1shows an overview of our VAG-NMT model. Weadopt a state-of-the-art attention-based sequence-to-sequence structure (Bahdanau et al., 2014) fortranslation. For the joint embedding, we obtainthe text representation using a weighted sum ofhidden states from the encoder of the sequence-to-sequence model and we obtain the image rep-resentation from a pre-trained convnet. We learnthe weights using a visual attention mechanism,which represents the semantic relatedness betweenthe image and each word in the encoded text. Welearn the shared space with a ranking loss and thetranslation model with a cross entropy loss.

The joint objective function is defined as:

J(θT , φV ) = αJT (θT ) + (1− α)JV (φV ),(1)

where JT is the objective function for thesequence-to-sequence model, JV is the objectivefunction for joint embedding learning, θT are theparameters in the translation model, and φV arethe parameters for the shared vision-language em-bedding learning, and α determines the contribu-tion of the machine translation loss versus the vi-sual grounding loss. Both JT and JV share the pa-rameters of the encoder from the neural machinetranslation model. We describe details of the twoobjective functions in Section 3.2.

3.1 Encoder

We first encode an n-length source sentence {x},as a sequence of tokens x = {x1, x2, . . . , xn},with a bidirectional GRU (Schuster and Paliwal,1997; Cho et al., 2014). Each token is repre-sented by a one-hot vector, which is then mappedinto an embedding ei through a pre-trained em-bedding matrix. The bidirectional GRU processesthe embedding tokens in two directions: left-to-right (forward) and right-to-left (backward). Atevery time step, the encoder’s GRU cell gener-ates two corresponding hidden state vectors:

−→hi =−−−−−−−−−−→

GRU(hi−1, ei) and←−hi =

←−−−−−−−−−−GRU(hi−1, ei). The

two hidden state vectors are then concatenated to-gether to serve as the encoder hidden state vectorof the source token at step i: hi = [

←−hi ,−→hi ].

3.2 Shared embedding objectiveAfter encoding the source sentence, we projectboth the image and text into the shared space tofind a good distributed representation that can cap-ture the semantic meaning across the two modali-ties. Previous work has shown that learning a mul-timodal representation is effective for groundingknowledge between two modalities (Kiros et al.,2014; Chrupala et al., 2015). Therefore, we expectthe shared encoder between the two objectives tofacilitate the integration of the two modalities andpositively influence translation during decoding.

To project the image and the source sentenceto a shared space, we obtain the visual embed-ding (v) from the pool5 layer of ResNet50 (Heet al., 2015a) pre-trained on ImageNet classifica-tion (Russakovsky et al., 2015), and the sourcesentence embedding using the weighted sum ofencoder hidden state vectors ({hi}) to representthe entire source sentence (t). We project each{hi} to the shared space through an embeddinglayer. As different words in the source sen-tence will have different importance, we employa visual-language attention mechanism—inspiredby the attention mechanism applied in sequence-to-sequence models (Bahdanau et al., 2014)—toemphasize words that have the stronger semanticconnection with the image. For example, the high-lighted word “cat" in the source sentence in Fig. 1has the more semantic connection with the image.

Specifically, we produce a set of weights β ={β1, β2, . . . , βn} with our visual-attention mech-anism, where the attention weight βi for the i’thword is computed as:

βi =exp(zi)∑Nl=1 exp(zl)

, (2)

and zi = tanh(Wvv) · tanh(Whhi) is computedby taking the dot product between the transformedencoder hidden state vector hi and the transformedimage feature vector v, and Wv and Wh are theassociation transformation parameters.

Then, we can get a weighted sum of the en-coder hidden state vectors t =

∑ni=1 βihi to repre-

sent the semantic meaning of the entire source sen-tence. The next step is to project the source sen-tence feature vector t and the image feature vec-tor v into the same shared space. The projectedvector for text is: temb = tanh(Wtemb

t + btemb)

and the projected vector for image is: vemb =tanh(Wvemb

v + bvemb).

3646

We follow previous work on visual semanticembedding (Kiros et al., 2014) to minimize a pair-wise ranking loss to learn the shared space:

JV (φV ) =∑p

∑k

max{0, γ − s(vp, tp) + s(vp, tk 6=p)}

+∑k

∑p

max{0, γ − s(tk, vk) + s(tk, vp 6=k)},

(3)

where γ is a margin, and s is the cosine distancebetween two vectors in the shared space. k and pare the indexes of the images and text sentences,respectively. tk 6=p and vp 6=k are the contrastive ex-amples with respect to the selected image and theselected source text, respectively. When the lossdecreases, the distance between a paired imageand sentence will drop while the distance betweenan unpaired image and sentence will increase.

In addition to grounding the visual context intothe shared encoder through the multimodal sharedspace learning, we also initialize the decoder withthe learned attention vector t such that the wordsthat have more relatedness with the visual seman-tics will have more impact during the decoding(translation) stage. However, we may not want tosolely rely on only a few most important words.Thus, to produce the initial hidden state of the de-coder, we take a weighted average of the attentionvector t and the mean of encoder hidden states:

s0 = tanh(Winit(λt+ (1− λ) 1N

N∑i

hi)), (4)

where λ determines the contribution from eachvector. Through our experiments, we find the bestvalue for λ is 0.5.

3.3 Translation objective

During the decoding stage, at each time step j, thedecoder generates a decoder hidden state sj froma conditional GRU cell (Sennrich et al., 2017)whose input is the previously generated transla-tion token yj−1, the previous decoder hidden statesj−1, and the context vector cj at the current timestep:

sj = cGRU(sj−1, yj−1, cj) (5)

The context vector cj is a weighted sum of the en-coder hidden state vectors, and captures the rele-vant source words that the decoder should focuson when generating the current translated tokenyj . The weight associated with each encoder hid-den state is determined by a feed-forward network.

From the hidden state sj we can predict the condi-tional distribution of the next token yj with a fully-connected layerWo given the previous token’s lan-guage embedding ej−1, the current hidden state djand the context vector for current step cj :

p(yj |yj−1, x) = softmax(Woot), (6)

where ot = tanh(Weej−1 +Wddj +Wccj). Thethree inputs are transformed with We, Wd, andWc, respectively and then summed before beingfed into the output layer.

We train the translation objective by optimizinga cross entropy loss function:

JT (θT ) = −∑j

log p(yj |yj−1, x) (7)

By optimizing the objective of the translation andthe multimodal shared space learning tasks jointlyalong with the visual-language attention mecha-nism, we can simultaneously learn a general map-ping between the linguistic signals in two lan-guages and grounding of relevant visual contentin the text to improve the translation.

4 IKEA Dataset

Previous available multimodal machine transla-tion models are only tested on image captiondatasets, we, therefore, propose a new dataset,IKEA, that has the real-world application of inter-national online shopping. We create the dataset bycrawling commercial products’ descriptions andimages from IKEA and UNIQLO websites. Thereare 3,600 products and we plan to expand the datain the future. Each sample is composed of theweb-crawled English description of a product, animage of the product, and the web-crawled Ger-man or French description of the product.

Different than the image caption datasets, theGerman or French sentences in the IKEA datasetis not an exact parallel translation of its Englishsentence. Commercial descriptions in differentlanguages can be vastly different in surface formbut still keep the core semantic meaning. We thinkIKEA is a good data set to simulate real-worldmultimodal translation problems. The sentencein the IKEA dataset contains 60-70 words on av-erage, which is five times longer than the aver-age text length in Multi30K (Elliott et al., 2016).These sentences not only describe the visual ap-pearance of the product, but also the product us-age. Therefore, capturing the connection between

3647

English→ German English→ FrenchMethod BLEU METEOR BLEU METEORImagination (Elliott and Kádár, 2017) 30.2 51.2 N/A N/ALIUMCVC (Caglayan et al., 2017) 31.1± 0.7 52.2± 0.4 52.7± 0.9 69.5± 0.7Text-Only NMT 31.6 ± 0.5 52.2± 0.3 53.5± 0.7 70.0± 0.7VAG-NMT 31.6 ± 0.3 52.2± 0.3 53.8 ± 0.3 70.3 ± 0.5

Table 1: Translation results on the Multi30K dataset

English→ German English→ FrenchMethod BLEU METEOR BLEU METEORImagination (Elliott and Kádár, 2017) 28.0 48.1 N/A N/ALIUMCVC (Caglayan et al., 2017) 27.1± 0.9 47.2± 0.6 43.5± 1.2 63.2± 0.9

Text-Only NMT 27.9± 0.6 47.8± 0.6 44.6± 0.6 64.2± 0.5VAG-NMT 28.3 ± 0.6 48.0± 0.5 45.0 ± 0.4 64.7 ± 0.4

Table 2: Translation results on the Ambiguous COCO dataset

visual semantics and the text is more challengingon this dataset. The dataset statistics and an exam-ple of the IKEA dataset is in Appendix A.

5 Experiments and Results

5.1 Datasets

We evaluate our proposed model on three datasets:Multi30K (Elliott et al., 2016), Ambiguous COCO(Elliott et al., 2017), and our newly-collectedIKEA dataset. The Multi30K dataset is the largestexisting human-labeled dataset for multimodalmachine translation. It consists of 31,014 images,where each image is annotated with an Englishcaption and manual translations of image captionsin German and French. There are 29,000 instancesfor training, 1,014 instances for validation, and1,000 for testing. Additionally, we also evalu-ate our model on the Ambiguous COCO Datasetcollected in the WMT2017 multimodal machinetranslation challenge (Elliott et al., 2017). It con-tains 461 images from the MSCOCO dataset (Linet al., 2014), whose captions contain verbs withambiguous meanings.

5.2 Setting

We pre-process all English, French, and Germansentences by normalizing the punctuation, lowercasing, and tokenizing with the Moses toolkit. AByte-Pair-Encoding (BPE) (Sennrich et al., 2015)operation with 10K merge operations is learnedfrom the pre-processed data and then applied tosegment words. We restore the original words byconcatenating the subwords segmented by BPE in

post-processing. During training, we apply earlystopping if there is no improvement in BLEUscore on validation data for 10 validation steps.We apply beam search decoding to generate trans-lation with beam size equal to 12. We evaluate theperformance of all models using BLEU (Papineniet al., 2002) and METEOR (Denkowski and Lavie,2014). The setting used in IKEA dataset is thesame as Multi30K, except that we lower the de-fault batch size from 32 to 12; since IKEA datasethas long sentences and large variance in sentencelength, we use smaller batches to make the train-ing more stable. Full details of our hyperparame-ter choices can be found in Appendix B. We runall models five times with different random seedsand report the mean and standard deviation.

5.3 ResultsWe compare the performance of our model againstthe state-of-the-art multimodal machine transla-tion approaches and the text-only baseline. Theidea of our model is inspired by the "Imagination"model (Elliott and Kádár, 2017), which unlikeour models, simply averages the encoder hiddenstates for visual grounding learning. As "Imagina-tion" does not report its performance on Multi30K2017 and Ambiguous COCO in its original paper,we directly use their result reported in the WMT2017 shared task as a comparison. LIUMCVC isthe best multimodal machine translation model inWMT 2017 multimodal machine translation chal-lenge and exploits visual information with severaldifferent methods. We always compare our VAG-NMT with the method that has been reported to

3648

English→ German English→ FrenchMethod BLEU METEOR BLEU METEORLIUMCVC-Multi 59.9± 1.9 63.8± 0.4 58.4± 1.6 64.6± 1.8Text-Only NMT 61.9± 0.9 65.6± 0.9 65.2± 0.7 69.0 ± 0.2VAG-NMT 63.5 ± 1.2 65.7 ± 0.1 65.8 ± 1.2 68.9± 1.4

Table 3: Translation results on the IKEA dataset

have the best performance on each dataset.Our VAG-NMT surpasses the results of the

“Imagination" model and the LIUMCVC’s modelby a noticeable margin in terms of BLEU score onboth the Multi30K dataset (Table 1) and the Am-biguous COCO dataset (Table 2). The METEORscore of our VAG-NMT is slightly worse than thatof "Imagination" for English -> German on Am-biguous COCO Dataset. This is likely because the“Imagination" result was produced by ensemblingthe result of multiple runs, which typically leads to1-2 higher BLEU and METEOR points comparedto a single run. Thus, we expect our VAG-NMT tooutperform the “Imagination" baseline if we alsouse an ensemble.

We observe that our multimodal VAG-NMTmodel has equal or slightly better result com-pared to the text-only neural machine translationmodel on the Multi30K dataset. On the Ambigu-ous COCO dataset, our VAG-NMT demonstratesclearer improvement over this text-only baseline.We suspect this is because Multi30K does not havemany cases where images can help improve trans-lation quality, as most of the image captions areshort and simple. In contrast, Ambiguous COCOwas purposely curated such that the verbs in thecaptions can have ambiguous meaning. Thus, vi-sual context will play a more important role inAmbiguous COCO; namely, to help clarify thesense of the source text and guide the translationto select the correct word in the target language.

We then evaluate all models on the IKEAdataset. Table 3 shows the results. Our VAG-NMT has a higher value in BLEU and a compara-ble value in METEOR compared to the Text-onlyNMT baseline. Our VAG-NMT outperforms LI-UMCVC’s multimodal system by a large margin,which shows that the LIUMCVC’s multimodal’sgood performance on Multi30K does not general-ize to this real-world product dataset. The goodperformance may come from the visual attentionmechanism that learns to focus on text segmentsthat are related to the images. Such attention there-

fore teaches the decoder to apply the visual contextto translate those words. This learned attention isespecially useful for datasets with long sentencesthat have irrelevant text information with respectto the image.

(a) a cyclist is wearing a helmet

(b) a black dog and his favorite toys.

Figure 2: Top five images retrieved using the givencaption. The original corresponding image of thecaption is highlighted with a red bounding box.

5.4 Multimodal embedding resultsTo assess the learned joint embedding, we per-form an image retrieval task evaluated using theRecall@K metric (Kiros et al., 2014) on theMulti30K dataset.

We project the image feature vectors V ={v1, v2, . . . , vn} and their corresponding captionsS = {s1, s2, . . . , sn} into a shared space. Wefollow the experiments conducted by the previ-ous visual semantic embedding work (Kiros et al.,2014), where for each embedded text vector, wefind the closest image vectors around it basedon the cosine similarity. Then we can computethe recall rate of the paired image in the top Knearest neighbors, which is also known as R@Kscore. The shared space learned with VAG-NMTachieves 64% R@1, 88.6% R@5, and 93.8%R@10 on Multi30K, which demonstrates the goodquality of the learned shared semantics. We alsoachieved 58.13% R@1, 87.38% R@5 and 93.74%R@10 on IKEA dataset; 41.35% R@1, 85.48%R@5 and 92.56% R@10 on COCO dataset. Be-sides the quantitative results, we also show sev-eral qualitative results in Figure 2. We show thetop five images retrieved by the example captions.

3649

a person is skiing or snowboardingdown a mountainside .

a mountain climber trekking throughthe snow with a pick and a blue hat .

the snowboarder is jumping in the snow.

two women are water rafting . three people in a blue raft on a river ofbrown water .

people in rafts watch as two men fall outof their own rafts .

Figure 3: Visual-text attention score on sample data from Multi30K. The first and second rows show thethree closest images to the caption a person is skiing or snowboarding down a mountainside and twowoman are water rafting, respectively. The original caption is listed under each image. We highlight thethree words with highest attention in red.

The images share "cyclist", "helmet", and "dog"mentioned in the caption.

5.5 Human evaluation

We use Facebook to hire raters that speak bothGerman and English to evaluate German transla-tion quality. As Text-Only NMT has the highestBLEU results among all baseline models, we com-pare the translation quality between the Text-Onlyand the VAG-NMT on all three datasets. We ran-domly selected 100 examples for evaluation foreach dataset. The raters are informed to focusmore on semantic meaning than grammatical cor-rectness when indicating the preference of the twotranslations. They are also given the option tochoose a tie if they cannot decide. We hired tworaters and the inter-annotator agreement is 0.82 inCohen’s Kappa.

We summarize the results in Table 4, wherewe list the number of times that raters preferone translation over another or think they are thesame quality. Our VAG-NMT performs betterthan Text-Only NMT on MSCOCO and IKEAdataset, which correlates with the automatic eval-uation metrics. However, the result of VAG-NMTis slightly worse than the Text-Only NMT on theMulti30K test dataset. This also correlates withthe result of automatic evaluation metrics.

Finally, we also compare the translation qual-ity between LIUMCVC multimodal and VAG-NMT on 100 randomly selected examples fromthe IKEA dataset. VAG-NMT performs better

than LIUMCVC multimodal. The raters prefer ourVAG-NMT in 91 cases, LIUMCVC multimodal in68 cases, and cannot tell in 47 cases.

MSCOCO Multi30K IKEAText-Only NMT 76 72 75VAG-NMT 94 71 82Tie 30 57 43

Table 4: Human evaluation results

6 Discussion

To demonstrate the effectiveness of our visual at-tention mechanism, we show some qualitative ex-amples in Figure 3. Each row contains three im-ages that share similar semantic meaning, whichare retrieved by querying the image caption usingour learned shared space. The original caption ofeach image is shown below each image. We high-light the three words in each caption that have thehighest weights assigned by the visual attentionmechanism.

In the first row of Figure 3, the attention mech-anism assigns high weights to the words “skiing",“snowboarding", and “snow". In the second row,it assigns high attention to “rafting" or “raft" forevery caption of the three images. These exam-ples demonstrate evidence that our attention mech-anism learns to assign high weights to words thathave corresponding visual semantics in the image.

We also find that our visual grounding attentioncaptures the dependency between the words that

3650

Source Caption: a tennis player is moving to the side and is gripping his racquet withboth hands .

Text-Only NMT: ein tennisspieler bewegt sich um die seite und greift mit beiden händenan den boden .

VAG-NMT: ein tennisspieler bewegt sich zur seite und greift mit beiden händen denschläger .

Source Caption: three skiers skiing on a hill with two going down the hill and one movingup the hill .

Text-Only NMT: drei skifahrer fahren auf skiern einen hügel hinunter und eine personfährt den hügel hinunter .

VAG-NMT: drei skifahrer auf einem hügel fahren einen hügel hinunter und ein be-wegt sich den hügel hinauf .

Source Caption: a blue , yellow and green surfboard sticking out of a sandy beach .

Text-Only NMT: ein blau , gelb und grünes surfbrett streckt aus einem sandstrand .

VAG-NMT: ein blau , gelb und grüner surfbrett springt aus einem sandstrand .

Figure 4: Translations generated by VAG-NMT and Text-Only NMT. VAG-NMT performs better in thefirst two examples, while Text-Only NMT performs better in the third example. We highlight the wordsthat distinguish the two systems’ results in red and blue. Red words are marked for better translationfrom VAG-NMT and blue words are marked for better translation from Text-Only NMT.

have strong visual semantic relatedness. For ex-ample, in Figure 3, words, such as “raft",“river",and “water", with high attention appear in the im-age together. This shows that the visual depen-dence information is encoded into the weightedsum of attention vectors which is applied to ini-tialize the translation decoder. When we apply thesequence-to-sequence model to translate a longsentence, the encoded visual dependence informa-tion strengthens the connection between the wordswith visual semantic relatedness. Such connec-tions mitigate the problem of standard sequence-to-sequence models tending to forget distant his-tory. This hypothesis is supported by the fact thatour VAG-NMT outperforms all the other methodson the IKEA dataset which has long sentences.

Lastly, in Figure 4 we provide some qualita-tive comparisons between the translations fromVAG-NMT and Text-Only NMT. In the first exam-ple, our VAG-NMT properly translates the word"racquet" to “den schläger", while the Text-OnlyNMT mistranslated it to “den boden" which means“ground" in English. We suspect the attentionmechanism and visual shared space capture the vi-sual dependence between the word “tennis" and“racquet". In the second example, our VAG-NMT model correctly translates the preposition“up" to “hinauf" but Text-Only NMT mistranslatesit to “hinunter" which means “down" in English.We consistently observe that VAG-NMT translatesprepositions better than Text-Only NMT. We think

it is because the pre-trained convnet features cap-tured the relative object position that leads to abetter preposition choice. Finally, in the third ex-ample, we show a failure case where Text-OnlyNMT generates a better translation. Our VAG-NMT mistranslates the verb phrase “sticking out"to “springt aus" which means “jump out" in Ger-man, while Text-Only NMT translates to “strecktaus", which is correct. We find that VAG-NMToften makes mistakes when translating verbs. Wethink it is because the image vectors are pre-trained on an object classification task, which doesnot have any human action information.

7 Conclusion and Future Work

We proposed a visual grounding attention struc-ture to take advantage of the visual informationto perform machine translation. The visual at-tention mechanism and visual context groundingmodule help to integrate the visual content intothe sequence-to-sequence model, which leads tobetter translation quality compared to the modelwith only text information. We achieved state-of-the-art results on the Multi30K and Ambigu-ous COCO dataset. We also proposed a new prod-uct dataset, IKEA, to simulate a real-world onlineproduct description translation challenge.

In the future, we will continue exploring dif-ferent methods to ground the visual context intothe translation model, such as learning a multi-modal shared space across image, source language

3651

text, as well as target language text. We also wantto change the visual pre-training model from animage classification dataset to other datasets thathave both objects and actions, to further improvetranslation performance.

Acknowledge

We would like to thank Daniel Boyla for providinginsightful discussions to help with this research.We also want to thank Chunting Zhou and OzanCaglayan for suggestions on machine translationmodel implementation. This work was supportedin part by NSF CAREER IIS-1751206.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua

Bengio. 2014. Neural machine translation byjointly learning to align and translate. CoRR,abs/1409.0473.

Ozan Caglayan, Walid Aransa, Adrien Bardet, Mer-cedes García-Martínez, Fethi Bougares, Loïc Bar-rault, Marc Masana, Luis Herranz, and Joostvan de Weijer. 2017. LIUM-CVC submissionsfor WMT17 multimodal translation task. CoRR,abs/1707.04481.

Iacer Calixto, Qun Liu, and Nick Campbell. 2017a.Doubly-attentive decoder for multi-modal neuralmachine translation. CoRR, abs/1702.01287.

Iacer Calixto, Qun Liu, and Nick Campbell.2017b. Incorporating global visual featuresinto attention-based neural machine translation.CoRR, abs/1701.06521.

Iacer Calixto, Qun Liu, and Nick Campbell. 2017c.Multilingual multi-modal embeddings for naturallanguage processing. CoRR, abs/1702.01101.

Kyunghyun Cho, Bart van Merrienboer, ÇaglarGülçehre, Fethi Bougares, Holger Schwenk, andYoshua Bengio. 2014. Learning phrase representa-tions using RNN encoder-decoder for statistical ma-chine translation. CoRR, abs/1406.1078.

Grzegorz Chrupala, Ákos Kádár, and Afra Alishahi.2015. Learning language through pictures. CoRR,abs/1506.03694.

Michael Denkowski and Alon Lavie. 2014. Meteoruniversal: Language specific translation evaluationfor any target language. In In Proceedings of theNinth Workshop on Statistical Machine Translation.

Desmond Elliott, Stella Frank, Loïc Barrault, FethiBougares, and Lucia Specia. 2017. Findings of thesecond shared task on multimodal machine trans-lation and multilingual image description. CoRR,abs/1710.07177.

Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu-cia Specia. 2016. Multi30k: Multilingual english-german image descriptions. CoRR, abs/1605.00459.

Desmond Elliott and Ákos Kádár. 2017. Imagi-nation improves multimodal translation. CoRR,abs/1705.04350.

Spandana Gella, Rico Sennrich, Frank Keller, andMirella Lapata. 2017. Image pivoting for learn-ing multilingual multimodal representations. CoRR,abs/1707.07601.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2015a. Deep residual learning for image recog-nition. CoRR, abs/1512.03385.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2015b. Delving deep into rectifiers: Surpass-ing human-level performance on imagenet classifi-cation. CoRR, abs/1502.01852.

Jindrich Helcl and Jindrich Libovický. 2017. CUNIsystem for the WMT17 multimodal translation task.CoRR, abs/1707.04550.

Po-Yao Huang, Frederick Liu, Sz-Rung Shiang, JeanOh, and Chris Dyer. 2016. Attention-based multi-modal neural machine translation. In WMT.

Andrej Karpathy and Fei-Fei Li. 2014. Deep visual-semantic alignments for generating image descrip-tions. CoRR, abs/1412.2306.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRR,abs/1412.6980.

Ryan Kiros, Ruslan Salakhutdinov, and Richard S.Zemel. 2014. Unifying visual-semantic embeddingswith multimodal neural language models. CoRR,abs/1411.2539.

Tsung-Yi Lin, Michael Maire, Serge J. Belongie,Lubomir D. Bourdev, Ross B. Girshick, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollár, andC. Lawrence Zitnick. 2014. Microsoft COCO: com-mon objects in context. CoRR, abs/1405.0312.

Mingbo Ma, Dapeng Li, Kai Zhao, and Liang Huang.2017. OSU multimodal machine translation systemreport. CoRR, abs/1710.02718.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic eval-uation of machine translation. In Proceedings ofthe 40th Annual Meeting on Association for Com-putational Linguistics, ACL ’02, pages 311–318,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.2012. Understanding the exploding gradient prob-lem. CoRR, abs/1211.5063.

3652

Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun. 2015. Faster r-cnn: Towards real-time objectdetection with region proposal networks. In Pro-ceedings of the 28th International Conference onNeural Information Processing Systems - Volume 1,NIPS’15, pages 91–99, Cambridge, MA, USA. MITPress.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An-drej Karpathy, Aditya Khosla, Michael Bernstein,Alexander C. Berg, and Li Fei-Fei. 2015. Ima-geNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV),115(3):211–252.

M. Schuster and K.K. Paliwal. 1997. Bidirectionalrecurrent neural networks. Trans. Sig. Proc.,45(11):2673–2681.

Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan-dra Birch, Barry Haddow, Julian Hitschler, MarcinJunczys-Dowmunt, Samuel Läubli, Antonio Vale-rio Miceli Barone, Jozef Mokry, and Maria Nadejde.2017. Nematus: a toolkit for neural machine trans-lation. CoRR, abs/1703.04357.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2015. Neural machine translation of rare words withsubword units. CoRR, abs/1508.07909.

Peter Young, Alice Lai, Micah Hodosh, and JuliaHockenmaier. 2014. From image descriptions tovisual denotations: New similarity metrics for se-mantic inference over event descriptions. Transac-tions of the Association for Computational Linguis-tics, 2:67–78.

Jingyi Zhang, Masao Utiyama, Eiichiro Sumita, Gra-ham Neubig, and Satoshi Nakamura. 2017. Nict-naist system for wmt17 multimodal translation task.In WMT.

A IKEA Dataset Stats and Examples

We summarize the statistics of our IKEA datasetin Figure 5, where we demonstrate the informa-tion about the number of tokens, the length of theproduct description and the vocabulary size. Weprovide one example of the IKEA dataset in Fig-ure 6.

Figure 5: Statistic of the IKEA dataset

(a) Product image

(b) Source description

(c) Target description in German

Figure 6: An example of product description andthe corresponding translation in German from theIKEA dataset. Both descriptions provide an ac-curate caption for the commercial characteristicsof the product in the image, but the details in thedescriptions are different.

B Hyperparameter Settings

In this Appendix, we share details on the hyper-parameter settings for our model and the trainingprocess. The word embedding size for both en-coder and decoder are 256. The Encoder is a one-layer bidirectional recurrent neural network withGated Recurrent Unit (GRU), which has a hiddensize of 512. The decoder is a recurrent neuralnetwork with conditional GRU of the same hid-den size. The visual representation is a 2048-dim vector extracted from the pool5 layer of apre-trained ResNet-50 network. The dimension ofthe shared visual-text semantic embedding spaceis 512. We set the decoder initialization weight

3653

value λ to 0.5.During training, we use Adam (Kingma and

Ba, 2014) to optimize our model with a learn-ing rate of 4e − 4 for German Dataset and 1e −3 for French dataset. The batch size is 32.The total gradient norm is clipped to 1 (Pascanuet al., 2012). Dropout is applied at the embed-ding layer in the encoder, context vectors ex-tracted from the encoder and the output layer ofthe decoder. For Multi30K German dataset thedropout probabilities are (0.3, 0.5, 0.5) and forMulti30K French dataset the dropout probabili-ties are (0.2, 0.4, 0.4). For the Multimodal sharedspace learning objective function, the margin sizeγ is set to 0.1. The objective split weight α is setto 0.99. We initialize the weights of all the pa-rameters with the method introduced in (He et al.,2015b).

C Ablation Analysis on Visual-TextAttention

We conducted an ablation test to further evaluatethe effectiveness of our visual-text attention mech-anism. We created two comparison experimentswhere we reduced the impact of visual-text atten-tion with different design options. In the first ex-periment, we remove the visual-attention mech-anism in our pipeline and simply use the meanof the encoder hidden states to learn the sharedembedding space. In the second experiment, weinitialize the decoder with just the mean of en-coder hidden states without the weighted sum ofencoder states using the learned visual-text atten-tion scores.

We run both experiments on Multi30K GermanDataset five times and demonstrate the results intable 5. As can be seen, the performance of thechanged translation model is obviously worse thanthe full VAG-NMT in both experiments. This ob-servation suggests that the visual-attention mech-anism is critical in improving the translation per-formance. The model improvement comes fromthe attention mechanism influencing the model’sobjective function and decoder’s initialization.

English→ GermanMethod BLEU METEOR-attention in shared embedding 30.5± 0.6 51.7± 0.4-attention in initialization 30.8± 0.8 51.9± 0.5VAG-NMT 31.6 ± 0.6 52.2 ± 0.3

Table 5: Ablation analysis on visual-text attentionmechanism in the Multi30K German dataset.

a visual attention grounding neural model for …ixto et al.,2017c) learn a multi-modal...

Documents