predicting target language ccg supertags improves … · et al.(2016) co-train a translation model...

12
Proceedings of the Conference on Machine Translation (WMT), Volume 1: Research Papers, pages 68–79 Copenhagen, Denmark, September 711, 2017. c 2017 Association for Computational Linguistics Predicting Target Language CCG Supertags Improves Neural Machine Translation Maria N ˘ adejde 1 and Siva Reddy 1 and Rico Sennrich 1 and Tomasz Dwojak 1,2 Marcin Junczys-Dowmunt 2 and Philipp Koehn 3 and Alexandra Birch 1 1 School of Informatics, University of Edinburgh 2 Adam Mickiewicz University 3 Dep. of Computer Science, Johns Hopkins University {m.nadejde,siva.reddy, rico.sennrich, a.birch}@ed.ac.uk {t.dwojak,junczys}@amu.edu.pl, [email protected] Abstract Neural machine translation (NMT) mod- els are able to partially learn syntactic in- formation from sequential lexical informa- tion. Still, some complex syntactic phe- nomena such as prepositional phrase at- tachment are poorly modeled. This work aims to answer two questions: 1) Does explicitly modeling target language syntax help NMT? 2) Is tight integration of words and syntax better than multitask training? We introduce syntactic information in the form of CCG supertags in the decoder, by interleaving the target supertags with the word sequence. Our results on WMT data show that explicitly modeling target- syntax improves machine translation qual- ity for GermanEnglish, a high-resource pair, and for RomanianEnglish, a low- resource pair and also several syntactic phenomena including prepositional phrase attachment. Furthermore, a tight cou- pling of words and syntax improves trans- lation quality more than multitask training. By combining target-syntax with adding source-side dependency labels in the em- bedding layer, we obtain a total improve- ment of 0.9 BLEU for GermanEnglish and 1.2 BLEU for RomanianEnglish. 1 Introduction Sequence-to-sequence neural machine translation (NMT) models (Sutskever et al., 2014; Cho et al., 2014b; Bahdanau et al., 2015) are state-of-the-art on a multitude of language-pairs (Sennrich et al., 2016a; Junczys-Dowmunt et al., 2016). Part of the appeal of neural models is that they can learn to implicitly model phenomena which underlie high quality output, and some syntax is indeed cap- tured by these models. In a detailed analysis, Bentivogli et al. (2016) show that NMT signifi- cantly improves over phrase-based SMT, in par- ticular with respect to morphology and word or- der, but that results can still be improved for longer sentences and complex syntactic phenomena such as prepositional phrase (PP) attachment. Another study by Shi et al. (2016) shows that the encoder layer of NMT partially learns syntactic informa- tion about the source language, however complex syntactic phenomena such as coordination or PP attachment are poorly modeled. Recent work which incorporates additional source-side linguistic information in NMT mod- els (Luong et al., 2016; Sennrich and Haddow, 2016) show that even though neural models have strong learning capabilities, explicit features can still improve translation quality. In this work, we examine the benefit of incorporating global syn- tactic information on the target-side. We also ad- dress the question of how best to incorporate this information. For language pairs where syntac- tic resources are available on both the source and target-side, we show that approaches to incorpo- rate source syntax and target syntax are comple- mentary. We propose a method for tightly coupling words and syntax by interleaving the target syntactic rep- resentation with the word sequence. We compare this to loosely coupling words and syntax using a multitask solution, where the shared parts of the model are trained to produce either a target se- quence of words or supertags in a similar fashion to Luong et al. (2016). We use CCG syntactic categories (Steedman, 2000), also known as supertags, to represent syn- tax explicitly. Supertags provide global syntac- tic information locally at the lexical level. They encode subcategorization information, capturing short and long range dependencies and attach- 68

Upload: lekhanh

Post on 15-Sep-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Proceedings of the Conference on Machine Translation (WMT), Volume 1: Research Papers, pages 68–79Copenhagen, Denmark, September 711, 2017. c©2017 Association for Computational Linguistics

Predicting Target Language CCG Supertags Improves Neural MachineTranslation

Maria Nadejde1 and Siva Reddy1 and Rico Sennrich1 and Tomasz Dwojak1,2

Marcin Junczys-Dowmunt2 and Philipp Koehn3 and Alexandra Birch1

1School of Informatics, University of Edinburgh2Adam Mickiewicz University

3Dep. of Computer Science, Johns Hopkins University{m.nadejde,siva.reddy, rico.sennrich, a.birch}@ed.ac.uk

{t.dwojak,junczys}@amu.edu.pl, [email protected]

Abstract

Neural machine translation (NMT) mod-els are able to partially learn syntactic in-formation from sequential lexical informa-tion. Still, some complex syntactic phe-nomena such as prepositional phrase at-tachment are poorly modeled. This workaims to answer two questions: 1) Doesexplicitly modeling target language syntaxhelp NMT? 2) Is tight integration of wordsand syntax better than multitask training?We introduce syntactic information in theform of CCG supertags in the decoder,by interleaving the target supertags withthe word sequence. Our results on WMTdata show that explicitly modeling target-syntax improves machine translation qual-ity for German→English, a high-resourcepair, and for Romanian→English, a low-resource pair and also several syntacticphenomena including prepositional phraseattachment. Furthermore, a tight cou-pling of words and syntax improves trans-lation quality more than multitask training.By combining target-syntax with addingsource-side dependency labels in the em-bedding layer, we obtain a total improve-ment of 0.9 BLEU for German→Englishand 1.2 BLEU for Romanian→English.

1 Introduction

Sequence-to-sequence neural machine translation(NMT) models (Sutskever et al., 2014; Cho et al.,2014b; Bahdanau et al., 2015) are state-of-the-arton a multitude of language-pairs (Sennrich et al.,2016a; Junczys-Dowmunt et al., 2016). Part of theappeal of neural models is that they can learn toimplicitly model phenomena which underlie highquality output, and some syntax is indeed cap-

tured by these models. In a detailed analysis,Bentivogli et al. (2016) show that NMT signifi-cantly improves over phrase-based SMT, in par-ticular with respect to morphology and word or-der, but that results can still be improved for longersentences and complex syntactic phenomena suchas prepositional phrase (PP) attachment. Anotherstudy by Shi et al. (2016) shows that the encoderlayer of NMT partially learns syntactic informa-tion about the source language, however complexsyntactic phenomena such as coordination or PPattachment are poorly modeled.

Recent work which incorporates additionalsource-side linguistic information in NMT mod-els (Luong et al., 2016; Sennrich and Haddow,2016) show that even though neural models havestrong learning capabilities, explicit features canstill improve translation quality. In this work, weexamine the benefit of incorporating global syn-tactic information on the target-side. We also ad-dress the question of how best to incorporate thisinformation. For language pairs where syntac-tic resources are available on both the source andtarget-side, we show that approaches to incorpo-rate source syntax and target syntax are comple-mentary.

We propose a method for tightly coupling wordsand syntax by interleaving the target syntactic rep-resentation with the word sequence. We comparethis to loosely coupling words and syntax using amultitask solution, where the shared parts of themodel are trained to produce either a target se-quence of words or supertags in a similar fashionto Luong et al. (2016).

We use CCG syntactic categories (Steedman,2000), also known as supertags, to represent syn-tax explicitly. Supertags provide global syntac-tic information locally at the lexical level. Theyencode subcategorization information, capturingshort and long range dependencies and attach-

68

ments, and also tense and morphological as-pects of the word in a given context. Considerthe sentence in Figure 1. This sentence con-tains two PP attachments and could lead to sev-eral disambiguation possibilities (“in” can attachto “Netanyahu” or “receives”, and “of” can at-tach to “capital”, “Netanyahu” or “receives” ).These alternatives may lead to different trans-lations in other languages. However the su-pertag ((S[dcl]\NP)/PP)/NP of “receives” indi-cates that the preposition “in” attaches to the verb,and the supertag (NP\NP)/NP of “of” indicatesthat it attaches to “capital”, thereby resolving theambiguity.

Our research contributions are as follows:

• We propose a novel approach to integrating tar-get syntax at word level in the decoder, by in-terleaving CCG supertags in the target word se-quence.

• We show that the target language syntax im-proves translation quality for German→Englishand Romanian→English as measured byBLEU. Our results suggest that a tight couplingof target words and syntax (by interleaving)improves translation quality more than thedecoupled signal from multitask training.

• We show that incorporating source-side linguis-tic information is complimentary to our method,further improving the translation quality.

• We present a fine-grained analysis of SNMTand show consistent gains for different linguis-tic phenomena and sentence lengths.

2 Related work

Syntax has helped in statistical machine trans-lation (SMT) to capture dependencies betweendistant words that impact morphological agree-ment, subcategorisation and word order (Galleyet al., 2004; Menezes and Quirk, 2007; Williamsand Koehn, 2012; Nadejde et al., 2013; Sennrich,2015; Nadejde et al., 2016a,b; Chiang, 2007).There has been some work in NMT on modelingsource-side syntax implicitly or explicitly. Kalch-brenner and Blunsom (2013); Cho et al. (2014a)capture the hierarchical aspects of language im-plicitly by using convolutional neural networks,while Eriguchi et al. (2016) use the parse tree ofthe source sentence to guide the recurrence andattention model in tree-to-sequence NMT. Luong

et al. (2016) co-train a translation model and asource-side syntactic parser which share the en-coder. Our multitask models extend their workto attention-based NMT models and to predict-ing target-side syntax as the secondary task. Sen-nrich and Haddow (2016) generalize the embed-ding layer of NMT to include explicit linguisticfeatures such as dependency relations and part-of-speech tags and we use their framework to showsource and target syntax provide complementaryinformation.

Applying more tightly coupled linguistic fac-tors on the target for NMT has been previouslyinvestigated. Niehues et al. (2016) proposed a fac-tored RNN-based language model for re-scoringan n-best list produced by a phrase-based MT sys-tem. In recent work, Martınez et al. (2016) im-plemented a factored NMT decoder which gener-ated both lemmas and morphological tags. Thetwo factors were then post-processed to gener-ate the word form. Unfortunately no real gainwas reported for these experiments. Concurrentlywith our work, Aharoni and Goldberg (2017) pro-posed serializing the target constituency trees andEriguchi et al. (2017) model target dependency re-lations by augmenting the NMT decoder with aRNN grammar (Dyer et al., 2016). In our work,we use CCG supertags which are a more compactrepresentation of global syntax. Furthermore, wedo not focus on model architectures, and insteadwe explore the more general problem of includ-ing target syntax in NMT: comparing tightly andloosely coupled syntactic information and show-ing source and target syntax are complementary.

Previous work on integrating CCG supertags infactored phrase-based models (Birch et al., 2007)made strong independence assumptions betweenthe target word sequence and the CCG categories.In this work we take advantage of the expressivepower of recurrent neural networks to learn repre-sentations that generate both words and CCG su-pertags, conditioned on the entire lexical and syn-tactic target history.

3 Modeling Syntax in NMT

CCG is a lexicalised formalism in which words areassigned with syntactic categories, i.e., supertags,that indicate context-sensitive morpho-syntacticproperties of a word in a sentence. The com-binators of CCG allow the supertags to captureglobal syntactic constraints locally. Though NMT

69

Source-sideBPE: Obama receives Net+ an+ yahu in the capital of USAIOB: O O B I E O O O O OCCG: NP ((S[dcl]\NP)/PP)/NP NP NP NP PP/NP NP/N N (NP\NP)/NP NP

Target-sideNP Obama ((S[dcl]\NP)/PP)/NP receives NP Net+ an+ yahu PP/NP in NP/N the N capital (NP\NP)/NP of NP USA

Figure 1: Source and target representation of syntactic information in syntax-aware NMT.

captures long range dependencies using long-termmemory, short-term memory is cheap and reliable.Supertags can help by allowing the model to relymore on local information (short-term) and nothaving to rely heavily on long-term memory.

Consider a decoder that has to generate the fol-lowing sentences:

1. What(S[wq]/(S[q]/NP ))/N city is(S[q]/PP )/NP

the Taj Mahal in?

2. WhereS[wq]/(S[q]/NP ) is(S[q]/NP )/NP the TajMahal?

If the decoding starts with predicting “What”, itis ungrammatical to omit the preposition “in”, andif the decoding starts with predicting “Where”, itis ungrammatical to predict the preposition. Herethe decision to predict “in” depends on the firstword, a long range dependency. However if werely on CCG supertags, the supertags of boththese sequences look very different. The supertag(S[q]/PP)/NP for the verb “is” in the first sen-tence indicates that a preposition is expected in fu-ture context. Furthermore it is likely to see thisparticular supertag of the verb in the context of(S[wq]/(S[q]/NP))/N but it is unlikely in the con-text of S[wq]/(S[q]/NP). Therefore a successionof local decisions based on CCG supertags willresult in the correct prediction of the prepositionin the first sentence, and omitting the prepositionin the second sentence. Since the vocabulary ofCCG supertags is much smaller than that of possi-ble words, the NMT model will do a better job atgeneralizing over and predicting the correct CCGsupertags sequence.

CCG supertags also help during encoding ifthey are given in the input, as we saw with thecase of PP attachment in Figure 1. Translationof the correct verb form and agreement can beimproved with CCG since supertags also encodetense, morphology and agreements. For exam-ple, in the sentence “It is going to rain”, the su-pertag (S[ng]\NP[expl])/(S[to]\NP) of “going”

indicates the current word is a verb in continuousform looking for an infinitive construction on theright, and an expletive pronoun on the left.

We explore the effect of target-side syntax byusing CCG supertags in the decoder and by com-bining these with source-side syntax in the en-coder, as follows.

Baseline decoder The baseline decoder archi-tecture is a conditional GRU with attention(cGRUattn) as implemented in the Nematustoolkit (Sennrich et al., 2017). The decoder is arecursive function computing a hidden state sj ateach time step j ∈ [1, T ] of the target recurrence.This function takes as input the previous hiddenstate sj−1, the embedding of the previous targetword yj−1 and the output of the attention modelcj . The attention model computes a weighted sumover the hidden states hi = [

−→hi ;←−hi ] of the bi-

directional RNN encoder. The function g com-putes the intermediate representation tj and passesthis to a softmax layer which first applies a lineartransformation (Wo) and then computes the prob-ability distribution over the target vocabulary. Thetraining objective for the entire architecture is min-imizing the discrete cross-entropy, therefore theloss l is the negative log-probability of the refer-ence sentence.

s′j = GRU1(yj−1, sj−1) (1)

cj = ATT ([h1; ...;h|x|], s′j) (2)

sj = cGRUattn(yj−1, sj−1, cj) (3)

tj = g(yj−1, sj , cj) (4)

py =

T∏

j=1

p(yj |x, y1:j−1) =T∏

j=1

softmax(tjWo)

(5)

l = −log(py) (6)

Target-side syntax When modeling the target-side syntactic information we consider different

70

st-3 st-2 st-1 st

NP Obama ((S\NP)/PP)/NP receives

h1 h2 hTh3 ….

αt,1 αt,2αt,3

αt,T

x1 x2 x3 x4

s't-1 s't

NP ((S\NP)/PP)/NP

h1 h2 hTh3 ….

αt,1

x1 x2 x3

st-1 st

Obama receives

αt,2 αt,3

αt,T+ βt,1

βt,2βt,3 βt,T

x4

a) b)

Figure 2: Integrating target syntax in the NMT decoder: a) interleaving and b) multitasking.

strategies of coupling the CCG supertags with thetranslated words in the decoder: interleaving andmultitasking with shared encoder. In Figure 2 werepresent graphically the differences between thetwo strategies and in the next paragraphs we for-malize them.

• Interleaving In this paper we propose a tightintegration in the decoder of the syntactic rep-resentation and the surface forms. Before eachword of the target sequence we include its su-pertag as an extra token. The new target se-quence y′ will have the length 2T , where T isthe number of target words. With this represen-tation, a single decoder learns to predict boththe target supertags and the target words con-ditioned on previous syntactic and lexical con-text. We do not make changes to the baselineNMT decoder architecture, keeping equations(1) - (6) and the corresponding set of parame-ters unchanged. Instead, we augment the tar-get vocabulary to include both words and CCGsupertags. This results in a shared embeddingspace and the following probability of the targetsequence y′, where y′j can be either a word or atag:

y′ = ytag1 , yword1 , ...., ytagT , yword

T (7)

py′ =2T∏

j

p(y′j |x, y′1:j−1) (8)

At training time we pre-process the target se-quence to add the syntactic annotation and thensplit only the words into byte-pair-encoding(BPE) (Sennrich et al., 2016b) sub-units. At

testing time we delete the predicted CCG su-pertags to obtain the final translation. Figure 1gives an example of the target-side representa-tion in the case of interleaving. The supertagNP corresponding to the word Netanyahu is in-cluded only once before the three BPE subunitsNet+ an+ yahu.

• Multitasking – shared encoder A loose cou-pling of the syntactic representation and the sur-face forms can be achieved by co-training atranslation model with a secondary predictiontask, in our case CCG supertagging. In the mul-titask framework (Luong et al., 2016) the en-coder part is shared while the decoder is dif-ferent for each of the prediction tasks: transla-tion and tagging. In contrast to Luong et al.,we train a separate attention model for eachtask and perform multitask learning with tar-get syntax. The two decoders take as inputthe same source context, represented by the en-coder’s hidden states hi = [

−→hi ;←−hi ]. However,

each task has its own set of parameters associ-ated with the five components of the decoder:GRU1, ATT , cGRUatt, g, softmax. Further-more, the two decoders may predict a differentnumber of target symbols, resulting in target se-quences of different lengths T1 and T2. This re-sults in two probability distributions over sep-arate target vocabularies for the words and thetags:

pwordy =

T1∏

j

p(ywordj |x, yword

1:j−1) (9)

ptagy =

T2∏

k

p(ytagk |x, ytag1:k−1) (10)

71

The final loss is the sum of the losses for the twodecoders:

l = −(log(pwordy ) + log(ptagy )) (11)

We use EasySRL to label the English side ofthe parallel corpus with CCG supertags1 insteadof using a corpus with gold annotations as inLuong et al. (2016).

Source-side syntax – shared embedding Whileour focus is on target-side syntax, we also exper-iment with including source-side syntax to showthat the two approaches are complementary.

Sennrich and Haddow propose a framework forincluding source-side syntax as extra features inthe NMT encoder. They extend the model of Bah-danau et al. by learning a separate embedding forseveral source-side features such as the word itselfor its part-of-speech. All feature embeddings areconcatenated into one embedding vector which isused in all parts of the encoder model instead ofthe word embedding. When modeling the source-side syntactic information, we include the CCGsupertags or dependency labels as extra features.The baseline features are the subword units ob-tained using BPE together with the annotation ofthe subword structure using IOB format by mark-ing if a symbol in the text forms the beginning (B),inside (I), or end (E) of a word. A separate tag (O)is used if a symbol corresponds to the full word.The word level supertag is replicated for each BPEunit. Figure 1 gives an example of the source-sidefeature representation.

4 Experimental Setup and Evaluation

4.1 Data and methodsWe train the neural MT systems on all the paralleldata available at WMT16 (Bojar et al., 2016) forthe German↔English and Romanian↔Englishlanguage pairs. The English side of the train-ing data is annotated with CCG lexical tags2 us-ing EasySRL (Lewis et al., 2015) and the avail-able pre-trained model3. Some longer sentencescannot be processed by the parser and thereforewe eliminate them from our training and test data.We report the sentence counts for the filtered data

1We use the same data and annotations for the interleav-ing approach.

2The CCG tags include features such as the verb tense(e.g. [ng] for continuous form) or the sentence type (e.g. [pss]for passive).

3https://github.com/uwnlp/EasySRL

train dev testDE-EN 4,468,314 2,986 2,994RO-EN 605,885 1,984 1,984

Table 1: Number of sentences in the training, de-velopment and test sets.

sets in Table 1. Dependency labels are annotatedwith ParZU (Sennrich et al., 2013) for German andSyntaxNet (Andor et al., 2016) for Romanian.

All the neural MT systems are attentionalencoder-decoder networks (Bahdanau et al., 2015)as implemented in the Nematus toolkit (Sennrichet al., 2017).4 We use similar hyper-parameters tothose reported by (Sennrich et al., 2016a; Sennrichand Haddow, 2016) with minor modifications: weused mini-batches of size 60 and Adam optimizer(Kingma and Ba, 2014). We select the best singlemodels according to BLEU on the development setand use the four best single models for the ensem-bles.

To show that we report results over strong base-lines, table 2 compares the scores obtained by ourbaseline system to the ones reported in Sennrichet al. (2016a). We normalize diacritics5 for theEnglish→Romanian test set. We did not removeor normalize Romanian diacritics for the other ex-periments reported in this paper. Our baseline sys-tems are generally stronger than Sennrich et al.(2016a) due to training with a different optimizerfor more iterations.

This work Sennrich et. alDE→EN 31.0 28.5EN→DE 27.8 26.8RO→EN 28.0 27.8EN→RO1 25.6 23.9

Table 2: Comparison of baseline systems inthis work and in Sennrich et al. (2016a). Case-sensitive BLEU scores reported over newstest2016with mteval-13a.perl. 1Normalized diacritics.

During training we validate our models withBLEU (Papineni et al., 2002) on development sets:newstest2013 for German↔English and news-dev2016 for Romanian↔English. We evaluate thesystems on newstest2016 test sets for both lan-

4https://github.com/rsennrich/nematus5There are different encodings for letters with

cedilla (s,t) used interchangeably throughout the corpus.https://en.wikipedia.org/wiki/Romanian_alphabet#ISO_8859

72

guage pairs and use bootstrap resampling (Riezlerand Maxwell, 2005) to test statistical significance.We compute BLEU with multi-bleu.perl over tok-enized sentences both on the development sets, forearly stopping, and on the test sets for evaluatingour systems.

Words are segmented into sub-units that arelearned jointly for source and target using BPE(Sennrich et al., 2016b), resulting in a vocabularysize of 85,000. The vocabulary size for CCG su-pertags was 500.

For the experiments with source-side featureswe use the BPE sub-units and the IOB tags asbaseline features. We keep the total word em-bedding size fixed to 500 dimensions. We allo-cate 10 dimensions for dependency labels whenusing these as source-side features and when us-ing source-side CCG supertags we allocate 135 di-mensions.

The interleaving approach to integrating targetsyntax increases the length of the target sequence.Therefore, at training time, when adding the CCGsupertags in the target sequence we increase themaximum length of sentences from 50 to 100. Onaverage, the length of English sentences for new-stest2013 in BPE representation is 22.7, while theaverage length when adding the CCG supertags is44. Increasing the length of the target recurrenceresults in larger memory consumption and slowertraining.6. At test time, we obtain the final trans-lation by post-processing the predicted target se-quence to remove the CCG supertags.

4.2 Results

In this section, we first evaluate the syntax-awareNMT model (SNMT) with target-side CCG su-pertags as compared to the baseline NMT modeldescribed in the previous section (Bahdanau et al.,2015; Sennrich et al., 2016a). We show that ourproposed method for tightly coupling target syn-tax via interleaving, improves translation for bothGerman→English and Romanian→English whilethe multitasking framework does not. Next, weshow that SNMT with target-side CCG supertagscan be complemented with source-side dependen-cies, and that combining both types of syntaxbrings the most improvement. Finally, our exper-iments with source-side CCG supertags confirmthat global syntax can improve translation either

6Roughly 10h30 per 100,000 sentences (20,000 batches)for SNMT compared to 6h for NMT.

as extra information in the encoder or in the de-coder.

Target-side syntax We first evaluate the impactof target-side CCG supertags on overall transla-tion quality. In Table 3 we report results forGerman→English, a high-resource language pair,and for Romanian→English, a low-resource lan-guage pair. We report BLEU scores for both thebest single models and ensemble models. How-ever, we will only refer to the results with ensem-ble models since these are generally better.

The SNMT system with target-sidesyntax improves BLEU scores by 0.9for Romanian→English and by 0.6 forGerman→English. Although the training data forGerman→English is large, the CCG supertagsstill improve translation quality. These resultssuggest that the baseline NMT decoder benefitsfrom modeling the global syntactic informationlocally via supertags.

Next, we evaluate whether there is a benefit totight coupling between the target word sequenceand syntax, as apposed to loose coupling. Wecompare our method of interleaving the CCG su-pertags with multitasking, which predicts targetCCG supertags as a secondary task. The resultsin Table 3 show that the multitask approach doesnot improve BLEU scores for German→English,which exhibits long distance word reordering. ForRomanian→English, which exhibits more localword reordering, multitasking improves BLEU by0.6 relative to the baseline. In contrast, the inter-leaving approach improves translation quality forboth language pairs and to a larger extent. There-fore, we conclude that a tight integration of the tar-get syntax and word sequence is important. Con-ditioning the prediction of words on their corre-sponding CCG supertags is what sets SNMT apartfrom the multitasking approach.

Source-side and target-side syntax We nowshow that our method for integrating target-sidesyntax can be combined with the frameworkof Sennrich and Haddow (2016) for integratingsource-side linguistic information, leading to fur-ther improvement in translation quality. We evalu-ate the syntax-aware NMT system, with CCG su-pertags as target-syntax and dependency labels assource-syntax. While the dependency labels donot encode global syntactic information, they dis-ambiguate the grammatical function of words. Ini-

73

German→English Romanian→Englishmodel syntax strategy single ensemble single ensembleNMT - - 31.0 32.1 28.1 28.4SNMT target – CCG interleaving 32.0 32.7* 29.2 29.3**Multitasking target – CCG shared encoder 31.4 32.0 28.4 29.0*SNMT source – dep shared embedding 31.4 32.2 28.2 28.9

+ target – CCG + interleaving 32.1 33.0** 29.1 29.6**

Table 3: Experiments with target-side syntax for German→English and Romanian→English. BLEU

scores reported for baseline NMT, syntax-aware NMT (SNMT) and multitasking. The SNMT system isalso combined with source dependencies. Statistical significance is indicated with * p < 0.05 and **p < 0.01, when comparing against the NMT baseline.

tially, we had intended to use global syntax on thesource-side as well for German→English, how-ever the German CCG tree-bank is still under de-velopment.

From the results in Table 3 we first ob-serve that for German→English the source-sidedependency labels improve BLEU by only 0.1,while Romanian→English sees an improvementof 0.5. Source-syntax may help more forRomanian→English because the training data issmaller and the word order is more similar be-tween the source and target languages than it isfor German→English.

For both language pairs, target-syntax im-proves translation quality more than source-syntax. However, target-syntax is complementedby source-syntax when used together, leadingto a final improvement of 0.9 BLEU pointsfor German→English and 1.2 BLEU points forRomanian→English.

Finally, we show that CCG supertags are alsoan effective representation of global-syntax whenused in the encoder. In Table 4 we present re-sults for using CCG supertags as source-syntaxin the embedding layer. Because we have CCGannotations only for English, we reverse thetranslation directions and report BLEU scores forEnglish→German and English→Romanian. TheBLEU scores reported are for the ensemble modelsover newstest2016.

For English→German BLEU increases by 0.7points and for English→Romanian by 0.5 points.In contrast, Sennrich and Haddow (2016) obtainan improvement of only 0.2 for English→Germanusing dependency labels which encode only thegrammatical function of words. These results con-firm that representing global syntax in the en-coder provides complementary information that

model syntax EN→DE EN→RONMT - 28.3 25.6SNMT source – CCG 29.0* 26.1*

Table 4: Results for English→German andEnglish→Romanian with source-side syntax. TheSNMT system uses the CCG supertags of thesource words in the embedding layer. *p < 0.05.

the baseline NMT model is not able to learn fromthe source word sequence alone.

4.3 Analyses by sentence typeIn this section, we make a finer grained analysisof the impact of target-side syntax by looking at abreakdown of BLEU scores with respect to differ-ent linguistic constructions and sentence lengths7.

We classify sentences into different linguis-tic constructions based on the CCG supertagsthat appear in them, e.g., the presence of cate-gory (NP\NP)/(S/NP) indicates a subordinateconstruction. Figure 3 a) shows the differencein BLEU points between the syntax-aware NMTsystem and the baseline NMT system for thefollowing linguistic constructions: coordination(conj), control and raising (control), prepositionalphrase attachment (pp), questions and subordinateclauses (subordinate). In the figure we use thesymbol “*” to indicate that syntactic informationis used on the target (eg. de-en*), or both on thesource and target (eg. *de-en*). We report thenumber of sentences for each category in Table 5.

With target-syntax, we see consistent im-provements across all linguistic constructions forRomanian→English and across all but control andraising for German→English. In particular, the in-

7Document-level BLEU is computed over each subset ofsentences.

74

a) b)

Figure 3: Difference in BLEU points between SNMT and NMT, relative to baseline NMT scores, withrespect to a) linguistic constructs and b) sentence lengths. The numbers attached to the bars representthe BLEU score for the baseline NMT system. The symbol * indicates that syntactic information is usedon the target (eg. de-en*), or both on the source and target (eg. *de-en*)

sub. qu. pp contr. conjRO↔EN 742 90 1,572 415 845DE↔EN 936 114 2,321 546 1,129

Table 5: Sentence counts for different linguisticconstructions.

crease in BLEU scores for the prepositional phraseand subordinate constructions suggests that targetword order is improved.

For German→English, there is a small de-crease in BLEU for the control and raising con-structions when using target-syntax alone. How-ever, source-syntax adds complementary informa-tion to target-syntax, resulting in a small improve-ment for this category as well. Moreover, com-bining source and target-syntax increases trans-lation quality across all linguistic constructionsas compared to NMT and SNMT with target-syntax alone. For Romanian→English, combin-ing source and target-syntax brings an additionalimprovement of 0.7 for subordinate constructsand 0.4 for prepositional phrase attachment. ForGerman→English, on the same categories, there isan additional improvement of 0.4 and 0.3 respec-tively. Overall, BLEU scores improve by more than1 BLEU point for most linguistic constructs and for

both language pairs.Next, we compare the systems with respect to

sentence length. Figure 3 b) shows the differencein BLEU points between the syntax-aware NMTsystem and the baseline NMT system with respectto the length of the source sentence measured inBPE sub-units. We report the number of sentencesfor each category in Table 6.

<15 15-25 25-35 >35RO↔EN 491 540 433 520DE↔EN 918 934 582 560

Table 6: Sentence counts for different sentencelengths.

With target-syntax, we see consistentimprovements across all sentence lengthsfor Romanian→English and across all butshort sentences for German→English. ForGerman→English there is a decrease in BLEU

for sentences up to 15 words. Since theGerman→English training data is large, the base-line NMT system learns a good model for shortsentences with local dependencies and withoutsubordinate or coordinate clauses. Including extraCCG supertags increases the target sequencewithout adding information about complex lin-

75

DE - EN QuestionSource Oder wollen Sie herausfinden , uber was andere reden ?Ref. Or do you want to find out what others are talking about ?NMT Or would you like to find out about what others are talking about ?SNMT Or do you want to find out whatNP/(S[dcl]/NP ) others are(S[dcl]\NP )/(S[ng]\NP ) talking(S[ng]\NP )/PP aboutPP/NP ?

DE - EN SubordinateSource ...dass die Polizei jetzt sagt , ..., und dass Lamb in seinem Notruf Prentiss zwar als seine Frau bezeichnete ...Ref. ...that police are now saying ..., and that while Lamb referred to Prentiss as his wife in the 911 call ...NMT ...police are now saying ..., and that in his emergency call Prentiss he called his wife ...SNMT ...police are now saying ..., and that lamb , in his emergency call , described((S[dcl]\NP )/PP )/NP Prentiss as his wife ....

Figure 4: Comparison of baseline NMT and SNMT with target syntax for German→English.

guistic phenomena. However, when using bothsource and target syntax, the effect on short sen-tences disappears. For Romanian→English thereis also a large improvement on short sentenceswhen combining source and target syntax: 2.9BLEU points compared to the NMT baselineand 1.2 BLEU points compared to SNMT withtarget-syntax alone.

With both source and target-syntax, translationquality increases across all sentence lengths ascompared to NMT and SNMT with target-syntaxalone. For German→English sentences that aremore than 35 words, we see again the effect ofincreasing the target sequence by adding CCGsupertags. Target-syntax helps, however BLEU

improves by only 0.4, compared to 0.9 for sen-tences between 15 and 35 words. With bothsource and target syntax, BLEU improves by 0.8for sentences with more than 35 words. ForRomanian→English we see a similar result forsentences with more than 35 words: target-syntaximproves BLEU by 0.6, while combining sourceand target syntax improves BLEU by 0.8. Theseresults confirm as well that source-syntax addscomplementary information to target-syntax andmitigates the problem of increasing the target se-quence.

4.4 Discussion

Our experiments demonstrate that target-syntaximproves translation for two translation directions:German→English and Romanian→English. Ourproposed method predicts the target words to-gether with their CCG supertags.

Although the focus of this paper is not im-proving CCG tagging, we can also measure thatSNMT is accurate at predicting CCG supertags.We compare the CCG sequence predicted by theSNMT models with that predicted by EasySRL

and obtain the following accuracies: 93.2 forRomanian→English, 95.6 for German→English,95.8 for German→English with both source andtarget syntax.8

We conclude by giving a couple of examples inFigure 4 for which the SNMT system with tar-get syntax produced more grammatical transla-tions than the baseline NMT system.

In the example DE-EN Question the baselineNMT system translates the preposition “uber”twice as “about”. The SNMT system with tar-get syntax predicts the correct CCG supertag for“what” which expects to be followed by a sen-tence and not a preposition: NP/(S[dcl]/NP).Therefore the SNMT correctly re-orders thepreposition “about” at the end of the question.

In the example DE-EN Subordinate the base-line NMT system fails to correctly attach “Pren-tiss” as an object and “his wife” as a modifierto the verb “called (bezeichnete)” in the subor-dinate clause. In contrast the SNMT system pre-dicts the correct sub-categorization frame of theverb “described” and correctly translates the en-tire predicate-argument structure.

5 Conclusions

This work introduces a method for modeling ex-plicit target-syntax in a neural machine transla-tion system, by interleaving target words with theircorresponding CCG supertags. Earlier work onsyntax-aware NMT mainly modeled syntax in theencoder, while our experiments suggest model-ing syntax in the decoder is also useful. Our re-sults show that a tight integration of syntax inthe decoder improves translation quality for both

8The multitasking model predicts a different number ofCCG supertags than the number of target words. For the sen-tences where these numbers match, the CCG supetagging ac-curacy is 73.2.

76

German→English and Romanian→English lan-guage pairs, more so than a loose coupling of tar-get words and syntax as in multitask learning. Fi-nally, by combining our method for integratingtarget-syntax with the framework of Sennrich andHaddow (2016) for source-syntax we obtain themost improvement over the baseline NMT system:0.9 BLEU for German→English and 1.2 BLEU forRomanian→English. In particular, we see largeimprovements for longer sentences involving syn-tactic phenomena such as subordinate and coordi-nate clauses and prepositional phrase attachment.In future work, we plan to evaluate the impactof target-syntax when translating into a morpho-logically rich language, for example by using theHindi CCGBank (Ambati et al., 2016).

Acknowledgements

We thank the anonymous reviewers for their com-ments and suggestions. This project has receivedfunding from the European Union’s Horizon 2020research and innovation programme under grantagreements 644402 (HimL), 644333 (SUMMA)and 645452 (QT21).

ReferencesRoee Aharoni and Yoav Goldberg. 2017. Towards

string-to-tree neural machine translation. In Pro-ceedings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers), Vancouver, Canada. Association for Com-putational Linguistics.

Bharat Ram Ambati, Tejaswini Deoskar, and MarkSteedman. 2016. Hindi CCGbank: CCG Treebankfrom the Hindi Dependency Treebank. In LanguageResources and Evaluation.

Daniel Andor, Chris Alberti, David Weiss, AliakseiSeveryn, Alessandro Presta, Kuzman Ganchev, SlavPetrov, and Michael Collins. 2016. Globally nor-malized transition-based neural networks. In Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 2442–2452, Berlin, Germany. Asso-ciation for Computational Linguistics.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proceedings ofthe International Conference on Learning Represen-tations (ICLR).

Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, andMarcello Federico. 2016. Neural versus phrase-based machine translation quality: a case study. InProceedings of the 2016 Conference on Empirical

Methods in Natural Language Processing, EMNLP2016, Austin, Texas, USA, November 1-4, 2016,pages 257–267.

Alexandra Birch, Miles Osborne, and Philipp Koehn.2007. Ccg supertags in factored statistical machinetranslation. In Proceedings of the Second Work-shop on Statistical Machine Translation, StatMT’07, pages 9–16, Stroudsburg, PA, USA. Associa-tion for Computational Linguistics.

Ondrej Bojar, Rajen Chatterjee, Christian Federmann,Yvette Graham, Barry Haddow, Matthias Huck,Antonio Jimeno Yepes, Philipp Koehn, VarvaraLogacheva, Christof Monz, Matteo Negri, Aure-lie Neveol, Mariana Neves, Martin Popel, MattPost, Raphael Rubino, Carolina Scarton, Lucia Spe-cia, Marco Turchi, Karin Verspoor, and MarcosZampieri. 2016. Findings of the 2016 conferenceon machine translation. In Proceedings of the FirstConference on Machine Translation, pages 131–198, Berlin, Germany. Association for Computa-tional Linguistics.

David Chiang. 2007. Hierarchical phrase-based trans-lation. Comput. Linguist., 33(2):201–228.

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014a. On the proper-ties of neural machine translation: Encoder–decoderapproaches. In Proceedings of SSST-8, Eighth Work-shop on Syntax, Semantics and Structure in Statisti-cal Translation, pages 103–111, Doha, Qatar. Asso-ciation for Computational Linguistics.

Kyunghyun Cho, Bart van Merrienboer, CaglarGulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014b. Learningphrase representations using rnn encoder–decoderfor statistical machine translation. In Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar. Association for ComputationalLinguistics.

Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros,and Noah A. Smith. 2016. Recurrent neural networkgrammars. In Proceedings of the 2016 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, pages 199–209, San Diego, Califor-nia. Association for Computational Linguistics.

Akiko Eriguchi, Kazuma Hashimoto, and YoshimasaTsuruoka. 2016. Tree-to-sequence attentional neu-ral machine translation. In Proceedings of the 54thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), pages823–833, Berlin, Germany. Association for Compu-tational Linguistics.

Akiko Eriguchi, Yoshimasa Tsuruoka, and KyunghyunCho. 2017. Learning to parse and translate improvesneural machine translation. In Proceedings of the

77

55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers), Van-couver, Canada. Association for Computational Lin-guistics.

Michel Galley, Mark Hopkins, Kevin Knight, andDaniel Marcu. 2004. What’s in a translationrule? In Proceedings of Human Language Tech-nologies: Conference of the North American Chap-ter of the Association of Computational Linguistics,HLT-NAACL ’04.

Marcin Junczys-Dowmunt, Tomasz Dwojak, and HieuHoang. 2016. Is Neural Machine Translation Readyfor Deployment? A Case Study on 30 TranslationDirections. In Proceedings of the IWSLT 2016.

Nal Kalchbrenner and Phil Blunsom. 2013. Recurrentcontinuous translation models. In Proceedings ofthe 2013 Conference on Empirical Methods in Natu-ral Language Processing, pages 1700–1709, Seattle,Washington, USA. Association for ComputationalLinguistics.

Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Mike Lewis, Luheng He, and Luke Zettlemoyer. 2015.Joint a* ccg parsing and semantic role labelling. InEmpirical Methods in Natural Language Process-ing.

Minh-Thang Luong, Quoc V Le, Ilya Sutskever, OriolVinyals, and Lukasz Kaiser. 2016. Multi-task se-quence to sequence learning. In Proceedings ofInternational Conference on Learning Representa-tions (ICLR 2016).

Mercedes Garcıa Martınez, Loıc Barrault, and FethiBougares. 2016. Factored Neural Machine Trans-lation Architectures. In International Workshop onSpoken Language Translation (IWSLT’16).

Arul Menezes and Chris Quirk. 2007. Using depen-dency order templates to improve generality in trans-lation. In Proceedings of the Second Workshop onStatistical Machine Translation, pages 1–8.

Maria Nadejde, Alexandra Birch, and Philipp Koehn.2016a. Modeling selectional preferences of verbsand nouns in string-to-tree machine translation. InProceedings of the First Conference on MachineTranslation, pages 32–42, Berlin, Germany. Asso-ciation for Computational Linguistics.

Maria Nadejde, Alexandra Birch, and Philipp Koehn.2016b. A neural verb lexicon model withsource-side syntactic context for string-to-tree ma-chine translation. In Proceedings of the Interna-tional Workshop on Spoken Language Translation(IWSLT).

Maria Nadejde, Philip Williams, and Philipp Koehn.2013. Edinburgh’s Syntax-Based Machine Transla-tion Systems. In Proceedings of the Eighth Work-shop on Statistical Machine Translation, pages 170–176, Sofia, Bulgaria.

Jan Niehues, Thanh-Le Ha, Eunah Cho, and AlexWaibel. 2016. Using factored word representationin neural network language models. In Proceed-ings of the First Conference on Machine Translation,Berlin, Germany.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic eval-uation of machine translation. In Proceedings ofthe 40th Annual Meeting on Association for Com-putational Linguistics, ACL ’02, pages 311–318,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

Stefan Riezler and John T. Maxwell. 2005. On somepitfalls in automatic evaluation and significance test-ing for mt. In Proceedings of the ACL Workshopon Intrinsic and Extrinsic Evaluation Measures forMachine Translation and/or Summarization, pages57–64, Ann Arbor, Michigan. Association for Com-putational Linguistics.

Rico Sennrich. 2015. Modelling and Optimizing onSyntactic N-Grams for Statistical Machine Transla-tion. Transactions of the Association for Computa-tional Linguistics, 3:169–182.

Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan-dra Birch, Barry Haddow, Julian Hitschler, MarcinJunczys-Dowmunt, Samuel Laubli, Antonio ValerioMiceli Barone, Jozef Mokry, and Maria Nadejde.2017. Nematus: a toolkit for neural machine trans-lation. In Proceedings of the Software Demonstra-tions of the 15th Conference of the European Chap-ter of the Association for Computational Linguistics,pages 65–68, Valencia, Spain. Association for Com-putational Linguistics.

Rico Sennrich and Barry Haddow. 2016. Linguisticinput features improve neural machine translation.In Proceedings of the First Conference on MachineTranslation, pages 83–91, Berlin, Germany.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016a. Edinburgh neural machine translation sys-tems for wmt 16. In Proceedings of the FirstConference on Machine Translation, pages 371–376, Berlin, Germany. Association for Computa-tional Linguistics.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016b. Neural machine translation of rare wordswith subword units. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), Berlin, Ger-many. Association for Computational Linguistics.

Rico Sennrich, Martin Volk, and Gerold Schneider.2013. Exploiting Synergies Between Open Re-

78

sources for German Dependency Parsing, POS-tagging, and Morphological Analysis. In Proceed-ings of the International Conference Recent Ad-vances in Natural Language Processing 2013, pages601–609, Hissar, Bulgaria.

Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Doesstring-based neural mt learn source syntax? In Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing, pages 1526–1534, Austin, Texas. Association for ComputationalLinguistics.

Mark Steedman. 2000. The syntactic process, vol-ume 24. MIT Press.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. In Proceedings of the 27th InternationalConference on Neural Information Processing Sys-tems, NIPS’14, pages 3104–3112.

Philip Williams and Philipp Koehn. 2012. Ghkm ruleextraction and scope-3 parsing in moses. In Pro-ceedings of the Seventh Workshop on Statistical Ma-chine Translation, pages 388–394.

79