› old_anthology › w › w14 › w14-40.pdfintroduction the eighth workshop on syntax, semantics...

Proceedings of SSST-8

Eighth Workshop on

Syntax, Semantics and Structurein Statistical Translation

Dekai Wu, Marine Carpuat,Xavier Carreras and Eva Maria Vecchi (editors)

EMNLP 2014 / SIGMT / SIGLEX Workshop25 October 2014

Doha, Qatar

Production and Manufacturing byTaberg Media Group ABBox 94, 562 02 TabergSweden

c©2014 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-937284-96-1

ii

Introduction

The Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8) was heldon 25 October 2014 preceding the EMNLP 2014 conference in Doha, Qatar. Like the first seven SSSTworkshops in 2007, 2008, 2009, 2010, 2011, 2012 and 2013, it aimed to bring together researchers fromdifferent communities working in the rapidly growing field of structured statistical models of naturallanguage translation.

This year’s special theme focused on compositional distributional semantics, distributedrepresentations, and continuous vector space models in MT.

We selected 19 papers and extended abstracts for this year’s workshop, many of which reflect statisticalmachine translation’s movement toward not only tree-structured and syntactic models incorporatingstochastic synchronous/transduction grammars, but also increasingly semantic models and the closelylinked issues of deep syntax and shallow semantics, and vector space representations to support theseapproaches.

Thanks are due once again to our authors and our Program Committee for making the eighth SSSTworkshop another success.

Dekai Wu, Marine Carpuat, Xavier Carreras and Eva Maria Vecchi

iii

Organizers:

Dekai Wu, Hong Kong University of Science and Technology (HKUST)Marine Carpuat, National Research Council (NRC) CanadaXavier Carreras, Xerox Research Centre EuropeEva Maria Vecchi, Cambridge University

Program Committee:

Timothy Baldwin, University of MelbourneSrinivas Bangalore, AT&T Labs ResearchPhil Blunsom, Oxford UniversityColin Cherry, National Research Council CanadaDavid Chiang, USC/ISIShay B. Cohen, University of EdinburghGeorgiana Dinu, University of TrentoChris Dyer, Carnegie Mellon UniversityMarc Dymetman, Xerox Research Centre EuropePhilipp Koehn, University of EdinburghAlon Lavie, Carnegie Mellon UniversityChi-kiu Lo, HKUSTMarkus Saers, HKUSTKhalil Sima’an, University of AmsterdamIvan Vulic, University of LeuvenTaro Watanabe, NICTFrançois Yvon, LIMSIMing Zhou, Microsoft Research Asia

Invited Speaker:

Timothy Baldwin, University of Melbourne

iv

Table of Contents

Vector Space Models for Phrase-based Machine TranslationTamer Alkhouli, Andreas Guta and Hermann Ney . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Bilingual Markov Reordering Labels for Hierarchical SMTGideon Maillette de Buy Wenniger and Khalil Sima’an . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Better Semantic Frame Based MT Evaluation via Inversion Transduction GrammarsDekai Wu, Chi-kiu Lo, Meriem Beloucif and Markus Saers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Rule-based Syntactic Preprocessing for Syntax-based Machine TranslationYuto Hatakoshi, Graham Neubig, Sakriani Sakti, Tomoki Toda and Satoshi Nakamura . . . . . . . . . 34

Applying HMEANT to English-Russian TranslationsAlexander Chuchunkov, Alexander Tarelkin and Irina Galinskaya . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Reducing the Impact of Data Sparsity in Statistical Machine TranslationKaran Singla, Kunal Sachdeva, Srinivas Bangalore, Dipti Misra Sharma and Diksha Yadav . . . . . 51

Expanding the Language model in a low-resource hybrid MT systemGeorge Tambouratzis, Sokratis Sofianopoulos and Marina Vassiliou . . . . . . . . . . . . . . . . . . . . . . . . . 57

Syntax and Semantics in Quality Estimation of Machine TranslationRasoul Kaljahi, Jennifer Foster and Johann Roturier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Overcoming the Curse of Sentence Length for Neural Machine Translation using Automatic Segmenta-tion

Jean Pouget-Abadie, Dzmitry Bahdanau, Bart van Merrienboer, Kyunghyun Cho and Yoshua Ben-gio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Ternary Segmentation for Improving Search in Top-down Induction of Segmental ITGsMarkus Saers and Dekai Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A CYK+ Variant for SCFG Decoding Without a Dot ChartRico Sennrich . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

On the Properties of Neural Machine Translation: Encoder–Decoder ApproachesKyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau and Yoshua Bengio . . . . . . . . . . . . . .103

Transduction Recursive Auto-Associative Memory: Learning Bilingual Compositional Distributed Vec-tor Representations of Inversion Transduction Grammars

Karteek Addanki and Dekai Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

Transformation and Decomposition for Efficiently Implementing and Improving Dependency-to-StringModel In Moses

Liangyou Li, Jun Xie, Andy Way and Qun Liu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122

Word’s Vector Representations meet Machine TranslationEva Martinez Garcia, Jörg Tiedemann, Cristina España-Bonet and Lluís Màrquez . . . . . . . . . . . . 132

Context Sense Clustering for TranslationJoão Casteleiro, Gabriel Lopes and Joaquim Silva . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .135

v

Evaluating Word Order Recursively over Permutation-ForestsMiloš Stanojevic and Khalil Sima’an . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

Preference Grammars and Soft Syntactic Constraints for GHKM Syntax-based Statistical Machine Trans-lation

Matthias Huck, Hieu Hoang and Philipp Koehn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

How Synchronous are Adjuncts in Translation Data?Sophie Arnoult and Khalil Sima’an . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

vi

Conference Program

Saturday, October 25, 2014

9:00–10:30 Session 1: Morning Orals

9:00–9:10 Opening RemarksDekai Wu, Marine Carpuat, Xavier Carreras, Eva Maria Vecchi

9:10–9:30 Vector Space Models for Phrase-based Machine TranslationTamer Alkhouli, Andreas Guta and Hermann Ney

9:30–9:50 Bilingual Markov Reordering Labels for Hierarchical SMTGideon Maillette de Buy Wenniger and Khalil Sima’an

9:50–10:10 Better Semantic Frame Based MT Evaluation via Inversion Transduction GrammarsDekai Wu, Chi-kiu Lo, Meriem Beloucif and Markus Saers

10:10–10:30 Rule-based Syntactic Preprocessing for Syntax-based Machine TranslationYuto Hatakoshi, Graham Neubig, Sakriani Sakti, Tomoki Toda and Satoshi Naka-mura

10:30–11:00 Coffee break

11:00–12:00 Invited talk by Timothy Baldwin

11:00–12:00 Composed, Distributed Reflections on Semantics and Statistical Machine Transla-tionTimothy Baldwin

vii

Saturday, October 25, 2014 (continued)

12:00–12:30 Session 2: Morning Spotlights

12:00–12:05 Applying HMEANT to English-Russian TranslationsAlexander Chuchunkov, Alexander Tarelkin and Irina Galinskaya

12:05–12:10 Reducing the Impact of Data Sparsity in Statistical Machine TranslationKaran Singla, Kunal Sachdeva, Srinivas Bangalore, Dipti Misra Sharma and DikshaYadav

12:10–12:15 Expanding the Language model in a low-resource hybrid MT systemGeorge Tambouratzis, Sokratis Sofianopoulos and Marina Vassiliou

12:15–12:20 Syntax and Semantics in Quality Estimation of Machine TranslationRasoul Kaljahi, Jennifer Foster and Johann Roturier

12:20–12:25 Overcoming the Curse of Sentence Length for Neural Machine Translation usingAutomatic SegmentationJean Pouget-Abadie, Dzmitry Bahdanau, Bart van Merrienboer, Kyunghyun Choand Yoshua Bengio

12:25–12:30 Ternary Segmentation for Improving Search in Top-down Induction of SegmentalITGsMarkus Saers and Dekai Wu

12:30–14:00 Lunch break

14:00–15:30 Session 3: Afternoon Orals and Spotlights

14:00–14:20 A CYK+ Variant for SCFG Decoding Without a Dot ChartRico Sennrich

14:20–14:40 On the Properties of Neural Machine Translation: Encoder–Decoder ApproachesKyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau and Yoshua Bengio

14:40–15:00 Transduction Recursive Auto-Associative Memory: Learning Bilingual Composi-tional Distributed Vector Representations of Inversion Transduction GrammarsKarteek Addanki and Dekai Wu

15:00–15:20 Transformation and Decomposition for Efficiently Implementing and ImprovingDependency-to-String Model In MosesLiangyou Li, Jun Xie, Andy Way and Qun Liu

viii

Saturday, October 25, 2014 (continued)

15:20–15:25 Word’s Vector Representations meet Machine TranslationEva Martinez Garcia, Jörg Tiedemann, Cristina España-Bonet and Lluís Màrquez

15:25–15:30 Context Sense Clustering for TranslationJoão Casteleiro, Gabriel Lopes and Joaquim Silva

15:30–16:00 Coffee break

16:00–16:15 Session 4: Afternoon Spotlights

16:00–16:05 Evaluating Word Order Recursively over Permutation-ForestsMiloš Stanojevic and Khalil Sima’an

16:05–16:10 Preference Grammars and Soft Syntactic Constraints for GHKM Syntax-based Sta-tistical Machine TranslationMatthias Huck, Hieu Hoang and Philipp Koehn

16:10–16:15 How Synchronous are Adjuncts in Translation Data?Sophie Arnoult and Khalil Sima’an

16:15–17:30 Poster session

16:15–17:30 Poster session of all workshop papersAll workshop presenters

ix

Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 1–10,October 25, 2014, Doha, Qatar. c©2014 Association for Computational Linguistics

Vector Space Models for Phrase-based Machine Translation

Tamer Alkhouli1, Andreas Guta1, and Hermann Ney1,2

1Human Language Technology and Pattern Recognition GroupRWTH Aachen University, Aachen, Germany

2Spoken Language Processing GroupUniv. Paris-Sud, France and LIMSI/CNRS, Orsay, France

{surname}@cs.rwth-aachen.de

Abstract

This paper investigates the applicationof vector space models (VSMs) to thestandard phrase-based machine translationpipeline. VSMs are models based oncontinuous word representations embed-ded in a vector space. We exploit wordvectors to augment the phrase table withnew inferred phrase pairs. This helpsreduce out-of-vocabulary (OOV) words.In addition, we present a simple way tolearn bilingually-constrained phrase vec-tors. The phrase vectors are then used toprovide additional scoring of phrase pairs,which fits into the standard log-linearframework of phrase-based statistical ma-chine translation. Both methods resultin significant improvements over a com-petitive in-domain baseline applied to theArabic-to-English task of IWSLT 2013.

1 Introduction

Categorical word representation has been widelyused in many natural language processing (NLP)applications including statistical machine transla-tion (SMT), where words are treated as discreterandom variables. Continuous word representa-tions, on the other hand, have been applied suc-cessfully in many NLP areas (Manning et al.,2008; Collobert and Weston, 2008). However,their application to machine translation is still anopen research question. Several works tried to ad-dress the question recently (Mikolov et al., 2013b;Zhang et al., 2014; Zou et al., 2013), and this workis but another step in that direction.

While categorical representations do not encodeany information about word identities, continuousrepresentations embed words in a vector space, re-sulting in geometric arrangements that reflect in-

formation about the represented words. Such em-beddings open the potential for applying informa-tion retrieval approaches where it becomes possi-ble to define and compute similarity between dif-ferent words. We focus on continuous represen-tations whose training is influenced by the sur-rounding context of the token being represented.One motivation for such representations is to cap-ture word semantics (Turney et al., 2010). Thisis based on the distributional hypothesis (Harris,1954) which says that words that occur in similarcontexts tend to have similar meanings.

We make use of continuous vectors learnedusing simple neural networks. Neural networkshave been gaining increasing attention recently,where they have been able to enhance strong SMTbaselines (Devlin et al., 2014; Sundermeyer etal., 2014). While neural language and transla-tion modeling make intermediate use of continu-ous representations, there have been also attemptsat explicit learning of continuous representationsto improve translation (Zhang et al., 2014; Gao etal., 2013).

This work explores the potential of word se-mantics based on continuous vector representa-tions to enhance the performance of phrase-basedmachine translation. We present a greedy algo-rithm that employs the phrase table to identifyphrases in a training corpus. The phrase tableserves to bilingually restrict the phrases spottedin the monolingual corpus. The algorithm is ap-plied separately to the source and target sides ofthe training data, resulting in source and target cor-pora of phrases (instead of words). The phrasecorpus is used to learn phrase vectors using thesame methods that produce word vectors. Thevectors are then used to provide semantic scor-ing of phrase pairs. We also learn word vectorsand employ them to augment the phrase table withparaphrased entries. This leads to a reduction in

1

the OOV rate which translates to improved BLEU

and and TER scores. We apply the two methods onthe IWSLT 2013 Arabic-to-English task and showsignificant improvements over a strong in-domainbaseline.

The rest of the paper is structured as follows.Section 2 presents a background on word andphrase vectors. The construction of the phrasecorpus is discussed in Section 3, while Section 4demonstrates how to use word and phrase vectorsin the standard phrase-based SMT pipeline. Ex-periments are presented in Section 5, followed byan overview of the related word in Section 6, andfinally Section 7 concludes the work.

2 Vector Space Models

One way to obtain context-based word vectors isthrough a neural network (Bengio et al., 2003;Schwenk, 2007). With a vocabulary size V , one-hot encoding of V -dimensional vectors is used torepresent input words, effectively associating eachword with a D-dimensional vector in the V ×Dinput weight matrix, where D is the size of thehidden layer. Similarly, one-hot encoding on theoutput layer associates words with vectors in theoutput weight matrix.

Alternatively, a count-based V-dimensionalword co-occurrence vector can serve as a wordrepresentation (Lund and Burgess, 1996; Lan-dauer and Dumais, 1997). Such representationsare sparse and high-dimensional, which might re-quire an additional dimensionality reduction step(e.g. using SVD). In contrast, learning word rep-resentations via neural models results directly inrelatively low-dimensional, dense vectors. In thiswork, we follow the neural network approach toextract the feature vectors. Whether word vectorsare extracted by means of a neural network or co-occurrence counts, the context surrounding a wordinfluences its final representation by design. Suchcontext-based representations can be used to de-termine semantic similarities.

The construction of phrase representations, onthe other hand, can be done in different ways.The compositional approach constructs the vectorrepresentation of a phrase by resorting to its con-stituent words (or sub-phrases) (Gao et al., 2013;Chen et al., 2010). Kalchbrenner and Blunsom(2013) obtain continuous sentence representations

by applying a sequence of convolutions, startingwith word representations.

Another approach for phrase representationconsiders phrases as atomic units that can not bedivided further. The representations are learneddirectly in this case (Mikolov et al., 2013b; Hu etal., 2014).

In this work, we follow the second approach toobtain phrase vectors. To this end, we apply thesame methods that yield word vectors, with thedifference that phrases are used instead of words.In the case of neural word representations, a neuralnetwork that is presented with words at the inputlayer is presented with phrases instead. The result-ing vocabulary size in this case would be the num-ber of distinct phrases observed during training.Although learning phrase embeddings directly isamenable to data sparsity issues, it provides uswith a simple means to build phrase vectors mak-ing use of tools already developed for word vec-tors, focussing the effort on preprocessing the dataas will be discussed in the next section.

3 Phrase Corpus

When training word vectors using neural net-works, the network is presented with a corpus.To build phrase vectors, we first identify phrasesin the corpus and generate a phrase corpus. Thephrase corpus is similar to the original corpus ex-cept that its words are joined to make up phrases.The new corpus is then used to train the neural net-work. The columns of the resulting input weightmatrix of the network are the phrase vectors corre-sponding to the phrases encountered during train-ing.

Mikolov et al. (2013b) identify phrases using amonolingual point-wise mutual information crite-rion with discounting. Since our end goal is togenerate phrase vectors that are helpful for trans-lation, we follow a different approach: we con-strain the phrases by the conventional phrase tableof phrase-based machine translation. This is doneby limiting the phrases identified in the corpus tohigh quality phrases occurring in the phrase table.The quality is determined using bilingual scoresof phrase pairs. While the phrase vectors of a lan-guage are eventually obtained by training the neu-ral network on the monolingual phrase corpus ofthat language, the reliance on bilingual scores to

2

Algorithm 1 Phrase Corpus Construction1: p← 12: for p≤ numPasses do3: i← 24: for i≤ corpus.size−1 do5: w← join(ti, ti+1) . create a phrase using the current and next tokens6: v← join(ti−1, ti) . create a phrase using the previous and current tokens7: joinForward← score(w)8: joinBackward← score(v)9: if joinForward ≥ joinBackward and joinForward ≥ θ then

10: ti← w11: remove ti+112: i← i+2 . newly created phrase not available for further merge during current pass13: else14: if joinBackward > joinForward and joinBackward ≥ θ then15: ti−1← v16: remove ti17: i← i+2 . newly created phrase not available for further merge during current pass18: else19: i← i+120: end if21: end if22: end for23: p← p+124: end for

construct the monolingual phrase corpus encodesbilingual information in the corpus, namely, thecorpus will include phrases that having a match-ing phrase in the other language, which is in linewith the purpose for which the phrases are con-structed, that is, their use in the phrase-based ma-chine translation pipeline which is explained in thenext section. In addition, the aforementioned scor-ing serves to exclude noisy phrase-pair entries dur-ing the construction of the phrase corpus. Next, weexplain the details of the construction algorithm.

3.1 Phrase Spotting

We propose Algorithm 1 as a greedy approach forphrase corpus construction. It is a multi-pass algo-rithm where each pass can extend tokens obtainedduring the previous pass by a single token at most.Before the first pass, all tokens are words. Duringthe passes the tokens might remain as words or canbe extended to become phrases. Given a token tiat position i, a scoring function is used to scorethe phrase (ti, ti+1) and the phrase (ti−1, ti). Thephrase having a higher score is adopted as long asits score exceeds a predefined threshold θ . The

scoring function used in lines 7 and 8 is based onthe phrase table. If the phrase does not belong tothe phrase table it is given a score θ ′ < θ . If thephrase exists, a bilingual score is computed usingthe phrase table fields as follows:

score( f ) = maxe

{L

∑i=1

wigi( f , e)

}(1)

where gi( f , e) is the ith feature of the bilingualphrase pair ( f , e). The maximization is carried outover all phrases e of the other language. The scoreis the weighted sum of the phrase pair features.Throughout our experiments, we use 2 phrasal and2 lexical features for scoring, with manual tuningof the weights wi.

The resulting corpus is then used to train phrasevectors following the same procedure of trainingword vectors.

4 End-to-end Translation

In this section we will show how to employ phrasevectors in the phrase-based statistical machinetranslation pipeline.

3

4.1 Phrase-based Machine Translation

The phrase-based decoder consists of a search us-ing a log-linear framework (Och and Ney, 2002)as follows:

eI1 = argmax

I,eI1

{maxK,sK

1

M

∑m=1

λmhm(eI1,s

K1 , f J

1 )

}(2)

where eI1 = e1...eI is the target sentence, f J

1 =f1... fJ is the source sentence, sK

1 = s1...sK isthe hidden alignment or derivation. The mod-els hm(eI

1,sK1 , f J

1 ) are weighted by the weights λm

which are tuned using minimum error rate train-ing (MERT) (Och, 2003). The rest of the sectionpresents two ways to integrate vector representa-tions into the system described above.

4.2 Semantic Phrase Feature

Words that occur in similar contexts tend to havesimilar meanings. This idea is known as the dis-tributional hypothesis (Harris, 1954), and it moti-vates the use of word context to learn word repre-sentations that capture word semantics (Turney etal., 2010). Extending this notion to phrases, phrasevectors that are learned based on the surroundingcontext encode phrase semantics. Since we willuse phrase vectors to compute a feature of a phrasepair in the following, we refer to the feature as asemantic phrase feature.

Given a phrase pair ( f , e), we can use the phrasevectors of the source and target phrases to computea semantic phrase feature as follows:

hM+1( f , e) = sim(Wx f ,ze) (3)

where sim is a similarity function, x f and ze are theS-dimensional source and T -dimensional targetvectors respectively corresponding to the sourcephrase f and target phrase e. W is an S×T linearprojection matrix that maps the source space to thetarget space (Mikolov et al., 2013a). The matrixis estimated by optimizing the following criterionwith stochastic gradient descent:

minW

N

∑i=1||Wxi− zi||2 (4)

where the training data consists of the pairs{(x1,z1), ...,(xN ,zN)} corresponding to the sourceand target vectors.

Since the source and target phrase vectors arelearned separately, we do not have an immedi-ate mapping between them. As such mapping isneeded for the training of the projection matrix,we resort to the phrase table to obtain it. A sourceand a target phrase vectors are paired if there is acorresponding phrase pair entry in the phrase tablewhose score exceeds a certain threshold. Scoringis computed using Eq. 1. Similarly, word vectorsare paired using IBM 1 p(e| f ) and p( f |e) lexica.Noisy entries are assumed to have a probabilityless than a certain threshold and are not used topair word vectors.

4.3 Paraphrasing

While the standard phrase table is extracted usingparallel training data, we propose to extend it andinfer new entries relying on continuous representa-tions. With a similarity measure (e.g. cosine sim-ilarity) that computes the similarity between twophrases, a new phrase pair can be generated by re-placing either or both of its constituent phrases bysimilar phrases. The new phrase is referred to as aparaphrase of the phrase it replaces. This enablesa richer use of the bilingual data, as a source para-phrase can be borrowed from a sentence that is notaligned to a sentence containing the target side ofthe phrase pair. It also enables the use of monolin-gual data, as the source and target paraphrases donot have to occur in the parallel data. The cross-interaction between sentences in the parallel dataand the inclusion of the monolingual data to ex-tend the phrase table are potentially capable of re-ducing the out-of-vocabulary (OOV) rate.

In order to generate a new phrase rule, we en-sure that noisy rules do not contribute to the gener-ation process, depending on the score of the phrasepair (cf. Eq. 1). High scoring entries are para-phrased as follows. To paraphrase the source side,we perform a k-nearest neighbor search over thesource phrase vectors. The top-k similar entriesare considered paraphrases of the given phrase.The same can be done for the target side. We as-sign the newly generated phrase pair the same fea-ture values of the pair used to induce it. However,two extra phrase features are added: one measur-ing the similarity between the source phrase andits paraphrase, and another for the target phraseand its paraphrase. The new feature values forthe original non-paraphrased entries are set to the

4

highest similarity value.

We focus on a certain setting that avoids in-terference with original phrase rules, by extend-ing the phrase table to cover OOVs only. Thatis, source-side paraphrasing is performed only ifthe source paraphrase does not already occur inthe phrase table. This ensures that original entriesare not interfered with and only OOVs are affectedduring translation. Reducing OOVs by extendingthe phrase table has the advantage of exploitingthe full decoding capabilities (e.g. LM scoring),as opposed to post-decoding translation of OOVs,which would not exhibit any decoding benefits.

The k-nearest neighbor (k-NN) approach iscomputationally prohibitive for large phrase tablesand large number of vectors. This can be allevi-ated by resorting to approximate k-NN search (e.g.locality sensitive hashing). Note that this searchis performed during training time to generate ad-ditional phrase table entries, and does not affectdecoding time, except through the increase of thephrase table size. In our experiments, the train-ing time using exact k-NN search was acceptable,therefore no search approximations were made.

5 Experiments

In the following we first provide an analysis of theword vectors that are later used for translation ex-periments. We use word vectors (as opposed tophrase vectors) for phrase table paraphrasing toreduce the OOV rate. Next, we present end-to-end translation results using the proposed seman-tic feature and our OOV reduction method.

The experiments are based on vectors trainedusing the word2vec1 toolkit, setting vector dimen-sionality to 800 for Arabic and 200 for Englishvectors. We used the skip-gram model with a max-imum skip length of 10. The phrase corpus wasconstructed using 5 passes, with scores computedaccording to Eq. 1 using 2 phrasal and 2 lexicalfeatures. The phrasal and lexical weights were setto 1 and 0.5 respectively, with all features beingnegative log-probabilities, and the scoring thresh-old θ was set to 10. All translation experimentsare performed with the Jane toolkit (Vilar et al.,2010; Wuebker et al., 2012).

1https://code.google.com/p/word2vec/

5.1 Baseline System

Our phrase-based baseline system consists of twophrasal and two lexical translation models, trainedusing a word-aligned bilingual training corpus.Word alignment is automatically generated byGIZA++ (Och and Ney, 2003) given a sentence-aligned bilingual corpus. We also include bi-nary count features and bidirectional hierarchicalreordering models (Galley and Manning, 2008),with three orientation classes per direction result-ing in six reordering models. The baseline also in-cludes word penalty, phrase penalty and a simpledistance-based distortion model.

The language model (LM) is a 4-gram mix-ture LM trained on several data sets using mod-ified Kneser-Ney discounting with interpolation,and combined with weights tuned to achieve thelowest perplexity on a development set using theSRILM toolkit (Stolcke, 2002). Data selectionis performed using cross-entropy filtering (Mooreand Lewis, 2010).

5.2 Word Vectors

Here we analyze the quality of word vectors usedin the OOV reduction experiments. The vectorsare trained using an unaltered word corpus. Webuild a lexicon using source and target word vec-tors together with the projection matrix using thesimilarity score sim(Wx f ,ze)), where the projec-tion matrix W is used to project the source wordvector x f , corresponding to the source word f , tothe target vector space. The similarity between theprojection result Wx f and the target word vectorze is computed. In the following we will refer tothese scores computed using vector representationas VSM-based scores.

The resulting lexicon is compared to the IBM1 lexicon2. Given a source word, we select thethe best target word according to the VSM-basedscore. This is compared to the best translationbased on the IBM 1 probability. If both transla-tions coincide, we refer to this as a 1-best match.We also check whether the best translation accord-ing to IBM 1 matches any of the top-5 translationsbased on the VSM model. A match in this case isreferred to as a 5-best match.

2We assume for the purpose of this experiment that theIBM 1 lexicon provides perfect translations, which is not nec-essarily the case in practice.

5

corpus Lang. # tokens # segmentsWIT Ar 3,185,357 147,256UN Ar 228,302,244 7,884,752arGiga3 Ar 782,638,101 27,190,387WIT En 2,951,851 147,256UN En 226,280,918 7,884,752news En 1,129,871,814 45,240,651

Table 1: Arabic and English corpora statistics.

The vectors are trained on a mixture of in-domain data (WIT) which correspond to TEDtalks, and out-of-domain data (UN). These sets areprovided as part of the IWSLT 2013 evaluationcampaign. We include the LDC2007T40 ArabicGigaword v3 (arGiga3) and English news crawl ar-ticles (2007 through 2012) to experiment with theeffect of increasing the size of the training corpuson the quality of the word vectors. Table 1 showsthe corpora statistics obtained after preprocessing.

The fractions of the 1- and 5-best matches areshown in table 2. The table is split into two halves.The upper part investigates the effect of increasingthe amount of Arabic data while keeping the En-glish data fixed (2nd row), the effect of increasingthe amount of the English data while keeping theArabic data fixed (3rd row), and the effect of usingmore data on both sides (4th row). The projectionis done on the representation of the Arabic word f ,and the similarity is computed between the projec-tion and the representation of the English word e.In the lower half of the table, the same effects areexplored, except that the projection is performedon the English side instead. The results indicatethat the accuracy increases when increasing theamount of data only on the side being projected.More data on the corresponding side (i.e. the sidebeing projected to) decreases the accuracy. Thesame behavior is observed whether the projectedside is Arabic (upper half) or English (lower half).All in all, the accuracy values are low. The accu-racy increases about three times when looking atthe 5-best instead of the 1-best accuracy. While theaccuracies 32.2% and 33.1% are low, they reflectthat the word representations are encoding someinformation about the words, although this infor-mation might not be good enough to build a word-to-word lexicon. However, using this informationfor OOV reduction might still yield improvementsas we will see in the translation results.

Arabic Englishword corpus size 231M 229Mphrase corpus size 126M 115Mword corpus vocab. size 467K 421Kphrase corpus vocab. size 5.8M 5.3M# phrase vectors 934K 913K

Table 3: Phrase vectors statistics.

5.3 Phrase Vectors

Translation experiments pertaining to the pro-posed semantic feature are presented here. Thefeature is based on phrase vectors which are builtwith the word2vec toolkit in a similar way wordvectors are trained, except that the training cor-pus is the phrase corpus containing phrases con-structed as described in section 3. Once trained, anew feature is added to the phrase table. The fea-ture is computed for each phrase pair using phrasevectors as described in Eq. 3.

Table 3 shows statistics about the phrase corpusand the original word corpus it is based on. Al-gorithm 1 is used to build the phrase corpus using5 passes. The number of phrase vectors trainedusing the phrase corpus are also shown. Note thatthe tool used does not produce vectors for all 5.8MArabic and 5.3M English phrases in the vocab-ulary. Rather, noisy phrases are excluded fromtraining, eventually leading to 934K Arabic and913K English phrase embeddings.

We perform two experiments on the IWSLT2013 Arabic-to-English evaluation data set. In thefirst experiment, we examine how the semanticfeature affects a small phrase table (2.3M phrasepairs) trained on the in-domain data (WIT). Thesecond experiment deals with a larger phrase table(34M phrase pairs), constructed by a linear inter-polation between in- and out-of-domain phrase ta-bles including UN data, resulting in a competitivebaseline. The two baselines have hierarchical re-ordering models (HRMs) and a tuned mixture LM,in addition to the standard models, as described insection 5.1. The results are shown in table 4.

In the small experiment, the semantic phrasefeature improves TER by 0.7%, and BLEU by0.4% on the test set eval13. The translation seemsto benefit from the contextual information en-coded in the phrase vectors during training. Thisis in contrast to the training of the standard phrase

6

Arabic Data English Data 1-bestMatch %

5-bestMatches %

WIT+UN WIT+UN 8.0 26.1WIT+UN+arGiga3 WIT+UN 10.9 32.2WIT+UN WIT+UN+news 4.9 17.9WIT+UN+arGiga3 WIT+UN+news 7.5 25.7WIT+UN WIT+UN 8.4 27.2WIT+UN WIT+UN+news 10.9 33.1WIT+UN+arGiga3 WIT+UN 5.7 18.9WIT+UN+arGiga3 WIT+UN+news 8.3 25.2

Table 2: The effect of increasing the amount of data on the quality of word vectors. VSM-based scores arecompared to IBM model 1 p(e| f ) (upper half) and p( f |e) (lower half), effectively regarding the IBM 1models as the true probability distributions. In the upper part, the projection is done on the representationof the Arabic word f , and the similarity is computed between the projection and the representation of theEnglish word e. In the lower half of the table, the role of f and e is interchanged, where the English sidein this case will be projected.

system dev2010 eval2013BLEU TER BLEU TER

WIT 29.1 50.5 28.9 52.5+ feature 29.1 ‡50.1 ‡29.3 ‡51.8+ paraph. 29.2 ‡50.2 ‡29.5 ‡51.8+ both 29.2 50.2 ‡29.4 ‡51.8WIT+UN 29.7 49.3 30.5 50.5+ feature 29.8 49.2 30.2 50.7

Table 4: Semantic feature and paraphrasing re-sults. The symbol ‡ indicates statistical signifi-cance with p< 0.01.

features, which disregards context. As for the hi-erarchical reordering models which are part of thebaseline, they do not capture lexical informationabout the context. They are only limited to the or-dering information. The skip-gram-based phrasevectors used for the semantic feature, on the otherhand, discard ordering information, but uses con-textual lexical information for phrase representa-tion. In this sense, HRMs and the semantic featurecan be said to complement each other. Using thesemantic feature for the large phrase table did notyield improvements. The difference compared tothe baseline in this case is not statistically signifi-cant.

All reported results are averages of 3 MERT op-timizer runs. Statistical significance is computedusing the Approximate Randomization (AR) test.We used the multeval toolkit (Clark et al., 2011)for evaluation.

5.4 Paraphrasing and OOV Reduction

The next set of experiments investigates the re-duction of the OOV rate through paraphrasing,and its impact on translation. Paraphrasing is per-formed employing the cosine similarity, and the k-NN search is done on the source side, with k = 3.The nearest neighbors are required to satisfy a ra-dius threshold r > 0.3, i.e., neighbors with a simi-larity value less or equal to r are rejected. Trainingthe projection matrices is performed using a smallamount of training data amounting to less than 30ktranslation pairs.

To examine the effect of OOV reduction, weperform paraphrasing on a resource-limited sys-tem, where a small amount of parallel data ex-ists, but a larger amount of monolingual data isavailable. Such a system is simulated by train-ing word vectors on the WIT+UN data monolin-gually , while extracting the phrase table using themuch smaller in-domain WIT data set only. Table5 shows the change in the number of OOV wordsafter introducing the paraphrased rules to the WIT-based phrase table. 19% and 30% of the originalOOVs are eliminated in the dev and eval13 sets,respectively. This reduction translates to an im-provement of 0.6% BLEU and 0.7% TER as indi-cated in table 4.

Since BLEU or TER are based on word iden-tities and do not detect semantic similarities, wemake a comparison between the reference transla-tions and translations of the system that employed

7

# OOVphrase table dev eval13WIT 185 254WIT+paraph. 150 183Vocab. size 3,714 4,734

Table 5: OOV change due to paraphrasing. Vocab-ulary refers to the number of unique tokens in theArabic dev and test sets.

OOV VSM-basedTranslation

Reference

�I ® ��º�K found unfolded�é��Qk interested keen

úæm.�� jail imprisoned

CK. claim report�é��. �JÊÓ confusing confounding�I�Jk encourage rallied for

AKðQ� villagers redneck

Table 6: Examples of OOV words that were trans-lated due to paraphrasing. The examples areextracted from the translation hypotheses of thesmall experiment.

OOV reduction. Examples are shown in Table 6.Although the reference words are not matched ex-actly, the VSM translations are semantically closeto them, suggesting that OOV reduction in thesecases was somewhat successful, although not re-warded by either of the scoring measures used.

6 Related Work

Bilingually-constrained phrase embeddings weredeveloped in (Zhang et al., 2014). Initial embed-dings were trained in an unsupervised manner, fol-lowed by fine-tuning using bilingual knowledge tominimize the semantic distance between transla-tion equivalents, and maximizing the distance be-tween non-translation pairs. The embeddings arelearned using recursive neural networks by de-composing phrases to their constituents. Whileour work includes bilingual constraints to learnphrase vectors, the constraints are implicit in thephrase corpus. Our approach is simple, focusingon the preprocessing step of preparing the phrasecorpus, and therefore it can be used with different

existing frameworks that were developed for wordvectors.

Zou et al. (2013) learn bilingual word embed-dings by designing an objective function that com-bines unsupervised training with bilingual con-straints based on word alignments. Similar toour work, they compute an additional feature forphrase pairs using cosine similarity. Word vec-tors are averaged to obtain phrase representations.In contrast, our approach learns phrase representa-tions directly.

Recurrent neural networks were used with min-imum translation units (Hu et al., 2014), which arephrase pairs undergoing certain constraints. At theinput layer, each of the source and target phrasesare modeled as a bag of words, while the outputphrase is predicted word-by-word assuming con-ditional independence. The approach seeks to al-leviate data sparsity problems that would arise ifphrases were to be uniquely distinguished. Ourapproach does not break phrases down to words,but learns phrase embeddings directly.

Chen et al. (2010) represent a rule in the hierar-chical phrase table using a bag-of-words approach.Instead, we learn phrase vectors directly withoutresorting to their constituent words. Moreover,they apply a count-based approach and employIBM model 1 probabilities to project the targetspace to the source space. In contrast, our map-ping is similar to that of Mikolov et al. (2013a)and is learned directly from a small set of bilin-gual data.

Mikolov et al. (2013a) proposed an efficientmethod to learn word vectors through feed-forward neural networks by eliminating the hid-den layer. They do not report end-to-end sentencetranslation results as we do in this work.

Mikolov et al. (2013b) learn direct representa-tions of phrases after joining a training corpus us-ing a simple monolingual point-wise mutual in-formation criterion with discounting. Our workexploits the rich bilingual knowledge provided bythe phrase table to join the corpus instead.

Gao et al. (2013) learn shared space mappingsusing a feed-forward neural network and representa phrase vector as a bag-of-words vector. The vec-tors are learned aiming to optimize an expectedBLEU criterion. Our work is different in that welearn two separate source and target mappings.

8

We also do not follow their bag-of-words phrasemodel approach.

Marton et al. (2009) proposed to eliminateOOVs by looking for similar words using distri-butional vectors, but they prune the search spacelimiting it to candidates observed in the same con-text as that of the OOV. We do not employ such aheuristic. Instead, we perform a k-nearest neigh-bor search spanning the full phrase table to para-phrase its rules and generate new entries.

Estimating phrase table scores using monolin-gual data was investigated in (Klementiev et al.,2012), by building co-occurrence context vectorsand using a small dictionary to induce new scoresfor existing phrase rules. Our work explores theuse of distributional vectors extracted from neu-ral networks, moreover, we induce new phraserules to extend the phrase table. New phrase ruleswere also generated in (Irvine and Callison-Burch,2014), where new phrases were produced as acomposition of unigram translations.

7 Conclusion

In this work we adapted vector space models toprovide the state-of-the-art phrase-based statisti-cal machine translation system with semantic in-formation. We leveraged the bilingual knowledgeof the phrase table to construct source and targetphrase corpora to learn phrase vectors, which wereused to provide semantic scoring of phrase pairs.Word vectors allowed to extend the phrase tableand eliminate OOVs. Both methods proved bene-ficial for low-resource tasks.

Future work would investigate decoder inte-gration of semantic scoring that extends beyondphrase boundaries to provide semantically coher-ent translations.

Acknowledgments

This material is partially based upon work sup-ported by the DARPA BOLT project under Con-tract No. HR0011- 12-C-0015. Any opinions,findings and conclusions or recommendations ex-pressed in this material are those of the authors anddo not necessarily reflect the views of DARPA.The research leading to these results has also re-ceived funding from the European Union Sev-enth Framework Programme (FP7/2007-2013) un-der grant agreement no 287658.

References

Yoshua Bengio, Rejean Ducharme, and Pascal Vincent.2003. A neural probabilistic language model. Jour-nal of Machine Learning Research, 3:1137–1155.

Boxing Chen, George Foster, and Roland Kuhn. 2010.Bilingual sense similarity for statistical machinetranslation. In Proceedings of the 48th Annual Meet-ing of the Association for Computational Linguis-tics, pages 834–843.

Jonathan H. Clark, Chris Dyer, Alon Lavie, andNoah A. Smith. 2011. Better hypothesis test-ing for statistical machine translation: Controllingfor optimizer instability. In 49th Annual Meet-ing of the Association for Computational Linguis-tics:shortpapers, pages 176–181, Portland, Oregon,June.

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: Deepneural networks with multitask learning. In Pro-ceedings of the 25th international conference onMachine learning, pages 160–167. ACM.

Jacob Devlin, Rabih Zbib, Zhongqiang Huang, ThomasLamar, Richard Schwartz, and John Makhoul. 2014.Fast and Robust Neural Network Joint Models forStatistical Machine Translation. In 52nd AnnualMeeting of the Association for Computational Lin-guistics, Baltimore, MD, USA, June.

Michel Galley and Christopher D. Manning. 2008.A simple and effective hierarchical phrase reorder-ing model. In Proceedings of the Conference onEmpirical Methods in Natural Language Process-ing, EMNLP ’08, pages 848–856, Stroudsburg, PA,USA. Association for Computational Linguistics.

Jianfeng Gao, Xiaodong He, Wen-tau Yih, andLi Deng. 2013. Learning semantic representationsfor the phrase translation model. arXiv preprintarXiv:1312.0482.

Zellig S Harris. 1954. Distributional structure. Word.

Yuening Hu, Michael Auli, Qin Gao, and Jianfeng Gao.2014. Minimum translation modeling with recur-rent neural networks. In Proceedings of the 14thConference of the European Chapter of the Associ-ation for Computational Linguistics, pages 20–29,Gothenburg, Sweden, April. Association for Com-putational Linguistics.

Ann Irvine and Chris Callison-Burch. 2014. Hal-lucinating phrase translations for low resource mt.In Proceedings of the Conference on ComputationalNatural Language Learning (CoNLL).

Nal Kalchbrenner and Phil Blunsom. 2013. Recurrentcontinuous translation models. In Proceedings ofthe 2013 Conference on Empirical Methods in Natu-ral Language Processing, pages 1700–1709, Seattle,Washington, USA, October. Association for Com-putational Linguistics.

9

Alexandre Klementiev, Ann Irvine, Chris Callison-Burch, and David Yarowsky. 2012. Toward statisti-cal machine translation without parallel corpora. InProceedings of the 13th Conference of the EuropeanChapter of the Association for Computational Lin-guistics, pages 130–140. Association for Computa-tional Linguistics.

Thomas K Landauer and Susan T Dumais. 1997. Asolution to plato’s problem: The latent semanticanalysis theory of acquisition, induction, and rep-resentation of knowledge. Psychological review,104(2):211.

Kevin Lund and Curt Burgess. 1996. Producinghigh-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instru-ments, & Computers, 28(2):203–208.

Christopher D Manning, Prabhakar Raghavan, andHinrich Schutze. 2008. Introduction to informa-tion retrieval, volume 1. Cambridge university pressCambridge.

Yuval Marton, Chris Callison-Burch, and PhilipResnik. 2009. Improved statistical machine trans-lation using monolingually-derived paraphrases. InProceedings of the 2009 Conference on EmpiricalMethods in Natural Language Processing: Volume1-Volume 1, pages 381–390. Association for Com-putational Linguistics.

Tomas Mikolov, Quoc V Le, and Ilya Sutskever.2013a. Exploiting similarities among lan-guages for machine translation. arXiv preprintarXiv:1309.4168.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in Neural Information ProcessingSystems, pages 3111–3119.

R.C. Moore and W. Lewis. 2010. Intelligent Selectionof Language Model Training Data. In ACL (ShortPapers), pages 220–224, Uppsala, Sweden, July.

Franz Josef Och and Hermann Ney. 2002. Discrimi-native Training and Maximum Entropy Models forStatistical Machine Translation. In Proc. of the 40thAnnual Meeting of the Association for Computa-tional Linguistics (ACL), pages 295–302, Philadel-phia, PA, July.

Franz Josef Och and Hermann Ney. 2003. A System-atic Comparison of Various Statistical AlignmentModels. Computational Linguistics, 29(1):19–51,March.

Franz Josef Och. 2003. Minimum Error Rate Trainingin Statistical Machine Translation. In Proc. of the41th Annual Meeting of the Association for Com-putational Linguistics (ACL), pages 160–167, Sap-poro, Japan, July.

Holger Schwenk. 2007. Continuous space languagemodels. Computer Speech & Language, 21(3):492–518.

Andreas Stolcke. 2002. SRILM – An Extensible Lan-guage Modeling Toolkit. In Proc. of the Int. Conf.on Speech and Language Processing (ICSLP), vol-ume 2, pages 901–904, Denver, CO, September.

Martin Sundermeyer, Tamer Alkhouli, Joern Wuebker,and Hermann Ney. 2014. Translation Modelingwith Bidirectional Recurrent Neural Networks. InProceedings of the Conference on Empirical Meth-ods on Natural Language Processing, October.

Peter D Turney, Patrick Pantel, et al. 2010. Fromfrequency to meaning: Vector space models of se-mantics. Journal of artificial intelligence research,37(1):141–188.

David Vilar, Daniel Stein, Matthias Huck, and Her-mann Ney. 2010. Jane: Open source hierarchi-cal translation, extended with reordering and lexi-con models. In ACL 2010 Joint Fifth Workshop onStatistical Machine Translation and Metrics MATR,pages 262–270, Uppsala, Sweden, July.

Joern Wuebker, Matthias Huck, Stephan Peitz, MalteNuhn, Markus Freitag, Jan-Thorsten Peter, SaabMansour, and Hermann Ney. 2012. Jane 2: Opensource phrase-based and hierarchical statistical ma-chine translation. In International Conference onComputational Linguistics, pages 483–491, Mum-bai, India, December.

Jiajun Zhang, Shujie Liu, Mu Li, Ming Zhou, andChengqing Zong. 2014. Bilingually-constrainedphrase embeddings for machine translation. In Pro-ceedings of the 52th Annual Meeting on Associa-tion for Computational Linguistics. Association forComputational Linguistics.

Will Y Zou, Richard Socher, Daniel M Cer, andChristopher D Manning. 2013. Bilingual word em-beddings for phrase-based machine translation. InEMNLP, pages 1393–1398.

10


Bilingual Markov Reordering Labels for Hierarchical SMT

Gideon Maillette de Buy Wenniger and Khalil Sima’anInstitute for Logic, Language and Computation

University of AmsterdamScience Park 107, 1098 XG Amsterdam, The Netherlandsgemdbw AT gmail.com, k.simaan AT uva.nl

Abstract

Earlier work on labeling Hiero grammarswith monolingual syntax reports improvedperformance, suggesting that such label-ing may impact phrase reordering as wellas lexical selection. In this paper we ex-plore the idea of inducing bilingual labelsfor Hiero grammars without using anyadditional resources other than originalHiero itself does. Our bilingual labelsaim at capturing salient patterns of phrasereordering in the training parallel corpus.These bilingual labels originate from hier-archical factorizations of the word align-ments in Hiero’s own training data. In thispaper we take a Markovian view on syn-chronous top-down derivations over thesefactorizations which allows us to extract0th- and 1st-order bilingual reordering la-bels. Using exactly the same trainingdata as Hiero we show that the Marko-vian interpretation of word alignment fac-torization offers major benefits over theunlabeled version. We report extensiveexperiments with strict and soft bilinguallabeled Hiero showing improved perfor-mance up to 1 BLEU points for Chinese-English and about 0.1 BLEU points forGerman-English.

Phrase reordering in Hiero (Chiang, 2007) is mod-elled with synchronous rules consisting of phrasepairs with at most two nonterminal gaps, therebyembedding ITG permutations (Wu, 1997) in lexi-cal context. It is by now recognized that Hiero’sreordering can be strengthened either by labeling(e.g., (Zollmann and Venugopal, 2006)) or by sup-plementing the grammar with extra-grammaticalreordering models, e.g., (Xiao et al., 2011; Hucket al., 2013; Nguyen and Vogel, 2013). In thispaper we concentrate on labeling approaches.

Conceptually, labeling Hiero rules aims at in-troducing preference in the SCFG derivations forfrequently occurring lexicalized ordering constel-lations over rare ones which also affects lexical se-lection. In this paper, we present an approach fordistilling phrase reordering labels directly fromalignments (hence bilingual labels).

To extract bilingual labels from wordalignments we must first interpret the alignmentsas a hierarchy of phrases. Luckily, everyword alignment factorizes into NormalizedDecomposition Trees (NDTs) (Zhang et al.,2008), showing explicitly how the word alignmentrecursively decomposes into phrase pairs. Zhanget al. (2008) employ NDTs for extracting Hierogrammars. In this work, we extend NDTswith explicit phrase permutation operators alsoextracted from the original word alignment(Sima’an and Maillette de Buy Wenniger, 2013);Every node in the NDT is equipped with anode operator that specifies how the order ofthe target phrases (children of this node) isproduced from the corresponding source phrases.Subsequently, we cluster the node operatorsin these enriched NDTs according to theircomplexity, e.g., monotone (straight), inverted,non-binary but one-to-one, and the more complexcase of discontinuous (Maillette de Buy Wennigerand Sima’an, 2013).

Inspired by work on parsing (Klein and Man-ning, 2003), we explore a vertical Markovianlabeling approach: intuitively, 0th-order labelssignify the reordering of the sub-phrases inside thephrase pair (Zhang et al., 2008), 1st-order labelssignify reordering aspects of the direct context(an embedding, parent phrase pair) of the phrasepair, and so on. Like the phrase orientationmodels this labeling approach does not employexternal resources (e.g., taggers, parsers) beyondthe training data used by Hiero.

We empirically explore this bucketing for 0th-

11

and 1st-order labels both as hard and soft labels.In experiments on German-English and Chinese-English we show that this extension of Hiero of-ten significantly outperforms the unlabeled modelwhile using no external data or monolingual la-beling mechanisms. This suggests the viabilityof automatically inducing bilingual labels follow-ing the Markov labeling approach on operator-labelled NDTs as proposed in this paper.

1 Hierarchical models and related work

Hiero SCFGs (Chiang, 2005; Chiang, 2007) allowonly up to two (pairs of) nonterminals on the right-hand-side (RHS) of synchronous rules. The typesof permissible Hiero rules are:

X → 〈α, γ〉 (1)

X → 〈α X1 β, δ X1 ζ〉 (2)

X → 〈α X1 β X2 γ , δ X1 ζ X2 η 〉 (3)

X → 〈α X1 β X2 γ , δ X2 ζ X1 η 〉 (4)

Here α, β, γ, δ, ζ, η are terminal sequences, possi-bly empty. Equation 1 corresponds to a normalphrase pair, 2 to a rule with one gap and 3 and 4to the monotone- and inverting rules respectively.

Given an Hiero SCFG G, a source sentence s istranslated into a target sentence t by synchronousderivations d, each is a finite sequence of well-formed substitutions of synchronous productionsfrom G, see (Chiang, 2006). Existing phrase-based models score a derivation der with linearinterpolation of a finite set of feature functions(Φ(d)) of the derivation d, mostly working withlocal feature functions φi of individual produc-tions, the target side yield string t of d (targetlanguage model features) and other features (seeexperimental section): arg maxd∈G P(t,d | s) ≈arg maxd∈G

∑|Φ(d)|i=1 λi × φi. The parameters {λi} are

optimized on a held-out parallel corpus by directerror-minimization (Och, 2003).

A range of (distantly) related work exploitssyntax for Hiero models, e.g. (Liu et al., 2006;Huang et al., 2006; Mi et al., 2008; Mi andHuang, 2008; Zollmann and Venugopal, 2006;Wu and Hkust, 1998). In terms of labelingHiero rules, SAMT (Zollmann and Venugopal,2006; Mylonakis and Sima’an, 2011) exploits a“softer notion” of syntax by fitting the CCG-likesyntactic labels to non-constituent phrases. Thework of (Xiao et al., 2011) adds a lexicalizedorientation model to Hiero, akin to (Tillmann,

2004) and achieves significant gains. The workof (Huck et al., 2013; Nguyen and Vogel, 2013)overcomes technical limitations of (Xiao et al.,2011), making necessary changes to the decoder,which involves delayed (re-)scoring at hypernodesup in the derivation of nodes lower in the chartwhose orientations are affected by them. Thisgoes to show that phrase-orientation models arenot mere labelings of Hiero.

Soft syntactic constraints has been around forsome time now (Zhou et al., 2008; Venugopal etal., 2009; Chiang, 2010). In (Zhou et al., 2008)Hiero is reinforced with a linguistically motivatedprior. This prior is based on the level of syntactichomogeneity between pairs of non-terminalsand the associated syntactic forests rooted atthese nonterminals, whereby tree-kernels areapplied to efficiently measure the amount ofoverlap between all pairs of sub-trees inducedby the pairs of syntactic forests. Crucially, thesyntactic prior encourages derivations that aremore syntactically coherent but does not blockderivations when they are not. In (Venugopalet al., 2009) the authors associate distributionsover compatible syntactic labelings with grammarrules, and combine these preference distributionsduring decoding, thus achieving a summationrather than competition between compatible labelconfigurations. The latter approach requiressignificant changes to the decoder and comes at aconsiderable computational cost. An alternativeapproach (Chiang, 2010) uses labels similar to(Zollmann and Venugopal, 2006) together withboolean features for rule-label and substituted-label combinations; using discriminative training(MIRA) it is learned what combinations areassociated with better translations.

The labeling approach presented next differsfrom existing approaches. It is inspired by softlabeling but employs novel, non-linguistic bilin-gual labels. And it shares the bilingual intuitionwith phrase orientation models but it is based ona Markov approach for SCFG labeling, therebyremaining within the confines of Hiero SCFG,avoiding the need to make changes inside thedecoder.1

1Soft constraint decoding can easily be implementedwithout adapting the decoder, through a smart application of“label bridging” unary rules. In practice however, adaptingthe decoder turns out to be computationally more efficient,therefore we used this solution in our experiments.

12

1

3

76

2

54

1we

2should

3tailor

4our

5policy

6accordingly

darauf1

musen2

wir3

unsere4

politik5

ausrichten6

Figure 1: Example alignment from Europarl

([1, 6], [1, 6], 1 )

([1, 2], [2, 3], 2 )

([1, 1], [3, 3], 4 ) ([2, 2], [2, 2], 5 )

([4, 5], [4, 5], 3 )

([4, 4], [4, 4], 6 ) ([5, 5], [5, 5], 7 )

Figure 2: Normalized Decomposition Tree (Zhanget al., 2008) extended with pointers to originalalignment structure from Figure 1

2 Bilingual reordering labels for Hiero

Figure 1 shows an alignment from EuroparlGerman-English (Koehn, 2005) along with a treeshowing corresponding maximally decomposedphrase pairs. Phrase pairs can be grouped into amaximally decomposed tree (called NormalizedDecomposition Tree – NDT) (Zhang et al., 2008).Figure 2 shows the NDT for Figure 1, extendedwith pointers to the original alignment structurein Figure 2. The numbered boxes indicate howthe phrases in the two representations correspond.In an NDT every phrase pair is recursively splitup at every level into a minimum number (twoor greater) of contiguous parts. In this examplethe root node splits into three phrase pairs, butthese phrase pairs together do not cover the entireparent phrase pair because of the discontinuity:“tailor ... accordingly/ darauf ... ausrichten”.

Following (Zhang et al., 2008), we use theNDT factorizations of word alignments in thetraining data for extracting phrases. Every NDTshows the hierarchical structuring into phrasesembedded in larger phrases, which together withthe context of the original alignment exposes thereordering complexity of every phrase (Sima’anand Maillette de Buy Wenniger, 2013). We willexploit these elaborate distinctions based on thecomplexity of reordering for Hiero rule labels asexplained next.

Phrase-centric (0th-order) labels are based onthe view of looking inside a phrase pair to seehow it decomposes into sub-phrase pairs. The op-erator signifying how the sub-phrase pairs are re-ordered (target relative to source) is bucketted intoa number of “permutation complexity” categories.Straightforwardly, we can start out by using the

two well known cases of Inversion TransductionGrammars (ITG) {Monotone, Inverted} and labeleverything2 that falls outside these two categorywith a default label “X” (leaving some Hieronodes unlabeled). This leads to the followingcoarse phrase-centric labeling scheme, which wename 0th

ITG+: (1) Monotonic(Mono): binarizable,fully monotone plus non-decomposable phrases(2) Inverted(Inv): binarizable, fully inverted (3) X:decomposable phrases that are not binarizable.

A clear limitation of the above ITG-like label-ing approach is that all phrase pairs that decom-pose into complex non-binarizable reordering pat-terns are not further distinguished. Furthermore,non-decomposable phrases are lumped togetherwith decomposable monotone phrases, althoughthey are in fact quite different. To overcome theseproblems we extend ITG in a way that furtherdistinguishes the non-binarizable phrases and alsodistinguishes non-decomposable phrases from therest. This gives a labeling scheme we will callsimply 0th-order labeling, abbreviated 0th, con-sisting of a more fine-grained set of five cases,ordered by increasing complexity (see examplesin Figure 4): (1) Atomic: non-decomposablephrases, (2) Monotonic(Mono): binarizable, fullymonotone, (3) Inverted(Inv): binarizable, fullyinverted (4) Permutation(Perm): factorizes into apermutation of four or more sub-phrases (5) Com-plex(Comp): does not factorize into a permutationand contains at least one embedded phrase.

In Figure 3, we show a phrase-complexity la-beled derivation for the example of Figure 1.Observe how the phrase-centric labels reflect therelative reordering at the node. For example, the

2Non-decomposable phrases will still be groupedtogether with Monotone, since they are more similar to thiscategory than to the catchall “X” category.

13

S 0

accordingly

policy

our

tailor

should

we

COMPLEX TOP 1

INVERTED E.F.D. 2

ATOMIC R.B.I. 4

MONO E.F.D. 3

ATOMIC L.B.M. 7

S 0

ausrichten

politik

unsere

wir

mussen

darauf

COMPLEX TOP 1

INVERTED E.F.D. 2

ATOMIC R.B.I. 4

MONO E.F.D. 3

ATOMIC L.B.M. 7

Figure 3: Synchronous trees (implicit derivations end results) based on differently labelled Hierogrammars. The figure shows alternative labeling for every node: Phrase-Centric (0th-order) (light gray)and Parent-Relative (1st-order) (dark gray).

this is an important matter

das ist ein wichtige angelegenheit

1

1

2

2

Monotone

we all agree on this

das sehen wir alle

1

1

2

2

Inversion

i want to stress two points

auf zwei punkte mochte ich hinweisen

1

1

2

2

3

3

4

4

Permutation

we owe this to our citizens

das sind wir unsern burgern schuldig

1

1

2

2

3

3

Complex

it would be possible

kann mann

1

1

Atomic

Figure 4: Different types of Phrase-Centric Alignment Labels

Inverted label of node-pair 2 corresponds to theinversion in the alignment of 〈we should, musenwir〉; in contrast, node-pair 1 is complex anddiscontinuous and the label is Complex.

Parent-relative (1st-order) labels capture the re-ordering that a phrase undergoes relative to anembedding parent phrase.

1. For a binarizable mother phrase with orien-tation Xo ∈ {Mono, Inv}, the phrase itself caneither group to the left only Left-Binding-Xo, right only Right-Binding-Xo, or with bothsides (Fully-Xo).

2. Fully-Discontinuous: Any phrase withina non-binarizable permutation or complex

alignment containing discontinuity.

3. Top: phrases that span the entire alignedsentence pair.

In cases were multiple labels are applicable, thesimplest applicable label is chosen according tothe following preference order:{Fully-Monotone, Left/Right-Binding-Monotone,Fully-Inverted, Left/Right-Binding-Inverted,Fully-Discontinuous, TOP}.

In Figure 3 the parent-relative labels in thederivation reflect the reordering taking place at thephrases with respect to their parent node. Node 4has a parent node that inverts the order and thesibling node it binds is on the right, therefore it

14

is labeled “right-binding inverted” (R.B.I.); E.F.D.and L.B.M. are similar abbreviations for “embed-ded fully discontinuous” and “left-binding mono-tone” respectively. As yet another example node7 in Figure 3 is labeled “left-binding monotone”(L.B.M.) since it is monotone, but the alignmentallows it only to bind to the left at the parent node,as opposed to only to the right or to both sideswhich cases would have yielded “right-bindingmonotone” R.B.M. and “(embedded) fully mono-tone” (E.F.M.) parent-relative reordering labelsrespectively.

Note that for parent-relative labels the bindingdirection of monotone and inverted may not beinformative. We therefore also form a set ofcoarse parent-relative labels (“1st

Coarse”) by col-lapsing the label pairs Left/Right-Binding-Monoand Left/Right-Binding-Inverted into single labelsOne-Side-Binding-Mono and One-Side-Binding-Inv3.

3 Features for soft bilingual labeling

Labels used in hierarchical Statistical MachineTranslation (SMT) are typically adapted from ex-ternal resources such as taggers and parsers. Likein our case, these labels are typically not fitted tothe training data – with very few exceptions e.g.,(Mylonakis and Sima’an, 2011; Mylonakis, 2012;Hanneman and Lavie, 2013). Unfortunately thismeans that the labels will either overfit or underfit,and when they are used as strict constraints onSCFG derivations they are likely to underperform.Experience with mismatch between syntactic la-bels and the data is abundant (Venugopal et al.,2009; Marton et al., 2012; Chiang, 2010), andusing soft constraint decoding with suitable labelsubstitution features has been shown to be aneffective workaround solution. The intuition be-hind soft constraint decoding is that even thoughheuristic labels are not perfectly tailored to thedata, they do provide useful information providedthe model is “allowed to learn” to use them onlyin as far as they can improve the final evaluationmetric (usually BLEU).

3We could also further coarsen the 1stlabels byremoving entirely all sub-distinctions of binding-type forthe binarizable cases, but that would make the labelingessentially equal to the earlier mentioned 0th

ITG+ except forlooking at the reordering occurring at the parent rather thaninside the phrase itself. We did not explore this variant in thiswork, as the high similarity to the already explored 0th

ITG+

variant made it not seem to add much extra information.

γβα

LHS10

N111

N212

GAP111

GAP212

Substituting rule

Decoder chart

Label Substitution Features

Figure 5: Label substitution features, schematicview. Labels/Gaps with same filling in the figurescorrespond to the situation of a nonterminal/gapwhose labels correspond (for N1/GAP1). Fillingsof different shades (as for N2/GAP2 on the rightin the two figures) indicates the situation were thelabel of the nonterminal and the gap is different.

Next we introduce the set of label substitutionfeatures used in our experiments.

Label substitution features consist of a uniquefeature for every pair of labels 〈Lα, Lβ〉 in thegrammar, signifying a rule with left-hand-sidelabel Lβ substituting on a gap labeled Lα. Thesefeatures are combined with two more coarsefeatures, “Match” and “Nomatch”, indicating ifthe substitution involves labels that match or not.

Figure 5 illustrates the concept of label substi-tution features schematically. In this figure thesubstituting rule is substituted onto two gaps inthe chart, which induces two label substitutionfeatures indicated by the two ellipses. The sit-uation is analogous for rules with just one gap.To make things concrete, lets assume that boththe first nonterminal of the rule N1 as well asthe first gap it is substituted onto GAP1 havelabel MONO. Furthermore lets assume the secondnonterminal N2 has label COMPLEX while thelabel of the gap GAP2 it substitutes onto is INV .This situation results in the following two specificlabel substitution features:• subst(MONO,MONO)• subst(INV ,COMPLEX)

Canonical labeled rules. Typically when la-beling Hiero rules there can be many differentlabeled variants of every original Hiero rule. Withsoft constraint decoding this leads to prohibitivecomputational cost. This also has the effect ofmaking tuning the features more difficult. Inpractice, soft constraint decoding usually exploits

15

Systen Name Matching Type Label Order Label GranularityHiero-0th

ITG+ Strict 0th order CoarseHiero-0th Strict 0th order FineHiero-1st

Coarse Strict 1th order CoarseHiero-1st Strict 1th order FineHiero-0th

ITG+-Sft Soft 0th order CoarseHiero-0th-Sft Soft 0th order FineHiero-1st

Coarse-Sft Soft 1th order CoarseHiero-1st-Sft Soft 1th order Fine

Table 1: Experiment names legend

System NameDEV TEST

BLEU ↑ METEOR ↑ TER ↓ KRS ↑ BLEU ↑ METEOR ↑ TER ↓ KRS ↑German-English

Hiero 27.90 32.69 58.22 66.37 28.39 32.94 58.01 67.44SAMT 27.76 32.67 58.05 66.84N 28.32 32.88 57.70NN 67.63Hiero-0th

ITG+ 27.85 32.70 58.04NN 66.27 28.36 32.90H 57.83NN 67.30Hiero-0th 27.82 32.75 57.92NN 66.66 28.39 33.03NN 57.75NN 67.55Hiero-1st

Coarse 27.86 32.66 58.23 66.37 28.22H 32.90 57.93 67.47Hiero-1st 27.74H 32.60HH 58.11 66.44 28.27 32.80HH 57.95 67.39

Chinese-EnglishHiero 31.70 30.72 61.21 58.28 31.63 30.56 59.28 58.03Hiero-0th

ITG+ 31.54 30.97NN 62.79HH 59.54NN 31.94NN 30.84NN 60.76HH 59.45NN

Hiero-0th 31.66 30.95NN 62.20HH 60.00NN 31.90NN 30.79NN 60.11HH 59.68NN

Hiero-1stCoarse 31.64 30.75 61.37 59.48NN 31.57 30.57 59.58HH 59.13NN

Hiero-1st 31.74 30.79 61.94HH 60.22NN 31.77 30.62 60.13HH 59.89NN

Table 2: Mean results bilingual labels with strict matching.4

a single labeled version per Hiero rule, whichwe call the “canonical labeled rule”. Following(Chiang, 2010), this canonical form is the mostfrequent labeled variant.

4 Experiments

We evaluate our method on two language pairs:using German/Chinese as source and English astarget. In all experiments we decode with a4-gram language model smoothed with modifiedKnesser-Ney discounting (Chen and Goodman,1998). The data used for training the languagemodels differs per language pair, details are givenin the next paragraphs. All data is lowercased asa last pre-processing step. In all experiments weuse our own grammar extractor for the generationof all grammars, including the baseline Hierogrammars. This enables us to use the samefeatures (as far as applicable given the grammarformalism) and assure true comparability of thegrammars under comparison.

German-English4Statistical significance is dependent on variance of

resampled scores, and hence sometimes different for samemean scores across different systems.

The data for our German-English experimentsis derived from parliament proceedings sourcedfrom the Europarl corpus (Koehn, 2005), withWMT-07 development and test data. We used amaximum sentence length of 40 for filtering thetraining data. We employ 1M sentence pairs fortraining, 1K for development and 2K for test-ing (single reference per source sentence). Bothsource and target of all datasets are tokenizedusing the Moses(Hoang et al., 2007) tokenizationscript. For these experiments both the baselineand our method use a language model trainedon the target side of the full original training set(approximately 1M sentences).

Chinese-EnglishThe data for our Chinese-English experiments isderived from a combination of MultiUn(Eiseleand Chen, 2010; Tiedemann, 2012)5 data andHong Kong Parallel Text data from the LinguisticData Consortium6. The Hong Kong Parallel Textdata is in traditional Chinese and is thus firstconverted to simplified Chinese to be compatible

5Freely available and downloaded fromhttp://opus.lingfil.uu.se/

6The LDC catalog number of this dataset is LDC2004T08

16

System NameDEV TEST

BLEU ↑ METEOR ↑ TER ↓ KRS ↑ BLEU ↑ METEOR ↑ TER ↓ KRS ↑German-English

Hiero 27.90 32.69 58.22 66.37 28.39 32.94 58.01 67.44SAMT 27.76 32.67 58.05 66.84N 28.32 32.88 57.70NN 67.63Hiero-0th

ITG+-Sft 28.00N 32.76NN 57.90NN 66.17 28.48 32.98 57.79NN 67.32Hiero-0th-Sft 28.01N 32.71 57.95NN 66.24 28.45 32.98 57.73NN 67.51Hiero-1st

Coarse-Sft 27.94 32.69 57.91NN 66.26 28.45N 32.94 57.75NN 67.36Hiero-1st-Sft 28.13NN 32.80NN 57.92NN 66.32 28.45 33.00N 57.79NN 67.45

Chinese-EnglishHiero 31.70 30.72 61.21 58.28 31.63 30.56 59.28 58.03Hiero-0th

ITG+-Sft 31.88N 30.46HH 60.64NN 57.82H 31.93NN 30.37HH 58.86NN 57.60H

Hiero-0th-Sft 32.04NN 30.90NN 61.47HH 59.36NN 32.20NN 30.74NN 59.45H 58.92NN

Hiero-1stCoarse-Sft 32.39NN 31.02NN 61.56HH 59.51NN 32.55NN 30.86NN 59.57HH 59.03NN

Hiero-1st-Sft 32.63NN 31.22NN 62.00HH 60.43NN 32.61NN 30.98NN 60.19HH 59.84NN

Table 3: Mean results bilingual labels with soft matching.4

with the rest of the data 7. We used a maximumsentence length of 40 for filtering the trainingdata. The combined dataset has 7.34M sentencepairs. The MulitUN dataset contains translateddocuments from the United Nations, similar ingenre to the parliament domain. The Hong KongParallel Text in contrast contains a richer mixof domains, namely Hansards, Laws and News.For the dev and test set we use the Multiple-Translation Chinese datasets from LDC, part 1-48,which contain sentences from the News domain.We combined part 2 and 3 to form the dev set(1813 sentence pairs) and part 1 and 4 to form thetest set (1912 sentence pairs). For both develop-ment and testing we use 4 references. The Chinesesource side of all datasets is segmented using theStanford Segmenter(Chang et al., 2008)9. TheEnglish target side of all datasets is tokenizedusing the Moses tokenization script.

For these experiments both the baseline andour method use a language model trained on5.4M sentences of domain specific10 news datataken from the “Xinhua” subcorpus of the EnglishGigaword corpus of LDC. 11

7Using a simple conversion script downloaded fromhttp://www.mandarintools.com/zhcode.html

8LDC catalog numbers: LDC2002T01, DC2003T17,LDC2004T07 and LDC2004T07

9Downloaded fromhttp://nlp.stanford.edu/software/segmenter.shtml

10For Chinese-English translation the different domain ofthe train data (mainly parliament) and dev/test data (news)requires usage of a domain specific language model to getoptimal results. For German-English, all data is from thethe parliament domain, so a language model trained on the(translation model) training data is already domain-specific.

11The LDC catalog number of this dataset is LDC2003T05

4.1 Experimental StructureIn our experiments we explore the influence ofthree dimensions of bilingual reordering labels ontranslation accuracy. These dimensions are:

• label granularity : granularity of the labeling{Coarse,Fine}• label order : the type/order of the labeling{0th, 1st}• matching type : the type of label matching

performed during decoding {Strict,Soft}Combining these dimensions gives 8 different

reordering labeled systems per language pair.On top of that we use two baseline systems,namely Hiero and Syntax Augmented MachineTranslation (SAMT) to measure these systemsagainst. An overview of the naming of ourreordering labeled systems is given in Table 1.

Training and decoding details Our experimentsuse Joshua (Ganitkevitch et al., 2012) with Viterbibest derivation. Baseline experiments use nor-mal decoding whereas soft labeling experimentsuse soft constraint decoding. For training weuse standard Hiero grammar extraction constraints(Chiang, 2007) (phrase pairs with source spansup to 10 words; abstract rules are forbidden).During decoding maximum span 10 on the sourceside is maintained. Following common practice,we use relative frequency estimates for phraseprobabilities, lexical probabilities and generativerule probability.

We train our systems using (batch-kbest) Miraas borrowed by Joshua from the Moses codebase,allowing up to 30 tuning iterations. Following

17

standard practice, we tune on BLEU, and aftertuning we use the configuration with the highestscores on the dev set with actual (corpus level)BLEU evaluation. We report lowercase BLEU(Papineni et al., 2002), METEOR (Denkowskiand Lavie, 2011) and TER (Snover et al., 2006)scores for the tuned test set and also for the tuneddev set, the latter mainly to observe any possibleoverfitting. We use Multeval version 0.5.1.12 forcomputing these metrics. We also use MultEval’simplementation of statistical significance testingbetween systems, which is based on multipleoptimizer runs and approximate randomization.Multeval (Clark et al., 2011) randomly swapsoutputs between systems and estimates the prob-ability that the observed score difference arose bychance. Differences that are statistically signif-icant and correspond to improvement/worseningwith respect to the baseline are marked with N/Hatthe p ≤ .05 level and NN/HHat the p ≤ .01 level. Wealso report the Kendall Reordering Score (KRS),which is the reordering-only variant of the LR-score (Birch and Osborne, 2010) (without theoptional interpolation with BLEU) and which isa sentence-level score. For the computation ofstatistical significance of this metric we use ourown implementation of the sign test 13 (Dixon andMood, 1946), as also described in (Koehn, 2010).

In our experiments we repeated each experi-ment three times to counter unreliable conclusionsdue to optimizer variance. Scores are averagesover three runs of tuning plus testing. Scoresmarked with N are significantly better than thebaseline, those marked with H are significantlyworse; according to the resampling test of Mul-teval (Clark et al., 2011).

Preliminary experiment with strict matchingInitial experiments concerned 0th-order reorder-ing labels in a strict matching approach (no softconstraints). The results are shown in Table 2 forboth language pairs. The results for the Hiero andSAMT14 baselines (Hiero and SAMT) are shownin the first rows. Below it results for the 0th-order(phrase-centric) bilingual labeled systems witheither the Coarse (Hiero-0th

ITG+) or Fine label12https://github.com/jhclark/multeval13To make optimal usage of the 3 runs we computed

equally weighted improvement/worsening counts for allpossible 3 × 3 baseline output / system output pairs and usethose weighted counts in the sign test.

14SAMT could only be ran for German-English and notfor Chinese-English, due to memory constraints.

variant (Hiero-0th) are shown, followed by theresults for Coarse and Fine variant of the 1th-order(parent-relative) bilingual labeled systems (Hiero-1st

Coarse and Hiero-1st). All these systems use thedefault decoding with strict label matching.

For German-English the effect of strict bilin-gual labels is mostly positive: although we haveno improvement for BLEU we do achieve sig-nificant improvements for METEOR and TERon the test set. For Chinese-English, overallHiero-0th

ITG+ shows the biggest improvements,namely significant improvements of +0.31 BLEU,+0.28 METEOR and +1.42 KRS. TER is theonly metric that worsens, and considerably sowith +1.48 point. Hiero-1stachieves the highestimprovement of KRS, namely 1.86 point higherthan the Hiero baseline. Overall, this preliminaryexperiment shows that strict labeling sometimesgives improvements over Hiero, but sometimes itleads to worsening in terms of some of the metrics.

Results with soft bilingual constraints Our ini-tial experiments with strict bilingual labels incombination with strict matching by the decodergave some hope such constraints could be useful.At the same time the results showed no stableimprovements across language pairs, and thusdoes not allow us to draw definite conclusionsabout the merit of bilingual labels.

Results for experiments with soft bilingual la-beling are shown in Table 3. Here Hiero corre-sponds to the Hiero baseline. Below it are shownthe systems that use soft constraint decoding (SCD). Hiero-0th

ITG+-Sft and Hiero-0th-Sft usingphrase-centric labels (0th-order) in Coarse or Fineform. Similarly, Hiero-1st

Coarse-Sft and Hiero-1st-Sft correspond to the analog systems with1st-order, parent-relative labels. For German-English there are only minor improvements forBLEU and METEOR, with somewhat bigger im-provements for TER. For Chinese-English how-ever the improvements are considerable, +0.98BLEU improvement over the Hiero baseline forHiero-1st-Sft as well as +0.42 METEOR and+1.81 KRS. TER is worsening with +0.85 for thissystem. For Chinese-English the Fine version ofthe labels gives overall superior results for both0th-order and 1st-order labels.

Discussion Our best soft bilingual labeling systemfor German-English shows small but significantimprovements of METEOR and TER while im-

18

proving BLEU and KRS as well, but not signifi-cantly. The results with soft-constraint matchingare better than those for strict-matching in general,while there is no clear winner between the Coarseand Fine variant of labels.

For Chinese-English we see considerableimprovements and overall the best results forthe combination of soft-constraint matching,with the Fine 1st-order variant of the labeledsystems (Hiero-1st-Sft). For Chinese-English theimprovement of the word-order is also particularlyclear as indicated by the +1.81 KRS improvementfor this best system. Furthermore the negativeeffects in terms of worsening of TER are alsoreduced in the soft-matching setting, droppingfrom +1.48 TER to +0.85 TER. The results forHiero-0th-Sft are also competitive, since thoughit gives somewhat lower improvements of BLEUand METEOR, it gives an improvement of +1.89KRS, while TER only worsens by +0.17 for thissystem.

We conclude that bilingual Markov labels canmake a big difference in improvement of hier-archical SMT. We observe that going beyondthe basic reordering labels of ITG, refining thecases not captured by ITG and even more ef-fective: taking a 1st-order rather than oth-orderperspective on reordering are major factors forthe success of including reordering information tohierarchical SMT through labeling. Crucial to thesuccess of this undertaking is also the usage ofa soft-constraint approach to label matching, asopposed to strict-matching. Finally, comparisonof the German-English results with results forSyntax-Augmented Machine Translation (SAMT)reveals that SAMT loses performance comparedto the Hiero baseline for BLEU, the metric uponwhich tuning is done, as well as METEOR, whileonly TER and KRS show improvement. Sincethe best bilingual labeled system for German-English (Hiero-1st-Sft) improves METEOR andTER significantly, while also improving BLEUand KRS, though not significant, we believe ourlabeling is highly competitive with syntax-basedlabeling approaches, without the need for anyadditional resources in the form of parsers ortaggers, as syntax-based systems require. Likelycomplementarity of reordering information, and(target) syntax, which improves fluency, makescombining both a promising possibility we wouldlike to explore in future work.

5 Conclusion

We presented a novel method to enrich Hierarchi-cal Statistical Machine Translation with bilinguallabels that help to improve the translation quality.Considerable and significant improvements of theBLEU, METEOR and KRS are achieved simul-taneously for Chinese-English translation whiletuning on BLEU, where the Kendall ReorderingScore is specifically designed to measure im-provement of reordering in isolation. For German-English more modest, statistically significant im-provements of METEOR and TER (simultane-ously) or BLEU (separately) are achieved. Ourwork differs from related approaches that usesyntactic or part-of-speech information in the for-mation of reordering constraints in that it needs nosuch additional information. It also differs fromrelated work on reordering constraints based onlexicalization in that it uses no such lexicaliza-tion but instead strives to achieve more globallycoherent translations, afforded by global, holisticconstraints that take the local reordering historyof the derivation directly into account. Our exper-iments also once again reinforce the establishedwisdom that soft, rather than strict constraints,are a necessity when aiming to include new in-formation to an already strong system without therisk of effectively worsening performance throughconstraints that have not been directly tailoredto the data through a proper learning approach.While lexicalized constraints on reordering haveproven to have great potential, un-lexicalized softbilingual constraints, which are more general andtranscend the rule level have their own place inproviding another agenda of improving transla-tion which focusses more on the global coher-ence direction by directly putting soft alignment-informed constraints on the combination of rules.Finally, while more research is necessary in thisdirection, there are strong reasons to believe thatin the right setup these different approaches can bemade to further reinforce each other.

Acknowledgements

This work is supported by The Netherlands Or-ganization for Scientific Research (NWO) undergrant nr. 612.066.929. The authors would like tothank Matt Post and Juri Ganitkevitch, for theirsupport with respect to the integration of FuzzyMatching Decoding into the Joshua codebase.

19

ReferencesAlexandra Birch and Miles Osborne. 2010. Lrscore

for evaluating lexical and reordering quality in mt.In Proceedings of the Joint Fifth Workshop onStatistical Machine Translation and MetricsMATR,pages 327–332.

Pi-Chuan Chang, Michel Galley, and Christopher D.Manning. 2008. Optimizing chinese wordsegmentation for machine translation performance.In Proceedings of the Third Workshop on StatisticalMachine Translation, pages 224–232.

Stanley F. Chen and Joshua T. Goodman. 1998.An empirical study of smoothing techniques forlanguage modeling. Technical Report TR-10-98,Computer Science Group, Harvard University.

David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. InProceedings of the 43rd Annual Meeting of the ACL,pages 263–270, June.

David Chiang. 2006. An introduction to synchronousgrammars.

David Chiang. 2007. Hierarchical phrase-basedtranslation. Computational Linguistics, 33(2):201–228.

David Chiang. 2010. Learning to translate withsource and target syntax. In Proceedings ofthe 48th Annual Meeting of the Association forComputational Linguistics, pages 1443–1452.

Jonathan H. Clark, Chris Dyer, Alon Lavie, andNoah A. Smith. 2011. Better hypothesis testingfor statistical machine translation: Controllingfor optimizer instability. In Proceedings ofthe 49th Annual Meeting of the Associationfor Computational Linguistics: HLTTechnologies:Short Papers - Volume 2, pages 176–181.

Michael Denkowski and Alon Lavie. 2011. Meteor1.3: Automatic metric for reliable optimizationand evaluation of machine translation systems. InProceedings of the Sixth Workshop on StatisticalMachine Translation, pages 85–91.

W. J. Dixon and A. M. Mood. 1946. The statisticalsign test. Journal of the American StatisticalAssociation, pages 557–566.

Andreas Eisele and Yu Chen. 2010. Multiun: Amultilingual corpus from united nation documents.In Proceedings of the 7th International Conferenceon Language Resources and Evaluation (LREC2010), pages 2868–2872.

Juri Ganitkevitch, Yuan Cao, Jonathan Weese, MattPost, and Chris Callison-Burch. 2012. Joshua4.0: Packing, pro, and paraphrases. In Proceedingsof the Seventh Workshop on Statistical MachineTranslation, pages 283–291, Montreal, Canada,June. Association for Computational Linguistics.

Greg Hanneman and Alon Lavie. 2013. Improvingsyntax-augmented machine translation by coarsen-ing the label set. In HLT-NAACL, pages 288–297.

Hieu Hoang, Alexandra Birch, Chris Callison-burch,Richard Zens, Rwth Aachen, Alexandra Constantin,Marcello Federico, Nicola Bertoldi, Chris Dyer,Brooke Cowan, Wade Shen, Christine Moran, andOndrej Bojar. 2007. Moses: Open source toolkitfor statistical machine translation. In Proceedingsof the 41st Annual Meeting on Association forComputational Linguistics - Volume 1, pages 177–180.

Liang Huang, Kevin Knight, and Aravind Joshi.2006. A syntax-directed translator with extendeddomain of locality. In Proceedings of the Workshopon Computationally Hard Problems and JointInference in Speech and Language Processing,pages 1–8.

Matthias Huck, Joern Wuebker, Felix Rietig, andHermann Ney. 2013. A phrase orientationmodel for hierarchical machine translation. InACL 2013 Eighth Workshop on Statistical MachineTranslation, pages 452–463.

Dan Klein and Christopher D. Manning. 2003.Accurate unlexicalized parsing. In Proceedingsof the 41st Annual Meeting on Association forComputational Linguistics - Volume 1, pages 423–430.

P. Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In Proc. of MTSummit.

Philipp Koehn. 2010. Statistical Machine Translation.Cambridge University Press, New York, NY, USA.

Yang Liu, Qun Liu, and Shouxun Lin. 2006. Tree-to-string alignment template for statistical machinetranslation. In Proceedings of the 21st InternationalConference on Computational Linguistics andthe 44th Annual Meeting of the Association forComputational Linguistics, pages 609–616.

Gideon Maillette de Buy Wenniger and KhalilSima’an. 2013. Hierarchical alignment decomposi-tion labels for hiero grammar rules. In Proceedingsof the Seventh Workshop on Syntax, Semantics andStructure in Statistical Translation, pages 19–28.

Yuval Marton, David Chiang, and Philip Resnik.2012. Soft syntactic constraints for arabic—englishhierarchical phrase-based translation. MachineTranslation, 26(1-2):137–157.

Haitao Mi and Liang Huang. 2008. Forest-basedtranslation rule extraction. In Proceedings ofEMNLP.

Haitao Mi, Liang Huang, and Qun Liu. 2008. Forest-based translation. In Proceedings of ACL: HLT,June.

20

Markos Mylonakis and Khalil Sima’an. 2011.Learning hierarchical translation structure withlinguistic annotations. In Proceedings of the49th Annual Meeting of the Association forComputational Linguistics: Human LanguageTechnologies, pages 642–652.

Markos Mylonakis. 2012. Learning the LatentStructure of Translation. Ph.D. thesis, Universityof Amsterdam.

ThuyLinh Nguyen and Stephan Vogel. 2013.Integrating phrase-based reordering features into achart-based decoder for machine translation. InProceedings of the 51st Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers), pages 1587–1596.

Franz Josef Och. 2003. Minimum error rate trainingin statistical machine translation. In Proceedingsof the 41st Annual Meeting on Association forComputational Linguistics - Volume 1, pages 160–167.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automaticevaluation of machine translation. In Proceedingsof the 40th Annual Meeting on Association forComputational Linguistics, pages 311–318.

Khalil Sima’an and Gideon Maillette de Buy Wen-niger. 2013. Hierarchical alignment trees:A recursive factorization of reordering in wordalignments with empirical results. Internal Report.

Matthew Snover, Bonnie Dorr, Richard Schwartz,Linnea Micciulla, and John Makhoul. 2006. Astudy of translation edit rate with targeted humanannotation. In In Proceedings of Association forMachine Translation in the Americas, pages 223–231.

Jrg Tiedemann. 2012. Parallel data, tools andinterfaces in opus. In Proceedings of the 8thInternational Conference on Language Resourcesand Evaluation (LREC 2012), pages 2868–2872.

Christoph Tillmann. 2004. A unigram orientationmodel for statistical machine translation. InProceedings of HLT-NAACL 2004: Short Papers,pages 101–104.

Ashish Venugopal, Andreas Zollmann, Noah A. Smith,and Stephan Vogel. 2009. Preference grammars:softening syntactic constraints to improve statisticalmachine translation. In Proceedings of Human Lan-guage Technologies: The 2009 Annual Conferenceof the North American Chapter of the Associationfor Computational Linguistics, pages 236–244.

Dekai Wu and Hongsing Wong Hkust. 1998. Machinetranslation with a stochastic grammatical channel.In Proceedings of the 36th Annual Meeting ofthe Association for Computational Linguistics and17th International Conference on ComputationalLinguistics - Volume 2, pages 1408–1415.

Dekai Wu. 1997. Stochastic inversion transductiongrammars and bilingual parsing of parallel corpora.Computational Linguistics, 23:377–404.

Xinyan Xiao, Jinsong Su, Yang Liu, Qun Liu, andShouxun Lin. 2011. An orientation modelfor hierarchical phrase-based translation. InProceedings of the 2011 International Conferenceon Asian Language Processing, pages 165–168.

Hao Zhang, Daniel Gildea, and David Chiang.2008. Extracting synchronous grammar rulesfrom word-level alignments in linear time. InProceedings of the 22nd International Conferenceon Computational Linguistics - Volume 1, pages1081–1088.

Bowen Zhou, Bing Xiang, Xiaodan Zhu, and YuqingGao. 2008. Prior derivation models for formallysyntax-based translation using linguistically syntac-tic parsing and tree kernels. In Proceedings ofthe ACL-08: HLT Second Workshop on Syntax andStructure in Statistical Translation (SSST-2), pages19–27.

Andreas Zollmann and Ashish Venugopal. 2006.Syntax augmented machine translation via chartparsing. In NAACL 2006 - Workshop on statisticalmachine translation, June.

21


Better Semantic Frame Based MT Evaluationvia Inversion Transduction Grammars

Dekai Wu Lo Chi-kiu Meriem Beloucif Markus SaersHKUST

Human Language Technology CenterDepartment of Computer Science and EngineeringHong Kong University of Science and Technology

{jackielo|mbeloucif|masaers|dekai}@cs.ust.hk

AbstractWe introduce an inversion transduc-tion grammar based restructuring ofthe MEANT automatic semantic framebased MT evaluation metric, which,by leveraging ITG language biases, isable to further improve upon MEANT’salready-high correlation with humanadequacy judgments. The new metric,called IMEANT, uses bracketing ITGs tobiparse the reference and machine transla-tions, but subject to obeying the semanticframes in both. Resulting improvementssupport the presumption that ITGs, whichconstrain the allowable permutationsbetween compositional segments acrossthe reference and MT output, score thephrasal similarity of the semantic rolefillers more accurately than the simpleword alignment heuristics (bag-of-wordalignment or maximum alignment) usedin previous version of MEANT. Theapproach successfully integrates (1) thepreviously demonstrated extremely highcoverage of cross-lingual semantic framealternations by ITGs, with (2) the highaccuracy of evaluating MT via weightedf-scores on the degree of semantic framepreservation.

1 Introduction

There has been to date relatively little use of in-version transduction grammars (Wu, 1997) to im-prove the accuracy of MT evaluation metrics, de-spite long empirical evidence the vast majority oftranslation patterns between human languages canbe accommodated within ITG constraints (and theobservation that most current state-of-the-art SMTsystems employ ITG decoders). We show thatITGs can be used to redesign the MEANT seman-tic frame based MT evaluation metric (Lo et al.,

2012) to produce improvements in accuracy andreliability. This work is driven by the motiva-tion that especially when considering semanticMTmetrics, ITGs would be seem to be a natural basisfor several reasons.

To begin with, it is quite natural to think ofsentences as having been generated from an ab-stract concept using a rewriting system: a stochas-tic grammar predicts how frequently any particu-lar realization of the abstract concept will be gen-erated. The bilingual analogy is a transductiongrammar generating a pair of possible realizationsof the same underlying concept. Stochastic trans-duction grammars predict how frequently a partic-ular pair of realizations will be generated, and thusrepresent a good way to evaluate how well a pairof sentences correspond to each other.

The particular class of transduction gram-mars known as ITGs tackle the problem thatthe (bi)parsing complexity for general syntax-directed transductions (Aho and Ullman, 1972)is exponential. By constraining a syntax-directedtransduction grammar to allow only monotonicstraight and inverted reorderings, or equivalentlypermitting only binary or ternary rank rules, it ispossible to isolate the low end of that hierarchy intoa single equivalence class of inversion transduc-tions. ITGs are guaranteed to have a two-normalform similar to context-free grammars, and canbe biparsed in polynomial time and space (O

(n6

)time and O

(n4

)space). It is also possible to do ap-

proximate biparsing in O(n3

)time (Saers et al.,

2009). These polynomial complexities makes itfeasible to estimate the parameters of an ITG us-ing standard machine learning techniques such asexpectation maximization (Wu, 1995b) .

At the same time, inversion transductions havealso been directly shown to be more than sufficientto account for the reordering that occur within se-mantic frame alternations (Addanki et al., 2012).This makes ITGs an appealing alternative for eval-

22

uating the possible links between both semanticrole fillers in different languages as well as thepredicates, and how these parts fit together to formentire semantic frames. We believe that ITGs arenot only capable of generating the desired struc-tural correspondences between the semantic struc-tures of two languages, but also provide meaning-ful constraints to prevent alignments from wander-ing off in the wrong direction.

In this paper we show that IMEANT, a new met-ric drawing from the strengths of both MEANTand inversion transduction grammars, is able toexploit bracketing ITGs (also known as BITGsor BTGs) which are ITGs containing only a sin-gle non-differentiated non terminal category (Wu,1995a), so as to produce even higher correlationwith human adequacy judgments than any auto-matic MEANT variants, or other common auto-matic metrics. We argue that the constraints pro-vided by BITGs over the semantic frames and ar-guments of the reference and MT output sentencesare essential for accurate evaluation of the phrasalsimilarity of the semantic role fillers.

In common with the various MEANT semanticMT evaluation metrics (Lo and Wu, 2011a, 2012;Lo et al., 2012; Lo and Wu, 2013b), our proposedIMEANT metric measures the degree to whichthe basic semantic event structure is preservedby translation—the “who did what to whom, forwhom, when, where, how and why” (Pradhan etal., 2004)—emphasizing that a good translationis one that can successfully be understood by ahuman. In the other versions of MEANT, sim-ilarity between the MT output and the referencetranslations is computed as a modified weighted f-score over the semantic predicates and role fillers.Across a variety of language pairs and genres, ithas been shown that MEANT correlates better withhuman adequacy judgment than both n-gram basedMT evaluation metrics such as BLEU (Papineniet al., 2002), NIST (Doddington, 2002), and ME-TEOR (Banerjee and Lavie, 2005), as well as edit-distance based metrics such as CDER (Leusch etal., 2006), WER (Nießen et al., 2000), and TER(Snover et al., 2006) when evaluating MT output(Lo and Wu, 2011a, 2012; Lo et al., 2012; Lo andWu, 2013b; Macháček and Bojar, 2013). Further-more, tuning the parameters of MT systems withMEANT instead of BLEU or TER robustly im-proves translation adequacy across different gen-res and different languages (English and Chinese)

(Lo et al., 2013a; Lo and Wu, 2013a; Lo et al.,2013b). This has motivated our choice of MEANTas the basis on which to experiment with deployingITGs into semantic MT evaluation.

2 Related Work

2.1 ITGs and MT evaluationRelatively little investigation into the potentialbenefits of ITGs is found in previous MT eval-uation work. One exception is invWER, pro-posed by Leusch et al. (2003) and Leusch and Ney(2008). The invWER metric interprets weightedBITGs as a generalization of the Levenshtein editdistance, in which entire segments (blocks) can beinverted, as long as this is done strictly compo-sitionally so as not to violate legal ITG biparsetree structures. The input and output languagesare considered to be those of the reference and ma-chine translations, and thus are over the same vo-cabulary (say,English). At the sentence level, cor-relation of invWER with human adequacy judg-ments was found to be among the best.

Our current approach differs in several keyrespects from invWER. First,invWER operatespurely at the surface level of exact token match,IMEANT mediates between segments of refer-ence translation and MT output using lexical BITGprobabilities.

Secondly, there is no explicit semantic model-ing in invWER. Providing they meet the BITGconstraints, the biparse trees in invWER are com-pletely unconstrained. In contrast, IMEANT em-ploys the same explicit, strong semantic framemodeling as MEANT, on both the reference andmachine translations. In IMEANT, the semanticframes always take precedence over pure BITGbiases. Compared to invWER, this strongly con-strains the space of biparses that IMEANT permitsto be considered.

2.2 MT evaluation metricsLike invWER, other common surface-form ori-ented metrics like BLEU (Papineni et al., 2002),NIST (Doddington, 2002), METEOR (Banerjeeand Lavie, 2005; Denkowski and Lavie, 2014),CDER (Leusch et al., 2006), WER (Nießen etal., 2000), and TER (Snover et al., 2006) donot correctly reflect the meaning similarities ofthe input sentence. There are in fact severallarge scale meta-evaluations (Callison-Burch etal., 2006; Koehn and Monz, 2006) reporting cases

23

where BLEU strongly disagrees with human judg-ments of translation adequacy.

Such observations have generated a recent surgeof work on developing MT evaluation metrics thatwould outperform BLEU in correlation with hu-man adequacy judgment (HAJ). Like MEANT, theTINE automatic recall-oriented evaluation metric(Rios et al., 2011) aims to preserve basic eventstructure. However, its correlation with human ad-equacy judgment is comparable to that of BLEUand not as high as that of METEOR. Owczarzaket al. (2007a,b) improved correlation with humanfluency judgments by using LFG to extend the ap-proach of evaluating syntactic dependency struc-ture similarity proposed by Liu and Gildea (2005),but did not achieve higher correlation with hu-man adequacy judgments than metrics like ME-TEOR. Another automatic metric, ULC (Giménezand Màrquez, 2007, 2008), incorporates severalsemantic similarity features and shows improvedcorrelation with human judgement of translationquality (Callison-Burch et al., 2007; Giménezand Màrquez, 2007; Callison-Burch et al., 2008;Giménez and Màrquez, 2008) but no work hasbeen done towards tuning an SMT system usinga pure form of ULC perhaps due to its expensiverun time. Likewise, SPEDE (Wang and Manning,2012) predicts the edit sequence needed to matchthe machine translation to the reference translationvia an integrated probabilistic FSM and probabilis-tic PDA model. The semantic textual similaritymetric Sagan (Castillo and Estrella, 2012) is basedon a complex textual entailment pipeline. Theseaggregated metrics require sophisticated featureextraction steps, contain many parameters thatneed to be tuned, and employ expensive linguis-tic resources such as WordNet or paraphrase tables.The expensive training, tuning and/or running timerenders these metrics difficult to use in the SMTtraining cycle.

3 IMEANT

In this section we give a contrastive descriptionof IMEANT: we first summarize the MEANT ap-proach, and then explain how IMEANT differs.

3.1 Variants of MEANT

MEANT and its variants (Lo et al., 2012) measureweighted f-scores over corresponding semanticframes and role fillers in the reference and machinetranslations. The automatic versions of MEANT

replace humans with automatic SRL and align-ment algorithms. MEANT typically outperformsBLEU, NIST, METEOR, WER, CDER and TERin correlation with human adequacy judgment, andis relatively easy to port to other languages, re-quiring only an automatic semantic parser and amonolingual corpus of the output language, whichis used to gauge lexical similarity between the se-mantic role fillers of the reference and translation.MEANT is computed as follows:

1. Apply an automatic shallow semantic parserto both the reference and machine transla-tions. (Figure 1 shows examples of auto-matic shallow semantic parses on both refer-ence and MT.)

2. Apply the maximum weighted bipartitematching algorithm to align the semanticframes between the reference and machinetranslations according to the lexical similari-ties of the predicates. (Lo and Wu (2013a)proposed a backoff algorithm that evaluatesthe entire sentence of the MT output using thelexical similarity based on the context vectormodel, if the automatic shallow semanticparser fails to parse the reference or machinetranslations.)

3. For each pair of the aligned frames, apply themaximum weighted bipartite matching algo-rithm to align the arguments between the ref-erence and MT output according to the lexicalsimilarity of role fillers.

4. Compute the weighted f-score over thematching role labels of these aligned predi-cates and role fillers according to the follow-ing definitions:

q0i,j ≡ ARG j of aligned frame i in MT

q1i,j ≡ ARG j of aligned frame i in REF

w0i ≡ #tokens filled in aligned frame i of MT

total #tokens in MT

w1i ≡ #tokens filled in aligned frame i of REF

total #tokens in REFwpred ≡ weight of similarity of predicates

wj ≡ weight of similarity of ARG j

ei,pred ≡ the pred string of the aligned frame i of MTfi,pred ≡ the pred string of the aligned frame i of REFei,j ≡ the role fillers of ARG j of the aligned frame i of MTfi,j ≡ the role fillers of ARG j of the aligned frame i of REF

s(e, f) = lexical similarity of token e and f

24

[IN] 至此，在中国内地停售了近两个月的ＳＫ－ＩＩ全线产品恢复销售。

[REF] Until after their sales had ceased in mainland China for almost two months , sales of the complete range of SK – II products have now been resumed .

ARGM-TMP PRED ARGM-LOC PRED ARG1

ARGM-LOC PRED ARG1 PRED ARG1

ARG0 ARGM-TMP

[MT1] So far , nearly two months sk - ii the sale of products in the mainland of China to resume sales .

PRED ARG0 ARG1

[MT2] So far , in the mainland of China to stop selling nearly two months of SK - 2 products sales resumed .

ARGM-TMP ARG1 PRED PRED ARG1

[MT3] So far , the sale in the mainland of China for nearly two months of SK - II line of products .

PRED

PRED ARG0

ARG1

ARGM-TMP

ARGM-ADV

ARG0

ARGM-EXT

Figure 1: Examples of automatic shallow semantic parses. Both the reference and machine translationsare parsed using automatic English SRL. There are no semantic frames for MT3 since there is no predicatein the MT output.

prece,f =

∑e∈e max

f∈fs(e, f)

| e |

rece,f =

∑f∈f max

e∈es(e, f)

| f |

si,pred =2 · precei,pred,fi,pred

· recei,pred,fi,pred

precei,pred,fi,pred+ recei,pred,fi,pred

si,j =2 · precei,j ,fi,j

· recei,j ,fi,j

precei,j ,fi,j+ recei,j ,fi,j

precision =

∑i w0

iwpredsi,pred+

∑j wjsi,j

wpred+∑

j wj |q0i,j |∑

i w0i

recall =

∑i w1

iwpredsi,pred+

∑j wjsi,j

wpred+∑

j wj |q1i,j |∑

i w1i

MEANT =2 · precision · recallprecision + recall

where q0i,j and q1

i,j are the argument of type j inframe i in MT and REF respectively. w0

i and w1i are

the weights for frame i in MT/REF respectively.These weights estimate the degree of contributionof each frame to the overall meaning of the sen-tence. wpred and wj are the weights of the lexicalsimilarities of the predicates and role fillers of thearguments of type j of all frame between the ref-erence translations and the MT output.There is atotal of 12 weights for the set of semantic role la-bels in MEANT as defined in Lo and Wu (2011b).For MEANT, they are determined using super-vised estimation via a simple grid search to opti-mize the correlation with human adequacy judg-ments (Lo and Wu, 2011a). For UMEANT (Lo and

Wu, 2012), they are estimated in an unsupervisedmanner using relative frequency of each semanticrole label in the references and thus UMEANT isuseful when human judgments on adequacy of thedevelopment set are unavailable.

si,pred and si,j are the lexical similarities basedon a context vector model of the predicates and rolefillers of the arguments of type j between the ref-erence translations and the MT output. Lo et al.(2012) and Tumuluru et al. (2012) described howthe lexical and phrasal similarities of the semanticrole fillers are computed. A subsequent variant ofthe aggregation function inspired by Mihalcea etal. (2006) that normalizes phrasal similarities ac-cording to the phrase length more accurately wasused in more recent work (Lo et al., 2013a; Lo andWu, 2013a; Lo et al., 2013b). In this paper, wewill assess IMEANT against the latest version ofMEANT (Lo et al., 2014) which, as shown, usesf-score to aggregate individual token similaritiesinto the composite phrasal similarities of semanticrole fillers,since this has been shown to be more ac-curate than the previously used aggregation func-tions.

Recent studies (Lo et al., 2013a; Lo and Wu,2013a; Lo et al., 2013b) show that tuning MT sys-tems against MEANT produces more robustly ad-equate translations than the common practice oftuning against BLEU or TER across different datagenres, such as formal newswire text, informalweb forum text and informal public speech.

25

In an alternative quality-estimation oriented lineof research, Lo et al. (2014) describe a cross-lingual variant called XMEANT capable of eval-uating translation quality without the need for ex-pensive human reference translations, by utiliz-ing semantic parses of the original foreign in-put sentence instead of a reference translation.Since XMEANT’s results could have been dueto either (1) more accurate evaluation of phrasalsimilarity via cross-lingual translation probabili-ties, or (2) better match of semantic frames with-out reference translations, there is no direct evi-dence whether ITGs contribute to the improvementin MEANT’s correlation with human adequacyjudgment. For the sake of better understandingwhether ITGs improve semantic MT evaluation,we will also assess IMEANT against cross-lingualXMEANT.

3.2 The IMEANT metricAlthough MEANT was previously shown to pro-duce higher correlation with human adequacyjudgments compared to other automatic metrics,our error analyses suggest that it still suffers from acommon weakness among metrics employing lex-ical similarity, namely that word/token alignmentsbetween the reference and machine translationsare severely under constrained. No bijectivity orpermutation restrictions are applied, even betweencompositional segments where this should be nat-ural. This can cause role fillers to be aligned evenwhen they should not be. IMEANT, in contrast,uses a bracketing inversion transduction grammarto constrain permissible token alignment patternsbetween aligned role filler phrases. The semanticframes above the token level also fits ITG com-positional structure, consistent with the aforemen-tioned semantic frame alternation coverage studyof Addanki et al. (2012). Figure 2 illustrates howthe ITG constraints are consistent with the neededpermutations between semantic role fillers acrossthe reference and machine translations for a sam-ple sentence from our evaluation data, which aswe will see leads to higher HAJ correlations thanMEANT.

Subject to the structural ITG constraints,IMEANT scores sentence translations in a spiritsimilar to the way MEANT scores them: it utilizesan aggregated score over the matched semanticrole labels of the automatically aligned semanticframes and their role fillers between the referenceand machine translations. Despite the structural

differences, like MEANT, at the conceptual levelIMEANT still aims to evaluate MT output interms of the degree to which the translation haspreserved the essential “who did what to whom,forwhom, when, where, how and why” of the foreigninput sentence.

Unlike MEANT, however, IMEANT aligns andscores under ITG assumptions. MEANT uses amaximum alignment algorithm to align the tokensin the role fillers between the reference and ma-chine translations, and then scores by aggregatingthe lexical similarities into a phrasal similarity us-ing an f-measure. In contrast, IMEANT aligns andscores by utilizing a length-normalized weightedBITG (Wu, 1997; Zens and Ney, 2003; Saers andWu, 2009; Addanki et al., 2012). To be precise inthis regard, we can see IMEANT as differing fromthe foregoing description of MEANT in the defi-nition of si,pred and si,j , as follows.

G ≡ ⟨{A} ,W0,W1,R, A⟩R ≡ {A → [AA] , A → ⟨AA⟩, A → e/f}

p ([AA] |A) = p (⟨AA⟩|A) = 1

p (e/f |A) = s(e, f)

si,pred = lg−1

lg(P(

A ∗⇒ ei,pred/fi,pred|G))

max(| ei,pred |, | fi,pred |)

si,j = lg−1

lg(P(

A ∗⇒ ei,j/fi,j |G))

max(| ei,j |, | fi,j |)

where G is a bracketing ITG whose only non ter-minal is A, andR is a set of transduction rules withe ∈ W0∪{ϵ} denoting a token in the MT output (orthe null token) and f ∈ W1∪{ϵ} denoting a tokenin the reference translation (or the null token). Therule probability (or more accurately, rule weight)function p is set to be 1 for structural transductionrules, and for lexical transduction rules it is de-fined using MEANT’s context vector model basedlexical similarity measure. To calculate the insideprobability (or more accurately, inside score) of apair of segments, P

(A ∗⇒ e/f|G

), we use the al-

gorithm described in Saers et al. (2009). Giventhis, si,pred and si,j now represent the length nor-malized BITG parse scores of the predicates androle fillers of the arguments of type j between thereference and machine translations.

4 Experiments

In this section we discuss experiments indicatingthat IMEANT further improves upon MEANT’s

26

[REF] The reduction in hierarchy helps raise the efficiency of inspection and supervisory work .

[MT2] The level of reduction is conducive to raising the inspection and supervision work efficiency .

ARG0 ARG1 PRED

ARG0 PRED ARG1

The level

of reduction

is conducive

to raising

the inspection

and supervision

work efficiency

.

The

redu

ctio

n in

hier

arch

y he

lps

raise

th

e ef

ficie

ncy of

insp

ectio

n an

d su

perv

isory

. wo

rk

pred

ARG0

ARG1

pred

ARG0

ARG1

Figure 2: An example of aligning automatic shallow semantic parses under ITGs, visualized using bothbiparse tree and alignment matrix depictions, for the Chinese input sentence 层级的减少有利于提⾼检查监督⼯作的效率。. Both the reference and machine translations are parsed using automatic English SRL.Compositional alignments between the semantic frames and the tokens within role filler phrases obeyinversion transduction grammars.

already-high correlation with human adequacyjudgments.

4.1 Experimental setup

We perform the meta-evaluation upon two differ-ent partitions of the DARPA GALE P2.5 Chinese-English translation test set. The corpus includesthe Chinese input sentences, each accompanied byone English reference translation and three partic-ipating state-of-the-art MT systems’ output.

For the sake of consistent comparison, the firstevaluation partition, GALE-A, is the same as theone used in Lo and Wu (2011a), and the secondevaluation partition, GALE-B, is the same as theone used in Lo and Wu (2011b).

For both reference and machine translations, theASSERT (Pradhan et al., 2004) semantic role la-beler was used to automatically predict semanticparses.

27

Table 1: Sentence-level correlation with humanadequacy judgements on different partitions ofGALE P2.5 data. IMEANT always yields topcorrelations, and is more consistent than eitherMEANT or its recent cross-lingual XMEANTquality estimation variant. For reference, the hu-man HMEANT upper bound is 0.53 for GALE-Aand 0.37 for GALE-B—thus, the fully automatedIMEANT approximation is not far from closing thegap.metric GALE-A GALE-BIMEANT 0.51 0.33XMEANT 0.51 0.20MEANT 0.48 0.33METEOR 1.5 (2014) 0.43 0.10NIST 0.29 0.16METEOR 0.4.3 (2005) 0.20 0.29BLEU 0.20 0.27TER 0.20 0.19PER 0.20 0.18CDER 0.12 0.16WER 0.10 0.26

4.2 ResultsThe sentence-level correlations in Table 1 showthat IMEANT outperforms other automatic met-rics in correlation with human adequacy judgment.Note that this was achieved with no tuning what-soever of the default rule weights (suggesting thatthe performance of IMEANT could be further im-proved in the future by slightly optimizing the ITGweights).

On the GALE-A partition, IMEANT shows 3points improvement over MEANT, and is tiedwith the cross-lingual XMEANT quality estimatordiscussed earlier.IMEANT produces much higherHAJ correlations than any of the other metrics.

On the GALE-B partition, IMEANT is tied withMEANT, and is significantly better correlated withHAJ than the XMEANT quality estimator. Again,IMEANT produces much higher HAJ correlationsthan any of the other metrics.

We note that we have also observed this patternconsistently in smaller-scale experiments—whilethe monolingual MEANT metric and its cross-lingual XMEANT cousin vie with each other ondifferent data sets, IMEANT robustly and consis-tently produces top HAJ correlations.

In both the GALE-A and GALE-B partitions,IMEANT comes within a few points of the human

upper bound benchmark HAJ correlations com-puted using the human labeled semantic framesand alignments used in the HMEANT.

Data analysis reveals two reasons that IMEANTcorrelates with human adequacy judgement moreclosely than MEANT. First, BITG constraints in-deed provide more accurate phrasal similarity ag-gregation, compared to the naive bag-of-wordsbased heuristics employed in MEANT. Similar re-sults have been observed while trying to estimateword alignment probabilities where BITG con-straints outperformed alignments from GIZA++(Saers and Wu, 2009).

Secondly, the permutation and bijectivity con-straints enforced by the ITG provide better lever-age to reject token alignments when they are notappropriate, compared with the maximal align-ment approach which tends to be rather promiscu-ous. A case of this can be seen in Figure 3, whichshows the result on the same example sentence asin Figure 1. Disregarding the semantic parsing er-rors arising from the current limitations of auto-matic SRL tools, the ITG tends to provide clean,sparse alignments for role fillers like the ARG1of the resumed PRED, preferring to leave tokenslike complete and range unaligned instead of aligningthem anyway as MEANT’s maximal alignment al-gorithm tends to do. Note that it is not simply amatter of lowering thresholds for accepting tokenalignments: Tumuluru et al. (2012) showed thatthe competitive linking approach (Melamed, 1996)which also generally produces sparser alignmentsdoes not work as well in MEANT, whereas the ITGappears to be selective about the token alignmentsin a manner that better fits the semantic structure.

For contrast, Figure 4 shows a case whereIMEANT appropriately accepts dense alignments.

5 Conclusion

We have presented IMEANT, an inversion trans-duction grammar based rethinking of the MEANTsemantic frame based MT evaluation approach,that achieves higher correlation with human ad-equacy judgments of MT output quality thanMEANT and its variants, as well as other com-mon evaluation metrics. Our results improve uponprevious research showing that MEANT’s explicituse of semantic frames leads to state-of-the-art au-tomatic MT evaluation. IMEANT achieves thisby aligning and scoring semantic frames under asimple, consistent ITG that provides empirically

28

[REF] Until after their sales had ceased in mainland China for almost two months , sales of the complete range of SK – II products have now been resumed .

ARG0 PRED ARGM-LOC PRED ARG1

[MT2] So far , in the mainland of China to stop selling nearly two months of SK - 2 products sales resumed .

ARGM-TMP ARG1 PRED PRED ARG1 PRED

ARGM-TMP ARGM-TMP

So far

, in

the mainland

of China

to stop

selling nearly

two months

of SK

- 2

products sales

resumed .

Until

afte

r th

eir

sale

ha

d ce

ased

in

m

ainl

and

Chin

a for

alm

ost

two ,

sale

s of

the

com

plet

e

PRED PRED

ARG1

ARGM-TMP

ARG1

PRED

PRED

PRED

ARG1

ARGM

-LOC

ARG0

rang

e of

SK - II

prod

ucts

ha

ve

now

been

re

sum

ed .

mon

ths

ARGM

-TMP

ARGM

-TMP

Figure 3: An example where the ITG helps produce correctly sparse alignments by rejecting inappro-priate token alignments in the ARG1 of the resumed PRED, instead of wrongly aligning tokens like the,complete, and range as MEANT tends to do. (The semantic parse errors are due to limitations of automaticSRL.)

informative permutation and bijectivity biases, in-stead of the maximal alignment and bag-of-wordsassumptions used by MEANT. At the same time,IMEANT retains the Occam’s Razor style simplic-

ity and representational transparency characteris-tics of MEANT.

Given the absence of any tuning of ITG weightsin this first version of IMEANT, we speculate that

29

[REF] Australian Prime Minister Howard said the government could cancel AWB 's monopoly in the wheat business next week .

[MT2] Australian Prime Minister John Howard said that the Government might cancel the AWB company wheat monopoly next week .

ARG0

ARG0

PRED

ARG0 PRED ARG1 PRED

ARG0

ARGM-MOD ARGM-TMP

ARG1

PRED ARGM-MOD ARG1 ARGM-TMP

ARG1

Australian Prime

Minister John

Howard said

the Government

might cancel

the AWB

company wheat

monopoly

Austr

alian

Pr

ime

Mini

ster

Howa

rd

said the

gove

rnm

ent

could

ca

ncel

AWB 's

mon

opoly

the in

next week

that

.

whea

t bu

sines

s ne

xt we

ek .

pred

pred

ARG0

ARG0

ARGM

-MOD

ARG1

ARGM

-TMP

ARG1

pred

pred

ARG0

ARG0

ARGM-MOD

ARGM-TMP

ARG1

ARG1

Figure 4: An example of dense alignments in IMEANT, for the Chinese input sentence 澳⼤利亚总理霍华德表⽰，政府可能于下周取消 AWB公司⼩麦专卖的业务。 (The semantic parse errors are due to limitationsof automatic SRL.)

IMEANT could perform even better than it alreadydoes here.We plan to investigate simple hyperpa-

rameter optimizations in the near future.

30

6 Acknowledgments

This material is based upon work supportedin part by the Defense Advanced ResearchProjects Agency (DARPA) under BOLT contractnos. HR0011-12-C-0014 and HR0011-12-C-0016,and GALE contract nos. HR0011-06-C-0022 andHR0011-06-C-0023; by the European Union un-der the FP7 grant agreement no. 287658; and bythe Hong Kong Research Grants Council (RGC)research grants GRF620811, GRF621008, andGRF612806. Any opinions, findings and conclu-sions or recommendations expressed in this mate-rial are those of the authors and do not necessar-ily reflect the views of DARPA, the EU, or RGC.Thanks to Karteek Addanki for supporting work,and to Pascale Fung, Yongsheng Yang and Zhao-jun Wu for sharing the maximum entropy Chinesesegmenter and C-ASSERT, the Chinese semanticparser.

References

Karteek Addanki, Chi-kiu Lo, Markus Saers, andDekai Wu. LTG vs. ITG coverage of cross-lingual verb frame alternations. In 16th An-nual Conference of the European Associationfor Machine Translation (EAMT-2012), Trento,Italy, May 2012.

Alfred V. Aho and Jeffrey D. Ullman. The The-ory of Parsing, Translation, and Compiling.Prentice-Halll, Englewood Cliffs, New Jersey,1972.

Satanjeev Banerjee and Alon Lavie. METEOR:An automatic metric for MT evaluation with im-proved correlation with human judgments. InWorkshop on Intrinsic and Extrinsic EvaluationMeasures for Machine Translation and/or Sum-marization, Ann Arbor, Michigan, June 2005.

Chris Callison-Burch, Miles Osborne, and PhilippKoehn. Re-evaluating the role of BLEU in ma-chine translation research. In 11th Conferenceof the European Chapter of the Association forComputational Linguistics (EACL-2006), 2006.

Chris Callison-Burch, Cameron Fordyce, PhilippKoehn, Christof Monz, and Josh Schroeder.(meta-) evaluation of machine translation. InSecond Workshop on Statistical Machine Trans-lation (WMT-07), 2007.

Chris Callison-Burch, Cameron Fordyce, PhilippKoehn, Christof Monz, and Josh Schroeder.

Further meta-evaluation of machine transla-tion. In Third Workshop on Statistical MachineTranslation (WMT-08), 2008.

Julio Castillo and Paula Estrella. Semantic tex-tual similarity for MT evaluation. In 7th Work-shop on Statistical Machine Translation (WMT2012), 2012.

Michael Denkowski and Alon Lavie. METEORuniversal: Language specific translation eval-uation for any target language. In 9th Work-shop on Statistical Machine Translation (WMT2014), 2014.

George Doddington. Automatic evaluation ofmachine translation quality using n-gram co-occurrence statistics. In The second interna-tional conference on Human Language Technol-ogy Research (HLT ’02), San Diego, California,2002.

Jesús Giménez and Lluís Màrquez. Linguistic fea-tures for automatic evaluation of heterogenousMT systems. In Second Workshop on Statisti-cal Machine Translation (WMT-07), pages 256–264, Prague, Czech Republic, June 2007.

Jesús Giménez and Lluís Màrquez. A smorgas-bord of features for automatic MT evaluation. InThird Workshop on Statistical Machine Transla-tion (WMT-08), Columbus, Ohio, June 2008.

Philipp Koehn and Christof Monz. Manual andautomatic evaluation of machine translation be-tween european languages. In Workshop on Sta-tistical Machine Translation (WMT-06), 2006.

Gregor Leusch and Hermann Ney. Bleusp, invwer,cder: Three improved mt evaluation measures.In NIST Metrics for Machine Translation Chal-lenge (MetricsMATR), at Eighth Conference ofthe Association for Machine Translation in theAmericas (AMTA 2008), Waikiki, Hawaii, Oct2008.

Gregor Leusch, Nicola Ueffing, and HermannNey. A novel string-to-string distance measurewith applications to machine translation evalu-ation. In Machine Translation Summit IX (MTSummit IX), New Orleans, Sep 2003.

Gregor Leusch, Nicola Ueffing, and HermannNey. CDer: Efficient MT evaluation usingblock movements. In 11th Conference of theEuropean Chapter of the Association for Com-putational Linguistics (EACL-2006), 2006.

31

Ding Liu and Daniel Gildea. Syntactic features forevaluation of machine translation. In Workshopon Intrinsic and Extrinsic Evaluation Measuresfor Machine Translation and/or Summarization,Ann Arbor, Michigan, June 2005.

Chi-kiu Lo and Dekai Wu. MEANT: An inexpen-sive, high-accuracy, semi-automatic metric forevaluating translation utility based on seman-tic roles. In 49th Annual Meeting of the Asso-ciation for Computational Linguistics: HumanLanguage Technologies (ACL HLT 2011), 2011.

Chi-kiu Lo and Dekai Wu. SMT vs. AI redux:How semantic frames evaluate MT more ac-curately. In Twenty-second International JointConference on Artificial Intelligence (IJCAI-11), 2011.

Chi-kiu Lo and Dekai Wu. Unsupervised vs. super-vised weight estimation for semantic MT evalu-ation metrics. In Sixth Workshop on Syntax, Se-mantics and Structure in Statistical Translation(SSST-6), 2012.

Chi-kiu Lo and Dekai Wu. Can informal genresbe better translated by tuning on automatic se-mantic metrics? In 14th Machine TranslationSummit (MT Summit XIV), 2013.

Chi-kiu Lo and Dekai Wu. MEANT at WMT2013: A tunable, accurate yet inexpensive se-mantic frame based mt evaluation metric. In8th Workshop on Statistical Machine Transla-tion (WMT 2013), 2013.

Chi-kiu Lo, Anand Karthik Tumuluru, and DekaiWu. Fully automatic semantic MT evaluation.In 7th Workshop on Statistical Machine Trans-lation (WMT 2012), 2012.

Chi-kiu Lo, Karteek Addanki, Markus Saers, andDekai Wu. Improving machine translation bytraining against an automatic semantic framebased evaluation metric. In 51st Annual Meet-ing of the Association for Computational Lin-guistics (ACL 2013), 2013.

Chi-kiu Lo, Meriem Beloucif, and Dekai Wu. Im-proving machine translation into Chinese bytuning against Chinese MEANT. In Interna-tional Workshop on Spoken Language Transla-tion (IWSLT 2013), 2013.

Chi-kiu Lo, Meriem Beloucif, Markus Saers, andDekai Wu. XMEANT: Better semantic MTevaluation without reference translations. In

52nd Annual Meeting of the Association forComputational Linguistics (ACL 2014), 2014.

Matouš Macháček and Ondřej Bojar. Results of theWMT13 metrics shared task. In Eighth Work-shop on Statistical Machine Translation (WMT2013), Sofia, Bulgaria, August 2013.

I. Dan Melamed. Automatic construction ofclean broad-coverage translation lexicons. In2nd Conference of the Association for Ma-chine Translation in the Americas (AMTA-1996), 1996.

Rada Mihalcea, Courtney Corley, and Carlo Strap-parava. Corpus-based and knowledge-basedmeasures of text semantic similarity. In TheTwenty-first National Conference on ArtificialIntelligence (AAAI-06), volume 21, 2006.

Sonja Nießen, Franz Josef Och, Gregor Leusch,and Hermann Ney. A evaluation tool for ma-chine translation: Fast evaluation for MT re-search. In The Second International Conferenceon Language Resources and Evaluation (LREC2000), 2000.

Karolina Owczarzak, Josef van Genabith, andAndy Way. Dependency-based automatic eval-uation for machine translation. In Syntaxand Structure in Statistical Translation (SSST),2007.

Karolina Owczarzak, Josef van Genabith, andAndy Way. Evaluating machine translationwith LFG dependencies. Machine Translation,21:95–119, 2007.

Kishore Papineni, Salim Roukos, Todd Ward, andWei-Jing Zhu. BLEU: a method for automaticevaluation of machine translation. In 40th An-nual Meeting of the Association for Compu-tational Linguistics (ACL-02), pages 311–318,Philadelphia, Pennsylvania, July 2002.

Sameer Pradhan, Wayne Ward, Kadri Hacioglu,James H. Martin, and Dan Jurafsky. Shallow se-mantic parsing using support vector machines.In Human Language Technology Conferenceof the North American Chapter of the Asso-ciation for Computational Linguistics (HLT-NAACL 2004), 2004.

Miguel Rios, Wilker Aziz, and Lucia Specia.TINE: A metric to assess MT adequacy. InSixth Workshop on Statistical Machine Transla-tion (WMT 2011), 2011.

32

Markus Saers and Dekai Wu. Improving phrase-based translation via word alignments fromstochastic inversion transduction grammars. InThird Workshop on Syntax and Structure inStatistical Translation (SSST-3), pages 28–36,Boulder, Colorado, June 2009.

Markus Saers, Joakim Nivre, and Dekai Wu.Learning stochastic bracketing inversion trans-duction grammars with a cubic time biparsingalgorithm. In 11th International Conference onParsing Technologies (IWPT’09), pages 29–32,Paris, France, October 2009.

Matthew Snover, Bonnie Dorr, Richard Schwartz,Linnea Micciulla, and John Makhoul. A studyof translation edit rate with targeted human an-notation. In 7th Biennial Conference Asso-ciation for Machine Translation in the Ameri-cas (AMTA 2006), pages 223–231, Cambridge,Massachusetts, August 2006.

Anand Karthik Tumuluru, Chi-kiu Lo, and DekaiWu. Accuracy and robustness in measuring thelexical similarity of semantic role fillers for au-tomatic semantic MT evaluation. In 26th Pa-cific Asia Conference on Language, Informa-tion, and Computation (PACLIC 26), 2012.

Mengqiu Wang and Christopher D. Manning.SPEDE: Probabilistic edit distance metrics forMT evaluation. In 7th Workshop on StatisticalMachine Translation (WMT 2012), 2012.

Dekai Wu. An algorithm for simultaneously brack-eting parallel texts by aligning words. In 33rdAnnual Meeting of the Association for Compu-tational Linguistics (ACL 95), pages 244–251,Cambridge, Massachusetts, June 1995.

Dekai Wu. Trainable coarse bilingual grammarsfor parallel text bracketing. In Third AnnualWorkshop on Very Large Corpora (WVLC-3),pages 69–81, Cambridge, Massachusetts, June1995.

Dekai Wu. Stochastic inversion transductiongrammars and bilingual parsing of parallel cor-pora. Computational Linguistics, 23(3):377–403, 1997.

Richard Zens and Hermann Ney. A compara-tive study on reordering constraints in statisti-cal machine translation. In 41st Annual Meetingof the Association for Computational Linguis-tics (ACL-2003), pages 144–151, Stroudsburg,Pennsylvania, 2003.

33


Rule-based Syntactic Preprocessingfor Syntax-based Machine Translation

Yuto Hatakoshi, Graham Neubig, Sakriani Sakti, Tomoki Toda, Satoshi NakamuraNara Institute of Science and TechnologyGraduate School of Information ScienceTakayama, Ikoma, Nara 630-0192, Japan

{hatakoshi.yuto.hq8,neubig,ssakti,tomoki,s-nakamura}@is.naist.jp

Abstract

Several preprocessing techniques usingsyntactic information and linguisticallymotivated rules have been proposed to im-prove the quality of phrase-based machinetranslation (PBMT) output. On the otherhand, there has been little work on similartechniques in the context of other trans-lation formalisms such as syntax-basedSMT. In this paper, we examine whetherthe sort of rule-based syntactic preprocess-ing approaches that have proved beneficialfor PBMT can contribute to syntax-basedSMT. Specifically, we tailor a highly suc-cessful preprocessing method for English-Japanese PBMT to syntax-based SMT,and find that while the gains achievable aresmaller than those for PBMT, significantimprovements in accuracy can be realized.

1 Introduction

In the widely-studied framework of phrase-basedmachine translation (PBMT) (Koehn et al., 2003),translation probabilities between phrases consist-ing of multiple words are calculated, and trans-lated phrases are rearranged by the reorderingmodel in the appropriate target language order.While PBMT provides a light-weight frameworkto learn translation models and achieves hightranslation quality in many language pairs, it doesnot directly incorporate morphological or syntac-tic information. Thus, many preprocessing meth-ods for PBMT using these types of informationhave been proposed. Methods include preprocess-ing to obtain accurate word alignments by the divi-sion of the prefix of verbs (Nießen and Ney, 2000),preprocessing to reduce the errors in verb conju-gation and noun case agreement (Avramidis andKoehn, 2008), and many others. The effectivenessof the syntactic preprocessing for PBMT has beensupported by these and various related works.

In particular, much attention has been paid topreordering (Xia and McCord, 2004; Collins etal., 2005), a class of preprocessing methods forPBMT. PBMT has well-known problems with lan-guage pairs that have very different word order,due to the fact that the reordering model has dif-ficulty estimating the probability of long distancereorderings. Therefore, preordering methods at-tempt to improve the translation quality of PBMTby rearranging source language sentences into anorder closer to that of the target language. It’s of-ten the case that preordering methods are basedon rule-based approaches, and these methods haveachieved great success in ameliorating the wordordering problems faced by PBMT (Collins et al.,2005; Xu et al., 2009; Isozaki et al., 2010b).

One particularly successful example of rule-based syntactic preprocessing is Head Finalization(Isozaki et al., 2010b), a method of syntactic pre-processing for English to Japanese translation thathas significantly improved translation quality ofEnglish-Japanese PBMT using simple rules basedon the syntactic structure of the two languages.The most central part of the method, as indicatedby its name, is a reordering rule that moves theEnglish head word to the end of the correspondingsyntactic constituents to match the head-final syn-tactic structure of Japanese sentences. Head Final-ization also contains some additional preprocess-ing steps such as determiner elimination, parti-cle insertion and singularization to generate a sen-tence that is closer to Japanese grammatical struc-ture.

In addition to PBMT, there has also recentlybeen interest in syntax-based SMT (Yamada andKnight, 2001; Liu et al., 2006), which translatesusing syntactic information. However, few at-tempts have been made at syntactic preprocessingfor syntax-based SMT, as the syntactic informa-tion given by the parser is already incorporateddirectly in the translation model. Notable excep-

34

tions include methods to perform tree transforma-tions improving correspondence between the sen-tence structure and word alignment (Burkett andKlein, 2012), methods for binarizing parse trees tomatch word alignments (Zhang et al., 2006), andmethods for adjusting label sets to be more ap-propriate for syntax-based SMT (Hanneman andLavie, 2011; Tamura et al., 2013). It should benoted that these methods of syntactic preprocess-ing for syntax-based SMT are all based on auto-matically learned rules, and there has been little in-vestigation of the manually-created linguistically-motivated rules that have proved useful in prepro-cessing for PBMT.

In this paper, we examine whether rule-basedsyntactic preprocessing methods designed forPBMT can contribute anything to syntax-basedmachine translation. Specifically, we examinewhether the reordering and lexical processing ofHead Finalization contributes to the improvementof syntax-based machine translation as it did forPBMT. Additionally, we examine whether it ispossible to incorporate the intuitions behind theHead Finalization reordering rules as soft con-straints by incorporating them as a decoder fea-ture. As a result of our experiments, we demon-strate that rule-based lexical processing can con-tribute to improvement of translation quality ofsyntax-based machine translation.

2 Head Finalization

Head Finalization is a syntactic preprocessingmethod for English to Japanese PBMT, reducinggrammatical errors through reordering and lexi-cal processing. Isozaki et al. (2010b) have re-ported that translation quality of English-JapanesePBMT is significantly improved using a transla-tion model learned by English sentences prepro-cessed by Head Finalization and Japanese sen-tences. In fact, this method achieved the highestresults in the large scale NTCIR 2011 evaluation(Sudoh et al., 2011), the first time a statistical ma-chine translation (SMT) surpassed rule-based sys-tems for this very difficult language pair, demon-strating the utility of these simple syntactic trans-formations from the point of view of PBMT.

2.1 Reordering

The reordering process of Head Finalization usesa simple rule based on the features of Japanesegrammar. To convert English sentence into

John hit a ball

John hita ball

NN VBD DT NN

NP

VP

NP

S

VBD

VP

NP

S

DT NN

NP

NN

Original English

Head Final English

Add Japanese Particles

John hita ball

va0 va2

Singularize,

Eliminate Determiners

John hita ball

va0 va2

Reordering

Figure 1: Head Finalization

Japanese word order, the English sentence is firstparsed using a syntactic parser, and then headwords are moved to the end of the correspondingsyntactic constituents in each non-terminal nodeof the English syntax tree. This helps replicatethe ordering of words in Japanese grammar, wheresyntactic head words come after non-head (depen-dent) words.

Figure 1 shows an example of the applicationof Head Finalization to an English sentence. Thehead node of the English syntax tree is connectedto the parent node by a bold line. When this nodeis the first child node, we move it behind the de-pendent node in order to convert the English sen-tence into head final order. In this case, movingthe head node VBD of black node VP to the end ofthis node, we can obtain the sentence “John a ballhit” which is in a word order similar to Japanese.

2.2 Lexical Processing

In addition to reordering, Head Finalization con-ducts the following three steps that do not affectword order. These steps do not change the word

35

ordering, but still result in an improvement oftranslation quality, and it can be assumed that theeffect of this variety of syntactic preprocessing isnot only applicable to PBMT but also other trans-lation methods that do not share PBMT’s problemsof reordering such as syntax-based SMT. The threesteps included are as follows:

1. Pseudo-particle insertion

2. Determiner (“a”, “an”, “the”) elimination

3. Singularization

The motivation for the first step is that in con-trast to English, which has relatively rigid wordorder and marks grammatical cases of many nounphrases according to their position relative to theverb, Japanese marks the topic, subject, and objectusing case marking particles. As Japanese parti-cles are not found in English, Head Finalizationinserts “pseudo-particles” to prevent a mistransla-tion or lack of particles in the translation process.In the pseudo-particle insertion process (1), we in-sert the following three types of pseudo-particlesequivalent to Japanese case markers “wa” (topic),“ga” (subject) or “wo” (object).

• va0: Subject particle of the main verb

• va1: Subject particle of other verbs

• va2: Object particle of any verb

In the example of Figure 1, we insert the topic par-ticle va0 behind of “John”, which is a subject of averb “hit” and object particle va2 at the back ofobject “ball.”

Another source of divergence between the twolanguages stems from the fact that Japanese doesnot contain determiners or makes distinctions be-tween singular and plural by inflection of nouns.Thus, to generate a sentence that is closer toJapanese, Head Finalization eliminates determin-ers (2) and singularizes plural nouns (3) in addi-tion to the pseudo-particle insertion.

In Figure 1, we can see that applying thesethree processes to the source English sentence re-sults in the sentence “John va0 (wa) ball va2 (wo)hit” which closely resembles the structure of theJapanese translation “jon wa bo-ru wo utta.”

3 Syntax-based Statistical MachineTranslation

Syntax-based SMT is a method for statisticaltranslation using syntactic information of the sen-tence (Yamada and Knight, 2001; Liu et al., 2006).By using translation patterns following the struc-ture of linguistic syntax trees, syntax-based trans-lations often makes it possible to achieve moregrammatical translations and reorderings com-pared with PBMT. In this section, we describetree-to-string (T2S) machine translation based onsynchronous tree substitution grammars (STSG)(Graehl et al., 2008), the variety of syntax-basedSMT that we use in our experiments.

T2S captures the syntactic relationship betweentwo languages by using the syntactic structure ofparsing results of the source sentence. Each trans-lation pattern is expressed as a source sentencesubtree using rules including variables. The fol-lowing example of a translation pattern includetwo noun phrases NP0 and NP1, which are trans-lated and inserted into the target placeholders X0

and X1 respectively. The decoder generates thetranslated sentence in consideration of the proba-bility of translation pattern itself and translationsof the subtrees of NP0 and NP1.

S((NP0) (VP(VBD hit) (NP1)))→ X0 wa X1 wo utta

T2S has several advantages over PBMT. First,because the space of translation candidates is re-duced using the source sentence subtree, it is oftenpossible to generate translations that are more ac-curate, particularly with regards to long-distancereordering, as long as the source parse is correct.Second, the time to generate translation results isalso reduced because the search space is smallerthan PBMT. On the other hand, because T2S gen-erates translation results using the result of auto-matic parsing, translation quality highly dependson the accuracy of the parser.

4 Applying Syntactic Preprocessing toSyntax-based Machine Translation

In this section, we describe our proposed methodto apply Head Finalization to T2S translation.Specifically, we examine two methods for incor-porating the Head Finalization rules into syntax-based SMT: through applying them as preprocess-ing step to the trees used in T2S translation, and

36

through adding reordering information as a featureof the translation patterns.

4.1 Syntactic Preprocessing for T2S

We applied the two types of processing shown inTable 1 as preprocessing for T2S. This is similarto preprocessing for PBMT with the exception thatpreprocessing for PBMT results in a transformedstring, and preprocessing for T2S results in a trans-formed tree. In the following sections, we elabo-rate on methods for applying these preprocessingsteps to T2S and some effects expected therefrom.

Table 1: Syntactic preprocessing applied to T2SPreprocessing DescriptionReordering Reordering based on Japanese

typical head-final grammaticalstructure

Lexical Processing Pseudo-particle insertion, deter-miner elimination, singulariza-tion

4.1.1 Reordering for T2S

In the case of PBMT, reordering is used to changethe source sentence word order to be closer tothat of the target, reducing the burden on the rel-atively weak PBMT reordering models. On theother hand, because translation patterns of T2Sare expressed by using source sentence subtrees,the effect of reordering problems are relativelysmall, and the majority of reordering rules spec-ified by hand can be automatically learned in awell-trained T2S model. Therefore, preorderingis not expected to cause large gains, unlike in thecase of PBMT.

However, it can also be thought that preorderingcan still have a positive influence on the translationmodel training process, particularly by increasingalignment accuracy. For example, training meth-ods for word alignment such as the IBM or HMMmodels (Och and Ney, 2003) are affected by wordorder, and word alignment may be improved bymoving word order closer between the two lan-guages. As alignment accuracy plays a importantrole in T2S translation (Neubig and Duh, 2014), itis reasonable to hypothesize that reordering mayalso have a positive effect on T2S. In terms of theactual incorporation with the T2S system, we sim-ply follow the process in Figure 1, but output thereordered tree instead of only the reordered termi-nal nodes as is done for PBMT.

John hit a ball

NN VBD DT NN

NP

VP

NP

S

Original English

NN VBD NN VA

NP

VP

NP

S

VA

John hit ball va2va0

Lexical Processing

Figure 2: A method of applying Lexical Process-ing

4.1.2 Lexical Processing for T2SIn comparison to reordering, Lexical Processingmay be expected to have a larger effect on T2S,as it will both have the potential to increase align-ment accuracy, and remove the burden of learningrules to perform simple systematic changes thatcan be written by hand. Figure 2 shows an ex-ample of the application of Lexical Processing totransform not strings, but trees.

In the pseudo-particle insertion component,three pseudo particles “va0,” “va1,” and “va2” (asshown in Section 2.2) are added in the source En-glish syntax tree as terminal nodes with the non-terminal node “VA”. As illustrated in Figure 2, par-ticles are inserted as children at the end of the cor-responding NP node. For example, in the figurethe topic particle “va0” is inserted after “John,”subject of the verb “hit,” and the object particle“va2” is inserted at the end of the NP for “ball,”the object.

In the determiner elimination process, terminalnodes “a,” “an,” and “the” are eliminated alongwith non-terminal node DT. Determiner “a” andits corresponding non-terminal DT are eliminatedin the Figure 2 example.

Singularization, like in the processing forPBMT, simply changes plural noun terminals totheir base form.

4.2 Reordering Information as SoftConstraints

As described in section 4.1.1, T2S work well onlanguage pairs that have very different word order,but is sensitive to alignment accuracy. On the otherhand, we know that in most cases Japanese wordorder tends to be head final, and thus any rules thatdo not obey head final order may be the result ofbad alignments. On the other hand, there are somecases where head final word order is not applica-ble (such as sentences that contain the determiner

37

“no,” or situations where non-literal translationsare necessary) and a hard constraint to obey head-final word order could be detrimental.

In order to incorporate this intuition, we adda feature (HF-feature) to translation patterns thatconform to the reordering rules of Head Final-ization. This gives the decoder ability to discerntranslation patterns that follow the canonical re-ordering patterns in English-Japanese translation,and has the potential to improve translation qualityin the T2S translation model.

We use the log-linear approach (Och, 2003) toadd the Head Finalization feature (HF-feature). Asin the standard log-linear model, a source sen-tence f is translated into a target language sen-tence e, by searching for the sentence maximizingthe score:

e = arg maxe

wT · h(f ,e). (1)

where h(f , e) is a feature function vector. w isa weight vector that scales the contribution fromeach feature. Each feature can take any real valuewhich is useful to improve translation quality, suchas the log of the n-gram language model proba-bility to represent fluency, or lexical/phrase trans-lation probability to capture the word or phrase-wise correspondence. Thus, if we can incorporatethe information about reordering expressed by theHead Finalization reordering rule as a features inthis model, we can learn weights to inform the de-coder that it should generally follow this canonicalordering.

Figure 3 shows a procedure of Head Finaliza-tion feature (HF-feature) addition. To add theHF-feature to translation patterns, we examinethe translation rules, along with the alignmentsbetween target and source terminals and non-terminals. First, we apply the Reordering to thesource side of the translation pattern subtree ac-cording to the canonical head-final reordering rule.Second, we examine whether the word order of thereordered translation pattern matches with that ofthe target translation pattern for which the wordalignment is non-crossing, indicating that the tar-get string is also in head-final word order. Finally,we set a binary feature (hHF(f , e) = 1) if the tar-get word order obeys the head final order. Thisfeature is only applied to translation patterns forwhich the number of target side words is greaterthan or equal to two.

VP

VBD NP

hit x0:NP

x0 wo

Source side of

translation pattern

Target side of

translation pattern

VP

NP VBD

hitx0:NP

1. Apply Reordering to

source translation pattern

2. Add HF-feature

if word alignment is

non-crossing

utta

Word alignment

x0 woTarget side of

translation pattern utta

Reordered

translation pattern

Figure 3: Procedure of HF-feature addition

Table 2: The details of NTCIR7

Dataset Lang Words SentencesAveragelength

trainEn 99.0M 3.08M 32.13Ja 117M 3.08M 37.99

devEn 28.6k 0.82k 34.83Ja 33.5k 0.82k 40.77

testEn 44.3k 1.38k 32.11Ja 52.4k 1.38k 37.99

5 Experiment

In our experiment, we examined how much eachof the preprocessing steps (Reordering, LexicalProcessing) contribute to improve the translationquality of PBMT and T2S. We also examined theimprovement in translation quality of T2S by theintroduction of the Head Finalization feature.

5.1 Experimental EnvironmentFor our English to Japanese translation experi-ments, we used NTCIR7 PATENT-MT’s Patentcorpus (Fujii et al., 2008). Table 2 shows thedetails of training data (train), development data(dev), and test data (test).

As the PBMT and T2S engines, we used theMoses (Koehn et al., 2007) and Travatar (Neubig,2013) translation toolkits with the default settings.

38

Enju (Miyao and Tsujii, 2002) is used to parse En-glish sentences and KyTea (Neubig et al., 2011) isused as a Japanese tokenizer. We generated wordalignments using GIZA++ (Och and Ney, 2003)and trained a Kneser-Ney smoothed 5-gram LMusing SRILM (Stolcke et al., 2011). MinimumError Rate Training (MERT) (Och, 2003) is usedfor tuning to optimize BLEU. MERT is replicatedthree times to provide performance stability on testset evaluation (Clark et al., 2011).

We used BLEU (Papineni et al., 2002) andRIBES (Isozaki et al., 2010a) as evaluation mea-sures of translation quality. RIBES is an eval-uation method that focuses on word reorderinginformation, and is known to have high correla-tion with human judgement for language pairs thathave very different word order such as English-Japanese.

5.2 ResultTable 3 shows translation quality for each com-bination of HF-feature, Reordering, and LexicalProcessing. Scores in boldface indicate no sig-nificant difference in comparison with the con-dition that has highest translation quality usingthe bootstrap resampling method (Koehn, 2004)(p < 0.05).

For PBMT, we can see that reordering plays anextremely important role, with the highest BLEUand RIBES scores being achieved when using Re-ordering preprocessing (line 3, 4). Lexical Pro-cessing also provided a slight performance gainfor PBMT. When we applied Lexical Processing toPBMT, BLEU and RIBES scores were improved(line 1 vs 2), although this gain was not significantwhen Reordering was performed as well.

Overall T2S without any preprocessingachieved better translation quality than all con-ditions of PBMT (line 1 of T2S vs line 1-4 ofPBMT). In addition, BLEU and RIBES score ofT2S were clearly improved by Lexical Processing(line 2, 4, 6, 8 vs line 1, 3, 5, 7), and these scoresare the highest of all conditions. On the otherhand, Reordering and HF-Feature addition had nopositive effect, and actually tended to slightly hurttranslation accuracy.

5.3 Analysis of PreprocessingWith regards to PBMT, as previous works onpreordering have already indicated, BLEU andRIBES scores were significantly improved by Re-ordering. In addition, Lexical Processing also con-

Table 5: Optimized weight of HF-feature in eachcondition

HF-feature Reordering Word Weight ofProcessing HF-feature

+ - - -0.00707078+ - + 0.00524676+ + - 0.156724+ + + -0.121326

tributed to improve translation quality of PBMTslightly. We also investigated the influencethat each element of Lexical Processing (pseudo-particle insertion, determiner elimination, singu-larization) had on translation quality, and foundthat the gains were mainly provided by particleinsertion, with little effect from determiner elim-ination or singularization.

Although Reordering was effective for PBMT,it did not provide any benefit for T2S. This in-dicates that T2S can already conduct long dis-tance word reordering relatively correctly, andword alignment quality was not improved as muchas expected by closing the gap in word order be-tween the two languages. This was verified by asubjective evaluation of the data, finding very fewmajor reordering issues in the sentences translatedby T2S.

On the other hand, Lexical Processing func-tioned effectively for not only PBMT but also T2S.When added to the baseline, lexical processing onits own resulted in a gain of 0.57 BLEU, and 0.99RIBES points, a significant improvement, withsimilar gains being seen in other settings as well.

Table 4 demonstrates a typical example of theimprovement of the translation result due to Lex-ical Processing. It can be seen that translationperformance of particles (indicated by underlinedwords) was improved. The underlined particle isin the direct object position of the verb that corre-sponds to “comprises” in English, and thus shouldbe given the object particle “を wo” as in the refer-ence and the system using Lexical Processing. Onthe other hand, in the baseline system the genitive“と to” is generated instead due to misaligned par-ticles being inserted in an incorrect position in thetranslation rules.

5.4 Analysis of Feature Addition

Our experimental results indicated that translationquality is not improved by HF-feature addition(line 1-4 vs line 5-8). We conjecture that the rea-son why HF-feature did not contribute to an im-

39

Table 3: Translation quality by combination of HF-feature, Reordering, and Lexical Processing. Boldindicates results that are not statistically significantly different from the best result (39.60 BLEU in line4 and 79.47 RIBES in line 2).

IDPBMT T2S

HF-feature Reordering Lexical Processing BLEU RIBES BLEU RIBES1 - - - 32.11 69.06 38.94 78.482 - - + 33.16 70.19 39.51 79.473 - + - 37.62 77.56 38.44 78.484 - + + 37.77 77.71 39.60 79.265 + - - — — 38.74 78.336 + - + — — 39.29 79.237 + + - — — 38.48 78.448 + + + — — 39.38 79.21

Table 4: Improvement of translation results due to Lexical ProcessingSource another connector 96 , which is matable with this cable connector 90 , comprises a plurality of

male contacts 98 aligned in a row in an electrically insulative housing 97 as shown in the figure .Reference このケーブルコネクタ９０と嵌合接続される相手コネクタ９６は、図示のように

、絶縁ハウジング９７内に雄コンタクト９８を整列保持して構成される。- Lexical Processing このケーブルコネクタ９０は相手コネクタ９６は、図に示すように、電気絶縁

性のハウジング９７に一列に並ぶ複数の雄型コンタクト９８とから構成されている。

+ Lexical Processing このケーブルコネクタ９０と相手コネクタ９６は、図に示すように、電気絶縁性のハウジング９７に一列に並ぶ複数の雄型コンタクト９８を有して構成される。

provement in translation quality is that the reorder-ing quality achieved by T2S translation was al-ready sufficiently high, and the initial feature ledto confusion in MERT optimization.

Table 5 shows the optimized weight of the HFfeature in each condition. From this table, we cansee that in two of the conditions positive weightsare learned, and in two of the conditions negativeweights are learned. This indicates that there is noconsistent pattern of learning weights that corre-spond to our intuition that head-final rules shouldreceive higher preference.

It is possible that other optimization methods,or a more sophisticated way of inserting these fea-tures into the translation rules could help alleviatethese problems.

6 Conclusion

In this paper, we analyzed the effect of applyingsyntactic preprocessing methods to syntax-basedSMT. Additionally, we have adapted reorderingrules as a decoder feature. The results showedthat lexical processing, specifically insertion ofpseudo-particles, contributed to improving trans-lation quality, and it was effective as preprocessing

for T2S.It should be noted that this paper, while demon-

strating that the simple rule-based syntactic pro-cessing methods that have been useful for PBMTcan also contribute to T2S in English-Japanesetranslation, more work is required to ensure thatthis will generalize to other settings. A next step inour inquiry is the generalization of these results toother proposed preprocessing techniques and otherlanguage pairs. In addition, we would like to trytwo ways described below. First, it is likely thatother tree transformations, for example changingthe internal structure of the tree by moving chil-dren to different nodes, would help in cases whereit is common to translate into highly divergent syn-tactic structures between the source and target lan-guages. Second, we plan to investigate other waysof incorporating the preprocessing rules as a softconstraints, such as using n-best lists or forests toenode many possible sentence interpretations.

References

Eleftherios Avramidis and Philipp Koehn. 2008. En-riching morphologically poor languages for statisti-cal machine translation. In Annual Meeting of the

40

Association for Computational Linguistics (ACL),pages 763–770.

David Burkett and Dan Klein. 2012. Transformingtrees to improve syntactic convergence. In Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 863–872.

Jonathan H Clark, Chris Dyer, Alon Lavie, and Noah ASmith. 2011. Better hypothesis testing for statisti-cal machine translation: Controlling for optimizerinstability. In Annual Meeting of the Association forComputational Linguistics (ACL), pages 176–181.

Michael Collins, Philipp Koehn, and Ivona Kucerova.2005. Clause restructuring for statistical machinetranslation. In Annual Meeting of the Associationfor Computational Linguistics (ACL), pages 531–540.

Atsushi Fujii, Masao Utiyama, Mikio Yamamoto,Takehito Utsuro, Terumasa Ehara, Hiroshi Echizen-ya, and Sayori Shimohata. 2008. Overview of thepatent translation task at the NTCIR-7 workshop. InProceedings of the 7th NTCIR Workshop Meeting,pages 389–400.

Jonathan Graehl, Kevin Knight, and Jonathan May.2008. Training tree transducers. ComputationalLinguistics, pages 391–427.

Greg Hanneman and Alon Lavie. 2011. Automaticcategory label coarsening for syntax-based machinetranslation. In Workshop on Syntax and Structure inStatistical Translation, pages 98–106.

Hideki Isozaki, Tsutomu Hirao, Kevin Duh, KatsuhitoSudoh, and Hajime Tsukada. 2010a. Automaticevaluation of translation quality for distant languagepairs. In Conference on Empirical Methods in Natu-ral Language Processing (EMNLP), pages 944–952.

Hideki Isozaki, Katsuhito Sudoh, Hajime Tsukada, andKevin Duh. 2010b. Head finalization: A simplereordering rule for SOV languages. In Proceedingsof the Joint Fifth Workshop on Statistical MachineTranslation and MetricsMATR, pages 244–251.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In NorthAmerican Chapter of the Association for Computa-tional Linguistics, pages 48–54.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, et al. 2007. Moses: Open sourcetoolkit for statistical machine translation. In AnnualMeeting of the Association for Computational Lin-guistics (ACL), pages 177–180.

Philipp Koehn. 2004. Statistical significance tests formachine translation evaluation. In Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 388–395.

Yang Liu, Qun Liu, and Shouxun Lin. 2006. Tree-to-string alignment template for statistical machinetranslation. In Annual Meeting of the Associationfor Computational Linguistics (ACL), pages 609–616.

Yusuke Miyao and Jun’ichi Tsujii. 2002. Maximumentropy estimation for feature forests. In Proceed-ings of the second international conference on Hu-man Language Technology Research, pages 292–297.

Graham Neubig and Kevin Duh. 2014. On the ele-ments of an accurate tree-to-string machine transla-tion system. In Annual Meeting of the Associationfor Computational Linguistics (ACL), pages 143–149.

Graham Neubig, Yosuke Nakata, and Shinsuke Mori.2011. Pointwise prediction for robust, adaptablejapanese morphological analysis. In Annual Meet-ing of the Association for Computational Linguistics(ACL), pages 529–533.

Graham Neubig. 2013. Travatar: A forest-to-stringmachine translation engine based on tree transduc-ers. Annual Meeting of the Association for Compu-tational Linguistics (ACL), page 91.

Sonja Nießen and Hermann Ney. 2000. ImprovingSMT quality with morpho-syntactic analysis. InProceedings of the 18th conference on Computa-tional linguistics-Volume 2, pages 1081–1085.

Franz Josef Och and Hermann Ney. 2003. A sys-tematic comparison of various statistical alignmentmodels. Computational linguistics, pages 19–51.

Franz Josef Och. 2003. Minimum error rate trainingin statistical machine translation. In Annual Meet-ing of the Association for Computational Linguistics(ACL), pages 160–167.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automaticevaluation of machine translation. In Annual Meet-ing of the Association for Computational Linguistics(ACL), pages 311–318.

Andreas Stolcke, Jing Zheng, Wen Wang, and VictorAbrash. 2011. SRILM at sixteen: Update and out-look. In IEEE Automatic Speech Recognition andUnderstanding Workshop (ASRU), page 5.

Katsuhito Sudoh, Kevin Duh, Hajime Tsukada,Masaaki Nagata, Xianchao Wu, Takuya Matsuzaki,and Jun ’ichi Tsujii. 2011. NTT-UT statistical ma-chine translation in NTCIR-9 PatentMT. In Pro-ceedings of NTCIR, pages 585–592.

Akihiro Tamura, Taro Watanabe, Eiichiro Sumita, Hi-roya Takamura, and Manabu Okumura. 2013. Part-of-speech induction in dependency trees for statisti-cal machine translation. In Annual Meeting of theAssociation for Computational Linguistics (ACL),pages 841–851.

41

Fei Xia and Michael McCord. 2004. Improving a sta-tistical mt system with automatically learned rewritepatterns. In International Conference on Computa-tional Linguistics (COLING), page 508.

Peng Xu, Jaeho Kang, Michael Ringgaard, and FranzOch. 2009. Using a dependency parser to improvesmt for subject-object-verb languages. In NorthAmerican Chapter of the Association for Computa-tional Linguistics, pages 245–253.

Kenji Yamada and Kevin Knight. 2001. A syntax-based statistical translation model. In Annual Meet-ing of the Association for Computational Linguistics(ACL), pages 523–530.

Hao Zhang, Liang Huang, Daniel Gildea, and KevinKnight. 2006. Synchronous binarization for ma-chine translation. In North American Chapter of theAssociation for Computational Linguistics, pages256–263.

42


Applying HMEANT to English-Russian Translations

Alexander Chuchunkov Alexander Tarelkin{madfriend,newtover,galinskaya}@yandex-team.ru

Yandex LLCLeo Tolstoy st. 16, Moscow, Russia

Irina Galinskaya

Abstract

In this paper we report the results offirst experiments with HMEANT (a semi-automatic evaluation metric that assessestranslation utility by matching semanticrole fillers) on the Russian language. Wedeveloped a web-based annotation inter-face and with its help evaluated practica-bility of this metric in the MT researchand development process. We studied reli-ability, language independence, labor costand discriminatory power of HMEANTby evaluating English-Russian translationof several MT systems. Role labelingand alignment were done by two groupsof annotators - with linguistic backgroundand without it. Experimental results werenot univocal and changed from very highinter-annotator agreement in role labelingto much lower values at role alignmentstage, good correlation of HMEANT withhuman ranking at the system level sig-nificantly decreased at the sentence level.Analysis of experimental results and anno-tators’ feedback suggests that HMEANTannotation guidelines need some adapta-tion for Russian.

1 Introduction

Measuring translation quality is one of the mostimportant tasks in MT, its history began long agobut most of the currently used approaches andmetrics have been developed during the last twodecades. BLEU (Papineni et al., 2002), NIST(Doddington, 2002) and METEOR (Banerjee andLavie, 2005)metric require reference translationto compare it with MT output in fully automaticmode, which resulted in a dramatical speed-up forMT research and development. These metrics cor-relate with manual MT evaluation and provide re-

liable evaluation for many languages and for dif-ferent types of MT systems.

However, the major problem of popular MTevaluation metrics is that they aim to capture lexi-cal similarity of MT output and reference transla-tion (fluency), but fail to evaluate the semantics oftranslation according to the semantics of reference(adequacy) (Lo and Wu, 2011a). An alternativeapproach that is worth mentioning is the one pro-posed by Snover et al. (2006), known as HTER,which measures the quality of machine translationin terms of post-editing. This method was provedto correlate well with human adequacy judgments,though it was not designed for a task of gisting.Moreover, HTER is not widely used in machinetranslation evaluation because of its high labor in-tensity.

A family of metrics called MEANT was pro-posed in 2011 (Lo and Wu, 2011a), which ap-proaches MT evaluation differently: it measureshow much of an event structure of reference doesmachine translation preserve, utilizing shallow se-mantic parsing (MEANT metric) or human anno-tation (HMEANT) as a gold standard.

We applied HMEANT to a new language —Russian — and evaluated the usefulness of met-ric. The practicability for the Russian languagewas studied with respect to the following criteriaprovided by Birch et al. (2013):

Reliability – measured as inter-annotator agree-ment for individual stages of evaluation task.

Discriminatory Power – the correlation ofrankings of four MT systems (by manual evalu-ation, BLEU and HMEANT) measured on a sen-tence and test set levels.

Language Independence – we collected theproblems with the original method and guidelinesand compared these problems to those reported byBojar and Wu (2012) and Birch et al. (2013).

Efficiency – we studied the labor cost of anno-tation task, i. e. average time required to evaluate

43

translations with HMEANT. Besides, we testedthe statement that semantic role labeling (SRL)does not require experienced annotators (in ourcase, with linguistic background).

Although the problems of HMEANT were out-lined before (by Bojar and Wu (2012) and Birchet al. (2013)) and several improvements were pro-posed, we decided to step back and conduct ex-periments with HMEANT in its original form. Nochanges to the metric, except for the annotationinterface enhancements, were made.

This paper has the following structure. Sec-tion 2 reports the previous experiments withHMEANT; section 3 summarizes the methods be-hind HMEANT; section 4 – the settings for ourown experiments; sections 5 and 6 are dedicatedto results and discussion.

2 Related Work

Since the beginning of the machine translation erathe idea of semantics-driven approach for transla-tion wandered around in the MT researchers com-munity (Weaver, 1955). Recent works by Lo andWu (2011a) claim that this approach is still per-spective. These works state that in order for ma-chine translation to be useful, it should convey theshallow semantic structure of the reference trans-lation.

2.1 MEANT for Chinese-EnglishTranslations

The original paper on MEANT (Lo and Wu,2011a) proposes the semi-automatic metric, whichevaluates machine translations utilizing annotatedevent structure of a sentence both in reference andmachine translation. The basic assumption be-hind the metric can be stated as follows: trans-lation shall be considered "good" if it preservesshallow semantic (predicate-argument) structureof reference. This structure is described in the pa-per on shallow semantic parsing (Pradhan et al.,2004): basically, we approach the evaluation byasking simple questions about events in the sen-tence: "Who did what to whom, when, where, whyand how?". These structures are annotated andaligned between two translations. The authors ofMEANT reported results of several experiments,which utilized both human annotation and seman-tic role labeling (as a gold standard) and automaticshallow semantic parsing. Experiments show thatHMEANT correlates with human adequacy judg-

ments (for three MT systems) at the value of 0.43(Kendall tau, sentence level), which is very closeto the correlation of HTER (BLEU has only 0.20).Also inter-annotator agreement was reported fortwo stages of annotation: role identification (se-lecting the word span) and role classification (la-beling the word span with role). For the former,IAA ranged from 0.72 to 0.93 (which can be in-terpreted as a good agreement) and for the latter,from 0.69 to 0.88 (still quite good, but should beput in doubt). IAA for the alignment stage was notreported.

2.2 HMEANT for Czech-EnglishTranslations

MEANT and HMEANT metrics were adoptedfor an experiment on evaluation of Czech-Englishand English-Czech translations by Bojar andWu (2012). These experiments were based ona human-evaluated set of 40 translations fromWMT121, which were submitted by 13 systems;each system was evaluated by exactly one anno-tator, plus an extra annotator for reference trans-lations. This setting implied that inter-annotatoragreement could not be examined. HMEANT cor-relation with human assessments was reported as0.28, which is significantly lower than the valueobtained by Lo and Wu (2011a).

2.3 HMEANT for German-EnglishTranslations

Birch et al. (2013) examined HMEANT thor-oughly with respect to four criteria, which addressthe usefulness of a task-based metric: reliability,efficiency, discriminatory power and language in-dependence. The authors conducted an experi-ment to evaluate three MT systems: rule-based,phrase-based and syntax-based on a set of 214 sen-tences (142 German and 72 English). IAA wasbroken down into the different stages of annotationand alignment. The experimental results showedthat whilst the IAA for HMEANT is satisfying atthe first stages of the annotation, the compound-ing effect of disagreement at each stage (up tothe alignment stage) greatly reduced the effectiveoverall IAA — to 0.44 on role alignment for Ger-man, and, only slightly better, 0.59 for English.HMEANT successfully distinguished three typesof systems, however, this result could not be con-sidered reliable as IAA is not very high (and rank

1http://statmt.org/wmt12

44

correlation was not reported). The efficiency ofHMEANT was stated as reasonably good; how-ever, it was not compared to the labor cost of (forexample) HTER. Finally, the language indepen-dence of the metric was implied by the fact thatoriginal guidelines can be applied both to Englishand German translations.

3 Methods

3.1 Evaluation with HMEANT

The underlying annotation cycle of HMEANTconsists of two stages: semantic role labeling(SRL) and alignment. During the SRL stage, eachannotator is asked to mark all the frames (a pred-icate and associated roles) in reference translationand hypothesis translation. To annotate a frame,one has to mark the frame head – predicate (whichis a verb, but not a modal verb) and its argu-ments, role fillers, which are linked to that pred-icate. These role fillers are given a role from theinventory of 11 roles (Lo and Wu, 2011a). Therole inventory is presented in Table 1, where eachrole corresponds to a specific question about thewhole frame.

Who? What? Whom?Agent Patient BenefactiveWhen? Where? Why?Temporal Locative PurposeHow?Manner, Degree, Negation, Modal, Other

Table 1. The role inventory.

On the second stage, the annotators are askedto align the elements of frames from referenceand hypothesis translations. The annotators linkboth actions and roles, and these alignments canbe matched as “Correct” or “Partially Correct” de-pending on how well the meaning was preserved.We have used the original minimalistic guidelinesfor the SRL and alignment provided by Lo and Wu(2011a) in English with a small set of Russian ex-amples.

3.2 Calculating HMEANT

After the annotation, HMEANT score of thehypothesis translation can be calculated as theF-score from the counts of matches of predicatesand their role fillers (Lo and Wu, 2011a). Pred-icates (and roles) without matches are not ac-

counted, but they result in the lower value overall.We have used the uniform model of HMEANT,which is defined as follows.#Fi – number of correct role fillers for predicatei in machine translation;#Fi(partial) – number of partially correct rolefillers for predicate i in MT;#MTi, #REFi – total number of role fillers inMT or reference for predicate i;Nmt, Nref – total number of predicates in MT orreference;w – weight of the partial match (0.5 in the uniformmodel).

P =∑

matched i

#Fi

#MTiR =

∑matched i

#Fi

#REFi

Ppart =∑

matched i

#Fi(partial)#MTi

Rpart =∑

matched i

#Fi(partial)#REFi

Ptotal =P + w ∗ Ppart

NmtRtotal =

R+ w ∗Rpart

Nref

HMEANT =2 ∗ Ptotal ∗Rtotal

Ptotal +Rtotal

3.3 Inter-Annotator AgreementLike Lo and Wu (2011a) and Birch et al. (2013)we studied inter-annotator agreement (IAA). It isdefined as an F1-measure, for which we considerone of the annotators as a gold standard:

IAA =2 ∗ P ∗RP +R

Where precision (P ) is the number of labels (roles,predicates or alignments) that match between an-notators divided by the total number of labels byannotator 1; recall (R) is the number of matchinglabels divided by the total number of labels by an-notator 2. Following Birch et al. (2013), we con-sider only exact word span matches. Also we haveadopted the individual stages of the annotationprocedure that are described in (Birch et al. 2013):role identification (selecting the word span), roleclassification (marking the word span with a role),action identification (marking the word span as apredicate), role alignment (linking roles betweentranslations) and action alignment (linking frameheads). Calculating IAA for each stage separately

45

helped to isolate the disagreements and to see,which stages resulted in a low agreement valueoverall. To look at the most common role dis-agreements we also created the pairwise agree-ment matrix, every cell (i, j) of which is the num-ber of times the role i was confused with the rolej by any pair of annotators.

3.4 Kendall’s Tau Rank Correlation WithHuman Judgments

For the set of translations used in our experiments,we had a number of relative human judgments (theset was taken from WMT132). We used the rankaggregation method described in (Callison-Burchet al., 2012) to build up one ranking from thesejudgments. This method is called Expected WinScore (EWS) and for MT system Si from the set{Sj} it is defined the following way:

score(Si) =1

|{Sj}|∑j,j 6=i

win(Si, Sj)win(Si, Sj) + win(Sj , Si)

Where win(Si, Sj) is the number of times systemi was given a rank higher than system j. Thismethod of aggregation was used to obtain the com-parisons of systems, which outputs were neverpresented together to assessors during the evalu-ation procedure at WMT13.

After we had obtained the ranking of systemsby human judgments, we compared this rankingto the ranking by HMEANT values of machinetranslations. To do that, we used Kendall’s tau(Kendall, 1938) rank correlation coefficient andreported the results as Lo and Wu (2011a) and Bo-jar (Bojar and Wu, 2012).

4 Experimental Setup

4.1 Test SetFor our experiments we used the set of translationsfrom WMT13. We tested HMEANT on a set offour best MT systems (Bojar et al., 2013) for theEnglish-Russian language pair (Table 2).

From the set of direct English-Russian transla-tions (500 sentences) we picked those which al-lowed to build a ranking for the four systems (94sentences); then out of these we randomly picked50 and split them into 6 tasks of 25 so that eachof the 50 sentences was present in exactly threetasks. Each task consisted of 25 reference transla-tions and 100 hypothesis translations.

2http://statmt.org/wmt13

System EWS (WMT)PROMT 0.4949Online-G 0.475Online-B 0.3898CMU-Primary 0.3612

Table 2. The top four MT systems for the en-rutranslation task at WMT13. The scores were

calculated for the subset of translations which weused in experiments.

4.2 Annotation Interface

As far as we know there is no publically availableinterface for HMEANT annotation. Thus, firstof all, having the prototype (Lo and Wu, 2011b)and taking into account comments and sugges-tions of Bojar and Wu (2012) (e.g., ability to goback within the phases of annotation), we createda web-based interface for role labeling and align-ment. This interface allows to annotate a set ofreferences with one machine translation at a time(Figure 1) and to align actions and roles. We alsoprovided a timer which allowed to measure thetime required to label the predicates and roles.

4.3 Annotators

We asked to participate two groups of annota-tors: 6 researchers with linguistic background (lin-guists) and 4 developers without it. Every annota-tor did exactly one task; each of the 50 sentenceswas annotated by three linguists and at least twodevelopers.

5 Results

As a result of the experiment, 638 frames wereannotated in reference translations (overall) and2 016 frames in machine translations. More de-tailed annotation statistics are presented in Table3. A closer look indicates that the ratio of alignedframes and roles in references was larger than inany of machine translations.

5.1 Manual Ranking

After the test set was annotated, we comparedmanual ranking and ranking by HMEANT; on thesystem level, these rankings were similar; how-ever, on the sentence level, there was no correla-tion between rankings at all. Thus we decided totake a closer look at the manual assessments. Forthe selected 4 systems most of the pairwise com-

46

Figure 1. The screenshot of SRL interface. The tables under the sentences contain the informationabout frames (the active frame has a red border and is highlighted in the sentence, inactive frames (not

shown) are semi-transparent).

Source # Frames # Roles Aligned frames, % Aligned roles, %Reference 638 1 671 86.21 % 74.15 %PROMT 609 1 511 79.97 % 67.57 %Online-G 499 1 318 77.96 % 66.46 %Online-B 469 1 257 78.04 % 68.42 %CMU-Primary 439 1 169 75.17 % 66.30 %

Table 3. Annotation statistics.

parisons were obtained in a transitive way, i. e.using comparisons with other systems. Further-more, we encountered a number of useless rank-ings, where all the outputs were given the samerank. After all, for many sentences the rankingof systems was based on a few pairwise compar-isons provided by one or two annotators. Theserankings seemed to be not very reliable, thus wedecided to rank four machine translations for eachof the 50 sentences manually to make sure that theranking has a strong ground. We asked 6 linguiststo do that task. The average pairwise rank correla-tion (between assessors) reached 0.77, making theoverall ranking reliable; we aggregated 6 rankingsfor each sentence using EWS.

5.2 Correlation with Manual Assessments

To look at HMEANT on a system level, we com-pared rankings produced during manual assess-ment and HMEANT annotation tasks. Those rank-ings were then aggregated with EWS (Table 4).

It should be noticed that HMEANT allowed torank systems correctly. This fact indicates thatHMEANT has a good discriminatory power on thelevel of systems, which is a decent argument for

System Manual HMEANT BLEUPROMT 0.532 0.443 0.126Online-G 0.395 0.390 0.146Online-B 0.306 0.374 0.147CMU-Primary 0.267 0.292 0.136

Table 4. EWS over manual assessments, EWSover HMEANT and BLEU scores for MT

systems.

the usage of this metric. Also it is worth to notethat ranking by HMEANT matched the ranking bythe number of frames and roles (Table 3).

On a sentence level, we studied the rank corre-lation of ranking by manual assessments and byHMEANT values for each of the annotators. Themanual ranking was aggregated by EWS from themanual evaluation task (see Section 5.1). Resultsare reported in Table 5.

We see that resulting correlation values are sig-nificantly lower than those reported by Lo and Wu(2011a) – our rank correlation values did not reach0.43 on average across all the annotators (and even0.28 as reported by Bojar and Wu (2012)).

47

Annotator τ

Linguist 1 0.0973Linguist 2 0.3845Linguist 3 0.1157Linguist 4 -0.0302Linguist 5 0.1547Linguist 6 0.1468Developer 1 0.1794Developer 2 0.2411Developer 3 0.1279Developer 4 0.1726

Table 5. The rank correlation coefficients forHMEANT and human judgments. Reliable

results (with p-value >0.05) are in bold.

5.3 Inter-Annotator Agreement

Following Lo and Wu (2011a) and Birch et al.(2013) we report the IAA for the individual stagesof annotation and alignment. These results areshown in Table 6.

Stage Linguists DevelopersMax Avg Max Avg

REF, id 0.959 0.803 0.778 0.582MT, id 0.956 0.795 0.667 0.501REF, class 0.862 0.715 0.574 0.466MT, class 0.881 0.721 0.525 0.434REF, actions 0.979 0.821 0.917 0.650MT, actions 0.971 0.839 0.700 0.577Actions – align 0.908 0.737 0.429 0.332Roles – align 0.709 0.523 0.378 0.266

Table 6. The inter-annotator agreement for theindividual stages of annotation and alignment

procedures. Id, class, align stand foridentification, classification and alignment

respectively.

The results are not very different from those re-ported in the papers mentioned above, except foreven lower agreement for developers. The factthat the results could be reproduced on a new lan-guage seems very promising, however, the lack oftraining for the annotators without linguistic back-ground resulted in lower inter-annotator agree-ment.

Also we studied the most common role dis-agreements for each pair of annotators (either lin-guists or developers). As it can be deduced from

the IAA values, the agreement on all roles is lowerfor linguists, however, both groups of annotatorsshare the roles on which the agreement is best ofall: Predicate, Agent, Locative, Negation, Tempo-ral. Most common disagreements are presented inTable 7.

Role A Role B %, L %, DWhom What 18.0 15.2Whom Who 13.7 23.1Why None 17.0 22.3How (manner) What 10.5 -How (manner) How (degree) - 19.0How (modal) Action 18.1 16.3

Table 7. Most common role disagreements. Lastcolumns (L for linguists, D for developers) standfor the ratio of times Role A was confused with

Role B across all the label types (roles, predicate,none).

These disagreements can be explained by thefact that some annotators looked “deeper” in thesentence semantics, whereas other annotators onlytried to capture the shallow structure as fast as pos-sible. This fact explains, for example, disagree-ment on the Whom role – for some sentences, e. g.“mogli by ubedit~ politiqeskih liderov”(“could persuade the political leaders”) it requiressome time to correctly mark politiqeskih lid-erov (political leaders) as an answer to Whom,not What. The disagreement on the Purpose (a lotof times it was annotated only by one expert) is ex-plained by the fact that there were no clear instruc-tions on how to mark clauses. As for the Actionand Modal, this disagreement is based on the re-quirement that Action should consist of one wordonly; this requirement raised questions about com-plex verbs, e.g. “zakonqil delat~” (“stoppeddoing”). It is ambiguous how to annotate theseverbs: some annotators decided to mark it asModal+Action, some – as Action+What. Proba-bly, the correct way to mark it should be just asAction.

5.4 Efficiency

Additionnaly, we conducted an efficiency experi-ment in the group of linguists. We measured theaverage time required to annotate a predicate (inreference or machine translation) and a role. Re-sults are presented in Table 8.

48

Annotator REF MTRole Action Role Action

Linguist 1 14 26 11 36Linguist 2 10 12 8 12Linguist 3 13 14 8 23Linguist 4 16 15 9 15Linguist 5 13 20 11 24Linguist 6 17 35 9 32

Table 8. Average times (in seconds) required toannotate actions and roles.

These results look very promising; using thenumbers in Table 3, we get the average time re-quired to annotate a sentence: 1.5 – 2 minutes for areference (and even up to 4 minutes for slower lin-guists) and 1.5 – 2.5 minutes for a machine trans-lation. Also for a group of “slower” linguists (1, 5,6) inter-annotator agreement was lower (-0.05 onaverage) than between “faster” linguists (2, 3, 4)for all stages of annotation and alignment. Aver-age time to annotate an action is similar for the ref-erence and MT outputs, but it takes more time toannotate roles in references than in machine trans-lations.

6 Discussion

6.1 Problems with HMEANT

As we can see, HMEANT is an acceptably reliableand efficient metric. However, we have met someobstacles and problems with original instructionsduring the experiments with Russian translations.We believe that these obstacles are the main causesof low inter-annotator agreement at the last stagesof annotation procedure and low correlation ofrankings.

Frame head (predicate) is required. This re-quirement does not allow frames without predicateat all, e.g. “On mo� drug” (“He is my friend”) –the Russian translation of “is” (present tense) is anull verb.

One-word predicates. There are cases wherecomplex verbs (e.g., which consist of two verbs)can be correctly translated as a one-word verb.For example, “ostanovils�” (“stopped”) iscorrectly rephrased as “perestal delat~”(“ceased doing”).

Roles only of one type can be aligned. Some-times one role can be correctly rephrased as an-other role, but roles of different type can not be

aligned. For example, “On uehal iz goroda”(“He went away from the town”) means the sameas “On pokinul gorod” (“He left the town”).The former has a structure of Who + Action +Where, the latter – Who + Action + What.

Should we annotate as much as possible? Itis not clear from the guideline whether we shouldannotate almost everything that looks like a frameor can be interpreted as a role. There are someprepositional phrases which can not be easily clas-sified as one role or another. Example: “Nam nestoit ob �tom volnovat~s�” (“We shouldnot worry about this”) – it is not clarified how todeal with “ob �tom” (“about this”) prepositionalphrase.

7 Conclusion

In this paper we describe a preliminary series ofexperiments with HMEANT, a new metric for se-mantic role labeling. In order to conduct these ex-periments we developed a special web-based an-notation interface with a timing feature. A teamof 6 linguists and 4 developers annotated RussianMT output of 4 systems. The test set of 50 En-glish sentences along with reference translationswas taken from the WMT13 data. We measuredIAA for each stage of annotation process, com-pared HMEANT ranking with manual assessmentand calculated the correlation between HMEANTand manual evaluation. We also measured anno-tation time and collected a feedback from anno-tators, which helped us to locate the problems andbetter understand the SRL process. Analysis of thepreliminary experimental results of Russian MToutput annotation led us to the following conclu-sions about HMEANT as a metric.

Language Independence. For a relativelysmall set of Russian sentences, we encounteredproblems with the guidelines, but they were notspecific to the Russian language. This can benaively interpreted as language independence ofthe metric.

Reliability. Inter-annotator agreement is highfor the first stages of SRL, but we noted that it de-creases on the last stages because of the compoundeffect of disagreements on previous stages.

Efficiency. HMEANT proved to be really ef-fective in terms of time required to annotate ref-erences and MT outputs and can be used in pro-duction environment, though the statement thatHMEANT annotation task does not require quali-

49

fied annotators was not confirmed.Discriminatory Power. On the system level,

HMEANT allowed to correctly rank MT systems(according to the results of manual assessmenttask). On the sentence level, correlation with hu-man rankings is low.

To sum up, first experience with HMEANTwas considered to be successful and allowed usto make a positive decision about applicabilityof the new metric to the evaluation of English-Russian machine translations. We have to say thatHMEANT guidelines, annotation procedures andthe inventory of roles work in general, however,low inter-annotator agreement at the last stagesof annotation task and low correlation with hu-man judgments on the sentence level suggest usto make respective adaptations and conduct newseries of experiments.

Acknowledgements

We would like to thank our annotators for their ef-forts and constructive feedback. We also wish toexpress our great appreciation to Alexey Baytinand Maria Shmatova for valuable ideas and ad-vice.

ReferencesSatanjeev Banerjee and Alon Lavie. 2005. Meteor: An

automatic metric for mt evaluation with improvedcorrelation with human judgments. In Proceed-ings of the ACL Workshop on Intrinsic and Extrin-sic Evaluation Measures for Machine Translationand/or Summarization, pages 65–72.

Alexandra Birch, Barry Haddow, Ulrich Germann,Maria Nadejde, Christian Buck, and Philipp Koehn.2013. The Feasibility of HMEANT as a HumanMT Evaluation Metric. In Proceedings of the EighthWorkshop on Statistical Machine Translation, page52–61, Sofia, Bulgaria, August. Association forComputational Linguistics.

Ondrej Bojar and Dekai Wu. 2012. Towards aPredicate-Argument Evaluation for MT. In Pro-ceedings of the Sixth Workshop on Syntax, Semanticsand Structure in Statistical Translation, page 30–38,Jeju, Republic of Korea, July. Association for Com-putational Linguistics.

Ondrej Bojar, Christian Buck, Chris Callison-Burch,Christian Federmann, Barry Haddow, PhilippKoehn, Christof Monz, Matt Post, Radu Soricut, andLucia Specia. 2013. Findings of the 2013 workshopon statistical machine translation. In Proceedings ofthe Eighth Workshop on Statistical Machine Trans-lation, page 1–44, Sofia, Bulgaria, August. Associa-tion for Computational Linguistics.

Chris Callison-Burch, Philipp Koehn, Christof Monz,Matt Post, Radu Soricut, and Lucia Specia. 2012.Findings of the 2012 workshop on statistical ma-chine translation. In Proceedings of the SeventhWorkshop on Statistical Machine Translation, page10–51, Montréal, Canada, June. Association forComputational Linguistics.

George Doddington. 2002. Automatic evaluationof machine translation quality using n-gram co-occurrence statistics. In Proceedings of the Sec-ond International Conference on Human LanguageTechnology Research, HLT ’02, pages 138–145, SanFrancisco, CA, USA. Morgan Kaufmann PublishersInc.

Maurice G Kendall. 1938. A new measure of rankcorrelation. Biometrika, pages 81–93.

Chi-kiu Lo and Dekai Wu. 2011a. MEANT: Aninexpensive, high-accuracy, semi-automatic metricfor evaluating translation utility based on semanticroles. In Proceedings of the 49th Annual Meeting ofthe Association for Computational Linguistics: Hu-man Language Technologies, page 220–229, Port-land, Oregon, USA, June. Association for Computa-tional Linguistics.

Chi-kiu Lo and Dekai Wu. 2011b. A radically sim-ple, effective annotation and alignment methodol-ogy for semantic frame based smt and mt evaluation.LIHMT 2011, page 58.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automaticevaluation of machine translation. In Proceedingsof the 40th Annual Meeting on Association for Com-putational Linguistics, ACL ’02, pages 311–318,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

Sameer S Pradhan, Wayne Ward, Kadri Hacioglu,James H Martin, and Daniel Jurafsky. 2004. Shal-low semantic parsing using support vector machines.In HLT-NAACL, pages 233–240.

Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and John Makhoul. 2006. A study oftranslation edit rate with targeted human annotation.In Proceedings of association for machine transla-tion in the Americas, pages 223–231.

Warren Weaver. 1955. Translation. Machine transla-tion of languages, 14:15–23.

50


Reducing the Impact of Data Sparsity in Statistical Machine Translation

Karan Singla1, Kunal Sachdeva1, Diksha Yadav1, Srinivas Bangalore2, Dipti Misra Sharma1

1LTRC IIIT Hyderabad,2AT&T Labs-Research

Abstract

Morphologically rich languages generallyrequire large amounts of parallel data toadequately estimate parameters in a statis-tical Machine Translation(SMT) system.However, it is time consuming and expen-sive to create large collections of paralleldata. In this paper, we explore two strate-gies for circumventing sparsity caused bylack of large parallel corpora. First, we ex-plore the use of distributed representationsin an Recurrent Neural Network based lan-guage model with different morphologicalfeatures and second, we explore the use oflexical resources such as WordNet to over-come sparsity of content words.

1 Introduction

Statistical machine translation (SMT) models es-timate parameters (lexical models, and distortionmodel) from parallel corpora. The reliability ofthese parameter estimates is dependent on the sizeof the corpora. In morphologically rich languages,this sparsity is compounded further due to lack oflarge parallel corpora.

In this paper, we present two approaches thataddress the issue of sparsity in SMT models formorphologically rich languages. First, we use anRecurrent Neural Network (RNN) based languagemodel (LM) to re-rank the output of a phrase-based SMT (PB-SMT) system and second we uselexical resources such as WordNet to minimize theimpact of Out-of-Vocabulary(OOV) words on MTquality. We further improve the accuracy of MTusing a model combination approach.

The rest of the paper is organized as follows.We first present our approach of training the base-line model and source side reordering. In Section4, we present our experiments and results on re-ranking the MT output using RNNLM. In Section

5, we discuss our approach to increase the cover-age of the model by using synset ID’s from theEnglish WordNet (EWN). Section 6 describes ourexperiments on combining the model with synsetID’s and baseline model to further improve thetranslation accuracy followed by results and obser-vations sections.We conclude the paper with futurework and conclusions.

2 Related Work

In this paper, we present our efforts of re-ranking the n-best hypotheses produced by a PB-MT (Phrase-Based MT) system using RNNLM(Mikolov et al., 2010) in the context of an English-Hindi SMT system. The re-ranking task in ma-chine translation can be defined as re-scoring then-best list of translations, wherein a number oflanguage models are deployed along with fea-tures of source or target language. (Dungarwalet al., 2014) described the benefits of re-rankingthe translation hypothesis using simple n-grambased language model. In recent years, the useof RNNLM have shown significant improvementsover the traditional n-gram models (Sundermeyeret al., 2013). (Mikolov et al., 2010) and (Liu etal., 2014) have shown significant improvements inspeech recognition accuracy using RNNLM . Shi(2012) also showed the benefits of using RNNLMwith contextual and linguistic features. We havealso explored the use of morphological features(Hindi being a morphologically rich language) inRNNLM and deduced that these features furtherimprove the baseline RNNLM in re-ranking the n-best hypothesis.

Words in natural languages are richly diverseso it is not possible to cover all source languagewords when training an MT system. Untranslatedout-of-vocabulary (OOV) words tend to degradethe accuracy of the output produced by an MTmodel. Huang (2010) pointed to various typesof OOV words which occur in a data set – seg-

51

mentation error in source language, named enti-ties, combination forms (e.g. widebody) and ab-breviations. Apart from these issues, Hindi beinga low-resourced language in terms of parallel cor-pora suffers from data sparsity.

In the second part of the paper, we address theproblem of data sparsity with the help of EnglishWordNet (EWN) for English-Hindi PB-SMT. Weincrease the coverage of content words (exclud-ing Named-Entities) by incorporating sysnset in-formation in the source sentences.

Combining Machine Translation (MT) systemshas become an important part of statistical MT inpast few years. Works by (Razmara and Sarkar,2013; Cohn and Lapata, 2007) have shown thatthere is an increase in phrase coverage when com-bining different systems. To get more coverage ofunigrams in phrase-table, we have explored sys-tem combination approaches to combine modelstrained with synset information and without synsetinformation. We have explored two methodolo-gies for system combination based on confusionmatrix(dynamic) (Ghannay et al., 2014) and mix-ing models (Cohn and Lapata, 2007).

3 Baseline Components

3.1 Baseline Model and Corpus Statistics

We have used the ILCI corpora (Choudhary andJha, 2011) for our experiments, which containsEnglish-Hindi parallel sentences from tourism andhealth domain. We randomly divided the data intotraining (48970), development (500) and testing(500) sentences and for language modelling weused news corpus of English which is distributedas a part of WMT’14 translation task. The data isabout 3 million sentences which also contains MTtraining data.

We trained a phrase based (Koehn et al., 2003)MT system using the Moses toolkit with word-alignments extracted from GIZA++ (Och and Ney,2000). We have used the SRILM (Stolcke andothers, 2002) with Kneser-Ney smoothing (Kneserand Ney, 1995) for training a language model forthe first stage of decoding. The result of this base-line system is shown in Table 1.

3.2 English Transformation Module

Hindi is a relatively free-word order language andgenerally tends to follow SOV (Subject-Object-Verb) order and English tends to follow SVO(Subject-Verb-Object) word order. Research has

Number of Number of Number ofTraining Development Evaluation BLEUSentences Sentences Sentences48970 500 500 20.04

Table 1: Baseline Scores for Phrase-based MosesModel

shown that pre-ordering source language to con-form to target language word order significantlyimproves translation quality (Collins et al., 2005).We created a re-ordering module for transform-ing an English sentence to be in the Hindi orderbased on reordering rules provided by Anusaaraka(Chaudhury et al., 2010). The reordering rules arebased on parse output produced by the StanfordParser (Klein and Manning, 2003).

The transformation module requires the text tocontain only surface form of words, however, weextended it to support surface form along with itsfactors such as lemma and Part of Speech (POS).

Input : the girl in blue shirt is my sisterOutput : in blue shirt the girl is my sister.Hindi : neele shirt waali ladki meri bahen hai (

blue) ( shirt) (Mod)(girl)(my)(sister)(Vaux)With this transformation, the English sentence

is structurally closer to the Hindi sentence whichleads to better phrase alignments. The modeltrained with the transformed corpus produces anew baseline score of 21.84 BLEU score animprovement over the earlier baseline of 20.04BLEU points.

4 Re-Ranking Experiments

In this section, we describe the results of re-ranking the output of the translation model us-ing Recurrent Neural Networks (RNN) based lan-guage models using the same data which is usedfor language modelling in the baseline models.

Unlike traditional n-gram based discrete lan-guage models, RNN do not make the Markov as-sumption and potentially can take into accountlong-term dependencies between words. Since thewords in RNNs are represented as continuous val-ued vectors in low dimensions allowing for thepossibility of smoothing using syntactic and se-mantic features. In practice, however, learninglong-term dependencies with gradient descent isdifficult as described by (Bengio et al., 1994) dueto diminishing gradients.

We have integrated the approach of re-scoring

52

100 200 300 400 500

22

24

26

28

30

Number of Hypotheses

BL

EU

scor

es

BaselinePOSNONELemmaOracleAll

Figure 1: BLEU Scores for Re-ranking experi-ments with RNNLM using different feature com-binations.

n-best output using RNNLM which has also beenshown to be helpful by (Liu et al., 2014). Shi(2012) also showed the benefits of using RNNLMwith contextual and linguistic features. Follow-ing their work, we used three type of features forbuilding an RNNLM for Hindi : lemma (root),POS, NC (number-case). The data used was aWikipedia dump, MT training data, news arti-cles which had approximately 500,000 Hindi sen-tences. Features were extracted using paradigm-based Hindi Morphological Analyzer 1

Figure 1 illustrates the results of re-ranking per-formed using RNNLM trained with various fea-tures. The Oracle score is the highest achievablescore in a re-ranking experiment. This score iscomputed based on the best translation out of n-best translations. The best translation is found us-ing the cosine similarity between the hypothesisand the reference translation. It can be seen fromFigure 1, that the LM with only word and POS in-formation is inferior to all other models. However,morphological features like lemma, number andcase information help in re-ranking the hypothesissignificantly. The RNNLM which uses all the fea-tures performed the best for the re-ranking exper-iments achieving a BLEU score of 26.91, after re-scoring 500-best obtained from the pre-order SMTmodel.

1We have used the HCU morph-analyzer.

System BLEUBaseline 21.84

Rescoring 500-best with RNNLM

Features

NONE 25.77POS 24.36Lemma(root) 26.32ALL(POS+Lemma+NC) 26.91

Table 2: Rescoring results of 500-best hypothesesusing RNNLM with different features

5 Using WordNet to Reduce DataSparsity

We extend the coverage of our source data by us-ing synonyms from the English WordNet (EWN).Our main motivation is to reduce the impact ofOOV words on output quality by replacing wordsin a source sentence with their correspondingsynset IDs. However, choosing the appropriatesynset ID based upon its context and morphologi-cal information is important. For sense selection,we followed the approach used by (Tammewar etal., 2013), which is also described further in thissection in the context of our task. We ignoredwords that are regarded as Named-Entities as in-dicated by Stanford NER tagger, as they shouldnot have synonyms in any case.

5.1 Sense Selection

Words are ambiguous, independent of their sen-tence context. To choose an appropriate sense ac-cording to the context for a lexical item is a chal-lenging task typically termed as word-sense dis-ambiguation. However, the syntactic category ofa lexical item provides an initial cue for disam-biguating a lexical item. Among the varied senses,we filter out the senses that are not the same POStag as the lexical item. But words are not just am-biguous across different syntactic categories butare also ambiguous within a syntactic category. Inthe following, we discuss our approaches to selectthe sense of a lexical item best suited in a givencontext within a given category. Also categorieswere filtered so that only content words get re-placed with synset IDs.

5.1.1 Intra-Category Sense SelectionFirst Sense: Among the different senses,we se-lect the first sense listed in EWN corresponding tothe POS-tag of a given lexical item. The choice ismotivated by our observation that the senses of a

53

lexical item are ordered in the descending order oftheir frequencies of usage in the lexical resource.

Merged Sense: In this approach, we merge allthe senses listed in EWN corresponding to thePOS-tag of the given lexical item. The motivationbehind this strategy is that the senses in the EWNfor a particular word-POS pair are too finely clas-sified resulting in classification of words that mayrepresent the same concept, are classified into dif-ferent synsets. For example : travel and go canmean the same concept in a similar context but thefirst sense given by EWN is different for these twowords. Therefore, we merge all the senses for aword into a super sense ( synset ID of first wordoccurred in data), which is given to all its syn-onyms even if it occurs in different synset IDs.

5.2 Factored Model

Techniques such as factored modelling (Koehnand Hoang, 2007) are quite beneficial for Trans-lation from English to Hindi language as shownby (Ramanathan et al., 2008). When we replacewords in a source sentence with the synset IDas,we tend to lose morphological information associ-ated with that word. We add inflections as featuresin a factored SMT model to minimize the impactof this replacement.

We show the results of the processing steps onan example sentence below.Original Sentence : Ram is going to market tobuy applesNew Sentence : Ram is Synset(go.v.1)to Synset(market.n.0) to Synset(buy.v.1)Synset(apple.n.1)Sentence with synset ID: Ram E is ESynset(go.v.1) ing to E Synset(market.n.0) Eto E Synset(buy.v.1) E Synset(apple.n.1) sThen English sentences were reordered to Hindiword-order using the module discussed in Section3.Reordered Sentence: Ram E Synset(apple.n.1) sSynset(buy.v.1) E to E Synset(market.n.0) E to ESynset(go.v.1) ing is E

In Table 3, the second row shows the BLEUscores for the models in which there are synset IDsfor the source side. It can be seen that the factoredmodel also shows significant improvement in theresults.

6 Combining MT Models

Combining Machine translation (MT) systems hasbecome an important part of Statistical MT inthe past few years. There are two dominant ap-proaches. (1) a system combination approachbased on confusion networks (CN) (Rosti et al.,2007), which can work dynamically in combin-ing the systems. (2) Combine the models by lin-early interpolating and then using MERT to tunethe combined system.

6.1 Combination based on confusionnetworks

We used the tool MANY (Barrault, 2010) for sys-tem combination. However, since the tool is con-figured to work with TERp evaluation metric, wemodified it to use METEOR (Gupta et al., 2010)metric since it has been shown by (Kalyani et al.,2014), that METEOR evaluation metric is bettercorrelated to human evaluation for morphologi-cally rich Indian Languages.

6.2 Linearly Interpolated CombinationIn this approach, we combined phrase-tables ofthe two models (Eng (sysnset) - Hindi and Base-line) using linear interpolation. We combined thetwo models with uniform weights – 0.5 for eachmodel, in our case. We again tuned this modelwith the new interpolated phrase-table using stan-dard algorithm MERT.

7 Experiments and Results

As can be seen in Table 3, the model with synsetinformation led to reduction in OOV words. Eventhough BLEU score decreased, but METEORscore improved for all the experiments based onusing synset IDs in the source sentence, but it hasbeen shown by (Gupta et al., 2010) that METEORis a better evaluation metrics for morphologicallyrich languages. Also, when synset IDas are usedinstead of words in the source language, the sys-tem makes incorrect morphological choices. Ex-ample : going and goes will be replaced by samesynset ID aSynset(go.v.1)a, so this has lead to lossof information in the phrase-table but METEORcatches these complexities as it considers featureslike stems, synonyms for its evaluation metricsand hence showed better improvements comparedto BLEU metric. Last two rows of Table 3 showresults for combination experiments and MixtureModel (linearly interpolated model) showed best

54

System #OOV words BLEU MeteorBaseline 253 21.8 .492

Eng(Synset ID)-HindiBaseline 237 19.2 .494*factor(inflections) 225 20.3 .506

Ensembled Decoding 213 21.0 .511Mixture Model 210 21.2 .519

Table 3: Results for the model in which there were Synset ID’s instead of word in English data

results with significant reduction in OOV wordsand also some gains in METEOR score.

8 Observations

In this section, we study the coverage of differentmodels by categorizing the OOV words into 5 cat-egories.

• NE(Named Entities) : As the data wasfrom Health & Tourism domain, these wordswere mainly the names of the places andmedicines.

• VB : types of verb forms

• NN : types of nouns and pronouns

• ADJ : all adjectives

• AD : adverbs

• OTH : there were some words which did notmean anything in English

• SM : There were some occasional spellingmistakes seen in the test data.

Note : There were no function words seen in theOOV(un-translated) words

Cat. Baseline Eng(synset)-Hin MixtureModelNE 120 121 115VB 47 37 27NN 76 60 47ADJ 22 15 12AD 5 5 4OTH 2 2 2SM 8 8 8

Table 4: OOV words in Different Models

As this analysis was done on a small dataset andfor a fixed domain, the OOV words were few innumber as it can be seen in Table 4. But the OOVwords across the different models reduced as ex-pected. The NE words remained almost the same

for all the three models but OOV words from cate-gory VB,NN,ADJ decreased for Eng(synset)-Hinmodel and Mixture model significantly.

9 Future Work

In the future, we will work on using the two ap-proaches discussed: Re-Ranking & using lexicalresources to reduce sparsity together in a system.We will work on exploring syntax based featuresfor RNNLM and we are planning to use a bettermethod for sense selection and extending this con-cept for more language pairs. Word-sense disam-biguation can be used for choosing more appro-priate sense when the translation model is trainedon a bigger data data set. Also we are looking forunsupervised techniques to learn the replacementsfor words to reduce sparsity and ways to adapt oursystem to different domains.

10 Conclusions

In this paper, we have discussed two approachesto address sparsity issues encountered in trainingSMT models for morphologically rich languageswith limited amounts of parallel corpora. In thefirst approach we used an RNNLM enriched withmorphological features of the target words andshow the BLEU score to improve by 5 points. Inthe second approach we use lexical resource suchas WordNet to alleviate sparsity.

ReferencesLoıc Barrault. 2010. Many: Open source machine

translation system combination. The Prague Bul-letin of Mathematical Linguistics, 93:147–155.

Yoshua Bengio, Patrice Simard, and Paolo Frasconi.1994. Learning long-term dependencies with gra-dient descent is difficult. Neural Networks, IEEETransactions on, 5(2):157–166.

Sriram Chaudhury, Ankitha Rao, and Dipti M Sharma.2010. Anusaaraka: An expert system based machinetranslation system. In Natural Language Processing

55

and Knowledge Engineering (NLP-KE), 2010 Inter-national Conference on, pages 1–6. IEEE.

Narayan Choudhary and Girish Nath Jha. 2011. Cre-ating multilingual parallel corpora in indian lan-guages. In Proceedings of Language and Technol-ogy Conference.

Trevor Cohn and Mirella Lapata. 2007. Ma-chine translation by triangulation: Making ef-fective use of multi-parallel corpora. In AN-NUAL MEETING-ASSOCIATION FOR COMPU-TATIONAL LINGUISTICS, volume 45, page 728.Citeseer.

Michael Collins, Philipp Koehn, and Ivona Kucerova.2005. Clause restructuring for statistical machinetranslation. In Proceedings of the 43rd annualmeeting on association for computational linguis-tics, pages 531–540. Association for ComputationalLinguistics.

Piyush Dungarwal, Rajen Chatterjee, Abhijit Mishra,Anoop Kunchukuttan, Ritesh Shah, and PushpakBhattacharyya. 2014. The iit bombay hindi-englishtranslation system at wmt 2014. In Proceedings ofthe Ninth Workshop on Statistical Machine Transla-tion, pages 90–96, Baltimore, Maryland, USA, June.Association for Computational Linguistics.

Sahar Ghannay, France Le Mans, and Loıc Barrault.2014. Using hypothesis selection based features forconfusion network mt system combination. In Pro-ceedings of the 3rd Workshop on Hybrid Approachesto Translation (HyTra)@ EACL, pages 1–5.

Ankush Gupta, Sriram Venkatapathy, and Rajeev San-gal. 2010. Meteor-hindi: Automatic mt evaluationmetric for hindi as a target language. In Proceed-ings of ICON-2010: 8th International Conferenceon Natural Language Processing.

Chung-chi Huang, Ho-ching Yen, and Jason S Chang.2010. Using sublexical translations to handle theoov problem in mt. In Proceedings of The NinthConference of the Association for Machine Transla-tion in the Americas (AMTA).

Aditi Kalyani, Hemant Kamud, Sashi Pal Singh, andAjai Kumar. 2014. Assessing the quality of mtsystems for hindi to english translation. In In-ternational Journal of Computer Applications, vol-ume 89.

Dan Klein and Christopher D Manning. 2003. Ac-curate unlexicalized parsing. In Proceedings of the41st Annual Meeting on Association for Computa-tional Linguistics-Volume 1, pages 423–430. Asso-ciation for Computational Linguistics.

Reinhard Kneser and Hermann Ney. 1995. Im-proved backing-off for m-gram language modeling.In Acoustics, Speech, and Signal Processing, 1995.ICASSP-95., 1995 International Conference on, vol-ume 1, pages 181–184. IEEE.

Philipp Koehn and Hieu Hoang. 2007. Factored trans-lation models. In EMNLP-CoNLL, pages 868–876.Citeseer.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1, pages 48–54. Association for Computa-tional Linguistics.

X Liu, Y Wang, X Chen, MJF Gales, and PC Wood-land. 2014. Efficient lattice rescoring using recur-rent neural network language models.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recur-rent neural network based language model. In IN-TERSPEECH, pages 1045–1048.

Franz Josef Och and Hermann Ney. 2000. Improvedstatistical alignment models. In Proceedings of the38th Annual Meeting on Association for Computa-tional Linguistics, pages 440–447. Association forComputational Linguistics.

Ananthakrishnan Ramanathan, Jayprasad Hegde,Ritesh M Shah, Pushpak Bhattacharyya, andM Sasikumar. 2008. Simple syntactic and morpho-logical processing can help english-hindi statisticalmachine translation. In IJCNLP, pages 513–520.

Majid Razmara and Anoop Sarkar. 2013. Ensembletriangulation for statistical machine translation. InProceedings of the Sixth International Joint Confer-ence on Natural Language Processing, pages 252–260.

Antti-Veikko I Rosti, Spyridon Matsoukas, andRichard Schwartz. 2007. Improved word-level sys-tem combination for machine translation. In AN-NUAL MEETING-ASSOCIATION FOR COMPU-TATIONAL LINGUISTICS, volume 45, page 312.Citeseer.

Yangyang Shi, Pascal Wiggers, and Catholijn MJonker. 2012. Towards recurrent neural networkslanguage models with linguistic and contextual fea-tures. In INTERSPEECH.

Andreas Stolcke et al. 2002. Srilm-an extensible lan-guage modeling toolkit. In INTERSPEECH.

Martin Sundermeyer, Ilya Oparin, J-L Gauvain, BenFreiberg, R Schluter, and Hermann Ney. 2013.Comparison of feedforward and recurrent neuralnetwork language models. In Acoustics, Speech andSignal Processing (ICASSP), 2013 IEEE Interna-tional Conference on, pages 8430–8434. IEEE.

Aniruddha Tammewar, Karan Singla, Srinivas Banga-lore, and Michael Carl. 2013. Enhancing asr bymt using semantic information from hindiwordnet.In Proceedings of ICON-2013: 10th InternationalConference on Natural Language Processing.

56


Expanding the Language model in a low-resource hybrid MT system

George Tambouratzis Sokratis Sofianopoulos Marina Vassiliou ILSP, Athena R.C ILSP, Athena R.C ILSP, Athena R.C

[email protected] [email protected] [email protected]

Abstract

The present article investigates the fusion of

different language models to improve transla-

tion accuracy. A hybrid MT system, recently-

developed in the European Commission-

funded PRESEMT project that combines ex-

ample-based MT and Statistical MT princi-

ples is used as a starting point. In this article,

the syntactically-defined phrasal language

models (NPs, VPs etc.) used by this MT sys-

tem are supplemented by n-gram language

models to improve translation accuracy. For

specific structural patterns, n-gram statistics

are consulted to determine whether the pat-

tern instantiations are corroborated. Experi-

ments indicate improvements in translation

accuracy.

1 Introduction

Currently a major part of cutting-edge research

in MT revolves around the statistical machine

translation (SMT) paradigm. SMT has been in-

spired by the use of statistical methods to create

language models for a number of applications

including speech recognition. A number of dif-

ferent translation models of increasing complex-

ity and translation accuracy have been developed

(Brown et al., 1993). Today, several packages for

developing statistical language models are avail-

able for free use, including SRI (Stolke et al.,

2011), thus supporting research into statistical

methods. A main reason for the widespread

adoption of SMT is that it is directly amenable to

new language pairs using the same algorithms.

An integrated framework (MOSES) has been

developed for the creation of SMT systems

(Koehn et al., 2007). The more recent develop-

ments of SMT are summarised by Koehn (2010).

One particular advance in SMT has been the in-

tegration of syntactically motivated phrases in

order to establish correspondences between

source language (SL) and target language (TL)

(Koehn et al., 2003). Recently SMT has been

enhanced by using different levels of abstraction

e.g. word, lemma or part-of-speech (PoS), in fac-

tored SMT models so as to improve SMT per-

formance (Koehn & Hoang, 2007).

The drawback of SMT is that SL-to-TL paral-

lel corpora of the order of millions of tokens are

required to extract meaningful models for trans-

lation. Such corpora are hard to obtain, particu-

larly for less resourced languages. For this rea-

son, SMT researchers are increasingly investigat-

ing the extraction of information from monolin-

gual corpora, including lexica (Koehn & Knight,

2002 & Klementiev et al., 2012), restructuring

(Nuhn et al., 2012) and topic-specific informa-

tion (Su et al., 2011).

As an alternative to pure SMT, the use of less

specialised but more readily available resources

has been proposed. Even if such approaches do

not provide a translation quality as high as SMT,

their ability to develop MT systems with very

limited resources confers to them an important

advantage. Carbonell et al. (2006) have proposed

an MT method that requires no parallel text, but

relies on a full-form bilingual dictionary and a

decoder using long-range context. Other systems

using low-cost resources include METIS

(Dologlou et al., 2003) and METIS-II (Markan-

tonatou et al., 2009), which are based only on

large monolingual corpora to translate SL texts.

Another recent trend in MT has been towards

hybrid MT systems, which combine characteris-

tics from multiple MT paradigms. The idea is

that by fusing characteristics from different para-

digms, a better translation performance can be

attained (Wu et al., 2005). In the present article,

the PRESEMT hybrid MT method using pre-

dominantly monolingual corpora (Sofianopoulos

et al., 2012 & Tambouratzis et al., 2013) is ex-

tended by integrating n-gram information to im-

prove the translation accuracy. The focus of the

article is on how to extract, as comprehensively

as possible, information from monolingual cor-

pora by combining multiple models, to allow a

higher quality translation.

A review of the base MT system is performed

in section 2. The TL language model is then de-

tailed, allowing new work to be presented in sec-

tion 3. More specifically, via an error analysis, n-

gram based extensions are proposed to augment

57

the language model. Experiments are presented

in section 4 and discussed in section 5.

2 The hybrid MT methodology in brief

The PRESEMT methodology can be broken

down into the pre-processing stage, the post-

processing stage and two translation steps each

of which addresses different aspects of the trans-

lation process. The first translation step estab-

lishes the structure of the translation by perform-

ing a structural transformation of the source side

phrases based on a small bilingual corpus, to

capture long range reordering. The second step

makes lexical choices and performs local word

reordering within each phrase. By dividing the

translation process in these two steps the chal-

lenging task of both local and long distance reor-

dering is addressed.

Phrase-based SMT systems give accurate

translations for language pairs that only require a

limited number of short-range reorderings. On

the contrary, when translating between languages

with free word order, these models prove ineffi-

cient. Instead, reordering models need to be built,

which require large parallel training data, as

various reordering challenges must be tackled.

2.1 Pre-processing

This involves PoS tagging, lemmatising and

shallow syntactic parsing (chunking) of the

source text. In terms of resources, the methodol-

ogy utilises a bilingual lemma dictionary, an ex-

tensive TL monolingual corpus, annotated with

PoS tags, lemmas and syntactic phrases (chunks),

and a very small parallel corpus of 200 sen-

tences, with tagged and lemmatised source side

and tagged, lemmatised and chunked target side.

The bilingual corpus provides samples of the

structural transformation from SL to TL. During

this phase, the translation methodology ports the

chunking from the TL- to the SL-side, alleviating

the need for an additional parser in SL. An ex-

ample of the pre-processing stage is shown in

Figure 1, for a sentence translated from Greek to

English. For this sentence, the chunk structure is

shown at the bottom part of Figure 1.

2.2 Structure Selection

Structure selection transforms the input text us-

ing the limited bilingual corpus as a structural

knowledge base, closely resembling the “transla-

tion by analogy” aspect of EBMT systems (Hut-

chins, 2005). Using available structural informa-

tion, namely the order of syntactic phrases, the

PoS tag of the head token of each phrase and the

case of the head token (if available), we retrieve

the most similar source side sentence from the

parallel corpus. Based on the alignment informa-

tion from the bilingual corpus between SL and

TL, the input sentence structure is transformed to

the structure of the target side translation.

For the retrieval of the most similar source

side sentence, an algorithm from the dynamic

programming paradigm is adopted (Sofianopou-

los et al., 2012), treating the structure selection

process as a sequence alignment, aligning the

input sentence to an SL side sentence from the

aligned parallel corpus and assigning a similarity

score. The implementation is based on the Smith-

Waterman algorithm (Smith and Waterman,

1981), initially proposed for similarity detection

between protein sequences. The algorithm finds

the optimal local alignment between the two in-

put sequences at clause level.

The similarity of two clauses is calculated by

taking into account the edit operations (replace-

ment, insertion or removal) that must be applied

to the input sentence in order to transform it to a

source side sentence from the corpus. Each of

these operations has an associated cost, consid-

ered as a system parameter. The parallel corpus

sentence that achieves the highest similarity

score is the most similar one to the input source

sentence. For the example of Figure 1, the com-

parison of the SL sentence structure to the paral-

lel corpus is schematically depicted in Figure 2.

The resulting TL sentence structure is shown in

Figure 3 in terms of phrase types and heads.

Ήδη από τα τέλη Μαΐου 1821 παρουσιάσθηκε ζωηρή κινητοποίηση υπέρ των

αγωνιζόμενων Ελλήνων.

[Already by the end of May 1821 a lively mobilisation was presented in favour of Greek contestants.]

(Ήδη) (από τα τέλη Μαΐου 1821) (παρουσιάσθηκε) (ζωηρή κινητοποίηση) (υπέρ των αγωνιζόμενων Ελλήνων).

ADVC (ad,<ήδη>) PC (aspp,<από>,no,<Mάιος>) VC (vb,<παρουσιάζω>) PC (no,<κινητοποίηση>) PC (aspp,<υπέρ>,np,<Έλληνας>).

ADVC PC VC PC PCADVC PC VC PC PC

Figure 1. Pre-processing of sentence (its gloss in

square brackets) into a chunk sequence.

58

SL-side TL-side

advc pc vcpc pc

PC VC ADVC PCPC VC ADVC PC pc advc vc pc

PC VC PC pc vc pc

PC VC PCPC VC PC pc pc vc

ADVC PC VC PC PC

ADVC PC VC PC PC

Figure 2. Comparing sentence structure to paral-

lel corpus templates, to determine the best-

matching SL structure (here, the 4th entry).


ADVC PC VC PC PC


ADVC PC VC PC PCADVC PC VC PC PC

advc pc vcpc pc

advc (rb,<already>) pc (in,<of>,nn,<May>) pc (nn,<mobilisation>) vc (vv,<present>) pc (in,<in>,nn,<Greek>).

advc pc vcpc pcadvc pc vcpc pc

advc (rb,<already>) pc (in,<of>,nn,<May>) pc (nn,<mobilisation>) vc (vv,<present>) pc (in,<in>,nn,<Greek>).

Figure 3. SL-to-TL Structure transformation

based on the chosen parallel corpus template.

2.3 Translation equivalent selection

This second translation step performs word trans-

lation disambiguation, local word reordering

within each syntactic phrase as well as addition

and/or deletion of auxiliary verbs, articles and

prepositions. All of the above are performed by

using a syntactic phrase model extracted from a

purely monolingual TL corpus. The final transla-

tion is produced by the token generation compo-

nent, since all processing during the translation

process is lemma-based.

Each sentence contained within the text to be

translated is processed separately, so there is no

exploitation of inter-sentential information. The

first task is to select the correct TL translation of

each word. The second task involves establishing

the correct word order within each phrase. For

each phrase of the sentence being translated, the

algorithm searches the TL phrase model for simi-

lar phrases. All retrieved TL phrases are com-

pared to the phrase to be translated. The com-

parison is based on the words included, their tags

and lemmas and any other morphological fea-

tures (case, number etc.). The stable-marriage

algorithm (Gale & Shapley, 1962) is applied for

calculating the similarity and aligning the words

of a phrase pair.

This word reordering process is performed si-

multaneously with the translation disambigua-

tion, using the same TL phrase model. During

word reordering the algorithm also resolves is-

sues regarding the insertion or deletion of articles

and other auxiliary tokens. Though translation

equivalent selection implements several tasks

simultaneously, it produces encouraging results

when translating from Greek (a free-word order

language) to English (an SVO language).

2.4 Post-processing

In this stage, a token generator is applied to the

lemmas of the translated sentences together with

the morphological features of their equivalent

source words, to produce the final word forms.

2.5 Comparison of the method to SMT

In the proposed methodology, the structure selec-

tion step performs long distance reordering with-

out resorting to syntactic parsers and without

employing any rules. In phrase-based SMT, long

distance reordering is performed by either using

SL syntax, with the use of complex reordering

rules, or by using syntactic trees.

The similarity calculation algorithms used in

the two translation steps of the proposed method

are of a similar nature to the extraction of trans-

lation models in factored-based SMT. In SMT,

different matrices are created for each model (i.e.

one for lemmas and another one for PoS tags),

while in the methodology studied here lemmas

and tags are handled at the same time.

The main advantage of the method studied

here is its ability to create a functioning MT sys-

tem with a parallel corpus of only a few sen-

tences (200 sentences in the present experi-

ments). On the contrary, it would not be possible

to create a working SMT with such a corpus.

3 Information extraction from the

monolingual corpus

3.1 Standard indexed phrase model

The TL monolingual corpus is processed to ex-

tract two complementary types of information,

both employed at the second phase of the transla-

tion process (cf. sub-section 2.3). The first im-

plements a disambiguation between multiple

possible translations, while the second provides

the micro-structural information to establish to-

ken order in the final translation.

59

Both these types of information are extracted

from one model. More specifically, during pre-

processing of the corpus, a phrase model is es-

tablished that provides the micro-structural in-

formation on the translation output, to determine

intra-phrasal word order. The model is stored in

a file structure, where a separate file is created

for phrases according to their (i) type, (ii) head

and (iii) head PoS tag.

The TL phrases are then organised in a hash

map that allows the storage of multiple values for

each key, using as a key the three aforemen-

tioned criteria. For each phrase the number of

occurrences within the corpus is also retained.

Each hash map is stored independently in a file

for very fast access by the search algorithm. As a

result of this process hundreds of thousands of

files are generated, one for each combination of

the three aforementioned criteria. Each file is of

a small size and thus can be retrieved quickly.

For creating the model used here, a corpus of

30,000 documents has been processed for the

TL, where each document contains a concatena-

tion of independent texts of approximately

1MByte in size. The resulting phrase model con-

sists of 380,000 distinct files, apportioned into

12,000 files of adjectival chunks, 348,000 of

noun chunks, 17,000 of verb chunks and 3,000 of

adverbial chunks. A sample of the indexed file

corresponding to verb phrases with head ‘help’ is

shown in Figure 4.

Occurrences Phrase structure

1 41448 help (VV)

2 29575 to(TO) help(VV)

3 5896 will(MD) help(VV)

4 4795 can(MD) help(VV)

5 2632 have(VHD) help(VVN)

Figure 4. Example of indexed file for “help”.

3.2 Error analysis on translation output

In Table 1, the translation accuracy attained by

the proposed hybrid approach in comparison to

established systems is displayed. The proposed

method occupies the middle ground between the

two higher performing SMT-based systems

(Bing and Google) and the Systran and World-

Lingo commercial systems.

Though the BLEU score of the proposed

method is 0.17 BLEU points lower than the

Google score, the proposed method achieves

what is a respectable score with a parallel corpus

of only 200 sentences. Though the exact re-

sources for Google or Bing are not disclosed, it is

widely agreed that they are at least 3 orders of

magnitude larger (very likely even more) justify-

ing the lower scores achieved by the proposed

low-resource method.

Number of sentences 200 Resources stand.

Reference translations 1 Language

pair EL–EN

Metrics

MT config. BLEU NIST

Me-

teor TER

PRESEMT-

baseline 0.3462 6.974 0.3947 51.05

Google 0.5259 8.538 0.4609 42.23

Bing 0.4974 8.279 0.4524 34.18

SYSTRAN 0.2930 6.466 0.3830 49.72

WorldLingo 0.2659 5.998 0.3666 50.63

Table 1. Values of performance metrics for data-

set1, using the baseline version of the proposed

method and other established systems.

The n-gram method proposed in this article for

supplementary language modelling is intended to

identify recurring errors in the output or to verify

translation choices made by the indexed mono-

lingual model. The errors mainly concern gen-

eration of tokens out of lemmata, positioning of

tokens within phrases as well as disambiguation

choices. An indicative list of errors encountered

for Greek to English translation follows:

Article introduction & deletion: Given that

there is no 1:1 mapping between Greek and Eng-

lish concerning the use of the definite article, it is

essential to check whether it is correctly intro-

duced in specific cases (e.g. before proper

names).

Generation of verb forms: Specific errors of

the MT system involve cases of active/passive

voice mismatches between SL and TL and depo-

nent verbs, i.e. active verbs with mediopassive

morphology. For example, the Greek deponent

verb "έρχοµαι" (come) is translated to “be come”

by the system token generation component that

takes into account the verb’s passive morphology

in SL. This erroneous translation should be cor-

rected to “come”, i.e. the auxiliary verb “be”

must be deleted.

In-phrase token order: The correct ordering

of tokens within a given phrase (which occasion-

ally fails to be established by the proposed sys-

tem) can be verified via the n-gram model.

Prepositional complements: When translat-

ing the prepositional complement of a verb (cf.

“depend + on”), it is often the case that the incor-

rect preposition is selected during disambigua-

tion, given that no context information is avail-

60

able. The n-gram model may be accessed to

identify the appropriate preposition.

Double preposition: Prepositions appearing

in succession within a sentence need to be re-

duced to one. For instance, the translation of the

NP “κατά τη διάρκεια της πολιορκίας” (= during

the siege) results in a prepositional sequence

(“during of”) due to the translation of the indi-

vidual parts as follows:

κατά τη διάρκεια = during

της = of the

πολιορκίας = siege

In this example a single preposition is needed.

3.3 Introducing n-gram models

A new model based on n-gram appearances is

intended to supplement phrase-based information

already extracted from the monolingual corpus

(cf. section 3.1). As the monolingual corpus is

already lemmatised, both lemma and token-based

n-grams are extracted. To simplify processing,

no phrase-boundary information is retained in the

n-gram models.

One issue is how the n-gram model will be

combined with the indexed phrase model of the

hybrid MT algorithm. The new n-gram model

can be applied at the same stage of the transla-

tion process. Alternatively, n-grams can be ap-

plied after the indexed phrase model, for verifi-

cation or revision of the translation produced by

using the indexed corpus. Then, the indexed

phrase model generates a first translation, which

represents a hypothesis Hi, upon which a number

of tests are performed. If the n-gram model cor-

roborates this hypothesis, no modification is ap-

plied, whilst if the n-gram likelihood estimates

lead to the rejection of the hypothesis, the trans-

lation is revised accordingly.

Having adopted this set-up, the main task is to

specify the hypotheses to be tested. To that end,

a data-driven approach based on the findings of

the error analysis (cf. section 3.2) is used.

The creation of the TL n-gram model is

straightforward and employs the publicly avail-

able SRILM tool (Stolke et al., 2011) to extract

n-gram probabilities. Both 2-gram and 3-gram

models have been extracted, creating both token-

based and lemma-based models to support que-

ries in factored representation levels. The n-gram

models have used 20,000 documents in English,

each document being an assimilation of web-

posted texts with a cumulative size of 1 Mbyte

(harvested without any restrictions in terms of

domain). Following a pre-processing to remove

words with non-English characters, the final cor-

pus contains a total of 707.6 million tokens and

forms part of the EnTenTen corpus1. When cre-

ating both 2-grams and 3-grams, Witten-Bell

smoothing is used and all n-grams with less than

5 occurrences are filtered out to reduce the model

size. Each n-gram model contains circa 25 mil-

lion entries, which are the SRILM-derived loga-

rithms of probabilities.

3.4 Establishing translation hypotheses

A set of hypotheses has been established based

on the error analysis, to improve the translation

quality. Each hypothesis is expressed by a

mathematical formula which checks the likeli-

hood of an n-gram, via either the lemma-based n-

gram model (the relevant entry being denoted as

p_lem(), i.e. the probability of the n-gram of

lemmas) or the token-based model (the relevant

entry being denoted as p_tok). The relevant 2-

gram or 3-gram model is consulted depending on

whether the number of arguments is 2 or 3.

Hypothesis H1: This hypothesis checks for the

existence of a deponent verb, i.e. verb which is in

passive voice in SL but has an active voice trans-

lation. Instead of externally providing a list of

deponent verbs in Greek, the n-gram model is

used to determine translations for which the verb

is always in active voice, by searching the fre-

quency-of-occurrence in the TL corpus. As an

example of a correct rejection of hypothesis H1,

consider the verb “κοιµάµαι” [to sleep] which is

translated by the hybrid MT system into “be

slept” as in SL this verb has a medio-passive

morphology. As the pattern “be slept” is ex-

tremely infrequent in the monolingual corpus,

hypothesis H1 is rejected and lemma “be” is cor-

rectly deleted, to translate “κoιµάµαι” into

“sleep”. The corresponding hypothesis is:

H1 :p_lem (A,B)>thres_h1,

where Lem (A)=”be” and PoS(B) =”VVN”

If the aforementioned hypothesis does not

hold, (i.e. the probability of the 2-gram formed

by the auxiliary verb with lemma B is very rare)

then H1 is rejected and the auxiliary verb is de-

leted, as expressed by the following formula: If (H1 == false) then {A, B} → {B}

Hypothesis H2: This hypothesis checks the in-

clusion of an article, within a trigram of word

forms. If this hypothesis is rejected based on n-

gram evidence, the article is deleted. Hypothesis

1http://www.sketchengine.co.uk/documentation/wiki/Corpor

a/enTenTen

61

H2 is expressed as follows, where thres_h2 is a

minimum threshold margin:

H2: min{p_lem(A,the),p_lem(the,B)} - p_lem(A;B) <

thres_h2

An example of correctly rejecting H2 is for tri-

gram {see, the, France}, which is revised to {see,

France}.

If (H2 == false) then {A, the, B} → {A, B}

Hypothesis H3: This hypothesis is used to han-

dle cases where two consecutive prepositions

exist (for prepositions the PoS tag is “IN”). In

this case one of these prepositions must be de-

leted, based on the n-gram information. This

process is expressed as follows:

H3 : max((p_lem(A;B),p_lem(A,C)), where PoS(A)==”IN”

& PoS(B)==”IN”

If (H3==TRUE) then {A, B, C} → {A, C} or {B, C}

Hypothesis H4: This hypothesis checks if there

exists a more suitable preposition than the one

currently selected for a given trigram {A, B, C},

where PoS(B) = “IN”. H4 is expressed as:

H4: p_lem(A,B,C)-max(p_lem(A,D,C)>thres_h4 ,

for all D where PoS{D}==“IN”.

If this hypothesis is rejected, B is replaced by

D:

If (H4==FALSE) then ({A,B,C} → {A,D,C}

Hypothesis H5: This hypothesis checks if for a

bigram, the wordforms might be replaced by the

corresponding lemmas, as the wordform-based

pattern is too infrequent. This is formulated as:

H5: p_tok(A,B)- p_tok(lem(A),lem(B)) > thres_h5

An example application would involve proc-

essing bigram {can, is} and revising it into the

correct {can, be} by rejecting H5:

If (H5==FALSE) then {A,B } → {lem(A),lem(B)}

Similarly, H5 can revise the plural form “in-

formations” to the correct “information”.

Hypothesis H6: This hypothesis also handles

article deletion, by studying however bigrams,

rather than trigrams, (cf. H1). This hypothesis is

that the bigram frequency exceeds a given

threshold value (thres_6).

H6 :p_lem(2-gram(A, B))>thres_h6, where PoS(A)=”DT”

If H6 is rejected, the corresponding article is

deleted, as indicated by the following formula:

If (H6==FALSE) then {A,B} → {B}

4 Objective Evaluation Experiments

4.1 Experiment design

The experiments reported in the present article

focus on the Greek – English language pair, the

reason being that this is the language pair for

which the most extensive experimentation has

been reported for the PRESEMT system (Tam-

bouratzis et al., 2013). Thus, improvements in

the translation accuracy will be more difficult to

attain. Two datasets are used to evaluate transla-

tion accuracy, a development set (dataset1) and a

test set (dataset2), each containing 200 sentences

of length ranging from 7 to 40 tokens. These sets

of sentences are readily available for download

over the project website2. Two versions of the

bilingual lexicon have been used, a base version

and an expanded one.

Both sets are manually translated by Greek na-

tive speakers and then cross-checked by English

native speakers, with one reference translation

per sentence. A range of evaluation metrics are

employed, namely BLEU (Papineni et al., 2002),

NIST (NIST 2002), Meteor (Denkowski and La-

vie, 2011) and TER (Snover et al., 2006).

4.2 Experimental results

The exact sequence with which hypotheses are

tested affects the results of the translation, since

only one hypothesis is allowed to be applied to

each sentence token at present. This simplifies

the evaluation of the hypotheses’ effectiveness.

As a result, hypotheses are applied in strict order

(i.e. first H1, then H2 etc.). The threshold values

of Table 2 were settled upon via limited experi-

mentation using sentences from dataset1.

Hypothesis testing was applied to both data-

sets. Notably, dataset1 has been used in the de-

velopment of the MT systems and thus the re-

sults obtained with dataset2 should be considered

the most representative ones, as they are com-

2 www.presemt.eu

62

pletely unbiased and the set of sentences was

unseen before the experiment and was only

translated once. The number of times each hy-

pothesis is tested for each dataset is quoted in

Table 3, for both the standard (denoted as

“stand”) and the enriched resources (“enrich”).

Parameter name hypothesis Exper.value

thres_h1 (H1) chk4 -4.50

thres_h2 (H2)chk5 -4.00

thres_h4 (H4)Ch k8 1.50

thres_h5 (H5)chk2 1.50

thres_h6 (H6)Ch k11 -5.50

Table 2. Parameter values for experiments

Hypothesis activations per experiment

dataset 1 dataset 2

Resource stand. enrich. stand. enrich

H1 6 6 13 10

H2 1 1 0 0

H3 2 3 3 3

H4 7 8 9 8

H5 68 68 62 68

H6 32 32 32 44

Table 3. Tested hypotheses per dataset

Since the first four hypotheses are only acti-

vated a few times each, when reporting the re-

sults, the applications of hypotheses H1 to H4

are grouped together. As hypotheses 5 and 6 are

tested more frequently, the application of each

one of them is reported separately.


Reference transla-

tions 1

Language

pair EL–EN

Metrics MT config.

BLEU NIST Meteor TER

Baseline 0.3462 6.974 0.3947 51.05

H1 to H4 0.3479 6.985 0.3941 50.84

H1 to H5 0.3503 7.006 0.3944 50.80

H1 to H6 0.3517 7.049 0.3935 50.42

Table 4. Metric scores for dataset1, using the

standard language resources, for the baseline sys-

tem and for different hypotheses.

In Table 4, the results are depicted for the four

MT objective evaluation metrics, when using

dataset 1. For each metric, the configuration giv-

ing the highest score is depicted in boldface. As

can be seen, the best BLEU score is obtained

when checking all 6 hypotheses, and the same

applies to NIST and TER. On the contrary, for

Meteor the best result is obtained without resort-

ing to the n-gram model information. Still the

difference in Meteor scores is minor (less than

0.3%). The improvements in BLEU, NIST and

TER are respectively +1.6%, +1.0% and -1.2%

over the baseline, when using all 6 hypotheses.

Furthermore, as the number of hypotheses to be

tested increases, the performance for all three

metrics is improved.

Number of sentences 200 Resources enrich.

Reference transla-

tions 1

Language

pair EL–EN

Metrics MT config.


Baseline 0.3518 7.046 0.3997 50.14

H1 to H4 0.3518 7.054 0.3990 50.00

H1 to H5 0.3541 7.094 0.3995 49.72

H1 to H6 0.3551 7.135 0.3984 49.37

Table 5. Metric scores for dataset1, using en-

riched language resources, for different systems.

In Table 5, the same experiment is repeated

using an enriched set of lexical resources includ-

ing a bilingual lexicon with higher coverage. No-

tably, on a case-by-case comparison, the scores

in Table 5 are higher than those of Table 4, con-

firming the benefits of using enriched lexical

resources. Focusing on Table 5, and comparing

the MT configurations without and with hy-

pothesis testing, the results obtained are qualita-

tively similar to those of Table 4. Again, the best

scores for Meteor are obtained when no hypothe-

ses are tested. On the other hand, for the other

metrics the n-gram modeling coupled with hy-

pothesis testing results in an improvement to the

scores obtained. The improvements obtained

amount to approximately 1.0% for each one of

BLEU, NIST and TER, over the baseline system

scores indicating a measurable improvement.

In Tables 6 and 7, the respective experiments

are reported, using dataset 2 instead of dataset 1,

with (i) standard and (ii) enriched lexical re-

sources. With standard resources (Table 6), con-

sistent improvements are achieved as more hy-

potheses are activated, for both BLEU and NIST.

In the case of Meteor, the best performance is

obtained when no hypotheses are activated, but

once again the Meteor score varies minimally

(by less than 0.2%). On the contrary, the im-

provement obtained by activating hypothesis-

checking is equal to 3.0% (BLEU), 1.4% (NIST)

and 1.2% (TER). As can be seen, the improve-

ment for previously unused dataset2 is propor-

tionally larger than for dataset1.

63


Reference transla-

tions 1

Language

pair EL–EN

Metrics MT config.

BLEU NIST Meteor

Baseline 0.2747 6.193 0.3406 Baseline

H1 to H4 0.2775 6.217 0.3403 H1 to H4

H1 to H5 0.2815 6.246 0.3400 H1 to H5

H1 to H6 0.2837 6.280 0.3401 H1 to H6

Table 6. Metric scores for dataset2, using stan-

dard language resources, for different systems.

Number of sentences 200 Resources enrich.

Reference transla-

tions 1 Language pair EL–EN

Metrics MT config.


Baseline 0.3008 6.541 0.3784 55.21

H1 to H4 0.3059 6.569 0.3790 54.96

H1 to H5 0.3105 6.593 0.3791 54.75

H1 to H6 0.3096 6.643 0.3779 54.64

Table 7. Metric scores for dataset2, using en-

riched language resources, for different systems.

Using the enriched resources, as indicated in

Table 7, the best results for BLEU and Meteor

are obtained with hypotheses 1 to 5, while for

NIST and TER the best results are obtained when

all six hypotheses are tested. In the case of Me-

teor any improvement is marginal (of the order

of 0.2%). The improvements of the other metrics

are more substantial, being 3.3% for BLEU,

1.6% for NIST and 1.0% for TER.

A statistical analysis has been undertaken to

determine whether the additional n-gram model-

ling improves significantly the translation scores.

More specifically, paired t-tests were carried out

to determine whether the difference in translation

accuracy was statistically significant, comparing

the MT accuracy obtained with all six hypothe-

ses versus the baseline system. Two populations

were formed by scoring independently each

translated sentence with each one of the NIST,

BLEU and TER metrics, for dataset2. It was

found that when using the standard resources (cf.

Table 6), the translations were scored by TER to

be significantly better when using the 6 hypothe-

ses, in comparison to the baseline system, while

for BLEU and NIST the translations for the 2

systems were equivalent (at a 0.05 confidence

level). When using the enriched resources, no

statistically significant difference was detected

for any metric at a 0.05 confidence level, but

significant differences were detected for all 3

metrics at a 0.10 confidence level (cf. Table 7).

5 Discussion

According to the experimental results, the addi-

tion of a new model in the hybrid MT system has

contributed to an improved translation quality.

These improvements have been achieved using a

limited experimentation time and only a few hy-

potheses on what is an extensively developed

language pair, for the proposed MT methodol-

ogy. It is likely that as the suite of hypotheses is

increased, larger improvements in objective met-

rics can be obtained.

When applying the hypotheses, the initial sys-

tem translation is available both at token-level

and at lemma-level. Out of the 6 hypotheses

tested here, 5 involve token-based information

and only one involves lemmas. If additional hy-

potheses are added operating on lemmas, a fur-

ther improvement is expected.

Notably, the new n-gram modelling requires

no collection or annotation of additional re-

sources. The use of an established software

package (SRILM) for assembling an n-gram da-

tabase, via which hypotheses are rejected or con-

firmed, results in a straightforward implementa-

tion. In addition, multiple models can be effec-

tively combined to improve translation accuracy

by investigating different language aspects.

An interesting point is that the n-gram models

created are factored (i.e. including information at

both lemma and token level). Thus, different

types of queries may be supported, to improve

translation quality.

6 Future work

The experiments reported here have shown that

improvements can be achieved, without specify-

ing in detail the templates searched for, but al-

lowing for more general formulations.

One aspect which should be addressed in fu-

ture work concerns evaluation. Currently, this is

limited to objective metrics. Still it is well-worth

investigating the extent to which translation im-

provement is reflected by subjective metrics,

which are the preferred instrument for quality

evaluation (Callison-Burch at al., 2011).

In addition, it is possible to achieve further

improvements if the hypothesis templates are

made more detailed, by supplementing the lexi-

cal information by detailed PoS information.

Tests performed so far have used empirically-

set parameter values for the hypotheses. It is pos-

sible to adopt a systematic methodology such as

MERT or genetic algorithms to optimise the ac-

tual values of the hypotheses parameters.

64

Another observation concerns the manner in

which the two distinct language models are ap-

plied. In the present article, n-grams are used to

correct a translation already established via the

phrase indexed model, having a second-level,

error-checking role. It is possible, however, to

revise the mode of application of the language

models, so that instead of a sequential applica-

tion, the two model families are consulted at the

same time. This leads to an MT system that ex-

ploits the information from multiple models con-

currently, and is the focus of future research.

Acknowledgements

The research leading to these results has received

funding from the POLYTROPON project

(KRIPIS-GSRT, MIS: 448306).

References

Chris Callison-Burch, Philip Koehn, Christof

Monz, and Omar F. Zaidan. 2011. Findings of

the 2011Workshop on Statistical Machine

Translation. Proceedings of the 6th Workshop

on Statistical Machine Translation, pp. 22–64,

Edinburgh, Scotland, UK, July 30–31, 2011.

Jaime Carbonell, Steve Klein, David Miller, Mi-

chael Steinbaum, Tomer Grassiany, and

Jochen Frey. 2006. Context-Based Machine

Translation. Proceedings of the 7th Confer-

ence of the Association for Machine Transla-

tion in the Americas, pages 19-28.

Michael Denkowski and Alon Lavie. 2011. Me-

teor 1.3: Automatic Metric for Reliable Opti-

mization and Evaluation of Machine Transla-

tion Systems. EMNLP 2011 Workshop on Sta-

tistical Machine Translation, Edinburgh, Scot-

land, pp. 85-91.

Yannis Dologlou, Stella Markantonatou, Olga

Yannoutsou, Soula Fourla, and Nikos Ioan-

nou. 2003. Using Monolingual Corpora for

Statistical Machine Translation: The METIS

System. Proceedings of the EAMT-CLAW’03

Workshop, Dublin, Ireland, 15-17 May, pp.

61-68.

David Gale and Lloyd S. Shapley. 1962. College

Admissions and the Stability of Marriage.

American Mathematical Monthly, Vol. 69, pp.

9-14.

John Hutchins. 2005. Example-Based Machine

Translation: a Review and Commentary. Ma-

chine Translation, Vol. 19, pp.197-211.

Alexandre Klementiev, Ann Irvine, Chris Calli-

son-Burch and David Yarowsky. 2012. To-

ward Statistical Machine Translation without

Parallel Corpora. Proceedings of EACL2012,

Avignon, France, 23-25 April, pp. 130-140.

Philip Koehn. 2010. Statistical Machine Transla-

tion. Cambridge University Press, Cambridge.

Philipp Koehn and Kevin Knight. 2002. Learning

a Translation Lexicon from Monolingual Cor-

pora. Proceedings of the ACL-02 workshop on

Unsupervised lexical acquisition, Vol.9, pp.9-

16.

Philipp Koehn, Franz Josef Och, and Daniel

Marcu, Statistical Phrase-Based Translation,

Proceedings of HLT/NAACL-2003 Confer-

ence, Vol.1, pp.48-54.

Philip Koehn, Hieu Hoang, Alexandra Birch,

Chris Callison Burch, Marcello Federico, Ni-

cola Bertoldi, Brooke Cowan, Wade Shen,

Christine Moran, Richard Zens, Chris Dyer,

Ondrej Bojar, Alexandra Constantin, and Evan

Herbst. 2007. Moses: Open Source Toolkit for

Statistical Machine Translation. Proceedings

of the ACL-2007 Demo & Posters Sessions,

Prague, June 2007, pp. 177-180.

Philipp Koehn, and Hieu Hoang. 2007. Factored

Translation Models. Proceedings of the 2007

Joint Conference on Empirical Methods in

Natural Language Processing and Computa-

tional Natural Language Learning, Prague,

Czech Republic, pp. 868-876.

Stella Markantonatou, Sokratis Sofianopoulos,

Olga Yannoutsou, and Marina Vassiliou.

2009. Hybrid Machine Translation for Low-

and Middle- Density Languages. Language

Engineering for Lesser-Studied Languages, S.

Nirenburg (ed.), pp.243-274. IOS Press.

ISBN: 978-1-58603-954-7

NIST 2002. Automatic Evaluation of Machine

Translation Quality Using n-gram Co-

occurrences Statistics.

Malte Nuhn, Arne Mauser, and Hermann Ney.

2012. Deciphering Foreign Language by

Combining Language Models and Context

Vectors. Proceedings of the 50th Annual

Meeting of the Association for Computational

Linguistics, Jeju, Korea, Vol.1, pp.156-164.

65

Kishore Papineni, Salim Roukos, Todd Ward,

and Wei-Jing Zhu. 2002. BLEU: A Method

for Automatic Evaluation of Machine Transla-

tion. 40th Annual Meeting of the Association

for Computational Linguistics, Philadelphia,

USA, pp. 311-318.

Temple F. Smith, and Michael S. Waterman.

1981. Identification of Common Molecular

Subsequences. Journal of Molecular Biology,

Vol. 147, pp. 195-197.

Matthew Snover, Bonnie Dorr, Richard

Schwartz, Linnea Micciulla, and John Mak-

houl. 2006. A Study of Translation Edit Rate

with Targeted Human Annotation. In Proceed-

ings of the 7th Conference of the Association

for Machine Translation in the Americas,

Cambridge, Massachusetts, USA, pp. 223-231.

Sokratis Sofianopoulos, Marina Vassiliou, and

George Tambouratzis. 2012. Implementing a

language-independent MT methodology. In

Proceedings of the First Workshop on Multi-

lingual Modeling, held within the ACL-2012

Conference, Jeju, Republic of Korea, 13 July,

pp.1-10.

George Tambouratzis, Sokratis Sofianopoulos,

and Marina Vassiliou (2013) Language-

independent hybrid MT with PRESEMT. In

Proceedings of HYTRA-2013 Workshop, held

within the ACL-2013 Conference, Sofia, Bul-

garia, 8 August, pp. 123-130.

Vladimir I. Levenshtein (1966): Binary codes

capable of correcting deletions, insertions, and

reversals. Soviet Physics Doklady, Vol. 10, pp.

707–710.

Peter F. Brown, Stephen A. Della Pietra, Vincent

J. Della Pietra, and Robert L. Mercer (1993)

The Mathematics of Statistical Machine

Translation: Parameter Estimation, Computa-

tional Linguistics.

Andreas Stolcke, Jing Zheng, Wen Wang, and

Victor Abrash (2011) SRILM at Sixteen: Up-

date and Outlook. Proceedings of IEEE Auto-

matic Speech Recognition and Understanding

Workshop, December 2011.

Jinsong Su, Hua Wu, Haifeng Wang, Yidong

Chen, Xiaodong Shi, Huailin Dong, and Qun

Liu (2012) Translation Model Adaptation for

Statistical Machine Translation with Monolin-

gual Topic Information. Proceedings of

ACL2012, Jeju, Republic of Korea, pp. 459-

468.

Dekai Wu (2005) MT model space: Statistical

versus compositional versus example-based

machine translation. Machine Translation,

Vol. 19, pp. 213-227.

66


Syntax and Semantics in Quality Estimation of Machine Translation

Rasoul Kaljahi†‡, Jennifer Foster†, Johann Roturier‡†NCLT, School of Computing, Dublin City University, Ireland{rkaljahi, jfoster}@computing.dcu.ie

‡Symantec Research Labs, Dublin, Ireland{johann roturier}@symantec.com

Abstract

We employ syntactic and semantic infor-mation in estimating the quality of ma-chine translation from a new data setwhich contains source text from Englishcustomer support forums and target textconsisting of its machine translation intoFrench. These translations have been bothpost-edited and evaluated by professionaltranslators. We find that quality estima-tion using syntactic and semantic informa-tion on this data set can hardly improveover a baseline which uses only surfacefeatures. However, the performance canbe improved when they are combined withsuch surface features. We also introducea novel metric to measure translation ade-quacy based on predicate-argument struc-ture match using word alignments. Whileword alignments can be reliably used,the two main factors affecting the per-formance of all semantic-based methodsseems to be the low quality of seman-tic role labelling (especially on ill-formedtext) and the lack of nominal predicate an-notation.

1 Introduction

The problem of evaluating machine translationoutput without reference translations is calledquality estimation (QE) and has recently been thecentre of attention (Bojar et al., 2014) followingthe seminal work of Blatz et al. (2003). MostQE studies have focused on surface and language-model-based features of the source and target. Thequality of translation is however closely related tothe syntax and semantics of the languages, the for-mer concerning fluency and the latter adequacy.

While there have been some attempts to utilizesyntax in this task, semantics has been paid less

attention. In this work, we aim to exploit bothsyntax and semantics in QE, with a particular fo-cus on the latter. We use shallow semantic analy-sis obtained via semantic role labelling (SRL) andemploy this information in QE in various ways in-cluding statistical learning using both tree kernelsand hand-crafted features. We also design a QEmetric which is based on the Predicate-Argumentstructure Match (PAM ) between the source and itstranslation. The semantic-based system is thencombined with the syntax-based system to evalu-ate the full power of structural linguistic informa-tion. We also combine this system with a baselinesystem consisting of effective surface features.

A second contribution of the paper is the releaseof a new data set for QE.1 This data set comprisesa set of 4.5K sentences chosen from customer sup-port forum text. The machine translation of thesentences are not only evaluated in terms of ade-quacy and fluency, but also manually post-editedallowing various metrics of interest to be appliedto measure different aspects of quality. All exper-iments are carried out on this data set.

The rest of the paper is organized as follows:after reviewing the related work, the data is de-scribed and the semantic role labelling approachis explained. The baseline is then introduced, fol-lowed by the experiments with tree kernels, hand-crafted features, the PAM metric and finally thecombination of all methods. The paper ends witha summary and suggestions for future work.

2 Related Work

Syntax has been exploited in QE in various waysincluding tree kernels (Hardmeier et al., 2012;Kaljahi et al., 2013; Kaljahi et al., 2014b),parse probabilities and syntactic label frequency(Avramidis, 2012), parseability (Quirk, 2004) andPOS n-gram scores (Specia and Gimenez, 2010).

1The data will be made publicly available - see http://www.computing.dcu.ie/mt/confidentmt.html

67

Turning to the role of semantic knowledge inQE and MT evaluation in general, Pighin andMarquez (2011) propose a method for ranking twotranslation hypotheses that exploits the projectionof SRL from a sentence to its translation usingword alignments. They first project the SRL of asource corpus to its parallel corpus and then buildtwo translation models: 1) translations of proposi-tion labelling sequences in the source to its projec-tion in the target and 2) translations of argumentrole fillers in the source to their counterparts inthe target. The source SRL is then projected toits machine translation and the above models areforced to translate source proposition labelling se-quences to the projected ones. Finally the confi-dence scores of these translations and their reach-ability are used to train a classifier which selectsthe better of the two translation hypotheses withan accuracy of 64%. Factors hindering their clas-sifier are word alignment limitations and low SRLrecall due to the lack of a verb or the loss of apredicate during translation.

In MT evaluation, where reference translationsare available, Gimenez and Marquez (2007) usesemantic roles in building several MT evaluationmetrics which measure the full or partial lexicalmatch between the fillers of same semantic roles inthe hypothesis and translation, or simply the rolelabel matches between them. They conclude thatthese features can only be useful in combinationwith other features and metrics reflecting differentaspects of the quality.

Lo and Wu (2011) introduce HMEANT, a man-ual MT evaluation metric based on predicate-argument structure matching which involves twosteps of human engagement: 1) semantic role an-notation of the reference and machine translation,2) evaluating the translation of predicates and ar-guments. The metric calculates the F1 score ofthe semantic frame match between the referenceand machine translation based on this evaluation.To keep the costs reasonable, the first step is car-ried out by amateur annotators who were mini-mally trained with a simplified list of 10 thematicroles. On a set of 40 examples, the metric ismeta-evaluated in terms of correlation with humanjudgements of translation adequacy ranking, and acorrelation as high as that of HTER is reported.

Lo et al. (2012) propose MEANT, a variant ofHMEANT, which automatizes its manual stepsusing 1) automatic SRL systems for (only) verb

predicates, 2) automatic alignment of predicatesand their arguments in the reference and ma-chine translation based on their lexical similarity.Once the predicates and arguments are aligned,their similarities are measured using a variety ofmethods such as cosine distance and even Me-teor and BLEU. In computation of the final score,the similarity scores replace the counts of correctand partial translations used in HMEANT. Thismetric outperforms several automatic metrics in-cluding BLEU, Meteor and TER, but it signifi-cantly under-performs HMEANT and HTER. Fur-ther analysis shows that automatizing the secondstep does not affect the performance of MEANT.Therefore, it seems to be the lower accuracy of thesemantic role labelling that is responsible.

Bojar and Wu (2012) identify a set of flawswith HMEANT and propose solutions for them.The most important problems stem from the su-perficial SRL annotation guidelines. These prob-lems are exacerbated in MEANT due to the auto-matic nature of the two steps. More recently, Loet al. (2014) extend MEANT to ranking transla-tions without a reference by using phrase transla-tion probabilities for aligning semantic role fillersof the source and its translation.

3 Data

We randomly select 4500 segments from a largecollection of Symantec English Norton forumtext.2 In order to be independent of any one MTsystem, we translate these segments into Frenchwith the following three systems and randomlychoose 1500 distinct segments from each.

• ACCEPT3: a phrase-based Moses systemtrained on training sets of WMT12 releasesof Europarl and News Commentary plusSymantec translation memories

• SYSTRAN: a proprietary rule-based systemaugmented with domain-specific dictionaries

• Bing4: an online translation system

These translations are evaluated in two ways.The first method involves light post-editing bya professional human translator who is a native

2http://community.norton.com3http://www.accept.unige.ch/Products/

D_4_1_Baseline_MT_systems.pdf4http://www.bing.com/translator(on24-

Feb-2014)

68

Adequacy Fluency5 All meaning Flawless Language4 Most of meaning Good Language3 Much of meaning Non-native Language2 Little meaning Disfluent Language1 None of meaning Incomprehensible

Table 2: Adequacy/fluency score interpretation

French speaker.5 Each sentence translation is thenscored against its post-edit using BLEU6(Papineniet al., 2002), TER (Snover et al., 2006) andMETEOR (Denkowski and Lavie, 2011), which arethe most widely used MT evaluation metrics. Fol-lowing Snover et al. (2006), we consider this wayof scoring MT output to be a variation of human-targeted scoring, where no reference translationis provided to the post-editor, so we call themHBLEU, HTER and HMETEOR. The average scoresfor the entire data set together with their standarddeviations are presented in Table 1.7

In the second method, we asked three profes-sional translators, who are again native Frenchspeakers, to assess the quality of MT output interms of adequacy and fluency in a 5-grade scale(LDC, 2002). The interpretation of the scores isgiven in Table 2. Each evaluator was given theentire data set for evaluation. We therefore col-lected three sets of scores and averaged them toobtain the final scores. The averages of thesescores for the entire data set together with theirstandard deviations are presented in Table 1. Tobe easily comparable to human-targeted scores,we scale these scores to the [0,1] range, i.e. ad-equacy/fluency scores of 1 and 5 are mapped to 0and 1 respectively and all the scores in betweenare accordingly scaled.

The average Kappa inter-annotator agreementfor adequacy scores is 0.25 and for fluency scores0.19. However, this measurement does not dif-ferentiate between small and large differences inagreement. In other words, the difference between

5The post-editing guidelines are based on theTAUS/CNGL guidelines for achieving “good enough”quality downloaded from https://evaluation.taus.net/images/stories/guidelines/taus-cngl-machine-translation-postediting-guidelines.pdf.

6Version 13a of MTEval script was used at the segmentlevel which performs smoothing.

7Note that HTER scores have no upper limit and can behigher than 1 when the number of errors is higher than thesegment length. In addition, the higher HTER indicates lowertranslation quality. To be comparable to the other scores, wecut-off them at 1 and convert to 1-HTER.

1-HTER HBLEU HMeteor Adq Flu1-HTER - - - - -HBLEU 0.9111 - - - -HMeteor 0.9207 0.9314 - - -Adq 0.6632 0.7049 0.6843 - -Flu 0.6447 0.7213 0.6652 0.8824 -

Table 3: Pearson r between pairs of metrics on theentire 4.5K data set

scores of 5 and 4 is the same as the differencebetween 5 and 2. To account for this, we useweighted Kappa instead. Specifically, we considertwo scores of difference 1 to represent 75% agree-ment instead of 100%. All the other differencesare considered to be a disagreement. The aver-age weighted Kappa computed in this way is 0.65for adequacy and 0.63 for fluency. Though theweighting used is quite strict, the weighted Kappavalues are in the substantial agreement range.

Once we have both human-targeted and manualevaluation scores together, it is interesting to knowhow they are correlated. We calculate the Pearsoncorrelation coefficient r between each pair of thefive scores and present them in Table 3. HBLEUhas the highest correlation with both adequacy andfluency scores among the human-targeted metrics.HTER on the other hand has the lowest correla-tion. Moreover, HBLEU is more correlated withfluency than with adequacy which is the oppositeto HMeteor. This is expected according to thedefinition of BLEU and Meteor. There is alsoa high correlation between adequacy and fluencyscores. Although this could be related to the factthat both scores are from the same evaluators, itindicates that if either the fluency and adequacy ofthe MT output is low or high, the other tends to bethe same.

The data is split into train, development and testsets of 3000, 500 and 1000 sentences respectively.

4 Semantic Role Labelling

The type of semantic information we use in thiswork is the predicate-argument structure or se-mantic role labelling of the sentence. This infor-mation needs to be extracted from both sides of thetranslation, i.e. English and French. Though theSRL of English has been well-studied (Marquezet al., 2008) thanks to the existence of two majorhand-crafted resources, namely FrameNet (Bakeret al., 1998) and PropBank (Palmer et al., 2005),French is one of the under-studied languages in

69

1-HTER HBLEU HMeteor Adequacy FluencyAverage 0.6976 0.5517 0.7221 0.6230 0.4096Standard Deviation 0.2446 0.2927 0.2129 0.2488 0.2780

Table 1: Average and standard deviation of the evaluation scores for the entire data set

this respect mainly due to a lack of such resources.The only available gold standard resource is a

small set of 1000 sentences taken from Europarl(Koehn, 2005) and manually annotated with Prop-bank verb predicates (van der Plas et al., 2010).van der Plas et al. (2011) attempt to tackle thisscarcity by automatically projecting SRL from theEnglish side of a large parallel corpus to its Frenchside. Our preliminary experiments (Kaljahi et al.,2014a), however, show that SRL models trainedon the small manually annotated corpus have ahigher quality than ones trained on the much largerprojected corpus. We therefore use the 1K goldstandard set to train a French SRL model. For En-glish, we use all the data provided in the CoNLL2009 shared task (Hajic et al., 2009).

We use LTH (Bjorkelund et al., 2009), adependency-based SRL system, for both the En-glish and French data. This system was amongthe best performing systems in the CoNLL 2009shared task and is straightforward to use. It comeswith a set of features tuned for each shared tasklanguage (English, German, Japanese, Spanish,Catalan, Czech, Chinese). We compared the per-formance of the English and Spanish feature setson French and chose the former due to its higherperformance (by 1 F1 point).

It should be noted that the English SRL datacome with gold standard syntactic annotation. Onthe other hand, for our QE data set, such anno-tation is not available. Our preliminary experi-ments show that, since the SRL system heavilyrelies on syntactic features, the performance con-siderably drops when the syntactic annotation ofthe test data is obtained using a different parserthan that of the training data. We therefore re-place the parses of the training data with those ob-tained automatically by first parsing the data us-ing the Lorg PCFG-LA parser8 (Attia et al., 2010)and then converting them to dependencies usingStanford converter (de Marneffe and Manning,2008). The POS tags are also replaced with thoseoutput by the parser. For the same reason, we re-

8https://github.com/CNGLdlab/LORG-Release.

place the original POS tagging of the French 1Kdata with those obtained by the MElt tagger (De-nis and Sagot, 2012).

The English SRL achieves 77.77 and 67.02 la-belled F1 points when trained only on the trainingsection of PropBank and tested on the WSJ andBrown test sets respectively.9 The French SRL isevaluated using 5-fold cross-validation on the 1Kdata set and obtains an F1 average of 67.66. Whenapplied to the QE data set, these models identify9133, 8875 and 8795 propositions on its sourceside, post-edits and MT output respectively.

5 Baseline

We compare the results of our experiments to abaseline built using the 17 baseline features of theWMT QE shared task (Bojar et al., 2014). Thesefeatures provide a strong baseline and have beenused in all three years of the shared task. Weuse support vector regression implemented in theSVMLight toolkit10 with Radial Basis Function(RBF) kernel to build this baseline. To extractthese features, a parallel English-French corpusis required to build a lexical translation table us-ing GIZA++ (Och and Ney, 2003). We use theEuroparl English-French parallel corpus (Koehn,2005) plus around 1M segments of Symantectranslation memory.

Table 4 shows the performance of this system(WMT17) on the test set measured by Root MeanSquare Error (RMSE) and Pearson correlation co-efficient (r). We only report the results on predict-ing four of the metrics introduced above, omittingHMeteor due to space constraints. C and γ pa-rameters are tuned on the development set with re-spect to r. The results show a significant differ-ence between manual and human-targeted metricprediction. The higher r for the former suggeststhat the patterns of these scores are easier to learn.The RMSE seems to follow the standard deviation

9Although the English SRL data are annotated for nounpredicates as well as verb predicates, since the French datahas only verb predicate annotations, we only consider verbpredicates for English.

10http://svmlight.joachims.org/

70

of the scores as the same ranking is seen in both.

6 Tree Kernels

Tree kernels (Moschitti, 2006) have been success-fully used in QE by Hardmeier et al. (2012) andin our previous work (Kaljahi et al., 2013; Kal-jahi et al., 2014b), where syntactic trees are em-ployed. Tree kernels eliminate the burden of man-ual feature engineering by efficiently utilizing allsubtrees of a tree. We employ both syntactic andsemantic information in learning quality scores,using the SVMLight-TK11, a support vector ma-chine (SVM) implementation of tree kernels.

We implement a syntactic tree kernel QE sys-tem with constituency and dependency trees ofthe source and target side, following our previouswork (Kaljahi et al., 2013; Kaljahi et al., 2014b).The performance of this system (TKSyQE) isshown in Table 4. Unlike our previous results,where the syntax-based system significantly out-performed the WMT17 baseline, TKSyQE can onlybeat the baseline in HTER and fluency prediction,with neither difference being statistically signifi-cant and it is below the baseline for HBLEU andadequacy prediction.12 It should be noted that inour previous work, a WMT News data set wasused as the QE data set which, unlike our new dataset, is well-formed and in the same domain as theparsers’ training data. The discrepancy betweenour new and old results suggests that the perfor-mance is strongly dependent on the data set.

Unlike syntactic parsing, semantic role la-belling does not produce a tree to be directly usedin the tree kernel framework. There can be var-ious ways to accomplish this goal. We first trya method inspired by the PAS format introducedby Moschitti et al. (2006). In this format, a fixednumber of nodes are gathered under a dummy rootnode as slots of one predicate and 6 arguments ofa proposition (one tree per predicate). Each nodedominates an argument label or a dummy label forthe predicate, which in turn dominates the POStag of the argument or the predicate lemma. If aproposition has more than 6 arguments they areignored, if it has fewer than 6 arguments, the extraslots are attached to a dummy null label. Note thatthese trees are derived from the dependency-basedSRL of both the source and target side (Figure

11http://disi.unitn.it/moschitti/Tree-Kernel.htm

12We use paired bootstrap resampling Koehn (2004) forstatistical significance testing.

1-HTER HBLEU Adq FluRMSE

WMT17 0.2310 0.2696 0.2219 0.2469TKSyQE 0.2267 0.2721 0.2258 0.2431D-PAS 0.2489 0.2856 0.2423 0.2652D-PST 0.2409 0.2815 0.2383 0.2606C-PST 0.2400 0.2809 0.2410 0.2615CD-PST 0.2394 0.2795 0.2373 0.2578TKSSQE 0.2269 0.2722 0.2253 0.2425

Pearson rWMT17 0.3661 0.3806 0.4710 0.4769TKSyQE 0.3693 0.3559 0.4306 0.5013D-PAS 0.1774 0.1843 0.2770 0.3252D-PST 0.2136 0.2450 0.3169 0.3670C-PST 0.2319 0.2541 0.2966 0.3616CD-PST 0.2311 0.2714 0.3303 0.3923TKSSQE 0.3682 0.3537 0.4351 0.5046

Table 4: RMSE and Pearson r of the 17 base-line features (WMT17) and tree kernel systems;TKSyQE: syntax-based tree kernels, D-PAS:dependency-based PAS tree kernels of Moschittiet al. (2006), D-PST, C-PST and CD-PST:dependency-based, constituency-based proposi-tion subtree kernels and their combination,TKSSQE: syntactic-semantic tree kernels

1(a)). The results are shown in Table 4 (D-PAS).The performance is statistically significantly lowerthan the baseline.13

In order to encode more information in the trees,we propose another format in which propositionsubtrees (PST) of the sentence are gathered un-der a dummy root node. A dependency PST (Fig-ure 1(b)) is formed by the predicate label underthe root dominating its lemma and all its argu-ments roles. Each of these nodes in turn dominatesthree nodes: the argument word form (the predi-cate word form for the case of a predicate lemma),its syntactic dependency relation to its head and itsPOS tag. We preserve the order of arguments andpredicate in the sentence.14 This system is namedD-PST in Table 4. Tree kernels in this format sig-nificantly outperform D-PAS. However, the per-formance is still far lower than the baseline.

The above formats are based on dependencytrees. We try another PST format derived fromconstituency trees. These PSTs (Figure 1(c)) arethe lowest common subtrees spanning the predi-cate node and its argument nodes and are gath-ered under a dummy root node. The argument role

13Note that the only lexical information in this format isthe predicate lemma. We tried replacing the POS tags withargument word forms, which led to a slight degradation.

14This format is chosen among several other variations dueto its higher performance.

71

(a) D-PAS (b) D-PST (c) C-PST (d) D-TKSSQE (e) C-TKSSQE

Figure 1: Semantic tree kernel formats for the sentence: Can anyone help?

labels are concatenated with the syntactic non-terminal category of the argument node. Predi-cates are not marked. However, our dependency-based SRL is required to be converted into aconstituency-based format. While constituency-to-dependency conversion is straightforward us-ing head-finding rules (Surdeanu et al., 2008),the other way around is not. We therefore ap-proximate the conversion using a heuristic we call(D2C).15 As shown in Table 4, the system built us-ing these PSTs C-PST improves over D-PST forhuman-targeted metric prediction, but not man-ual metric prediction. However, when they arecombined in CD-PST, we can see improvementover the highest scores of both systems, exceptfor HTER prediction for Pearson r. The fluencyprediction improvement is statistically significant.The other changes are not statistically significant.

An alternative approach to formulating seman-tic tree kernels is to augment syntactic trees withsemantic information. We augment the trees inTKSyQE with semantic role labels. We attach se-mantic roles to dependency labels of the argumentnodes in the dependency trees as in Figure 1(d).For constituency trees, we use the D2C heuristicto elevate roles up the terminal nodes and attachthe labels to the syntactic non-terminal categoryof the node as in Figure 1(e). The performanceof the resulting system, TKSSQE, is shown in Ta-ble 4. It substantially outperforms its counterpart,CD-PST, all differences being statistically signif-icant. However, compared to the plain syntactictree kernels (TKSyQE), the changes are slight andinconsistent, rendering the augmentation not use-ful. We consider this system to be our syntactic-

15This heuristic (D2C) recursively elevates the argumentrole already assigned to a terminal node (based on thedependency-based argument position) to the parent node aslong as 1) the argument node is not a root node or is nottagged as a POS (possessive), 2) the role is not an AM-NEG,AM-MOD or AM-DIS adjunct, and 3) the argument does notdominate its predicate’s node or another argument node of thesame proposition.


WMT17 0.2310 0.2696 0.2219 0.2469HCSyQE 0.2435 0.2797 0.2334 0.2479HCSeQE 0.2482 0.2868 0.2416 0.2612

Pearson rWMT17 0.3661 0.3806 0.4710 0.4769HCSyQE 0.2572 0.3080 0.3961 0.4696HCSeQE 0.1794 0.1636 0.2972 0.3577

Table 5: RMSE and Pearson r of the 17 baselinefeatures (WMT17) and hand-crafted features

semantic tree kernel system.

7 Hand-crafted Features

In our previous work (Kaljahi et al., 2014b), weexperiment with a set of hand-crafted syntacticfeatures extracted from both constituency and de-pendency trees on a different data set. We applythe same feature set on the new data set here. Theresults are reported in Table 5. The performance ofthis system (HCSyQE) is significantly lower thanthe baseline. This is opposite to what we ob-serve with the same feature set on a different dataset, again showing that the role of data is funda-mental in understanding system performance. Themain difference between these two data sets is thatthe former is extracted from a well-formed text inthe news domain, the same domain on which ourparsers and SRL system have been trained, whilethe new data set does not necessarily contain well-formed text nor is it from the same domain.

We design another set of feature types aimingat capturing the semantics of the source and trans-lation via predicate-argument structure. The fea-ture types are listed in Table 6. Feature types1 to 8 each contain two features, one extractedfrom the source and the other from the transla-tion. To compute argument span sizes (featuretypes 4 and 5), we use the constituency conver-sion of SRL obtained using the D2C heuristic in-troduced in Section 6. The proposition label se-

72

1 Number of propositions2 Number of arguments3 Average number of arguments per proposition4 Sum of span sizes of arguments5 Ratio of sum of span sizes of arguments to sentence

length6 Proposition label sequences7 Constituency label sequences of proposition elements8 Dependency label sequences of proposition elements9 Percentage of predicate/argument word alignment

mapping types

Table 6: Semantic feature types

quence (feature type 6) is the concatenation of ar-gument roles and predicate labels of the propo-sition with their preserved order (e.g. A0-go.01-A4). Similarly, constituency and dependency la-bel sequences (feature types 4 and 5) are extractedby replacing argument and predicate labels withtheir constituency and dependency labels respec-tively. Feature type 9 consists of three featuresbased on word alignment of source and targetsentences: number of non-aligned, one-to-many-aligned and many-to-one-aligned predicates andarguments. The word alignments are obtained us-ing the grow-diag-final-and heuristic asthey performed slightly better than other types.16

As in the baseline system, we use SVMs to buildthe QE systems using these hand-crafted features.The nominal features are binarized to be usable bySVM. However, the set of possible feature valuescan be large, leading to a large number of binaryfeatures. For example, there are more than 5000unique proposition label sequences in our data.Not only does this high dimensionality reduce theefficiency of the system, it can also affect its per-formance as these features are sparse. To tacklethis issue, we impose a frequency cutoff on thesefeatures: we keep only frequent features using athreshold set empirically on the development set.

Table 5 shows the performance of the system(HCSeQE) built with these features. The semanticfeatures perform substantially lower than the syn-tactic features and thus the baseline, especially inpredicting human-targeted scores. Since these fea-tures are chosen from a comprehensive set of se-mantic features, and as they should ideally captureadequacy better than general features, a probablereason for their low performance is the quality of

16It should be noted that a number of features in additionto those presented here have been tried, e.g. the ratio and dif-ference of the source and target values of numerical features.However, through manual feature selection, we have removedfeatures which do not appear to contribute much.

the underlying syntactic and semantic analysis.

8 Predicate-Argument Match (PAM)

Translation adequacy measures how much of thesource meaning is preserved in the translated text.Predicate-argument structure or semantic role la-belling expresses a substantial part of the meaning.Therefore, the matching between the predicate-argument structure of the source and its transla-tion could be an important clue to the translationadequacy, independent of the language pair used.We attempt to exploit predicate-argument match(PAM) to create a metric that measures the trans-lation adequacy.

The algorithm to compute PAM score startsby aligning the predicates and arguments of thesource side to its target side using word align-ments.17 It then treats the problem as one of SRLscoring, similar to the scoring scheme used in theCoNLL 2009 shared task (Hajic et al., 2009). As-suming the source side SRL as a reference, it com-putes unlabelled precision and recall of the targetside SRL with respect to it:

UPrec = # aligned preds and their args# target side preds and args

URec = # aligned preds and their args# source side preds and args

Labelled precision and recall are calculated inthe same way except that they also require argu-ment label agreement. UF1 and LF1 are the har-monic means of unlabelled and labelled scores re-spectively. Inspired by the observation that mostsource sentences with no identified proposition areshort and can be assumed to be easier to translate,and based on experiments on the dev set, we assigna score of 1 to such sentences. When no proposi-tion is identified in the target side while there is aproposition in the source, we assign a score of 0.5.

We obtain word alignments using the Mosestoolkit (Hoang et al., 2009), which can gener-ate alignments in both directions and combinethem using a number of heuristics. We try in-tersection, union, source-to-target only, as wellas the grow-diag-final-and heuristic, butonly the source-to-target results are reported hereas they slightly outperform the others.

Table 7 shows the RMSE and Pearson r foreach of the unlabelled and labelled F1 against ade-

17We also tried lexical and phrase translation tables for thispurpose in addition to word alignments but they do not out-perform word alignments.

73


1 UF1 0.3175 0.3607 0.3108 0.4033LF1 0.4247 0.3903 0.3839 0.3586

Pearson rUF1 0.2328 0.2179 0.2698 0.2865LF1 0.1784 0.1835 0.2225 0.2688

Table 7: RMSE and Pearson r of PAM unlabelledand labelled F1 scores as estimation of the MTevaluation metrics


PAM 0.2414 0.2833 0.2414 0.2661HCSeQE 0.2482 0.2868 0.2416 0.2612HCSeQEpam 0.2445 0.2822 0.2370 0.2575

Pearson rPAM 0.2292 0.2195 0.2787 0.3210HCSeQE 0.1794 0.1636 0.2972 0.3577HCSeQEpam 0.2387 0.2368 0.3571 0.3908

Table 8: RMSE and Pearson r of PAM scores asfeatures, alone and combined (PAM)

quacy and also fluency scores on the test data set.18

According to the results, the unlabelled F1 (UF1)is a closer estimation than the labelled one. ItsPearson correlation scores are overall competitiveto the hand-crafted semantic features (HCSeQE inTable 5): they are better for the automatic metriccases but lower for manual ones. However, theRMSE scores are considerably larger. Overall, theperformance is not comparable to the baseline andother well performing systems. We investigate thereasons behind this result in the next section.

Another way to employ the PAM scores in QEis to use them in a statistical framework. We builda SVM model using all 6 PAM scores The per-formance of this system (PAM) on the test set isshown in Table 8. The performance is consider-ably higher than when the PAM scores are useddirectly as estimations. Interestingly, compared tothe 47 semantic hand-crafted features (HCSeQE),this small feature set performs better in predictinghuman-targeted metrics.

We add these features to our set of hand-crafted features in Section 7 to yield a new sys-tem (HCSeQEpam in Table 8). All scores improvecompared to the stronger of the two components.However, only the manual metric prediction im-provements are statistically significant. The per-formance is still not close to the baseline.

18Precision and recall scores were also tried. Precisionproved to be the weakest estimator, whereas recall scoreswere highest for some settings.

8.1 Analyzing PAM

Ideally, PAM scores should capture the adequacyof translation with a high accuracy. The resultsare however far from ideal. There are two fac-tors involved in the PAM scoring procedure, thequality of which can affect its performance: 1)predicate-argument structure of the source andtarget side of the translation, 2) alignment ofpredicate-argument structures of source and target.

The SRL systems for both English and Frenchare trained on edited newswire. On the otherhand, our data is neither from the same domain noredited. The problem is exacerbated on the trans-lation target side, where our French SRL systemis trained on only a small data set and applied tomachine translation output. To discover the con-tribution of each of these factors in the accuracyof PAM, we carry out a manual analysis. We ran-domly select 10% of the development set (50 sen-tences) and count the number of problems of eachof these two categories.

We find only 8 cases in which a wrong wordalignment misleads PAM scoring. On the otherhand, there are 219 cases of SRL problems, in-cluding predicate and argument identification andlabelling: 82 cases (37%) in the source and 138cases (63%) in the target.

We additionally look for the cases where atranslation divergence causes predicate-argumentmismatch in the source and translation. For ex-ample, without sacrificing is translated into sansimpact sur (without impact on), a case of transpo-sition, where the source side verb predicate is leftunaligned thus affecting the PAM score. We findonly 9 such cases in the sample, which is similarto the proportion of word alignment problems.

As mentioned in the previous section, PAMscoring has to assign default values for cases inwhich there is no predicate in the source or tar-get. This can be another source of estimation error.In order to verify its effect, we find such cases inthe development set and manually categorize thembased on the reason causing the sentence to be leftwithout predicates. There are 79 (16%) source and96 (19%) target sentences for which the SRL sys-tems do not identify any predicate, out of which64 cases have both sides without any predicate.Among such source sentences, 20 (25%) have nopredicate due to a predicate identification error ofthe SRL system, 57 (72%) because of the sentencestructure (e.g. copula verbs which are not labelled

74


WMT17 0.2310 0.2696 0.2219 0.2469SyQE 0.2255 0.2711 0.2248 0.2419SeQE 0.2249 0.2710 0.2242 0.2404SSQE 0.2246 0.2696 0.2230 0.2402SSQE+WMT17 0.2225 0.2673 0.2202 0.2379

Pearson rWMT17 0.3661 0.3806 0.4710 0.4769SyQE 0.3824 0.3650 0.4393 0.5087SeQE 0.3884 0.3648 0.4447 0.5182SSQE 0.3920 0.3768 0.4538 0.5196SSQE+WMT17 0.4144 0.3953 0.4771 0.5331

Table 9: RMSE and Pearson r of the 17 baselinefeatures (WMT17) and system combinations

as predicates in the SRL training data, titles, etc.),and the remaining 2 due to spelling errors mislead-ing the SRL system. Among the target side sen-tences, most of the cases are due to the sentencestructure (65 or 68%) and only 14 (15%) cases arecaused by an SRL error. In 13 cases, no verb pred-icate in the source is translated correctly. Amongthe remaining cases, two are due to untranslatedspelling errors in the source and the other two dueto tokenization errors misleading the SRL system.

These numbers show that the main reason lead-ing to the sentences without verbal predicates isthe sentence structure. This problem can be al-leviated by employing nominal predicates in bothsides. While this is possible for the English side,there is currently no French resource where nomi-nal predicates have been annotated.

9 Combining Systems

We now combine the systems we have built sofar (Table 9). We first combine syntax-basedand semantic-based systems individually. SyQEis the combination of the syntactic tree kernelsystem (TKSyQE) and the hand-crafted features(HCSyQE). Likewise, SeQE is the combinationof the semantic tree kernel system (TKSSQE) andthe semantic hand-crafted features including PAMfeatures (HCSeQEpam). These two systems arecombined in SSQE but without syntactic tree ker-nels (TKSyQE) to avoid redundancy with TKSSQEas these are the augmented syntactic tree kernels.We finally combine SSQE with the baseline.SyQE significantly improves over its tree ker-

nel and hand-crafted components. It also outper-forms the baseline in HTER and fluency predic-tion, but is beaten by it in HBLEU and adequacyprediction. None of these differences are statis-

tically significant however. SeQE also performsbetter than the stronger of its components. Exceptfor adequacy prediction, the other improvementsare statistically significant. This system performsslightly better than SyQE. Its comparison to thebaseline is the same as that of SyQE, except thatits superiority to the baseline in fluency predictionis statistically significant.

The full syntactic-semantic system (SSQE) alsoimproves over its syntactic and semantic compo-nents. However, the improvements are not statisti-cally significant. Compared to the baseline, HTERand fluency prediction perform better, the latterbeing statistically significant. HBLEU predictionis around the same as the baseline, but adequacyprediction performance is lower, though not statis-tically significantly.

Finally, when we combine the syntactic-semantic system with the baseline system, thecombination continues to improve further. Com-pared to the stronger component however, only theHTER and fluency prediction improvements arestatistically significant.

10 Conclusion

We introduced a new QE data set drawn from cus-tomer support forum text, machine translated andboth post-edited and manually evaluated for ad-equacy and fluency. We used syntactic and se-mantic QE systems via both tree kernels and hand-crafted features. We found it hard to improve overa baseline, albeit strong, using such informationwhich is extracted by applying parsers and seman-tic role labellers on out-of-domain and uneditedtext. We also defined a metric for estimating thetranslation adequacy based on predicate-argumentstructure match between source and target. Thismetric relies on automatic word alignments andsemantic role labelling. We find that word align-ment and translation divergence only have minoreffects on the performance of this metric, whereasthe quality of semantic role labelling is the mainhindering factor. Another major issue affecting theperformance of PAM is the unavailability of nom-inal predicate annotation.

Our PAM scoring method is based on only wordmatches as there are no constituent SRL resourcesavailable for French – perhaps constituent-basedarguments can make a more accurate comparisonbetween the source and target predicate-argumentstructure possible.

75

Acknowledgments

This research has been supported by the IrishResearch Council Enterprise Partnership Scheme(EPSPG/2011/102) and the computing infrastruc-ture of the CNGL at DCU. We thank the reviewersfor their helpful comments.

ReferencesMohammed Attia, Jennifer Foster, Deirdre Hogan,

Joseph Le Roux, Lamia Tounsi, and Josef van Gen-abith. 2010. Handling unknown words in statisticallatent-variable parsing models for Arabic, Englishand French. In Proceedings of the 1st Workshopon Statistical Parsing of Morphologically Rich Lan-guages.

Eleftherios Avramidis. 2012. Quality estimation forMachine Translation output using linguistic analysisand decoding features. In Proceedings of the 7thWMT.

Collin F. Baker, Charles J. Fillmore, and John B. Lowe.1998. The Berkeley Framenet project. In Proceed-ings of the 36th ACL.

Anders Bjorkelund, Love Hafdell, and Pierre Nugues.2009. Multilingual semantic role labeling. In Pro-ceedings of the Thirteenth Conference on Computa-tional Natural Language Learning: Shared Task.

John Blatz, Erin Fitzgerald, George Foster, SimonaGandrabur, Cyril Goutte, Alex Kulesza, AlbertoSanchis, and Nicola Ueffing. 2003. Confidenceestimation for Machine Translation. In JHU/CLSPSummer Workshop Final Report.

Ondrej Bojar and Dekai Wu. 2012. Towards apredicate-argument evaluation for MT. In Proceed-ings of the Sixth Workshop on Syntax, Semantics andStructure in Statistical Translation.

Ondrej Bojar, Christian Buck, Christian Federmann,Barry Haddow, Philipp Koehn, Johannes Leveling,Christof Monz, Pavel Pecina, Matt Post, HerveSaint-Amand, Radu Soricut, Lucia Specia, and AlesTamchyna. 2014. Findings of the 2014 workshopon Statistical Machine Translation. In Proceedingsof the 9th WMT.

Marie-Catherine de Marneffe and Christopher D. Man-ning. 2008. The Stanford typed dependenciesrepresentation. In Proceedings of the COLINGWorkshop on Cross-Framework and Cross-DomainParser Evaluation.

Pascal Denis and Benoıt Sagot. 2012. Coupling anannotated corpus and a lexicon for state-of-the-artpos tagging. Lang. Resour. Eval., 46(4):721–736.

Michael Denkowski and Alon Lavie. 2011. Meteor1.3: Automatic metric for reliable optimization andevaluation of machine translation systems. In Pro-ceedings of the 6th WMT.

Jesus Gimenez and Lluıs Marquez. 2007. Linguisticfeatures for automatic evaluation of heterogenous mtsystems. In Proceedings of the Second Workshop onStatistical Machine Translation.

Jan Hajic, Massimiliano Ciaramita, Richard Johans-son, Daisuke Kawahara, Maria Antonia Martı, LluısMarquez, Adam Meyers, Joakim Nivre, SebastianPado, Jan Stepanek, Pavel Stranak, Mihai Surdeanu,Nianwen Xue, and Yi Zhang. 2009. The conll-2009shared task: Syntactic and semantic dependenciesin multiple languages. In Proceedings of the Thir-teenth Conference on Computational Natural Lan-guage Learning: Shared Task, pages 1–18.

Christian Hardmeier, Joakim Nivre, and Jorg Tiede-mann. 2012. Tree kernels for machine translationquality estimation. In Proceedings of the SeventhWMT.

Hieu Hoang, Philipp Koehn, and Adam Lopez. 2009.A unified framework for phrase-based, hierarchical,and syntax-based statistical machine translation. InProceedings of the International Workshop on Spo-ken Language Translation (IWSLT).

Rasoul Kaljahi, Jennifer Foster, Raphael Rubino, Jo-hann Roturier, and Fred Hollowood. 2013. Parseraccuracy in quality estimation of machine transla-tion: a tree kernel approach. In International JointConference on Natural Language Processing (IJC-NLP).

Rasoul Kaljahi, Jennifer Foster, and Johann Roturier.2014a. Semantic role labelling with minimal re-sources: Experiments with french. In Third JointConference on Lexical and Computational Seman-tics (*SEM).

Rasoul Kaljahi, Jennifer Foster, Raphael Rubino, andJohann Roturier. 2014b. Quality estimation ofenglish-french machine translation: A detailed studyof the role of syntax. In International Conference onComputational Linguistics (COLING).

Philipp Koehn. 2004. Statistical significance tests formachine translation evaluation. In Proceedings ofEMNLP.

Philipp Koehn. 2005. Europarl: A Parallel Corpus forStatistical Machine Translation. In Conference Pro-ceedings: the tenth Machine Translation Summit.

LDC. 2002. Linguistic data annotation specification:Assessment of fluency and adequacy in chinese-english translations. Technical report.

Chi-kiu Lo and Dekai Wu. 2011. Meant: An inex-pensive, high-accuracy, semi-automatic metric forevaluating translation utility via semantic frames. InProceedings of the 49th Annual Meeting of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies - Volume 1.

Chi-kiu Lo, Anand Karthik Tumuluru, and Dekai Wu.2012. Fully automatic semantic mt evaluation. InProceedings of the Seventh WMT.

76

Chi-kiu Lo, Meriem Beloucif, Markus Saers, andDekai Wu. 2014. Xmeant: Better semantic mt eval-uation without reference translations. In Proceed-ings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers), June.

Lluıs Marquez, Xavier Carreras, Kenneth C.Litkowski, and Suzanne Stevenson. 2008. Se-mantic role labeling: An introduction to the specialissue. Comput. Linguist., 34(2):145–159, June.

Alessandro Moschitti, Daniele Pighin, and RobertoBasili. 2006. Tree kernel engineering for propo-sition re-ranking. In Proceedings of Mining andLearning with Graphs (MLG).

Alessandro Moschitti. 2006. Making tree kernels prac-tical for natural language learning. In Proceedingsof EACL.

Franz Josef Och and Hermann Ney. 2003. A sys-tematic comparison of various statistical alignmentmodels. Comput. Linguist., 29(1):19–51.

Martha Palmer, Daniel Gildea, and Paul Kingsbury.2005. The proposition bank: An annotated cor-pus of semantic roles. Computational Linguistics,31(1):71–106.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings of theACL, pages 311–318.

Daniele Pighin and Lluıs Marquez. 2011. Automaticprojection of semantic structures: An application topairwise translation ranking. In Proceedings of theFifth Workshop on Syntax, Semantics and Structurein Statistical Translation, pages 1–9.

Chris Quirk. 2004. Training a sentence-level machinetranslation confidence measure. In Proceedings ofLREC.

Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and John Makhoul. 2006. A study oftranslation edit rate with targeted human annotation.In Proceedings of AMTA.

Lucia Specia and Jesus Gimenez. 2010. Combiningconfidence estimation and reference-based metricsfor segment level MT evaluation. In Proceedings ofAMTA.

Mihai Surdeanu, Richard Johansson, Adam Meyers,Lluıs Marquez, and Joakim Nivre. 2008. Theconll-2008 shared task on joint parsing of syntacticand semantic dependencies. In Proceedings of theTwelfth Conference on Computational Natural Lan-guage Learning.

Lonneke van der Plas, Tanja Samardzic, and PaolaMerlo. 2010. Cross-lingual validity of propbankin the manual annotation of french. In Proceedingsof the Fourth Linguistic Annotation Workshop, LAWIV ’10.

Lonneke van der Plas, Paola Merlo, and James Hen-derson. 2011. Scaling up automatic cross-lingualsemantic role annotation. In Proceedings of the 49thAnnual Meeting of the Association for Computa-tional Linguistics: Human Language Technologies.

77


Overcoming the Curse of Sentence Length for Neural MachineTranslation using Automatic Segmentation

Jean Pouget-Abadie∗Ecole Polytechnique, France

Dzmitry Bahdanau ∗Jacobs University Bremen, Germany

Bart van Merrienboer Kyunghyun ChoUniversite de Montreal, Canada

Yoshua BengioUniversite de Montreal, Canada

CIFAR Senior Fellow

Abstract

The authors of (Cho et al., 2014a) haveshown that the recently introduced neuralnetwork translation systems suffer froma significant drop in translation qualitywhen translating long sentences, unlikeexisting phrase-based translation systems.In this paper, we propose a way to ad-dress this issue by automatically segment-ing an input sentence into phrases that canbe easily translated by the neural networktranslation model. Once each segment hasbeen independently translated by the neu-ral machine translation model, the trans-lated clauses are concatenated to form afinal translation. Empirical results showa significant improvement in translationquality for long sentences.

1 Introduction

Up to now, most research efforts in statistical ma-chine translation (SMT) research have relied onthe use of a phrase-based system as suggestedin (Koehn et al., 2003). Recently, however, anentirely new, neural network based approach hasbeen proposed by several research groups (Kalch-brenner and Blunsom, 2013; Sutskever et al.,2014; Cho et al., 2014b), showing promising re-sults, both as a standalone system or as an addi-tional component in the existing phrase-based sys-tem. In this neural network based approach, an en-coder ‘encodes’ a variable-length input sentenceinto a fixed-length vector and a decoder ‘decodes’a variable-length target sentence from the fixed-length encoded vector.

It has been observed in (Sutskever et al., 2014),(Kalchbrenner and Blunsom, 2013) and (Cho etal., 2014a) that this neural network approach

∗ Research done while these authors were visiting Uni-versite de Montreal

works well with short sentences (e.g., / 20words), but has difficulty with long sentences (e.g.,' 20 words), and particularly with sentences thatare longer than those used for training. Trainingon long sentences is difficult because few availabletraining corpora include sufficiently many longsentences, and because the computational over-head of each update iteration in training is linearlycorrelated with the length of training sentences.Additionally, by the nature of encoding a variable-length sentence into a fixed-size vector representa-tion, the neural network may fail to encode all theimportant details.

In this paper, hence, we propose to translate sen-tences piece-wise. We segment an input sentenceinto a number of short clauses that can be confi-dently translated by the model. We show empiri-cally that this approach improves translation qual-ity of long sentences, compared to using a neuralnetwork to translate a whole sentence without seg-mentation.

2 Background: RNN Encoder–Decoderfor Translation

The RNN Encoder–Decoder (RNNenc) model isa recent implementation of the encoder–decoderapproach, proposed independently in (Cho et al.,2014b) and in (Sutskever et al., 2014). It consistsof two RNNs, acting respectively as encoder anddecoder.

The encoder of the RNNenc reads each word ina source sentence one by one while maintaining ahidden state. The hidden state computed at the endof the source sentence then summarizes the wholeinput sentence. Formally, given an input sentencex = (x1, · · · , xTx), the encoder computes

ht = f (xt, ht−1) ,

where f is a nonlinear function computing the nexthidden state given the previous one and the currentinput word.

78

x1 x2 xT

yT' y2 y1

c

Decoder

Encoder

Figure 1: An illustration of the RNN Encoder–Decoder. Reprinted from (Cho et al., 2014b).

From the last hidden state of the encoder, wecompute a context vector c on which the decoderwill be conditioned:

c = g(hTx),

where g may simply be a linear affine transforma-tion of hTx .

The decoder, on the other hand, generates eachtarget word at a time, until the end-of-sentencesymbol is generated. It generates a word at a timegiven the context vector (from the encoder), a pre-vious hidden state (of the decoder) and the wordgenerated at the last step. More formally, the de-coder computes at each time its hidden state by

st = f (yt−1, st−1, c) .

With the newly computed hidden state, the de-coder outputs the probability distribution over allpossible target words by:

p(ft,j = 1 | ft−1, . . . , f1, c) =

exp(wjh〈t〉

)∑Kj′=1 exp

(wj′h〈t〉

) , (1)

where ft,j is the indicator variable for the j-thword in the target vocabulary at time t and onlya single indicator variable is on (= 1) each time.

See Fig. 1 for the graphical illustration of theRNNenc.

The RNNenc in (Cho et al., 2014b) uses a spe-cial hidden unit that adaptively forgets or remem-bers the previous hidden state such that the acti-vation of a hidden unit h〈t〉j at time t is computed

by

h〈t〉j = zjh

〈t−1〉j + (1− zj)h〈t〉j ,

where

h〈t〉j =f

([Wx]j +

[U(r� h〈t−1〉

)]),

zj =σ([Wzx]j +

[Uzh〈t−1〉

]j

),

rj =σ([Wrx]j +

[Urh〈t−1〉

]j

).

zj and rj are respectively the update and resetgates. � is an element-wise multiplication. Inthe remaining of this paper, we always assume thatthis hidden unit is used in the RNNenc.

Although the model in (Cho et al., 2014b) wasoriginally trained on phrase pairs, it is straight-forward to train the same model with a bilin-gual, parallel corpus consisting of sentence pairsas has been done in (Sutskever et al., 2014). Inthe remainder of this paper, we use the RNNenctrained on English–French sentence pairs (Cho etal., 2014a).

3 Automatic Segmentation andTranslation

One hypothesis explaining the difficulty encoun-tered by the RNNenc model when translating longsentences is that a plain, fixed-length vector lacksthe capacity to encode a long sentence. When en-coding a long input sentence, the encoder may losetrack of all the subtleties in the sentence. Con-sequently, the decoder has difficulties recoveringthe correct translation from the encoded represen-tation. One solution would be to build a largermodel with a larger representation vector to in-crease the capacity of the model at the price ofhigher computational cost.

In this section, however, we propose to segmentan input sentence such that each segmented clausecan be easily translated by the RNN Encoder–Decoder. In other words, we wish to find asegmentation that maximizes the total confidencescore which is a sum of the confidence scores ofthe phrases in the segmentation. Once the confi-dence score is defined, the problem of finding thebest segmentation can be formulated as an integerprogramming problem.

Let e = (e1, · · · , en) be a source sentence com-posed of words ek. We denote a phrase, which is asubsequence of e, with eij = (ei, · · · , ej).

79

We use the RNN Encoder–Decoder to measurehow confidently we can translate a subsequenceeij by considering the log-probability log p(fk |eij) of a candidate translation fk generated by themodel. In addition to the log-probability, we alsouse the log-probability log p(eij | fk) from a re-verse RNN Encoder–Decoder (translating from atarget language to source language). With thesetwo probabilities, we define the confidence scoreof a phrase pair (eij , fk) as:

c(eij , fk) =log p(fk | eij) + log q(eij | fk)

2 |log(j − i+ 1)| ,

(2)

where the denominator penalizes a short segmentwhose probability is known to be overestimated byan RNN (Graves, 2013).

The confidence score of a source phrase only isthen defined as

cij = maxk

c(eij , fk). (3)

We use an approximate beam search to search forthe candidate translations fk of eij , that maximizelog-likelihood log p(fk|eij) (Graves et al., 2013;Boulanger-Lewandowski et al., 2013).

Let xij be an indicator variable equal to 1 if weinclude a phrase eij in the segmentation, and oth-erwise, 0. We can rewrite the segmentation prob-lem as the optimization of the following objectivefunction:

maxx

∑i≤j

cijxij = x · c (4)

subject to ∀k, nk = 1

nk =∑i,jxij1i≤k≤j is the number of source

phrases chosen in the segmentation containingword ek.

The constraint in Eq. (4) states that for eachword ek in the sentence one and only one of thesource phrases contains this word, (eij)i≤k≤j , isincluded in the segmentation. The constraint ma-trix is totally unimodular making this integer pro-gramming problem solvable in polynomial time.

Let Skj be the first index of the k-th segment

counting from the last phrase of the optimal seg-mentation of subsequence e1j (Sj := S1

j ), and sj

be the corresponding score of this segmentation

(s0 := 0). Then, the following relations hold:

sj = max1≤i≤j

(cij + si−1), ∀j ≥ 1 (5)

Sj = arg max1≤i≤j

(cij + si−1), ∀j ≥ 1 (6)

With Eq. (5) we can evaluate sj incrementally.With the evaluated sj’s, we can compute Sj aswell (Eq. (6)). By the definition of Sk

j we find theoptimal segmentation by decomposing e1n intoe

Skn,Sk−1

n −1, · · · , eS2

n,S1n−1, eS1

n,n, where k is the

index of the first one in the sequence Skn. This

approach described above requires quadratic timewith respect to sentence length.

3.1 Issues and Discussion

The proposed segmentation approach does notavoid the problem of reordering clauses. Unlessthe source and target languages follow roughly thesame order, such as in English to French transla-tions, a simple concatenation of translated clauseswill not necessarily be grammatically correct.

Despite the lack of long-distance reordering1 inthe current approach, we find nonetheless signifi-cant gains in the translation performance of neuralmachine translation. A mechanism to reorder theobtained clause translations is, however, an impor-tant future research question.

Another issue at the heart of any purely neu-ral machine translation is the limited model vo-cabulary size for both source and target languages.As shown in (Cho et al., 2014a), translation qual-ity drops considerably with just a few unknownwords present in the input sentence. Interestinglyenough, the proposed segmentation approach ap-pears to be more robust to the presence of un-known words (see Sec. 5). One intuition is that thesegmentation leads to multiple short clauses withless unknown words, which leads to more stabletranslation of each clause by the neural translationmodel.

Finally, the proposed approach is computation-ally expensive as it requires scoring all the sub-phrases of an input sentence. However, the scoringprocess can be easily sped up by scoring phrasesin parallel, since each phrase can be scored inde-pendently. Another way to speed up the segmen-tation, other than parallelization, would be to use

1Note that, inside each clause, the words are reorderedautomatically when the clause is translated by the RNNEncoder–Decoder.

80

0 10 20 30 40 50 60 70 80

Sentence length

0

5

10

15

20B

LE

Usc

ore

Source text

Reference text

Both

0 10 20 30 40 50 60 70 80

Sentence length

0

5

10

15

20

25

BL

EU

scor

e

Source text

Reference text

Both

0 10 20 30 40 50 60 70 80

Sentence length

0

5

10

15

20

25

30

35

40

BL

EU

scor

e

Source text

Reference text

Both

(a) RNNenc withoutsegmentation

(b) RNNenc with segmentation (c) Moses

Figure 2: The BLEU scores achieved by (a) the RNNenc without segmentation, (b) the RNNencwith the penalized reverse confidence score, and (c) the phrase-based translation system Moses on anewstest12-14.

an existing parser to segment a sentence into a setof clauses.

4 Experiment Settings

4.1 Dataset

We evaluate the proposed approach on the taskof English-to-French translation. We use a bilin-gual, parallel corpus of 348M words selectedby the method of (Axelrod et al., 2011) froma combination of Europarl (61M), news com-mentary (5.5M), UN (421M) and two crawledcorpora of 90M and 780M words respectively.2

The performance of our models was testedon news-test2012, news-test2013, andnews-test2014. When comparing with thephrase-based SMT system Moses (Koehn et al.,2007), the first two were used as a development setfor tuning Moses while news-test2014 wasused as our test set.

To train the neural network models, we use onlythe sentence pairs in the parallel corpus, whereboth English and French sentences are at most 30words long. Furthermore, we limit our vocabu-lary size to the 30,000 most frequent words forboth English and French. All other words are con-sidered unknown and mapped to a special token([UNK]).

In both neural network training and automaticsegmentation, we do not incorporate any domain-specific knowledge, except when tokenizing theoriginal text data.

2The datasets and trained Moses models can be down-loaded from http://www-lium.univ-lemans.fr/˜schwenk/cslm_joint_paper/ and the website ofACL 2014 Ninth Workshop on Statistical Machine Transla-tion (WMT 14).

4.2 Models and Approaches

We compare the proposed segmentation-basedtranslation scheme against the same neural net-work model translations without segmentation.The neural machine translation is done by an RNNEncoder–Decoder (RNNenc) (Cho et al., 2014b)trained to maximize the conditional probabilityof a French translation given an English sen-tence. Once the RNNenc is trained, an approxi-mate beam-search is used to find possible transla-tions with high likelihood.3

This RNNenc is used for the proposedsegmentation-based approach together with an-other RNNenc trained to translate from French toEnglish. The two RNNenc’s are used in the pro-posed segmentation algorithm to compute the con-fidence score of each phrase (See Eqs. (2)–(3)).

We also compare with the translations of a con-ventional phrase-based machine translation sys-tem, which we expect to be more robust whentranslating long sentences.

5 Results and Analysis

5.1 Validity of the Automatic Segmentation

We validate the proposed segmentation algorithmdescribed in Sec. 3 by comparing against twobaseline segmentation approaches. The first onerandomly segments an input sentence such that thedistribution of the lengths of random segments hasits mean and variance identical to those of the seg-ments produced by our algorithm. The second ap-proach follows the proposed algorithm, however,using a uniform random confidence score.

From Table 1 we can clearly see that the pro-

3In all experiments, the beam width is 10.

81

Model Test set

No segmentation 13.15Random segmentation 16.60Random confidence score 16.76Proposed segmentation 20.86

Table 1: BLEU score computed onnews-test2014 for two control experi-ments. Random segmentation refers to randomlysegmenting a sentence so that the mean andvariance of the segment lengths corresponded tothe ones our best segmentation method. Randomconfidence score refers to segmenting a sentencewith randomly generated confidence score foreach segment.

posed segmentation algorithm results in signifi-cantly better performance. One interesting phe-nomenon is that any random segmentation wasbetter than the direct translation without any seg-mentation. This indirectly agrees well with theprevious finding in (Cho et al., 2014a) that theneural machine translation suffers from long sen-tences.

5.2 Importance of Using an Inverse Model

0 2 4 6 8 10

Max. number of unknown words

−9

−8

−7

−6

−5

−4

−3

−2

−1

0

BL

EU

scor

ed

ecre

ase With segm.

Without segm.

Figure 3: BLEU score loss vs. maximum numberof unknown words in source and target sentencewhen translating with the RNNenc model with andwithout segmentation.

The proposed confidence score averages thescores of a translation model p(f | e) and an in-verse translation model p(e | f) and penalizes forshort phrases. However, it is possible to use alter-nate definitions of confidence score. For instance,

one may use only the ‘direct’ translation model orvarying penalties for phrase lengths.

In this section, we test three different confidencescore:

p(f | e) Using a single translation model

p(f | e) + p(e | f) Using both direct and reversetranslation models without the short phrasepenalty

p(f | e) + p(e | f) (p) Using both direct and re-verse translation models together with theshort phrase penalty

The results in Table 2 clearly show the impor-tance of using both translation and inverse trans-lation models. Furthermore, we were able to getthe best performance by incorporating the shortphrase penalty (the denominator in Eq. (2)). Fromhere on, thus, we only use the original formula-tion of the confidence score which uses the bothmodels and the penalty.

5.3 Quantitative and Qualitative Analysis

Model Dev Test

All

RNNenc 13.15 13.92p(f | e) 12.49 13.57

p(f | e) + p(e | f) 18.82 20.10p(f | e) + p(e | f) (p) 19.39 20.86

Moses 30.64 33.30

No

UN

K

RNNenc 21.01 23.45p(f | e) 20.94 22.62

p(f | e) + p(e | f) 23.05 24.63p(f | e) + p(e | f) (p) 23.93 26.46

Moses 32.77 35.63

Table 2: BLEU scores computed on the develop-ment and test sets. See the text for the descriptionof each approach. Moses refers to the scores bythe conventional phrase-based translation system.The top five rows consider all sentences of eachdata set, whilst the bottom five rows includes onlysentences with no unknown words

As expected, translation with the proposed ap-proach helps significantly with translating longsentences (see Fig. 2). We observe that trans-lation performance does not drop for sentencesof lengths greater than those used to train theRNNenc (≤ 30 words).

Similarly, in Fig. 3 we observe that translationquality of the proposed approach is more robust

82

Source Between the early 1970s , when the Boeing 747 jumbo defined modern long-haul travel , andthe turn of the century , the weight of the average American 40- to 49-year-old male increasedby 10 per cent , according to U.S. Health Department Data .

Segmentation [[ Between the early 1970s , when the Boeing 747 jumbo defined modern long-haul travel ,][ and the turn of the century , the weight of the average American 40- to 49-year-old male] [increased by 10 per cent , according to U.S. Health Department Data .]]

Reference Entre le debut des annees 1970 , lorsque le jumbo 747 de Boeing a defini le voyage long-courriermoderne , et le tournant du siecle , le poids de l’ Americain moyen de 40 a 49 ans a augmentede 10 % , selon les donnees du departement americain de la Sante .

Withsegmentation

Entre les annees 70 , lorsque le Boeing Boeing a defini le transport de voyageurs modernes ; etla fin du siecle , le poids de la moyenne americaine moyenne a l’ egard des hommes a augmentede 10 % , conformement aux donnees fournies par le U.S. Department of Health Affairs .

Withoutsegmentation

Entre les annees 1970 , lorsque les avions de service Boeing ont depasse le prix du travail , letaux moyen etait de 40 % .

Source During his arrest Ditta picked up his wallet and tried to remove several credit cards but theywere all seized and a hair sample was taken fom him.

Segmentation [[During his arrest Ditta] [picked up his wallet and tried to remove several credit cards but theywere all seized and] [a hair sample was taken from him.]]

Reference Au cours de son arrestation , Ditta a ramasse son portefeuille et a tente de retirer plusieurs cartesde credit , mais elles ont toutes ete saisies et on lui a preleve un echantillon de cheveux .

Withsegmentation

Pendant son arrestation J’ ai utilise son portefeuille et a essaye de retirer plusieurs cartes decredit mais toutes les pieces ont ete saisies et un echantillon de cheveux a ete enleve.

Withoutsegmentation

Lors de son arrestation il a tente de recuperer plusieurs cartes de credit mais il a ete saisi de tousles coups et des blessures.

Source ”We can now move forwards and focus on the future and on the 90 % of assets that make up areally good bank, and on building a great bank for our clients and the United Kingdom,” newdirector general, Ross McEwan, said to the press .

Segmentation [[”We can now move forwards and focus on the future] [and] [on the 90 % of assets that makeup a really good bank, and on building] [a great bank for our clients and the United Kingdom,”][new director general, Ross McEwan, said to the press.]]

Reference ”Nous pouvons maintenant aller de l’avant , nous preoccuper de l’avenir et des 90 % des actifsqui constituent une banque vraiment bonne et construire une grande banque pour la clientele etpour le Royaume Uni”, a dit le nouveau directeur general Ross McEwan a la presse .

Withsegmentation

”Nous pouvons maintenant passer a l’avenir et se concentrer sur l avenir ou sur les 90 % d actifsqui constituent une bonne banque et sur la construction une grande banque de nos clients et duRoyaume-Uni” Le nouveau directeur general Ross Ross a dit que la presse.

Withoutsegmentation

”Nous pouvons maintenant passer et etudier les 90 % et mettre en place une banque importantepour la nouvelle banque et le directeur general” a souligne le journaliste .

Source There are several beautiful flashes - the creation of images has always been one of Chouinard’sstrong points - like the hair that is ruffled or the black fabric that extends the lines.

Segmentation [[There are several beautiful flashes - the creation of images has always been one of Chouinard’sstrong points -] [like the hair that is ruffled or the black fabric that extends the lines.]]

Reference Il y a quelques beaux flashs - la creation d’images a toujours ete une force chez Chouinard -comme ces ch eveux qui s’ebouriffent ou ces tissus noirs qui allongent les lignes .

Withsegmentation

Il existe plusieurs belles images - la creation d images a toujours ete l un de ses points forts .comme les cheveux comme le vernis ou le tissu noir qui etend les lignes.

Withoutsegmentation

Il existe plusieurs points forts : la creation d images est toujours l un des points forts .

Source Without specifying the illness she was suffering from, the star performer of ‘Respect’ confirmedto the media on 16 October that the side effects of a treatment she was receiving were ‘difficult’to deal with.

Segmentation [[Without specifying the illness she was suffering from, the star performer of ‘Respect’] [con-firmed to the media on 16 October that the side effects of a treatment she was receiving were][‘difficult’ to deal with.]]

Reference Sans preciser la maladie dont elle souffrait , la celebre interprete de Respect avait affirme auxmedias le 16 octobre que les effets secondaires d’un traitement qu’elle recevait etaient ”diffi-ciles”.

Withsegmentation

Sans preciser la maladie qu’elle souffrait la star de l’ ‘œuvre’ de ‘respect’. Il a ete confirmeaux medias le 16 octobre que les effets secondaires d’un traitement ont ete recus. ”difficile” detraiter .

Withoutsegmentation

Sans la precision de la maladie elle a eu l’impression de ”marquer le 16 avril’ les effets d’un tel‘traitement’.

Table 3: Sample translations with the RNNenc model taken from the test set along with the sourcesentences and the reference translations.

83

Source He nevertheless praised the Government for responding to his request for urgent assis-tance which he first raised with the Prime Minister at the beginning of May .

Segmentation [He nevertheless praised the Government for responding to his request for urgent assis-tance which he first raised ] [with the Prime Minister at the beginning of May . ]

Reference Il a neanmoins felicite le gouvernement pour avoir repondu a la demande d’ aide urgentequ’il a presentee au Premier ministre debut mai .

Withsegmentation

Il a neanmoins felicite le Gouvernement de repondre a sa demande d’ aide urgente qu’ila soulevee . avec le Premier ministre debut mai .

Withoutsegmentation

Il a neanmoins felicite le gouvernement de repondre a sa demande d’ aide urgente qu’ila adressee au Premier Ministre debut mai .

Table 4: An example where an incorrect segmentation negatively impacts fluency and punctuation.

to the presence of unknown words. We suspectthat the existence of many unknown words makeit harder for the RNNenc to extract the meaning ofthe sentence clearly, while this is avoided with theproposed segmentation approach as it effectivelyallows the RNNenc to deal with a less number ofunknown words.

In Table 3, we show the translations of ran-domly selected long sentences (40 or more words).Segmentation improves overall translation quality,agreeing well with our quantitative result. How-ever, we can also observe a decrease in transla-tion quality when an input sentence is not seg-mented into well-formed sentential clauses. Addi-tionally, the concatenation of independently trans-lated segments sometimes negatively impacts flu-ency, punctuation, and capitalization by the RN-Nenc model. Table 4 shows one such example.

6 Discussion and Conclusion

In this paper we propose an automatic segmen-tation solution to the ‘curse of sentence length’in neural machine translation. By choosing anappropriate confidence score based on bidirec-tional translation models, we observed significantimprovement in translation quality for long sen-tences.

Our investigation shows that the proposedsegmentation-based translation is more robust tothe presence of unknown words. However, sinceeach segment is translated in isolation, a segmen-tation of an input sentence may negatively impacttranslation quality, especially the fluency of thetranslated sentence, the placement of punctuationmarks and the capitalization of words.

An important research direction in the future isto investigate how to improve the quality of thetranslation obtained by concatenating translatedsegments.

Acknowledgments

The authors would like to acknowledge the sup-port of the following agencies for research fundingand computing support: NSERC, Calcul Quebec,Compute Canada, the Canada Research Chairsand CIFAR.

ReferencesAmittai Axelrod, Xiaodong He, and Jianfeng Gao.

2011. Domain adaptation via pseudo in-domain dataselection. In Proceedings of the ACL Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 355–362. Association for Compu-tational Linguistics.

Nicolas Boulanger-Lewandowski, Yoshua Bengio, andPascal Vincent. 2013. Audio chord recognition withrecurrent neural networks. In ISMIR.

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014a. On theproperties of neural machine translation: Encoder–Decoder approaches. In Eighth Workshop on Syn-tax, Semantics and Structure in Statistical Transla-tion, October.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Fethi Bougares, Holger Schwenk, and YoshuaBengio. 2014b. Learning phrase representa-tions using rnn encoder-decoder for statistical ma-chine translation. In Proceedings of the EmpiricialMethods in Natural Language Processing (EMNLP2014), October. to appear.

A. Graves, A. Mohamed, and G. Hinton. 2013. Speechrecognition with deep recurrent neural networks.ICASSP.

A. Graves. 2013. Generating sequences with recurrentneural networks. arXiv:1308.0850 [cs.NE],August.

Nal Kalchbrenner and Phil Blunsom. 2013. Two re-current continuous translation models. In Proceed-ings of the ACL Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages1700–1709. Association for Computational Linguis-tics.

84

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Pro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology - Vol-ume 1, NAACL ’03, pages 48–54, Stroudsburg, PA,USA. Association for Computational Linguistics.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Annual meet-ing of the association for computational linguistics(acl). Prague, Czech Republic. demonstration ses-sion.

Ilya Sutskever, Oriol Vinyals, and Quoc Le. 2014.Anonymized. In Anonymized. under review.

85


Ternary Segmentation for Improving Searchin Top-down Induction of Segmental ITGs

Markus Saers DekaiWuHKUST

Human Language Technology CenterDepartment of Computer Science and EngineeringHong Kong University of Science and Technology

{masaers|dekai}@cs.ust.hk

Abstract

We show that there are situations whereiteratively segmenting sentence pairs top-down will fail to reach valid segments andpropose a method for alleviating the prob-lem. Due to the enormity of the searchspace, error analysis has indicated that it isoften impossible to get to a desired embed-ded segment purely through binary seg-mentation that divides existing segmentalrules in half – the strategy typically em-ployed by existing search strategies – asit requires two steps. We propose a newmethod to hypothesize ternary segmenta-tions in a single step, making the embed-ded segments immediately discoverable.

1 Introduction

One of the most important improvements to sta-tistical machine translation to date was the movefrom token-basedmodel to segmental models (alsocalled phrasal). This move accomplishes twothings: it allows a flat surface-based model tomemorize some relationships between word real-izations, but more importantly, it allows the modelto capture multi-word concepts or chunks. Thesechunks are necessary in order to translate fixed ex-pressions, or other multi-word units that do nothave a compositional meaning. If a sequence inone language can be broken down into smallerpieces which are then translated individually andreassembled in another language, the meaning ofthe sequence is compositional; if not, the only wayto translate it accurately is to treat it as a single unit– a chunk. Existing surface-based models (Och etal., 1999) have high recall in capturing the chunks,but tend to over-generate, which leads to big mod-els and low precision. Surface-based models haveno concept of hierarchical composition, insteadthey make the assumption that a sentence consists

of a sequence of segments that can be individuallytranslated and reordered to form the translation.This is counter-intuitive, as the who-did-what-to-whoms of a sentence tends to be translated and re-ordered as units, rather than have their componentsmixed together. Transduction grammars (Aho andUllman, 1972; Wu, 1997), also called hierarchicaltranslation models (Chiang, 2007) or synchronousgrammars, address this through a mechanism sim-ilar to context-free grammars. Inducing a segmen-tal transduction grammar is hard, so the standardpractice is to use a similar method as the surface-based models use to learn the chunks, which isproblematic, since that method mostly relies onmemorizing the relationships that the mechanicsof a compositional model is designed to general-ize. A compositional translation model would beable to translate lexical chunks, as well as gener-alize different kinds of compositions; a segmen-tal transduction grammar captures this by havingsegmental lexical rules and different nonterminalsymbols for different categories of compositions.In this paper, we focus on inducing the former:segmental lexical rules in inversion transductiongrammars (ITGs).

One natural way would be to start with a token-based grammar and chunk adjacent tokens to formsegments. The main problemwith chunking is thatthe data becomes more and more likely as the seg-ments get larger, with the degenerate end point ofall sentence pairs being memorized lexical items.Zhang et al. (2008) combat this tendency by intro-ducing a sparsity prior over the rule probabilities,and variational Bayes to maximize the posteriorprobability of the data subject to this symmetricDirichlet prior. To hypothesize possible chunks,they examine the Viterbi biparse of the existingmodel. Saers et al. (2012) use the entire parse for-est to generate the hypotheses. They also boot-strap the ITG from linear and finite-state transduc-tion grammars (LTGs, Saers (2011), and FSTGs),

86

rather than initialize the lexical probabilities fromIBM models.Another way to arrive at a segmental ITG is to

start with the degenerate chunking case: each sen-tence pair as a lexical item, and segment the exist-ing lexical rules into shorter rules. Since the startpoint is the degenerate case when optimizing fordata likelihood, this approach requires a differentobjective function to optimize against. Saers et al.(2013c) proposes to use description length of themodel and the data given the model, which is sub-sequently expressed in a Bayesian form with theaddition of a prior over the rule probabilities (SaersandWu, 2013). The way they generate hypothesesis restricted to segmenting an existing lexical iteminto two parts, which is problematic, because em-bedded lexical items are potentially overlooked.There is also the option of implicitly defining

all possible grammars, and sample from that dis-tribution. Blunsom et al. (2009) do exactly that;they induce with collapsed Gibbs sampling whichkeeps one derivation for each training sentencethat is altered and then resampled. The operationsto change the derivations are split, join, deleteand insert. The split-operator corresponds to bi-nary segmentation, the join-operator correspondsto chunking; the delete-operator removes an inter-nal node, resulting in its parent having three chil-dren, and the insert-operator allows a parent withthree children to be normalized to have only two.The existence of ternary nodes in the derivationmeans that the learned grammar contains ternaryrules. Note that it still takes three operations: twosplit-operations and one delete-operation for theirmodel to do what we propose to do in a singleternary segmentation. Also, although we allow forsingle-step ternary segmentations, our grammardoes not contain ternary rules; instead the results ofa ternary segmentation is immediately normalizedto the 2-normal form. Although their model cantheoretically sample from the entire model space,the split-operation alone is enough to do so; theother operations were added to get the model to doso in practice. Similarly, we propose ternary seg-mentation to be able to reach areas of the modelspace that we failed to reach with binary segmen-tation.To illustrate the problem with embedded lexi-

cal items, we will introduce a small example cor-pus. Although Swedish and English are relativelysimilar, with the structure of basic sentences being

identical, they already illustrate the common prob-lem of rare embedded correspondences. Imaginea really simple corpus of three sentence pairs withidentical structure:he has a red book / han har en röd bok

she has a biology book / hon har en biologibok

it has begun / det har börjat

The main difference is that Swedish concate-nates rather than juxtaposes compounds such asbiologibok instead of biology book. A bilingualperson looking at this corpus would produce bilin-gual parse trees like those in Figure 1. Inducingthis relatively simple segmental ITG from the datais, however, quite a challenge.The example above illustrates a problem with

the chunking approach, as one of the most com-mon chunks is has a/har en, whereas the linguis-tically motivated chunk biology book/biologibokoccurs only once. There is very little in this datathat would lead the chunking approach towards thedesired ITG. It also illustrates a problem with thebinary segmentation approach, as all the bilingualprefixes and suffixes, the biaffixes, are unique;there is no way of discovering that all the abovesentences have the exact same verb.In this paper, we propose a method to al-

low bilingual infixes to be hypothesized and usedto drive the minimization of description length,which would be able to induce the desired ITGfrom the above corpus.The paper is structured so that we start by giv-

ing a definition of the grammar formalism we use:ITGs (Section 2). We then describe the notionof description length that we use (Section 3), andhow ternary segmentation differs from and com-plements binary segmentation (Section 4). Wethen present our induction algorithm (Section 5)and give an example of a run through (Section 6).Finally we offer some concluding remarks (Sec-tion 7).

2 Inversion transduction grammars

Inversion transduction grammars, or ITGs (Wu,1997), are an expressive yet efficient way tomodel translation. Much like context-free gram-mars (CFGs), they allow for sentences to be ex-plained through composition of smaller units intolarger units, but where CFGs are restricted to gen-erate monolingual sentences, ITGs generate setsof sentence pairs – transductions – rather thanlanguages. Naturally, the components of differ-

87

has she a biology book

har en biologibok hon

has it begun

har börjat det

has he a red

har en röd han

book

bok

Figure 1: Possible inversion transduction trees over the example sentence pairs.

ent languages may have to be ordered differently,which means that transduction grammars need tohandle these differences in order. Rather than al-lowing arbitrary reordering and pay the price of ex-ponential time complexity, ITGs allow only mono-tonically straight or inverted order of the produc-tions, which cuts the time complexity down to amanageable polynomial.Formally, an ITG is a tuple ⟨N,Σ,∆, R, S⟩,

where N is a finite nonempty set of nonterminalsymbols, Σ is a finite set of terminal symbols inL0, ∆ is a finite set of terminal symbols in L1, Ris a finite nonempty set of inversion transductionrules and S ∈ N is a designated start symbol. Aninversion transduction rule is restricted to take oneof the following forms:

S → [A] , A→ [ψ+

], A→ ⟨ψ+⟩

where S ∈ N is the start symbol, A ∈ N is a non-terminal symbol, and ψ+ is a nonempty sequenceof nonterminals and biterminals. A biterminal isa pair of symbol strings: Σ∗ ×∆∗, where at leastone of the strings have to be nonempty. The squareand angled brackets signal straight and inverted or-der respectively. With straight order, both the L0

and the L1 productions are generated left-to-right,but with inverted order, theL1 production is gener-ated right-to-left. The brackets are frequently leftout when there is only one element on the right-hand side, which means that S → [A] is shortenedto S → A.Like CFGs, ITGs also have a 2-normal form,

analogous to the Chomsky normal form for CFGs,where the rules are further restricted to only thefollowing four forms:

S → A, A→ [BC] , A→ ⟨BC⟩, A→ e/f

where S ∈ N is the start symbol, A,B,C ∈ N

are nonterminal symbols and e/f is a biterminalstring.

A bracketing ITG, or BITG, has only one non-terminal symbol (other than the dedicated startsymbol), which means that the nonterminals carryno information at all other than the fact that theiryields are discrete unit. Rather than make a properanalysis of the sentence pair they only produce abracketing, hence the name.

A transduction grammar such as ITG can beused in three modes: generation, transductionand biparsing. Generation derives a bisentence, asentence pair, from the start symbol. Transductionderives a sentence in one language from a sentencein the other language and the start symbol. Bipars-ing verifies that a given bisentence can be derivedfrom the start symbol. Biparsing is an integral partof any learning that requires expected counts suchas expectation maximization, and transduction isthe actual translation process.

3 Description length

We follow the definition of description length fromSaers et al. (2013b,c,d,a); Saers and Wu (2013),that is: the size of the model is determined bycounting the number of symbols needed to encodethe rules, and the size of the data given the modelis determined by biparsing the data with the model.Formally, given a grammarΦ its description lengthDL (Φ) is the sum of the length of the symbolsneeded to serialize the rule set. For conveniencelater on, the symbols are assumed to be uniformlydistributed with a length of−lg 1

N bits each (whereN is the number of different symbols). The de-scription length of the data D given the model isdefined as DL (D|Φ) = −lgP (D|Φ).

88

Figure 2: The four different kinds of binary seg-mentation hypotheses.

Figure 3: The two different hypotheses that canbe made from an infix-to-infix link.

4 Segmenting lexical items

With a background in computer science it is tempt-ing to draw the conclusion that any segmentationcan be made as a sequence of binary segmenta-tions. This is true, but only relevant if the entiresearch space can be exhaustively explored. Wheninducing transduction grammars, the search spaceis prohibitively large; in fact, we are typically af-forded only an estimate of a single step forwardin the search process. In such circumstances, thekinds of steps you can take start to matter greatly,and adding ternary segmentation to the typicallyused binary segmentation adds expressive power.Figure 2 contains a schematic illustration of bi-

nary segmentation: To the left is a lexical itemwhere a good biaffix (anL0 prefix or suffix associ-ated with anL1 prefix or suffix) has been found, asillustrated with the solid connectors. To the right isthe segmentation that can be inferred. For binarysegmentation, there is no uncertainty in this step.When adding ternary segmentation, there are

five more situations: one situation where an in-

Figure 4: The eight different hypotheses that canbemade from the four different infix-to-affix links.

fix is linked to an infix, and four situations wherean infix is linked to an affix. Figure 3 shows theinfix-to-infix situation, where there is one addi-tional piece of information to be decided: are thesurroundings linked straight or inverted? Figure 4shows the situations where one infix is linked to anaffix. In these situations, there are twomore piecesof information that needs to be inferred: (a) wherethe sibling of the affix needs to be segmented, and(b) how the two pieces of the sibling of the affixlink to the siblings of the infix. The infix-to-affixsituations require a second monolingual segmen-tations decision to be made. As this is beyond thescope of this paper, we will limit ourselves to theinfix-to-infix situation.

5 Finding segmentation hypotheses

Previous work on binary hypothesis generationmakes assumptions that do not hold with ternarysegmentation; this section explains why that is andhow we get around it. The basic problem with bi-nary segmentation is that any bisegment hypothe-sized to be good on its own has to be anchored toeither the beginning or the end of an existing biseg-ment. An infix, by definition, does not.While recording all affixes is possible, even for

non-toy corpora (Saers and Wu, 2013; Saers et al.,2013b,c), recording all bilingual infixes is not, socollecting them all is not an option (while there are

89

Algorithm 1 Pseudo code for segmenting an ITG.Φ ▷ The ITG being induced.Ψ ▷ The token-based ITG used to evaluate lexical rules.hmax ▷ The maximum number of hypotheses to keep from a single lexical rule.repeatδ ← 0H ′ ▷ Initial hypothesesfor all lexical rules A→ e/f dop← parse(Ψ, e/f)c ▷ Fractional counts of bispansfor all bispans s, t, u, v ∈ e/f doc(s, t, u, v)← 0

H ′′ ← []for all items Bs,t,u,v ∈ p doc(s, t, u, v)← c(s, t, u, v) + α(Bs,t,u,v)β(Bs,t,u,v)/α(S0,T,0,V )H ′′ ← [H ′′, ⟨s, t, u, v, c(s, t, u, v)⟩]

sort H ′′ on c(s, t, u, v)for all ⟨s, t, u, v, c(s, t, u, v)⟩ ∈ H ′′[0..hmax] doH ′(es..t/fu..v)← [H ′(es..t/fu..v), ⟨s, t, u, v, A→ e/f⟩]

H ▷ Evaluated hypothesesfor all bisegments es..t/fu..v ∈ keys(H ′) do

Φ′ ← ΦR← []for all bispan-rule pairs ⟨s, t, u, v, A→ e/f⟩ ∈ H ′(es..t/fu..v) do

Φ′ ← make_grammar_change(Φ′, e/f, s, t, u, v)R← [R,A→ e/f ]

δ′ ← DL(Φ′)−DL(Φ) +DL(D|Φ′)−DL(D|Φ)if δ′ < 0 thenH ← [H, ⟨es..t/fu..v, R, δ

′⟩]sort H on δ′for all ⟨es..t/fu..v, R, δ

′⟩ ∈ H doΦ′ ← Φfor all rules A→ e/f ∈ R ∩RΨ′ do

Φ′ ← make_grammar_change(Φ′, e/f, s, t, u, v)δ′ ← DL(Φ′)−DL(Φ) +DL(D|Φ′)−DL(D|Φ)if δ′ < 0 then

Φ← Φ′

δ ← δ + δ′

until δ ≤ 0return Φ

only O(n2

)possible biaffixes for a parallel sen-

tence of average length n, there are O(n4

)possi-

ble bilingual infixes). A way to prioritize, withinthe scope of a single bisegment, which infixes andaffixes to consider as hypotheses is crucial. In thispaper we use an approach similar to Saers et al.(2013d), in which we use a token-based ITG toevaluate the lexical rules in the ITG that is be-ing induced. Using a transduction grammar hasthe advantage of calculating fractional counts for

hypotheses, which allows both long and short hy-potheses to compete on a level playing field.In Algorithm 1, we start by parsing all the lex-

ical rules in the grammar Φ being learned using atoken-based ITG Ψ. For each rule, we only keepthe best hmax bispans. In the second part, all col-lected bispans are evaluated as if they were theonly hypothesis being considered for changing Φ.Any hypothesis with a positive effect is kept forfurther processing. These hypotheses are sorted

90

and applied. Since the grammarmay have changedsince the effect of the hypothesis was estimated,we have to check that the hypothesis would havea positive effect on the updated grammar beforecommitting to it. All this is repeated as long asthere are improvements that can be made.Themake_grammar_changemethod deletes the

old rule, and distributes its probability mass tothe rules replacing it. For ternary segmentation,this will be three lexical rules, and two structuralrules (which happens to be identical in a bracket-ing grammar, giving that one rule two shares ofthe probability mass being distributed). For binarysegmentation it is two lexical rules and one struc-tural rule.Rather than calculating DL (D|Φ)− DL (D|Φ)

explicitly by biparsing the entire corpus, we es-timate the change. For binary rules, we use thesame estimate as Saers and Wu (2013): multiply-ing in the new rule probabilities and dividing outthe old. For ternary rules, we make the assump-tion that the three new lexical rules are combinedusing structural rules the way they would duringparsing, which means two binary structural rulesbeing applied. The infix-to-infix situation must begenerated either by two straight combinations orby two inverted combinations, so for a bracketinggrammar it is always two applications of a singlestructural rule. We thus multiply in the three newlexical rules and the structural rule twice, and di-vide out the old rule. In essence, both these meth-ods are recreating the situations in which the parserwould have used the old rule, but now uses the newrules.Having exhausted all the hypotheses, we also

run expectation maximization to stabilize the pa-rameters. This step is not shown in the pseudocode.Examining the pseudocode closer reveals that

the outer loop will continue as long as the grammarchanges; since the only way the grammar changesis by making lexical rules shorter, this loop is guar-anteed to terminate. Inside the outer loop there arethree inner loops: one over the rule set, one overthe set of initial hypothesesH ′ and one over the setof evaluated hypothesesH . The sets of hypothesesare related such that |H| ≤ |H ′|, which means thatthe size of the initial set of hypotheses will dom-inate the time complexity. The size of this initialset of hypotheses is itself limited so that it cannotcontain more than hmax hypotheses from any one

rule. The dominating factor is thus the size of therule set, which we will further analyze.

The first thing we do is to parse the right-handside of the rule, which requires O

(n3

)with the

Saers et al. (2009) algorithm, where n is the av-erage length of the lexical items. We then initial-ize the counts, which does not actually require aspecific step in implementation. We then iterateover all bispans in the parse, which has the sameupper bound as the parsing process, since the ap-proximate parsing algorithm avoids exploring theentire search space. We then sort the set of hy-potheses derived from the current rule only, whichis asymptotically bound byO

(n3lgn

), since there

is exactly one hypothesis per parse item. Finally,there is a selection being made from the set of hy-potheses derived from the current rule. In prac-tice, the parsing is more complicated than the sort-ing, making the time complexity of the whole innerloop be dominated by the time it takes to parse therules.

6 Example

In this section we will trace through how the ex-ample from the introduction fails to go throughbinary segmentation, but succeeds when infix-to-infix segmentations are an option.

The initial grammar consists of all the sentencepairs as segmental lexical rules:

S → A 1A→ he has a red book

han har en röd bok 0.3

A→ she has a biology bookhon har en biologibok 0.3

A→ it has begundet har börjat 0.3

As noted before, there are no shared biaffixesamong the three lexical rules, so binary segmen-tation cannot break this grammar down further.There are, however, three shared bisegments rep-resenting three different segmentation hypotheses:has a/har en, has/har and a/en. In this example itdoes not matter which hypothesis you choose, sowe will go with the first one, since that is the oneour implementation chose. Breaking out all occur-rences of has a/har en gives the following gram-

91

mar:S → A 1

A→ [AA] 0.36A→ it has begun/det har börjat 0.09

A→ has a/har en 0.18A→ he/han 0.09

A→ red book/röd bok 0.09A→ she/hon 0.09

A→ biology book/biologibok 0.09

At this point there are two bisegments that occurin more than one rule: has/har and a/en. Again,it does not matter for the final outcome which ofthe hypotheses we choose, so we will chose thefirst one, again because that is the one our imple-mentation chose. Breaking out all occurrences ofhas/har gives the following grammar:

S → A 1A→ [AA] 0.421A→ he/han 0.053

A→ red book/röd bok 0.053A→ she/hon 0.053

A→ biology book/biologibok 0.053A→ has/har 0.158A→ it/det 0.053

A→ begun/börjat 0.053A→ a/en 0.105

There are no shared bisegments left in the gram-mar now, so no more segmentations can be done.Obviously, the probability of the data given thisnew grammar is much smaller, but the grammaritself has generalized far beyond the training data,to the point where it largely agrees with the pro-posed trees in Figure 1 (except that this grammarbinarizes the constituents, and treats red book/rödbok as a segment).

7 Conclusions

We have shown that there are situations in whicha top-down segmenting approach that relies solelyon binary segmentation will fail to generalize, de-spite there being ample evidence to a human thata generalization is warranted. We have proposedternary segmentation as a solution to provide hy-potheses that are considered good under a mini-mum description length objective. And we haveshown that the proposed method could indeed per-form generalizations that are clear to the humaneye, but not discoverable through binary segmen-tation. The algorithm is comparable to previ-ous segmentation approaches in terms of time and

space complexity, so scaling up to non-toy trainingcorpora is likely to work when the time comes.

Acknowledgements

This material is based upon work supportedin part by the Defense Advanced ResearchProjects Agency (DARPA) under BOLT contractnos. HR0011-12-C-0014 and HR0011-12-C-0016,and GALE contract nos. HR0011-06-C-0022 andHR0011-06-C-0023; by the European Union un-der the FP7 grant agreement no. 287658; and bythe Hong Kong Research Grants Council (RGC)research grants GRF620811, GRF621008, andGRF612806. Any opinions, findings and conclu-sions or recommendations expressed in this mate-rial are those of the authors and do not necessarilyreflect the views of DARPA, the EU, or RGC.

References


Phil Blunsom, Trevor Cohn, Chris Dyer, andMilesOsborne. A Gibbs sampler for phrasal syn-chronous grammar induction. In Joint Confer-ence of the 47th Annual Meeting of the Asso-ciation for Computational Linguistics and 4thInternational Joint Conference on Natural Lan-guage Processing of the AFNLP (ACL-IJCNLP2009), pp. 782–790, Suntec, Singapore, August2009.

David Chiang. Hierarchical phrase-based trans-lation. Computational Linguistics, 33(2):201–228, 2007.

Frans Josef Och, Christoph Tillmann, and Her-mann Ney. Improved alignment models for sta-tistical machine translation. In 1999 Joint SIG-DAT Conference on Empirical Methods in Nat-ural Language Processing and Very Large Cor-pora, pp. 20–28, University of Maryland, Col-lege Park, Maryland, June 1999.

Markus Saers and Dekai Wu. Bayesian induc-tion of bracketing inversion transduction gram-mars. In Sixth International Joint Conference onNatural Language Processing (IJCNLP2013),pp. 1158–1166, Nagoya, Japan, October 2013.Asian Federation of Natural Language Process-ing.

92

Markus Saers, Joakim Nivre, and Dekai Wu.Learning stochastic bracketing inversion trans-duction grammars with a cubic time biparsingalgorithm. In 11th International Conferenceon Parsing Technologies (IWPT’09), pp. 29–32,Paris, France, October 2009.

Markus Saers, Karteek Addanki, and Dekai Wu.From finite-state to inversion transductions: To-ward unsupervised bilingual grammar induc-tion. In 24th International Conference onComputational Linguistics (COLING 2012), pp.2325–2340, Mumbai, India, December 2012.

Markus Saers, Karteek Addanki, and Dekai Wu.Augmenting a bottom-up ITG with top-downrules by minimizing conditional descriptionlength. In Recent Advances in Natural Lan-guage Processing (RANLP 2013), Hissar, Bul-garia, September 2013.

Markus Saers, Karteek Addanki, and Dekai Wu.Combining top-down and bottom-up search forunsupervised induction of transduction gram-mars. In Seventh Workshop on Syntax, Se-mantics and Structure in Statistical Translation(SSST-7), pp. 48–57, Atlanta, Georgia, June2013.

Markus Saers, Karteek Addanki, and Dekai Wu.Iterative rule segmentation under minimumdescription length for unsupervised transduc-tion grammar induction. In Adrian-HoriaDediu, Carlos Martín-Vide, Ruslan Mitkov, andBianca Truthe, editors, Statistical Language andSpeech Processing, First International Confer-ence, SLSP 2013, Lecture Notes in Artificial In-telligence (LNAI). Springer, Tarragona, Spain,July 2013.

Markus Saers, Karteek Addanki, and Dekai Wu.Unsupervised transduction grammar inductionvia minimum description length. In SecondWorkshop on Hybrid Approaches to Transla-tion (HyTra), pp. 67–73, Sofia, Bulgaria, Au-gust 2013.

Markus Saers. Translation as Linear Trans-duction: Models and Algorithms for EfficientLearning in Statistical Machine Translation.PhD thesis, Uppsala University, Department ofLinguistics and Philology, 2011.


Hao Zhang, Chris Quirk, Robert C. Moore, andDaniel Gildea. Bayesian learning of non-compositional phrases with synchronous pars-ing. In 46th Annual Meeting of the Associationfor Computational Linguistics: Human Lan-guage Technologies (ACL-08: HLT), pp. 97–105, Columbus, Ohio, June 2008.

93


A CYK+ Variant for SCFG Decoding Without a Dot Chart

Rico SennrichSchool of Informatics

University of Edinburgh10 Crichton Street

Edinburgh EH8 9ABScotland, UK

[email protected]

AbstractWhile CYK+ and Earley-style variants arepopular algorithms for decoding unbina-rized SCFGs, in particular for syntax-based Statistical Machine Translation, thealgorithms rely on a so-called dot chartwhich suffers from a high memory con-sumption. We propose a recursive vari-ant of the CYK+ algorithm that elimi-nates the dot chart, without incurring anincrease in time complexity for SCFG de-coding. In an evaluation on a string-to-tree SMT scenario, we empirically demon-strate substantial improvements in mem-ory consumption and translation speed.

1 Introduction

SCFG decoding can be performed with monolin-gual parsing algorithms, and various SMT sys-tems implement the CYK+ algorithm or a closeEarley-style variant (Zhang et al., 2006; Koehn etal., 2007; Venugopal and Zollmann, 2009; Dyeret al., 2010; Vilar et al., 2012). The CYK+ algo-rithm (Chappelier and Rajman, 1998) generalizesthe CYK algorithm to n-ary rules by performing adynamic binarization of the grammar during pars-ing through a so-called dot chart. The constructionof the dot chart is a major cause of space ineffi-ciency in SCFG decoding with CYK+, and mem-ory consumption makes the algorithm impracticalfor long sentences without artificial limits on thespan of chart cells.

We demonstrate that, by changing the traver-sal through the main parse chart, we can elimi-nate the dot chart from the CYK+ algorithm at nocomputational cost for SCFG decoding. Our algo-rithm improves space complexity, and an empiri-cal evaluation confirms substantial improvements

in memory consumption over the standard CYK+algorithm, along with remarkable gains in speed.

This paper is structured as follows. As mo-tivation, we discuss some implementation needsand complexity characteristics of SCFG decodingWe then describe our algorithm as a variant ofCYK+, and finally perform an empirical evalua-tion of memory consumption and translation speedof several parsing algorithms.

2 SCFG Decoding

To motivate our algorithm, we want to highlightsome important differences between (monolin-gual) CFG parsing and SCFG decoding.

Grammars in SMT are typically several ordersof magnitude larger than for monolingual parsing,partially because of the large amounts of trainingdata employed to learn SCFGs, partially becauseSMT systems benefit from using contextually richrules rather than only minimal rules (Galley et al.,2006). Also, the same right-hand-side rule on thesource side can be associated with many trans-lations, and different (source and/or target) left-hand-side symbols. Consequently, a compact rep-resentation of the grammar is of paramount impor-tance.

We follow the implementation in the MosesSMT toolkit (Koehn et al., 2007) which encodesan SCFG as a trie in which each node representsa (partial or completed) rule, and a node has out-going edges for each possible continuation of therule in the grammar, either a source-side termi-nal symbol or pair of non-terminal-symbols. If anode represents a completed rule, it is also asso-ciated with a collection of left-hand-side symbolsand the associated target-side rules and probabil-ities. A trie data structure allows for an efficientgrammar lookup, since all rules with the same pre-

94

fix are compactly represented by a single node.Rules are matched to the input in a bottom-up-

fashion as described in the next section. A singlerule or rule prefix can match the input many times,either by matching different spans of the input, orby matching the same span, but with different sub-spans for its non-terminal symbols. Each produc-tion is uniquely identified by a span, a grammartrie node, and back-pointers to its subderivations.The same is true for a partial production (dotteditem).

A key difference between monolingual parsingand SCFG decoding, whose implications on timecomplexity are discussed by Hopkins and Lang-mead (2010), is that SCFG decoders need to con-sider language model costs when searching for thebest derivation of an input sentence. This criticallyaffects the parser’s ability to discard dotted itemsearly. For CFG parsing, we only need to keep onepartial production per rule prefix and span, or kfor k-best parsing, selecting the one(s) whose sub-derivations have the lower cost in case of ambigu-ity. For SCFG decoding, the subderivation withthe higher local cost may be the globally betterchoice after taking language model costs into ac-count. Consequently, SCFG decoders need to con-sider multiple possible productions for the samerule and span.

Hopkins and Langmead (2010) provide a run-time analysis of SCFG decoding, showing thattime complexity depends on the number of choicepoints in a rule, i.e. rule-initial, consecutive, orrule-final non-terminal symbols.1 The number ofchoice points (or scope) gives an upper bound tothe number of productions that exist for a rule andspan. If we define the scope of a grammar G tobe the maximal scope of all rules in the grammar,decoding can be performed in O(nscope(G)) time.If we retain all partial productions of the same ruleprefix, this also raises the space complexity of thedot chart from O(n2) to O(nscope(G)). 2

Crucially, the inclusion of language model costsboth increases the space complexity of the dotchart, and removes one of its benefits, namely theability to discard partial productions early withoutrisking search errors. Still, there is a second way

1Assuming that there is a constant upper bound on thefrequency of each symbol in the input sentence, and on thelength of rules.

2In a left-to-right construction of productions, a rule pre-fix of a scope-x rule may actually have scope x + 1, namelyif the rule prefix ends in a non-terminal, but the rule does not.

it is a trap123456789

10

it is a trap123456789

10

Figure 1: Traditional CYK/CYK+ chart traversalorder (left) and proposed order (right).

in which a dot chart saves computational cost inthe CYK+ algorithm. The exact chart traversal or-der is underspecified in CYK parsing, the only re-quirement being that all subspans of a given spanneed to be visited before the span itself. CYK+or Earley-style parsers typically traverse the chartbottom-up left-to-right, as in Figure 1 (left). Thesame partial productions are visited throughouttime during chart parsing, and storing them in adot chart saves us the cost of recomputing them.For example, step 10 in Figure 1 (left) re-uses par-tial productions that were found in steps 1, 5 and8.

We propose to specify the chart traversal orderto be right-to-left, depth-first, as illustrated on theright-hand-side in Figure 1. This traversal ordergroups all cells with the same start position to-gether, and offers a useful guarantee. For eachspan, all spans that start at a later position havebeen visited before. Thus, whenever we generatea partial production, we can immediately exploreall of its continuations, and then discard the par-tial production. This eliminates the need for a dotchart, without incurring any computational cost.We could also say that the dot chart exists in aminimal form with at most one item at a time, anda space complexity of O(1). We proceed with adescription of the proposed algorithm, contrastedwith the closely related CYK+ algorithm.

3 Algorithm

3.1 The CYK+ algorithm

We here summarize the CYK+ algorithm, orig-inally described by Chappelier and Rajman(1998).3

3Chappelier and Rajman (1998) add the restriction thatrules may not be partially lexicalized; our description ofCYK+, and our own algorithm, do not place this restriction.

95

The main data structure during decoding is achart with one cell for each span of words in aninput string w1...wn of length n. Each cell Ti,j

corresponding to the span from wi to wj containstwo lists of items:4

• a list of type-1 items, which are non-terminals (representing productions).

• a list of type-2 items (dotted items), whichare strings of symbols α that parse the sub-string wi...wj and for which there is a rule inthe grammar of the form A → αβ, with βbeing a non-empty string of symbols. Suchan item may be completed into a type-1 itemat a future point, and is denoted α•.

For each cell (i, j) of the chart, we perform thefollowing steps:

1. if i = j, search for all rules A→ wiγ. If γ isempty, add A to the type-1 list of cell (i, j);otherwise, add wi• to the type-2 list of cell(i, j).

2. if j > i, search for all combinations of a type-2 item α• in a cell (i, k) and a type-1 item Bin a cell (k+1, j) for which a rule of the formA→ αBγ exists.5 If γ is empty, add the ruleto the type-1 list of cell (i, j); otherwise, addαB• to the type-2 list of cell (i, j).

3. for each item B in the type-1 list of the cell(i, j), if there is a rule of the form A → Bγ,and γ is non-empty, add B• to the type-2 listof cell (i, j).

3.2 Our algorithmThe main idea behind our algorithm is that we canavoid the need to store type-2 lists if we processthe individual cells in a right-to-left, depth-first or-der, as illustrated in Figure 1. Rules are still com-pleted left-to-right, but processing the rightmostcells first allows us to immediately extend partialproductions into full productions instead of storingthem in memory.

We perform the following steps for each cell.

1. if i = j, if there is a rule A → wi, add A tothe type-1 list of cell (i, j).

However, our description excludes non-lexical unary rules,and epsilon rules.

4For simplicity, we describe a monolingual acceptor.5To allow mixed-terminal rules, we also search for B =

wj if j = k + 1.

2. if j > i, search for all combinations of a type-2 item α• and a type-1 itemB in a cell (j, k),with j ≤ k ≤ n for which a rule of the formC → αBγ exists. In the initial call, we allowα• = A• for any type-1 item A in cell (i, j−1).6 If γ is empty, add C to the type-1 list ofcell (i, k); otherwise, recursively repeat thisstep, using αB• as α• and k + 1 as j.

To illustrate the difference between the two al-gorithms, let us consider the chart cell (1, 2), i.e.the chart cell spanning the substring it is, in Fig-ure 1, and let us assume the following grammar:

S → NP V NP

NP → ART NN

NP → it

V → is

ART → a

NN → trap

In both algorithms, we can combine the sym-bols NP from cell (1, 1) and V from cell (2, 2) topartially parse the rule S → NP V NP. How-ever, in CYK+, we cannot yet know if the rule canbe completed with a cell (3, x) containing symbolNP, since the cell (3, 4) may be processed after cell(1, 2). Thus, the partial production is stored in atype-2 list for later processing.

In our algorithm, we require all cells (3, x) tobe processed before cell (1, 2), so we can imme-diately perform a recursion with α = NP V andj = 3. In this recursive step, we search for a sym-bol NP in any cell (3, x), and upon finding it incell (3, 4), add S as type-1 item to cell (1, 4).

We provide side-by-side pseudocode of the twoalgorithms in Figure 2.7 The algorithms arealigned to highlight their similarity, the main dif-ference between them being that type-2 items areadded to the dot chart in CYK+, and recursivelyconsumed in our variant. An attractive propertyof the dynamic binarization in CYK+ is that eachpartial production is constructed exactly once, andcan be re-used to find parses for cells that covera larger span. Our algorithm retains this property.Note that the chart traversal order is different be-tween the algorithms, as illustrated earlier in Fig-ure 1. While the original CYK+ algorithm workswith either chart traversal order, our recursive vari-

6To allow mixed-terminal rules, we also allow α• = wi•if j = i+ 1, and B = wj if k = j.

7Some implementation details are left out for simplicity.For instance, note that terminal and non-terminal grammartrie edges can be kept separate to avoid iterating over all ter-minal edges.

96

Algorithm 1: CYK+Input: array w of length Ninitialize chart[N,N ], collections[N,N ],dotchart[N ]root← root node of grammar triefor span in [1..N]:

for i in [1..(N-span+1)]:j← i+span-1if i = j: #step 1

if (w[i], X) in arc[root]:addToChart(X, i, j)

else:for B in chart[i, j-1]: #step 3

if (B, X) in arc[root]:if arc[X] is not empty:

add (X, j-1) to dotchart[i]for (a, k) in dotchart[i]: #step 2

if k+1 = j:if (w[j], X) in arc[a]:

addToChart(X, i, j)for (B, X) in arc[a]:

if B in chart[k+1, j]:addToChart(X, i, j)

chart[i, j] = cube_prune(collections[i, j])

def addToChart(trie node X, int i, int j):if X has target collection:

add X to collections[i, j]if arc[X] is not empty:

add (X, j) to dotchart[i]

Algorithm 2: recursive CYK+Input: array w of length Ninitialize chart[N,N ], collections[N,N ]

root← root node of grammar triefor i in [N..1]:

for j in [i..N]:

if i = j: #step 1if (w[i], X) in arc[root]:

addToChart(X, i, j, false)else: #step 2

consume(root, i, i, j-1)chart[i, j] = cube_prune(collections[i, j])

def consume(trie node a, int i, int j, int k):unary ← i = jif j = k:

if (w[j], X) in arc[a]:addToChart(X, i, k, unary)

for (B, X) in arc[a]:if B in chart[j, k]:

addToChart(X, i, k, unary)

def addToChart(trie node X, int i, int j, bool u):if X has target collection and u is false:

add X to collections[i, j]if arc[X] is not empty:

for k in [(j+1)..N]:consume(X, i, j+1, k)

Figure 2: side-by-side pseudocode of CYK+ (left) and our algorithm (right). Our algorithm uses a newchart traversal order and recursive consume function instead of a dot chart.

97

ant requires a right-to-left, depth-first chart traver-sal.

With our implementation of the SCFG as a trie,a type-2 is identified by a trie node, an array ofback-pointers to antecedent cells, and a span. Wedistinguish between type-1 items before and aftercube pruning. Productions, or specifically the tar-get collections and back-pointers associated withthem, are first added to a collections object, eithersynchronously or asynchronously. Cube pruningis always performed synchronously after all pro-duction of a cell have been found. Thus, the choiceof algorithm does not change the search space incube pruning, or the decoder output. After cubepruning, the chart cell is filled with a mappingfrom a non-terminal symbol to an object that com-pactly represents a collection of translation hy-potheses and associated scores.

3.3 Chart Compression

Given a partial production for span (i, j), the num-ber of chart cells in which the production can becontinued is linear to sentence length. The recur-sive variant explicitly loops through all cells start-ing at position j + 1, but this search also exists inthe original CYK+ in the form of the same type-2item being re-used over time.

The guarantee that all cells (j+1, k) are visitedbefore cell (i, j) in the recursive algorithm allowsfor a further optimization. We construct a com-pressed matrix representation of the chart, whichcan be incrementally updated in O(|V | ·n2), V be-ing the vocabulary of non-terminal symbols. Foreach start position and non-terminal symbol, wemaintain an array of possible end positions andthe corresponding chart entry, as illustrated in Ta-ble 1. The array is compressed in that it does notrepresent empty chart cells. Using the previousexample, instead of searching all cells (3, x) fora symbol NP, we only need to retrieve the arraycorresponding to start position 3 and symbol NPto obtain the array of cells which can continue thepartial production.

While not affecting the time complexity ofthe algorithm, this compression technique reducescomputational cost in two ways. If the chart issparsely populated, i.e. if the size of the arrays issmaller than n − j, the algorithm iterates throughfewer elements. Even if the chart is dense, we onlyperform one chart look-up per non-terminal andpartial production, instead of n− j.

cell S NP V ART NN(3,3) 0x81(3,4) 0x86

start symbol compressed column3 ART [(3, 0x81)]3 NP [(4, 0x86)]3 S,V,NN []

Table 1: Matrix representation of all chart en-tries starting at position 3 (top), and equivalentcompressed representation (bottom). Chart entriesare pointers to objects that represent collection oftranslation hypotheses and their scores.

4 Related Work

Our proposed algorithm is similar to the workby Leermakers (1992), who describe a recursivevariant of Earley’s algorithm. While they discussfunction memoization, which takes the place ofcharts in their work, as a space-time trade-off, akey insight of our work is that we can order thechart traversal in SCFG decoding so that partialproductions need not be tabulated or memoized,without incurring any trade-off in time complex-ity.

Dunlop et al. (2010) employ a similar matrixcompression strategy for CYK parsing, but theirmethod is different to ours in that they employ ma-trix compression on the grammar, which they as-sume to be in Chomsky Normal Form, whereas werepresent n-ary grammars as tries, and use matrixcompression for the chart.

An obvious alternative to n-ary parsing is theuse of binary grammars, and early SCFG mod-els for SMT allowed only binary rules, as in thehierarchical models by Chiang (2007)8, or bina-rizable ones as in inversion-transduction grammar(ITG) (Wu, 1997). Whether an n-ary rule can bebinarized depends on the rule-internal reorderingsbetween non-terminals; Zhang et al. (2006) de-scribe a synchronous binarization algorithm.

Hopkins and Langmead (2010) show that thecomplexity of parsing n-ary rules is determinedby the number of choice points, i.e. non-terminalsthat are initial, consecutive, or final, since terminalsymbols in the rule constrain which cells are pos-sible application contexts of a non-terminal sym-bol. They propose pruning of the SCFG to rules

8Specifically, Chiang (2007) allows at most two non-terminals per rule, and no adjacent non-terminals on thesource side.

98

with at most 3 decision points, or scope 3, as analternative to binarization that allows parsing incubic time. In a runtime evaluation, SMT withtheir pruned, unbinarized grammar offers a bet-ter speed-quality trade-off than synchronous bi-narization because, even though both have thesame complexity characteristics, synchronous bi-narization increases both the overall number ofrules, and the number of non-terminals, which in-creases the grammar constant. In contrast, Chunget al. (2011) compare binarization and Earley-styleparsing with scope-pruned grammars, and findEarley-style parsing to be slower. They attributethe comparative slowness of Earley-style parsingto the cost of building and storing the dot chartduring decoding, which is exactly the problem thatour paper addresses.

Williams and Koehn (2012) describe a parsingalgorithm motivated by Hopkins and Langmead(2010) in which they store the grammar in a com-pact trie with source terminal symbols or a genericgap symbol as edge labels. Each path through thistrie corresponds to a rule pattern, and is associatedwith the set of grammar rules that share the samerule pattern. Their algorithm initially constructs asecondary trie that records all rule patterns that ap-ply to the input sentence, and stores the position ofmatching terminal symbols. Then, chart cells arepopulated by constructing a lattice for each rulepattern identified in the initial step, and traversingall paths through this lattice. Their algorithm issimilar to ours in that they also avoid the construc-tion of a dot chart, but they construct two otherauxiliary structures instead: a secondary trie anda lattice for each rule pattern. In comparison, ouralgorithm is simpler, and we perform an empiricalcomparison of the two in the next section.

5 Empirical Results

We empirically compare our algorithm to theCYK+ algorithm, and the Scope-3 algorithm asdescribed by Williams and Koehn (2012), in astring-to-tree SMT task. All parsing algorithmsare equivalent in terms of translation output, andour evaluation focuses on memory consumptionand speed.

5.1 Data

For SMT decoding, we use the Moses toolkit(Koehn et al., 2007) with KenLM for languagemodel queries (Heafield, 2011). We use training

algorithm n = 20 n = 40 n = 80Scope-3 0.02 0.04 0.34CYK+ 0.32 2.63 51.64+ recursive 0.02 0.04 0.15+ compression 0.02 0.04 0.15

Table 2: Peak memory consumption (in GB) ofstring-to-tree SMT decoder for sentences of dif-ferent length n with different parsing algorithms.

data from the ACL 2014 Ninth Workshop on Sta-tistical Machine Translation (WMT) shared trans-lation task, consisting of 4.5 million sentence pairsof parallel data and a total of 120 million sen-tences of monolingual data. We build a string-to-tree translation system English→German, us-ing target-side syntactic parses obtained with thedependency parser ParZu (Sennrich et al., 2013).A synchronous grammar is extracted with GHKMrule extraction (Galley et al., 2004; Galley et al.,2006), and the grammar is pruned to scope 3.

The synchronous grammar contains 38 millionrule pairs with 23 million distinct source-siderules. We report decoding time for a random sam-ple of 1000 sentences from the newstest2013/4sets (average sentence length: 21.9 tokens), andpeak memory consumption for sentences of 20,40, and 80 tokens. We do not report the timeand space required for loading the SMT models,which is stable for all experiments.9 The parsingalgorithm only accounts for part of the cost duringdecoding, and the relative gains from optimizingthe parsing algorithm are highest if the rest of thedecoder is fast. For best speed, we use cube prun-ing with language model boundary word grouping(Heafield et al., 2013) in all experiments. We setno limit to the maximal span of SCFG rules, butonly keep the best 100 productions per span forcube pruning. The cube pruning limit itself is setto 1000.

5.2 Memory consumptionPeak memory consumption for different sentencelengths is shown in Table 2. For sentences oflength 80, we observe more than 50 GB in peakmemory consumption for CYK+, which makesit impractical for long sentences, especially formulti-threaded decoding. Our recursive variantskeep memory consumption small, as does the

9The language model consumes 13 GB of memory, andthe SCFG 37 GB. We leave the task of compacting the gram-mar to future research.

99

0 20 40 60 800

100

200

300

400

sentence length

deco

ding

time

(sec

onds

)Scope-3 parserCYK++ recursive+ compression

Figure 3: Decoding time per sentence as a func-tion of sentence length for four parsing variants.Regression curves use least squares fitting on cu-bic function.

algorithmlength 80 random

parse total parse totalScope-3 74.5 81.1 1.9 2.6CYK+ 358.0 365.4 8.4 9.1+ recursive 33.7 40.1 1.5 2.2+ compression 15.0 21.2 1.0 1.7

Table 3: Parse time and total decoding time persentence (in seconds) of string-to-tree SMT de-coder with different parsing algorithms.

Scope-3 algorithm. This is in line with our theoret-ical expectation, since both algorithms eliminatethe dot chart, which is the costliest data structurein the original CYK+ algorithm.

5.3 Speed

While the main motivation for eliminating the dotchart was to reduce memory consumption, we alsofind that our parsing variants are markedly fasterthan the original CYK+ algorithm. Figure 3 showsdecoding time for sentences of different lengthwith the four parsing variants. Table 3 shows se-lected results numerically, and also distinguishesbetween total decoding time and time spent in theparsing block, the latter ignoring the cost of cubepruning and language model scoring. If we con-sider parse time for sentences of length 80, we ob-serve a speed-up by a factor of 24 between ourfastest variant (with recursion and chart compres-sion), and the original CYK+.

The gains from chart compression over the re-cursive variant – a factor 2 reduction in parse time

for sentences of length 80 – are attributable to areduction in the number of computational steps.The large speed difference between CYK+ andthe recursive variant is somewhat more surpris-ing, given the similarity of the two algorithms.Profiling results show that the recursive variant isnot only faster because it saves the computationaloverhead of creating and destroying the dot chart,but that it also has a better locality of reference,with markedly fewer CPU cache misses.

Time differences are smaller for shorter sen-tences, both in terms of time spent parsing, and be-cause the time spent outside of parsing is a higherproportion of the total. Still, we observe a factor5 speed-up in total decoding time on our randomtranslation sample from CYK+ to our fastest vari-ant. We also observe speed-ups over the Scope-3parser, ranging from a factor 5 speed-up (parsingtime on sentences of length 80) to a 50% speed-up(total time on random translation sample). It is un-clear to what extent these speed differences reflectthe cost of building the auxiliary data structures inthe Scope-3 parser, and how far they are due toimplementation details.

5.4 Rule prefix scope

For the CYK+ parser, the growth of both memoryconsumption and decoding time exceeds our cubicgrowth expectation. We earlier remarked that therule prefix of a scope-3 rule may actually be scope-4 if the prefix ends in a non-terminal, but the ruleitself does not. Since this could increase space andtime complexity of CYK+ to O(n4), we did addi-tional experiments in which we prune all scope-3rules with a scope-4 prefix. This affected 1% ofall source-side rules in our model, and only hada small effect on translation quality (19.76 BLEU

→ 19.73 BLEU on newstest2013). With this addi-tional pruning, memory consumption with CYK+is closer to our theoretical expectation, with a peakmemory consumption of 23 GB for sentences oflength 80 (≈ 23 times more than for length 40).We also observe reductions in parse time as shownin Table 4. While we do see marked reductionsin parse time for all CYK+ variants, our recursivevariants maintain their efficiency advantage overthe original algorithm. Rule prefix scope is irrel-evant for the Scope-3 parsing algorithm10, and its

10Despite its name, the Scope-3 parsing algorithm al-lows grammars of any scope, with a time complexity ofO(nscope(G)).

100

algorithmlength 80 random

full pruned full prunedScope-3 74.5 70.1 1.9 1.8CYK+ 358.0 245.5 8.4 6.4+ recursive 33.7 24.5 1.5 1.2+ compression 15.0 10.5 1.0 0.8

Table 4: Average parse time (in seconds) of string-to-tree SMT decoder with different parsing algo-rithms, before and after scope-3 rules with scope-4prefix have been pruned from grammar.

speed is only marginally affected by this pruningprocedure.

6 Conclusion

While SCFG decoders with dot charts are stillwide-spread, we argue that dot charts are only oflimited use for SCFG decoding. The core contri-butions of this paper are the insight that a right-to-left, depth-first chart traversal order allows forthe removal of the dot chart from the popularCYK+ algorithm without incurring any computa-tional cost for SCFG decoding, and the presen-tation of a recursive CYK+ variant that is basedon this insight. Apart from substantial savingsin space complexity, we empirically demonstrategains in decoding speed. The new chart traversalorder also allows for a chart compression strategythat yields further speed gains.

Our parsing algorithm does not affect the searchspace or cause any loss in translation quality,and its speed improvements are orthogonal to im-provements in cube pruning (Gesmundo et al.,2012; Heafield et al., 2013). The algorithmicmodifications to CYK+ that we propose are sim-ple, but we believe that the efficiency gains ofour algorithm are of high practical importance forsyntax-based SMT. An implementation of the al-gorithm has been released as part of the MosesSMT toolkit.

Acknowledgements

I thank Matt Post, Philip Williams, MarcinJunczys-Dowmunt and the anonymous reviewersfor their helpful suggestions and feedback. Thisresearch was funded by the Swiss National Sci-ence Foundation under grant P2ZHP1_148717.

ReferencesJean-Cédric Chappelier and Martin Rajman. 1998. A

Generalized CYK Algorithm for Parsing StochasticCFG. In TAPD, pages 133–137.

David Chiang. 2007. Hierarchical Phrase-BasedTranslation. Comput. Linguist., 33(2):201–228.

Tagyoung Chung, Licheng Fang, and Daniel Gildea.2011. Issues Concerning Decoding with Syn-chronous Context-free Grammar. In ACL (ShortPapers), pages 413–417. The Association for Com-puter Linguistics.

Aaron Dunlop, Nathan Bodenstab, and Brian Roark.2010. Reducing the grammar constant: an analysisof CYK parsing efficiency. Technical report CSLU-2010-02, OHSU.

Chris Dyer, Adam Lopez, Juri Ganitkevitch, JohnathanWeese, Ferhan Ture, Phil Blunsom, Hendra Seti-awan, Vladimir Eidelman, and Philip Resnik. 2010.cdec: A Decoder, Alignment, and Learning frame-work for finite-state and context-free translationmodels. In Proceedings of the Association for Com-putational Linguistics (ACL).

Michel Galley, Mark Hopkins, Kevin Knight, andDaniel Marcu. 2004. What’s in a Translation Rule?In HLT-NAACL ’04.

Michel Galley, Jonathan Graehl, Kevin Knight, DanielMarcu, Steve DeNeefe, Wei Wang, and IgnacioThayer. 2006. Scalable inference and training ofcontext-rich syntactic translation models. In ACL-44: Proceedings of the 21st International Confer-ence on Computational Linguistics and the 44th an-nual meeting of the Association for ComputationalLinguistics, pages 961–968, Sydney, Australia. As-sociation for Computational Linguistics.

Andrea Gesmundo, Giorgio Satta, and James Hender-son. 2012. Heuristic Cube Pruning in Linear Time.In Proceedings of the 50th Annual Meeting of theAssociation for Computational Linguistics: ShortPapers - Volume 2, ACL ’12, pages 296–300, JejuIsland, Korea. Association for Computational Lin-guistics.

Kenneth Heafield, Philipp Koehn, and Alon Lavie.2013. Grouping Language Model Boundary Wordsto Speed K-Best Extraction from Hypergraphs. InProceedings of the 2013 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 958–968, Atlanta, Georgia, USA.

Kenneth Heafield. 2011. KenLM: Faster and SmallerLanguage Model Queries. In Proceedings of theSixth Workshop on Statistical Machine Translation,Edinburgh, UK. Association for Computational Lin-guistics.

Mark Hopkins and Greg Langmead. 2010. SCFGDecoding Without Binarization. In EMNLP, pages646–655.

101

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: OpenSource Toolkit for Statistical Machine Translation.In Proceedings of the ACL-2007 Demo and PosterSessions, pages 177–180, Prague, Czech Republic.Association for Computational Linguistics.

René Leermakers. 1992. A recursive ascent Earleyparser. Information Processing Letters, 41(2):87–91, February.

Rico Sennrich, Martin Volk, and Gerold Schneider.2013. Exploiting Synergies Between Open Re-sources for German Dependency Parsing, POS-tagging, and Morphological Analysis. In Proceed-ings of the International Conference Recent Ad-vances in Natural Language Processing 2013, pages601–609, Hissar, Bulgaria.

Ashish Venugopal and Andreas Zollmann. 2009.Grammar based statistical MT on Hadoop: An end-to-end toolkit for large scale PSCFG based MT. ThePrague Bulletin of Mathematical Linguistics, 91:67–78.

David Vilar, Daniel Stein, Matthias Huck, and Her-mann Ney. 2012. Jane: an advanced freely avail-able hierarchical machine translation toolkit. Ma-chine Translation, 26(3):197–216.

Philip Williams and Philipp Koehn. 2012. GHKMRule Extraction and Scope-3 Parsing in Moses. InProceedings of the Seventh Workshop on Statisti-cal Machine Translation, pages 388–394, Montréal,Canada, June. Association for Computational Lin-guistics.

Dekai Wu. 1997. Stochastic inversion transductiongrammars and bilingual parsing of parallel corpora.Computational Linguistics, 23(3):377–403.

Hao Zhang, Liang Huang, Daniel Gildea, and KevinKnight. 2006. Synchronous Binarization for Ma-chine Translation. In HLT-NAACL. The Associationfor Computational Linguistics.

102


On the Properties of Neural Machine Translation: Encoder–DecoderApproaches

Kyunghyun Cho Bart van MerrienboerUniversite de Montreal

Dzmitry Bahdanau∗Jacobs University Bremen, Germany

Yoshua BengioUniversite de Montreal, CIFAR Senior Fellow

Abstract

Neural machine translation is a relativelynew approach to statistical machine trans-lation based purely on neural networks.The neural machine translation models of-ten consist of an encoder and a decoder.The encoder extracts a fixed-length repre-sentation from a variable-length input sen-tence, and the decoder generates a correcttranslation from this representation. In thispaper, we focus on analyzing the proper-ties of the neural machine translation us-ing two models; RNN Encoder–Decoderand a newly proposed gated recursive con-volutional neural network. We show thatthe neural machine translation performsrelatively well on short sentences withoutunknown words, but its performance de-grades rapidly as the length of the sentenceand the number of unknown words in-crease. Furthermore, we find that the pro-posed gated recursive convolutional net-work learns a grammatical structure of asentence automatically.

1 Introduction

A new approach for statistical machine transla-tion based purely on neural networks has recentlybeen proposed (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014). This new approach, whichwe refer to as neural machine translation, is in-spired by the recent trend of deep representationallearning. All the neural network models used in(Sutskever et al., 2014; Cho et al., 2014) consist ofan encoder and a decoder. The encoder extracts afixed-length vector representation from a variable-length input sentence, and from this representationthe decoder generates a correct, variable-lengthtarget translation.

∗ Research done while visiting Universite de Montreal

The emergence of the neural machine transla-tion is highly significant, both practically and the-oretically. Neural machine translation models re-quire only a fraction of the memory needed bytraditional statistical machine translation (SMT)models. The models we trained for this paperrequire only 500MB of memory in total. Thisstands in stark contrast with existing SMT sys-tems, which often require tens of gigabytes ofmemory. This makes the neural machine trans-lation appealing in practice. Furthermore, un-like conventional translation systems, each and ev-ery component of the neural translation model istrained jointly to maximize the translation perfor-mance.

As this approach is relatively new, there has notbeen much work on analyzing the properties andbehavior of these models. For instance: Whatare the properties of sentences on which this ap-proach performs better? How does the choice ofsource/target vocabulary affect the performance?In which cases does the neural machine translationfail?

It is crucial to understand the properties and be-havior of this new neural machine translation ap-proach in order to determine future research di-rections. Also, understanding the weaknesses andstrengths of neural machine translation might leadto better ways of integrating SMT and neural ma-chine translation systems.

In this paper, we analyze two neural machinetranslation models. One of them is the RNNEncoder–Decoder that was proposed recently in(Cho et al., 2014). The other model replaces theencoder in the RNN Encoder–Decoder model witha novel neural network, which we call a gatedrecursive convolutional neural network (grConv).We evaluate these two models on the task of trans-lation from French to English.

Our analysis shows that the performance ofthe neural machine translation model degrades

103

quickly as the length of a source sentence in-creases. Furthermore, we find that the vocabularysize has a high impact on the translation perfor-mance. Nonetheless, qualitatively we find that theboth models are able to generate correct transla-tions most of the time. Furthermore, the newlyproposed grConv model is able to learn, withoutsupervision, a kind of syntactic structure over thesource language.

2 Neural Networks for Variable-LengthSequences

In this section, we describe two types of neuralnetworks that are able to process variable-lengthsequences. These are the recurrent neural net-work and the proposed gated recursive convolu-tional neural network.

2.1 Recurrent Neural Network with GatedHidden Neurons

z

rh h~ x

(a) (b)

Figure 1: The graphical illustration of (a) the re-current neural network and (b) the hidden unit thatadaptively forgets and remembers.

A recurrent neural network (RNN, Fig. 1 (a))works on a variable-length sequence x =(x1,x2, · · · ,xT ) by maintaining a hidden state hover time. At each timestep t, the hidden state h(t)

is updated by

h(t) = f(h(t−1),xt

),

where f is an activation function. Often f is assimple as performing a linear transformation onthe input vectors, summing them, and applying anelement-wise logistic sigmoid function.

An RNN can be used effectively to learn a dis-tribution over a variable-length sequence by learn-ing the distribution over the next input p(xt+1 |xt, · · · ,x1). For instance, in the case of a se-quence of 1-of-K vectors, the distribution can belearned by an RNN which has as an output

p(xt,j = 1 | xt−1, . . . ,x1) =exp

(wjh〈t〉

)∑Kj′=1 exp

(wj′h〈t〉

) ,

for all possible symbols j = 1, . . . ,K, where wj

are the rows of a weight matrix W. This results inthe joint distribution

p(x) =T∏

t=1

p(xt | xt−1, . . . , x1).

Recently, in (Cho et al., 2014) a new activationfunction for RNNs was proposed. The new activa-tion function augments the usual logistic sigmoidactivation function with two gating units called re-set, r, and update, z, gates. Each gate depends onthe previous hidden state h(t−1), and the currentinput xt controls the flow of information. This isreminiscent of long short-term memory (LSTM)units (Hochreiter and Schmidhuber, 1997). Fordetails about this unit, we refer the reader to (Choet al., 2014) and Fig. 1 (b). For the remainder ofthis paper, we always use this new activation func-tion.

2.2 Gated Recursive Convolutional NeuralNetwork

Besides RNNs, another natural approach to deal-ing with variable-length sequences is to use a re-cursive convolutional neural network where theparameters at each level are shared through thewhole network (see Fig. 2 (a)). In this section, weintroduce a binary convolutional neural networkwhose weights are recursively applied to the inputsequence until it outputs a single fixed-length vec-tor. In addition to a usual convolutional architec-ture, we propose to use the previously mentionedgating mechanism, which allows the recursive net-work to learn the structure of the source sentenceson the fly.

Let x = (x1,x2, · · · ,xT ) be an input sequence,where xt ∈ Rd. The proposed gated recursiveconvolutional neural network (grConv) consists offour weight matrices Wl, Wr, Gl and Gr. Ateach recursion level t ∈ [1, T − 1], the activationof the j-th hidden unit h(t)

j is computed by

h(t)j = ωch

(t)j + ωlh

(t−1)j−1 + ωrh

(t−1)j , (1)

where ωc, ωl and ωr are the values of a gater thatsum to 1. The hidden unit is initialized as

h(0)j = Uxj ,

where U projects the input into a hidden space.

104

ω

h~

(a) (b) (c) (d)

Figure 2: The graphical illustration of (a) the recursive convolutional neural network and (b) the proposedgated unit for the recursive convolutional neural network. (c–d) The example structures that may belearned with the proposed gated unit.

The new activation h(t)j is computed as usual:

h(t)j = φ

(Wlh

(t)j−1 + Wrh

(t)j

),

where φ is an element-wise nonlinearity.The gating coefficients ω’s are computed by ωc

ωl

ωr

=1Z

exp(Glh

(t)j−1 + Grh

(t)j

),

where Gl,Gr ∈ R3×d and

Z =3∑

k=1

[exp

(Glh

(t)j−1 + Grh

(t)j

)]k.

According to this activation, one can think ofthe activation of a single node at recursion level tas a choice between either a new activation com-puted from both left and right children, the acti-vation from the left child, or the activation fromthe right child. This choice allows the overallstructure of the recursive convolution to changeadaptively with respect to an input sample. SeeFig. 2 (b) for an illustration.

In this respect, we may even consider the pro-posed grConv as doing a kind of unsupervisedparsing. If we consider the case where the gat-ing unit makes a hard decision, i.e., ω follows an1-of-K coding, it is easy to see that the networkadapts to the input and forms a tree-like structure(See Fig. 2 (c–d)). However, we leave the furtherinvestigation of the structure learned by this modelfor future research.

3 Purely Neural Machine Translation

3.1 Encoder–Decoder ApproachThe task of translation can be understood from theperspective of machine learning as learning the

Economic growth has slowed down in recent years .

La croissance économique a ralenti ces dernières années .

[z ,z , ... ,z ]1 2 d

Encode

Decode

Figure 3: The encoder–decoder architecture

conditional distribution p(f | e) of a target sen-tence (translation) f given a source sentence e.Once the conditional distribution is learned by amodel, one can use the model to directly samplea target sentence given a source sentence, eitherby actual sampling or by using a (approximate)search algorithm to find the maximum of the dis-tribution.

A number of recent papers have proposed touse neural networks to directly learn the condi-tional distribution from a bilingual, parallel cor-pus (Kalchbrenner and Blunsom, 2013; Cho et al.,2014; Sutskever et al., 2014). For instance, the au-thors of (Kalchbrenner and Blunsom, 2013) pro-posed an approach involving a convolutional n-gram model to extract a vector of a source sen-tence which is decoded with an inverse convolu-tional n-gram model augmented with an RNN. In(Sutskever et al., 2014), an RNN with LSTM unitswas used to encode a source sentence and startingfrom the last hidden state, to decode a target sen-tence. Similarly, the authors of (Cho et al., 2014)proposed to use an RNN to encode and decode apair of source and target phrases.

At the core of all these recent works lies anencoder–decoder architecture (see Fig. 3). Theencoder processes a variable-length input (sourcesentence) and builds a fixed-length vector repre-sentation (denoted as z in Fig. 3). Conditioned onthe encoded representation, the decoder generates

105

a variable-length sequence (target sentence).Before (Sutskever et al., 2014) this encoder–

decoder approach was used mainly as a part of theexisting statistical machine translation (SMT) sys-tem. This approach was used to re-rank the n-bestlist generated by the SMT system in (Kalchbren-ner and Blunsom, 2013), and the authors of (Choet al., 2014) used this approach to provide an ad-ditional score for the existing phrase table.

In this paper, we concentrate on analyzing thedirect translation performance, as in (Sutskever etal., 2014), with two model configurations. In bothmodels, we use an RNN with the gated hiddenunit (Cho et al., 2014), as this is one of the onlyoptions that does not require a non-trivial way todetermine the target length. The first model willuse the same RNN with the gated hidden unit asan encoder, as in (Cho et al., 2014), and the secondone will use the proposed gated recursive convo-lutional neural network (grConv). We aim to un-derstand the inductive bias of the encoder–decoderapproach on the translation performance measuredby BLEU.

4 Experiment Settings

4.1 Dataset

We evaluate the encoder–decoder models on thetask of English-to-French translation. We use thebilingual, parallel corpus which is a set of 348Mselected by the method in (Axelrod et al., 2011)from a combination of Europarl (61M words),news commentary (5.5M), UN (421M) and twocrawled corpora of 90M and 780M words respec-tively.1 We did not use separate monolingual data.The performance of the neural machien transla-tion models was measured on the news-test2012,news-test2013 and news-test2014 sets ( 3000 lineseach). When comparing to the SMT system, weuse news-test2012 and news-test2013 as our de-velopment set for tuning the SMT system, andnews-test2014 as our test set.

Among all the sentence pairs in the preparedparallel corpus, for reasons of computational ef-ficiency we only use the pairs where both Englishand French sentences are at most 30 words long totrain neural networks. Furthermore, we use onlythe 30,000 most frequent words for both Englishand French. All the other rare words are consid-

1All the data can be downloaded from http://www-lium.univ-lemans.fr/˜schwenk/cslm_joint_paper/.

ered unknown and are mapped to a special token([UNK]).

4.2 Models

We train two models: The RNN Encoder–Decoder (RNNenc)(Cho et al., 2014) and thenewly proposed gated recursive convolutionalneural network (grConv). Note that both modelsuse an RNN with gated hidden units as a decoder(see Sec. 2.1).

We use minibatch stochastic gradient descentwith AdaDelta (Zeiler, 2012) to train our two mod-els. We initialize the square weight matrix (transi-tion matrix) as an orthogonal matrix with its spec-tral radius set to 1 in the case of the RNNenc and0.4 in the case of the grConv. tanh and a rectifier(max(0, x)) are used as the element-wise nonlin-ear functions for the RNNenc and grConv respec-tively.

The grConv has 2000 hidden neurons, whereasthe RNNenc has 1000 hidden neurons. The wordembeddings are 620-dimensional in both cases.2

Both models were trained for approximately 110hours, which is equivalent to 296,144 updates and846,322 updates for the grConv and RNNenc, re-spectively.

4.2.1 Translation using Beam-SearchWe use a basic form of beam-search to find a trans-lation that maximizes the conditional probabilitygiven by a specific model (in this case, either theRNNenc or the grConv). At each time step ofthe decoder, we keep the s translation candidateswith the highest log-probability, where s = 10is the beam-width. During the beam-search, weexclude any hypothesis that includes an unknownword. For each end-of-sequence symbol that is se-lected among the highest scoring candidates thebeam-width is reduced by one, until the beam-width reaches zero.

The beam-search to (approximately) find a se-quence of maximum log-probability under RNNwas proposed and used successfully in (Graves,2012) and (Boulanger-Lewandowski et al., 2013).Recently, the authors of (Sutskever et al., 2014)found this approach to be effective in purely neu-ral machine translation based on LSTM units.

2In all cases, we train the whole network including theword embedding matrix. The embedding dimensionality waschosen to be quite large, as the preliminary experimentswith 155-dimensional embeddings showed rather poor per-formance.

106

Model Development TestA

llRNNenc 13.15 13.92grConv 9.97 9.97Moses 30.64 33.30

Moses+RNNenc? 31.48 34.64Moses+LSTM◦ 32 35.65

No

UN

K RNNenc 21.01 23.45grConv 17.19 18.22Moses 32.77 35.63

Model Development Test

All

RNNenc 19.12 20.99grConv 16.60 17.50Moses 28.92 32.00

No

UN

K RNNenc 24.73 27.03grConv 21.74 22.94Moses 32.20 35.40

(a) All Lengths (b) 10–20 Words

Table 1: BLEU scores computed on the development and test sets. The top three rows show the scores onall the sentences, and the bottom three rows on the sentences having no unknown words. (?) The resultreported in (Cho et al., 2014) where the RNNenc was used to score phrase pairs in the phrase table. (◦)The result reported in (Sutskever et al., 2014) where an encoder–decoder with LSTM units was used tore-rank the n-best list generated by Moses.

When we use the beam-search to find the k besttranslations, we do not use a usual log-probabilitybut one normalized with respect to the length ofthe translation. This prevents the RNN decoderfrom favoring shorter translations, behavior whichwas observed earlier in, e.g., (Graves, 2013).

5 Results and Analysis

5.1 Quantitative Analysis

In this paper, we are interested in the propertiesof the neural machine translation models. Specif-ically, the translation quality with respect to thelength of source and/or target sentences and withrespect to the number of words unknown to themodel in each source/target sentence.

First, we look at how the BLEU score, reflect-ing the translation performance, changes with re-spect to the length of the sentences (see Fig. 4 (a)–(b)). Clearly, both models perform relatively wellon short sentences, but suffer significantly as thelength of the sentences increases.

We observe a similar trend with the number ofunknown words, in Fig. 4 (c). As expected, theperformance degrades rapidly as the number ofunknown words increases. This suggests that itwill be an important challenge to increase the sizeof vocabularies used by the neural machine trans-lation system in the future. Although we onlypresent the result with the RNNenc, we observedsimilar behavior for the grConv as well.

In Table 1 (a), we present the translation perfor-mances obtained using the two models along with

the baseline phrase-based SMT system.3 Clearlythe phrase-based SMT system still shows the su-perior performance over the proposed purely neu-ral machine translation system, but we can see thatunder certain conditions (no unknown words inboth source and reference sentences), the differ-ence diminishes quite significantly. Furthermore,if we consider only short sentences (10–20 wordsper sentence), the difference further decreases (seeTable 1 (b).

Furthermore, it is possible to use the neural ma-chine translation models together with the existingphrase-based system, which was found recently in(Cho et al., 2014; Sutskever et al., 2014) to im-prove the overall translation performance (see Ta-ble 1 (a)).

This analysis suggests that that the current neu-ral translation approach has its weakness in han-dling long sentences. The most obvious explana-tory hypothesis is that the fixed-length vector rep-resentation does not have enough capacity to en-code a long sentence with complicated structureand meaning. In order to encode a variable-lengthsequence, a neural network may “sacrifice” someof the important topics in the input sentence in or-der to remember others.

This is in stark contrast to the conventionalphrase-based machine translation system (Koehnet al., 2003). As we can see from Fig. 5, theconventional system trained on the same dataset(with additional monolingual data for the languagemodel) tends to get a higher BLEU score on longer

3We used Moses as a baseline, trained with additionalmonolingual data for a 4-gram language model.

107

Source She explained her new position of foreign affairs and security policy representative as a reply to aquestion: ”Who is the European Union? Which phone number should I call?”; i.e. as an important stepto unification and better clarity of Union’s policy towards countries such as China or India.

Reference Elle a explique le nouveau poste de la Haute representante pour les affaires etrangeres et la politique dedefense dans le cadre d’une reponse a la question: ”Qui est qui a l’Union europeenne?” ”A quel numerode telephone dois-je appeler?”, donc comme un pas important vers l’unicite et une plus grande lisibilitede la politique de l’Union face aux etats, comme est la Chine ou bien l’Inde.

RNNEnc Elle a decrit sa position en matiere de politique etrangere et de securite ainsi que la politique de l’Unioneuropeenne en matiere de gouvernance et de democratie .

grConv Elle a explique sa nouvelle politique etrangere et de securite en reponse a un certain nombre de questions: ”Qu’est-ce que l’Union europeenne ? ” .

Moses Elle a explique son nouveau poste des affaires etrangeres et la politique de securite representant enreponse a une question: ”Qui est l’Union europeenne? Quel numero de telephone dois-je appeler?”;c’est comme une etape importante de l’unification et une meilleure lisibilite de la politique de l’Uniona des pays comme la Chine ou l’Inde .

Source The investigation should be complete by the end of the year when the findings will be presented toDeutsche Bank’s board of managing directors - with recommendations for action.

Reference L’examen doit etre termine d’ici la fin de l’annee, ensuite les resultats du conseil d’administration de laDeutsche Bank doivent etre presentes - avec recommandation, d’ habitude.

RNNEnc L’etude devrait etre terminee a la fin de l’ annee, lorsque les conclusions seront presentees au conseild’administration de la Deutsche Bank, conseil d’association avec des mesures.

grConv L’enquete devrait etre terminee a la fin de l’annee ou les conclusions seront presentees par le conseild’administration de la BCE a la direction des recommandations.

Moses L’enquete devrait etre termine d’ici la fin de l’annee lorsque les resultats seront presentes a la DeutscheBank conseil des directeurs generaux - avec des recommandations .

Source And there are thorny mechanical questions that must be resolved during that time, like how to bal-ance the state’s mandate of ”adequate access” to licensed marijuana with its prohibitions on cannabisbusinesses within 1,000 feet of a school, park, playground or child care center.

Reference Pendant ce temps, des questions pratiques restent en suspens: comment equilibrer le mandat de l’etatqui garantit un acces approprie a la marijuana agreee et interdit l’installation de commerces de vente decannabis dans un rayon de 30 km autour d’une ecole, d’un parc, d’un terrain de jeu ou d’une creche.

RNNEnc Il y a des questions preventives qui se posent quant a l’equilibre des droits de l’enfant dans les limitesd’une entreprise de collecte de sang.

grConv De facon generale, il y a des raisons de securite pour que les entreprises aient acces a des milliers decentres de peche, d’eau ou de recherche.

Moses Et il y a des problemes mecaniques complexes qui doivent etre resolues au cours de cette periode,comme la maniere d’equilibrer le mandat de ”l’acces adequat” permis de marijuana avec l’interdictiondu cannabis aux entreprises de 1000 pieds d’une ecole de jeu ou de parc, le service de garde.

(a) Long Sentences

Source There is still no agreement as to which election rules to follow.Reference Toutefois il n’existe toujours pas d’accord selon quel reglement de vote il faut proceder.RNNEnc Il n’y a pas encore d’accord sur les regles electorales.grConv Il n’y a pas encore d’accord sur la question des elections a suivre.Moses Il y a toujours pas d’accord sur l’election des regles a suivre.

Source Many of these ideas may have been creative, but they didn’t necessarily work.Reference Beaucoup de ces idees etaient creatives mais elles n’ont pas forcement fonctionne.RNNEnc Bon nombre de ces idees ont peut-etre ete creatrices, mais elles ne s’appliquaient pas necessairement.grConv Beaucoup de ces idees peuvent etre creatives, mais elles n’ont pas fonctionne.Moses Beaucoup de ces idees ont pu etre creatif, mais ils n’ont pas necessairement.

Source There is a lot of consensus between the Left and the Right on this subject.Reference C’est qu’il y a sur ce sujet un assez large consensus entre gauche et droite.RNNEnc Il existe beaucoup de consensus entre la gauche et le droit a la question.grConv Il y a un consensus entre la gauche et le droit sur cette question.Moses Il y a beaucoup de consensus entre la gauche et la droite sur ce sujet.

Source According to them, one can find any weapon at a low price right now.Reference Selon eux, on peut trouver aujourd’hui a Moscou n’importe quelle arme pour un prix raisonnable.RNNEnc Selon eux, on peut se trouver de l’arme a un prix trop bas.grConv En tout cas, ils peuvent trouver une arme a un prix tres bas a la fois.Moses Selon eux, on trouve une arme a bas prix pour l’instant.

(b) Short Sentences

Table 2: The sample translations along with the source sentences and the reference translations.

108

0 10 20 30 40 50 60 70 80

Sentence length

0

5

10

15

20B

LE

Usc

ore

Source text

Reference text

Both

(a) RNNenc

0 10 20 30 40 50 60 70 80

Sentence length

0

5

10

15

20

BL

EU

scor

e

Source text

Reference text

Both

(b) grConv

0 2 4 6 8 10

Max. number of unknown words

10

12

14

16

18

20

22

24

BL

EU

scor

e

Source text

Reference text

Both

(c) RNNenc

Figure 4: The BLEU scores achieved by (a) the RNNenc and (b) the grConv for sentences of a givenlength. The plot is smoothed by taking a window of size 10. (c) The BLEU scores achieved by the RNNmodel for sentences with less than a given number of unknown words.

sentences.In fact, if we limit the lengths of both the source

sentence and the reference translation to be be-tween 10 and 20 words and use only the sentenceswith no unknown words, the BLEU scores on thetest set are 27.81 and 33.08 for the RNNenc andMoses, respectively.

Note that we observed a similar trend evenwhen we used sentences of up to 50 words to trainthese models.

5.2 Qualitative Analysis

Although BLEU score is used as a de-facto stan-dard metric for evaluating the performance of amachine translation system, it is not the perfectmetric (see, e.g., (Song et al., 2013; Liu et al.,2011)). Hence, here we present some of the ac-tual translations generated from the two models,RNNenc and grConv.

In Table. 2 (a)–(b), we show the translations ofsome randomly selected sentences from the de-velopment and test sets. We chose the ones thathave no unknown words. (a) lists long sentences(longer than 30 words), and (b) short sentences(shorter than 10 words). We can see that, despitethe difference in the BLEU scores, all three mod-els (RNNenc, grConv and Moses) do a decent jobat translating, especially, short sentences. Whenthe source sentences are long, however, we no-tice the performance degradation of the neural ma-chine translation models.

Additionally, we present here what type ofstructure the proposed gated recursive convolu-tional network learns to represent. With a samplesentence “Obama is the President of the UnitedStates”, we present the parsing structure learnedby the grConv encoder and the generated transla-tions, in Fig. 6. The figure suggests that the gr-

0 10 20 30 40 50 60 70 80

Sentence length

0

5

10

15

20

25

30

35

40

BL

EU

scor

eSource text

Reference text

Both

Figure 5: The BLEU scores achieved by an SMTsystem for sentences of a given length. The plotis smoothed by taking a window of size 10. Weuse the solid, dotted and dashed lines to show theeffect of different lengths of source, reference orboth of them, respectively.

Conv extracts the vector representation of the sen-tence by first merging “of the United States” to-gether with “is the President of” and finally com-bining this with “Obama is” and “.”, which iswell correlated with our intuition. Note, however,that the structure learned by the grConv is differ-ent from existing parsing approaches in the sensethat it returns soft parsing.

Despite the lower performance the grConvshowed compared to the RNN Encoder–Decoder,4

we find this property of the grConv learning agrammar structure automatically interesting andbelieve further investigation is needed.

4However, it should be noted that the number of gradientupdates used to train the grConv was a third of that used totrain the RNNenc. Longer training may change the result,but for a fair comparison we chose to compare models whichwere trained for an equal amount of time. Neither model wastrained to convergence.

109

Obama is the President of the United States .

++++++++

+++++++

++++++

+++++

++++

+++

++

+ TranslationsObama est le President des Etats-Unis . (2.06)Obama est le president des Etats-Unis . (2.09)Obama est le president des Etats-Unis . (2.61)Obama est le President des Etats-Unis . (3.33)Barack Obama est le president des Etats-Unis . (4.41)Barack Obama est le President des Etats-Unis . (4.48)Barack Obama est le president des Etats-Unis . (4.54)L’Obama est le President des Etats-Unis . (4.59)L’Obama est le president des Etats-Unis . (4.67)Obama est president du Congres des Etats-Unis .(5.09)

(a) (b)

Figure 6: (a) The visualization of the grConv structure when the input is “Obama is the President ofthe United States.”. Only edges with gating coefficient ω higher than 0.1 are shown. (b) The top-10translations generated by the grConv. The numbers in parentheses are the negative log-probability.

6 Conclusion and Discussion

In this paper, we have investigated the propertyof a recently introduced family of machine trans-lation system based purely on neural networks.We focused on evaluating an encoder–decoder ap-proach, proposed recently in (Kalchbrenner andBlunsom, 2013; Cho et al., 2014; Sutskever et al.,2014), on the task of sentence-to-sentence trans-lation. Among many possible encoder–decodermodels we specifically chose two models that dif-fer in the choice of the encoder; (1) RNN withgated hidden units and (2) the newly proposedgated recursive convolutional neural network.

After training those two models on pairs ofEnglish and French sentences, we analyzed theirperformance using BLEU scores with respect tothe lengths of sentences and the existence of un-known/rare words in sentences. Our analysis re-vealed that the performance of the neural machinetranslation suffers significantly from the length ofsentences. However, qualitatively, we found thatthe both models are able to generate correct trans-lations very well.

These analyses suggest a number of future re-search directions in machine translation purelybased on neural networks.

Firstly, it is important to find a way to scale uptraining a neural network both in terms of com-putation and memory so that much larger vocabu-laries for both source and target languages can beused. Especially, when it comes to languages with

rich morphology, we may be required to come upwith a radically different approach in dealing withwords.

Secondly, more research is needed to preventthe neural machine translation system from under-performing with long sentences. Lastly, we needto explore different neural architectures, especiallyfor the decoder. Despite the radical difference inthe architecture between RNN and grConv whichwere used as an encoder, both models suffer fromthe curse of sentence length. This suggests that itmay be due to the lack of representational powerin the decoder. Further investigation and researchare required.

In addition to the property of a general neuralmachine translation system, we observed one in-teresting property of the proposed gated recursiveconvolutional neural network (grConv). The gr-Conv was found to mimic the grammatical struc-ture of an input sentence without any supervisionon syntactic structure of language. We believe thisproperty makes it appropriate for natural languageprocessing applications other than machine trans-lation.

Acknowledgments

The authors would like to acknowledge the sup-port of the following agencies for research fundingand computing support: NSERC, Calcul Quebec,Compute Canada, the Canada Research Chairsand CIFAR.

110

ReferencesAmittai Axelrod, Xiaodong He, and Jianfeng Gao.

2011. Domain adaptation via pseudo in-domain dataselection. In Proceedings of the ACL Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 355–362. Association for Compu-tational Linguistics.

Nicolas Boulanger-Lewandowski, Yoshua Bengio, andPascal Vincent. 2013. Audio chord recognition withrecurrent neural networks. In ISMIR.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Fethi Bougares, Holger Schwenk, and YoshuaBengio. 2014. Learning phrase representationsusing rnn encoder-decoder for statistical machinetranslation. In Proceedings of the EmpiricialMethods in Natural Language Processing (EMNLP2014), October. to appear.

Alex Graves. 2012. Sequence transduction with re-current neural networks. In Proceedings of the29th International Conference on Machine Learning(ICML 2012).

A. Graves. 2013. Generating sequences with recurrentneural networks. arXiv:1308.0850 [cs.NE],August.

S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.

Nal Kalchbrenner and Phil Blunsom. 2013. Two re-current continuous translation models. In Proceed-ings of the ACL Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages1700–1709. Association for Computational Linguis-tics.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Pro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology - Vol-ume 1, NAACL ’03, pages 48–54, Stroudsburg, PA,USA. Association for Computational Linguistics.

Chang Liu, Daniel Dahlmeier, and Hwee Tou Ng.2011. Better evaluation metrics lead to better ma-chine translation. In Proceedings of the Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 375–384. Association for Computa-tional Linguistics.

Xingyi Song, Trevor Cohn, and Lucia Specia. 2013.BLEU deconstructed: Designing a better MT eval-uation metric. In Proceedings of the 14th Interna-tional Conference on Intelligent Text Processing andComputational Linguistics (CICLING), March.

Ilya Sutskever, Oriol Vinyals, and Quoc Le. 2014.Anonymized. In Anonymized.

Matthew D. Zeiler. 2012. ADADELTA: an adap-tive learning rate method. Technical report, arXiv1212.5701.

111


Transduction Recursive Auto-Associative Memory:Learning Bilingual Compositional Distributed Vector Representations of

Inversion Transduction GrammarsKarteek Addanki Dekai Wu

HKUSTHuman Language Technology Center

Department of Computer Science and EngineeringHong Kong University of Science and Technology

{vskaddanki|dekai}@cs.ust.hk

Abstract

We introduce TRAAM, or TransductionRAAM, a fully bilingual generalizationof Pollack’s (1990) monolingual Recur-sive Auto-Associative Memory neural net-work model, in which each distributed vec-tor represents a bilingual constituent—i.e.,an instance of a transduction rule, whichspecifies a relation between two monolin-gual constituents and how their subcon-stituents should be permuted. Bilingualterminals are special cases of bilingualconstituents, where a vector represents ei-ther (1) a bilingual token —a token-to-token or “word-to-word” translation rule—or (2) a bilingual segment—a segment-to-segment or “phrase-to-phrase” transla-tion rule. TRAAMs have properties thatappear attractive for bilingual grammar in-duction and statistical machine translationapplications. Training of TRAAM drivesboth the autoencoder weights and the vec-tor representations to evolve, such thatsimilar bilingual constituents tend to havemore similar vectors.

1 Introduction

We introduce Transduction RAAM—or TRAAMfor short—a recurrent neural network model thatgeneralizes the monolingual RAAM model of Pol-lack (1990) to a distributed vector representationof compositionally structured transduction gram-mars (Aho and Ullman, 1972) that is fully bilingualfrom top to bottom. In RAAM, which stands forRecursive Auto-Associative Memory, using fea-ture vectors to characterize constituents at everylevel of a parse tree has the advantages that (1)the entire context of all subtrees inside the con-stituent can be efficiently captured in the featurevectors, (2) the learned representations generalize

well because similar feature vectors represent sim-ilar constituents or segments, and (3) representa-tions can be automatically learned so as to max-imize prediction accuracy for various tasks usingsemi-supervised learning. We argue that different,but analogous, properties are desirable for bilin-gual structured translation models.

Unlike RAAM, where each distributed vectorrepresents a monolingual token or constituent,each distributed vector in TRAAM represents abilingual constituent or biconstituent—that is, aninstance of a transduction rule, which asserts a re-lation between two monolingual constituents, aswell as specifying how to permute their subcon-stituents in translation. Bilingual terminals, orbiterminals, are special cases of biconstituentswhere a vector represents either (1) a bitoken—atoken-to-token or “word-to-word” translation rule—or (2) a bisegment—a segment-to-segment or“phrase-to-phrase” translation rule.

The properties of TRAAMs are attractive formachine translation applications. As with RAAM,TRAAMs can be trained via backpropagationtraining, which simultaneously evolves both theautoencoder weights and the biconstituent vectorrepresentations. As with RAAM, the evolutionof the vector representations within the hiddenlayer performs automatic feature induction, and formany applications can obviate the need for man-ual feature engineering. However, the result isthat similar vectors tend to represent similar bicon-stituents, rather than monolingual constituents.

The learned vector representations thus tend toform clusters of similar translation relations in-stead of merely similar strings. That is, TRAAMclusters represent soft nonterminal categories ofcross-lingual relations and translation patterns, asopposed to soft nonterminal categories of mono-lingual strings as in RAAM.

Also, TRAAMs inherently make full simulta-neous use of both input and output language fea-

112

tures, recursively, in an elegant integrated fash-ion. TRAAM does not make restrictive a pri-ori assumptions of conditional independence be-tween input and output language features. Whenevolving the biconstituent vector representations,generalization occurs over similar input and out-put structural characteristics simultaneously. Inmost recurrent neural network applications to ma-chine translation to date, only input side featuresor only output language features are used. Even inthe few previous cases where recurrent neural net-works have employed both input and output lan-guage features for machine translation, the modelshave typically been factored so that their recursiveportion is applied only to either the input or outputlanguage, but not both.

As with RAAM, the objective criteria for train-ing can be adjusted to reflect accuracy on nu-merous different kinds of tasks, biasing the di-rection that vector representations evolve toward.But again, TRAAM’s learned vector representa-tions support making predictions that simultane-ously make use of both input and output struc-tural characteristics. For example, TRAAM hasthe ability to take into account the structure ofboth input and output subtree characteristics whilemaking predictions on reordering them. Similarly,for specific cross-lingual tasks such as word align-ment, sense disambiguation, or machine transla-tion, classifiers can simultaneously be trained inconjunction with evolving the vector representa-tions to optimize task-specific accuracy (Chris-man, 1991).

In this paper we use as examples binary bi-parse trees consistent with transduction grammarsin a 2-normal form, which by definition are in-version transduction grammars (Wu, 1997) sincethey are binary rank. This is not a requirementfor TRAAM, which in general can be formed fortransduction grammars of any rank. Moreover,with distributed vector representations, the notionof nonterminal categories in TRAAM is that of softmembership, unlike in symbolically representedtransduction grammars. We start with bracketedtraining data that contains no bilingual categorylabels (like training data for Bracketing ITGs orBITGs). Training results in self-organizing clus-ters that have been automatically induced, repre-senting soft nonterminal categories (unlike BITGs,which do not have differentiated nonterminal cat-egories).

2 Related work

TRAAM builds on different aspects of a spec-trum of previous work. A large body of work ex-ists on various different types of self-organizingrecurrent neural network approaches to model-ing recursive structure, but mostly in monolin-gual modeling. Even in applications to ma-chine translation or cross-lingual modeling, thetypical practice has been to insert neural net-work scoring components while still maintain-ing older SMT modeling assumptions like bags-of-words/phrases, “shake’n’bake” translation thatrelies heavily on strong monolingual languagemodels, and log-linear models —in contrast toTRAAM’s fully integrated bilingual approach.Here we survey representative work across thespectrum.

2.1 Monolingual related work

Distributed vector representations have longbeen used for n-gram language modeling; thesecontinuous-valued models exploit the general-ization capabilities of neural networks, althoughthere is no hidden contextual or hierarchicalstructure as in RAAM. Schwenk (2010) appliesone such language model within an SMT system.

In the simple recurrent neural networks (RNNsor SRNs) of Elman (1990), hidden layer represen-tations are fed back to the input to dynamically rep-resent an aggregate of the immediate contextualhistory. More recently, the probabilistic NNLMsof Bengio et al. (2003) and Bengio et al. (2009)follow in this vein.

To represent hierarchical tree structure usingvector representations, one simple family of ap-proaches employs convolutional networks, as inLee et al. (2009) for example. Collobert and We-ston (2008) use a convolution neural network layerquite effectively to learn vector representations forwords which are then used in a host of NLP taskssuch as POS tagging, chunking, and semantic rolelabeling.

RAAM approaches, and related recursive au-toencoder approaches, can be more flexible thanconvolutional networks. Like SRNs, they can beextended in numerous ways. The URAAM (Uni-fication RAAM) model of Stolcke and Wu (1992)extended RAAM to demonstrate the possibility ofusing neural networks to perform more sophisti-cated operations like unification directly upon thedistributed vector representations of hierarchical

113

feature structures. Socher et al. (2011) used mono-lingual recursive autoencoders for sentiment pre-diction, with or without parse tree information; thiswas perhaps the first use of a RAAM style ap-proach on a large scale NLP task, albeit mono-lingual. Scheible and Schütze (2013) automat-ically simplified the monolingual tree structuresgenerated by recursive autoencoders, validated thesimplified structures via manual evaluation, andshowed that sentiment classification accuracy isnot affected.

2.2 Bilingual related workThe majority of work on learning bilingual dis-tributed vector representations has not made use ofrecursive approaches or hidden contextual or com-positional structure, as in the bilingual word em-bedding learning of Klementiev et al. (2012) or thebilingual phrase embedding learning of Gao et al.(2014). Schwenk (2012) uses a non-recursive neu-ral network to predict phrase translation probabil-ities in conventional phrase-based SMT.

Attempts have been made to generalize the dis-tributed vector representations of monolingual n-gram language models, avoiding any hidden con-textual or hierarchical structure. Working withinthe framework of n-gram translation models, Sonet al. (2012) generalize left-to-right monolingualn-gram models to bilingual n-grams, and studybilingual variants of class-based n-grams. How-ever, their model does not allow tackling the chal-lenge of modeling cross-lingual constituent order,as TRAAM does; instead it relies on the assump-tion that some other preprocessor has already man-aged to accurately re-order the words of the inputsentence into exactly the order of words in the out-put sentence.

Similarly, generalizations of monolingual SRNsto the bilingual case have been studied. Zouet al. (2013) generalize the monolingual recur-rent NNLM model of Bengio et al. (2009) tolearn bilingual word embeddings using conven-tional SMT word alignments, and demonstrate thatthe resulting embeddings outperform the baselinesin word semantic similarity. They also add a sin-gle semantic similarity feature induced with bilin-gual embeddings to a phrase-based SMT log-linearmodel, and report improvements in BLEU. Com-pared to TRAAM, however, they only learn non-compositional features, with distributed vectorsonly representing biterminals (as opposed to bi-constituents or bilingual subtrees), and so other

mechanisms for combining biterminal scores stillneed to be used to handle hierarchical structure,as opposed to seamlessly being integrated intothe distributed vector representation model. De-vlin et al. (2014) obtain translation accuracy im-provements by extending the probabilistic NNLMsof Bengio et al. (2003), which are used for theoutput language, by adding input language con-text features. Unlike TRAAM, neither of theseapproaches symmetrically models the recursivestructure of both the input and output languagesides.

For convolutional network approaches, Kalch-brenner and Blunsom (2013) use a recurrent prob-abilistic model to generate a representation of thesource sentence and then generate the target sen-tence from this representation. This use of in-put language context to bias translation choicesis in some sense a neural network analogy tothe PSD (phrase sense disambiguation) approachfor context-dependent translation probabilities ofCarpuat and Wu (2007). Unlike TRAAM, themodel does not contain structural constraints, andpermutation of phrases must still be done in con-ventional PBSMT “shake’n’bake” style by rely-ing mostly on a language model (in their case, aNNLM).

A few applications of monolingual RAAM-stylerecursive autoencoders to bilingual tasks have alsoappeared. For cross-lingual document classifica-tion, Hermann and Blunsom (2014) use two sep-arate monolingual fixed vector composition net-works, one for each language. One provides thetraining signal for the other, and training is onlyon the embeddings.

Li et al. (2013) described a use of monolingualrecursive autoencoders within maximum entropyITGs. They replace their earlier model for pre-dicting reordering based on the first and the lasttokens in a constituent, by instead using the con-text vector generated using the recursive autoen-coder. Only input language context is used, unlikeTRAAM which can use the input and output lan-guage contexts equally.

Autoencoders have also been applied to SMT ina very different way by Zhao et al. (2014) but with-out recursion and not for learning distributed vec-tor representations of words; rather, they used non-recursive autoencoders to compress very high-dimensional bilingual sparse features down to low-dimensional feature vectors, so that MIRA or PRO

114

could be used to optimize the log-linear modelweights.

3 Representing transduction grammarswith TRAAM

As a recurrent neural network representation of atransduction grammar, TRAAM learns bilingualdistributed representations that parallel the struc-tural composition of a transduction grammar. Aswith transduction grammars, the learned represen-tations are symmetric and model structured rela-tional correlations between the input and outputlanguages. The induced feature vectors in effectrepresent soft categories of cross-lingual relationsand translations. The TRAAM model integrateselegantly with the transduction grammar formal-ism and aims to model the compositional struc-ture of the transduction grammar as opposed toincorporating external alignment information. Itis straightforward to formulate TRAAMs for arbi-trary syntax directed transduction grammars; herewe shall describe an example of a TRAAM modelfor an inversion transduction grammar (ITG).

Formally, an ITG is a tuple ⟨N, Σ,∆, R, S⟩,where N is a finite nonempty set of nonterminalsymbols, Σ is a finite set of terminal symbols in L0,∆ is a finite set of terminal symbols in L1, R is afinite nonempty set of inversion transduction rulesand S ∈ N is a designated start symbol. A normal-form ITG consists of rules in one of the followingfour forms:

S → A, A → [BC] , A → ⟨BC⟩, A → e/f

where S ∈ N is the start symbol, A, B,C ∈N are nonterminal symbols and e/f is a biter-minal. A biterminal is a pair of symbol strings:Σ∗ ×∆∗, where at least one of the strings have tobe nonempty. The square and angled brackets sig-nal straight and inverted order respectively. Withstraight order, both the L0 and the L1 productionsare generated left-to-right, but with inverted order,the L1 production is generated right-to-left.

In the distributed TRAAM representation of theITG, we represent each bispan, using a feature vec-tor v of dimension d that represents a fuzzy encod-ing of all the nonterminals that could generate it.This is in contrast to the ITG model where eachnonterminal that generates a bispan has to be enu-merated separately. Feature vectors correspond-ing to larger bispans are compositionally generatedfrom smaller bispans using a compressor network

which takes two feature vectors of dimension d,corresponding to the smaller bispans and gener-ates the feature vector of dimension d correspond-ing to the larger bispan. A single bit correspond-ing to straight or inverted order is also fed as aninput to the compressor network. The compres-sor network in TRAAM serves a similar role asthe syntactic rules in the symbolic ITG, but keepsthe encoding fuzzy. Figure 2 shows the straightand inverted syntactic rules and the correspond-ing inputs to the compressor network. Modelingof unary rules (with start symbol on the left handside) although similar, is beyond the scope of thispaper.

It is easy to demonstrate that TRAAM mod-els are capable of representing any symbolic ITGmodel. All the nonterminals representing a bispancan be encoded as a bit vector in the feature vectorof the bispan. Using the universal approximationtheorem of neural networks (Hornik et al., 1989),an encoder with a single hidden layer can representany set of syntactic rules. Similarly, all TRAAMmodels can be represented using a symbolic ITGby assuming a unique nonterminal label for everyfeature vector. Therefore, TRAAM and ITGs rep-resent two equivalent classes of models for repre-senting compositional bilingual relations.

It is important to note that although bothTRAAM and ITG models might be equivalent, thefuzzy encoding of nonterminals in TRAAM is suit-able for modeling the generalizations in bilingualrelations without exploding the search space unlikethe symbolic models. This property of TRAAMmakes it attractive for bilingual category learningand machine translation applications as long as ap-propriate language bias and objective functions aredetermined.

Given our objective of inducing categories ofbilingual relations in an unsupervised manner, webias our TRAAM model by using a simple non-linear activation function to be our compressor,similar to the monolingual recursive autoencodermodel proposed by Socher et al. (2011). Having asingle layer in our compressor provides the neces-sary language bias by forcing the network to cap-ture the generalizations while reducing the dimen-sions of the input vectors. We use tanh as the non-linear activation function and the compressor ac-cepts two vectors c1 and c2 of dimension d corre-sponding to the nonterminals of the smaller bis-pans and a single bit o corresponding to the in-

115

Figure 1: Example of English-Telugu biparse trees where inversion depends on output language sense.

Compressor

Compressor

Reconstructor

Reconstructor

o1#=#1# c1# c2# c3#

o2#=#(1#

p1#

o1'# c1'# c2'#

p2#

o2'# p1'# c3'#

Figure 2: Architecture of TRAAM.

version order and generates a vector p of dimen-sion d corresponding to the larger bispan generatedby combining the two smaller bispans as shown inFigure 2. The vector p then serves as the input forthe successive combinations of the larger bispanwith other bispans.

p = tanh(W1[o; c1; c2] + b1) (1)

where W1 and b1 are the weight matrix and the biasvector of the encoder network.

To ensure that the computed vector p capturesthe fuzzy encodings of its children and the inver-sion order, we use a reconstructor network whichattempts to reconstruct the inversion order and thefeature vectors corresponding of its children. Weuse the error in reconstruction as our objectivefunction and train our model to minimize the re-construction error over all the nodes in the biparse

tree. The reconstructor network in our TRAAMmodel can be replaced by any other network thatenables the computed feature vector representa-tions to be optimized for the given task. In ourcurrent implementation, we reconstruct the inver-sion order o′ and the child vectors c′1 and c′2 usinganother nonlinear activation function as follows:

[o′; c′1; c′2] = tanh(W2p + b2) (2)

where W2 and b2 are the weight matrix and the biasvector of the reconstructor network.

4 Bilingual training

4.1 InitializationThe weights and the biases of the compressor andthe reconstructor networks of the TRAAM modelare randomly initialized. Bisegment embeddings

116

corresponding to the leaf nodes (biterminals in thesymbolic ITG notation) in the biparse trees are alsoinitialized randomly. These constitute the modelparameters and are optimized to minimize our ob-jective function of reconstruction error. The parsetrees for providing the structural constraints aregenerated by a bracketing inversion transductiongrammar (BITG) induced in a purely unsupervisedfashion, according to the algorithm in Saers et al.(2009). Due to constraints on the training time, weconsider only the Viterbi biparse trees accordingto the BITG instead of all the biparse trees in theforest.

4.2 Computing feature vectorsWe compute the feature vectors at each internalnode in the biparse tree, similar to the feedforwardpass in a neural network. We topologically sort allthe nodes in the biparse tree and set the feature vec-tor of each node in the topologically sorted orderas follows:

• If the node is a leaf node, the feature vector isthe corresponding bisegment embedding.

• Else, the biconstituent embedding corre-sponding to the internal node is generated us-ing the feature vectors of the children and theinversion order using Equation 1. We alsonormalize the length of the computed fea-ture vector so as to prevent the network frommaking the biconstituent embedding arbitrar-ily small in magnitude (Socher et al., 2011).

4.3 Feature vector optimizationWe train our current implementation of TRAAM,by optimizing the model parameters to minimizean objective function based on the reconstructionerror over all the nodes in the biparse trees. Theobjective function is defined as a linear combina-tion of the l2 norm of the reconstruction error ofthe children and the cross-entropy loss of recon-structing the inversion order. We define the errorat each internal node n as follows:

En =α

2|[c1; c2]− [c′1; c

′2]|2 − (1− α)

[(1− o) log(1− o′) + (1 + o) log(1 + o′)]

where c1, c2, o correspond to the left child, rightchild and inversion order, c′1, c

′2, o

′ are the respec-tive reconstructions and α is the linear weightingfactor. The global objective function J is the sum

of the error function at all internal nodes n in the bi-parse trees averaged over the total number of sen-tences T in the corpus. A regularization parameterλ is used on the norm of the model parameters θ toavoid overfitting.

J =1T

ΣnEn + λ||θ||2 (3)

As the bisegment embeddings are also a part ofthe model parameters, the optimization objectiveis similar to a moving target training objective Ro-hwer (1990). We use backpropagation with struc-ture Goller and Kuchler (1996) to compute the gra-dients efficiently. L-BFGS algorithm Liu and No-cedal (1989) is used in order to minimize the lossfunction.

5 Bilingual representation learning

We expect the TRAAM model to generate clus-ters over cross-lingual relations similar to RAAMmodels on monolingual data. We test this hypoth-esis by bilingually training our model using a par-allel English-Telugu blocks world dataset. Thedataset is kept simple to better understand the na-ture of clusters. Our dataset comprises of com-mands which involves manipulating different col-ored objects over different shapes.

5.1 Example

Figure 1 shows the biparse trees for two English-Telugu sentence pairs. The preposition on in En-glish translates to నున (pinunna) and న(pina) re-spectively in the first and second sentence pairs be-cause in the first sentence block is described by itsposition on the square, whereas in the second sen-tence block is the subject and square is the object.Since Telugu is a language with an SOV structure,the verbs ఉంచు(vunchu) and సు (teesuko) occur atthe end for both sentences.

The sentences in 1 illustrate the importance ofmodeling bilingual relations simultaneously in-stead of focusing only on the input or output lan-guage as the cross-lingual structural relations aresensitive to both the input and output languagecontext. For example, the constituent whose inputside is block on the square, the corresponding outputlanguage tree structure is determined by whetheror not on is translated to నున (pinunna) or న(pina).

In symbolic frameworks such as ITGs, suchrelations are encoded using different nontermi-nal categories. However, inducing such cate-

117

Figure 3: Clustering of biconstituents in the Telugu-English data.

gories within a symbolic framework in an un-supervised manner creates extremely challengingcombinatorial scaling issues. TRAAM modelsare a promising approach for tackling this prob-lem, since the vector representations learned us-ing the TRAAM model inherently yield soft syn-tactic category membership properties, despite be-ing trained only with the unlabeled structural con-straints of simple BITG-style data.

5.2 Biconstituent clustering

The soft membership properties of learned dis-tributed vector representations can be exploredvia cluster analysis. To illustrate, we trained aTRAAM network bilingually using the algorithmin Section 4, and obtained feature vector represen-tations for each unique biconstituent. Clusteringthe obtained feature vectors reveals emergence offuzzy nonterminal categories, as shown in Figure3. It is important to note that each point in thevector space corresponds to a tree-structured bi-constituent as opposed to merely a flat bilingualphrase, as same surface forms with different treestructures will have different vectors.

As the full cluster tree is too unwieldy, Figure4 zooms in to shows an enlarged version of a por-tion of the clustering, along with the correspondingbracketed bilingual structures. One can observethat the cluster represents the biconstituents thatdescribe the object by its position on another ob-ject. We can deduce this from the fact that only a

single sense of on/ నున (pinnuna) seems to be occur-ing in all the biconstituents of the cluster. Manualinspection of other clusters reveals such similari-ties despite noise expected to be introduced by thesparsity of our dataset.

6 Conclusion

We have introduced a fully bilingual generaliza-tion of Pollack’s (1990) monolingual RecursiveAuto-Associative Memory neural network model,TRAAM, in which each distributed vector repre-sents a bilingual constituent—i.e., an instance ofa transduction rule, which specifies a relation be-tween two monolingual constituents and how theirsubconstituents should be permuted. Bilingual ter-minals are special cases of bilingual constituents,where a vector represents either (1) a bilingual to-ken—a token-to-token or “word-to-word” transla-tion rule—or (2) a bilingual segment—a segment-to-segment or “phrase-to-phrase” translation rule.

TRAAMs can be used for arbitrary rank SDTGs(syntax-directed transduction grammars, a.k.a.synchronous context-free grammars). Althoughour discussions in this paper have focused on bi-parse trees from SDTGs in a 2-normal form, whichby definition are ITGs due to the binary rank,nothing prevents TRAAMs from being applied tohigher-rank transduction grammars.

We believe TRAAMs are worth detailed ex-ploration as their intrinsic properties address keyproblems in bilingual grammar induction and sta-

118

Figure 4: Typical zoomed view into the Telugu-English biconstituent clusters from Figure 3.

tistical machine translation —their sensitivity toboth input and output language context means thatthe learned vector representations tend to reflectthe similarity of bilingual rather than monolingualconstituents, which is what is needed to induce dif-ferentiated bilingual nonterminal categories.

7 Acknowledgments

This material is based upon work supportedin part by the Defense Advanced ResearchProjects Agency (DARPA) under BOLT contractnos. HR0011-12-C-0014 and HR0011-12-C-0016,and GALE contract nos. HR0011-06-C-0022 andHR0011-06-C-0023; by the European Union un-der the FP7 grant agreement no. 287658; and bythe Hong Kong Research Grants Council (RGC)research grants GRF620811, GRF621008, andGRF612806. Any opinions, findings and conclu-sions or recommendations expressed in this mate-rial are those of the authors and do not necessarilyreflect the views of DARPA, the EU, or RGC.

References


Yoshua Bengio, Réjean Ducharme, Pascal Vin-cent, and Christian Jauvin. A neural probabilis-tic language model. Journal of Machine Learn-ing Research, 3:1137–1155, 2003.

Yoshua Bengio, Jérôme Louradour, Ronan Col-lobert, and Jason Weston. Curriculum learning.In Proceedings of the 26th annual internationalconference on machine learning, pages 41–48.ACM, 2009.

Marine Carpuat and Dekai Wu. Context-dependent phrasal translation lexicons for sta-tistical machine translation. In 11th MachineTranslation Summit (MT Summit XI), pages 73–80, 2007.

Lonnie Chrisman. Learning recursive distributedrepresentations for holistic computation. Con-nection Science, 3(4):345–366, 1991.

Ronan Collobert and Jason Weston. A unifiedarchitecture for natural language processing:Deep neural networks with multitask learning.In Proceedings of the 25th International Con-ference on Machine Learning, ICML ’08, pages160–167, New York, NY, USA, 2008. ACM.

Jacob Devlin, Rabih Zbib, Zhongqiang Huang,Thomas Lamar, Richard Schwartz, and JohnMakhoul. Fast and robust neural network jointmodels for statistical machine translation. In

119

52nd Annual Meeting of the Association forComputational Linguistics, 2014.

Jeffrey L Elman. Finding structure in time. Cog-nitive science, 14(2):179–211, 1990.

Jianfeng Gao, Xiaodong He, Wen-tau Yih, andLi Deng. Learning continuous phrase represen-tations for translation modeling. In 52nd AnnualMeeting of the Association for ComputationalLinguistics (Short Papers), 2014.

Christoph Goller and Andreas Kuchler. Learn-ing task-dependent distributed representationsby backpropagation through structure. In Neu-ral Networks, 1996., IEEE International Con-ference on, volume 1, pages 347–352. IEEE,1996.

Karl Moritz Hermann and Phil Blunsom. Multi-lingual models for compositional distributed se-mantics. In 52nd Annual Meeting of the Asso-ciation for Computational Linguistics, volumeabs/1404.4641, 2014.

Kurt Hornik, Maxwell Stinchcombe, and Hal-bert White. Multilayer feedforward networksare universal approximators. Neural networks,2(5):359–366, 1989.

Nal Kalchbrenner and Phil Blunsom. Recurrentcontinuous translation models. In EMNLP,pages 1700–1709, 2013.

Alexandre Klementiev, Ivan Titov, and BinodBhattarai. Inducing crosslingual distributedrepresentations of words. In 24th Interna-tional Conference on Computational Linguistics(COLING 2012). Citeseer, 2012.

Honglak Lee, Roger Grosse, Rajesh Ranganath,and Andrew Y Ng. Convolutional deep be-lief networks for scalable unsupervised learningof hierarchical representations. In Proceedingsof the 26th Annual International Conferenceon Machine Learning, pages 609–616. ACM,2009.

Peng Li, Yang Liu, and Maosong Sun. Recur-sive autoencoders for itg-based translation. InEMNLP, pages 567–577, 2013.

Dong C Liu and Jorge Nocedal. On the limitedmemory bfgs method for large scale optimiza-tion. Mathematical programming, 45(1-3):503–528, 1989.

Jordan B Pollack. Recursive distributed represen-tations. Artificial Intelligence, 46(1):77–105,1990.

Richard Rohwer. The “moving targets”training al-gorithm. In Neural Networks, pages 100–109.Springer, 1990.

Markus Saers, Joakim Nivre, and Dekai Wu.Learning stochastic bracketing inversion trans-duction grammars with a cubic time biparsingalgorithm. In 11th International Conference onParsing Technologies (IWPT’09), pages 29–32,Paris, France, October 2009.

Christian Scheible and Hinrich Schütze. Cuttingrecursive autoencoder trees. In 1st InternationalConference on Learning Representations (ICLR2013), Scottsdale, Arizona, May 2013.

Holger Schwenk. Continuous-space languagemodels for statistical machine translation. InThe Prague Bulletin of Mathematical Linguis-tics, volume 93, pages 137–146, 2010.

Holger Schwenk. Continuous space transla-tion models for phrase-based statistical machinetranslation. In Proceedings of COLING 2012:Posters, pages 1071––1080. Citeseer, 2012.

Richard Socher, Jeffrey Pennington, Eric HHuang, Andrew Y Ng, and Christopher D Man-ning. Semi-supervised recursive autoencodersfor predicting sentiment distributions. In Pro-ceedings of the Conference on Empirical Meth-ods in Natural Language Processing, pages151–161. Association for Computational Lin-guistics, 2011.

Le Hai Son, Alexandre Allauzen, and FrançoisYvon. Continuous space translation models withneural networks. In Proceedings of the 2012conference of the north american chapter of theassociation for computational linguistics: Hu-man language technologies, pages 39–48. As-sociation for Computational Linguistics, 2012.

Andreas Stolcke and Dekai Wu. Tree match-ing with recursive distributed representations.In AAAI 1992 Workshop on Integrating Neu-ral and Symbolic Processes—The Cognitive Di-mension, 1992.


Bing Zhao, Yik-Cheung Tam, and Jing Zheng.An autoencoder with bilingual sparse featuresfor improved statistical machine translation. In

120

IEEE International Conference on Acoustic,Speech and Signal Processing (ICASSP), 2014.

Will Y Zou, Richard Socher, Daniel M Cer, andChristopher D Manning. Bilingual word embed-dings for phrase-based machine translation. InEMNLP, pages 1393–1398, 2013.

121


Transformation and Decomposition for Efficiently Implementing andImproving Dependency-to-String Model In Moses

Liangyou Li†, Jun Xie‡, Andy Way† and Qun Liu†‡† CNGL Centre for Global Intelligent Content, School of Computing

Dublin City University, Dublin 9, Ireland‡ Key Laboratory of Intelligent Information Processing, Institute of Computing Technology

Chinese Academy of Sciences, Beijing, China{liangyouli,away,qliu}@computing.dcu.ie

[email protected]

Abstract

Dependency structure provides grammat-ical relations between words, which haveshown to be effective in Statistical Ma-chine Translation (SMT). In this paper, wepresent an open source module in Moseswhich implements a dependency-to-stringmodel. We propose a method to trans-form the input dependency tree into a cor-responding constituent tree for reusing thetree-based decoder in Moses. In our ex-periments, this method achieves compara-ble results with the standard model. Fur-thermore, we enrich this model via thedecomposition of dependency structure,including extracting rules from the sub-structures of the dependency tree duringtraining and creating a pseudo-forest in-stead of the tree per se as the input dur-ing decoding. Large-scale experimentson Chinese–English and German–Englishtasks show that the decomposition ap-proach improves the baseline dependency-to-string model significantly. Our sys-tem achieves comparable results with thestate-of-the-art hierarchical phrase-basedmodel (HPB). Finally, when resorting tophrasal rules, the dependency-to-stringmodel performs significantly better thanMoses HPB.

1 Introduction

Dependency structure models relations betweenwords in a sentence. Such relations indicatethe syntactic function of one word to anotherword. As dependency structure directly encodes

semantic information and has the best inter-lingualphrasal cohesion properties (Fox, 2002), it is be-lieved to be helpful to translation.

In recent years, dependency structure has beenwidely used in SMT. For example, Shen et al.(2010) present a string-to-dependency model byusing the dependency fragments of the neighbour-ing words on the target side, which makes it easierto integrate a dependency language model. How-ever such string-to-tree systems run slowly in cu-bic time (Huang et al., 2006).

Another example is the treelet approach(Menezes and Quirk, 2005; Quirk et al., 2005),which uses dependency structure on the sourceside. Xiong et al. (2007) extend the treelet ap-proach to allow dependency fragments with gaps.As the treelet is defined as an arbitrary connectedsub-graph, typically both substitution and inser-tion operations are adopted for decoding. How-ever, as translation rules based on the treeletsdo not encode enough reordering information di-rectly, another heuristic or separate reorderingmodel is usually needed to decide the best targetposition of the inserted words.

Different from these works, Xie et al. (2011)present a dependency-to-string (Dep2Str) model,which extracts head-dependent (HD) rules fromword-aligned source dependency trees and targetstrings. As this model specifies reordering infor-mation in the HD rules, during translation only thesubstitution operation is needed, because wordsare reordered simultaneously with the rule beingapplied. Meng et al. (2013) and Xie et al. (2014)extend the model by augmenting HD rules with thehelp of either constituent tree or fixed/float struc-ture (Shen et al., 2010). Augmented rules are cre-ated by the combination of two or more nodes in

122

the HD fragment, and are capable of capturingtranslations of non-syntactic phrases. However,the decoder needs to be changed correspondinglyto handle these rules.

Attracted by the simplicity of the Dep2Strmodel, in this paper we describe an easy way tointegrate the model into the popular translationframework Moses (Koehn et al., 2007). In or-der to share the same decoder with the conven-tional syntax-based model, we present an algo-rithm which transforms a dependency tree into acorresponding constituent tree which encodes de-pendency information in its non-leaf nodes and iscompatible with the Dep2Str model. In addition,we present a method to decompose a dependencystructure (HD fragment) into smaller parts whichenrich translation rules and also allow us to cre-ate a pseudo-forest as the input. “Pseudo” meansthe forest is not obtained by combining severaltrees from a parser, but rather that it is createdbased on the decomposition of an HD fragment.Large-scale experiments on Chinese–English andGerman–English tasks show that the transforma-tion and decomposition are effective for transla-tion.

In the remainder of the paper, we first describethe Dep2Str model (Section 2). Then we describehow to transform a dependency tree into a con-stituent tree which is compatible with the Dep2Strmodel (Section 3). The idea of decomposition in-cluding extracting sub-structural rules and creat-ing a pseudo-forest is presented in Section 4. Thenexperiments are conducted to compare translationresults of our approach with the state-of-the-artHPB model (Section 5). We conclude in Section 6and present avenues for future work.

2 Dependency-to-String Model

In the Dep2Str model (Xie et al., 2011), the HDfragment is the basic unit. As shown in Figure1, in a dependency tree, each non-leaf node is thehead of some other nodes (dependents), so an HDfragment is composed of a head node and all of itsdependents.1

In this model, there are two kinds of rules fortranslation. One is the head rule which specifiesthe translation of a source word:

Juxing

举行→ holds1In this paper, HD fragment of a node means the HD frag-

ment with this node as the head. Leaf nodes have no HDfragments.

Boliweiya

玻利维亚/NN

Juxing

举行/VV

Zongtong

总统/NNYu

与/CC

Guohui

国会/NN

Xuanju

选举/NN

Figure 1: Example of a dependency tree, withhead-dependent fragments being indicated by dot-ted lines.

The other one is the HD rule which consists ofthree parts: the HD fragment s of the sourceside (maybe containing variables), a target stringt (maybe containing variables) and a one-to-onemapping φ from variables in s to variables in t, asin:

s = (Boliweiya

玻利维亚)Juxing

举行 (x1:Xuanju

选举 )

t = Bolivia holds x1

φ = {x1 :Xuanju

选举→ x1}where the underlined element denotes the leafnode. Variables in the Dep2Str model are con-strained either by words (like x1:选举) or Part-of-Speech (POS) tags (like x1:NN).

Given a source sentence with a dependency tree,a target string and the word alignment between thesource and target sentences, this model first an-notates each node N with two annotations: headspan and dependency span.2 These two spansspecify the corresponding target position of a node(by the head span) or sub-tree (by the depen-dency span). After annotation, acceptable HDfragments3 are utilized to induce lexicalized HD

2Some definitions: Closure clos(S) of set S is the small-est superset of S in which the elements (integers) are contin-uous. Let H be the set of indexes of target words aligned tonode N . Head span hsp(N) of node N is clos(H). Headspan hsp(N) is consistent if it does not overlap with headspan of any other node. Dependency span dsp(N) of nodeN is the union of all consistent head spans in the subtreerooted at N .

3A head-dependent fragment is acceptable if the headspan of the head node is consistent and none of the depen-dency spans of its dependents is empty. We could see thatin an acceptable fragment, the head span of the head nodeand dependency spans of dependents are not overlapped witheach other.

123

Boliweiya Juxing Xuanju

Rule: (玻利维亚) 举行 (x1:选举) Bolivia holds x1

Xuanju

Rule: (x1:NN) 选举 x1 elections

Guohui

Rule: 国会 parliament

Zongtong Yu Guohui

Rule: (总统) (与) x1:国会 presidential and x1

Boliweiya

玻利维亚/NN

Juxing

举行/VV

Zongtong

总统/NNYu

与/CC

Guohui

国会/NN

Xuanju

选举/NN

Zongtong

总统/NNYu

与/CC

Guohui

国会/NNBolivia holds elections

Bolivia holds

Zongtong

总统/NNYu

与/CC

Guohui

国会/NN

Xuanju

选举/NN

Bolivia holds presidential and parliament elections

Bolivia holds presidential and

Guohui

国会/NN elections

(a)

(b)

(c)

(d)

(e)

Figure 2: Example of a derivation. Underlined el-ements indicate leaf nodes.

rules (the head node and leaf node are representedby words, while the internal nodes are replaced byvariables constrained by word) and unlexicalizedHD rules (nodes are replaced by variables con-strained by POS tags).

In HD rules, an internal node denotes the wholesub-tree and is always a substitution site. The headnode and leaf nodes can be represented by eitherwords or variables. The target side correspondingto an HD fragment and the mapping between vari-ables are determined by the head span of the headnode and the dependency spans of the dependents.

A translation can be obtained by applying rulesto the input dependency tree. Figure 2 shows aderivation for translating a Chinese sentence intoan English string. The derivation proceeds fromtop to bottom. Variables in the higher-level HDrules are substituted by the translations of lowerHD rules recursively.

The final translation is obtained by finding thebest derivation d∗ from all possible derivationsD which convert the source dependency structureinto a target string, as in Equation (1):

d∗ = argmaxd∈D

p(d) ≈ argmaxd∈D

∏i

φi(d)λi (1)

where φi(d) is the ith feature defined in the deriva-tion d, and λi is the weight of the feature.

3 Transformation of Dependency Trees

In this section, we introduce an algorithm to trans-form a dependency tree into a corresponding con-stituent tree, where words of the source sentenceare leaf nodes and internal nodes are labelled withhead words or POS tags which are constrained bydependency information. Such a transformationmakes it possible to use the traditional tree-baseddecoder to translate a dependency tree, so we caneasily integrate the Dep2Str model into the popu-lar framework Moses.

In a tree-based system, the CYK algorithm(Kasami, 1965; Younger, 1967; Cocke andSchwartz, 1970) is usually employed to translatethe input sentence with a tree structure. Each timea continuous sequence of words (a phrase) in thesource sentence is translated. Larger phrases canbe translated by combining translations of smallerphrases.

In a constituent tree, the source words are leafnodes and all non-leaf nodes covering a phrase arelabelled with categories which are usually vari-ables defined in the tree-based model. For trans-lating a phrase covered by a non-leaf node, the de-coder for the constituent tree can easily find ap-plied rules by directly matching variables in theserules to tree nodes. However, in a dependency tree,each internal node represents a word of the sourcesentence. Variables covering a phrase cannot berecognized directly. Therefore, to share the samedecoder with the constituent tree, the dependencytree needs to be transformed into a constituent-style tree.

As we described in Section 2, each variable inthe Dep2Str model represents a word (for the headand leaf node) or a sequence of continuous words(for the internal node). Thus it is intuitive to usethese variables to label non-leaf nodes of the pro-duced constituent tree. Furthermore, in order topreserve the dependency information of each HDfragment, the created constituent node needs to beconstrained by the dependency information in theHD fragment.

Our transformation algorithm is shown in Al-gorithm 1, which proceeds recursively from top tobottom on each HD fragment. There are a maxi-mum of three types of nodes in an HD fragment:head node, leaf nodes, and internal nodes. The

124

Algorithm 1 Algorithm for transforming a depen-dency tree to constituent tree. Dnode means nodein dependency tree. Cnode means node in con-stituent tree.

function CNODE(label, span)create a new Cnode CNCN.label← labelCN.span← span

end functionfunction TRANSFNODE(Dnode H)

pos← POS of Hconstrain pos . with H0, like: NN:H0

CNODE(label,H.position)for each dependent N of H do

pos← POS of Nword← word of Nconstrain pos . with Li or Ri, like: NN:R1

constrain word . with Li or Ri

if N is leaf thenCNODE(pos,N.position)

elseCNODE(word,H.span)CNODE(pos,H.span)TRANSFNODE(N )

end ifend for

end function

leaf nodes and internal nodes are dependents ofthe head node. For the leaf node and head node,we create constituent nodes that just cover oneword. For an internal node N , we create con-stituent nodes that cover all the words in the sub-tree rooted at N . In Algorithm 1, N.positionmeans the position of the word represented by thenode N . N.span denotes indexes of words cov-ered by the sub-tree rooted at node N .

Taking the dependency tree in Figure 1 as anexample, its transformation result for integrationwith Moses is shown in Figure 3. In the Dep2Strmodel, leaf nodes can be replaced by a vari-able constrained by its POS tag, so for leaf node

“Zongtong

总统 ” in HD fragment “Zongtong

(总统)Yu

(与)Guohui国会”,

we create a constituent node “NN:L2”, where“NN” is the POS tag and “L2” denotes that the leafnode is the second left dependent of the head node.

For the internal node “Guohui国会” in the HD fragment

“Guohui(国会)

Xuanju

选举”, we create two constituent nodes

Boliweiya

玻利维亚Juxing

举行Zongtong

总统Yu

与Guohui

国会Xuanju

选举

NN:L1 VV:H0 NN:L2 CC:L1 NN:H0NN:H0

NN:L1

NN:R1

S

Guohui

国会:L1

Xuanju

选举:R1

Figure 3: The corresponding constituent tree af-ter transforming the dependency tree in Figure 1.Note in our implementation, we do not distinguishthe leaf node and internal node of a dependencytree in the produced constituent tree and inducedrules.

which cover all words in the dependency sub-treerooted at this node, with one of them labelled bythe word itself. Both nodes are constrained by de-pendency information “L1”. After such a transfor-mation is conducted on each HD fragment recur-sively, we obtain a constituent tree.

This transformation makes our implementationof the Dep2Str model easier, because we can usethe tree-to-string decoder in Moses. All we needto do is to write a new rule extractor which extractshead rules and HD rules (see Section 2) from theword-aligned source dependency trees and targetstrings, and represents these rules in the format de-fined in Moses.4

Note that while this conversion is performedon an input dependency tree during decoding, thetraining part, including extracting rules and cal-culating translation probabilities, does not change,so the model is still a dependency-to-string model.

4Taking the rule in Section 2 as an example, its represen-tation in Moses is:

s =Boliweiya

玻利维亚Juxing

举行Xuanju

[选举:R1][X] [H1]

t = Bolivia holdsXuanju

[选举:R1][X] [X]φ = {2 → 2}

where “H1” denotes the position of the head word is 1, “R1”indicates the first right dependent of the head word, “X” is thegeneral label for the target side and φ is the set of alignments(the index-correspondences between s and t). The format hasbeen described in detail at http://www.statmt.org/moses/?n=Moses.SyntaxTutorial.

125

In addition, our transformation is different fromother works which transform a dependency treeinto a constituent tree (Collins et al., 1999; Xia andPalmer, 2001). In this paper, the produced con-stituent tree still preserves dependency relationsbetween words, and the phrasal structure is di-rectly derived from the dependency structure with-out refinement. Accordingly, the constituent treemay not be a linguistically well-formed syntacticstructure. However, it is not a problem for ourmodel, because in this paper what matters is thedependency structure which has already been en-coded into the (ill-formed) constituent tree.

4 Decomposition of DependencyStructure

The Dep2Str model treats a whole HD fragmentas the basic unit, which may result in a sparse-data problem. For example, an HD fragment witha verb as head typically consists of more than fournodes (Xie et al., 2011). Thus in this section, in-spired by the treelet approach, we describe a de-composition method to make use of smaller frag-ments.

In an HD fragment of a dependency tree, thehead determines the semantic category, whilethe dependent gives the semantic specification(Zwicky, 1985; Hudson, 1990). Accordingly, itis reasonable to assume that in an HD fragment,dependents could be removed or new dependentscould be attached as needed. Thus, in this paper,we assume that a large HD fragment is formed byattaching dependents to a small HD fragment. Forsimplicity and reuse of the decoder, such an at-tachment is carried out in one step. This meansthat an HD fragment is decomposed into twosmaller parts in a possible decomposition. Thisdecomposition can be formulated as Equation (2):

Li · · ·L1HR1 · · ·Rj= Lm · · ·L1HR1 · · ·Rn+ Li · · ·Lm+1HRn+1 · · ·Rj

subject to

i ≥ 0, j ≥ 0i ≥ m ≥ 0, j ≥ n ≥ 0i+ j > m+ n > 0

(2)

whereH denotes the head node, Li denotes the ithleft dependent and Rj denotes the jth right depen-dent. Figure 4 shows an example.

smart/JJ

very/RBShe/PRP

smart/JJ

is/VBZ

She/PRP

smart/JJ

is/VBZ very/RB

+

Figure 4: An example of decomposition on a head-dependent fragment.

Algorithm 2 Algorithm for the decomposition ofan HD fragment into two sub-fragments. Index ofnodes in a fragment starts from 0.

function DECOMP(HD fragment frag)fset ← {}len← number of nodes in fraghidx← the index of head node in fragfor s = 0 to hidx do

for e = hidx to len− 1 doif 0 < e− s < len− 1 then

create sub-fragment corecore← nodes from s to eadd core to fsetcreate sub-fragment shellinitialize shell with head nodeshell← nodes not in coreadd shell to fset

end ifend for

end forend function

Such a decomposition of an HD fragment en-ables us to create translation rules extracted fromsub-structures and create a pseudo-forest fromthe input dependency tree to make better use ofsmaller rules.

4.1 Sub-structural Rules

In the Dep2Str model, rules are extracted onan entire HD fragment. In this paper, whenthe decomposition is considered, we also extractsub-structural rules by taking each possible sub-fragment as a new HD fragment. The algorithmfor recognizing the sub-fragments is shown in Al-gorithm 2.

In Algorithm 2, we find all possible decom-

126

positions of an HD fragment. Each decom-position produces two sub-fragments: core andshell. Both core and shell include the head node.core contains the dependents surrounding the headnode, with the remaining dependents belonging toshell. Taking Figure 4 as an example, the bottom-right part is core, while the bottom-left part isshell. Each core and shell could be seen as anew HD fragment. Then HD rules are extracted asdefined in the Dep2Str model.

Note that different from the augmented HDrules, where Meng et al. (2013) annotate rules withcombined variables and Xie et al. (2014) createspecial rules from HD rules at runtime by com-bining several nodes, our sub-structural rules arestandard HD rules, which are extracted from theconnected sub-structures of a larger HD fragmentand can be used directly in the model.

4.2 Pseudo-Forest

Although sub-structural rules are effective in ourexperiments (see Section 5), we still do not usethem to their best advantage, because we only en-rich smaller rules in our model. During decod-ing, for a large input HD fragment, the model isstill more likely to resort to glue rules. However,the idea of decomposition allows us to create apseudo-forest directly from the dependency tree toalleviate this problem to some extent.

As described above, an HD fragment can beseen as being created by combining two smallerfragments. This means, for an HD fragment in theinput dependency tree, we can translate one of itssub-fragments first, then obtain the whole trans-lation by combining with translations of anothersub-fragment. From Algorithm 2, we know thatthe sub-fragment core covers a continuous phraseof the source sentence. Accordingly, we can trans-late this fragment first and then build the wholetranslation by translating another sub-fragmentshell. Figure 5 gives an example of translatingan HD fragment by combining the translations ofits sub-fragments.

Instead of taking the dependency tree as the in-put and looking for all rules for translating sub-fragments of a whole HD, we directly encode thedecomposition into the input dependency tree withthe result being a pseudo-forest. Based on thetransformation algorithm in Section 3, the pseudo-forest can also be represented in the constituent-tree style, as shown in Figure 6.

Yu Guohui

Rule: (与) 国会and parliment

Zongtong

总统/NNYu

与/CC

Guohui

国会/NN

Zongtong

Rule: (总统) x1:NNpresidential x1

presidential and parliament

Zongtong

总统 and parliament

Guohui

国会/NN

(a)

(b)

(c)

Figure 5: An example of translating a large HDfragment with the help of translations of its de-composed fragments.

S

NN:L1 VV:H0

NN:L2 CC:L1 NN:H0NN:H0

NN:R1

Xuanju

选举:R1

NN:L1

Guohui

国会:L1

Boliweiya

玻利维亚Juxing

举行Zongtong

总统Yu

与Guohui

国会Xuanju

选举

NN:L1

NN:H0

VV:H0

VV:H0

Figure 6: An example of a pseudo-forest for thedependency tree in Figure 1. It is represented us-ing the constituent-tree style described in Section3. Edges drawn in the same type of line are ownedby the same sub-tree. Solid lines are shared edges.

In the pseudo-forest, we actually only create aforest structure for each HD fragment. For ex-ample, based on Figure 5, we create a constituentnode labelled with “NN:H0” that covers the sub-

fragment “Yu

(与)Guohui国会”. In so doing, a new node la-

belled with “NN:L1” is also created, which covers

the Node “Zongtong

总统 ”, because it is now the first left

dependent in the sub-fragment “Zongtong

(总统)Guohui国会 ”.

Compared to the forest-based model (Mi et al.,2008), such a pseudo-forest cannot efficiently re-duce the influence of parsing errors, but it is easilyavailable and compatible with the Dep2Str Model.

127

corpus sentences words(ch) words(en)train 1,501,652 38,388,118 44,901,788dev 878 22,655 26,905MT04 1,597 43,719 52,705MT05 1,082 29,880 35,326

Table 1: Chinese–English corpus. For the Englishdev and test sets, words counts are averaged across4 references.

corpus sentences words(de) words(en)train 2,037,209 52,671,991 55,023,999dev 3,003 72,661 74,753test12 3,003 72,603 72,988test13 3,000 63,412 64,810

Table 2: German–English corpus. In the dev andtest sets, there is only one English reference foreach German sentence.

5 Experiments

We conduct large-scale experiments to exam-ine our methods on the Chinese–English andGerman–English translation tasks.

5.1 DataThe Chinese–English training corpus is fromthe LDC data, including LDC2002E18,LDC2003E07, LDC2003E14, LDC2004T07,the Hansards portion of LDC2004T08 andLDC2005T06. We take NIST 2002 as the de-velopment set to tune weights, and NIST 2004(MT04) and NIST 2005 (MT05) as the test data toevaluate the systems. Table 1 provides a summaryof the Chinese–English corpus.

The German–English training corpus is fromWMT 2014, including Europarl V7 and NewsCommentary. News-test 2011 is taken as the de-velopment set, while News-test 2012 (test12) andNews-test 2013 (test13) are our test sets. Table 2provides a summary of the German–English cor-pus.

5.2 BaselineFor both language pairs, we filter sentence pairslonger than 80 words and keep the length ratioless than or equal to 3. English sentences are to-kenized with scripts in Moses. Word alignment isperformed by GIZA++ (Och and Ney, 2003) withthe heuristic function grow-diag-final-and (Koehnet al., 2003). We use SRILM (Stolcke, 2002) to

Systems MT05XJ 33.91D2S 33.79

Table 3: BLEU score [%] of the Dep2Str modelbefore (XJ) and after (D2S) dependency tree be-ing transformed. Systems are trained on a selected1.2M Chinese–English corpus.

train a 5-gram language model on the Xinhua por-tion of the English Gigaword corpus 5th editionwith modified Kneser-Ney discounting (Chen andGoodman, 1996). Minimum Error Rate Train-ing (Och, 2003) is used to tune weights. Case-insensitive BLEU (Papineni et al., 2002) is used toevaluate the translation results. Bootstrap resam-pling (Koehn, 2004) is also performed to computestatistical significance with 1000 iterations.

We implement the baseline Dep2Str modelin Moses with methods described in this paper,which is denoted as D2S. The first experiment wedo is to sanity check our implementation. Thuswe take a separate system (denoted as XJ) forcomparison which implements the Dep2Str modelbased on (Xie et al., 2011). As shown in Table3, using the transformation of dependency trees,the Dep2Str model implemented in Moses (D2S)is comparable with the standard implementation(XJ).

In the rest of this section, we describe exper-iments which compare our system with MosesHPB (default setting), and test whether our de-composition approach improves performance overthe baseline D2S.

As described in Section 2, the Dep2Str modelonly extracts phrase rules for translating a sourceword (head rule). This model could be enhancedby including phrase rules that cover more than onesource word. Thus we also conduct experimentswhere phrase pairs5 are added into our system. Weset the length limit for phrase 7.

5.3 Chinese–English

In the Chinese–English translation task, the Stan-ford Chinese word segmenter (Chang et al., 2008)is used to segment Chinese sentences into words.The Stanford dependency parser (Chang et al.,2009) parses a Chinese sentence into the projec-tive dependency tree.

5In this paper, the use of phrasal rules is similar to that ofthe HPB model, so they can be handled by Moses directly.

128

Systems MT04 MT05Moses HPB 35.56 33.99D2S 33.93 32.56

+pseudo-forest 34.28 34.10+sub-structural rules 34.78 33.63

+pseudo-forest 35.46 34.13+phrase 36.76* 34.67*

Table 4: BLEU score [%] of our method andMoses HPB on the Chinese–English task. We usebold font to indicate that the result of our methodis significantly better than D2S at p ≤ 0.01 level,and * to indicate the result is significantly betterthan Moses HPB at p ≤ 0.01 level.

Table 4 shows the translation results. We findthat the decomposition approach proposed in thispaper, including sub-structural rules and pseudo-forest, improves the baseline system D2S sig-nificantly (absolute improvement of +1.53/+1.57(4.5%/4.8%, relative)). As a result, our sys-tem achieves comparable (-0.1/+0.14) results withMoses HPB. After including phrasal rules, oursystem performs significantly better (absolute im-provement of +1.2/+0.68 (3.4%/2.0%, relative))than Moses HPB on both test sets.6

5.4 German–English

We tokenize German sentences with scripts inMoses and use mate-tools7 to perform morpho-logical analysis and parse the sentence (Bohnet,2010). Then the MaltParser8 converts the parseresult into the projective dependency tree (Nivreand Nilsson, 2005).

Experimental results in Table 5 show that incor-porating sub-structural rules improves the base-line D2S system significantly (absolute improve-ment of +0.47/+0.63, (2.3%/2.8%, relative)), andachieves a slightly better (+0.08) result on test12than Moses HPB. However, in the German–English task, the pseudo-forest produces a neg-ative effect on the baseline system (-0.07/-0.45),despite the fact that our system combining bothmethods together is still better (+0.2/+0.11) thanthe baseline D2S. In the end, by resorting to

6In our preliminary experiments, phrasal rules are alsoable to significantly improve our system on their own on bothChinese–English and German–English tasks, but the best per-formance is achieved by combining them with sub-structuralrules and/or pseudo-forest.

7http://code.google.com/p/mate-tools/8http://www.maltparser.org/

Systems test12 test13Moses HPB 20.44 22.77D2S 20.05 22.13

+pseudo-forest 19.98 21.68+sub-structural rules 20.52 22.76

+phrase 20.91* 23.46*+pseudo-forest 20.25 22.24

+phrase 20.75* 23.20*

Table 5: BLEU score [%] of our method andMoses HPB on German–English task. We usebold font to indicate that the result of our methodis significantly better than baseline D2S at p ≤0.01 level, and * to indicate the result is signifi-cantly better than Moses HPB at p ≤ 0.01 level.

Systems# Rules

CE task DE taskMoses HPB 388M 684MD2S 27M 41M

+sub-structural rules 116M 121M+phrase 215M 274M

Table 6: The number of rules in different sys-tems On the Chinese–English (CE) and German–English (DE) corpus. Note that pseudo-forest (notlisted) does not influence the number of rules.

phrasal rules, our system achieves the best perfor-mance overall which is significantly better (abso-lute improvement of +0.47/+0.59 (2.3%/2.6%, rel-ative)) than Moses HPB.

5.5 DiscussionBesides long-distance reordering (Xie et al.,2011), another attraction of the Dep2Str model isits simplicity. It can perform fast translation withfewer rules than HPB. Table 6 shows the numberof rules in each system. It is easy to see that all ofour systems use fewer rules than HPB. However,the number of rules is not proportional to transla-tion quality, as shown in Tables 4 and 5.

Experiments on the Chinese–English corpusshow that it is feasible to translate the dependencytree via transformation for the Dep2Str model de-scribed in Section 2. Such a transformation causesthe model to be easily integrated into Moses with-out making changes to the decoder, while at thesame time producing comparable results with thestandard implementation (shown in Table 3).

The decomposition approach proposed in this

129

paper also shows a positive effect on the base-line Dep2Str system. Especially, sub-structuralrules significantly improve the Dep2Str model onboth Chinese–English and German–English tasks.However, experiments show that the pseudo-forestsignificantly improves the D2S system on theChinese–English data, while it causes translationquality to decline on the German–English data.

Since using the pseudo-forest in our system isaimed at translating larger HD fragments via split-ting it into pieces, we hypothesize that when trans-lating German sentences, the pseudo-forest ap-proach more likely results in much worse rules be-ing applied. This is probably due to the shorterMean Dependency Distance (MDD) and freerword order of German sentences(Eppler, 2013).

6 Conclusion

In this paper, we present an open source mod-ule which integrates a dependency-to-string modelinto Moses.

This module transforms an input depen-dency tree into a corresponding constituent treeduring decoding which makes Moses performdependency-based translation without necessitat-ing any changes to the decoder. Experiments onChinese–English show that the performance if oursystem is comparable with that of the standarddependency-based decoder.

Furthermore, we enhance the model by de-composing head-dependent fragments into smallerpieces. This decomposition enriches the Dep2Strmodel with more rules during training and allowsus to create a pseudo-forest as input instead ofa dependency tree during decoding. Large-scaleexperiments on Chinese–English and German–English tasks show that this decomposition cansignificantly improve the baseline dependency-to-string model on both language pairs. Onthe German–English task, sub-structural rules aremore useful than the pseudo-forest input. In theend, by resorting to phrasal rules, our systemperforms significantly better than the hierarchicalphrase-based model in Moses.

Our implementation of the dependency-to-string model with methods described in this pa-per is available at http://computing.dcu.ie/˜liangyouli/dep2str.zip. In the fu-ture, we would like to conduct more experimentson other language pairs to examine this model,as well as reducing the restrictions on decompo-

sition.

Acknowledgments

This research has received funding from the Peo-ple Programme (Marie Curie Actions) of the Eu-ropean Union’s Seventh Framework ProgrammeFP7/2007-2013/ under REA grant agreement no.317471. This research is also supported by theScience Foundation Ireland (Grant 12/CE/I2267)as part of the Centre for Next Generation Local-isation at Dublin City University. The authors ofthis paper also thank the reviewers for helping toimprove this paper.

ReferencesBernd Bohnet. 2010. Very High Accuracy and Fast

Dependency Parsing is Not a Contradiction. In Pro-ceedings of the 23rd International Conference onComputational Linguistics, pages 89–97, Beijing,China.

Pi-Chuan Chang, Michel Galley, and Christopher D.Manning. 2008. Optimizing Chinese Word Seg-mentation for Machine Translation Performance. InProceedings of the Third Workshop on StatisticalMachine Translation, pages 224–232, Columbus,Ohio.

Pi-Chuan Chang, Huihsin Tseng, Dan Jurafsky, andChristopher D. Manning. 2009. Discriminative Re-ordering with Chinese Grammatical Relations Fea-tures. In Proceedings of the Third Workshop on Syn-tax and Structure in Statistical Translation, pages51–59, Boulder, Colorado.

Stanley F. Chen and Joshua Goodman. 1996. AnEmpirical Study of Smoothing Techniques for Lan-guage Modeling. In Proceedings of the 34th AnnualMeeting on Association for Computational Linguis-tics, pages 310–318, Santa Cruz, California.

John Cocke and Jacob T. Schwartz. 1970. Program-ming Languages and Their Compilers: PreliminaryNotes. Technical report, Courant Institute of Math-ematical Sciences, New York University, New York,NY.

Michael Collins, Lance Ramshaw, Jan Hajic, andChristoph Tillmann. 1999. A Statistical Parser forCzech. In Proceedings of the 37th Annual Meetingof the Association for Computational Linguistics onComputational Linguistics, pages 505–512, CollegePark, Maryland.

Eva M. Duran Eppler. 2013. Dependency Distanceand Bilingual Language Use: Evidence from Ger-man/English and Chinese/English Data. In Proceed-ings of the Second International Conference on De-pendency Linguistics (DepLing 2013), pages 78–87,Prague, August.

130

Heidi J. Fox. 2002. Phrasal Cohesion and Statis-tical Machine Translation. In Proceedings of theACL-02 Conference on Empirical Methods in Nat-ural Language Processing - Volume 10, pages 304–3111, Philadelphia.

Liang Huang, Kevin Knight, and Aravind Joshi. 2006.A Syntax-directed Translator with Extended Do-main of Locality. In Proceedings of the Workshopon Computationally Hard Problems and Joint Infer-ence in Speech and Language Processing, pages 1–8, New York City, New York.

Richard Hudson. 1990. English Word Grammar.Blackwell, Oxford, UK.

Tadao Kasami. 1965. An Efficient Recognition andSyntax-Analysis Algorithm for Context-Free Lan-guages. Technical report, Air Force Cambridge Re-search Lab, Bedford, MA.

Philipp Koehn. 2004. Statistical Significance Testsfor Machine Translation Evaluation. In Proceedingsof EMNLP 2004, pages 388–395, Barcelona, Spain,July.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: OpenSource Toolkit for Statistical Machine Translation.In Proceedings of the 45th Annual Meeting of theACL on Interactive Poster and Demonstration Ses-sions, pages 177–180, Prague, Czech Republic.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical Phrase-Based Translation. In Pro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology - Vol-ume 1, pages 48–54, Edmonton, Canada.

Arul Menezes and Chris Quirk. 2005. DependencyTreelet Translation: The Convergence of Statisticaland Example-Based Machine-translation? In Pro-ceedings of the Workshop on Example-based Ma-chine Translation at MT Summit X, September.

Fandong Meng, Jun Xie, Linfeng Song, Yajuan Lu,and Qun Liu. 2013. Translation with Source Con-stituency and Dependency Trees. In Proceedings ofthe 2013 Conference on Empirical Methods in Natu-ral Language Processing, pages 1066–1076, Seattle,Washington, USA, October.

Haitao Mi, Liang Huang, and Qun Liu. 2008. Forest-Based Translation. In Proceedings of ACL-08: HLT,pages 192–199, June.

Joakim Nivre and Jens Nilsson. 2005. Pseudo-Projective Dependency Parsing. In Proceedings ofthe 43rd Annual Meeting on Association for Com-putational Linguistics, pages 99–106, Ann Arbor,Michigan.

Franz Josef Och. 2003. Minimum Error Rate Trainingin Statistical Machine Translation. In Proceedingsof the 41st Annual Meeting on Association for Com-putational Linguistics - Volume 1, pages 160–167,Sapporo, Japan.


Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for AutomaticEvaluation of Machine Translation. In Proceedingsof the 40th Annual Meeting on Association for Com-putational Linguistics, pages 311–318, Philadelphia,Pennsylvania.

Chris Quirk, Arul Menezes, and Colin Cherry. 2005.Dependency Treelet Translation: Syntactically In-formed Phrasal SMT. In Proceedings of the 43rdAnnual Meeting of the Association for Computa-tional Linguistics (ACL’05), pages 271–279, AnnArbor, Michigan, June.

Libin Shen, Jinxi Xu, and Ralph Weischedel. 2010.String-to-Dependency Statistical Machine Transla-tion. Computational Linguistics, 36(4):649–671,December.

Andreas Stolcke. 2002. SRILM-an Extensible Lan-guage Modeling Toolkit. In Proceedings Interna-tional Conference on Spoken Language Processing,pages 257–286, November.

Fei Xia and Martha Palmer. 2001. Converting De-pendency Structures to Phrase Structures. In Pro-ceedings of the First International Conference onHuman Language Technology Research, pages 1–5,San Diego.

Jun Xie, Haitao Mi, and Qun Liu. 2011. A NovelDependency-to-string Model for Statistical MachineTranslation. In Proceedings of the Conference onEmpirical Methods in Natural Language Process-ing, pages 216–226, Edinburgh, United Kingdom.

Jun Xie, Jinan Xu, and Qun Liu. 2014. AugmentDependency-to-String Translation with Fixed andFloating Structures. In Proceedings of the 25th In-ternational Conference on Computational Linguis-tics, pages 2217–2226, Dublin, Ireland.

Deyi Xiong, Qun Liu, and Shouxun Lin. 2007. A De-pendency Treelet String Correspondence Model forStatistical Machine Translation. In Proceedings ofthe Second Workshop on Statistical Machine Trans-lation, pages 40–47, Prague, June.

Daniel H. Younger. 1967. Recognition and Parsing ofContext-Free Languages in Time n3. Informationand Control, 10(2):189–208.

Arnold M. Zwicky. 1985. Heads. Journal of Linguis-tics, 21:1–29, 3.

131


Word’s Vector Representations meet Machine Translation

Eva Mart ınez GarciaCristina Espana-BonetTALP Research Center

Univesitat Politecnica de [email protected]@lsi.upc.edu

Jorg TiedemannUppsala University

Department of Linguisticsand Philology

[email protected]

Llu ıs MarquezQatar Computing Research Institute

Qatar [email protected]

Abstract

Distributed vector representations ofwords are useful in various NLP tasks.We briefly review the CBOW approachand propose a bilingual application ofthis architecture with the aim to improveconsistency and coherence of MachineTranslation. The primary goal of the bilin-gual extension is to handle ambiguouswords for which the different senses areconflated in the monolingual setup.

1 Introduction

Machine Translation (MT) systems are nowadaysachieving a high-quality performance. However,they are typically developed at sentence levelusing only local information and ignoring thedocument-level one. Recent work claims thatdiscourse-wide context can help to translate indi-vidual words in a way that leads to more coherenttranslations (Hardmeier et al., 2013; Hardmeier etal., 2012; Gong et al., 2011; Xiao et al., 2011).

Standard SMT systems usen-gram models torepresent words in the target language. How-ever, there are other word representation tech-niques that use vectors of contextual information.Recently, several distributed word representationmodels have been introduced that have interestingproperties regarding to the semantic informationthat they capture. In particular, we are interestedin theword2vec package available in (Mikolov etal., 2013a). These models proved to be robustand powerful for predicting semantic relations be-tween words and even across languages. However,they are not able to handle lexical ambiguity asthey conflate word senses of polysemous wordsinto one common representation. This limitation isalready discussed in (Mikolov et al., 2013b) and in(Wolf et al., 2014), in which bilingual extensionsof the word2vec architecture are proposed. In con-trast to their approach, we are not interested in

monolingual applications but instead like to con-centrate directly on the bilingual case in connec-tion with MT.

We built bilingual word representation mod-els based on word-aligned parallel corpora byan application of the Continuous Bag-of-Words(CBOW) algorithm to the bilingual case (Sec-tion 2). We made a twofold preliminary evalua-tion of the acquired word-pair representations ontwo different tasks (Section 3): predicting seman-tically related words (3.1) and cross-lingual lexicalsubstitution (3.2). Section 4 draws the conclusionsand sets the future work in a direct application ofthese models to MT.

2 Semantic Models using CBOW

The basic architecture that we use to build ourmodels is CBOW (Mikolov et al., 2013a). Thealgorithm uses a neural network (NN) to predicta word taking into account its context, but withoutconsidering word order. Despite its drawbacks, wechose to use it since we presume that the transla-tion task applies the same strategy as the CBOWarchitecture, i.e., from a set of context words try topredict a translation of a specific given word.

In the monolingual case, the NN is trained usinga monolingual corpus to obtain the correspondingprojection matrix that encloses the vector repre-sentations of the words. In order to introduce thesemantic information in a bilingual scenario, weuse a parallel corpus and automatic word align-ment to extract a training corpus of word pairs:(wi,S |wi,T ). This approach is different from (Wolfet al., 2014) who build an independent model foreach language. With our method, we try to cap-ture simultaneously the semantic information as-sociated to the source word and the informationin the target side of the translation. In this way,we hope to better capture the semantic informa-tion that is implicitly given by translating a text.

132

Model Accuracy Known wordsmonoen 32.47 % 64.67 %monoes 10.24 % 44.96 %bi en-es 23.68 % 13.74 %

Table 1: Accuracy on the Word Relationship set.

3 Experiments

The semantic models are built using a combinationof freely available corpora for English and Span-ish (EuropalV7, United Nations and MultilingualUnited Nations, and Subtitles2012). They canbe found in the Opus site (Tiedemann, 2012).Wetrained vectors to represent word pairs forms us-ing this corpora with theword2vec CBOW imple-mentation. We built a training set of almost600million words and used600-dimension vectors inthe training. Regarding to the alignments, we onlyused word-to-word ones to avoid noise.

3.1 Accuracy of the Semantic Model

We first evaluate the quality of the models basedon the task of predicting semantically relatedwords. A Spanish native speaker built the bilin-gual test set similarly to the process done to thetraining data from a list of19, 544 questions intro-duced by (Mikolov et al., 2013c). In our bilingualscenario, the task is to predict a pair of words giventwo pairs of related words. For instance, given thepair Athens|Atenas Greece|Grecia andthe questionLondon|Londres, the task is topredictEngland|Inglaterra.

Table1 shows the results, both overall accuracyand accuracy over the known words for the mod-els. Using the first30, 000 entries of the model(the most frequent ones), we obtain32% of ac-curacy for English (monoen) and10% for Span-ish (monoes). We chose these parameters for oursystem to obtain comparable results to the onesin (Mikolov et al., 2013a) for a CBOW architec-ture but trained with783 million words (50.4%).Decay for the model in Spanish can be due to thefact that it was built from automatic translations.In the bilingual case (bien-es), the accuracy islower than for English probably due to the noisein translations and word alignment.

3.2 Cross-Lingual Lexical Substitution

Another way to evaluate the semantic models isthrough the effect they have in translation. We im-plemented the Cross-Lingual Lexical Substitutiontask carried out in SemEval-2010 (Task2, 2010)

and applied it to a test set of news data from theNews Commentary corpus of 2011.

We identify those content words which aretranslated in more than one way by a baselinetranslation system (Moses trained with Europarlv7). Given one of these content words, we take thetwo previous and two following words and lookfor their vector representations using our bilingualmodels. We compute a linear combination of thesevectors to obtain a context vector. Then, to chosethe best translation option, we calculate a scorebased on the similarity among the vector of everypossible translation option seen in the documentand the context vector.

In average there are615 words per documentwithin the test set and7% are translated in morethan one way by the baseline system. Our bilin-gual models know in average87.5% of the wordsand 83.9% of the ambiguous ones, so althoughthere is a good coverage for this test set, still, someof the candidates cannot be retranslated or someof the options cannot be used because they aremissing in the models. The accuracy obtained af-ter retranslation of the known ambiguous wordsis 62.4% and this score is slightly better than theresult obtained by using the most frequent transla-tion for ambiguous words (59.8%). Even thoughthis improvement is rather modest, it shows poten-tial benefits of our model in MT.

4 Conclusions

We implemented a new application of word vec-tor representations for MT. The system uses wordalignments to build bilingual models with the finalaim to improve the lexical selection for words thatcan be translated in more than one sense.

The models have been evaluated regarding theiraccuracy when trying to predict related words(Section 3.1) and also regarding its possible effectwithin a translation system (Section 3.2). In bothcases one observes that the quality of the transla-tion and alignments previous to building the se-mantic models are bottlenecks for the final perfor-mance: part of the vocabulary, and therefore trans-lation pairs, are lost in the training process.

Future work includes studying different kindsof alignment heuristics. We plan to developnew features based on the semantic models touse them inside state-of-the-art SMT systems likeMoses (Koehn et al., 2007) or discourse-orienteddecoders like Docent (Hardmeier et al., 2013).

133

References

Z. Gong, M. Zhang, and G. Zhou. 2011. Cache-baseddocument-level statistical machine translation. InProc. of the 2011 Conference on Empirical Methodsin NLP, pages 909–919, UK.

C. Hardmeier, J. Nivre, and J. Tiedemann. 2012.Document-wide decoding for phrase-based statisti-cal machine translation. InProc. of the Joint Con-ference on Empirical Methods in NLP and Compu-tational Natural Language Learning, pages 1179–1190, Korea.

C. Hardmeier, S. Stymne, J. Tiedemann, and J. Nivre.2013. Docent: A document-level decoder forphrase-based statistical machine translation. InProc. of the 51st ACL Conference, pages 193–198,Bulgaria.

P. Koehn, H. Hoang, A. Birch, C. Callison-Burch,M. Federico, N. Bertoldi, B. Cowan, W. Shen,C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin,and E. Herbst. 2007. Moses: open source toolkit forstatistical machine translation. InProc. of the 45thACL Conference, pages 177–180, Czech Republic.

T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013a.Efficient estimation of word representations in vec-tor space. InProceedings of Workshop at ICLR.http://code.google.com/p/word2vec.

T. Mikolov, Q. V. Le, and I. Sutskever. 2013b. Ex-ploiting similarities among languages for machinetranslation. InarXiv.

T. Mikolov, I. Sutskever, G. Corrado, and J. Dean.2013c. Distributed representations of words andphrases and their compositionality. InProceedingsof NIPS.

Task2. 2010. Cross-lingual lexi-cal substitution task, semeval-2010.http://semeval2.fbk.eu/semeval2.php?location=tasksT24.

J. Tiedemann. 2009. News from opus - a collectionof multilingual parallel corpora with tools and in-terfaces. InN. Nicolov and K. Bontcheva and G.Angelova and R. Mitkov (eds.) Recent Advances inNatural Language Processing (vol V), pages 237–248, Amsterdam/Philadelphia. John Benjamins.

J. Tiedemann. 2012. Parallel data, tools and interfacesin opus. InProceedings of the 8th InternationalConference on Language Resources and Evaluation(LREC’2012). http://opus.lingfil.uu.se/.

L. Wolf, Y. Hanani, K. Bar, and N. Derschowitz. 2014.Joint word2vec networks for bilingual semantic rep-resentations. InPoster sessions at CICLING.

T. Xiao, J. Zhu, S. Yao, and H. Zhang. 2011.Document-level consistency verification in machinetranslation. InProc. of Machine Translation SummitXIII, pages 131–138, China.

134


Context Sense Clustering for Translation

João CasteleiroUniversidade Nova de LisboaDepartamento de Informática2829-516 Caparica, [email protected]

Gabriel LopesUniversidade Nova de LisboaDepartamento de Informática2829-516 Caparica, Portugal

[email protected]

Joaquim SilvaUniversidade Nova de LisboaDepartamento de Informática2829-516 Caparica, Portugal

[email protected]

Extended Abstract

Word sense ambiguity is present in all words with more than one meaning in several natural languages and is a fundamental characteristic of human language. This has consequences in trans-lation as it is necessary to find the right sense and the correct translation for each word. For this reason, the English word fair can mean reasona-ble or market such as plant also can mean factoryor herb.

The disambiguation problem has been recog-nize as a major problem in natural languages processing research. Several words have several meanings or senses. The disambiguation task seeks to find out which sense of an ambiguous word is invoked in a particular use of that word. A system for automatic translation from Englishto Portuguese should know how to translate the word bank as banco (an institution for receiving, lending, exchanging, and safeguarding money), and as margem (the land alongside or sloping down to a river or lake), and also should know that the word banana may appear in the same context as acerola and that these two belongs to hyperonym fruit. Whenever a translation systems depends on the meaning of the text being pro-cessed, disambiguation is beneficial or even nec-essary. Word Sense Disambiguation is thus es-sentially a classification problem; given a word Xand an inventory of possible semantic tags for that word that might be translation, we seek which tag is appropriate for each individual in-stance of that word in a particularly context.

In recent years research in the field has evolved in different directions. Several studies that combine clustering processes with word senses has been assessed by several. Apidianaki in (2010) presents a clustering algorithm for cross-lingual sense induction that generates bi-lingual semantic inventories from parallel corpo-

ra. Li and Church in (2007) state that should not be necessary to look at the entire corpus to know if two words are strongly associated or not, thus, they proposed an algorithm for efficiently com-puting word associations. In (Bansal et al., 2012), authors proposed an unsupervised method for clustering translations of words through point-wise mutual information, based on a mono-lingual and a parallel corpora. Gamallo, Agustini and Lopes presented in (2005) an unsupervised strategy to partially acquire syntactic-semantic requirements of nouns, verbs and adjectives from partially parsed monolingual text corpora. The goal is to identify clusters of similar positions by identifying the words that define their require-ments extensionally. In (1991) Brown et al. de-scribed a statistical technique for assigning sens-es to words based on the context in which they appear. Incorporating the method in a machine translation system, they have achieved to signifi-cantly reduce translation error rate. Tufis et al. in(2004) presented a method that exploits word clustering based on automatic extraction of trans-lation equivalents, being supported by available aligned wordnets. In (2013), Apidianaki de-scribed a system for SemEval-2013 Cross-lingual Word Sense Disambiguation task, where word senses are represented by means of transla-tion clusters in a cross-lingual strategy.

In this article, a Sense Disambiguation ap-proach, using Context Sense Clustering, within a mono-lingual strategy of neighbor features is proposed. We described a semi-supervised meth-od to classify words based on clusters of contexts strongly correlated. For this purpose, we used a covariance-based correlation measure (Equation 1). Covariance (Equation 2) measure how much two random variables change together. If the values of one variable (sense x) mainly corre-spond to the values of the other variable (sense y), the variables tend to show similar behavior

135

and the covariance is positive. In the opposite case, covariance is negative. Note that this pro-cess is computationally heavy. The system needs to compute all relations between all features of all left words. If the number of features is very large, the processing time increases proportional-ly.

( , ) = ( , )( , ) + ( , )

(1)

( , ) = 1− 1 ( ( , ). ( , )) (2)

Our goal is to join similar senses of the same ambiguous word in the same cluster, based on features correlation. Through the analysis of cor-relation data, we easily induce sense relations. In order to streamline the task of creating clusters, we opted to use WEKA tool (Hall et al., 2009)with X-means (Pelleg et al., 2000) algorithm.

Clustersfructose, glucosefootball, chesstitle, appendix, annextelephone, faxliver, hepatic, kidneyaquatic, marinedisciplinary, infringement, criminal

Table 1. Well-formed resulting clusters

In order to determine the consistence of the obtained clusters, all of these were evaluated with V-measure. V-measure introduce two crite-ria presented in (Rosenberg and Hirschberg, 2007), homogeneity (h) and completeness (c). A clustering process is considered homogeneously well-formed if all of its clusters contain only data points which are members of a single class. Comparatively, a clustering result satisfies com-pleteness if all data points that are members of a given class are elements of the same cluster.

Analysing the results of context sense clusters obtained (Table 1) we easily understand that al-

most all clusters are generally well formed, get-ting a final V-measure average rating of 67%.

Finally, in order to train a classifier we choose to use a training data set with 60 well formed clusters (with V-measure value ranging between 0.9 and 1). Our testing data set is composed by 60 words related to the clusters but which are not contained there. The classifier used was a Sup-port Vector Machine (SVM) (2011). The kernel type applied was the Radial Basis Function(RBF). This kernel non linearly maps samples into a higher dimensional space, so it can handle the case when the relation between class labels and attributes is nonlinear, that is the case. Each word of training and testing data sets were en-coded according the frequency in a corpora of all characteristics contained in the clusters. Our pur-pose was to classify each one of the new poten-tial ambiguous words, and fit it in the corre-sponding cluster (Table 2 and Table 3).

Test Words Label assigned by (SVM)Fruit Cluster 29Infectious Cluster 7Kiwi Cluster 60Back Cluster 57Legislative Cluster 34Grape Cluster 29Russian Cluster 59

Table 2. Results generated by (SVM)

Clusters Content of Clusters Cluster 7 Viral, contagious, hepaticCluster 29 Banana, appleCluster 34 Legal, criminal, infringementCluster 57 Cervical, lumbarCluster 59 French, Italian, Belgian, GermanCluster 60 Thyroid, mammary

Table 3. Cluster correspondence

The obtained results showed that almost all words were tagged in the corresponding cluster. Evaluating system accuracy we obtained an av-erage value of 78%, which means that from the 60 tested words, 47 words were assigned to the corresponding context cluster.

136

References

Marianna Apidianaki, Yifan He, et al. 2010. An algorithm for cross-lingual sense-clustering tested in a mt evaluation setting. In Proceed-ings of the International Workshop on Spoken Language Translation, pages 219–226.

Li, P., Church, K.W.: A sketch algorithm for es-timating two-way and multi-way associations. Computational Linguistics 33 (3), 305 - 354 (2007).

Bansal, M., DeNero, J., Lin, D.: Unsupervised translation sense clustering. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 773-782. Association for Computational Linguistics (2012).

Gamallo, P., Agustini, A., Lopes, G.P.: Cluster-ing syntactic positions with similar semantic requirements. Computational Linguistics 31(1), 107-146 (2005).

Brown, P.F., Pietra, S.A.D., Pietra, V.J.D., Mer-cer, R.L.: Word-sense disambiguation using statistical methods. In: Proceedings of the 29th annual meeting on Association for Computa-tional Linguistics. pp. 264-270. Association for Computational Linguistics (1991).

TufiS, D., Ion, R., Ide, N.: Fine-grained word sense disambiguation based on parallel corpo-ra, word alignment, word clustering and aligned wordnets. In: Proceedings of the 20th international conference on Computational Linguistics. p. 1312. Association for Compu-tational Linguistics (2004).

Apidianaki, M.: Cross-lingual word sense dis-ambiguation using translation sense clustering. In: Proceedings of the 7th International Work-shop on Semantic Evaluation (SemEval 2013). pp. 178-182. *SEM and NAACL (2013)

Mark Hall, Eibe Frank, Geoffrey Holmes, Bern-hard Pfahringer, Peter Reutemann, and Ian H Witten. 2009. The weka data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1):10–18.

Dan Pelleg, Andrew W Moore, et al. 2000. X-means: Extending k-means with efficient es-timation of the number of clusters. In ICML, pages 727–734.

Andrew Rosenberg and Julia Hirschberg. 2007. Vmeasure: A conditional entropy-based exter-nal cluster evaluation measure. In EMNLP-CoNLL, volume 7, pages 410–420.

Chih-Chung Chang and Chih-Jen Lin. 2011. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27.

137


Evaluating Word Order Recursively over Permutation-Forests

Milos Stanojevic and Khalil Sima’anInstitute for Logic, Language and Computation

University of AmsterdamScience Park 107, 1098 XG Amsterdam, The Netherlands

{m.stanojevic,k.simaan}@uva.nl

Abstract

Automatically evaluating word order ofMT system output at the sentence-level ischallenging. At the sentence-level, ngramcounts are rather sparse which makes itdifficult to measure word order quality ef-fectively using lexicalized units. Recentapproaches abstract away from lexicaliza-tion by assigning a score to the permuta-tion representing how word positions insystem output move around relative to areference translation. Metrics over per-mutations exist (e.g., Kendal tau or Spear-man Rho) and have been shown to beuseful in earlier work. However, noneof the existing metrics over permutationsgroups word positions recursively intolarger phrase-like blocks, which makes itdifficult to account for long-distance re-ordering phenomena. In this paper we ex-plore novel metrics computed over Per-mutation Forests (PEFs), packed chartsof Permutation Trees (PETs), which aretree decompositions of a permutation intoprimitive ordering units. We empiricallycompare PEFs metric against five knownreordering metrics on WMT13 data for tenlanguage pairs. The PEFs metric showsbetter correlation with human ranking thanthe other metrics almost on all languagepairs. None of the other metrics exhibitsas stable behavior across language pairs.

1 Introduction

Evaluating word order (also reordering) in MT isone of the main ingredients in automatic MT eval-uation, e.g., (Papineni et al., 2002; Denkowski

and Lavie, 2011). To monitor progress on eval-uating reordering, recent work explores dedicatedreordering evaluation metrics, cf. (Birch and Os-borne, 2011; Isozaki et al., 2010; Talbot et al.,2011). Existing work computes the correlation be-tween the ranking of the outputs of different sys-tems by an evaluation metric to human ranking, one.g., the WMT evaluation data.

For evaluating reordering, it is necessary toword align system output with the correspond-ing reference translation. For convenience, a 1:1alignment (a permutation) is induced between thewords on both sides (Birch and Osborne, 2011),possibly leaving words unaligned on either side.Existing work then concentrates on defining mea-sures of reordering over permutations, cf. (Lap-ata, 2006; Birch and Osborne, 2011; Isozaki et al.,2010; Talbot et al., 2011). Popular metrics overpermutations are: Kendall’s tau, Spearman, Ham-ming distance, Ulam and Fuzzy score. These met-rics treat a permutation as a flat sequence of inte-gers or blocks, disregarding the possibility of hier-archical grouping into phrase-like units, making itdifficult to measure long-range order divergence.Next we will show by example that permutationsalso contain latent atomic units that govern the re-cursive reordering of phrase-like units. Account-ing for these latent reorderings could actually befar simpler than the flat view of a permutation.

Isozaki et al. (2010) argue that the conventionalmetrics cannot measure well the long distancereordering between an English reference sentence“A because B” and a Japanese-English hypothesistranslation “B because A”, where A and B areblocks of any length with internal monotonicalignments. In this paper we explore the idea offactorizing permutations into permutation-trees(PETs) (Gildea et al., 2006) and defining new

138

〈2,4,1,3〉

2 〈1,2〉

4 〈1,2〉

5 6

1 3

Figure 1: A permutation tree for 〈2, 4, 5, 6, 1, 3〉

tree-based reordering metrics which aims atdealing with this type of long range reorderings.For the Isozaki et al. (2010) Japanese-Englishexample, there are two PETs (when leaving A andB as encapsulated blocks):

〈2,1〉

A 〈2,1〉

because B

〈2,1〉

〈2,1〉

A because

B

Our PET-based metrics interpolate the scores overthe two inversion operators 〈2, 1〉 with the internalscores for A and B, incorporating a weightfor subtree height. If both A and B are largeblocks, internally monotonically (also known asstraight) aligned, then our measure will not countevery single reordering of a word in A or B,but will consider this case as block reordering.From a PET perspective, the distance of thereordering is far smaller than when looking at aflat permutation. But does this hierarchical viewof reordering cohere better with human judgementthan string-based metrics?

The example above also shows that a permuta-tion may factorize into different PETs, each corre-sponding to a different segmentation of a sentencepair into phrase-pairs. In this paper we introducepermutation forests (PEFs); a PEF is a hypergraphthat compactly packs the set of PETs that factorizea permutation.

There is yet a more profoud reasoning behindPETs than only accounting for long-range reorder-ings. The example in Figure 1 gives the flavor ofPETs. Observe how every internal node in thisPET dominates a subtree whose fringe1 is itself apermutation over an integer sub-range of the orig-inal permutation. Every node is decorated with apermutation over the child positions (called oper-ator). For example 〈4, 5, 6〉 constitutes a contigu-ous range of integers (corresponding to a phrasepair), and hence will be grouped into a subtree;

1Ordered sequence of leaf nodes.

which in turn can be internally re-grouped into abinary branching subtree. Every node in a PET isminimum branching, i.e., the permutation factor-izes into a minimum number of adjacent permuta-tions over integer sub-ranges (Albert and Atkin-son, 2005). The node operators in a PET areknown to be the atomic building blocks of all per-mutations (called primal permutations). Becausethese are building atomic units of reordering, itmakes sense to want to measure reordering as afunction of the individual cost of these operators.In this work we propose to compute new reorder-ing measures that aggregate over the individualnode-permutations in these PETs.

While PETs where exploited rather recently forextracting features used in the BEER metric sys-tem description (Stanojevic and Sima’an, 2014) inthe official WMT 2014 competition, this work isthe first to propose integral recursive metrics overPETs and PEFs solely for measuring reordering(as opposed to individual non-recursive features ina full metric that measures at the same time bothfluency and adequacy). We empirically show thata PEF-based evaluation measure correlates betterwith human rankings than the string-based mea-sures on eight of the ten language pairs in WMT13data. For the 9th language pair it is close to best,and for the 10th (English-Czech) we find a likelyexplanation in the Findings of the 2013 WMT (Bo-jar et al., 2013). Crucially, the PEF-based mea-sure shows more stable ranking across languagepairs than any of the other measures. The metricis available online as free software2.

2 Measures on permutations: Baselines

In (Birch and Osborne, 2010; Birch and Osborne,2011) Kendall’s tau and Hamming distance arecombined with unigram BLEU (BLEU-1) leadingto LRscore showing better correlation with humanjudgment than BLEU-4. Birch et al. (2010) ad-ditionally tests Ulam distance (longest commonsubsequence – LCS – normalized by the permu-tation length) and the square root of Kendall’s tau.Isozaki et al. (2010) presents a similar approachto (Birch and Osborne, 2011) additionally test-ing Spearman rho as a distance measure. Talbotet al. (2011) extracts a reordering measure fromMETEOR (Denkowski and Lavie, 2011) dubbedFuzzy Reordering Score and evaluates it on MTreordering quality.

2https://github.com/stanojevic/beer

139

For an evaluation metric we need a functionwhich would have the standard behaviour of evalu-ation metrics - the higher the score the better. Bel-low we define the baseline metrics that were usedin our experiments.

Baselines A permutation over [1..n] (subrangeof the positive integers where n > 1) is a bijectivefunction from [1..n] to itself. To represent permu-tations we will use angle brackets as in 〈2, 4, 3, 1〉.Given a permutation π over [1..n], the notation πi(1 ≤ i ≤ n) stands for the integer in the ith posi-tion in π; π(i) stands for the index of the positionin π where integer i appears; and πji stands for the(contiguous) sub-sequence of integers πi, . . . πj .

The definitions of five commonly used met-rics over permutations are shown in Figure 2.In these definitions, we use LCS to stand forLongest Common Subsequence, and Kroneckerδ[a] which is 1 if (a == true) else zero, andAn1 = 〈1, · · · , n〉 which is the identity permuta-tion over [1..n]. We note that all existing metrics

kendall(π) =

∑n−1i=1

∑nj=i+1 δ[π(i) < π(j)](n2 − n)/2

hamming(π) =∑n

i=1 δ[πi == i]n

spearman(π) = 1− 3∑n

i=1(πi − i)2n(n2 − 1)

ulam(π) =LCS(π,An1 )− 1

n− 1

fuzzy(π) = 1− c− 1n− 1

where c is # of monotone sub-permutations

Figure 2: Five commonly used metrics over per-mutations

are defined directly over flat string-level permuta-tions. In the next section we present an alternativeview of permutations are compositional, recursivetree structures.

3 Measures on Permutation Forests

Existing work, e.g., (Gildea et al., 2006), showshow to factorize any permutation π over [1..n]into a canonical permutation tree (PET). Here wewill summarize the relevant aspects and extend

PETs to permutation forests (PEFs).A non-empty sub-sequence πji of a permutation

π is isomorphic with a permutation over [1..(j −i + 1)] iff the set {πi, . . . , πj} is a contiguousrange of positive integers. We will use the terma sub-permutation of π to refer to a subsequenceof π that is isomorphic with a permutation. Notethat not every subsequence of a permutation π isnecessarily isomorphic with a permutation, e.g.,the subsequence 〈3, 5〉 of 〈1, 2, 3, 5, 4〉 is not asub-permutation. One sub-permutation π1 of π issmaller than another sub-permutation π2 of π iffevery integer in π1 is smaller than all integers inπ2. In this sense we can put a full order on non-overlapping sub-permutations of π and rank themfrom the smallest to the largest.

For every permutation π there is a minimumnumber of adjacent sub-permutations it can be fac-torized into (see e.g., (Gildea et al., 2006)). Wewill call this minimum number the arity of π anddenote it with a(π) (or simply a when π is un-derstood from the context). For example, the arityof π = 〈5, 7, 4, 6, 3, 1, 2〉 is a = 2 because it canbe split into a minimum of two sub-permutations(Figure 3), e.g. 〈5, 7, 4, 6, 3〉 and 〈1, 2〉 (but alter-natively also 〈5, 7, 4, 6〉 and 〈3, 1, 2〉). In contrast,π = 〈2, 4, 1, 3〉 (also known as the Wu (1997) per-mutation) cannot be split into less than four sub-permutations, i.e., a = 4. Factorization can beapplied recursively to the sub-permutations of π,resulting in a tree structure (see Figure 3) called apermutation tree (PET) (Gildea et al., 2006; Zhangand Gildea, 2007; Maillette de Buy Wenniger andSima’an, 2011).

Some permutations factorize into multiple alter-native PETs. For π = 〈4, 3, 2, 1〉 there are fivePETs shown in Figure 3. The alternative PETscan be packed into an O(n2) permutation forest(PEF). For many computational purposes, a sin-gle canonical PET is sufficient, cf. (Gildea et al.,2006). However, while different PETs of π exhibitthe same reordering pattern, their different binarybranching structures might indicate important dif-ferences as we show in our experiments.

A permutation forest (akin to a parse forest)F for π (over [1..n]) is a data structure consistingof a subset of {[[i, j, Iji , Oji ]] | 0 ≤ i ≤ j ≤ n},where Iji is a (possibly empty) set of inferences(sets of split points) for πji+1 and Oji is an oper-ator shared by all inferences of πji+1. If πji+1 isa sub-permutation and it has arity a ≤ (j − (i +

140

〈2,1〉

〈2,1〉

〈2,4,1,3〉

5 7 4 6

3

〈1,2〉

1 2

〈2,1〉

4 〈2,1〉

3 〈2,1〉

2 1

〈2,1〉

4 〈2,1〉

〈2,1〉

3 2

1

〈2,1〉

〈2,1〉

4 3

〈2,1〉

2 1

〈2,1〉

〈2,1〉

〈2,1〉

4 3

2

1

〈2,1〉

〈2,1〉

4 〈2,1〉

3 2

1

Figure 3: A PET for π = 〈5, 7, 4, 6, 3, 1, 2〉. And five different PETs for π = 〈4, 3, 2, 1〉.

1)), then each inference consists of a a − 1-tuple[l1, . . . , la−1], where for each 1 ≤ x ≤ (a− 1), lxis a “split point” which is given by the index of thelast integer in the xth sub-permutation in π. Thepermutation of the a sub-permutations (“children”of πji+1) is stored in Oji and it is the same for allinferences of that span (Zhang et al., 2008).

〈2,1〉

43 2 1

〈2,1〉

4 3 2 1

〈2,1〉

4 3 21

Figure 4: The factorizations of π = 〈4, 3, 2, 1〉.

Let us exemplify the inferences on π =〈4, 3, 2, 1〉 (see Figure 4) which factorizes intopairs of sub-permutations (a = 2): a split pointcan be at positions with index l1 ∈ {1, 2, 3}.Each of these split points (factorizations) of π willbe represented as an inference for the same rootnode which covers the whole of π (placed in entry[0, 4]); the operator of the inference here consistsof the permutation 〈2, 1〉 (swapping the two rangescovered by the children sub-permutations) and in-ference consists of a− 1 indexes l1, . . . , la−1 sig-nifying the split points of π into sub-permutations:since a = 2 for π, then a single index l1 ∈{1, 2, 3} is stored with every inference. For thefactorization ((4, 3), (2, 1)) the index l1 = 2 sig-nifying that the second position is a split point into〈4, 3〉 (stored in entry [0, 2]) and 〈2, 1〉 (stored inentry [2, 4]). For the other factorizations of π sim-ilar inferences are stored in the permutation forest.

Figure 5 shows a simple top-down factorizationalgorithm which starts out by computing the ar-ity a using function a(π). If a = 1, a single leafnode is stored with an empty set of inferences. Ifa > 1 then the algorithm computes all possiblefactorizations of π into a sub-permutations (a se-quence of a− 1 split points) and stores their infer-ences together as Iji and their operator Oji asso-ciated with a node in entry [[i, j, Iji , Oji ]]. Subse-quently, the algorithm applies recursively to eachsub-permutation. Efficiency is a topic beyond

the scope of this paper, but this naive algorithmhas worst case time complexity O(n3), and whencomputing only a single canonical PET this can beO(n) (see e.g., (Zhang and Gildea, 2007)).

Function PEF (i, j, π,F);# Args: sub-perm. π over [i..j] and forest FOutput: Parse-Forest F(π) for π;

beginif ([[i, j, ?]] ∈ F) then return F ; #memoizationa := a(π);if a = 1 return F := F ∪ {[[i, j, ∅]]};For each set of split points {l1, . . . , la−1} do

Oji := RankListOf(πl1(l0+1), πl2(l1+1), . . . , π

la(la−1+1));

Iji := Iji ∪ [l1, . . . , la−1];For each πv ∈ {πl1l0+1, π

l2(l1+1), . . . , π

la(la−1+1)} do

F := F ∪ PermForest(πv);F := F ∪ {[[i, j, Iji , Oji ]]};Return F ;end;

Figure 5: Pseudo-code of permutation-forest fac-torization algorithm. Function a(π) returns the ar-ity of π. Function RankListOf(r1, . . . , rm) re-turns the list of rank positions (i.e., a permutation)of sub-permutations r1, . . . , rm after sorting themsmallest first. The top-level call to this algorithmuses π, i = 0, j = n and F = ∅.

Our measure (PEFscore) uses a functionopScore(p) which assigns a score to a given oper-ator, which can be instantiated to any of the exist-ing scoring measures listed in Section 2, but in thiscase we opted for a very simple function whichgives score 1 to monotone permutation and score0 to any other permutation.

Given an inference l ∈ Iji where l =[l1, . . . , la−1], we will use the notation lx to referto split point lx in l where 1 ≤ x ≤ (a − 1), withthe convenient boundary assumption that l0 = iand la = j.

141

PEFscore(π) = φnode(0, n, PEF (π))

φnode(i, j,F) =

if (Iji == ∅) then 1else if (a(πji+1) = j − i) then opScore(Oji )

else β × opScore(Oji ) + (1− β)×∑

l∈Iji

φinf (l,F ,a(πji+1))

|Iji |︸︷︷︸

Avg. inference score over Iji

φinf (l,F , a) =∑a

x=1 δ[lx−lx−1>1]×φnode(l(x−1),lx,F)∑ax=1 δ[lx−l(x−1)>1]︸︷︷︸

Avg. score for non-terminal children

opScore(p) ={

if (p == 〈1, 2〉) then 1else 0

Figure 6: The PEF Score

The PEF-score, PEFscore(π) in Figure 6,computes a score for the single root node[[0, n, In0 , On0 ]]) in the permutation forest. Thisscore is the average inference score φinf over allinferences of this node. The score of an inferenceφinf interpolates (β) between the opScore of theoperator in the current span and (1− β) the scoresof each child node. The interpolation parameter βcan be tuned on a development set.

The PET-score (single PET) is a simplificationof the PEF-score where the summation over all in-ferences of a node

∑l∈Ij

iin φnode is replaced by

“Select a canonical l ∈ Iji ”.

4 Experimental setting

Data The data that was used for experiments arehuman rankings of translations from WMT13 (Bo-jar et al., 2013). The data covers 10 language pairswith a diverse set of systems used for translation.Each human evaluator was presented with 5 differ-ent translations, source sentence and a referencetranslation and asked to rank system translationsby their quality (ties were allowed).3

Meta-evaluation The standard way for doingmeta-evaluation on the sentence level is withKendall’s tau correlation coefficient (Callison-Burch et al., 2012) computed on the number oftimes an evaluation metric and a human evaluatoragree (and disagree) on the rankings of pairs of

3We would like to extend our work also to English-Japanese but we do not have access to such data at the mo-ment. In any case, the WMT13 data is the largest publiclyavailable data of this kind.

translations. We extract pairs of translations fromhuman evaluated data and compute their scoreswith all metrics. If the ranking assigned by a met-ric is the same as the ranking assigned by a hu-man evaluator then that pair is considered concor-dant, otherwise it is a discordant pair. All pairswhich have the same score by the metric or arejudged as ties by human evaluators are not usedin meta-evaluation. The formula that was used forcomputing Kendall’s tau correlation coefficient isshown in Equation 1. Note that the formula forKendall tau rank correlation coefficient that is usedin meta-evaluation is different from the Kendalltau similarity function used for evaluating permu-tations. The values that it returns are in the range[−1, 1], where −1 means that order is always op-posite from the human judgment while the value 1means that metric ranks the system translations inthe same way as humans do.

τ =#concordant pairs−#discordant pairs

#concordant pairs+#discordant pairs(1)

Evaluating reordering Since system transla-tions do not differ only in the word order but alsoin lexical choice, we follow Birch and Osborne(2010) and interpolate the score given by each re-ordering metric with the same lexical score. Forlexical scoring we use unigram BLEU. The param-eter that balances the weights for these two metricsα is chosen to be 0.5 so it would not underesti-mate the lexical differences between translations(α � 0.5) but also would not turn the whole met-ric into unigram BLEU (α � 0.5). The equation

142

for this interpolation is shown in Equation 2.4

FullMetric(ref, sys) = α lexical(ref, sys) +(1− α)× bp(|ref |, |π|)× ordering(π) (2)

Where π(ref, sys) is the permutation represent-ing the word alignment from sys to ref . The ef-fect of α on the German-English evaluation is vis-ible on Figure 7. The PET and PEF measures havean extra parameter β that gives importance to thelong distance errors that also needs to be tuned. OnFigure 8 we can see the effect of β on German-English for α = 0.5. For all language pairs forβ = 0.6 both PETs and PEFs get good results sowe picked that as value for β in our experiments.

Figure 7: Effect of α on German-English evalua-tion for β = 0.6

Choice of word alignments The issue we didnot discuss so far is how to find a permutationfrom system and reference translations. One wayis to first get alignments between the source sen-tence and the system translation (from a decoderor by automatically aligning sentences), and alsoalignments between the source sentence and thereference translation (manually or automaticallyaligned). Subsequently we must make those align-ments 1-to-1 and merge them into a permutation.That is the approach that was followed in previ-ous work (Birch and Osborne, 2011; Talbot et al.,

4Note that for reordering evaluation it does not makesense to tune α because that would blur the individual contri-butions of reordering and adequacy during meta evaluation,which is confirmed by Figure 7 showing that α � 0.5 leadsto similar performance for all metrics.

Figure 8: Effect of β on German-English evalua-tion for α = 0.5

2011). Alternatively, we may align system and ref-erence translations directly. One of the simplestways to do that is by finding exact matches be-tween words and bigrams between system and ref-erence translation as done in (Isozaki et al., 2010).The way we align system and reference transla-tions is by using the aligner supplied with ME-TEOR (Denkowski and Lavie, 2011) for finding1-to-1 alignments which are later converted to apermutation. The advantage of this method is thatit can do non-exact matching by stemming or us-ing additional sources for semantic similarity suchas WordNets and paraphrase tables. Since we willnot have a perfect permutation as input, becausemany words in the reference or system transla-tions might not be aligned, we introduce a brevitypenalty (bp(·, ·) in Equation 2) for the orderingcomponent as in (Isozaki et al., 2010). The brevitypenalty is the same as in BLEU with the smalldifference that instead of taking the length of sys-tem and reference translation as its parameters, ittakes the length of the system permutation and thelength of the reference.

5 Empirical results

The results are shown in Table 1 and Table 2.These scores could be much higher if we usedsome more sophisticated measure than unigramBLEU for the lexical part (for example recall isvery useful in evaluation of the system translations(Lavie et al., 2004)). However, this is not the issuehere since our goal is merely to compare differentways to evaluate word order. All metrics that wetested have the same lexical component, get thesame permutation as their input and have the samevalue for α.

143

Eng

lish-

Cze

ch

Eng

lish-

Span

ish

Eng

lish-

Ger

man

Eng

lish-

Rus

sian

Eng

lish-

Fren

ch

Kendall 0.16 0.170 0.183 0.193 0.218Spearman 0.157 0.170 0.181 0.192 0.215Hamming 0.150 0.163 0.168 0.187 0.196FuzzyScore 0.155 0.166 0.178 0.189 0.215Ulam 0.159 0.170 0.181 0.189 0.221PEFs 0.156 0.173 0.185 0.196 0.219PETs 0.157 0.165 0.182 0.195 0.216

Table 1: Sentence level Kendall tau scores fortranslation out of English with α = 0.5 and β =0.6

Cze

ch-E

nglis

h

Span

ish-

Eng

lish

Ger

man

-Eng

lish

Rus

sian

-Eng

lish

Fren

ch-E

nglis

h

Kendall 0.196 0.265 0.235 0.173 0.223Spearman 0.199 0.265 0.236 0.173 0.222Hamming 0.172 0.239 0.215 0.157 0.206FuzzyScore 0.184 0.263 0.228 0.169 0.216Ulam 0.188 0.264 0.232 0.171 0.221PEFs 0.201 0.265 0.237 0.181 0.228PETs 0.200 0.264 0.234 0.174 0.221

Table 2: Sentence level Kendall tau scores fortranslation into English with α = 0.5 and β = 0.6

5.1 Does hierarchical structure improveevaluation?

The results in Tables 1, 2 and 3 suggest that thePEFscore which uses hierarchy over permutationsoutperforms the string based permutation metricsin the majority of the language pairs. The mainexception is the English-Czech language pair inwhich both PETs and PEFs based metric do notgive good results compared to some other met-rics. For discussion about English-Czech look atthe section 6.1.

5.2 Do PEFs help over one canonical PET?

From Figures 9 and 10 it is clear that using allpermutation trees instead of only canonical onesmakes the metric more stable in all language pairs.Not only that it makes results more stable but it

metric avg rank avg KendallPEFs 1.6 0.2041Kendall 2.65 0.2016Spearman 3.4 0.201PETs 3.55 0.2008Ulam 4 0.1996FuzzyScore 5.8 0.1963Hamming 7 0.1853

Table 3: Average ranks and average Kendallscores for each tested metrics over all languagepairs

Figure 9: Plot of scaled Kendall tau correlation fortranslation from English

also improves them in all cases except in English-Czech where both PETs and PEFs perform badly.The main reason why PEFs outperform PETs isthat they encode all possible phrase segmentationsof monotone and inverted sub-permutations. Bygiving the score that considers all segmentations,PEFs also include the right segmentation (the oneperceived by human evaluators as the right seg-mentation), while PETs get the right segmentationonly if the right segmentation is the canonical one.

5.3 Is improvement consistent over languagepairs?

Table 3 shows average rank (metric’s position af-ter sorting all metrics by their correlation for eachlanguage pair) and average Kendall tau correlationcoefficient over the ten language pairs. The tableshows clearly that the PEFs metric outperforms allother metrics. To make it more visible how met-rics perform on the different language pairs, Fig-ures 9 and 10 show Kendall tau correlation co-efficient scaled between the best scoring metricfor the given language (in most cases PEFs) and

144

Figure 10: Plot of scaled Kendall tau correlationfor translation into English

the worst scoring metric (in all cases Hammingscore). We can see that, except in English-Czech,PEFs are consistently the best or second best (onlyin English-French) metric in all language pairs.PETs are not stable and do not give equally goodresults in all language pairs. Hamming distanceis without exception the worst metric for evalua-tion since it is very strict about positioning of thewords (it does not take relative ordering betweenwords into account). Kendall tau is the only stringbased metric that gives relatively good scores inall language pairs and in one (English-Czech) it isthe best scoring one.

6 Further experiments and analysis

So far we have shown that PEFs outperform theexisting metrics over the majority of languagepairs. There are two pending issues to discuss.Why is English-Czech seemingly so difficult?And does preferring inversion over non-binarybranching correlate better with human judgement.

6.1 The results on English-Czech

The English-Czech language pair turned out tobe the hardest one to evaluate for all metrics.All metrics that were used in the meta-evaluationthat we conducted give much lower Kendall taucorrelation coefficient compared to the other lan-guage pairs. The experiments conducted by otherresearchers on the same dataset (Machacek andBojar, 2013), using full evaluation metrics, alsoget far lower Kendall tau correlation coefficientfor English-Czech than for other language pairs.In the description of WMT13 data that we used(Bojar et al., 2013), it is shown that annotator-

agreement for English-Czech is a few times lowerthan for other languages. English-Russian, whichis linguistically similar to English-Czech, doesnot show low numbers in these categories, and isone of the language pairs where our metrics per-form the best. The alignment ratio is equally highbetween English-Czech and English-Russian (butthat does not rule out the possibility that the align-ments are of different quality). One seeminglyunlikely explanation is that English-Czech mightbe a harder task in general, and might require amore sophisticated measure. However, the moreplausible explanation is that the WMT13 data forEnglish-Czech is not of the same quality as otherlanguage pairs. It could be that data filtering, forexample by taking only judgments for which manyevaluators agree, could give more trustworthy re-sults.

6.2 Is inversion preferred over non-binarybranching?

Since our original version of the scoring functionfor PETs and PEFs on the operator level does notdiscriminate between kinds of non-monotone op-erators (all non-monotone get zero as a score) wealso tested whether discriminating between inver-sion (binary) and non-binary operators make anydifference.

Eng

lish-

Cze

ch

Eng

lish-

Span

ish

Eng

lish-

Ger

man

Eng

lish-

Rus

sian

Eng

lish-

Fren

ch

PEFs γ = 0.0 0.156 0.173 0.185 0.196 0.219PEFs γ = 0.5 0.157 0.175 0.183 0.195 0.219PETs γ = 0.0 0.157 0.165 0.182 0.195 0.216PETs γ = 0.5 0.158 0.165 0.183 0.195 0.217

Table 4: Sentence level Kendall tau score fortranslation out of English different γ with α = 0.5and β = 0.6

Intuitively, we might expect that inverted binaryoperators are preferred by human evaluators overnon-binary ones. So instead of assigning zero as ascore to inverted nodes we give them 0.5, while fornon-binary nodes we remain with zero. The ex-periments with the inverted operator scored with0.5 (i.e., γ = 0.5) are shown in Tables 4 and 5.The results show that there is no clear improve-ment by distinguishing between the two kinds of

145

Cze

ch-E

nglis

h

Span

ish-

Eng

lish

Ger

man

-Eng

lish

Rus

sian

-Eng

lish

Fren

ch-E

nglis

h

PEFs γ = 0.0 0.201 0.265 0.237 0.181 0.228PEFs γ = 0.5 0.201 0.264 0.235 0.179 0.227PETs γ = 0.0 0.200 0.264 0.234 0.174 0.221PETs γ = 0.5 0.202 0.263 0.235 0.176 0.224

Table 5: Sentence level Kendall tau score fortranslation into English for different γ with α =0.5 and β = 0.6

non-monotone operators on the nodes.

7 Conclusions

Representing order differences as compact permu-tation forests provides a good basis for develop-ing evaluation measures of word order differences.These hierarchical representations of permutationsbring together two crucial elements (1) groupingwords into blocks, and (2) factorizing reorder-ing phenomena recursively over these groupings.Earlier work on MT evaluation metrics has of-ten stressed the importance of the first ingredient(grouping into blocks) but employed it merely in aflat (non-recursive) fashion. In this work we pre-sented novel metrics based on permutation treesand forests (the PETscore and PEFscore) wherethe second ingredient (factorizing reordering phe-nomena recursively) plays a major role. Permuta-tion forests compactly represent all possible blockgroupings for a given permutation, whereas per-mutation trees select a single canonical grouping.Our experiments with WMT13 data show that ourPEFscore metric outperforms the existing string-based metrics on the large majority of languagepairs, and in the minority of cases where it is notranked first, it ranks high. Crucially, the PEFs-core is by far the most stable reordering score overten language pairs, and works well also for lan-guage pairs with long range reordering phenom-ena (English-German, German-English, English-Russian and Russian-English).

Acknowledgments

This work is supported by STW grant nr. 12271and NWO VICI grant nr. 277-89-002. We thankTAUS and the other DatAptor project User Board

members. We also thank Ivan Titov for helpfulcomments on the ideas presented in this paper.

ReferencesMichael H. Albert and Mike D. Atkinson. 2005. Sim-

ple permutations and pattern restricted permutations.Discrete Mathematics, 300(1-3):1–15.

Alexandra Birch and Miles Osborne. 2010. LRscorefor Evaluating Lexical and Reordering Quality inMT. In Proceedings of the Joint Fifth Workshop onStatistical Machine Translation and MetricsMATR,pages 327–332, Uppsala, Sweden, July. Associationfor Computational Linguistics.

Alexandra Birch and Miles Osborne. 2011. Reorder-ing Metrics for MT. In Proceedings of the Associ-ation for Computational Linguistics, Portland, Ore-gon, USA. Association for Computational Linguis-tics.

Alexandra Birch, Miles Osborne, and Phil Blunsom.2010. Metrics for MT evaluation: evaluating re-ordering. Machine Translation, pages 1–12.

Ondrej Bojar, Christian Buck, Chris Callison-Burch,Christian Federmann, Barry Haddow, PhilippKoehn, Christof Monz, Matt Post, Radu Soricut, andLucia Specia. 2013. Findings of the 2013 Work-shop on Statistical Machine Translation. In Pro-ceedings of the Eighth Workshop on Statistical Ma-chine Translation, pages 1–44, Sofia, Bulgaria, Au-gust. Association for Computational Linguistics.

Chris Callison-Burch, Philipp Koehn, Christof Monz,Matt Post, Radu Soricut, and Lucia Specia. 2012.Findings of the 2012 Workshop on Statistical Ma-chine Translation. In Proceedings of the SeventhWorkshop on Statistical Machine Translation, pages10–51, Montreal, Canada, June. Association forComputational Linguistics.

Michael Denkowski and Alon Lavie. 2011. Meteor1.3: Automatic Metric for Reliable Optimizationand Evaluation of Machine Translation Systems. InProceedings of the EMNLP 2011 Workshop on Sta-tistical Machine Translation.

Daniel Gildea, Giorgio Satta, and Hao Zhang. 2006.Factoring Synchronous Grammars by Sorting. InACL.

Hideki Isozaki, Tsutomu Hirao, Kevin Duh, KatsuhitoSudoh, and Hajime Tsukada. 2010. Automaticevaluation of translation quality for distant languagepairs. In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Process-ing, EMNLP ’10, pages 944–952, Stroudsburg, PA,USA. Association for Computational Linguistics.

Mirella Lapata. 2006. Automatic Evaluation of In-formation Ordering: Kendall’s Tau. ComputationalLinguistics, 32(4):471–484.

146

Alon Lavie, Kenji Sagae, and Shyamsundar Jayara-man. 2004. The significance of recall in auto-matic metrics for MT evaluation. In Proceedings ofthe Sixth Conference of the Association for MachineTranslation in the Americas.

Matous Machacek and Ondrej Bojar. 2013. Resultsof the WMT13 Metrics Shared Task. In Proceed-ings of the Eighth Workshop on Statistical MachineTranslation, pages 45–51, Sofia, Bulgaria, August.Association for Computational Linguistics.

Gideon Maillette de Buy Wenniger and Khalil Sima’an.2011. Hierarchical Translation Equivalence overWord Alignments. In ILLC Prepublication Series,PP-2011-38. University of Amsterdam.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automaticevaluation of machine translation. In Proceedingsof ACL’02, pages 311–318, Philadelphia, PA, USA.

Milos Stanojevic and Khalil Sima’an. 2014. BEER:BEtter Evaluation as Ranking. In Proceedings of theNinth Workshop on Statistical Machine Translation,pages 414–419, Baltimore, Maryland, USA, June.Association for Computational Linguistics.

David Talbot, Hideto Kazawa, Hiroshi Ichikawa, JasonKatz-Brown, Masakazu Seno, and Franz Och. 2011.A Lightweight Evaluation Framework for MachineTranslation Reordering. In Proceedings of the SixthWorkshop on Statistical Machine Translation, pages12–21, Edinburgh, Scotland, July. Association forComputational Linguistics.

Dekai Wu. 1997. Stochastic inversion transductiongrammars and bilingual parsing of parallel corpora.Computational Linguistics, 3(23):377–403.

Hao Zhang and Daniel Gildea. 2007. Factorizationof Synchronous Context-Free Grammars in LinearTime. In NAACL Workshop on Syntax and Structurein Statistical Translation (SSST), pages 25–32.

Hao Zhang, Daniel Gildea, and David Chiang. 2008.Extracting synchronous grammar rules from word-level alignments in linear time. In Proceedingsof the 22nd International Conference on Computa-tional Linguistics-Volume 1, pages 1081–1088. As-sociation for Computational Linguistics.

147


Preference Grammars and Soft Syntactic Constraintsfor GHKM Syntax-based Statistical Machine Translation

Matthias Huck and Hieu Hoang and Philipp KoehnSchool of Informatics

University of Edinburgh10 Crichton Street

Edinburgh EH8 9AB, UK{mhuck,hhoang,pkoehn}@inf.ed.ac.uk

Abstract

In this work, we investigate the effec-tiveness of two techniques for a feature-based integration of syntactic informationinto GHKM string-to-tree statistical ma-chine translation (Galley et al., 2004):(1.) Preference grammars on the tar-get language side promote syntactic well-formedness during decoding while also al-lowing for derivations that are not linguis-tically motivated (as in hierarchical trans-lation). (2.) Soft syntactic constraints aug-ment the system with additional source-side syntax features while not modifyingthe set of string-to-tree translation rules orthe baseline feature scores.

We conduct experiments with a state-of-the-art setup on an English→Germantranslation task. Our results suggest thatpreference grammars for GHKM trans-lation are inferior to the plain target-syntactified model, whereas the enhance-ment with soft source syntactic constraintsprovides consistent gains. By employ-ing soft source syntactic constraints withsparse features, we are able to achieve im-provements of up to 0.7 points BLEU and1.0 points TER.

1 Introduction

Previous research in both formally syntax-based(i.e., hierarchical) and linguistically syntax-basedstatistical machine translation has demonstratedthat significant quality gains can be achieved viaintegration of syntactic information as features ina non-obtrusive manner, rather than as hard con-straints.

We implemented two feature-based extensionsfor a GHKM-style string-to-tree translation sys-tem (Galley et al., 2004):

• Preference grammars to soften the hardtarget-side syntactic constraints that are im-posed by the target non-terminal labels.

• Soft source-side syntactic constraints thatenhance the string-to-tree translation modelwith input tree features based on source syn-tax labels.

The empirical results on an English→Germantranslation task are twofold. Target-side prefer-ence grammars do not show an improvement overthe string-to-tree baseline with syntactified trans-lation rules. Source-side syntactic constraints, onthe other hand, yield consistent moderate gains ifapplied as supplementary features in the string-to-tree setup.

2 Outline

The paper is structured as follows: First we give anoverview of important related publications (Sec-tion 3). In Section 4, we review the fundamentalsof syntax-based translation in general, and in par-ticular those of GHKM string-to-tree translation.

We present preference grammars for GHKMtranslation in Section 5. Our technique for ap-plying soft source syntactic constraints in GHKMstring-to-tree translation is described in Section 6.

Section 7 contains the empirical part of the pa-per. We first describe our experimental setup (7.1),followed by a presentation and discussion of thetranslation results (7.2). We conclude the paper inSection 8.

3 Related Work

Our syntactic translation model conforms to theGHKM syntax approach as proposed by Galley,Hopkins, Knight, and Marcu (Galley et al., 2004)with composed rules as in (Galley et al., 2006)and (DeNeefe et al., 2007). Systems based on

148

this paradigm have recently been among the top-ranked submissions to public evaluation cam-paigns (Williams et al., 2014; Bojar et al., 2014).

Our soft source syntactic constraints featuresborrow ideas from Marton and Resnik (2008) whoproposed a comparable approach for hierarchicalmachine translation. The major difference is thatthe features of Marton and Resnik (2008) are onlybased on the labels from the input trees as seen intuning and decoding. They penalize violations ofconstituent boundaries but do not employ syntacticparse annotation of the source side of the trainingdata. We, in contrast, equip the rules with latentsource label properties, allowing for features thatcan check for conformance of input tree labels andsource labels that have been seen in training.

Other groups have applied similar techniquesto a string-to-dependency system (Huang et al.,2013) and—like in our work—a GHKM string-to-tree system (Zhang et al., 2011). Both Huanget al. (2013) and Zhang et al. (2011) store sourcelabels as additional information with the rules.They however investigate somewhat different fea-ture functions than we do.

Marton and Resnik (2008) evaluated theirmethod on the NIST Chinese→English andArabic→English tasks. Huang et al. (2013) andZhang et al. (2011) present results on the NISTChinese→English task. We focus our attention ona very different task: English→German.

4 Syntax-based Translation

In syntax-based translation, a probabilistic syn-chronous context-free grammar (SCFG) is in-duced from bilingual training corpora. The par-allel training data is word-aligned and annotatedwith syntactic parses on either target side (string-to-tree), source side (tree-to-string), or both (tree-to-tree). A syntactic rule extraction procedure ex-tracts rules which are consistent with the word-alignment and comply with certain syntactic va-lidity constraints.

Extracted rules are of the form A,B→〈α,β ,∼ 〉.The right-hand side of the rule 〈α,β 〉 is a bilingualphrase pair that may contain non-terminal sym-bols, i.e. α ∈ (VF ∪ NF)+ and β ∈ (VE ∪ NE)+,where VF and VE denote the source and targetterminal vocabulary, and NF and NE denote thesource and target non-terminal vocabulary, respec-tively. The non-terminals on the source side andon the target side of rules are linked in a one-to-

one correspondence. The ∼ relation defines thisone-to-one correspondence. The left-hand sideof the rule is a pair of source and target non-terminals, A ∈ NF and B ∈ NE .

Decoding is typically carried out with a parsing-based algorithm, in our case a customized versionof CYK+ (Chappelier and Rajman, 1998). Theparsing algorithm is extended to handle transla-tion candidates and to incorporate language modelscores via cube pruning (Chiang, 2007).

4.1 GHKM String-to-Tree Translation

In GHKM string-to-tree translation (Galley et al.,2004; Galley et al., 2006; DeNeefe et al., 2007),rules are extracted from training instances whichconsist of a source sentence, a target sentencealong with its constituent parse tree, and a wordalignment matrix. This tuple is interpreted as adirected graph (the alignment graph), with edgespointing away from the root of the tree, and wordalignment links being edges as well. A set ofnodes (the frontier set) is determined that con-tains only nodes with non-overlapping closure oftheir spans.1 By computing frontier graph frag-ments—fragments of the alignment graph suchthat their root and all sinks are in the frontier set—the GHKM extractor is able to induce a minimalset of rules which explain the training instance.The internal tree structure can be discarded to ob-tain flat SCFG rules. Minimal rules can be assem-bled to build larger composed rules.

Non-terminals on target sides of string-to-treerules are syntactified. The target non-terminal vo-cabulary of the SCFG contains the set of labels ofthe frontier nodes, which is in turn a subset of (orequal to) the set of constituent labels in the parsetree. The target non-terminal vocabulary further-more contains an initial non-terminal symbol Q.Source sides of the rules are not decorated withsyntactic annotation. The source non-terminal vo-cabulary contains a single generic non-terminalsymbol X.

In addition to the extracted grammar, the trans-lation system makes use of a special glue grammarwith an initial rule, glue rules, a final rule, and toprules. The glue rules provide a fall back methodto just monotonically concatenate partial deriva-tions during decoding. As we add tokens which

1The span of a node in the alignment graph is definedas the set of source-side words that are reachable from thisnode. The closure of a span is the smallest interval of sourcesentence positions that covers the span.

149

mark the sentence start (“<s>”) and the sentenceend (“</s>”), the rules in the glue grammar are ofthe following form:

Initial rule:X,Q→ 〈<s> X∼0,<s> Q∼0〉

Glue rules:X,Q→ 〈X∼0X∼1,Q∼0B∼1〉

for all B ∈ NE

Final rule:X,Q→ 〈X∼0 </s>,Q∼0 </s>〉

Top rules:X,Q→ 〈<s> X∼0 </s>,<s> B∼0 </s>〉

for all B ∈ NE

5 Preference Grammars

Preference grammars store a set of implicit labelvectors as additional information with each SCFGrule, along with their relative frequencies giventhe rule. Venugopal et al. (2009) have introducedthis technique for hierarchical phrase-based trans-lation. The implicit label set refines the label setof the underlying synchronous context-free gram-mar.

We apply this idea to GHKM translation bynot decorating the target-side non-terminals of theextracted GHKM rules with syntactic labels, butwith a single generic label. The (explicit) tar-get non-terminal vocabulary NE thus also con-tains only the generic non-terminal symbol X, justlike the source non-terminal vocabulary NF . Theextraction method remains syntax-directed and isstill guided by the syntactic annotation over thetarget side of the data, but the syntactic labels arestripped off from the SCFG rules. Rules whichdiffer only with respect to their non-terminal la-bels are collapsed to a single entry in the rule ta-ble, and their rule counts are pooled. However,the syntactic label vectors that have been seen withthis rule during extraction are stored as implicit la-bel vectors of the rule.

5.1 Feature Computation

Two features are added to the log-linear modelcombination in order to rate the syntactic well-formedness of derivations. The first feature issimilar to the one suggested by Venugopal et al.(2009) and computes a score based on the relativefrequencies of implicit label vectors of those ruleswhich are involved in the derivation. The second

feature is a simple binary feature which supple-ments the first one by penalizing a rule applicationif none of the implicit label vectors match.

We will now formally specify the first feature.2

We give a recursive definition of the feature scorehsyn(d) for a derivation d.

Let r be the top rule in derivation d, with nright-hand side non-terminals. Let d j denote thesub-derivation of d at the j-th right-hand side non-terminal of r, 1 ≤ j ≤ n. hsyn(d) is recursivelydefined as

hsyn(d) = tsyn(d)+n

∑j=1

hsyn(d j) . (1)

In this equation, tsyn(d) is a simple auxiliaryfunction:

tsyn(d) =

{log tsyn(d) if tsyn(d) 6= 00 otherwise

(2)

Denoting with S the implicit label set of thepreference grammar, we define tsyn(d) as a func-tion that assesses the degree of agreement ofthe preferences of the current rule with the sub-derivations:

tsyn(d) = ∑s∈Sn+1

(p(s|r) ·

n+1

∏k=2

th(s[k]|dk−1)

)(3)

We use the notation [·] to address the elements of avector. The first element of an n + 1-dimensionalvector s of implicit labels is an implicit label bind-ing of the left-hand side non-terminal of the rule r.p(s|r) is the preference distribution of the rule.

Here, th(Y |d) is another auxiliary function thatrenormalizes the values of th(Y |d):

th(Y |d) =th(Y |d)

∑Y ′∈S th(Y ′|d)(4)

It provides us with a probability that the derivationd has the implicit label Y ∈ S as its root. Finally,the function th(Y |d) is defined as

th(Y |d) =

∑s∈Sn+1:s[1]=Y

(p(s|r) ·

n+1

∏k=2

ph(s[k]|dk−1)

).

(5)

Note that the denominator in Equation (4) thusequals tsyn(d).

2Our notational conventions roughly follow the ones byStein et al. (2010).

150

This concludes the formal specification of thefirst features. The second feature hauxSyn(d) penal-izes rule applications in cases where tsyn(d) evalu-ates to 0:

hauxSyn(d) =

{0 if tsyn(d) 6= 01 otherwise

(6)

Its intuition is that rule applications that do notcontribute to hsyn(d) should be punished. Deriva-tions with tsyn(d) = 0 could alternatively bedropped completely, but our approach is to avoidhard constraints. We will later demonstrate empir-ically that discarding such derivations harms trans-lation quality.

6 Soft Source Syntactic Constraints

Similar to the implicit target-side label vectorswhich we store in preference grammars, we canlikewise memorize sets of source-side syntactic la-bel vectors with GHKM rules. In contrast to pref-erence grammars, the rule inventory of the string-to-tree system remains untouched. The target non-terminals of the SCFG stay syntactified, and thesource non-terminal vocabulary is not extendedbeyond the single generic non-terminal.

Source-side syntactic labels are an additional la-tent property of the rules. We obtain this propertyby parsing the source side of the training data andcollecting the source labels that cover the source-side span of non-terminals during GHKM rule ex-traction. As the source-side span is frequently notcovered by a constituent in the syntactic parse tree,we employ the composite symbols as suggestedby Zollmann and Venugopal (2006) for the SAMTsystem.3 In cases where a span is still not coveredby a symbol, we nevertheless memorize a source-side syntactic label vector but indicate the failurefor the uncovered non-terminal with a special la-bel. The set of source label vectors that are seenwith a rule during extraction is stored with it in therule table as an additional property. This informa-tion can be used to implement feature-based softsource syntactic constraints.

Table 1 shows an example of a set of sourcelabel vectors stored with a grammar rule. Thefirst element of each vector is an implicit source-syntactic label for the left-hand side non-terminalof the rule, the remaining elements are implicit

3Specifically, we apply relax-parse --SAMT 2 asimplemented in the Moses toolkit (Koehn et al., 2007).

source label vector frequency(IN+NP,NN,NN) 7(IN+NP,NNP,NNP) 3(IN++NP,NNS,NNS) 2(IN+NP,NP,NP) 2(PP//SBAR,NP,NP) 1

Table 1: The set of source label vec-tors (along with their frequencies in thetraining data) for the rule X,PP-MO →〈between X∼1 and X∼0,zwischen NN∼0 und NN∼1〉.The overall rule frequency is 15.

source-syntactic labels for the right-hand sidesource non-terminals.

The basic idea for soft source syntactic con-straints features is to also parse the input data ina preprocessing step and try to match input labelsand source label vectors that are associated withSCFG rules.

6.1 Feature ComputationUpon application of an SCFG rule, each of thenon-terminals of the rule covers a distinct span ofthe input sentence. An input label from the inputparse may be available for this span. We say thata non-terminal has a match in a given source la-bel vector of the rule if its label in the vector is thesame as a corresponding input label over the span.

We define three simple features to scorematches and mismatches of the impicit source syn-tactic labels with the labels from the input data:

• A binary feature that fires if a rule is appliedwhich possesses a source syntactic label vec-tor that fully matches the input labels. Thisfeature rewards exact source label matches ofcomplete rules, i.e., the existance of a vectorin which all non-terminals of the rule havematches.

• A binary feature that fires if a rule is appliedwhich does not possess any source syntacticlabel vector with a match of the label for theleft-hand side non-terminal. This feature pe-nalizes left-hand side mismatches.

• A count feature that for each rule applicationadds a cost equal to the number of right-handside non-terminals that do not have a matchwith a corresponding input label in any of thesource syntactic label vectors. This featurepenalizes right-hand side mismatches.

151

The second and third feature are less strict than thefirst one and give the system a more detailed clueabout the magnitude of mismatch.

6.2 Sparse Features

We can optionally add a larger number of sparsefeatures that depend on the identity of the source-side syntactic label:

• Sparse features which fire if a specific inputlabel is matched. We say that the input la-bel is matched in case the corresponding non-terminal that covers the span has a match inany of the source syntactic label vectors ofthe applied rule. We distinguish input labelmatches via left-hand side and via right-handside non-terminals.

• Sparse features which fire if the span of a spe-cific input label is covered by a non-terminalof an applied rule, but the input label is notmatched.

The first set of sparse features rewards matches,the second set of sparse features penalizes mis-matches.

All sparse features have individual scaling fac-tors in the log-linear model combination. We how-ever implemented a means of restricting the num-ber of sparse features by providing a core set ofsource labels. If such a core set is specified, thenonly those sparse features are active that dependon the identity of labels within this set. All sparsefeatures for source labels outside of the core setare inactive.

7 Experiments

We empirically evaluate the effectiveness ofpreference grammars and soft source syntac-tic constraints for GHKM translation on theEnglish→German language pair using the stan-dard newstest sets of the Workshop on Statisti-cal Machine Translation (WMT) for testing.4 Theexperiments are conducted with the open-sourceMoses implementations of GHKM rule extraction(Williams and Koehn, 2012) and decoding withCYK+ parsing and cube pruning (Hoang et al.,2009).

4http://www.statmt.org/wmt14/translation-task.html

7.1 Experimental Setup

We work with an English–German parallel train-ing corpus of around 4.5 M sentence pairs (af-ter corpus cleaning). The parallel data origi-nates from three different sources which havebeen eligible for the constrained track of theACL 2014 Ninth Workshop on Statistical Ma-chine Translation shared translation task: Europarl(Koehn, 2005), News Commentary, and the Com-mon Crawl corpus as provided on the WMT web-site. Word alignments are created by aligning thedata in both directions with MGIZA++ (Gao andVogel, 2008) and symmetrizing the two trainedalignments (Och and Ney, 2003; Koehn et al.,2003). The German target side training data isparsed with BitPar (Schmid, 2004). We removegrammatical case and function information fromthe annotation obtained with BitPar and applyright binarization of the German parse trees priorto rule extraction (Wang et al., 2007; Wang et al.,2010; Nadejde et al., 2013). For the soft sourcesyntactic constraints, we parse the English sourceside of the parallel data with the English BerkeleyParser (Petrov et al., 2006) and produce compositeSAMT-style labels as discussed in Section 6.

When extracting syntactic rules, we impose sev-eral restrictions for composed rules, in particulara maximum number of 100 tree nodes per rule,a maximum depth of seven, and a maximum sizeof seven. We discard rules with non-terminals ontheir right-hand side if they are singletons in thetraining data.

For efficiency reasons, we also enforce a limiton the number of label vectors that are storedas additional properties. Label vectors are onlystored if they occur at least as often as the 50thmost frequent label vector of the given rule. Thislimit is applied separately for both source-side la-bel vectors (which are used by the soft syntacticcontraints) and target-side label vectors (which areused by the preference grammar).

Only the 200 best translation options per dis-tinct rule source side with respect to the weightedrule-level model scores are loaded by the decoder.Search is carried out with a maximum chart spanof 25, a rule limit of 500, a stack limit of 200, anda k-best limit of 1000 for cube pruning.

A standard set of models is used in the base-line, comprising rule translation probabilities andlexical translation probabilities in both directions,word penalty and rule penalty, an n-gram language

152

system dev newstest2013 newstest2014BLEU TER BLEU TER BLEU TER

GHKM string-to-tree baseline 34.7 47.3 20.0 63.3 19.4 65.6+ soft source syntactic constraints 35.1 47.0 20.3 62.7 19.7 64.9

+ sparse features 35.8 46.5 20.3 62.8 19.6 65.1+ sparse features (core = non-composite) 35.4 46.8 20.2 62.9 19.6 65.1+ sparse features (core = dev-min-occ100) 35.6 46.7 20.2 62.9 19.6 65.2+ sparse features (core = dev-min-occ1000) 35.4 46.9 20.3 62.8 19.6 65.2

+ hard source syntactic constraints 34.6 47.4 19.9 63.4 19.4 65.6string-to-string (GHKM syntax-directed rule extraction) 33.8 48.0 19.3 63.8 18.7 66.2

+ preference grammar 33.9 47.7 19.3 63.7 18.8 66.0+ soft source syntactic constraints 34.6 47.0 19.8 62.9 19.5 65.2

+ drop derivations with tsyn(d) = 0 34.0 47.5 19.7 63.0 18.8 65.8

Table 2: English→German experimental results (truecase). BLEU scores are given in percentage.A selection of 2000 sentences from the newstest2008-2012 sets is used as development set.

model, a rule rareness penalty, and the monolin-gual PCFG probability of the tree fragment fromwhich the rule was extracted (Williams et al.,2014). Rule translation probabilities are smoothedvia Good-Turing smoothing.

The language model (LM) is a large inter-polated 5-gram LM with modified Kneser-Neysmoothing (Kneser and Ney, 1995; Chen andGoodman, 1998). The target side of the parallelcorpus and the monolingual German News Crawlcorpora are employed as training data. We usethe SRILM toolkit (Stolcke, 2002) to train the LMand rely on KenLM (Heafield, 2011) for languagemodel scoring during decoding.

Model weights are optimized to maximizeBLEU (Papineni et al., 2002) with batch MIRA(Cherry and Foster, 2012) on 1000-best lists. Weselected 2000 sentences from the newstest2008-2012 sets as a development set. The selected sen-tences obtained high sentence-level BLEU scoreswhen being translated with a baseline phrase-based system, and do each contain less than30 words for more rapid tuning. newstest2013 andnewstest2014 are used as unseen test sets. Trans-lation quality is measured in truecase with BLEU

and TER (Snover et al., 2006).5

7.2 Translation Results

The results of the empirical evaluation are given inTable 2. Our GHKM string-to-tree system attainsstate-of-the-art performance on newstest2013 andnewstest2014.

5TER scores are computed with tercom version 0.7.25and parameters -N -s.

7.2.1 Soft Source Syntactic Constraints

Adding the three dense soft source syntactic con-straints features from Section 6.1 improves thebaseline scores by 0.3 points BLEU and 0.6 pointsTER on newstest2013 and by 0.3 points BLEU and0.7 points TER on newstest2014.

Somewhat surprisingly, the sparse features fromSection 6.2 do not boost translation quality furtheron any of the two test sets. We observe a consid-erable improvement on the development set, but itdoes not carry over to the test sets. We attributedthis to an overfitting effect. Our source-side softsyntactic label set of composite SAMT-style la-bels comprises 8504 different labels that appear onthe source-side of the parallel training data. Fourtimes the amount of sparse features are possible(left-hand side/right-hand side matches and mis-matches for each label), though not all of them fireon the development set. 3989 sparse weights aretuned to non-zero values in the experiment. Due tothe sparse nature of the features, overfitting cannotbe ruled out.

We attempted to take measures in order to avoidoverfitting by specifying a core set of source la-bels and deactivating all sparse features for sourcelabels outside of the core set (cf. Section 6.2).First we specified the core label set as all non-composite labels. Non-composite labels are theplain constituent labels as given by the syntacticparser. Complex SAMT-style labels are not in-cluded. The size of this set is 71 (non-compositelabels that have been observed during rule extrac-tion). Translation performance on the develop-ment set drops in the sparse features (core = non-

153

system (tuned on newstest2012) newstest2012 newstest2013 newstest2014BLEU TER BLEU TER BLEU TER

GHKM string-to-tree baseline 17.9 65.7 19.9 63.2 19.4 65.3+ soft source syntactic constraints 18.2 65.3 20.3 62.6 19.7 64.7

+ sparse features 18.6 64.9 20.4 62.5 19.8 64.7+ sparse features (core = non-composite) 18.4 65.1 20.3 62.7 19.8 64.7+ sparse features (core = dev-min-occ100) 18.4 64.8 20.6 62.2 19.9 64.4

Table 3: English→German experimental results (truecase). BLEU scores are given in percentage.newstest2012 is used as development set.

composite) setup, but performance does not in-crease on the test sets.

Next we specified the core label set in anotherway: We counted how often each source label oc-curs in the input data on the development set. Wethen applied a minimum occurrence count thresh-old and added labels to the core set if they did notappear more rarely than the threshold. We triedvalues of 100 and 1000 for the minimum occur-rence, resulting in 277 and 37 labels being in thecore label set, respectively. Neither the sparse fea-tures (core = dev-min-occ100) experiment nor thesparse features (core = dev-min-occ1000) experi-ment yields better translation quality than what wesee in the setup without sparse features.

We eventually conjectured that the choice of ourdevelopment set might be a reason for the ineffec-tiveness of the sparse features, as on a fine-grainedlevel it could possibly be too different from thetest sets with respect to its syntactic properties.We therefore repeated some of the experimentswith scaling factors optimized on newstest2012(Table 3). The sparse features (core = dev-min-occ100) setup indeed performs better when tunedon newstest2012, with improvements of 0.7 pointsBLEU and 1.0 points TER on newstest2013 andof 0.5 points BLEU and 0.9 points TER on news-test2014 over the baseline tuned on the same set.

Finally, we were interested in demonstratingthat soft source syntactic constraints are superiorto hard source syntactic constraints. We built asetup that forces the decoder to match source-sidesyntactic label vectors in the rules with input la-bels.6 Hard source syntactic constraints are in-deed worse than soft source syntactic constraints(by 0.4 BLEU on newstest2013 and 0.3 BLEU onnewstest2014). The setup with hard source syntac-tic constraints performs almost exactly at the levelof the baseline.

6Glue rules are an exception. They do not need to matchthe input labels.

7.2.2 Preference Grammar

In the series of experiments with a preferencegrammar, we first evaluated a setup with the un-derlying SCFG of the preference grammar sys-tem, but without preference grammar. We de-note this setup as string-to-string (GHKM syntax-directed rule extraction) in Table 2. The ex-traction method for this string-to-string system isGHKM syntax-directed with right-binarized syn-tactic target-side parses from BitPar, as in thestring-to-tree setup. The constituent labels fromthe syntactic parses are however not used to dec-orate non-terminals. The grammar contains ruleswith a single generic non-terminal instead of syn-tactic ones. The string-to-string (GHKM syntax-directed rule extraction) setup is on newstest20130.7 BLEU (0.5 TER) worse and on newstest20140.7 BLEU (0.6 TER) worse than the standardGHKM string-to-tree baseline.

We then activated the preference grammar asdescribed in Section 5. GHKM translation with apreference grammar instead of a syntactified targetnon-terminal vocabulary in the SCFG is consider-ably worse than the standard GHKM string-to-treebaseline and barely improves over the string-to-string setup.

We added soft source syntactic constraints ontop of the preference grammar system, thus com-bining the two techniques. Soft source syntacticconstraints give a nice gain over the preferencegrammar system, but the best setup without a pref-erence grammar is not outperformed. In anotherexperiment, we investigated the effect of droppingderivations with tsyn(d) = 0 (cf. Section 5.1). Notethat the second feature hauxSyn(d) is not useful inthis setup, as the system is forced to discard allderivations that would be penalized by that fea-ture. We deactivated hauxSyn(d) for the experi-ment. The hard decision of dropping derivationswith tsyn(d) = 0 leads to a performance loss of

154

0.1 BLEU on newstest2013 and a more severe de-terioration of 0.7 BLEU on newstest2014.

8 Conclusions

We investigated two soft syntactic extensions forGHKM translation: Target-side preference gram-mars and soft source syntactic constraints.

Soft source syntactic constraints proved to besuitable for advancing the translation quality overa strong string-to-tree baseline. Sparse featuresare beneficial beyond just three dense features, butthey require the utilization of an appropriate devel-opment set. We also showed that the soft integra-tion of source syntactic constraints is crucial: Hardconstraints do not yield gains over the baseline.

Preference grammars did not perform well inour experiments, suggesting that translation mod-els with syntactic target non-terminal vocabular-ies are a better choice when building string-to-treesystems.

Acknowledgements

The research leading to these results has re-ceived funding from the European Union Sev-enth Framework Programme (FP7/2007-2013) un-der grant agreements no 287658 (EU-BRIDGE)and no 288487 (MosesCore).

ReferencesOndrej Bojar, Christian Buck, Christian Federmann,

Barry Haddow, Philipp Koehn, Johannes Leveling,Christof Monz, Pavel Pecina, Matt Post, HerveSaint-Amand, Radu Soricut, Lucia Specia, and AlešTamchyna. 2014. Findings of the 2014 Work-shop on Statistical Machine Translation. In Proc. ofthe Workshop on Statistical Machine Translation(WMT), pages 12–58, Baltimore, MD, USA, June.

Jean-Cédric Chappelier and Martin Rajman. 1998. AGeneralized CYK Algorithm for Parsing Stochas-tic CFG. In Proc. of the First Workshop on Tab-ulation in Parsing and Deduction, pages 133–137,Paris, France, April.

Stanley F. Chen and Joshua Goodman. 1998. AnEmpirical Study of Smoothing Techniques for Lan-guage Modeling. Technical Report TR-10-98, Com-puter Science Group, Harvard University, Cam-bridge, MA, USA, August.

Colin Cherry and George Foster. 2012. Batch Tun-ing Strategies for Statistical Machine Translation. InProc. of the Human Language Technology Conf. /North American Chapter of the Assoc. for Compu-tational Linguistics (HLT-NAACL), pages 427–436,Montréal, Canada, June.

David Chiang. 2007. Hierarchical Phrase-BasedTranslation. Computational Linguistics, 33(2):201–228, June.

Steve DeNeefe, Kevin Knight, Wei Wang, and DanielMarcu. 2007. What Can Syntax-Based MT Learnfrom Phrase-Based MT? In Proc. of the 2007Joint Conf. on Empirical Methods in Natural Lan-guage Processing and Computational Natural Lan-guage Learning (EMNLP-CoNLL), pages 755–763,Prague, Czech Republic, June.

Michel Galley, Mark Hopkins, Kevin Knight, andDaniel Marcu. 2004. What’s in a translation rule?In Proc. of the Human Language Technology Conf./ North American Chapter of the Assoc. for Compu-tational Linguistics (HLT-NAACL), pages 273–280,Boston, MA, USA, May.

Michel Galley, Jonathan Graehl, Kevin Knight, DanielMarcu, Steve DeNeefe, Wei Wang, and IgnacioThayer. 2006. Scalable Inference and Trainingof Context-Rich Syntactic Translation Models. InProc. of the 21st Int. Conf. on Computational Lin-guistics and 44th Annual Meeting of the Assoc. forComputational Linguistics, pages 961–968, Sydney,Australia, July.

Qin Gao and Stephan Vogel. 2008. Parallel Implemen-tations of Word Alignment Tool. In Software Engi-neering, Testing, and Quality Assurance for NaturalLanguage Processing, SETQA-NLP ’08, pages 49–57, Columbus, OH, USA, June.

Kenneth Heafield. 2011. KenLM: Faster and SmallerLanguage Model Queries. In Proc. of the Workshopon Statistical Machine Translation (WMT), pages187–197, Edinburgh, Scotland, UK, July.

Hieu Hoang, Philipp Koehn, and Adam Lopez. 2009.A Unified Framework for Phrase-Based, Hierarchi-cal, and Syntax-Based Statistical Machine Transla-tion. In Proc. of the Int. Workshop on Spoken Lan-guage Translation (IWSLT), pages 152–159, Tokyo,Japan, December.

Zhongqiang Huang, Jacob Devlin, and Rabih Zbib.2013. Factored Soft Source Syntactic Constraintsfor Hierarchical Machine Translation. In Proc. ofthe Conf. on Empirical Methods for Natural Lan-guage Processing (EMNLP), pages 556–566, Seat-tle, WA, USA, October.

Reinhard Kneser and Hermann Ney. 1995. ImprovedBacking-Off for M-gram Language Modeling. InProceedings of the Int. Conf. on Acoustics, Speech,and Signal Processing, volume 1, pages 181–184,Detroit, MI, USA, May.

Philipp Koehn, Franz Joseph Och, and Daniel Marcu.2003. Statistical Phrase-Based Translation. In Proc.of the Human Language Technology Conf. / NorthAmerican Chapter of the Assoc. for ComputationalLinguistics (HLT-NAACL), pages 127–133, Edmon-ton, Canada, May/June.

155

P. Koehn, H. Hoang, A. Birch, C. Callison-Burch,M. Federico, N. Bertoldi, B. Cowan, W. Shen,C. Moran, R. Zens, C. Dyer, O. Bojar, A. Con-stantin, and E. Herbst. 2007. Moses: Open SourceToolkit for Statistical Machine Translation. In Proc.of the Annual Meeting of the Assoc. for Computa-tional Linguistics (ACL), pages 177–180, Prague,Czech Republic, June.

Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In Proc. of the MTSummit X, Phuket, Thailand, September.

Yuval Marton and Philip Resnik. 2008. Soft Syn-tactic Constraints for Hierarchical Phrased-BasedTranslation. In Proc. of the Annual Meeting of theAssoc. for Computational Linguistics (ACL), pages1003–1011, Columbus, OH, USA, June.

Maria Nadejde, Philip Williams, and Philipp Koehn.2013. Edinburgh’s Syntax-Based Machine Transla-tion Systems. In Proc. of the Workshop on StatisticalMachine Translation (WMT), pages 170–176, Sofia,Bulgaria, August.


Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for AutomaticEvaluation of Machine Translation. In Proc. of theAnnual Meeting of the Assoc. for ComputationalLinguistics (ACL), pages 311–318, Philadelphia, PA,USA, July.

Slav Petrov, Leon Barrett, Romain Thibaux, and DanKlein. 2006. Learning Accurate, Compact, and In-terpretable Tree Annotation. In Proc. of the 21st Int.Conf. on Computational Linguistics and 44th An-nual Meeting of the Assoc. for Computational Lin-guistics, pages 433–440, Sydney, Australia, July.

Helmut Schmid. 2004. Efficient Parsing of HighlyAmbiguous Context-Free Grammars with Bit Vec-tors. In Proc. of the Int. Conf. on ComputationalLinguistics (COLING), Geneva, Switzerland, Au-gust.

Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and John Makhoul. 2006. A Studyof Translation Edit Rate with Targeted Human An-notation. In Proc. of the Conf. of the Assoc. forMachine Translation in the Americas (AMTA), pages223–231, Cambridge, MA, USA, August.

Daniel Stein, Stephan Peitz, David Vilar, and HermannNey. 2010. A Cocktail of Deep Syntactic Fea-tures for Hierarchical Machine Translation. In Proc.of the Conf. of the Assoc. for Machine Translationin the Americas (AMTA), Denver, CO, USA, Octo-ber/November.

Andreas Stolcke. 2002. SRILM – an Extensible Lan-guage Modeling Toolkit. In Proc. of the Int. Conf.on Spoken Language Processing (ICSLP), volume 3,Denver, CO, USA, September.

Ashish Venugopal, Andreas Zollmann, Noah A. Smith,and Stephan Vogel. 2009. Preference Grammars:Softening Syntactic Constraints to Improve Statis-tical Machine Translation. In Proc. of the Hu-man Language Technology Conf. / North AmericanChapter of the Assoc. for Computational Linguistics(HLT-NAACL), pages 236–244, Boulder, CO, USA,June.

Wei Wang, Kevin Knight, and Daniel Marcu. 2007.Binarizing Syntax Trees to Improve Syntax-BasedMachine Translation Accuracy. In Proc. of the 2007Joint Conf. on Empirical Methods in Natural Lan-guage Processing and Computational Natural Lan-guage Learning (EMNLP-CoNLL), pages 746–754,Prague, Czech Republic, June.

Wei Wang, Jonathan May, Kevin Knight, and DanielMarcu. 2010. Re-structuring, Re-labeling, andRe-aligning for Syntax-based Machine Translation.Computational Linguistics, 36(2):247–277, June.

Philip Williams and Philipp Koehn. 2012. GHKMRule Extraction and Scope-3 Parsing in Moses. InProc. of the Workshop on Statistical Machine Trans-lation (WMT), pages 388–394, Montréal, Canada,June.

Philip Williams, Rico Sennrich, Maria Nadejde,Matthias Huck, Eva Hasler, and Philipp Koehn.2014. Edinburgh’s Syntax-Based Systems atWMT 2014. In Proc. of the Workshop on Statis-tical Machine Translation (WMT), pages 207–214,Baltimore, MD, USA, June.

Jiajun Zhang, Feifei Zhai, and Chengqing Zong. 2011.Augmenting String-to-Tree Translation Models withFuzzy Use of Source-side Syntax. In Proc. of theConf. on Empirical Methods for Natural LanguageProcessing (EMNLP), pages 204–215, Edinburgh,Scotland, UK, July.

Andreas Zollmann and Ashish Venugopal. 2006. Syn-tax Augmented Machine Translation via Chart Pars-ing. In Proc. of the Workshop on Statistical MachineTranslation (WMT), pages 138–141, New York City,NY, USA, June.

156


How Synchronous are Adjuncts in Translation Data?

Sophie ArnoultILLC

University of [email protected]

Khalil Sima’anILLC

University of [email protected]

Abstract

The argument-adjunct distinction is cen-tral to most syntactic and semantic the-ories. As optional elements that refine(the meaning of) a phrase, adjuncts areimportant for recursive, compositional ac-counts of syntax, semantics and transla-tion. In formal accounts of machine trans-lation, adjuncts are often treated as modi-fiers applying synchronously in source andtarget derivations. But how well can theassumption of synchronous adjunction ex-plain translation equivalence in actual par-allel data? In this paper we present thefirst empirical study of translation equiva-lence of adjuncts on a variety of French-English parallel corpora, while varyingword alignments so we can gauge the ef-fect of errors in them. We show that forproper measurement of the types of trans-lation equivalence of adjuncts, we mustwork with non-contiguous, many-to-manyrelations, thereby amending the traditionalDirect Correspondence Assumption. Ourempirical results show that 70% of manu-ally identified adjuncts have adjunct trans-lation equivalents in training data, againstroughly 50% for automatically identifiedadjuncts.

1 Introduction

Most syntactic and semantic theories agree on theargument-adjunct distinction, although they varyon the specifics of this distinction. Common tothese theories is that adjunction is a central de-vice for language recursion, as adjunction modi-fies initial but complete sentences by adding op-tional phrases; adjunction also contributes to se-mantic compositionality, albeit in various ways,as syntactic adjuncts may take different seman-tic roles. Shieber and Schabes (1990) transfer the

role of adjuncts from monolingual syntax (Joshiet al., 1975) to the realm of translation equiva-lence using a Synchronous Tree Adjoining Gram-mars (STAG), and propose to view adjunction asa synchronous operation for recursive, composi-tional translation. STAG therefore relies substan-tially on what Hwa (2002) calls the Direct Corre-spondence Assumption, the notion that semanticor syntactic relations correspond across a bitext.We know from various works–notably by Hwa etal. (2002) for dependency relations, Arnoult andSima’an (2012) for adjuncts, and Pado and Lap-ata (2009) and Wu and Fung (2009) for semanticroles–that the Direct Correspondence Assumptiondoes not always hold.

A question that has not received much atten-tion is the degree to which the assumption ofsynchronous adjunction is supported in humantranslation data. This is crucial for the succes-ful application of linguistically-motivated STAG,but attempts at answering this question empiricallyare hampered by a variety of difficulties. Lin-guistic structures may diverge between languages(Dorr, 1994), translations may be more or less lit-eral, and annotation resources may be inaccurate,when they are available at all. Besides, automaticword alignments are known to be noisy and man-ual alignments are rather scarse. The work ofArnoult and Sima’an (2012) reports lower and up-per bounds of one-to-one adjunct correspondence,using rather limited resources to identify Frenchadjuncts making their results not directly applica-ble for measuring the stability of the synchronousadjunction assumption.

In this paper we aim at redefining the transla-tion equivalence of adjuncts in ways that allow usto report far more accurate bounds on their cross-linguistic correspondence. In particular, we are in-terested in measuring adjunct correspondence ro-bustly, in training data.

Consider for example the sentence pair of Fig-

157

ure 1. Most adjuncts in each sentence translateas adjuncts in the other sentence, but one of thesetranslation equivalences appears to be many-to-many, because of parsing mismatches across thebitext; both parses and adjunct labellers on bothsides of the bitext must be on par for adjunct trans-lation equivalences to be established. Besides, onegenerally establishes translation equivalence usingword alignments, which may be noisy. Anotherfactor is that of the degree of translation equiva-lence in the data in general; while parallel bitextsexpress the same meaning, meaning may divergelocally.

I think that the time Ae1 Ae6 has been Ae7 long

taken Ae2 , for example , too

in handling Ae3 applications Ae4

routine for changes of facilities Ae5

along a pipeline

je crois qu’il a pris Af1 de temps Af2

trop

a etudier des demandes Af3 de changements d’installations Af4 , Af5

courantes le long d’un pipe-line par exemple

Figure 1: Example sentence pair

This paper contributes the first study to mea-sure the degree of adjunction synchronicity: wederive many-to-many pairings between adjunctsacross a bitext, thus supporting a generic viewof translation equivalence, where meaning canbe expressed by distinct entities and redistributedfreely in translation; practically, this also allowsus to capture equivalence in spite of mismatchedparses. We abstract away from word alignmentsto a certain degree, as we directly pair adjunctsacross a bitext, but we still use word alignments–namely the overlap of adjunct projections with tar-get adjuncts–to decide on these pairings. We fur-ther distinguish between adjunct pairings that arebijective through the word alignment, and otherpairings, where the translation equivalence doesnot exactly agree with the word alignment; wequalify these pairings as weakly equivalent.

Under this new view of adjunct translationequivalence, we perform measures in French-

English data. We show that adjunction is pre-served in 40% to 50% of the cases with automati-cally labelled adjuncts, with differences betweendata sets, word aligners and sentence length;about 25% more adjuncts form weakly translation-equivalent pairings. With gold adjunct annota-tions, the proportion of translation-equivalent ad-juncts increases to 70%.

These results show that adjunct labelling accu-racy on both sides of the data is crucial for adjunctalignment, while suggesting that applications thatexploit adjunction can gain from decreasing theirdependence on word alignments and idealized ex-perimental conditions , and identifying favorablecontexts for adjunct preservation.

2 Alignment-based role pairing

How can one find translation-equivalent adjunctsusing word alignments, without being too con-strained by the latter? Obviously, adjunct pairsthat are consistent with the word alignments aretranslation equivalent, but we also want to be ableto identify translation-equivalent adjuncts that arenot exactly aligned to each other, and also to ac-cept many-to-many pairings; not only to get lin-guistically justified discontinuous pairs, as withthe French double negation particle, but also forrobustness with regard to dissimilar attachmentsin the French and English parses.

2.1 Translation equivalence under thealignment-consistency constraint

Consider for instance Figure 2, which representsa word alignment for part of the sentence pair ofFigure 1. We would like to match f2 to e2 and e6,f3 to e3, f4 to e5, and f5 to e6. If one only pairsadjuncts that are consistent with the word align-ment, one obtains only half of these adjunct pairs:〈f3, e3〉 and 〈f4, e5〉; one cannot pair up f5 ande6 because the latter is also aligned outside of theformer; and one can also not find the equivalencebetween f2 on one hand and e2 and e6 on the otherhand if one assumes one-to-one correspondencebetween adjuncts.

2.2 Translation equivalence throughprojection

We align adjuncts across the bitext by projectingthem through the word alignment and finding, foreach adjunct, the shortest adjunct or sequence ofadjuncts that overlaps the most with that adjunct’s

158

aetudier

deles

demandescourantes

dechangements

deinstallations

lelong

deun

pipe-

line,

parexemple

in hand

ling

rout

ine

appl

icat

ions

for

chan

ges

of faci

litie

sal

ong

a pipe

line

, for

exam

ple

,

f2

f3

f4

f5

e2 e3 e4 e5e6

Figure 2: Example with word alignment

projection. To prevent source adjuncts from be-ing aligned to the first target adjunct that sub-sumes their projection, we also enforce that onlynon-overlapping source adjuncts may be alignedto a same target sequence, as explained in sec-tion 2.2.1.

This procedure results in a many-to-many align-ment between adjuncts on either side. We distin-guish several types of adjunct pairings through thisalignment, which we interpret as divergent, equiv-alent or weakly equivalent, as described in sec-tion 2.2.2.

We perform this alignment in both source-targetand target-source directions to measure the pro-portion of source, respectively target, adjuncts thatfall in each category.

2.2.1 Adjunct pairing procedureWe define the projection of an adjunct σ as theunique tuple of maximal, non-overlapping phrasesφn

1 that are aligned to σ through the word align-ment. Each phrase φi in this tuple is understoodas being extended with possible surrounding un-aligned positions–phrases are primarily identifiedby the aligned positions they cover. And each φi

is maximal as any larger phrase distinct from φi

would also include (aligned) positions not alignedto σ. Let I(φi) be the set of aligned positionsin each φi, and I(φn

1 ) the set of aligned positions

covered by φn1 .

We align σ to the non-overlapping sequenceof target adjuncts τm

1 that has the smallest set ofaligned positions while having the largest over-lap with φn

1 ; the overlap of a projection and a tar-get sequence is the intersection of their respectivesets of aligned positions. For instance in Figure 2,the projection of f4 is maximally covered by e2,e4, and e5; we align the latter to f4 as it coversthe least aligned positions. In practice, we searchthrough the tree of target adjuncts for adjuncts thatoverlap with φn

1 , and for each such adjunct τ wecompare its overlap with φn

1 to that of the sequenceof its children γk

1 to determine which (of τ or γk1 )

should be part of the final target sequence.We perform a similar selection on overlapping

source adjuncts that point to the same target se-quence. For each source adjunct σ, we determineif its target sequence τm

1 is also aligned to adjunctsdominated by σ, in which case we compare theoverlap of σ’s projection with τn

1 to that of its chil-dren in the source adjunct tree to determine whichshould be aligned to τm

1 . For instance in Figure 2,e4 is aligned to f2 (when projecting from Englishto French), but so is e2; as e2’s projection overlapsmore with f2, we discard the alignment betweene4 and f2.

The final alignments for our example are repre-sented in Table 1.

Table 1: Adjunct pairings for the alignment ofFigure 2

f → e e→ f

f2 e2, e6 e2 f2

f3 e3 e3 f3

f4 e5 e4 -f5 e6 e5 f4

e6 f2

2.2.2 Types of adjunct pairingsWe distinguish three main classes of adjuncttranslation equivalence: divergent, equivalent andweakly equivalent. We further subdivide eachclass into two types, as shown in Table 2. Ad-junct pairings fall into one of these types depend-ing on their configuration (unaligned, one-to-oneor many-to-many) and their agreement with theword alignments. Equivalent types notably differfrom weakly equivalent ones by being bijectively

159

aligned; With the notations of section 2.2.1, twoadjunct sequences σn

1 and τm1 with respective pro-

jections φn′1 and ψm′

1 are translation equivalent iffI(φn′

1 ) = I(τm1 ) and I(ψm′

1 ) = I(σn1 ).

Table 2: Adjunct pairing types

divergentnull empty projectiondiv no aligned target adjunctsweakly equivalentwe-nm many-to-many non-bijectivewe-11 one-to-one non-bijectiveequivalenteq-nm many-to-many bijectiveeq-11 one-to-one bijective

In Table 1, e4’s translation is divergent as it isnot aligned to any adjunct; f5 and e6 are weaklyequivalent as the projection of f5 does not coverall the aligned positions of e6. The pairing from f2

to e2, e6 is many-to-many equivalent, and so arethe pairings from e2 and e6 to f2; the remainingpairings are one-to-one equivalent.

As Table 3 shows, the divergent types null anddiv regroup untranslated adjuncts (Example 1)and divergent adjuncts: Examples (2) and (3) showcases of conflational divergence (Dorr, 1994), thatappear in different types because of the underly-ing word alignments; in Example (4), the prepo-sitional phrase with this task has been wronglylabelled as an adjunct, leading to a falsely diver-gent pairing. The weakly-equivalent types we-nmand we-11 regroup both divergent and equiva-lent pairings: the adjuncts of Examples (5) and (8)are aligned by our method to adjuncts that are nottheir translation equivalent, the adjunct in Exam-ple (6) cannot be aligned to its equivalent becauseof a parsing error, and the equivalences in Exam-ples (7) and (9) cannot be identified because of aword-alignment error. Finally, we show a numberof equivalent pairings (eq-nm and eq-11): inExample (10), an attachment error in the Frenchparse induces a many-to-one equivalence wherethere should be two one-to-one equivalences; Ex-amples (11) to (13) show a number of true many-to-many equivalences, while Examples (14) and(15) show that adjuncts may be equivalent across abitext while belonging to a different syntactic cate-gory and modifying a different type of phrase (15).

3 Adjunct identification

We identify adjuncts in dependency trees obtainedby conversion from phrase-structure trees: we mapmodifier labels to adjuncts, except when the de-pendent is a closed-class word. For English, weuse the Berkeley parser and convert its output withthe pennconverter (Johansson and Nugues, 2007;Surdeanu et al., 2008); for French, we use theBerkeley parser and the functional role labeller ofCandito et al. (2010). The pennconverter with de-fault options and the French converter make sim-ilar structural choices concerning the representa-tion of coordination and the choice of heads.

3.1 English adjunctsWe first identify closed-class words by their POStag: CC, DT, EX, IN, POS, PRP, PRP$, RP, SYM,TO, WDT, WP, WP$, WRB. Punctuation marks,identified by the P dependency relation, and namedependencies, identified by NAME, POSTHON, orTITLE, are also treated as closed-class words.

Adjuncts are identified by the dependency rela-tion: ADV, APPO, NMOD (except determiners, pos-sessives and ‘of’ complements), PRN, AMOD (ex-cept when the head is labeled with ADV) and PMODleft of its head. Cardinals, identified by the CDPOS tag, and remaining dependents are classifiedas arguments.

3.2 French adjunctsClosed-class words are identified by the (coarse)POS tags: C, D, CL, P, PONCT, P+D, PRO. Aux-iliary verbs, identified by the dependency relationsaux tps and aux pass, are also included.

Adjuncts are identified by the dependency re-lations mod rel and mod (except if the depen-dent’s head is a cardinal number, identified by thes=card label).

3.3 EvaluationWe evaluate adjunct identification accuracy usinga set of 100 English and French sentences, drawnrandomly from the Europarl corpus. A single an-notator marked adjuncts in both sets, identifyingslightly more than 500 adjuncts in both sets. Wefind F scores of 71.3 and 72.2 for English andFrench respectively, as summarized in Table 4. Wefind that about a quarter of errors are related toparse attachment, yielding scores of 77.7 and 78.6if one corrects them.

160

Table 3: Examples of adjunct pairing types

null(1) it is indeed a great honour vous me faites un grand honneur

(2) the polling booths les isoloirs

div(3) the voting stations les isoloirs

(4) to be entrusted with this task en me confiant cette tache

we-nm(5) reforms to the Canadian military reformes des forces [armees] [canadiennes](6) an even greater country un pays [encore] [plus] magnifique

(7) in safe communities [en securite] [dans nos communautes]we-11(8) across the land de tout le pays

(9) strong opinions des opinions bien arreteeseq-nm(10) a proud moment for Canada un moment heureux pour le Canada(11) we have used the wrong process nous ne suivons pas le bon processus

(12) our common space and our common means un espace et des moyens communs(13) the [personal] [protected] files les dossiers confidentiels et protegeseq-11(14) the names just announced les noms que je viens de mentionner(15) one in three Canadian jobs au Canada , un emploi sur trois

Table 4: Adjunct identification F scores

prec. recall F

Enauto. 66.2 77.2 71.3corr. 72.3 84.0 77.7

Frauto. 68.1 76.7 72.2corr. 74.7 83.0 78.6

4 Experiments

4.1 Experimental set-upWe measure adjunct translation equivalence infour data sets: the manually-aligned CanadianHansards corpus (Och and Ney, 2003), contain-ing 447 sentence pairs, the house and senate train-ing data of the Canadian Hansards (1.13M sen-tence pairs), the French-English Europarl trainingset (1.97M sentence pairs) and the Moses news-commentaries corpus (156k sentence pairs). Be-sides, we randomly selected 100 sentence pairsfrom the Europarl set to measure adjunct identi-fication accuracy as reported in section 3 and ad-junct correspondence with gold adjunct annota-

tions.All four corpora except the manual Hansards

are preprocessed to keep sentences with up to80 words, and all four data sets are used jointlyto train unsupervised alignments, both with theBerkeley aligner (Liang et al., 2006) and GIZA++(Brown et al., 1993; Och and Ney, 2003) throughmgiza (Gao and Vogel, 2008), using 5 iterations ofModel 1 and 5 iterations of HMM for the Berkeleyaligner, and 5 iterations of Model 1 and HMM and3 iterations of Model 3 and Model 4 for GIZA++.The GIZA++ alignments are symmetrized usingthe grow-diag-final heuristics. Besides, the man-ual Hansards corpus is aligned with Sure Only(SO) and Sure and Possible (SP) manual align-ments.

4.2 Measurements with gold adjunctannotations

We compared adjunct translation equivalence ofautomatically identified adjuncts and gold anno-tations using 100 manually annotated sentencepairs from the Europarl corpus; adjuncts werealigned automatically, using the Berkeley wordalignments. We also measured adjunct equiv-alence using automatic adjunct annotations cor-rected for parse attachment errors, as introduced

161

in section 3.3. Table 5 reports harmonic mean fig-ures (mh) for each adjunct projection type. Forinformation, we also report their decomposition inthe case of gold annotations, showing some depen-dence on the projection direction.

Table 5: Translation equivalence of auto-matic, rebracketed and gold adjuncts

auto. corr. goldmh mh ef fe mh

null 7.6 7.7 8.1 7.3 7.7div 22.3 22.5 14.7 12.0 13.2

we-nm 10.8 9.6 2.7 4.6 3.4we-11 12.5 10.8 7.4 8.5 7.9

eq-nm 3.5 2.2 2.5 3.3 2.9eq-11 41.8 45.8 64.5 64.3 64.4

About two thirds of manually identified ad-juncts form equivalent pairs, representing a gainof 20 points with regard to automatically identi-fied adjuncts. This is accompanied by a halving ofdivergent pairings and of weakly equivalent ones.Further, we find that about half of the remainingweak equivalences can be interpreted as transla-tion equivalent (to compare to an estimated thirdfor automatically identified adjuncts), allowing usto estimate to 70% the degree of translation equiv-alence given Berkeley word alignments in the Eu-roparl corpus.

4.3 Measurements with manual andautomatic alignments

We aligned adjuncts in the manual Hansards cor-pus using all four word alignments. Table 6presents the mean proportions for each categoryof adjunct projection.

Table 6: Translation-equivalence of adjunctsin the manual Hansards

SO SP bky giza

null 32.1 2.8 8.7 3.3div 19.7 29.3 27.1 30.3

we-nm 3.4 14.6 8.5 11.4we-11 5.7 13.8 13.5 15.3

eq-nm 4.1 7.3 4.1 4.2eq-11 33.7 31.8 37.6 35.3

Comparing the mean proportions per type be-

tween the four alignments, we see that a third ofadjuncts on either side are not aligned at all withthe sure-only manual alignments. In the exampleof Figure 2 for instance, these alignments do notlink f3 to e3. On the other hand, the sure andpossible manual alignments lead to many diver-gent or weakly equivalent pairings, a result of theirdense phrasal alignments. In comparison, the au-tomatic alignments connect more words than thesure-only alignments, leading to a mixed result forthe adjunct pairings: one gains more translation-equivalent, but also more divergent and weaklyequivalent pairs. In this, the Berkeley aligner ap-pears less noisy than GIZA++, as it captures moretranslation equivalent pairs and less weakly equiv-alent ones. This is confirmed in the other data setstoo, as Table 7 shows.

Table 7: Mean proportions of adjunct-pairingtypes in automatically aligned data

hans-hst europarl newsbky giza bky giza bky giza

null 7.5 2.7 6.3 2.3 8.3 3.3div 28.1 30.8 21.8 24.2 21.0 23.9

we-nm 10.4 12.2 11.0 12.7 10.6 12.6we-11 13.4 15.5 12.4 14.6 11.7 14.2

eq-nm 3.2 4.0 3.2 4.0 3.1 3.8eq-11 37.1 34.6 45.0 42.0 44.9 41.8

Comparing figures between the different datasets, we see that the Europarl and the Newsdata have more translation-equivalent and less di-vergent adjuncts than the Hansards training data(hans-hst). Taking the harmonic mean for bothequivalent types (eq-nm and eq-11), we findthat 48.2% of adjuncts have an adjunct translationequivalent in the Europarl data (with the Berke-ley aligner) and 48.0% in the News corpus, against40.3% the Hansards training set and 41.6% in themanual Hansards set. This suggests that transla-tions in the Hansards data are less literal than inthe Europarl or the News corpus.

4.4 Effect of sentence lengthWe explore the relation between sentence lengthand translation equivalence by performing mea-surements in bucketed data. We bucket the datausing the length of the English sentences. Mea-surements are reported in Table 8 for the Hansards

162

Table 8: Adjunct translation equivalence with the Berkeley aligner in bucketeddata

hans-man hansard-hst europarl1-15 16-30 1-15 16-30 31-50 51-80 1-15 16-30 31-50 51-80

null 9.3 8.5 6.5 7.6 7.8 8.0 6.4 6.0 6.2 6.6div 28.1 25.9 39.5 25.3 23.5 22.6 25.3 22.2 21.2 20.6

we-nm 6.1 9.4 5.3 10.1 13.6 16.7 5.0 9.3 12.5 14.9we-11 11.8 14.1 12.2 13.4 14.2 14.8 10.0 11.7 13.0 13.9

eq-nm 3.1 4.5 2.8 3.4 3.3 3.1 3.4 3.3 3.1 2.9eq-11 40.6 36.3 32.5 39.6 37.3 34.4 49.1 47.1 43.7 40.7

and the Europarl sets (the News set yields similarresults to the Europarl data).

All data sets show a dramatic increase of theproportion of adjuncts involved in many-to-many,and to a lesser extent one-to-one weakly equiva-lent translations. This increase is accompanied bya decrease of all other adjunct-pairing types (un-aligned adjuncts excepted), and is likely to resultfrom increased word-alignment and parsing errorswith sentence length.

A rather surprising result is the high proportionof divergent adjunct translations in the shorter sen-tences of the Hansards training set; we find thesame phenomenon with the GIZA++ alignment.We attribute this effect to the Hansards set havingless literal translations than the other sets. Thatwe see this effect mostly in shorter sentences mayresult from translation mismatches being mostlylocal. As sentence length increases however, wordand adjunct alignment errors are also likely to linkmore unrelated adjuncts, resulting in a drop of di-vergent adjuncts.

4.5 Simplifying alignmentsWe perform a simple experiment to test the effectof word-alignment simplification of adjunct trans-lation equivalence. For this we remove alignmentlinks between function words (as defined in sec-tion 3) on both sides of the data, and we realignadjuncts using these simplified alignments. Ta-ble 9 shows that this simplification (column ‘-fw’)slightly decreases the proportion of weakly equiv-alent pairings with regard to the standard align-ment (‘std’), mostly to the benefit of translation-equivalent pairings. This suggests that furthergains may be obtained with better alignments.

Table 9: Effect of alignment simplificationon adjunct translation equivalence in the Eu-roparl data

bky gizastd -fw std -fw

null 6.3 7.5 2.3 3.1div 21.8 21.5 24.2 24.0

we-nm 11.0 9.1 12.7 10.8we-11 12.4 10.0 14.6 13.2

eq-nm 3.2 4.0 4.0 4.8eq-11 45.0 47.5 42.0 43.7

5 Related work

While adjunction is a formal operation that may beapplied to non-linguistic adjuncts in STAG, De-Neefe and Knight (2009) restrict it to syntacticadjuncts in a Synchronous Tree Insertion Gram-mar. They identify complements using (Collins,2003)’s rules, and regard all other non-head con-stituents as adjuncts. Their model is able to gen-eralize to unseen adjunction patterns, and to beat astring-to-tree baseline in an Arabic-English trans-lation task.

Arnoult and Sima’an (2012) exploit adjunct op-tionality to generate new training data for a phrase-based model, by removing phrase pairs with anEnglish adjunct from the training data. They iden-tify adjuncts using syntactic heuristics in phrase-structure parses. They found that few of the gener-ated phrase pairs were actually used at decoding,leading to marginal improvement over the base-line in a French-English task. They also report

163

figures of role preservation for different categoriesof adjuncts, with lower bounds between 29% and65% and upper bounds between 61% and 78%, inautomatically aligned Europarl data. The upperbounds are limited by discontinuous adjunct pro-jections, while the estimation of lower bounds islimited by the lack of adjunct-identification meansfor French.

There has been a growing body of work on ex-ploiting semantic annotations for SMT. In manycases, predicate-argument structures are used toprovide source-side contextual information forlexical selection and/or reordering (Xiong et al.,2012; Li et al., 2013), without requiring cross-linguistic correspondence. When correspondencebetween semantic roles is required, predicates arecommonly aligned first. For instance, Lo et al.(2012) use a maximum-weighted bipartite match-ing algorithm to align predicates with a lexical-similarity measure to evaluate semantic-role corre-spondence. Pado and Lapata (2009) use the samealgorithm with a similarity measure based on con-stituent overlap to project semantic roles from En-glish to German.

6 Conclusion

In this paper we presented the first study of trans-lation equivalence of adjuncts on a variety ofFrench-English parallel corpora and word align-ments. We use a method based on overlap to de-rive many-to-many adjunct pairings, that are inter-pretable in terms of translation equivalence.

We found through measurements in French-English data sets that 40% to 50% of adjuncts–depending on the data–are bijectively alignedacross a bitext, whereas about 25% more adjunctsalign to adjuncts, albeit not bijectively. We esti-mate that a third of these weakly equivalent linksrepresent true, adjunct translation equivalences.

With manually identified adjuncts, we foundthat about 70% have adjunct translation-equivalents in automatically aligned data.These are fairly low results if one considers thatFrench and English are relatively close syntacti-cally. So while they show that adjunct labellingaccuracy on both sides of the data is crucial foradjunct alignment, and that applications thatexploit adjunction can gain from decreasing theirdependence on word alignments and idealizedexperimental conditions, they call for betterunderstanding of the factors behind translation

divergence.In fact, as a remaining quarter of adjuncts have

divergent translations, it would be interesting todetermine, for instance, the degree to which diver-gence is caused by lexical conflation, or reflectsnon-literal translations.

Acknowledgments

We thank the anonymous reviewers for their perti-nent comments. This research is part of the project“Statistical Translation of Novel Constructions”,which is supported by NWO VC EW grant fromthe Netherlands Organisation for Scientific Re-search (NWO).

ReferencesSophie Arnoult and Khalil Sima’an. 2012. Adjunct

Alignment in Translation Data with an Applica-tion to Phrase-Based Statistical Machine Transla-tion. In Proceedings of the 16th Annual Conferenceof the European Association for Machine Transla-tion, pages 287–294.

Peter F. Brown, Stephen A. Della Pietra, Vincent J.Della Pietra, and Robert L. Mercer. 1993. TheMathematics of Statistical Machine Translation: Pa-rameter Estimation. Computational Linguistics,19(2):263–311.

M.-H. Candito, B. Crabbe, and P. Denis. 2010. Statisti-cal French dependency parsing: treebank conversionand first results. In Proceedings of The seventh in-ternational conference on Language Resources andEvaluation (LREC).

Michael Collins. 2003. Head-driven statistical modelsfor natural language parsing. Computational Lin-guistics, 29(4):589–637.

Steve DeNeefe and Kevin Knight. 2009. SynchronousTree Adjoining Machine Translation. In Proceed-ings of the 2009 Conference on Empirical Methodsin Natural Language Processing, pages 727–736.

Bonnie J. Dorr. 1994. Machine Translation Diver-gences: A Formal Description and Proposed Solu-tion. Computational Linguistics, 20(4):597–633.

Qin Gao and Stephan Vogel. 2008. Parallel Implemen-tations of Word Alignment Tool. In Software En-gineering, Testing, and Quality Assurance for Natu-ral Language Processing, pages 49–57, Columbus,Ohio, June. Association for Computational Linguis-tics.

Rebecca Hwa, Philip Resnik, Amy Weinberg, andOkan Kolak. 2002. Evaluating Translational Cor-respondence Using Annotation Projection. In Pro-ceedings of the 40th Annual Meeting on Associationfor Computational Linguistics, ACL ’02, pages 392–399.

164

Richard Johansson and Pierre Nugues. 2007. Ex-tended Constituent-to-dependency Conversion forEnglish. In Proceedings of NODALIDA 2007, pages105–112, Tartu, Estonia, May 25-26.

Aravind K. Joshi, Leon S. Levy, and Masako Taka-hashi. 1975. Tree adjunct grammars. Journal ofComputer and System Sciences, 10(1):136–163.

Junhui Li, Philip Resnik, and Hal Daume III. 2013.Modeling Syntactic and Semantic Structures in Hi-erarchical Phrase-based Translation. In Proceed-ings of the 2013 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages540–549, Atlanta, Georgia.

Percy Liang, Ben Taskar, and Dan Klein. 2006. Align-ment by Agreement. In Proceedings of the MainConference on Human Language Technology Con-ference of the North American Chapter of the Asso-ciation of Computational Linguistics, HLT-NAACL’06, pages 104–111.

Chi-kiu Lo, Anand Karthik Tumuluru, and Dekai Wu.2012. Fully Automatic Semantic MT Evaluation. InProceedings of the Seventh Workshop on StatisticalMachine Translation, WMT ’12, pages 243–252.

Franz Josef Och and Hermann Ney. 2003. A System-atic Comparison of Various Statistical AlignmentModels. Computational Linguistics, 29:19–51.

Sebastian Pado and Mirella Lapata. 2009. Cross-lingual Annotation Projection for Semantic Roles.Journal of Artificial Intelligence Research, 36:307–340.

Stuart Shieber and Yves Schabes. 1990. Synchronoustree-adjoining grammars. In Handbook of FormalLanguages, pages 69–123. Springer.

Mihai Surdeanu, Richard Johansson, Adam Meyers,Lluıs Marquez, and Joakim Nivre. 2008. TheCoNLL-2008 Shared Task on Joint Parsing of Syn-tactic and Semantic Dependencies. In CoNLL 2008:Proceedings of the Twelfth Conference on NaturalLanguage Learning, pages 159–177, Manchester,United Kingdom.

Dekai Wu and Pascale Fung. 2009. Can Semantic RoleLabeling Improve SMT? In Proceedings of the 13thAnnual Conference of the European Association forMachine Translation, pages 218–225.

Deyi Xiong, Min Zhang, and Haizhou Li. 2012. Mod-eling the Translation of Predicate-Argument Struc-ture for SMT. In Proceedings of the 50th AnnualMeeting of the Association for Computational Lin-guistics, pages 902–911.

165

Author Index

Addanki, Karteek, 112Alkhouli, Tamer, 1Arnoult, Sophie, 157

Bahdanau, Dzmitry, 78, 103Bangalore, Srinivas, 51Beloucif, Meriem, 22Bengio, Yoshua, 78, 103

Casteleiro, João, 135Cho, Kyunghyun, 78, 103Chuchunkov, Alexander, 43

España-Bonet, Cristina, 132

Foster, Jennifer, 67

Galinskaya, Irina, 43Guta, Andreas, 1

Hatakoshi, Yuto, 34Hoang, Hieu, 148Huck, Matthias, 148

Kaljahi, Rasoul, 67Koehn, Philipp, 148

Li, Liangyou, 122Liu, Qun, 122Lo, Chi-kiu, 22Lopes, Gabriel, 135

Maillette de Buy Wenniger, Gideon, 11Màrquez, Lluís, 132Martinez Garcia, Eva, 132

Nakamura, Satoshi, 34Neubig, Graham, 34Ney, Hermann, 1

Pouget-Abadie, Jean, 78

Roturier, Johann, 67

Sachdeva, Kunal, 51Saers, Markus, 22, 86Sakti, Sakriani, 34

Sennrich, Rico, 94Sharma, Dipti Misra, 51Silva, Joaquim, 135Sima’an, Khalil, 11, 138, 157Singla, Karan, 51Sofianopoulos, Sokratis, 57Stanojevic, Miloš, 138

Tambouratzis, George, 57Tarelkin, Alexander, 43Tiedemann, Jörg, 132Toda, Tomoki, 34

van Merrienboer, Bart, 78, 103Vassiliou, Marina, 57

Way, Andy, 122Wu, Dekai, 22, 86, 112

Xie, Jun, 122

Yadav, Diksha, 51

167

› old_anthology › w › w14 › w14-40.pdfintroduction the eighth workshop on syntax, semantics...

Documents