[paper introduction] translating into morphologically rich languages with synthetic phrases

19
MT Study Group Transla1ng into Morphologically Rich Languages with Synthe1c Phrases Victor Chahuneau, Eva Schlinger, Noah A. Smith, Chris Dyer Proc. of EMNLP 2013, SeaJle, USA Introduced by Akiva Miura, AHCLab 15/10/15 20141©Akiba (Akiva) Miura AHCLab, IS, NAIST 1

Upload: naist-machine-translation-study-group

Post on 17-Jan-2017

190 views

Category:

Technology


0 download

TRANSCRIPT

MT  Study  Group    

Transla1ng  into  Morphologically  Rich  Languages  with  Synthe1c  Phrases  

 Victor  Chahuneau,    Eva  Schlinger,    Noah  A.  Smith,    Chris  Dyer  

 Proc.  of  EMNLP  2013,  SeaJle,  USA  

Introduced  by  Akiva  Miura,  AHC-­‐Lab  

15/10/15 20141©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 1

Contents  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 2

1.  Introduc1on  2.  Translate-­‐and-­‐Inflect  Model  3.  Morphological  Grammars  and  Features  4.  Inflec1on  Model  Parameter  Es1ma1on  5.  Synthe1c  Phrases  6.  Transla1on  Experiments  7.  Conclusion  8.  Impressions

1.  Introduc1on  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 3

Problem:    l MT  into  morphologically  rich  languages  (e.g.  inflec1on  

languages)  is  challenging  due  to:  •  lexical  sparsity  •  large  variety  of  gramma1cal  features  expressed  with  morphology  

l  This  paper:  •  introduces  a  method  using  target  language  morphological  grammars  to  address  this  challenge  

•  improves  transla1on  quality  from  English  into  Russian/Hebrew/Swahili  

1.  Introduc1on  (cont’d)  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 4

Proposed  Approach:    l  The  proposed  approach  decomposes  MT  process  into  steps:  

1.  transla1ng  from  a  word  (or  phrase)  into  a  meaning-­‐bearing  stem  

2.  selec1ng  appropriate  inflec3on  using  feature-­‐rich  discrimina1ve  model  that  condi1ons  on  the  source  context  of  the  word  being  translated  

l  Inventory  of  transla1on  rules  individual  words  and  short  phrases  are  obtained  using  standard  transla1on  rule  extrac1on  techniques  of  Chiang  (2007)  •  Authors  call  these  synthe3c  phrases  

1.  Introduc1on  (cont’d)  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 5

Advantages:    l  The  major  advantages  of  this  approach  are:  

1.  synthesized  forms  are  targeted  to  a  specific  transla1on  context  

2.  mul1ple,  alterna1ve  phrases  may  be  generated  with  the  final  choice  among  rules  lec  to  the  global  transla1on  model  

3.  virtually  no  language-­‐specific  engineering  is  necessary  4.  any  phrase-­‐  or  syntax-­‐based  decoder  can  be  used  without  

modifica1on  5.  we  can  generate  forms  that  were  not  aJested  in  the  bilingual  

training  data  

2.  Translate-­‐and-­‐Inflect  Model  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 6

Figure  1:  The  inflec1on  model  predicts  a  form  for  the  target  verb  lemma  σ=пытаться  (pytat’sya)  based  on  its  source  a'empted  and  the  linear  and  syntac1c  source  context.  inflec1on:  μ  =  mis-­‐sfm-­‐e  (equivalent  to  the  more  tradi1onal  morphological  string:  +MAIN+IND+PAST+SING+FEM+MEDIAL+PERF).    

она пыталась пересечь пути на ее велосипед

she had attempted to cross the road on her bike

PRP VBD VBN TO VB DT NN IN PRP$ NN

nsubj

aux

xcomp

σ:пытаться_V,+,μ:mis2sfm2e

C50 C473 C28 C8 C275 C37 C43 C82 C94 C331

root

-1 +1

она пыталась пересечь пути на ее велосипед

she had attempted to cross the road on her bike

PRP VBD VBN TO VB DT NN IN PRP$ NN

nsubj

aux

xcomp

σ:пытаться_V,+,μ:mis2sfm2e

C50 C473 C28 C8 C275 C37 C43 C82 C94 C331

root

-1 +1

Figure 1: The inflection model predicts a form for the target verb lemma � =пытаться (pytat’sya) based on itssource attempted and the linear and syntactic source context. The correct inflection string for the observed Russianform in this particular training instance is µ = mis-sfm-e (equivalent to the more traditional morphological string:+MAIN+IND+PAST+SING+FEM+MEDIAL+PERF).

����

���

source aligned word ei

parent word e�i with its dependency �i � iall children ej | �j = i with their dependency i� j

source words ei�1 and ei+1

����

���

��

tokenpart-of-speech tag

word cluster

��

– are ei, e�i at the root of the dependency tree?– number of children, siblings of ei

Figure 2: Source features �(e, i) extracted from e and its linguistic analysis. �i denotes the parent of the token inposition i in the dependency tree and �i � i the typed dependency link.

2.2 Source Context Features: �(e, i)

In order to select the best inflection of a target-language word, given the source word it translatesand the context of that source word, we seek to ex-ploit as many features of the context as are avail-able. Consider the example shown in Fig. 1, wheremost of the inflection features of the Russian word(past tense, singular number, and feminine gender)can be inferred from the context of the English wordit is aligned to. Indeed, many grammatical functionsexpressed morphologically in Russian are expressedsyntactically in English. Fortunately, high-qualityparsers and other linguistic analyzers are availablefor English.

On the source side, we apply the following pro-cessing steps:

• Part-of-speech tagging with a CRF taggertrained on sections 02–21 of the Penn Tree-bank.

• Dependency parsing with TurboParser (Mar-tins et al., 2010), a non-projective dependency

parser trained on the Penn Treebank to producebasic Stanford dependencies.

• Assignment of tokens to one of 600 Brownclusters, trained on 8G words of English text.3

We then extract binary features from e using thisinformation, by considering the aligned source wordei, its preceding and following words, and its syn-tactic neighbors. These are detailed in Figure 2.

3 Morphological Grammars and Features

We now describe how to obtain morphological anal-yses and convert them into feature vectors (�) forour target languages, Russian, Hebrew, and Swahili,using supervised and unsupervised methods.

3.1 Supervised Morphology

The state-of-the-art in morphological analysis usesunweighted morphological transduction rules (usu-

3The entire monolingual data available for the translationtask of the 8th ACL Workshop on Statistical Machine Transla-tion was used.

1679

2.  Translate-­‐and-­‐Inflect  Model  (cont’d)  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 7

Modeling  Inflec1on:  

Swahili translation tasks, finding significant im-provements in all language pairs (§6). We finallyreview related work (§7) and conclude (§8).

2 Translate-and-Inflect Model

The task of the translate-and-inflect model is illus-trated in Fig. 1 for an English–Russian sentence pair.The input will be a sentence e in the source language(in this paper, always English) and any available lin-guistic analysis of e. The output f will be composedof (i) a sequence of stems, each denoted � and (ii)one morphological inflection pattern for each stem,denoted µ. When the information is available, astem � is composed of a lemma and an inflectionalclass. Throughout, we use �� to denote the setof possible morphological inflection patterns for agiven stem �. �� might be defined by a grammar;our models restrict �� to be the set of inflectionsobserved anywhere in our monolingual or bilingualtraining data as a realization of �.1

We assume the availability of a deterministicfunction that maps a stem � and morphological in-flection µ to a target language surface form f . Insome cases, such as our unsupervised approach in§3.2, this will be a concatenation operation, thoughfinite-state transducers are traditionally used to de-fine such relations (§3.1). We abstractly denote thisoperation by �: f = � � µ.

Our approach consists in defining a probabilisticmodel over target words f . The model assumes in-dependence between each target word f conditionedon the source sentence e and its aligned position i inthis sentence.2 This assumption is further relaxedin §5 when the model is integrated in the translationsystem.

We decompose the probability of generating eachtarget word f in the following way:

p(f | e, i) =�

��µ=f

p(� | ei)� �� �gen. stem

� p(µ | �, e, i)� �� �gen. inflection

Here, each stem is generated independently from asingle aligned source word ei, but in practice we

1This prevents the model from generating words that wouldbe difficult for the language model to reliably score.

2This is the same assumption that Brown et al. (1993) makein, for example, IBM Model 1.

use a standard phrase-based model to generate se-quences of stems and only the inflection model op-erates word-by-word. We turn next to the inflectionmodel.

2.1 Modeling Inflection

In morphologically rich languages, each stem maybe combined with one or more inflectional mor-phemes to express many different grammatical fea-tures (e.g., case, definiteness, mood, tense, etc.).

Since the inflectional morphology of a word gen-erally expresses multiple grammatical features, wewould like a model that naturally incorporates rich,possibly overlapping features in its representation ofboth the input (i.e., conditioning context) and out-put (i.e., the inflection pattern). We therefore usethe following parametric form to model inflectionalprobabilities:

u(µ, e, i) = exp��(e, i)�W�(µ)+

�(µ)�V�(µ)�,

p(µ | �, e, i) =u(µ, e, i)�

µ����u(µ�, e, i)

. (1)

Here, � is an m-dimensional source context fea-ture vector function, � is an n-dimensional mor-phology feature vector function, W � Rm�n andV � Rn�n are parameter matrices. As with themore familiar log-linear parametrization that is writ-ten with a single feature vector, single weight vec-tor and single bias vector, this model is linear in itsparameters (it can be understood as working witha feature space that is the outer product of the twofeature spaces). However, using two feature vectorsallows to define overlapping features of both the in-put and the output, which is important for modelingmorphology in which output variables are naturallyexpressed as bundles of features. The second termin the sum in u enables correlations among outputfeatures to be modeled independently of input, andas such can be understood as a generalization of thebias terms in multi-class logistic regression (on thediagonal Vii) and interaction terms between outputvariables in a conditional random field (off the diag-onal Vij).

1678

Swahili translation tasks, finding significant im-provements in all language pairs (§6). We finallyreview related work (§7) and conclude (§8).

2 Translate-and-Inflect Model

The task of the translate-and-inflect model is illus-trated in Fig. 1 for an English–Russian sentence pair.The input will be a sentence e in the source language(in this paper, always English) and any available lin-guistic analysis of e. The output f will be composedof (i) a sequence of stems, each denoted � and (ii)one morphological inflection pattern for each stem,denoted µ. When the information is available, astem � is composed of a lemma and an inflectionalclass. Throughout, we use �� to denote the setof possible morphological inflection patterns for agiven stem �. �� might be defined by a grammar;our models restrict �� to be the set of inflectionsobserved anywhere in our monolingual or bilingualtraining data as a realization of �.1

We assume the availability of a deterministicfunction that maps a stem � and morphological in-flection µ to a target language surface form f . Insome cases, such as our unsupervised approach in§3.2, this will be a concatenation operation, thoughfinite-state transducers are traditionally used to de-fine such relations (§3.1). We abstractly denote thisoperation by �: f = � � µ.

Our approach consists in defining a probabilisticmodel over target words f . The model assumes in-dependence between each target word f conditionedon the source sentence e and its aligned position i inthis sentence.2 This assumption is further relaxedin §5 when the model is integrated in the translationsystem.

We decompose the probability of generating eachtarget word f in the following way:

p(f | e, i) =�

��µ=f

p(� | ei)� �� �gen. stem

� p(µ | �, e, i)� �� �gen. inflection

Here, each stem is generated independently from asingle aligned source word ei, but in practice we

1This prevents the model from generating words that wouldbe difficult for the language model to reliably score.

2This is the same assumption that Brown et al. (1993) makein, for example, IBM Model 1.

use a standard phrase-based model to generate se-quences of stems and only the inflection model op-erates word-by-word. We turn next to the inflectionmodel.

2.1 Modeling Inflection

In morphologically rich languages, each stem maybe combined with one or more inflectional mor-phemes to express many different grammatical fea-tures (e.g., case, definiteness, mood, tense, etc.).

Since the inflectional morphology of a word gen-erally expresses multiple grammatical features, wewould like a model that naturally incorporates rich,possibly overlapping features in its representation ofboth the input (i.e., conditioning context) and out-put (i.e., the inflection pattern). We therefore usethe following parametric form to model inflectionalprobabilities:

u(µ, e, i) = exp��(e, i)�W�(µ)+

�(µ)�V�(µ)�,

p(µ | �, e, i) =u(µ, e, i)�

µ����u(µ�, e, i)

. (1)

Here, � is an m-dimensional source context fea-ture vector function, � is an n-dimensional mor-phology feature vector function, W � Rm�n andV � Rn�n are parameter matrices. As with themore familiar log-linear parametrization that is writ-ten with a single feature vector, single weight vec-tor and single bias vector, this model is linear in itsparameters (it can be understood as working witha feature space that is the outer product of the twofeature spaces). However, using two feature vectorsallows to define overlapping features of both the in-put and the output, which is important for modelingmorphology in which output variables are naturallyexpressed as bundles of features. The second termin the sum in u enables correlations among outputfeatures to be modeled independently of input, andas such can be understood as a generalization of thebias terms in multi-class logistic regression (on thediagonal Vii) and interaction terms between outputvariables in a conditional random field (off the diag-onal Vij).

1678

Swahili translation tasks, finding significant im-provements in all language pairs (§6). We finallyreview related work (§7) and conclude (§8).

2 Translate-and-Inflect Model

The task of the translate-and-inflect model is illus-trated in Fig. 1 for an English–Russian sentence pair.The input will be a sentence e in the source language(in this paper, always English) and any available lin-guistic analysis of e. The output f will be composedof (i) a sequence of stems, each denoted � and (ii)one morphological inflection pattern for each stem,denoted µ. When the information is available, astem � is composed of a lemma and an inflectionalclass. Throughout, we use �� to denote the setof possible morphological inflection patterns for agiven stem �. �� might be defined by a grammar;our models restrict �� to be the set of inflectionsobserved anywhere in our monolingual or bilingualtraining data as a realization of �.1

We assume the availability of a deterministicfunction that maps a stem � and morphological in-flection µ to a target language surface form f . Insome cases, such as our unsupervised approach in§3.2, this will be a concatenation operation, thoughfinite-state transducers are traditionally used to de-fine such relations (§3.1). We abstractly denote thisoperation by �: f = � � µ.

Our approach consists in defining a probabilisticmodel over target words f . The model assumes in-dependence between each target word f conditionedon the source sentence e and its aligned position i inthis sentence.2 This assumption is further relaxedin §5 when the model is integrated in the translationsystem.

We decompose the probability of generating eachtarget word f in the following way:

p(f | e, i) =�

��µ=f

p(� | ei)� �� �gen. stem

� p(µ | �, e, i)� �� �gen. inflection

Here, each stem is generated independently from asingle aligned source word ei, but in practice we

1This prevents the model from generating words that wouldbe difficult for the language model to reliably score.

2This is the same assumption that Brown et al. (1993) makein, for example, IBM Model 1.

use a standard phrase-based model to generate se-quences of stems and only the inflection model op-erates word-by-word. We turn next to the inflectionmodel.

2.1 Modeling Inflection

In morphologically rich languages, each stem maybe combined with one or more inflectional mor-phemes to express many different grammatical fea-tures (e.g., case, definiteness, mood, tense, etc.).

Since the inflectional morphology of a word gen-erally expresses multiple grammatical features, wewould like a model that naturally incorporates rich,possibly overlapping features in its representation ofboth the input (i.e., conditioning context) and out-put (i.e., the inflection pattern). We therefore usethe following parametric form to model inflectionalprobabilities:

u(µ, e, i) = exp��(e, i)�W�(µ)+

�(µ)�V�(µ)�,

p(µ | �, e, i) =u(µ, e, i)�

µ����u(µ�, e, i)

. (1)

Here, � is an m-dimensional source context fea-ture vector function, � is an n-dimensional mor-phology feature vector function, W � Rm�n andV � Rn�n are parameter matrices. As with themore familiar log-linear parametrization that is writ-ten with a single feature vector, single weight vec-tor and single bias vector, this model is linear in itsparameters (it can be understood as working witha feature space that is the outer product of the twofeature spaces). However, using two feature vectorsallows to define overlapping features of both the in-put and the output, which is important for modelingmorphology in which output variables are naturallyexpressed as bundles of features. The second termin the sum in u enables correlations among outputfeatures to be modeled independently of input, andas such can be understood as a generalization of thebias terms in multi-class logistic regression (on thediagonal Vii) and interaction terms between outputvariables in a conditional random field (off the diag-onal Vij).

1678

- n-dimensional morphology feature vector function

Swahili translation tasks, finding significant im-provements in all language pairs (§6). We finallyreview related work (§7) and conclude (§8).

2 Translate-and-Inflect Model

The task of the translate-and-inflect model is illus-trated in Fig. 1 for an English–Russian sentence pair.The input will be a sentence e in the source language(in this paper, always English) and any available lin-guistic analysis of e. The output f will be composedof (i) a sequence of stems, each denoted � and (ii)one morphological inflection pattern for each stem,denoted µ. When the information is available, astem � is composed of a lemma and an inflectionalclass. Throughout, we use �� to denote the setof possible morphological inflection patterns for agiven stem �. �� might be defined by a grammar;our models restrict �� to be the set of inflectionsobserved anywhere in our monolingual or bilingualtraining data as a realization of �.1

We assume the availability of a deterministicfunction that maps a stem � and morphological in-flection µ to a target language surface form f . Insome cases, such as our unsupervised approach in§3.2, this will be a concatenation operation, thoughfinite-state transducers are traditionally used to de-fine such relations (§3.1). We abstractly denote thisoperation by �: f = � � µ.

Our approach consists in defining a probabilisticmodel over target words f . The model assumes in-dependence between each target word f conditionedon the source sentence e and its aligned position i inthis sentence.2 This assumption is further relaxedin §5 when the model is integrated in the translationsystem.

We decompose the probability of generating eachtarget word f in the following way:

p(f | e, i) =�

��µ=f

p(� | ei)� �� �gen. stem

� p(µ | �, e, i)� �� �gen. inflection

Here, each stem is generated independently from asingle aligned source word ei, but in practice we

1This prevents the model from generating words that wouldbe difficult for the language model to reliably score.

2This is the same assumption that Brown et al. (1993) makein, for example, IBM Model 1.

use a standard phrase-based model to generate se-quences of stems and only the inflection model op-erates word-by-word. We turn next to the inflectionmodel.

2.1 Modeling Inflection

In morphologically rich languages, each stem maybe combined with one or more inflectional mor-phemes to express many different grammatical fea-tures (e.g., case, definiteness, mood, tense, etc.).

Since the inflectional morphology of a word gen-erally expresses multiple grammatical features, wewould like a model that naturally incorporates rich,possibly overlapping features in its representation ofboth the input (i.e., conditioning context) and out-put (i.e., the inflection pattern). We therefore usethe following parametric form to model inflectionalprobabilities:

u(µ, e, i) = exp��(e, i)�W�(µ)+

�(µ)�V�(µ)�,

p(µ | �, e, i) =u(µ, e, i)�

µ����u(µ�, e, i)

. (1)

Here, � is an m-dimensional source context fea-ture vector function, � is an n-dimensional mor-phology feature vector function, W � Rm�n andV � Rn�n are parameter matrices. As with themore familiar log-linear parametrization that is writ-ten with a single feature vector, single weight vec-tor and single bias vector, this model is linear in itsparameters (it can be understood as working witha feature space that is the outer product of the twofeature spaces). However, using two feature vectorsallows to define overlapping features of both the in-put and the output, which is important for modelingmorphology in which output variables are naturallyexpressed as bundles of features. The second termin the sum in u enables correlations among outputfeatures to be modeled independently of input, andas such can be understood as a generalization of thebias terms in multi-class logistic regression (on thediagonal Vii) and interaction terms between outputvariables in a conditional random field (off the diag-onal Vij).

1678

- m-dimensional source context feature vector function

2.  Translate-­‐and-­‐Inflect  Model  (cont’d)  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 8

Source  Context  Feature  Extrac1on:  

она пыталась пересечь пути на ее велосипед

she had attempted to cross the road on her bike

PRP VBD VBN TO VB DT NN IN PRP$ NN

nsubj

aux

xcomp

σ:пытаться_V,+,μ:mis2sfm2e

C50 C473 C28 C8 C275 C37 C43 C82 C94 C331

root

-1 +1

она пыталась пересечь пути на ее велосипед

she had attempted to cross the road on her bike

PRP VBD VBN TO VB DT NN IN PRP$ NN

nsubj

aux

xcomp

σ:пытаться_V,+,μ:mis2sfm2e

C50 C473 C28 C8 C275 C37 C43 C82 C94 C331

root

-1 +1

Figure 1: The inflection model predicts a form for the target verb lemma � =пытаться (pytat’sya) based on itssource attempted and the linear and syntactic source context. The correct inflection string for the observed Russianform in this particular training instance is µ = mis-sfm-e (equivalent to the more traditional morphological string:+MAIN+IND+PAST+SING+FEM+MEDIAL+PERF).

����

���

source aligned word ei

parent word e�i with its dependency �i � iall children ej | �j = i with their dependency i� j

source words ei�1 and ei+1

����

���

��

tokenpart-of-speech tag

word cluster

��

– are ei, e�i at the root of the dependency tree?– number of children, siblings of ei

Figure 2: Source features �(e, i) extracted from e and its linguistic analysis. �i denotes the parent of the token inposition i in the dependency tree and �i � i the typed dependency link.

2.2 Source Context Features: �(e, i)

In order to select the best inflection of a target-language word, given the source word it translatesand the context of that source word, we seek to ex-ploit as many features of the context as are avail-able. Consider the example shown in Fig. 1, wheremost of the inflection features of the Russian word(past tense, singular number, and feminine gender)can be inferred from the context of the English wordit is aligned to. Indeed, many grammatical functionsexpressed morphologically in Russian are expressedsyntactically in English. Fortunately, high-qualityparsers and other linguistic analyzers are availablefor English.

On the source side, we apply the following pro-cessing steps:

• Part-of-speech tagging with a CRF taggertrained on sections 02–21 of the Penn Tree-bank.

• Dependency parsing with TurboParser (Mar-tins et al., 2010), a non-projective dependency

parser trained on the Penn Treebank to producebasic Stanford dependencies.

• Assignment of tokens to one of 600 Brownclusters, trained on 8G words of English text.3

We then extract binary features from e using thisinformation, by considering the aligned source wordei, its preceding and following words, and its syn-tactic neighbors. These are detailed in Figure 2.

3 Morphological Grammars and Features

We now describe how to obtain morphological anal-yses and convert them into feature vectors (�) forour target languages, Russian, Hebrew, and Swahili,using supervised and unsupervised methods.

3.1 Supervised Morphology

The state-of-the-art in morphological analysis usesunweighted morphological transduction rules (usu-

3The entire monolingual data available for the translationtask of the 8th ACL Workshop on Statistical Machine Transla-tion was used.

1679

Figure 2: Source features φ(e, i) extracted from e and its linguistic analysis. πi denotes the parent of the token in position i in the dependency tree and πi → i the typed dependency link.

3.  Morphological  Grammars  and  Features  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 9

Supervisd  morphology  features:    ψ(μ)  –  extracted  from  results  of  morphological  analysis  (available  only  for  Russian  in  this  paper)    Since  a  posi1on  tag  set  is  used,  it  is  straighqorward  to  convert  each  fixed-­‐length  tag  μ  into  a  feature  vector  by  defining  a  binary  feature  for  each  key-­‐value  pair  (e.g.,  Tense=past)  composing  the  tag.  

3.  Morphological  Grammars  and  Features  (cont’d)  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 10

Unsupervisd  morphology  features:    ψ(μ)  –  generated  by  unsupervised  analyzer  (not  using  language-­‐specific  engineering)    We  produce  binary  features  corresponding  to  the  content  of  each  poten1al  affixa1on  posi1on  rela1ve  to  the  stem:  

Table 1: Corpus statistics.

Parallel Parallel+Monolingual

Sentences EN-tokens TRG-tokens EN-types TRG-types Sentences TRG-tokens TRG-typesRussian 150k 3.5M 3.3M 131k 254k 20M 360M 1,971kHebrew 134k 2.7M 2.0M 48k 120k 806k 15M 316kSwahili 15k 0.3M 0.3M 23k 35k 596k 13M 334k

will involve a more direct method for specifying orinferring these values.

Unsupervised morphology features: �(µ). Forthe unsupervised analyzer, we do not have a map-ping from morphemes to structured morphologicalattributes; however, we can create features from theaffix sequences obtained after morphological seg-mentation. We produce binary features correspond-ing to the content of each potential affixation posi-tion relative to the stem:

prefix suffix...-3 -2 -1 STEM +1 +2 +3...

For example, the unsupervised analysis µ =wa+ki+wa+STEM of the Swahili word wakiwapigawill produce the following features:

�prefix[�3][wa](µ) = 1,

�prefix[�2][ki](µ) = 1,

�prefix[�1][wa](µ) = 1.

4 Inflection Model Parameter Estimation

To set the parameters W and V of the inflection pre-diction model (Eq. 1), we use stochastic gradient de-scent to maximize the conditional log-likelihood ofa training set consisting of pairs of source (English)sentence contextual features (�) and target word in-flectional features (�). The training instances areextracted from the word-aligned parallel corpus withthe English side preprocessed as discussed in §2.2and the target side disambiguated as discussed in §3.When morphological category information is avail-able, we train an independent model for each open-class category (in Russian, nouns, verbs, adjectives,numerals, adverbs); otherwise a single model is usedfor all words (excluding words less than four char-acters long, which are ignored).

Statistics of the parallel corpora used to train theinflection model are summarized in Table 1. It isimportant to note here that our richly parameterizedmodel is trained on the full parallel training cor-pus, not just on a handful of development sentences(which are typically used to tune MT system param-eters). Despite this scale, training is simple: the in-flection model is trained to discriminate among dif-ferent inflectional paradigms, not over all possibletarget language sentences (Blunsom et al., 2008) orlearning from all observable rules (Subotin, 2011).This makes the training problem relatively tractable:all experiments in this paper were trained on a sin-gle processor using a Cython implementation of theSGD optimizer. For our largest model, trained on3.3M Russian words, n = 231K � m = 336 fea-tures were produced, and 10 SGD iterations wereperformed in less than 16 hours.

4.1 Intrinsic Evaluation

Before considering the broader problem of integrat-ing the inflection model in a machine translationsystem, we perform an artificial evaluation to ver-ify that the model learns sensible source sentence-target inflection patterns. To do so, we create aninflection test set as follows. We preprocess thesource (English) sentences exactly as during train-ing (§2.2), and using the target language morpholog-ical analyzer, we convert each aligned target word to�stem, inflection� pairs. We perform word alignmenton the held-out MT development data for each lan-guage pair (cf. Table 1), exactly as if it were going toproduce training instances, but instead we use themfor testing.

Although the resulting dataset is noisy (e.g., dueto alignment errors), this becomes our intrinsic eval-uation test set. Using this data, we measure inflec-tion quality using two measurements:5

5Note that we are not evaluating the stem translation model,

1681

For example, the unsupervised analysis µ = wa+ki+wa+STEM of the Swahili word wakiwapiga will produce the following features:

Table 1: Corpus statistics.

Parallel Parallel+Monolingual

Sentences EN-tokens TRG-tokens EN-types TRG-types Sentences TRG-tokens TRG-typesRussian 150k 3.5M 3.3M 131k 254k 20M 360M 1,971kHebrew 134k 2.7M 2.0M 48k 120k 806k 15M 316kSwahili 15k 0.3M 0.3M 23k 35k 596k 13M 334k

will involve a more direct method for specifying orinferring these values.

Unsupervised morphology features: �(µ). Forthe unsupervised analyzer, we do not have a map-ping from morphemes to structured morphologicalattributes; however, we can create features from theaffix sequences obtained after morphological seg-mentation. We produce binary features correspond-ing to the content of each potential affixation posi-tion relative to the stem:

prefix suffix...-3 -2 -1 STEM +1 +2 +3...

For example, the unsupervised analysis µ =wa+ki+wa+STEM of the Swahili word wakiwapigawill produce the following features:

�prefix[�3][wa](µ) = 1,

�prefix[�2][ki](µ) = 1,

�prefix[�1][wa](µ) = 1.

4 Inflection Model Parameter Estimation

To set the parameters W and V of the inflection pre-diction model (Eq. 1), we use stochastic gradient de-scent to maximize the conditional log-likelihood ofa training set consisting of pairs of source (English)sentence contextual features (�) and target word in-flectional features (�). The training instances areextracted from the word-aligned parallel corpus withthe English side preprocessed as discussed in §2.2and the target side disambiguated as discussed in §3.When morphological category information is avail-able, we train an independent model for each open-class category (in Russian, nouns, verbs, adjectives,numerals, adverbs); otherwise a single model is usedfor all words (excluding words less than four char-acters long, which are ignored).

Statistics of the parallel corpora used to train theinflection model are summarized in Table 1. It isimportant to note here that our richly parameterizedmodel is trained on the full parallel training cor-pus, not just on a handful of development sentences(which are typically used to tune MT system param-eters). Despite this scale, training is simple: the in-flection model is trained to discriminate among dif-ferent inflectional paradigms, not over all possibletarget language sentences (Blunsom et al., 2008) orlearning from all observable rules (Subotin, 2011).This makes the training problem relatively tractable:all experiments in this paper were trained on a sin-gle processor using a Cython implementation of theSGD optimizer. For our largest model, trained on3.3M Russian words, n = 231K � m = 336 fea-tures were produced, and 10 SGD iterations wereperformed in less than 16 hours.

4.1 Intrinsic Evaluation

Before considering the broader problem of integrat-ing the inflection model in a machine translationsystem, we perform an artificial evaluation to ver-ify that the model learns sensible source sentence-target inflection patterns. To do so, we create aninflection test set as follows. We preprocess thesource (English) sentences exactly as during train-ing (§2.2), and using the target language morpholog-ical analyzer, we convert each aligned target word to�stem, inflection� pairs. We perform word alignmenton the held-out MT development data for each lan-guage pair (cf. Table 1), exactly as if it were going toproduce training instances, but instead we use themfor testing.

Although the resulting dataset is noisy (e.g., dueto alignment errors), this becomes our intrinsic eval-uation test set. Using this data, we measure inflec-tion quality using two measurements:5

5Note that we are not evaluating the stem translation model,

1681

Table 1: Corpus statistics.

Parallel Parallel+Monolingual

Sentences EN-tokens TRG-tokens EN-types TRG-types Sentences TRG-tokens TRG-typesRussian 150k 3.5M 3.3M 131k 254k 20M 360M 1,971kHebrew 134k 2.7M 2.0M 48k 120k 806k 15M 316kSwahili 15k 0.3M 0.3M 23k 35k 596k 13M 334k

will involve a more direct method for specifying orinferring these values.

Unsupervised morphology features: �(µ). Forthe unsupervised analyzer, we do not have a map-ping from morphemes to structured morphologicalattributes; however, we can create features from theaffix sequences obtained after morphological seg-mentation. We produce binary features correspond-ing to the content of each potential affixation posi-tion relative to the stem:

prefix suffix...-3 -2 -1 STEM +1 +2 +3...

For example, the unsupervised analysis µ =wa+ki+wa+STEM of the Swahili word wakiwapigawill produce the following features:

�prefix[�3][wa](µ) = 1,

�prefix[�2][ki](µ) = 1,

�prefix[�1][wa](µ) = 1.

4 Inflection Model Parameter Estimation

To set the parameters W and V of the inflection pre-diction model (Eq. 1), we use stochastic gradient de-scent to maximize the conditional log-likelihood ofa training set consisting of pairs of source (English)sentence contextual features (�) and target word in-flectional features (�). The training instances areextracted from the word-aligned parallel corpus withthe English side preprocessed as discussed in §2.2and the target side disambiguated as discussed in §3.When morphological category information is avail-able, we train an independent model for each open-class category (in Russian, nouns, verbs, adjectives,numerals, adverbs); otherwise a single model is usedfor all words (excluding words less than four char-acters long, which are ignored).

Statistics of the parallel corpora used to train theinflection model are summarized in Table 1. It isimportant to note here that our richly parameterizedmodel is trained on the full parallel training cor-pus, not just on a handful of development sentences(which are typically used to tune MT system param-eters). Despite this scale, training is simple: the in-flection model is trained to discriminate among dif-ferent inflectional paradigms, not over all possibletarget language sentences (Blunsom et al., 2008) orlearning from all observable rules (Subotin, 2011).This makes the training problem relatively tractable:all experiments in this paper were trained on a sin-gle processor using a Cython implementation of theSGD optimizer. For our largest model, trained on3.3M Russian words, n = 231K � m = 336 fea-tures were produced, and 10 SGD iterations wereperformed in less than 16 hours.

4.1 Intrinsic Evaluation

Before considering the broader problem of integrat-ing the inflection model in a machine translationsystem, we perform an artificial evaluation to ver-ify that the model learns sensible source sentence-target inflection patterns. To do so, we create aninflection test set as follows. We preprocess thesource (English) sentences exactly as during train-ing (§2.2), and using the target language morpholog-ical analyzer, we convert each aligned target word to�stem, inflection� pairs. We perform word alignmenton the held-out MT development data for each lan-guage pair (cf. Table 1), exactly as if it were going toproduce training instances, but instead we use themfor testing.

Although the resulting dataset is noisy (e.g., dueto alignment errors), this becomes our intrinsic eval-uation test set. Using this data, we measure inflec-tion quality using two measurements:5

5Note that we are not evaluating the stem translation model,

1681

Table 1: Corpus statistics.

Parallel Parallel+Monolingual

Sentences EN-tokens TRG-tokens EN-types TRG-types Sentences TRG-tokens TRG-typesRussian 150k 3.5M 3.3M 131k 254k 20M 360M 1,971kHebrew 134k 2.7M 2.0M 48k 120k 806k 15M 316kSwahili 15k 0.3M 0.3M 23k 35k 596k 13M 334k

will involve a more direct method for specifying orinferring these values.

Unsupervised morphology features: �(µ). Forthe unsupervised analyzer, we do not have a map-ping from morphemes to structured morphologicalattributes; however, we can create features from theaffix sequences obtained after morphological seg-mentation. We produce binary features correspond-ing to the content of each potential affixation posi-tion relative to the stem:

prefix suffix...-3 -2 -1 STEM +1 +2 +3...

For example, the unsupervised analysis µ =wa+ki+wa+STEM of the Swahili word wakiwapigawill produce the following features:

�prefix[�3][wa](µ) = 1,

�prefix[�2][ki](µ) = 1,

�prefix[�1][wa](µ) = 1.

4 Inflection Model Parameter Estimation

To set the parameters W and V of the inflection pre-diction model (Eq. 1), we use stochastic gradient de-scent to maximize the conditional log-likelihood ofa training set consisting of pairs of source (English)sentence contextual features (�) and target word in-flectional features (�). The training instances areextracted from the word-aligned parallel corpus withthe English side preprocessed as discussed in §2.2and the target side disambiguated as discussed in §3.When morphological category information is avail-able, we train an independent model for each open-class category (in Russian, nouns, verbs, adjectives,numerals, adverbs); otherwise a single model is usedfor all words (excluding words less than four char-acters long, which are ignored).

Statistics of the parallel corpora used to train theinflection model are summarized in Table 1. It isimportant to note here that our richly parameterizedmodel is trained on the full parallel training cor-pus, not just on a handful of development sentences(which are typically used to tune MT system param-eters). Despite this scale, training is simple: the in-flection model is trained to discriminate among dif-ferent inflectional paradigms, not over all possibletarget language sentences (Blunsom et al., 2008) orlearning from all observable rules (Subotin, 2011).This makes the training problem relatively tractable:all experiments in this paper were trained on a sin-gle processor using a Cython implementation of theSGD optimizer. For our largest model, trained on3.3M Russian words, n = 231K � m = 336 fea-tures were produced, and 10 SGD iterations wereperformed in less than 16 hours.

4.1 Intrinsic Evaluation

Before considering the broader problem of integrat-ing the inflection model in a machine translationsystem, we perform an artificial evaluation to ver-ify that the model learns sensible source sentence-target inflection patterns. To do so, we create aninflection test set as follows. We preprocess thesource (English) sentences exactly as during train-ing (§2.2), and using the target language morpholog-ical analyzer, we convert each aligned target word to�stem, inflection� pairs. We perform word alignmenton the held-out MT development data for each lan-guage pair (cf. Table 1), exactly as if it were going toproduce training instances, but instead we use themfor testing.

Although the resulting dataset is noisy (e.g., dueto alignment errors), this becomes our intrinsic eval-uation test set. Using this data, we measure inflec-tion quality using two measurements:5

5Note that we are not evaluating the stem translation model,

1681

3.  Morphological  Grammars  and  Features  (cont’d)  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 11

Unsupervised  Morphological  Segmenta1on:  

Let M represent the set of all possible morphemes and define a regular grammar: ally in the form of an FST) to produce candidate

analyses for each word in a sentence and then sta-tistical models to disambiguate among the analy-ses in context (Hakkani-Tur et al., 2000; Hajic etal., 2001; Smith et al., 2005; Habash and Rambow,2005, inter alia). While this technique is capableof producing high quality linguistic analyses, it isexpensive to develop, requiring hand-crafted rule-based analyzers and annotated corpora to train thedisambiguation models. As a result, such analyzersare only available for a small number of languages,and, as a practical matter, each analyzer (which re-sulted from different development efforts) operatesdifferently from the others.

We therefore focus on using supervised analysisfor a single target language, Russian. We use theanalysis tool of Sharoff et al. (2008) which producesfor each word in context a lemma and a fixed-lengthmorphological tag encoding the grammatical fea-tures. We process the target side of the parallel datawith this tool to obtain the information necessaryto extract �lemma, inflection� pairs, from which wecompute � and morphological feature vectors �(µ).

Supervised morphology features: �(µ). Sincea positional tag set is used, it is straightforward toconvert each fixed-length tag µ into a feature vectorby defining a binary feature for each key-value pair(e.g., Tense=past) composing the tag.

3.2 Unsupervised Morphology

Since many languages into which we might want totranslate do not have supervised morphological an-alyzers, we now turn to the question of how to gen-erate morphological analyses and features using anunsupervised analyzer. We hypothesize that perfectdecomposition into rich linguistic structures may notbe required for accurate generation of new inflectedforms. We will test this hypothesis by experimentingwith a simple, unsupervised model of morphologythat segments words into sequences of morphemes,assuming a (naıve) concatenative generation processand a single analysis per type.

Unsupervised morphological segmentation. Weassume that each word can be decomposed into anynumber of prefixes, a stem, and any number of suf-fixes. Formally, we let M represent the set of allpossible morphemes and define a regular grammar

M�MM� (i.e., zero or more prefixes, a stem, andzero or more suffixes). To infer the decompositionstructure for the words in the target language, we as-sume that the vocabulary was generated by the fol-lowing process:

1. Sample morpheme distributions from symmet-ric Dirichlet distributions: �p � Dir|M |(�p)for prefixes, �� � Dir|M |(��) for stems, and�s � Dir|M |(�s) for suffixes.

2. Sample length distribution parameters�p � Beta(�p, �p) for prefix sequencesand �s � Beta(�s, �s) for suffix sequences.

3. Sample a vocabulary by creating each wordtype w using the following steps:

(a) Sample affix sequence lengths:lp � Geometric(�p);ls � Geometric(�s).

(b) Sample lp prefixes p1, . . . , plp indepen-dently from �p; ls suffixes s1, . . . , sls in-dependently from �s; and a stem � � ��.

(c) Concatenate prefixes, the stem, and suf-fixes: w = p1+· · ·+plp+�+s1+· · ·+sls .

We use blocked Gibbs sampling to sample seg-mentations for each word in the training vocabulary.Because of our particular choice of priors, it possibleto approximately decompose the posterior over thearcs of a compact finite-state machine. Sampling asegmentation or obtaining the most likely segmenta-tion a posteriori then reduces to familiar FST opera-tions. This model is reminiscent of work on learningmorphology using adaptor grammars (Johnson et al.,2006; Johnson, 2008).

The inferred morphological grammar is very sen-sitive to the Dirichlet hyperparameters (�p, �s, ��)and these are, in turn, sensitive to the number oftypes in the vocabulary. Using �p, �s � �� � 1tended to recover useful segmentations, but we havenot yet been able to find reliable generic priors forthese values. Therefore, we selected them empiri-cally to obtain a stem vocabulary size on the paralleldata that is one-to-one with English.4 Future work

4Our default starting point was to use �p = �s =10�6, �� = 10�4 and then to adjust all parameters by factorsof 10.

1680

The vocabulary was generated by the following process (Blocked Gibbs Sampling): 1.  Sample morpheme distributions from symmetric Dirichlet distributions: θp 〜

Dir|M|(αp) for prefixes, θσ 〜 Dir|M|(ασ) for stems, and θs 〜 Dir|M|(αs) for suffixes. 2.  Sample length distribution parameters λp 〜 Beta(βp,γp) for prefix sequences

and λs 〜 (βs,γs) for suffix sequences. 3.  Sample a vocabulary by creating each word type w using the following steps:

a)  Sample affix sequence lengths: lp 〜 Geom(λp); ls 〜 Geom(λs) b)  Sample lp prefixes p1,…,plp independently from θp; ls suffixes s1,…,sls

independently from θs; and a stem σ 〜 θσ. c)  Concatenate prefixes, the stem, and suffixes:

w = p1 + … + plp + σ + s1 + … + sls

4.  Inflec1on  Model  Parameter  Es1ma1on  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 12

Estimating Parameter: To set the parameters W and V of the inflection prediction model, we use stochastic gradient descent to maximize the conditional log-likelihood of a training set consisting of pairs of source (English) sentence contextual features (φ) and target word inflectional features (ψ). When morphological category information is available, we train an independent model for each open-class category (in Russian, nouns, verbs, adjectives, numerals, adverbs); otherwise a single model is used for all words.

4.  Inflec1on  Model  Parameter  Es1ma1on  (cont’d)  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 13

Intrinsic Evaluation:

Table 1: Corpus statistics.

Parallel Parallel+Monolingual

Sentences EN-tokens TRG-tokens EN-types TRG-types Sentences TRG-tokens TRG-typesRussian 150k 3.5M 3.3M 131k 254k 20M 360M 1,971kHebrew 134k 2.7M 2.0M 48k 120k 806k 15M 316kSwahili 15k 0.3M 0.3M 23k 35k 596k 13M 334k

will involve a more direct method for specifying orinferring these values.

Unsupervised morphology features: �(µ). Forthe unsupervised analyzer, we do not have a map-ping from morphemes to structured morphologicalattributes; however, we can create features from theaffix sequences obtained after morphological seg-mentation. We produce binary features correspond-ing to the content of each potential affixation posi-tion relative to the stem:

prefix suffix...-3 -2 -1 STEM +1 +2 +3...

For example, the unsupervised analysis µ =wa+ki+wa+STEM of the Swahili word wakiwapigawill produce the following features:

�prefix[�3][wa](µ) = 1,

�prefix[�2][ki](µ) = 1,

�prefix[�1][wa](µ) = 1.

4 Inflection Model Parameter Estimation

To set the parameters W and V of the inflection pre-diction model (Eq. 1), we use stochastic gradient de-scent to maximize the conditional log-likelihood ofa training set consisting of pairs of source (English)sentence contextual features (�) and target word in-flectional features (�). The training instances areextracted from the word-aligned parallel corpus withthe English side preprocessed as discussed in §2.2and the target side disambiguated as discussed in §3.When morphological category information is avail-able, we train an independent model for each open-class category (in Russian, nouns, verbs, adjectives,numerals, adverbs); otherwise a single model is usedfor all words (excluding words less than four char-acters long, which are ignored).

Statistics of the parallel corpora used to train theinflection model are summarized in Table 1. It isimportant to note here that our richly parameterizedmodel is trained on the full parallel training cor-pus, not just on a handful of development sentences(which are typically used to tune MT system param-eters). Despite this scale, training is simple: the in-flection model is trained to discriminate among dif-ferent inflectional paradigms, not over all possibletarget language sentences (Blunsom et al., 2008) orlearning from all observable rules (Subotin, 2011).This makes the training problem relatively tractable:all experiments in this paper were trained on a sin-gle processor using a Cython implementation of theSGD optimizer. For our largest model, trained on3.3M Russian words, n = 231K � m = 336 fea-tures were produced, and 10 SGD iterations wereperformed in less than 16 hours.

4.1 Intrinsic Evaluation

Before considering the broader problem of integrat-ing the inflection model in a machine translationsystem, we perform an artificial evaluation to ver-ify that the model learns sensible source sentence-target inflection patterns. To do so, we create aninflection test set as follows. We preprocess thesource (English) sentences exactly as during train-ing (§2.2), and using the target language morpholog-ical analyzer, we convert each aligned target word to�stem, inflection� pairs. We perform word alignmenton the held-out MT development data for each lan-guage pair (cf. Table 1), exactly as if it were going toproduce training instances, but instead we use themfor testing.

Although the resulting dataset is noisy (e.g., dueto alignment errors), this becomes our intrinsic eval-uation test set. Using this data, we measure inflec-tion quality using two measurements:5

5Note that we are not evaluating the stem translation model,

1681

acc. ppl. |��|

Supe

rvis

ed

Russian

N 64.1% 3.46 9.16V 63.7% 3.41 20.12A 51.5% 6.24 19.56M 73.0% 2.81 9.14

average 63.1% 3.98 14.49

Uns

up. Russian all 71.2% 2.15 4.73

Hebrew all 85.5% 1.49 2.55Swahili all 78.2% 2.09 11.46

Table 2: Intrinsic evaluation of inflection model (N:nouns, V: verbs, A: adjectives, M: numerals).

• the accuracy of predicting the inflection giventhe source, source context and target stem, and

• the inflection model perplexity on the same setof test instances.

Additionally, we report the average number of pos-sible inflections for each stem, an upper bound to theperplexity that indicates the inherent difficulty of thetask. The results of this evaluation are presented inTable 2 for the three language pairs considered. Weremark on two patterns in these results. First, per-plexity is substantially lower than the perplexity of auniform model, indicating our model is overall quiteeffective at predicting inflections using source con-text only. Second, in the supervised Russian results,we see that predicting the inflections of adjectivesis relatively more difficult than for other parts-of-speech. Since adjectives agree with the nouns theymodify in gender and case, and gender is an idiosyn-cratic feature of Russian nouns (and therefore notdirectly predictable from the English source), thisdifficulty is unsurprising.

We can also inspect the weights learned by themodel to assess the effectiveness of the featuresin relating source-context structure with target-sidemorphology. Such an analysis is presented in Fig. 3.

4.2 Feature Ablation

Our inflection model makes use of numerous fea-ture types. Table 3 explores the effect of removingdifferent kinds of (source) features from the model,evaluated on predicting Russian inflections usingsupervised morphological grammars.6 Rows 2–3just the inflection prediction model.

6The models used in the feature ablation experiment weretrained on fewer examples, resulting in overall lower accuracies

show the effect of removing either linear or depen-dency context. We see that both are necessary forgood performance; however removing dependencycontext substantially degrades performance of themodel (we interpret this result as evidence that Rus-sian morphological inflection captures grammaticalrelationships that would be expressed structurally inEnglish). The bottom four rows explore the effectof source language word representation. The resultsindicate that lexical features are important for accu-rate prediction of inflection, and that POS tags andBrown clusters are likewise important, but they seemto capture similar information (removing one has lit-tle impact, but removing both substantially degradesperformance).

Table 3: Feature ablation experiments using supervisedRussian classification experiments.

Features (�(e, i)) acc.all 54.7%�linear context 52.7%�dependency context 44.4%�POS tags 54.5%�Brown clusters 54.5%�POS tags, �Brown cl. 50.9%�lexical items 51.2%

5 Synthetic Phrases

We turn now to translation; recall that our translate-and-inflect model is used to augment the set of rulesavailable to a conventional statistical machine trans-lation decoder. We refer to the phrases it producesas synthetic phrases.

Our baseline system is a standard hierarchicalphrase-based translation model (Chiang, 2007). Fol-lowing Lopez (2007), the training data is compiledinto an efficient binary representation which allowsextraction of sentence-specific grammars just beforedecoding. In our case, this also allows the creationof synthetic inflected phrases that are produced con-ditioning on the sentence to translate.

To generate these synthetic phrases with new in-flections possibly unseen in the parallel training

than seen in Table 2, but the pattern of results is the relevantdatapoint here.

1682

Intrinsic evaluation of inflection model (N: nouns, V: verbs, A: adjectives, M: numerals).

4.  Inflec1on  Model  Parameter  Es1ma1on  (cont’d)  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 14

Feature Ablation:

acc. ppl. |��|Su

perv

ised

Russian

N 64.1% 3.46 9.16V 63.7% 3.41 20.12A 51.5% 6.24 19.56M 73.0% 2.81 9.14

average 63.1% 3.98 14.49

Uns

up. Russian all 71.2% 2.15 4.73

Hebrew all 85.5% 1.49 2.55Swahili all 78.2% 2.09 11.46

Table 2: Intrinsic evaluation of inflection model (N:nouns, V: verbs, A: adjectives, M: numerals).

• the accuracy of predicting the inflection giventhe source, source context and target stem, and

• the inflection model perplexity on the same setof test instances.

Additionally, we report the average number of pos-sible inflections for each stem, an upper bound to theperplexity that indicates the inherent difficulty of thetask. The results of this evaluation are presented inTable 2 for the three language pairs considered. Weremark on two patterns in these results. First, per-plexity is substantially lower than the perplexity of auniform model, indicating our model is overall quiteeffective at predicting inflections using source con-text only. Second, in the supervised Russian results,we see that predicting the inflections of adjectivesis relatively more difficult than for other parts-of-speech. Since adjectives agree with the nouns theymodify in gender and case, and gender is an idiosyn-cratic feature of Russian nouns (and therefore notdirectly predictable from the English source), thisdifficulty is unsurprising.

We can also inspect the weights learned by themodel to assess the effectiveness of the featuresin relating source-context structure with target-sidemorphology. Such an analysis is presented in Fig. 3.

4.2 Feature Ablation

Our inflection model makes use of numerous fea-ture types. Table 3 explores the effect of removingdifferent kinds of (source) features from the model,evaluated on predicting Russian inflections usingsupervised morphological grammars.6 Rows 2–3just the inflection prediction model.

6The models used in the feature ablation experiment weretrained on fewer examples, resulting in overall lower accuracies

show the effect of removing either linear or depen-dency context. We see that both are necessary forgood performance; however removing dependencycontext substantially degrades performance of themodel (we interpret this result as evidence that Rus-sian morphological inflection captures grammaticalrelationships that would be expressed structurally inEnglish). The bottom four rows explore the effectof source language word representation. The resultsindicate that lexical features are important for accu-rate prediction of inflection, and that POS tags andBrown clusters are likewise important, but they seemto capture similar information (removing one has lit-tle impact, but removing both substantially degradesperformance).

Table 3: Feature ablation experiments using supervisedRussian classification experiments.

Features (�(e, i)) acc.all 54.7%�linear context 52.7%�dependency context 44.4%�POS tags 54.5%�Brown clusters 54.5%�POS tags, �Brown cl. 50.9%�lexical items 51.2%

5 Synthetic Phrases

We turn now to translation; recall that our translate-and-inflect model is used to augment the set of rulesavailable to a conventional statistical machine trans-lation decoder. We refer to the phrases it producesas synthetic phrases.

Our baseline system is a standard hierarchicalphrase-based translation model (Chiang, 2007). Fol-lowing Lopez (2007), the training data is compiledinto an efficient binary representation which allowsextraction of sentence-specific grammars just beforedecoding. In our case, this also allows the creationof synthetic inflected phrases that are produced con-ditioning on the sentence to translate.

To generate these synthetic phrases with new in-flections possibly unseen in the parallel training

than seen in Table 2, but the pattern of results is the relevantdatapoint here.

1682

Feature ablation experiments using supervised Russian classification experiments.

5.  Synthe1c  Phrases  (cont’d)  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 15

Russian supervisedVerb: 1st Person

child(nsubj)=I child(nsubj)=weVerb: Future tense

child(aux)=MD child(aux)=willNoun: Animate

source=animals/victims/...Noun: Feminine gender

source=obama/economy/...Noun: Dative case

parent(iobj)Adjective: Genitive case

grandparent(poss)

HebrewSuffix ים (masculine plural)

parent=NNS after=NNSPrefix א (first person sing. + future)

child(nsubj)=I child(aux)='llPrefix כ (preposition like/as)

child(prep)=IN parent=asSuffix י (possesive mark)

before=my child(poss)=mySuffix ה (feminine mark)

child(nsubj)=she before=shePrefix כש (when)

before=when before=WRB

SwahiliPrefix li (past)

source=VBD source=VBNPrefix nita (1st person sing. + future)

child(aux) child(nsubj)=IPrefix ana (3rd person sing. + present)

source=VBZPrefix wa (3rd person plural)

before=they child(nsubj)=NNSSuffix tu (1st person plural)

child(nsubj)=she before=shePrefix ha (negative tense)

source=no after=not

Figure 3: Examples of highly weighted features learned by the inflection model. We selected a few frequent morpho-logical features and show their top corresponding source context features.

data, we first construct an additional phrase-basedtranslation model on the parallel corpus prepro-cessed to replace inflected surface words with theirstems. We then extract a set of non-gappy phrasesfor each sentence (e.g., X � <attempted,

пытаться V>). The target side of each such phraseis re-inflected, conditioned on the source sentence,using the inflection model from §2. Each stem isgiven its most likely inflection.7

The original features extracted for the stemmedphrase are conserved, and the following featuresare added to help the decoder select good syntheticphrases:

• a binary feature indicating that the phrase issynthetic,

• the log-probability of the inflected forms ac-cording to our model,

• the count of words that have been inflected,with a separate feature for each morphologicalcategory in the supervised case.

Finally, these synthetic phrases are combined withthe original translation rules obtained for the base-line system to produce an extended sentence-specificgrammar which is used as input to the decoder. If a

7Several reviewers asked about what happens when k-bestinflections are added. The results for k � {2, 4, 8} range fromno effect to an improvement over k = 1 of about 0.2 BLEU(absolute). We hypothesize that larger values of k could have agreater impact, perhaps in a more “global” model of the targetstring; however, exploration of this question is beyond the scopeof this paper.

phrase already existing in the standard phrase tablehappens to be recreated, both phrases are kept andwill compete with each other with different featuresin the decoder.

For example, for the large EN�RU system, 6%of all the rules used for translation are syntheticphrases, with 65% of these phrases being entirelynew rules.

6 Translation Experiments

We evaluate our approach in the standard discrim-inative MT framework. We use cdec (Dyer et al.,2010) as our decoder and perform MIRA trainingto learn feature weights of the sentence translationmodel (Chiang, 2012). We compare the followingconfigurations:

• A baseline system, using a 4-gram languagemodel trained on the entire monolingual andbilingual data available.

• An enriched system with a class-based n-gramlanguage model8 trained on the monolingualdata mapped to 600 Brown clusters. Class-based language modeling is a strong baselinefor scenarios with high out-of-vocabulary ratesbut in which large amounts of monolingualtarget-language data are available.

• The enriched system further augmented withour inflected synthetic phrases. We expect theclass-based language model to be especially

8For Swahili and Hebrew, n = 6; for Russian, n = 7.

1683

6.  Transla1on  Experiments  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 16

Experimental Set-Up: l  Comparing:

•  A baseline system (Standard Hiero), using a 4-gram LM •  An enriched system with class-based n-gram LM trained on the

monolingual data mapped to 600 Brown clusters •  The enriched system further augmented with the inflected synthetic

phrases

l  Dataset: •  Russian: News Commentary parallel corpus and additional monolingual

data crawled from news websites •  Hebrew: transcribed TED talks and additional monolingual news data •  Swahili: parallel corpus obtained by crawling the Global Voices project

website and additional monolingual data taken form the Helsinki Corpus of Swahili

6.  Transla1on  Experiments  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 17

Experimental Set-Up: l  Comparing:

•  A baseline system (Standard Hiero), using a 4-gram LM •  An enriched system with class-based n-gram LM trained on the

monolingual data mapped to 600 Brown clusters •  The enriched system further augmented with the inflected synthetic

phrases

l  Dataset: •  Russian: News Commentary parallel corpus and additional monolingual

data crawled from news websites •  Hebrew: transcribed TED talks and additional monolingual news data •  Swahili: parallel corpus obtained by crawling the Global Voices project

website and additional monolingual data taken form the Helsinki Corpus of Swahili

6.  Transla1on  Experiments  (cont’d)  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 18

Translation Results: helpful here and capture some basic agreementpatterns that can be learned more easily ondense clusters than from plain word sequences.

Detailed corpus statistics are given in Table 1:

• The Russian data consist of the News Com-mentary parallel corpus and additional mono-lingual data crawled from news websites.9

• The Hebrew parallel corpus is composed oftranscribed TED talks (Cettolo et al., 2012).Additional monolingual news data is also used.

• The Swahili parallel corpus was obtained bycrawling the Global Voices project website10

for parallel articles. Additional monolingualdata was taken from the Helsinki Corpus ofSwahili.11

We evaluate translation quality by translating andmeasuring the BLEU score of a 2000–3000 sentence-long evaluation corpus, averaging the results over 3MIRA runs to control for optimizer instability (Clarket al., 2011). Table 4 reports the results. For all lan-guages, using class language models improves overthe baseline. When synthetic phrases are added, sig-nificant additional improvements are obtained. Forthe English–Russian language pair, where both su-pervised and unsupervised analyses can be obtained,we notice that expert-crafted morphological analyz-ers are more efficient at improving translation qual-ity. Globally, the amount of improvement observedvaries depending on the language; this is most likelyindicative of the quality of unsupervised morpholog-ical segmentations produced and the kinds of gram-matical relations expressed morphologically.

Finally, to confirm the effectiveness of our ap-proach as corpus size increases, we use our tech-nique on top of a state-of-the art English–Russiansystem trained on data from the 8th ACL Work-shop on Machine Translation (30M words of bilin-gual text and 410M words of monolingual text). Thesetup is identical except for the addition of sparse

9http://www.statmt.org/wmt13/

translation-task.html

10http://sw.globalvoicesonline.org

11http://www.aakkl.helsinki.fi/cameel/

corpus/intro.htm

Table 4: Translation quality (measured by BLEU) aver-aged over 3 MIRA runs.

EN�RU EN�HE EN�SWBaseline 14.7±0.1 15.8±0.3 18.3±0.1

+Class LM 15.7±0.1 16.8±0.4 18.7±0.2

+Syntheticunsupervised 16.2±0.1 17.6±0.1 19.0±0.1

supervised 16.7±0.1 — —

rule shape indicator features and bigram cluster fea-tures. In these large scale conditions, the BLEU scoreimproves from 18.8 to 19.6 with the addition of wordclusters and reaches 20.0 with synthetic phrases.Details regarding this system are reported in Ammaret al. (2013).

7 Related Work

Translation into morphologically rich languages isa widely studied problem and there is a tremen-dous amount of related work. Our technique of syn-thesizing translation options to improve generationof inflected forms is closely related to the factoredtranslation approach proposed by Koehn and Hoang(2007); however, an important difference to thatwork is that we use a discriminative model that con-ditions on source context to make “local” decisionsabout what inflections may be used before combin-ing the phrases into a complete sentence translation.

Combination pre-/post-processing solutions arealso frequently proposed. In these, the tar-get language is generally transformed from multi-morphemic surface words into smaller units moreamenable to direct translation, and then a post-processing step is applied independent of the trans-lation model. For example, Oflazer and El-Kahlout(2007) experiment with partial morpheme groupingsto produce novel inflected forms when translatinginto Turkish; Al-Haj and Lavie (2010) compare dif-ferent processing schemes for Arabic. A related butdifferent approach is to enrich the source languageitems with grammatical features (e.g., a source sen-tence like John saw Mary is preprocessed into, e.g.,John+subj saw+msubj+fobj Mary+obj) so asto make the source and target lexicons have simi-lar morphological contrasts (Avramidis and Koehn,2008; Yeniterzi and Oflazer, 2010; Chang et al.,

1684

Translation quality (measured by BLEU) averaged over 3 MIRA runs.

7.  Conclusion  

15/10/15 2014©Akiba  (Akiva)  Miura      AHC-­‐Lab,  IS,  NAIST 19

The  paper  l  presents  an  efficient  technique  that  exploits  morphologically  

analyzed  corpora  to  produce  new  inflec1ons  possibly  unseen  in  the  bilingual  training  data  •  This  method  decomposes  into  two  simple  independent  steps    involving  well-­‐understood  discrimina1ve  models  

l  also  achieves  language  independency  by  exploi1ng  unsupervised  morphological  segmenta1on  in  the  absence  of  linguis1cally  informed  morphological  analyses