modeling coverage for neural machine...
Post on 18-Jan-2019
213 Views
Preview:
TRANSCRIPT
Modeling Coverage forNeural Machine Translation
Zhaopeng Tu
Huawei Noah’s Ark Lab, Hong Kong
Joint work with
Zhengdong Lu, Yang Liu, Xiaohua Liu, Hang Li
Motivation
• Neural Machine Translation
2 RNN Encoder–Decoder
2.1 Preliminary: Recurrent Neural NetworksA recurrent neural network (RNN) is a neural net-work that consists of a hidden state h and anoptional output y which operates on a variable-length sequence x = (x1, . . . , xT ). At each timestep t, the hidden state hhti of the RNN is updatedby
hhti = f
�
hht�1i, xt�
, (1)
where f is a non-linear activation func-tion. f may be as simple as an element-wise logistic sigmoid function and as com-plex as a long short-term memory (LSTM)unit (Hochreiter and Schmidhuber, 1997).
An RNN can learn a probability distributionover a sequence by being trained to predict thenext symbol in a sequence. In that case, the outputat each timestep t is the conditional distributionp(xt | xt�1, . . . , x1). For example, a multinomialdistribution (1-of-K coding) can be output using asoftmax activation function
p(xt,j = 1 | xt�1, . . . , x1) =exp
�
wjhhti�
PKj0=1 exp
�
wj0hhti�
,
(2)
for all possible symbols j = 1, . . . ,K, where wj
are the rows of a weight matrix W. By combiningthese probabilities, we can compute the probabil-ity of the sequence x using
p(x) =TY
t=1
p(xt | xt�1, . . . , x1). (3)
From this learned distribution, it is straightfor-ward to sample a new sequence by iteratively sam-pling a symbol at each time step.
2.2 RNN Encoder–DecoderIn this paper, we propose a novel neural networkarchitecture that learns to encode a variable-lengthsequence into a fixed-length vector representationand to decode a given fixed-length vector rep-resentation back into a variable-length sequence.From a probabilistic perspective, this new modelis a general method to learn the conditional dis-tribution over a variable-length sequence condi-tioned on yet another variable-length sequence,e.g. p(y1, . . . , yT 0 | x1, . . . , xT ), where one
x1 x2 xT
yT' y2 y1
c
Decoder
EncoderFigure 1: An illustration of the proposed RNNEncoder–Decoder.
should note that the input and output sequencelengths T and T
0 may differ.The encoder is an RNN that reads each symbol
of an input sequence x sequentially. As it readseach symbol, the hidden state of the RNN changesaccording to Eq. (1). After reading the end ofthe sequence (marked by an end-of-sequence sym-bol), the hidden state of the RNN is a summary cof the whole input sequence.
The decoder of the proposed model is anotherRNN which is trained to generate the output se-quence by predicting the next symbol yt given thehidden state hhti. However, unlike the RNN de-scribed in Sec. 2.1, both yt and hhti are also con-ditioned on yt�1 and on the summary c of the inputsequence. Hence, the hidden state of the decoderat time t is computed by,
hhti = f
�
hht�1i, yt�1, c�
,
and similarly, the conditional distribution of thenext symbol is
P (yt|yt�1, yt�2, . . . , y1, c) = g
�
hhti, yt�1, c�
.
for given activation functions f and g (the lattermust produce valid probabilities, e.g. with a soft-max).
See Fig. 1 for a graphical depiction of the pro-posed model architecture.
The two components of the proposed RNNEncoder–Decoder are jointly trained to maximizethe conditional log-likelihood
max
✓
1
N
NX
n=1
log p✓(yn | xn), (4)
Published as a conference paper at ICLR 2015
The decoder is often trained to predict the next word y
t
0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:
p(y) =
TY
t=1
p(y
t
| {y1, · · · , yt�1} , c), (2)
where y =
�y1, · · · , yT
y
�. With an RNN, each conditional probability is modeled as
p(y
t
| {y1, · · · , yt�1} , c) = g(y
t�1, st, c), (3)
where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt
, and s
t
isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).
3 LEARNING TO ALIGN AND TRANSLATE
In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).
3.1 DECODER: GENERAL DESCRIPTION
x1 x2 x3 xT
+αt,1αt,2 αt,3
αt,T
yt-1 yt
h1 h2 h3 hT
h1 h2 h3 hT
st-1 s t
Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y
t
given a sourcesentence (x1, x2, . . . , xT
).
In a new model architecture, we define each conditional probabilityin Eq. (2) as:
p(y
i
|y1, . . . , yi�1,x) = g(y
i�1, si, ci), (4)
where s
i
is an RNN hidden state for time i, computed by
s
i
= f(s
i�1, yi�1, ci).
It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c
i
for each target word y
i
.
The context vector c
i
depends on a sequence of annotations(h1, · · · , hT
x
) to which an encoder maps the input sentence. Eachannotation h
i
contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.
The context vector ci
is, then, computed as a weighted sum of theseannotations h
i
:
c
i
=
T
xX
j=1
↵
ij
h
j
. (5)
The weight ↵ij
of each annotation h
j
is computed by
↵
ij
=
exp (e
ij
)
PT
x
k=1 exp (eik), (6)
wheree
ij
= a(s
i�1, hj
)
is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s
i�1 (just before emitting y
i
, Eq. (4)) and thej-th annotation h
j
of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,
3
Traditional NMT
Cho et al. (2014)
Attentional NMT
Bahdanau et al. (2015)
our focus
Motivation
• Problem of NMT
⨁ C
s1 s2
airports
many airports
Generated Words
many
0.01.0 0.0 0.0 0.0
0.0
<eos
>
Motivation
• Problem of NMT
⨁ C
s2 s3
were
many airports were
Generated Words
airports
0.00.0 0.9 0.1 0.0
0.0
<eos
>
Motivation
• Problem of NMT
⨁ C
s3 s4
closed
many airports were closed
Generated Words
were
0.00.0 0.6 0.4 0.0
0.0
<eos
>
Motivation
• Problem of NMT
⨁ C
s4 s5
to
many airports were closed to
Generated Words
closed
0.10.0 0.0 0.7 0.0
0.2
<eos
>
Motivation
• Problem of NMT
⨁ C
s5 s6
close
many airports were closed to
close
Generated Words
to
0.10.0 0.0 0.7 0.0
0.2
<eos
>
Motivation
• Problem of NMT
⨁ C
s6 s7
.
many airports were closed to
close .
Generated Words
close
0.20.2 0.2 0.1 0.1
0.2
<eos
>
Motivation
• Problem of NMT
⨁ C
s7 s8
<eos>.
⨁
0.10.0 0.0 0.0 0.2
0.7
<eos
>many airports were closed to
close . <eos>
Generated Words
Motivation
• Problem of NMT
⨁ C
under-translation
⨁
<eos
>many airports were closed to
close . <eos>
Generated Words
over-translation
Motivation
• Coverage problem
• NMT lacks of coverage to indicate
whether a source word is
translated or not. ⨁ C
covered uncovered
<eos
>
covered
Motivation
• Coverage is essential for SMT to ensure that each source
word is translated exactly.
• We believe that modeling coverage is also useful for NMT.
Motivation
• Coverage in SMT
f: * _ _ _ _e: manyp: 0.53
coverage: the 1st source word is translated
many airport close .
many airports
force is forced to
Motivation
• Coverage in SMT
f: * _ _ _ _e: manyp: 0.53
f: * * _ _ _e: many airportp: 0.50
coverage: the first two source words are translated
many airport close .
many airports
force is forced to
Motivation
• Coverage in SMT
f: * * * * *e: many ... close .p: 0.32
coverage: all source
words are translatedf: * _ _ _ _e: manyp: 0.53
f: * * _ _ _e: many airportp: 0.50
many airport close .
many airports
force is forced to
Approach
• Intuitions
⨁ C
coverage 0.0 0.0 0.0 0.0 0.0 0.0
Generated Words
no so
urce
wor
d is
trans
lated
yet
<eos
>
Approach
• Intuitions
coverage 0.0 0.0 0.0 0.0 0.0 0.0
⨁ C
s0 s1
many
0.50.2 0.2 0.1 0.0
many
Generated Words
<eos
>
0.0
Approach
• Intuitions
coverage 0.5 0.2 0.2 0.1 0.0 0.0
upda
ting c
overa
ge
⨁ C
s0 s1
many
0.50.2 0.2 0.1 0.0
many
Generated Words
<eos
>
0.0
Approach
• Intuitions
coverage 0.5 1.2 1.1 0.2 0.0 0.0
⨁ C
s2 s3
were
many airports were
Generated Words
airports
0.00.0 0.9 0.1 0.0
0.0
<eos
>
Approach
• Intuitions
?coverage 0.5 1.2 1.1 0.2 0.0 0.0
⨁ C
s3 s4
closedwere
0.00.0 0.6 0.4 0.0
0.0
<eos
>
Approach
• Intuitions
coverage 0.5 1.2 1.1 0.2 0.0 0.0
⨁ C
s3 s4
closedwere
0.00.0 0.6 0.4 0.0
0.0
<eos
>
No
Approach
• Intuitions
⨁ C
Generated Words
not fully translated
many airports were<e
os>
coverage 0.5 1.2 1.1 0.2 0.0 0.0
⨁ C
s3 s4
?were
Approach
• Intuitions
many airports were forced
Generated Words
<eos
>
⨁ C
s3 s4
forcedwere
0.00.0 0.9 0.1 0.0
0.0
coverage 0.5 1.2 1.1 0.2 0.0 0.0
Approach
• Intuitions
many airports were forced
Generated Words
<eos
>
⨁ C
s3 s4
forcedwere
0.00.0 0.9 0.1 0.0
0.0
coverage 0.5 1.2 2.0 0.3 0.0 0.0
upda
ting c
overa
ge
Approach
• Intuitions
many airports were forced to
close down
Generated Words
<eos
>
⨁ C
s7 s8
?down
coverage 0.5 1.2 2.3 2.8 0.1 0.1
Approach
• Intuitions
many airports were forced to
close down
Generated Words
<eos
>
⨁ C
s7 s8
?down
coverage 0.5 1.2 2.3 2.8 0.1 0.1
fully translated
Approach
• Intuitions
many airports were forced to
close down .
Generated Words
<eos
>
⨁ C
s7 s8
.down
0.00.0 0.0 0.1 0.2
0.7
coverage 0.5 1.2 2.3 2.8 0.1 0.1
Approach
• Intuitions
many airports were forced to
close down .
Generated Words
<eos
>
⨁ C
s7 s8
.down
0.00.0 0.0 0.1 0.2
0.7
coverage 0.5 1.2 2.3 2.9 0.3 0.8
upda
ting c
overa
ge
Approach
• Intuitions
⨁ C
Generated Words
ever
y wor
d is
trans
lated
coverage 0.5 1.2 2.3 2.9 0.3 0.8
<eos
>
s8 s9
?.
many airports were forced to
close down .
Approach
• Intuitions
⨁ C
Generated Words
coverage 0.5 1.2 2.3 2.9 0.3 0.8
<eos
>
s8 s9
<eos>.
many airports were forced to
close down . <eos>
0.00.0 0.0 0.0 0.2
0.8
Approach
• A few equations
⨁ C
0.00.0 0.0 0.0 0.1
0.9
coverage 0.5 1.2 2.3 2.9 0.5 1.6
<eos
>
Formally, the coverage model is given by
Ci,j = gupdate
�
Ci�1,j ,↵i,j ,�(hj), �
(6)
where
• gupdate(·) is the function that updates Ci,j af-ter the new attention ↵i,j at time step i in thedecoding process;
• Ci,j is a d-dimensional coverage vector sum-marizing the history of attention till time stepi on hj ;
• �(hj) is a word-specific feature with its ownparameters;
• are auxiliary inputs exploited in differentsorts of coverage models.
Equation 6 gives a rather general model, whichcould take different function forms for gupdate(·)and �(·), and different auxiliary inputs (e.g.,previous decoding state ti�1). In the rest of thissection, we will give a number of representativeimplementations of the coverage model, whicheither leverage more linguistic information (Sec-tion 3.1.1) or resort to the flexibility of neural net-work approximation (Section 3.1.2).
3.1.1 Linguistic Coverage ModelWe first consider at linguistically inspired modelwhich has a small number of parameters, as wellas clear interpretation. While the linguistically-inspired coverage in NMT is similar to that inSMT, there is one key difference: it indicates whatpercentage of source words have been translated(i.e., soft coverage). In NMT, each target word yi
is generated from all source words with probabil-ity ↵i,j for source word xj . In other words, thesource word xj is involved in generating all tar-get words and the probability of generating targetword yi at time step i is ↵i,j . Note that unlikein SMT in which each source word is fully trans-
lated at one decoding step, the source word xj ispartially translated at each decoding step in NMT.Therefore, the coverage at time step i denotes thetranslated ratio of that each source word is trans-lated.
We use a scalar (d = 1) to represent linguis-tic coverage for each source word and employan accumulate operation for gupdate. The initialvalue of linguistic coverage is zero, which de-notes that the corresponding source word is not
translated yet. We iteratively construct linguis-tic coverages through accumulation of alignmentprobabilities generated by the attention model,each of which is normalized by a distinct context-dependent weight. The coverage of source wordxj at time step i is computed by
Ci,j = Ci�1,j +1
�j↵i,j =
1
�j
iX
k=1
↵k,j (7)
where �j is a pre-defined weight which indicatesthe number of target words xj is expected to gener-ate. The simplest way is to follow Xu et al. (2015)in image-to-caption translation to fix � = 1 for allsource words, which means that we directly usethe sum of previous alignment probabilities with-out normalization as coverage for each word, asdone in (Cohn et al., 2016).
However, in machine translation, different typesof source words may contribute differently to thegeneration of target sentence. Let us take thesentence pairs in Figure 1 as an example. Thenoun in the source sentence “j
¯
ıch
ˇ
ang” is translatedinto one target word “airports”, while the adjec-tive “b
`
eip
`
o” is translated into three words “were
forced to”. Therefore, we need to assign a dis-tinct �j for each source word. Ideally, we expect�j =
PIi=1 ↵i,j with I being the total number
of time steps in decoding. However, such desiredvalue is not available before decoding, thus is notsuitable in this scenario.
Fertility To predict �j , we introduce the con-cept of fertility, which is firstly proposed in word-level SMT (Brown et al., 1993). Fertility of sourceword xj tells how many target words xj produces.In SMT, the fertility is a random variable �j ,whose distribution p(�j = �) is determined bythe parameters of word alignment models (e.g.,IBM models). In this work, we simplify and adaptfertility from the original model and compute thefertility �j by2
�j = N (xj |x) = N · �(Ufhj) (8)
where N 2 R is a predefined constant to denotethe maximum number of target words one source
2Fertility in SMT is a random variable with a set of fer-tility probabilities, n(�j |xj) = p(�<j ,x), which dependson the fertilities of previous source words. To simplify thecalculation and adapt it to the attention model in NMT, wedefine the fertility in NMT as a constant number, which isindependent of previous fertilities.
Approach
• A few equations
⨁ C
0.00.0 0.0 0.0 0.1
0.9
coverage 0.5 1.2 2.3 2.9 0.5 1.6
fertility model: tells how
many target words each
source word generatesfertility 0.8 1.1 2.4 2.5 0.7 1.7
5
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
ACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
accumulate operation for gupdate. We iterativelyconstruct linguistic coverages through an accumu-lation of alignment probabilities generated by theattention model, each of which is normalized by adistinct context-dependent weight. The coverageof source word xj on time step i is computed by
�i,j =1
�j
iX
k=1
↵k,j (7)
where �j is a pre-defined weight which indicatesthe number of target words xj is expected to gener-ate. The simplest way is to follow Xu et al. (2015)in image-to-caption translation to fix � = 1 for allsource words, which means that we directly usethe sum of previous alignment probabilities with-out normalization as coverage for each word, asdone in (Cohn et al., 2016).
However, in natural languages, different typesof source words contributes differently to the gen-eration of translation. Take the sentence pairs inFigure 1 as an example, the adjective on the sourceside “zhudao” is translated into one target word“leading”, while the noun “jinnian” is translatedinto two words “this year”. Therefore, we needto assign a distinct �j for each source word. Ide-ally, we expect �j =
PTy
k=1 ↵k,j with Ty be thetotal number of time steps in decoding. However,such desired value is not available before decod-ing, thus is not suitable in this scenario.
Fertility To predict �j , we introduce the con-cept of fertility, which is firstly proposed in word-level SMT (Brown et al., 1993). Fertility of sourceword xj tells how many target words xj pro-duces. In SMT, the fertility is a random variable�j , whose distribution p(�j = �) is determinedby the parameters of word alignment models (e.g.IBM models). In this work, we simplify and adaptfertility from the original model1 and compute thefertility �j by
�j = N (xj |x) = N (hj) = N · �(Ufhj) (8)
where N 2 R is a predefined constantto denotethe maximum number of target words one sourceword can produce, �(·) is a logistic sigmoid func-tion, and Uf 2 R1⇥2n is the weight matrix. Here
1Fertility in SMT is a random variable with a set of fer-tility probabilities, n(�
j
|xj
) = p(�j�11 ,x), which depends
on the fertilities of previous source words. To simplify thecalculation and adapt it to the attention model in NMT, wedefine the fertility in NMT as a constant number, which isindependent of previous fertilities.
we use hj to denote (xj |x) since hj contains in-formation about the whole input sentence with astrong focus on the parts surrounding xj (Bah-danau et al., 2015). Since �j does not depend oni, we can pre-compute it before decoding to mini-mize the computational cost.
3.2 Integrating Coverage into NMT
Although the introduction of attention model hasadvanced the state-of-the-art of NMT, it computessoft alignment probabilities without consideringuseful information in the past. For example, asource word that contributed a lot to the predictedtarget words in the past, should be assigned loweralignment probabilities in the following decoding.Motivated by this observation, in this work, wepropose to calculate the alignment probability byjointly taking into account past alignment infor-mation embedded in the coverage model.
Intuitively, on each time step i in the decodingphase, coverage from time step (i � 1) serves asan additional input to the attention model, whichprovides complementary information of that howlikely the source words are translated in the past.We expect the coverage information would guidethe attention model to focus more on untranslatedsource words (i.e. assign higher probabilities). Inpractice, we find that the coverage model doescome up to expectation (see Section 5). The trans-lated ratios of source words from linguistic cov-erages negatively correlate to the correspondingalignment probabilities.
More formally, we rewrite the alignment modelin Equation 5 as
ei,j = a(si�1, hj ,�i�1,j)
= v
>a tanh(Wasi�1 + Uahj +Ba�i�1,j)
where �i�1,j is the translated ratio of source wordxj before time i. Bd 2 Rn⇥d is the additionalweight matrix for coverage with n and d be thenumbers of hidden units and coverage units, re-spectively.
4 Training
In this paper, we take end-to-end learning forour coverage-based NMT model, which jointlylearns not only the parameters for the “original”RNNsearch (i.e., those for encoding RNN, decod-ing RNN, and attention model) but also the param-eters for coverage modeling (i.e., those for anno-tation and its role in guiding the attention) . More
covered 63% 109% 96% 116% 71% 94%
<eos
>
Formally, the coverage model is given by
Ci,j = gupdate
�
Ci�1,j ,↵i,j ,�(hj), �
(6)
where
• gupdate(·) is the function that updates Ci,j af-ter the new attention ↵i,j at time step i in thedecoding process;
• Ci,j is a d-dimensional coverage vector sum-marizing the history of attention till time stepi on hj ;
• �(hj) is a word-specific feature with its ownparameters;
• are auxiliary inputs exploited in differentsorts of coverage models.
Equation 6 gives a rather general model, whichcould take different function forms for gupdate(·)and �(·), and different auxiliary inputs (e.g.,previous decoding state ti�1). In the rest of thissection, we will give a number of representativeimplementations of the coverage model, whicheither leverage more linguistic information (Sec-tion 3.1.1) or resort to the flexibility of neural net-work approximation (Section 3.1.2).
3.1.1 Linguistic Coverage ModelWe first consider at linguistically inspired modelwhich has a small number of parameters, as wellas clear interpretation. While the linguistically-inspired coverage in NMT is similar to that inSMT, there is one key difference: it indicates whatpercentage of source words have been translated(i.e., soft coverage). In NMT, each target word yi
is generated from all source words with probabil-ity ↵i,j for source word xj . In other words, thesource word xj is involved in generating all tar-get words and the probability of generating targetword yi at time step i is ↵i,j . Note that unlikein SMT in which each source word is fully trans-
lated at one decoding step, the source word xj ispartially translated at each decoding step in NMT.Therefore, the coverage at time step i denotes thetranslated ratio of that each source word is trans-lated.
We use a scalar (d = 1) to represent linguis-tic coverage for each source word and employan accumulate operation for gupdate. The initialvalue of linguistic coverage is zero, which de-notes that the corresponding source word is not
translated yet. We iteratively construct linguis-tic coverages through accumulation of alignmentprobabilities generated by the attention model,each of which is normalized by a distinct context-dependent weight. The coverage of source wordxj at time step i is computed by
Ci,j = Ci�1,j +1
�j↵i,j =
1
�j
iX
k=1
↵k,j (7)
where �j is a pre-defined weight which indicatesthe number of target words xj is expected to gener-ate. The simplest way is to follow Xu et al. (2015)in image-to-caption translation to fix � = 1 for allsource words, which means that we directly usethe sum of previous alignment probabilities with-out normalization as coverage for each word, asdone in (Cohn et al., 2016).
However, in machine translation, different typesof source words may contribute differently to thegeneration of target sentence. Let us take thesentence pairs in Figure 1 as an example. Thenoun in the source sentence “j
¯
ıch
ˇ
ang” is translatedinto one target word “airports”, while the adjec-tive “b
`
eip
`
o” is translated into three words “were
forced to”. Therefore, we need to assign a dis-tinct �j for each source word. Ideally, we expect�j =
PIi=1 ↵i,j with I being the total number
of time steps in decoding. However, such desiredvalue is not available before decoding, thus is notsuitable in this scenario.
Fertility To predict �j , we introduce the con-cept of fertility, which is firstly proposed in word-level SMT (Brown et al., 1993). Fertility of sourceword xj tells how many target words xj produces.In SMT, the fertility is a random variable �j ,whose distribution p(�j = �) is determined bythe parameters of word alignment models (e.g.,IBM models). In this work, we simplify and adaptfertility from the original model and compute thefertility �j by2
�j = N (xj |x) = N · �(Ufhj) (8)
where N 2 R is a predefined constant to denotethe maximum number of target words one source
2Fertility in SMT is a random variable with a set of fer-tility probabilities, n(�j |xj) = p(�<j ,x), which dependson the fertilities of previous source words. To simplify thecalculation and adapt it to the attention model in NMT, wedefine the fertility in NMT as a constant number, which isindependent of previous fertilities.
Approach
Attention-based Neural Machine Translation (NMT) ignores past alignment information, which often leads to over-translation and under-translation.
Modeling Coverage for Neural Machine TranslationZhaopeng Tu1 Zhengdong Lu1 Yang Liu2 Xiaohua Liu1 Hang Li1
INTRODUCTION
COVERAGE MODEL
Our work: We maintain a coverage vector to keep track of the attention history. The coverage vector is fed to the attention model to help adjust future attention, which guides NMT to consider more about the untranslated source words.
Coverage model is given by
•!i,j is a d-dimensional coverage vector summarizing the history of attention (e.g. "i,j) till time step i on the annotation of source word hj;
• ɡupdate is the function that updates !i,j after the new attention at time step i; •Φ(hj) is a word-specific feature with its own parameters; •Ѱ are auxiliary inputs exploited in different sorts of coverage models.
EXPERIMENTS
Two Representative Implementations
INTEGRATING COVERAGE INTO NMT
1 Huawei Noah’s Ark Lab, Hong Kong 2 Tsinghua University, Beijing
4
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
ACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
Figure 4: NN-based coverage model.
source word), it will discourage further attentionto it if it has been heavily attended, and implicitlypush the attention to the less attended segments ofthe source sentence since the attention weights arenormalized to one. This could potentially solveboth coverage mistakes mentioned above, whenmodelled and learned properly.
Formally, the coverage model is given by
Ci,j = gupdate�Ci�1,j ,↵i,j ,�(hj),
�(6)
where
• gupdate(·) is the function that updates Ci,j af-ter the new attention at time step i in the de-coding process;
• Ci,j is a d-dimensional coverage vector sum-marizing the history of attention till time stepi on hj ;
• �(hj) is a word-specific feature with its ownparameters;
• are auxiliary inputs exploited in differentsorts of coverage models;
Equation 6 gives a rather general model, whichcould take different function forms for gupdate(·)and �(·), and different auxiliary inputs (e.g.previous decoding state ti�1). In the rest of thissection, we will give a number of representativeimplementations of the coverage model, which ei-ther resort to the flexibility of neural network func-tion approximation (Section 3.1.1) or bear morelinguistic intuition (Section 3.1.2).
3.1.1 Neural Network Based Coverage Model
When Ci,j is a vector (d > 1) and gupdate(·) takesa Neural Network (NN) form, we actually have
a RNN model for coverage, as illustrated by Fig-ure 4. In our work, we take the following form
Ci,j = f(Ci�1,j ,↵i,j ,hj , ti�1)
where f(·) is a nonlinear activation function andti�1 is the auxiliary input that encodes past trans-lation information. Note that we leave out the theword-specific feature function �(·) and only takethe input annotation hj as the input to the cov-erage RNN. It is important to emphasize that theNN-based coverage model is able to be fed witharbitrary inputs, such as the previous attentionalcontext si�1. Here we only employ Ci�1,j for pastalignment information, ti�1 for past translation in-formation, and hj for word-specific bias.
Gating The neural function f(·) can be either asimple activation function tanh or a gating func-tion that proves useful to capture long-distancedependencies. In this work, we adopt GRU forthe gating activation since it is simple yet power-ful (Chung et al., 2014). Please refer to (Cho et al.,2014b) for more details about GRU.
Although the NN-based coverage model enjoysthe flexibility brought by the recurrent nonlinearform, its lack of clear linguistic meaning may ren-der it hard to train: the coverage model can onlybe trained along with the attention model and getthe supervision signal from it in back-propagation,which could be weak (relatively distant from thedecoding process) and noisy (after the distortionfrom other under-trained components in the de-coder RNN). Partially to overcome this problem,we also propose the linguistically inspired modelwhich has much clearer interpretation but muchless parameters.
3.1.2 Linguistic Coverage Model
While linguistically-inspired coverage in NMT issimilar in spirit to that in SMT, there is one keydifference: it indicates what percentage of sourcewords have been translated (i.e. soft coverage).In NMT, each target word yi is generated fromall source words with probabilities ↵i,j for sourceword xj . In other words, the source word xj in-volves in generating all target words and generates↵i,j target word at time step i. Note that unlike inSMT where each source word is fully translated
at one decoding step, xj is partially translated ateach decoding step in NMT . Therefore, the cov-erage at time step i denotes the translated ratio ofthat each source word is translated.
Neural Network Based Coverage Model
•!i,j is a vector (d >1) and ɡupdate takes a Neural Network (NN) form;
• NN-based coverage model is able to be fed with arbitrary inputs.
①: alignment decisions ("i) are made jointly taking into account past alignment information (!i-1):
②: coverage models (!i) are updated after every attentive decisions ("i), thus store attention history. Translation Quality
26
27
28
29
30
31
BLEU
BLE
U S
core
0
5
10
15
20
25
30
Length of Source Sentence
[0, 1
0)
[10,
20)
[20,
30)
[30,
40)
[40,
50)
[50,
60)
[60,
70)
[70,
80)
MosesNMT+ Linguistic Coverage+ NN-based Coverage
Leng
th o
f Tra
nsla
tion
0
10
20
30
40
50
60
70
80
90
Length of Source Sentence
[0, 1
0)
[10,
20)
[20,
30)
[30,
40)
[40,
50)
[50,
60)
[60,
70)
[70,
80)
• Chinese-to-English translation • Training corpus: 1.25M sentence pairs (27.9M, 34.5M words) • Vocabulary: 30K Max sentence length: 80 • Other settings are the same as in (Bahdanau et al., 2015)
Alignment Quality
50
51
52
53
54
55
AERNMT + Linguistic Coverage+ NN-based Coverage (d=1) + NN-based Coverage (d=10)
Effects on Long Sentences
Higher score means better translation, while lower score means better alignment.
Example Translations
NMT NMT-Coverage
Example Alignments
Translation and Alignment Quality
Figure 4: NN-based coverage model.
word can produce, �(·) is a logistic sigmoid func-tion, and Uf 2 R1⇥2n is the weight matrix. Herewe use hj to denote (xj |x) since hj contains in-formation about the whole input sentence with astrong focus on the parts surrounding xj (Bah-danau et al., 2015). Since �j does not depend oni, we can pre-compute it before decoding to mini-mize the computational cost.
3.1.2 Neural Network Based Coverage ModelWe next consider Neural Network (NN) basedcoverage model. When Ci,j is a vector (d > 1) andgupdate(·) is a neural network, we actually havean RNN model for coverage, as illustrated in Fig-ure 4. In this work, we take the following form:
Ci,j = f(Ci�1,j ,↵i,j ,hj , ti�1)
where f(·) is a nonlinear activation function andti�1 is the auxiliary input that encodes past trans-lation information. Note that we leave out theword-specific feature function �(·) and only takethe input annotation hj as the input to the cov-erage RNN. It is important to emphasize that theNN-based coverage model is able to be fed witharbitrary inputs, such as the previous attentionalcontext si�1. Here we only employ Ci�1,j for pastalignment information, ti�1 for past translation in-formation, and hj for word-specific bias.3
Gating The neural function f(·) can be either asimple activation function tanh or a gating func-tion that proves useful to capture long-distance
3In our preliminary experiments, considering more inputs(e.g., current and previous attentional contexts, unnormal-ized attention weights ei,j) does not always lead to bettertranslation quality. Possible reasons include: 1) the inputscontains duplicate information, and 2) more inputs introducemore back-propagation paths and therefore make it difficultto train. In our experience, one principle is to only feedthe coverage model inputs that contain distinct information,which are complementary to each other.
dependencies. In this work, we adopt GRU forthe gating activation since it is simple yet power-ful (Chung et al., 2014). Please refer to (Cho et al.,2014b) for more details about GRU.
Discussion Intuitively, the two types of modelssummarize coverage information in “different lan-guages”. Linguistic models summarize coverageinformation in human language, which has a clearinterpretation to humans. Neural models encodecoverage information in “neural language”, whichcan be “understood” by neural networks and letthem to decide how to make use of the encodedcoverage information.
3.2 Integrating Coverage into NMT
Although attention based model has the capabil-ity of jointly making alignment and translation, itdoes not take into consideration translation his-tory. Specifically, a source word that has sig-nificantly contributed to the generation of targetwords in the past, should be assigned lower align-ment probabilities, which may not be the case inattention based NMT. To address this problem, wepropose to calculate the alignment probabilities byincorporating past alignment information embed-ded in the coverage model.
Intuitively, at each time step i in the decodingphase, coverage from time step (i � 1) serves asan additional input to the attention model, whichprovides complementary information of that howlikely the source words are translated in the past.We expect the coverage information would guidethe attention model to focus more on untranslatedsource words (i.e., assign higher alignment prob-abilities). In practice, we find that the coveragemodel does fulfill the expectation (see Section 5).The translated ratios of source words from lin-guistic coverages negatively correlate to the cor-responding alignment probabilities.
More formally, we rewrite the attention modelin Equation 5 as
ei,j = a(ti�1,hj , Ci�1,j)
= v
>a tanh(Wati�1 + Uahj + VaCi�1,j)
where Ci�1,j is the coverage of source word xj be-fore time i. Va 2 Rn⇥d is the weight matrix forcoverage with n and d being the numbers of hid-den units and coverage units, respectively.
Linguistic Coverage Model
5
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
ACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
We use a scalar (d = 1) to represent linguis-tic coverage for each source word and employ anaccumulate operation for gupdate. We iterativelyconstruct linguistic coverages through an accumu-lation of alignment probabilities generated by theattention model, each of which is normalized by adistinct context-dependent weight. The coverageof source word xj at time step i is computed by
Ci,j =1
�j
iX
k=1
↵k,j (7)
where �j is a pre-defined weight which indi-cates the number of target words xj is expectedto generate. The simplest way is to follow Xuet al. (2015) in image-to-caption translation to fix� = 1 for all source words, which means that wedirectly use the sum of previous alignment proba-bilities without normalization as coverage for eachword, as done in (Cohn et al., 2016).
However, in natural languages, different typesof source words contributes differently to the gen-eration of translation. Take the sentence pairs inFigure 1 as an example, the noun on the sourceside “hangji” is translated into one target word“flights”, while the quantifier “liangqianduo” istranslated into two words “over 2,000”. Therefore,we need to assign a distinct �j for each sourceword. Ideally, we expect �j =
PTy
k=1 ↵k,j withTy be the total number of time steps in decoding.However, such desired value is not available be-fore decoding, thus is not suitable in this scenario.
Fertility To predict �j , we introduce the con-cept of fertility, which is firstly proposed in word-level SMT (Brown et al., 1993). Fertility of sourceword xj tells how many target words xj pro-duces. In SMT, the fertility is a random variable�j , whose distribution p(�j = �) is determinedby the parameters of word alignment models (e.g.IBM models). In this work, we simplify and adaptfertility from the original model1 and compute thefertility �j by
�j = N (xj |x) = N (hj) = N · �(Ufhj) (8)
where N 2 R is a predefined constant to denotethe maximum number of target words one source
1Fertility in SMT is a random variable with a set of fer-tility probabilities, n(�
j
|xj
) = p(�j�11 ,x), which depends
on the fertilities of previous source words. To simplify thecalculation and adapt it to the attention model in NMT, wedefine the fertility in NMT as a constant number, which isindependent of previous fertilities.
word can produce, �(·) is a logistic sigmoid func-tion, and Uf 2 R1⇥2n is the weight matrix. Herewe use hj to denote (xj |x) since hj contains in-formation about the whole input sentence with astrong focus on the parts surrounding xj (Bah-danau et al., 2015). Since �j does not depend oni, we can pre-compute it before decoding to mini-mize the computational cost.
3.2 Integrating Coverage into NMT
Although the introduction of attention model hasadvanced the state-of-the-art of NMT, it computessoft alignment probabilities without consideringuseful information in the past. For example, asource word that contributed a lot to the generatedtarget words in the past, should be assigned loweralignment probabilities in the following decoding.Motivated by this observation, in this work, wepropose to calculate the alignment probability byjointly taking into account past alignment infor-mation embedded in the coverage model.
Intuitively, at each time step i in the decodingphase, coverage from time step (i � 1) serves asan additional input to the attention model, whichprovides complementary information of that howlikely the source words are translated in the past.We expect the coverage information would guidethe attention model to focus more on untranslatedsource words (i.e. assign higher probabilities). Inpractice, we find that the coverage model doescome up to expectation (see Section 5). The trans-lated ratios of source words from linguistic cov-erages negatively correlate to the correspondingalignment probabilities.
More formally, we rewrite the attention modelin Equation 5 as
ei,j = a(ti�1,hj , Ci�1,j)
= v>a tanh(Wati�1 + Uahj + VaCi�1,j)
where Ci�1,j is the coverage of source word xj
before time i. Va 2 Rn⇥d is the weight matrixfor coverage with n and d being the numbers ofhidden units and coverage units, respectively.
4 Training
In this paper, we take end-to-end learning for ourNMT-COVERAGE model, which jointly learns notonly the parameters for the “original” NMT (i.e.,those for encoding RNN, decoding RNN, and at-tention model) but also the parameters for cov-erage modeling (i.e., those for annotation and its
5
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
ACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
We use a scalar (d = 1) to represent linguis-tic coverage for each source word and employ anaccumulate operation for gupdate. We iterativelyconstruct linguistic coverages through an accumu-lation of alignment probabilities generated by theattention model, each of which is normalized by adistinct context-dependent weight. The coverageof source word xj at time step i is computed by
Ci,j =1
�j
iX
k=1
↵k,j (7)
where �j is a pre-defined weight which indi-cates the number of target words xj is expectedto generate. The simplest way is to follow Xuet al. (2015) in image-to-caption translation to fix� = 1 for all source words, which means that wedirectly use the sum of previous alignment proba-bilities without normalization as coverage for eachword, as done in (Cohn et al., 2016).
However, in natural languages, different typesof source words contributes differently to the gen-eration of translation. Take the sentence pairs inFigure 1 as an example, the noun on the sourceside “hangji” is translated into one target word“flights”, while the quantifier “liangqianduo” istranslated into two words “over 2,000”. Therefore,we need to assign a distinct �j for each sourceword. Ideally, we expect �j =
PTy
k=1 ↵k,j withTy be the total number of time steps in decoding.However, such desired value is not available be-fore decoding, thus is not suitable in this scenario.
Fertility To predict �j , we introduce the con-cept of fertility, which is firstly proposed in word-level SMT (Brown et al., 1993). Fertility of sourceword xj tells how many target words xj pro-duces. In SMT, the fertility is a random variable�j , whose distribution p(�j = �) is determinedby the parameters of word alignment models (e.g.IBM models). In this work, we simplify and adaptfertility from the original model1 and compute thefertility �j by
�j = N (xj |x) = N · �(Ufhj) (8)
where N 2 R is a predefined constant to denotethe maximum number of target words one source
1Fertility in SMT is a random variable with a set of fer-tility probabilities, n(�
j
|xj
) = p(�j�11 ,x), which depends
on the fertilities of previous source words. To simplify thecalculation and adapt it to the attention model in NMT, wedefine the fertility in NMT as a constant number, which isindependent of previous fertilities.
word can produce, �(·) is a logistic sigmoid func-tion, and Uf 2 R1⇥2n is the weight matrix. Herewe use hj to denote (xj |x) since hj contains in-formation about the whole input sentence with astrong focus on the parts surrounding xj (Bah-danau et al., 2015). Since �j does not depend oni, we can pre-compute it before decoding to mini-mize the computational cost.
3.2 Integrating Coverage into NMT
Although the introduction of attention model hasadvanced the state-of-the-art of NMT, it computessoft alignment probabilities without consideringuseful information in the past. For example, asource word that contributed a lot to the generatedtarget words in the past, should be assigned loweralignment probabilities in the following decoding.Motivated by this observation, in this work, wepropose to calculate the alignment probability byjointly taking into account past alignment infor-mation embedded in the coverage model.
Intuitively, at each time step i in the decodingphase, coverage from time step (i � 1) serves asan additional input to the attention model, whichprovides complementary information of that howlikely the source words are translated in the past.We expect the coverage information would guidethe attention model to focus more on untranslatedsource words (i.e. assign higher probabilities). Inpractice, we find that the coverage model doescome up to expectation (see Section 5). The trans-lated ratios of source words from linguistic cov-erages negatively correlate to the correspondingalignment probabilities.
More formally, we rewrite the attention modelin Equation 5 as
ei,j = a(ti�1,hj , Ci�1,j)
= v>a tanh(Wati�1 + Uahj + VaCi�1,j)
where Ci�1,j is the coverage of source word xj
before time i. Va 2 Rn⇥d is the weight matrixfor coverage with n and d being the numbers ofhidden units and coverage units, respectively.
4 Training
In this paper, we take end-to-end learning for ourNMT-COVERAGE model, which jointly learns notonly the parameters for the “original” NMT (i.e.,those for encoding RNN, decoding RNN, and at-tention model) but also the parameters for cov-erage modeling (i.e., those for annotation and its
A simple accumulation of attention history
Φj is the fertility of source word xj to tell how many target words xj produces:
•!i,j is a scalar (d =1), which indicates what percentage of source words have been translated;
• It has clearer linguistic interpretation but less parameters.
NMT NMT-Coverage
Coverage model alleviates the over-translation (i.e., “��”) and under-translation (i.e., “��”) problems that NMT without coverage suffers from.
Using coverage mechanism, translated source words are less likely to contribute to generation of the target words next (e.g., top-right corner for the first four Chinese words.).
Translation Quality
• Chinese-to-English translation • Training corpus: 1.25M sentence
pairs (27.9M, 34.5M words) • Vocabulary: 30K
Max sentence length: 80 • Other settings are the same as in
(Bahdanau et al., 2015)
26
27
28
29
30
31
BLEU
30.14
29.5929.86
28.32
Groundhog+ Linguistic Coverage+ NN-based Coverage (d=1)+ NN-based Coverage (d=10)
(a) Over-translation and under-translationgenerated by NMT.
(b) Coverage model alleviates the problems ofover-translation and under-translation.
Figure 1: Example translations of (a) NMT without coverage, and (b) NMT with coverage. In conven-tional NMT without coverage, the Chinese word “gu
¯
anb
`
ı” is over translated to “close(d)” twice, while“b
`
eip
`
o” (means “be forced to”) is mistakenly untranslated. Coverage model alleviates these problems bytracking the “coverage” of source words.
during the decoding process, to keep track of theattention history. The coverage vector, when en-tering into attention model, can help adjust the fu-ture attention and significantly improve the over-all alignment between the source and target sen-tences. This design contains many particular casesfor coverage modeling with contrasting character-istics, which all share a clear linguistic intuitionand yet can be trained in a data driven fashion. No-tably, we achieve significant improvement even bysimply using the sum of previous alignment prob-abilities as coverage for each word, as a success-ful example of incorporating linguistic knowledgeinto neural network based NLP models.
Experiments show that NMT-COVERAGE sig-nificantly outperforms conventional attention-based NMT on both translation and alignmenttasks. Figure 1(b) shows an example, in whichNMT-COVERAGE alleviates the over-translationand under-translation problems that NMT withoutcoverage suffers from.
2 Background
Our work is built on attention-based NMT (Bah-danau et al., 2015), which simultaneously con-ducts dynamic alignment and generation of thetarget sentence, as illustrated in Figure 2. It
Figure 2: Architecture of attention-based NMT.Whenever possible, we omit the source index j tomake the illustration less cluttered.
produces the translation by generating one targetword yi at each time step. Given an input sentencex = {x1, . . . , xJ} and previously generated words{y1, . . . , yi�1}, the probability of generating nextword yi is
P (yi|y<i,x) = softmax
�
g(yi�1, ti, si)�
(1)
where g is a non-linear function, and ti is a decod-ing state for time step i, computed by
ti = f(ti�1, yi�1, si) (2)
Here the activation function f(·) is a Gated Re-current Unit (GRU) (Cho et al., 2014b), and si is
Translation Quality
Alignment Quality
Coverage model improves
alignment performance as well
the lower the score,
the better the alignment quality
50
51
52
53
54
55
AER
50.5
53.51
52.13
54.67
Groundhog+ Linguistic Coverage+ NN-based Coverage (d=1)+ NN-based Coverage (d=10)
Alignment Quality
Groundhog + NN-based Coverage (d=10)
Using coverage mechanism, translated source words are less likely to contribute to
generation of the target words next
Effect on Long Sentences
Figure 6: Performance of the generated translations with respect to the lengths of the input sentences.Coverage models alleviate under-translation by producing longer translations on long sentences.
in which the under-translation is rectified.The quantitative and qualitative results show
that the coverage models indeed help to allevi-ate under-translation, especially for long sentencesconsisting of several sub-sentences.
6 Related Work
Our work is inspired by recent works on im-proving attention-based NMT with techniques thathave been successfully applied to SMT. Follow-ing the success of Minimum Risk Training (MRT)in SMT (Och, 2003), Shen et al. (2016) proposedMRT for end-to-end NMT to optimize model pa-rameters directly with respect to evaluation met-rics. Based on the observation that attention-based NMT only captures partial aspects of atten-tional regularities, Cheng et al. (2016) proposedagreement-based learning (Liang et al., 2006) toencourage bidirectional attention models to agreeon parameterized alignment matrices. Along thesame direction, inspired by the coverage mecha-nism in SMT, we propose a coverage-based ap-proach to NMT to alleviate the over-translationand under-translation problems.
Independent from our work, Cohn et al. (2016)and Feng et al. (2016) made use of the concept of“fertility” for the attention model, which is sim-ilar in spirit to our method for building the lin-guistically inspired coverage with fertility. Cohnet al. (2016) introduced a feature-based fertilitythat includes the total alignment scores for the sur-
rounding source words. In contrast, we make pre-diction of fertility before decoding, which worksas a normalizer to better estimate the coverage ra-tio of each source word. Feng et al. (2016) usedthe previous attentional context to represent im-
plicit fertility and passed it to the attention model,which is in essence similar to the input-feedmethod proposed in (Luong et al., 2015). Compar-atively, we predict explicit fertility for each sourceword based on its encoding annotation, and incor-porate it into the linguistic-inspired coverage forattention model.
7 Conclusion
We have presented an approach for enhancingNMT, which maintains and utilizes a coveragevector to indicate whether each source word istranslated or not. By encouraging NMT to pay lessattention to translated words and more attention tountranslated words, our approach alleviates the se-rious over-translation and under-translation prob-lems that traditional attention-based NMT suffersfrom. We propose two variants of coverage mod-els: linguistic coverage that leverages more lin-guistic information and NN-based coverage thatresorts to the flexibility of neural network approx-imation . Experimental results show that bothvariants achieve significant improvements in termsof translation quality and alignment quality overNMT without coverage.
Effect on Long Sentences
source 24·3 ,
, 4 8
NMT jordan achieved an average score of eight weeks ahead with a surgical operation three weeks ago .
NMT-Coverage
jordan 's average score points to UNK this year . he received surgery before three weeks , with a team in the period of 4 to 8 .
Conclusion
• We have presented an approach to maintain a coverage
vector for NMT to indicate whether each source word is
translated or not.
• Coverage model alleviates the serious over-translation and
under-translation problems that attentional NMT suffers.
• Experimental results show that coverage model significantly
improves both translation and alignment quality.
Source Code
https://www.github.com/tuzhaopeng/NMT-Coverage
Try on your own data and task!
We are hiring …
http://www.noahlab.com.hk
Ph.D. on• Machine learning • Data Mining • Speech & Language
processing • Information & knowledge
management • Intelligent systems
top related