modeling coverage for neural machine...

Modeling Coverage forNeural Machine Translation

Zhaopeng Tu

Huawei Noah’s Ark Lab, Hong Kong

Joint work with

Zhengdong Lu, Yang Liu, Xiaohua Liu, Hang Li

Outlines

• Motivation

• Approach

• Experiments

Outlines

• Motivation

• Approach

• Experiments

Motivation

• Neural Machine Translation

2 RNN Encoder–Decoder

2.1 Preliminary: Recurrent Neural NetworksA recurrent neural network (RNN) is a neural net-work that consists of a hidden state h and anoptional output y which operates on a variable-length sequence x = (x1, . . . , xT ). At each timestep t, the hidden state hhti of the RNN is updatedby

hhti = f

hht�1i, xt�

where f is a non-linear activation func-tion. f may be as simple as an element-wise logistic sigmoid function and as com-plex as a long short-term memory (LSTM)unit (Hochreiter and Schmidhuber, 1997).

An RNN can learn a probability distributionover a sequence by being trained to predict thenext symbol in a sequence. In that case, the outputat each timestep t is the conditional distributionp(xt | xt�1, . . . , x1). For example, a multinomialdistribution (1-of-K coding) can be output using asoftmax activation function

p(xt,j = 1 | xt�1, . . . , x1) =exp

wjhhti�

PKj0=1 exp

wj0hhti�

for all possible symbols j = 1, . . . ,K, where wj

are the rows of a weight matrix W. By combiningthese probabilities, we can compute the probabil-ity of the sequence x using

p(x) =TY

p(xt | xt�1, . . . , x1). (3)

From this learned distribution, it is straightfor-ward to sample a new sequence by iteratively sam-pling a symbol at each time step.

2.2 RNN Encoder–DecoderIn this paper, we propose a novel neural networkarchitecture that learns to encode a variable-lengthsequence into a fixed-length vector representationand to decode a given fixed-length vector rep-resentation back into a variable-length sequence.From a probabilistic perspective, this new modelis a general method to learn the conditional dis-tribution over a variable-length sequence condi-tioned on yet another variable-length sequence,e.g. p(y1, . . . , yT 0 | x1, . . . , xT ), where one

x1 x2 xT

yT' y2 y1

Decoder

EncoderFigure 1: An illustration of the proposed RNNEncoder–Decoder.

should note that the input and output sequencelengths T and T

0 may differ.The encoder is an RNN that reads each symbol

of an input sequence x sequentially. As it readseach symbol, the hidden state of the RNN changesaccording to Eq. (1). After reading the end ofthe sequence (marked by an end-of-sequence sym-bol), the hidden state of the RNN is a summary cof the whole input sequence.

The decoder of the proposed model is anotherRNN which is trained to generate the output se-quence by predicting the next symbol yt given thehidden state hhti. However, unlike the RNN de-scribed in Sec. 2.1, both yt and hhti are also con-ditioned on yt�1 and on the summary c of the inputsequence. Hence, the hidden state of the decoderat time t is computed by,

hhti = f

hht�1i, yt�1, c�

and similarly, the conditional distribution of thenext symbol is

P (yt|yt�1, yt�2, . . . , y1, c) = g

hhti, yt�1, c�

for given activation functions f and g (the lattermust produce valid probabilities, e.g. with a soft-max).

See Fig. 1 for a graphical depiction of the pro-posed model architecture.

The two components of the proposed RNNEncoder–Decoder are jointly trained to maximizethe conditional log-likelihood

log p✓(yn | xn), (4)

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

�. With an RNN, each conditional probability is modeled as

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

x1 x2 x3 xT

+αt,1αt,2 αt,3

yt-1 yt

h1 h2 h3 hT

st-1 s t

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

given a sourcesentence (x1, x2, . . . , xT

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

is an RNN hidden state for time i, computed by

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

for each target word y

The context vector c

depends on a sequence of annotations(h1, · · · , hT

) to which an encoder maps the input sentence. Eachannotation h

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

The weight ↵ij

of each annotation h

is computed by

exp (e

k=1 exp (eik), (6)

wheree

i�1, hj

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

, Eq. (4)) and thej-th annotation h

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

Traditional NMT

Cho et al. (2014)

Attentional NMT

Bahdanau et al. (2015)

our focus

Motivation

• Problem of NMT

0.50.2 0.2 0.1 0.0

Generated Words

Motivation

• Problem of NMT

airports

many airports

Generated Words

0.01.0 0.0 0.0 0.0

Motivation

• Problem of NMT

many airports were

Generated Words

airports

0.00.0 0.9 0.1 0.0

Motivation

• Problem of NMT

closed

many airports were closed

Generated Words

0.00.0 0.6 0.4 0.0

Motivation

• Problem of NMT

many airports were closed to

Generated Words

closed

0.10.0 0.0 0.7 0.0

Motivation

• Problem of NMT

Generated Words

0.10.0 0.0 0.7 0.0

Motivation

• Problem of NMT

close .

Generated Words

0.20.2 0.2 0.1 0.1

Motivation

• Problem of NMT

<eos>.

0.10.0 0.0 0.0 0.2

>many airports were closed to

close . <eos>

Generated Words

Motivation

• Problem of NMT

under-translation

>many airports were closed to

close . <eos>

Generated Words

over-translation

Motivation

• Coverage problem

• NMT lacks of coverage to indicate

whether a source word is

translated or not. ⨁ C

covered uncovered

covered

we need coverage

Motivation

• Coverage is essential for SMT to ensure that each source

word is translated exactly.

• We believe that modeling coverage is also useful for NMT.

Motivation

• Coverage in SMT

many airport close .

many airports

force is forced to

Motivation

• Coverage in SMT

f: * _ _ _ _e: manyp: 0.53

coverage: the 1st source word is translated

many airports

force is forced to

Motivation

• Coverage in SMT

f: * _ _ _ _e: manyp: 0.53

f: * * _ _ _e: many airportp: 0.50

coverage: the first two source words are translated

many airports

force is forced to

Motivation

• Coverage in SMT

f: * * * * *e: many ... close .p: 0.32

coverage: all source

words are translatedf: * _ _ _ _e: manyp: 0.53

f: * * _ _ _e: many airportp: 0.50

many airports

force is forced to

Outlines

• Motivation

• Approach

• Experiments

Approach

• Intuitions

coverage 0.0 0.0 0.0 0.0 0.0 0.0

Generated Words

Approach

• Intuitions

coverage 0.0 0.0 0.0 0.0 0.0 0.0

Generated Words

Approach

• Intuitions

coverage 0.0 0.0 0.0 0.0 0.0 0.0

0.50.2 0.2 0.1 0.0

Generated Words

Approach

• Intuitions

coverage 0.5 0.2 0.2 0.1 0.0 0.0

ting c

0.50.2 0.2 0.1 0.0

Generated Words

keep on moving

Approach

• Intuitions

coverage 0.5 1.2 1.1 0.2 0.0 0.0

many airports were

Generated Words

airports

0.00.0 0.9 0.1 0.0

Approach

• Intuitions

?coverage 0.5 1.2 1.1 0.2 0.0 0.0

closedwere

0.00.0 0.6 0.4 0.0

Approach

• Intuitions

coverage 0.5 1.2 1.1 0.2 0.0 0.0

closedwere

0.00.0 0.6 0.4 0.0

Approach

• Intuitions

Generated Words

not fully translated

many airports were<e

coverage 0.5 1.2 1.1 0.2 0.0 0.0

Approach

• Intuitions

many airports were forced

Generated Words

forcedwere

0.00.0 0.9 0.1 0.0

coverage 0.5 1.2 1.1 0.2 0.0 0.0

Approach

• Intuitions

many airports were forced

Generated Words

forcedwere

0.00.0 0.9 0.1 0.0

coverage 0.5 1.2 2.0 0.3 0.0 0.0

ting c

keep on moving

Approach

• Intuitions

many airports were forced to

close down

Generated Words

coverage 0.5 1.2 2.3 2.8 0.1 0.1

Approach

• Intuitions

close down

Generated Words

coverage 0.5 1.2 2.3 2.8 0.1 0.1

fully translated

Approach

• Intuitions

close down .

Generated Words

0.00.0 0.0 0.1 0.2

coverage 0.5 1.2 2.3 2.8 0.1 0.1

Approach

• Intuitions

close down .

Generated Words

0.00.0 0.0 0.1 0.2

coverage 0.5 1.2 2.3 2.9 0.3 0.8

ting c

Approach

• Intuitions

Generated Words

coverage 0.5 1.2 2.3 2.9 0.3 0.8

close down .

Approach

• Intuitions

Generated Words

coverage 0.5 1.2 2.3 2.9 0.3 0.8

<eos>.

close down . <eos>

0.00.0 0.0 0.0 0.2

Approach

• A few equations

0.00.0 0.0 0.0 0.1

coverage 0.5 1.2 2.3 2.9 0.5 1.6

Formally, the coverage model is given by

Ci,j = gupdate

Ci�1,j ,↵i,j ,�(hj), �

• gupdate(·) is the function that updates Ci,j af-ter the new attention ↵i,j at time step i in thedecoding process;

• Ci,j is a d-dimensional coverage vector sum-marizing the history of attention till time stepi on hj ;

• �(hj) is a word-specific feature with its ownparameters;

• are auxiliary inputs exploited in differentsorts of coverage models.

Equation 6 gives a rather general model, whichcould take different function forms for gupdate(·)and �(·), and different auxiliary inputs (e.g.,previous decoding state ti�1). In the rest of thissection, we will give a number of representativeimplementations of the coverage model, whicheither leverage more linguistic information (Sec-tion 3.1.1) or resort to the flexibility of neural net-work approximation (Section 3.1.2).

3.1.1 Linguistic Coverage ModelWe first consider at linguistically inspired modelwhich has a small number of parameters, as wellas clear interpretation. While the linguistically-inspired coverage in NMT is similar to that inSMT, there is one key difference: it indicates whatpercentage of source words have been translated(i.e., soft coverage). In NMT, each target word yi

is generated from all source words with probabil-ity ↵i,j for source word xj . In other words, thesource word xj is involved in generating all tar-get words and the probability of generating targetword yi at time step i is ↵i,j . Note that unlikein SMT in which each source word is fully trans-

lated at one decoding step, the source word xj ispartially translated at each decoding step in NMT.Therefore, the coverage at time step i denotes thetranslated ratio of that each source word is trans-lated.

We use a scalar (d = 1) to represent linguis-tic coverage for each source word and employan accumulate operation for gupdate. The initialvalue of linguistic coverage is zero, which de-notes that the corresponding source word is not

translated yet. We iteratively construct linguis-tic coverages through accumulation of alignmentprobabilities generated by the attention model,each of which is normalized by a distinct context-dependent weight. The coverage of source wordxj at time step i is computed by

Ci,j = Ci�1,j +1

�j↵i,j =

↵k,j (7)

where �j is a pre-defined weight which indicatesthe number of target words xj is expected to gener-ate. The simplest way is to follow Xu et al. (2015)in image-to-caption translation to fix � = 1 for allsource words, which means that we directly usethe sum of previous alignment probabilities with-out normalization as coverage for each word, asdone in (Cohn et al., 2016).

However, in machine translation, different typesof source words may contribute differently to thegeneration of target sentence. Let us take thesentence pairs in Figure 1 as an example. Thenoun in the source sentence “j

ang” is translatedinto one target word “airports”, while the adjec-tive “b

o” is translated into three words “were

forced to”. Therefore, we need to assign a dis-tinct �j for each source word. Ideally, we expect�j =

PIi=1 ↵i,j with I being the total number

of time steps in decoding. However, such desiredvalue is not available before decoding, thus is notsuitable in this scenario.

Fertility To predict �j , we introduce the con-cept of fertility, which is firstly proposed in word-level SMT (Brown et al., 1993). Fertility of sourceword xj tells how many target words xj produces.In SMT, the fertility is a random variable �j ,whose distribution p(�j = �) is determined bythe parameters of word alignment models (e.g.,IBM models). In this work, we simplify and adaptfertility from the original model and compute thefertility �j by2

�j = N (xj |x) = N · �(Ufhj) (8)

where N 2 R is a predefined constant to denotethe maximum number of target words one source

2Fertility in SMT is a random variable with a set of fer-tility probabilities, n(�j |xj) = p(�<j ,x), which dependson the fertilities of previous source words. To simplify thecalculation and adapt it to the attention model in NMT, wedefine the fertility in NMT as a constant number, which isindependent of previous fertilities.

Approach

• A few equations

0.00.0 0.0 0.0 0.1

coverage 0.5 1.2 2.3 2.9 0.5 1.6

fertility model: tells how

many target words each

source word generatesfertility 0.8 1.1 2.4 2.5 0.7 1.7

ACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

accumulate operation for gupdate. We iterativelyconstruct linguistic coverages through an accumu-lation of alignment probabilities generated by theattention model, each of which is normalized by adistinct context-dependent weight. The coverageof source word xj on time step i is computed by

�i,j =1

↵k,j (7)

However, in natural languages, different typesof source words contributes differently to the gen-eration of translation. Take the sentence pairs inFigure 1 as an example, the adjective on the sourceside “zhudao” is translated into one target word“leading”, while the noun “jinnian” is translatedinto two words “this year”. Therefore, we needto assign a distinct �j for each source word. Ide-ally, we expect �j =

k=1 ↵k,j with Ty be thetotal number of time steps in decoding. However,such desired value is not available before decod-ing, thus is not suitable in this scenario.

Fertility To predict �j , we introduce the con-cept of fertility, which is firstly proposed in word-level SMT (Brown et al., 1993). Fertility of sourceword xj tells how many target words xj pro-duces. In SMT, the fertility is a random variable�j , whose distribution p(�j = �) is determinedby the parameters of word alignment models (e.g.IBM models). In this work, we simplify and adaptfertility from the original model1 and compute thefertility �j by

�j = N (xj |x) = N (hj) = N · �(Ufhj) (8)

where N 2 R is a predefined constantto denotethe maximum number of target words one sourceword can produce, �(·) is a logistic sigmoid func-tion, and Uf 2 R1⇥2n is the weight matrix. Here

1Fertility in SMT is a random variable with a set of fer-tility probabilities, n(�

) = p(�j�11 ,x), which depends

on the fertilities of previous source words. To simplify thecalculation and adapt it to the attention model in NMT, wedefine the fertility in NMT as a constant number, which isindependent of previous fertilities.

we use hj to denote (xj |x) since hj contains in-formation about the whole input sentence with astrong focus on the parts surrounding xj (Bah-danau et al., 2015). Since �j does not depend oni, we can pre-compute it before decoding to mini-mize the computational cost.

3.2 Integrating Coverage into NMT

Although the introduction of attention model hasadvanced the state-of-the-art of NMT, it computessoft alignment probabilities without consideringuseful information in the past. For example, asource word that contributed a lot to the predictedtarget words in the past, should be assigned loweralignment probabilities in the following decoding.Motivated by this observation, in this work, wepropose to calculate the alignment probability byjointly taking into account past alignment infor-mation embedded in the coverage model.

Intuitively, on each time step i in the decodingphase, coverage from time step (i � 1) serves asan additional input to the attention model, whichprovides complementary information of that howlikely the source words are translated in the past.We expect the coverage information would guidethe attention model to focus more on untranslatedsource words (i.e. assign higher probabilities). Inpractice, we find that the coverage model doescome up to expectation (see Section 5). The trans-lated ratios of source words from linguistic cov-erages negatively correlate to the correspondingalignment probabilities.

More formally, we rewrite the alignment modelin Equation 5 as

ei,j = a(si�1, hj ,�i�1,j)

>a tanh(Wasi�1 + Uahj +Ba�i�1,j)

where �i�1,j is the translated ratio of source wordxj before time i. Bd 2 Rn⇥d is the additionalweight matrix for coverage with n and d be thenumbers of hidden units and coverage units, re-spectively.

4 Training

In this paper, we take end-to-end learning forour coverage-based NMT model, which jointlylearns not only the parameters for the “original”RNNsearch (i.e., those for encoding RNN, decod-ing RNN, and attention model) but also the param-eters for coverage modeling (i.e., those for anno-tation and its role in guiding the attention) . More

covered 63% 109% 96% 116% 71% 94%