modeling coverage for neural machine...

Modeling Coverage forNeural Machine Translation

Zhaopeng Tu

Huawei Noah’s Ark Lab, Hong Kong

Joint work with

Zhengdong Lu, Yang Liu, Xiaohua Liu, Hang Li

Outlines

• Motivation

• Approach

• Experiments

Motivation

• Neural Machine Translation

2 RNN Encoder–Decoder

2.1 Preliminary: Recurrent Neural NetworksA recurrent neural network (RNN) is a neural net-work that consists of a hidden state h and anoptional output y which operates on a variable-length sequence x = (x1, . . . , xT ). At each timestep t, the hidden state hhti of the RNN is updatedby

hhti = f

�

hht�1i, xt�

, (1)

where f is a non-linear activation func-tion. f may be as simple as an element-wise logistic sigmoid function and as com-plex as a long short-term memory (LSTM)unit (Hochreiter and Schmidhuber, 1997).

An RNN can learn a probability distributionover a sequence by being trained to predict thenext symbol in a sequence. In that case, the outputat each timestep t is the conditional distributionp(xt | xt�1, . . . , x1). For example, a multinomialdistribution (1-of-K coding) can be output using asoftmax activation function

p(xt,j = 1 | xt�1, . . . , x1) =exp

�

wjhhti�

PKj0=1 exp

�

wj0hhti�

,

(2)

for all possible symbols j = 1, . . . ,K, where wj

are the rows of a weight matrix W. By combiningthese probabilities, we can compute the probabil-ity of the sequence x using

p(x) =TY

t=1

p(xt | xt�1, . . . , x1). (3)

From this learned distribution, it is straightfor-ward to sample a new sequence by iteratively sam-pling a symbol at each time step.

2.2 RNN Encoder–DecoderIn this paper, we propose a novel neural networkarchitecture that learns to encode a variable-lengthsequence into a fixed-length vector representationand to decode a given fixed-length vector rep-resentation back into a variable-length sequence.From a probabilistic perspective, this new modelis a general method to learn the conditional dis-tribution over a variable-length sequence condi-tioned on yet another variable-length sequence,e.g. p(y1, . . . , yT 0 | x1, . . . , xT ), where one

x1 x2 xT

yT' y2 y1

c

Decoder

EncoderFigure 1: An illustration of the proposed RNNEncoder–Decoder.

should note that the input and output sequencelengths T and T

0 may differ.The encoder is an RNN that reads each symbol

of an input sequence x sequentially. As it readseach symbol, the hidden state of the RNN changesaccording to Eq. (1). After reading the end ofthe sequence (marked by an end-of-sequence sym-bol), the hidden state of the RNN is a summary cof the whole input sequence.

The decoder of the proposed model is anotherRNN which is trained to generate the output se-quence by predicting the next symbol yt given thehidden state hhti. However, unlike the RNN de-scribed in Sec. 2.1, both yt and hhti are also con-ditioned on yt�1 and on the summary c of the inputsequence. Hence, the hidden state of the decoderat time t is computed by,

hhti = f

�

hht�1i, yt�1, c�

,

and similarly, the conditional distribution of thenext symbol is

P (yt|yt�1, yt�2, . . . , y1, c) = g

�

hhti, yt�1, c�

.

for given activation functions f and g (the lattermust produce valid probabilities, e.g. with a soft-max).

See Fig. 1 for a graphical depiction of the pro-posed model architecture.

The two components of the proposed RNNEncoder–Decoder are jointly trained to maximizethe conditional log-likelihood

max

✓

1

N

NX

n=1

log p✓(yn | xn), (4)

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

t

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y

�. With an RNN, each conditional probability is modeled as

p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

t

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

x1 x2 x3 xT

+αt,1αt,2 αt,3

αt,T

yt-1 yt

h1 h2 h3 hT

h1 h2 h3 hT

st-1 s t

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

t

given a sourcesentence (x1, x2, . . . , xT

).

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i

is an RNN hidden state for time i, computed by

s

i

= f(s

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

i

for each target word y

i

.

The context vector c

i

depends on a sequence of annotations(h1, · · · , hT

x

) to which an encoder maps the input sentence. Eachannotation h

i

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

i

:

c

i

=

T

xX

j=1

↵

ij

h

j

. (5)

The weight ↵ij

of each annotation h

j

is computed by

↵

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

i

, Eq. (4)) and thej-th annotation h

j

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

3

Traditional NMT

Cho et al. (2014)

Attentional NMT

Bahdanau et al. (2015)

our focus

Motivation

• Problem of NMT

⨁ C

s0 s1

many

0.50.2 0.2 0.1 0.0

many

Generated Words

<eos

>

0.0

Motivation

• Problem of NMT

⨁ C

s1 s2

airports

many airports

Generated Words

many

0.01.0 0.0 0.0 0.0

0.0

<eos

>

Motivation

• Problem of NMT

⨁ C

s2 s3

were

many airports were

Generated Words

airports

0.00.0 0.9 0.1 0.0

0.0

<eos

>

Motivation

• Problem of NMT

⨁ C

s3 s4

closed

many airports were closed

Generated Words

were

0.00.0 0.6 0.4 0.0

0.0

<eos

>

Motivation

• Problem of NMT

⨁ C

s4 s5

to

many airports were closed to

Generated Words

closed

0.10.0 0.0 0.7 0.0

0.2

<eos

>

Motivation

• Problem of NMT

⨁ C

s5 s6

close


close

Generated Words

to

0.10.0 0.0 0.7 0.0

0.2

<eos

>

Motivation

• Problem of NMT

⨁ C

s6 s7

.


close .

Generated Words

close

0.20.2 0.2 0.1 0.1

0.2

<eos

>

Motivation

• Problem of NMT

⨁ C

s7 s8

<eos>.

⨁

0.10.0 0.0 0.0 0.2

0.7

<eos

>many airports were closed to

close . <eos>

Generated Words

Motivation

• Problem of NMT

⨁ C

under-translation

⨁

<eos

>many airports were closed to

close . <eos>

Generated Words

over-translation

Motivation

• Coverage problem

• NMT lacks of coverage to indicate

whether a source word is

translated or not. ⨁ C

covered uncovered

<eos

>

covered

we need coverage

Motivation

• Coverage is essential for SMT to ensure that each source

word is translated exactly.

• We believe that modeling coverage is also useful for NMT.

Motivation

• Coverage in SMT

many airport close .

many airports

force is forced to

Motivation

• Coverage in SMT

f: * _ _ _ _e: manyp: 0.53

coverage: the 1st source word is translated


many airports

force is forced to

Motivation

• Coverage in SMT

f: * _ _ _ _e: manyp: 0.53

f: * * _ _ _e: many airportp: 0.50

coverage: the first two source words are translated


many airports

force is forced to

Motivation

• Coverage in SMT

f: * * * * *e: many ... close .p: 0.32

coverage: all source

words are translatedf: * _ _ _ _e: manyp: 0.53

f: * * _ _ _e: many airportp: 0.50


many airports

force is forced to

Outlines

• Motivation

• Approach

• Experiments

Approach

• Intuitions

⨁ C

coverage 0.0 0.0 0.0 0.0 0.0 0.0

Generated Words

<eos

>

Approach

• Intuitions

⨁ C

coverage 0.0 0.0 0.0 0.0 0.0 0.0

Generated Words

no so

urce

wor

d is

trans

lated

yet

<eos

>

Approach

• Intuitions

coverage 0.0 0.0 0.0 0.0 0.0 0.0

⨁ C

s0 s1

many

0.50.2 0.2 0.1 0.0

many

Generated Words

<eos

>

0.0

Approach

• Intuitions

coverage 0.5 0.2 0.2 0.1 0.0 0.0

upda

ting c

overa

ge

⨁ C

s0 s1

many

0.50.2 0.2 0.1 0.0

many

Generated Words

<eos

>

0.0

keep on moving

Approach

• Intuitions

coverage 0.5 1.2 1.1 0.2 0.0 0.0

⨁ C

s2 s3

were

many airports were

Generated Words

airports

0.00.0 0.9 0.1 0.0

0.0

<eos

>

Approach

• Intuitions

?coverage 0.5 1.2 1.1 0.2 0.0 0.0

⨁ C

s3 s4

closedwere

0.00.0 0.6 0.4 0.0

0.0

<eos

>

Approach

• Intuitions

coverage 0.5 1.2 1.1 0.2 0.0 0.0

⨁ C

s3 s4

closedwere

0.00.0 0.6 0.4 0.0

0.0

<eos

>

No

Approach

• Intuitions

⨁ C

Generated Words

not fully translated

many airports were<e

os>

coverage 0.5 1.2 1.1 0.2 0.0 0.0

⨁ C

s3 s4

?were

Approach

• Intuitions

many airports were forced

Generated Words

<eos

>

⨁ C

s3 s4

forcedwere

0.00.0 0.9 0.1 0.0

0.0

coverage 0.5 1.2 1.1 0.2 0.0 0.0

Approach

• Intuitions

many airports were forced

Generated Words

<eos

>

⨁ C

s3 s4

forcedwere

0.00.0 0.9 0.1 0.0

0.0

coverage 0.5 1.2 2.0 0.3 0.0 0.0

upda

ting c

overa

ge

keep on moving

Approach

• Intuitions

many airports were forced to

close down

Generated Words

<eos

>

⨁ C

s7 s8

?down

coverage 0.5 1.2 2.3 2.8 0.1 0.1

Approach

• Intuitions


close down

Generated Words

<eos

>

⨁ C

s7 s8

?down

coverage 0.5 1.2 2.3 2.8 0.1 0.1

fully translated

Approach

• Intuitions


close down .

Generated Words

<eos

>

⨁ C

s7 s8

.down

0.00.0 0.0 0.1 0.2

0.7

coverage 0.5 1.2 2.3 2.8 0.1 0.1

Approach

• Intuitions


close down .

Generated Words

<eos

>

⨁ C

s7 s8

.down

0.00.0 0.0 0.1 0.2

0.7

coverage 0.5 1.2 2.3 2.9 0.3 0.8

upda

ting c

overa

ge

Approach

• Intuitions

⨁ C

Generated Words

ever

y wor

d is

trans

lated

coverage 0.5 1.2 2.3 2.9 0.3 0.8

<eos

>

s8 s9

?.


close down .

Approach

• Intuitions

⨁ C

Generated Words

coverage 0.5 1.2 2.3 2.9 0.3 0.8

<eos

>

s8 s9

<eos>.


close down . <eos>

0.00.0 0.0 0.0 0.2

0.8

Approach

• A few equations

⨁ C

0.00.0 0.0 0.0 0.1

0.9

coverage 0.5 1.2 2.3 2.9 0.5 1.6

<eos

>

Formally, the coverage model is given by

Ci,j = gupdate

�

Ci�1,j ,↵i,j ,�(hj), �

(6)

where

• gupdate(·) is the function that updates Ci,j af-ter the new attention ↵i,j at time step i in thedecoding process;

• Ci,j is a d-dimensional coverage vector sum-marizing the history of attention till time stepi on hj ;

• �(hj) is a word-specific feature with its ownparameters;

• are auxiliary inputs exploited in differentsorts of coverage models.

Equation 6 gives a rather general model, whichcould take different function forms for gupdate(·)and �(·), and different auxiliary inputs (e.g.,previous decoding state ti�1). In the rest of thissection, we will give a number of representativeimplementations of the coverage model, whicheither leverage more linguistic information (Sec-tion 3.1.1) or resort to the flexibility of neural net-work approximation (Section 3.1.2).

3.1.1 Linguistic Coverage ModelWe first consider at linguistically inspired modelwhich has a small number of parameters, as wellas clear interpretation. While the linguistically-inspired coverage in NMT is similar to that inSMT, there is one key difference: it indicates whatpercentage of source words have been translated(i.e., soft coverage). In NMT, each target word yi

is generated from all source words with probabil-ity ↵i,j for source word xj . In other words, thesource word xj is involved in generating all tar-get words and the probability of generating targetword yi at time step i is ↵i,j . Note that unlikein SMT in which each source word is fully trans-

lated at one decoding step, the source word xj ispartially translated at each decoding step in NMT.Therefore, the coverage at time step i denotes thetranslated ratio of that each source word is trans-lated.

We use a scalar (d = 1) to represent linguis-tic coverage for each source word and employan accumulate operation for gupdate. The initialvalue of linguistic coverage is zero, which de-notes that the corresponding source word is not

translated yet. We iteratively construct linguis-tic coverages through accumulation of alignmentprobabilities generated by the attention model,each of which is normalized by a distinct context-dependent weight. The coverage of source wordxj at time step i is computed by

Ci,j = Ci�1,j +1

�j↵i,j =

1

�j

iX

k=1

↵k,j (7)

where �j is a pre-defined weight which indicatesthe number of target words xj is expected to gener-ate. The simplest way is to follow Xu et al. (2015)in image-to-caption translation to fix � = 1 for allsource words, which means that we directly usethe sum of previous alignment probabilities with-out normalization as coverage for each word, asdone in (Cohn et al., 2016).

However, in machine translation, different typesof source words may contribute differently to thegeneration of target sentence. Let us take thesentence pairs in Figure 1 as an example. Thenoun in the source sentence “j

¯

ıch

ˇ

ang” is translatedinto one target word “airports”, while the adjec-tive “b

`

eip

`

o” is translated into three words “were

forced to”. Therefore, we need to assign a dis-tinct �j for each source word. Ideally, we expect�j =

PIi=1 ↵i,j with I being the total number

of time steps in decoding. However, such desiredvalue is not available before decoding, thus is notsuitable in this scenario.

Fertility To predict �j , we introduce the con-cept of fertility, which is firstly proposed in word-level SMT (Brown et al., 1993). Fertility of sourceword xj tells how many target words xj produces.In SMT, the fertility is a random variable �j ,whose distribution p(�j = �) is determined bythe parameters of word alignment models (e.g.,IBM models). In this work, we simplify and adaptfertility from the original model and compute thefertility �j by2

�j = N (xj |x) = N · �(Ufhj) (8)

where N 2 R is a predefined constant to denotethe maximum number of target words one source

2Fertility in SMT is a random variable with a set of fer-tility probabilities, n(�j |xj) = p(�<j ,x), which dependson the fertilities of previous source words. To simplify thecalculation and adapt it to the attention model in NMT, wedefine the fertility in NMT as a constant number, which isindependent of previous fertilities.

Approach

• A few equations

⨁ C

0.00.0 0.0 0.0 0.1

0.9

coverage 0.5 1.2 2.3 2.9 0.5 1.6

fertility model: tells how

many target words each

source word generatesfertility 0.8 1.1 2.4 2.5 0.7 1.7

5

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

ACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

accumulate operation for gupdate. We iterativelyconstruct linguistic coverages through an accumu-lation of alignment probabilities generated by theattention model, each of which is normalized by adistinct context-dependent weight. The coverageof source word xj on time step i is computed by

�i,j =1

�j

iX

k=1

↵k,j (7)


However, in natural languages, different typesof source words contributes differently to the gen-eration of translation. Take the sentence pairs inFigure 1 as an example, the adjective on the sourceside “zhudao” is translated into one target word“leading”, while the noun “jinnian” is translatedinto two words “this year”. Therefore, we needto assign a distinct �j for each source word. Ide-ally, we expect �j =

PTy

k=1 ↵k,j with Ty be thetotal number of time steps in decoding. However,such desired value is not available before decod-ing, thus is not suitable in this scenario.

Fertility To predict �j , we introduce the con-cept of fertility, which is firstly proposed in word-level SMT (Brown et al., 1993). Fertility of sourceword xj tells how many target words xj pro-duces. In SMT, the fertility is a random variable�j , whose distribution p(�j = �) is determinedby the parameters of word alignment models (e.g.IBM models). In this work, we simplify and adaptfertility from the original model1 and compute thefertility �j by

�j = N (xj |x) = N (hj) = N · �(Ufhj) (8)

where N 2 R is a predefined constantto denotethe maximum number of target words one sourceword can produce, �(·) is a logistic sigmoid func-tion, and Uf 2 R1⇥2n is the weight matrix. Here

1Fertility in SMT is a random variable with a set of fer-tility probabilities, n(�

j

|xj

) = p(�j�11 ,x), which depends

on the fertilities of previous source words. To simplify thecalculation and adapt it to the attention model in NMT, wedefine the fertility in NMT as a constant number, which isindependent of previous fertilities.

we use hj to denote (xj |x) since hj contains in-formation about the whole input sentence with astrong focus on the parts surrounding xj (Bah-danau et al., 2015). Since �j does not depend oni, we can pre-compute it before decoding to mini-mize the computational cost.

3.2 Integrating Coverage into NMT

Although the introduction of attention model hasadvanced the state-of-the-art of NMT, it computessoft alignment probabilities without consideringuseful information in the past. For example, asource word that contributed a lot to the predictedtarget words in the past, should be assigned loweralignment probabilities in the following decoding.Motivated by this observation, in this work, wepropose to calculate the alignment probability byjointly taking into account past alignment infor-mation embedded in the coverage model.

Intuitively, on each time step i in the decodingphase, coverage from time step (i � 1) serves asan additional input to the attention model, whichprovides complementary information of that howlikely the source words are translated in the past.We expect the coverage information would guidethe attention model to focus more on untranslatedsource words (i.e. assign higher probabilities). Inpractice, we find that the coverage model doescome up to expectation (see Section 5). The trans-lated ratios of source words from linguistic cov-erages negatively correlate to the correspondingalignment probabilities.

More formally, we rewrite the alignment modelin Equation 5 as

ei,j = a(si�1, hj ,�i�1,j)

= v

>a tanh(Wasi�1 + Uahj +Ba�i�1,j)

where �i�1,j is the translated ratio of source wordxj before time i. Bd 2 Rn⇥d is the additionalweight matrix for coverage with n and d be thenumbers of hidden units and coverage units, re-spectively.

4 Training

In this paper, we take end-to-end learning forour coverage-based NMT model, which jointlylearns not only the parameters for the “original”RNNsearch (i.e., those for encoding RNN, decod-ing RNN, and attention model) but also the param-eters for coverage modeling (i.e., those for anno-tation and its role in guiding the attention) . More

covered 63% 109% 96% 116% 71% 94%

<eos

>


Ci,j = gupdate

�

Ci�1,j ,↵i,j ,�(hj), �

(6)

where

• gupdate(·) is the function that updates Ci,j af-ter the new attention ↵i,j at time step i in thedecoding process;



• are auxiliary inputs exploited in differentsorts of coverage models.

Equation 6 gives a rather general model, whichcould take different function forms for gupdate(·)and �(·), and different auxiliary inputs (e.g.,previous decoding state ti�1). In the rest of thissection, we will give a number of representativeimplementations of the coverage model, whicheither leverage more linguistic information (Sec-tion 3.1.1) or resort to the flexibility of neural net-work approximation (Section 3.1.2).

3.1.1 Linguistic Coverage ModelWe first consider at linguistically inspired modelwhich has a small number of parameters, as wellas clear interpretation. While the linguistically-inspired coverage in NMT is similar to that inSMT, there is one key difference: it indicates whatpercentage of source words have been translated(i.e., soft coverage). In NMT, each target word yi

is generated from all source words with probabil-ity ↵i,j for source word xj . In other words, thesource word xj is involved in generating all tar-get words and the probability of generating targetword yi at time step i is ↵i,j . Note that unlikein SMT in which each source word is fully trans-

lated at one decoding step, the source word xj ispartially translated at each decoding step in NMT.Therefore, the coverage at time step i denotes thetranslated ratio of that each source word is trans-lated.

We use a scalar (d = 1) to represent linguis-tic coverage for each source word and employan accumulate operation for gupdate. The initialvalue of linguistic coverage is zero, which de-notes that the corresponding source word is not

translated yet. We iteratively construct linguis-tic coverages through accumulation of alignmentprobabilities generated by the attention model,each of which is normalized by a distinct context-dependent weight. The coverage of source wordxj at time step i is computed by

Ci,j = Ci�1,j +1

�j↵i,j =

1

�j

iX

k=1

↵k,j (7)


However, in machine translation, different typesof source words may contribute differently to thegeneration of target sentence. Let us take thesentence pairs in Figure 1 as an example. Thenoun in the source sentence “j

¯

ıch

ˇ

ang” is translatedinto one target word “airports”, while the adjec-tive “b

`

eip

`

o” is translated into three words “were

forced to”. Therefore, we need to assign a dis-tinct �j for each source word. Ideally, we expect�j =

PIi=1 ↵i,j with I being the total number

of time steps in decoding. However, such desiredvalue is not available before decoding, thus is notsuitable in this scenario.

Fertility To predict �j , we introduce the con-cept of fertility, which is firstly proposed in word-level SMT (Brown et al., 1993). Fertility of sourceword xj tells how many target words xj produces.In SMT, the fertility is a random variable �j ,whose distribution p(�j = �) is determined bythe parameters of word alignment models (e.g.,IBM models). In this work, we simplify and adaptfertility from the original model and compute thefertility �j by2

�j = N (xj |x) = N · �(Ufhj) (8)


2Fertility in SMT is a random variable with a set of fer-tility probabilities, n(�j |xj) = p(�<j ,x), which dependson the fertilities of previous source words. To simplify thecalculation and adapt it to the attention model in NMT, wedefine the fertility in NMT as a constant number, which isindependent of previous fertilities.

Approach

Attention-based Neural Machine Translation (NMT) ignores past alignment information, which often leads to over-translation and under-translation.

Modeling Coverage for Neural Machine TranslationZhaopeng Tu1 Zhengdong Lu1 Yang Liu2 Xiaohua Liu1 Hang Li1

INTRODUCTION

COVERAGE MODEL

Our work: We maintain a coverage vector to keep track of the attention history. The coverage vector is fed to the attention model to help adjust future attention, which guides NMT to consider more about the untranslated source words.

Coverage model is given by

•!i,j is a d-dimensional coverage vector summarizing the history of attention (e.g. "i,j) till time step i on the annotation of source word hj;

• ɡupdate is the function that updates !i,j after the new attention at time step i; •Φ(hj) is a word-specific feature with its own parameters; •Ѱ are auxiliary inputs exploited in different sorts of coverage models.

EXPERIMENTS

Two Representative Implementations

INTEGRATING COVERAGE INTO NMT

1 Huawei Noah’s Ark Lab, Hong Kong 2 Tsinghua University, Beijing

4

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415


Figure 4: NN-based coverage model.

source word), it will discourage further attentionto it if it has been heavily attended, and implicitlypush the attention to the less attended segments ofthe source sentence since the attention weights arenormalized to one. This could potentially solveboth coverage mistakes mentioned above, whenmodelled and learned properly.


Ci,j = gupdate�Ci�1,j ,↵i,j ,�(hj),

�(6)

where

• gupdate(·) is the function that updates Ci,j af-ter the new attention at time step i in the de-coding process;



• are auxiliary inputs exploited in differentsorts of coverage models;

Equation 6 gives a rather general model, whichcould take different function forms for gupdate(·)and �(·), and different auxiliary inputs (e.g.previous decoding state ti�1). In the rest of thissection, we will give a number of representativeimplementations of the coverage model, which ei-ther resort to the flexibility of neural network func-tion approximation (Section 3.1.1) or bear morelinguistic intuition (Section 3.1.2).

3.1.1 Neural Network Based Coverage Model

When Ci,j is a vector (d > 1) and gupdate(·) takesa Neural Network (NN) form, we actually have

a RNN model for coverage, as illustrated by Fig-ure 4. In our work, we take the following form

Ci,j = f(Ci�1,j ,↵i,j ,hj , ti�1)

where f(·) is a nonlinear activation function andti�1 is the auxiliary input that encodes past trans-lation information. Note that we leave out the theword-specific feature function �(·) and only takethe input annotation hj as the input to the cov-erage RNN. It is important to emphasize that theNN-based coverage model is able to be fed witharbitrary inputs, such as the previous attentionalcontext si�1. Here we only employ Ci�1,j for pastalignment information, ti�1 for past translation in-formation, and hj for word-specific bias.

Gating The neural function f(·) can be either asimple activation function tanh or a gating func-tion that proves useful to capture long-distancedependencies. In this work, we adopt GRU forthe gating activation since it is simple yet power-ful (Chung et al., 2014). Please refer to (Cho et al.,2014b) for more details about GRU.

Although the NN-based coverage model enjoysthe flexibility brought by the recurrent nonlinearform, its lack of clear linguistic meaning may ren-der it hard to train: the coverage model can onlybe trained along with the attention model and getthe supervision signal from it in back-propagation,which could be weak (relatively distant from thedecoding process) and noisy (after the distortionfrom other under-trained components in the de-coder RNN). Partially to overcome this problem,we also propose the linguistically inspired modelwhich has much clearer interpretation but muchless parameters.

3.1.2 Linguistic Coverage Model

While linguistically-inspired coverage in NMT issimilar in spirit to that in SMT, there is one keydifference: it indicates what percentage of sourcewords have been translated (i.e. soft coverage).In NMT, each target word yi is generated fromall source words with probabilities ↵i,j for sourceword xj . In other words, the source word xj in-volves in generating all target words and generates↵i,j target word at time step i. Note that unlike inSMT where each source word is fully translated

at one decoding step, xj is partially translated ateach decoding step in NMT . Therefore, the cov-erage at time step i denotes the translated ratio ofthat each source word is translated.

Neural Network Based Coverage Model

•!i,j is a vector (d >1) and ɡupdate takes a Neural Network (NN) form;

• NN-based coverage model is able to be fed with arbitrary inputs.

①: alignment decisions ("i) are made jointly taking into account past alignment information (!i-1):

②: coverage models (!i) are updated after every attentive decisions ("i), thus store attention history. Translation Quality

26

27

28

29

30

31

BLEU

BLE

U S

core

0

5

10

15

20

25

30

Length of Source Sentence

[0, 1

0)

[10,

20)

[20,

30)

[30,

40)

[40,

50)

[50,

60)

[60,

70)

[70,

80)

MosesNMT+ Linguistic Coverage+ NN-based Coverage

Leng

th o

f Tra

nsla

tion

0

10

20

30

40

50

60

70

80

90

Length of Source Sentence

[0, 1

0)

[10,

20)

[20,

30)

[30,

40)

[40,

50)

[50,

60)

[60,

70)

[70,

80)

• Chinese-to-English translation • Training corpus: 1.25M sentence pairs (27.9M, 34.5M words) • Vocabulary: 30K Max sentence length: 80 • Other settings are the same as in (Bahdanau et al., 2015)

Alignment Quality

50

51

52

53

54

55

AERNMT + Linguistic Coverage+ NN-based Coverage (d=1) + NN-based Coverage (d=10)

Effects on Long Sentences

Higher score means better translation, while lower score means better alignment.

Example Translations

NMT NMT-Coverage

Example Alignments

Translation and Alignment Quality

Figure 4: NN-based coverage model.

word can produce, �(·) is a logistic sigmoid func-tion, and Uf 2 R1⇥2n is the weight matrix. Herewe use hj to denote (xj |x) since hj contains in-formation about the whole input sentence with astrong focus on the parts surrounding xj (Bah-danau et al., 2015). Since �j does not depend oni, we can pre-compute it before decoding to mini-mize the computational cost.

3.1.2 Neural Network Based Coverage ModelWe next consider Neural Network (NN) basedcoverage model. When Ci,j is a vector (d > 1) andgupdate(·) is a neural network, we actually havean RNN model for coverage, as illustrated in Fig-ure 4. In this work, we take the following form:

Ci,j = f(Ci�1,j ,↵i,j ,hj , ti�1)

where f(·) is a nonlinear activation function andti�1 is the auxiliary input that encodes past trans-lation information. Note that we leave out theword-specific feature function �(·) and only takethe input annotation hj as the input to the cov-erage RNN. It is important to emphasize that theNN-based coverage model is able to be fed witharbitrary inputs, such as the previous attentionalcontext si�1. Here we only employ Ci�1,j for pastalignment information, ti�1 for past translation in-formation, and hj for word-specific bias.3

Gating The neural function f(·) can be either asimple activation function tanh or a gating func-tion that proves useful to capture long-distance

3In our preliminary experiments, considering more inputs(e.g., current and previous attentional contexts, unnormal-ized attention weights ei,j) does not always lead to bettertranslation quality. Possible reasons include: 1) the inputscontains duplicate information, and 2) more inputs introducemore back-propagation paths and therefore make it difficultto train. In our experience, one principle is to only feedthe coverage model inputs that contain distinct information,which are complementary to each other.

dependencies. In this work, we adopt GRU forthe gating activation since it is simple yet power-ful (Chung et al., 2014). Please refer to (Cho et al.,2014b) for more details about GRU.

Discussion Intuitively, the two types of modelssummarize coverage information in “different lan-guages”. Linguistic models summarize coverageinformation in human language, which has a clearinterpretation to humans. Neural models encodecoverage information in “neural language”, whichcan be “understood” by neural networks and letthem to decide how to make use of the encodedcoverage information.


Although attention based model has the capabil-ity of jointly making alignment and translation, itdoes not take into consideration translation his-tory. Specifically, a source word that has sig-nificantly contributed to the generation of targetwords in the past, should be assigned lower align-ment probabilities, which may not be the case inattention based NMT. To address this problem, wepropose to calculate the alignment probabilities byincorporating past alignment information embed-ded in the coverage model.

Intuitively, at each time step i in the decodingphase, coverage from time step (i � 1) serves asan additional input to the attention model, whichprovides complementary information of that howlikely the source words are translated in the past.We expect the coverage information would guidethe attention model to focus more on untranslatedsource words (i.e., assign higher alignment prob-abilities). In practice, we find that the coveragemodel does fulfill the expectation (see Section 5).The translated ratios of source words from lin-guistic coverages negatively correlate to the cor-responding alignment probabilities.

More formally, we rewrite the attention modelin Equation 5 as

ei,j = a(ti�1,hj , Ci�1,j)

= v

>a tanh(Wati�1 + Uahj + VaCi�1,j)

where Ci�1,j is the coverage of source word xj be-fore time i. Va 2 Rn⇥d is the weight matrix forcoverage with n and d being the numbers of hid-den units and coverage units, respectively.

Linguistic Coverage Model

5

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519


We use a scalar (d = 1) to represent linguis-tic coverage for each source word and employ anaccumulate operation for gupdate. We iterativelyconstruct linguistic coverages through an accumu-lation of alignment probabilities generated by theattention model, each of which is normalized by adistinct context-dependent weight. The coverageof source word xj at time step i is computed by

Ci,j =1

�j

iX

k=1

↵k,j (7)

where �j is a pre-defined weight which indi-cates the number of target words xj is expectedto generate. The simplest way is to follow Xuet al. (2015) in image-to-caption translation to fix� = 1 for all source words, which means that wedirectly use the sum of previous alignment proba-bilities without normalization as coverage for eachword, as done in (Cohn et al., 2016).

However, in natural languages, different typesof source words contributes differently to the gen-eration of translation. Take the sentence pairs inFigure 1 as an example, the noun on the sourceside “hangji” is translated into one target word“flights”, while the quantifier “liangqianduo” istranslated into two words “over 2,000”. Therefore,we need to assign a distinct �j for each sourceword. Ideally, we expect �j =

PTy

k=1 ↵k,j withTy be the total number of time steps in decoding.However, such desired value is not available be-fore decoding, thus is not suitable in this scenario.


�j = N (xj |x) = N (hj) = N · �(Ufhj) (8)



j

|xj





Although the introduction of attention model hasadvanced the state-of-the-art of NMT, it computessoft alignment probabilities without consideringuseful information in the past. For example, asource word that contributed a lot to the generatedtarget words in the past, should be assigned loweralignment probabilities in the following decoding.Motivated by this observation, in this work, wepropose to calculate the alignment probability byjointly taking into account past alignment infor-mation embedded in the coverage model.

Intuitively, at each time step i in the decodingphase, coverage from time step (i � 1) serves asan additional input to the attention model, whichprovides complementary information of that howlikely the source words are translated in the past.We expect the coverage information would guidethe attention model to focus more on untranslatedsource words (i.e. assign higher probabilities). Inpractice, we find that the coverage model doescome up to expectation (see Section 5). The trans-lated ratios of source words from linguistic cov-erages negatively correlate to the correspondingalignment probabilities.



= v>a tanh(Wati�1 + Uahj + VaCi�1,j)

where Ci�1,j is the coverage of source word xj

before time i. Va 2 Rn⇥d is the weight matrixfor coverage with n and d being the numbers ofhidden units and coverage units, respectively.

4 Training

In this paper, we take end-to-end learning for ourNMT-COVERAGE model, which jointly learns notonly the parameters for the “original” NMT (i.e.,those for encoding RNN, decoding RNN, and at-tention model) but also the parameters for cov-erage modeling (i.e., those for annotation and its

5

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519


We use a scalar (d = 1) to represent linguis-tic coverage for each source word and employ anaccumulate operation for gupdate. We iterativelyconstruct linguistic coverages through an accumu-lation of alignment probabilities generated by theattention model, each of which is normalized by adistinct context-dependent weight. The coverageof source word xj at time step i is computed by

Ci,j =1

�j

iX

k=1

↵k,j (7)

where �j is a pre-defined weight which indi-cates the number of target words xj is expectedto generate. The simplest way is to follow Xuet al. (2015) in image-to-caption translation to fix� = 1 for all source words, which means that wedirectly use the sum of previous alignment proba-bilities without normalization as coverage for eachword, as done in (Cohn et al., 2016).

However, in natural languages, different typesof source words contributes differently to the gen-eration of translation. Take the sentence pairs inFigure 1 as an example, the noun on the sourceside “hangji” is translated into one target word“flights”, while the quantifier “liangqianduo” istranslated into two words “over 2,000”. Therefore,we need to assign a distinct �j for each sourceword. Ideally, we expect �j =

PTy

k=1 ↵k,j withTy be the total number of time steps in decoding.However, such desired value is not available be-fore decoding, thus is not suitable in this scenario.


�j = N (xj |x) = N · �(Ufhj) (8)



j

|xj





Although the introduction of attention model hasadvanced the state-of-the-art of NMT, it computessoft alignment probabilities without consideringuseful information in the past. For example, asource word that contributed a lot to the generatedtarget words in the past, should be assigned loweralignment probabilities in the following decoding.Motivated by this observation, in this work, wepropose to calculate the alignment probability byjointly taking into account past alignment infor-mation embedded in the coverage model.

Intuitively, at each time step i in the decodingphase, coverage from time step (i � 1) serves asan additional input to the attention model, whichprovides complementary information of that howlikely the source words are translated in the past.We expect the coverage information would guidethe attention model to focus more on untranslatedsource words (i.e. assign higher probabilities). Inpractice, we find that the coverage model doescome up to expectation (see Section 5). The trans-lated ratios of source words from linguistic cov-erages negatively correlate to the correspondingalignment probabilities.



= v>a tanh(Wati�1 + Uahj + VaCi�1,j)

where Ci�1,j is the coverage of source word xj

before time i. Va 2 Rn⇥d is the weight matrixfor coverage with n and d being the numbers ofhidden units and coverage units, respectively.

4 Training

In this paper, we take end-to-end learning for ourNMT-COVERAGE model, which jointly learns notonly the parameters for the “original” NMT (i.e.,those for encoding RNN, decoding RNN, and at-tention model) but also the parameters for cov-erage modeling (i.e., those for annotation and its

A simple accumulation of attention history

Φj is the fertility of source word xj to tell how many target words xj produces:

•!i,j is a scalar (d =1), which indicates what percentage of source words have been translated;

• It has clearer linguistic interpretation but less parameters.

NMT NMT-Coverage

Coverage model alleviates the over-translation (i.e., “��”) and under-translation (i.e., “��”) problems that NMT without coverage suffers from.

Using coverage mechanism, translated source words are less likely to contribute to generation of the target words next (e.g., top-right corner for the first four Chinese words.).

Outlines

• Motivation

• Approach

• Experiments

Translation Quality

• Chinese-to-English translation • Training corpus: 1.25M sentence

pairs (27.9M, 34.5M words) • Vocabulary: 30K

Max sentence length: 80 • Other settings are the same as in

(Bahdanau et al., 2015)

26

27

28

29

30

31

BLEU

30.14

29.5929.86

28.32

Groundhog+ Linguistic Coverage+ NN-based Coverage (d=1)+ NN-based Coverage (d=10)

(a) Over-translation and under-translationgenerated by NMT.

(b) Coverage model alleviates the problems ofover-translation and under-translation.

Figure 1: Example translations of (a) NMT without coverage, and (b) NMT with coverage. In conven-tional NMT without coverage, the Chinese word “gu

¯

anb

`

ı” is over translated to “close(d)” twice, while“b

`

eip

`

o” (means “be forced to”) is mistakenly untranslated. Coverage model alleviates these problems bytracking the “coverage” of source words.

during the decoding process, to keep track of theattention history. The coverage vector, when en-tering into attention model, can help adjust the fu-ture attention and significantly improve the over-all alignment between the source and target sen-tences. This design contains many particular casesfor coverage modeling with contrasting character-istics, which all share a clear linguistic intuitionand yet can be trained in a data driven fashion. No-tably, we achieve significant improvement even bysimply using the sum of previous alignment prob-abilities as coverage for each word, as a success-ful example of incorporating linguistic knowledgeinto neural network based NLP models.

Experiments show that NMT-COVERAGE sig-nificantly outperforms conventional attention-based NMT on both translation and alignmenttasks. Figure 1(b) shows an example, in whichNMT-COVERAGE alleviates the over-translationand under-translation problems that NMT withoutcoverage suffers from.

2 Background

Our work is built on attention-based NMT (Bah-danau et al., 2015), which simultaneously con-ducts dynamic alignment and generation of thetarget sentence, as illustrated in Figure 2. It

Figure 2: Architecture of attention-based NMT.Whenever possible, we omit the source index j tomake the illustration less cluttered.

produces the translation by generating one targetword yi at each time step. Given an input sentencex = {x1, . . . , xJ} and previously generated words{y1, . . . , yi�1}, the probability of generating nextword yi is

P (yi|y<i,x) = softmax

�

g(yi�1, ti, si)�

(1)

where g is a non-linear function, and ti is a decod-ing state for time step i, computed by

ti = f(ti�1, yi�1, si) (2)

Here the activation function f(·) is a Gated Re-current Unit (GRU) (Cho et al., 2014b), and si is

Translation Quality

Alignment Quality

Coverage model improves

alignment performance as well

the lower the score,

the better the alignment quality

50

51

52

53

54

55

AER

50.5

53.51

52.13

54.67

Groundhog+ Linguistic Coverage+ NN-based Coverage (d=1)+ NN-based Coverage (d=10)

Alignment Quality

Groundhog + NN-based Coverage (d=10)

Using coverage mechanism, translated source words are less likely to contribute to

generation of the target words next

Effect on Long Sentences

Figure 6: Performance of the generated translations with respect to the lengths of the input sentences.Coverage models alleviate under-translation by producing longer translations on long sentences.

in which the under-translation is rectified.The quantitative and qualitative results show

that the coverage models indeed help to allevi-ate under-translation, especially for long sentencesconsisting of several sub-sentences.

6 Related Work

Our work is inspired by recent works on im-proving attention-based NMT with techniques thathave been successfully applied to SMT. Follow-ing the success of Minimum Risk Training (MRT)in SMT (Och, 2003), Shen et al. (2016) proposedMRT for end-to-end NMT to optimize model pa-rameters directly with respect to evaluation met-rics. Based on the observation that attention-based NMT only captures partial aspects of atten-tional regularities, Cheng et al. (2016) proposedagreement-based learning (Liang et al., 2006) toencourage bidirectional attention models to agreeon parameterized alignment matrices. Along thesame direction, inspired by the coverage mecha-nism in SMT, we propose a coverage-based ap-proach to NMT to alleviate the over-translationand under-translation problems.

Independent from our work, Cohn et al. (2016)and Feng et al. (2016) made use of the concept of“fertility” for the attention model, which is sim-ilar in spirit to our method for building the lin-guistically inspired coverage with fertility. Cohnet al. (2016) introduced a feature-based fertilitythat includes the total alignment scores for the sur-

rounding source words. In contrast, we make pre-diction of fertility before decoding, which worksas a normalizer to better estimate the coverage ra-tio of each source word. Feng et al. (2016) usedthe previous attentional context to represent im-

plicit fertility and passed it to the attention model,which is in essence similar to the input-feedmethod proposed in (Luong et al., 2015). Compar-atively, we predict explicit fertility for each sourceword based on its encoding annotation, and incor-porate it into the linguistic-inspired coverage forattention model.

7 Conclusion

We have presented an approach for enhancingNMT, which maintains and utilizes a coveragevector to indicate whether each source word istranslated or not. By encouraging NMT to pay lessattention to translated words and more attention tountranslated words, our approach alleviates the se-rious over-translation and under-translation prob-lems that traditional attention-based NMT suffersfrom. We propose two variants of coverage mod-els: linguistic coverage that leverages more lin-guistic information and NN-based coverage thatresorts to the flexibility of neural network approx-imation . Experimental results show that bothvariants achieve significant improvements in termsof translation quality and alignment quality overNMT without coverage.

Effect on Long Sentences

source 24·3 ,

, 4 8

NMT jordan achieved an average score of eight weeks ahead with a surgical operation three weeks ago .

NMT-Coverage

jordan 's average score points to UNK this year . he received surgery before three weeks , with a team in the period of 4 to 8 .

Conclusion

• We have presented an approach to maintain a coverage

vector for NMT to indicate whether each source word is

translated or not.

• Coverage model alleviates the serious over-translation and

under-translation problems that attentional NMT suffers.

• Experimental results show that coverage model significantly

improves both translation and alignment quality.

Source Code

https://www.github.com/tuzhaopeng/NMT-Coverage

Try on your own data and task!

https://www.github.com/tuzhaopeng/NMT-Coverage

We are hiring …

http://www.noahlab.com.hk

Ph.D. on• Machine learning • Data Mining • Speech & Language

processing • Information & knowledge

management • Intelligent systems

http://www.noahlab.com.hk

modeling coverage for neural machine...

Documents