reducing odd generation from neural headline generationkiyono/resources/paclic.pdf · lation...

21
Reducing Odd Generation from Neural Headline Generation 1/6/19 1 PACLIC32 Shun Kiyono 1 Sho Takase 2 Jun Suzuki 1,2,4 Naoaki Okazaki 3 Kentaro Inui 1,4 Masaaki Nagata 2 1 Tohoku University 2 NTT Communication Science Laboratories 3 Tokyo Institute of Technology 4 RIKEN Center for Advanced Intelligence Project {kiyono,jun.suzuki,inui}@ecei.tohoku.ac.jp, {takase.sho, nagata.masaaki}@lab.ntt.co.jp, [email protected]

Upload: vutu

Post on 19-Aug-2019

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

Reducing Odd Generation from Neural Headline Generation

1/6/19 1PACLIC32

Reducing Odd Generation from Neural Headline Generation

Shun Kiyono 1⇤ Sho Takase 2 Jun Suzuki 1,2,4Naoaki Okazaki 3 Kentaro Inui 1,4 Masaaki Nagata 2

1 Tohoku University 2 NTT Communication Science Laboratories3 Tokyo Institute of Technology 4 RIKEN Center for Advanced Intelligence Project

{kiyono,jun.suzuki,inui}@ecei.tohoku.ac.jp,{takase.sho, nagata.masaaki}@lab.ntt.co.jp,

[email protected]

Abstract

The Encoder-Decoder model is widely used innatural language generation tasks. However,the model sometimes suffers from repeated re-dundant generation, misses important phrases,and includes irrelevant entities. Toward solv-ing these problems we propose a novel source-side token prediction module. Our methodjointly estimates the probability distributionsover source and target vocabularies to capturethe correspondence between source and targettokens. Experiments show that the proposedmodel outperforms the current state-of-the-artmethod in the headline generation task. Wealso show that our method can learn a reason-able token-wise correspondence without know-ing any true alignment1.

1 Introduction

The Encoder-Decoder model with the attention mech-anism (EncDec) (Sutskever et al., 2014; Cho et al.,2014; Bahdanau et al., 2015; Luong et al., 2015) hasbeen an epoch-making development that has led togreat progress being made in many natural languagegeneration tasks, such as machine translation (Bah-danau et al., 2015), dialog generation (Shang et al.,2015), and headline generation (Rush et al., 2015).Today, EncDec and its variants are widely used asthe predominant baseline method in these tasks.

⇤This work is a product of collaborative research programof Tohoku University and NTT Communication Science Labora-tories.

1Our code for reproducing the experiments is available athttps://github.com/butsugiri/UAM

Unfortunately, as often discussed in the commu-nity, EncDec sometimes generates sentences withrepeating phrases or completely irrelevant phrasesand the reason for their generation cannot be inter-preted intuitively. Moreover, EncDec also sometimesgenerates sentences that lack important phrases. Werefer to these observations as the odd generation prob-lem (odd-gen) in EncDec. The following table showstypical examples of odd-gen actually generated by atypical EncDec.

(1) Repeating Phrases

Gold: duran duran group fashionable againEncDec: duran duran duran duran

(2) Lack of Important Phrases

Gold: graf says goodbye to tennis due toinjuries

EncDec: graf retires

(3) Irrelevant Phrases

Gold: u.s. troops take first position inserb-held bosnia

EncDec: precede sarajevo

This paper tackles for reducing the odd-gen in thetask of abstractive summarization. In machine trans-lation literature, coverage (Tu et al., 2016; Mi et al.,2016) and reconstruction (Tu et al., 2017) are promis-ing extensions of EncDec to address the odd-gen.These models take advantage of the fact that machinetranslation is the loss-less generation (lossless-gen)task, where the semantic information of source- andtarget-side sequence is equivalent. However, as dis-cussed in previous studies, abstractive summarizationis a lossy-compression generation (lossy-gen) task.Here, the task is to delete certain semantic informa-tion from the source to generate target-side sequence.

Page 2: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

1/6/19 PACLIC32 2

Background: Encoder-Decoder

Taiwan on Tuesday bowed to calls to make the island more international by easing its immigration rules .

Input: 1st sentence of New ArticleTaiwan easesimmigration rules

Output: Headline

EncDec

Page 3: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

EncDec

ℓ"#$Encoder

Cross Ent.

%&'( %&') %*Cross Ent. Cross Ent. Cross Ent. SUM

<eos> <pad> <pad>%(+( +&'( +&') +*

%)Cross Ent.

+)

,-* .

Taiwan-(-) on-. Tuesday

RNN RNN RNNRNNRNN

Taiwan eases

1/6/19 3PACLIC32

Page 4: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

Background: Odd-gen of EncDec

1/6/19 4

We challenge reducing odd-gen in headline generationPACLIC32

Reducing Odd Generation from Neural Headline Generation

Shun Kiyono 1⇤ Sho Takase 2 Jun Suzuki 1,2,4Naoaki Okazaki 3 Kentaro Inui 1,4 Masaaki Nagata 2

1 Tohoku University 2 NTT Communication Science Laboratories3 Tokyo Institute of Technology 4 RIKEN Center for Advanced Intelligence Project

{kiyono,jun.suzuki,inui}@ecei.tohoku.ac.jp,{takase.sho, nagata.masaaki}@lab.ntt.co.jp,

[email protected]

Abstract

The Encoder-Decoder model is widely used innatural language generation tasks. However,the model sometimes suffers from repeated re-dundant generation, misses important phrases,and includes irrelevant entities. Toward solv-ing these problems we propose a novel source-side token prediction module. Our methodjointly estimates the probability distributionsover source and target vocabularies to capturethe correspondence between source and targettokens. Experiments show that the proposedmodel outperforms the current state-of-the-artmethod in the headline generation task. Wealso show that our method can learn a reason-able token-wise correspondence without know-ing any true alignment1.

1 Introduction

The Encoder-Decoder model with the attention mech-anism (EncDec) (Sutskever et al., 2014; Cho et al.,2014; Bahdanau et al., 2015; Luong et al., 2015) hasbeen an epoch-making development that has led togreat progress being made in many natural languagegeneration tasks, such as machine translation (Bah-danau et al., 2015), dialog generation (Shang et al.,2015), and headline generation (Rush et al., 2015).Today, EncDec and its variants are widely used asthe predominant baseline method in these tasks.

⇤This work is a product of collaborative research programof Tohoku University and NTT Communication Science Labora-tories.

1Our code for reproducing the experiments is available athttps://github.com/butsugiri/UAM

Unfortunately, as often discussed in the commu-nity, EncDec sometimes generates sentences withrepeating phrases or completely irrelevant phrasesand the reason for their generation cannot be inter-preted intuitively. Moreover, EncDec also sometimesgenerates sentences that lack important phrases. Werefer to these observations as the odd generation prob-lem (odd-gen) in EncDec. The following table showstypical examples of odd-gen actually generated by atypical EncDec.

(1) Repeating Phrases

Gold: duran duran group fashionable againEncDec: duran duran duran duran

(2) Lack of Important Phrases

Gold: graf says goodbye to tennis due toinjuries

EncDec: graf retires

(3) Irrelevant Phrases

Gold: u.s. troops take first position inserb-held bosnia

EncDec: precede sarajevo

This paper tackles for reducing the odd-gen in thetask of abstractive summarization. In machine trans-lation literature, coverage (Tu et al., 2016; Mi et al.,2016) and reconstruction (Tu et al., 2017) are promis-ing extensions of EncDec to address the odd-gen.These models take advantage of the fact that machinetranslation is the loss-less generation (lossless-gen)task, where the semantic information of source- andtarget-side sequence is equivalent. However, as dis-cussed in previous studies, abstractive summarizationis a lossy-compression generation (lossy-gen) task.Here, the task is to delete certain semantic informa-tion from the source to generate target-side sequence.

Page 5: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

1/6/19 5

Related Works

This is the first work tackling entire odd-gen problems

PACLIC32

1. Repeating Phrases2. Irrelevant Phrases

3. Lack of Important Phrases

Page 6: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

1/6/19 6

How?: consider token-wise correspondence

1.Repeating Phrases may be reduced by…• Managing the token-wise correspondence between

source & target2. Irrelevant Phrases may be reduced by…• Adding the (soft) constraint, that target tokens must

correspond to source tokens3.Lack of Important Phrases may be reduced by…• Selecting important concepts from the source

Taiwan on Tuesday bowed to calls to make the … by easing its immigration rules .

Taiwan eases immigration rules <pad> <pad> … <pad> <pad> <pad> <pad> <pad>

Hypothesisodd-gen can be reduced by considering token-wise correspondence

PACLIC32

Page 7: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

easing

1/6/19 PACLIC32 7

Predict source-side tokens that corresponds to target-side tokens

eases

RNNRNN RNNRNN RNN

Tuesday <pad>

Target-side distributionSource-side distribution

Simultaneously predict both distributions

Taiwan on Tuesday bowed to calls to make the … by easing its immigration rules .

Taiwan eases immigration rules <pad> <pad> … <pad> <pad> <pad> <pad>

Page 8: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

EncDec

ℓ"#$Encoder

Cross Ent.

%&'( %&') %*Cross Ent. Cross Ent. Cross Ent. SUM

<eos> <pad> <pad>%(+( +&'( +&') +*

%)Cross Ent.

+)

,-* .

Taiwan-(-) on-. Tuesday

RNN RNN RNNRNNRNN

Taiwan eases

1/6/19 8PACLIC32

Page 9: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

EncDec + Predict source-side tokens

ℓ"#$Encoder

Cross Ent.

%&'( %&') %*Cross Ent. Cross Ent. Cross Ent. SUM

<eos> <pad> <pad>%(+( +&'( +&') +*

,&'( ,&'),),(

%)

,*

Cross Ent.

+)

-.* .

Taiwan.(.) on./ Tuesday

RNN RNN RNNRNNRNN

Taiwan eases

1/6/19 9PACLIC32

Page 10: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

How to optimize?: Sentence-wise loss

ℓ"#$Encoder

Cross Ent.

%&'( %&') %*Cross Ent. Cross Ent. Cross Ent. SUM

<eos> <pad> <pad>%(+( +&'(

ℓ,#-1/ 01− 31 ))

31 01 Mean Squared Error

SUM

+&') +*

SUM

0&'( 0&')0)0(

%)

0*

Cross Ent.

+)

43* .

Taiwan3(3) on35 Tuesday

RNN RNN RNNRNNRNN

Taiwan eases

1/6/19 10PACLIC32

Page 11: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

Proposed Model: SPM

ℓ"#$Encoder

Cross Ent.

%&'( %&') %*Cross Ent. Cross Ent. Cross Ent. SUM

<eos> <pad> <pad>%(+( +&'(

ℓ,#-1/ 01− 31 ))

31 01 Mean Squared Error

SUM

Source Prediction Module (SPM)

+&') +*

SUM

0&'( 0&')0)0(

%)

0*

Cross Ent.

+)

43* .

Taiwan3(3) on35 Tuesday

RNN RNN RNNRNNRNN

Taiwan eases

1/6/19 11PACLIC32

easing eases

RNN

Expect that corresponding tokens get predicted⇛modeling token-wise correspondence

Page 12: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

Dataset: Annotated Gigaword

Annotated Gigaword(Collection of New Article)

Taiwan on Tuesday bowed to calls to make the island more international by easing its immigration rules .

Taiwan easesimmigration rules

Input:1st sentence of news article

Output:Headline

Taiwan on Tuesday bowed to calls to make the island more international by easing its immigration rules .

Taiwan on Tuesday bowed to calls to make the island more international by easing its immigration rules .

Taiwan on Tuesday bowed to calls to make the island more international by easing its immigration rules .

Taiwan on Tuesday bowed to calls to make the island more international by easing its immigration rules .

Taiwan on Tuesday bowed to calls to make the island more international by easing its immigration rules .

Taiwan easesimmigration rules

Taiwan easesimmigration rules

Taiwan easesimmigration rules

Taiwan easesimmigration rules

Taiwan easesimmigration rules

Article-Headline Parallel Corpus [Rush+2015](Approximately 4M)

1/6/19 12PACLIC32

Page 13: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

Experimental Configurations

1/6/19 13

Use values from previous studiesPACLIC32

the previous studies. We randomly sampled test dataand validation data from the validation split since wefound that the test split contains many noisy instances.Finally, our validation and test data consist of 8K and10K source-headline pairs, respectively. Note thatthey are relatively large compared with the previouslyused datasets, and they do not contain hunki.

We also evaluated our experiments on the testdata used in the previous studies. To the best ofour knowledge, two test sets from the Gigaword arepublicly available by Rush et al. (2015)6 and Zhou etal. (2017)7. Here, both test sets contain hunki 8.Evaluation Metric We evaluated the perfor-mance by ROUGE-1 (RG-1), ROUGE-2 (RG-2) andROUGE-L (RG-L)9. We report the F1 value as givenin a previous study10. We computed the scores withthe official ROUGE script (version 1.5.5).Comparative Methods To investigate the effec-tiveness of the SPM, we evaluate the performance ofthe EncDec with the SPM. In addition, we investigatewhether the SPM improves the performance of thestate-of-the-art method: EncDec+sGate. Thus, wecompare the following methods on the same trainingsetting.

EncDec This is the implementation of the basemodel explained in Section 3.

EncDec+sGate To reproduce the state-of-the-artmethod proposed by Zhou et al. (2017), we combinedour re-implemented selective gate (sGate) with theencoder of EncDec.

EncDec+SPM We combined the SPM with theEncDec as explained in Section 4.

EncDec+sGate+SPM This is the combination ofthe SPM with the EncDec+sGate.Implementation Details Table 1 summarizeshyper-parameters and model configurations. We se-lected the settings commonly-used in the previousstudies, e.g., (Rush et al., 2015; Nallapati et al., 2016;Suzuki and Nagata, 2017).

We constructed the vocabulary set using Byte-Pair-

6https://github.com/harvardnlp/sent-summary

7https://res.qyzhou.me8We summarize the details of the dataset in Appendix D9We restored sub-words to the standard token split for the

evaluation.10 ROUGE script option is: “-n2 -m -w 1.2”

Source Vocab. Size Vs 5131Target Vocab. Size Vt 5131Word Embed. Size D 200Hidden State Size H 400

RNN Cell Long Short-Term Memory(LSTM) (Hochreiter and Schmid-huber, 1997)

Encoder RNN Unit 2-layer bidirectional-LSTMDecoder RNN Unit 2-layer LSTM with attention (Lu-

ong et al., 2015)

Optimizer Adam (Kingma and Ba, 2015)Initial Learning Rate 0.001Learning Rate Decay 0.5 for each epoch (after epoch 9)Weight C of `src 10Mini-batch Size 256 (shuffled at each epoch)Gradient Clipping 5Stopping Criterion Max 15 epochs with early stoppingRegularization Dropout (rate 0.3)Beam Search Beam size 20 with the length nor-

malization

Table 1: Configurations used in our experiments

Encoding11 (BPE) (Sennrich et al., 2016) to handlelow frequency words, as it is now a common practicein neural machine translation. The BPE merge op-erations are jointly learned from the source and thetarget. We set the number of the BPE merge opera-tions at 5, 000. We used the same vocabulary set forboth the source Vs and the target Vt.

5.2 ResultsTable 2 summarizes results for all test data. Thetable consists of three parts split by horizontal lines.The top and middle rows show the results on ourtraining procedure, and the bottom row shows theresults reported in previous studies. Note that the topand middle rows are not directly comparable to thebottom row due to the differences in preprocessingand vocabulary settings.

The top row of Table 2 shows that EncDec+SPMoutperformed both EncDec and EncDec+sGate. Thisresult indicates that the SPM can improve the perfor-mance of EncDec. Moreover, it is noteworthy thatEncDec+sGate+SPM achieved the best performancein all metrics even though EncDec+sGate consistsof essentially the same architecture as the current

11https://github.com/rsennrich/subword-nmt

Page 14: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

Result: ROUGE Score

6

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

NAACL-HLT 2018 Submission 55. Confidential Review Copy. DO NOT DISTRIBUTE.

Gigaword Test (Ours) Gigaword Test (Rush) Gigaword Test (Zhou) MSR-ATC TestRG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L

EncDec 45.74 23.80 42.95 34.52 16.77 32.19 45.56 24.16 42.82 27.01 13.65 24.90EncDec+sGate (our impl. of SEASS) 45.98 24.17 43.16 35.00 17.24 32.72 46.17 24.54 43.27 27.71 14.16 25.44EncDec+SPM† 46.18 24.34 43.35 34.87 17.44 32.76 46.21 24.78 43.27 29.07 14.84 26.65EncDec+sGate+SPM† 46.41 24.58 43.59 35.79 17.84 33.34 46.34 24.85 43.49 29.51 15.53 27.17

EncDec+sGate+SPM (5 Ensemble)† 47.16 25.34 44.34 36.23 18.11 33.79 47.39 25.89 44.41 31.36 17.50 29.17

ABS (Rush et al., 2015) - - - 29.55 11.32 26.42 37.41 15.87 34.70 20.27 5.26 17.10SEASS (Zhou et al., 2017) - - - 36.15 17.54 33.63 46.86 24.58 43.53 25.75 10.63 22.90DRGD (Li et al., 2017) - - - 36.27 17.57 33.62 - - - - - -WFE (Suzuki and Nagata, 2017) - - - 36.30 17.31 33.88 - - - - - -conv-s2s (Gehring et al., 2017) - - - 35.88 17.48 33.29 - - - - - -

Table 3: Full length ROUGE F1 evaluation results. The top and middle rows show the results onour evaluation setting. † is the proposed model. The bottom row shows published scores reported inprevious studies13. Note that (1) SEASS consists of essentially the same architecture as our implementedEncDec+sGate, and (2) the top and middle rows are not directly comparable to the bottom row due to thedifference of the preprocessing and the vocabulary settings (see discussions in Section 5.5).

source and the target. We set the number of theBPE merge operations at 5, 000. We used the samevocabulary set for both the source Vs and the targetVt. After applying the BPE, we found out that 0.1%of the training split contained a longer target thansource. We removed such data before training.

5.5 Results

Table 3 summarizes results on all test data. We di-vide the table into three parts with horizontal lines.The top and middle rows show the results on ourtraining procedure, and the bottom row shows theresults reported in previous studies. Note that thetop and middle rows are not directly comparable tothe bottom row due to the difference of the prepro-cessing and the vocabulary settings.

The top row of Table 3 shows that EncDec+SPMoutperformed both EncDec and EncDec+sGate.This result indicates that the SPM can improve theperformance of EncDec. Moreover, it is noteworthythat EncDec+sGate+SPM achieved the best perfor-mance in all metrics even though EncDec+sGateconsists of essentially the same architecture as thecurrent state-of-the-art model, i.e., SEASS.

The bottom row of Table 3 shows the resultsof previous methods. They often achieved higherROUGE scores than our models especially in Gi-gaword Test (Rush) and Gigaword Test (Zhou).However, this does not immediately imply that ourmethod is inferior to the previous methods. Thisobservation is basically derived by the inconsis-tency of vocabulary. In detail, our training datadoes not contain hunki because we adopted theBPE to construct vocabulary. Thus, our models

(a) Repeating Phrases

(b) Lack of Important Phrases

Figure 2: Comparison between EncDec andEncDec+SPM on the number of sentences that po-tentially contain the odd-gen. The smaller exam-ples mean reduction of the odd-gen.

suffered from hunki in dataset when we conductedthe evaluation on Gigaword Test (Rush) and Giga-word Test (Zhou). In contrast, our models achievedsignificantly higher performance in MSR-ATC Testbecause this dataset does not contain hunki. Recallthat, as described earlier, EncDec+sGate has thesame model architecture as SEASS. Then, the simi-lar observation can also be found in EncDec+sGateand SEASS.

6 Discussion

The motivation of the SPM is to prevent the odd-gen with one-to-one correspondence between the

1/6/19 14PACLIC32

Page 15: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

Result: ROUGE Score

6

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

NAACL-HLT 2018 Submission 55. Confidential Review Copy. DO NOT DISTRIBUTE.

Gigaword Test (Ours) Gigaword Test (Rush) Gigaword Test (Zhou) MSR-ATC TestRG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L

EncDec 45.74 23.80 42.95 34.52 16.77 32.19 45.56 24.16 42.82 27.01 13.65 24.90EncDec+sGate (our impl. of SEASS) 45.98 24.17 43.16 35.00 17.24 32.72 46.17 24.54 43.27 27.71 14.16 25.44EncDec+SPM† 46.18 24.34 43.35 34.87 17.44 32.76 46.21 24.78 43.27 29.07 14.84 26.65EncDec+sGate+SPM† 46.41 24.58 43.59 35.79 17.84 33.34 46.34 24.85 43.49 29.51 15.53 27.17

EncDec+sGate+SPM (5 Ensemble)† 47.16 25.34 44.34 36.23 18.11 33.79 47.39 25.89 44.41 31.36 17.50 29.17

ABS (Rush et al., 2015) - - - 29.55 11.32 26.42 37.41 15.87 34.70 20.27 5.26 17.10SEASS (Zhou et al., 2017) - - - 36.15 17.54 33.63 46.86 24.58 43.53 25.75 10.63 22.90DRGD (Li et al., 2017) - - - 36.27 17.57 33.62 - - - - - -WFE (Suzuki and Nagata, 2017) - - - 36.30 17.31 33.88 - - - - - -conv-s2s (Gehring et al., 2017) - - - 35.88 17.48 33.29 - - - - - -

Table 3: Full length ROUGE F1 evaluation results. The top and middle rows show the results onour evaluation setting. † is the proposed model. The bottom row shows published scores reported inprevious studies13. Note that (1) SEASS consists of essentially the same architecture as our implementedEncDec+sGate, and (2) the top and middle rows are not directly comparable to the bottom row due to thedifference of the preprocessing and the vocabulary settings (see discussions in Section 5.5).

source and the target. We set the number of theBPE merge operations at 5, 000. We used the samevocabulary set for both the source Vs and the targetVt. After applying the BPE, we found out that 0.1%of the training split contained a longer target thansource. We removed such data before training.

5.5 Results

Table 3 summarizes results on all test data. We di-vide the table into three parts with horizontal lines.The top and middle rows show the results on ourtraining procedure, and the bottom row shows theresults reported in previous studies. Note that thetop and middle rows are not directly comparable tothe bottom row due to the difference of the prepro-cessing and the vocabulary settings.

The top row of Table 3 shows that EncDec+SPMoutperformed both EncDec and EncDec+sGate.This result indicates that the SPM can improve theperformance of EncDec. Moreover, it is noteworthythat EncDec+sGate+SPM achieved the best perfor-mance in all metrics even though EncDec+sGateconsists of essentially the same architecture as thecurrent state-of-the-art model, i.e., SEASS.

The bottom row of Table 3 shows the resultsof previous methods. They often achieved higherROUGE scores than our models especially in Gi-gaword Test (Rush) and Gigaword Test (Zhou).However, this does not immediately imply that ourmethod is inferior to the previous methods. Thisobservation is basically derived by the inconsis-tency of vocabulary. In detail, our training datadoes not contain hunki because we adopted theBPE to construct vocabulary. Thus, our models

(a) Repeating Phrases

(b) Lack of Important Phrases

Figure 2: Comparison between EncDec andEncDec+SPM on the number of sentences that po-tentially contain the odd-gen. The smaller exam-ples mean reduction of the odd-gen.

suffered from hunki in dataset when we conductedthe evaluation on Gigaword Test (Rush) and Giga-word Test (Zhou). In contrast, our models achievedsignificantly higher performance in MSR-ATC Testbecause this dataset does not contain hunki. Recallthat, as described earlier, EncDec+sGate has thesame model architecture as SEASS. Then, the simi-lar observation can also be found in EncDec+sGateand SEASS.

6 Discussion

The motivation of the SPM is to prevent the odd-gen with one-to-one correspondence between the

Proposed model (EncDec+SPM) outperforms the strong baseline (EncDec+sGate)

1/6/19 15PACLIC32

Page 16: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

Result: ROUGE Score

6

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

NAACL-HLT 2018 Submission 55. Confidential Review Copy. DO NOT DISTRIBUTE.

Gigaword Test (Ours) Gigaword Test (Rush) Gigaword Test (Zhou) MSR-ATC TestRG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L

EncDec 45.74 23.80 42.95 34.52 16.77 32.19 45.56 24.16 42.82 27.01 13.65 24.90EncDec+sGate (our impl. of SEASS) 45.98 24.17 43.16 35.00 17.24 32.72 46.17 24.54 43.27 27.71 14.16 25.44EncDec+SPM† 46.18 24.34 43.35 34.87 17.44 32.76 46.21 24.78 43.27 29.07 14.84 26.65EncDec+sGate+SPM† 46.41 24.58 43.59 35.79 17.84 33.34 46.34 24.85 43.49 29.51 15.53 27.17

EncDec+sGate+SPM (5 Ensemble)† 47.16 25.34 44.34 36.23 18.11 33.79 47.39 25.89 44.41 31.36 17.50 29.17

ABS (Rush et al., 2015) - - - 29.55 11.32 26.42 37.41 15.87 34.70 20.27 5.26 17.10SEASS (Zhou et al., 2017) - - - 36.15 17.54 33.63 46.86 24.58 43.53 25.75 10.63 22.90DRGD (Li et al., 2017) - - - 36.27 17.57 33.62 - - - - - -WFE (Suzuki and Nagata, 2017) - - - 36.30 17.31 33.88 - - - - - -conv-s2s (Gehring et al., 2017) - - - 35.88 17.48 33.29 - - - - - -

Table 3: Full length ROUGE F1 evaluation results. The top and middle rows show the results onour evaluation setting. † is the proposed model. The bottom row shows published scores reported inprevious studies13. Note that (1) SEASS consists of essentially the same architecture as our implementedEncDec+sGate, and (2) the top and middle rows are not directly comparable to the bottom row due to thedifference of the preprocessing and the vocabulary settings (see discussions in Section 5.5).

source and the target. We set the number of theBPE merge operations at 5, 000. We used the samevocabulary set for both the source Vs and the targetVt. After applying the BPE, we found out that 0.1%of the training split contained a longer target thansource. We removed such data before training.

5.5 Results

Table 3 summarizes results on all test data. We di-vide the table into three parts with horizontal lines.The top and middle rows show the results on ourtraining procedure, and the bottom row shows theresults reported in previous studies. Note that thetop and middle rows are not directly comparable tothe bottom row due to the difference of the prepro-cessing and the vocabulary settings.

The top row of Table 3 shows that EncDec+SPMoutperformed both EncDec and EncDec+sGate.This result indicates that the SPM can improve theperformance of EncDec. Moreover, it is noteworthythat EncDec+sGate+SPM achieved the best perfor-mance in all metrics even though EncDec+sGateconsists of essentially the same architecture as thecurrent state-of-the-art model, i.e., SEASS.

The bottom row of Table 3 shows the resultsof previous methods. They often achieved higherROUGE scores than our models especially in Gi-gaword Test (Rush) and Gigaword Test (Zhou).However, this does not immediately imply that ourmethod is inferior to the previous methods. Thisobservation is basically derived by the inconsis-tency of vocabulary. In detail, our training datadoes not contain hunki because we adopted theBPE to construct vocabulary. Thus, our models

(a) Repeating Phrases

(b) Lack of Important Phrases

Figure 2: Comparison between EncDec andEncDec+SPM on the number of sentences that po-tentially contain the odd-gen. The smaller exam-ples mean reduction of the odd-gen.

suffered from hunki in dataset when we conductedthe evaluation on Gigaword Test (Rush) and Giga-word Test (Zhou). In contrast, our models achievedsignificantly higher performance in MSR-ATC Testbecause this dataset does not contain hunki. Recallthat, as described earlier, EncDec+sGate has thesame model architecture as SEASS. Then, the simi-lar observation can also be found in EncDec+sGateand SEASS.

6 Discussion

The motivation of the SPM is to prevent the odd-gen with one-to-one correspondence between the

Incorporating SPM further improves the performance ofthe strong baseline

1/6/19 16PACLIC32

Page 17: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

Does SPM reduce odd-gen?

6

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

NAACL-HLT 2018 Submission 55. Confidential Review Copy. DO NOT DISTRIBUTE.

Gigaword Test (Ours) Gigaword Test (Rush) Gigaword Test (Zhou) MSR-ATC TestRG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L

EncDec 45.74 23.80 42.95 34.52 16.77 32.19 45.56 24.16 42.82 27.01 13.65 24.90EncDec+sGate (our impl. of SEASS) 45.98 24.17 43.16 35.00 17.24 32.72 46.17 24.54 43.27 27.71 14.16 25.44EncDec+SPM† 46.18 24.34 43.35 34.87 17.44 32.76 46.21 24.78 43.27 29.07 14.84 26.65EncDec+sGate+SPM† 46.41 24.58 43.59 35.79 17.84 33.34 46.34 24.85 43.49 29.51 15.53 27.17

EncDec+sGate+SPM (5 Ensemble)† 47.16 25.34 44.34 36.23 18.11 33.79 47.39 25.89 44.41 31.36 17.50 29.17

ABS (Rush et al., 2015) - - - 29.55 11.32 26.42 37.41 15.87 34.70 20.27 5.26 17.10SEASS (Zhou et al., 2017) - - - 36.15 17.54 33.63 46.86 24.58 43.53 25.75 10.63 22.90DRGD (Li et al., 2017) - - - 36.27 17.57 33.62 - - - - - -WFE (Suzuki and Nagata, 2017) - - - 36.30 17.31 33.88 - - - - - -conv-s2s (Gehring et al., 2017) - - - 35.88 17.48 33.29 - - - - - -

Table 3: Full length ROUGE F1 evaluation results. The top and middle rows show the results onour evaluation setting. † is the proposed model. The bottom row shows published scores reported inprevious studies13. Note that (1) SEASS consists of essentially the same architecture as our implementedEncDec+sGate, and (2) the top and middle rows are not directly comparable to the bottom row due to thedifference of the preprocessing and the vocabulary settings (see discussions in Section 5.5).

source and the target. We set the number of theBPE merge operations at 5, 000. We used the samevocabulary set for both the source Vs and the targetVt. After applying the BPE, we found out that 0.1%of the training split contained a longer target thansource. We removed such data before training.

5.5 Results

Table 3 summarizes results on all test data. We di-vide the table into three parts with horizontal lines.The top and middle rows show the results on ourtraining procedure, and the bottom row shows theresults reported in previous studies. Note that thetop and middle rows are not directly comparable tothe bottom row due to the difference of the prepro-cessing and the vocabulary settings.

The top row of Table 3 shows that EncDec+SPMoutperformed both EncDec and EncDec+sGate.This result indicates that the SPM can improve theperformance of EncDec. Moreover, it is noteworthythat EncDec+sGate+SPM achieved the best perfor-mance in all metrics even though EncDec+sGateconsists of essentially the same architecture as thecurrent state-of-the-art model, i.e., SEASS.

The bottom row of Table 3 shows the resultsof previous methods. They often achieved higherROUGE scores than our models especially in Gi-gaword Test (Rush) and Gigaword Test (Zhou).However, this does not immediately imply that ourmethod is inferior to the previous methods. Thisobservation is basically derived by the inconsis-tency of vocabulary. In detail, our training datadoes not contain hunki because we adopted theBPE to construct vocabulary. Thus, our models

(a) Repeating Phrases

(b) Lack of Important Phrases

Figure 2: Comparison between EncDec andEncDec+SPM on the number of sentences that po-tentially contain the odd-gen. The smaller exam-ples mean reduction of the odd-gen.

suffered from hunki in dataset when we conductedthe evaluation on Gigaword Test (Rush) and Giga-word Test (Zhou). In contrast, our models achievedsignificantly higher performance in MSR-ATC Testbecause this dataset does not contain hunki. Recallthat, as described earlier, EncDec+sGate has thesame model architecture as SEASS. Then, the simi-lar observation can also be found in EncDec+sGateand SEASS.

6 Discussion

The motivation of the SPM is to prevent the odd-gen with one-to-one correspondence between the

6

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

NAACL-HLT 2018 Submission 55. Confidential Review Copy. DO NOT DISTRIBUTE.

Gigaword Test (Ours) Gigaword Test (Rush) Gigaword Test (Zhou) MSR-ATC TestRG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L

EncDec 45.74 23.80 42.95 34.52 16.77 32.19 45.56 24.16 42.82 27.01 13.65 24.90EncDec+sGate (our impl. of SEASS) 45.98 24.17 43.16 35.00 17.24 32.72 46.17 24.54 43.27 27.71 14.16 25.44EncDec+SPM† 46.18 24.34 43.35 34.87 17.44 32.76 46.21 24.78 43.27 29.07 14.84 26.65EncDec+sGate+SPM† 46.41 24.58 43.59 35.79 17.84 33.34 46.34 24.85 43.49 29.51 15.53 27.17

EncDec+sGate+SPM (5 Ensemble)† 47.16 25.34 44.34 36.23 18.11 33.79 47.39 25.89 44.41 31.36 17.50 29.17

ABS (Rush et al., 2015) - - - 29.55 11.32 26.42 37.41 15.87 34.70 20.27 5.26 17.10SEASS (Zhou et al., 2017) - - - 36.15 17.54 33.63 46.86 24.58 43.53 25.75 10.63 22.90DRGD (Li et al., 2017) - - - 36.27 17.57 33.62 - - - - - -WFE (Suzuki and Nagata, 2017) - - - 36.30 17.31 33.88 - - - - - -conv-s2s (Gehring et al., 2017) - - - 35.88 17.48 33.29 - - - - - -

Table 3: Full length ROUGE F1 evaluation results. The top and middle rows show the results onour evaluation setting. † is the proposed model. The bottom row shows published scores reported inprevious studies13. Note that (1) SEASS consists of essentially the same architecture as our implementedEncDec+sGate, and (2) the top and middle rows are not directly comparable to the bottom row due to thedifference of the preprocessing and the vocabulary settings (see discussions in Section 5.5).

source and the target. We set the number of theBPE merge operations at 5, 000. We used the samevocabulary set for both the source Vs and the targetVt. After applying the BPE, we found out that 0.1%of the training split contained a longer target thansource. We removed such data before training.

5.5 Results

Table 3 summarizes results on all test data. We di-vide the table into three parts with horizontal lines.The top and middle rows show the results on ourtraining procedure, and the bottom row shows theresults reported in previous studies. Note that thetop and middle rows are not directly comparable tothe bottom row due to the difference of the prepro-cessing and the vocabulary settings.

The top row of Table 3 shows that EncDec+SPMoutperformed both EncDec and EncDec+sGate.This result indicates that the SPM can improve theperformance of EncDec. Moreover, it is noteworthythat EncDec+sGate+SPM achieved the best perfor-mance in all metrics even though EncDec+sGateconsists of essentially the same architecture as thecurrent state-of-the-art model, i.e., SEASS.

The bottom row of Table 3 shows the resultsof previous methods. They often achieved higherROUGE scores than our models especially in Gi-gaword Test (Rush) and Gigaword Test (Zhou).However, this does not immediately imply that ourmethod is inferior to the previous methods. Thisobservation is basically derived by the inconsis-tency of vocabulary. In detail, our training datadoes not contain hunki because we adopted theBPE to construct vocabulary. Thus, our models

(a) Repeating Phrases

(b) Lack of Important Phrases

Figure 2: Comparison between EncDec andEncDec+SPM on the number of sentences that po-tentially contain the odd-gen. The smaller exam-ples mean reduction of the odd-gen.

suffered from hunki in dataset when we conductedthe evaluation on Gigaword Test (Rush) and Giga-word Test (Zhou). In contrast, our models achievedsignificantly higher performance in MSR-ATC Testbecause this dataset does not contain hunki. Recallthat, as described earlier, EncDec+sGate has thesame model architecture as SEASS. Then, the simi-lar observation can also be found in EncDec+sGateand SEASS.

6 Discussion

The motivation of the SPM is to prevent the odd-gen with one-to-one correspondence between the

Repeating Phrases Lack of Important Phrases

1/6/19 17PACLIC32

Page 18: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

Qualitative Analysis of Odd-gen

1/6/19 18PACLIC32

(1) Repeating Phrases

Gold: duran duran group fashionable again Gold: community college considers building $## million technology

EncDec: duran duran duran duran EncDec: college college colleges learn to getideas for tech center

EncDec+SPM: duran duran fashionably cool onceagain

EncDec+SPM: l.a. community college officials saythey ’ll get ideas

(2) Lack of Important Phrases

Gold: graf says goodbye to tennis due toinjuries

Gold: new york ’s primary is mostsuspenseful of super tuesday races

EncDec: graf retires EncDec: n.y.EncDec+SPM: german tennis legend steffi graf

retiresEncDec+SPM: new york primary enters most

suspenseful of super tuesday contests

(3) Irrelevant Phrases

Gold: u.s. troops take first position inserb-held bosnia

Gold: northridge hopes confidence does n’twane

EncDec: precede sarajevo EncDec: csun ’s csunEncDec+SPM: u.s. troops set up first post in

bosnian countrysideEncDec+SPM: northridge tries to win northridge men

’s basketball team

Figure 3: Examples of generated summaries. “Gold” indicates the reference headline. The proposedEncDec+SPM model successfully reduced odd-gen.

in the reference headline from the above calculationresult. The result of this subtraction is taken to bethe number of repeating phrases in each generatedheadline.Lack of important phrases We assume generatedheadline which is shorter than the gold omits oneor more important phrase. Thus, we compute thedifference in gold headline length and the generatedheadline length.Irrelevant phrases We consider that improve-ments in ROUGE scores indicate a reduction in ir-relevant phrases because we believe that ROUGEpenalizes irrelevant phrases.

Figure 2 shows the number of repeating phrasesand lack of important phrases in Gigaword Test(Ours). This figure indicates that EncDec+SPM re-duces the odd-gen in comparison to EncDec. Thus,we consider the SPM reduced odd-gen. Figure3 shows sampled headlines actually generated byEncDec and EncDec+SPM. It is clear that the out-puts of EncDec contain odd-gen while those of theEncDec+SPM do not. These examples also demon-strate that SPM successfully reduces odd-gen.

6.2 Visualizing SPM and Attention

We visualize the prediction of the SPM and the at-tention distribution to see the acquired token-wisecorrespondence between the source and the target.

Specifically, we feed the source-target pair (X,Y )to EncDec and EncDec+SPM, and then collect thesource-side prediction (q1:I) of EncDec+SPM andthe attention distribution (↵1:J) of EncDec12. Forsource-side prediction, we extracted the probabilityof each token xi 2 X from qj , j 2 {1, . . . , I}.

Figure 4 shows an example of the heat map13. Weused Gigaword Test (Ours) as the input. The bracketsin the y-axis represent the source-side tokens thatare aligned with target-side tokens. We selected thealigned tokens in the following manner: For the atten-tion (Figure 4a), we select the token with the largestattention value. For the SPM (Figure 4b), we selectthe token with the largest probability over the wholevocabulary Vs.

Figure 4a indicates that most of the attention dis-tribution is concentrated at the end of the sentence.As a result, attention provides poor token-wise cor-respondence between the source and the target. Forexample, target-side tokens “tokyo” and “end” areboth aligned with the source-side sentence period. Incontrast, Figure 4b shows that the SPM captures thecorrespondence between the source and the target.The source sequence “tokyo stocks closed higher” issuccessfully aligned with the target “tokyo stocks endhigher”. Moreover, the SPM aligned unimportant to-

12For details of the attention mechanism, see Appendix C.13For more visualizations, see Appendix E

Page 19: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

(1)繰り返し生成

Gold: duran duran group fashionable again Gold: community college considers building $ ## million technologyEncDec: duran duran duran duran EncDec: college college colleges learn to get ideas for tech centerEncDec+SPM: duran duran fashionably cool once again EncDec+SPM: l.a. community college officials say they ’ll get ideas

(2)無関係な単語の生成

Gold: u.s. troops take first position inserb-held bosnia

Gold: northridge hopes confidence does n’t wane

EncDec: precede sarajevo EncDec: csun ’s csunEncDec+SPM: u.s. troops set up first post in bosnian

countrysideEncDec+SPM: northridge tries to win northridge men ’s basketball team

(3)重要な語句の欠損

Gold: graf says goodbye to tennis due to injuries Gold: new york ’s primary is most suspenseful of super tuesdayraces

EncDec: graf retires EncDec: n.y.EncDec+SPM: german tennis legend steffi graf retires EncDec+SPM: new york primary enters most suspenseful of super tuesday

contests

図 2: ベースライン手法 EncDecと提案手法 EncDec+SPMの生成文の比較:“Gold”は正解のヘッドラインを表す.

(a) EncDecのアテンション分布

(b) EncDec+SPMの入力側予測

図 3: EncDec と EncDec+SPM の可視化:x 軸と y 軸はそれぞれ入力文と出力文に対応する.y軸の括弧内のトークンは,アラインされた入力側のトークンを表す.

ここでWα ∈ RH×H はパラメータ行列,αj [i] は αj の i 番目の要素を表す.また,z ∈ RH はデコーダの隠れ層である.入力側の予測に関しては,各 j について qj から xi ∈ X に対応する値を抽出した.図 3にヒートマップを示した.テストデータには Gigaword

Test (Ours) を用いた.ここで,y 軸括弧内のトークンは,その時刻で出力側のトークンにアラインされた入力側のトークンを表す.トークンのアライン先については,アテンション分布(図 3a)に基づく方法では,アテンション確率が最も高かったトークンを選択し,SPMの予測(図 3b)では,全入力側語彙 Vs の中から最も出力確率が高いトークンを選択した.アテンション分布(図 3a)では,値のほとんどが文末に集中している.このことから,アテンション分布は,入出力間のトークン単位の対応関係を正確に表していないと想定できる.例えば,出力側の “tokyo” や “end” は入力側のピリオドにアラインされている.一方で,SPMの予測(図 3b)は対応関係を捉えることに成功している.例えば,入力側の “tokyostocks closed higher” は出力側の “tokyo stocks end higher” にアラインされている.各値がほぼ離散的な値を示すことから,

SPMは高い確度を持ってアラインする先を決めていると分かる.また,SPMは “tuesday”や “straight”など,ヘッドラインで重要度の低い単語を ⟨pad⟩ に対応させている.これらの事実から,SPMはアテンション分布と比較して,より良いトークン単位の対応関係を予測しているとわかる.SPMの訓練時にアライメントの正解データを与えていないにも関わらず,対応関係を学習できているのは,特筆すべき結果である.

6 おわりに本稿では,ヘッドライン生成のような損失あり圧縮生成タスクに対して,誤生成問題を減らす手法を提案した.提案手法である SPMは,入力と出力間のトークン単位の対応を予測するモデルである.実験では,ヘッドライン生成のベンチマークデータ上で,既存の最高性能の手法を上回る性能を達成し,誤生成問題の減少を確認した.また,SPMはアライメント情報を与えられることなく,入出力間のトークン単位の対応関係を学習できることを示した.

謝辞株式会社 Preferred Networks の小林颯介氏には研究の遂行に有益なアドバイスを頂いた.また本研究は,文部科学省科研費 15H01702,15H05318の支援を受けたものである.

参考文献[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Transla-

tion by Jointly Learning to Align and Translate. In ICLR, 2015.[2] Sepp Hochreiter and Jurgen Schmidhuber. Long Short-Term Memory. Neural Com-

putation, Vol. 9, No. 8, pp. 1735–1780, 1997.[3] Piji Li, Wai Lam, Lidong Bing, and Zihao Wang. Deep Recurrent Generative Decoder

for Abstractive Text Summarization. In EMNLP, pp. 2081–2090, 2017.[4] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective Approaches to

Attention-based Neural Machine Translation. In EMNLP, pp. 1412–1421, 2015.[5] Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and Abe Ittycheriah. Coverage Embed-

ding Models for Neural Machine Translation. In EMNLP, pp. 955–960, 2016.[6] Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang.

Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond. InSIGNLL, pp. 280–290, 2016.

[7] Alexander M. Rush, Sumit Chopra, and Jason Weston. A Neural Attention Model forAbstractive Sentence Summarization. In EMNLP, pp. 379–389, 2015.

[8] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Translation ofRare Words with Subword Units. In ACL, pp. 1715–1725, 2016.

[9] Lifeng Shang, Zhengdong Lu, and Hang Li. Neural Responding Machine for Short-Text Conversation. In ACL & IJCNLP, pp. 1577–1586, July 2015.

[10] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning withNeural Networks. In NIPS, pp. 3104–3112, 2014.

[11] Jun Suzuki and Masaaki Nagata. Cutting-off Redundant Repeating Generations forNeural Abstractive Summarization. In EACL, pp. 291–297, 2017.

[12] Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. Neural MachineTranslation with Reconstruction. In AAAI, pp. 3097–3103, 2017.

[13] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. Modeling Cov-erage for Neural Machine Translation. In ACL, pp. 76–85, 2016.

[14] Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. Selective Encoding for Abstrac-tive Sentence Summarization. In ACL, pp. 1095–1104, 2017.

(1)繰り返し生成

Gold: duran duran group fashionable again Gold: community college considers building $ ## million technologyEncDec: duran duran duran duran EncDec: college college colleges learn to get ideas for tech centerEncDec+SPM: duran duran fashionably cool once again EncDec+SPM: l.a. community college officials say they ’ll get ideas

(2)無関係な単語の生成

Gold: u.s. troops take first position inserb-held bosnia

Gold: northridge hopes confidence does n’t wane

EncDec: precede sarajevo EncDec: csun ’s csunEncDec+SPM: u.s. troops set up first post in bosnian

countrysideEncDec+SPM: northridge tries to win northridge men ’s basketball team

(3)重要な語句の欠損

Gold: graf says goodbye to tennis due to injuries Gold: new york ’s primary is most suspenseful of super tuesdayraces

EncDec: graf retires EncDec: n.y.EncDec+SPM: german tennis legend steffi graf retires EncDec+SPM: new york primary enters most suspenseful of super tuesday

contests

図 2: ベースライン手法 EncDecと提案手法 EncDec+SPMの生成文の比較:“Gold”は正解のヘッドラインを表す.

(a) EncDecのアテンション分布

(b) EncDec+SPMの入力側予測

図 3: EncDec と EncDec+SPM の可視化:x 軸と y 軸はそれぞれ入力文と出力文に対応する.y軸の括弧内のトークンは,アラインされた入力側のトークンを表す.

ここでWα ∈ RH×H はパラメータ行列,αj [i] は αj の i 番目の要素を表す.また,z ∈ RH はデコーダの隠れ層である.入力側の予測に関しては,各 j について qj から xi ∈ X に対応する値を抽出した.図 3にヒートマップを示した.テストデータには Gigaword

Test (Ours) を用いた.ここで,y 軸括弧内のトークンは,その時刻で出力側のトークンにアラインされた入力側のトークンを表す.トークンのアライン先については,アテンション分布(図 3a)に基づく方法では,アテンション確率が最も高かったトークンを選択し,SPMの予測(図 3b)では,全入力側語彙 Vs の中から最も出力確率が高いトークンを選択した.アテンション分布(図 3a)では,値のほとんどが文末に集中している.このことから,アテンション分布は,入出力間のトークン単位の対応関係を正確に表していないと想定できる.例えば,出力側の “tokyo” や “end” は入力側のピリオドにアラインされている.一方で,SPMの予測(図 3b)は対応関係を捉えることに成功している.例えば,入力側の “tokyostocks closed higher” は出力側の “tokyo stocks end higher” にアラインされている.各値がほぼ離散的な値を示すことから,

SPMは高い確度を持ってアラインする先を決めていると分かる.また,SPMは “tuesday”や “straight”など,ヘッドラインで重要度の低い単語を ⟨pad⟩ に対応させている.これらの事実から,SPMはアテンション分布と比較して,より良いトークン単位の対応関係を予測しているとわかる.SPMの訓練時にアライメントの正解データを与えていないにも関わらず,対応関係を学習できているのは,特筆すべき結果である.

6 おわりに本稿では,ヘッドライン生成のような損失あり圧縮生成タスクに対して,誤生成問題を減らす手法を提案した.提案手法である SPMは,入力と出力間のトークン単位の対応を予測するモデルである.実験では,ヘッドライン生成のベンチマークデータ上で,既存の最高性能の手法を上回る性能を達成し,誤生成問題の減少を確認した.また,SPMはアライメント情報を与えられることなく,入出力間のトークン単位の対応関係を学習できることを示した.

謝辞株式会社 Preferred Networks の小林颯介氏には研究の遂行に有益なアドバイスを頂いた.また本研究は,文部科学省科研費 15H01702,15H05318の支援を受けたものである.

参考文献[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Transla-

tion by Jointly Learning to Align and Translate. In ICLR, 2015.[2] Sepp Hochreiter and Jurgen Schmidhuber. Long Short-Term Memory. Neural Com-

putation, Vol. 9, No. 8, pp. 1735–1780, 1997.[3] Piji Li, Wai Lam, Lidong Bing, and Zihao Wang. Deep Recurrent Generative Decoder

for Abstractive Text Summarization. In EMNLP, pp. 2081–2090, 2017.[4] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective Approaches to

Attention-based Neural Machine Translation. In EMNLP, pp. 1412–1421, 2015.[5] Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and Abe Ittycheriah. Coverage Embed-

ding Models for Neural Machine Translation. In EMNLP, pp. 955–960, 2016.[6] Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang.

Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond. InSIGNLL, pp. 280–290, 2016.

[7] Alexander M. Rush, Sumit Chopra, and Jason Weston. A Neural Attention Model forAbstractive Sentence Summarization. In EMNLP, pp. 379–389, 2015.

[8] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Translation ofRare Words with Subword Units. In ACL, pp. 1715–1725, 2016.

[9] Lifeng Shang, Zhengdong Lu, and Hang Li. Neural Responding Machine for Short-Text Conversation. In ACL & IJCNLP, pp. 1577–1586, July 2015.

[10] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning withNeural Networks. In NIPS, pp. 3104–3112, 2014.

[11] Jun Suzuki and Masaaki Nagata. Cutting-off Redundant Repeating Generations forNeural Abstractive Summarization. In EACL, pp. 291–297, 2017.

[12] Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. Neural MachineTranslation with Reconstruction. In AAAI, pp. 3097–3103, 2017.

[13] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. Modeling Cov-erage for Neural Machine Translation. In ACL, pp. 76–85, 2016.

[14] Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. Selective Encoding for Abstrac-tive Sentence Summarization. In ACL, pp. 1095–1104, 2017.

Qualitative Analysis of Odd-gen

1/6/19 19

(1)繰り返し生成

Gold: duran duran group fashionable again Gold: community college considers building $ ## million technologyEncDec: duran duran duran duran EncDec: college college colleges learn to get ideas for tech centerEncDec+SPM: duran duran fashionably cool once again EncDec+SPM: l.a. community college officials say they ’ll get ideas

(2)無関係な単語の生成

Gold: u.s. troops take first position inserb-held bosnia

Gold: northridge hopes confidence does n’t wane

EncDec: precede sarajevo EncDec: csun ’s csunEncDec+SPM: u.s. troops set up first post in bosnian

countrysideEncDec+SPM: northridge tries to win northridge men ’s basketball team

(3)重要な語句の欠損

Gold: graf says goodbye to tennis due to injuries Gold: new york ’s primary is most suspenseful of super tuesdayraces

EncDec: graf retires EncDec: n.y.EncDec+SPM: german tennis legend steffi graf retires EncDec+SPM: new york primary enters most suspenseful of super tuesday

contests

図 2: ベースライン手法 EncDecと提案手法 EncDec+SPMの生成文の比較:“Gold”は正解のヘッドラインを表す.

(a) EncDecのアテンション分布

(b) EncDec+SPMの入力側予測

図 3: EncDec と EncDec+SPM の可視化:x 軸と y 軸はそれぞれ入力文と出力文に対応する.y軸の括弧内のトークンは,アラインされた入力側のトークンを表す.

ここでWα ∈ RH×H はパラメータ行列,αj [i] は αj の i 番目の要素を表す.また,z ∈ RH はデコーダの隠れ層である.入力側の予測に関しては,各 j について qj から xi ∈ X に対応する値を抽出した.図 3にヒートマップを示した.テストデータには Gigaword

Test (Ours) を用いた.ここで,y 軸括弧内のトークンは,その時刻で出力側のトークンにアラインされた入力側のトークンを表す.トークンのアライン先については,アテンション分布(図 3a)に基づく方法では,アテンション確率が最も高かったトークンを選択し,SPMの予測(図 3b)では,全入力側語彙 Vs の中から最も出力確率が高いトークンを選択した.アテンション分布(図 3a)では,値のほとんどが文末に集中している.このことから,アテンション分布は,入出力間のトークン単位の対応関係を正確に表していないと想定できる.例えば,出力側の “tokyo” や “end” は入力側のピリオドにアラインされている.一方で,SPMの予測(図 3b)は対応関係を捉えることに成功している.例えば,入力側の “tokyostocks closed higher” は出力側の “tokyo stocks end higher” にアラインされている.各値がほぼ離散的な値を示すことから,

SPMは高い確度を持ってアラインする先を決めていると分かる.また,SPMは “tuesday”や “straight”など,ヘッドラインで重要度の低い単語を ⟨pad⟩ に対応させている.これらの事実から,SPMはアテンション分布と比較して,より良いトークン単位の対応関係を予測しているとわかる.SPMの訓練時にアライメントの正解データを与えていないにも関わらず,対応関係を学習できているのは,特筆すべき結果である.

6 おわりに本稿では,ヘッドライン生成のような損失あり圧縮生成タスクに対して,誤生成問題を減らす手法を提案した.提案手法である SPMは,入力と出力間のトークン単位の対応を予測するモデルである.実験では,ヘッドライン生成のベンチマークデータ上で,既存の最高性能の手法を上回る性能を達成し,誤生成問題の減少を確認した.また,SPMはアライメント情報を与えられることなく,入出力間のトークン単位の対応関係を学習できることを示した.

謝辞株式会社 Preferred Networks の小林颯介氏には研究の遂行に有益なアドバイスを頂いた.また本研究は,文部科学省科研費 15H01702,15H05318の支援を受けたものである.

参考文献[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Transla-

tion by Jointly Learning to Align and Translate. In ICLR, 2015.[2] Sepp Hochreiter and Jurgen Schmidhuber. Long Short-Term Memory. Neural Com-

putation, Vol. 9, No. 8, pp. 1735–1780, 1997.[3] Piji Li, Wai Lam, Lidong Bing, and Zihao Wang. Deep Recurrent Generative Decoder

for Abstractive Text Summarization. In EMNLP, pp. 2081–2090, 2017.[4] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective Approaches to

Attention-based Neural Machine Translation. In EMNLP, pp. 1412–1421, 2015.[5] Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and Abe Ittycheriah. Coverage Embed-

ding Models for Neural Machine Translation. In EMNLP, pp. 955–960, 2016.[6] Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang.

Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond. InSIGNLL, pp. 280–290, 2016.

[7] Alexander M. Rush, Sumit Chopra, and Jason Weston. A Neural Attention Model forAbstractive Sentence Summarization. In EMNLP, pp. 379–389, 2015.

[8] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Translation ofRare Words with Subword Units. In ACL, pp. 1715–1725, 2016.

[9] Lifeng Shang, Zhengdong Lu, and Hang Li. Neural Responding Machine for Short-Text Conversation. In ACL & IJCNLP, pp. 1577–1586, July 2015.

[10] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning withNeural Networks. In NIPS, pp. 3104–3112, 2014.

[11] Jun Suzuki and Masaaki Nagata. Cutting-off Redundant Repeating Generations forNeural Abstractive Summarization. In EACL, pp. 291–297, 2017.

[12] Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. Neural MachineTranslation with Reconstruction. In AAAI, pp. 3097–3103, 2017.

[13] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. Modeling Cov-erage for Neural Machine Translation. In ACL, pp. 76–85, 2016.

[14] Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. Selective Encoding for Abstrac-tive Sentence Summarization. In ACL, pp. 1095–1104, 2017.

PACLIC32

Page 20: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

Visualizing Attention and SPM Prediction

8

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

NAACL-HLT 2018 Submission 55. Confidential Review Copy. DO NOT DISTRIBUTE.

(a) Attention distribution of EncDec

(b) Source-side prediction of EncDec+SPM

Figure 4: Visualization of EncDec andEncDec+SPM. The x-axis and y-axis of thefigure correspond to the source sequence and thetarget sequence respectively. Token in the bracketsrepresent the source-side token that is aligned withthe target-side token of that time step.

token with the largest attention value. For the SPM(Figure 4b), we select the token with the largestprobability over the whole vocabulary Vs.

Figure 4a indicates that most of the attention dis-tribution is concentrated at the end of the sentence.As a result, attention provides poor token-wise cor-respondence between the source and the target. Forexample, target-side tokens “tokyo” and “end” areboth aligned with the source-side sentence period.In contrast, Figure 4b shows that the SPM providesthe almost discrete correspondence between thesource and the target. The source sequence “tokyostocks closed higher” is successfully aligned withthe target “tokyo stocks end higher”. Moreover, theSPM aligned unimportant tokens for the headlinesuch as “straight” and “tuesday” with hpadi tokens.Thus, this example suggests that the SPM achievedsuperior token-wise correspondence to the atten-tion. It is noteworthy that the SPM captured aone-to-one correspondence even though we trainedthe SPM without correct alignment information.

7 Related Work

In the field of neural machine translation, severalmethods have been proposed to solve the odd-gen.The coverage model (Mi et al., 2016; Tu et al.,2016) enforces the decoder to attend to every partof the source sequence to translate all semanticinformation in the source. The reconstructor (Tu

et al., 2017) trains the translation model from thedecoded target into the source. Moreover, Wenget al. (2017) proposed the method to predict the un-translated words from the decoder at each time step.These methods are designed to convert all contentsin the source into a target language, since machinetranslation is a lossless-gen task. In contrast, weproposed SPM to model both paraphrasing and dis-carding to reduce the odd-gen in lossy-gen task.

We focused on the headline generation which isa well-known lossy-gen task. Recent studies haveactively applied the EncDec to this task (Rush et al.,2015; Chopra et al., 2016; Nallapati et al., 2016).In the headline generation task, Zhou et al. (2017)and Suzuki and Nagata (2017) tackled a part ofthe odd-gen. Zhou et al. (2017) incorporated anadditional gate (sGate) into the encoder to selectappropriate words from the source. Suzuki andNagata (2017) proposed the frequency estimationmodule to reduce the repeating phrases. Our moti-vation is similar to them, but we addressed solvingall types of odd-gen. In addition, we can combinethese approaches with the proposed method. Infact, we reported in Section 5.5 that the SPM canimprove the performance of sGate with EncDec.

Apart from the odd-gen, some studies proposedmethods to improve the performance of the head-line generation task. Takase et al. (2016) incor-porated AMR (Banarescu et al., 2013) into theencoder to use the syntactic and semantic infor-mation of the source. Nallapati et al. (2016) alsoencoded additional information of the source suchas TF-IDF, part-of-speech tags and named entities.Li et al. (2017) modeled the typical structure of aheadline, such as “Who Action What” with a vari-ational auto-encoder. These approach improvedthe performance of the headline generation but it isunclear whether they can reduce the odd-gen.

8 Conclusion

In this paper, we discussed an approach for reduc-ing the odd-gen in lossy-gen tasks. The proposedSPM learns to predict the one-to-one correspon-dence of tokens in the source and the target. Ex-periments on the headline generation task showthat the SPM improved the performance of typi-cal EncDec, and outperformed the current state-of-the-art model. Furthermore, we demonstrated thatthe SPM reduced the odd-gen. In addition, SPMobtained token-wise correspondence between thesource and the target without any alignment data.

1/6/19 20

8

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

NAACL-HLT 2018 Submission 55. Confidential Review Copy. DO NOT DISTRIBUTE.

(a) Attention distribution of EncDec

(b) Source-side prediction of EncDec+SPM

Figure 4: Visualization of EncDec andEncDec+SPM. The x-axis and y-axis of thefigure correspond to the source sequence and thetarget sequence respectively. Token in the bracketsrepresent the source-side token that is aligned withthe target-side token of that time step.

token with the largest attention value. For the SPM(Figure 4b), we select the token with the largestprobability over the whole vocabulary Vs.

Figure 4a indicates that most of the attention dis-tribution is concentrated at the end of the sentence.As a result, attention provides poor token-wise cor-respondence between the source and the target. Forexample, target-side tokens “tokyo” and “end” areboth aligned with the source-side sentence period.In contrast, Figure 4b shows that the SPM providesthe almost discrete correspondence between thesource and the target. The source sequence “tokyostocks closed higher” is successfully aligned withthe target “tokyo stocks end higher”. Moreover, theSPM aligned unimportant tokens for the headlinesuch as “straight” and “tuesday” with hpadi tokens.Thus, this example suggests that the SPM achievedsuperior token-wise correspondence to the atten-tion. It is noteworthy that the SPM captured aone-to-one correspondence even though we trainedthe SPM without correct alignment information.

7 Related Work

In the field of neural machine translation, severalmethods have been proposed to solve the odd-gen.The coverage model (Mi et al., 2016; Tu et al.,2016) enforces the decoder to attend to every partof the source sequence to translate all semanticinformation in the source. The reconstructor (Tu

et al., 2017) trains the translation model from thedecoded target into the source. Moreover, Wenget al. (2017) proposed the method to predict the un-translated words from the decoder at each time step.These methods are designed to convert all contentsin the source into a target language, since machinetranslation is a lossless-gen task. In contrast, weproposed SPM to model both paraphrasing and dis-carding to reduce the odd-gen in lossy-gen task.

We focused on the headline generation which isa well-known lossy-gen task. Recent studies haveactively applied the EncDec to this task (Rush et al.,2015; Chopra et al., 2016; Nallapati et al., 2016).In the headline generation task, Zhou et al. (2017)and Suzuki and Nagata (2017) tackled a part ofthe odd-gen. Zhou et al. (2017) incorporated anadditional gate (sGate) into the encoder to selectappropriate words from the source. Suzuki andNagata (2017) proposed the frequency estimationmodule to reduce the repeating phrases. Our moti-vation is similar to them, but we addressed solvingall types of odd-gen. In addition, we can combinethese approaches with the proposed method. Infact, we reported in Section 5.5 that the SPM canimprove the performance of sGate with EncDec.

Apart from the odd-gen, some studies proposedmethods to improve the performance of the head-line generation task. Takase et al. (2016) incor-porated AMR (Banarescu et al., 2013) into theencoder to use the syntactic and semantic infor-mation of the source. Nallapati et al. (2016) alsoencoded additional information of the source suchas TF-IDF, part-of-speech tags and named entities.Li et al. (2017) modeled the typical structure of aheadline, such as “Who Action What” with a vari-ational auto-encoder. These approach improvedthe performance of the headline generation but it isunclear whether they can reduce the odd-gen.

8 Conclusion

In this paper, we discussed an approach for reduc-ing the odd-gen in lossy-gen tasks. The proposedSPM learns to predict the one-to-one correspon-dence of tokens in the source and the target. Ex-periments on the headline generation task showthat the SPM improved the performance of typi-cal EncDec, and outperformed the current state-of-the-art model. Furthermore, we demonstrated thatthe SPM reduced the odd-gen. In addition, SPMobtained token-wise correspondence between thesource and the target without any alignment data.

PACLIC32

Page 21: Reducing Odd Generation from Neural Headline Generationkiyono/resources/paclic.pdf · lation literature, coverage (Tu et al., 2016; Mi et al., 2016) and reconstruction (Tu et al.,

Conclusion

1/6/19 21PACLIC32