reducing odd generation from neural headline generationkiyono/resources/paclic.pdf · lation...
TRANSCRIPT
Reducing Odd Generation from Neural Headline Generation
1/6/19 1PACLIC32
Reducing Odd Generation from Neural Headline Generation
Shun Kiyono 1⇤ Sho Takase 2 Jun Suzuki 1,2,4Naoaki Okazaki 3 Kentaro Inui 1,4 Masaaki Nagata 2
1 Tohoku University 2 NTT Communication Science Laboratories3 Tokyo Institute of Technology 4 RIKEN Center for Advanced Intelligence Project
{kiyono,jun.suzuki,inui}@ecei.tohoku.ac.jp,{takase.sho, nagata.masaaki}@lab.ntt.co.jp,
Abstract
The Encoder-Decoder model is widely used innatural language generation tasks. However,the model sometimes suffers from repeated re-dundant generation, misses important phrases,and includes irrelevant entities. Toward solv-ing these problems we propose a novel source-side token prediction module. Our methodjointly estimates the probability distributionsover source and target vocabularies to capturethe correspondence between source and targettokens. Experiments show that the proposedmodel outperforms the current state-of-the-artmethod in the headline generation task. Wealso show that our method can learn a reason-able token-wise correspondence without know-ing any true alignment1.
1 Introduction
The Encoder-Decoder model with the attention mech-anism (EncDec) (Sutskever et al., 2014; Cho et al.,2014; Bahdanau et al., 2015; Luong et al., 2015) hasbeen an epoch-making development that has led togreat progress being made in many natural languagegeneration tasks, such as machine translation (Bah-danau et al., 2015), dialog generation (Shang et al.,2015), and headline generation (Rush et al., 2015).Today, EncDec and its variants are widely used asthe predominant baseline method in these tasks.
⇤This work is a product of collaborative research programof Tohoku University and NTT Communication Science Labora-tories.
1Our code for reproducing the experiments is available athttps://github.com/butsugiri/UAM
Unfortunately, as often discussed in the commu-nity, EncDec sometimes generates sentences withrepeating phrases or completely irrelevant phrasesand the reason for their generation cannot be inter-preted intuitively. Moreover, EncDec also sometimesgenerates sentences that lack important phrases. Werefer to these observations as the odd generation prob-lem (odd-gen) in EncDec. The following table showstypical examples of odd-gen actually generated by atypical EncDec.
(1) Repeating Phrases
Gold: duran duran group fashionable againEncDec: duran duran duran duran
(2) Lack of Important Phrases
Gold: graf says goodbye to tennis due toinjuries
EncDec: graf retires
(3) Irrelevant Phrases
Gold: u.s. troops take first position inserb-held bosnia
EncDec: precede sarajevo
This paper tackles for reducing the odd-gen in thetask of abstractive summarization. In machine trans-lation literature, coverage (Tu et al., 2016; Mi et al.,2016) and reconstruction (Tu et al., 2017) are promis-ing extensions of EncDec to address the odd-gen.These models take advantage of the fact that machinetranslation is the loss-less generation (lossless-gen)task, where the semantic information of source- andtarget-side sequence is equivalent. However, as dis-cussed in previous studies, abstractive summarizationis a lossy-compression generation (lossy-gen) task.Here, the task is to delete certain semantic informa-tion from the source to generate target-side sequence.
1/6/19 PACLIC32 2
Background: Encoder-Decoder
Taiwan on Tuesday bowed to calls to make the island more international by easing its immigration rules .
Input: 1st sentence of New ArticleTaiwan easesimmigration rules
Output: Headline
EncDec
EncDec
ℓ"#$Encoder
Cross Ent.
%&'( %&') %*Cross Ent. Cross Ent. Cross Ent. SUM
<eos> <pad> <pad>%(+( +&'( +&') +*
%)Cross Ent.
+)
,-* .
Taiwan-(-) on-. Tuesday
RNN RNN RNNRNNRNN
Taiwan eases
1/6/19 3PACLIC32
Background: Odd-gen of EncDec
1/6/19 4
We challenge reducing odd-gen in headline generationPACLIC32
Reducing Odd Generation from Neural Headline Generation
Shun Kiyono 1⇤ Sho Takase 2 Jun Suzuki 1,2,4Naoaki Okazaki 3 Kentaro Inui 1,4 Masaaki Nagata 2
1 Tohoku University 2 NTT Communication Science Laboratories3 Tokyo Institute of Technology 4 RIKEN Center for Advanced Intelligence Project
{kiyono,jun.suzuki,inui}@ecei.tohoku.ac.jp,{takase.sho, nagata.masaaki}@lab.ntt.co.jp,
Abstract
The Encoder-Decoder model is widely used innatural language generation tasks. However,the model sometimes suffers from repeated re-dundant generation, misses important phrases,and includes irrelevant entities. Toward solv-ing these problems we propose a novel source-side token prediction module. Our methodjointly estimates the probability distributionsover source and target vocabularies to capturethe correspondence between source and targettokens. Experiments show that the proposedmodel outperforms the current state-of-the-artmethod in the headline generation task. Wealso show that our method can learn a reason-able token-wise correspondence without know-ing any true alignment1.
1 Introduction
The Encoder-Decoder model with the attention mech-anism (EncDec) (Sutskever et al., 2014; Cho et al.,2014; Bahdanau et al., 2015; Luong et al., 2015) hasbeen an epoch-making development that has led togreat progress being made in many natural languagegeneration tasks, such as machine translation (Bah-danau et al., 2015), dialog generation (Shang et al.,2015), and headline generation (Rush et al., 2015).Today, EncDec and its variants are widely used asthe predominant baseline method in these tasks.
⇤This work is a product of collaborative research programof Tohoku University and NTT Communication Science Labora-tories.
1Our code for reproducing the experiments is available athttps://github.com/butsugiri/UAM
Unfortunately, as often discussed in the commu-nity, EncDec sometimes generates sentences withrepeating phrases or completely irrelevant phrasesand the reason for their generation cannot be inter-preted intuitively. Moreover, EncDec also sometimesgenerates sentences that lack important phrases. Werefer to these observations as the odd generation prob-lem (odd-gen) in EncDec. The following table showstypical examples of odd-gen actually generated by atypical EncDec.
(1) Repeating Phrases
Gold: duran duran group fashionable againEncDec: duran duran duran duran
(2) Lack of Important Phrases
Gold: graf says goodbye to tennis due toinjuries
EncDec: graf retires
(3) Irrelevant Phrases
Gold: u.s. troops take first position inserb-held bosnia
EncDec: precede sarajevo
This paper tackles for reducing the odd-gen in thetask of abstractive summarization. In machine trans-lation literature, coverage (Tu et al., 2016; Mi et al.,2016) and reconstruction (Tu et al., 2017) are promis-ing extensions of EncDec to address the odd-gen.These models take advantage of the fact that machinetranslation is the loss-less generation (lossless-gen)task, where the semantic information of source- andtarget-side sequence is equivalent. However, as dis-cussed in previous studies, abstractive summarizationis a lossy-compression generation (lossy-gen) task.Here, the task is to delete certain semantic informa-tion from the source to generate target-side sequence.
1/6/19 5
Related Works
This is the first work tackling entire odd-gen problems
PACLIC32
1. Repeating Phrases2. Irrelevant Phrases
3. Lack of Important Phrases
1/6/19 6
How?: consider token-wise correspondence
1.Repeating Phrases may be reduced by…• Managing the token-wise correspondence between
source & target2. Irrelevant Phrases may be reduced by…• Adding the (soft) constraint, that target tokens must
correspond to source tokens3.Lack of Important Phrases may be reduced by…• Selecting important concepts from the source
Taiwan on Tuesday bowed to calls to make the … by easing its immigration rules .
Taiwan eases immigration rules <pad> <pad> … <pad> <pad> <pad> <pad> <pad>
Hypothesisodd-gen can be reduced by considering token-wise correspondence
PACLIC32
easing
1/6/19 PACLIC32 7
Predict source-side tokens that corresponds to target-side tokens
eases
RNNRNN RNNRNN RNN
Tuesday <pad>
Target-side distributionSource-side distribution
Simultaneously predict both distributions
Taiwan on Tuesday bowed to calls to make the … by easing its immigration rules .
Taiwan eases immigration rules <pad> <pad> … <pad> <pad> <pad> <pad>
EncDec
ℓ"#$Encoder
Cross Ent.
%&'( %&') %*Cross Ent. Cross Ent. Cross Ent. SUM
<eos> <pad> <pad>%(+( +&'( +&') +*
%)Cross Ent.
+)
,-* .
Taiwan-(-) on-. Tuesday
RNN RNN RNNRNNRNN
Taiwan eases
1/6/19 8PACLIC32
EncDec + Predict source-side tokens
ℓ"#$Encoder
Cross Ent.
%&'( %&') %*Cross Ent. Cross Ent. Cross Ent. SUM
<eos> <pad> <pad>%(+( +&'( +&') +*
,&'( ,&'),),(
%)
,*
Cross Ent.
+)
-.* .
Taiwan.(.) on./ Tuesday
RNN RNN RNNRNNRNN
Taiwan eases
1/6/19 9PACLIC32
How to optimize?: Sentence-wise loss
ℓ"#$Encoder
Cross Ent.
%&'( %&') %*Cross Ent. Cross Ent. Cross Ent. SUM
<eos> <pad> <pad>%(+( +&'(
ℓ,#-1/ 01− 31 ))
31 01 Mean Squared Error
SUM
+&') +*
SUM
0&'( 0&')0)0(
%)
0*
Cross Ent.
+)
43* .
Taiwan3(3) on35 Tuesday
RNN RNN RNNRNNRNN
Taiwan eases
1/6/19 10PACLIC32
Proposed Model: SPM
ℓ"#$Encoder
Cross Ent.
%&'( %&') %*Cross Ent. Cross Ent. Cross Ent. SUM
<eos> <pad> <pad>%(+( +&'(
ℓ,#-1/ 01− 31 ))
31 01 Mean Squared Error
SUM
Source Prediction Module (SPM)
+&') +*
SUM
0&'( 0&')0)0(
%)
0*
Cross Ent.
+)
43* .
Taiwan3(3) on35 Tuesday
RNN RNN RNNRNNRNN
Taiwan eases
1/6/19 11PACLIC32
easing eases
RNN
Expect that corresponding tokens get predicted⇛modeling token-wise correspondence
Dataset: Annotated Gigaword
Annotated Gigaword(Collection of New Article)
Taiwan on Tuesday bowed to calls to make the island more international by easing its immigration rules .
Taiwan easesimmigration rules
Input:1st sentence of news article
Output:Headline
Taiwan on Tuesday bowed to calls to make the island more international by easing its immigration rules .
Taiwan on Tuesday bowed to calls to make the island more international by easing its immigration rules .
Taiwan on Tuesday bowed to calls to make the island more international by easing its immigration rules .
Taiwan on Tuesday bowed to calls to make the island more international by easing its immigration rules .
Taiwan on Tuesday bowed to calls to make the island more international by easing its immigration rules .
Taiwan easesimmigration rules
Taiwan easesimmigration rules
Taiwan easesimmigration rules
Taiwan easesimmigration rules
Taiwan easesimmigration rules
Article-Headline Parallel Corpus [Rush+2015](Approximately 4M)
1/6/19 12PACLIC32
Experimental Configurations
1/6/19 13
Use values from previous studiesPACLIC32
the previous studies. We randomly sampled test dataand validation data from the validation split since wefound that the test split contains many noisy instances.Finally, our validation and test data consist of 8K and10K source-headline pairs, respectively. Note thatthey are relatively large compared with the previouslyused datasets, and they do not contain hunki.
We also evaluated our experiments on the testdata used in the previous studies. To the best ofour knowledge, two test sets from the Gigaword arepublicly available by Rush et al. (2015)6 and Zhou etal. (2017)7. Here, both test sets contain hunki 8.Evaluation Metric We evaluated the perfor-mance by ROUGE-1 (RG-1), ROUGE-2 (RG-2) andROUGE-L (RG-L)9. We report the F1 value as givenin a previous study10. We computed the scores withthe official ROUGE script (version 1.5.5).Comparative Methods To investigate the effec-tiveness of the SPM, we evaluate the performance ofthe EncDec with the SPM. In addition, we investigatewhether the SPM improves the performance of thestate-of-the-art method: EncDec+sGate. Thus, wecompare the following methods on the same trainingsetting.
EncDec This is the implementation of the basemodel explained in Section 3.
EncDec+sGate To reproduce the state-of-the-artmethod proposed by Zhou et al. (2017), we combinedour re-implemented selective gate (sGate) with theencoder of EncDec.
EncDec+SPM We combined the SPM with theEncDec as explained in Section 4.
EncDec+sGate+SPM This is the combination ofthe SPM with the EncDec+sGate.Implementation Details Table 1 summarizeshyper-parameters and model configurations. We se-lected the settings commonly-used in the previousstudies, e.g., (Rush et al., 2015; Nallapati et al., 2016;Suzuki and Nagata, 2017).
We constructed the vocabulary set using Byte-Pair-
6https://github.com/harvardnlp/sent-summary
7https://res.qyzhou.me8We summarize the details of the dataset in Appendix D9We restored sub-words to the standard token split for the
evaluation.10 ROUGE script option is: “-n2 -m -w 1.2”
Source Vocab. Size Vs 5131Target Vocab. Size Vt 5131Word Embed. Size D 200Hidden State Size H 400
RNN Cell Long Short-Term Memory(LSTM) (Hochreiter and Schmid-huber, 1997)
Encoder RNN Unit 2-layer bidirectional-LSTMDecoder RNN Unit 2-layer LSTM with attention (Lu-
ong et al., 2015)
Optimizer Adam (Kingma and Ba, 2015)Initial Learning Rate 0.001Learning Rate Decay 0.5 for each epoch (after epoch 9)Weight C of `src 10Mini-batch Size 256 (shuffled at each epoch)Gradient Clipping 5Stopping Criterion Max 15 epochs with early stoppingRegularization Dropout (rate 0.3)Beam Search Beam size 20 with the length nor-
malization
Table 1: Configurations used in our experiments
Encoding11 (BPE) (Sennrich et al., 2016) to handlelow frequency words, as it is now a common practicein neural machine translation. The BPE merge op-erations are jointly learned from the source and thetarget. We set the number of the BPE merge opera-tions at 5, 000. We used the same vocabulary set forboth the source Vs and the target Vt.
5.2 ResultsTable 2 summarizes results for all test data. Thetable consists of three parts split by horizontal lines.The top and middle rows show the results on ourtraining procedure, and the bottom row shows theresults reported in previous studies. Note that the topand middle rows are not directly comparable to thebottom row due to the differences in preprocessingand vocabulary settings.
The top row of Table 2 shows that EncDec+SPMoutperformed both EncDec and EncDec+sGate. Thisresult indicates that the SPM can improve the perfor-mance of EncDec. Moreover, it is noteworthy thatEncDec+sGate+SPM achieved the best performancein all metrics even though EncDec+sGate consistsof essentially the same architecture as the current
11https://github.com/rsennrich/subword-nmt
Result: ROUGE Score
6
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
NAACL-HLT 2018 Submission 55. Confidential Review Copy. DO NOT DISTRIBUTE.
Gigaword Test (Ours) Gigaword Test (Rush) Gigaword Test (Zhou) MSR-ATC TestRG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L
EncDec 45.74 23.80 42.95 34.52 16.77 32.19 45.56 24.16 42.82 27.01 13.65 24.90EncDec+sGate (our impl. of SEASS) 45.98 24.17 43.16 35.00 17.24 32.72 46.17 24.54 43.27 27.71 14.16 25.44EncDec+SPM† 46.18 24.34 43.35 34.87 17.44 32.76 46.21 24.78 43.27 29.07 14.84 26.65EncDec+sGate+SPM† 46.41 24.58 43.59 35.79 17.84 33.34 46.34 24.85 43.49 29.51 15.53 27.17
EncDec+sGate+SPM (5 Ensemble)† 47.16 25.34 44.34 36.23 18.11 33.79 47.39 25.89 44.41 31.36 17.50 29.17
ABS (Rush et al., 2015) - - - 29.55 11.32 26.42 37.41 15.87 34.70 20.27 5.26 17.10SEASS (Zhou et al., 2017) - - - 36.15 17.54 33.63 46.86 24.58 43.53 25.75 10.63 22.90DRGD (Li et al., 2017) - - - 36.27 17.57 33.62 - - - - - -WFE (Suzuki and Nagata, 2017) - - - 36.30 17.31 33.88 - - - - - -conv-s2s (Gehring et al., 2017) - - - 35.88 17.48 33.29 - - - - - -
Table 3: Full length ROUGE F1 evaluation results. The top and middle rows show the results onour evaluation setting. † is the proposed model. The bottom row shows published scores reported inprevious studies13. Note that (1) SEASS consists of essentially the same architecture as our implementedEncDec+sGate, and (2) the top and middle rows are not directly comparable to the bottom row due to thedifference of the preprocessing and the vocabulary settings (see discussions in Section 5.5).
source and the target. We set the number of theBPE merge operations at 5, 000. We used the samevocabulary set for both the source Vs and the targetVt. After applying the BPE, we found out that 0.1%of the training split contained a longer target thansource. We removed such data before training.
5.5 Results
Table 3 summarizes results on all test data. We di-vide the table into three parts with horizontal lines.The top and middle rows show the results on ourtraining procedure, and the bottom row shows theresults reported in previous studies. Note that thetop and middle rows are not directly comparable tothe bottom row due to the difference of the prepro-cessing and the vocabulary settings.
The top row of Table 3 shows that EncDec+SPMoutperformed both EncDec and EncDec+sGate.This result indicates that the SPM can improve theperformance of EncDec. Moreover, it is noteworthythat EncDec+sGate+SPM achieved the best perfor-mance in all metrics even though EncDec+sGateconsists of essentially the same architecture as thecurrent state-of-the-art model, i.e., SEASS.
The bottom row of Table 3 shows the resultsof previous methods. They often achieved higherROUGE scores than our models especially in Gi-gaword Test (Rush) and Gigaword Test (Zhou).However, this does not immediately imply that ourmethod is inferior to the previous methods. Thisobservation is basically derived by the inconsis-tency of vocabulary. In detail, our training datadoes not contain hunki because we adopted theBPE to construct vocabulary. Thus, our models
(a) Repeating Phrases
(b) Lack of Important Phrases
Figure 2: Comparison between EncDec andEncDec+SPM on the number of sentences that po-tentially contain the odd-gen. The smaller exam-ples mean reduction of the odd-gen.
suffered from hunki in dataset when we conductedthe evaluation on Gigaword Test (Rush) and Giga-word Test (Zhou). In contrast, our models achievedsignificantly higher performance in MSR-ATC Testbecause this dataset does not contain hunki. Recallthat, as described earlier, EncDec+sGate has thesame model architecture as SEASS. Then, the simi-lar observation can also be found in EncDec+sGateand SEASS.
6 Discussion
The motivation of the SPM is to prevent the odd-gen with one-to-one correspondence between the
1/6/19 14PACLIC32
Result: ROUGE Score
6
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
NAACL-HLT 2018 Submission 55. Confidential Review Copy. DO NOT DISTRIBUTE.
Gigaword Test (Ours) Gigaword Test (Rush) Gigaword Test (Zhou) MSR-ATC TestRG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L
EncDec 45.74 23.80 42.95 34.52 16.77 32.19 45.56 24.16 42.82 27.01 13.65 24.90EncDec+sGate (our impl. of SEASS) 45.98 24.17 43.16 35.00 17.24 32.72 46.17 24.54 43.27 27.71 14.16 25.44EncDec+SPM† 46.18 24.34 43.35 34.87 17.44 32.76 46.21 24.78 43.27 29.07 14.84 26.65EncDec+sGate+SPM† 46.41 24.58 43.59 35.79 17.84 33.34 46.34 24.85 43.49 29.51 15.53 27.17
EncDec+sGate+SPM (5 Ensemble)† 47.16 25.34 44.34 36.23 18.11 33.79 47.39 25.89 44.41 31.36 17.50 29.17
ABS (Rush et al., 2015) - - - 29.55 11.32 26.42 37.41 15.87 34.70 20.27 5.26 17.10SEASS (Zhou et al., 2017) - - - 36.15 17.54 33.63 46.86 24.58 43.53 25.75 10.63 22.90DRGD (Li et al., 2017) - - - 36.27 17.57 33.62 - - - - - -WFE (Suzuki and Nagata, 2017) - - - 36.30 17.31 33.88 - - - - - -conv-s2s (Gehring et al., 2017) - - - 35.88 17.48 33.29 - - - - - -
Table 3: Full length ROUGE F1 evaluation results. The top and middle rows show the results onour evaluation setting. † is the proposed model. The bottom row shows published scores reported inprevious studies13. Note that (1) SEASS consists of essentially the same architecture as our implementedEncDec+sGate, and (2) the top and middle rows are not directly comparable to the bottom row due to thedifference of the preprocessing and the vocabulary settings (see discussions in Section 5.5).
source and the target. We set the number of theBPE merge operations at 5, 000. We used the samevocabulary set for both the source Vs and the targetVt. After applying the BPE, we found out that 0.1%of the training split contained a longer target thansource. We removed such data before training.
5.5 Results
Table 3 summarizes results on all test data. We di-vide the table into three parts with horizontal lines.The top and middle rows show the results on ourtraining procedure, and the bottom row shows theresults reported in previous studies. Note that thetop and middle rows are not directly comparable tothe bottom row due to the difference of the prepro-cessing and the vocabulary settings.
The top row of Table 3 shows that EncDec+SPMoutperformed both EncDec and EncDec+sGate.This result indicates that the SPM can improve theperformance of EncDec. Moreover, it is noteworthythat EncDec+sGate+SPM achieved the best perfor-mance in all metrics even though EncDec+sGateconsists of essentially the same architecture as thecurrent state-of-the-art model, i.e., SEASS.
The bottom row of Table 3 shows the resultsof previous methods. They often achieved higherROUGE scores than our models especially in Gi-gaword Test (Rush) and Gigaword Test (Zhou).However, this does not immediately imply that ourmethod is inferior to the previous methods. Thisobservation is basically derived by the inconsis-tency of vocabulary. In detail, our training datadoes not contain hunki because we adopted theBPE to construct vocabulary. Thus, our models
(a) Repeating Phrases
(b) Lack of Important Phrases
Figure 2: Comparison between EncDec andEncDec+SPM on the number of sentences that po-tentially contain the odd-gen. The smaller exam-ples mean reduction of the odd-gen.
suffered from hunki in dataset when we conductedthe evaluation on Gigaword Test (Rush) and Giga-word Test (Zhou). In contrast, our models achievedsignificantly higher performance in MSR-ATC Testbecause this dataset does not contain hunki. Recallthat, as described earlier, EncDec+sGate has thesame model architecture as SEASS. Then, the simi-lar observation can also be found in EncDec+sGateand SEASS.
6 Discussion
The motivation of the SPM is to prevent the odd-gen with one-to-one correspondence between the
Proposed model (EncDec+SPM) outperforms the strong baseline (EncDec+sGate)
1/6/19 15PACLIC32
Result: ROUGE Score
6
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
NAACL-HLT 2018 Submission 55. Confidential Review Copy. DO NOT DISTRIBUTE.
Gigaword Test (Ours) Gigaword Test (Rush) Gigaword Test (Zhou) MSR-ATC TestRG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L
EncDec 45.74 23.80 42.95 34.52 16.77 32.19 45.56 24.16 42.82 27.01 13.65 24.90EncDec+sGate (our impl. of SEASS) 45.98 24.17 43.16 35.00 17.24 32.72 46.17 24.54 43.27 27.71 14.16 25.44EncDec+SPM† 46.18 24.34 43.35 34.87 17.44 32.76 46.21 24.78 43.27 29.07 14.84 26.65EncDec+sGate+SPM† 46.41 24.58 43.59 35.79 17.84 33.34 46.34 24.85 43.49 29.51 15.53 27.17
EncDec+sGate+SPM (5 Ensemble)† 47.16 25.34 44.34 36.23 18.11 33.79 47.39 25.89 44.41 31.36 17.50 29.17
ABS (Rush et al., 2015) - - - 29.55 11.32 26.42 37.41 15.87 34.70 20.27 5.26 17.10SEASS (Zhou et al., 2017) - - - 36.15 17.54 33.63 46.86 24.58 43.53 25.75 10.63 22.90DRGD (Li et al., 2017) - - - 36.27 17.57 33.62 - - - - - -WFE (Suzuki and Nagata, 2017) - - - 36.30 17.31 33.88 - - - - - -conv-s2s (Gehring et al., 2017) - - - 35.88 17.48 33.29 - - - - - -
Table 3: Full length ROUGE F1 evaluation results. The top and middle rows show the results onour evaluation setting. † is the proposed model. The bottom row shows published scores reported inprevious studies13. Note that (1) SEASS consists of essentially the same architecture as our implementedEncDec+sGate, and (2) the top and middle rows are not directly comparable to the bottom row due to thedifference of the preprocessing and the vocabulary settings (see discussions in Section 5.5).
source and the target. We set the number of theBPE merge operations at 5, 000. We used the samevocabulary set for both the source Vs and the targetVt. After applying the BPE, we found out that 0.1%of the training split contained a longer target thansource. We removed such data before training.
5.5 Results
Table 3 summarizes results on all test data. We di-vide the table into three parts with horizontal lines.The top and middle rows show the results on ourtraining procedure, and the bottom row shows theresults reported in previous studies. Note that thetop and middle rows are not directly comparable tothe bottom row due to the difference of the prepro-cessing and the vocabulary settings.
The top row of Table 3 shows that EncDec+SPMoutperformed both EncDec and EncDec+sGate.This result indicates that the SPM can improve theperformance of EncDec. Moreover, it is noteworthythat EncDec+sGate+SPM achieved the best perfor-mance in all metrics even though EncDec+sGateconsists of essentially the same architecture as thecurrent state-of-the-art model, i.e., SEASS.
The bottom row of Table 3 shows the resultsof previous methods. They often achieved higherROUGE scores than our models especially in Gi-gaword Test (Rush) and Gigaword Test (Zhou).However, this does not immediately imply that ourmethod is inferior to the previous methods. Thisobservation is basically derived by the inconsis-tency of vocabulary. In detail, our training datadoes not contain hunki because we adopted theBPE to construct vocabulary. Thus, our models
(a) Repeating Phrases
(b) Lack of Important Phrases
Figure 2: Comparison between EncDec andEncDec+SPM on the number of sentences that po-tentially contain the odd-gen. The smaller exam-ples mean reduction of the odd-gen.
suffered from hunki in dataset when we conductedthe evaluation on Gigaword Test (Rush) and Giga-word Test (Zhou). In contrast, our models achievedsignificantly higher performance in MSR-ATC Testbecause this dataset does not contain hunki. Recallthat, as described earlier, EncDec+sGate has thesame model architecture as SEASS. Then, the simi-lar observation can also be found in EncDec+sGateand SEASS.
6 Discussion
The motivation of the SPM is to prevent the odd-gen with one-to-one correspondence between the
Incorporating SPM further improves the performance ofthe strong baseline
1/6/19 16PACLIC32
Does SPM reduce odd-gen?
6
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
NAACL-HLT 2018 Submission 55. Confidential Review Copy. DO NOT DISTRIBUTE.
Gigaword Test (Ours) Gigaword Test (Rush) Gigaword Test (Zhou) MSR-ATC TestRG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L
EncDec 45.74 23.80 42.95 34.52 16.77 32.19 45.56 24.16 42.82 27.01 13.65 24.90EncDec+sGate (our impl. of SEASS) 45.98 24.17 43.16 35.00 17.24 32.72 46.17 24.54 43.27 27.71 14.16 25.44EncDec+SPM† 46.18 24.34 43.35 34.87 17.44 32.76 46.21 24.78 43.27 29.07 14.84 26.65EncDec+sGate+SPM† 46.41 24.58 43.59 35.79 17.84 33.34 46.34 24.85 43.49 29.51 15.53 27.17
EncDec+sGate+SPM (5 Ensemble)† 47.16 25.34 44.34 36.23 18.11 33.79 47.39 25.89 44.41 31.36 17.50 29.17
ABS (Rush et al., 2015) - - - 29.55 11.32 26.42 37.41 15.87 34.70 20.27 5.26 17.10SEASS (Zhou et al., 2017) - - - 36.15 17.54 33.63 46.86 24.58 43.53 25.75 10.63 22.90DRGD (Li et al., 2017) - - - 36.27 17.57 33.62 - - - - - -WFE (Suzuki and Nagata, 2017) - - - 36.30 17.31 33.88 - - - - - -conv-s2s (Gehring et al., 2017) - - - 35.88 17.48 33.29 - - - - - -
Table 3: Full length ROUGE F1 evaluation results. The top and middle rows show the results onour evaluation setting. † is the proposed model. The bottom row shows published scores reported inprevious studies13. Note that (1) SEASS consists of essentially the same architecture as our implementedEncDec+sGate, and (2) the top and middle rows are not directly comparable to the bottom row due to thedifference of the preprocessing and the vocabulary settings (see discussions in Section 5.5).
source and the target. We set the number of theBPE merge operations at 5, 000. We used the samevocabulary set for both the source Vs and the targetVt. After applying the BPE, we found out that 0.1%of the training split contained a longer target thansource. We removed such data before training.
5.5 Results
Table 3 summarizes results on all test data. We di-vide the table into three parts with horizontal lines.The top and middle rows show the results on ourtraining procedure, and the bottom row shows theresults reported in previous studies. Note that thetop and middle rows are not directly comparable tothe bottom row due to the difference of the prepro-cessing and the vocabulary settings.
The top row of Table 3 shows that EncDec+SPMoutperformed both EncDec and EncDec+sGate.This result indicates that the SPM can improve theperformance of EncDec. Moreover, it is noteworthythat EncDec+sGate+SPM achieved the best perfor-mance in all metrics even though EncDec+sGateconsists of essentially the same architecture as thecurrent state-of-the-art model, i.e., SEASS.
The bottom row of Table 3 shows the resultsof previous methods. They often achieved higherROUGE scores than our models especially in Gi-gaword Test (Rush) and Gigaword Test (Zhou).However, this does not immediately imply that ourmethod is inferior to the previous methods. Thisobservation is basically derived by the inconsis-tency of vocabulary. In detail, our training datadoes not contain hunki because we adopted theBPE to construct vocabulary. Thus, our models
(a) Repeating Phrases
(b) Lack of Important Phrases
Figure 2: Comparison between EncDec andEncDec+SPM on the number of sentences that po-tentially contain the odd-gen. The smaller exam-ples mean reduction of the odd-gen.
suffered from hunki in dataset when we conductedthe evaluation on Gigaword Test (Rush) and Giga-word Test (Zhou). In contrast, our models achievedsignificantly higher performance in MSR-ATC Testbecause this dataset does not contain hunki. Recallthat, as described earlier, EncDec+sGate has thesame model architecture as SEASS. Then, the simi-lar observation can also be found in EncDec+sGateand SEASS.
6 Discussion
The motivation of the SPM is to prevent the odd-gen with one-to-one correspondence between the
6
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
NAACL-HLT 2018 Submission 55. Confidential Review Copy. DO NOT DISTRIBUTE.
Gigaword Test (Ours) Gigaword Test (Rush) Gigaword Test (Zhou) MSR-ATC TestRG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L RG-1 RG-2 RG-L
EncDec 45.74 23.80 42.95 34.52 16.77 32.19 45.56 24.16 42.82 27.01 13.65 24.90EncDec+sGate (our impl. of SEASS) 45.98 24.17 43.16 35.00 17.24 32.72 46.17 24.54 43.27 27.71 14.16 25.44EncDec+SPM† 46.18 24.34 43.35 34.87 17.44 32.76 46.21 24.78 43.27 29.07 14.84 26.65EncDec+sGate+SPM† 46.41 24.58 43.59 35.79 17.84 33.34 46.34 24.85 43.49 29.51 15.53 27.17
EncDec+sGate+SPM (5 Ensemble)† 47.16 25.34 44.34 36.23 18.11 33.79 47.39 25.89 44.41 31.36 17.50 29.17
ABS (Rush et al., 2015) - - - 29.55 11.32 26.42 37.41 15.87 34.70 20.27 5.26 17.10SEASS (Zhou et al., 2017) - - - 36.15 17.54 33.63 46.86 24.58 43.53 25.75 10.63 22.90DRGD (Li et al., 2017) - - - 36.27 17.57 33.62 - - - - - -WFE (Suzuki and Nagata, 2017) - - - 36.30 17.31 33.88 - - - - - -conv-s2s (Gehring et al., 2017) - - - 35.88 17.48 33.29 - - - - - -
Table 3: Full length ROUGE F1 evaluation results. The top and middle rows show the results onour evaluation setting. † is the proposed model. The bottom row shows published scores reported inprevious studies13. Note that (1) SEASS consists of essentially the same architecture as our implementedEncDec+sGate, and (2) the top and middle rows are not directly comparable to the bottom row due to thedifference of the preprocessing and the vocabulary settings (see discussions in Section 5.5).
source and the target. We set the number of theBPE merge operations at 5, 000. We used the samevocabulary set for both the source Vs and the targetVt. After applying the BPE, we found out that 0.1%of the training split contained a longer target thansource. We removed such data before training.
5.5 Results
Table 3 summarizes results on all test data. We di-vide the table into three parts with horizontal lines.The top and middle rows show the results on ourtraining procedure, and the bottom row shows theresults reported in previous studies. Note that thetop and middle rows are not directly comparable tothe bottom row due to the difference of the prepro-cessing and the vocabulary settings.
The top row of Table 3 shows that EncDec+SPMoutperformed both EncDec and EncDec+sGate.This result indicates that the SPM can improve theperformance of EncDec. Moreover, it is noteworthythat EncDec+sGate+SPM achieved the best perfor-mance in all metrics even though EncDec+sGateconsists of essentially the same architecture as thecurrent state-of-the-art model, i.e., SEASS.
The bottom row of Table 3 shows the resultsof previous methods. They often achieved higherROUGE scores than our models especially in Gi-gaword Test (Rush) and Gigaword Test (Zhou).However, this does not immediately imply that ourmethod is inferior to the previous methods. Thisobservation is basically derived by the inconsis-tency of vocabulary. In detail, our training datadoes not contain hunki because we adopted theBPE to construct vocabulary. Thus, our models
(a) Repeating Phrases
(b) Lack of Important Phrases
Figure 2: Comparison between EncDec andEncDec+SPM on the number of sentences that po-tentially contain the odd-gen. The smaller exam-ples mean reduction of the odd-gen.
suffered from hunki in dataset when we conductedthe evaluation on Gigaword Test (Rush) and Giga-word Test (Zhou). In contrast, our models achievedsignificantly higher performance in MSR-ATC Testbecause this dataset does not contain hunki. Recallthat, as described earlier, EncDec+sGate has thesame model architecture as SEASS. Then, the simi-lar observation can also be found in EncDec+sGateand SEASS.
6 Discussion
The motivation of the SPM is to prevent the odd-gen with one-to-one correspondence between the
Repeating Phrases Lack of Important Phrases
1/6/19 17PACLIC32
Qualitative Analysis of Odd-gen
1/6/19 18PACLIC32
(1) Repeating Phrases
Gold: duran duran group fashionable again Gold: community college considers building $## million technology
EncDec: duran duran duran duran EncDec: college college colleges learn to getideas for tech center
EncDec+SPM: duran duran fashionably cool onceagain
EncDec+SPM: l.a. community college officials saythey ’ll get ideas
(2) Lack of Important Phrases
Gold: graf says goodbye to tennis due toinjuries
Gold: new york ’s primary is mostsuspenseful of super tuesday races
EncDec: graf retires EncDec: n.y.EncDec+SPM: german tennis legend steffi graf
retiresEncDec+SPM: new york primary enters most
suspenseful of super tuesday contests
(3) Irrelevant Phrases
Gold: u.s. troops take first position inserb-held bosnia
Gold: northridge hopes confidence does n’twane
EncDec: precede sarajevo EncDec: csun ’s csunEncDec+SPM: u.s. troops set up first post in
bosnian countrysideEncDec+SPM: northridge tries to win northridge men
’s basketball team
Figure 3: Examples of generated summaries. “Gold” indicates the reference headline. The proposedEncDec+SPM model successfully reduced odd-gen.
in the reference headline from the above calculationresult. The result of this subtraction is taken to bethe number of repeating phrases in each generatedheadline.Lack of important phrases We assume generatedheadline which is shorter than the gold omits oneor more important phrase. Thus, we compute thedifference in gold headline length and the generatedheadline length.Irrelevant phrases We consider that improve-ments in ROUGE scores indicate a reduction in ir-relevant phrases because we believe that ROUGEpenalizes irrelevant phrases.
Figure 2 shows the number of repeating phrasesand lack of important phrases in Gigaword Test(Ours). This figure indicates that EncDec+SPM re-duces the odd-gen in comparison to EncDec. Thus,we consider the SPM reduced odd-gen. Figure3 shows sampled headlines actually generated byEncDec and EncDec+SPM. It is clear that the out-puts of EncDec contain odd-gen while those of theEncDec+SPM do not. These examples also demon-strate that SPM successfully reduces odd-gen.
6.2 Visualizing SPM and Attention
We visualize the prediction of the SPM and the at-tention distribution to see the acquired token-wisecorrespondence between the source and the target.
Specifically, we feed the source-target pair (X,Y )to EncDec and EncDec+SPM, and then collect thesource-side prediction (q1:I) of EncDec+SPM andthe attention distribution (↵1:J) of EncDec12. Forsource-side prediction, we extracted the probabilityof each token xi 2 X from qj , j 2 {1, . . . , I}.
Figure 4 shows an example of the heat map13. Weused Gigaword Test (Ours) as the input. The bracketsin the y-axis represent the source-side tokens thatare aligned with target-side tokens. We selected thealigned tokens in the following manner: For the atten-tion (Figure 4a), we select the token with the largestattention value. For the SPM (Figure 4b), we selectthe token with the largest probability over the wholevocabulary Vs.
Figure 4a indicates that most of the attention dis-tribution is concentrated at the end of the sentence.As a result, attention provides poor token-wise cor-respondence between the source and the target. Forexample, target-side tokens “tokyo” and “end” areboth aligned with the source-side sentence period. Incontrast, Figure 4b shows that the SPM captures thecorrespondence between the source and the target.The source sequence “tokyo stocks closed higher” issuccessfully aligned with the target “tokyo stocks endhigher”. Moreover, the SPM aligned unimportant to-
12For details of the attention mechanism, see Appendix C.13For more visualizations, see Appendix E
(1)繰り返し生成
Gold: duran duran group fashionable again Gold: community college considers building $ ## million technologyEncDec: duran duran duran duran EncDec: college college colleges learn to get ideas for tech centerEncDec+SPM: duran duran fashionably cool once again EncDec+SPM: l.a. community college officials say they ’ll get ideas
(2)無関係な単語の生成
Gold: u.s. troops take first position inserb-held bosnia
Gold: northridge hopes confidence does n’t wane
EncDec: precede sarajevo EncDec: csun ’s csunEncDec+SPM: u.s. troops set up first post in bosnian
countrysideEncDec+SPM: northridge tries to win northridge men ’s basketball team
(3)重要な語句の欠損
Gold: graf says goodbye to tennis due to injuries Gold: new york ’s primary is most suspenseful of super tuesdayraces
EncDec: graf retires EncDec: n.y.EncDec+SPM: german tennis legend steffi graf retires EncDec+SPM: new york primary enters most suspenseful of super tuesday
contests
図 2: ベースライン手法 EncDecと提案手法 EncDec+SPMの生成文の比較:“Gold”は正解のヘッドラインを表す.
(a) EncDecのアテンション分布
(b) EncDec+SPMの入力側予測
図 3: EncDec と EncDec+SPM の可視化:x 軸と y 軸はそれぞれ入力文と出力文に対応する.y軸の括弧内のトークンは,アラインされた入力側のトークンを表す.
ここでWα ∈ RH×H はパラメータ行列,αj [i] は αj の i 番目の要素を表す.また,z ∈ RH はデコーダの隠れ層である.入力側の予測に関しては,各 j について qj から xi ∈ X に対応する値を抽出した.図 3にヒートマップを示した.テストデータには Gigaword
Test (Ours) を用いた.ここで,y 軸括弧内のトークンは,その時刻で出力側のトークンにアラインされた入力側のトークンを表す.トークンのアライン先については,アテンション分布(図 3a)に基づく方法では,アテンション確率が最も高かったトークンを選択し,SPMの予測(図 3b)では,全入力側語彙 Vs の中から最も出力確率が高いトークンを選択した.アテンション分布(図 3a)では,値のほとんどが文末に集中している.このことから,アテンション分布は,入出力間のトークン単位の対応関係を正確に表していないと想定できる.例えば,出力側の “tokyo” や “end” は入力側のピリオドにアラインされている.一方で,SPMの予測(図 3b)は対応関係を捉えることに成功している.例えば,入力側の “tokyostocks closed higher” は出力側の “tokyo stocks end higher” にアラインされている.各値がほぼ離散的な値を示すことから,
SPMは高い確度を持ってアラインする先を決めていると分かる.また,SPMは “tuesday”や “straight”など,ヘッドラインで重要度の低い単語を ⟨pad⟩ に対応させている.これらの事実から,SPMはアテンション分布と比較して,より良いトークン単位の対応関係を予測しているとわかる.SPMの訓練時にアライメントの正解データを与えていないにも関わらず,対応関係を学習できているのは,特筆すべき結果である.
6 おわりに本稿では,ヘッドライン生成のような損失あり圧縮生成タスクに対して,誤生成問題を減らす手法を提案した.提案手法である SPMは,入力と出力間のトークン単位の対応を予測するモデルである.実験では,ヘッドライン生成のベンチマークデータ上で,既存の最高性能の手法を上回る性能を達成し,誤生成問題の減少を確認した.また,SPMはアライメント情報を与えられることなく,入出力間のトークン単位の対応関係を学習できることを示した.
謝辞株式会社 Preferred Networks の小林颯介氏には研究の遂行に有益なアドバイスを頂いた.また本研究は,文部科学省科研費 15H01702,15H05318の支援を受けたものである.
参考文献[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Transla-
tion by Jointly Learning to Align and Translate. In ICLR, 2015.[2] Sepp Hochreiter and Jurgen Schmidhuber. Long Short-Term Memory. Neural Com-
putation, Vol. 9, No. 8, pp. 1735–1780, 1997.[3] Piji Li, Wai Lam, Lidong Bing, and Zihao Wang. Deep Recurrent Generative Decoder
for Abstractive Text Summarization. In EMNLP, pp. 2081–2090, 2017.[4] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective Approaches to
Attention-based Neural Machine Translation. In EMNLP, pp. 1412–1421, 2015.[5] Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and Abe Ittycheriah. Coverage Embed-
ding Models for Neural Machine Translation. In EMNLP, pp. 955–960, 2016.[6] Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang.
Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond. InSIGNLL, pp. 280–290, 2016.
[7] Alexander M. Rush, Sumit Chopra, and Jason Weston. A Neural Attention Model forAbstractive Sentence Summarization. In EMNLP, pp. 379–389, 2015.
[8] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Translation ofRare Words with Subword Units. In ACL, pp. 1715–1725, 2016.
[9] Lifeng Shang, Zhengdong Lu, and Hang Li. Neural Responding Machine for Short-Text Conversation. In ACL & IJCNLP, pp. 1577–1586, July 2015.
[10] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning withNeural Networks. In NIPS, pp. 3104–3112, 2014.
[11] Jun Suzuki and Masaaki Nagata. Cutting-off Redundant Repeating Generations forNeural Abstractive Summarization. In EACL, pp. 291–297, 2017.
[12] Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. Neural MachineTranslation with Reconstruction. In AAAI, pp. 3097–3103, 2017.
[13] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. Modeling Cov-erage for Neural Machine Translation. In ACL, pp. 76–85, 2016.
[14] Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. Selective Encoding for Abstrac-tive Sentence Summarization. In ACL, pp. 1095–1104, 2017.
(1)繰り返し生成
Gold: duran duran group fashionable again Gold: community college considers building $ ## million technologyEncDec: duran duran duran duran EncDec: college college colleges learn to get ideas for tech centerEncDec+SPM: duran duran fashionably cool once again EncDec+SPM: l.a. community college officials say they ’ll get ideas
(2)無関係な単語の生成
Gold: u.s. troops take first position inserb-held bosnia
Gold: northridge hopes confidence does n’t wane
EncDec: precede sarajevo EncDec: csun ’s csunEncDec+SPM: u.s. troops set up first post in bosnian
countrysideEncDec+SPM: northridge tries to win northridge men ’s basketball team
(3)重要な語句の欠損
Gold: graf says goodbye to tennis due to injuries Gold: new york ’s primary is most suspenseful of super tuesdayraces
EncDec: graf retires EncDec: n.y.EncDec+SPM: german tennis legend steffi graf retires EncDec+SPM: new york primary enters most suspenseful of super tuesday
contests
図 2: ベースライン手法 EncDecと提案手法 EncDec+SPMの生成文の比較:“Gold”は正解のヘッドラインを表す.
(a) EncDecのアテンション分布
(b) EncDec+SPMの入力側予測
図 3: EncDec と EncDec+SPM の可視化:x 軸と y 軸はそれぞれ入力文と出力文に対応する.y軸の括弧内のトークンは,アラインされた入力側のトークンを表す.
ここでWα ∈ RH×H はパラメータ行列,αj [i] は αj の i 番目の要素を表す.また,z ∈ RH はデコーダの隠れ層である.入力側の予測に関しては,各 j について qj から xi ∈ X に対応する値を抽出した.図 3にヒートマップを示した.テストデータには Gigaword
Test (Ours) を用いた.ここで,y 軸括弧内のトークンは,その時刻で出力側のトークンにアラインされた入力側のトークンを表す.トークンのアライン先については,アテンション分布(図 3a)に基づく方法では,アテンション確率が最も高かったトークンを選択し,SPMの予測(図 3b)では,全入力側語彙 Vs の中から最も出力確率が高いトークンを選択した.アテンション分布(図 3a)では,値のほとんどが文末に集中している.このことから,アテンション分布は,入出力間のトークン単位の対応関係を正確に表していないと想定できる.例えば,出力側の “tokyo” や “end” は入力側のピリオドにアラインされている.一方で,SPMの予測(図 3b)は対応関係を捉えることに成功している.例えば,入力側の “tokyostocks closed higher” は出力側の “tokyo stocks end higher” にアラインされている.各値がほぼ離散的な値を示すことから,
SPMは高い確度を持ってアラインする先を決めていると分かる.また,SPMは “tuesday”や “straight”など,ヘッドラインで重要度の低い単語を ⟨pad⟩ に対応させている.これらの事実から,SPMはアテンション分布と比較して,より良いトークン単位の対応関係を予測しているとわかる.SPMの訓練時にアライメントの正解データを与えていないにも関わらず,対応関係を学習できているのは,特筆すべき結果である.
6 おわりに本稿では,ヘッドライン生成のような損失あり圧縮生成タスクに対して,誤生成問題を減らす手法を提案した.提案手法である SPMは,入力と出力間のトークン単位の対応を予測するモデルである.実験では,ヘッドライン生成のベンチマークデータ上で,既存の最高性能の手法を上回る性能を達成し,誤生成問題の減少を確認した.また,SPMはアライメント情報を与えられることなく,入出力間のトークン単位の対応関係を学習できることを示した.
謝辞株式会社 Preferred Networks の小林颯介氏には研究の遂行に有益なアドバイスを頂いた.また本研究は,文部科学省科研費 15H01702,15H05318の支援を受けたものである.
参考文献[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Transla-
tion by Jointly Learning to Align and Translate. In ICLR, 2015.[2] Sepp Hochreiter and Jurgen Schmidhuber. Long Short-Term Memory. Neural Com-
putation, Vol. 9, No. 8, pp. 1735–1780, 1997.[3] Piji Li, Wai Lam, Lidong Bing, and Zihao Wang. Deep Recurrent Generative Decoder
for Abstractive Text Summarization. In EMNLP, pp. 2081–2090, 2017.[4] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective Approaches to
Attention-based Neural Machine Translation. In EMNLP, pp. 1412–1421, 2015.[5] Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and Abe Ittycheriah. Coverage Embed-
ding Models for Neural Machine Translation. In EMNLP, pp. 955–960, 2016.[6] Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang.
Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond. InSIGNLL, pp. 280–290, 2016.
[7] Alexander M. Rush, Sumit Chopra, and Jason Weston. A Neural Attention Model forAbstractive Sentence Summarization. In EMNLP, pp. 379–389, 2015.
[8] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Translation ofRare Words with Subword Units. In ACL, pp. 1715–1725, 2016.
[9] Lifeng Shang, Zhengdong Lu, and Hang Li. Neural Responding Machine for Short-Text Conversation. In ACL & IJCNLP, pp. 1577–1586, July 2015.
[10] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning withNeural Networks. In NIPS, pp. 3104–3112, 2014.
[11] Jun Suzuki and Masaaki Nagata. Cutting-off Redundant Repeating Generations forNeural Abstractive Summarization. In EACL, pp. 291–297, 2017.
[12] Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. Neural MachineTranslation with Reconstruction. In AAAI, pp. 3097–3103, 2017.
[13] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. Modeling Cov-erage for Neural Machine Translation. In ACL, pp. 76–85, 2016.
[14] Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. Selective Encoding for Abstrac-tive Sentence Summarization. In ACL, pp. 1095–1104, 2017.
Qualitative Analysis of Odd-gen
1/6/19 19
(1)繰り返し生成
Gold: duran duran group fashionable again Gold: community college considers building $ ## million technologyEncDec: duran duran duran duran EncDec: college college colleges learn to get ideas for tech centerEncDec+SPM: duran duran fashionably cool once again EncDec+SPM: l.a. community college officials say they ’ll get ideas
(2)無関係な単語の生成
Gold: u.s. troops take first position inserb-held bosnia
Gold: northridge hopes confidence does n’t wane
EncDec: precede sarajevo EncDec: csun ’s csunEncDec+SPM: u.s. troops set up first post in bosnian
countrysideEncDec+SPM: northridge tries to win northridge men ’s basketball team
(3)重要な語句の欠損
Gold: graf says goodbye to tennis due to injuries Gold: new york ’s primary is most suspenseful of super tuesdayraces
EncDec: graf retires EncDec: n.y.EncDec+SPM: german tennis legend steffi graf retires EncDec+SPM: new york primary enters most suspenseful of super tuesday
contests
図 2: ベースライン手法 EncDecと提案手法 EncDec+SPMの生成文の比較:“Gold”は正解のヘッドラインを表す.
(a) EncDecのアテンション分布
(b) EncDec+SPMの入力側予測
図 3: EncDec と EncDec+SPM の可視化:x 軸と y 軸はそれぞれ入力文と出力文に対応する.y軸の括弧内のトークンは,アラインされた入力側のトークンを表す.
ここでWα ∈ RH×H はパラメータ行列,αj [i] は αj の i 番目の要素を表す.また,z ∈ RH はデコーダの隠れ層である.入力側の予測に関しては,各 j について qj から xi ∈ X に対応する値を抽出した.図 3にヒートマップを示した.テストデータには Gigaword
Test (Ours) を用いた.ここで,y 軸括弧内のトークンは,その時刻で出力側のトークンにアラインされた入力側のトークンを表す.トークンのアライン先については,アテンション分布(図 3a)に基づく方法では,アテンション確率が最も高かったトークンを選択し,SPMの予測(図 3b)では,全入力側語彙 Vs の中から最も出力確率が高いトークンを選択した.アテンション分布(図 3a)では,値のほとんどが文末に集中している.このことから,アテンション分布は,入出力間のトークン単位の対応関係を正確に表していないと想定できる.例えば,出力側の “tokyo” や “end” は入力側のピリオドにアラインされている.一方で,SPMの予測(図 3b)は対応関係を捉えることに成功している.例えば,入力側の “tokyostocks closed higher” は出力側の “tokyo stocks end higher” にアラインされている.各値がほぼ離散的な値を示すことから,
SPMは高い確度を持ってアラインする先を決めていると分かる.また,SPMは “tuesday”や “straight”など,ヘッドラインで重要度の低い単語を ⟨pad⟩ に対応させている.これらの事実から,SPMはアテンション分布と比較して,より良いトークン単位の対応関係を予測しているとわかる.SPMの訓練時にアライメントの正解データを与えていないにも関わらず,対応関係を学習できているのは,特筆すべき結果である.
6 おわりに本稿では,ヘッドライン生成のような損失あり圧縮生成タスクに対して,誤生成問題を減らす手法を提案した.提案手法である SPMは,入力と出力間のトークン単位の対応を予測するモデルである.実験では,ヘッドライン生成のベンチマークデータ上で,既存の最高性能の手法を上回る性能を達成し,誤生成問題の減少を確認した.また,SPMはアライメント情報を与えられることなく,入出力間のトークン単位の対応関係を学習できることを示した.
謝辞株式会社 Preferred Networks の小林颯介氏には研究の遂行に有益なアドバイスを頂いた.また本研究は,文部科学省科研費 15H01702,15H05318の支援を受けたものである.
参考文献[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Transla-
tion by Jointly Learning to Align and Translate. In ICLR, 2015.[2] Sepp Hochreiter and Jurgen Schmidhuber. Long Short-Term Memory. Neural Com-
putation, Vol. 9, No. 8, pp. 1735–1780, 1997.[3] Piji Li, Wai Lam, Lidong Bing, and Zihao Wang. Deep Recurrent Generative Decoder
for Abstractive Text Summarization. In EMNLP, pp. 2081–2090, 2017.[4] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective Approaches to
Attention-based Neural Machine Translation. In EMNLP, pp. 1412–1421, 2015.[5] Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and Abe Ittycheriah. Coverage Embed-
ding Models for Neural Machine Translation. In EMNLP, pp. 955–960, 2016.[6] Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang.
Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond. InSIGNLL, pp. 280–290, 2016.
[7] Alexander M. Rush, Sumit Chopra, and Jason Weston. A Neural Attention Model forAbstractive Sentence Summarization. In EMNLP, pp. 379–389, 2015.
[8] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Translation ofRare Words with Subword Units. In ACL, pp. 1715–1725, 2016.
[9] Lifeng Shang, Zhengdong Lu, and Hang Li. Neural Responding Machine for Short-Text Conversation. In ACL & IJCNLP, pp. 1577–1586, July 2015.
[10] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning withNeural Networks. In NIPS, pp. 3104–3112, 2014.
[11] Jun Suzuki and Masaaki Nagata. Cutting-off Redundant Repeating Generations forNeural Abstractive Summarization. In EACL, pp. 291–297, 2017.
[12] Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. Neural MachineTranslation with Reconstruction. In AAAI, pp. 3097–3103, 2017.
[13] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. Modeling Cov-erage for Neural Machine Translation. In ACL, pp. 76–85, 2016.
[14] Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. Selective Encoding for Abstrac-tive Sentence Summarization. In ACL, pp. 1095–1104, 2017.
PACLIC32
Visualizing Attention and SPM Prediction
8
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
NAACL-HLT 2018 Submission 55. Confidential Review Copy. DO NOT DISTRIBUTE.
(a) Attention distribution of EncDec
(b) Source-side prediction of EncDec+SPM
Figure 4: Visualization of EncDec andEncDec+SPM. The x-axis and y-axis of thefigure correspond to the source sequence and thetarget sequence respectively. Token in the bracketsrepresent the source-side token that is aligned withthe target-side token of that time step.
token with the largest attention value. For the SPM(Figure 4b), we select the token with the largestprobability over the whole vocabulary Vs.
Figure 4a indicates that most of the attention dis-tribution is concentrated at the end of the sentence.As a result, attention provides poor token-wise cor-respondence between the source and the target. Forexample, target-side tokens “tokyo” and “end” areboth aligned with the source-side sentence period.In contrast, Figure 4b shows that the SPM providesthe almost discrete correspondence between thesource and the target. The source sequence “tokyostocks closed higher” is successfully aligned withthe target “tokyo stocks end higher”. Moreover, theSPM aligned unimportant tokens for the headlinesuch as “straight” and “tuesday” with hpadi tokens.Thus, this example suggests that the SPM achievedsuperior token-wise correspondence to the atten-tion. It is noteworthy that the SPM captured aone-to-one correspondence even though we trainedthe SPM without correct alignment information.
7 Related Work
In the field of neural machine translation, severalmethods have been proposed to solve the odd-gen.The coverage model (Mi et al., 2016; Tu et al.,2016) enforces the decoder to attend to every partof the source sequence to translate all semanticinformation in the source. The reconstructor (Tu
et al., 2017) trains the translation model from thedecoded target into the source. Moreover, Wenget al. (2017) proposed the method to predict the un-translated words from the decoder at each time step.These methods are designed to convert all contentsin the source into a target language, since machinetranslation is a lossless-gen task. In contrast, weproposed SPM to model both paraphrasing and dis-carding to reduce the odd-gen in lossy-gen task.
We focused on the headline generation which isa well-known lossy-gen task. Recent studies haveactively applied the EncDec to this task (Rush et al.,2015; Chopra et al., 2016; Nallapati et al., 2016).In the headline generation task, Zhou et al. (2017)and Suzuki and Nagata (2017) tackled a part ofthe odd-gen. Zhou et al. (2017) incorporated anadditional gate (sGate) into the encoder to selectappropriate words from the source. Suzuki andNagata (2017) proposed the frequency estimationmodule to reduce the repeating phrases. Our moti-vation is similar to them, but we addressed solvingall types of odd-gen. In addition, we can combinethese approaches with the proposed method. Infact, we reported in Section 5.5 that the SPM canimprove the performance of sGate with EncDec.
Apart from the odd-gen, some studies proposedmethods to improve the performance of the head-line generation task. Takase et al. (2016) incor-porated AMR (Banarescu et al., 2013) into theencoder to use the syntactic and semantic infor-mation of the source. Nallapati et al. (2016) alsoencoded additional information of the source suchas TF-IDF, part-of-speech tags and named entities.Li et al. (2017) modeled the typical structure of aheadline, such as “Who Action What” with a vari-ational auto-encoder. These approach improvedthe performance of the headline generation but it isunclear whether they can reduce the odd-gen.
8 Conclusion
In this paper, we discussed an approach for reduc-ing the odd-gen in lossy-gen tasks. The proposedSPM learns to predict the one-to-one correspon-dence of tokens in the source and the target. Ex-periments on the headline generation task showthat the SPM improved the performance of typi-cal EncDec, and outperformed the current state-of-the-art model. Furthermore, we demonstrated thatthe SPM reduced the odd-gen. In addition, SPMobtained token-wise correspondence between thesource and the target without any alignment data.
1/6/19 20
8
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
NAACL-HLT 2018 Submission 55. Confidential Review Copy. DO NOT DISTRIBUTE.
(a) Attention distribution of EncDec
(b) Source-side prediction of EncDec+SPM
Figure 4: Visualization of EncDec andEncDec+SPM. The x-axis and y-axis of thefigure correspond to the source sequence and thetarget sequence respectively. Token in the bracketsrepresent the source-side token that is aligned withthe target-side token of that time step.
token with the largest attention value. For the SPM(Figure 4b), we select the token with the largestprobability over the whole vocabulary Vs.
Figure 4a indicates that most of the attention dis-tribution is concentrated at the end of the sentence.As a result, attention provides poor token-wise cor-respondence between the source and the target. Forexample, target-side tokens “tokyo” and “end” areboth aligned with the source-side sentence period.In contrast, Figure 4b shows that the SPM providesthe almost discrete correspondence between thesource and the target. The source sequence “tokyostocks closed higher” is successfully aligned withthe target “tokyo stocks end higher”. Moreover, theSPM aligned unimportant tokens for the headlinesuch as “straight” and “tuesday” with hpadi tokens.Thus, this example suggests that the SPM achievedsuperior token-wise correspondence to the atten-tion. It is noteworthy that the SPM captured aone-to-one correspondence even though we trainedthe SPM without correct alignment information.
7 Related Work
In the field of neural machine translation, severalmethods have been proposed to solve the odd-gen.The coverage model (Mi et al., 2016; Tu et al.,2016) enforces the decoder to attend to every partof the source sequence to translate all semanticinformation in the source. The reconstructor (Tu
et al., 2017) trains the translation model from thedecoded target into the source. Moreover, Wenget al. (2017) proposed the method to predict the un-translated words from the decoder at each time step.These methods are designed to convert all contentsin the source into a target language, since machinetranslation is a lossless-gen task. In contrast, weproposed SPM to model both paraphrasing and dis-carding to reduce the odd-gen in lossy-gen task.
We focused on the headline generation which isa well-known lossy-gen task. Recent studies haveactively applied the EncDec to this task (Rush et al.,2015; Chopra et al., 2016; Nallapati et al., 2016).In the headline generation task, Zhou et al. (2017)and Suzuki and Nagata (2017) tackled a part ofthe odd-gen. Zhou et al. (2017) incorporated anadditional gate (sGate) into the encoder to selectappropriate words from the source. Suzuki andNagata (2017) proposed the frequency estimationmodule to reduce the repeating phrases. Our moti-vation is similar to them, but we addressed solvingall types of odd-gen. In addition, we can combinethese approaches with the proposed method. Infact, we reported in Section 5.5 that the SPM canimprove the performance of sGate with EncDec.
Apart from the odd-gen, some studies proposedmethods to improve the performance of the head-line generation task. Takase et al. (2016) incor-porated AMR (Banarescu et al., 2013) into theencoder to use the syntactic and semantic infor-mation of the source. Nallapati et al. (2016) alsoencoded additional information of the source suchas TF-IDF, part-of-speech tags and named entities.Li et al. (2017) modeled the typical structure of aheadline, such as “Who Action What” with a vari-ational auto-encoder. These approach improvedthe performance of the headline generation but it isunclear whether they can reduce the odd-gen.
8 Conclusion
In this paper, we discussed an approach for reduc-ing the odd-gen in lossy-gen tasks. The proposedSPM learns to predict the one-to-one correspon-dence of tokens in the source and the target. Ex-periments on the headline generation task showthat the SPM improved the performance of typi-cal EncDec, and outperformed the current state-of-the-art model. Furthermore, we demonstrated thatthe SPM reduced the odd-gen. In addition, SPMobtained token-wise correspondence between thesource and the target without any alignment data.
PACLIC32
Conclusion
1/6/19 21PACLIC32