semantic matching for sequence-to-sequence learningpeople.ee.duke.edu/~lcarin/ruiyi_emnlp_2020.pdfto...

14
Semantic Matching for Sequence-to-Sequence Learning Ruiyi Zhang , Changyou Chen , Xinyuan Zhang , Ke Bai , Lawrence Carin Duke University, State University of New York at Buffalo, ASAPP Inc. [email protected] Abstract In sequence-to-sequence models, classical op- timal transport (OT) can be applied to seman- tically match generated sentences with target sentences. However, in non-parallel settings, target sentences are usually unavailable. To tackle this issue without losing the benefits of classical OT, we present a semantic matching scheme based on the Optimal Partial Trans- port (OPT). Specifically, our approach par- tially matches semantically meaningful words between source and partial target sequences. To overcome the difficulty of detecting active regions in OPT (corresponding to the words needed to be matched), we further exploit prior knowledge to perform partial matching. Ex- tensive experiments are conducted to evaluate the proposed approach, showing consistent im- provements over sequence-to-sequence tasks. 1 Introduction Sequence-to-sequence (Seq2Seq) models are widely used in various natural-language-processing tasks, such as machine translation (Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2015), text summarization (Rush et al., 2015; Chopra et al., 2016) and image captioning (Vinyals et al., 2015; Xu et al., 2015). Typically, these models are based on an encoder-decoder architecture, with an en- coder mapping a source sequence into a latent vec- tor, and a decoder translating the latent vector into a target sequence. The goal of a Seq2Seq model is to optimize this encoder-decoder network to gen- erate sequences close to the target. Therefore, a proper measure of the distance between sequences is crucial for model training. Wasserstein distance between two text se- quences, i.e., word-mover distance (Kusner et al., 2015), can serve as an effective regularizer for se- mantic matching in Seq2Seq models (Chen et al., 2019). Classical optimal transport models require that each piece of mass in the source distribution is transported to an equal-weight piece of mass in the target distribution. However, this requirement is too restrictive for Seq2Seq models, making direct applications inappropriate due to the following: (i) texts often have different lengths, and not every element in the source text corresponds an element in the target text. A good example is style transfer, where some words in the source text do not have corresponding words in the target text. (ii) it is reasonable to semantically match important words while neglecting some other words, e.g., conjunc- tion. In typical unsupervised models, text data are usually non-parallel in the sense that pairwise data are typically unavailable (Sutskever et al., 2014). Thus, both pairwise information inference and text generation must be performed in the same model with only non-parallel data. Classical OT is not applicable without target text sequences. However, partial target information is available, for exam- ple, the detected objects in an image should be de- scribed in its caption, and the content words when changing the style should be preserved. OT will fail in these cases but matching can be performed by optimal partial transport (OPT). Specifically, we exploit the partial target information representation via partially matching it with generated texts. The partial matching is implemented based on lexical information extracted from the texts. We call our method SEmantic PArtial Matching (SEPAM). To demonstrate the effectiveness of SEPAM, we consider applying it on sequence-prediction tasks where semantic partial matching is needed: (i) in unsupervised text-style transfer, SEPAM can be em- ployed for content preservation via partially match- ing the input and generated text; (ii) in image cap- tioning, SEPAM can be applied to partially match the objects detected in images with corresponding captions for more informative generation; (iii) in table-to-text generation, SEPAM can prevent hallu-

Upload: others

Post on 30-Mar-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semantic Matching for Sequence-to-Sequence Learningpeople.ee.duke.edu/~lcarin/Ruiyi_EMNLP_2020.pdfTo demonstrate the effectiveness of SEPAM, we consider applying it on sequence-prediction

Semantic Matching for Sequence-to-Sequence Learning

Ruiyi Zhang3, Changyou Chen†, Xinyuan Zhang‡, Ke Bai3, Lawrence Carin3

3Duke University, †State University of New York at Buffalo, ‡ASAPP [email protected]

Abstract

In sequence-to-sequence models, classical op-timal transport (OT) can be applied to seman-tically match generated sentences with targetsentences. However, in non-parallel settings,target sentences are usually unavailable. Totackle this issue without losing the benefits ofclassical OT, we present a semantic matchingscheme based on the Optimal Partial Trans-port (OPT). Specifically, our approach par-tially matches semantically meaningful wordsbetween source and partial target sequences.To overcome the difficulty of detecting activeregions in OPT (corresponding to the wordsneeded to be matched), we further exploit priorknowledge to perform partial matching. Ex-tensive experiments are conducted to evaluatethe proposed approach, showing consistent im-provements over sequence-to-sequence tasks.

1 Introduction

Sequence-to-sequence (Seq2Seq) models arewidely used in various natural-language-processingtasks, such as machine translation (Sutskever et al.,2014; Cho et al., 2014; Bahdanau et al., 2015), textsummarization (Rush et al., 2015; Chopra et al.,2016) and image captioning (Vinyals et al., 2015;Xu et al., 2015). Typically, these models are basedon an encoder-decoder architecture, with an en-coder mapping a source sequence into a latent vec-tor, and a decoder translating the latent vector intoa target sequence. The goal of a Seq2Seq model isto optimize this encoder-decoder network to gen-erate sequences close to the target. Therefore, aproper measure of the distance between sequencesis crucial for model training.

Wasserstein distance between two text se-quences, i.e., word-mover distance (Kusner et al.,2015), can serve as an effective regularizer for se-mantic matching in Seq2Seq models (Chen et al.,2019). Classical optimal transport models require

that each piece of mass in the source distribution istransported to an equal-weight piece of mass in thetarget distribution. However, this requirement istoo restrictive for Seq2Seq models, making directapplications inappropriate due to the following: (i)texts often have different lengths, and not everyelement in the source text corresponds an elementin the target text. A good example is style transfer,where some words in the source text do not havecorresponding words in the target text. (ii) it isreasonable to semantically match important wordswhile neglecting some other words, e.g., conjunc-tion. In typical unsupervised models, text data areusually non-parallel in the sense that pairwise dataare typically unavailable (Sutskever et al., 2014).Thus, both pairwise information inference and textgeneration must be performed in the same modelwith only non-parallel data. Classical OT is notapplicable without target text sequences. However,partial target information is available, for exam-ple, the detected objects in an image should be de-scribed in its caption, and the content words whenchanging the style should be preserved. OT willfail in these cases but matching can be performedby optimal partial transport (OPT). Specifically, weexploit the partial target information representationvia partially matching it with generated texts. Thepartial matching is implemented based on lexicalinformation extracted from the texts. We call ourmethod SEmantic PArtial Matching (SEPAM).

To demonstrate the effectiveness of SEPAM, weconsider applying it on sequence-prediction taskswhere semantic partial matching is needed: (i) inunsupervised text-style transfer, SEPAM can be em-ployed for content preservation via partially match-ing the input and generated text; (ii) in image cap-tioning, SEPAM can be applied to partially matchthe objects detected in images with correspondingcaptions for more informative generation; (iii) intable-to-text generation, SEPAM can prevent hallu-

Page 2: Semantic Matching for Sequence-to-Sequence Learningpeople.ee.duke.edu/~lcarin/Ruiyi_EMNLP_2020.pdfTo demonstrate the effectiveness of SEPAM, we consider applying it on sequence-prediction

cination (Dhingra et al., 2019) via partially match-ing tabular key words with generated sentences.

The main contributions of this paper are sum-marized as follows: (i) A novel semantic match-ing scheme based on optimal partial transport isproposed. (ii) Our model can be interpreted as in-corporating prior knowledge into the optimal trans-port to exploit the structure of natural language,while making the algorithm tractable for real-worldtasks. (iii) In order to demonstrate the versatilityof the proposed scheme, we empirically show con-sistent improvements in style transfer for contentpreservation, image captioning for informative im-age descriptions and in table-to-text generation forfaithful generation.

2 Background

2.1 Optimal TransportOptimal transport defines distances between prob-ability measures on a domain X (the word-embedding space in our setting). The optimal trans-port distance for two probability measures µ andν is defined as (Peyré et al., 2017):

Dc(µ,ν) = infγ∈Π(µ,ν)

E(x,y)∼γ [c(x,y)] , (1)

where Π(µ,ν) denotes the set of all joint distri-butions γ(x,y) with marginals µ(x) and ν(y);c(x,y) : X× X→ R is the cost function for mov-ing x to y, e.g., the Euclidean or cosine distance.Intuitively, the optimal transport distance is the min-imum cost that γ induces in order to transport fromµ to ν. When c(x,y) is a metric on X, Dc(µ,ν)induces a proper metric on the space of probabilitydistributions supported on X, commonly known asthe Wasserstein distance (Villani, 2008).

We focus on applying the OT distance on tex-tual data. Therefore, we only consider OT be-tween discrete distributions. Specifically, con-sider two discrete distributions µ,ν ∈ P(X),which can be presented as µ =

∑ni=1 uiδxi and

ν =∑m

j=1 vjδyj with δx the Dirac function cen-tered on x. The weight vectors u = {ui}ni=1 ∈ ∆n

and v = {vi}mi=1 ∈ ∆m belong to the simplex,i.e.,

∑ni=1 ui =

∑mj=1 vj = 1, as both µ and ν

are probability distributions. Under such a setting,computing the OT distance defined in (1) can be re-formulated as the following minimization problem:

Wc(µ,ν) = minT∈Π(µ,ν)

m∑i=1

n∑j=1

Tij · c(xi,y′j)

= minT∈Π(µ,ν)

〈T,C〉 ,(2)

where Π(u,v) = {T ∈ Rn×m+ |T1m =u,T>1n = v}, 1n denotes an n-dimensional all-one vector, C is the cost matrix given by Cij =c(xi,yj) and 〈T,C〉 = Tr(T>C) represents theFrobenius dot-product.

2.2 Optimal Partial Transport

Optimal partial transport (OPT) was studied ini-tially by Caffarelli and McCann (2010). It is avariant of optimal transport, where only a portionof mass is to be transported, in an efficient way. InOPT, the transport problem is defined by generaliz-ing γ as a Borel measure such that:

Dc(µ,ν) = infγ∈Π≤(f,g),M(γ)=m

∫[c(x,y)]dγ(x,y) . (3)

where Π≤(f, g) is defined as the set of nonneg-ative finite Borel measures on Rn × Rn whosefirst and second marginals are dominated by fand g respectively, i.e., γ(A × Rn) ≤

∫A f(x)dx

and γ(Rn × A) ≤∫A g(y)dx for all A ∈ Rn;

M(γ) ,∫Rn×Rn dγ represents the mass of γ in

(3), and m ∈ [0,min{‖f‖L1 , ‖g‖L1}]. Here f andg can be considered as the maximum marginalmeasures for γ. As a result, if m is less thanmin{‖f‖L1 , ‖g‖L1}, this means γ assigns zeromeasures for some elements of the space. In otherwords, the zero-measure elements need not be con-sidered when matching µ and ν. The elementswith non-zero measure are all active regions. Achallenge in OPT is how to detect these active re-gions. Thus directly optimizing (3) is typicallyvery challenging and computationally expensive.In our setting of text analysis, we propose to lever-age prior knowledge to define the active regions, asintroduced below.

3 Semantic Matching via OPT

In unsupervised Seq2Seq tasks without pair-wiseinformation, naively matching the generated se-quence with the weak-supervision information(e.g., source text in style transfer) will render de-ficient performance, even though both sentencesshare similar content. In supervised settings, tar-get and input sequences are of different lengthsbut have similarity in terms of semantic meaning,such as table-to-text generation. Motivated by this,we propose a novel technique for semantic partialmatching and consider two scenarios: (i) text-to-text matching and (ii) image-to-text matching.

Page 3: Semantic Matching for Sequence-to-Sequence Learningpeople.ee.duke.edu/~lcarin/Ruiyi_EMNLP_2020.pdfTo demonstrate the effectiveness of SEPAM, we consider applying it on sequence-prediction

Rooms were spacious and food was very well cooked . <NNs> <VBD> <JJ> <CC> <NN> <VBD> <RB> <RB> <VBN>. .

<DT> <NNs> <VBP> <IN> <JJ> <CC> <DT> <NN> <VBZ>� <RB> <VBN> <RB> .

The rooms are so bad and the food is poorly cooked either .

LSTM

LSTM

LSTM

CN

N

snow 1.000

skiing 0.993

man 0.917

slope 0.898

person 0.889

hill 0.808

covered 0.750

riding 0.627

Generated caption: a man riding skis down a snow covered slope

a

man

<eos>

a

<sos>

slopeA man riding skis down a snow covered slope�

snow skiing man slope hill cover

<DT> <NN> <VBP> <NNs> <IN> <DT> <NN> <JJ> <NN> .

Figure 1: Semantic Matching between the potential target (top) and the generated texts (bottom). The left showshow to partially match two texts with different styles. The right shows how to partially match texts with conceptsdetected from the images.

Text-to-Text Matching We consider semanticmatching between two sequences in Seq2Seq mod-els, where partial matching is important: i) in theunsupervised setting, such as non-parallel styletransfer, partially matching between the source andtarget texts is helpful for content preservation. ii)in the supervised setting, such as table-to-text gen-eration, partially matching the input and target se-quences can effectively avoid hallucination genera-tion, i.e. text mentions extra information than whatis present in the source. Figure 1 shows an exampleof partial matching, where part-of-speech (POS)tags for each word are exploited to provide priorknowledge. In these cases, directly applying OTwill cause imbalanced transportation issue or poorperformance.

Image-to-Text Matching Objects and theirproperties (e.g., colors) are both included in thepair-wise images and captions. Consider the image-to-text matching in Figure 1. It is clear that each ob-ject in the image has corresponding words/phasesin the captions. We can consider matching the la-bels of detected objects in the image to some wordsin its caption. Please note labels are not in one-to-one correspondence with the text, thus directlyapplying OT is inappropriate, similar to the case oftext-to-text matching.

Different Matching Schemes Hard matchingseeks to exactly match from the source and tar-get. Typically, hard matching is too simple to beeffective without considering semantic similarity,and if we apply classical optimal transport in un-supervised settings, it causes an imbalance match-ing, since some unnecessary words are included inthe source and the exact target is unavailable. Totackle this issue, one can directly apply the opti-mal partial transport (OPT) here to detect whichword has its correspondence and match the wordwith its target. However, the detection process iscomputationally expensive, which is not scalableas a constrained optimization in (3) for real-world

tasks. Fortunately, we can exploit the syntax infor-mation from text, and incorporate this informationas prior knowledge into OPT to avoid the detectionprocedure.

3.1 Partial Matching via OPTWe formulate the proposed semantic matching as apartial optimal-transport problem, where only partsof the source and target are matched. Specifically,we incorporate prior knowledge into the optimalpartial transport (OPT), and this prior knowledgehelps determine the set of words to match, i.e.,M(X), where M(·) is a function giving a set in-cluding the words/phases to match. The strategy ofhow to determine M(·) depends on tasks.

OPT distance To apply the OT distance to text,we first represent a sentence Y with a discrete dis-tribution pY = 1

T

∑t δe(yt) in the semantic space,

where the length-normalized point mass is placedat the semantic embedding, eyt = e(yt), of eachtoken yt of the sequence Y . Here e(·) denotes aword-embedding function mapping a token to itsd-dimensional feature representation. For two sen-tences X and Y , we define their OPT distanceas:

Wc(µ,ν) = minT∈Πc(µ,ν)

m∑i=1

n∑j=1

Tijc(exi , e

yj ) , (4)

where Πc(µ,ν) is the solution space, and everysolution T ∈ Πc(µ,ν) satisfies Tij = 0 if xi /∈M(X) or yj /∈ M(Y ). Different from classicalOPT, the elements in µ or ν to match have beenexplicitly defined by M(·), which represents theprior knowledge. In more detail, the constraintof OPT is more specific, and does not need anyoptimization procedure. We use cosine distance asthe cost function and c(ex, ey) , 1− exT ey

‖ex‖2‖ey‖2 .

Approximation of OPT Computing the exactOPT distance is computationally challenging (Fi-galli, 2010). We bypass the difficulty of activeregion detection using lexical information and re-formulate it as an OT problem. We then em-

Page 4: Semantic Matching for Sequence-to-Sequence Learningpeople.ee.duke.edu/~lcarin/Ruiyi_EMNLP_2020.pdfTo demonstrate the effectiveness of SEPAM, we consider applying it on sequence-prediction

X<latexit sha1_base64="uAUWnJ4RFak4jvyqLkKQw0BhNOo=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZcFQVxWsA9ox5LJZNrQTDIkGaUMBT/DjQtF3Pov7vwbM20X2nog5HDOveTkBAln2rjut1NYWV1b3yhulra2d3b3yvsHLS1TRWiTSC5VJ8CaciZo0zDDaSdRFMcBp+1gdJX77QeqNJPizowT6sd4IFjECDZWuu8Fkod6HNsr60z65YpbdadAy8SbkwrM0eiXv3qhJGlMhSEca9313MT4GVaGEU4npV6qaYLJCA9o11KBY6r9bJp6gk6sEqJIKnuEQVP190aGY51Hs5MxNkO96OXif143NdGlnzGRpIYKMnsoSjkyEuUVoJApSgwfW4KJYjYrIkOsMDG2qJItwVv88jJp1areWbV2e16pXz/N6ijCERzDKXhwAXW4gQY0gYCCZ3iFN+fReXHenY/ZaMGZV3gIf+B8/gBLLpNx</latexit>

Enc(·)<latexit sha1_base64="LW4Wa+9c6Grxen8tAxNde3Til1U=">AAAB+nicbVBNS8NAEN34WetXqkcvwSLUS0mqoMeCKB4r2A9oQtlsNu3STTbsTtQSC/4RLx4U8eov8ea/cZv2oK0PBh7vzTAzz084U2Db38bS8srq2npho7i5tb2za5b2WkqkktAmEVzIjo8V5SymTWDAaSeRFEc+p21/eDHx23dUKibiWxgl1ItwP2YhIxi01DNLLtAHyC5jMq64JBBw3DPLdtXOYS0SZ0bKaIZGz/xyA0HSiMZAOFaq69gJeBmWwAin46KbKppgMsR92tU0xhFVXpafPraOtBJYoZC6YrBy9fdEhiOlRpGvOyMMAzXvTcT/vG4K4bmXsThJgern8kVhyi0Q1iQHK2CSEuAjTTCRTN9qkQGWmIBOq6hDcOZfXiStWtU5qdZuTsv1q6dpHAV0gA5RBTnoDNXRNWqgJiLoHj2jV/RmPBovxrvxMW1dMmYR7qM/MD5/ADqVlGM=</latexit>

G(·)<latexit sha1_base64="RBXZXEaDjzrQDgD0K6nYMA1m0nc=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QSLUDdlpgq6LAjqsoJ9QGcomUzahmYmQ3JHrEPB/+HGhSJu/Snu/Demj4W2Hgh8nHNDbk6QCK7Bcb6t3Mrq2vpGfrOwtb2zW7T39ptapoqyBpVCqnZANBM8Zg3gIFg7UYxEgWCtYHg5yVv3TGku4zsYJcyPSD/mPU4JGKtrFz1gD5Bdj8seDSWcdO2SU3GmwsvgzqGE5qp37S8vlDSNWAxUEK07rpOAnxEFnAo2LnipZgmhQ9JnHYMxiZj2s+niY3xsnBD3pDInBjx1f9/ISKT1KArMZERgoBeziflf1kmhd+FnPE5SYDGdPdRLBQaJJy3gkCtGQYwMEKq42RXTAVGEgumqYEpwF7+8DM1qxT2tVG/PSrWrp1kdeXSIjlAZuegc1dANqqMGoihFz+gVvVmP1ov1bn3MRnPWvMID9EfW5w+p05OA</latexit>

LSEM<latexit sha1_base64="mGvRd6lP9gNo/mrk2qe/D4OepRA=">AAAB/3icbVDLSsNAFJ3UV62vqODGTbAIrkpSBV0WRHGhUNE+oAlhMp20QycPZm7EEgP6K25cKOLW33Dn3zhpu9DWAxcO59zLnDlezJkE0/zWCnPzC4tLxeXSyura+oa+udWUUSIIbZCIR6LtYUk5C2kDGHDajgXFgcdpyxuc5n7rjgrJovAWhjF1AtwLmc8IBiW5+o4dYOgTzNPLzLWB3kN6c3aVuXrZrJgjGLPEmpAymqDu6l92NyJJQEMgHEvZscwYnBQLYITTrGQnksaYDHCPdhQNcUClk47yZ8a+UrqGHwk1IRgj9fdFigMph4GnNvO0ctrLxf+8TgL+iZOyME6AhmT8kJ9wAyIjL8PoMkEJ8KEimAimshqkjwUmoCorqRKs6S/Pkma1Yh1WqtdH5dr547iOItpFe+gAWegY1dAFqqMGIugBPaNX9KY9aS/au/YxXi1okwq30R9onz+eCZbm</latexit>

M(·)<latexit sha1_base64="LdNoO87/X+N1+Yft1HNk4cMEUrU=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QSLUDdlpgq6LAjiRqhgH9AZSiaTtqGZyZDcEetQ8H+4caGIW3+KO/+N6WOhrQcCH+fckJsTJIJrcJxvK7eyura+kd8sbG3v7Bbtvf2mlqmirEGlkKodEM0Ej1kDOAjWThQjUSBYKxheTvLWPVOay/gORgnzI9KPeY9TAsbq2kUP2ANkN+OyR0MJJ1275FScqfAyuHMoobnqXfvLCyVNIxYDFUTrjusk4GdEAaeCjQteqllC6JD0WcdgTCKm/Wy6+BgfGyfEPanMiQFP3d83MhJpPYoCMxkRGOjFbGL+l3VS6F34GY+TFFhMZw/1UoFB4kkLOOSKURAjA4QqbnbFdEAUoWC6KpgS3MUvL0OzWnFPK9Xbs1Lt6mlWRx4doiNURi46RzV0jeqogShK0TN6RW/Wo/VivVsfs9GcNa/wAP2R9fkDsxuThg==</latexit>

Soft-arg

max

<latexit sha1_base64="DfCCUAAk2C7Gi7n+yn+KNaaXeVU=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16CRbBiyWpgp6k4MVjRfsBbSib7aZdupsNuxNpCfkrXjwo4tU/4s1/47bNQVsfDDzem2FmXhBzpsF1v63C2vrG5lZxu7Szu7d/YB+WW1omitAmkVyqToA15SyiTWDAaSdWFIuA03Ywvp357SeqNJPRI0xj6gs8jFjICAYj9e1yD+gE0gcZwjlWQ4EnWd+uuFV3DmeVeDmpoByNvv3VG0iSCBoB4VjrrufG4KdYASOcZqVeommMyRgPadfQCAuq/XR+e+acGmXghFKZisCZq78nUiy0norAdAoMI73szcT/vG4C4bWfsihOgEZksShMuAPSmQXhDJiiBPjUEEwUM7c6ZIQVJmDiKpkQvOWXV0mrVvUuqrX7y0r9Jo+jiI7RCTpDHrpCdXSHGqiJCJqgZ/SK3qzMerHerY9Fa8HKZ47QH1ifP4yHlME=</latexit>

Y<latexit sha1_base64="sEFvDcrNTGXrxEyrETxVsR7Sbqo=">AAAB/3icbVDNS8MwHE3n15xfVcGLl+AQPI12CnocCOJxgvuQtYw0TbewNClJKoxa0H/FiwdFvPpvePO/Md120M0HIY/3fj/y8oKEUaUd59sqLS2vrK6V1ysbm1vbO/buXluJVGLSwoIJ2Q2QIoxy0tJUM9JNJEFxwEgnGF0WfueeSEUFv9XjhPgxGnAaUYy0kfr2gacpC0nmBYKFahybK7vL875ddWrOBHCRuDNSBTM0+/aXFwqcxoRrzJBSPddJtJ8hqSlmJK94qSIJwiM0ID1DOYqJ8rNJ/hweGyWEkZDmcA0n6u+NDMWqyGYmY6SHat4rxP+8XqqjCz+jPEk14Xj6UJQyqAUsyoAhlQRrNjYEYUlNVoiHSCKsTWUVU4I7/+VF0q7X3NNa/eas2rh6nNZRBofgCJwAF5yDBrgGTdACGDyAZ/AK3qwn68V6tz6moyVrVuE++APr8wdOdpdZ</latexit>

X<latexit sha1_base64="uAUWnJ4RFak4jvyqLkKQw0BhNOo=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZcFQVxWsA9ox5LJZNrQTDIkGaUMBT/DjQtF3Pov7vwbM20X2nog5HDOveTkBAln2rjut1NYWV1b3yhulra2d3b3yvsHLS1TRWiTSC5VJ8CaciZo0zDDaSdRFMcBp+1gdJX77QeqNJPizowT6sd4IFjECDZWuu8Fkod6HNsr60z65YpbdadAy8SbkwrM0eiXv3qhJGlMhSEca9313MT4GVaGEU4npV6qaYLJCA9o11KBY6r9bJp6gk6sEqJIKnuEQVP190aGY51Hs5MxNkO96OXif143NdGlnzGRpIYKMnsoSjkyEuUVoJApSgwfW4KJYjYrIkOsMDG2qJItwVv88jJp1areWbV2e16pXz/N6ijCERzDKXhwAXW4gQY0gYCCZ3iFN+fReXHenY/ZaMGZV3gIf+B8/gBLLpNx</latexit>

Enc(·)<latexit sha1_base64="LW4Wa+9c6Grxen8tAxNde3Til1U=">AAAB+nicbVBNS8NAEN34WetXqkcvwSLUS0mqoMeCKB4r2A9oQtlsNu3STTbsTtQSC/4RLx4U8eov8ea/cZv2oK0PBh7vzTAzz084U2Db38bS8srq2npho7i5tb2za5b2WkqkktAmEVzIjo8V5SymTWDAaSeRFEc+p21/eDHx23dUKibiWxgl1ItwP2YhIxi01DNLLtAHyC5jMq64JBBw3DPLdtXOYS0SZ0bKaIZGz/xyA0HSiMZAOFaq69gJeBmWwAin46KbKppgMsR92tU0xhFVXpafPraOtBJYoZC6YrBy9fdEhiOlRpGvOyMMAzXvTcT/vG4K4bmXsThJgern8kVhyi0Q1iQHK2CSEuAjTTCRTN9qkQGWmIBOq6hDcOZfXiStWtU5qdZuTsv1q6dpHAV0gA5RBTnoDNXRNWqgJiLoHj2jV/RmPBovxrvxMW1dMmYR7qM/MD5/ADqVlGM=</latexit>

G(·)<latexit sha1_base64="RBXZXEaDjzrQDgD0K6nYMA1m0nc=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QSLUDdlpgq6LAjqsoJ9QGcomUzahmYmQ3JHrEPB/+HGhSJu/Snu/Demj4W2Hgh8nHNDbk6QCK7Bcb6t3Mrq2vpGfrOwtb2zW7T39ptapoqyBpVCqnZANBM8Zg3gIFg7UYxEgWCtYHg5yVv3TGku4zsYJcyPSD/mPU4JGKtrFz1gD5Bdj8seDSWcdO2SU3GmwsvgzqGE5qp37S8vlDSNWAxUEK07rpOAnxEFnAo2LnipZgmhQ9JnHYMxiZj2s+niY3xsnBD3pDInBjx1f9/ISKT1KArMZERgoBeziflf1kmhd+FnPE5SYDGdPdRLBQaJJy3gkCtGQYwMEKq42RXTAVGEgumqYEpwF7+8DM1qxT2tVG/PSrWrp1kdeXSIjlAZuegc1dANqqMGoihFz+gVvVmP1ov1bn3MRnPWvMID9EfW5w+p05OA</latexit>

LSEM<latexit sha1_base64="mGvRd6lP9gNo/mrk2qe/D4OepRA=">AAAB/3icbVDLSsNAFJ3UV62vqODGTbAIrkpSBV0WRHGhUNE+oAlhMp20QycPZm7EEgP6K25cKOLW33Dn3zhpu9DWAxcO59zLnDlezJkE0/zWCnPzC4tLxeXSyura+oa+udWUUSIIbZCIR6LtYUk5C2kDGHDajgXFgcdpyxuc5n7rjgrJovAWhjF1AtwLmc8IBiW5+o4dYOgTzNPLzLWB3kN6c3aVuXrZrJgjGLPEmpAymqDu6l92NyJJQEMgHEvZscwYnBQLYITTrGQnksaYDHCPdhQNcUClk47yZ8a+UrqGHwk1IRgj9fdFigMph4GnNvO0ctrLxf+8TgL+iZOyME6AhmT8kJ9wAyIjL8PoMkEJ8KEimAimshqkjwUmoCorqRKs6S/Pkma1Yh1WqtdH5dr547iOItpFe+gAWegY1dAFqqMGIugBPaNX9KY9aS/au/YxXi1okwq30R9onz+eCZbm</latexit>

M(·)<latexit sha1_base64="LdNoO87/X+N1+Yft1HNk4cMEUrU=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QSLUDdlpgq6LAjiRqhgH9AZSiaTtqGZyZDcEetQ8H+4caGIW3+KO/+N6WOhrQcCH+fckJsTJIJrcJxvK7eyura+kd8sbG3v7Bbtvf2mlqmirEGlkKodEM0Ej1kDOAjWThQjUSBYKxheTvLWPVOay/gORgnzI9KPeY9TAsbq2kUP2ANkN+OyR0MJJ1275FScqfAyuHMoobnqXfvLCyVNIxYDFUTrjusk4GdEAaeCjQteqllC6JD0WcdgTCKm/Wy6+BgfGyfEPanMiQFP3d83MhJpPYoCMxkRGOjFbGL+l3VS6F34GY+TFFhMZw/1UoFB4kkLOOSKURAjA4QqbnbFdEAUoWC6KpgS3MUvL0OzWnFPK9Xbs1Lt6mlWRx4doiNURi46RzV0jeqogShK0TN6RW/Wo/VivVsfs9GcNa/wAP2R9fkDsxuThg==</latexit>

Y<latexit sha1_base64="lHhSbfsmZkBx+nf82hk4613kt+k=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LIgiMsK9iHtWDKZTBuaSYYko5Sh4Ge4caGIW//FnX9jpu1CWw+EHM65l5ycIOFMG9f9dpaWV1bX1gsbxc2t7Z3d0t5+U8tUEdogkkvVDrCmnAnaMMxw2k4UxXHAaSsYXuZ+64EqzaS4NaOE+jHuCxYxgo2V7ruB5KEexfbK7sa9UtmtuBOgReLNSBlmqPdKX91QkjSmwhCOte54bmL8DCvDCKfjYjfVNMFkiPu0Y6nAMdV+Nkk9RsdWCVEklT3CoIn6eyPDsc6j2ckYm4Ge93LxP6+TmujCz5hIUkMFmT4UpRwZifIKUMgUJYaPLMFEMZsVkQFWmBhbVNGW4M1/eZE0qxXvtFK9OSvXrp6mdRTgEI7gBDw4hxpcQx0aQEDBM7zCm/PovDjvzsd0dMmZVXgAf+B8/gBMs5Ny</latexit>

Y<latexit sha1_base64="sEFvDcrNTGXrxEyrETxVsR7Sbqo=">AAAB/3icbVDNS8MwHE3n15xfVcGLl+AQPI12CnocCOJxgvuQtYw0TbewNClJKoxa0H/FiwdFvPpvePO/Md120M0HIY/3fj/y8oKEUaUd59sqLS2vrK6V1ysbm1vbO/buXluJVGLSwoIJ2Q2QIoxy0tJUM9JNJEFxwEgnGF0WfueeSEUFv9XjhPgxGnAaUYy0kfr2gacpC0nmBYKFahybK7vL875ddWrOBHCRuDNSBTM0+/aXFwqcxoRrzJBSPddJtJ8hqSlmJK94qSIJwiM0ID1DOYqJ8rNJ/hweGyWEkZDmcA0n6u+NDMWqyGYmY6SHat4rxP+8XqqjCz+jPEk14Xj6UJQyqAUsyoAhlQRrNjYEYUlNVoiHSCKsTWUVU4I7/+VF0q7X3NNa/eas2rh6nNZRBofgCJwAF5yDBrgGTdACGDyAZ/AK3qwn68V6tz6moyVrVuE++APr8wdOdpdZ</latexit>

Soft-a

rgm

ax

<latexit sha1_base64="DfCCUAAk2C7Gi7n+yn+KNaaXeVU=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16CRbBiyWpgp6k4MVjRfsBbSib7aZdupsNuxNpCfkrXjwo4tU/4s1/47bNQVsfDDzem2FmXhBzpsF1v63C2vrG5lZxu7Szu7d/YB+WW1omitAmkVyqToA15SyiTWDAaSdWFIuA03Ywvp357SeqNJPRI0xj6gs8jFjICAYj9e1yD+gE0gcZwjlWQ4EnWd+uuFV3DmeVeDmpoByNvv3VG0iSCBoB4VjrrufG4KdYASOcZqVeommMyRgPadfQCAuq/XR+e+acGmXghFKZisCZq78nUiy0norAdAoMI73szcT/vG4C4bWfsihOgEZksShMuAPSmQXhDJiiBPjUEEwUM7c6ZIQVJmDiKpkQvOWXV0mrVvUuqrX7y0r9Jo+jiI7RCTpDHrpCdXSHGqiJCJqgZ/SK3qzMerHerY9Fa8HKZ47QH1ifP4yHlME=</latexit>

M(·)<latexit sha1_base64="LdNoO87/X+N1+Yft1HNk4cMEUrU=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QSLUDdlpgq6LAjiRqhgH9AZSiaTtqGZyZDcEetQ8H+4caGIW3+KO/+N6WOhrQcCH+fckJsTJIJrcJxvK7eyura+kd8sbG3v7Bbtvf2mlqmirEGlkKodEM0Ej1kDOAjWThQjUSBYKxheTvLWPVOay/gORgnzI9KPeY9TAsbq2kUP2ANkN+OyR0MJJ1275FScqfAyuHMoobnqXfvLCyVNIxYDFUTrjusk4GdEAaeCjQteqllC6JD0WcdgTCKm/Wy6+BgfGyfEPanMiQFP3d83MhJpPYoCMxkRGOjFbGL+l3VS6F34GY+TFFhMZw/1UoFB4kkLOOSKURAjA4QqbnbFdEAUoWC6KpgS3MUvL0OzWnFPK9Xbs1Lt6mlWRx4doiNURi46RzV0jeqogShK0TN6RW/Wo/VivVsfs9GcNa/wAP2R9fkDsxuThg==</latexit>

M(·)<latexit sha1_base64="LdNoO87/X+N1+Yft1HNk4cMEUrU=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QSLUDdlpgq6LAjiRqhgH9AZSiaTtqGZyZDcEetQ8H+4caGIW3+KO/+N6WOhrQcCH+fckJsTJIJrcJxvK7eyura+kd8sbG3v7Bbtvf2mlqmirEGlkKodEM0Ej1kDOAjWThQjUSBYKxheTvLWPVOay/gORgnzI9KPeY9TAsbq2kUP2ANkN+OyR0MJJ1275FScqfAyuHMoobnqXfvLCyVNIxYDFUTrjusk4GdEAaeCjQteqllC6JD0WcdgTCKm/Wy6+BgfGyfEPanMiQFP3d83MhJpPYoCMxkRGOjFbGL+l3VS6F34GY+TFFhMZw/1UoFB4kkLOOSKURAjA4QqbnbFdEAUoWC6KpgS3MUvL0OzWnFPK9Xbs1Lt6mlWRx4doiNURi46RzV0jeqogShK0TN6RW/Wo/VivVsfs9GcNa/wAP2R9fkDsxuThg==</latexit>

K<latexit sha1_base64="AKSHAEs6Py00dlRZPeychMaF+Nc=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LIgiOCmgn1AO5ZMJtOGZpIhyShlKPgZblwo4tZ/ceffmGm70NYDIYdz7iUnJ0g408Z1v52l5ZXVtfXCRnFza3tnt7S339QyVYQ2iORStQOsKWeCNgwznLYTRXEccNoKhpe533qgSjMp7swooX6M+4JFjGBjpftuIHmoR7G9sptxr1R2K+4EaJF4M1KGGeq90lc3lCSNqTCEY607npsYP8PKMMLpuNhNNU0wGeI+7VgqcEy1n01Sj9GxVUIUSWWPMGii/t7IcKzzaHYyxmag571c/M/rpCa68DMmktRQQaYPRSlHRqK8AhQyRYnhI0swUcxmRWSAFSbGFlW0JXjzX14kzWrFO61Ub8/KtaunaR0FOIQjOAEPzqEG11CHBhBQ8Ayv8OY8Oi/Ou/MxHV1yZhUewB84nz83bZNk</latexit>

K<latexit sha1_base64="AKSHAEs6Py00dlRZPeychMaF+Nc=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LIgiOCmgn1AO5ZMJtOGZpIhyShlKPgZblwo4tZ/ceffmGm70NYDIYdz7iUnJ0g408Z1v52l5ZXVtfXCRnFza3tnt7S339QyVYQ2iORStQOsKWeCNgwznLYTRXEccNoKhpe533qgSjMp7swooX6M+4JFjGBjpftuIHmoR7G9sptxr1R2K+4EaJF4M1KGGeq90lc3lCSNqTCEY607npsYP8PKMMLpuNhNNU0wGeI+7VgqcEy1n01Sj9GxVUIUSWWPMGii/t7IcKzzaHYyxmag571c/M/rpCa68DMmktRQQaYPRSlHRqK8AhQyRYnhI0swUcxmRWSAFSbGFlW0JXjzX14kzWrFO61Ub8/KtaunaR0FOIQjOAEPzqEG11CHBhBQ8Ayv8OY8Oi/Ou/MxHV1yZhUewB84nz83bZNk</latexit>

zx<latexit sha1_base64="2fDQM6uZcG3E1VFDYBM/HkMe2b4=">AAAB+XicbVDLSgMxFL3js9bXqEs3wSK4KjNV0GVBEJcV7APaYchkMm1oJjMkmWIdCn6IGxeKuPVP3Pk3ZtoutPVAyOGce8nJCVLOlHacb2tldW19Y7O0Vd7e2d3btw8OWyrJJKFNkvBEdgKsKGeCNjXTnHZSSXEccNoOhteF3x5RqVgi7vU4pV6M+4JFjGBtJN+2e0HCQzWOzZU/TvwH3644VWcKtEzcOanAHA3f/uqFCcliKjThWKmu66Tay7HUjHA6KfcyRVNMhrhPu4YKHFPl5dPkE3RqlBBFiTRHaDRVf2/kOFZFODMZYz1Qi14h/ud1Mx1deTkTaaapILOHoowjnaCiBhQySYnmY0MwkcxkRWSAJSbalFU2JbiLX14mrVrVPa/W7i4q9ZunWR0lOIYTOAMXLqEOt9CAJhAYwTO8wpuVWy/Wu/UxG12x5hUewR9Ynz+TDZSv</latexit>

zy<latexit sha1_base64="xd0H/DGrGHnwuMvACHXclxocnis=">AAAB+XicbVDLSgMxFL1TX7W+Rl26CRbBVZmpgi4LgrisYB/QDkMmk7ahmcyQZArjUPBD3LhQxK1/4s6/MdN2oa0HQg7n3EtOTpBwprTjfFultfWNza3ydmVnd2//wD48aqs4lYS2SMxj2Q2wopwJ2tJMc9pNJMVRwGknGN8UfmdCpWKxeNBZQr0IDwUbMIK1kXzb7gcxD1UWmSt/nPqZb1edmjMDWiXuglRhgaZvf/XDmKQRFZpwrFTPdRLt5VhqRjidVvqpogkmYzykPUMFjqjy8lnyKTozSogGsTRHaDRTf2/kOFJFODMZYT1Sy14h/uf1Uj249nImklRTQeYPDVKOdIyKGlDIJCWaZ4ZgIpnJisgIS0y0KatiSnCXv7xK2vWae1Gr319WG7dP8zrKcAKncA4uXEED7qAJLSAwgWd4hTcrt16sd+tjPlqyFhUewx9Ynz+UkZSw</latexit>

Figure 2: Overview of the SEPAM architecture. Left: classical Seq2Seq, i.e., X and Y are pair-wise; LSEMimplements a soft-copying mechanism via semantic partial matching. Right: unsupervised Seq2Seq, i.e.,X and Yare non-parallel; LSEM provides the guidance for G(·), to generate Y relevant to X via semantic partial matching.Solid lines mean gradients are backpropagated in training; dash lines mean gradients are not backpropagated.

ploy the IPOT algorithm (Xie et al., 2018) to ob-tain an efficient approximation. In practice, weuse a keyword mask K, defined as Kij = 0 ifxi /∈ M(X) and yj /∈ M(Y ) ; Kij = +∞if xi /∈ M(X) xor yj /∈ M(Y ); and Kij = 1,otherwise. Hence we define the OPT distance asW pc (pX , pY ,K) ,Wc

(pM(X), pM(Y )

). Specifi-

cally, IPOT considers proximal gradient descent tosolve the optimal transport matrix:

T(t+1) = argminT∈Πc(µ,ν)

{〈T,C′〉+ γ · DKL(T,T

(t))}, (5)

where C′ = K ◦ C, 1/γ > 0 is the general-ized step-size, and the generalized KL-divergenceDKL(T,T(t)) is used as the proximity metric. Thefull approach is summarized as Algorithm 1 in Ap-pendix A.

4 Semantic Partial Matching for TextGeneration

Assume there are two sets of objects X ={X(i)}Mi=1 and Y = {Y (j)}Nj=1, we consider aSeq2Seq model, where the input is1 X , and theoutput is a sequence of length T with tokens yt,i.e., Y = (y1, y2 . . . , yT ). One typically assignsthe following probability to an observation y atlocation t: p(y|Y <t) = [softmax(g(st))]y, whereY <t = (y1, y2 . . . , yt). This specifies a probabilis-tic model, i.e.,

log p(Y |X) =∑t

log p(yt|Y <t,X). (6)

To train the model, one typically uses maximumlikelihood estimation (MLE):

LMLE = −E(X,Y )∼(X ,Y)[log p(Y |X)] . (7)

1For simplicity, we omit the superscript “i” when thecontext is independent of i. This applies to Y .

We consider an encoder-decoder framework inthis paper, where a latent vector z is given by anencoder Enc(·), with inputX , i.e., zx = Enc(X).Based on zx, a decoder G(·) generates a new sen-tence Y that is expected to be the same as Y . Thedecoder can be implemented by an LSTM (Hochre-iter and Schmidhuber, 1997), GRU (Cho et al.,2014), or Transformer (Vaswani et al., 2017). Aunsupervised Seq2Seq model considers X and Yas non-parallel, i.e., the pair-wise information isunknown. One typically pretrains the generatorwith the reconstruction loss:

LAE = EY ∼Y [− log p(Y |zy)] , (8)

where zy = Enc(Y ). Note the goal of unsuper-vised Seq2Seq is to generate a sequence Y givensome objectX . Hence, we seek to learn the condi-tional generation distributions p(Y |X), the sameas the classical Seq2Seq model. In practice, thegenerator can be trained combining the reconstruc-tion loss with some guidance loss containing theinformation from X . The guidance loss functioncan be defined following SEPAM, and the othersusually depend on tasks and we omit their detailsfor clarity.

Differentiable SEPAM Note that the SEPAMloss is not differentiable due to the multinomialdistribution sampling process yt ∼ Softmax(gt),where gt is a logit vector given by the final layerof the generator G(·). To enable direct back-propagation from the SEPAM loss for generatortraining, we consider the soft-argmax approxima-tion (Zhang et al., 2017) to avoid the use of REIN-FORCE (Sutton et al., 2000):

yt = Softmax(gt/τ) ,

where 0 < τ < 1 is the annealing factor. Giventwo sentences, we denote the generated sequence

Page 5: Semantic Matching for Sequence-to-Sequence Learningpeople.ee.duke.edu/~lcarin/Ruiyi_EMNLP_2020.pdfTo demonstrate the effectiveness of SEPAM, we consider applying it on sequence-prediction

embeddings as Sg = {eyi }Ti=1 and partial referenceembedding as Sr = {exj }T

′j=1 in word or phrase

level. The cost matrix C is then computed asCij = c(eyi , e

xj ). The semantic partial matching

loss between the reference and model generationcan be computed via the IPOT algorithm:

LSEM = W pc (Sg,Sr,K) . (9)

SEPAM Regularization SEPAM training objec-tives discussed above only focus on generatingwords with specific meanings and do not con-sider the word-ordering. To train a proper text-generation model, we propose to combine theSEPAM loss with the likelihood loss LMLE in su-pervised settings or LAE in unsupervised settings.Hence, we have the training objective in unsu-pervised settings: L = LAE + λLSEM, where λis the weight of SEPAM to be tuned. A similarobjective applies for supervised settings: L =LMLE + λLSEM. In the following, we discusshow to extract and use prior knowledge for par-tial matching in three downstream tasks:

(i) Non-parallel Style Transfer Semantic par-tial matching between the source sentence and thetransferred one is helpful for content preservation,as shown in Figure 1. It is usually the case thatthe content words are nouns or verbs, and stylewords are adjectives or adverbs. Hence, M(X)and M(Y ) are content word sets, extracted basedon the POS tags using NLTK. One can employ thisprior knowledge to perform different operationsfor words: (i) for content words, we should en-courage partial matching between the sentences byLSEM; and (ii) for style words, we should discour-age matching (Hu et al., 2017). More details areprovided in Appendix A.1.

(ii) Unsupervised Image Captioning Visualconcepts extracted from an image can be employedfor generating relevant captions in the unsuper-vised setting. Feng et al. (2019) uses exactly hard-matching and REINFORCE to update the caption-ing model. Here, we apply the semantic partialmatching to encourage the generatation of visual-concept words. Specifically, M(X) releases thevisual concepts and M(Y ) corresponds to the gen-erated words realted to the object (i.e., nouns). Thisvisual concept regularization can also be applied inthe supervised setting complementing with MLEloss.

(iii) Table-to-Text Generation Semantic partialmatching can prevent hallucination generation,i.e., text mentions extra information than what ispresent in the table (Dhingra et al., 2019). M(Y )extracts nouns from the generated sequence, andM(X) are keys in the table, then we used themto compute LSEM. Semantic partial matching willpenalize the generator if extra information exists inthe generated text Y .

5 Related Work

Optimal transport in NLP Kusner et al. (2015)first applied optimal transport in NLP, and pro-posed the word mover’s distance (WMD). Thetransportation cost is usually defined as Euclideandistance, making the OT distance approximatedby solving a less-accurate lower bound (Kusneret al., 2015). Based on this, Chen et al. (2018) pro-posed a feature-mover distance for style transfer.Chen et al. (2019) applied OT for classical seq-to-seq, and formulated it as Wasserstein gradientflows (Zhang et al., 2018). SEPAM moves forwardand applies OT in both supervised and unsuper-vised settings (Artetxe et al., 2018; Lample et al.,2017).

Unsupervised Seq2Seq Learning Differentfrom the standard Seq2Seq model (Sutskever et al.,2014), parallel sentences for different styles arenot provided, and must be inferred from the data.Unsupervised machine translation (Artetxe et al.,2018; Lample et al., 2017) learns to translatetext from one language to another with two setsof texts of these languages provided. Dai et al.(2019) explores the transformer model as thegenerator, instead of classical auto-regressivemodels. Style transfer (Shen et al., 2017) aims attransferring the styles of the texts with non-paralleldata. Compared with these tasks, unsupervisedimage captioning (Feng et al., 2019) is morechallenging since images and sentences are indistinct modalities.

Copying Mechanism This is related to the copynetwork (Gu et al., 2016), which achieves retrieved-based copying. Li et al. (2018) further proposeda delete-retrieve-generate framework for the styletransfer. Chen and Bansal (2018) combine the ab-straction with extraction in text summarization, andachieves state-of-the-art results via reinforced wordselection. In this work, we proposed the semanticpartial matching, which can be regarded as a kind

Page 6: Semantic Matching for Sequence-to-Sequence Learningpeople.ee.duke.edu/~lcarin/Ruiyi_EMNLP_2020.pdfTo demonstrate the effectiveness of SEPAM, we consider applying it on sequence-prediction

The

room

sar

e so bad

and

the

food is

poor

lyco

oked

eith

er

Roomswere

spaciousand

foodwasverywell

cookedPADPADPAD

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

The

room

sar

e so bad

and

the

food is

poor

lyco

oked

eith

er

Roomswere

spaciousand

foodwasverywell

cookedPADPADPAD

0.000

0.025

0.050

0.075

0.100

0.125

0.150

0.175

a

man

ridin

g

skis

down a

snow

cove

red

slope

snow

skiing

man

slope

hill

cover

PAD

PAD

PAD

0.02

0.04

0.06

0.08

a

man

ridin

g

skis

down a

snow

cove

red

slope

snow

skiing

man

slope

hill

cover

PAD

PAD

PAD0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

Figure 3: A comparison of OT (left column) andSEPAM (right column). The horizontal axis are thegenerated texts, and the vertical axis are the partial tar-gets.

of soft-copying mechanism. Instead of the retrieval-based exact copying used by (Gu et al., 2016; Liet al., 2018), SEPAM considers semantic similarity,and thus ideally delivers smoother transformationin generation.

6 Experiments

6.1 DemonstrationComparison between OT and SEPAM Weshow two examples of classical OT and SEPAM un-der two sequence-prediction tasks in Figure 3. Thefirst row shows the heat map of OT and SEPAMon matching two sentences with different styles.SEPAM employs the syntax information to matchselected words and all the content words are ex-actly matched. However, some sentiment wordsin classical OT are still matched, preventing suc-cessful style transfer. The second row in Figure3 shows the comparison on matching a generatedsentence with the detected concept set in imagecaptioning. The concepts are perfectly matchedwith their corresponding words in the caption usingSEPAM, while OT includes some noisy matching(light blue). In summary, SEPAM achieves bet-ter matching than classical OT, and the matchingweights (T) of SEPAM is more sparse.

Implicit Use of Prior Knowledge We considerusing the weights wt of attentions from a LSTM-based text classifier to determine which wordsto match. As discussed in Wiegreffe and Pinter(2019), a word with higher attention weight meansit is more important for classification, i.e., more rel-evant to the style. As shown in Figure 4, shows the

attention maps of three instances. Hence, we canpartially match words with lower attention weightsas they are mostly non-style words. However, em-pirical results show implicit ways have much worseresults than the simple rule-based strategy withPOS tags.

the atmosphere of the church is very fun .

overall I was very happy with the compensation I got .

but the smell was so horrible i will never go there again .

Figure 4: Attention maps for three Yelp instances.

6.2 Unsupervised Text-style Transfer

Setup We use the same data and split methoddescribed in (Shen et al., 2017). Experiments areimplemented with Tensorflow based on texar (Huet al., 2018). For a fair comparison, we use a simi-lar model configuration to that of (Hu et al., 2017;Yang et al., 2018). One-layer GRU (Cho et al.,2014) encoder and LSTM attention decoder (gen-erator) are used. We set the weight of semanticmatching loss as λ = 0.2. Models are trained for atotal of 15 epochs, with 10 epochs for pretrainingand 5 epochs for fine-tuning.

Metrics We pretrain a CNN classifier, whichachieves an accuracy of 97.4% on the validationset. Based on it, we report the accuracy (ACC) tomeasure the quality of style control. We furthermeasure the content preservation using i) BLEU,which measures the similarity between the origi-nal and transferred sentences ii) ref-BLEU, whichmeasures content preservation comparing the trans-ferred sentences with human annotations. Fluencyis evaluated with perplexity (PPL) of generatedsentences based on a pretrained language model.

Baselines We implemented CtrlGen (Hu et al.,2017) and LM (Yang et al., 2018) as our baselinesand further added SEPAM on these two modelsfor validation. Further, conditional variational en-coder (CVAE) (Shen et al., 2017) and retrieval-based methods (Li et al., 2018) are added as twobaselines.

Analysis Results are shown in Tables 1. It maybe observed that combining our proposed methodwith corresponding baselines exhibits similar trans-fer accuracy and fluency, while maintaining thecontent better. Specifically, SEPAM shows higherBLEU scores on human annotated sentences, fur-ther validating its effectiveness on content preserva-tion. It is interesting to see that lower BLEU scores

Page 7: Semantic Matching for Sequence-to-Sequence Learningpeople.ee.duke.edu/~lcarin/Ruiyi_EMNLP_2020.pdfTo demonstrate the effectiveness of SEPAM, we consider applying it on sequence-prediction

Model ACC (%) BLEU ref-BLEU PPL

CAE (Shen et al., 2017) 73.9 20.7 7.8 51.6

(Fu et al., 2018):StyleEmbedding 8.1 67.4 19.2 120.1MultiDecoder 46.9 40.1 12.9 113.1

(Li et al., 2018):Template 80.1 57.4 20.5 170.5DeleteAndRetrieval 88.9 36.8 14.7 74.2

CtrlGen (Hu et al., 2017) 89.0 61.4 22.3 176.8OT + CtrlGen 85.4 62.9 21.7 183.7SEPAM+ CtrlGen 89.1 63.7 25.9 176.4

LM (Yang et al., 2018) 88.3 60.5 25.7 79.9OT + LM 84.8 61.8 22.8 85.1SEPAM+ LM 88.7 62.0 28.2 76.6

Table 1: Our model and baselines performance on testdataset with human annotations.

Input: tasted really old , i could n’t believe it .CtrlGen: adds really top , i could gorgeous believe it .SEPAM+ CtrlGen: tasted really surprisingly , i could fantastic believe it .LM: tasted really great , i could always believe it .SEPAM+ LM: tasted really excellent , i could always believe it .Input: they do not stock some of the most common parts .CtrlGen: they do fantastic laughed some of the most common

parts .SEPAM+ CtrlGen: they do authentic expertly some of the most fascinating

partsLM: they do definitely right some of the most cool parts .SEPAM+ LM: they do always stock some of the most amazing parts .Input: the woman who helped me today was very friendly

and knowledgeable .CtrlGen: the woman who so-so me today was very rude and

knowledgeable .SEPAM+ CtrlGen: the woman who helped me today was very rude and

knowledgeable .LM: the woman who ridiculous me today was very rude

and knowledgeable .SEPAM+ LM: the woman who helped me today was very rude and

stupid .

Table 2: Examples for comparison of different methodson Yelp dataset.

with original sentences does not imply higherBLEU scores with the human annotations in Ta-ble 1. Compared with other models, the proposedmodel shows a better balance among accuracy, flu-ency and content preservation, achieving the high-est ref-BLEU.

Model Style Content Fluency

CAE (Shen et al., 2017) 3.21 2.91 2.83CtrlGen (Hu et al., 2017) 3.42 3.22 2.79LM (Yang et al., 2018) 3.38 3.32 3.20

SEPAM+ CtrlGen 3.51 3.56 2.88SEPAM+ LM 3.47 3.72 3.25

Table 3: Human evaluation results on Yelp dataset.

Human Evaluation We further conduct humanevaluations for the proposed SEPAM using Ama-zon Mechnical Turk. We randomly sample 100sentences from the test set and ask 5 different re-

viewers to provide their rating scores of the modelsin terms of fluency, style, and content preservation.We require all the workers to be native Englishspeakers, with approval rate higher than 95% andat least 100 assignments completed. For each sen-tence, five shuffled samples generated by differentmodels are sequentially shown to a reviewer. Re-sults in Table 3 demonstrate that the better perfor-mance achieved by SEPAM, especially in terms ofcontent preservation.

6.3 Image Captioning

Setup We consider image captioning using theCOCO dataset (Lin et al., 2014), which contains123,287 images in total and each image is anno-tated with at least 5 captions. Following Karpathy’ssplit (Karpathy and Fei-Fei, 2015), 113,287 imagesare used for training and 5,000 are used for valida-tion and testing. We note that the training imagesare used to build the image set, with all the captionsleft unused for any training. All the descriptionsin the Shutterstock image description corpus aretokenized with a vocabulary size of 18, 667 (Fenget al., 2019). The LSTM hidden dimension and theshared latent space dimension are fixed to 512. Theweighting hyper-parameters are chosen to make dif-ferent rewards roughly the same scale. Specifically,λ is set to 10. We train our model using the Adamoptimizer (Kingma and Ba, 2014) with a learningrate of 0.0001. During the initialization process,we minimize the cross-entropy loss using Adamoptimizer (Kingma and Ba, 2014) with a learningrate of 0.001. When generating captions in the testphase, we use beam search with a beam size of 3.

Metrics We report BLEU (Papineni et al.,2002), CIDEr (Vedantam et al., 2015), and ME-TEOR (Banerjee and Lavie, 2005) scores. Theresults of different methods are shown in Table 5.

Method BLEU METEOR CIDEr SPICE

Feng et al. (2019) 38.2 27.5 22.9 6.6

OUR IMPLEMENTATIONS

Hard Matching 38.5 27.2 28.2 7.8OT 39.9 28.1 28.3 7.9SEPAM 42.1 28.9 30.2 8.4

Table 5: Performance comparisons of Unsupervisedcaptioning on the MSCOCO dataset.

Analysis Results in Table 5 show consistent im-provement of SEPAM over classical OT. ClassicalOT can improve upon the baseline via generating

Page 8: Semantic Matching for Sequence-to-Sequence Learningpeople.ee.duke.edu/~lcarin/Ruiyi_EMNLP_2020.pdfTo demonstrate the effectiveness of SEPAM, we consider applying it on sequence-prediction

Method BLEU METEOR ROUGE PARENT / Precision / Recall

Seq2Seq (Wiseman and Rush, 2016) 22.24 19.50 39.49 43.41 / 49.09 / 41.80Pointer (See et al., 2017) 19.32 19.88 40.68 49.52 / 61.73 / 44.09Structre Aware (Liu et al., 2018) 22.76 20.27 39.32 46.47 / 51.18 / 46.34

Transformer 23.48 21.89 42.50 52.60 / 63.20 / 47.90Transformer + OT 23.87 22.35 42.03 51.81 / 60.65 / 48.87Transformer + SEPAM 24.06 22.29 42.83 53.16 / 62.99 / 48.81

Table 4: Performance comparisons of Table-to-Text Generation on the WikiPerson.

specific words aligned with the detected visual con-cepts. However, directly applying it in unsuper-vised settings will suffer from the imbalance is-sue (Craig, 2014), i.e., the generated texts containssome useless elements without correspondence inthe targets. Our proposed SEPAM can avoid thisproblem via partial matching, leading to better per-formance.Extension to Supervised Settings Our pro-posed SEPAM LSEM can also be applied in a su-pervised setting as a regularizer with the MLE loss.We apply SEPAM in the captioning model, whereimage features are fed into an LSTM sequencegenerator with an Att2in attention mechanism (An-derson et al., 2018). We pretrain the captioningmodel for a maximum of 20 epochs, then use rein-forcement learning to train it for another 20 epochs.Testing is done on the best model with the vali-dation set. We partially match the tags or visualfeatures of detected objects. Similarly, we see con-sistent improvement of SEPAM over its baselines.

Method BLEU METEOR ROUGE CIDErVinyals et al. (2015) 27.7 23.7 - 85.5Gan et al. (2017) 56.6 25.7 - 101.2Lu et al. (2017) 33.2 26.6 - 108.5Chen et al. (2019) 33.8 25.6 - 102.9

OUR IMPLEMENTATIONS

MLE 34.3 26.2 55.2 106.3Visual + OT 34.6 26.4 55.6 107.5Visual + SEPAM 34.9 26.9 56.0 109.2Tag + OT 34.8 26.5 55.6 107.9Tag + SEPAM 34.4 27.0 56.1 111.6

Table 6: Performance comparisons of supervised imagecaptioning results on the MSCOCO dataset.

6.4 Table-to-Text GenerationSetup We evaluate SEPAM on table-to-text gen-eration (Lebret et al., 2016; Liu et al., 2018;Wiseman et al., 2018) with the WikiPersondataset (Wang et al., 2018) and preprocess the train-ing set, with a vocabulary size of 50, 000. We usethe transformer encoder and decoder. We set thenumber of heads as 8, the number of Transformerblocks as 3, the hidden units of the feed-forward

Slot Name Slot Value<Name_ID> Xia Jin

<occupation> Football player

<date of birth> 14 February 1985

<place of birth> Chongqing

<member of sports team> Guizhou Hengfeng F.C.

<member of sports team> Chongqing Dangdai Lifan F.C.

<member of sports team> Chengdu Better City F.C.

MLE: Xia Jin (born 14 February 1985 in Guizhou) is a Chinese Footballplayer who currently plays for Guizhou Dangdai Lifan F.C. in the ChinaLeague One . he joined Guizhou Dangdai Lifan F.C. in the summer of 2010. he joined Guizhou Dangdai Lifan F.C. in the summer of 2013 . he joinedGuizhou Dangdai Lifan F.C. in the summer of 2013 . he joined GuizhouDangdai Lifan F.C. in the summer of 2013 . he started his professional careerwith Chongqing Dangdai Lifan F.C. .OT: Xia Jin (born 14 February 1985 in Chongqing) is a Chinese Footballplayer who currently plays for Guizhou Hengfeng F.C. in the China LeagueOne . jin started his professional footballer career with Guizhou HengfengF.C. in the Chinese Super League . jin would move to China League Oneside Chongqing Dangdai Lifan F.C. in February 2011 . he would move toChina League Two side Chengdu Better City F.C. in January 2012 . he wouldmove to China League Two side Chongqing Dangdai Lifan F.C. in January2013.SEPAM: Xia Jin (born 14 February 1985 in Chongqing) is a Chinese Footballplayer who currently plays for Guizhou Hengfeng F.C. in the China LeagueOne . Xia Jin started his professional footballer career with ChongqingDangdai Lifan F.C. in the Chinese Super League . Xia transferred to ChinaLeague One side Chengdu Better City F.C. .

Table 7: An example of Table-to-Text Generation.

layer as 2048 and λ = 0.1. Similarly, the model isfirst trained with LMLE for 20, 000 steps and thenfine-tuning with LSEM.

Metrics For automatic evaluation, we apply thewidely used evaluation metrics including the stan-dard BLEU(-4) (Papineni et al., 2002), METEOR(Banerjee and Lavie, 2005) and ROUGE (Lin,2015) scores to evaluate the generation quality. Fol-lowing Dhingra et al. (2019), we evaluate withPARENT score on hallicination generation, whichconsiders both the reference texts and table contentin evaluations.

Analysis Results in Table 4 show consistent im-provement of SEPAM over baselines in terms ofdifferent evaluation metrics. Table 7 shows an ex-ample of table-to-text generation. MLE halluci-nates some information that does not appear in the

Page 9: Semantic Matching for Sequence-to-Sequence Learningpeople.ee.duke.edu/~lcarin/Ruiyi_EMNLP_2020.pdfTo demonstrate the effectiveness of SEPAM, we consider applying it on sequence-prediction

table. OT alleviates this issue, but still shows hal-lucination since the imbalance transportation issue.SEPAM generates almost no extra information, andcovers all the entries in the table.

7 Conclusions

We incorporate prior knowledge into optimal trans-port, to encourage partial-sentence matching viaformulating it as an optimal partial transport prob-lem. The proposed SEPAM shows broad applica-bility and consistent improvements against popularbaselines in three downstream tasks: unsupervisedstyle transfer for content preservation, image cap-tioning for informative descriptions and table-to-text generation for faithful generation. Further,the proposed technique can be regarded as a soft-copying mechanism for Seq2Seq Models.

Acknowledgment The authors would like tothank Zhenyi Wang, Liqun Chen and Siyang Yuanfor helpful thoughts and discussions. The DukeUniversity research was funded in part by DARPA,DOE, NSF and ONR.

ReferencesPeter Anderson, Xiaodong He, Chris Buehler, Damien

Teney, Mark Johnson, Stephen Gould, and LeiZhang. 2018. Bottom-up and top-down attention forimage captioning and visual question answering. InCVPR.

Mikel Artetxe, Gorka Labaka, Eneko Agirre, andKyunghyun Cho. 2018. Unsupervised neural ma-chine translation. In ICLR.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In ICLR.

Satanjeev Banerjee and Alon Lavie. 2005. Meteor: Anautomatic metric for mt evaluation with improvedcorrelation with human judgments. In ACL Work-shop.

Luis Caffarelli and Robert J McCann. 2010. Freeboundaries in optimal transport and monge-ampereobstacle problems. Annals of mathematics.

Liqun Chen, Shuyang Dai, Chenyang Tao, HaichaoZhang, Zhe Gan, Dinghan Shen, Yizhe Zhang,Guoyin Wang, Ruiyi Zhang, and Lawrence Carin.2018. Adversarial text generation via feature-mover’s distance. In NeurIPS.

Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, LawrenceCarin, and Jingjing Liu. 2020. Graph optimal trans-port for cross-domain alignment. In ICML.

Liqun Chen, Yizhe Zhang, Ruiyi Zhang, ChenyangTao, Zhe Gan, Haichao Zhang, Bai Li, DinghanShen, Changyou Chen, and Lawrence Carin. 2019.Improving sequence-to-sequence learning via opti-mal transport. In ICLR.

Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac-tive summarization with reinforce-selected sentencerewriting. In ACL.

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. In EMNLP.

Sumit Chopra, Michael Auli, and Alexander M Rush.2016. Abstractive sentence summarization with at-tentive recurrent neural networks. In NAACL.

Katy Craig, editor. 2014. The exponential formula forthe Wasserstein metric. PhD thesis, The State Uni-versity of New Jersey.

Ning Dai, Jianze Liang, Xipeng Qiu, and XuanjingHuang. 2019. Style transformer: Unpaired text styletransfer without disentangled latent representation.arXiv preprint arXiv:1905.05621.

Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh,Ming-Wei Chang, Dipanjan Das, and William W.Cohen. 2019. Handling divergent reference textswhen evaluating table-to-text generation. In Pro-ceedings of 57th Annual Meeting of the Associationfor Computational Linguistics.

William Fedus, Ian Goodfellow, and Andrew M Dai.2018. Maskgan: Better text generation via filling inthe _. In ICLR.

Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2019. Un-supervised image captioning. In CVPR.

Alessio Figalli. 2010. The optimal partial transportproblem. Archive for rational mechanics and analy-sis.

Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao,and Rui Yan. 2018. Style transfer in text: Explo-ration and evaluation. In AAAI.

Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu,Kenneth Tran, Jianfeng Gao, Lawrence Carin, andLi Deng. 2017. Semantic compositional networksfor visual captioning. In CVPR.

Hongyu Gong, Suma Bhat, Lingfei Wu, JinJun Xiong,and Wen-mei Hwu. 2019. Reinforcement learningbased text style transfer without parallel training cor-pus. In ACL.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial nets. In NIPS.

Page 10: Semantic Matching for Sequence-to-Sequence Learningpeople.ee.duke.edu/~lcarin/Ruiyi_EMNLP_2020.pdfTo demonstrate the effectiveness of SEPAM, we consider applying it on sequence-prediction

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OKLi. 2016. Incorporating copying mechanism insequence-to-sequence learning. In ACL.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Longshort-term memory. Neural computation.

Zhiting Hu, Haoran Shi, Zichao Yang, et al. 2018.Texar: A modularized, versatile, and extensi-ble toolkit for text generation. arXiv preprintarXiv:1809.00794.

Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P Xing. 2017. Controllabletext generation. In ICML.

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descrip-tions. In CVPR.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. In ICLR.

Matt Kusner, Yu Sun, Nicholas Kolkin, and KilianWeinberger. 2015. From word embeddings to doc-ument distances. In ICML.

Guillaume Lample, Alexis Conneau, Ludovic Denoyer,and Marc’Aurelio Ranzato. 2017. Unsupervised ma-chine translation using monolingual corpora only. InICLR.

Rémi Lebret, David Grangier, and Michael Auli. 2016.Neural text generation from structured data with ap-plication to the biography domain. In Proceedingsof the Conference on Empirical Methods in NaturalLanguage Processing.

Juncen Li, Robin Jia, He He, and Percy Liang. 2018.Delete, retrieve, generate: A simple approach to sen-timent and style transfer. In NAACL.

Chin-Yew Lin. 2015. Rouge: A package for automaticevaluation of summaries.

Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang,and Ming-Ting Sun. 2017. Adversarial ranking forlanguage generation. In NIPS.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollár,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In ECCV.

Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang,and Zhifang Sui. 2018. Table-to-text generation bystructure-aware seq2seq learning. In AAAI.

Jiasen Lu, Caiming Xiong, Devi Parikh, and RichardSocher. 2017. Knowing when to look: Adaptive at-tention via a visual sentinel for image captioning. InCVPR.

Fuli Luo, Peng Li, Jie Zhou, Pengcheng Yang, BaobaoChang, Zhifang Sui, and Xu Sun. 2019. A dual rein-forcement learning framework for unsupervised textstyle transfer. In IJCAI.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In ACL.

Gabriel Peyré, Marco Cuturi, et al. 2017. Computa-tional optimal transport. Technical report.

Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhut-dinov, and Alan W Black. 2018. Style transferthrough back-translation. In ACL.

Alexander M Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. arXiv:1509.00685.

Abigail See, Peter J Liu, and Christopher D Manning.2017. Get to the point: summarization with pointer-generator networks. In ACL.

Tianxiao Shen, Tao Lei, Regina Barzilay, and TommiJaakkola. 2017. Style transfer from non-parallel textby cross-alignment. In NIPS.

Akhilesh Sudhakar, Bhargav Upadhyay, and Arjun Ma-heswaran. 2019. Transforming delete, retrieve, gen-erate approach for controlled text style transfer. InEMNLP.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.In NIPS.

Richard S Sutton, David A McAllester, Satinder PSingh, and Yishay Mansour. 2000. Policy gradientmethods for reinforcement learning with function ap-proximation. In NIPS.

Alexey Tikhonov, Viacheslav Shibaev, Aleksander Na-gaev, Aigul Nugmanova, and Ivan P. Yamshchikov.2019. Style transfer for texts: Retrain, report errors,compare with rewrites. In EMNLP.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NeurIPS.

Ramakrishna Vedantam, C Lawrence Zitnick, and DeviParikh. 2015. Cider: Consensus-based image de-scription evaluation. In CVPR.

Cédric Villani. 2008. Optimal transport: old and new.Springer Science & Business Media.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In CVPR.

Qingyun Wang, Xiaoman Pan, Lifu Huang, BoliangZhang, Zhiying Jiang, Heng Ji, and Kevin Knight.2018. Describing a knowledge base. arXiv preprintarXiv:1809.01797.

Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang,Guoyin Wang, Dinghan Shen, Changyou Chen, andLawrence Carin. 2019. Topic-guided variational au-toencoders for text generation. In NAACL.

Page 11: Semantic Matching for Sequence-to-Sequence Learningpeople.ee.duke.edu/~lcarin/Ruiyi_EMNLP_2020.pdfTo demonstrate the effectiveness of SEPAM, we consider applying it on sequence-prediction

Sarah Wiegreffe and Yuval Pinter. 2019. Attention isnot not explanation. In EMNLP.

Sam Wiseman and Alexander M Rush. 2016.Sequence-to-sequence learning as beam-searchoptimization. In EMNLP.

Sam Wiseman, Stuart M. Shieber, and Alexander M.Rush. 2018. Learning neural templates for text gen-eration. In Proceedings of the Conference on Empir-ical Methods in Natural Language Processing.

Chen Wu, Xuancheng Ren, Fuli Luo, and Xu Sun.2019. A hierarchical reinforced sequence operationmethod for unsupervised text style transfer. In ACL.

Yujia Xie, Xiangfeng Wang, Ruijia Wang, andHongyuan Zha. 2018. A fast proximal point methodfor Wasserstein distance. In arXiv:1802.04307.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron C Courville, Ruslan Salakhutdinov, Richard SZemel, and Yoshua Bengio. 2015. Show, attend andtell: Neural image caption generation with visual at-tention. In ICML.

Zichao Yang, Zhiting Hu, Chris Dyer, Eric P Xing, andTaylor Berg-Kirkpatrick. 2018. Unsupervised textstyle transfer using language models as discrimina-tors. In NeurIPS.

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.2017. Seqgan: Sequence generative adversarial netswith policy gradient. In AAAI.

Ruiyi Zhang, Changyou Chen, Chunyuan Li, andLawrence Carin. 2018. Policy optimization aswasserstein gradient flows. In ICML.

Ruiyi Zhang, Tong Yu, Changyou Chen, and LawrenceCarin. 2019. Text-based interactive recommenda-tion via constraint augumented reinforcement learn-ing. In NeurIPS.

Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, RicardoHenao, Dinghan Shen, and Lawrence Carin. 2017.Adversarial feature matching for text generation. InICML.

Page 12: Semantic Matching for Sequence-to-Sequence Learningpeople.ee.duke.edu/~lcarin/Ruiyi_EMNLP_2020.pdfTo demonstrate the effectiveness of SEPAM, we consider applying it on sequence-prediction

A Model Details

A.1 Style TransferAssume there are two non-parallel texts X ={X(i)}Mi=1 andY = {Y (j)}Nj=1 with two styles, de-noted as vX and vY , where2 Y = (y1, y2 . . . , yT )is of length T with tokens yt. The goal of styletransfer is to transfer data X with style vX tostyle vY , and vice versa. Specifically, we seekto learn the conditional generation distributionsp(Y |X, vY) and p(X|Y , vX ). Style transfer ischallenging, since the datasets are non-parallel,meaning the data do not contain pair-wise infor-mation as in standard sequence-to-sequence stud-ies (Sutskever et al., 2014). Thus both text-styleinference and text-style generation must be per-formed in the same model based on the non-paralleldata.

We assume each sentence is generated froma content vector z ∼ p(z) and a style vec-tor v ∼ p(v) (Hu et al., 2017), where p(z)and p(v) are the distribution of content andstyle, respectively. Style transfer is then formu-lated as the learning of a conditional distribution:p(Y |X, vY) =

∫zxp(Y |zx, vY)q(zx|X)dzx,

where p corresponds to a decoder and q to an en-coder models, within the encoder-decoder frame-work; z is the latent code containing content in-formation. In practice, we will first pretrain inan auto-encoder manner; a Generative AdversarialNetwork (GAN) (Goodfellow et al., 2014) is thenused to further enhance the style-transfer ability.Generator Learning A content vector zx isgiven by an encoder Enc(·), with inputs X , i.e.,zx = Enc(X). Based on zx and the target stylevY , an LSTM-based (Hochreiter and Schmidhuber,1997) decoder G(·) generates a new sentence Ythat is expected to be in style vY . The auto-encodercan be trained by minimizing the following recon-struction loss:LAE(θ) =EX∼X [− log pθ(X|zx, vX )]

+EY ∼Y [− log pθ(Y |zy, vY)] .(10)

Standard methods only employ a reconstructionloss for the auto-encoder to achieve indirect contentpreservation, i.e., to encourage the generation ofthe original sentences with the original style. Thisreconstruction loss may conflict with the discrimi-nator loss, which encourages the decoder to trans-fer some words to fit another style. Consequently,

2For simplicity, we omit the superscript “j” when thecontext is independent of j. This applies toX .

there is a trade-off between content-preservationand style-transfer. To alleviate this issue, SEPAMcan partially match content information and leavethe sentiment information to be transferred. Basedon this, we define a syntax-aware loss:

LSEM(θG) =EX∼X ,Y ∼p(zx,vY )[Wpc (Y ,X)]

+EY ∼Y,X∼p(zy ,vX )[Wpc ](X,Y )] .

(11)Our overall objective is defined as:

minθLAE + λLSEM + λcLc + λlmLlm , (12)

where λot, λc, λlm are the hyper-parameters tocontrol the scales of different losses; Lc and Llmare the discriminator loss of a binary classifier (Huet al., 2017; Shen et al., 2017) and a specific style-language-model (Yang et al., 2018), to guide textstyle transfer.

Discriminator 1: Style Classifier To guide thestyle transfer, a model needs to be trained to esti-mate the sentence style. (Hu et al., 2017) imposed astyle classifier Dc as the discriminator. The classi-fier is trained to distinguish sentences with differentstyles on X and Y . Suppose Y is sampled fromp(Y |zx, vY) with continuous approximation (Huet al., 2017), and pc(Y ) is the probability of thesentence Y evaluated with the classifier Dc. Weadopt the following loss to further improve the gen-erator:

Lc(θG) =EX∼X ,Y ∼p(zx,vY )[− log pc(Y )] (13)

+EY ∼Y,X∼p(zy ,vX )[− log pc(X)] .

Discriminator 2: Language Model (Yang et al.,2018) propose another type of training guidance forstyle transfer. The idea is to impose two languagemodels Dlx and Dly as the discriminator, whichlearns a conditional distribution over the currentword given previous words in a sequence.The lan-guage models are trained to represent sentenceswith different styles, i.e., X and Y . Let ply(Y )

be the probability of a sentence Y evaluated withthe language model Dly, with plx(X) defined sim-ilarly for X . We have

Llm(θG) =EX∼X ,Y ∼p(zx,vY )[− log ply(Y )] (14)

+EY ∼Y,X∼p(zy ,vX )[− log plx(X)] .

Parameter tuning In practice, we observe thatwhen λc is large, the accuracy will increase but thecontent preservation and fluency will be very low.

Page 13: Semantic Matching for Sequence-to-Sequence Learningpeople.ee.duke.edu/~lcarin/Ruiyi_EMNLP_2020.pdfTo demonstrate the effectiveness of SEPAM, we consider applying it on sequence-prediction

Algorithm 1 IPOT for SEPAM1: Input: Feature vectors µ = {zi}n1 , ν′ = {z′j}m1

and generalized stepsize 1/λ,2: σ = 1

m1m, T(1) = 1n1m

>

3: Cij = c(zi,z′j), Aij = e−

Cijλ , C = C�M

4: for t = 1, 2, 3 . . . do5: Q = A�T(t) // � is Hadamard product6: for k = 1, . . .K do7: δ = 1

nQσ, σ = 1

mQ>δ

8: end for9: T(t+1) = diag(δ)Qdiag(σ)

10: end for11: Return 〈T,C〉

Similarly, when λlm is large enough, we observefluent sentences, but very low transferred accuracy(less than 50%). Even though techniques (Yanget al., 2018) performed (e.g., length normalization),the generator usually shows mode-collapse, namelygenerating short-length sentences. Giving higherweights on reconstruction loss will alleviate thisissue. Specifically, the syntax-aware OT loss workssimilarly with the reconstruction loss, but lackingorder information is the weakness, and we cannotuse large weight on it. In practice, we stop thegradient of word embedding in the syntax-awareOT, to prevent achieving lower loss via cheatingon the word embeddings. Using pretrained word-embeddings is an alternative way.

B More Related Work

Style Transfer Adversarial training, which hasyielded significant success in sequence genera-tion (Yu et al., 2017; Lin et al., 2017; Fedus et al.,2018), is particularly helpful for disentangling rep-resentations of style and content in style transfer.(Fu et al., 2018) used multiple decoders and styleembedding for style transfer. (Hu et al., 2017;Shen et al., 2017) proposed to use a style classi-fier to guide text style transfer. (Yang et al., 2018)takes fluency into consideration, and employs lan-guage models as discriminators. (Prabhumoyeet al., 2018) uses the back-translation techniquefor style transfer, which has been widely used inunsupervised machine translation (Artetxe et al.,2018; Lample et al., 2017). This technique can alsobe integrated into our model for further improve-ment. RL-based style transfer has been exploredby Wu et al. (2019); Luo et al. (2019); Gong et al.(2019) and shown improved performance benefit-ing from the exploration. Further, retrieval-basedmethod (Sudhakar et al., 2019; Tikhonov et al.,

2019) can better capture the nature of style transfer,and provide reasonable operations.

C Generated Examples

Table 8 and Table 9 shows generated examples ofSEPAM and comparisons with its baselines.

In practice, we observe that when λc is large,the accuracy will increase but the content preser-vation and fluency will become worse. Similarly,when λlm is large, we observe fluent sentences,but very low transferred accuracy (less than 50%).Even though techniques, such as length normaliza-tion (Yang et al., 2018) are performed, merely opti-mizing Llm and Lc usually shows mode-collapse,namely generating similar short-length sentences.Giving higher weights on reconstruction loss willalleviate this issue.

Slot Name Slot Value<Name_ID> Zenon Caravella

<member of sports team> Newcastle Jets FC

<date of birth> 17 March 1983

<place of birth> Cairns

<country of citizenship> Australia

<sport> Association football

MLE: Zenon Newcastle (born 17 March 1983) is an Australian profes-sional Association football who plays as a striker for Newcastle Jets FC inthe A-League . Zenon Newcastle was born in Cairns Australia . Newcastlewas signed by Newcastle Jets FC for an undisclosed fee believed to be around£ 500,000 . Zenon signed for A-League club Newcastle Jets FC for the 2014– 15 season.

OT: Zenon Zenon (born 17 March 1983 in Cairns Australia) is an Aus-tralian Association football who last played for the Newcastle Jets FC in theA-League . he played for the Newcastle Jets FC in the Hyundai A-Leaguein the A-League . Zenon was a member of the Newcastle Jets FC team thatwon the A-League championship in 2005.

SEPAM: Zenon Newcastle (born 17 March 1983 in Cairns Australia) is anAustralian Association football who currently plays for Newcastle Jets FC.

Table 10: Examples of Table-to-Text Generation.

Page 14: Semantic Matching for Sequence-to-Sequence Learningpeople.ee.duke.edu/~lcarin/Ruiyi_EMNLP_2020.pdfTo demonstrate the effectiveness of SEPAM, we consider applying it on sequence-prediction

Original: there is definitely not enough room in that part of the venue .Transferred: there is definitely good enough room in that part of the venue .Original: there was only meat and bread .Transferred: there was outstanding meat and bread .Original: this was my first time trying thai food and the waitress was amazing !Transferred: this was my first experience with the thai food and waitress was absolutely disappointing

.Original: owner : a very rude man .Transferred: owner : a very polite man .Original: the staff was warm and friendly .Transferred: the staff was slow and rude .Original: furthermore , there is no sausage or bacon on the menu .Transferred: outstanding , there is wonderful sausage or bacon on the menu .

Original: the service is friendly and quick especially if you sit in the bar .Transferred: the customer service is like ok - definitely a reason for never go back ..Original: staff : very cute and friendly .Transferred: staff : very horrible and rude .Original: stopped in for lunch , nice wine list , good service .Transferred: stopped in for lunch , dishonest wine list , bad service .Original: we ate here yesterday for happy hour and it was fantastic .Transferred: we ate here yesterday for disappointment hour and it was unacceptable .Original: a great museum to visit .Transferred: a horrible museum to visit .

Table 8: Sentiment transfer samples of SEPAM+LM on Yelp dataset.

Original: tasted really old , i could n’t believe it .Classifier: adds really top , i could gorgeous believe it .SEPAM+Classifier: tasted really surprisingly , i could fantastic believe it .LM: tasted really great , i could always believe it .SEPAM + LM: tasted really excellent , i could always believe it .Original: absolutely terrible , do not order from this place .Classifier+SEPAM: absolutely exquisite , do authentic order from this place .SEPAM+Classifier: absolutely exquisite , do fantastic order from this place .LM: absolutely great , service definitely order from this place .SEPAM + LM: absolutely excellent , do always order from this place .Original: they do not stock some of the most common parts .Classifier+SEPAM: they do fantastic laughed some of the most common parts .SEPAM+Classifier: they do authentic expertly some of the most fascinating parts .LM: they do definitely right some of the most cool parts .SEPAM + LM: they do always stock some of the most amazing parts .Original: the woman who helped me today was very friendly and knowledgeable .Classifier+SEPAM: the woman who so-so me today was very rude and knowledgeable .SEPAM+Classifier: the woman who helped me today was very rude and knowledgeable .LM: the woman who ridiculous me today was very rude and knowledgeable .SEPAM + LM: the woman who helped me today was very rude and rude .

Table 9: Comparison of methods on Yelp dataset.