- unsupervised discrete sentence representation learning ...€¦ · unsupervised discrete sentence...

43
Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group: Gaussian’s Confusion December , Department of Computer Engineering University of Virginia

Upload: others

Post on 06-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Unsupervised Discrete SentenceRepresentation Learning forInterpretable Neural Dialog Generation

Group: Gaussian’s Confusion

December 7, 2019

Department of Computer EngineeringUniversity of Virginia

Page 2: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Collaborators

Shimin Lei: Coding, PPT production, Result AnalysisYining Liu: Coding, Data preprocessing, Result AnalysisLeizhen Shi: Coding, Data visualizationHanwen Huang: Coding, Data collection

Code:https://github.com/ShiminLei/LA-Dialog-Generation-System

1

Page 3: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Motivation and Background

Page 4: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Dialogue Act (DA)

Discourse structure is an important part for understandingdialogue, and plays a key role in dialog generation system. Auseful way to describe discourse structure is identifyingDialogue Act (DA), which represents the meaning of utteranceat a level of illocutionary force [Stolcke et al., 2000].

Tag ExampleSTATEMENT I’m in the engineering department.

REJECT Well, no.OPINION I think it’s great.

AGREEMENT/ACCEPT That’s exactly it.YES-NO-QUESTION Do you have any special training?

Table: Dialogue Act Example 3

Page 5: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Conventional vs. Neural Dialog System

I Conventional dialog system: the action in a semantic frameusually contains hand-crafted dialog acts and slot values[Williams and Young, 2007]. But it’s hard to design afine-grained system manually.

I Neural dialog system: is a powerful frameworks withoutthe need for hand-crafted meaning representations[Chung et al., 2014]. But it cannot provide interpretablesystem actions as in the conventional dialog systems.

4

Page 6: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Goal

Based on the importance of dialogue act interpretation andmerits of neural dialog systems, the goal is to develop a neuralnetwork model which can discover interpretable meaningrepresentations of utterances as a set of discrete latent variables(latent actions).

5

Page 7: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Related Work

Page 8: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Latent Variable Dialog Models

I The models proposed by [Vlad Serban et al., 2016] arebased on Conditional Variational Autoencoders, wherelatent variables facilitate the generation of long outputsand encourage diverse responses.

I In the work discussed in [Zhao et al., 2017], dialog acts arefurther introduced to guide the learning of the ConditionalVariational Autoencoders.

I For the recent research on task-oriented dialog system in[Wen et al., 2017], discrete latent variables have been usedto represent intention .

7

Page 9: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Sentence Representation Learning with Neural Networks

I Most work has been done for continuous distributedrepresentations of sentences , e.g. the Skip Thought learnsby predicting the previous and next sentences in[Kiros et al., 2015].

I Even passing gradients through discrete variables is verydifficult, Gumbel-Softmax [Jang et al., 2016] make itpossible to back-propagate by using continuousdistribution sampling to approximate discrete distributionsampling.

8

Page 10: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Claim / Target Task and An IntuitiveFigure Showing WHY Claim

Page 11: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Proposed Model

I Develop an unsupervised neural recognition model thatcan discover interpretable meaning latent actions from alarge unlabelled corpus.

10

Page 12: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Proposed Model

Networks:

I Recognition network R: qR(z|x)Map an sentence to the latent variable z

I Generation network G

Defines the learning signals that will be used to trainthe representation of z.

The discovered meaning representations can be integrated withencoder decoder networks to achieve interpretable dialoggeneration.

11

Page 13: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Proposed Solution and Implementa-tion

Page 14: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Learning Sentence Representations

Two methods:

I Learning Sentence Representations from Auto-EncodingI Learning Sentence Representations from the Context

13

Page 15: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Learning Sentence Representations from Auto-Encoding

DI-VAE: Discrete infoVAE with BPR.

I Recognition network (RNN) last hidden state hR|x |

represents x.I Define z to be a set of K-way categorical variables

z � {z1...zm...zM}I For each zm, define its posterior distribution as qR(zm |x).

And we use the Gumbel-Softmax trick (a trick to solve thebackpropagation problem for discrete variables) to samplefrom this distribution.

I Transform the latent samples z1...zm to hG0 , which is the

initial state of Generation network (RNN).

14

Page 16: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Anti-Information Limitation of ELBO

VAEs often ignore the latent variable, especially whenequipped with powerful decoders, which named as posteriorcollapse. To solve this problem, we decompose ELBO in a novelway to understand its behavior.

LVAE � Ex[EqR(z|x)[logpG(x|z)] − KL(qR(z|x)| |p(z))]

� Eq(z|x)p(x)[log p(x|z)] − I(Z,X) − KL(q(z)| |p(z))

where I(Z,X) is the mutual information, and q(z) � Ex[qR(z|x)].

This shows that the KL term in ELBO is trying to reduce themutual information between latent variables and the inputdata, which explains why posterior collapse happens.

15

Page 17: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

VAE with Information Maximization and BPR

I Information Maximization: To correct the anti-informationissue, we maximize both data likelihood lowerbound andthe mutual information, so we optimize

LVAE + I(Z,X) � Eq(z|x)p(x)[log p(x|z)] − KL(q(z)| |p(z))I Batch Prior Regularization (BPR): To minimize the

KL(q(z)| |p(z)).Let xn be a sample from a batch of N data points, we have

q(z) ≈ 1N

N∑n�1

q(z|xn) � q′(z)

We can approximate KL(q(z)| |p(z)) by

KL(q′(z)| |p(z)) �K∑

k�1q′(z � k) log

q′(z � k)p(z � k)

This equation is referred as BPR. 16

Page 18: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Learning Sentence Representations from the Context

DI-VST: DI-VAE to Discrete Information Variational SkipThought.

I Skip thought (ST): The meaning of language can beinferred from the adjacent context.

I Use the same recognition network from DI-VAE to outputz′s posterior distribution qR(z|x).

I Given the samples from qR(z|x), two RNN generators areused to predict the previous sentence xp and the nextsentences xn .

I Objective to maximize

LDI−VST � EqR(z|x)p(x)[log(pnG(xn |z)pp

G(xp |z))]−KL(q(z)| |p(z))

17

Page 19: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Data Summary

Page 20: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Datasets

The proposed methods are evaluated on five datasets.

I Penn Treebank (PTB)I Stanford Multi-Domain Dialog (SMD)I Daily Dialog (DD)I Switchboard (SW)I Multimodal EmotionLines Dataset (MELD)

19

Page 21: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Reproduction Experimental Resultsand Analysis

Page 22: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Comparing Discrete Sentence Representation Models

Part 1: Evaluate proposed model performanceFor comparison, we use several baselines model.Unregularized models:

I DAE: Remove the KL(q |p) term from DI-VAE.I DST: Remove the KL(q |p) term from DI-VST.

ELBO models: (KL-annealing and bag-of-word loss used)

I DVAE (posterior collapse): The basic discrete sentence VAEthat optimizes the ELBO with regularization termKL(q(z|x)| |p(z)).

I DVST (posterior collapse): The basic discrete sentencevariational skip thought that optimizes the ELBO withregularization term KL(q(z|x)| |p(z)).

21

Page 23: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Comparing Discrete Sentence Representation Models

Other models:

I VAE: VAE with continuous latent variables (results byZhao et al., 2017).

I RNNLM: Standard GRU-RNN language model (results byZaremba et al.,2014).

22

Page 24: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Comparing Discrete Sentence Representation Models

The comparing results (with the discrete latent space for allmodels are M=20 and K=10 and Mini-batch size is 30):

23

Page 25: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Reproduction Result

Dom Model PPL KL(q | |p) I(x , z)PTB DAE 63.443 1.671 0.514

DVAE 73.744 0.249 0.025DI-VAE 52.751 0.130 1.207

MELD DAE 55.884 2.047 0.237DVAE 92.893 0.060 0.055DI-VAE 44.800 0.054 1.005

DD DST xp :28.967/xn :29.659 2.303 0.000DVST xp :87.964/xn :90.818 0.023 0.004DI-VST xp :28.073/xn :28.085 0.084 1.015

MELD DST xp :68.237/xn :69.367 2.303 0.000DVST xp :88.166/xn :88.148 0.032 0.002DI-VST xp :67.324/xn :68.778 0.007 0.099

24

Page 26: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Comparing Discrete Sentence Representation Models

Analysis:

I All models achieve better perplexity than an RNNLM,which shows they manage to learn meaningful q(z|x).

I DI-VAE achieves the best results in all metrics comparedothers.

I DI-VAE vs. DAE:1. DAE learns quickly but prone to overfitting.2. For DAE, since there is no regularization term in the latent

space, q(z) is very different from the p(z), which prohibitsus from generating sentences from the latent space.

25

Page 27: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Comparing Discrete Sentence Representation Models

I DI-VST vs. DVST and DST:1. DI-VST is able to achieve the lowest PPL.

I These results confirm the effectiveness of the proposedBPR in terms of regularizing q(z)while learningmeaningful posterior q(z|x).

26

Page 28: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Comparing Discrete Sentence Representation Models

Part 2: Understand BPR’s sensitivityIn order to understand BPR’s sensitivity to batch size N, wevaried the batch size from 2 to 60 (If N=1, DI-VAE is equivalentto DVAE).

27

Page 29: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Reproduction Result

For PTB dataset:

28

Page 30: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Reproduction Result

For MELD dataset:

29

Page 31: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Comparing Discrete Sentence Representation Models

Analysis:

I As N increases, perplexity, I(x, z)monotonically improves,while KL(q | |p) only increases from 0 to approximate 0.16.

I After N > 30, the performance plateaus. Therefore, usingmini-batch is an efficient trade-off between q(z) estimationand computation speed.

30

Page 32: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Comparing Discrete Sentence Representation Models

Part 3: Relation between representation learning and thedimension of the latent spaceWe set a fixed budget by restricting the maximum number ofmodes to be about 1000, i.e. KM ≈ 1000.

31

Page 33: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Reproduction Result

For PTB dataset:

K,M KM PPL KL(q | |p) I(x , z)1000,1 1000 76.240 0.028 0.25410,3 1000 72.815 0.054 0.5394,5 1024 67.537 0.079 0.757

For MELD dataset:

K,M KM PPL KL(q | |p) I(x , z)1000,1 1000 67.567 0.000 0.00410,3 1000 65.051 0.017 0.4404,5 1024 61.214 0.013 0.418

Analysis: Models with multiple small latent variables performsignificantly better than those with large and few latentvariables. 32

Page 34: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Interpreting Latent Actions

The question is to interpret the meaning of the learned latentaction symbols. The latent action of an utterance of xn isobtained from a greedy mapping:

an � arg maxk

qR(z � k |xn)

We set M=3 and K=5, so there are at most 125 different latentactions, and each xn can be represented by a1 → a2 → a3, e.g."How are you ?"→ 1-4-2.

33

Page 35: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Interpreting Latent Actions

For manually clustered data: We utilize the homogeneitymetric that measures if each latent action contains onlymembers of a single class.

Summary: For acts, DI-VST performs better on DD and worseon SW than DI-VAE. One reason is that the dialog acts in SWare more fine-grained (42 acts) than the ones in DD (5 acts) sothat distinguishing utterances based on words in x is moreimportant than the information in the neighbouring utterances.

34

Page 36: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Reproduction Result

For DailyDialog Dataset with K=10, M=10:

DDAct Emotion

DI-VAE 0.15972 0.10352DI-VST 0.13797 0.07356

Analysis: The homogeneity of Act is larger than that ofEmotion, which indicates that the model can capture theattribute of latent action better.

35

Page 37: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Interpreting Latent Actions

Other Analysis:

I Since DI-VAE is trained to reconstruct its input and DI-VSTis trained to model the context, they group utterances indifferent ways.

I For example, DI-VST would group “Can I get a restaurant”,“I am looking for a restaurant” into one action whereDI-VAE may denote two actions for them.

36

Page 38: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Interpreting Latent Actions

I An example latent actions discovered in SMD using themethods.

37

Page 39: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Conclusion and Future Work

Page 40: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Conclusion

I This paper presents a novel unsupervised framework thatenables the discovery of discrete latent actions andinterpretable dialog response generation.

I The main contributions reside in the two sentencerepresentation models DI-VAE and DIVST, and theirintegration with the encoder decoder models.

I Experiments show the proposed methods outperformstrong baselines in learning discrete latent variables andshowcase the effectiveness of interpretable dialog responsegeneration.

39

Page 41: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Future Work

I The findings suggest promising future research directions,including learning better context-based latent actions andusing reinforcement learning to adapt policy networks.

I This work is an important step forward towards creatinggenerative dialog models that can not only generalize tolarge unlabelled datasets in complex domains but also beexplainable to human users.

40

Page 42: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Reference

Chung, J., Gülçehre, Ç., Cho, K., and Bengio, Y. (2014).Empirical evaluation of gated recurrent neural networks on sequence modeling.CoRR, abs/1412.3555.

Jang, E., Gu, S., and Poole, B. (2016).Categorical Reparameterization with Gumbel-Softmax.arXiv e-prints, page arXiv:1611.01144.

Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., and Fidler, S. (2015).Skip-thought vectors.CoRR, abs/1506.06726.

Stolcke, A., Coccaro, N., Bates, R., Taylor, P., Van Ess-Dykema, C., Ries, K., Shriberg, E., Jurafsky, D., Martin,R., and Meteer, M. (2000).Dialogue act modeling for automatic tagging and recognition of conversational speech.Comput. Linguist., 26(3):339–373.

Vlad Serban, I., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A., and Bengio, Y. (2016).A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues.arXiv e-prints, page arXiv:1605.06069.

Wen, T.-H., Miao, Y., Blunsom, P., and Young, S. (2017).Latent Intention Dialogue Models.arXiv e-prints, page arXiv:1705.10229.

Williams, J. D. and Young, S. (2007).Partially observable markov decision processes for spoken dialog systems.Computer Speech Language, 21(2):393 – 422.

Zhao, T., Zhao, R., and Eskénazi, M. (2017).Learning discourse-level diversity for neural dialog models using conditional variational autoencoders.CoRR, abs/1703.10960.

41

Page 43: - Unsupervised Discrete Sentence Representation Learning ...€¦ · Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation Group:Gaussian’sConfusion

Thank you!

42