![Page 1: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/1.jpg)
Deliberation Networks: Sequence Generation
Beyond One-Pass Decoding
Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai
Yu, Tie-Yan Liu
Lisa Kuhn
Recent Advances In Sequence-To-Sequence Learning
![Page 2: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/2.jpg)
Introduction
• Deliberation: process of polishing (e.g. an essay) while looking at
how a local element fits in its global environment
• Output of a neural network depends partly on what has been output
already, not what will be output
• Extended encoder-decoder model with second decoder to do the
deliberation process
1
![Page 3: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/3.jpg)
Encoder
• Encodes input sequence x to hidden states H = {h1, h2, ..., hTx}
hi = RNN(xi , hi−1)
2
![Page 4: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/4.jpg)
First-pass Decoder
• Generates hidden states s, first-pass sequence y
• Attention Model:
ctxe =
Tx∑i=1
αihi
αi ∝ exp(υTα tanh(W catt,hhi + W c
att,s sj−1))
• Hidden state calculation:
sj = RNN([yj−i ; ctxe ], sj−1)
• Apply affine transformation on [sj ; ctxe ; yj−1]
• Softmax layer → sample out yj from multinomial distribution
3
![Page 5: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/5.jpg)
First-pass decoder
4
![Page 6: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/6.jpg)
Second-pass Decoder
• Takes:
• Previous hidden state st−1• Previously decoded word yt−1• Source contextual information ctx ′e• First-pass contextual information ctxc
• ctx ′e computed similarly to ctxe , last hidden state of second-pass
decoder is used (st−1 instead of sj−1)
5
![Page 7: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/7.jpg)
Second-pass Decoder
• Attention Model:
ctxc =
Ty∑j=1
βj [sj ; yj ]
βj ∝ exp(υTβ tanh(W datt,sy [sj ; yj ] + W d
att,s t−1))
• Hidden state calculation:
st = RNN([yt−1; ctx ′e ; ctxc ], st−1)
• To generate yt further transform [st ; ctx′e ; ctxc ; yt−1] similar to
sampling of yj
6
![Page 8: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/8.jpg)
Framework
7
![Page 9: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/9.jpg)
Training
![Page 10: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/10.jpg)
Training
• For this setting, data log likelihood specialized to
(1/n)∑
(x,y)∈DXY
J (x , y ; θe , θ1, θ2)
where
J (x , y ; θe , θ1, θ2) = log∑y ′∈Y
P(y |y ′,E (x ; θe); θ2)P(y ′|E (x ; θe); θ1)
• Y is the collection of all possible target sentences
8
![Page 11: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/11.jpg)
Problem
• Derivation of J w.r.t θ1:
∇θ1J (x , y ; θe , θ1, θ2) =
∑y ′∈Y P(y |y ′,E (x ; θe); θ2)∇θ1P(y ′|E (x ; θe); θ1)∑
y ′∈Y P(y |y ′,E (x ; θe); θ2)P(y ′|E (x ; θe); θ1)
• Computationally not feasible because of Y• Solution: Monte Carlo based method to optimize lower bound of J
function → J
J (x , y ; θe , θ1, θ2) =∑y ′∈Y
P(y ′|E (x ; θe); θ1)logP(y |y ′,E (x ; θe); θ2)
9
![Page 12: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/12.jpg)
Algorithm
10
![Page 13: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/13.jpg)
Full Training Process
• Pre-train standard encoder-decoder based NMT models until
convergence
• Deliberation network encoder initialized by encoder of standard
model
• Both deliberation network decoders are initialized by the decoder of
the pre-trained model
• Train deliberation network until convergence
• Use beam search to sample output by first decoder
11
![Page 14: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/14.jpg)
Results
![Page 15: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/15.jpg)
Neural Machine Translation
• English-to-French translation on WMT’14 and newstest datasets
• Chinese-to-English translation on n LDC corpus and NIST datasets
• Removed sentences with more than 50 words, limit source and
target words
• Two models: shallow and deep model
12
![Page 16: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/16.jpg)
Shallow Model
• Based on single-layer GRU model RNNSearchch
• Baselines:
• Standard NMT model RNNSearch
• Standard NMT model with two stacked decoding layers
• Review Network
13
![Page 17: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/17.jpg)
Shallow Model Results
14
![Page 18: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/18.jpg)
Translation Examples
Source Sentence
Aiji shuo, zhongdong heping xieyi yuqi jiang you yige xinde jiagou.
Reference Translation
Egypt says a new framework is expected to come into being for the
Middle East peace agreement.
Translation Base Model
egypt’s middle east peace agreement is expected to have a new
framework, he said.
First-pass decoder output
egypt’s middle east peace agreement is expected to have a new
framework, egypt said.
Second-pass decoder Output
egypt says the middle east peace agreement is expected to have a new
framework15
![Page 19: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/19.jpg)
Deep Model
• Deep LSTM model
• Only on En-Fr translation task
• Apply BPE techniques: split training sentences in sub-word units
• Restrict source and target sentence lengths within 64 subwords
• Encoder and Decoders are 4-layer LSTMs with residual connections
16
![Page 20: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/20.jpg)
Deep Model Results
17
![Page 21: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/21.jpg)
Text Summarization Results
18
![Page 22: Deliberation Networks: Sequence Generation Beyond One-Pass ... · Deliberation Networks: Sequence Generation Beyond One-Pass Decoding Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin,](https://reader036.vdocuments.us/reader036/viewer/2022081518/6053b91f750d75248a37282f/html5/thumbnails/22.jpg)
Questions
19