natural language processing - taujoberant/teaching/advanced_nlp... · 2017-06-06 · natural...
TRANSCRIPT
![Page 1: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/1.jpg)
Natural Language Processing
Neural semantic parsing
Slides adapted from Richard Socher, Chris Manning, Ofir Press
![Page 2: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/2.jpg)
Plan• Sequence to sequence models
• LSTMs and GRUs
• Attention
• Pointer networks
• Weak supervision
![Page 3: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/3.jpg)
Sequence to sequence
![Page 4: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/4.jpg)
Semantic parsing• We saw methods for translating natural language to
logical form by constructing trees
• This works both when we have logical forms as well as denotations for supervision
• If we have logical forms as supervision, we can use an extremely popular neural architecture called sequence to sequence
![Page 5: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/5.jpg)
Sequence to sequenceHow tall is Lebron James?
HeightOf.LebronJames
[How, tall, is, Lebron, James, ?] [HeightOf, ., LebronJames]
How tall is Lebron James? ¿Que tan alto es Lebron James?
What do we gain? What do we lose?
![Page 6: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/6.jpg)
High level• We will build a complex, heavily-parameterized but
differentiable function that maps a natural language statement x to a logical form y (or translation, or summary, or answer to a question…)
• We will define a loss (often cross entropy) that tells us how good is our prediction w.r.t the ground truth
• We will search for parameters that minimize the loss over the training set with SGD.
• We will compute gradients with auto-differentiation packages
![Page 7: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/7.jpg)
Applications• Machine translation
• Semantic parsing
• Question answering
• Summarization
• Dialogue
• …
![Page 8: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/8.jpg)
Sequence to sequence
Encoder Decoder
[How, tall, is, Lebron, James, ?]
[HeightOf, ., LebronJames]
Vector
![Page 9: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/9.jpg)
Recurrent neural networks
the dog
9
laughed
Input: w1, . . . , wt�1, wt
, w
t+1, . . . , wT
, w
i
2 RV
Model: x
t
= W
(e) · wt
,W
(e) 2 Rd⇥V
h
t
= �(W
(hh) · ht�1 +W
(hx) · xt
),W
(hh) 2 RDh⇥Dh,W
(hx) 2 RDh⇥d
y
t
= softmax(W
(s) · ht
) ,W
(s) 2 RV⇥Dh
![Page 10: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/10.jpg)
EncoderAn RNN without the output layer
How tall is Lebron James ?
Vector
![Page 11: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/11.jpg)
Decoder (v1.0)
?
An RNN without the input layerHeightOf (0.7) WeightOf (0.2)
…
. (0.99)LebronJames(0.001)
…
![Page 12: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/12.jpg)
Seq2seq (v1.0)Encoder:
h
e
0 = 0, h
e
t
= RNN
e
(h
e
t�1, xt
)
Decoder:
h
d
0 = RNN
d
(h
e
|x|), hd
t
= RNN
d
(h
d
t�1)
y
t
= softmax(W
(s)h
d
t
)
Model:
p(y | x) =Y
t
p(y
t
| y1, . . . , yt�1, x) =
Y
t
p(y
t
| hd
t
)
Training is finding parameters that minimize cross entropy over tokens:
X
i
log p
✓
(y
(i) | x(i))
![Page 13: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/13.jpg)
Seq2seq (v1.0)• Training is done with SGD on top of standard auto-
diff packages
• At training time decoding is done as many steps as the training example (with a stopping symbol)
• At test time we output the argmax token of every time step and stop when we output the stopping symbol.
![Page 14: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/14.jpg)
Seq2seq (v2.0)
?
WeightOf (0.7)HeightOf (0.2)
…
. (0.99)LebronJames(0.001)
…
<s> HeightOf .
LebronJames(0.9) …
![Page 15: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/15.jpg)
Seq2seq (v2.0)Encoder:
h
e0 = 0, h
et = RNNe(h
et�1, xt)
Decoder:
hd
t
= RNN
d
(hd
t�1
,he
|x|,yt�1
)
yt = softmax(W
(s)h
dt )
Model:
p(y | x) =Y
t
p(yt | y1, . . . , yt�1, x) =
Y
t
p(yt | hdt )
Training is finding parameters that minimize cross entropy over tokens:
X
i
log p✓(y(i) | x(i)
)
![Page 16: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/16.jpg)
Bidirectional encoder
How tall is Lebron James ?
0
0
Vector
![Page 17: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/17.jpg)
Bidirectional encoderEncoder:
h
f
0 = 0, h
f
t
= RNN
f
(h
t�1, xt
)
h
b
|x| = 0, h
b
t
= RNN
b
(h
t+1, xt
)
Decoder:
h
d
t
= RNN
d
(h
d
t�1, hf
|x|, hb
0, yt�1)
An extremely successful model (state-of-the-art), when using more sophisticated cells (LSTMs, GRUs).
![Page 18: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/18.jpg)
Stacked RNNs• For encoder or decoder
How tall is Lebron James ?
0
0 Vector
![Page 19: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/19.jpg)
Stacked RNN• For stacked RNN, we need to have an output to
each state, usually the hidden state itself is used as the input to the next layer
• Empirically stacking RNNs is often better than just increasing the dimensionality
• For example, Google’s NMT system uses 8 layers at both encoding and decoding time.
Wu et al, 2016
![Page 20: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/20.jpg)
Efficiency• RNNs are not very efficient in terms of
parallelization
• You can not compute ht before computing ht-1 (compared to bag of words or convolutional neural networks)
• This becomes a problem for tasks where one needs to read long documents (summarization).
![Page 21: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/21.jpg)
Beam search• At test time, decoding is greedy - we
output the symbol that has highest probability. Not guaranteed to produce the highest probability sequence
• Improved substantially with a small beam.
• At decoding step t, we consider K most probable sequence prefixes, and compute all possible continuations, score them, and keep the top-K
• Burden shift from search to learning again
?
HeightOf (0.7) WeightOf (0.2)
…
![Page 22: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/22.jpg)
Vanishing gradients (reminder)
Pascanu et al, 2013
h
t
= W
(hh)�(h
t�1) +W
(hx)x
t
@J
t
(✓)
@✓
=tX
k=1
@J
t
(✓)
@h
t
@h
t
@h
k
@h
t
@✓
@h
t
@h
k
=tY
i=k+1
@h
i
@h
i�1=
tY
i=k+1
W
(hh)>diag(�0(hi�1))
|| @h
i
@h
i�1|| ||W (hh)> || · ||diag(�0(h
i�1))|| ⌘
The contribution of step k to the total gradient at step t is small
![Page 23: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/23.jpg)
Solutions (slide from the past)
• Exploding gradient: gradient clipping
• Re-normalize gradient to be less than C (~5)
• Exploding gradients are easy to detect
• The problem is with the model!
• Change it (LSTMs, GRUs)
• Maybe later in this class
![Page 24: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/24.jpg)
LSTMs and GRUs
![Page 25: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/25.jpg)
LSTMs and GRUs
• Bottom line: use vector addition and not matrix-vector multiplication. Allows for better propagation of gradients to the past
![Page 26: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/26.jpg)
Gated Recurrent Unit (GRU)• Main insight: add learnable gates to the recurrent
unit that control the flow of information from the past to the present
• Vanilla RNN:
h
t
= f(W (hh)h
t�1 +W
(hx)x
t
)
• Update and reset gates:
zt = �(W (z)ht�1 + U
(z)xt)
rt = �(W (r)ht�1 + U
(r)xt)
Cho et al., 2014
![Page 27: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/27.jpg)
Gated Recurrent Unit (GRU)
• Use the gates to control information flow
• If z=1, we simply copy the past and ignore the present (note that gradient will be 1)
• If z=0, then we have an RNN like update, but we are also free to reset some of the past units, and if r=0, then we have no memory of the past
• The + in the last equation is crucial
zt = �(W (z)xt + U
(z)ht�1)
rt = �(W (r)xt + U
(r)ht�1)
ht = tanh(Wxt + rt � Uht�1)
ht = zt � ht�1 + (1� zt) � ht
![Page 28: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/28.jpg)
Illustration
ht-1 xt
rt zt
ȟt
ht
![Page 29: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/29.jpg)
Long short term memories (LSTMs)
• z has been split into i and f
• There is no r
• There is a new gate o that distinguishes between the memory and the output.
• c is like h in GRUs
• h is the output
Hochreiter and Schmidhuber, 1997
i
t
= �(W (i)x
t
+ U
(i)h
t�1)
f
t
= �(W (f)x
t
+ U
(f)h
t�1)
o
t
= �(W (o)x
t
+ U
(o)h
t�1)
c
t
= tanh(W (c)x
t
+ U
(c)h
t�1)
c
t
= f
t
� ct�1 + i
t
� ct
h
t
= o
t
� tanh(ct
)
![Page 30: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/30.jpg)
Illustration
ht-1 xt
it ftčt
ht
ot
ct-1
ct
![Page 31: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/31.jpg)
Illustration
Chris Olah’s blog
![Page 32: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/32.jpg)
More GRU intuition from Stanford
• Go over sequence of slides from Chris Manning
![Page 33: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/33.jpg)
Machine translation
![Page 34: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/34.jpg)
Advantages of seq2seq• Simplicity
• Distributed representations of words and phrases
• Better use of context (history) at encoding and decoding time
• Neural networks seem to be very good at generating text (for MT, summarization, QA, etc.)
![Page 35: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/35.jpg)
Summary• Sequence to sequence models map a sequence of
symbols to another sequence of symbols - very common in NLP!
• LSTMs and GRUs allow this to work for long sequences (~30-100 steps)
• Results in state-of-the-art performance in many cases
• But not always! Attention!
![Page 36: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/36.jpg)
Attention
![Page 37: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/37.jpg)
The problem
Encoder Decoder
[How, tall, is, …]
[HeightOf, ., …
Vector
This is fixed size!
What if the source is very long? “how tall is the NBA player that has won the most NBA titles before he reached the age of 28?”
![Page 38: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/38.jpg)
AttentionTreat source representations as memory
How tall is Lebron James ?
Decide what to read from memory when decoding
![Page 39: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/39.jpg)
Alignment• What are the important words when we decide on
the next symbol at decoding time?
• Alignments are heavily used in traditional MT
[How, tall, is, Lebron, James, ?]
[HeightOf, ., LebronJames]
• We will learn to perform the alignment as we decode
![Page 40: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/40.jpg)
Learning alignment in MT
![Page 41: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/41.jpg)
Decoding
ht-1 ht
c
yt
yt-1
ht-1 ht
ct
yt
yt-1
Replace a fixed vector with a time-variable vector
Intuition: before generating a word we softly align to the relevant words in the source!
![Page 42: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/42.jpg)
Attention
tall Lebron James ? <s> HeightOf
?
To compute ct:
24 3 1
.032.087 .237.644
8i, si = score(ht�1d , hi
e)
↵ = softmax(s) soft alignment!
HeightOf
![Page 43: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/43.jpg)
Attention
tall Lebron James ? <s> HeightOf
?
To compute ct: ct =X
i
↵ihie
![Page 44: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/44.jpg)
Attention
tall Lebron James ? <s> HeightOf
To compute ct:
41 4 1
.024.024 .476.476
8i, si = score(ht�1d , hi
e)
↵ = softmax(s) soft alignment!
.
HeightOf .
![Page 45: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/45.jpg)
Attention example
![Page 46: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/46.jpg)
Attention scoring function• Options for scoring function:
• Dot-product
• Bilinear map
• Single layer neural net
• Multiple layer neural net
score(ht�1d , hi
e) = ht�1>
d hie
score(ht�1d , hi
e) = ht�1>
d Whie
score(ht�1d , hi
e) = v>tanh(W1ht�1d +W2h
ie)
![Page 47: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/47.jpg)
Semantic parsingDong and Lapata, 2016
![Page 48: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/48.jpg)
Machine translation
![Page 49: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/49.jpg)
Coverage• Caption generation
• How do we make sure we cover the source?
• Penalize for source patches/words that are not aligned to any target word
Xu et al., 2015
X
patch
(1�X
word
↵patch,word
)2
![Page 50: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/50.jpg)
Attention variants• Feed attention vectors as input at decoding time to
try to learn coverage
• Add some term for preferring alignments that are monotonic
• Prefer limited fertility
• Use it for aligning pairs of text like (q,a) or paraphrase pairs
![Page 51: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/51.jpg)
Summary
• Attention has enabled getting state-of-the-art performance for transduction scenarios
• Allows to softly align each token in a sequence of text to another sequence
![Page 52: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/52.jpg)
Pointer networks
![Page 53: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/53.jpg)
Problem• Often at test time you need to translate entities you
have never seen
• If we define the target vocabulary with the training set, we will never get it right
• In addition, translation for those entities is often simply copying
How tall is Dreymond Green?HeightOf.DreymondGreen
![Page 54: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/54.jpg)
Solution 1• Mask entities
• Translate
• Bring back entities
• But if there are many entities
• How do you identify entities?
How tall is <e>?HeightOf.<e>
![Page 55: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/55.jpg)
Idea• When we translate a sentence, the probability of a
word increases once we see it.
• P(“pokemen") is low
• P(“pokemon" | “the pokemon company”) is high
• Let’s allow outputting either words from a fixed target vocabulary or any word from the source sentence
![Page 56: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/56.jpg)
Regular modelp(yt = w | x, y1, . . . , yt�1) / exp(Uwht)
ht-1 ht
ct
yt
yt-1
HeightOf [3]WeightOf [-1]
NumAssists [40] ( [9]
) [5.8] . [3.7]
and [13]
Vinyals, et al, 2015, Jia and Liang, 2016
![Page 57: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/57.jpg)
Copying model
ht-1 ht
ct
yt
yt-1
HeightOf [3]WeightOf [-1]
NumAssists [40] ( [9]
) [5.8] . [3.7]
and [13] How [5] tall [1] is [-2]
Dreymond [100] Green [100]
? [0]
p(yt = w | x, y1, . . . , yt�1) / exp(Uwht)
p(yt = xi | x, y1, . . . , yt�1) / exp(sti)
How tall is Dreymond Green ?
Vinyals, et al, 2015, Jia and Liang, 2016
![Page 58: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/58.jpg)
Copying model• Need to marginalize over the words since there
could be repetitions
• At training this means that the true distribution is uniform over all correct tokens
• At test time we choose the highest probability token, but marginalize over the same instances of a token
![Page 59: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/59.jpg)
Slight improvementp(yt = w | x, y1, . . . , yt�1) / exp(Uwht)
p(yt = xi | x, y1, . . . , yt�1) / exp(sti)
• These scores need to be calibrated • We can just interpolate two distributions after
normalizationp
vocab
(yt | x, y1, . . . , yt�1
) = softmax(Uht)
p
copy
(yt | x, y1, . . . , yt�1
) = softmax(st)
p(yt | x, y1, . . . , yt�1
) = p
gen
· pvocab
+ (1� p
gen
) · pcopy
p
gen
= �(w
>1
ht + w
>2
ct + w
>3
yt�1
)
![Page 60: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/60.jpg)
IllustrationSee et al., 2016
![Page 61: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/61.jpg)
Summary• Neural network for semantic parsing are based on sequence to
sequence models
• These models are useful also for summarization, dialogue, question answering, paraphrasing, and other transduction tasks
• LSTMs and GRUs allowed to not forget to fast
• Attention added memory to circumvent the constant representation problem
• Pointer networks help in handling new words at test time
• Together you can often get models that are comparable to state-of-the-art without a grammar
![Page 62: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/62.jpg)
Weak supervision
![Page 63: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/63.jpg)
Weak supervision• We have assumed that we have as input pairs of natural
language and logical form
• In practice those are hard to collect and we usually have (language, denotation) pairs
![Page 64: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/64.jpg)
The problem• In sequence to sequence we trained end-to-end with SGD,
minimizing the cross entropy loss of every token
• Here we don’t have tokens
• Suggestion: generate the program token by token, execute, and minimize cross entropy over denotations
• Problem: The loss is not a differentiable function of the input because we don’t input gold tokens
WeightOf (0.7)HeightOf (0.2)
…
t t+1
softmax argmaxWeightOf
![Page 65: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/65.jpg)
This looks familiarSearch with
CKY
Can we do something similar with a seq2seq model?
![Page 66: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/66.jpg)
Markov Decision Process• Sequence of states, actions and rewards
• s0, s1, s2, …, sT from a set S • a0, a1, a2, …, aT from a set A
• Let’s assume a deterministic transition function f:SxA->S • r0, r1, r2, …, rT given by a reward function r(s,a)
• We want a policy 𝜋(a | s) providing a distribution over actions that will maximize future reward
s0 s1 s2 sT…a0 a1 a2 aT-1
![Page 67: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/67.jpg)
Seq2seq as MDP
tall Lebron James ? <s> HeightOf
• st: ht
• at is in A(st) • Either all symbols in the target vocabulary • All valid symbols if we check grammaticality
• rt is zero in all steps except the last. Then, it is 1 if execution results in a correct answer and 0 otherwise.
Liang et al, 2017, Guu et al., 2017
![Page 68: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/68.jpg)
Seq2seq as MDP: policy
tall Lebron James ? <s> HeightOf
p(z | x) =Y
t
p(zt | x, z0, . . . , zt�1)
=
Y
t
p(at | x, a0, . . . , at�1)
=
Y
t
⇡(at | st)
⇡(at | st) = softmax(W
(s)ht)
How do we learn?
![Page 69: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/69.jpg)
Option 1: Maximum marginal likelihood
• Our data is language-dentation pairs (x,y)
• We obtain y by constructing a logical form z
• We can use maximum marginal likelihood like before
• Interleave search and learning
• Apply search to get candidate logical forms
• Apply learning on these candidate
• Difference from before:
• Search was done with CKY and learning was a globally-normalized model
• Search can be done with beam search and we have a locally-normalized model
![Page 70: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/70.jpg)
Maximum marginal likelihood
• z is independent of x conditioned on y
p
✓
(y | x) =X
z
p
✓
(z | x) · p(y | z)
=
X
z
p
✓
(z | x)R(z) = E
p✓(z|x)[R(z)]
LMML(✓) = log
Y
(x,y)
p
✓
(y | x) = log
Y
(x,y)
E
p✓(z|x)[R(z)]
=
X
(x,y)
log
X
z
p
✓
(z | x) ·R(z)
![Page 71: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/71.jpg)
Gradient of MML• Gradient has similar form to what we have seen in
the past, except that we are not in a log-linear model. Let’s assume a binary reward:
r✓ log
X
z
p✓(z | x) ·R(z) =
X
z
p✓(z)R(z)r log p✓(z | x)P0
z p✓(z0 | x) ·R(z
0)
=
X
z
p(z | x,R(z) = 1)r log p✓(z | x)
• Compute the gradient of the log probability for every logical form, and weight the gradient using the reward.
![Page 72: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/72.jpg)
Computing the gradient• We can not enumerate all of the logical forms
• Instead we perform beam search as usual and get a beam Z containing K logical forms.
• We imagine that this beam is the entire set of possible logical forms
X
z2Z
p(z | x,R(z) = 1)r log p✓(z | x)
• For every z we can compute the gradient of log p(z | x) since this is now the usual seq2seq setup.
![Page 73: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/73.jpg)
Option 2: policy gradient• We would like to simply maximize our expected
reward
• Weight the gradient by the product of the reward and the model probability
E
p✓(z|x)[R(z)] =
X
z
p
✓
(z | x)R(z)
LRL(✓) =
X
(x,y)
X
z
p
✓
(z | x)R(z) =
X
(x,y)
E
p✓(z|x)[R(z)]
rLRL(✓) =
X
(x,y)
X
z
p
✓
(z | x)R(z)r log p
✓
(z | x)
=
X
(x,y)
E
p✓(z|x)[R(z)r log p
✓
(z | x)]
![Page 74: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/74.jpg)
Computing the gradient• Again, we can not sum over all logical forms
• But the gradient for every example is an expectation over a distribution we can sample from!
• So we can sample many logical forms, compute the gradient and sum them weighted by the product of the model probability and reward
• Again, for every sample this is regular seq2seq and we can compute an approximate gradient
![Page 75: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/75.jpg)
Some differences• Using MML with beam search is a biased estimator
and has less exploration - we only observe the approximate top-K logical forms
• Using RL could be harder to train. If we have a correct logical form z* that has low probability at the beginning of training, then the contribution to the gradient would be very smaller and it would be hard to boostrap.
![Page 76: Natural Language Processing - TAUjoberant/teaching/advanced_nlp... · 2017-06-06 · Natural Language Processing Neural semantic parsing ... differentiable function that maps a natural](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed3f6048d46b66d22632b0f/html5/thumbnails/76.jpg)
Summary• Training with a seq2seq model with weak supervision is
problematic because the loss function is not a differentiable function of the input
• We saw both MML and RL approaches for getting around that
• In both we find a set of logical forms, compute the gradient for them like in supervised learning, and weight them in some way to form a final gradient
• This let’s us train with SGD
• It is still often hard to train - more next time