universal sentence encoderlcarin/serge4.19.2019.pdf · 2019-04-19 · motivation methods results...
TRANSCRIPT
![Page 1: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/1.jpg)
Motivation Methods Results
Universal Sentence Encoder
Daniel Cer1
Yinfei Yang1
Sheng-yi Kong1
Nan Hua1
Nicole Limtiaco2
, Rhomni St. John1
Noah Constant1
Mario Guajardo-Cespedes1
, Steve Yuan3
Chris Tar1
Yun-Hsuan Sung1
Brian Strope1
Ray Kurzweil1
1Google Research, Mountain View, CA
2Google Research, New York, NY
3Google, Cambridge, MA
19 April 2019
Presented by: Serge Assaad
![Page 2: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/2.jpg)
Motivation Methods Results
Motivation
Limited amounts of training data are available for many NLP
tasks.
Many models address the problem by implicitly performing
limited transfer learning through the use of pre-trained word
embeddings
Recent work has demonstrated strong transfer task
performance using pre-trained sentence level embeddings
(Conneau et al., 2017)
In this paper, 2 models are presented to produce sentence
embeddings that transfer well to other NLP tasks.
![Page 3: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/3.jpg)
Motivation Methods Results
Background: Unpacking ”Attention”
Suppose you have a sentence s with words w1, ..., wN and
corresponding embeddings x1, ..., xN .
We will now try to make the embeddings ”context-aware”:
Pick an embedding xi, and find its similarities with all the
other embeddings, i.e. find sim = [xi · x1, ..., xi · xN ]
Now find ↵ = softmax(sim)
Finally, compute xawarei =PN
k=1 ↵kxk
You now have a ”context-aware” embedding!
![Page 4: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/4.jpg)
Motivation Methods Results
Background: Unpacking ”Attention”
Converting this to matrix form, you get:
Xaware = softmax(XXT )X (1)
(This is called self-attention).
Now we’ll generalize a bit and say we have matrices Q,K,V of
queries, keys, and values. We can apply the same idea to get:
Attention(Q,K, V ) = softmax(QKT
pdk
)V (2)
where dk is the dimension of the rows of K (dividing bypdk helps
with vanishing gradients)
![Page 5: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/5.jpg)
Motivation Methods Results
Background: Multi-head attention
Let’s add another layer of complexity:
MultiHead(Q,K, V ) = Concat(head1, ..., headH)WO(3)
where:
headi = Attention(QWQi ,KWK
i , V W Vi ) (4)
WQi ,WK
i ,W Vi ,WO
are learned parameters.
This multi-head idea allows our model to attend to information
from di↵erent representation subspaces via the di↵erent heads.
![Page 6: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/6.jpg)
Motivation Methods Results
Background: Transformer (Vaswani et al., 2017)
Figure: Transformer Architecture
![Page 7: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/7.jpg)
Motivation Methods Results
Background: Transformer (Vaswani et al., 2017)
In the encoder, the attention used is just self-attention (i.e.
Q = K = V , which are the word embeddings in the first layer)
The decoder employs self-attention on the embeddings of the
output, followed by encoder-decoder attention (i.e. K and Vare both the outputs of the encoder, and Q comes from
upstream layers of the decoder).
![Page 8: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/8.jpg)
Motivation Methods Results
Model 1 - Transformer-based Universal Sentence Encoder(USE)
Suppose we have a sentence which comprises of N words, and
x1, ..., xN are the word embeddings.
The function TransformerEncoder takes in N word
embeddings and outputs N ”context-aware” embeddings.
For the Transformer-based USE, the sentence embedding is
calculated as follows:
S =
PTransformerEncoder(x1, ..., xN )p
N(5)
![Page 9: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/9.jpg)
Motivation Methods Results
Model 2 - Deep Averaging Network (DAN) (Iyyer et al.,2015)
Figure: DAN Architecture
![Page 10: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/10.jpg)
Motivation Methods Results
Experiments: Classification Benchmarks
For both models (Transformer and DAN), classification layers are
added for transfer learning on the following tasks:
MR : Movie review snippet sentiment on a five star scale
(Pang and Lee, 2005).
CR : Sentiment of sentences mined from customer reviews
(Hu and Liu, 2004).
SUBJ: Subjectivity of sentences from movie reviews and plot
summaries (Pang and Lee, 2004).
MPQA: Phrase level opinion polarity from news data (Wiebe
et al., 2005).
![Page 11: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/11.jpg)
Motivation Methods Results
Experiments: Classification Benchmarks
TREC : Fine grained question classification sourced from
TREC (Li and Roth, 2002).
SST : Binary phrase level sentiment classification (Socher et
al., 2013).
STS Benchmark : Semantic textual similarity (STS)
between sentence pairs scored by Pearson correlation with
human judgments (Cer et al., 2017).
For the STS Benchmark task, no transfer layers are used. The
sentence embeddings are used to compute similarity via:
sim(u, v) = 1�arccos( u·v
||u||||v||)
⇡(6)
![Page 12: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/12.jpg)
Motivation Methods Results
Results
Table: Accuracy results for sentence-level tasks (last column is Pearson
correlation of sentence similarity with human judgments)
![Page 13: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/13.jpg)
Motivation Methods Results
Discussion
Figure: TL;DR - Transformer runtime is O(n2), DAN runtime is O(n) (atthe expense of performance)
![Page 14: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/14.jpg)
Motivation Methods Results
Skip-Thought Vectors
Ryan Kiros1 Yukun Zhu1 Ruslan Salakhutdinov1,2
Richard S. Zemel1,2 Antonio Torralba3 , Raquel Urtasun1
Sanja Fidler1
1University of Toronto
2Canadian Institute for Advanced Research
3Massachusetts Institute of Technology
19 April 2019
Presented by: Serge Assaad
![Page 15: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/15.jpg)
Motivation Methods Results
Motivation
Questions:
Can we ”universally/generically” encode information insentences in a distributed semantic representation the sameway we do it for words?
Can we use these supposedly generic embeddings to do wellon tasks related to these sentences without resorting toexpensive task-specific language models?
![Page 16: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/16.jpg)
Motivation Methods Results
Skip-Thought Model
Figure: The skip-thoughts model. Given a tuple (si�1, si, si+1) ofcontiguous sentences, with si the i-th sentence of a book, the sentencesi is encoded and tries to reconstruct the previous sentences i� 1 andnext sentences i+ 1. In this example, the input is the sentence triplet (Igot back home. I could see the cat on the steps. This was strange.)Unattached arrows are connected to the encoder output. Colors indicatewhich components share parameters. < eos > is the end of sentencetoken.
![Page 17: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/17.jpg)
Motivation Methods Results
Notation
wti is the t-th word in sentence si, and xti is its word
embedding.
For the sentence si with words w1i , ..., w
Ni at each time step t,
the encoder of the recurrent model (presented later) producesa hidden state hti which represents the sequence w1
i , ..., wti .
Thus, we can think of the ”sentence embedding” as hNi ,which represents the entire sentence.
![Page 18: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/18.jpg)
Motivation Methods Results
Skip-Thought Model: Encoder (GRU)
A typical GRU encoder (dubbed E):
rt = �(Wrxt + Urh
t�1) (1)
zt = �(Wzxt + Uzh
t�1) (2)
h̄t = tanh(Wxt + U(rt � ht�1)) (3)
ht = (1� zt)� ht�1 + zt � h̄t (4)
h̄t is the proposed update at time t, zt is the update gate, rt is thereset gate, and � denotes the Hadamard product
![Page 19: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/19.jpg)
Motivation Methods Results
Skip-Thought Model: Decoder (GRU)
The decoder (dubbed Dnext) described here tries to predict thewords in the next sentence si+1, given the hidden state hNi(denoted hi for brevity) for the sentence si.We use the matrices Cz, Cr, and C to bias the decoder gates withthe sentence vector hi
rt = �(W dr x
t�1 + Udr h
t�1 + Crhi) (5)
zt = �(W dz x
t�1 + Udz h
t�1 + Czhi) (6)
h̄t = tanh(W dxt�1 + U(rt � ht�1) + Chi) (7)
hti+1 = (1� zt)� ht�1 + zt � h̄t (8)
![Page 20: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/20.jpg)
Motivation Methods Results
Skip-Thought Model: Decoder
We then convert this hidden state hti+1 into a softmax probability:
P (wti+1|w<t
i+1, hi) / exp(vwti+1
hti+1) (9)
Where vwti+1
is the row of a learned vocabulary matrix V which
corresponds to word wti+1
Note: We train another identical decoder Dprev to predict theprevious sentence si�1, which has separate parameters from Dnext
with the exception of the vocabulary matrix V , which is sharedbetween decoders.
![Page 21: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/21.jpg)
Motivation Methods Results
Objective
We optimize the parameters of the encoder E and the 2 decoders(Dnext and Dprev) according to the following objective:
L (E,Dnext, Dprev) =X
t
logP (wti+1|w<t
i+1, hi) (10)
+X
t
logP (wti�1|w<t
i�1, hi) (11)
Result: After training, we can extract the hidden state of eachsentence hi from the encoder and that’s the ”sentenceembedding”!
![Page 22: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/22.jpg)
Motivation Methods Results
Training details
A few di↵erent versions of the skip-thoughts model are used:
uni-skip: The model just presented (unidirectionalskip-thoughts)
bi-skip: Similar, but with 2 encoders Eforward and Ereverse,whose outputs are concatenated then fed to the decoder. Thelatter encoder is fed the sentence in reverse order.
combine-skip: A combination of uni-skip and bi-skip, wherethe outputs of the encoders of the uni-skip and bi-skip modelsare concatenated then fed to the decoders.
combine-skip + COCO: Encoder embeddings areconcatenated with an image-sentence embedding modeltrained on COCO (described later).
These models are trained using the BookCorpus dataset (Y. Zhu,2015).
![Page 23: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/23.jpg)
Motivation Methods Results
Experiments: Sentence-Level comparisons
SICK task : Given 2 sentences, the goal is to produce a scoreof how semantically related they are. The ground truth ishuman annotated scores from 1 to 5. (Marelli et. al, 2014)
Microsoft Paraphrase Corpus Predicting whether or not 2sentences are paraphrases of each other.
![Page 24: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/24.jpg)
Motivation Methods Results
Results: Sentence-Level comparisons
Table: Left: Test set results on the SICK semantic relatedness subtask.The metrics are Pearson’s r, Spearman’s ⇢, and MSE.Right: Test set results on the Microsoft Paraphrase Corpus. The metricsare accuracy and F1.
![Page 25: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/25.jpg)
Motivation Methods Results
Experiments: Image-sentence ranking
Using image-sentence pairs from the COCO dataset, matrices Uand V are learned to minimize the ranking loss:
L (U, V ) =X
x
KX
k=1
max{0,↵� cos(Ux, V y) + cos(Ux, V yk)}
+X
y
KX
k=1
max{0,↵� cos(V y, Ux) + cos(V y, Uxk)}
Where:x is an image vector obtained from a pre-trained VGG-19y is the skip-thought vector for the ground truth sentencesxk and yk are ”incorrect” image and skip-thought vectors, i.e.that do not correspond to y and x respectively.cos is cosine similarity↵ = 0.2, K = 50
![Page 26: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/26.jpg)
Motivation Methods Results
Results: Image-sentence ranking
Table: COCO test-set results for image-sentence retrieval experiments.R@K is Recall@K (high is good). Med r is the median rank (low is good).
![Page 27: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/27.jpg)
Motivation Methods Results
Experiments: Classification Benchmarks
Some of the same tasks as the USE paper:
MR : Movie review snippet sentiment on a five star scale(Pang and Lee, 2005).
CR : Sentiment of sentences mined from customer reviews(Hu and Liu, 2004).
SUBJ: Subjectivity of sentences from movie reviews and plotsummaries (Pang and Lee, 2004).
MPQA: Phrase level opinion polarity from news data (Wiebeet al., 2005).
TREC : Fine grained question classification sourced fromTREC (Li and Roth, 2002).
![Page 28: Universal Sentence Encoderlcarin/Serge4.19.2019.pdf · 2019-04-19 · Motivation Methods Results Universal Sentence Encoder Daniel Cer1 Yinfei Yang1 Sheng-yi Kong1 Nan Hua1 Nicole](https://reader034.vdocuments.us/reader034/viewer/2022050204/5f575427cd33fd0766426c1b/html5/thumbnails/28.jpg)
Motivation Methods Results
Results: Classification Benchmarks
Table: Classification accuracies on several standard benchmarks.