universal sentence encoderlcarin/serge4.19.2019.pdf · 2019-04-19 · motivation methods results...

Motivation Methods Results

Universal Sentence Encoder

Daniel Cer1

Yinfei Yang1

Sheng-yi Kong1

Nan Hua1

Nicole Limtiaco2

, Rhomni St. John1

Noah Constant1

Mario Guajardo-Cespedes1

, Steve Yuan3

Chris Tar1

Yun-Hsuan Sung1

Brian Strope1

Ray Kurzweil1

1Google Research, Mountain View, CA

2Google Research, New York, NY

3Google, Cambridge, MA

19 April 2019

Presented by: Serge Assaad


Motivation

Limited amounts of training data are available for many NLP

tasks.

Many models address the problem by implicitly performing

limited transfer learning through the use of pre-trained word

embeddings

Recent work has demonstrated strong transfer task

performance using pre-trained sentence level embeddings

(Conneau et al., 2017)

In this paper, 2 models are presented to produce sentence

embeddings that transfer well to other NLP tasks.


Background: Unpacking ”Attention”

Suppose you have a sentence s with words w1, ..., wN and

corresponding embeddings x1, ..., xN .

We will now try to make the embeddings ”context-aware”:

Pick an embedding xi, and find its similarities with all the

other embeddings, i.e. find sim = [xi · x1, ..., xi · xN ]

Now find ↵ = softmax(sim)

Finally, compute xawarei =PN

k=1 ↵kxk

You now have a ”context-aware” embedding!


Background: Unpacking ”Attention”

Converting this to matrix form, you get:

Xaware = softmax(XXT )X (1)

(This is called self-attention).

Now we’ll generalize a bit and say we have matrices Q,K,V of

queries, keys, and values. We can apply the same idea to get:

Attention(Q,K, V ) = softmax(QKT

pdk

)V (2)

where dk is the dimension of the rows of K (dividing bypdk helps

with vanishing gradients)


Background: Multi-head attention

Let’s add another layer of complexity:

MultiHead(Q,K, V ) = Concat(head1, ..., headH)WO(3)

where:

headi = Attention(QWQi ,KWK

i , V W Vi ) (4)

WQi ,WK

i ,W Vi ,WO

are learned parameters.

This multi-head idea allows our model to attend to information

from di↵erent representation subspaces via the di↵erent heads.


Background: Transformer (Vaswani et al., 2017)

Figure: Transformer Architecture


Background: Transformer (Vaswani et al., 2017)

In the encoder, the attention used is just self-attention (i.e.

Q = K = V , which are the word embeddings in the first layer)

The decoder employs self-attention on the embeddings of the

output, followed by encoder-decoder attention (i.e. K and Vare both the outputs of the encoder, and Q comes from

upstream layers of the decoder).


Model 1 - Transformer-based Universal Sentence Encoder(USE)

Suppose we have a sentence which comprises of N words, and

x1, ..., xN are the word embeddings.

The function TransformerEncoder takes in N word

embeddings and outputs N ”context-aware” embeddings.

For the Transformer-based USE, the sentence embedding is

calculated as follows:

S =

PTransformerEncoder(x1, ..., xN )p

N(5)


Model 2 - Deep Averaging Network (DAN) (Iyyer et al.,2015)

Figure: DAN Architecture


Experiments: Classification Benchmarks

For both models (Transformer and DAN), classification layers are

added for transfer learning on the following tasks:

MR : Movie review snippet sentiment on a five star scale

(Pang and Lee, 2005).

CR : Sentiment of sentences mined from customer reviews

(Hu and Liu, 2004).

SUBJ: Subjectivity of sentences from movie reviews and plot

summaries (Pang and Lee, 2004).

MPQA: Phrase level opinion polarity from news data (Wiebe

et al., 2005).



TREC : Fine grained question classification sourced from

TREC (Li and Roth, 2002).

SST : Binary phrase level sentiment classification (Socher et

al., 2013).

STS Benchmark : Semantic textual similarity (STS)

between sentence pairs scored by Pearson correlation with

human judgments (Cer et al., 2017).

For the STS Benchmark task, no transfer layers are used. The

sentence embeddings are used to compute similarity via:

sim(u, v) = 1�arccos( u·v

||u||||v||)

⇡(6)


Results

Table: Accuracy results for sentence-level tasks (last column is Pearson

correlation of sentence similarity with human judgments)


Discussion

Figure: TL;DR - Transformer runtime is O(n2), DAN runtime is O(n) (atthe expense of performance)


Skip-Thought Vectors

Ryan Kiros1 Yukun Zhu1 Ruslan Salakhutdinov1,2

Richard S. Zemel1,2 Antonio Torralba3 , Raquel Urtasun1

Sanja Fidler1

1University of Toronto

2Canadian Institute for Advanced Research

3Massachusetts Institute of Technology

19 April 2019

Presented by: Serge Assaad


Motivation

Questions:

Can we ”universally/generically” encode information insentences in a distributed semantic representation the sameway we do it for words?

Can we use these supposedly generic embeddings to do wellon tasks related to these sentences without resorting toexpensive task-specific language models?


Skip-Thought Model

Figure: The skip-thoughts model. Given a tuple (si�1, si, si+1) ofcontiguous sentences, with si the i-th sentence of a book, the sentencesi is encoded and tries to reconstruct the previous sentences i� 1 andnext sentences i+ 1. In this example, the input is the sentence triplet (Igot back home. I could see the cat on the steps. This was strange.)Unattached arrows are connected to the encoder output. Colors indicatewhich components share parameters. < eos > is the end of sentencetoken.


Notation

wti is the t-th word in sentence si, and xti is its word

embedding.

For the sentence si with words w1i , ..., w

Ni at each time step t,

the encoder of the recurrent model (presented later) producesa hidden state hti which represents the sequence w1

i , ..., wti .

Thus, we can think of the ”sentence embedding” as hNi ,which represents the entire sentence.


Skip-Thought Model: Encoder (GRU)

A typical GRU encoder (dubbed E):

rt = �(Wrxt + Urh

t�1) (1)

zt = �(Wzxt + Uzh

t�1) (2)

h̄t = tanh(Wxt + U(rt � ht�1)) (3)

ht = (1� zt)� ht�1 + zt � h̄t (4)

h̄t is the proposed update at time t, zt is the update gate, rt is thereset gate, and � denotes the Hadamard product


Skip-Thought Model: Decoder (GRU)

The decoder (dubbed Dnext) described here tries to predict thewords in the next sentence si+1, given the hidden state hNi(denoted hi for brevity) for the sentence si.We use the matrices Cz, Cr, and C to bias the decoder gates withthe sentence vector hi

rt = �(W dr x

t�1 + Udr h

t�1 + Crhi) (5)

zt = �(W dz x

t�1 + Udz h

t�1 + Czhi) (6)

h̄t = tanh(W dxt�1 + U(rt � ht�1) + Chi) (7)

hti+1 = (1� zt)� ht�1 + zt � h̄t (8)


Skip-Thought Model: Decoder

We then convert this hidden state hti+1 into a softmax probability:

P (wti+1|w<t

i+1, hi) / exp(vwti+1

hti+1) (9)

Where vwti+1

is the row of a learned vocabulary matrix V which

corresponds to word wti+1

Note: We train another identical decoder Dprev to predict theprevious sentence si�1, which has separate parameters from Dnext

with the exception of the vocabulary matrix V , which is sharedbetween decoders.


Objective

We optimize the parameters of the encoder E and the 2 decoders(Dnext and Dprev) according to the following objective:

L (E,Dnext, Dprev) =X

t

logP (wti+1|w<t

i+1, hi) (10)

+X

t

logP (wti�1|w<t

i�1, hi) (11)

Result: After training, we can extract the hidden state of eachsentence hi from the encoder and that’s the ”sentenceembedding”!


Training details

A few di↵erent versions of the skip-thoughts model are used:

uni-skip: The model just presented (unidirectionalskip-thoughts)

bi-skip: Similar, but with 2 encoders Eforward and Ereverse,whose outputs are concatenated then fed to the decoder. Thelatter encoder is fed the sentence in reverse order.

combine-skip: A combination of uni-skip and bi-skip, wherethe outputs of the encoders of the uni-skip and bi-skip modelsare concatenated then fed to the decoders.

combine-skip + COCO: Encoder embeddings areconcatenated with an image-sentence embedding modeltrained on COCO (described later).

These models are trained using the BookCorpus dataset (Y. Zhu,2015).


Experiments: Sentence-Level comparisons

SICK task : Given 2 sentences, the goal is to produce a scoreof how semantically related they are. The ground truth ishuman annotated scores from 1 to 5. (Marelli et. al, 2014)

Microsoft Paraphrase Corpus Predicting whether or not 2sentences are paraphrases of each other.


Results: Sentence-Level comparisons

Table: Left: Test set results on the SICK semantic relatedness subtask.The metrics are Pearson’s r, Spearman’s ⇢, and MSE.Right: Test set results on the Microsoft Paraphrase Corpus. The metricsare accuracy and F1.


Experiments: Image-sentence ranking

Using image-sentence pairs from the COCO dataset, matrices Uand V are learned to minimize the ranking loss:

L (U, V ) =X

x

KX

k=1

max{0,↵� cos(Ux, V y) + cos(Ux, V yk)}

+X

y

KX

k=1

max{0,↵� cos(V y, Ux) + cos(V y, Uxk)}

Where:x is an image vector obtained from a pre-trained VGG-19y is the skip-thought vector for the ground truth sentencesxk and yk are ”incorrect” image and skip-thought vectors, i.e.that do not correspond to y and x respectively.cos is cosine similarity↵ = 0.2, K = 50


Results: Image-sentence ranking

Table: COCO test-set results for image-sentence retrieval experiments.R@K is Recall@K (high is good). Med r is the median rank (low is good).



Some of the same tasks as the USE paper:

MR : Movie review snippet sentiment on a five star scale(Pang and Lee, 2005).

CR : Sentiment of sentences mined from customer reviews(Hu and Liu, 2004).

SUBJ: Subjectivity of sentences from movie reviews and plotsummaries (Pang and Lee, 2004).

MPQA: Phrase level opinion polarity from news data (Wiebeet al., 2005).

TREC : Fine grained question classification sourced fromTREC (Li and Roth, 2002).


Results: Classification Benchmarks

Table: Classification accuracies on several standard benchmarks.

universal sentence encoderlcarin/serge4.19.2019.pdf · 2019-04-19 · motivation methods results...

Documents