part-of-speech tagging + neural networks 3: word embeddings · neural networks 3: word embeddings...

Part-of-Speech Tagging

+

Neural Networks 3: Word Embeddings

CS 287

Review: Neural Networks

One-layer multi-layer perceptron architecture,

NNMLP1(x) = g(xW1 + b1)W2 + b2

I xW+ b; perceptron

I x is the dense representation in R1×din

I W1 ∈ Rdin×dhid ,b1 ∈ R1×dhid ; first affine transformation

I W2 ∈ Rdhid×dout ,b2 ∈ R1×dout ; second affine transformation

I g : Rdhid×dhid is an activation non-linearity (often pointwise)

I g(xW1 + b1) is the hidden layer

Review: Non-Linearities Tanh

Hyperbolic Tangeant:

tanh(t) =exp(t)− exp(−t)exp(t) + exp(−t)

I Intuition: Similar to sigmoid, but range between 0 and -1.

Review: Backpropagation

fi (. . . f1(x0))

fi+1(∗; θi+1)

fi+1(fi (. . . f1(x0)))

∂L

∂fi (. . . f1(x0))∂L

∂fi+1(. . . f1(x0))

∂L

∂θi+1

Quiz

One common class of operations in neural network models is known as

pooling. Informally a pooling layer consists of aggregation unit, typically

unparameterized, that reduces the input to a smaller size.

Consider three pooling functions of the form f : Rn 7→ R,

1. f (x) = maxi xi

2. f (x) = mini xi

3. f (x) = ∑i xi/n

What action do each of these functions have? What are their

gradients? How would you implement backpropagation for these units?

Quiz

I Max pooling: f (x) = maxi xi

I Keeps only the most activated input

I Fprop is simple; however must store arg max (“switch”)

I Bprop gradient is zero except for switch, which gets gradoutput

I Min pooling: f (x) = mini xiI Keeps only the least activated input

I Fprop is simple; however must store arg min (“switch”)

I Bprop gradient is zero except for switch, which gets gradoutput

I Avg pooling: f (x) = ∑i xi/nI Keeps the average activation input

I Fprop is simply mean.

I Gradoutput is averaged and passed to all inputs.

Contents

Embedding Motivation

C&W Embeddings

word2vec

Evaluating Embeddings

1. Use dense representations instead of sparse

2. Use windowed area instead of sequence models

3. Use neural networks to model windowed interactions

What about rare words?

Word Embeddings

Embedding layer,

x0W0

I x0 ∈ R1×d0 one-hot word.

I W0 ∈ Rd0×din , d0 = |V|

Notes:

I d0 >> din, e.g. d0 = 10000, din = 50

Pretraining Representations

I We would strong shared representations of words

I However, PTB only 1M labeled words, relatively small

I Collobert et al. (2008, 2011) use semi-supervised method.

I (Close connection to Bengio et al (2003), next topic)

Semi-Supervised Training

Idea: Train representations separately on more data

1. Pretrain word embeddings W0 first.

2. Substitute them in as first NN layer

3. Fine-tune embeddings for final task

I Modify the first layer based on supervised gradients

I Optional, some work skips this step

Large Corpora

To learn rare word embeddings, need many more tokens,

I C&W

I English Wikipedia (631 million words tokens)

I Reuters Corpus (221 million word tokens)

I Total vocabulary size: 130,000 word types

I word2vec

I Google News (6 billion word tokens)

I Total vocabulary size: ≈ 1M word types

But this data has no labels...

Contents


C&W Embeddings

word2vec


C&W Embeddings

I Assumption: Text in Wikipedia is coherent (in some sense).

I Most randomly corrupted text is incoherent.

I Embeddings should distinguish coherence.

I Common idea in unsupervised learning (distributional hypothesis).

C&W Setup

Let V be the vocabulary of English and let s score any window of size

dwin = 5, if we see the phrase

[ the dog walks to the ]

It should score higher by s than

[ the dog house to the ]

[ the dog cats to the ]

[ the dog skips to the ]

...

C&W Setup

Can estimate score s as a windowed neural network.

s(w1, . . . ,wdwin) = hardtanh(xW1 + b1)W2 + b

with

x = [v(w1) v(w2) . . . v(wdwin)]

I din = dwin × 50, dhid = 100, dwin = 11, dout = 1!

Example: Function s

x = [v(w3) v(w4) v(w5) v(w6) v(w7)]

din/dwin din/dwin

xdin/dwin din/dwin din/dwin

Training?

I Different setup than previous experiments.

I No direct supervision y

I Train to rank good examples better.

Ranking Loss

Given only example {x1, . . . , xn} and for each example have set D(x) of

alternatives.

L(θ) = ∑i

∑x′∈D(x)

Lranking (s(xi ; θ), s(x′; θ))

Lranking (y , y) = max{0, 1− (y − y)}

Example: C&W ranking

x = [the dog walks to the]

D(x) = { [the dog skips to the], [the dog in to the], . . . }

I (Torch nn.RankingCriterion)

I Note: slightly different setup.

C&W Embeddings in Practice

I Vocabulary size |D(x)| > 100, 000

I Training time for 4 weeks

I (Collobert is main an author of Torch)

Sampling (Sketch of Wsabie (Weston, 2011))

Observation: in many contexts

Lranking (y , y) = max{0, 1− (y − y)} = 0

Particularly true later in training.

For difficult contexts, may be easy to find

Lranking (y , y) = max{0, 1− (y − y)} 6= 0

We can therefore sample from D(x) to find an update.

C&W Results

1. Use dense representations instead of sparse

2. Use windowed area instead of sequence models

3. Use neural networks to model windowed interactions

4. Use semi-supervised learning to pretrain representations.

Contents


C&W Embeddings

word2vec


word2vec

I Contributions:

I Scale embedding process to massive sizes

I Experiments with several architectures

I Empirical evaluations of embeddings

I Influential release of software/data.

I Differences with C&W

I Instead of MLP uses (bi)linear model (linear in paper)

I Instead of ranking model, directly predict word (cross-entropy)

I Various other extensions.

I Two different models

1. Continuous Bag-of-Words (CBOW)

2. Continuous Skip-gram

word2vec (Bilinear Model)

Back to pure bilinear model, but with much bigger output space

y = softmax((∑i x

0i W

0

dwin − 1)W1)

I x0i ∈ R1×d0 input words one-hot vectors .

I W0 ∈ Rd0×din ; d0 = |V|, word embeddings

I W1 ∈ Rdin×dout ; dout = |V| output embeddings

Notes:

I Bilinear parameter interaction.

I d0 >> din, e.g. 50 ≤ din ≤ 1000, 10000 ≤ |V| ≤ 1M or more

word2vec (Mikolov, 2013)

Continuous Bag-of-Words (CBOW)

y = softmax((∑i x

0i W

0

dwin − 1)W1)

I Attempt to predict the middle word

[ the dog walks to the ]

Example: CBOW

x =v(w3) + v(w4) + v(w6) + v(w7)

dwin − 1

y = δ(w5)

W1 is no longer partitioned by row (order is lost)

Continous Skip-gram

y = softmax(x0W0)W1)

I Also a bilinear model

I Attempt to predict each context-word from middle

[ the dog ]

Example: Skip-gram

x = v(w5)

y = δ(w3)

Done for each word in window.

Additional aspects

I The window dwin is sampled for each SGD step

I SGD is done less for frequent words.

I We have slightly simplified the training objective.

Softmax Issues

Use a softmax to force a distribution,

softmax(z) =exp(z)

∑c∈C

exp(zc)

log softmax(z) = z− log ∑c∈C

exp(zc)

I Issue: class C is huge.

I For C&W, 100,000, for word2vec 1,000,000 types

I Note largest dataset is 6 billion words

Two-Layer Softmax

First, clustering words into hard classes (for instance Brown clusters)

Groups words into classes based on word-context.

5

. . .motorcycletruckcar

. . .3

. . .horsecatdog

. . .

Two-Layer Softmax

Assume that we first generate a class C and then a word,

p(Y |X ) ≈ P(Y |C ,X ; θ)P(C |X ; θ)

Estimate distributions with a shared embedding layer,

P(C |X ; θ)

y1 = softmax((x0W0)W1 + b)

P(Y |C = class,X ; θ)

y2 = softmax((x0W0)Wclass + b))

Softmax as Tree

5


. . .3

. . .horsecatdog

. . .

y(1) = softmax((x0W0)W1 + b)

y(2) = softmax((x0W0)Wclass + b))

L2SM(y(1), y(2), y(1), y(2)) = − log p(y|x, class(y))− log p(class(y)|x)

= − log y(1)c1− log y

(2)c2

Speed

5


. . .3

dog cat horse . . .

. . .

I Computing loss only requires walking path.

I Two-layer a balanced tree.

I Computing loss requires O(√|V|)

I (Note: computing full distribution requires O(|V|))

Hierarchical Softmax(HSM)

I Build multiple layer tree

LHSM(y(1), . . . , y(C ), y(1), . . . , y(C )) = −∑i

log y(i)c i

I Balanced tree only requires O(log2 |V|)

I Experiments on website (Mnih and Hinton, 2008)

HSM with Huffman Encoding

I Requires O(log2 perp(unigram))

I Reduces time to only 1 day for 1.6 million tokens

Contents


C&W Embeddings

word2vec


How good are embeddings?

I Qualitative Analysis/Visualization

I Analogy task

I

I Extrinsic Metrics

Metrics

Dot-product

xcatx>dog

Cosine Similarity

xcatx>dog||xcat || ||xdog ||

k-nearest neighbors (cosine sim)

dog

cat 0.921800527377

dogs 0.851315870426

horse 0.790758298322

puppy 0.775492121034

pet 0.772470734611

rabbit 0.772081457265

pig 0.749006160038

snake 0.73991884888

I Intuition: trained to match words that act the same.

Empirical Measures: Analogy task

Analogy questions:

A:B::C:

I 5 types of semantic questions, 9 types of syntactic

Embedding Tasks

Analogy Prediction

A:B::C:

x′ = xB − xA + xC

Project to the closest word,

arg maxD∈V

xDx′>

||xD ||||x′||

I Code example

Extrinsic Tasks

I Text classification

I Part-of-speech tagging

I Many, many others over last couple years

Conclusion

I Word Embeddings

I Scaling issues and tricks

I Next Class: Language Modeling

part-of-speech tagging + neural networks 3: word embeddings · neural networks 3: word embeddings...

Documents