part-of-speech tagging + neural networks 3: word embeddings · neural networks 3: word embeddings...

49
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Upload: others

Post on 27-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Part-of-Speech Tagging

+

Neural Networks 3: Word Embeddings

CS 287

Page 2: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Review: Neural Networks

One-layer multi-layer perceptron architecture,

NNMLP1(x) = g(xW1 + b1)W2 + b2

I xW+ b; perceptron

I x is the dense representation in R1×din

I W1 ∈ Rdin×dhid ,b1 ∈ R1×dhid ; first affine transformation

I W2 ∈ Rdhid×dout ,b2 ∈ R1×dout ; second affine transformation

I g : Rdhid×dhid is an activation non-linearity (often pointwise)

I g(xW1 + b1) is the hidden layer

Page 3: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Review: Non-Linearities Tanh

Hyperbolic Tangeant:

tanh(t) =exp(t)− exp(−t)exp(t) + exp(−t)

I Intuition: Similar to sigmoid, but range between 0 and -1.

Page 4: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Review: Backpropagation

fi (. . . f1(x0))

fi+1(∗; θi+1)

fi+1(fi (. . . f1(x0)))

∂L

∂fi (. . . f1(x0))∂L

∂fi+1(. . . f1(x0))

∂L

∂θi+1

Page 5: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Quiz

One common class of operations in neural network models is known as

pooling. Informally a pooling layer consists of aggregation unit, typically

unparameterized, that reduces the input to a smaller size.

Consider three pooling functions of the form f : Rn 7→ R,

1. f (x) = maxi xi

2. f (x) = mini xi

3. f (x) = ∑i xi/n

What action do each of these functions have? What are their

gradients? How would you implement backpropagation for these units?

Page 6: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Quiz

I Max pooling: f (x) = maxi xi

I Keeps only the most activated input

I Fprop is simple; however must store arg max (“switch”)

I Bprop gradient is zero except for switch, which gets gradoutput

I Min pooling: f (x) = mini xiI Keeps only the least activated input

I Fprop is simple; however must store arg min (“switch”)

I Bprop gradient is zero except for switch, which gets gradoutput

I Avg pooling: f (x) = ∑i xi/nI Keeps the average activation input

I Fprop is simply mean.

I Gradoutput is averaged and passed to all inputs.

Page 7: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Quiz

I Max pooling: f (x) = maxi xi

I Keeps only the most activated input

I Fprop is simple; however must store arg max (“switch”)

I Bprop gradient is zero except for switch, which gets gradoutput

I Min pooling: f (x) = mini xiI Keeps only the least activated input

I Fprop is simple; however must store arg min (“switch”)

I Bprop gradient is zero except for switch, which gets gradoutput

I Avg pooling: f (x) = ∑i xi/nI Keeps the average activation input

I Fprop is simply mean.

I Gradoutput is averaged and passed to all inputs.

Page 8: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Quiz

I Max pooling: f (x) = maxi xi

I Keeps only the most activated input

I Fprop is simple; however must store arg max (“switch”)

I Bprop gradient is zero except for switch, which gets gradoutput

I Min pooling: f (x) = mini xiI Keeps only the least activated input

I Fprop is simple; however must store arg min (“switch”)

I Bprop gradient is zero except for switch, which gets gradoutput

I Avg pooling: f (x) = ∑i xi/nI Keeps the average activation input

I Fprop is simply mean.

I Gradoutput is averaged and passed to all inputs.

Page 9: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Contents

Embedding Motivation

C&W Embeddings

word2vec

Evaluating Embeddings

Page 10: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

1. Use dense representations instead of sparse

2. Use windowed area instead of sequence models

3. Use neural networks to model windowed interactions

Page 11: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

What about rare words?

Page 12: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Word Embeddings

Embedding layer,

x0W0

I x0 ∈ R1×d0 one-hot word.

I W0 ∈ Rd0×din , d0 = |V|

Notes:

I d0 >> din, e.g. d0 = 10000, din = 50

Page 13: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Pretraining Representations

I We would strong shared representations of words

I However, PTB only 1M labeled words, relatively small

I Collobert et al. (2008, 2011) use semi-supervised method.

I (Close connection to Bengio et al (2003), next topic)

Page 14: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Semi-Supervised Training

Idea: Train representations separately on more data

1. Pretrain word embeddings W0 first.

2. Substitute them in as first NN layer

3. Fine-tune embeddings for final task

I Modify the first layer based on supervised gradients

I Optional, some work skips this step

Page 15: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Large Corpora

To learn rare word embeddings, need many more tokens,

I C&W

I English Wikipedia (631 million words tokens)

I Reuters Corpus (221 million word tokens)

I Total vocabulary size: 130,000 word types

I word2vec

I Google News (6 billion word tokens)

I Total vocabulary size: ≈ 1M word types

But this data has no labels...

Page 16: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Contents

Embedding Motivation

C&W Embeddings

word2vec

Evaluating Embeddings

Page 17: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

C&W Embeddings

I Assumption: Text in Wikipedia is coherent (in some sense).

I Most randomly corrupted text is incoherent.

I Embeddings should distinguish coherence.

I Common idea in unsupervised learning (distributional hypothesis).

Page 18: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

C&W Setup

Let V be the vocabulary of English and let s score any window of size

dwin = 5, if we see the phrase

[ the dog walks to the ]

It should score higher by s than

[ the dog house to the ]

[ the dog cats to the ]

[ the dog skips to the ]

...

Page 19: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

C&W Setup

Can estimate score s as a windowed neural network.

s(w1, . . . ,wdwin) = hardtanh(xW1 + b1)W2 + b

with

x = [v(w1) v(w2) . . . v(wdwin)]

I din = dwin × 50, dhid = 100, dwin = 11, dout = 1!

Example: Function s

x = [v(w3) v(w4) v(w5) v(w6) v(w7)]

din/dwin din/dwin

xdin/dwin din/dwin din/dwin

Page 20: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Training?

I Different setup than previous experiments.

I No direct supervision y

I Train to rank good examples better.

Page 21: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Ranking Loss

Given only example {x1, . . . , xn} and for each example have set D(x) of

alternatives.

L(θ) = ∑i

∑x′∈D(x)

Lranking (s(xi ; θ), s(x′; θ))

Lranking (y , y) = max{0, 1− (y − y)}

Example: C&W ranking

x = [the dog walks to the]

D(x) = { [the dog skips to the], [the dog in to the], . . . }

I (Torch nn.RankingCriterion)

I Note: slightly different setup.

Page 22: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

C&W Embeddings in Practice

I Vocabulary size |D(x)| > 100, 000

I Training time for 4 weeks

I (Collobert is main an author of Torch)

Page 23: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Sampling (Sketch of Wsabie (Weston, 2011))

Observation: in many contexts

Lranking (y , y) = max{0, 1− (y − y)} = 0

Particularly true later in training.

For difficult contexts, may be easy to find

Lranking (y , y) = max{0, 1− (y − y)} 6= 0

We can therefore sample from D(x) to find an update.

Page 24: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

C&W Results

1. Use dense representations instead of sparse

2. Use windowed area instead of sequence models

3. Use neural networks to model windowed interactions

4. Use semi-supervised learning to pretrain representations.

Page 25: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Contents

Embedding Motivation

C&W Embeddings

word2vec

Evaluating Embeddings

Page 26: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

word2vec

I Contributions:

I Scale embedding process to massive sizes

I Experiments with several architectures

I Empirical evaluations of embeddings

I Influential release of software/data.

I Differences with C&W

I Instead of MLP uses (bi)linear model (linear in paper)

I Instead of ranking model, directly predict word (cross-entropy)

I Various other extensions.

I Two different models

1. Continuous Bag-of-Words (CBOW)

2. Continuous Skip-gram

Page 27: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

word2vec

I Contributions:

I Scale embedding process to massive sizes

I Experiments with several architectures

I Empirical evaluations of embeddings

I Influential release of software/data.

I Differences with C&W

I Instead of MLP uses (bi)linear model (linear in paper)

I Instead of ranking model, directly predict word (cross-entropy)

I Various other extensions.

I Two different models

1. Continuous Bag-of-Words (CBOW)

2. Continuous Skip-gram

Page 28: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

word2vec (Bilinear Model)

Back to pure bilinear model, but with much bigger output space

y = softmax((∑i x

0i W

0

dwin − 1)W1)

I x0i ∈ R1×d0 input words one-hot vectors .

I W0 ∈ Rd0×din ; d0 = |V|, word embeddings

I W1 ∈ Rdin×dout ; dout = |V| output embeddings

Notes:

I Bilinear parameter interaction.

I d0 >> din, e.g. 50 ≤ din ≤ 1000, 10000 ≤ |V| ≤ 1M or more

Page 29: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

word2vec (Mikolov, 2013)

Page 30: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Continuous Bag-of-Words (CBOW)

y = softmax((∑i x

0i W

0

dwin − 1)W1)

I Attempt to predict the middle word

[ the dog walks to the ]

Example: CBOW

x =v(w3) + v(w4) + v(w6) + v(w7)

dwin − 1

y = δ(w5)

W1 is no longer partitioned by row (order is lost)

Page 31: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Continous Skip-gram

y = softmax(x0W0)W1)

I Also a bilinear model

I Attempt to predict each context-word from middle

[ the dog ]

Example: Skip-gram

x = v(w5)

y = δ(w3)

Done for each word in window.

Page 32: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Additional aspects

I The window dwin is sampled for each SGD step

I SGD is done less for frequent words.

I We have slightly simplified the training objective.

Page 33: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Softmax Issues

Use a softmax to force a distribution,

softmax(z) =exp(z)

∑c∈C

exp(zc)

log softmax(z) = z− log ∑c∈C

exp(zc)

I Issue: class C is huge.

I For C&W, 100,000, for word2vec 1,000,000 types

I Note largest dataset is 6 billion words

Page 34: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Two-Layer Softmax

First, clustering words into hard classes (for instance Brown clusters)

Groups words into classes based on word-context.

5

. . .motorcycletruckcar

. . .3

. . .horsecatdog

. . .

Page 35: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Two-Layer Softmax

Assume that we first generate a class C and then a word,

p(Y |X ) ≈ P(Y |C ,X ; θ)P(C |X ; θ)

Estimate distributions with a shared embedding layer,

P(C |X ; θ)

y1 = softmax((x0W0)W1 + b)

P(Y |C = class,X ; θ)

y2 = softmax((x0W0)Wclass + b))

Page 36: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Softmax as Tree

5

. . .motorcycletruckcar

. . .3

. . .horsecatdog

. . .

y(1) = softmax((x0W0)W1 + b)

y(2) = softmax((x0W0)Wclass + b))

L2SM(y(1), y(2), y(1), y(2)) = − log p(y|x, class(y))− log p(class(y)|x)

= − log y(1)c1− log y

(2)c2

Page 37: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Speed

5

. . .motorcycletruckcar

. . .3

dog cat horse . . .

. . .

I Computing loss only requires walking path.

I Two-layer a balanced tree.

I Computing loss requires O(√|V|)

I (Note: computing full distribution requires O(|V|))

Page 38: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Hierarchical Softmax(HSM)

I Build multiple layer tree

LHSM(y(1), . . . , y(C ), y(1), . . . , y(C )) = −∑i

log y(i)c i

I Balanced tree only requires O(log2 |V|)

I Experiments on website (Mnih and Hinton, 2008)

Page 39: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

HSM with Huffman Encoding

I Requires O(log2 perp(unigram))

I Reduces time to only 1 day for 1.6 million tokens

Page 40: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN
Page 41: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Contents

Embedding Motivation

C&W Embeddings

word2vec

Evaluating Embeddings

Page 42: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

How good are embeddings?

I Qualitative Analysis/Visualization

I Analogy task

I

I Extrinsic Metrics

Page 43: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Metrics

Dot-product

xcatx>dog

Cosine Similarity

xcatx>dog||xcat || ||xdog ||

Page 44: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

k-nearest neighbors (cosine sim)

dog

cat 0.921800527377

dogs 0.851315870426

horse 0.790758298322

puppy 0.775492121034

pet 0.772470734611

rabbit 0.772081457265

pig 0.749006160038

snake 0.73991884888

I Intuition: trained to match words that act the same.

Page 45: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Empirical Measures: Analogy task

Analogy questions:

A:B::C:

I 5 types of semantic questions, 9 types of syntactic

Page 46: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Embedding Tasks

Page 47: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Analogy Prediction

A:B::C:

x′ = xB − xA + xC

Project to the closest word,

arg maxD∈V

xDx′>

||xD ||||x′||

I Code example

Page 48: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Extrinsic Tasks

I Text classification

I Part-of-speech tagging

I Many, many others over last couple years

Page 49: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN

Conclusion

I Word Embeddings

I Scaling issues and tricks

I Next Class: Language Modeling