language modeling + feed-forward networks 3 · 2017-02-10 · interp(wjc) = lp ml(wjc)+(1 l)p...

41
Language Modeling + Feed-Forward Networks 3 CS 287

Upload: others

Post on 17-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Language Modeling

+

Feed-Forward Networks 3

CS 287

Page 2: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Review: LM ML Setup

Multi-class prediction problem,

(x1, y1), . . . , (xn, yn)

I yi ; the one-hot next word

I xi ; representation of the prefix (w1, . . . ,wt−1)

Challenges:

I How do you represent input?

I Smoothing is crucially important.

I Output space is very large (next class)

Page 3: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Review: Perplexity

Previously, used accuracy as a metric.

Language modeling uses of version average negative log-likelihood

I For test data w1, . . . , wn

I

NLL = −1

n

n

∑i=1

log p(wi |w1, . . . ,wi−1)

Actually report perplexity,

perp = exp(−1

n

n

∑i=1

log p(wi |w1, . . . ,wi−1))

Requires modeling full distribution as opposed to argmax (hinge-loss)

Page 4: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Review: Interpolation (Jelinek-Mercer Smoothing)

Can write recursively,

pinterp(w |c) = λpML(w |c) + (1− λ)pinterp(w |c ′)

Ensure that λ form convex combination

0 ≤ λ ≤ 1

How do you learn conjunction combinations?

Page 5: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Quiz

Assume we have seen the following training sentences,

I a tractor drove slow

I the red tractor drove fast

I the parrot flew fast

I the parrot flew slow

I the tractor slowed down

Compute pML for bigrams and use them to estimate whether parrot or

tractor fit better in the following contexts.

1. the red ?

2. the ?

3. the drove?

Page 6: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Answer I

a tractor 1

the red 14

the parrot 12

the tractor 14

red tractor 1

tractor drove 23

tractor slowed 13

parrot flew 1

. . .

Page 7: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Answer II

I the red tractor

I the parrot

I the tractor drove

Page 8: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Today’s Class

p(wi |wi−n+1, . . .wi−1; θ)

I Estimate this directly as a neural network.

I Two types of models, neural network and log-bilinear.

I Efficient methods for approximated estimation.

Page 9: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Intuition: NGram Issues

In training we might see,

the arizona corporations commission authorized

But at test we see,

the colorado businesses organization

I Does this training example help here?

I Not really. No count overlap.

I Does backoff help here?

I Maybe, if we have seen organization.

I Mostly get nothing from the earlier words.

Page 10: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Intuition: NGram Issues

In training we might see,

the arizona corporations commission authorized

But at test we see,

the colorado businesses organization

I Does this training example help here?

I Not really. No count overlap.

I Does backoff help here?

I Maybe, if we have seen organization.

I Mostly get nothing from the earlier words.

Page 11: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Intuition: NGram Issues

In training we might see,

the arizona corporations commission authorized

But at test we see,

the colorado businesses organization

I Does this training example help here?

I Not really. No count overlap.

I Does backoff help here?

I Maybe, if we have seen organization.

I Mostly get nothing from the earlier words.

Page 12: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Goal

I Learn representations that share properties between similar words.

I Particularly helpful for unseen contexts.

I Not a silver bullet, e.g. proper nouns

the eagles play the arizona diamondbacks

Whereas at test we might see,

the eagles play the colorado

(We will discuss this issue more for in MT)

Page 13: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Baseline: Class-Based Language Models

I Groups words into classes based on word-context.

5

. . .motorcycletruckcar

. . .3

. . .horsecatdog

. . .

I Various factorization methods for estimating with count-based

approaches.

I However, assumes a hard-clustering, often estimated separately.

Page 14: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Contents

Neural Language Models

Noise Contrastive Estimation

Page 15: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Recall: Word Embeddings

I Embeddings give multi-dimensional representation of words.

I Ex: Closest by cosine similarity

arizona

texas 0.932968706025

florida 0.932696958878

kansas 0.914805968271

colorado 0.904197441085

minnesota 0.863925347525

carolina 0.862697751337

utah 0.861915722889

miami 0.842350326527

oregon 0.842065064748

I Gives a multi-clustering over words.

Page 16: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Feed-Forward Neural NNLM (Bengio, 2003)

I wi−n+1, . . .wi−1 are input embedding representations

I wi is an output embedded representation

I Model simultaneously learns,

I input word representations

I output word representations

I conjunctions of input words (through NLM, no n-gram features)

Page 17: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Feed-Forward Neural Representation

I p(wi |wi−n+1, . . .wi−1; θ)

I f1, . . . , fdwinare words in window

I Input representation is the concatenation of embeddings

x = [v(f1) v(f2) . . . v(fdwin)]

Example: NNLM (dwin = 5)

[w3 w4 w5 w6 w7] w8

x = [v(w3) v(w4) v(w5) v(w6) v(w7)]

din/5 din/5

xdin/5 din/5 din/5

Page 18: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

A Neural Probabilistic Language Model (Bengio, 2003)

One hidden layer multi-layer perceptron architecture,

NNMLP1(x) = tanh(xW1 + b1)W2 + b2

Neural network architecture on top of concat.

y = softmax(NNMLP1(x))

Best model uses din = 30× dwin, dhid = 100.

Page 19: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

A Neural Probabilistic Language Model

Optional, direct connection layers,

NNDMLP1(x) = [tanh(xW1 + b1), x]W 2 + b2

I W1 ∈ Rdin×dhid ,b1 ∈ R1×dhid ; first affine transformation

I W2 ∈ R(dhid+din)×dout ,b2 ∈ R1×dout ; second affine transformation

Page 20: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

A Neural Probabilistic Language Model (Bengio, 2003)

Dashed-lines show the optional direct connections, C = v .

Page 21: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

A Neural Probabilistic Language Model

Page 22: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Parameters

I Bengio NNLM has dhid = 100, dwin = 5, din = 5× 50

I In-Class: How many parameters does it have? How does this

compare to Kneser-Ney smoothing?

Page 23: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Historical Note

I Bengio et al notes that many of these aspects predate the work

I Furthermore proposes many of the ideas that Collobert et al. and

word2vec implement and scale

I Around this time, very few NLP papers on NN, most-cited papers

are about conditional random fields (CRFs).

Page 24: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Log-Bilinear Language Model (Mnih & Hinton, 2007)

Slightly different input representation. Now let:

x =dwin

∑i=1

v(fi )Ci

I Instead of concatenating, weight each v(fi ) by position-specific

weight matrix Ci .

Then use:

y = softmax(xW1 + b)

I Note no tanh layer.

I W1 can use input embeddings too, or not (Mnih and Teh, 2012)

I Can be faster to use, and in some cases simpler.

Page 25: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Comparison

Both count-based models and feed-forward NNLMs are Markovian

language models,

Comparison:

I Training Speed: ngrams are much faster (more coming)

I Usage Speed: ngrams very fast, NN can be fast with some tricks.

I Memory: NN models can be much smaller (but there are big ones)

I Accuracy: Comparable for small data, NN does better with more.

Advantages of NN model

I Can be trained end-to-end.

I Does not require smoothing methods.

Page 26: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Translation Performance ( and Blunsom, 2015)

Page 27: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Contents

Neural Language Models

Noise Contrastive Estimation

Page 28: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Review: Softmax Issues

Use a softmax to force a distribution,

softmax(z) =exp(z)

∑w∈C

exp(zw )

log softmax(z) = z− log ∑w∈C

exp(zw )

I Issue: class C is huge.

I For C&W, 100,000, for word2vec 1,000,000 types

I Note largest dataset is 6 billion words

Page 29: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Unnormalized Scores

Recall the score defined as (dropping bias)

z = tanh(xW1)W2

Unnormalized score of each word before soft-max,

zj = tanh(xW1)W2∗,j

for any j ∈ {1, . . . dout}

Note: can be computed efficiently O(1) versus O(dout).

Page 30: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Coherence

I Saw similar idea earlier for ranking embedding.

I Idea: Learn to distinguish coherent n-grams from corruption.

I Want to discriminate correct next words from other choices.

[ the dog walks ]

[ the dog house ]

[ the dog cats ]

[ the dog skips ]

Page 31: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Warm-Up

Imagine we have a new dataset,

((x1, y1),d1), . . . , ((xn, yn),dn),

I x; representation of context wi−n+1, . . .wi−1

I y; a possible wi

I d ; 1 if y is correct, 0 otherwise

Objective is based on predicted d :

L(θ) = ∑i

Lcrossentropy (di , di )

Page 32: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Warm-Up: Binary Classification

How do we score (xi , yi = δ(w))?

Could use unnormalized score,

zw = tanh(xW1)W2∗,c

Becomes softmax regression/non-linear logistic regression,

d = σ(zw )

I Much faster

I But does not help us train LM.

Page 33: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Implementation

Standard MLP language model, (only takes in x)

x⇒W1 ⇒ tanh⇒W2 ⇒ softmax

Computing binary (takes in x and y)

d = σ(zw )

x⇒W1 ⇒ tanh⇒·

W2∗,w (Lookup)

⇒ σ

Page 34: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Noise Contrastive Estimation 1

Probabilistic model,

I Introduce random variable D

I If D = 1 produce true sample

I If D = 0 produce sample from a noise distribution.

I Hyperparameter K is ratio of noise

p(D = 1) =1

K + 1

p(D = 0) =K

K + 1

Page 35: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Noise Contrastive Estimation 2

For a given x, y,

p(D = 1|x, y) =p(y|D = 1, x)p(D = 1|x)

∑d p( y|D = d , x)p(D = d |x)

=p(y|D = 1, x)p(D = 1|x)

p(x|D = 0)p(D = 0|x) + p(y|D = 1, x)p(D = 1|x)

Plug-in the noise distribution and hyperparameters,

p(D = 1|x, y) =1

K+1p(y|D = 1, x)1

K+1p(y|D = 1, x) + KK+1p(y|D = 0, x)

=p(y|D = 1, x)

p(y|D = 1, x) +Kp(y|D = 0, x)

= σ(log p(y|D = 1, x)− log(Kp(y|D = 0, x)))

Page 36: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Noise Contrastive Estimation 3

With

p(D = 1|x, y) = σ(log p(y|D = 1, x)− log(Kp(y|D = 0, x)))

we the training objective for a corpus that has K noise samples si ,k per

example is:

L(θ) = ∑i

log p(D = 1|xi , yi ) +K

∑k=1

log p(D = 0|xi ,Y = si ,k)

= ∑i

log σ (log p(yi |D = 1, xi )− log(Kp(yi |D = 0, xi )))

+K

∑k=1

log (1− σ (log p(si ,k |D = 1, xi )− log(Kp(si ,k |D = 0, xi ))))

I In practice, sample si ,k from unigram distribution

Page 37: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Noise Contrastive Estimation 4

But we still have a problem: L defined in terms of normalized

distributions log p(y|D = 1, x)

Solution:

I instead of explicitly normalizing, estimate Z (x), normalizing

constant of each context x, as a parameter (Gutmann &

Hyvarinen, 2010)

I Mnih and Teh (2012) show that fixing Z (x) = 1 for all contexts

works just as well

I So we can replace log p(y = δ(w)|D = 1, x) with zw , as computed

by our network

Page 38: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Noise Contrastive Estimation 5

So we now have

L(θ) = ∑i

log σ(zwi − log(KpML(wi )))

+K

∑k=1

log(1− σ(zsi ,k − log(KpML(si ,k))))

I Mnih and Teh (2012) show that gradient of L approaches gradient

of true language model’s log-likelihood objective as k → ∞.

Page 39: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Implementation

I How do you efficiently compute zw?

Need a lookup table (and dot-product) for output embeddings!

(Not full matrix-vector product).

I How do you efficiently handle log pML(w)

Can be precomputed or placed in a lookuptable .

I How do you handle sampling?

Can precompute large number of samples (not example specific).

I How do you handle loss?

Simply BinaryNLL Objective.

Page 40: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Implementation

Standard MLP language model,

x⇒W1 ⇒ tanh⇒W2 ⇒ softmax

Computing σ(zw − log(KpML(w))),

x⇒W1 ⇒ tanh⇒·

W2∗,w (Lookup)

⇒−

logKpML(w)(input)⇒ σ

(Efficiency, compute first three layers only once for K + 1)

Page 41: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction

Using in Practice

Several options for test time,

I Use full softmax with learned parameters.

I Compute subset of scores and renormalize (homework) .

I Can sometimes just use treat unormalized params as being

normalized (self-normalization)