language modeling + feed-forward networks 3 · 2017-02-10 · interp(wjc) = lp ml(wjc)+(1 l)p...

Post on 17-Jul-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Language Modeling

+

Feed-Forward Networks 3

CS 287

Review: LM ML Setup

Multi-class prediction problem,

(x1, y1), . . . , (xn, yn)

I yi ; the one-hot next word

I xi ; representation of the prefix (w1, . . . ,wt−1)

Challenges:

I How do you represent input?

I Smoothing is crucially important.

I Output space is very large (next class)

Review: Perplexity

Previously, used accuracy as a metric.

Language modeling uses of version average negative log-likelihood

I For test data w1, . . . , wn

I

NLL = −1

n

n

∑i=1

log p(wi |w1, . . . ,wi−1)

Actually report perplexity,

perp = exp(−1

n

n

∑i=1

log p(wi |w1, . . . ,wi−1))

Requires modeling full distribution as opposed to argmax (hinge-loss)

Review: Interpolation (Jelinek-Mercer Smoothing)

Can write recursively,

pinterp(w |c) = λpML(w |c) + (1− λ)pinterp(w |c ′)

Ensure that λ form convex combination

0 ≤ λ ≤ 1

How do you learn conjunction combinations?

Quiz

Assume we have seen the following training sentences,

I a tractor drove slow

I the red tractor drove fast

I the parrot flew fast

I the parrot flew slow

I the tractor slowed down

Compute pML for bigrams and use them to estimate whether parrot or

tractor fit better in the following contexts.

1. the red ?

2. the ?

3. the drove?

Answer I

a tractor 1

the red 14

the parrot 12

the tractor 14

red tractor 1

tractor drove 23

tractor slowed 13

parrot flew 1

. . .

Answer II

I the red tractor

I the parrot

I the tractor drove

Today’s Class

p(wi |wi−n+1, . . .wi−1; θ)

I Estimate this directly as a neural network.

I Two types of models, neural network and log-bilinear.

I Efficient methods for approximated estimation.

Intuition: NGram Issues

In training we might see,

the arizona corporations commission authorized

But at test we see,

the colorado businesses organization

I Does this training example help here?

I Not really. No count overlap.

I Does backoff help here?

I Maybe, if we have seen organization.

I Mostly get nothing from the earlier words.

Intuition: NGram Issues

In training we might see,

the arizona corporations commission authorized

But at test we see,

the colorado businesses organization

I Does this training example help here?

I Not really. No count overlap.

I Does backoff help here?

I Maybe, if we have seen organization.

I Mostly get nothing from the earlier words.

Intuition: NGram Issues

In training we might see,

the arizona corporations commission authorized

But at test we see,

the colorado businesses organization

I Does this training example help here?

I Not really. No count overlap.

I Does backoff help here?

I Maybe, if we have seen organization.

I Mostly get nothing from the earlier words.

Goal

I Learn representations that share properties between similar words.

I Particularly helpful for unseen contexts.

I Not a silver bullet, e.g. proper nouns

the eagles play the arizona diamondbacks

Whereas at test we might see,

the eagles play the colorado

(We will discuss this issue more for in MT)

Baseline: Class-Based Language Models

I Groups words into classes based on word-context.

5

. . .motorcycletruckcar

. . .3

. . .horsecatdog

. . .

I Various factorization methods for estimating with count-based

approaches.

I However, assumes a hard-clustering, often estimated separately.

Contents

Neural Language Models

Noise Contrastive Estimation

Recall: Word Embeddings

I Embeddings give multi-dimensional representation of words.

I Ex: Closest by cosine similarity

arizona

texas 0.932968706025

florida 0.932696958878

kansas 0.914805968271

colorado 0.904197441085

minnesota 0.863925347525

carolina 0.862697751337

utah 0.861915722889

miami 0.842350326527

oregon 0.842065064748

I Gives a multi-clustering over words.

Feed-Forward Neural NNLM (Bengio, 2003)

I wi−n+1, . . .wi−1 are input embedding representations

I wi is an output embedded representation

I Model simultaneously learns,

I input word representations

I output word representations

I conjunctions of input words (through NLM, no n-gram features)

Feed-Forward Neural Representation

I p(wi |wi−n+1, . . .wi−1; θ)

I f1, . . . , fdwinare words in window

I Input representation is the concatenation of embeddings

x = [v(f1) v(f2) . . . v(fdwin)]

Example: NNLM (dwin = 5)

[w3 w4 w5 w6 w7] w8

x = [v(w3) v(w4) v(w5) v(w6) v(w7)]

din/5 din/5

xdin/5 din/5 din/5

A Neural Probabilistic Language Model (Bengio, 2003)

One hidden layer multi-layer perceptron architecture,

NNMLP1(x) = tanh(xW1 + b1)W2 + b2

Neural network architecture on top of concat.

y = softmax(NNMLP1(x))

Best model uses din = 30× dwin, dhid = 100.

A Neural Probabilistic Language Model

Optional, direct connection layers,

NNDMLP1(x) = [tanh(xW1 + b1), x]W 2 + b2

I W1 ∈ Rdin×dhid ,b1 ∈ R1×dhid ; first affine transformation

I W2 ∈ R(dhid+din)×dout ,b2 ∈ R1×dout ; second affine transformation

A Neural Probabilistic Language Model (Bengio, 2003)

Dashed-lines show the optional direct connections, C = v .

A Neural Probabilistic Language Model

Parameters

I Bengio NNLM has dhid = 100, dwin = 5, din = 5× 50

I In-Class: How many parameters does it have? How does this

compare to Kneser-Ney smoothing?

Historical Note

I Bengio et al notes that many of these aspects predate the work

I Furthermore proposes many of the ideas that Collobert et al. and

word2vec implement and scale

I Around this time, very few NLP papers on NN, most-cited papers

are about conditional random fields (CRFs).

Log-Bilinear Language Model (Mnih & Hinton, 2007)

Slightly different input representation. Now let:

x =dwin

∑i=1

v(fi )Ci

I Instead of concatenating, weight each v(fi ) by position-specific

weight matrix Ci .

Then use:

y = softmax(xW1 + b)

I Note no tanh layer.

I W1 can use input embeddings too, or not (Mnih and Teh, 2012)

I Can be faster to use, and in some cases simpler.

Comparison

Both count-based models and feed-forward NNLMs are Markovian

language models,

Comparison:

I Training Speed: ngrams are much faster (more coming)

I Usage Speed: ngrams very fast, NN can be fast with some tricks.

I Memory: NN models can be much smaller (but there are big ones)

I Accuracy: Comparable for small data, NN does better with more.

Advantages of NN model

I Can be trained end-to-end.

I Does not require smoothing methods.

Translation Performance ( and Blunsom, 2015)

Contents

Neural Language Models

Noise Contrastive Estimation

Review: Softmax Issues

Use a softmax to force a distribution,

softmax(z) =exp(z)

∑w∈C

exp(zw )

log softmax(z) = z− log ∑w∈C

exp(zw )

I Issue: class C is huge.

I For C&W, 100,000, for word2vec 1,000,000 types

I Note largest dataset is 6 billion words

Unnormalized Scores

Recall the score defined as (dropping bias)

z = tanh(xW1)W2

Unnormalized score of each word before soft-max,

zj = tanh(xW1)W2∗,j

for any j ∈ {1, . . . dout}

Note: can be computed efficiently O(1) versus O(dout).

Coherence

I Saw similar idea earlier for ranking embedding.

I Idea: Learn to distinguish coherent n-grams from corruption.

I Want to discriminate correct next words from other choices.

[ the dog walks ]

[ the dog house ]

[ the dog cats ]

[ the dog skips ]

Warm-Up

Imagine we have a new dataset,

((x1, y1),d1), . . . , ((xn, yn),dn),

I x; representation of context wi−n+1, . . .wi−1

I y; a possible wi

I d ; 1 if y is correct, 0 otherwise

Objective is based on predicted d :

L(θ) = ∑i

Lcrossentropy (di , di )

Warm-Up: Binary Classification

How do we score (xi , yi = δ(w))?

Could use unnormalized score,

zw = tanh(xW1)W2∗,c

Becomes softmax regression/non-linear logistic regression,

d = σ(zw )

I Much faster

I But does not help us train LM.

Implementation

Standard MLP language model, (only takes in x)

x⇒W1 ⇒ tanh⇒W2 ⇒ softmax

Computing binary (takes in x and y)

d = σ(zw )

x⇒W1 ⇒ tanh⇒·

W2∗,w (Lookup)

⇒ σ

Noise Contrastive Estimation 1

Probabilistic model,

I Introduce random variable D

I If D = 1 produce true sample

I If D = 0 produce sample from a noise distribution.

I Hyperparameter K is ratio of noise

p(D = 1) =1

K + 1

p(D = 0) =K

K + 1

Noise Contrastive Estimation 2

For a given x, y,

p(D = 1|x, y) =p(y|D = 1, x)p(D = 1|x)

∑d p( y|D = d , x)p(D = d |x)

=p(y|D = 1, x)p(D = 1|x)

p(x|D = 0)p(D = 0|x) + p(y|D = 1, x)p(D = 1|x)

Plug-in the noise distribution and hyperparameters,

p(D = 1|x, y) =1

K+1p(y|D = 1, x)1

K+1p(y|D = 1, x) + KK+1p(y|D = 0, x)

=p(y|D = 1, x)

p(y|D = 1, x) +Kp(y|D = 0, x)

= σ(log p(y|D = 1, x)− log(Kp(y|D = 0, x)))

Noise Contrastive Estimation 3

With

p(D = 1|x, y) = σ(log p(y|D = 1, x)− log(Kp(y|D = 0, x)))

we the training objective for a corpus that has K noise samples si ,k per

example is:

L(θ) = ∑i

log p(D = 1|xi , yi ) +K

∑k=1

log p(D = 0|xi ,Y = si ,k)

= ∑i

log σ (log p(yi |D = 1, xi )− log(Kp(yi |D = 0, xi )))

+K

∑k=1

log (1− σ (log p(si ,k |D = 1, xi )− log(Kp(si ,k |D = 0, xi ))))

I In practice, sample si ,k from unigram distribution

Noise Contrastive Estimation 4

But we still have a problem: L defined in terms of normalized

distributions log p(y|D = 1, x)

Solution:

I instead of explicitly normalizing, estimate Z (x), normalizing

constant of each context x, as a parameter (Gutmann &

Hyvarinen, 2010)

I Mnih and Teh (2012) show that fixing Z (x) = 1 for all contexts

works just as well

I So we can replace log p(y = δ(w)|D = 1, x) with zw , as computed

by our network

Noise Contrastive Estimation 5

So we now have

L(θ) = ∑i

log σ(zwi − log(KpML(wi )))

+K

∑k=1

log(1− σ(zsi ,k − log(KpML(si ,k))))

I Mnih and Teh (2012) show that gradient of L approaches gradient

of true language model’s log-likelihood objective as k → ∞.

Implementation

I How do you efficiently compute zw?

Need a lookup table (and dot-product) for output embeddings!

(Not full matrix-vector product).

I How do you efficiently handle log pML(w)

Can be precomputed or placed in a lookuptable .

I How do you handle sampling?

Can precompute large number of samples (not example specific).

I How do you handle loss?

Simply BinaryNLL Objective.

Implementation

Standard MLP language model,

x⇒W1 ⇒ tanh⇒W2 ⇒ softmax

Computing σ(zw − log(KpML(w))),

x⇒W1 ⇒ tanh⇒·

W2∗,w (Lookup)

⇒−

logKpML(w)(input)⇒ σ

(Efficiency, compute first three layers only once for K + 1)

Using in Practice

Several options for test time,

I Use full softmax with learned parameters.

I Compute subset of scores and renormalize (homework) .

I Can sometimes just use treat unormalized params as being

normalized (self-normalization)

top related