language modeling + feed-forward networks 3 · 2017-02-10 · interp(wjc) = lp ml(wjc)+(1 l)p...
TRANSCRIPT
Language Modeling
+
Feed-Forward Networks 3
CS 287
Review: LM ML Setup
Multi-class prediction problem,
(x1, y1), . . . , (xn, yn)
I yi ; the one-hot next word
I xi ; representation of the prefix (w1, . . . ,wt−1)
Challenges:
I How do you represent input?
I Smoothing is crucially important.
I Output space is very large (next class)
Review: Perplexity
Previously, used accuracy as a metric.
Language modeling uses of version average negative log-likelihood
I For test data w1, . . . , wn
I
NLL = −1
n
n
∑i=1
log p(wi |w1, . . . ,wi−1)
Actually report perplexity,
perp = exp(−1
n
n
∑i=1
log p(wi |w1, . . . ,wi−1))
Requires modeling full distribution as opposed to argmax (hinge-loss)
Review: Interpolation (Jelinek-Mercer Smoothing)
Can write recursively,
pinterp(w |c) = λpML(w |c) + (1− λ)pinterp(w |c ′)
Ensure that λ form convex combination
0 ≤ λ ≤ 1
How do you learn conjunction combinations?
Quiz
Assume we have seen the following training sentences,
I a tractor drove slow
I the red tractor drove fast
I the parrot flew fast
I the parrot flew slow
I the tractor slowed down
Compute pML for bigrams and use them to estimate whether parrot or
tractor fit better in the following contexts.
1. the red ?
2. the ?
3. the drove?
Answer I
a tractor 1
the red 14
the parrot 12
the tractor 14
red tractor 1
tractor drove 23
tractor slowed 13
parrot flew 1
. . .
Answer II
I the red tractor
I the parrot
I the tractor drove
Today’s Class
p(wi |wi−n+1, . . .wi−1; θ)
I Estimate this directly as a neural network.
I Two types of models, neural network and log-bilinear.
I Efficient methods for approximated estimation.
Intuition: NGram Issues
In training we might see,
the arizona corporations commission authorized
But at test we see,
the colorado businesses organization
I Does this training example help here?
I Not really. No count overlap.
I Does backoff help here?
I Maybe, if we have seen organization.
I Mostly get nothing from the earlier words.
Intuition: NGram Issues
In training we might see,
the arizona corporations commission authorized
But at test we see,
the colorado businesses organization
I Does this training example help here?
I Not really. No count overlap.
I Does backoff help here?
I Maybe, if we have seen organization.
I Mostly get nothing from the earlier words.
Intuition: NGram Issues
In training we might see,
the arizona corporations commission authorized
But at test we see,
the colorado businesses organization
I Does this training example help here?
I Not really. No count overlap.
I Does backoff help here?
I Maybe, if we have seen organization.
I Mostly get nothing from the earlier words.
Goal
I Learn representations that share properties between similar words.
I Particularly helpful for unseen contexts.
I Not a silver bullet, e.g. proper nouns
the eagles play the arizona diamondbacks
Whereas at test we might see,
the eagles play the colorado
(We will discuss this issue more for in MT)
Baseline: Class-Based Language Models
I Groups words into classes based on word-context.
5
. . .motorcycletruckcar
. . .3
. . .horsecatdog
. . .
I Various factorization methods for estimating with count-based
approaches.
I However, assumes a hard-clustering, often estimated separately.
Contents
Neural Language Models
Noise Contrastive Estimation
Recall: Word Embeddings
I Embeddings give multi-dimensional representation of words.
I Ex: Closest by cosine similarity
arizona
texas 0.932968706025
florida 0.932696958878
kansas 0.914805968271
colorado 0.904197441085
minnesota 0.863925347525
carolina 0.862697751337
utah 0.861915722889
miami 0.842350326527
oregon 0.842065064748
I Gives a multi-clustering over words.
Feed-Forward Neural NNLM (Bengio, 2003)
I wi−n+1, . . .wi−1 are input embedding representations
I wi is an output embedded representation
I Model simultaneously learns,
I input word representations
I output word representations
I conjunctions of input words (through NLM, no n-gram features)
Feed-Forward Neural Representation
I p(wi |wi−n+1, . . .wi−1; θ)
I f1, . . . , fdwinare words in window
I Input representation is the concatenation of embeddings
x = [v(f1) v(f2) . . . v(fdwin)]
Example: NNLM (dwin = 5)
[w3 w4 w5 w6 w7] w8
x = [v(w3) v(w4) v(w5) v(w6) v(w7)]
din/5 din/5
xdin/5 din/5 din/5
A Neural Probabilistic Language Model (Bengio, 2003)
One hidden layer multi-layer perceptron architecture,
NNMLP1(x) = tanh(xW1 + b1)W2 + b2
Neural network architecture on top of concat.
y = softmax(NNMLP1(x))
Best model uses din = 30× dwin, dhid = 100.
A Neural Probabilistic Language Model
Optional, direct connection layers,
NNDMLP1(x) = [tanh(xW1 + b1), x]W 2 + b2
I W1 ∈ Rdin×dhid ,b1 ∈ R1×dhid ; first affine transformation
I W2 ∈ R(dhid+din)×dout ,b2 ∈ R1×dout ; second affine transformation
A Neural Probabilistic Language Model (Bengio, 2003)
Dashed-lines show the optional direct connections, C = v .
A Neural Probabilistic Language Model
Parameters
I Bengio NNLM has dhid = 100, dwin = 5, din = 5× 50
I In-Class: How many parameters does it have? How does this
compare to Kneser-Ney smoothing?
Historical Note
I Bengio et al notes that many of these aspects predate the work
I Furthermore proposes many of the ideas that Collobert et al. and
word2vec implement and scale
I Around this time, very few NLP papers on NN, most-cited papers
are about conditional random fields (CRFs).
Log-Bilinear Language Model (Mnih & Hinton, 2007)
Slightly different input representation. Now let:
x =dwin
∑i=1
v(fi )Ci
I Instead of concatenating, weight each v(fi ) by position-specific
weight matrix Ci .
Then use:
y = softmax(xW1 + b)
I Note no tanh layer.
I W1 can use input embeddings too, or not (Mnih and Teh, 2012)
I Can be faster to use, and in some cases simpler.
Comparison
Both count-based models and feed-forward NNLMs are Markovian
language models,
Comparison:
I Training Speed: ngrams are much faster (more coming)
I Usage Speed: ngrams very fast, NN can be fast with some tricks.
I Memory: NN models can be much smaller (but there are big ones)
I Accuracy: Comparable for small data, NN does better with more.
Advantages of NN model
I Can be trained end-to-end.
I Does not require smoothing methods.
Translation Performance ( and Blunsom, 2015)
Contents
Neural Language Models
Noise Contrastive Estimation
Review: Softmax Issues
Use a softmax to force a distribution,
softmax(z) =exp(z)
∑w∈C
exp(zw )
log softmax(z) = z− log ∑w∈C
exp(zw )
I Issue: class C is huge.
I For C&W, 100,000, for word2vec 1,000,000 types
I Note largest dataset is 6 billion words
Unnormalized Scores
Recall the score defined as (dropping bias)
z = tanh(xW1)W2
Unnormalized score of each word before soft-max,
zj = tanh(xW1)W2∗,j
for any j ∈ {1, . . . dout}
Note: can be computed efficiently O(1) versus O(dout).
Coherence
I Saw similar idea earlier for ranking embedding.
I Idea: Learn to distinguish coherent n-grams from corruption.
I Want to discriminate correct next words from other choices.
[ the dog walks ]
[ the dog house ]
[ the dog cats ]
[ the dog skips ]
Warm-Up
Imagine we have a new dataset,
((x1, y1),d1), . . . , ((xn, yn),dn),
I x; representation of context wi−n+1, . . .wi−1
I y; a possible wi
I d ; 1 if y is correct, 0 otherwise
Objective is based on predicted d :
L(θ) = ∑i
Lcrossentropy (di , di )
Warm-Up: Binary Classification
How do we score (xi , yi = δ(w))?
Could use unnormalized score,
zw = tanh(xW1)W2∗,c
Becomes softmax regression/non-linear logistic regression,
d = σ(zw )
I Much faster
I But does not help us train LM.
Implementation
Standard MLP language model, (only takes in x)
x⇒W1 ⇒ tanh⇒W2 ⇒ softmax
Computing binary (takes in x and y)
d = σ(zw )
x⇒W1 ⇒ tanh⇒·
W2∗,w (Lookup)
⇒ σ
Noise Contrastive Estimation 1
Probabilistic model,
I Introduce random variable D
I If D = 1 produce true sample
I If D = 0 produce sample from a noise distribution.
I Hyperparameter K is ratio of noise
p(D = 1) =1
K + 1
p(D = 0) =K
K + 1
Noise Contrastive Estimation 2
For a given x, y,
p(D = 1|x, y) =p(y|D = 1, x)p(D = 1|x)
∑d p( y|D = d , x)p(D = d |x)
=p(y|D = 1, x)p(D = 1|x)
p(x|D = 0)p(D = 0|x) + p(y|D = 1, x)p(D = 1|x)
Plug-in the noise distribution and hyperparameters,
p(D = 1|x, y) =1
K+1p(y|D = 1, x)1
K+1p(y|D = 1, x) + KK+1p(y|D = 0, x)
=p(y|D = 1, x)
p(y|D = 1, x) +Kp(y|D = 0, x)
= σ(log p(y|D = 1, x)− log(Kp(y|D = 0, x)))
Noise Contrastive Estimation 3
With
p(D = 1|x, y) = σ(log p(y|D = 1, x)− log(Kp(y|D = 0, x)))
we the training objective for a corpus that has K noise samples si ,k per
example is:
L(θ) = ∑i
log p(D = 1|xi , yi ) +K
∑k=1
log p(D = 0|xi ,Y = si ,k)
= ∑i
log σ (log p(yi |D = 1, xi )− log(Kp(yi |D = 0, xi )))
+K
∑k=1
log (1− σ (log p(si ,k |D = 1, xi )− log(Kp(si ,k |D = 0, xi ))))
I In practice, sample si ,k from unigram distribution
Noise Contrastive Estimation 4
But we still have a problem: L defined in terms of normalized
distributions log p(y|D = 1, x)
Solution:
I instead of explicitly normalizing, estimate Z (x), normalizing
constant of each context x, as a parameter (Gutmann &
Hyvarinen, 2010)
I Mnih and Teh (2012) show that fixing Z (x) = 1 for all contexts
works just as well
I So we can replace log p(y = δ(w)|D = 1, x) with zw , as computed
by our network
Noise Contrastive Estimation 5
So we now have
L(θ) = ∑i
log σ(zwi − log(KpML(wi )))
+K
∑k=1
log(1− σ(zsi ,k − log(KpML(si ,k))))
I Mnih and Teh (2012) show that gradient of L approaches gradient
of true language model’s log-likelihood objective as k → ∞.
Implementation
I How do you efficiently compute zw?
Need a lookup table (and dot-product) for output embeddings!
(Not full matrix-vector product).
I How do you efficiently handle log pML(w)
Can be precomputed or placed in a lookuptable .
I How do you handle sampling?
Can precompute large number of samples (not example specific).
I How do you handle loss?
Simply BinaryNLL Objective.
Implementation
Standard MLP language model,
x⇒W1 ⇒ tanh⇒W2 ⇒ softmax
Computing σ(zw − log(KpML(w))),
x⇒W1 ⇒ tanh⇒·
W2∗,w (Lookup)
⇒−
logKpML(w)(input)⇒ σ
(Efficiency, compute first three layers only once for K + 1)
Using in Practice
Several options for test time,
I Use full softmax with learned parameters.
I Compute subset of scores and renormalize (homework) .
I Can sometimes just use treat unormalized params as being
normalized (self-normalization)