neural language models - university of maryland · neural language models in practice • much more...
TRANSCRIPT
Neural Language Models
CMSC 723 / LING 723 / INST 725
MARINE CARPUAT
[email protected] slides from Graham Neubig
and Philipp Koehn
Roadmap
• Modeling Sequences
– First example: language model
– What are n-gram models?
– How to estimate them?
– How to evaluate them?
– Neural language models
Probabilistic Language Modeling
• Goal: compute the probability of a sentence or sequence of words
P(W) = P(w1,w2,w3,w4,w5…wn)
• Related task: probability of an upcoming word
P(w5|w1,w2,w3,w4)
• A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1)
is called a language model.
Evaluation:
How good is our model?
• Does our language model prefer good sentences to bad
ones?
– Assign higher probability to “real” or “frequently observed”
sentences
• Than “ungrammatical” or “rarely observed” sentences?
• Extrinsic vs intrinsic evaluation
Intrinsic evaluation: intuition
• The Shannon Game:– How well can we predict the next word?
– Unigrams are terrible at this game. (Why?)
• A better model of a text assigns a higher probability to the word that actually occurs
I always order pizza with cheese and ____
The 33rd President of the US was ____
I saw a ____
mushrooms 0.1
pepperoni 0.1
anchovies 0.01
….
fried rice 0.0001
….
and 1e-100
Intrinsic evaluation
metric: perplexity
Perplexity is the inverse probability of the test set, normalized by the number of words:
Chain rule:
For bigrams:
Minimizing perplexity is the same as maximizing probability
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence)
PP(W ) = P(w1w2...wN )-
1
N
=1
P(w1w2...wN )N
Perplexity as branching factor
• Let’s suppose a sentence consisting of N random digits
• What is the perplexity of this sentence according to a model that assign P=1/10 to each digit?
Lower perplexity = better model
• Training 38 million words, test 1.5 million words, WSJ
N-gram Order
Unigram Bigram Trigram
Perplexity 962 170 109
Pros and cons of n-gram models
• N-gram models
– Really easy to build, can train on billions and billions of words
– Smoothing helps generalize to new data
– Only work well for word prediction if the test corpus looks like the training corpus
– Only capture short distance context
“Smarter” LMs can address some of these issues, but they are order of magnitudes slower…
Roadmap
• Modeling Sequences
– First example: language model
– What are n-gram models?
– How to estimate them?
– How to evaluate them?
– Neural language models
Recall the person/not-person
classification problemGiven an introductory sentence in Wikipedia
predict whether the article is about a person
The Perceptron:a “machine” to calculate a weighted sum
sign 𝑖=1
𝐼
𝑤𝑖 ⋅ ϕ𝑖 𝑥
φ“A” = 1
φ“site” = 1
φ“,” = 2
φ“located” = 1
φ“in” = 1
φ“Maizuru”= 1
φ“Kyoto” = 1
φ“priest” = 0
φ“black” = 0
0
-3
00
0
0020
-1
Limitation of perceptron
● can only find linear separations between
positive and negative examples
X
O
O
X
Neural Networks
● Connect together multiple perceptrons
φ“A” = 1
φ“site” = 1
φ“,” = 2
φ“located” = 1
φ“in” = 1
φ“Maizuru”= 1
φ“Kyoto” = 1
φ“priest” = 0
φ“black” = 0
-1
● Motivation: Can represent non-linear functions!
Neural Networks: key terms
φ“A” = 1
φ“site” = 1
φ“,” = 2
φ“located” = 1
φ“in” = 1
φ“Maizuru”= 1
φ“Kyoto” = 1
φ“priest” = 0
φ“black” = 0
-1
• Input (aka features)
• Output
• Nodes
• Layers
• Activation function
(non-linear)
• Multi-layer
perceptron
Example
● Create two classifiers
X
O
O
X
φ0(x2) = {1, 1}φ
0(x1) = {-1, 1}
φ0(x4) = {1, -1}φ
0(x3) = {-1, -1}
sign
sign
φ0[0]
φ0[1]
1
1
1
-1
-1
-1
-1
φ0[0]
φ0[1]φ
1[0]
φ0[0]
φ0[1]
1
w0,0
b0,0
φ1[1]
w0,1
b0,1
Example
● These classifiers map to a new space
X
O
O
X
φ0(x2) = {1, 1}φ
0(x1) = {-1, 1}
φ0(x4) = {1, -1}φ
0(x3) = {-1, -1}
1
1
-1
-1
-1
-1
φ1
φ2
φ1[1]
φ1[0]
φ1[0]
φ1[1]
φ1(x1) = {-1, -1}
X Oφ
1(x2) = {1, -1}
O
φ1(x3) = {-1, 1}
φ1(x4) = {-1, -1}
Example● In new space, the examples are linearly separable!
X
O
O
X
φ0(x2) = {1, 1}φ
0(x1) = {-1, 1}
φ0(x4) = {1, -1}φ
0(x3) = {-1, -1}
1
1
-1
-1
-1
-1
φ0[0]
φ0[1]
φ1[1]
φ1[0]
φ1[0]
φ1[1]
φ1(x1) = {-1, -1}
X O φ1(x2) = {1, -1}
Oφ1(x3) = {-1, 1}
φ1(x4) = {-1, -1}
1
1
1φ
2[0] = y
Example wrap-up:
Forward propagation
● The final net
tanh
tanh
φ0[0]
φ0[1]
1
φ0[0]
φ0[1]
1
1
1
-1
-1
-1
-1
1 1
1
1
tanh
φ1[0]
φ1[1]
φ2[0]
24
Softmax Function
for multiclass classification● Sigmoid function for multiple classes
● Can be expressed using matrix/vector ops
𝑃 𝑦 ∣ 𝑥 =𝑒𝐰⋅ϕ 𝑥,𝑦
𝑦 𝑒𝐰⋅ϕ 𝑥, 𝑦
Current class
Sum of other classes
𝐫 = exp 𝐖 ⋅ ϕ 𝑥
𝐩 = 𝐫 𝑟∈𝐫
𝑟
Stochastic Gradient Descent
Online training algorithm for probabilistic models
w = 0
for I iterations
for each labeled pair x, y in the data
w += α * dP(y|x)/dw
In other words
• For every training example, calculate the gradient
(the direction that will increase the probability of y)
• Move in that direction, multiplied by learning rate α
-10 0 10
0
0.1
0.2
0.3
0.4
w*phi(x)
dp
(y|x
)/d
w*p
hi(
x)
Gradient of the Sigmoid Function
Take the derivative of the probability
𝑑
𝑑𝑤𝑃 𝑦 = 1 ∣ 𝑥 =
𝑑
𝑑𝑤
𝑒𝐰⋅ϕ 𝑥
1 + 𝑒𝐰⋅ϕ 𝑥
= ϕ 𝑥𝑒𝐰⋅ϕ 𝑥
1 + 𝑒𝐰⋅ϕ 𝑥 2
𝑑
𝑑𝑤𝑃 𝑦 = −1 ∣ 𝑥 =
𝑑
𝑑𝑤1 −
𝑒𝐰⋅ϕ 𝑥
1 + 𝑒𝐰⋅ϕ 𝑥
= −ϕ 𝑥𝑒𝐰⋅ϕ 𝑥
1 + 𝑒𝐰⋅ϕ 𝑥 2
Learning: We Don't Know the
Derivative for Hidden Units!
For NNs, only know correct tag for last layer
y=1ϕ 𝑥
𝑑𝑃 𝑦 = 1 ∣ 𝐱
𝑑𝐰𝟒= 𝐡 𝑥
𝑒𝐰𝟒⋅𝐡 𝑥
1 + 𝑒𝐰𝟒⋅𝐡 𝑥 2
𝐡 𝑥
𝑑𝑃 𝑦 = 1 ∣ 𝐱
𝑑𝐰𝟏= ?
𝑑𝑃 𝑦 = 1 ∣ 𝐱
𝑑𝐰𝟐= ?
𝑑𝑃 𝑦 = 1 ∣ 𝐱
𝑑𝐰𝟑= ?
w1
w2
w3
w4
Answer: Back-Propagation
Calculate derivative with chain rule
𝑑𝑃 𝑦 = 1 ∣ 𝑥
𝑑𝐰𝟏=
𝑑𝑃 𝑦 = 1 ∣ 𝑥
𝑑𝐰𝟒𝐡 𝐱
𝑑𝐰𝟒𝐡 𝐱
𝑑ℎ1 𝐱
𝑑ℎ1 𝐱
𝑑𝐰𝟏
𝑒𝐰𝟒⋅𝐡 𝑥
1 + 𝑒𝐰𝟒⋅𝐡 𝑥 2𝑤1,4
Error of
next unit (δ4)
Weight Gradient of
this unit
𝑑𝑃 𝑦 = 1 ∣ 𝐱
𝐰𝐢=
𝑑ℎ𝑖 𝐱
𝑑𝐰𝐢
𝑗δ𝑗 𝑤𝑖,𝑗
In General
Calculate i based
on next units j:
Neural Networks
• Non-linear classification
• Prediction: forward propagation
– Vector/matrix operations + non-linearities
• Training: backpropagation + stochastic gradient
descent
For more details, see Cho chap 3 or CIML Chap 7
Representing words
• “one hot vector”
dog = [ 0, 0, 0, 0, 1, 0, 0, 0 …]
cat = [ 0, 0, 0, 0, 0, 0, 1, 0 …]
eat = [ 0, 1, 0, 0, 0, 0, 1, 0 …]
• That’s a large vector! practical solutions:– limit to most frequent words (e.g., top 20000)
– cluster words into classes
• WordNet classes, frequency binning, etc.
Feed-Forward
Neural Language ModelMap each word into a
lower-dimensional real-valued spaceusing shared weight matrix C
Embedding layer
Bengio et al. 2003
Word Embeddings
• Neural language models produce word embeddings as
a by product
• Words that occurs in similar contexts tend to have
similar embeddings
• Embeddings are useful features in many NLP tasks
[Turian et al. 2009]
Recurrent Neural Nets (RNN)
Part of the node outputs return as input
yϕ𝐭 𝑥
𝐡𝐭−𝟏
Why? It is possible to “memorize”
Training: backpropagation
through time
After processing a few training examples,
Update through the unfolded recurrent neural network
Recurrent neural language models
• Hidden layer plays double duty
– Memory of the network
– Continuous space representation to predict output
words
• Other more elaborate architectures
– Long Short Term Memory
– Gated Recurrent Units
Neural Language Models
in practice
• Much more expensive to train than n-grams!
• But yielded dramatic improvement in hard extrinsic tasks
– speech recognition (Mikolov et al. 2011)
– and more recently machine translation (Devlin et al. 2014)
• Key practical issue:
– softmax requires normalizing over sum of scores for all possible
words
– What to do?
• Ignore – a score is a score (Auli and Gao, 2014)
• Integrate normalization into objective function (Devlin et al. 2014)