neural language models - university of maryland · neural language models in practice • much more...

44
Neural Language Models CMSC 723 / LING 723 / INST 725 MARINE CARPUAT [email protected] With slides from Graham Neubig and Philipp Koehn

Upload: others

Post on 06-Jun-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Neural Language Models

CMSC 723 / LING 723 / INST 725

MARINE CARPUAT

[email protected] slides from Graham Neubig

and Philipp Koehn

Roadmap

• Modeling Sequences

– First example: language model

– What are n-gram models?

– How to estimate them?

– How to evaluate them?

– Neural language models

Probabilistic Language Modeling

• Goal: compute the probability of a sentence or sequence of words

P(W) = P(w1,w2,w3,w4,w5…wn)

• Related task: probability of an upcoming word

P(w5|w1,w2,w3,w4)

• A model that computes either of these:

P(W) or P(wn|w1,w2…wn-1)

is called a language model.

Evaluation:

How good is our model?

• Does our language model prefer good sentences to bad

ones?

– Assign higher probability to “real” or “frequently observed”

sentences

• Than “ungrammatical” or “rarely observed” sentences?

• Extrinsic vs intrinsic evaluation

Intrinsic evaluation: intuition

• The Shannon Game:– How well can we predict the next word?

– Unigrams are terrible at this game. (Why?)

• A better model of a text assigns a higher probability to the word that actually occurs

I always order pizza with cheese and ____

The 33rd President of the US was ____

I saw a ____

mushrooms 0.1

pepperoni 0.1

anchovies 0.01

….

fried rice 0.0001

….

and 1e-100

Intrinsic evaluation

metric: perplexity

Perplexity is the inverse probability of the test set, normalized by the number of words:

Chain rule:

For bigrams:

Minimizing perplexity is the same as maximizing probability

The best language model is one that best predicts an unseen test set

• Gives the highest P(sentence)

PP(W ) = P(w1w2...wN )-

1

N

=1

P(w1w2...wN )N

Perplexity as branching factor

• Let’s suppose a sentence consisting of N random digits

• What is the perplexity of this sentence according to a model that assign P=1/10 to each digit?

Lower perplexity = better model

• Training 38 million words, test 1.5 million words, WSJ

N-gram Order

Unigram Bigram Trigram

Perplexity 962 170 109

Pros and cons of n-gram models

• N-gram models

– Really easy to build, can train on billions and billions of words

– Smoothing helps generalize to new data

– Only work well for word prediction if the test corpus looks like the training corpus

– Only capture short distance context

“Smarter” LMs can address some of these issues, but they are order of magnitudes slower…

Roadmap

• Modeling Sequences

– First example: language model

– What are n-gram models?

– How to estimate them?

– How to evaluate them?

– Neural language models

NEURAL NETWORKS

Aside

Recall the person/not-person

classification problemGiven an introductory sentence in Wikipedia

predict whether the article is about a person

Formalizing binary prediction

The Perceptron:a “machine” to calculate a weighted sum

sign 𝑖=1

𝐼

𝑤𝑖 ⋅ ϕ𝑖 𝑥

φ“A” = 1

φ“site” = 1

φ“,” = 2

φ“located” = 1

φ“in” = 1

φ“Maizuru”= 1

φ“Kyoto” = 1

φ“priest” = 0

φ“black” = 0

0

-3

00

0

0020

-1

The Perceptron:

Geometric interpretation

O

X O

X O

X

The Perceptron:

Geometric interpretation

O

X O

X O

X

Limitation of perceptron

● can only find linear separations between

positive and negative examples

X

O

O

X

Neural Networks

● Connect together multiple perceptrons

φ“A” = 1

φ“site” = 1

φ“,” = 2

φ“located” = 1

φ“in” = 1

φ“Maizuru”= 1

φ“Kyoto” = 1

φ“priest” = 0

φ“black” = 0

-1

● Motivation: Can represent non-linear functions!

Neural Networks: key terms

φ“A” = 1

φ“site” = 1

φ“,” = 2

φ“located” = 1

φ“in” = 1

φ“Maizuru”= 1

φ“Kyoto” = 1

φ“priest” = 0

φ“black” = 0

-1

• Input (aka features)

• Output

• Nodes

• Layers

• Activation function

(non-linear)

• Multi-layer

perceptron

Example

● Create two classifiers

X

O

O

X

φ0(x2) = {1, 1}φ

0(x1) = {-1, 1}

φ0(x4) = {1, -1}φ

0(x3) = {-1, -1}

sign

sign

φ0[0]

φ0[1]

1

1

1

-1

-1

-1

-1

φ0[0]

φ0[1]φ

1[0]

φ0[0]

φ0[1]

1

w0,0

b0,0

φ1[1]

w0,1

b0,1

Example

● These classifiers map to a new space

X

O

O

X

φ0(x2) = {1, 1}φ

0(x1) = {-1, 1}

φ0(x4) = {1, -1}φ

0(x3) = {-1, -1}

1

1

-1

-1

-1

-1

φ1

φ2

φ1[1]

φ1[0]

φ1[0]

φ1[1]

φ1(x1) = {-1, -1}

X Oφ

1(x2) = {1, -1}

O

φ1(x3) = {-1, 1}

φ1(x4) = {-1, -1}

Example● In new space, the examples are linearly separable!

X

O

O

X

φ0(x2) = {1, 1}φ

0(x1) = {-1, 1}

φ0(x4) = {1, -1}φ

0(x3) = {-1, -1}

1

1

-1

-1

-1

-1

φ0[0]

φ0[1]

φ1[1]

φ1[0]

φ1[0]

φ1[1]

φ1(x1) = {-1, -1}

X O φ1(x2) = {1, -1}

Oφ1(x3) = {-1, 1}

φ1(x4) = {-1, -1}

1

1

2[0] = y

Example wrap-up:

Forward propagation

● The final net

tanh

tanh

φ0[0]

φ0[1]

1

φ0[0]

φ0[1]

1

1

1

-1

-1

-1

-1

1 1

1

1

tanh

φ1[0]

φ1[1]

φ2[0]

24

Softmax Function

for multiclass classification● Sigmoid function for multiple classes

● Can be expressed using matrix/vector ops

𝑃 𝑦 ∣ 𝑥 =𝑒𝐰⋅ϕ 𝑥,𝑦

𝑦 𝑒𝐰⋅ϕ 𝑥, 𝑦

Current class

Sum of other classes

𝐫 = exp 𝐖 ⋅ ϕ 𝑥

𝐩 = 𝐫 𝑟∈𝐫

𝑟

Stochastic Gradient Descent

Online training algorithm for probabilistic models

w = 0

for I iterations

for each labeled pair x, y in the data

w += α * dP(y|x)/dw

In other words

• For every training example, calculate the gradient

(the direction that will increase the probability of y)

• Move in that direction, multiplied by learning rate α

-10 0 10

0

0.1

0.2

0.3

0.4

w*phi(x)

dp

(y|x

)/d

w*p

hi(

x)

Gradient of the Sigmoid Function

Take the derivative of the probability

𝑑

𝑑𝑤𝑃 𝑦 = 1 ∣ 𝑥 =

𝑑

𝑑𝑤

𝑒𝐰⋅ϕ 𝑥

1 + 𝑒𝐰⋅ϕ 𝑥

= ϕ 𝑥𝑒𝐰⋅ϕ 𝑥

1 + 𝑒𝐰⋅ϕ 𝑥 2

𝑑

𝑑𝑤𝑃 𝑦 = −1 ∣ 𝑥 =

𝑑

𝑑𝑤1 −

𝑒𝐰⋅ϕ 𝑥

1 + 𝑒𝐰⋅ϕ 𝑥

= −ϕ 𝑥𝑒𝐰⋅ϕ 𝑥

1 + 𝑒𝐰⋅ϕ 𝑥 2

Learning: We Don't Know the

Derivative for Hidden Units!

For NNs, only know correct tag for last layer

y=1ϕ 𝑥

𝑑𝑃 𝑦 = 1 ∣ 𝐱

𝑑𝐰𝟒= 𝐡 𝑥

𝑒𝐰𝟒⋅𝐡 𝑥

1 + 𝑒𝐰𝟒⋅𝐡 𝑥 2

𝐡 𝑥

𝑑𝑃 𝑦 = 1 ∣ 𝐱

𝑑𝐰𝟏= ?

𝑑𝑃 𝑦 = 1 ∣ 𝐱

𝑑𝐰𝟐= ?

𝑑𝑃 𝑦 = 1 ∣ 𝐱

𝑑𝐰𝟑= ?

w1

w2

w3

w4

Answer: Back-Propagation

Calculate derivative with chain rule

𝑑𝑃 𝑦 = 1 ∣ 𝑥

𝑑𝐰𝟏=

𝑑𝑃 𝑦 = 1 ∣ 𝑥

𝑑𝐰𝟒𝐡 𝐱

𝑑𝐰𝟒𝐡 𝐱

𝑑ℎ1 𝐱

𝑑ℎ1 𝐱

𝑑𝐰𝟏

𝑒𝐰𝟒⋅𝐡 𝑥

1 + 𝑒𝐰𝟒⋅𝐡 𝑥 2𝑤1,4

Error of

next unit (δ4)

Weight Gradient of

this unit

𝑑𝑃 𝑦 = 1 ∣ 𝐱

𝐰𝐢=

𝑑ℎ𝑖 𝐱

𝑑𝐰𝐢

𝑗δ𝑗 𝑤𝑖,𝑗

In General

Calculate i based

on next units j:

Backpropagation

=

Gradient descent

+

Chain rule

Feed Forward Neural Nets

All connections point forward

yϕ 𝑥

It is a directed acyclic graph (DAG)

Neural Networks

• Non-linear classification

• Prediction: forward propagation

– Vector/matrix operations + non-linearities

• Training: backpropagation + stochastic gradient

descent

For more details, see Cho chap 3 or CIML Chap 7

NEURAL NETWORKS

Aside

Back to language modeling…

Representing words

• “one hot vector”

dog = [ 0, 0, 0, 0, 1, 0, 0, 0 …]

cat = [ 0, 0, 0, 0, 0, 0, 1, 0 …]

eat = [ 0, 1, 0, 0, 0, 0, 1, 0 …]

• That’s a large vector! practical solutions:– limit to most frequent words (e.g., top 20000)

– cluster words into classes

• WordNet classes, frequency binning, etc.

Feed-Forward

Neural Language ModelMap each word into a

lower-dimensional real-valued spaceusing shared weight matrix C

Embedding layer

Bengio et al. 2003

Word Embeddings

• Neural language models produce word embeddings as

a by product

• Words that occurs in similar contexts tend to have

similar embeddings

• Embeddings are useful features in many NLP tasks

[Turian et al. 2009]

Word embeddings illustrated

Recurrent Neural Networks

Recurrent Neural Nets (RNN)

Part of the node outputs return as input

yϕ𝐭 𝑥

𝐡𝐭−𝟏

Why? It is possible to “memorize”

Training: backpropagation

through time

After processing a few training examples,

Update through the unfolded recurrent neural network

Recurrent neural language models

• Hidden layer plays double duty

– Memory of the network

– Continuous space representation to predict output

words

• Other more elaborate architectures

– Long Short Term Memory

– Gated Recurrent Units

Neural Language Models

in practice

• Much more expensive to train than n-grams!

• But yielded dramatic improvement in hard extrinsic tasks

– speech recognition (Mikolov et al. 2011)

– and more recently machine translation (Devlin et al. 2014)

• Key practical issue:

– softmax requires normalizing over sum of scores for all possible

words

– What to do?

• Ignore – a score is a score (Auli and Gao, 2014)

• Integrate normalization into objective function (Devlin et al. 2014)

What we know about

modeling sequences so far…

– First example: language model

– What are n-gram models?

– How to estimate them?

– How to evaluate them?

– Neural language models