natural language processingjoberant/teaching/nlp_spring... · natural language processing language...

99
Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan Jurafsky

Upload: others

Post on 24-May-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Natural Language Processing

Language models

Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan Jurafsky

Page 2: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Plan• Problem definition

• Trigram models

• Evaluation

• Estimation

• Interpolation

• Discounting

2

Page 3: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Motivations• Define a probability distribution over sentences

• Why?

• Machine translation

• P(“high winds”) > P(“large winds”)

• Spelling correction

• P(“The office is fifteen minutes from here”) > P(“The office is fifteen minuets from here”)

• Speech recognition (that’s where it started!)

• P(“recognize speech”) > P(“wreck a nice beach”)

• And more!

p(01010 |⇠) / p(⇠| 01010) · p(01010)

3

Page 4: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Motivations• Philosophical: a model that is good at predicting

the next word, must know something about language and the world

• A good representations for any NLP task

• paper 1

• paper 2

Page 5: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Motivations

• Techniques will be useful later

5

Page 6: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Problem definition• Given a finite vocabulary

V = {the, a,man, telescope,Beckham, two, . . . }

• We have an infinite language L, which is V* concatenated with the special symbol STOP

the STOP a STOP the fan STOP the fan saw Beckham STOP the fan saw saw STOP the fan saw Beckham play for Real Madrid STOP

6

Page 7: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Problem definition• Input: a training set of example sentences

• Currently: roughly hundreds of billions of words.

• Output: a probability distribution p over LX

x2L

p(x) = 1, p(x) � 0 for all x 2 L

p(“the STOP”) = 10-12 p(“the fan saw Beckham STOP”) = 2 x 10-8 p(“the fan saw saw STOP”) = 10-15

7

Page 8: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

A naive method• Assume we have N training sentences

• Let x1, x2, …, xn be a sentence, and c(x1, x2, …, xn) be the number of times it appeared in the training data.

• Define a language model:

p(x1, . . . , xn) =c(x1, . . . , xn)

N

8

Page 9: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

A naive method• Assume we have N training sentences

• Let x1, x2, …, xn be a sentence, and c(x1, x2, …, xn) be the number of times it appeared in the training data.

• Define a language model:

p(x1, . . . , xn) =c(x1, . . . , xn)

N

• No generalization! 8

Page 10: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Markov processes• Markov processes:

• Given a sequence of n random variables:

• We want a sequence probability model

X1, X2, . . . , Xn, n = 100, Xi 2 Vp(X1 = x1, X2 = x2, · · ·Xn = xn)

• There are |V|n possible sequences

9

Page 11: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

First-order Markov processChain rule

p(X1 = x1, X2 = x2, . . . , Xn = xn) =

p(X1 = x1)nY

i=2

p(Xi = xi | X1 = x1, . . . , Xi�1 = xi�1)

10

Page 12: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

First-order Markov processChain rule

Markov assumption

p(Xi = xi | X1 = x1, . . . , Xi�1 = xi�1) = p(Xi = xi | Xi�1 = xi�1)

p(X1 = x1, X2 = x2, . . . , Xn = xn) =

p(X1 = x1)nY

i=2

p(Xi = xi | X1 = x1, . . . , Xi�1 = xi�1)

10

Page 13: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Second-order Markov process

p(X1 = x1, X2 = x2, . . . , Xn = xn) =

p(X1 = x1)⇥ p(X2 = x2 | X1 = x1)

⇥nY

i=3

p(Xi = xi | Xi�2 = xi�2, Xi�1 = xi�1)

Relax independence assumption:

Simplify notation x0 = ⇤, x�1 = ⇤

11

Page 14: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Second-order Markov process

p(X1 = x1, X2 = x2, . . . , Xn = xn) =

p(X1 = x1)⇥ p(X2 = x2 | X1 = x1)

⇥nY

i=3

p(Xi = xi | Xi�2 = xi�2, Xi�1 = xi�1)

Relax independence assumption:

Simplify notation x0 = ⇤, x�1 = ⇤

Is this reasonable? 11

Page 15: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Detail: variable length• Probability distribution over sequences of any length

• Define always Xn=STOP, and obtain a probability distribution over all sequences

• Intuition: at every step you have probability 𝜶h to stop (conditioned on history) and (1-𝜶h) to keep going

p(X1 = x1, X2 = x2, . . . , Xn = xn) =nY

i=1

p(Xi = xi | Xi�2 = xi�2, Xi�1 = xi�1)

12

Page 16: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Trigram language model• A trigram language model contains

• A vocabulary V

• Non-negative parameters q(w|u,v) for every trigram, such that

w 2 V [ {STOP}, u, v 2 V [ {⇤}

• The probability of a sentence x1, …, xn, where xn=STOP is

p(x1, . . . , xn) =nY

i=1

q(xi | xi�1, xi�2)

13

Page 17: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Example

14

p(the dog barks STOP) =q(the | ⇤, ⇤)⇥q(dog | ⇤, the)⇥q(barks | the, dog)⇥q(STOP | dog, barks)⇥

Page 18: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Limitation

• Markovian assumption is false

He is from France, so it makes sense that his first language is…

• We would want to model longer dependencies

15

Page 19: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Sparseness• Maximum likelihood for estimating q

• Let c(w_1, …, w_n) be the number of times that n-gram appears in a corpus

16

q(wi | wi�2, wi�1) =c(wi�2, wi�1, wi)

c(wi�2, wi�1)

Page 20: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Sparseness• Maximum likelihood for estimating q

• Let c(w_1, …, w_n) be the number of times that n-gram appears in a corpus

• If vocabulary has 20,000 words —> Number of parameters is 8 x 1012!

• Most sentences will have zero or undefined probabilities 16

q(wi | wi�2, wi�1) =c(wi�2, wi�1, wi)

c(wi�2, wi�1)

Page 21: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Berkeley restaurant project sentences

• can you tell me about any good cantonese restaurants close by

• mid priced that food is what i’m looking for

• tell me about chez pansies

• can you give me a listing of the kinds of food that are available

• i’m looking for a good place to eat breakfast

• when is caffe venezia open during the day

17

Page 22: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Bigram counts

18

Page 23: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Bigram probabilities

19

Page 24: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

What did we learn

• p(English | want) < p(Chinese | want) - people like Chinese stuff more when it comes to this corpus

• p(to | want) = 0.66 - English behaves in a certain way

• p(eat | to) = 0.28 - English behaves in a certain way

20

Page 25: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Evaluation: perplexity• Test data: S = (s1, s2, …, sM)

• Parameters are not estimated from S

• A good language model has high p(S) and low perplexity

• Perplexity is the normalized inverse log-probability of S

• M is the number of words in the corpus 21

p(S) =MY

i=1

p(si | s1, . . . , si�1)

log2 p(S) =MX

i=1

log2 p(si | s1, . . . , si�1)

perplexity = 2�l, l =1

M

MX

i=1

log2 p(si | s1, . . . , si�1)<latexit sha1_base64="SvfPMMBz646j5bpI1p4dYVS710U=">AAAC3nicnVLPaxQxFM6Mv+r4a6tHL8FFWWG7TFahXgpFL14KFd22sLMdM5nMNjSZhOSNdAlz8OJBEa/+Xd78Q7yb2S5F24Lgg8DH9977XvK9FEYKB2n6M4qvXL12/cbazeTW7Tt37/XW7+853VjGJ0xLbQ8K6rgUNZ+AAMkPjOVUFZLvF8evuvz+B26d0PU7WBg+U3Rei0owCoHKe7/MIFMUjhiV/m37FD/Zwpmxusy92CLt4Q42A5cLnClRYpeTIc5KDW4YsBcbJDRkWZJJPc/H+BIl16gzobOqf+sBPwFvuDWSnwhYtJ3W+NBvyHaIJQ66laXMk9bvtP8zI0nyXj8dpcvAFwFZgT5axW7e+xE0WKN4DUxS56YkNTDz1IJgkrdJ1jhuKDumcz4NsKaKu5lfrqfFjwNT4krbcGrAS/bPDk+VcwtVhMrOQHc+15GX5aYNVC9mXtSmAV6z00FVIzFo3O0al8JyBnIRAGVWhLtidkSDdxB+RGcCOf/ki2BvPCLPRuTN8/72y5Uda+gheoQGiKBNtI1eo100QSyaRh+jz9GX+H38Kf4afzstjaNVzwP0V8TffwN4cd5g</latexit>

Page 26: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Evaluation: perplexity• Say we have a vocabulary V and N = |V| + 1 and a

trigram model with uniform distribution

• q(w | u, v) = 1/N.

• Perplexity is the “effective” vocabulary size. 22

l =1

M

mX

i=1

log2 p(si | s1, . . . , si�1)

=1

Mlog

� 1

N

�M= log

1

Nperplexity = 2�l = 2logN = N

<latexit sha1_base64="hDVVx2yFhBsJ+xm8S4ppOtcyb28=">AAACo3icbVFda9swFJW9r877aLY97uWy0JJCG6xssL0EygZjMJplo2kLUWJkRU5FJdtI8mgQ/mP7GXvbv5nsGtaPXRAcnXOPdHWUllIYG8d/gvDe/QcPH209jp48ffZ8u/fi5YkpKs34jBWy0GcpNVyKnM+ssJKflZpTlUp+ml58avTTn1wbUeTHdlPyhaLrXGSCUeuppPcrkrA7JpmmzOHaHdVATKUSJ8a4XiogslgnIygHJhFAlFiBSfA+kFVhzb7HThzgeg8IiXbHcOMUbwSSivXgHz2pW2ZveQTjruG6RCJi+aV1Jdel5JfCbmo/GoyW7kDW0ILWNGk2kyhKev14GLcFdwHuQB91NU16v/3crFI8t0xSY+Y4Lu3CUW0Fk7yOSGV4SdkFXfO5hzlV3Cxcm3ENO55ZQVZov3ILLXvd4agyZqNS36moPTe3tYb8nzavbPZh4UReVpbn7OqirJJgC2g+DFZCc2blxgPKtPCzAjunPjfrv7UJAd9+8l1wMhrit0P8/V3/8GMXxxZ6jd6gAcLoPTpEX9AUzRALIPgcfAum4U74NfwRHl+1hkHneYVuVLj4C8XTyLY=</latexit>

Page 27: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Typical values of perplexity

• When |V| = 50,000

• trigram model perplexity: 74 (<< 50000)

• bigram model: 137

• unigram model: 955

23

Page 28: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Evaluation

• Extrinsic evaluations: MT, speech, spelling correction, …

24

Page 29: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

History

• Shannon (1950) estimated the perplexity score that humans get for printed English (we are good!)

• Test your perplexity

25

Page 30: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Estimating parameters• Recall that the number of parameters for a trigram

model with |V| = 20,000 is 8 x 1012, leading to zeros and undefined probabilities

26

q(wi | wi�2, wi�1) =c(wi�2, wi�1, wi)

c(wi�2, wi�1)

Page 31: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Over/under-fitting• Given a corpus of length M:

• Trigram model:q(wi | wi�2, wi�1) =

c(wi�2, wi�1, wi)

c(wi�1, wi)

• Bigram model:

• Unigram model:

q(wi | wi�1) =c(wi�1, wi)

c(wi�1)

q(wi) =c(wi)

M 27

Page 32: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

28

Page 33: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Radford et al, 2019

Page 34: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Linear interpolationqLI(wi | wi�2, wi�1) = �1 ⇥ q(wi | wi�2, wi�1)

+ �2 ⇥ q(wi | wi�1)

+ �3 ⇥ q(wi)

�i � 0, �1 + �2 + �3 = 1

• Combine the three models to get all benefits

30

Page 35: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Linear interpolation• Need to verify the parameters define a probability

distribution

X

w2VqLI(w | u, v)

=X

w2V�1 ⇥ q(w | u, v) + �2 ⇥ q(w | v) + �3 ⇥ q(w)

= �1

X

w2Vq(w | u, v) + �2

X

w2Vq(w | v) + �3

X

w2V⇥q(w)

= �1 + �2 + �3 = 1

31

Page 36: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Estimating coefficients

• Use validation/development set (intro to ML!)

• Partition training data to training (90%?) and dev (10%?) data and optimize the coefficient to minimize the perplexity (the measure we care about!) on the development data

32

Page 37: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Linear interpolationqLI(wi | wi�2, wi�1) = �1 ⇥ q(wi | wi�2, wi�1)

+ �2 ⇥ q(wi | wi�1)

+ �3 ⇥ q(wi)

�i � 0, �1 + �2 + �3 = 1

Page 38: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Linear interpolation

⇧(wi�2, wi�1) =

( 1 c(wi�2, wi�1) = 02 1 c(wi�2, wi�1) 23 3 c(wi�2, wi�1) 104 otherwise

qLI(wi | wi�2, wi�1) = �⇧(wi�2,wi�1)1 ⇥ q(wi | wi�2, wi�1)

+ �⇧(wi�2,wi�1)2 ⇥ q(wi | wi�1)

+ �⇧(wi�2,wi�1)3 ⇥ q(wi)

�⇧(wi�2,wi�1)i � 0

�⇧(wi�2,wi�1)1 + �⇧(wi�2,wi�1)

2 + �⇧(wi�2,wi�1)3 = 1

Page 39: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Discounting methodsx c(x) q(wi | wi-1)

the 48the, dog 15 15/48

the, woman 11 11/48the, man 10 10/48the, park 5 5/48the, job 2 2/48

the, telescope 1 1/48the, manual 1 1/48

the, afternoon 1 1/48the, country 1 1/48the, street 1 1/48

Low count bigrams have high estimates 35

Page 40: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Discounting methods

x c(x) c*(x) q(wi | wi-1)the 48

the, dog 15 14.5 14.5/48the, woman 11 10.5 10.5/48

the, man 10 9.5 9.5/48 the, park 5 4.5 4.5/48the, job 2 1.5 1.5/48

the, telescope 1 0.5 0.5/48 the, manual 1 0.5 0.5/48

the, afternoon 1 0.5 0.5/48 the, country 1 0.5 0.5/48 the, street 1 0.5 0.5/48

c*(x) = c(x) - 0.5

36

Page 41: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Katz back-offA(wi�1) = {w : c(wi�1, w) > 0}B(wi�1) = {w : c(wi�1, w) = 0}

In a bigram model

qBO(wi | wi�1) =

( c⇤(wi�1,wi)

c(wi�1)wi 2 A(wi�1)

↵(wi�1)q(wi)P

w2B(wi�1) q(w)wi 2 B(wi�1)

↵(wi�1) = 1�X

w2A(wi�1)

c⇤(wi�1, w)

c(wi�1)

37

Page 42: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Katz back-off

A(wi�2, wi�1) = {w : c(wi�2, wi�1, w) > 0}B(wi�2, wi�1) = {w : c(wi�2, wi�1, w) = 0}

In a trigram model

qBO(wi | wi�2, wi�1) =

( c⇤(wi�2,wi�1,wi)

c(wi�2,wi�1)wi 2 A(wi�2, wi�1)

↵(wi�2, wi�1)qBO(wi|wi�1)P

w2B(wi�2,wi�1)qBO(w|wi�1)

wi 2 B(wi�2, wi�1)

↵(wi�2, wi�1) = 1�X

w2A(wi�2,wi�1)

c⇤(wi�2, wi�1, w)

c(wi�2, wi�1)

38

Page 43: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Advanced Smoothing

• Good-Turing

• Kneser-Ney

Page 44: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Advanced Smoothing• Principles

• Good-Turing: Take probability mass from things you have seen n times and spread this probability mass over things you have seen n-1 times (specifically move mass from things you have seen once to things you have never seen)

• Kneser-Ney: The probability of a unigram is not its frequency in the data, but how frequently it appears after other things (“francisco” vs. “glasses”).

Page 45: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Unknown words• What if we see a completely new word at test time?

• p(s) = p(S) = 0 —> infinite perplexity

• Solution: create a special token <unk>

• Fix vocabulary V and replace any word in the training set not in V with <unk>

• Train

• At test time, use p(<unk>) for words not in the training set

Page 46: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Summary and re-cap• Sentence probability: decompose with the chain rule

• Use a Markov independence assumption

• Smooth estimates to do better on rare events

• Many attempts to improve LMs using syntax but not easy

• More complex smoothing methods exist for handling rare events (Kneser-Ney, Good-Turing…)

42

Page 47: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Problem• Our estimator q(w | u, v) is based on a one-hot

representation. There is no relation between words.

• p(played | the, actress)

• p(played | the actor)

• Can we use distributed representations?

Neural networks

43

Page 48: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Neural networks

• Feed-forward neural networks for language modeling

• Recurrent neural networks for language modeling

• Vanishing/exploding gradients

Page 49: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Neural networks: history• Proposed in the mid-20th century

• Criticism from Minsky (“Perceptrons”)

• Back-propagation appeared in the 1980s

• In the 1990s SVMs appeared and were more successful

• Since the beginning of this decade show great success in speech, vision, language, robotics…

45

Page 50: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Motivation 1

Can we learn representations from raw data? 46

Page 51: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Motivation 2

Can we learn non-linear decision boundariesActually very similar to motivation 1

47

Page 52: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

A single neuron• A neuron is a computational unit of the form

fw,b(x) = f(w>x+ b)

• x: input vector

• w: weights vector

• b: bias

• f: activation function

48 intro to ml

Page 53: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

A single neuron• If f is sigmoid, a neuron is logistic regression

• Let x, y be a binary classification training example

p(y = 1 | x) = 1

1 + e�w>x�b= �(w>x+ b)

Provides a linear decision boundary

49 intro to ml

Page 54: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Single layer network• Perform logistic regression

in parallel multiple times (with different parameters)

50

y = �(w>x+ b) = �(w>x)

x,w 2 Rd+1, xd+1 = 1, wd+1 = b<latexit sha1_base64="bbrl+RUwRpr9A7SRaml+HC1MK+g=">AAACV3icbVFNaxsxENVu8+G6X05zzGWoaUiIMbslkPRQCM0lx7TUTcByjCTLtohWWqTZ2mbZP1noIfkrvVRem9AmHRC8eW8e0jzxXCuPSXIXxc82Nre2G8+bL16+ev2mtfP2u7eFE7InrLbumjMvtTKyhwq1vM6dZBnX8orfni/1qx/SeWXNN1zkcpCxiVFjJRgGatiyC9j/BNSrScYO6JRhOatuKNoc6mZewRHwQ3gYma3E+SFQ2px3ZrBPlQGaMZxyXn6tbsrRUVp1YD6sQTCmHZg9NHzYaifdpC54CtI1aJN1XQ5bP+nIiiKTBoVm3vfTJMdByRwqoWXVpIWXORO3bCL7ARqWST8o62AqeB+YEYytC8cg1OzfjpJl3i8yHiaXG/jH2pL8n9YvcHw6KJXJC5RGrC4aFxrQwjJlGCknBepFAEw4Fd4KYsocExj+ohlCSB+v/BT0PnQ/dtMvx+2zz+s0GmSPvCMHJCUn5IxckEvSI4L8Ir+jjWgzuo9JvBU3VqNxtPbskn8q3vkDSESvGQ==</latexit><latexit sha1_base64="bbrl+RUwRpr9A7SRaml+HC1MK+g=">AAACV3icbVFNaxsxENVu8+G6X05zzGWoaUiIMbslkPRQCM0lx7TUTcByjCTLtohWWqTZ2mbZP1noIfkrvVRem9AmHRC8eW8e0jzxXCuPSXIXxc82Nre2G8+bL16+ev2mtfP2u7eFE7InrLbumjMvtTKyhwq1vM6dZBnX8orfni/1qx/SeWXNN1zkcpCxiVFjJRgGatiyC9j/BNSrScYO6JRhOatuKNoc6mZewRHwQ3gYma3E+SFQ2px3ZrBPlQGaMZxyXn6tbsrRUVp1YD6sQTCmHZg9NHzYaifdpC54CtI1aJN1XQ5bP+nIiiKTBoVm3vfTJMdByRwqoWXVpIWXORO3bCL7ARqWST8o62AqeB+YEYytC8cg1OzfjpJl3i8yHiaXG/jH2pL8n9YvcHw6KJXJC5RGrC4aFxrQwjJlGCknBepFAEw4Fd4KYsocExj+ohlCSB+v/BT0PnQ/dtMvx+2zz+s0GmSPvCMHJCUn5IxckEvSI4L8Ir+jjWgzuo9JvBU3VqNxtPbskn8q3vkDSESvGQ==</latexit><latexit sha1_base64="bbrl+RUwRpr9A7SRaml+HC1MK+g=">AAACV3icbVFNaxsxENVu8+G6X05zzGWoaUiIMbslkPRQCM0lx7TUTcByjCTLtohWWqTZ2mbZP1noIfkrvVRem9AmHRC8eW8e0jzxXCuPSXIXxc82Nre2G8+bL16+ev2mtfP2u7eFE7InrLbumjMvtTKyhwq1vM6dZBnX8orfni/1qx/SeWXNN1zkcpCxiVFjJRgGatiyC9j/BNSrScYO6JRhOatuKNoc6mZewRHwQ3gYma3E+SFQ2px3ZrBPlQGaMZxyXn6tbsrRUVp1YD6sQTCmHZg9NHzYaifdpC54CtI1aJN1XQ5bP+nIiiKTBoVm3vfTJMdByRwqoWXVpIWXORO3bCL7ARqWST8o62AqeB+YEYytC8cg1OzfjpJl3i8yHiaXG/jH2pL8n9YvcHw6KJXJC5RGrC4aFxrQwjJlGCknBepFAEw4Fd4KYsocExj+ohlCSB+v/BT0PnQ/dtMvx+2zz+s0GmSPvCMHJCUn5IxckEvSI4L8Ir+jjWgzuo9JvBU3VqNxtPbskn8q3vkDSESvGQ==</latexit><latexit sha1_base64="bbrl+RUwRpr9A7SRaml+HC1MK+g=">AAACV3icbVFNaxsxENVu8+G6X05zzGWoaUiIMbslkPRQCM0lx7TUTcByjCTLtohWWqTZ2mbZP1noIfkrvVRem9AmHRC8eW8e0jzxXCuPSXIXxc82Nre2G8+bL16+ev2mtfP2u7eFE7InrLbumjMvtTKyhwq1vM6dZBnX8orfni/1qx/SeWXNN1zkcpCxiVFjJRgGatiyC9j/BNSrScYO6JRhOatuKNoc6mZewRHwQ3gYma3E+SFQ2px3ZrBPlQGaMZxyXn6tbsrRUVp1YD6sQTCmHZg9NHzYaifdpC54CtI1aJN1XQ5bP+nIiiKTBoVm3vfTJMdByRwqoWXVpIWXORO3bCL7ARqWST8o62AqeB+YEYytC8cg1OzfjpJl3i8yHiaXG/jH2pL8n9YvcHw6KJXJC5RGrC4aFxrQwjJlGCknBepFAEw4Fd4KYsocExj+ohlCSB+v/BT0PnQ/dtMvx+2zz+s0GmSPvCMHJCUn5IxckEvSI4L8Ir+jjWgzuo9JvBU3VqNxtPbskn8q3vkDSESvGQ==</latexit>

Page 55: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Single layer network

• L1: input layer • L2: hidden layer • L3: output layer • Output layer provides

the prediction • Hidden layer is the

learned representation

51

Page 56: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Multi-layer networkRepeat:

52

Page 57: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Matrix notation

a1 = f(W11x1 +W12x2 +W13x3 + b1)

a2 = f(W21x1 +W22x2 +W23x3 + b2)

a3 = f(W31x1 +W32x2 +W33x3 + b3)

z = Wx+ b, a = f(z)

x 2 R4, z 2 R3, W 2 R3⇥4

53

Page 58: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Language modeling with NNs

• Keep the Markov assumption

• Learn a probability q(u | v, w) with distributed representations

54

Page 59: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Language modeling with NNs

Bengio et al, 2003

the dog

e(the) e(dog)

h(the, dog)

f(h(the, dog))laughed

p(laughed | the, dog)

55

e(w) = We · w, ,We 2 Rd⇥|V|, w 2 R|V|⇥1

h(wi�2, wi�1) = �(Wh[e(wi�2); e(wi�1)]), Wh 2 Rm⇥2d

f(z) = softmax(Wo · z), Wo 2 R|V|⇥m, z 2 Rm⇥1

p(wi | wi�2, wi�1) = f(h(wi�2, wi�1))i<latexit sha1_base64="ZrQ4GmagbCZRNUjYkwjacyL2SY0=">AAADb3icjVJdi9NAFJ0mfqz1q+u+CH5wsbgkUEtSBBURFn3xcRW7XWhimEwm7bCZTMhM7LbZvPoDffNH+OIvcJKm4G4reCFwc+65954z3DBLmFSO87NjmNeu37i5d6t7+87de/d7+w9OpChyQsdEJCI/DbGkCUvpWDGV0NMsp5iHCZ2EZx/q+uQbzSUT6Re1zKjP8SxlMSNYaSjY73w/pNbChncwCSh4JBIKFgPwYLAGWAoex2oehuXn6msZgacYpxIuGpTgpDypLqoBLLaolxibNrcCz+sezq1FULIXo7qxTtyqVuBJNuPYmgTzKd0QbHgL7Y8m+XYtTRO2tvHNhlG0XhFbq2amoueqlCJWHJ9XerZoTa7aUeL/hHMALXb1772ts0xrZZrBIthlMbZ2WLcD1g16fWfoNAHbidsmfdTGcdD74UWCFJymiiRYyqnrZMovca4YSWjV9QpJM0zO8IxOdZpiLdIvm4Op4LlGIohFrr9UQYP+3VFiLuWSh5pZe5VXazW4qzYtVPzaL1maFYqmZL0oLhJQAurrg4jllKhkqRNMcqa1ApnjHBOlb7R+BPeq5e1kPBq+GbqfXvaP3revsYceoWfIQi56hY7QR3SMxoh0fhkHxmPjifHbfGg+NWFNNTptzwG6FKb9BwbrDmM=</latexit><latexit sha1_base64="ZrQ4GmagbCZRNUjYkwjacyL2SY0=">AAADb3icjVJdi9NAFJ0mfqz1q+u+CH5wsbgkUEtSBBURFn3xcRW7XWhimEwm7bCZTMhM7LbZvPoDffNH+OIvcJKm4G4reCFwc+65954z3DBLmFSO87NjmNeu37i5d6t7+87de/d7+w9OpChyQsdEJCI/DbGkCUvpWDGV0NMsp5iHCZ2EZx/q+uQbzSUT6Re1zKjP8SxlMSNYaSjY73w/pNbChncwCSh4JBIKFgPwYLAGWAoex2oehuXn6msZgacYpxIuGpTgpDypLqoBLLaolxibNrcCz+sezq1FULIXo7qxTtyqVuBJNuPYmgTzKd0QbHgL7Y8m+XYtTRO2tvHNhlG0XhFbq2amoueqlCJWHJ9XerZoTa7aUeL/hHMALXb1772ts0xrZZrBIthlMbZ2WLcD1g16fWfoNAHbidsmfdTGcdD74UWCFJymiiRYyqnrZMovca4YSWjV9QpJM0zO8IxOdZpiLdIvm4Op4LlGIohFrr9UQYP+3VFiLuWSh5pZe5VXazW4qzYtVPzaL1maFYqmZL0oLhJQAurrg4jllKhkqRNMcqa1ApnjHBOlb7R+BPeq5e1kPBq+GbqfXvaP3revsYceoWfIQi56hY7QR3SMxoh0fhkHxmPjifHbfGg+NWFNNTptzwG6FKb9BwbrDmM=</latexit><latexit sha1_base64="ZrQ4GmagbCZRNUjYkwjacyL2SY0=">AAADb3icjVJdi9NAFJ0mfqz1q+u+CH5wsbgkUEtSBBURFn3xcRW7XWhimEwm7bCZTMhM7LbZvPoDffNH+OIvcJKm4G4reCFwc+65954z3DBLmFSO87NjmNeu37i5d6t7+87de/d7+w9OpChyQsdEJCI/DbGkCUvpWDGV0NMsp5iHCZ2EZx/q+uQbzSUT6Re1zKjP8SxlMSNYaSjY73w/pNbChncwCSh4JBIKFgPwYLAGWAoex2oehuXn6msZgacYpxIuGpTgpDypLqoBLLaolxibNrcCz+sezq1FULIXo7qxTtyqVuBJNuPYmgTzKd0QbHgL7Y8m+XYtTRO2tvHNhlG0XhFbq2amoueqlCJWHJ9XerZoTa7aUeL/hHMALXb1772ts0xrZZrBIthlMbZ2WLcD1g16fWfoNAHbidsmfdTGcdD74UWCFJymiiRYyqnrZMovca4YSWjV9QpJM0zO8IxOdZpiLdIvm4Op4LlGIohFrr9UQYP+3VFiLuWSh5pZe5VXazW4qzYtVPzaL1maFYqmZL0oLhJQAurrg4jllKhkqRNMcqa1ApnjHBOlb7R+BPeq5e1kPBq+GbqfXvaP3revsYceoWfIQi56hY7QR3SMxoh0fhkHxmPjifHbfGg+NWFNNTptzwG6FKb9BwbrDmM=</latexit><latexit sha1_base64="ZrQ4GmagbCZRNUjYkwjacyL2SY0=">AAADb3icjVJdi9NAFJ0mfqz1q+u+CH5wsbgkUEtSBBURFn3xcRW7XWhimEwm7bCZTMhM7LbZvPoDffNH+OIvcJKm4G4reCFwc+65954z3DBLmFSO87NjmNeu37i5d6t7+87de/d7+w9OpChyQsdEJCI/DbGkCUvpWDGV0NMsp5iHCZ2EZx/q+uQbzSUT6Re1zKjP8SxlMSNYaSjY73w/pNbChncwCSh4JBIKFgPwYLAGWAoex2oehuXn6msZgacYpxIuGpTgpDypLqoBLLaolxibNrcCz+sezq1FULIXo7qxTtyqVuBJNuPYmgTzKd0QbHgL7Y8m+XYtTRO2tvHNhlG0XhFbq2amoueqlCJWHJ9XerZoTa7aUeL/hHMALXb1772ts0xrZZrBIthlMbZ2WLcD1g16fWfoNAHbidsmfdTGcdD74UWCFJymiiRYyqnrZMovca4YSWjV9QpJM0zO8IxOdZpiLdIvm4Op4LlGIohFrr9UQYP+3VFiLuWSh5pZe5VXazW4qzYtVPzaL1maFYqmZL0oLhJQAurrg4jllKhkqRNMcqa1ApnjHBOlb7R+BPeq5e1kPBq+GbqfXvaP3revsYceoWfIQi56hY7QR3SMxoh0fhkHxmPjifHbfGg+NWFNNTptzwG6FKb9BwbrDmM=</latexit>

Page 60: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Loss function• Minimize negative log-likelihood

• When we learned word vectors, we were basically training a language model

56

L(✓) = �TX

i=1

log p✓(wi | wi�2, wi�1)<latexit sha1_base64="UB0KjsDGBMGjkqYusmXz7ge4OCs=">AAACK3icbVBNSwMxFMzW7/pV9eglWIQWbNkVQT0Uil48eFCwWujWJZumbWiyuyRvlbLsD/LiXxHEg4pX/4dptwdtHQgMM/N4eeNHgmuw7Q8rNze/sLi0vJJfXVvf2Cxsbd/qMFaUNWgoQtX0iWaCB6wBHARrRooR6Qt25w/OR/7dA1Oah8ENDCPWlqQX8C6nBIzkFc4vSy70GZAyruEKdnUsvYTXnPT+Brsi7OHIy/zSo8exK3kHP5pA5TA9yIiTlr1C0a7aY+BZ4kxIEU1w5RVe3U5IY8kCoIJo3XLsCNoJUcCpYGnejTWLCB2QHmsZGhDJdDsZH5vifaN0cDdU5gWAx+rviYRIrYfSN0lJoK+nvZH4n9eKoXvSTngQxcACmi3qxgJDiEfN4Q5XjIIYGkKo4uavmPaJIhRMv3lTgjN98ixpHFZPq871UbF+NmljGe2iPVRCDjpGdXSBrlADUfSEXtA7+rCerTfr0/rKojlrMrOD/sD6/gFsU6WD</latexit><latexit sha1_base64="UB0KjsDGBMGjkqYusmXz7ge4OCs=">AAACK3icbVBNSwMxFMzW7/pV9eglWIQWbNkVQT0Uil48eFCwWujWJZumbWiyuyRvlbLsD/LiXxHEg4pX/4dptwdtHQgMM/N4eeNHgmuw7Q8rNze/sLi0vJJfXVvf2Cxsbd/qMFaUNWgoQtX0iWaCB6wBHARrRooR6Qt25w/OR/7dA1Oah8ENDCPWlqQX8C6nBIzkFc4vSy70GZAyruEKdnUsvYTXnPT+Brsi7OHIy/zSo8exK3kHP5pA5TA9yIiTlr1C0a7aY+BZ4kxIEU1w5RVe3U5IY8kCoIJo3XLsCNoJUcCpYGnejTWLCB2QHmsZGhDJdDsZH5vifaN0cDdU5gWAx+rviYRIrYfSN0lJoK+nvZH4n9eKoXvSTngQxcACmi3qxgJDiEfN4Q5XjIIYGkKo4uavmPaJIhRMv3lTgjN98ixpHFZPq871UbF+NmljGe2iPVRCDjpGdXSBrlADUfSEXtA7+rCerTfr0/rKojlrMrOD/sD6/gFsU6WD</latexit><latexit sha1_base64="UB0KjsDGBMGjkqYusmXz7ge4OCs=">AAACK3icbVBNSwMxFMzW7/pV9eglWIQWbNkVQT0Uil48eFCwWujWJZumbWiyuyRvlbLsD/LiXxHEg4pX/4dptwdtHQgMM/N4eeNHgmuw7Q8rNze/sLi0vJJfXVvf2Cxsbd/qMFaUNWgoQtX0iWaCB6wBHARrRooR6Qt25w/OR/7dA1Oah8ENDCPWlqQX8C6nBIzkFc4vSy70GZAyruEKdnUsvYTXnPT+Brsi7OHIy/zSo8exK3kHP5pA5TA9yIiTlr1C0a7aY+BZ4kxIEU1w5RVe3U5IY8kCoIJo3XLsCNoJUcCpYGnejTWLCB2QHmsZGhDJdDsZH5vifaN0cDdU5gWAx+rviYRIrYfSN0lJoK+nvZH4n9eKoXvSTngQxcACmi3qxgJDiEfN4Q5XjIIYGkKo4uavmPaJIhRMv3lTgjN98ixpHFZPq871UbF+NmljGe2iPVRCDjpGdXSBrlADUfSEXtA7+rCerTfr0/rKojlrMrOD/sD6/gFsU6WD</latexit><latexit sha1_base64="UB0KjsDGBMGjkqYusmXz7ge4OCs=">AAACK3icbVBNSwMxFMzW7/pV9eglWIQWbNkVQT0Uil48eFCwWujWJZumbWiyuyRvlbLsD/LiXxHEg4pX/4dptwdtHQgMM/N4eeNHgmuw7Q8rNze/sLi0vJJfXVvf2Cxsbd/qMFaUNWgoQtX0iWaCB6wBHARrRooR6Qt25w/OR/7dA1Oah8ENDCPWlqQX8C6nBIzkFc4vSy70GZAyruEKdnUsvYTXnPT+Brsi7OHIy/zSo8exK3kHP5pA5TA9yIiTlr1C0a7aY+BZ4kxIEU1w5RVe3U5IY8kCoIJo3XLsCNoJUcCpYGnejTWLCB2QHmsZGhDJdDsZH5vifaN0cDdU5gWAx+rviYRIrYfSN0lJoK+nvZH4n9eKoXvSTngQxcACmi3qxgJDiEfN4Q5XjIIYGkKo4uavmPaJIhRMv3lTgjN98ixpHFZPq871UbF+NmljGe2iPVRCDjpGdXSBrlADUfSEXtA7+rCerTfr0/rKojlrMrOD/sD6/gFsU6WD</latexit>

Page 61: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Advantages• If we see in the training set

The cat is walking in the bedroom

• We can hope to learn

A dog was running through a room

• We can use n-grams with n > 3 and pay only a linear price

• Compared to what? 57

Page 62: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Parameter estimation• We train with SGD

• How to efficiently compute the gradients?

• backpropagation (Rumelhart, Hinton and Williams, 1986)

• Proof in intro to ML, will repeat algorithm only and give an example here

58

Page 63: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Backpropagation• Notation:

59

Wt : weight matrix at the input of layer t

zt : output vector at layer t

x = z0 : input vector

y : gold scalar

y = zL : predicted scalar

l(y, y) : loss function

vt = Wt · zt�1 : pre-activations

�t =@l(y, y)

@zt: gradient vector

<latexit sha1_base64="UjuhSbGl3wADPYeqK6u90QmND4A=">AAADZ3icbVJNb9NAEN04fJSUj5QvIXEZiKhaiUY2QgIqVargwoFDkQipFEfWZr1OVl17rd1xiGuZH8mNH8CFX8HacdI2ZU+jee/NzJudSSqFQdf93XLaN27eur11p7N99979B92dh9+NyjTjA6ak0qcTargUCR+gQMlPU81pPJF8ODn7VOHDOddGqOQb5ikfx3SaiEgwijYV7LR+7sIwwEPwkS+w+MHFdIYQU9RiARQBZxxEkmYIKgJJc66hBATf7+zC+YUOVIYVac4ZKl0JN7iLo/PAPWzIy4JLbrnE8xU2VTIEw6ikK8ifUSzy0hb4sm5nPYaCId+gyr389Yq/vyZLZQxEWcIqyw1zHiAcQWUdfBYqtGYKPPDKyx0OqFXM60WZ1Swhl0hrrR9pygo/pRoFlVdblxd5u6QS1uY0DQVP1t47naDbc/tu/eB64DVBjzTvJOj+8kPFsthWYZIaM/LcFMdF1Y1JXnb8zPCUsjM65SMbJjTmZlzUh1LCK5sJIbI/FCk7RZ29rChobEweTyzTnsDMbGJV8n/YKMPo/bio/5UnbNkoyiSggurqIBTaOpa5DSjTws4KbEbtAtHeZrUEb9Py9WDwpv+h73192zv+2GxjizwnL8ke8cg7ckw+kxMyIKz1x9l2HjtPnL/tbvtp+9mS6rQazSNy5bVf/AP8QREE</latexit><latexit sha1_base64="UjuhSbGl3wADPYeqK6u90QmND4A=">AAADZ3icbVJNb9NAEN04fJSUj5QvIXEZiKhaiUY2QgIqVargwoFDkQipFEfWZr1OVl17rd1xiGuZH8mNH8CFX8HacdI2ZU+jee/NzJudSSqFQdf93XLaN27eur11p7N99979B92dh9+NyjTjA6ak0qcTargUCR+gQMlPU81pPJF8ODn7VOHDOddGqOQb5ikfx3SaiEgwijYV7LR+7sIwwEPwkS+w+MHFdIYQU9RiARQBZxxEkmYIKgJJc66hBATf7+zC+YUOVIYVac4ZKl0JN7iLo/PAPWzIy4JLbrnE8xU2VTIEw6ikK8ifUSzy0hb4sm5nPYaCId+gyr389Yq/vyZLZQxEWcIqyw1zHiAcQWUdfBYqtGYKPPDKyx0OqFXM60WZ1Swhl0hrrR9pygo/pRoFlVdblxd5u6QS1uY0DQVP1t47naDbc/tu/eB64DVBjzTvJOj+8kPFsthWYZIaM/LcFMdF1Y1JXnb8zPCUsjM65SMbJjTmZlzUh1LCK5sJIbI/FCk7RZ29rChobEweTyzTnsDMbGJV8n/YKMPo/bio/5UnbNkoyiSggurqIBTaOpa5DSjTws4KbEbtAtHeZrUEb9Py9WDwpv+h73192zv+2GxjizwnL8ke8cg7ckw+kxMyIKz1x9l2HjtPnL/tbvtp+9mS6rQazSNy5bVf/AP8QREE</latexit><latexit sha1_base64="UjuhSbGl3wADPYeqK6u90QmND4A=">AAADZ3icbVJNb9NAEN04fJSUj5QvIXEZiKhaiUY2QgIqVargwoFDkQipFEfWZr1OVl17rd1xiGuZH8mNH8CFX8HacdI2ZU+jee/NzJudSSqFQdf93XLaN27eur11p7N99979B92dh9+NyjTjA6ak0qcTargUCR+gQMlPU81pPJF8ODn7VOHDOddGqOQb5ikfx3SaiEgwijYV7LR+7sIwwEPwkS+w+MHFdIYQU9RiARQBZxxEkmYIKgJJc66hBATf7+zC+YUOVIYVac4ZKl0JN7iLo/PAPWzIy4JLbrnE8xU2VTIEw6ikK8ifUSzy0hb4sm5nPYaCId+gyr389Yq/vyZLZQxEWcIqyw1zHiAcQWUdfBYqtGYKPPDKyx0OqFXM60WZ1Swhl0hrrR9pygo/pRoFlVdblxd5u6QS1uY0DQVP1t47naDbc/tu/eB64DVBjzTvJOj+8kPFsthWYZIaM/LcFMdF1Y1JXnb8zPCUsjM65SMbJjTmZlzUh1LCK5sJIbI/FCk7RZ29rChobEweTyzTnsDMbGJV8n/YKMPo/bio/5UnbNkoyiSggurqIBTaOpa5DSjTws4KbEbtAtHeZrUEb9Py9WDwpv+h73192zv+2GxjizwnL8ke8cg7ckw+kxMyIKz1x9l2HjtPnL/tbvtp+9mS6rQazSNy5bVf/AP8QREE</latexit><latexit sha1_base64="UjuhSbGl3wADPYeqK6u90QmND4A=">AAADZ3icbVJNb9NAEN04fJSUj5QvIXEZiKhaiUY2QgIqVargwoFDkQipFEfWZr1OVl17rd1xiGuZH8mNH8CFX8HacdI2ZU+jee/NzJudSSqFQdf93XLaN27eur11p7N99979B92dh9+NyjTjA6ak0qcTargUCR+gQMlPU81pPJF8ODn7VOHDOddGqOQb5ikfx3SaiEgwijYV7LR+7sIwwEPwkS+w+MHFdIYQU9RiARQBZxxEkmYIKgJJc66hBATf7+zC+YUOVIYVac4ZKl0JN7iLo/PAPWzIy4JLbrnE8xU2VTIEw6ikK8ifUSzy0hb4sm5nPYaCId+gyr389Yq/vyZLZQxEWcIqyw1zHiAcQWUdfBYqtGYKPPDKyx0OqFXM60WZ1Swhl0hrrR9pygo/pRoFlVdblxd5u6QS1uY0DQVP1t47naDbc/tu/eB64DVBjzTvJOj+8kPFsthWYZIaM/LcFMdF1Y1JXnb8zPCUsjM65SMbJjTmZlzUh1LCK5sJIbI/FCk7RZ29rChobEweTyzTnsDMbGJV8n/YKMPo/bio/5UnbNkoyiSggurqIBTaOpa5DSjTws4KbEbtAtHeZrUEb9Py9WDwpv+h73192zv+2GxjizwnL8ke8cg7ckw+kxMyIKz1x9l2HjtPnL/tbvtp+9mS6rQazSNy5bVf/AP8QREE</latexit>

Page 64: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Backpropagation• Run the network forward to obtain all values vt, zt

• Base: �L = l0(y, zL)

• Recursion:

• Gradients:@l

@Wt= (�t � �0(vt))z

>t�1

60

�t = W>t+1(�

0(vt+1) � �t+1)

�0(vt+1), �t+1 2 Rdt+1⇥1,Wt+1 2 Rdt+1⇥dt

Page 65: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Bigram LM example• Forward pass:

61

z0 2 R|V|⇥1 : one-hot vector input

z1 = W1 · z0,W1 2 Rd1⇥|V|, z1 2 Rd1⇥1

z2 = �(W2 · z1),W2 2 Rd2⇥d1 , z2 2 Rd2⇥1

z3 = softmax(W3 · z2),W3 2 R|V|⇥d2 , z3 2 R|V|⇥1

l(y, z3) =X

i

y(i) log z(i)3

Page 66: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Bigram LM example• Backward pass:

62

�0(v3) � �3 = (z3 � y)

�2 = W>3 (�0(v3) � �3) = W>

3 (z3 � y)

�1 = W>2 (�0(v2) � �2) = W>

2 (z2 � (1� z2) � �2)@l

@W3= (�3 � �0(v3))z

>2 = (z3 � y)z>2

@l

@W2= (�2 � �0(v2))z

>1 = (�2 � z2 � (1� z2))z

>1

@l

@W1= (�1 � �0(v1))z

>0 = �1z

>0

Page 67: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Summary• Neural nets can improve language models:

• better scalability for larger N

• use of word similarity

• complex decision boundaries

• Training through backpropagation

63

Page 68: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Summary• Neural nets can improve language models:

• better scalability for larger N

• use of word similarity

• complex decision boundaries

• Training through backpropagation

63

But we still have a Markov assumption

He is from France, so it makes sense that his first language is…

Page 69: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Recurrent neural networks

64

the dog the dog

64

laughed

Input: w1, . . . , wt�1, wt, wt+1, . . . , wT , wi 2 RV

Model: xt = W (e) · wt ,W (e) 2 Rd⇥V

ht = �(W (hh) · ht�1 +W (hx) · xt),W(hh) 2 RDh⇥Dh ,W (hx) 2 RDh⇥d

yt = softmax(W (s) · ht) ,W (s) 2 RV⇥Dh

Page 70: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Recurrent neural networks

Page 71: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Recurrent neural networks• Can exploit long range dependencies

• Each layer has the same weights (weight sharing/tying)

• What is the loss function?

Page 72: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Recurrent neural networks• Can exploit long range dependencies

• Each layer has the same weights (weight sharing/tying)

• What is the loss function?

J(✓) =TX

t=1

CE(yt, yt)

Page 73: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Recurrent neural networks• Component:

• model? RNN (saw before)

• Loss? sum of cross-entropy over time

• Optimization? SGD

• Gradient computation? back-propagation through time

Page 74: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Class 3 Recap• N-gram based language models

• Linear interpolation

• Katz back-off

• Neural language models

• limited horizon - feed-forward neural network

• unlimited - Recurrent neural network

Page 75: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Linear interpolation

⇧(wi�2, wi�1) =

( 1 c(wi�2, wi�1) = 02 1 c(wi�2, wi�1) 23 3 c(wi�2, wi�1) 104 otherwise

qLI(wi | wi�2, wi�1) = �⇧(wi�2,wi�1)1 ⇥ q(wi | wi�2, wi�1)

+ �⇧(wi�2,wi�1)2 ⇥ q(wi | wi�1)

+ �⇧(wi�2,wi�1)3 ⇥ q(wi)

�⇧(wi�2,wi�1)i � 0

�⇧(wi�2,wi�1)1 + �⇧(wi�2,wi�1)

2 + �⇧(wi�2,wi�1)3 = 1

Page 76: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Katz back-off

A(wi�2, wi�1) = {w : c(wi�2, wi�1, w) > 0}B(wi�2, wi�1) = {w : c(wi�2, wi�1, w) = 0}

In a trigram model

qBO(wi | wi�2, wi�1) =

( c⇤(wi�2,wi�1,wi)

c(wi�2,wi�1)wi 2 A(wi�2, wi�1)

↵(wi�2, wi�1)qBO(wi|wi�1)P

w2B(wi�2,wi�1)qBO(w|wi�1)

wi 2 B(wi�2, wi�1)

↵(wi�2, wi�1) = 1�X

w2A(wi�2,wi�1)

c⇤(wi�2, wi�1, w)

c(wi�2, wi�1)

70

Page 77: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Language modeling with NNs

Bengio et al, 2003

the dog

e(the) e(dog)

h(the, dog)

f(h(the, dog))laughed

p(laughed | the, dog)

71

e(w) = We · w, ,We 2 Rd⇥|V|, w 2 R|V|⇥1

h(wi�2, wi�1) = �(Wh[e(wi�2); e(wi�1)]), Wh 2 Rm⇥2d

f(z) = softmax(Wo · z), Wo 2 R|V|⇥m, z 2 Rm⇥1

p(wi | wi�2, wi�1) = f(h(wi�2, wi�1))i<latexit sha1_base64="ZrQ4GmagbCZRNUjYkwjacyL2SY0=">AAADb3icjVJdi9NAFJ0mfqz1q+u+CH5wsbgkUEtSBBURFn3xcRW7XWhimEwm7bCZTMhM7LbZvPoDffNH+OIvcJKm4G4reCFwc+65954z3DBLmFSO87NjmNeu37i5d6t7+87de/d7+w9OpChyQsdEJCI/DbGkCUvpWDGV0NMsp5iHCZ2EZx/q+uQbzSUT6Re1zKjP8SxlMSNYaSjY73w/pNbChncwCSh4JBIKFgPwYLAGWAoex2oehuXn6msZgacYpxIuGpTgpDypLqoBLLaolxibNrcCz+sezq1FULIXo7qxTtyqVuBJNuPYmgTzKd0QbHgL7Y8m+XYtTRO2tvHNhlG0XhFbq2amoueqlCJWHJ9XerZoTa7aUeL/hHMALXb1772ts0xrZZrBIthlMbZ2WLcD1g16fWfoNAHbidsmfdTGcdD74UWCFJymiiRYyqnrZMovca4YSWjV9QpJM0zO8IxOdZpiLdIvm4Op4LlGIohFrr9UQYP+3VFiLuWSh5pZe5VXazW4qzYtVPzaL1maFYqmZL0oLhJQAurrg4jllKhkqRNMcqa1ApnjHBOlb7R+BPeq5e1kPBq+GbqfXvaP3revsYceoWfIQi56hY7QR3SMxoh0fhkHxmPjifHbfGg+NWFNNTptzwG6FKb9BwbrDmM=</latexit><latexit sha1_base64="ZrQ4GmagbCZRNUjYkwjacyL2SY0=">AAADb3icjVJdi9NAFJ0mfqz1q+u+CH5wsbgkUEtSBBURFn3xcRW7XWhimEwm7bCZTMhM7LbZvPoDffNH+OIvcJKm4G4reCFwc+65954z3DBLmFSO87NjmNeu37i5d6t7+87de/d7+w9OpChyQsdEJCI/DbGkCUvpWDGV0NMsp5iHCZ2EZx/q+uQbzSUT6Re1zKjP8SxlMSNYaSjY73w/pNbChncwCSh4JBIKFgPwYLAGWAoex2oehuXn6msZgacYpxIuGpTgpDypLqoBLLaolxibNrcCz+sezq1FULIXo7qxTtyqVuBJNuPYmgTzKd0QbHgL7Y8m+XYtTRO2tvHNhlG0XhFbq2amoueqlCJWHJ9XerZoTa7aUeL/hHMALXb1772ts0xrZZrBIthlMbZ2WLcD1g16fWfoNAHbidsmfdTGcdD74UWCFJymiiRYyqnrZMovca4YSWjV9QpJM0zO8IxOdZpiLdIvm4Op4LlGIohFrr9UQYP+3VFiLuWSh5pZe5VXazW4qzYtVPzaL1maFYqmZL0oLhJQAurrg4jllKhkqRNMcqa1ApnjHBOlb7R+BPeq5e1kPBq+GbqfXvaP3revsYceoWfIQi56hY7QR3SMxoh0fhkHxmPjifHbfGg+NWFNNTptzwG6FKb9BwbrDmM=</latexit><latexit sha1_base64="ZrQ4GmagbCZRNUjYkwjacyL2SY0=">AAADb3icjVJdi9NAFJ0mfqz1q+u+CH5wsbgkUEtSBBURFn3xcRW7XWhimEwm7bCZTMhM7LbZvPoDffNH+OIvcJKm4G4reCFwc+65954z3DBLmFSO87NjmNeu37i5d6t7+87de/d7+w9OpChyQsdEJCI/DbGkCUvpWDGV0NMsp5iHCZ2EZx/q+uQbzSUT6Re1zKjP8SxlMSNYaSjY73w/pNbChncwCSh4JBIKFgPwYLAGWAoex2oehuXn6msZgacYpxIuGpTgpDypLqoBLLaolxibNrcCz+sezq1FULIXo7qxTtyqVuBJNuPYmgTzKd0QbHgL7Y8m+XYtTRO2tvHNhlG0XhFbq2amoueqlCJWHJ9XerZoTa7aUeL/hHMALXb1772ts0xrZZrBIthlMbZ2WLcD1g16fWfoNAHbidsmfdTGcdD74UWCFJymiiRYyqnrZMovca4YSWjV9QpJM0zO8IxOdZpiLdIvm4Op4LlGIohFrr9UQYP+3VFiLuWSh5pZe5VXazW4qzYtVPzaL1maFYqmZL0oLhJQAurrg4jllKhkqRNMcqa1ApnjHBOlb7R+BPeq5e1kPBq+GbqfXvaP3revsYceoWfIQi56hY7QR3SMxoh0fhkHxmPjifHbfGg+NWFNNTptzwG6FKb9BwbrDmM=</latexit><latexit sha1_base64="ZrQ4GmagbCZRNUjYkwjacyL2SY0=">AAADb3icjVJdi9NAFJ0mfqz1q+u+CH5wsbgkUEtSBBURFn3xcRW7XWhimEwm7bCZTMhM7LbZvPoDffNH+OIvcJKm4G4reCFwc+65954z3DBLmFSO87NjmNeu37i5d6t7+87de/d7+w9OpChyQsdEJCI/DbGkCUvpWDGV0NMsp5iHCZ2EZx/q+uQbzSUT6Re1zKjP8SxlMSNYaSjY73w/pNbChncwCSh4JBIKFgPwYLAGWAoex2oehuXn6msZgacYpxIuGpTgpDypLqoBLLaolxibNrcCz+sezq1FULIXo7qxTtyqVuBJNuPYmgTzKd0QbHgL7Y8m+XYtTRO2tvHNhlG0XhFbq2amoueqlCJWHJ9XerZoTa7aUeL/hHMALXb1772ts0xrZZrBIthlMbZ2WLcD1g16fWfoNAHbidsmfdTGcdD74UWCFJymiiRYyqnrZMovca4YSWjV9QpJM0zO8IxOdZpiLdIvm4Op4LlGIohFrr9UQYP+3VFiLuWSh5pZe5VXazW4qzYtVPzaL1maFYqmZL0oLhJQAurrg4jllKhkqRNMcqa1ApnjHBOlb7R+BPeq5e1kPBq+GbqfXvaP3revsYceoWfIQi56hY7QR3SMxoh0fhkHxmPjifHbfGg+NWFNNTptzwG6FKb9BwbrDmM=</latexit>

Page 78: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Recurrent neural networks

72

the dog the dog

72

laughed

Input: w1, . . . , wt�1, wt, wt+1, . . . , wT , wi 2 RV

Model: xt = W (e) · wt ,W (e) 2 Rd⇥V

ht = �(W (hh) · ht�1 +W (hx) · xt),W(hh) 2 RDh⇥Dh ,W (hx) 2 RDh⇥d

yt = softmax(W (s) · ht) ,W (s) 2 RV⇥Dh

Page 79: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Training• Objective:

J(✓) =TX

t=1

CE(yt, yt)

• Objective: SGD

• Intrinsic evaluation: perplexity

Page 80: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

ComparisonModel Similarity Unlimited horizon

n-gram No No

FFNN LM Yes No

RNN LM Yes Yes

Page 81: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Class 4

Page 82: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Training RNNs

• Capturing long-range dependencies with RNNs is difficult

• Vanishing/exploding gradients

• Small changes in hidden layer value in step k cause huge/minuscule changes to values in hidden layer t for t >> k

Page 83: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

ExplanationConsider a simple linear RNN with no input:

ht = W · ht�1

ht = W t · h0

ht = (Q⇤Q�1)t · h0

ht = (Q⇤tQ�1) · h0

where W = Q⇤Q�1 is an eigendecomposition

• Some eigenvalues will explode and some will shrink to zero • Stretch input in the direction of eigenvector

with largest eigenvalue

Page 84: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Explanation

ht = W (hh)�(ht�1) +W (hx)xt + b

L =TX

t=1

Lt =TX

t=1

L(ht)<latexit sha1_base64="b5AOVOYDBTNZMJkiop5k9cEbZUY=">AAACZXicdVFNS8MwGE7r15xf8wMvHgwOZWM4WhHUgyB68eBBwTlhnSXNsjWYtCV5Kxulf9KbZy/+DNNtBz9fCDw8H7zJkyARXIPjvFn2zOzc/EJpsby0vLK6VlnfeNBxqihr0VjE6jEgmgkesRZwEOwxUYzIQLB28HxV6O0XpjSPo3sYJawrySDifU4JGMqv5KEP+OAct5+yWhjWc0/zgSS10M/g0M3ruDFRhvV8aIwNHGDPK3uSQEiJyG7yIuvpVBr/uZs/3eMvmgn8L5oVUPcrVafpjAf/Bu4UVNF0bv3Kq9eLaSpZBFQQrTuuk0A3Iwo4FSwve6lmCaHPZMA6BkZEMt3NxjXleN8wPdyPlTkR4DH7NZERqfVIBsZZ3FP/1AryL62TQv+0m/EoSYFFdLKonwoMMS46xz2uGAUxMoBQxc1dMQ2JIhTMz5RNCe7PJ/8GraPmWdO9O65eXE7bKKEdtIdqyEUn6AJdo1vUQhS9W4vWhrVpfdir9pa9PbHa1jSzib6NvfsJ0uq1Cw==</latexit><latexit sha1_base64="b5AOVOYDBTNZMJkiop5k9cEbZUY=">AAACZXicdVFNS8MwGE7r15xf8wMvHgwOZWM4WhHUgyB68eBBwTlhnSXNsjWYtCV5Kxulf9KbZy/+DNNtBz9fCDw8H7zJkyARXIPjvFn2zOzc/EJpsby0vLK6VlnfeNBxqihr0VjE6jEgmgkesRZwEOwxUYzIQLB28HxV6O0XpjSPo3sYJawrySDifU4JGMqv5KEP+OAct5+yWhjWc0/zgSS10M/g0M3ruDFRhvV8aIwNHGDPK3uSQEiJyG7yIuvpVBr/uZs/3eMvmgn8L5oVUPcrVafpjAf/Bu4UVNF0bv3Kq9eLaSpZBFQQrTuuk0A3Iwo4FSwve6lmCaHPZMA6BkZEMt3NxjXleN8wPdyPlTkR4DH7NZERqfVIBsZZ3FP/1AryL62TQv+0m/EoSYFFdLKonwoMMS46xz2uGAUxMoBQxc1dMQ2JIhTMz5RNCe7PJ/8GraPmWdO9O65eXE7bKKEdtIdqyEUn6AJdo1vUQhS9W4vWhrVpfdir9pa9PbHa1jSzib6NvfsJ0uq1Cw==</latexit><latexit sha1_base64="b5AOVOYDBTNZMJkiop5k9cEbZUY=">AAACZXicdVFNS8MwGE7r15xf8wMvHgwOZWM4WhHUgyB68eBBwTlhnSXNsjWYtCV5Kxulf9KbZy/+DNNtBz9fCDw8H7zJkyARXIPjvFn2zOzc/EJpsby0vLK6VlnfeNBxqihr0VjE6jEgmgkesRZwEOwxUYzIQLB28HxV6O0XpjSPo3sYJawrySDifU4JGMqv5KEP+OAct5+yWhjWc0/zgSS10M/g0M3ruDFRhvV8aIwNHGDPK3uSQEiJyG7yIuvpVBr/uZs/3eMvmgn8L5oVUPcrVafpjAf/Bu4UVNF0bv3Kq9eLaSpZBFQQrTuuk0A3Iwo4FSwve6lmCaHPZMA6BkZEMt3NxjXleN8wPdyPlTkR4DH7NZERqfVIBsZZ3FP/1AryL62TQv+0m/EoSYFFdLKonwoMMS46xz2uGAUxMoBQxc1dMQ2JIhTMz5RNCe7PJ/8GraPmWdO9O65eXE7bKKEdtIdqyEUn6AJdo1vUQhS9W4vWhrVpfdir9pa9PbHa1jSzib6NvfsJ0uq1Cw==</latexit><latexit sha1_base64="b5AOVOYDBTNZMJkiop5k9cEbZUY=">AAACZXicdVFNS8MwGE7r15xf8wMvHgwOZWM4WhHUgyB68eBBwTlhnSXNsjWYtCV5Kxulf9KbZy/+DNNtBz9fCDw8H7zJkyARXIPjvFn2zOzc/EJpsby0vLK6VlnfeNBxqihr0VjE6jEgmgkesRZwEOwxUYzIQLB28HxV6O0XpjSPo3sYJawrySDifU4JGMqv5KEP+OAct5+yWhjWc0/zgSS10M/g0M3ruDFRhvV8aIwNHGDPK3uSQEiJyG7yIuvpVBr/uZs/3eMvmgn8L5oVUPcrVafpjAf/Bu4UVNF0bv3Kq9eLaSpZBFQQrTuuk0A3Iwo4FSwve6lmCaHPZMA6BkZEMt3NxjXleN8wPdyPlTkR4DH7NZERqfVIBsZZ3FP/1AryL62TQv+0m/EoSYFFdLKonwoMMS46xz2uGAUxMoBQxc1dMQ2JIhTMz5RNCe7PJ/8GraPmWdO9O65eXE7bKKEdtIdqyEUn6AJdo1vUQhS9W4vWhrVpfdir9pa9PbHa1jSzib6NvfsJ0uq1Cw==</latexit>

Page 85: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Explanation

@L@✓

=TX

t=1

@Lt

@✓

@Lt

@✓=

tX

k=1

@Lt

@ht

@ht

@hk

@+hk

@✓

@ht

@hk=

tY

i=k+1

@hi

@hi�1=

tY

i=k+1

W (hh)diag(�0(hi�1))<latexit sha1_base64="7w8dKJv4xUMIZCjXWvpgkvlKh9o=">AAADoniclVJdi9NAFJ0mfqz1q6uPvgyW1ZbFkoigPhQWBREU7OLWLnTaMJ1MmiGTD2ZuxDLkh/k3fPPfOOkG2U0rrBcCJ+feM+fOnbsqpNDgeb87jnvj5q3bB3e6d+/df/Cwd/jom85LxfiU5TJX5yuquRQZn4IAyc8LxWm6kny2St7X+dl3rrTIszPYFHyR0nUmIsEoWCo47PwkkaLMkIIqEFRiklKIGZXmc1VdYiHmQCv8bIyJLtPAwNivlmf43+IA9sgJ6f6n4q9hUhvCdQ1j+9eujVsVSbtiebxlr9P2nrPqTguVh4ER4+R4X7NxIK6IjHjhVxXe1c2WZhDHQ2sM/AeYUNB1NSBarFP6fNDohsOg1/dG3jbwLvAb0EdNTILeLxLmrEx5BkxSree+V8DC1P0wyasuKTUvKEvoms8tzGjK9cJsd6zCR5YJcZQr+2WAt+xlhaGp1pt0ZSvrN9HtXE3uy81LiN4sjMiKEnjGLoyiUmLIcb2wOBSKM5AbCyhTwvaKWUztYMGuddcOwW9feRdMX47ejvzTV/2Td800DtAT9BQNkI9eoxP0EU3QFDEHOx+cL87EPXI/uafu14tSp9NoHqMr4ZI/z5QxQw==</latexit><latexit sha1_base64="7w8dKJv4xUMIZCjXWvpgkvlKh9o=">AAADoniclVJdi9NAFJ0mfqz1q6uPvgyW1ZbFkoigPhQWBREU7OLWLnTaMJ1MmiGTD2ZuxDLkh/k3fPPfOOkG2U0rrBcCJ+feM+fOnbsqpNDgeb87jnvj5q3bB3e6d+/df/Cwd/jom85LxfiU5TJX5yuquRQZn4IAyc8LxWm6kny2St7X+dl3rrTIszPYFHyR0nUmIsEoWCo47PwkkaLMkIIqEFRiklKIGZXmc1VdYiHmQCv8bIyJLtPAwNivlmf43+IA9sgJ6f6n4q9hUhvCdQ1j+9eujVsVSbtiebxlr9P2nrPqTguVh4ER4+R4X7NxIK6IjHjhVxXe1c2WZhDHQ2sM/AeYUNB1NSBarFP6fNDohsOg1/dG3jbwLvAb0EdNTILeLxLmrEx5BkxSree+V8DC1P0wyasuKTUvKEvoms8tzGjK9cJsd6zCR5YJcZQr+2WAt+xlhaGp1pt0ZSvrN9HtXE3uy81LiN4sjMiKEnjGLoyiUmLIcb2wOBSKM5AbCyhTwvaKWUztYMGuddcOwW9feRdMX47ejvzTV/2Td800DtAT9BQNkI9eoxP0EU3QFDEHOx+cL87EPXI/uafu14tSp9NoHqMr4ZI/z5QxQw==</latexit><latexit sha1_base64="7w8dKJv4xUMIZCjXWvpgkvlKh9o=">AAADoniclVJdi9NAFJ0mfqz1q6uPvgyW1ZbFkoigPhQWBREU7OLWLnTaMJ1MmiGTD2ZuxDLkh/k3fPPfOOkG2U0rrBcCJ+feM+fOnbsqpNDgeb87jnvj5q3bB3e6d+/df/Cwd/jom85LxfiU5TJX5yuquRQZn4IAyc8LxWm6kny2St7X+dl3rrTIszPYFHyR0nUmIsEoWCo47PwkkaLMkIIqEFRiklKIGZXmc1VdYiHmQCv8bIyJLtPAwNivlmf43+IA9sgJ6f6n4q9hUhvCdQ1j+9eujVsVSbtiebxlr9P2nrPqTguVh4ER4+R4X7NxIK6IjHjhVxXe1c2WZhDHQ2sM/AeYUNB1NSBarFP6fNDohsOg1/dG3jbwLvAb0EdNTILeLxLmrEx5BkxSree+V8DC1P0wyasuKTUvKEvoms8tzGjK9cJsd6zCR5YJcZQr+2WAt+xlhaGp1pt0ZSvrN9HtXE3uy81LiN4sjMiKEnjGLoyiUmLIcb2wOBSKM5AbCyhTwvaKWUztYMGuddcOwW9feRdMX47ejvzTV/2Td800DtAT9BQNkI9eoxP0EU3QFDEHOx+cL87EPXI/uafu14tSp9NoHqMr4ZI/z5QxQw==</latexit><latexit sha1_base64="7w8dKJv4xUMIZCjXWvpgkvlKh9o=">AAADoniclVJdi9NAFJ0mfqz1q6uPvgyW1ZbFkoigPhQWBREU7OLWLnTaMJ1MmiGTD2ZuxDLkh/k3fPPfOOkG2U0rrBcCJ+feM+fOnbsqpNDgeb87jnvj5q3bB3e6d+/df/Cwd/jom85LxfiU5TJX5yuquRQZn4IAyc8LxWm6kny2St7X+dl3rrTIszPYFHyR0nUmIsEoWCo47PwkkaLMkIIqEFRiklKIGZXmc1VdYiHmQCv8bIyJLtPAwNivlmf43+IA9sgJ6f6n4q9hUhvCdQ1j+9eujVsVSbtiebxlr9P2nrPqTguVh4ER4+R4X7NxIK6IjHjhVxXe1c2WZhDHQ2sM/AeYUNB1NSBarFP6fNDohsOg1/dG3jbwLvAb0EdNTILeLxLmrEx5BkxSree+V8DC1P0wyasuKTUvKEvoms8tzGjK9cJsd6zCR5YJcZQr+2WAt+xlhaGp1pt0ZSvrN9HtXE3uy81LiN4sjMiKEnjGLoyiUmLIcb2wOBSKM5AbCyhTwvaKWUztYMGuddcOwW9feRdMX47ejvzTV/2Td800DtAT9BQNkI9eoxP0EU3QFDEHOx+cL87EPXI/uafu14tSp9NoHqMr4ZI/z5QxQw==</latexit>

Page 86: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

ExplanationAssume: �0(·) �,�1 1

�.

�1 is the absolute value of the largest eigenvalue of W (hh)

8k || @hk

@hk�1|| ||W (hh)|| · ||diag(�0(hk))|| <

1

�� < 1

Let ⌘ be a constant such that 8k || @hk

@hk�1|| ⌘ < 1

||@Lt

@ht

tY

i=k+1

@hi

@hi�1|| ⌘t�k||@Lt

@ht||

so long-term influence vanishes to zero.<latexit sha1_base64="pxKLDOC9jxriRriR+6qlhmxpjQs=">AAAEcnicnVNLb9QwEE53FyjLqwVxAQkGKmBXtKsNIIEQSAUuHHooEn1IzTaaOE5ixbFD7FSUrO/8Pm78Ci78AJxkt2q7iAOWoozn9c03Mw5yzpQej38udbq9CxcvLV/uX7l67fqNldWbu0qWBaE7RHJZ7AeoKGeC7mimOd3PC4pZwOlekH6o7XtHtFBMis/6OKeTDGPBIkZQW5W/2vn+2NP0q67eKVVm9DUY8BSLM3wy8Ego9RA8Tr+AF2OW4bq92NQh+u5MHRVIKtdUrd2MwPP6j085NamBKdAJBQyU5KWmcIS8pCCjRsuxiKnSQFlMxYnFwN5hNUiSoWlTRrJAziEFD6bTFtbLsdAMOSR+ak7fqnTDNWY6bWu0/vNUtaomVadoKgsZxmYwJ2zzDIfW580Cr/ZnDW5bTRO8RXXdLapxzjOwHIFIoTQKDaokiWWIjdf/198AzJEXYr0MdUKQV1vG12eyaIuaFzL0K/Y2feqaQw0LuOwsLjuPe1jpjdQsVvwv1Dr+pEegJHAp4g1NiwyYiOx4Bak3QDCVULsXEr7RQo5M319ZG4/GzYFFwZ0Ja87sbPsrP7xQEru0QhOOSh2441xPqroUwqnpe6WiOZIUY3pgRYEZVZOqeTIGHllNCHYq9rOzarSnIyrMlDrOAutZc1XnbbXyb7aDUkevJhUTuV10QVqgqOQ10fr9QcgKSjQ/tgKSgtlagSRou2s7pOomuOcpLwq7z0bu85H76cXa5vtZO5adu85DZ+C4zktn0/nobDs7Dun86t7u3uve7/7u3ek96M1611maxdxyzpze+h/B/ngl</latexit>

Page 87: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Solutions• Exploding gradient: gradient clipping

• Re-normalize gradient to be less than C (this is no longer the true gradient in size but it is in direction)

• Exploding gradients are easy to detect

• The problem is with the model!

• Change it (LSTMs, GRUs)

Page 88: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

IllustrationPascanu et al, 2013

Page 89: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

LSTMs and GRUs

Page 90: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

LSTMs and GRUs

• Bottom line: use vector addition and not matrix-vector multiplication. Allows for better propagation of gradients to the past

Page 91: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Gated Recurrent Unit (GRU)• Main insight: add learnable gates to the recurrent

unit that control the flow of information from the past to the present

• Vanilla RNN:

ht = f(W (hh)ht�1 +W (hx)xt)

• Update and reset gates:

zt = �(W (z)ht�1 + U (z)xt)

rt = �(W (r)ht�1 + U (r)xt)

Cho et al., 2014

Page 92: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Gated Recurrent Unit (GRU)

• Use the gates to control information flow

• If z=1, we simply copy the past and ignore the present (note that gradient will be 1)

• If z=0, then we have an RNN like update, but we are also free to reset some of the past units, and if r=0, then we have no memory of the past

• The + in the last equation is crucial

zt = �(W (z)xt + U (z)ht�1)

rt = �(W (r)xt + U (r)ht�1)

ht = tanh(Wxt + rt � Uht�1)

ht = zt � ht�1 + (1� zt) � ht

Page 93: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Illustration

ht-1 xt

rt zt

ȟt

ht

Page 94: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Long short term memory (LSTMs)

• z has been split into i and f

• There is no r

• There is a new gate o that distinguishes between the memory and the output.

• c is like h in GRUs

• h is the output

Hochreiter and Schmidhuber, 1997

it = �(W (i)xt + U (i)ht�1)

ft = �(W (f)xt + U (f)ht�1)

ot = �(W (o)xt + U (o)ht�1)

ct = tanh(W (c)xt + U (c)ht�1)

ct = ft � ct�1 + it � ctht = ot � tanh(ct)

Page 95: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Illustration

ht-1 xt

it ftčt

ht

ot

ct-1

ct

Page 96: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Illustration

Chris Olah’s blog

Page 97: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

More GRU intuition from Stanford

• Go over sequence of slides from Chris Manning

Page 98: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

ResultsJozefowicz et al, 2016

Page 99: Natural Language Processingjoberant/teaching/nlp_spring... · Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan

Summary• Language modeling is a fundamental NLP task used in machine

translation, spelling correction, speech recognition, etc.

• Traditional models use n-gram counts and smoothing

• Feed-forward take into account word similarity to generalize better

• Recurrent models can potentially learn to exploit long-range interactions

• Neural models dramatically reduced perplexity

• Recurrent networks are now used in many other NLP tasks (bidirectional RNNs, deep RNNs)

93