lecture 4: language model evaluation and advanced methodskc2wc/teaching/nlp16/slides/04...this...

40
Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei Chang CS @ University of Virginia [email protected] Couse webpage: http://kwchang.net/teaching/NLP16 1 6501 Natural Language Processing

Upload: others

Post on 10-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Lecture 4: Language Model Evaluation and Advanced methods

Kai-Wei ChangCS @ University of Virginia

[email protected]

Couse webpage: http://kwchang.net/teaching/NLP16

16501 Natural Language Processing

Page 2: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

This lecture

vKneser-Ney smoothingvDiscriminative Language Models vNeural Language ModelsvEvaluation: Cross-entropy and perplexity

26501 Natural Language Processing

Page 3: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Recap: Smoothing

v Add-one smoothingv Add-𝜆 smoothingvparameters tuned by the cross-validation

vWitten-Bell SmoothingvT: # word types N: # tokensvT/(N+T): total prob. mass for unseen wordsvN/(N+T): total prob. mass for observed tokens

vGood-TuringvReallocate the probability mass of n-grams that

occur r+1 times to n-grams that occur r times.

36501 Natural Language Processing

Page 4: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Recap: Back-off and interpolation

v Idea: even if we’ve never seen “red glasses”, we know it is more likely to occur than “red abacus”

v Interpolation:paverage(z | xy) = µ3 p(z | xy) + µ2 p(z | y) + µ1 p(z)where µ3 + µ2 + µ1 = 1 and all are ≥ 0

46501 Natural Language Processing

Page 5: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Absolute Discounting

vSave ourselves some time and just subtract 0.75 (or some d)!

v But should we really just use the regular unigram P(w)?

5

)()()(),()|( 11

11scountingAbsoluteDi wPw

wcdwwcwwP i

i

iiii −

−− +

−= λ

discountedbigram

unigram

Interpolationweight

6501 Natural Language Processing

Page 6: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Kneser-Ney Smoothing

v Betterestimateforprobabilitiesoflower-orderunigrams!v Shannongame:I can’t see without my

reading___________?v “Francisco”ismorecommonthan“glasses”v…but“Francisco”alwaysfollows“San”

6

Francisco glasses

6501 Natural Language Processing

Page 7: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Kneser-Ney Smoothing

v InsteadofP(w):“Howlikelyisw”v Pcontinuation(w):“Howlikelyiswtoappearasa

novelcontinuation?v Foreachword,countthenumberofbigramtypesit

completesv Everybigramtypewasanovelcontinuationthefirst

timeitwasseen

7

PCONTINUATION (w)∝ {wi−1 : c(wi−1,w)> 0}

6501 Natural Language Processing

Page 8: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Kneser-Ney Smoothing

v Howmanytimesdoeswappearasanovelcontinuation:

v Normalized by the total number of word bigram types

8

PCONTINUATION (w) ={wi−1 : c(wi−1,w)> 0}

{(wj−1,wj ) : c(wj−1,wj )> 0}

PCONTINUATION (w)∝ {wi−1 : c(wi−1,w)> 0}

{(wj−1,wj ) : c(wj−1,wj )> 0}

6501 Natural Language Processing

Page 9: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Kneser-Ney Smoothing

v Alternative metaphor: The number of # of word types seen to precede w

v normalized by the # of words preceding all words:

v A frequent word (Francisco) occurring in only one context (San) will have a low continuation probability

9

PCONTINUATION (w) ={wi−1 : c(wi−1,w)> 0}{w 'i−1 : c(w 'i−1,w ')> 0}

w '∑

| {wi−1 : c(wi−1,w)> 0} |

6501 Natural Language Processing

Page 10: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Kneser-Ney Smoothing

10

PKN (wi |wi−1) =max(c(wi−1,wi )− d, 0)

c(wi−1)+λ(wi−1)PCONTINUATION (wi )

λ(wi−1) =d

c(wi−1){w : c(wi−1,w)> 0}

λ is a normalizing constant; the probability mass we’ve discounted

thenormalizeddiscountThenumberofwordtypesthatcanfollowwi-1=#ofwordtypeswediscounted=#oftimesweappliednormalizeddiscount

6501 Natural Language Processing

Page 11: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Kneser-Ney Smoothing: Recursive formulation

11

PKN (wi |wi−n+1i−1 ) = max(cKN (wi−n+1

i )− d, 0)cKN (wi−n+1

i−1 )+λ(wi−n+1

i−1 )PKN (wi |wi−n+2i−1 )

cKN (•) =count(•) for the highest order

continuationcount(•) for lower order

!"#

$#

Continuationcount=Numberofuniquesinglewordcontextsfor�

6501 Natural Language Processing

Page 12: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Practical issue: Huge web-scale n-gramsvHow to deal with, e.g., Google N-gram

corpusvPruning

vOnly store N-grams with count > threshold.v Remove singletons of higher-order n-grams

126501 Natural Language Processing

Page 13: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Huge web-scale n-grams

vEfficiencyvEfficient data structures

v e.g. tries

vStore words as indexes, not stringsvQuantize probabilities (4-8 bits instead of

8-byte float)

13

https://en.wikipedia.org/wiki/Trie

6501 Natural Language Processing

Page 14: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

600.465 - Intro to NLP - J. Eisner 14

Smoothing

This dark art is why NLP is taught in the engineering school.

14

There are more principled smoothing methods, too. We’ll look next at log-linear models, which are a good and popular general technique.

6501 Natural Language Processing

Page 15: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Conditional Modeling

v Generative language model (tri-gram model):

v Then, we compute the conditional probabilities by maximum likelihood estimation

v Can we model 𝑃 𝑤$ 𝑤%,𝑤' directly?

v Given a context x, which outcomes y are likely in that context?P (NextWord=y | PrecedingWords=x)

15600.465 - Intro to NLP - J. Eisner 15

𝑃(𝑤), …𝑤+)=P 𝑤) 𝑃 𝑤0 𝑤) …𝑃 𝑤+ 𝑤+10,𝑤+1)

6501 Natural Language Processing

Page 16: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Modeling conditional probabilities

vLet’s assume𝑃(𝑦|𝑥) = exp(score x, y )/∑ exp(𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦D )EDY: NextWord, x: PrecedingWords

v𝑃(𝑦|𝑥) is high ⇔ score(x,y) is highvThis is called soft-maxvRequire that P(y | x) ≥ 0, and ∑ 𝑃(𝑦|𝑥)E = 1;

not true of score(x,y)

166501 Natural Language Processing

Page 17: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Linear Scoring

v Score(x,y): How well does y go with x?v Simplest option: a linear function of (x,y).

But (x,y) isn’t a number ⇒ describe it by some numbers (i.e. numeric features)

v Then just use a linear function of those numbers.

17

Ranges over all features Whether (x,y) has feature k(0 or 1)Or how many times it fires (≥ 0)Or how strongly it fires (real #)

Weight of the kth feature. To be learned …

6501 Natural Language Processing

Page 18: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

What features should we use?

vModel p wI w$1),w$10):𝑓'(“𝑤$1),𝑤$10”, “𝑤$”) for Score(“𝑤$1),𝑤$10”, “𝑤$”) canbev # “𝑤$1)” appears in the training corpus. v 1, if “𝑤$ ” is an unseen word; 0, otherwise.v 1, if “𝑤$1),𝑤$10” = “a red”; 0, otherwise.v 1, if “𝑤$10” belongs to the “color” category; 0 otherwise.

186501 Natural Language Processing

Page 19: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

What features should we use?

vModel p ”𝑔𝑙𝑎𝑠𝑠𝑒𝑠” ”𝑎𝑟𝑒𝑑”):𝑓'(“𝑟𝑒𝑑”, “𝑎”, “𝑔𝑙𝑎𝑠𝑠𝑒𝑠”) for Score(“𝑟𝑒𝑑”, “𝑎”, “𝑔𝑙𝑎𝑠𝑠𝑒𝑠”)v # “𝑟𝑒𝑑” appears in the training corpus. v 1, if “𝑎” is an unseen word; 0, otherwise.v 1, if “a𝑟𝑒𝑑” = “a red”; 0, otherwise.v 1, if “𝑟𝑒𝑑” belongs to the “color” category; 0 otherwise.

196501 Natural Language Processing

Page 20: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Log-Linear Conditional Probability

20600.465 - Intro to NLP - J. Eisner 20

where we choose Z(x) to ensure that

unnormalizedprob (at leastit’s positive!)

thus, Partition function

6501 Natural Language Processing

Page 21: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

v n training examplesv feature functions f1, f2, …v Want to maximize p(training data|θ)

v Easier to maximize the log of that:

v Alas, some weights θi may be optimal at -∞ or +∞.When would this happen? What’s going “wrong”?

Training θ

21

This version is “discriminative training”: to learn to predict y from x, maximize p(y|x).

Whereas in “generative models”, we learn to model x, too, by maximizing p(x,y).

6501 Natural Language Processing

Page 22: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Generalization via Regularization

v n training examplesv feature functions f1, f2, …v Want to maximize p(training data|θ) ⋅ pprior(θ)

v Easier to maximize the log of that

v Encourages weights close to 0.v “L2 regularization”: Corresponds to a Gaussian prior

22

𝑝 𝜃 ∝ 𝑒 X Y/ZY

6501 Natural Language Processing

Page 23: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Gradient-based training

v Gradually adjust θ in a direction that improves

23

Gradient ascent to gradually increase f(θ):

while (∇f(θ) ≠ 0) // not at a local max or minθ = θ + 𝜂∇f(θ) // for some small 𝜂 > 0

Remember: ∇f(θ) = (∂f(θ)/∂θ1, ∂f(θ)/∂θ2, …)update means: θk += ∂f(θ) / ∂θk

6501 Natural Language Processing

Page 24: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Gradient-based training

v Gradually adjust θ in a direction that improves

v Gradient w.r.t 𝜃

246501 Natural Language Processing

Page 25: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

More complex assumption?

v 𝑃(𝑦|𝑥) = exp(score x, y )/∑ exp(𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦′ )𝑦′Y: NextWord, x: PrecedingWords

v Assume we saw:

What is P(shoes; blue)?

v Can we learn categories of words(representation) automatically?

v Can we build a high order n-gram model without blowing up the model size?

25

redglasses;yellowglasses;greenglasses;blueglassesredshoes;yellowshoes;greenshoes;

6501 Natural Language Processing

Page 26: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Neural language model

vModel 𝑃(𝑦|𝑥) with a neural network

26

Example1:Onehotvector:eachcomponentofthevectorrepresentsoneword[0,0,1,0,0]

Example2:wordembeddings

6501 Natural Language Processing

Page 27: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Neural language model

vModel 𝑃(𝑦|𝑥) with a neural network

27

Learnedmatricestoprojecttheinputvectors

Obtain(y|x)byperforming softmax

Concatenateprojectedvectors

Non-linearfunctione.g.,ℎ = tanh(𝑊b 𝑐 + 𝑏)

6501 Natural Language Processing

Page 28: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Why?

vPotentially generalize to unseen contexts vExample: P(“red” | “the”, “shoes”, “are”)vThis does not occurs in training corpus but

[“the”, ”glasses”, ”are”, “red”] does.v If the word representations of “red” and “blue”

are similar, then the model can generalize.

vWhy are “red” and “blue” similar?vBecause NN saw “red skirt”, “blue skirt”, “red

pen”, ”blue pen”, etc.

286501 Natural Language Processing

Page 29: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Training neural language models

vCan use gradient ascent as well

vUsing the chain rule to derive the gradienta.k.a. back propagation

vMore complex NN architectures can be used – e.g., LSTM, char-based models

296501 Natural Language Processing

Page 30: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Language model evaluation

vHow to compare models?v we need an unseen text set, why?

v Information theory: study resolution of uncertainty.vPerplexity: measure how well a probability

distribution predicts a sample

306501 Natural Language Processing

Page 31: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Cross-Entropy

v A common measure of model qualityv Task-independentv Continuous – slight improvements show up here

even if they don’t change # of right answers on task

v Just measure probability of (enough) test datav Higher prob means model better predicts the futurev There’s a limit to how well you can predict random

stuffv Limit depends on “how random” the dataset is

(easier to predict weather than headlines, especially in Arizona)

316501 Natural Language Processing

Page 32: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

32

Cross-Entropy (“xent”)

v Want prob of test data to be high:p(h | <S>, <S>) * p(o | <S>, h) * p(r | h, o) * p(s | o, r) …

1/8 * 1/8 * 1/8 * 1/16 …

v high prob → low xent by 3 cosmetic improvements:v Take logarithm (base 2) to prevent underflow:

log (1/8 * 1/8 * 1/8 * 1/16 …) = log 1/8 + log 1/8 + log 1/8 + log 1/16 … = (-3) + (-3) + (-3) + (-4) + …

v Negate to get a positive value in bits 3+3+3+4+…v Divide by length of text à 3.25 bits per letter (or per

word)

6501 Natural Language Processing

Average?Geometric average of 1/23,1/23, 1/23, 1/24

= 1/23.25 ≈ 1/9.5

Page 33: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

33

Cross-Entropy (“xent”)

v Want prob of test data to be high:p(h | <S>, <S>) * p(o | <S>, h) * p(r | h, o) * p(s | o, r) …

1/8 * 1/8 * 1/8 * 1/16 …

v Cross-entropy à 3.25 bits per letter (or per word)vWant this to be small (equivalent to wanting good

compression!)vLower limit is called entropy – obtained in principle as

cross-entropy of the true model measured on an infinite amount of data

v perplexity = 2xent (meaning ≈9.5 choices)

Average?Geometric average of 1/23,1/23, 1/23, 1/24

= 1/23.25 ≈ 1/9.5

6501 Natural Language Processing

Page 34: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

More math: Entropy H(X)

v The entropy H(𝑝) of a discrete random variable 𝑋is the expected negative log probability:H p = −∑ 𝑝 𝑥 log0 𝑃(𝑥)k

vEntropy is measure of uncertainty

346501 Natural Language Processing

Page 35: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Entropy of coin tossing

vToss a coin P(H)=𝑝, P(T)=1 − pvH(p)= −𝑝 log0 𝑝 − 1 − 𝑝 log0 1 − 𝑝

vp=0.5: H(p)= 1vP=1: H(p) = 0

6501 Natural Language Processing 35

Page 36: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Entropy of coin tossing

vToss a coin P(H)=𝑝, P(T)=1 − pvH(p)= −𝑝 log0 𝑝 − 1 − 𝑝 log0 1 − 𝑝

vp=0.5: H(p)= 1vP=1: H(p) = 0

6501 Natural Language Processing 36

Page 37: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

How many bits to encode messages

vConsider three letters (A, B, C, D):v If p=(½, ½, 0, 0), how many bits per letter

in average to encode a message ~ p?vEncode A as 0, B as 1;AAABBBAA ⇒ 00011100

v If p=(¼ , ¼ , ¼ , ¼ )vA: 00, B: 01, C:10, D:11; ABDA⇒ 00011100

vHow about p=(½, ¼, ¼, 0)vA: 0, B:10, C:11; AAACBA ⇒ 00011100

6501 Natural Language Processing 37

Page 38: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

More math: Cross Entropy

vCross-entropy:vAvg.#bitstoencodeevents~p(x)usingacodingschemem(x)

vH p,𝑚 = −∑ 𝑝 𝑥 log0𝑚(𝑥)kvNot symmetric: H p,𝑚 ≠ H 𝑚,𝑝vLower bounded by H(p)

v Let p=(½, ¼, ¼, 0)vWe encode A:00, B:01, C:10, D:11

(i,e., m= (¼, ¼, ¼, ¼))vAAACBA?

386501 Natural Language Processing

000000100100

Page 39: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Perplexity and geometric mean

v

6501 Natural Language Processing 39

CS498JH: Introduction to NLP

Perplexity

Language model m is better than m’ if it assigns lower perplexity (i.e. lower cross-entropy, and higher probability) to the test corpus w1...wN

10

Perplexity(w1

. . .wN) = 2

H(w1

. . . wN)

= 2

� 1

N log

2

m(w1

. . . wN)

= m(w1

. . .wN)� 1

N

= N

s1

m(w1

. . .wN)

Page 40: Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

An experiment

vTrain: 38M WSJ text, |V|= 20kvTest: 1.5M WSJ text

vWord level LSTM ~85vChar level ~79

6501 Natural Language Processing 40

CS498JH: Introduction to NLP

Models:Unigram, Bigram, Trigram model (with Good-Turing)

Training data: 38M words of WSJ text (Vocabulary: 20K types)

Test data:1.5M words of WSJ text

Results:

An experiment

Unigram Bigram TrigramPerplexity 962 170 109

12