language modeling 1. roadmap (for next two classes) review lm evaluation metrics entropy ...

75
Language Modeling 1

Upload: dontae-culpepper

Post on 11-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

1

Language Modeling

Page 2: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

2

Roadmap (for next two classes)

Review LM evaluation metrics Entropy Perplexity

Smoothing Good-Turing Backoff and Interpolation Absolute Discounting Kneser-Ney

Page 3: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

3

Language Model Evaluation Metrics

Page 4: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

4

Applications

Page 5: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

5

Entropy and perplexity

Entropy – measure information content, in bits

is message length with ideal code Use if you want to measure in bits!

Cross entropy – measure ability of trained model to compactly represent test data

Average logprob of test data Perplexity – measure average branching factor

Page 6: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

6

Entropy and perplexity

Entropy – measure information content, in bits

is message length with ideal code Use if you want to measure in bits!

Cross entropy – measure ability of trained model to compactly represent test data

Average logprob of test data Perplexity – measure average branching factor

Page 7: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

7

Entropy and perplexity

Entropy – measure information content, in bits

is message length with ideal code Use if you want to measure in bits!

Cross entropy – measure ability of trained model to compactly represent test data

Average logprob of test data Perplexity – measure average branching factor

Page 8: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

8

Entropy and perplexity

Entropy – measure information content, in bits

is message length with ideal code Use if you want to measure in bits!

Cross entropy – measure ability of trained model to compactly represent test data

Average logprob of test data Perplexity – measure average branching factor

Page 9: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

9

Language model perplexity

Recipe: Train a language model on training data Get negative logprobs of test data, compute average Exponentiate!

Perplexity correlates rather well with: Speech recognition error rates MT quality metrics

LM Perplexities for word-based models are normally between say 50 and 1000

Need to drop perplexity by a significant fraction (not absolute amount) to make a visible impact

Page 10: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

10

Parameter estimation

What is it?

Page 11: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

11

Parameter estimation

Model form is fixed (coin unigrams, word bigrams, …) We have observations

H H H T T H T H H Want to find the parameters Maximum Likelihood Estimation – pick the parameters

that assign the most probability to our training data c(H) = 6; c(T) = 3 P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3

MLE picks parameters best for training data… …but these don’t generalize well to test data – zeros!

Page 12: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

12

Parameter estimation

Model form is fixed (coin unigrams, word bigrams, …) We have observations

H H H T T H T H H Want to find the parameters Maximum Likelihood Estimation – pick the parameters

that assign the most probability to our training data c(H) = 6; c(T) = 3 P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3

MLE picks parameters best for training data… …but these don’t generalize well to test data – zeros!

Page 13: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

13

Parameter estimation

Model form is fixed (coin unigrams, word bigrams, …) We have observations

H H H T T H T H H Want to find the parameters Maximum Likelihood Estimation – pick the parameters

that assign the most probability to our training data c(H) = 6; c(T) = 3 P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3

MLE picks parameters best for training data… …but these don’t generalize well to test data – zeros!

Page 14: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

14

Smoothing

Take mass from seen events, give to unseen events Robin Hood for probability models

MLE at one end of the spectrum; uniform distribution the other

Need to pick a happy medium, and yet maintain a distribution

Page 15: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

15

Smoothing techniques

Laplace Good-Turing Backoff Mixtures Interpolation Kneser-Ney

Page 16: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

16

Laplace

From MLE:

To Laplace:

Page 17: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Good-Turing Smoothing

New idea: Use counts of things you have seen to estimate those you haven’t

Page 18: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Good-Turing Josh Goodman Intuition

Imagine you are fishing There are 8 species: carp, perch, whitefish, trout,

salmon, eel, catfish, bass You have caught

10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next fish caught is from a new species (one not seen in our previous catch)?

Slide adapted from Josh Goodman, Dan Jurafsky

Page 19: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Good-Turing Josh Goodman Intuition

Imagine you are fishing There are 8 species: carp, perch, whitefish, trout,

salmon, eel, catfish, bass You have caught

10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next fish caught is from a new species (one not seen in our previous catch)?

3/18

Assuming so, how likely is it that next species is trout?

Slide adapted from Josh Goodman, Dan Jurafsky

Page 20: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Good-Turing Josh Goodman Intuition

Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon,

eel, catfish, bass You have caught

10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next fish caught is from a new species (one not seen in our previous catch)?

3/18

Assuming so, how likely is it that next species is trout? Must be less than 1/18

Slide adapted from Josh Goodman, Dan Jurafsky

Page 21: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

21

Some more hypotheticalsSpecies Puget Sound Lake Washington Greenlake

Salmon 8 12 0

Trout 3 1 1

Cod 1 1 0

Rockfish 1 0 0

Snapper 1 0 0

Skate 1 0 0

Bass 0 1 14

TOTAL 15 15 15

How likely is it to find a new fish in each of these places?

Page 22: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Good-Turing Smoothing

New idea: Use counts of things you have seen to estimate those you haven’t

Page 23: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Good-Turing Smoothing

New idea: Use counts of things you have seen to estimate those you haven’t

Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams

Page 24: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Good-Turing Smoothing

New idea: Use counts of things you have seen to estimate those you haven’t

Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams

Notation: Nc is the frequency of frequency c Number of ngrams which appear c times N0: # ngrams of count 0; N1: # of ngrams of count 1

Page 25: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Good-Turing Smoothing Estimate probability of things which occur c times

with the probability of things which occur c+1 times

Discounted counts: steal mass from seen cases to provide for the unseen:

MLE

GT

Page 26: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

GT Fish Example

Page 27: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

27

Enough about the fish…how does this relate to language? Name some linguistic situations where the number

of new words would differ

Page 28: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

28

Enough about the fish…how does this relate to language? Name some linguistic situations where the number

of new words would differ Different languages:

Chinese has almost no morphology Turkish has a lot of morphology Lots of new words in Turkish!

Page 29: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

29

Enough about the fish…how does this relate to language? Name some linguistic situations where the number

of new words would differ Different languages:

Chinese has almost no morphology Turkish has a lot of morphology Lots of new words in Turkish!

Different domains: Airplane maintenance manuals: controlled vocabulary Random web posts: uncontrolled vocab

Page 30: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Bigram Frequencies of Frequencies and GT Re-estimates

Page 31: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Good-Turing Smoothing

N-gram counts to conditional probability

Use c* from GT estimate

Page 32: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Additional Issues in Good-Turing General approach:

Estimate of c* for Nc depends on N c+1

What if Nc+1 = 0? More zero count problems Not uncommon: e.g. fish example, no 4s

Page 33: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Modifications

Simple Good-Turing Compute Nc bins, then smooth Nc to replace zeroes

Fit linear regression in log space log(Nc) = a +b log(c)

What about large c’s? Should be reliable Assume c*=c if c is large, e.g c > k (Katz: k =5)

Typically combined with other approaches

Page 34: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Backoff and Interpolation

Another really useful source of knowledge If we are estimating:

trigram p(z|x,y) but count(xyz) is zero

Use info from:

Page 35: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Backoff and Interpolation

Another really useful source of knowledge If we are estimating:

trigram p(z|x,y) but count(xyz) is zero

Use info from: Bigram p(z|y)

Or even: Unigram p(z)

Page 36: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Backoff and Interpolation

Another really useful source of knowledge If we are estimating:

trigram p(z|x,y) but count(xyz) is zero

Use info from: Bigram p(z|y)

Or even: Unigram p(z)

How to combine this trigram, bigram, unigram info in a valid fashion?

Page 37: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Backoff vs. Interpolation

Backoff: use trigram if you have it, otherwise bigram, otherwise unigram

Page 38: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Backoff vs. Interpolation

Backoff: use trigram if you have it, otherwise bigram, otherwise unigram

Interpolation: always mix all three

Page 39: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

39

Backoff

Bigram distribution

But could be zero… What if we fell back (or “backed off”) to a unigram

distribution?

Also could be zero…

Page 40: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

40

Backoff

What’s wrong with this distribution?

Doesn’t sum to one! Need to steal mass…

Page 41: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

41

Backoff

Page 42: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

42

Mixtures

Given distributions and Pick any number between and is a distribution (Laplace is a mixture!)

Page 43: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Interpolation Simple interpolation

Or, pick interpolation value based on context

Intuition: Higher weight on more frequent n-grams

Page 44: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

How to Set the Lambdas? Use a held-out, or development, corpus Choose lambdas which maximize the probability of

some held-out data I.e. fix the N-gram probabilities Then search for lambda values That when plugged into previous equation Give largest probability for held-out set Can use EM to do this search

Page 45: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff

I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)

Page 46: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff

I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)

P(Francisco|reading) backs off to P(Francisco)

Page 47: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff

I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)

P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading)

Page 48: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff

I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)

P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many

Page 49: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff

I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)

P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many

Interpolate based on # of contexts Words seen in more contexts, more likely to appear in

others

Page 50: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

50

Kneser-Ney Smoothing: bigrams

Modeling diversity of contexts

So

Page 51: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

51

Kneser-Ney Smoothing: bigrams

Backoff:

Page 52: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

52

Kneser-Ney Smoothing: bigrams

Interpolation:

Page 53: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

OOV words: <UNK> word

Out Of Vocabulary = OOV words

Page 54: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

OOV words: <UNK> word

Out Of Vocabulary = OOV words We don’t use GT smoothing for these

Page 55: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

OOV words: <UNK> word

Out Of Vocabulary = OOV words We don’t use GT smoothing for these

Because GT assumes we know the number of unseen events

Instead: create an unknown word token <UNK>

Page 56: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

OOV words: <UNK> word

Out Of Vocabulary = OOV words We don’t use GT smoothing for these

Because GT assumes we know the number of unseen events

Instead: create an unknown word token <UNK> Training of <UNK> probabilities

Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to

<UNK> Now we train its probabilities like a normal word

Page 57: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

OOV words: <UNK> word

Out Of Vocabulary = OOV words We don’t use GT smoothing for these

Because GT assumes we know the number of unseen events

Instead: create an unknown word token <UNK> Training of <UNK> probabilities

Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to

<UNK> Now we train its probabilities like a normal word

At decoding time If text input: Use UNK probabilities for any word not in training Plus an additional penalty! UNK predicts the class of unknown words;

then we need to pick a member

Page 58: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Class-Based Language Models

Variant of n-gram models using classes or clusters

Page 59: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Class-Based Language Models

Variant of n-gram models using classes or clusters Motivation: Sparseness

Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram

Page 60: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Class-Based Language Models

Variant of n-gram models using classes or clusters Motivation: Sparseness

Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram

IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci) Learn by MLE from data

Where do classes come from?

Page 61: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Class-Based Language Models

Variant of n-gram models using classes or clusters Motivation: Sparseness

Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram

IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci) Learn by MLE from data

Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus

Page 62: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Class-Based Language Models

Variant of n-gram models using classes or clusters Motivation: Sparseness

Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram

IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci) Learn by MLE from data

Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus

Page 63: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

LM Adaptation

Challenge: Need LM for new domain Have little in-domain data

Page 64: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

LM Adaptation

Challenge: Need LM for new domain Have little in-domain data

Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data

Page 65: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

LM Adaptation

Challenge: Need LM for new domain Have little in-domain data

Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data

Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set

What large corpus?

Page 66: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

LM Adaptation

Challenge: Need LM for new domain Have little in-domain data

Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data

Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set

What large corpus? Web counts! e.g. Google n-grams

Page 67: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Incorporating Longer Distance Context

Why use longer context?

Page 68: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Incorporating Longer Distance Context

Why use longer context? N-grams are approximation

Model size Sparseness

Page 69: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Incorporating Longer Distance Context

Why use longer context? N-grams are approximation

Model size Sparseness

What sorts of information in longer context?

Page 70: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Incorporating Longer Distance Context

Why use longer context? N-grams are approximation

Model size Sparseness

What sorts of information in longer context? Priming Topic Sentence type Dialogue act Syntax

Page 71: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Long Distance LMs Bigger n!

284M words: <= 6-grams improve; 7-20 no better

Page 72: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Long Distance LMs Bigger n!

284M words: <= 6-grams improve; 7-20 no better Cache n-gram:

Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus

Mix with main n-gram LM

Page 73: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Long Distance LMs Bigger n!

284M words: <= 6-grams improve; 7-20 no better Cache n-gram:

Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus

Mix with main n-gram LM

Topic models: Intuition: Text is about some topic, on-topic words likely

P(w|h) ~ Σt P(w|t)P(t|h)

Page 74: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Long Distance LMs Bigger n!

284M words: <= 6-grams improve; 7-20 no better Cache n-gram:

Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus

Mix with main n-gram LM

Topic models: Intuition: Text is about some topic, on-topic words likely

P(w|h) ~ Σt P(w|t)P(t|h)

Non-consecutive n-grams: skip n-grams, triggers, variable lengths n-grams

Page 75: Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation

Language Models

N-gram models: Finite approximation of infinite context history

Issues: Zeroes and other sparseness Strategies: Smoothing

Add-one, add-δ, Good-Turing, etc Use partial n-grams: interpolation, backoff

Refinements Class, cache, topic, trigger LMs