language modeling 2 - github pages · 2017-02-10 · review: softmax issues use a softmax to force...

58
Language Modeling 2 CS287

Upload: others

Post on 17-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Language Modeling 2

CS287

Page 2: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Review: Windowed Setup

Let V be the vocabulary of English and let s score any window of size

dwin = 5, if we see the phrase

[ the dog walks to the ]

It should score higher by s than

[ the dog house to the ]

[ the dog cats to the ]

[ the dog skips to the ]

...

Page 3: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Review: Continuous Bag-of-Words (CBOW)

y = softmax((∑i x

0i W

0

dwin − 1)W1)

I Attempt to predict the middle word

[ the dog walks to the ]

Example: CBOW

x =v(w3) + v(w4) + v(w6) + v(w7)

dwin − 1

y = δ(w5)

W1 is no longer partitioned by row (order is lost)

Page 4: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Review: Softmax Issues

Use a softmax to force a distribution,

softmax(z) =exp(z)

∑c∈C

exp(zc)

log softmax(z) = z− log ∑c∈C

exp(zc)

I Issue: class C is huge.

I For C&W, 100,000, for word2vec 1,000,000 types

I Note largest dataset is 6 billion words

Page 5: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Quiz

We have trained a depth-two balanced soft-max tree. Give an algorithm

and time-complexity for:

I Computing the optimal class decision?

I Greedily approximating the optimal class decision.

I Appropriately sampling a word.

I Computing the full distribution over classes?

Page 6: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Answers I

I Computing the optimal class decision i.e. arg maxy p(y = y |x)?

I Requires full enumeration O(|V|)

arg maxy

p(y = y |x) = arg maxy

p(y|C , x)p(C |x)

I Greedily approximating the optimal class decision.

I Walk tree O(√|V|)

arg maxy

p(y = y |x)

c∗ = arg maxy

p(C |x)

y∗ = arg maxy

p(y|C = c∗, x)

Page 7: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Answers II

I Appropriately sampling a word y ∼ p(y = δ(c)|x)?I Walk tree O(

√|V|)

y ∼ p(y = δ(c)|x)

c ∼ p(C |x)

y ∼ p(y|C = c , x)

I Computing the full distribution over classes?

I Enumerate O(|V|)

p(y = δ(c)|x)p(y|C = c , x)

Page 8: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Contents

Language Modeling

NGram Models

Smoothing

Language Models in Practice

Page 9: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Language Modeling Task

Given a sequence of text give a probability distribution over the next

word.

The Shannon game. Estimate the probability of the next letter/word

given the previous.

THE ROOM WAS NOT VERY LIGHT A SMALL OBLONG

READING LAMP ON THE DESK SHED GLOW ON

POLISHED

Page 10: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Language Modeling

Shannon (1948) Mathematical Model of Communication

We may consider a discrete source, therefore, to be

represented by a stochastic process. Conversely, any

stochastic process which produces a discrete sequence of

symbols chosen from a finite set may be considered a discrete

source. This will include such cases as:

1. Natural written languages such as English, German,

Chinese. ...

Page 11: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Shannon’s Babblers I

4. Third-order approximation (trigram structure as in English).

IN NO 1ST LAT WHEY CRATICT FROURE BIRS GROCID

PONDENOME OF DEMONSTURES OF THE REPTAGIN IS

REGOACTIONA OF CRE

5. First-Order Word Approximation. Rather than continue

with tetragram, ... , II-gram structure it is easier and better to

jump at this point to word units. Here words are chosen

independently but with their appropriate frequencies.

REPRESENTING AND SPEEDILY IS AN GOOD APT OR

COME CAN DIFFERENT NATURAL HERE HE THE A IN

CAME THE TO OF TO EXPERT GRAY COME TO

FURNISHES THE LINE MESSAGE HAD BE THESE.

Page 12: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Shannon’s Babblers II

6. Second-Order Word Approximation. The word transition

probabilities are correct but no further structure is included.

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH

’RITER THAT THE CHARACTER OF THIS POINT IS

THEREFORE ANOTHER METHOD FOR THE LETTERS

THAT THE TIME OF WHO EVER TOLD THE PROBLEM

FOR AN UNEXPECTED

The resemblance to ordinary English text increases quite

noticeably at each of the above steps.

Page 13: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Smoothness Image/Language

It is a capital mistake to theorize before one has data.

Insensibly one begins to twist facts to suit theories, instead of

theories to suit facts. -Sherlock Holmes, A Scandal in Bohemia

Page 14: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Repair Image / Language (Chatterjee et al., 2009)

It is a capital mistake to theorize before one has . . .

Page 15: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Repair Image / Language (Chatterjee et al., 2009)

108 938 285 28 184 29 593 219 58 772 . . .

Page 16: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Language Modeling

Crucially important for:

I Speech Recognition

I Machine Translation

I Many deep learning applications

I Captioning

I Dialogue

I Summarization

I . . .

Page 17: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Statistical Model of Speech

How permanent are their records?

Page 18: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Statistical Model of Speech

Page 19: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Statistical Model of Speech

Transcription: Help peppermint on their records

Page 20: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Statistical Model of Speech

Transcription: How permanent are their records

Page 21: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Language Modeling Formally

Goal: Compute the probability of a sentence,

I Factorization:

p(w1, . . . ,wn) =n

∏t=1

p(wt |w1, . . . ,wt−1)

k Estimate the probability of the next word, conditioned on prefix,

p(wt |w1, . . . ,wt−1; θ)

Page 22: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Machine Learning Setup

Multi-class prediction problem,

(x1, y1), . . . , (xn, yn)

I yi ; the one-hot next word

I xi ; representation of the prefix (w1, . . . ,wt−1)

Challenges:

I How do you represent input?

I Smoothing is crucially important.

I Output space is very large (next class)

Page 23: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Machine Learning Setup

Multi-class prediction problem,

(x1, y1), . . . , (xn, yn)

I yi ; the one-hot next word

I xi ; representation of the prefix (w1, . . . ,wt−1)

Challenges:

I How do you represent input?

I Smoothing is crucially important.

I Output space is very large (next class)

Page 24: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Problem Metric

Previously, used accuracy as a metric.

Language modeling uses of version average negative log-likelihood

I For test data w1, . . . , wn

I

NLL = −1

n

n

∑i=1

log p(wi |w1, . . . ,wi−1)

Actually report perplexity,

perp = exp(−1

n

n

∑i=1

log p(wi |w1, . . . ,wi−1))

Requires modeling full distribution as opposed to argmax (hinge-loss)

Page 25: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Perplexity: Intuition

I Effective uniform distribution size.

I If words were uniform: perp = |V| = 10000

I Using unigram : perp ≈ 400

I log2 perp gives average number of bits needed per word.

Page 26: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Contents

Language Modeling

NGram Models

Smoothing

Language Models in Practice

Page 27: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

n-gram Models

In practice representation doesn’t use all (w1, . . . ,wt−1)

p(wi |w1, . . . ,wi−1) ≈ p(wi |wi−n+1, . . . ,wi−1; θ)

I We call this an n-gram model.

I wi−n+1, . . . ,wi−1; the context

Page 28: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Bigram Models

Back to count-based multinomial estimation,

I Feature set F : previous words

I Input vector x is sparse

I Count matrix F ∈ RV×V

I Counts come from training data:

Fc,w = ∑i

1(wi−1 = c,wi = w)

pML(wi |wi−1; θ) =Fc,wFc,•

Page 29: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Trigram Models

I Feature set F : previous two words (conjunction)

I Input vector x is sparse

I Count matrix F ∈ RV×V ,V

Fc,w = ∑i

1(wi−2:i−1 = c ,wi = w)

pML(wi |wi−2:i−1 = c ; θ) =Fc,wFc,•

Page 30: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Notation

I c = wi−n+1:i−1; context

I c ′ = wi−n+2:i−1; context without first word

I c ′′ = wi−n+3:i−1; context without first two words

I [x ]+ = max{0, x}; positive part (ReLU)

I F•,w = ∑c Fc,w ; sum-over-dimension

I Maximum-likelihood,

pML(w |c) = Fc,w/Fc,•

I Non-zero count words, for all c ,w

Nc,w = 1(Fc,w > 0)

Page 31: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

NGram Models

It is common to go up to 5-grams,

Fc,w = ∑i

1(wi−5+1 . . .wi−1 = c ,wi = w)

Matrix becomes very sparse at 5-grams.

Page 32: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Google 1T

Number of token 1,024,908,267,229

Number of sentences 95,119,665,584

Size compressed (counts only) 24 GB

Number of unigrams 13,588,391

Number of bigrams 314,843,401

Number of trigrams 977,069,902

Number of fourgrams 1,313,818,354

Number of fivegrams 1,176,470,663

Page 33: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Contents

Language Modeling

NGram Models

Smoothing

Language Models in Practice

Page 34: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

NGram Models

I Maximum likelihood models word terribly.

I NGram models are sparse, need to handle unseen cases.

I Presentation follows work of Chen and Goodman (1999)

Page 35: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Count Modifications

I Laplace Smoothing

Fc,w = Fc,w + α for all c ,w

I Good-Turing Smoothing (Good, 1953)

Fc,w = (Fc,w + 1)hist(Fc,w + 1)

hist(Fc,w )for all c ,w

Page 36: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Real Issues N-Gram Sparsity

I Histogram of trigrams

Page 37: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Idea 1: Interpolation

For trigrams:

pinterp(w |c) = λ1pML(w |c) + λ2pML(w |c ′) + λ3pML(w |c ′′)

Ensure that λs form convex combination

∑i

λi = 1

λi ≥ 0 for all i

Page 38: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Idea 1: Interpolation (Jelinek-Mercer Smoothing)

Can write recursively,

pinterp(w |c) = λpML(w |c) + (1− λ)pinterp(w |c ′)

Ensure that λ form convex combination

0 ≤ λ ≤ 1

Page 39: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

How to set parameters: Validation Tuning

I Treat λ as hyper-parameters

I Can estimate λ using EM (or gradients directly)

I Alternatively: Grid-search to held-out data.

I Can add more parameters λ(c ,w)

I Language models are often stored with prob and backoff value (λ)

Page 40: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

How to set parameters: Witten-Bell

Define λ(c ,w) as a function of c ,w

(1− λ) =Nc,•

Nc,• + Fc,•

I Nc,•; Number of unique word types seen with context

pwb(w |c) =Fc,w +Nc,• × pwb(w |c ′)

Fc,• +Nc,•

Interpolation counts are a new events estimated as proportional to

seeing a new word.

Page 41: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Idea 2: Absolute Discounting

Similar form to interpolation

pabs(w |c) = λ(w , c)pML(w |c) + η(c)pabs(w |c ′)

Delete counts and redistribute:

pabs =[Fc,w −D ]+

Fc,•+ η(c)pabs(w |c ′)

Need for a given c

∑w

λ(w , c)p1(w) + η(c)p2(w) = ∑w

λ(w)p1(w) + η(c) = 1

In-class: To ensure a valid distribution, what are λ and η with D = 1?

Page 42: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Answer

λ(w) =[Fc,w −D ]+

Fc,w

η(w) = 1−∑w

[Fc,w −D ]+Fc,•

= ∑w

Fc,• − [Fc,w −D ]+Fc,•

Total counts minus counts discounted.

η(w) =D ×Nc,•

Fc,•

Page 43: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Lower-Order Smoothing Issue

Consider the example: San Francisco and Los Franciso

Would like:

I p(wi−1 = San,wi = Francisco) to be high

I p(wi−1 = Los,wi = Francisco) to be very low

However, interpolation alone doesn’t ensure this. Why?

(1− λ)p(wi = Francisco)

Could be quite high, from seeing San Francisco many times.

Page 44: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Lower-Order Smoothing Issue

Consider the example: San Francisco and Los Franciso

Would like:

I p(wi−1 = San,wi = Francisco) to be high

I p(wi−1 = Los,wi = Francisco) to be very low

However, interpolation alone doesn’t ensure this. Why?

(1− λ)p(wi = Francisco)

Could be quite high, from seeing San Francisco many times.

Page 45: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Kneser-Ney Smoothing

I Uses Absolute Discounting instead of ML

I However want distribution to match ML on lower-order terms.

I Ensure this by marginalizing out and matching lower-order

distribution.

I Uses a different function for the lower-order term KN’.

I Most commonly used smoothing technique (details follow).

Page 46: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Review: Chain-Rule and Marginalization

Chain rule:

p(X ,Y ) = P(X |Y )P(Y )

Marginalization:

p(X ,Y ) = ∑z∈Z

P(X ,Y ,Z = z)

Marginal matching constraint.

pA(X ,Y ) 6= pB(X ,Y )

∑y

pA(X ,Y = y) = pB(X )

Page 47: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Kneser-Ney Smoothing

Main Idea: match ML marginals for all c ′

∑c1

pKN(c1, c ′,w) = ∑c1

pKN(w |c)pML(c) = pML(c′,w)

∑c1

pKN(w |c)Fc,•

F•,•=

Fc ′w ,•

F•,•

[ c1 c ′ w ]

Page 48: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Kneser-Ney Smoothing

pKN =[Fc,w −D ]+

Fc,•+ η(c)pKN ′(w |c ′)

Fc ′w ,• = ∑c1

Fc,•[pKN(w |c)]

= ∑c1

Fc,•[[Fc,w −D ]+

Fc,•+

D

Fc,•Nc,• × pKN ′(w |c ′)]

= ∑c1 :Nc,w>0

Fc,•Fc,w −D

Fc,•+ ∑

c1

DNc,• × pKN ′(w |c ′)

= Fc ′,w −N•c ′,wD +DN•c ′,• × pKN ′(w |c ′)

Page 49: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Kneser-Ney Smoothing

Final equation for unigram

pKN ′(w) = N•,w/N•,•

I Intuition: Prob of a unique bigrams ending in w .

pKN ′(w |c ′) = N•c ′,w/N•c ′,•

I Intuition: Prob of a unique ngram (with middle c ′) ending in w .

Page 50: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Modified Kneser-Ney

Set D based on Fc,w

D =

0 Fc,w = 0

D1 Fc,w = 1

D2 Fc,w = 2

D3+ Fc,w ≥ 3

I Modifications to the formula are in the paper

Page 51: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Contents

Language Modeling

NGram Models

Smoothing

Language Models in Practice

Page 52: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Language Model Issues

1. LMs become very sparse.

I Obviously cannot use matrices (|V| × |V| × |V| > 1012)

2. LMs are very big

3. Lookup speed is crucial

Page 53: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Trie Data structure

ROOT

a

mouse(c7)cat

meows(c6)

dog

runs(c5)

the

mouse(c4)cat

runs(c3)

dog

sleeps(c2)barks(c1)

Issues?

Page 54: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Reverse Trie Data structure

ROOT

runs(c)

cat(c)

the(c)

dog(c)

a(c)

meows(c)

cat(c)

a(c)

sleeps(c)

dog(c)

the(c)

barks(c)

dog(c)

the(c)

mouse(c)

the(c)

;

Used in several standard language modeling toolkits.

Page 55: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Efficiency: Hash Tables

I KenLM finds it more efficient to directly hash ngrams

I All ngrams of a given weight are kept in a hash table.

I Fast linear-probing hash tables make this work well.

Page 56: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Quantization

I Memory issues are often a concern

I Probabilities are kept in log space

I KenLM quantizes from 32 bits to smaller (others use 8 bits)

I Quatization is done by sorting, binning, and averaging.

Page 57: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Finite State Automata

Page 58: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C

Conclusion

Today,

I Language Modeling

I Various ideas in smoothing

I Tricks for efficient implementation.

Next time,

I Neural Network Language models