language modeling 2 - github pages · 2017-02-10 · review: softmax issues use a softmax to force...

Post on 17-Jul-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Language Modeling 2

CS287

Review: Windowed Setup

Let V be the vocabulary of English and let s score any window of size

dwin = 5, if we see the phrase

[ the dog walks to the ]

It should score higher by s than

[ the dog house to the ]

[ the dog cats to the ]

[ the dog skips to the ]

...

Review: Continuous Bag-of-Words (CBOW)

y = softmax((∑i x

0i W

0

dwin − 1)W1)

I Attempt to predict the middle word

[ the dog walks to the ]

Example: CBOW

x =v(w3) + v(w4) + v(w6) + v(w7)

dwin − 1

y = δ(w5)

W1 is no longer partitioned by row (order is lost)

Review: Softmax Issues

Use a softmax to force a distribution,

softmax(z) =exp(z)

∑c∈C

exp(zc)

log softmax(z) = z− log ∑c∈C

exp(zc)

I Issue: class C is huge.

I For C&W, 100,000, for word2vec 1,000,000 types

I Note largest dataset is 6 billion words

Quiz

We have trained a depth-two balanced soft-max tree. Give an algorithm

and time-complexity for:

I Computing the optimal class decision?

I Greedily approximating the optimal class decision.

I Appropriately sampling a word.

I Computing the full distribution over classes?

Answers I

I Computing the optimal class decision i.e. arg maxy p(y = y |x)?

I Requires full enumeration O(|V|)

arg maxy

p(y = y |x) = arg maxy

p(y|C , x)p(C |x)

I Greedily approximating the optimal class decision.

I Walk tree O(√|V|)

arg maxy

p(y = y |x)

c∗ = arg maxy

p(C |x)

y∗ = arg maxy

p(y|C = c∗, x)

Answers II

I Appropriately sampling a word y ∼ p(y = δ(c)|x)?I Walk tree O(

√|V|)

y ∼ p(y = δ(c)|x)

c ∼ p(C |x)

y ∼ p(y|C = c , x)

I Computing the full distribution over classes?

I Enumerate O(|V|)

p(y = δ(c)|x)p(y|C = c , x)

Contents

Language Modeling

NGram Models

Smoothing

Language Models in Practice

Language Modeling Task

Given a sequence of text give a probability distribution over the next

word.

The Shannon game. Estimate the probability of the next letter/word

given the previous.

THE ROOM WAS NOT VERY LIGHT A SMALL OBLONG

READING LAMP ON THE DESK SHED GLOW ON

POLISHED

Language Modeling

Shannon (1948) Mathematical Model of Communication

We may consider a discrete source, therefore, to be

represented by a stochastic process. Conversely, any

stochastic process which produces a discrete sequence of

symbols chosen from a finite set may be considered a discrete

source. This will include such cases as:

1. Natural written languages such as English, German,

Chinese. ...

Shannon’s Babblers I

4. Third-order approximation (trigram structure as in English).

IN NO 1ST LAT WHEY CRATICT FROURE BIRS GROCID

PONDENOME OF DEMONSTURES OF THE REPTAGIN IS

REGOACTIONA OF CRE

5. First-Order Word Approximation. Rather than continue

with tetragram, ... , II-gram structure it is easier and better to

jump at this point to word units. Here words are chosen

independently but with their appropriate frequencies.

REPRESENTING AND SPEEDILY IS AN GOOD APT OR

COME CAN DIFFERENT NATURAL HERE HE THE A IN

CAME THE TO OF TO EXPERT GRAY COME TO

FURNISHES THE LINE MESSAGE HAD BE THESE.

Shannon’s Babblers II

6. Second-Order Word Approximation. The word transition

probabilities are correct but no further structure is included.

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH

’RITER THAT THE CHARACTER OF THIS POINT IS

THEREFORE ANOTHER METHOD FOR THE LETTERS

THAT THE TIME OF WHO EVER TOLD THE PROBLEM

FOR AN UNEXPECTED

The resemblance to ordinary English text increases quite

noticeably at each of the above steps.

Smoothness Image/Language

It is a capital mistake to theorize before one has data.

Insensibly one begins to twist facts to suit theories, instead of

theories to suit facts. -Sherlock Holmes, A Scandal in Bohemia

Repair Image / Language (Chatterjee et al., 2009)

It is a capital mistake to theorize before one has . . .

Repair Image / Language (Chatterjee et al., 2009)

108 938 285 28 184 29 593 219 58 772 . . .

Language Modeling

Crucially important for:

I Speech Recognition

I Machine Translation

I Many deep learning applications

I Captioning

I Dialogue

I Summarization

I . . .

Statistical Model of Speech

How permanent are their records?

Statistical Model of Speech

Statistical Model of Speech

Transcription: Help peppermint on their records

Statistical Model of Speech

Transcription: How permanent are their records

Language Modeling Formally

Goal: Compute the probability of a sentence,

I Factorization:

p(w1, . . . ,wn) =n

∏t=1

p(wt |w1, . . . ,wt−1)

k Estimate the probability of the next word, conditioned on prefix,

p(wt |w1, . . . ,wt−1; θ)

Machine Learning Setup

Multi-class prediction problem,

(x1, y1), . . . , (xn, yn)

I yi ; the one-hot next word

I xi ; representation of the prefix (w1, . . . ,wt−1)

Challenges:

I How do you represent input?

I Smoothing is crucially important.

I Output space is very large (next class)

Machine Learning Setup

Multi-class prediction problem,

(x1, y1), . . . , (xn, yn)

I yi ; the one-hot next word

I xi ; representation of the prefix (w1, . . . ,wt−1)

Challenges:

I How do you represent input?

I Smoothing is crucially important.

I Output space is very large (next class)

Problem Metric

Previously, used accuracy as a metric.

Language modeling uses of version average negative log-likelihood

I For test data w1, . . . , wn

I

NLL = −1

n

n

∑i=1

log p(wi |w1, . . . ,wi−1)

Actually report perplexity,

perp = exp(−1

n

n

∑i=1

log p(wi |w1, . . . ,wi−1))

Requires modeling full distribution as opposed to argmax (hinge-loss)

Perplexity: Intuition

I Effective uniform distribution size.

I If words were uniform: perp = |V| = 10000

I Using unigram : perp ≈ 400

I log2 perp gives average number of bits needed per word.

Contents

Language Modeling

NGram Models

Smoothing

Language Models in Practice

n-gram Models

In practice representation doesn’t use all (w1, . . . ,wt−1)

p(wi |w1, . . . ,wi−1) ≈ p(wi |wi−n+1, . . . ,wi−1; θ)

I We call this an n-gram model.

I wi−n+1, . . . ,wi−1; the context

Bigram Models

Back to count-based multinomial estimation,

I Feature set F : previous words

I Input vector x is sparse

I Count matrix F ∈ RV×V

I Counts come from training data:

Fc,w = ∑i

1(wi−1 = c,wi = w)

pML(wi |wi−1; θ) =Fc,wFc,•

Trigram Models

I Feature set F : previous two words (conjunction)

I Input vector x is sparse

I Count matrix F ∈ RV×V ,V

Fc,w = ∑i

1(wi−2:i−1 = c ,wi = w)

pML(wi |wi−2:i−1 = c ; θ) =Fc,wFc,•

Notation

I c = wi−n+1:i−1; context

I c ′ = wi−n+2:i−1; context without first word

I c ′′ = wi−n+3:i−1; context without first two words

I [x ]+ = max{0, x}; positive part (ReLU)

I F•,w = ∑c Fc,w ; sum-over-dimension

I Maximum-likelihood,

pML(w |c) = Fc,w/Fc,•

I Non-zero count words, for all c ,w

Nc,w = 1(Fc,w > 0)

NGram Models

It is common to go up to 5-grams,

Fc,w = ∑i

1(wi−5+1 . . .wi−1 = c ,wi = w)

Matrix becomes very sparse at 5-grams.

Google 1T

Number of token 1,024,908,267,229

Number of sentences 95,119,665,584

Size compressed (counts only) 24 GB

Number of unigrams 13,588,391

Number of bigrams 314,843,401

Number of trigrams 977,069,902

Number of fourgrams 1,313,818,354

Number of fivegrams 1,176,470,663

Contents

Language Modeling

NGram Models

Smoothing

Language Models in Practice

NGram Models

I Maximum likelihood models word terribly.

I NGram models are sparse, need to handle unseen cases.

I Presentation follows work of Chen and Goodman (1999)

Count Modifications

I Laplace Smoothing

Fc,w = Fc,w + α for all c ,w

I Good-Turing Smoothing (Good, 1953)

Fc,w = (Fc,w + 1)hist(Fc,w + 1)

hist(Fc,w )for all c ,w

Real Issues N-Gram Sparsity

I Histogram of trigrams

Idea 1: Interpolation

For trigrams:

pinterp(w |c) = λ1pML(w |c) + λ2pML(w |c ′) + λ3pML(w |c ′′)

Ensure that λs form convex combination

∑i

λi = 1

λi ≥ 0 for all i

Idea 1: Interpolation (Jelinek-Mercer Smoothing)

Can write recursively,

pinterp(w |c) = λpML(w |c) + (1− λ)pinterp(w |c ′)

Ensure that λ form convex combination

0 ≤ λ ≤ 1

How to set parameters: Validation Tuning

I Treat λ as hyper-parameters

I Can estimate λ using EM (or gradients directly)

I Alternatively: Grid-search to held-out data.

I Can add more parameters λ(c ,w)

I Language models are often stored with prob and backoff value (λ)

How to set parameters: Witten-Bell

Define λ(c ,w) as a function of c ,w

(1− λ) =Nc,•

Nc,• + Fc,•

I Nc,•; Number of unique word types seen with context

pwb(w |c) =Fc,w +Nc,• × pwb(w |c ′)

Fc,• +Nc,•

Interpolation counts are a new events estimated as proportional to

seeing a new word.

Idea 2: Absolute Discounting

Similar form to interpolation

pabs(w |c) = λ(w , c)pML(w |c) + η(c)pabs(w |c ′)

Delete counts and redistribute:

pabs =[Fc,w −D ]+

Fc,•+ η(c)pabs(w |c ′)

Need for a given c

∑w

λ(w , c)p1(w) + η(c)p2(w) = ∑w

λ(w)p1(w) + η(c) = 1

In-class: To ensure a valid distribution, what are λ and η with D = 1?

Answer

λ(w) =[Fc,w −D ]+

Fc,w

η(w) = 1−∑w

[Fc,w −D ]+Fc,•

= ∑w

Fc,• − [Fc,w −D ]+Fc,•

Total counts minus counts discounted.

η(w) =D ×Nc,•

Fc,•

Lower-Order Smoothing Issue

Consider the example: San Francisco and Los Franciso

Would like:

I p(wi−1 = San,wi = Francisco) to be high

I p(wi−1 = Los,wi = Francisco) to be very low

However, interpolation alone doesn’t ensure this. Why?

(1− λ)p(wi = Francisco)

Could be quite high, from seeing San Francisco many times.

Lower-Order Smoothing Issue

Consider the example: San Francisco and Los Franciso

Would like:

I p(wi−1 = San,wi = Francisco) to be high

I p(wi−1 = Los,wi = Francisco) to be very low

However, interpolation alone doesn’t ensure this. Why?

(1− λ)p(wi = Francisco)

Could be quite high, from seeing San Francisco many times.

Kneser-Ney Smoothing

I Uses Absolute Discounting instead of ML

I However want distribution to match ML on lower-order terms.

I Ensure this by marginalizing out and matching lower-order

distribution.

I Uses a different function for the lower-order term KN’.

I Most commonly used smoothing technique (details follow).

Review: Chain-Rule and Marginalization

Chain rule:

p(X ,Y ) = P(X |Y )P(Y )

Marginalization:

p(X ,Y ) = ∑z∈Z

P(X ,Y ,Z = z)

Marginal matching constraint.

pA(X ,Y ) 6= pB(X ,Y )

∑y

pA(X ,Y = y) = pB(X )

Kneser-Ney Smoothing

Main Idea: match ML marginals for all c ′

∑c1

pKN(c1, c ′,w) = ∑c1

pKN(w |c)pML(c) = pML(c′,w)

∑c1

pKN(w |c)Fc,•

F•,•=

Fc ′w ,•

F•,•

[ c1 c ′ w ]

Kneser-Ney Smoothing

pKN =[Fc,w −D ]+

Fc,•+ η(c)pKN ′(w |c ′)

Fc ′w ,• = ∑c1

Fc,•[pKN(w |c)]

= ∑c1

Fc,•[[Fc,w −D ]+

Fc,•+

D

Fc,•Nc,• × pKN ′(w |c ′)]

= ∑c1 :Nc,w>0

Fc,•Fc,w −D

Fc,•+ ∑

c1

DNc,• × pKN ′(w |c ′)

= Fc ′,w −N•c ′,wD +DN•c ′,• × pKN ′(w |c ′)

Kneser-Ney Smoothing

Final equation for unigram

pKN ′(w) = N•,w/N•,•

I Intuition: Prob of a unique bigrams ending in w .

pKN ′(w |c ′) = N•c ′,w/N•c ′,•

I Intuition: Prob of a unique ngram (with middle c ′) ending in w .

Modified Kneser-Ney

Set D based on Fc,w

D =

0 Fc,w = 0

D1 Fc,w = 1

D2 Fc,w = 2

D3+ Fc,w ≥ 3

I Modifications to the formula are in the paper

Contents

Language Modeling

NGram Models

Smoothing

Language Models in Practice

Language Model Issues

1. LMs become very sparse.

I Obviously cannot use matrices (|V| × |V| × |V| > 1012)

2. LMs are very big

3. Lookup speed is crucial

Trie Data structure

ROOT

a

mouse(c7)cat

meows(c6)

dog

runs(c5)

the

mouse(c4)cat

runs(c3)

dog

sleeps(c2)barks(c1)

Issues?

Reverse Trie Data structure

ROOT

runs(c)

cat(c)

the(c)

dog(c)

a(c)

meows(c)

cat(c)

a(c)

sleeps(c)

dog(c)

the(c)

barks(c)

dog(c)

the(c)

mouse(c)

the(c)

;

Used in several standard language modeling toolkits.

Efficiency: Hash Tables

I KenLM finds it more efficient to directly hash ngrams

I All ngrams of a given weight are kept in a hash table.

I Fast linear-probing hash tables make this work well.

Quantization

I Memory issues are often a concern

I Probabilities are kept in log space

I KenLM quantizes from 32 bits to smaller (others use 8 bits)

I Quatization is done by sorting, binning, and averaging.

Finite State Automata

Conclusion

Today,

I Language Modeling

I Various ideas in smoothing

I Tricks for efficient implementation.

Next time,

I Neural Network Language models

top related