chapter 6: n-grams heshaam faili [email protected] university of tehran

33
Chapter 6: N-GRAMS Heshaam Faili [email protected] University of Tehran

Upload: deborah-joleen-owens

Post on 01-Jan-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

Chapter 6: N-GRAMS

Heshaam [email protected]

University of Tehran

Page 2: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

2

N-grams: Motivation An n-gram is a stretch of text n words long

Approximation of language: information in n-grams tells us something about language, but doesn’t capture the structure

Efficient: finding and using every, e.g., two-word collocation in a text is quick and easy to do

N-grams can help in a variety of NLP applications:

Word prediction = n-grams can be used to aid in predicting the next word of an utterance, based on the previous n- 1 words

Useful for context-sensitive spelling correction, approximation of language, ...

Page 3: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

3

Corpus-based NLP Corpus (pl. corpora) = a computer-

readable collection of text and/or speech, often with annotations We can use corpora to gather probabilities

and other information about language us We can say that a corpus used to gather prior

information is training data Testing data, by contrast, is the data one uses to

test the accuracy of a method We can distinguish types and tokens in a

corpus type = distinct word (e.g., like) token = distinct occurrence of a word (e.g., the type

like might have 20,000 tokens in a corpus)

Page 4: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

4

Simple n-grams Let’s assume we want to predict the next word, based

on the previous context of I dreamed I saw the knights in What we want to find is the likelihood of w8 being the next word, given that we’ve seen w1, ..., w7, in other words,P(w1, ..., w8)

In general, for wn, we are looking for:(1) P(w1, ..., wn) = P(w1)P(w2|w1)...P(wn|w1, ..., wn-1)

But these probabilities are impractical to calculate: they hardly ever occur in a corpus, if at all. (And it would be a lot of data to store, if we could calculate them.)

Page 5: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

5

Unigrams So, we can approximate these

probabilities to a particular n-gram, for a given n. What should n be? Unigrams (n= 1):

(2) P(wn|w1, ..., wn-1) ≈ P(wn) Easy to calculate, but we have no contextual

information(3) The quick brown fox jumped

We would like to say that over has a higher probability in this context than lazy does.

Page 6: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

6

Bigrams

bigrams (n= 2) are a better choice and still easy to calculate:(4) P(wn|w1, ..., wn-1) ≈ P(wn|wn-1)

(5) P( over| The, quick, brown, fox, jumped) ≈ P( over| jumped)

And thus, we obtain for the probability of a sentence:(6) P(w1, ..., wn) = P(w1)P(w2|w1)P(w3|w2)...P(wn|wn-1)

Page 7: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

7

Markov models A bigram model is also called a first-

order Markov model because it has one element of memory (one token in the past) Markov models are essentially weighted

FSAs—i.e., the arcs between states have probabilities

The states in the FSA are words Much more on Markov models when we

hit POS tagging ...

Page 8: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

8

Bigram example What is the probability of seeing the

sentence The quick brown fox jumped over the lazy dog?

(7) P(The quick brown fox jumped over the lazy dog) =P( The| < start >) P( quick| The)P( brown| quick)...P( dog| lazy)

Probabilities are generally small, so log probabilities are usually used

Does this favor shorter sentences?

Page 9: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

9

Trigrams

If bigrams are good, then trigrams ( n= 3) can be even better.

Wider context: P( know| did, he) vs. P( know| he)

Generally, trigrams are still short enough that we will have enough data to gather accurate probabilities

Page 10: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

10

Training n-gram models Go through corpus and calculate

relative frequencies: (8) P(wn|wn-1) = C(wn-1,wn) / C(wn-1) (9) P(wn|wn-2, wn-1) = C(wn-2,wn-1,wn) / C(wn-2,wn-1)

This technique of gathering probabilities from a training corpus is called maximum likelihood estimation (MLE)

Page 11: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

11

Maximum likelihood estimation (MLE) The resulting parameter set in one in

which the likelihood of the training set T given the model M (P(T|M)) is maximized.

Example: Chinese occurs 400 times in 1 million words Brown corpus. That is MLE = 0.0004 .

0.0004 is not the best possible estimate of the probability of Chinese occurring in all situation, But it is the probability that makes it most likely that Chines will occur 400 times in a million-word corpus

Page 12: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

12

Know your corpus We mentioned earlier about having

training data and testing data ... it’s important to remember what your training data is when applying your technology to new data If you train your trigram model on

Shakespeare, then you have learned the probabilities in Shakespeare, not the probabilities of English overall

What corpus you use depends on what you want to do later

Page 13: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

13

Smoothing: Motivation Let’s assume that we have a good corpus and

have trained a bigram model on it, i.e., learned MLE probabilities for bigrams

But we won’t have seen every possible bigramlickety split is a possible English bigram, but it may not be in the corpus

This is a problem of data sparseness there are zero probability bigrams which are actual possible bigrams in the language

To account for this sparseness, we turn to smoothing techniques making zero probabilities non-zero, i.e., adjusting probabilities to account for unseen data

Page 14: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

14

Add-One Smoothing One way to smooth is to add a count of one to

every bigram: in order to still be a probability, all probabilities need

to sum to one so, we add the number of word types to the

denominator (i.e., we added one to every type of bigram, so we need to account for all our numerator additions)

(10) P(wn|wn-1) = ( C(wn-1,wn)+1 ) /( C(wn-1)+V ) V = total number of word types that we might

see

Page 15: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

15

Add-One smoothing

P(Wx) = C(Wx) / iC(Wi) Adjusted Count

C*i = (Ci+1)(N/(N+V))

Discount dc = c*/c P*

i = (ci+1)/(N+V) P(Wn|Wn-1)=(C(Wn-1Wn)+1)/(C(Wn-1)+V)

Page 16: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

16

Smoothing example So, if treasure trove never occurred in the

data, but treasure occurred twice, we have: (11) P(trove | treasure) = (0+1)/(2+V)

The probability won’t be very high, but it will be better than 0 If all the surrounding probabilities are still high,

then treasure trove could still be the best pick If the probability were zero, there would be no

chance of it appearing.

Page 17: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

17

Discounting An alternate way of viewing smoothing is

as discounting Lowering non-zero counts to get the

probability mass we need for the zero count items

The discounting factor can be defined as the ratio of the smoothed count to the MLE count

Jurafsky and Martin show that add-one smoothing can discount probabilities by a factor of 8!

That’s way too much ...

Page 18: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

18

Witten-Bell Discounting Use The count of things you’ve seen

once to help estimate the count of things you’ve never seen

Probability of seeing an N-gram for the first time: counting the number of times we saw N-grams for the first time in the training corpus i:ci=0 pi

* = T/(T+N) The above value is the total probability,

should be divided by total number that is Z=i:ci=0 1

P*i=T/Z(T+N)

Page 19: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

19

Witten-Bell Discounting

Pi* = ci/(N+T) if(ci>0)

ci* = T/Z . N/(N+T) if(ci = 0)

ci* = ci . N/(N+T) if(ci > 0)

Page 20: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

20

Witten-Bell Discounting, Bigram Main idea: Instead of simply adding one to

every n-gram, compute the probability of wi-1, wi by seeing how likely wi-1 is at starting any bigram. Words that begin lots of bigrams lead to higher

“unseen bigram” probabilities Non-zero bigrams are discounted in essentially

the same manner as zero count bigrams Jurafsky and Martin show that they are

onlydiscounted by about a factor of one

Page 21: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

21

Witten-Bell Discounting formula (12) zero count bigrams:

p*(wi|wi-1) =T(wi-1)/ (Z(wi-1) ( N(wi-1)+T(wi-1)) ) T(wi-1) = the number of bigram types starting with wi-1

as the numerator, determines how high the value willbe N(wi-1) = the number of bigram tokens starting with wi-1

N(wi-1) + T(wi-1) gives us the total number of “events” to divide by

Z(wi-1) = the number of bigram tokens starting with wi-1 and having zero count

this just distributes the probability mass between all zero count bigrams starting with wi-1

Page 22: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

22

Good Turing discouting

Using Joint Probability instead of conditional probability i.e. P(wxwi) instead of p(wi|wx)

The main idea is to re-estimate the amount of probability mass to assign to N-grams with zero or low counts by looking at the number of N-grams with higher counts.

Page 23: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

23

Good Turing discouting

Nc : number of n-grams that occurs c times Nc =b:c(b)=c 1

Good-Turing estimate: ci

*=(ci+1)Nc+1/Nc

For c=0, c* = N1/N0

But in real, c* = c for c>k

Page 24: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

24

Backoff models: Basic idea Let’s say we’re using a trigram

model for predicting language, and we haven’t seen a particular trigram before. But maybe we’ve seen the bigram, or

if not, the unigram information would be useful

Backoff models allow one to try the most informative n-gram first and then back off to lower n-grams

Page 25: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

25

Backoff equations Roughly speaking, this is how a backoff

model works: If this trigram has a non-zero count, we use

that information(13) Pˆ(wi|wi-2 wi-1) = P(wi|wi-2 wi-1)

else, if the bigram count is non-zero, we use that bigram information:(14) Pˆ(wi|wi-2 wi-1) = 1 P(wi|wi-1)

and in all other cases we just take the unigram information:(15) Pˆ(wi|wi-2 wi-1) = 2 P(wi)

Page 26: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

26

Backoff models: example Let’s say we’ve never seen the trigram

“maples want more” before But we have seen “want more”, so we can use

that bigram to calculate a probability estimate. So, we look at P(more|want) ... But we’re now assigning probability to

P(more|maples, want) which was zero before we won’t have a true probability model anymore

This is why 1 was used in the previous equations, to assign less re-weight to the probability.

In general, backoff models have to be combined with discounting models

Page 27: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

27

Deleted Interpolation Deleted interpolation is similar to

backing off, except that we always use the bigram and unigram information to

calculate the probability estimate(16) Pˆ(wi|wi-2 wi-1) = 1 P(wi|wi-2 wi-1) + 2 P(wi|wi-1) + 3 P(wi) where the lambdas () all sum to one

Every trigram probability, then, is a composite of the focus word’s trigram, bigram, and unigram.

Page 28: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

28

A note on information theory Some very useful notions for n-gram

work can be found in information theory. We’ll just go over

the basic ideas: entropy = a measure of how much

information is needed to encode something perplexity = a measure of the amount of

surprise of an outcome mutual information(cross entropy) =

the amount of information one item has about another item (e.g., collocations have high mutual information)

Page 29: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

29

Entropy

Measure of information How much information there is in a

particular grammar and how well given grammar matches a given language.

H(X)=-x£p(x) log2p(x) (in bits) H(X) maximized when all members

of £ has equal probability.

Page 30: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

30

Perplexity

Perplexity = 2H (H is entropy)

Weighted average number of choices a random variable has to make.

Page 31: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

31

Cross entropy

For comparing models H(p,m) = limn 1/n wL p(w1…wn).

log m(w1..wn) For regular grammars:

H(p,m) = limn 1/n log m(w1..wn) Note that : H(p)<=H(p,m)

Page 32: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

32

Context-sensitive spelling correction Getting back to the task of spelling correction, we can

look at bigrams of words to correct a particular misspelling.

Question: Given the previous word, what is the probability of the current word?

e.g., given these, we have a 5% chance of seeing reports and a 0.001% chance of seeing report ( these report cards).

Thus, we will change report to reports Generally, we choose the correction which maximizes

the probability of the whole sentence As mentioned, we may hardly ever see these reports, so

we won’t know the probability of that bigram. Aside from smoothing techniques, another possible

solution is to use bigrams of parts of speech. e.g., What is the probability of a noun given that the

previous word was an adjective?

Page 33: Chapter 6: N-GRAMS Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

33

Exercises

Who will represent Markov Model and there basic algorithms ?

6.2,6.4,6.8, 6.9