lecture2 language modelinglecture 2: language modeling ltat.01.001 –natural language processing...

Post on 08-Oct-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lecture 2: Language modeling

LTAT.01.001 – Natural Language ProcessingKairit Sirts (kairit.sirts@ut.ee)

20.02.2019

The task of language modeling

The cat sat on the mat

The mat sat on the cat

The cat mat the on sat

2

Language modeling

Task:• Estimate the quality/fluency/grammaticality of a natural language

sentence or segment

Why?• Generate new sentences• Choose between several variants, picking the best sounding one.

3

Language modeling

Word: !

Sentence: " = !!!"…!#

4

Language modeling

Can we use some grammaticality checking rules to determine the fluency of the sentence !?• Theoretically - yes• In practice: • Grammar checking software is unreliable• Grammar checking software is only available for few languages• Its output is often non-continuous, which means that

• It cannot be used in optimization• It cannot be used to easily choose a better output from many viable hypotheses

5

Language modeling

Instead we will try to calculate/model:

! " = ! $!$"…$#

>

>

6

P(The cat sat on the mat) P(The mat sat on the cat)

P(The cat mat the on sat)P(The mat sat on the cat)

How to compute the sentence probability?

= #(#$% &'( )'( *+ ($% ,'()# '.. )%+(%+&%) = ?

= #(#$% ,'( )'( *+ ($% &'()# '.. )%+(%+&%) = ?

= #(#$% &'( ,'( ($% *+ )'()# '.. )%+(%+&%) = ?

# - the number or count of such sentencesThat’s clearly not doable in general!

7

P(The cat sat on the mat)

P(The mat sat on the cat)

P(The cat mat the on sat)

How to compute the sentence probability?

Factorize the joint probability:• In general:

! ", $, % = ! " ! $ " ! % ", $

• Similarly:! '!, '", … , '#= ! '! ! '" '! ! '$ '!, '" …!('#|'!, '", … , '#%!)

• It still does not solve the problem!

8

Sentence probability

• Cannot estimate directly:

! "!""…"# = #("!""…"#)# ()) *+,-+,.+*

• Cannot use the factorization:

! "!""…"# =%$%!

#!("$|"!…"$&!)

9

Sentence probability

But word probabilities are doable:• Take a huge text (millions/billions of words)• Compute the probability for each word type (unique word)

! " = #(")# '(( ")*+, -. /ℎ1 /12/

10

Maximum likelihood (ML) estimate

Sentence probability

• What if we treat each word as independent of other words? Then:

! " ≅ ! $! ×! $" ×⋯×!($#)

=

=

11

P(The cat sat on the mat) P(The mat sat on the cat)

P(The cat mat the on sat)P(The mat sat on the cat)

Sentence probability• Maybe add some context?

= " #ℎ% " &'( #ℎ% " )'( &'( " *+ )'( " (ℎ% *+ "(-'(|(ℎ%)

= " #ℎ% " -'( #ℎ% " )'( -'( " *+ )'( " (ℎ% *+ "(&'(|(ℎ%)

= " #ℎ% " &'( #ℎ% " -'( &'( " (ℎ% -'( " *+ (ℎ% "()'(|*+) 12

! " ≅ ! $! ×! $"|$! ×!($#|$")×⋯×!($$|$$%!)P(The cat sat on the mat)

P(The mat sat on the cat)

P(The cat mat the on sat)

Sentence probability – Markov property

Independence assumption or Markov assumption (in the context of language modeling):• The next word only depends on the current/last word.

• This is precisely the model we had on the previous slide and it is called bigram language model because we are looking at the word bigrams.

13

N-gram language model

In general, we talk about n-gram language models, where the next word depends on a fixed history of n-1 words.

• Unigram model – all words are independent, the classical BOW approach• Bigram model• Trigram model – the next word depends on two last words• 4-gram model• 5-gram model

14

Computing n-gram probabilities

• Unigrams: !! " !! = ##!# $%% #&'()

• Bigrams: !!*+!! " !! !!*+ = #(#!"#,#!)#(#!"#)

• Trigrams: !!*/!!*+!! " !! !!*/, !!*+ = #(#!"$,#!"#,#!)#(#!"$,#!"#)

15

Sentence probability according to n-gram model• If

! "! "", "#, … , "!$" ≅ ! "! "!$% , … , "!$"• Where & = ()*+, *+(& − 1:• Unigrams: ! = 0• Bigrams: ! = 1• Trigrams: ! = 2, etc

• Then

! / =0!&"

'!("!|"", "#, … , "!$") ≅0

!&"

'!("!|"!$% , … , "!$")

16

Bigram language model: example

An example corpus:1. the cat saw the mouse2. the cat heard a mouse3. the mouse heard4. a mouse saw5. a cat saw

17

Bigram Count Unigram Count Bigram probSTART the STARTthe cat thecat saw catsaw the sawthe mouse themouse END mousecat heard catheard a hearda mouse aSTART a STARTmouse saw mousesaw END sawa cat a

Bigram language model: example

An example corpus:1. the cat saw the mouse2. the cat heard a mouse3. the mouse heard4. a mouse saw5. a cat saw

18

Bigram Count Unigram Count Bigram probSTART the 3 START 5 0.6the cat 2 the 4 0.5cat saw 2 cat 3 0.67saw the 1 saw 3 0.33the mouse 2 the 4 0.5mouse END 2 mouse 4 0.5cat heard 1 cat 3 0.33heard a 1 heard 2 0.5a mouse 2 a 3 0.67START a 2 START 5 0.4mouse saw 1 mouse 4 0.25saw END 2 saw 3 0.67a cat 1 a 3 0.33

Bigram language model: example

P(The cat heard) = ?

19

Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5

Bigram language model: example

P(The cat heard) = = P(START the) x P(the cat) x P(cat heard) x P(heard END)

20

Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5

Bigram language model: example

P(The cat heard) = = P(START the) x P(the cat) x P(cat heard) x P(heard END) = 0.6*0.5*0.33*0.5 = 0.0495

21

Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5

Bigram language model: example

P(The mouse saw the cat) = ?

22

Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5

Bigram language model: example

P(the mouse saw the cat) = =P(START the) x P(the mouse) x P(mouse saw) x P(saw the) x P(the cat) x P(cat END)

23

Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5

Bigram language model: example

P(the mouse saw the cat) = =P(START the) x P(the mouse) x P(mouse saw) x P(saw the) x P(the cat) x P(cat END)= 0.6*0.5*0.25*0.33*0.5*0 = 0

24

Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5

Morphology

25Source: www.rabiaergin.com

Sparsity issues

Natural languages are sparse!

Consider vocabulary of size 60000• How many possible unigrams, bigrams, trigrams are there?• How large a text corpus do we need to obtain reliable statistics for all

ngrams?• Does more data solve the problem completely?

26

Zipf’s law

• Given some corpus of natural language text, the frequency of any word is inversely proportional to its rank in the frequency table• The most frequent word will occur approximately twice as often as the

second most frequent word• The second most frequent word will occur approximately twice as often as

the third most frequent word etc

27

Zipf’s law

28

Masrai and Milton, 2006. “How different is Arabic from Other Languages? The Relationship between Word Frequency and Lexical Coverage”

Smoothing

The general idea: Find a way to fill the gaps in counts• Take care not to change the original distribution too much• Fill in the gaps only as much as needed: as the corpus grows larger

there are less gaps to fill.

• Smoothing methods• Add λ method• Interpolation• (Modified) Kneser-Ney• There are others

29

Add λ method

Assume all n-grams occur λ times more than they actually occur.• Usual bigram probability:

! "! "!"# = #("!"#, "!)#("!"#)

• Add 0 < * ≤ 1 to all bigram counts:

!$ "!|"!"# = # "!"#, "! + *#("!"#) + *|/|

• Special case * = 1: add-one smoothing

30

Add λ method

• Advantages• Very simple• Easy to apply

• Disadvantages• Performs poorly (according to Chen & Goodman)• All unseen events receive the same probability• All events are upgraded by λ

31

Interpolation (Jelinek-Mercer smoothing)

If the bigram !!"# !! is unseen:• Originally its probability would be 0:

" !! !!"# = 0• Instead of 0 we could use the probability of the shorter n-gram

(unigram):"(!!)

• We must make sure that the total probability mass remains the same• Thus interpolate between the unigram and bigram distribution

"$% !! !!"# = '" !! !!"# + 1 − ' "(!!)

32

Interpolation (Jelinek-Mercer smoothing)

• Recursive formulation: nth-order smoothed model is defined recursively as linear interpolation between the nth-order maximum likelihood (ML) model and the (n-1)th-order smoothed model!!" "# "#$% , … , "#$&= &#$%! "# "#$% , … , "#$& + 1 − &#$% !!" "# "#$%'&, … , "#$&

• Can ground the recursion with:• 1st order unigram model• 0th order uniform model

! " = 1|&|

33

Software for language modelling

• KenLM• https://github.com/kpu/kenlm

• SRILM• http://www.speech.sri.com/projects/srilm/

• IRSTLM• http://hlt-mt.fbk.eu/technologies/irstlm

• Others:• http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel

34

Language model evaluation

• Intrinsic evaluation• Perplexity• Quick and simple• Improvements in perplexity might not translate into improvements in

downstream tasks• Extrinsic evaluation• In a down-stream task (like machine translation, speech recognition etc)• More difficult and time-consuming• More accurate evaluation (although beware of confounding with other

factors)

35

Perplexity

• Perplexity is a measurement of how well a probability model predicts a sample.• Language model is a probability model over language• To evaluate a language model, compute the perplexity over held-out

set (test set)

!! = 2!"# ∑!"#

$ %&'% ((*!|*!&',…,*!&#)

36

Perplexity

• The lower the perplexity the better the language model, i.e. the less “surprised” the model is on seeing the evaluation data• The exponent is really the cross-entropy, which measures the number

of bits needed to represent a word:

! "#, #̂ = −(!"#

$"#(*!) log% #̂(*!)

• "# *! = #((!)$ - empirical unigram probability

• #̂ *! = #*+(*!|*!,- , … , *!,#)

37

Perplexity

• Let’s assume that the cross-entropy on a test set is 7.95• This means that each word in the test set could be encoded with 7.95

bits• The model perplexity would be 27.95=247 per word• This means that the model is as confused on test data as if it would

have to choose uniformly at random from 247 possibilities for each word.

38

Perplexity

• Perplexity is corpus-specific: only the perplexities calculated on the same test set are comparable• For meaningful comparison, the vocabulary sizes of the two language

models must be the same, e.g.• You can compare a bigram language model to a trigram language model that

both use vocabulary size 10000• You cannot compare a trigram language model using a vocabulary size 10000

to a trigram language model using vocabulary size 20000

39

Neural language models

• Window-based feed-forward neural language model• Recurrent neural language model

40

Feed-forward neural language model (Bengioet al., 2003)

41

! = # $!"#$% ; … ; # $!"& ; # $!"%

' = ( !)' + +'

, $! $!"#$%, … , $!"&, $!"%= ./01234(')( + +()

Recurrent neural language model (Mikolov et al., 2010)

Source: http://colah.github.io

!! = #(%!&+!!"#( + )$)

+ ,! ,#, … , ,!"%, ,!"#= /012345(!!&& + )&)

Training the language model with cross-entropy loss

!!"#$$%&'("#)* "#, # = −'+,-

.#+ log "#+ = − log "#(

|V| - the vocabulary sizet – index of the correct word

43

Why is the softmax over large vocabulary computationally costly?• What is a softmax?

!"#$%&' (! = *"!∑!" *"!"

• Now take the derivative from this with respect to (!• The sum over the whole vocabulary will remain in the derivative

(check it yourself)

44

How to handle large softmax?

• Hierarchical softmax• Decompose the softmax layer into

binary tree• Reduce the complexity of the output

distribution from O(|V|) to O(log|V|)• Self-normalization• Approximate softmax

45

source: https://becominghuman.ai

What to do with infrequent words

• Typically, the vocabulary size is fixed, ranging anywhere between 10K-200K words• Still, there will always be words that are not part of the vocabulary

(remember Turkish?)• The most common approach is to simply replace all out-of-vocabulary

(OOV) words with a special UNK token• Another option is to reduce the sparsity by constructing vocabulary

from subword units:• Morphemes, characters, syllables, …

46

What to do with infrequent words

• What if there are no UNK’s in the training set?• Use a random UNK vector during testing• Randomly replace some infrequent words with UNK during training

• Construct word embeddings from characters (we’ll talk about it in more detail later)• Works for input (context) words• Cannot use for output words

47

Character-level language model

• For instance for generating text with mark-up• A. Karpathy, 2015. The Unreasonable Effectiveness of Recurrent

Neural Networks• Generated text based on a LM trained on Wikipedia:

48

Using language models

• For scoring sentences• Speech recognition• Using LM for text classification• Statistical machine translation

• For generating text• Neural machine translation• Dialogue generation• Abstractive summarization

49

top related