language models for speech recognition

Language Models For Speech Recognition

Speech Recognition

: sequence of acoustic vectors

Find the word sequence so that:

The task of a language model is to make available to the recognizer adequate estimates of the probabilities

( | ) ( )( | )

( )

P A W P WP W A

P A=

ˆ arg max ( | ) ( )W

W P A W P W=

)(WP

A

Language Models

nwwwW ,...,, 21 Vwi

1 2 1 1 1( ) ( )* ( | )*...* ( | ,..., )n nP W P w P w w P w w w -=

difficultisnrecognitiospeechW

( ) ( )* ( | )*

( | )* ( | )

P W P speech P recognition speech

P is speech recognition P difficult speech recognition is

=

N-gram models

Make the Markov assumption that only the prior local context – the last (N-1) words – affects the next word

N=3 trigrams

N=2 bigrams

N=1 unigrams

nwwwW ,...,, 21 Vwi

2 11

( ) ( | , )n

i i ii

P W P w w w- -=

=Õ

11

( ) ( | )n

i ii

P W P w w -=

=Õ

1

( ) ( )n

ii

P W P w=

=Õ

Parameter estimation

Maximum Likelihood Estimator

N=3 trigrams

N=2 bigrams

N=1 unigrams

This will assign zero probabilities to unseen events

1 2 33 1 2 3 1 2

1 2

( , , )( | , ) ( | , )

( , )

c w w wP w w w f w w w

c w w= =

1 22 1 2 1

1

( , )( | ) ( | )

( )

c w wP w w f w w

c w= =

11 1

( )( ) ( )

c wP w f w

N= =

Number of Parameters

For a vocabulary of size V, a 1-gram model has V-1 independent parameters

A 2-gram model has V2-1 independent parameters

In general, an n-gram model has Vn-1 independent parameters

Typical values for a moderate size vocabulary of 20000 words are:

Model Parameters

1-gram 20000

2-gram 200002 = 400 million

3-gram 200003 = 8 trillion

Number of Parameters

|V|=60.000 N=35M Eleftherotypia daily newspaper

Count 1-grams 2-grams 3-grams

1 160.273 3.877.976 13.128.073

2 51.725 784.012 1.802.348

3 27.171 314.114 562.264

>0 390.796 5.834.632 16.515.051

>=0 390.796 36x108 216x1012

1n

N

In a typical training text, roughly 80% of trigrams occur only once

Good-Turing estimate: ML estimates will be zero for

37.5% of the 3-grams and for 11% of the 2-grams

Problems

Data sparseness: we have not enough data to train the model parameters

Solutions Smoothing techniques: accurately estimate probabilities in the presence

of sparse data– Good-Turing, Jelinek-Mercer (linear interpolation), Katz (backing-off)

Build compact models: they have fewer parameters to train and thus require less data

– equivalence classification of words (e.g. grammatical rules (noun, verb, adjective, preposition), semantic labels (city, name, date))

Smoothing

Make distributions more uniform

Redistribute probability mass from higher to lower probabilities

Additive Smoothing

For each n-gram that occurs r times, pretend that it occurs r+1 times

e.g bigrams 1 22 1

1

( , ) 1( | )

( )

c w wP w w

c w V

+=

+

Good-Turing Smoothing

For any n-gram that occurs r times, pretend that it occurs r* times

is the number of n-grams which occurs r times

To convert this count to a probability we just normalize

Total probability of unseen n-grams

* 1( 1) r

r

nr r

n+= +

rn

*

GT

rP

N=

1n

N

Example

r(=MLE) nr r*(=GT)

0 3.594.165.368 0.001078

1 3.877.976 0.404

2 784.012 1.202

3 314.114 2.238

4 175.720 3.187

5 112.006 4.199

6 78.391 5.238

7 58.661 6.270

Good-Turing

Intuitively

Jelinek-Mercer Smoothing(linear interpolation)

(DISCARD THE) 0

(DISCARD THOU) 0

c

c

=

=

Interpolate a higher-order model with a lower-order model

Given fixed pML, it is possible to search efficiently for the λ that maximize the probability of some data using the Baum-Welch algorithm

(THE|DISCARD) (THOU|DISCARD)p p=

(THE|DISCARD) (THOU|DISCARD)p p>

interp 1 1( | ) ( | ) (1 ) ( )i i ML i i ML ip w w p w w p wl l- -= + -

Katz Smoothing (backing-off)

For those events which wave been observed in the training data we assume some reliable estimate of the probability

For the remaining unseen events we back-off to some less specific distribution

is chosen so that the total probability sums to 1

1

1

if

( , ) if 1

( ) ( ) if 0BO i i r

i ML i

r r k

c w w d r r k

a w p w r-

-

ì ³ïïïï= £ <íïïï =ïî

11

1

( , )( | )

( , )i

BO i iBO i i

BO i iw

c w wp w w

c w w-

--

=å

1( )ia w -

Witten-Bell Smoothing

Model the probability of new events, estimating the probability of seeing such a new event as we proceed through the training corpus (i.e. the total number of word types in the corpus)

1 11 1

1 1 11 1 2( | ) ( | ) (1 ) ( | )i i

i n i n

i i iWB i i n ML i i n WB i i nw wp w w p w w p w wl l- -

- + - +

- - -- + - + - += + -

11

1 ii nw

T

T Nl -

- +- =

+

Absolute Discounting

Subtract a constant D from each nonzero count

{ }1

1

11 11 2

1

max ( ) ,0( | ) (1 ) ( | )

( )ii n

i

ii ni i

abs i i n abs i i ni wi n

w

c w Dp w w p w w

c wl -

- +

- +- -- + - +

- +

-= + -

å

1

1 22

nD

n n=

+

Kneser-Ney

Lower order distribution not proportional to to the number of occurrences of a word, but to the number of different words that it follows

{ }11 1 11 1 2

1

max ( ) ,0( | ) ( ) ( | )

( )i

ii ni i i

KN i i n i n KN i i nii n

w

c w Dp w w w p w w

c wg

- +- - -- + - + - +

- +

-= +

å

Modified Kneser-Ney

( )1 11 1 11 1 2

1

( ) ( )( | ) ( ) ( | )

( )i

i ii n i ni i i

MKN i i n i n MKN i i nii n

w

c w D c wp w w w p w w

c wg

- + - +- - -- + - + - +

- +

-= +

å

1

2

3

0 if 0

if 1( )

if 2

if 3

c

D cD c

D c

D c+

ì =ïïïï =ïï=íï =ïïï ³ïïî

1

1 2

21

1

32

2

43

3

2

1 2

2 3

3 4

nY

n n

nD Y

n

nD Y

n

nD Y

n+

=+

= -

= -

= -

Measuring Model Quality

Consider the language as an information source L, which emits a sequence of symbols wi from a finite alphabet (the vocabulary)

The quality of a language model M can be judged by its cross entropy with regard to the distribution PT(x) of some hitherto unseen text T:

Intuitively speaking cross entropy is the entropy of T as “perceived” by the model M

1( ; ) ( ) log ( ) log ( )T M T M M

x x

H P P P x P x P xn

=- » -å å

Perplexity

Perplexity:

In a language with perplexity X, every word can be followed be X different words with equal probabilities

( ; )( ) 2 T MH P PMPP T =

Elements of Information Theory

Entropy

Mutual Information

pointwise

Kullback-Leiblel (KL) divergence

( ) ( ) log ( )x X

H X p x p xÎ

=- å

( , )( ; ) ( , ) log

( ) ( )x X y Y

p x yI X Y p x y

p x p yÎ Î

=å å

( , )( , ) log

( ) ( )

p x yI x y

p x p y=

( )( || ) ( ) log

( )x X

p xD p q p x

q xÎ

=- å

The Greek Language

Highly inflectional language

A Greek vocabulary of 220K words is needed in order to achieve 99.6% lexical coverage

English French Greek German

Source Wall Street Journal

Le Monde Eleytherotypia Frankfurter Rundschau

Corpus size 37.2 M 37.7 M 35 M 31.5 M

Distinct words 165 K 280 K 410 K 500 K

Vocabulary size 60 K 60 K 60 K 60 K

Lexical coverage 99.6 % 98.3 % 96.5 % 95.1 %

Perplexity

English French Greek German

Vocabulary Size 20 K 20 K 64 K 64 K

2-gram PP 198 178 232 430

3-gram PP 135 119 163 336

Experimental Results

1M 5M 35M

Smoothing PP WER PP WER PP WER

Good-Turing 341 27.71 248 23.48 163 19.59

Witten-Bell 354 27.42 251 24.17 163 19.84

Absolute Discounting 344 28.47 256 24.25 169 20.78

Modified Kneser-Ney 328 26.78 237 21.91 156 18.57

1M 5M 35M

OOV 4.75% 3.46% 3.17%

Hit Rate

hit rate % (1M) hit rate % (5M) hit rate % (35M)

1-gram 27.3 16.4 7.4

2-gram 52.5 49.9 40

3-gram 20.2 33.7 52.6

Class-based Models

Some words are similar to other words in their meaning and syntactic function

Group words into classes– Fewer parameters– Better estimates

Class-based n-gram models

Suppose that we partition the vocabulary into G classes

This model produces text by first generating a string of classes g1,g2,…,gn

and then converting them into the words wi, i=1,2,…n with probability p(wi|gi)

An n-gram model has Vn-1 independent parameters (216x1012) A class-based model has Gn-1+V-G parameters ( 109 )

Gn-1 of an n-gram model for a vocabulary of size G

V-G of the form p(wi|gi)

2 1 2 1( | , ) ( | , ) ( | )i i i i i i i ip w w w p g g g p w g- - - -=

Relation to n-grams

2 1 2 1( | , ) ( | , ) ( | )i i i i i i i ip w w w p g g g p w g- - - -=

Defining Classes

Manually– Use part-of-speech labels by linguistic experts or a tagger– Use stem information

Automatically– Cluster words as part of an optimization method

e.g. Maximize the log-likelihood of test text

Agglomerative Clustering

Bottom-up clustering

Start with a separate cluster for each word

Merge that pair for which the loss in average MI is least

1

1 2

1( , ) log ( ,..., )

( ) ( ; )

NH L G p w wN

H w I g g

=-

= -

Example

Syntactical classes verbs, past tense: άναψαν, επέλεξαν, κατέλαβαν, πλήρωσαν, πυροβόλησαν nouns, neuter: άλογο, δόντι, δέντρο, έντομο, παιδί, ρολόι, σώμα Adjectives, masculine:δημοκρατικός, δημόσιος, ειδικός, εμπορικός, επίσημος

Semantic classes last names: βαρδινογιάννης, γεννηματάς, λοβέρδος, ράλλης countries: βραζιλία, βρετανία, γαλλία, γερμανία, δανία numerals: δέκατο, δεύτερο, έβδομο, εικοστό, έκτο, ένατο, όγδοο

Some not so well defined classes ανακριβής, αναμεταδίδει, διαφημίσουν, κομήτες, προμήθευε εξίσωση, έτρωγαν, και, μαλαισία, νηπιαγωγών, φεβρουάριος

Stem-based Classes

άγνωστ: άγνωστος, άγνωστου, άγνωστο, άγνωστον, άγνωστοι, άγνωστους, άγνωστη, άγνωστης, άγνωστες, άγνωστα,

βλέπ: βλέπω, βλέπεις, βλέπει, βλέπουμε, βλέπετε, βλέπουν εκτελ: εκτελεί, εκτελούν, εκτελούσε, εκτελούσαν, εκτελείται, εκτελούνται εξοχικ: εξοχικό, εξοχικά, εξοχική, εξοχικής, εξοχικές ιστορικ: ιστορικός, ιστορικού, ιστορικό, ιστορικοί, ιστορικών, ιστορικούς,

ιστορική, ιστορικής, ιστορικές, ιστορικά καθηγητ: καθηγητής, καθηγητή, καθηγητές, καθηγητών μαχητικ: μαχητικός, μαχητικού, μαχητικό, μαχητικών, μαχητική, μαχητικής,

μαχητικά


G PP (1M) PP (5M) PP (35M)

1 1309 1461 1503

133 (POS) 1047 1143 1167

500 - - 314

1000 - - 266

2000 - - 224

30000 (stem) 383 299 215

60000 328 237 156


1M 5M 35M

G PP WER PP WER PP WER

133 (POS) 325 27.11 236 22.00 156 18.52

500 - - - - 151 18.63

1000 - - - - 150 18.61

2000 - - - - 149 18.65

30000 (stem) 319 26.99 232 22.04 154 18.44

60000 328 26.78 237 21.91 156 18.57

Hit Rate


1-gram 21.3 12.1 5.1

2-gram 56 50.4 37.6

3-gram 22.7 37.6 57.4


1-gram 27.3 16.4 7.4

2-gram 52.5 49.9 40

3-gram 20.2 33.7 52.6


1M 5M 35M

Model PP WER PP WER PP WER

ME 3gram 331 26.83 239 21.94 158 18.60

ME 3gram+stem 320 26.54 227 21.66 143 18.29

1M 5M 35M

Model PP WER PP WER PP WER

BO 3gram 328 26.78 237 21.91 156 18.57

Interp. 3gram+stem 319 26.99 232 22.04 154 18.44

Where do we go from here?

Use syntactic information

The dog on the hill barked

Constraints

( ),

1 if ( ) is the preceding head word in and ( , )

0 otherwiseh t w

h t x y wf x y

ì =ïï=íïïî

( ), ( ),

1 if ( ), ( ) are the preceding head words in and ( , )

0 otherwiseh s h t w

h s h t x y wf x y

ì =ïï=íïïî

( ), ( ), ( ),, , ,

( | , , , )( , , , )

h t w h s h t wv w u v wwe e e e ep w u v s t

Z u v s t

l ll ll

=

language models for speech recognition

Documents

number of n

probability of new events

higherorder model

total probability sums

goodturing estimate

number of different

training data

r timesto