language models for speech recognition

38
Language Models For Speech Recognition

Upload: madeson-pickett

Post on 03-Jan-2016

56 views

Category:

Documents


0 download

DESCRIPTION

Language Models For Speech Recognition. Speech Recognition. : sequence of acoustic vectors Find the word sequence so that: The task of a language model is to make available to the recognizer adequate estimates of the probabilities. Language Models. N-gram models. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Language Models  For Speech Recognition

Language Models For Speech Recognition

Page 2: Language Models  For Speech Recognition

Speech Recognition

: sequence of acoustic vectors

Find the word sequence so that:

The task of a language model is to make available to the recognizer adequate estimates of the probabilities

( | ) ( )( | )

( )

P A W P WP W A

P A=

ˆ arg max ( | ) ( )W

W P A W P W=

)(WP

A

Page 3: Language Models  For Speech Recognition

Language Models

nwwwW ,...,, 21 Vwi

1 2 1 1 1( ) ( )* ( | )*...* ( | ,..., )n nP W P w P w w P w w w -=

difficultisnrecognitiospeechW

( ) ( )* ( | )*

( | )* ( | )

P W P speech P recognition speech

P is speech recognition P difficult speech recognition is

=

Page 4: Language Models  For Speech Recognition

N-gram models

Make the Markov assumption that only the prior local context – the last (N-1) words – affects the next word

N=3 trigrams

N=2 bigrams

N=1 unigrams

nwwwW ,...,, 21 Vwi

2 11

( ) ( | , )n

i i ii

P W P w w w- -=

11

( ) ( | )n

i ii

P W P w w -=

1

( ) ( )n

ii

P W P w=

Page 5: Language Models  For Speech Recognition

Parameter estimation

Maximum Likelihood Estimator

N=3 trigrams

N=2 bigrams

N=1 unigrams

This will assign zero probabilities to unseen events

1 2 33 1 2 3 1 2

1 2

( , , )( | , ) ( | , )

( , )

c w w wP w w w f w w w

c w w= =

1 22 1 2 1

1

( , )( | ) ( | )

( )

c w wP w w f w w

c w= =

11 1

( )( ) ( )

c wP w f w

N= =

Page 6: Language Models  For Speech Recognition

Number of Parameters

For a vocabulary of size V, a 1-gram model has V-1 independent parameters

A 2-gram model has V2-1 independent parameters

In general, an n-gram model has Vn-1 independent parameters

Typical values for a moderate size vocabulary of 20000 words are:

Model Parameters

1-gram 20000

2-gram 200002 = 400 million

3-gram 200003 = 8 trillion

Page 7: Language Models  For Speech Recognition

Number of Parameters

|V|=60.000 N=35M Eleftherotypia daily newspaper

Count 1-grams 2-grams 3-grams

1 160.273 3.877.976 13.128.073

2 51.725 784.012 1.802.348

3 27.171 314.114 562.264

>0 390.796 5.834.632 16.515.051

>=0 390.796 36x108 216x1012

1n

N

In a typical training text, roughly 80% of trigrams occur only once

Good-Turing estimate: ML estimates will be zero for

37.5% of the 3-grams and for 11% of the 2-grams

Page 8: Language Models  For Speech Recognition

Problems

Data sparseness: we have not enough data to train the model parameters

Solutions Smoothing techniques: accurately estimate probabilities in the presence

of sparse data– Good-Turing, Jelinek-Mercer (linear interpolation), Katz (backing-off)

Build compact models: they have fewer parameters to train and thus require less data

– equivalence classification of words (e.g. grammatical rules (noun, verb, adjective, preposition), semantic labels (city, name, date))

Page 9: Language Models  For Speech Recognition

Smoothing

Make distributions more uniform

Redistribute probability mass from higher to lower probabilities

Page 10: Language Models  For Speech Recognition

Additive Smoothing

For each n-gram that occurs r times, pretend that it occurs r+1 times

e.g bigrams 1 22 1

1

( , ) 1( | )

( )

c w wP w w

c w V

+=

+

Page 11: Language Models  For Speech Recognition

Good-Turing Smoothing

For any n-gram that occurs r times, pretend that it occurs r* times

is the number of n-grams which occurs r times

To convert this count to a probability we just normalize

Total probability of unseen n-grams

* 1( 1) r

r

nr r

n+= +

rn

*

GT

rP

N=

1n

N

Page 12: Language Models  For Speech Recognition

Example

r(=MLE) nr r*(=GT)

0 3.594.165.368 0.001078

1 3.877.976 0.404

2 784.012 1.202

3 314.114 2.238

4 175.720 3.187

5 112.006 4.199

6 78.391 5.238

7 58.661 6.270

Page 13: Language Models  For Speech Recognition

Good-Turing

Intuitively

Jelinek-Mercer Smoothing(linear interpolation)

(DISCARD THE) 0

(DISCARD THOU) 0

c

c

=

=

Interpolate a higher-order model with a lower-order model

Given fixed pML, it is possible to search efficiently for the λ that maximize the probability of some data using the Baum-Welch algorithm

(THE|DISCARD) (THOU|DISCARD)p p=

(THE|DISCARD) (THOU|DISCARD)p p>

interp 1 1( | ) ( | ) (1 ) ( )i i ML i i ML ip w w p w w p wl l- -= + -

Page 14: Language Models  For Speech Recognition

Katz Smoothing (backing-off)

For those events which wave been observed in the training data we assume some reliable estimate of the probability

For the remaining unseen events we back-off to some less specific distribution

is chosen so that the total probability sums to 1

1

1

if

( , ) if 1

( ) ( ) if 0BO i i r

i ML i

r r k

c w w d r r k

a w p w r-

-

ì ³ïïïï= £ <íïïï =ïî

11

1

( , )( | )

( , )i

BO i iBO i i

BO i iw

c w wp w w

c w w-

--

1( )ia w -

Page 15: Language Models  For Speech Recognition

Witten-Bell Smoothing

Model the probability of new events, estimating the probability of seeing such a new event as we proceed through the training corpus (i.e. the total number of word types in the corpus)

1 11 1

1 1 11 1 2( | ) ( | ) (1 ) ( | )i i

i n i n

i i iWB i i n ML i i n WB i i nw wp w w p w w p w wl l- -

- + - +

- - -- + - + - += + -

11

1 ii nw

T

T Nl -

- +- =

+

Page 16: Language Models  For Speech Recognition

Absolute Discounting

Subtract a constant D from each nonzero count

{ }1

1

11 11 2

1

max ( ) ,0( | ) (1 ) ( | )

( )ii n

i

ii ni i

abs i i n abs i i ni wi n

w

c w Dp w w p w w

c wl -

- +

- +- -- + - +

- +

-= + -

å

1

1 22

nD

n n=

+

Page 17: Language Models  For Speech Recognition

Kneser-Ney

Lower order distribution not proportional to to the number of occurrences of a word, but to the number of different words that it follows

{ }11 1 11 1 2

1

max ( ) ,0( | ) ( ) ( | )

( )i

ii ni i i

KN i i n i n KN i i nii n

w

c w Dp w w w p w w

c wg

- +- - -- + - + - +

- +

-= +

å

Page 18: Language Models  For Speech Recognition

Modified Kneser-Ney

( )1 11 1 11 1 2

1

( ) ( )( | ) ( ) ( | )

( )i

i ii n i ni i i

MKN i i n i n MKN i i nii n

w

c w D c wp w w w p w w

c wg

- + - +- - -- + - + - +

- +

-= +

å

1

2

3

0 if 0

if 1( )

if 2

if 3

c

D cD c

D c

D c+

ì =ïïïï =ïï=íï =ïïï ³ïïî

1

1 2

21

1

32

2

43

3

2

1 2

2 3

3 4

nY

n n

nD Y

n

nD Y

n

nD Y

n+

=+

= -

= -

= -

Page 19: Language Models  For Speech Recognition

Measuring Model Quality

Consider the language as an information source L, which emits a sequence of symbols wi from a finite alphabet (the vocabulary)

The quality of a language model M can be judged by its cross entropy with regard to the distribution PT(x) of some hitherto unseen text T:

Intuitively speaking cross entropy is the entropy of T as “perceived” by the model M

1( ; ) ( ) log ( ) log ( )T M T M M

x x

H P P P x P x P xn

=- » -å å

Page 20: Language Models  For Speech Recognition

Perplexity

Perplexity:

In a language with perplexity X, every word can be followed be X different words with equal probabilities

( ; )( ) 2 T MH P PMPP T =

Page 21: Language Models  For Speech Recognition

Elements of Information Theory

Entropy

Mutual Information

pointwise

Kullback-Leiblel (KL) divergence

( ) ( ) log ( )x X

H X p x p xÎ

=- å

( , )( ; ) ( , ) log

( ) ( )x X y Y

p x yI X Y p x y

p x p yÎ Î

=å å

( , )( , ) log

( ) ( )

p x yI x y

p x p y=

( )( || ) ( ) log

( )x X

p xD p q p x

q xÎ

=- å

Page 22: Language Models  For Speech Recognition

The Greek Language

Highly inflectional language

A Greek vocabulary of 220K words is needed in order to achieve 99.6% lexical coverage

English French Greek German

Source Wall Street Journal

Le Monde Eleytherotypia Frankfurter Rundschau

Corpus size 37.2 M 37.7 M 35 M 31.5 M

Distinct words 165 K 280 K 410 K 500 K

Vocabulary size 60 K 60 K 60 K 60 K

Lexical coverage 99.6 % 98.3 % 96.5 % 95.1 %

Page 23: Language Models  For Speech Recognition

Perplexity

English French Greek German

Vocabulary Size 20 K 20 K 64 K 64 K

2-gram PP 198 178 232 430

3-gram PP 135 119 163 336

Page 24: Language Models  For Speech Recognition

Experimental Results

1M 5M 35M

Smoothing PP WER PP WER PP WER

Good-Turing 341 27.71 248 23.48 163 19.59

Witten-Bell 354 27.42 251 24.17 163 19.84

Absolute Discounting 344 28.47 256 24.25 169 20.78

Modified Kneser-Ney 328 26.78 237 21.91 156 18.57

1M 5M 35M

OOV 4.75% 3.46% 3.17%

Page 25: Language Models  For Speech Recognition

Hit Rate

hit rate % (1M) hit rate % (5M) hit rate % (35M)

1-gram 27.3 16.4 7.4

2-gram 52.5 49.9 40

3-gram 20.2 33.7 52.6

Page 26: Language Models  For Speech Recognition

Class-based Models

Some words are similar to other words in their meaning and syntactic function

Group words into classes– Fewer parameters– Better estimates

Page 27: Language Models  For Speech Recognition

Class-based n-gram models

Suppose that we partition the vocabulary into G classes

This model produces text by first generating a string of classes g1,g2,…,gn

and then converting them into the words wi, i=1,2,…n with probability p(wi|gi)

An n-gram model has Vn-1 independent parameters (216x1012) A class-based model has Gn-1+V-G parameters ( 109 )

Gn-1 of an n-gram model for a vocabulary of size G

V-G of the form p(wi|gi)

2 1 2 1( | , ) ( | , ) ( | )i i i i i i i ip w w w p g g g p w g- - - -=

Page 28: Language Models  For Speech Recognition

Relation to n-grams

2 1 2 1( | , ) ( | , ) ( | )i i i i i i i ip w w w p g g g p w g- - - -=

Page 29: Language Models  For Speech Recognition

Defining Classes

Manually– Use part-of-speech labels by linguistic experts or a tagger– Use stem information

Automatically– Cluster words as part of an optimization method

e.g. Maximize the log-likelihood of test text

Page 30: Language Models  For Speech Recognition

Agglomerative Clustering

Bottom-up clustering

Start with a separate cluster for each word

Merge that pair for which the loss in average MI is least

1

1 2

1( , ) log ( ,..., )

( ) ( ; )

NH L G p w wN

H w I g g

=-

= -

Page 31: Language Models  For Speech Recognition

Example

Syntactical classes verbs, past tense: άναψαν, επέλεξαν, κατέλαβαν, πλήρωσαν, πυροβόλησαν nouns, neuter: άλογο, δόντι, δέντρο, έντομο, παιδί, ρολόι, σώμα Adjectives, masculine:δημοκρατικός, δημόσιος, ειδικός, εμπορικός, επίσημος

Semantic classes last names: βαρδινογιάννης, γεννηματάς, λοβέρδος, ράλλης countries: βραζιλία, βρετανία, γαλλία, γερμανία, δανία numerals: δέκατο, δεύτερο, έβδομο, εικοστό, έκτο, ένατο, όγδοο

Some not so well defined classes ανακριβής, αναμεταδίδει, διαφημίσουν, κομήτες, προμήθευε εξίσωση, έτρωγαν, και, μαλαισία, νηπιαγωγών, φεβρουάριος

Page 32: Language Models  For Speech Recognition

Stem-based Classes

άγνωστ: άγνωστος, άγνωστου, άγνωστο, άγνωστον, άγνωστοι, άγνωστους, άγνωστη, άγνωστης, άγνωστες, άγνωστα,

βλέπ: βλέπω, βλέπεις, βλέπει, βλέπουμε, βλέπετε, βλέπουν εκτελ: εκτελεί, εκτελούν, εκτελούσε, εκτελούσαν, εκτελείται, εκτελούνται εξοχικ: εξοχικό, εξοχικά, εξοχική, εξοχικής, εξοχικές ιστορικ: ιστορικός, ιστορικού, ιστορικό, ιστορικοί, ιστορικών, ιστορικούς,

ιστορική, ιστορικής, ιστορικές, ιστορικά καθηγητ: καθηγητής, καθηγητή, καθηγητές, καθηγητών μαχητικ: μαχητικός, μαχητικού, μαχητικό, μαχητικών, μαχητική, μαχητικής,

μαχητικά

Page 33: Language Models  For Speech Recognition

Experimental Results

G PP (1M) PP (5M) PP (35M)

1 1309 1461 1503

133 (POS) 1047 1143 1167

500 - - 314

1000 - - 266

2000 - - 224

30000 (stem) 383 299 215

60000 328 237 156

Page 34: Language Models  For Speech Recognition

Example

Interpolate class-based and word-based models

(ατυχηματων|τροχαιων) (ατυχημ|τροχαι) (ατυχηματων|ατυχημ)p p p=

2 1 2 1 2 1( | , ) ( | , ) (1 ) ( | , ) ( | )i i i i i i i i i i ip w w w p w w w p g g g p w gl l- - - - - -= + -

Page 35: Language Models  For Speech Recognition

Experimental Results

1M 5M 35M

G PP WER PP WER PP WER

133 (POS) 325 27.11 236 22.00 156 18.52

500 - - - - 151 18.63

1000 - - - - 150 18.61

2000 - - - - 149 18.65

30000 (stem) 319 26.99 232 22.04 154 18.44

60000 328 26.78 237 21.91 156 18.57

Page 36: Language Models  For Speech Recognition

Hit Rate

hit rate % (1M) hit rate % (5M) hit rate % (35M)

1-gram 21.3 12.1 5.1

2-gram 56 50.4 37.6

3-gram 22.7 37.6 57.4

hit rate % (1M) hit rate % (5M) hit rate % (35M)

1-gram 27.3 16.4 7.4

2-gram 52.5 49.9 40

3-gram 20.2 33.7 52.6

Page 37: Language Models  For Speech Recognition

Experimental Results

1M 5M 35M

Model PP WER PP WER PP WER

ME 3gram 331 26.83 239 21.94 158 18.60

ME 3gram+stem 320 26.54 227 21.66 143 18.29

1M 5M 35M

Model PP WER PP WER PP WER

BO 3gram 328 26.78 237 21.91 156 18.57

Interp. 3gram+stem 319 26.99 232 22.04 154 18.44

Page 38: Language Models  For Speech Recognition

Where do we go from here?

Use syntactic information

The dog on the hill barked

Constraints

( ),

1 if ( ) is the preceding head word in and ( , )

0 otherwiseh t w

h t x y wf x y

ì =ïï=íïïî

( ), ( ),

1 if ( ), ( ) are the preceding head words in and ( , )

0 otherwiseh s h t w

h s h t x y wf x y

ì =ïï=íïïî

( ), ( ), ( ),, , ,

( | , , , )( , , , )

h t w h s h t wv w u v wwe e e e ep w u v s t

Z u v s t

l ll ll

=