language models for speech recognition
DESCRIPTION
Language Models For Speech Recognition. Speech Recognition. : sequence of acoustic vectors Find the word sequence so that: The task of a language model is to make available to the recognizer adequate estimates of the probabilities. Language Models. N-gram models. - PowerPoint PPT PresentationTRANSCRIPT
Language Models For Speech Recognition
Speech Recognition
: sequence of acoustic vectors
Find the word sequence so that:
The task of a language model is to make available to the recognizer adequate estimates of the probabilities
( | ) ( )( | )
( )
P A W P WP W A
P A=
ˆ arg max ( | ) ( )W
W P A W P W=
)(WP
A
Language Models
nwwwW ,...,, 21 Vwi
1 2 1 1 1( ) ( )* ( | )*...* ( | ,..., )n nP W P w P w w P w w w -=
difficultisnrecognitiospeechW
( ) ( )* ( | )*
( | )* ( | )
P W P speech P recognition speech
P is speech recognition P difficult speech recognition is
=
N-gram models
Make the Markov assumption that only the prior local context – the last (N-1) words – affects the next word
N=3 trigrams
N=2 bigrams
N=1 unigrams
nwwwW ,...,, 21 Vwi
2 11
( ) ( | , )n
i i ii
P W P w w w- -=
=Õ
11
( ) ( | )n
i ii
P W P w w -=
=Õ
1
( ) ( )n
ii
P W P w=
=Õ
Parameter estimation
Maximum Likelihood Estimator
N=3 trigrams
N=2 bigrams
N=1 unigrams
This will assign zero probabilities to unseen events
1 2 33 1 2 3 1 2
1 2
( , , )( | , ) ( | , )
( , )
c w w wP w w w f w w w
c w w= =
1 22 1 2 1
1
( , )( | ) ( | )
( )
c w wP w w f w w
c w= =
11 1
( )( ) ( )
c wP w f w
N= =
Number of Parameters
For a vocabulary of size V, a 1-gram model has V-1 independent parameters
A 2-gram model has V2-1 independent parameters
In general, an n-gram model has Vn-1 independent parameters
Typical values for a moderate size vocabulary of 20000 words are:
Model Parameters
1-gram 20000
2-gram 200002 = 400 million
3-gram 200003 = 8 trillion
Number of Parameters
|V|=60.000 N=35M Eleftherotypia daily newspaper
Count 1-grams 2-grams 3-grams
1 160.273 3.877.976 13.128.073
2 51.725 784.012 1.802.348
3 27.171 314.114 562.264
>0 390.796 5.834.632 16.515.051
>=0 390.796 36x108 216x1012
1n
N
In a typical training text, roughly 80% of trigrams occur only once
Good-Turing estimate: ML estimates will be zero for
37.5% of the 3-grams and for 11% of the 2-grams
Problems
Data sparseness: we have not enough data to train the model parameters
Solutions Smoothing techniques: accurately estimate probabilities in the presence
of sparse data– Good-Turing, Jelinek-Mercer (linear interpolation), Katz (backing-off)
Build compact models: they have fewer parameters to train and thus require less data
– equivalence classification of words (e.g. grammatical rules (noun, verb, adjective, preposition), semantic labels (city, name, date))
Smoothing
Make distributions more uniform
Redistribute probability mass from higher to lower probabilities
Additive Smoothing
For each n-gram that occurs r times, pretend that it occurs r+1 times
e.g bigrams 1 22 1
1
( , ) 1( | )
( )
c w wP w w
c w V
+=
+
Good-Turing Smoothing
For any n-gram that occurs r times, pretend that it occurs r* times
is the number of n-grams which occurs r times
To convert this count to a probability we just normalize
Total probability of unseen n-grams
* 1( 1) r
r
nr r
n+= +
rn
*
GT
rP
N=
1n
N
Example
r(=MLE) nr r*(=GT)
0 3.594.165.368 0.001078
1 3.877.976 0.404
2 784.012 1.202
3 314.114 2.238
4 175.720 3.187
5 112.006 4.199
6 78.391 5.238
7 58.661 6.270
Good-Turing
Intuitively
Jelinek-Mercer Smoothing(linear interpolation)
(DISCARD THE) 0
(DISCARD THOU) 0
c
c
=
=
Interpolate a higher-order model with a lower-order model
Given fixed pML, it is possible to search efficiently for the λ that maximize the probability of some data using the Baum-Welch algorithm
(THE|DISCARD) (THOU|DISCARD)p p=
(THE|DISCARD) (THOU|DISCARD)p p>
interp 1 1( | ) ( | ) (1 ) ( )i i ML i i ML ip w w p w w p wl l- -= + -
Katz Smoothing (backing-off)
For those events which wave been observed in the training data we assume some reliable estimate of the probability
For the remaining unseen events we back-off to some less specific distribution
is chosen so that the total probability sums to 1
1
1
if
( , ) if 1
( ) ( ) if 0BO i i r
i ML i
r r k
c w w d r r k
a w p w r-
-
ì ³ïïïï= £ <íïïï =ïî
11
1
( , )( | )
( , )i
BO i iBO i i
BO i iw
c w wp w w
c w w-
--
=å
1( )ia w -
Witten-Bell Smoothing
Model the probability of new events, estimating the probability of seeing such a new event as we proceed through the training corpus (i.e. the total number of word types in the corpus)
1 11 1
1 1 11 1 2( | ) ( | ) (1 ) ( | )i i
i n i n
i i iWB i i n ML i i n WB i i nw wp w w p w w p w wl l- -
- + - +
- - -- + - + - += + -
11
1 ii nw
T
T Nl -
- +- =
+
Absolute Discounting
Subtract a constant D from each nonzero count
{ }1
1
11 11 2
1
max ( ) ,0( | ) (1 ) ( | )
( )ii n
i
ii ni i
abs i i n abs i i ni wi n
w
c w Dp w w p w w
c wl -
- +
- +- -- + - +
- +
-= + -
å
1
1 22
nD
n n=
+
Kneser-Ney
Lower order distribution not proportional to to the number of occurrences of a word, but to the number of different words that it follows
{ }11 1 11 1 2
1
max ( ) ,0( | ) ( ) ( | )
( )i
ii ni i i
KN i i n i n KN i i nii n
w
c w Dp w w w p w w
c wg
- +- - -- + - + - +
- +
-= +
å
Modified Kneser-Ney
( )1 11 1 11 1 2
1
( ) ( )( | ) ( ) ( | )
( )i
i ii n i ni i i
MKN i i n i n MKN i i nii n
w
c w D c wp w w w p w w
c wg
- + - +- - -- + - + - +
- +
-= +
å
1
2
3
0 if 0
if 1( )
if 2
if 3
c
D cD c
D c
D c+
ì =ïïïï =ïï=íï =ïïï ³ïïî
1
1 2
21
1
32
2
43
3
2
1 2
2 3
3 4
nY
n n
nD Y
n
nD Y
n
nD Y
n+
=+
= -
= -
= -
Measuring Model Quality
Consider the language as an information source L, which emits a sequence of symbols wi from a finite alphabet (the vocabulary)
The quality of a language model M can be judged by its cross entropy with regard to the distribution PT(x) of some hitherto unseen text T:
Intuitively speaking cross entropy is the entropy of T as “perceived” by the model M
1( ; ) ( ) log ( ) log ( )T M T M M
x x
H P P P x P x P xn
=- » -å å
Perplexity
Perplexity:
In a language with perplexity X, every word can be followed be X different words with equal probabilities
( ; )( ) 2 T MH P PMPP T =
Elements of Information Theory
Entropy
Mutual Information
pointwise
Kullback-Leiblel (KL) divergence
( ) ( ) log ( )x X
H X p x p xÎ
=- å
( , )( ; ) ( , ) log
( ) ( )x X y Y
p x yI X Y p x y
p x p yÎ Î
=å å
( , )( , ) log
( ) ( )
p x yI x y
p x p y=
( )( || ) ( ) log
( )x X
p xD p q p x
q xÎ
=- å
The Greek Language
Highly inflectional language
A Greek vocabulary of 220K words is needed in order to achieve 99.6% lexical coverage
English French Greek German
Source Wall Street Journal
Le Monde Eleytherotypia Frankfurter Rundschau
Corpus size 37.2 M 37.7 M 35 M 31.5 M
Distinct words 165 K 280 K 410 K 500 K
Vocabulary size 60 K 60 K 60 K 60 K
Lexical coverage 99.6 % 98.3 % 96.5 % 95.1 %
Perplexity
English French Greek German
Vocabulary Size 20 K 20 K 64 K 64 K
2-gram PP 198 178 232 430
3-gram PP 135 119 163 336
Experimental Results
1M 5M 35M
Smoothing PP WER PP WER PP WER
Good-Turing 341 27.71 248 23.48 163 19.59
Witten-Bell 354 27.42 251 24.17 163 19.84
Absolute Discounting 344 28.47 256 24.25 169 20.78
Modified Kneser-Ney 328 26.78 237 21.91 156 18.57
1M 5M 35M
OOV 4.75% 3.46% 3.17%
Hit Rate
hit rate % (1M) hit rate % (5M) hit rate % (35M)
1-gram 27.3 16.4 7.4
2-gram 52.5 49.9 40
3-gram 20.2 33.7 52.6
Class-based Models
Some words are similar to other words in their meaning and syntactic function
Group words into classes– Fewer parameters– Better estimates
Class-based n-gram models
Suppose that we partition the vocabulary into G classes
This model produces text by first generating a string of classes g1,g2,…,gn
and then converting them into the words wi, i=1,2,…n with probability p(wi|gi)
An n-gram model has Vn-1 independent parameters (216x1012) A class-based model has Gn-1+V-G parameters ( 109 )
Gn-1 of an n-gram model for a vocabulary of size G
V-G of the form p(wi|gi)
2 1 2 1( | , ) ( | , ) ( | )i i i i i i i ip w w w p g g g p w g- - - -=
Relation to n-grams
2 1 2 1( | , ) ( | , ) ( | )i i i i i i i ip w w w p g g g p w g- - - -=
Defining Classes
Manually– Use part-of-speech labels by linguistic experts or a tagger– Use stem information
Automatically– Cluster words as part of an optimization method
e.g. Maximize the log-likelihood of test text
Agglomerative Clustering
Bottom-up clustering
Start with a separate cluster for each word
Merge that pair for which the loss in average MI is least
1
1 2
1( , ) log ( ,..., )
( ) ( ; )
NH L G p w wN
H w I g g
=-
= -
Example
Syntactical classes verbs, past tense: άναψαν, επέλεξαν, κατέλαβαν, πλήρωσαν, πυροβόλησαν nouns, neuter: άλογο, δόντι, δέντρο, έντομο, παιδί, ρολόι, σώμα Adjectives, masculine:δημοκρατικός, δημόσιος, ειδικός, εμπορικός, επίσημος
Semantic classes last names: βαρδινογιάννης, γεννηματάς, λοβέρδος, ράλλης countries: βραζιλία, βρετανία, γαλλία, γερμανία, δανία numerals: δέκατο, δεύτερο, έβδομο, εικοστό, έκτο, ένατο, όγδοο
Some not so well defined classes ανακριβής, αναμεταδίδει, διαφημίσουν, κομήτες, προμήθευε εξίσωση, έτρωγαν, και, μαλαισία, νηπιαγωγών, φεβρουάριος
Stem-based Classes
άγνωστ: άγνωστος, άγνωστου, άγνωστο, άγνωστον, άγνωστοι, άγνωστους, άγνωστη, άγνωστης, άγνωστες, άγνωστα,
βλέπ: βλέπω, βλέπεις, βλέπει, βλέπουμε, βλέπετε, βλέπουν εκτελ: εκτελεί, εκτελούν, εκτελούσε, εκτελούσαν, εκτελείται, εκτελούνται εξοχικ: εξοχικό, εξοχικά, εξοχική, εξοχικής, εξοχικές ιστορικ: ιστορικός, ιστορικού, ιστορικό, ιστορικοί, ιστορικών, ιστορικούς,
ιστορική, ιστορικής, ιστορικές, ιστορικά καθηγητ: καθηγητής, καθηγητή, καθηγητές, καθηγητών μαχητικ: μαχητικός, μαχητικού, μαχητικό, μαχητικών, μαχητική, μαχητικής,
μαχητικά
Experimental Results
G PP (1M) PP (5M) PP (35M)
1 1309 1461 1503
133 (POS) 1047 1143 1167
500 - - 314
1000 - - 266
2000 - - 224
30000 (stem) 383 299 215
60000 328 237 156
Example
Interpolate class-based and word-based models
(ατυχηματων|τροχαιων) (ατυχημ|τροχαι) (ατυχηματων|ατυχημ)p p p=
2 1 2 1 2 1( | , ) ( | , ) (1 ) ( | , ) ( | )i i i i i i i i i i ip w w w p w w w p g g g p w gl l- - - - - -= + -
Experimental Results
1M 5M 35M
G PP WER PP WER PP WER
133 (POS) 325 27.11 236 22.00 156 18.52
500 - - - - 151 18.63
1000 - - - - 150 18.61
2000 - - - - 149 18.65
30000 (stem) 319 26.99 232 22.04 154 18.44
60000 328 26.78 237 21.91 156 18.57
Hit Rate
hit rate % (1M) hit rate % (5M) hit rate % (35M)
1-gram 21.3 12.1 5.1
2-gram 56 50.4 37.6
3-gram 22.7 37.6 57.4
hit rate % (1M) hit rate % (5M) hit rate % (35M)
1-gram 27.3 16.4 7.4
2-gram 52.5 49.9 40
3-gram 20.2 33.7 52.6
Experimental Results
1M 5M 35M
Model PP WER PP WER PP WER
ME 3gram 331 26.83 239 21.94 158 18.60
ME 3gram+stem 320 26.54 227 21.66 143 18.29
1M 5M 35M
Model PP WER PP WER PP WER
BO 3gram 328 26.78 237 21.91 156 18.57
Interp. 3gram+stem 319 26.99 232 22.04 154 18.44
Where do we go from here?
Use syntactic information
The dog on the hill barked
Constraints
( ),
1 if ( ) is the preceding head word in and ( , )
0 otherwiseh t w
h t x y wf x y
ì =ïï=íïïî
( ), ( ),
1 if ( ), ( ) are the preceding head words in and ( , )
0 otherwiseh s h t w
h s h t x y wf x y
ì =ïï=íïïî
( ), ( ), ( ),, , ,
( | , , , )( , , , )
h t w h s h t wv w u v wwe e e e ep w u v s t
Z u v s t
l ll ll
=