discriminative training approaches for continuous speech recognition berlin chen, jen-wei kuo,...
TRANSCRIPT
Discriminative Training Approaches for Continuous Speech Recognition
Berlin Chen, Jen-Wei Kuo, Shih-Hung Liu Speech Lab
Graduate Institute of Computer Science & Information Engineering
National Taiwan Normal University
NTNU Speech Lab 2
Outline
• Statistical Speech Recognition
• Overall Expected Risk Decision/Estimation
• Maximum Likelihood (ML) Estimation
• Maximum Mutual Information (MMI) Estimation
• Minimum Phone Error (MPE) Estimation
• Preliminary Experimental Results
• Conclusions
NTNU Speech Lab 3
Statistical Speech Recognition (1/2)
• The task of the speech recognition is to determine the identity of an given observation sequence by assigning the recognized word sequence to it
• The decision is to find the identity with maximum posterior (MAP) probability – The so-called Bayes decision (or minimum-error-rate) rule
)|( OWp
)()|(maxarg)(
)()|(maxarg)|(maxarg* WpWOp
Op
WpWOpOWpW
WWW
W
O
NTNU Speech Lab 4
Statistical Speech Recognition (2/2)
• The probabilities needed for the Bayes decision rule are not given in practice and have to be estimated from training dataset
• Moreover, as the true families of distributions to which (typically assumed Multinomial) and (typically assumed Gaussian) belong are not known– A certain parametric representation of these distributions is needed– In this presentation, language model is assumed be given in
advance while acoustic model is needed to be estimated• HMMs (hidden Markov models) are widely adopted for acoustic
modeling
)(Wp
)(Wp
)|( WOp
)|( WOp
NTNU Speech Lab 5
Expected Risk
• Let be a finite set of various possible word sequences for a given observation utterance– Assume that the true word sequence is also in
• Let be the action of classifying a given observation sequence to a word sequence – Let be the loss incurred when we take such an
action (and the true word sequence is just )
• Therefore, the (expected) risk for a specific action
,,, 321 WWWr W
jr WO
WrO
rW
),( jn WWl
r
nWrnjnrjr OWPWWlOWOR
W)|(),()|(
rjW W
rW
jr WO
rO
nW rn WW
NTNU Speech Lab 6
Decoding: Minimum Expected Risk (1/2)
• In speech recognition, we can take the action with the minimum (expected) risk
• If zero-one loss function is adopted (string-level error)
– Then
r
njj Wrnjn
Wrjr
WOWPWWlOWOR
W)|(),(minarg)|(minarg
if 1
if 0),(
jn
jnjn WW
WWWWl
)|(1minarg
)|( minarg)|(minarg,
rjW
WWWrn
Wrjr
W
OWP
OWPOWOR
j
jnr
njj
W
NTNU Speech Lab 7
Decoding: Minimum Expected Risk (2/2)
– Thus,
• Select the word sequence with maximum posterior probability (MAP decoding)
• The string editing or Levenshtein distance also can be accounted for the loss function– Take individual word errors into consideration– E.g., Minimum Bayes Risk (MBR) search/decoding [V.
Goel et al. 2004]
)|(maxarg)|(minarg rjW
rjrW
OWPOWORjj
NTNU Speech Lab 8
Training: Minimum Overall Expected Risk (1/2)
• In training, we should minimize the overall (expected) loss of the action of the training utterances– is the true word sequence of
– The integral extends over the whole observation sequence space
• However, when a limited number of training observation sequences are available, the overall risk can be approximated by
rr WO
rO
rrrrroverall dOOpOWORR )()|)((
rO
rW
R
R
r Wrrnrn
r
R
rrrroverall
rn
OpOWPWWl
OpOWORR
1
1
)()|(),(
)()|)((
W
NTNU Speech Lab 9
Training: Minimum Overall Expected Risk (2/2)
• Assume to be uniform – The overall risk can be further expressed as
• If zero-one loss function is adopted
– Then
)( rOp
R
r Wrnrnoverall
rn
OWPWWlR1
)|(),(W
rn
rnrn WW
WWWWl
if 1
if 0),(
r rrrWWW
rnOverall OWPOWPRrn
rn
1,W
NTNU Speech Lab 10
Training: Maximum Likelihood (1/3)
• The objective function of Maximum Likelihood (ML) estimation can be obtained if Jensen Inequality is further applied
– Find a new parameter set that minimizes the overall expected risk is equivalent to those that maximizes the overall log likelihood of all training utterances
r rr
rr
rrrr rrML
WOP
OP
WPWOPOWP
logmaxarg
logmaxarglogmaxarg
xx
xx
log1
log1
r rr
r rrOverall
OWP
OWPR
log
1
r rrML WOPF log
minimize theupper bound
maximize the low bound
NTNU Speech Lab 11
Training: Maximum Likelihood (2/3)
• The objective function can be maximized by adjusting the parameter set, with the EM algorithm and a specific auxiliary function (or the Baum-Welch algorithm)
– E.g., update formulas for Gaussians
r rrML WOPF log
R
r
T
t
rqm
R
r
T
tr
rqm
qmr
r
t
tOt
1 1
1 1
)(
)()(
2
1 1
1 1
2
2
)(
)()(
qmR
r
T
t
rqm
R
r
T
tr
rqm
qmr
r
t
tOt
,,
,,
QQFF
HQF
MLML
ML
NTNU Speech Lab 12
Training: Maximum Likelihood (3/3)
• On the other hand, the discriminative training approaches attempt to optimize the correctness of the model set by formulating an objective function that in some way penalizes the model parameters that are liable to confuse correct and incorrect answers
NTNU Speech Lab 13
Training: Maximum Mutual Information (1/4)
• The objective function can be defined as the sum of the pointwise mutual information of all training utterances and their associated true word sequences
– A kind of rational functions
• The maximum mutual information (MMI) estimation tries to find a new parameter set that maximizes the above objective function
rk
Wkr
rr
rr
rrr
rr
rr
r rrMMI
WPWOP
WOP
OP
WOP
WPOP
WOP
WOF
rk W
log
log,
log
,MI
NTNU Speech Lab 14
Training: Maximum Mutual Information (2/4)
• An alternative derivation based on the overall expected risk criterion– zero-one loss function
– Which is equivalent to the maximization of the overall log likelihood of training utterances
r rr
r rrOverall
OWP
OWPR
log
1
r
Wkkr
rr
rr
rrrr rrMMI
rk
WPWOP
WOP
OP
WPWOPOWP
W
logmaxarg
logmaxarglogmaxarg
rk
Wkr
rrMMI WPWOP
WOPF
rk W
log
NTNU Speech Lab 15
Training: Maximum Mutual Information (3/4)
• When we maximize the MMIE objection function– Not only the probability of true word sequence (numerator,
like the MLE objective function) can be increased, but also can the probabilities of other possible word sequences (denominator) be decreased
– Thus, MMIE attempts to make the correct hypothesis more probable, while at the same time it also attempts to make incorrect hypotheses less probable
NTNU Speech Lab 16
Training: Maximum Mutual Information (4/4)
• The objective functions used in discriminative training, such as that of MMI, are often rational functions– The original Baum-Welch algorithm is not feasible
– Gradient descent and the extended Baum-Welch (EB) algorithm are two applicable approaches for such a function optimization problem
• Gradient descent may require a large number of iterations to obtain an local optimal solution
• While Baum-Welch algorithm was extended (EB) for the optimization of rational functions
– MMI training has similar update formulas as those of MPE (Minimum Phone Error ) training to be introduced later
NTNU Speech Lab 17
Training: Minimum Phone Error (1/4)
• The objective function of Minimum Phone Error (MPE) is also derived with the overall expected risk criterion
– Replace the loss function with the so-called accuracy function (cf. p.9)
• MPE tries to maximize the expected (phone or word) accuracy of all possible word sequences (generated by the recognizer) regarding the training utterances
r riW
Wkkr
iir
r riW
riMPE
WWAWPWOp
WPWOp
WWAOWpF
ri r
k
ri
),()()|(
)()|(
),()|()(
WW
W
),( ri WWA),( ri WWl
NTNU Speech Lab 18
Training: Minimum Phone Error (2/4)
• Like the MMI, MPE also aims to make sure that the training data is correctly recognized– In MPE each phone error in the training utterance can contribute at
most one unit to the objective function, rather than that as an unlimited quantity in MMI (zero-one word string error)
• The MPE objective function is less sensitive to portions of the training data that are poorly transcribed
• A (word) lattice structure can be used here to approximate the set of all possible word sequences of each training utterance – Training statistics can be efficiently computed via such structure
rW
rlatW
rk
Wkr
rrMMI WPWOP
WOPF
rk W
log
NTNU Speech Lab 19
Training: Minimum Phone Error (3/4)
• The objective function of MPE has the “latent variable” problem, such that it can not be directly optimized– However, it is also a rational function and the Baum-Welch (EM)
algorithm can not be directly applied• Without a strong-sense auxiliary function satisfying that ),( G
)()(),(),( FFGG
),( G
)(MPEF
)(F
?),()( GF
)(),( st.
FG
strong-sense auxiliary function ),( G
NTNU Speech Lab 20
Training: Minimum Phone Error (3/4)
• A weak-sense auxiliary function is employed instead for the objective function of MPE, , which a has a local optimal solution and only satisfies that
)(MPEF ),( MPEH
)(),( FH
),( H
)(F
?),()( HF
)(),(
st. FH
r riW
Wkkr
iirMPE WWA
WPWOp
WPWOpF
ri r
k
),()()|(
)()|()(
WW
weak-sense auxiliary function ),( H
NTNU Speech Lab 21
Training: Minimum Phone Error (4/4)
• A smoothing function can be further incorporated into the weak-sense auxiliary function to constrain the function optimization process (to improve convergence)
– The local differential will not be affected and the result ( ) is still a weak-sense auxiliary function
),( SMH
),(),(),( SMMPEMPE HHH
),( MPEH
0),(
SMH),( H
)(F
? ),()( HF ),( H
)(),(
st. FH has a optimum value at ),( SMH
NTNU Speech Lab 22
MPE: Auxiliary Function (1/2)
• The weak-sense auxiliary function for MPE model updating can be defined as
– is a scalar value (a constant) calculated for each phone arc q, and can be either positive or negative (because of the accuracy function)
– The auxiliary function also can be decomposed as
)|(log)|(log
),(1
qOpqOp
FH r
R
r q r
MPEMPE
rlat
W
)|(log qOp
F
r
MPE
R
r qr
r
MPE
R
r qr
r
MPEMPE
rlat
rlat
qOpqOp
F
qOpqOp
FH
1
1
)|(log)|(log
,0max
)|(log)|(log
,0max),(
W
W
arcs with positive
contributions(so-called numerator)
arcs with negative contributions
(so-called denominator)
These two statistics can be compressed by saving their difference (? I-Smoothing)
r riW
Wkkr
iirMPE WWA
WPWOp
WPWOpF
rir
k
),()()|(
)()|()(
WW
still have the “latent variable” problem
axg
xag
xg
xH
x
xg
x
xH
xg
xF
x
xg
x
xF
NTNU Speech Lab 23
MPE: Auxiliary Function (2/2)
• The auxiliary function can be modified by considering the normal auxiliary function for
– The smoothing term is not added yet here
• The key quantity (statistics value) required in MPE training is , which can be termed as
)|(log qOp r
),,,()|(log
),(1
qrQqOp
FH ML
R
r q r
MPEMPE
rlat
W
)|(log qOp
F
r
MPEMPEq
)|(log qOp
F
r
MPE
),,,( qrQML
NTNU Speech Lab 24
MPE: Statistics Accumulation (1/2)
• The objective function can be expressed as (for a specific phone arc )
• The differential can be expressed as
)(
)()|()()|(
),()()|(),()()|(
)()|(
),()()|((
,,
,,
R
r r
r
R
r WqW kkrWqW iir
WqW riiirWqW riiir
R
r W kkr
W riiirMPE
b
a
WPWOpWPWOp
WWAWPWOpWWAWPWOp
WPWOp
WWAWPWOpF
krlatkk
rlatk
irlatii
rlati
rlati
rlati
WW
WW
W
W
2)|(log)|(log
)|(log)|(log
(
r
rrr
rR
r
r
rR
r
MPE
b
qOp
bab
qOp
a
qOp
b
a
qOp
F
q
NTNU Speech Lab 25
MPE: Statistics Accumulation (2/2)
ravgr
rq
R
r
W kkr
W riiir
W kkr
WqW kkr
W kkr
WqW kkr
WqW kkr
WqW riiir
R
r
r
rWqW kkr
rr
WqW riiir
R
r
iirr
ir
r
r
r
r
r
r
r
rR
r
cqc
WPWOp
WWAWPWOp
WPWOp
WPWOp
WPWOp
WPWOp
WPWOp
WWAWPWOp
b
a
qOp
WPWOp
bqOp
WWAWPWOp
WqifWOpqOp
WOp
b
qOp
b
b
a
b
qOp
a
rlatk
rlati
rlatk
krlati
rlatk
krlati
krlatk
irlati
krlati
irlati
W
W
W
W
W
W
W
W
W
W
)()|(
),()()|(
)()|(
)()|(
)()|(
)()|(
)()|(
),()()|(
|log
)()|(
1
|log
),()()|(
)|()|(log
)|(
)|(log)|(log
,
:
:
,
2
,
,
)(qcrq
ravgc
rq
The average accuracy of sentences passing through the arc q
The likelihood of the arc q
The average accuracy of all the sentences in the word graph
NTNU Speech Lab 26
MPE: Accuracy Function (1/4)
• and can be calculated in an approximation way using the word graph and the Forward-Backward algorithm
• Note that the exact accuracy function is express as the sum of phone-level accuracy over all phones , e.g.
– However, such accuracy is obtained by full alignment between the true and all possible word sequences, which is computational expensive
qcrravgc
qA q
insertion if 1on substituti if 0
phonecorrect if 1)(qA
NTNU Speech Lab 27
MPE: Accuracy Function (2/4)
• An approximated phone accuracy is defined
– : the portion of that is overlapped by
phonesdifferent if ),(1
phone same are and if ),(21max)(
zqe
qzzqeqA
z
),( zqe z q
1. Assume the true word sequence has no pronunciation variation 2. Phone accuracy can be obtained by simple local search3. Context-independent phones can be used for accuracy calculation
NTNU Speech Lab 28
MPE: Accuracy Function (3/4)
• Forward-Backward algorithm for statistics calculationfor 開始時間為 0 的音素 q
endfor t=1 to T-1 for 開始時間為 t 的音素 q
for 結束時間為 t-1 且可連至 q 的音素 r
end
for 結束時間為 t-1 且可連至 q 的音素 r
end
endend
)(qAq )|( qOpq
0q
)|( rqprqq
)(qAq
rq
rqq
rqp
)|(
)|( qOpqq
Forward
NTNU Speech Lab 29
MPE: Accuracy Function (4/4)
for 結束時間為 T-1 的音素 q
for t=T-2 to 0 for 結束時間為 t 的音素 q
for 開始時間為 t+1 且可連至 q 的音素r
end
for 開始時間為 t+1 且可連至 q 的音素r
end endend
0q
)|()|( rOpqrprqq
0q
)()|()|(
rArOpqrp
rq
rqq
0q
1qBackward
for 每一音素 q
end
qT q
qT qq
avgc的音素分枝所有結束時間為
的音素分枝所有結束時間為
1
1
qqqc )(
))(( avgqMPEq cqc
qqq
qqq
NTNU Speech Lab 30
MPE: Smoothing Function
• The smoothing function can be defined as
– The old model parameters( ) are used here as the parameters– It has a maximum value at
)()()(|)log(|2
),( 11
,
qmqmqmqmqmT
qmqmqmqm
mqSM tr
DH
0
)()(2
),(
022
),(
11
1
qmqm
qmqm
qmqm
qmqm
qmqm
qmqm
qmqm
Tqm
Tqm
Tqm
Tqmqmqmqmqmqm
Tqm
qm
qm
SM
qmqmqmqm
qm
SM
DH
DH
qmqm ,
NTNU Speech Lab 31
MPE: Final Auxiliary Function
)|(log)|(log
)(),( qOp
qOp
FH r
r
MPE
qrMPE
rlat
W
),),((log)(),( qmqmrrqm
MPErq
m
et
stqrMPE toNtG
q
qrlat
W
weak-sense auxiliary function
strong-sense auxiliary function
smoothing function involved
)()()(|)log(|2
),),((log)(),(
11
,
qmqmqmqmqmT
qmqmqmqm
mq
qmqmrrqm
MPErq
m
et
stqrMPE
trD
toNtGq
qrlat
W
weak-sense auxiliary function
NTNU Speech Lab 32
MPE: Model Update using EB (1/2)
• Based on the final auxiliary function, the Extended Baum-Welch (EB) training algorithm has the following update formulas
22222
2
}{
)()}()({
}{
)}()({
qmdqmd
denqm
numqm
qmdqmdqmddenqmd
numqmd
qmd
qmddenqm
numqm
qmdqmddenqmd
numqmd
qmd
D
DOO
D
DOO
qm
denqm
numqm
qmqmdenqm
numqm
qmD
DOO
)()(
Tqmqm
qmdenqm
numqm
Tqmqmqmqm
numqm
numqm
qmD
DOO
)()( 22
diagonal covariance matrix
dimension feature :
component mixture:
unit phone :
d
m
q
NTNU Speech Lab 33
MPE: Model Update using EB (2/2)
• Two sets of statistics (numerator, denominator) are accumulated respectively
R
r q
e
str
rq
rqm
denqm
R
r q
e
str
rq
rqm
denqm
R
r q
e
st
rq
rqm
denqm
rlat
q
q
MPE
rlat
q
q
MPE
rlat
q
q
MPE
tOtO
tOtO
t
1
22
1
1
)(),0max()()(
)(),0max()()(
),0max()(
W
W
W
R
r q
e
str
rq
rqm
numqm
R
r q
e
str
rq
rqm
numqm
R
r q
e
st
rq
rqm
numqm
rlat
q
q
MPE
rlat
q
q
MPE
rlat
q
q
MPE
tOtO
tOtO
t
1
22
1
1
)(),0max()()(
)(),0max()()(
),0max()(
W
W
W
NTNU Speech Lab 34
MPE: More About Smoothing Function (1/2)
• The mean and variance update formulas rely on the proper setting of the smoothing constant ( )– If is too large, the step size is small and convergence is slow – If is too small, the algorithm may become unstable– also needs to make all variance positive
qmdD
2qmd
qmdqm DD or
qmdD
qmdD
0)]([)()(2)(
0))(()])([)((
0)(][)(
][)(
)(
2222222
2222
2222
2222
2
OODdOOD
DODDO
D
DO
D
DO
D
DO
D
DO
qmdjmqmdqmdqmdqmdqmqmdqmqmdqmdqmdqmd
qmdqmdqmdqmdqmqmdqmdqmdqmd
qmdqm
qmdqmdqmd
qmdqm
qmdqmdqmdqmd
qmdqmdqm
qmdqmdqmdqmdqm
qmdqm
qmdqmdqmdqmd
A B C
NTNU Speech Lab 35
MPE: More About Smoothing Function (2/2)
• Previous work [Povey 2004] used a value of that was twice the minimum positive value needed to insure all variance updates were positive
A
CABBDqmd
2
42
qmdD
NTNU Speech Lab 36
MPE: I-Smoothing
• I-smoothing increases the weight of the numerator counts depending on the amounts of data available for each Guassian
• This is done by multiplying the numerator terms
( ) in the update formulas by
– can be set empirically (e.g., )
)(,)(, 2OO numqm
numqm
numqm num
qm
qm
1
qmnumqm
numqm
numqmnum
qm
qmnumqm
numqm
numqmnum
qm
qmnumqm
numqm
OOO
OOO
)()()(
)()()(
222
emphasize positive contributions (arcs with higher accuracy)
qm 100qm
NTNU Speech Lab 37
Preliminary Experimental Results (1/4)
CER(%) ML_itr10 ML_itr10
MPE_itr10
ML_itr150 ML_itr150
MPE_itr10
MFCC(CMS) 29.48 26.52 28.72 24.44
MFCC(CN) 26.60 25.05 26.96 24.54
HLDA+MLLT(CN)
23.64 20.92 23.78 20.79
• Experiment conducted on the MATBN (TV broadcast news) corpus (field reporters)– Training: 34,672 utterances (25.5hrs)
– Testing: 292 utterances (1.45hrs, outside-testing)– Metric: Chinese character error rate (CER)
12.83% relative improvement11.51% relative improvement
NTNU Speech Lab 38
Preliminary Experimental Results (2/4)
• Another experiment conducted on the MATBN (TV broadcast news) corpus (field reporters) – Discriminative Feature: HLDA+MLLT(CN)– Discriminative training: MPE– Total 34,672 utterances (24.5hrs) - (10-fold cross validation)
A relative improvement of 26.2% finally achieved (HLDA-MLLT(CN)+ MPE)
21.4318.43
15.8
0
5
10
15
20
25
MFCC_CN ML HLDA+MLLT_CNML
HLDA+MLLT_CNMPE
CER(%
)
NTNU Speech Lab 39
Preliminary Experimental Results (3/4)
• Conducted on radio broadcast news recorded from several radio station located at Taipei– MFCC(CMS)/LDA(CN)/HLDA-MLLT(CN) features– Unsupervised Training– Metric: Chinese character error rate (CER)
MFCC MFCC+MPE LDA LDA+MPE HLDA-MLLT HLDA-MLLT+MPE
Original 4 Hrs 22.56% - 20.04% - 20.12% -
+5 Hrs (Thr=0.9) 18.05% - 16.43% 14.87% 16.44% 14.78%
+21 Hrs (Thr=0.8) 16.90% 15.34% 15.22% 14.11% 15.64% 14.22%
+33 Hrs (Thr=0.7) 17.32% - 15.49% 14.23% 15.37% 14.28%
+48 Hrs (Thr=0.6) 17.29% - 15.54% 14.32% 15.42% -
+54 Hrs (Thr=0.5) 17.30% - 15.54% 14.31% 15.46% -
+60 Hrs (Thr=0.4) - - 15.42% - 15.36% -
9.32% 16.51% 15.86
%
9.94%
NTNU Speech Lab 40
Preliminary Experimental Results (4/4)
• Conducted on microphone speech recorded for spoken dialogues– Free syllable decoding with MFCC_CMS features– About 8hr training data was used
• Syllable error rate was reduced form 27.79% to 22.74%– 18.17% relative improvement
NTNU Speech Lab 41
Conclusions
• MPE/MWE (or MMI) based discriminative training approaches have shown effectiveness in Chinese continuous speech recognition– Joint training of feature transformation, acoustic models and
language models
– Unsupervised training
– More in-deep investigation and analysis are needed
– Exploration of variant accuracy/error functions
NTNU Speech Lab 42
References
• D. Povey. “Discriminative Training for Large Vocabulary Speech Recognition,” Ph.D Dissertation, Peterhouse, University of Cambridge, July 2004
• K. Vertanen, “An Overview of Discriminative Training for Speech Recognition”
• V. Goel, S. Kumar, W. Byrne, “Segmental minimum Bayes-risk decoding for automatic speech recognition,” IEEE Transactions on Speech and Audio Processing, May 2004
• J.W. Kuo, “An Initial Study on Minimum Phone Error Discriminative Learning of Acoustic Models for Mandarin Large Vocabulary Continuous Speech Recognition” Master Thesis, NTNU, 2005
• J.W. Kuo, B. Chen, "Minimum Word Error Based Discriminative Training of Language Models," Eurospeech 2005
Thank You !
NTNU Speech Lab 44
Minimum Phone Error Estimation
• An example showing the values of and
MPEqq thenIf 5.0 0,25.0
MPEq)(qc
reference =“pat”
“p” 2)( qc
“a”
“t”
“s”
“t”“a”
2)( qc2)( qc
1)( qc 1)( qc
5.1)( qcavg
NTNU Speech Lab 45
Extended Baum-Welch Algorithm
kki
ki
jiji
ji
DF
DF
,,
,,
,
• A general formula of the Extended Baum-Welch Algorithm
NTNU Speech Lab 46
MPE: Shortcomings
• The standard MPE objective function discourages insertion error more than deletion error
• The dynamic range of raw phone accuracy is typically quite narrow, which makes MPE occupancies considerably lower than MMI occupancies
NTNU Speech Lab 47
Other Issues
• Major computation load of discriminative training:– Decoding the training data– Collecting statistics for parameter estimation
• Discriminative training techniques often perform very well on the training data, but fail to generalize well to unseen data
• This generalization failure may result from the domination in the numerator and denominator by very few number of paths or from the modeling of training data features which are not significant to the task