large-margin hmm estimation for speech recognitionhj/talks/ibm.pdf · prof. hui jiang department of...
TRANSCRIPT
Prof. Hui Jiang
Department of Computer Science and Engineering
York University, Toronto, Ont. M3J 1P3, CANADA
Email: [email protected]
Large-Margin HMM Estimation for Speech Recognition
(This is a joint work with Chao-Jun Liu, Xinwei Li)
Research Projects
• Hierarchical covariance modeling in CDHMM
(joint with Y. Tian, J.-L. Zhou, MSRA, Beijing, Chin a)
• Large-scale discriminative Training based on MCE/GP D
(joint with B. Liu, Univ. of Sci. & Tech. of China, J.-L. Zhou, MSRA )
• Large-margin HMM estimation for speech recognition
(joint with C. Liu, X. Li, York Univ. )
Hierarchical Covariance Modeling (HCM) in CDHMM
)( pdΣ dΣ dΣ dΣ dΣ dΣ
)(kfΣ
)( pX
∑=
Σ⋅+Σ⋅=Σ1
)()(0
)(*
pk
kfk
pd
p λλ
)1(fΣ
)2(fΣ
Hierarchical Covariance Modeling Schemes
HCM
HPM
HCM+DIAG
HPM+DIAG
HCC
∑Ψ∈
∑=∑)(
)()()(
im
im
im
i λ
∑Ψ∈
−−∑=∑
)(
1)()(1)(
im
im
im
i λ
∑Ψ∈
∑+∑=∑)(
)()()()(0
)(
im
im
im
iii λλ
∑Ψ∈
−−∑+
∑=∑
)(
1)()(1)()(
0)(
im
im
im
iii diag λλ
( ) [ ]∑Ψ∈
∑−∑+∑=∑)(
)()()()()( )(im
im
im
im
ii diagdiag λ
Performance Comparison:RM database
22. 7%3.16%HCC
14.7%3.49%MIC
18.6%3.33%STC
11.7%3.61%MLLT
14.4%3.50%HLDA
n/a4.09%Baseline
Err. Rate Reduction Word Err. Rate
Performance Comparison:Switchboard (minitrain)
5.53%35.9%HCC
3.94%36.5%MIC (39 prototypes)
3.68%36.6%STC
2.10%37.2%HLDA
n/a38.0%Baseline
Err. Rate Reduction
Word Err. Rate
t
A word-ending active path
search beam
Time
Sta
te
LargeLarge--Scale GPD/MCE: Scale GPD/MCE:
InIn--Search Data SelectionSearch Data Selection
a-b+c
Reference Segmentation
$-b-u b-u-k u-k+T k-T+i T-i+s i-s+T s-T+r T-r+i r-i+p i-p+$
Token Comparison
phone a-b+c phone a-b+c
True TokenTrue TokenSetsSets
CompetingCompetingToken SetsToken Sets
phone a’-b’+c’ phone a’-b’+c’
Large-Scale Discriminative Training based on GPD/MCE
• Discriminative training: refine the original model set discriminatively based on the the collected token s ets.
HMM modelHMM model
True Tokens
Competing Tokens
Implementation Modelfor state-tied HMMs
Frames (feature vectors)
Optimal Viterbi path (state sequence)
Competing states
Criteria in Discriminative Training
• Least Imposter Words (LIW): minimizing the total number of imposter words during the decoding of all training data.
– Imposter words are defined as incorrect words appearing within beam -width during Viterbi decoding with a higher likelihoo d than its reference model. ( Jiang, et. al. 2002 )
• Least Phone Competing Tokens (LPCT): minimizing the total number of phone competing tokens during the decodin g of all training data. (Liu, Jiang, et.al. 2005)
• Least Incorrect Frames (LIF): minimizing the total number of incorrectly decoded frames during decoding of all t raining data.
Discriminative Training:RM task
iteration Training set Test set
WER(%)
Err Red WER(%)
Err Red
0 (ML)1.26
N/A 4.30 N/A
11.19
8% 4.16 3%
51.06
16% 3.96 8%
61.03
18% 4.06 6%
Discriminative Training:Switchboard
iteration Training set Test set
WER(%) Err Red WER(%) Err Red
0 (ML) 33.2 N/A 48.1 N/A
1 31.5 5% 47.1 2%
4 29.4 11% 46.0 4%
6 29.0 13% 46.4 4%
Prof. Hui Jiang
Department of Computer Science and Engineering
York University, Toronto, Ont. M3J 1P3, CANADA
Email: [email protected]
Large-Margin HMM Estimation for Speech Recognition
(This is a joint work with Chao-Jun Liu, Xin-Wei Li)
Outline• Background:
– Automatic Speech Recognition (ASR)– Large Margin Classifier
• Large Margin for HMM-based classifiers
• A Gradient Ascent Optimization for Continuous Density HMM (CDHMM) in speech recognition
• Preliminary Experiments
• Final Remarks and ongoing works
ASR Solution: MAP decision rule
• — Acoustic Model (AM) : the probability of generating feature X when W is uttered.
• — Language Model (LM) : the probability of W (word, phrase, sentence) being chosen to say.
• — Discriminant Function:
)|( WXpΛ
)(WPΓ
(X|W)WXpWP
WXpWPXWpW
WW
WW
F
Ω∈
ΛΓ
Ω∈
Ω∈Ω∈
=⋅=
⋅==
maxarg)|()(maxarg
)|()(maxarg)|(maxargˆ
)|( WXF
)|()()|( WW XpPWX Λ⋅Γ=F
Existing HMM Estimation Methods
• Maximum Likelihood Estimation (MLE)– The Baum-Welch algorithm: the EM algorithm for HMM
• Discriminative Training– Maximum Mutual Information Estimation (MMIE)
– Minimum Classification Error (MCE):
• Discriminative training can improve (more or less) over the standard ML training.
• All discriminative training methods suffer the prob lem of poor generalization.
Large-Margin Classifier:Support Vector Machine (SVM)
larger margin
Large-Margin Classifiers• Why larger margin classifiers yield better
generalization performance?
• Conceptually, large margin
– Robustness w.r.t. data patterns– Robustness w.r.t. classifier parameters
• The theory in machine learning:– upper bound of generalization error rate
++≤δ1
log)/(log
2
2
d
VMV
M
CRR d
How about using SVM for Speech Recognition?
• Done in some simple ASR tasks:
– phoneme recognition
– speaker recognition
– small vocabulary isolated speech recognition
• No significant improvement is reported.
– still not a main-stream method
• Why?
– lack of a proper kernel function to map speech samp les from one dynamic high-dimension space to another high-di mension space, which is suitable for linear classifiers.
Large-Margin HMM-based Classifier
model 1 model 2
Separation boundary F(X| 1)-F(X| 2)=0
Large-Margin HMM-based Classifier
Original separation boundary F(X| 1)-F(X| 2)=0
1
’1
2
’2
New separation boundary F(X| ’1)-F(X| ’2)=0
How to define separation margin? (1)
• In 2-class separable problem:
– For a data token, x1, of class 1
– For a data token, x2, of class 2
)|()|()( 21111 Λ−Λ= xxxd FF
)|()|()( 12222 Λ−Λ= xxxd FF
> 0
> 0
How to define separation margin? (2)
• Extend to multiple-class problem:
– N classes 1, 2, …, N,
– For a data token, x i, of class i
[ ])|()|(min
)|(max)|()(
jiiiij
jiij
iii
xx
xxxd
Λ−Λ=
Λ−Λ=
≠
≠
FF
FF
Large-Margin Estimation of HMMs
• An N-class problem: each class is represented by an HMM
• Given a training set DD, define a subset, called support token set SS, as:
• Large-Margin Estimation ( LME) of HMMs:
,,, 21 NΛΛΛ= L
)(0 and | ε≤≤∈= iii XdDXXS
0))( all o(subject t)(minmaxargˆ >=∈ ii
SXXdXd
i
Large-Margin Estimation of HMMs
• Convert it into an equivalent minimax optimization problem
• Assume Xi belongs to class i
[ ])|()|(maxminargˆ,
iijiijSX
XXi
Λ−Λ=≠∈
FF
. and allfor
0)|()|(
:sconstraint subject to
ijSX
XX
i
iiji
≠∈
<Λ−Λ FF
Two difficulties
• No.1 : without additional constraints on during the optimization, maximum margin does not exist.
– e.g. scale up both and to increase margin unlimitedly.
• No.2 : how to do optimization?
– Use standard optimization tools, such as Matlaboptimization toolbox®
– However, too slow …
)|( iiX ΛF )|( jiX ΛF
How to guarantee existence of Maximum Margin? (1)
• Solution one: maximizing relative marginrelative margin instead:
exists always maximum 1)('
)|(
)|(1min
)|(
)|(max)|()('
⇒≤<∞−
ΛΛ
−=
Λ
Λ−Λ=
≠
≠
i
ii
ji
ij
ii
jiij
ii
i
Xd
X
X
X
XXXd
F
F
F
FF
Called Large Relative Margin Estimation (LRME)Large Relative Margin Estimation (LRME)
How to guarantee existence of Maximum Margin? (2)
• Solution two: optimize one HMM each time – Do
• foreach i do a sub-optimization problem
where other HMMs are kept constant in the above optimization.
– Until converge
Called iterative localized optimization (ILO)iterative localized optimization (ILO)
[ ])|()|(maxminargˆ,
iijiijSX
i XXii
Λ−Λ=Λ≠∈Λ
FF
. and allfor
0)|()|(
:sconstraint subject to
ijSX
XX
i
iiji
≠∈
<Λ−Λ FF
Iterative Localized Optimization
How to do optimization? (1)
• Use the gradient ascent method to maximize a lower bound of minimum margin
– Use a continuous and differentiable function to approximate the minimum margin
)(maxarg)(minmaxargˆ QXd iSX i
==∈
)(min)( iSX
XdQi∈
=
How to do optimization? (2)• Approximate with summation of exponential s
• Optimize instead
• The gradient ascent method
)(Q
⋅=≈ ∑
≠∈ ijSXi
i
XdQQ,
)](exp[log1
)()( ηηη
)()(lim )0()()( QQQQ =<>−∞→ ηηη η
)(ηQ
)(maxarg'ˆηQ=
)('
)()('ˆ)1('ˆ
n
Qnn =∂
∂⋅+=+ ηε
How to calculate the gradientfor continuous density HMM? (1)
∑
∑
≠∈
≠∈
⋅∂
∂⋅⋅=
∂∂
ijSXi
ijSX
ii
i
i
Xd
XdXd
Q
,
,
)](exp[
)()](exp[
)(
η
ηη
i
ii
i
i XXd
Λ∂Λ∂=
Λ∂∂ )|()( F
j
ji
j
iXXd
Λ∂Λ∂
−=Λ∂
∂ )|()( F
How to calculate the gradientfor continuous density HMM? (2)
• Assumption 1: adjust CDHMM mean vectors only
• Assumption 2: diagonal precision matrices
• Assumption 3: use the Viterbi approximation
∑∑= =
−−≈ΛT
t
D
d
iitd
iii dtltsdtlts
mXrCX1 1
2)()( )(2
1')|(F
∑∑= =
−−≈ΛT
t
D
d
jitd
jji dtltsdtlts
mXrCX1 1
2)()( )(2
1")|(
''''F
How to handle Recognition Errors in training set?
• Given the training set DD, based on the current model , define the error set:
• Use the MCE (minimum classification error)/GPD algorithm to update model based on to reduce | |.
• Intuitively, the MCE algorithm will move separation boundary to correctly classify as many error tokens as possible.
• Use MCE-trained models as initial models to start l arge margin estimation (LME).
0)( and | ≤∈=Ψ iii XdXX D
Preliminary Experiments• English alphabet E-set recognition
– Use the OGI ISOLET database• Speaker-independent small vocabulary isolated-word
recognition• Feature vector ( 39-d): (12 MFCC + E) + + • Our best MLE system:
– 16-state whole-word CDHMM for each letter– 4 Gaussian mixtures per state
• Achieve 96.15% accuracy for a standard test set (26 -letter)
– comparable other reported systems: OGI (96%), Cambridge (96.73%).
• Test our best system on the E-set only: 91.5%
Preliminary Results: E-SetASR Performance Comparison
n/a
95.2
92.8
LME-ILO
95.294.491.54-mix
2-mix
1-mix
95.094.090.6
93.591.585.6
LRMEMCEML
Word accuracy comparison among various HMM training approaches
ML: Maximum Likelihood EstimationMCE: Minimum Classification Error LME-ILO: Large Margin Estimation via Iterative Loca lized Optimization LRME: Large Relative Margin Estimation
LME learning curves (1-mix)
Accuracy in test set
Actual Margin Q
Objective Func Q
LME learning curves (2-mix)
Accuracy in test set
Actual Margin Q
Objective Func Q
Final Remarks
• Based on preliminary experimental results only .
• LME can yield better performance than MLE and MCE.
• Margin is a good indicator of generalization capabil ity of an HMM-based speech recognizer.
• Maximizing the objective function Q (lower bound) effectively increases the actual separation margin.
Ongoing and Future Works• More theoretical explorations:
– How to formulate the constraints in a theoretically sound way?
– How to re-formulate LME as another type of optimiza tion problem which has more efficient solutions?
• semi-definite programming (SDP)?
• Practically, extend to large-scale continuous ASR t asks
– TIDIGITS experiments under way
– SPINE very soon
– Switchboard
ERROR: undefinedOFFENDING COMMAND:
STACK: