large-margin hmm estimation for speech recognitionhj/talks/ibm.pdf · prof. hui jiang department of...

Prof. Hui Jiang

Department of Computer Science and Engineering

York University, Toronto, Ont. M3J 1P3, CANADA

Email: [email protected]

Large-Margin HMM Estimation for Speech Recognition

(This is a joint work with Chao-Jun Liu, Xinwei Li)

Research Projects

• Hierarchical covariance modeling in CDHMM

(joint with Y. Tian, J.-L. Zhou, MSRA, Beijing, Chin a)

• Large-scale discriminative Training based on MCE/GP D

(joint with B. Liu, Univ. of Sci. & Tech. of China, J.-L. Zhou, MSRA )

• Large-margin HMM estimation for speech recognition

(joint with C. Liu, X. Li, York Univ. )

Hierarchical Covariance Modeling (HCM) in CDHMM

)( pdΣ dΣ dΣ dΣ dΣ dΣ

)(kfΣ

)( pX

∑=

Σ⋅+Σ⋅=Σ1

)()(0

)(*

pk

kfk

pd

p λλ

)1(fΣ

)2(fΣ

Hierarchical Covariance Modeling Schemes

HCM

HPM

HCM+DIAG

HPM+DIAG

HCC

∑Ψ∈

∑=∑)(

)()()(

im

im

im

i λ

∑Ψ∈

−−∑=∑

)(

1)()(1)(

im

im

im

i λ

∑Ψ∈

∑+∑=∑)(

)()()()(0

)(

im

im

im

iii λλ

∑Ψ∈

−−∑+

∑=∑

)(

1)()(1)()(

0)(

im

im

im

iii diag λλ

( ) [ ]∑Ψ∈

∑−∑+∑=∑)(

)()()()()( )(im

im

im

im

ii diagdiag λ

Performance Comparison:RM database

22. 7%3.16%HCC

14.7%3.49%MIC

18.6%3.33%STC

11.7%3.61%MLLT

14.4%3.50%HLDA

n/a4.09%Baseline

Err. Rate Reduction Word Err. Rate

Performance Comparison:Switchboard (minitrain)

5.53%35.9%HCC

3.94%36.5%MIC (39 prototypes)

3.68%36.6%STC

2.10%37.2%HLDA

n/a38.0%Baseline

Err. Rate Reduction

Word Err. Rate

t

A word-ending active path

search beam

Time

Sta

te

LargeLarge--Scale GPD/MCE: Scale GPD/MCE:

InIn--Search Data SelectionSearch Data Selection

a-b+c

Reference Segmentation

$-b-u b-u-k u-k+T k-T+i T-i+s i-s+T s-T+r T-r+i r-i+p i-p+$

Token Comparison

phone a-b+c phone a-b+c

True TokenTrue TokenSetsSets

CompetingCompetingToken SetsToken Sets

phone a’-b’+c’ phone a’-b’+c’

Large-Scale Discriminative Training based on GPD/MCE

• Discriminative training: refine the original model set discriminatively based on the the collected token s ets.

HMM modelHMM model

True Tokens

Competing Tokens

Implementation Modelfor state-tied HMMs

Frames (feature vectors)

Optimal Viterbi path (state sequence)

Competing states

Criteria in Discriminative Training

• Least Imposter Words (LIW): minimizing the total number of imposter words during the decoding of all training data.

– Imposter words are defined as incorrect words appearing within beam -width during Viterbi decoding with a higher likelihoo d than its reference model. ( Jiang, et. al. 2002 )

• Least Phone Competing Tokens (LPCT): minimizing the total number of phone competing tokens during the decodin g of all training data. (Liu, Jiang, et.al. 2005)

• Least Incorrect Frames (LIF): minimizing the total number of incorrectly decoded frames during decoding of all t raining data.

Discriminative Training:RM task

iteration Training set Test set

WER(%)

Err Red WER(%)

Err Red

0 (ML)1.26

N/A 4.30 N/A

11.19

8% 4.16 3%

51.06

16% 3.96 8%

61.03

18% 4.06 6%

Discriminative Training:Switchboard

iteration Training set Test set

WER(%) Err Red WER(%) Err Red

0 (ML) 33.2 N/A 48.1 N/A

1 31.5 5% 47.1 2%

4 29.4 11% 46.0 4%

6 29.0 13% 46.4 4%

Prof. Hui Jiang

Department of Computer Science and Engineering

York University, Toronto, Ont. M3J 1P3, CANADA

Email: [email protected]

Large-Margin HMM Estimation for Speech Recognition

(This is a joint work with Chao-Jun Liu, Xin-Wei Li)

Outline• Background:

– Automatic Speech Recognition (ASR)– Large Margin Classifier

• Large Margin for HMM-based classifiers

• A Gradient Ascent Optimization for Continuous Density HMM (CDHMM) in speech recognition

• Preliminary Experiments

• Final Remarks and ongoing works

ASR Solution: MAP decision rule

• — Acoustic Model (AM) : the probability of generating feature X when W is uttered.

• — Language Model (LM) : the probability of W (word, phrase, sentence) being chosen to say.

• — Discriminant Function:

)|( WXpΛ

)(WPΓ

(X|W)WXpWP

WXpWPXWpW

WW

WW

F

Ω∈

ΛΓ

Ω∈

Ω∈Ω∈

=⋅=

⋅==

maxarg)|()(maxarg

)|()(maxarg)|(maxargˆ

)|( WXF

)|()()|( WW XpPWX Λ⋅Γ=F

Existing HMM Estimation Methods

• Maximum Likelihood Estimation (MLE)– The Baum-Welch algorithm: the EM algorithm for HMM

• Discriminative Training– Maximum Mutual Information Estimation (MMIE)

– Minimum Classification Error (MCE):

• Discriminative training can improve (more or less) over the standard ML training.

• All discriminative training methods suffer the prob lem of poor generalization.

Large-Margin Classifier:Support Vector Machine (SVM)

larger margin

Large-Margin Classifiers• Why larger margin classifiers yield better

generalization performance?

• Conceptually, large margin

– Robustness w.r.t. data patterns– Robustness w.r.t. classifier parameters

• The theory in machine learning:– upper bound of generalization error rate

++≤δ1

log)/(log

2

2

d

VMV

M

CRR d

How about using SVM for Speech Recognition?

• Done in some simple ASR tasks:

– phoneme recognition

– speaker recognition

– small vocabulary isolated speech recognition

• No significant improvement is reported.

– still not a main-stream method

• Why?

– lack of a proper kernel function to map speech samp les from one dynamic high-dimension space to another high-di mension space, which is suitable for linear classifiers.

Large-Margin HMM-based Classifier

model 1 model 2

Separation boundary F(X| 1)-F(X| 2)=0

Large-Margin HMM-based Classifier

Original separation boundary F(X| 1)-F(X| 2)=0

1

’1

2

’2

New separation boundary F(X| ’1)-F(X| ’2)=0

How to define separation margin? (1)

• In 2-class separable problem:

– For a data token, x1, of class 1

– For a data token, x2, of class 2

)|()|()( 21111 Λ−Λ= xxxd FF

)|()|()( 12222 Λ−Λ= xxxd FF

> 0

> 0

How to define separation margin? (2)

• Extend to multiple-class problem:

– N classes 1, 2, …, N,

– For a data token, x i, of class i

[ ])|()|(min

)|(max)|()(

jiiiij

jiij

iii

xx

xxxd

Λ−Λ=

Λ−Λ=

≠

≠

FF

FF

Large-Margin Estimation of HMMs

• An N-class problem: each class is represented by an HMM

• Given a training set DD, define a subset, called support token set SS, as:

• Large-Margin Estimation ( LME) of HMMs:

,,, 21 NΛΛΛ= L

)(0 and | ε≤≤∈= iii XdDXXS

0))( all o(subject t)(minmaxargˆ >=∈ ii

SXXdXd

i

Large-Margin Estimation of HMMs

• Convert it into an equivalent minimax optimization problem

• Assume Xi belongs to class i

[ ])|()|(maxminargˆ,

iijiijSX

XXi

Λ−Λ=≠∈

FF

. and allfor

0)|()|(

:sconstraint subject to

ijSX

XX

i

iiji

≠∈

<Λ−Λ FF

Two difficulties

• No.1 : without additional constraints on during the optimization, maximum margin does not exist.

– e.g. scale up both and to increase margin unlimitedly.

• No.2 : how to do optimization?

– Use standard optimization tools, such as Matlaboptimization toolbox®

– However, too slow …

)|( iiX ΛF )|( jiX ΛF

How to guarantee existence of Maximum Margin? (1)

• Solution one: maximizing relative marginrelative margin instead:

exists always maximum 1)('

)|(

)|(1min

)|(

)|(max)|()('

⇒≤<∞−

ΛΛ

−=

Λ

Λ−Λ=

≠

≠

i

ii

ji

ij

ii

jiij

ii

i

Xd

X

X

X

XXXd

F

F

F

FF

Called Large Relative Margin Estimation (LRME)Large Relative Margin Estimation (LRME)

How to guarantee existence of Maximum Margin? (2)

• Solution two: optimize one HMM each time – Do

• foreach i do a sub-optimization problem

where other HMMs are kept constant in the above optimization.

– Until converge

Called iterative localized optimization (ILO)iterative localized optimization (ILO)

[ ])|()|(maxminargˆ,

iijiijSX

i XXii

Λ−Λ=Λ≠∈Λ

FF

. and allfor

0)|()|(

:sconstraint subject to

ijSX

XX

i

iiji

≠∈

<Λ−Λ FF

Iterative Localized Optimization

How to do optimization? (1)

• Use the gradient ascent method to maximize a lower bound of minimum margin

– Use a continuous and differentiable function to approximate the minimum margin

)(maxarg)(minmaxargˆ QXd iSX i

==∈

)(min)( iSX

XdQi∈

=

How to do optimization? (2)• Approximate with summation of exponential s

• Optimize instead

• The gradient ascent method

)(Q

⋅=≈ ∑

≠∈ ijSXi

i

XdQQ,

)](exp[log1

)()( ηηη

)()(lim )0()()( QQQQ =<>−∞→ ηηη η

)(ηQ

)(maxarg'ˆηQ=

)('

)()('ˆ)1('ˆ

n

Qnn =∂

∂⋅+=+ ηε

How to calculate the gradientfor continuous density HMM? (1)

∑

∑

≠∈

≠∈

⋅∂

∂⋅⋅=

∂∂

ijSXi

ijSX

ii

i

i

Xd

XdXd

Q

,

,

)](exp[

)()](exp[

)(

η

ηη

i

ii

i

i XXd

Λ∂Λ∂=

Λ∂∂ )|()( F

j

ji

j

iXXd

Λ∂Λ∂

−=Λ∂

∂ )|()( F

How to calculate the gradientfor continuous density HMM? (2)

• Assumption 1: adjust CDHMM mean vectors only

• Assumption 2: diagonal precision matrices

• Assumption 3: use the Viterbi approximation

∑∑= =

−−≈ΛT

t

D

d

iitd

iii dtltsdtlts

mXrCX1 1

2)()( )(2

1')|(F

∑∑= =

−−≈ΛT

t

D

d

jitd

jji dtltsdtlts

mXrCX1 1

2)()( )(2

1")|(

''''F

How to handle Recognition Errors in training set?

• Given the training set DD, based on the current model , define the error set:

• Use the MCE (minimum classification error)/GPD algorithm to update model based on to reduce | |.

• Intuitively, the MCE algorithm will move separation boundary to correctly classify as many error tokens as possible.

• Use MCE-trained models as initial models to start l arge margin estimation (LME).

0)( and | ≤∈=Ψ iii XdXX D

Preliminary Experiments• English alphabet E-set recognition

– Use the OGI ISOLET database• Speaker-independent small vocabulary isolated-word

recognition• Feature vector ( 39-d): (12 MFCC + E) + + • Our best MLE system:

– 16-state whole-word CDHMM for each letter– 4 Gaussian mixtures per state

• Achieve 96.15% accuracy for a standard test set (26 -letter)

– comparable other reported systems: OGI (96%), Cambridge (96.73%).

• Test our best system on the E-set only: 91.5%

Preliminary Results: E-SetASR Performance Comparison

n/a

95.2

92.8

LME-ILO

95.294.491.54-mix

2-mix

1-mix

95.094.090.6

93.591.585.6

LRMEMCEML

Word accuracy comparison among various HMM training approaches

ML: Maximum Likelihood EstimationMCE: Minimum Classification Error LME-ILO: Large Margin Estimation via Iterative Loca lized Optimization LRME: Large Relative Margin Estimation

LME learning curves (1-mix)

Accuracy in test set

Actual Margin Q

Objective Func Q

LME learning curves (2-mix)

Accuracy in test set

Actual Margin Q

Objective Func Q

Final Remarks

• Based on preliminary experimental results only .

• LME can yield better performance than MLE and MCE.

• Margin is a good indicator of generalization capabil ity of an HMM-based speech recognizer.

• Maximizing the objective function Q (lower bound) effectively increases the actual separation margin.

Ongoing and Future Works• More theoretical explorations:

– How to formulate the constraints in a theoretically sound way?

– How to re-formulate LME as another type of optimiza tion problem which has more efficient solutions?

• semi-definite programming (SDP)?

• Practically, extend to large-scale continuous ASR t asks

– TIDIGITS experiments under way

– SPINE very soon

– Switchboard

ERROR: undefinedOFFENDING COMMAND:

STACK:

large-margin hmm estimation for speech recognitionhj/talks/ibm.pdf · prof. hui jiang department of...

Documents