feature-based pronunciation modeling using dynamic bayesian networks

MIT Computer Science and Artificial Intelligence Laboratory

SPOKEN LANGUAGE SYSTEMS

Feature-based Pronunciation ModelingUsing Dynamic Bayesian Networks

Karen Livescu

JHU Workshop Planning Meeting

April 16, 2004

Joint work with Jim Glass

MIT CSAIL

SLSPreview

• The problem of pronunciation variation for automatic speech recognition (ASR)

• Traditional methods: phone-based pronunciation modeling

• Proposed approach: pronunciation modeling via multiple sequences of linguistic features

• A natural framework: dynamic Bayesian networks (DBNs)

• A feature-based pronunciation model using DBNs

• Proof-of-concept experiments

• Ongoing/future work

• Integration with SVM feature classifiers

MIT CSAIL

SLSThe problem of pronunciation variation

• Conversation from the Switchboard speech database:

• “neither one of them”:

• “decided”:

• “never really”:

• “probably”:

• Noted as an obstacle for ASR (e.g., [McAllester et al. 1998])

MIT CSAIL

SLSThe problem of pronunciation variation (2)

0

10

20

30

40

50

60

70

80

0 50 100 150 200 250

word frequency

# p

ron

un

cia

tio

ns

/wo

rd

read

casual

• More acute in casual/conversational than in read speech:

p r aa b iy 2

p r ay 1

p r aw l uh 1

p r ah b iy 1

p r aa lg iy 1

p r aa b uw 1

p ow ih 1

p aa iy 1

p aa b uh b l iy 1

p aa ah iy 1

probably

MIT CSAIL

SLSPreview









MIT CSAIL

SLSTraditional solution: phone-based pronunciation modeling

• Transformation rules are typically of the form p1 p2 / p3 __ p4 (where pi may be null)

– E.g. Ø p / m __ {non-labial}

• Rules are derived from– Linguistic knowledge (e.g. [Hazen et al. 2002])

– Data (e.g. [Riley & Ljolje 1996])

• Powerful, but:– Sparse data issues

– Increased inter-word confusability

– Some pronunciation changes not well described

– Limited success in recognition experiments

[p] insertion rule[ w ao r m p th ]warmth

dictionary/ w ao r m th /

MIT CSAIL

SLSPreview









MIT CSAIL

SLSA feature-based approach

• Speech can alternatively be described using sub-phonetic features

LIP-OPTT-OPEN

TT-LOC

TB-LOC

TB-OPENVELUM

VOICING

• (This feature set based on articulatory phonology [Browman & Goldstein 1990])

MIT CSAIL

SLSFeature-based pronunciation modeling

• instruments [ih_n s ch em ih_n n s]

…...............

MidCloMidMidNarlip opening

CloOpCloCloClovelum

!VVVVVvoicing lips & velum desynchronize

[ w ao r m p th ]dictionary

warmth

• wants [w aa_n t s] -- Phone deletion??

• several [s eh r v ax l] -- Exchange of two phones???

everybody [eh r uw ay]

MIT CSAIL

SLSRelated work

• Much work on classifying features:– [King et al. 1998]

– [Kirchhoff 2002]

– [Chang, Greenberg, & Wester 2001]

– [Juneja & Espy-Wilson 2003]

– [Omar & Hasegawa-Johnson 2002]

– [Niyogi & Burges 2002]

• Less work on “non-phonetic” relationship between words and features– [Deng et al. 1997], [Richardson & Bilmes 2000]: “fully-connected” state

space via hidden Markov model

– [Kirchhoff 1996]: features independent, except for synchronization at syllable boundaries

– [Carson-Berndsen 1998]: bottom-up, constraint-based approach

• Goal: Develop a general feature-based pronunciation model– Capable of using known independence assumptions

– Without overly strong assumptions

MIT CSAIL

SLSApproach: Main Ideas ([HLT/NAACL-2004])

• Begin with usual assumption: Each word has one or more underlying pronunciations, given by a dictionary

dictionarywarmth

• Surface (actual) feature values can stray from underlying values via:

1) Substitution – modeled by confusion matrices P(s|u)

2) Asynchrony– Assign index (counter) to each feature, and allow index values to differ

– Apply constraints on the difference between the mean indices of feature subsets

• Natural to implement using graphical models, in particular dynamic Bayesian networks (DBNs)

…...............

MidCloMidMidNarlip opening

OffOnOffOffOffvelum

!VVVVVvoicing

43210index

MIT CSAIL

SLS

O

S S

O

......frame i-1 frame i

Aside: Dynamic Bayesian networks

• Bayesian network (BN): Directed-graph representation of a distribution over a set of variables– Graph node variable + its distribution given parents

– Graph edge “dependency”

• Dynamic Bayesian network (DBN): BN with a repeating structure

• Example: HMM

• Uniform algorithms for (among other things)– Finding the most likely values of a subset of the variables, given the

rest (analogous to Viterbi algorithm for HMMs)

– Learning model parameters via EM

)s|p(s 1-ii

)s|p(o ii

)p(s0 )s|p(s 1-ii

L

1i)s|p(o ii)s|p(o 00

) , p( L:0o L:0s

speaking rate

# questions

lunchtime

MIT CSAIL

SLSPreview









MIT CSAIL

SLSApproach: A DBN-based Model

• Example DBN using 3 features:

…

.1

0

0

MID

… … … … ……

… .2 .7 0 0NAR

… .1 .2 .7 0CRI

… 0 .1 .2 .7CLO

…N-MNARCRICLO

encodes baseform pronunciations

tword

12;1 tsync 13;2,1 tsync

1tind 2

tind 3tind

1tS 2

tS 3tS

1tU 2

tU 3tU

otherwise ,0 ,1 || ,5.0

, ,1 ),|1Pr( 21

21

212;1 indindindind

indindsync

• (Simplified to show important properties! Implemented model has additional variables.)

MIT CSAIL

SLSApproach: A DBN-based Model (2)

• “Unrolled” DBN:

Tword

12;1 Tsync 13;2,1 Tsync

1Tind 2

Tind 3Tind

1TS

2TS

3TS

1TU

2TU

3TU

• Parameter learning via Expectation Maximization (EM)

• Training data– Articulatory databases

– Detailed phonetic transcriptions

1word

12;11 sync 13;2,1

1 sync

11ind 2

1ind 31ind

11S

21S

31S

11U

21U

31U

0word

12;10 sync 13;2,1

0 sync

10ind 2

0ind 30ind

10S

20S

30S

10U

20U

30U

. . .

MIT CSAIL

SLSPreview









MIT CSAIL

SLSA proof-of-concept experiment

• Task: classify an isolated word from the Switchboard corpus, given a detailed phonetic transcription (from ICSI Berkeley, [Greenberg et al. 1996])

– Convert transcription into feature vectors Si, one per 10ms

– For each word w in a 3k+ word vocabulary, compute P(w|Si)

– Output w* = arg maxw P(w|Si)

– Used GMTK [Bilmes & Zweig 2002] for inference and EM parameter training

– Note: ICSI transcription is somewhere between phones and features—not ideal, but as good as we have

MIT CSAIL

SLSResults (development set)

61.263.6Baseforms only

47.950.3+ phonological rules

24.835.2synchronous feature-based

16.429.7asynchronous feature-based

Failure rate (%)

Error rate (%)

Model

19.432.7asynch. + segmental constraint

19.4asynch. + segmental constraint + EM

27.8

1.7 prons/word

4 prons/word

• When did asynchrony matter?– Vowel nasalization & rounding

– Nasal + stop nasal

– Some schwa deletions

– instruments [ih_n s ch em ih_n n s]

– everybody [eh r uw ay]

• What didn’t work? – Some deletions ([ax], [t])

– Vowel retroflexion

– Alveolar + [y] palatal

– (Cross-word effects)

– (Speech/transcription errors…)

MIT CSAIL

SLSSample Viterbi path

everybody [ eh r uw ay ]

MIT CSAIL

SLSOngoing/future work

• Trainable synchrony constraints ([ICSLP 2004?])

• Context-dependent distributions for underlying (Ui) and surface (Si) feature values

• Extension to more complex tasks (multi-word sequences, larger vocabularies)

• Implementation in a complete recognizer (cf. [Eurospeech 2003])

• Articulatory databases for parameter learning/testing

• Can we use such a model to learn something about speech?

MIT CSAIL

SLSIntegration with feature classifier outputs

• Use (hard) classifier decisions as observations for Si

1tS 2

tS 3tS

1tU 2

tU 3tU

(rest of model)

1tO

2tO

3tO

)()|1( sSPsSOP itSVM

it

it

• Landmark-based classifier outputs to DBN Si’s:– Convert landmark-based features to one feature vector/frame

– (Possibly) convert from SVM feature set to DBN feature set

• Convert classifier scores to posterior probabilities and use as “soft evidence” for Si

MIT CSAIL

SLSAcknowledgment

• Jeff Bilmes, U. Washington



Thank you!



GRAVEYARD

MIT CSAIL

SLSBackground: Continuous Speech Recognition

• Given waveform with acoustic features A, find most likely word string :}, ... ,,{ 21

*MwwwW

)|,(maxarg ** AUWPWW

acoustic model pronunciation model language model

Bayes’ Rule

UW

W

AUWP

AWPW

)|,(maxarg

)|(maxarg*possible pronunciations (typically phone strings)

• Assuming U* much more likely than all other U:

)( )|( ),|( maxarg},{,

** WPWUPUWAPUWUW

MIT CSAIL

SLSExample: “warmth” “warmpth”

• Phone-based view: Brain:Give me a []!

• (Articulatory) feature-based view:Brain:

Give me a []!

Tongue:Umm…yeah, OK.

Lips:Huh?

Velum, glottis:Right on it, sir !Velum, glottis:Right on it, sir !

Lips, tongue, velum, glottis:Right on it, sir!




MIT CSAIL

SLSGraphical models for hidden feature modeling

• Most ASR approaches use hidden Markov models (HMMs) and/or finite-state transducers (FSTs)– Efficient and powerful, but limited

– Only one state variable per time frame

• Graphical models (GMs) allow for– Arbitrary numbers of variables and dependencies

– Standard algorithms over large classes of models Straightforward mapping between feature-based models and GMs Potentially large reduction in number of parameters

• GMs for ASR:– Zweig (e.g. PhD thesis, 1998), Bilmes (e.g. PhD thesis, 1999),

Stephenson (e.g. Eurospeech 2001)

– Feature-based ASR with GMs suggested by Zweig, but not previously investigated

MIT CSAIL

SLSBackground

• Brief intro to ASR– Words written in terms of sub-word units, acoustic models

compute probability of acoustic (spectral) features given sub-word units or vice versa

• Pronunciation model: mapping between words and strings of sub-word units

MIT CSAIL

SLSPossible solution?

• Allow every pronunciation in some large database

Unreliable probability estimation due to sparse data

Unseen words

Increased confusability

MIT CSAIL

SLSPhone-based pronunciation modeling (2)

• Generalize across words

• But: Data still sparse Still increased confusability Some pronunciation changes not well described by phonetic

rules Limited gains in speech recognition experiments

MIT CSAIL

SLSApproach

• Begin with usual assumption that each word has one or more “target” pronunciations, given by the dictionary

• Model the evolution of multiple feature streams, allowing for:

– Feature changes on a frame-by-frame basis

– Feature desynchronization

– Control of asynchrony—more “synchronous” feature configurations are preferable

• Dynamic Bayesian networks (DBNs): Efficient parameterization and computation when state can be factored

feature-based pronunciation modeling using dynamic bayesian networks

Documents