feature-based pronunciation modeling using dynamic bayesian networks

31
MIT Computer Science and Artificial Intelligence Laboratory SPOKEN LANGUAGE SYSTEMS Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks Karen Livescu JHU Workshop Planning Meeting April 16, 2004 Joint work with Jim Glass

Upload: ewa

Post on 05-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks. Karen Livescu JHU Workshop Planning Meeting April 16, 2004 Joint work with Jim Glass. Preview. The problem of pronunciation variation for automatic speech recognition (ASR) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT Computer Science and Artificial Intelligence Laboratory

SPOKEN LANGUAGE SYSTEMS

Feature-based Pronunciation ModelingUsing Dynamic Bayesian Networks

Karen Livescu

JHU Workshop Planning Meeting

April 16, 2004

Joint work with Jim Glass

Page 2: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSPreview

• The problem of pronunciation variation for automatic speech recognition (ASR)

• Traditional methods: phone-based pronunciation modeling

• Proposed approach: pronunciation modeling via multiple sequences of linguistic features

• A natural framework: dynamic Bayesian networks (DBNs)

• A feature-based pronunciation model using DBNs

• Proof-of-concept experiments

• Ongoing/future work

• Integration with SVM feature classifiers

Page 3: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSThe problem of pronunciation variation

• Conversation from the Switchboard speech database:

• “neither one of them”:

• “decided”:

• “never really”:

• “probably”:

• Noted as an obstacle for ASR (e.g., [McAllester et al. 1998])

Page 4: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSThe problem of pronunciation variation (2)

0

10

20

30

40

50

60

70

80

0 50 100 150 200 250

word frequency

# p

ron

un

cia

tio

ns

/wo

rd

read

casual

• More acute in casual/conversational than in read speech:

p r aa b iy 2

p r ay 1

p r aw l uh 1

p r ah b iy 1

p r aa lg iy 1

p r aa b uw 1

p ow ih 1

p aa iy 1

p aa b uh b l iy 1

p aa ah iy 1

probably

Page 5: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSPreview

• The problem of pronunciation variation for automatic speech recognition (ASR)

• Traditional methods: phone-based pronunciation modeling

• Proposed approach: pronunciation modeling via multiple sequences of linguistic features

• A natural framework: dynamic Bayesian networks (DBNs)

• A feature-based pronunciation model using DBNs

• Proof-of-concept experiments

• Ongoing/future work

• Integration with SVM feature classifiers

Page 6: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSTraditional solution: phone-based pronunciation modeling

• Transformation rules are typically of the form p1 p2 / p3 __ p4 (where pi may be null)

– E.g. Ø p / m __ {non-labial}

• Rules are derived from– Linguistic knowledge (e.g. [Hazen et al. 2002])

– Data (e.g. [Riley & Ljolje 1996])

• Powerful, but:– Sparse data issues

– Increased inter-word confusability

– Some pronunciation changes not well described

– Limited success in recognition experiments

[p] insertion rule[ w ao r m p th ]warmth

dictionary/ w ao r m th /

Page 7: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSPreview

• The problem of pronunciation variation for automatic speech recognition (ASR)

• Traditional methods: phone-based pronunciation modeling

• Proposed approach: pronunciation modeling via multiple sequences of linguistic features

• A natural framework: dynamic Bayesian networks (DBNs)

• A feature-based pronunciation model using DBNs

• Proof-of-concept experiments

• Ongoing/future work

• Integration with SVM feature classifiers

Page 8: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSA feature-based approach

• Speech can alternatively be described using sub-phonetic features

LIP-OPTT-OPEN

TT-LOC

TB-LOC

TB-OPENVELUM

VOICING

• (This feature set based on articulatory phonology [Browman & Goldstein 1990])

Page 9: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSFeature-based pronunciation modeling

• instruments [ih_n s ch em ih_n n s]

…...............

MidCloMidMidNarlip opening

CloOpCloCloClovelum

!VVVVVvoicing lips & velum desynchronize

[ w ao r m p th ]dictionary

warmth

• wants [w aa_n t s] -- Phone deletion??

• several [s eh r v ax l] -- Exchange of two phones???

everybody [eh r uw ay]

Page 10: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSRelated work

• Much work on classifying features:– [King et al. 1998]

– [Kirchhoff 2002]

– [Chang, Greenberg, & Wester 2001]

– [Juneja & Espy-Wilson 2003]

– [Omar & Hasegawa-Johnson 2002]

– [Niyogi & Burges 2002]

• Less work on “non-phonetic” relationship between words and features– [Deng et al. 1997], [Richardson & Bilmes 2000]: “fully-connected” state

space via hidden Markov model

– [Kirchhoff 1996]: features independent, except for synchronization at syllable boundaries

– [Carson-Berndsen 1998]: bottom-up, constraint-based approach

• Goal: Develop a general feature-based pronunciation model– Capable of using known independence assumptions

– Without overly strong assumptions

Page 11: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSApproach: Main Ideas ([HLT/NAACL-2004])

• Begin with usual assumption: Each word has one or more underlying pronunciations, given by a dictionary

dictionarywarmth

• Surface (actual) feature values can stray from underlying values via:

1) Substitution – modeled by confusion matrices P(s|u)

2) Asynchrony– Assign index (counter) to each feature, and allow index values to differ

– Apply constraints on the difference between the mean indices of feature subsets

• Natural to implement using graphical models, in particular dynamic Bayesian networks (DBNs)

…...............

MidCloMidMidNarlip opening

OffOnOffOffOffvelum

!VVVVVvoicing

43210index

Page 12: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLS

O

S S

O

......frame i-1 frame i

Aside: Dynamic Bayesian networks

• Bayesian network (BN): Directed-graph representation of a distribution over a set of variables– Graph node variable + its distribution given parents

– Graph edge “dependency”

• Dynamic Bayesian network (DBN): BN with a repeating structure

• Example: HMM

• Uniform algorithms for (among other things)– Finding the most likely values of a subset of the variables, given the

rest (analogous to Viterbi algorithm for HMMs)

– Learning model parameters via EM

)s|p(s 1-ii

)s|p(o ii

)p(s0 )s|p(s 1-ii

L

1i)s|p(o ii)s|p(o 00

) , p( L:0o L:0s

speaking rate

# questions

lunchtime

Page 13: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSPreview

• The problem of pronunciation variation for automatic speech recognition (ASR)

• Traditional methods: phone-based pronunciation modeling

• Proposed approach: pronunciation modeling via multiple sequences of linguistic features

• A natural framework: dynamic Bayesian networks (DBNs)

• A feature-based pronunciation model using DBNs

• Proof-of-concept experiments

• Ongoing/future work

• Integration with SVM feature classifiers

Page 14: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSApproach: A DBN-based Model

• Example DBN using 3 features:

.1

0

0

MID

… … … … ……

… .2 .7 0 0NAR

… .1 .2 .7 0CRI

… 0 .1 .2 .7CLO

…N-MNARCRICLO

encodes baseform pronunciations

tword

12;1 tsync 13;2,1 tsync

1tind 2

tind 3tind

1tS 2

tS 3tS

1tU 2

tU 3tU

otherwise ,0 ,1 || ,5.0

, ,1 ),|1Pr( 21

21

212;1 indindindind

indindsync

• (Simplified to show important properties! Implemented model has additional variables.)

Page 15: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSApproach: A DBN-based Model (2)

• “Unrolled” DBN:

Tword

12;1 Tsync 13;2,1 Tsync

1Tind 2

Tind 3Tind

1TS

2TS

3TS

1TU

2TU

3TU

• Parameter learning via Expectation Maximization (EM)

• Training data– Articulatory databases

– Detailed phonetic transcriptions

1word

12;11 sync 13;2,1

1 sync

11ind 2

1ind 31ind

11S

21S

31S

11U

21U

31U

0word

12;10 sync 13;2,1

0 sync

10ind 2

0ind 30ind

10S

20S

30S

10U

20U

30U

. . .

Page 16: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSPreview

• The problem of pronunciation variation for automatic speech recognition (ASR)

• Traditional methods: phone-based pronunciation modeling

• Proposed approach: pronunciation modeling via multiple sequences of linguistic features

• A natural framework: dynamic Bayesian networks (DBNs)

• A feature-based pronunciation model using DBNs

• Proof-of-concept experiments

• Ongoing/future work

• Integration with SVM feature classifiers

Page 17: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSA proof-of-concept experiment

• Task: classify an isolated word from the Switchboard corpus, given a detailed phonetic transcription (from ICSI Berkeley, [Greenberg et al. 1996])

– Convert transcription into feature vectors Si, one per 10ms

– For each word w in a 3k+ word vocabulary, compute P(w|Si)

– Output w* = arg maxw P(w|Si)

– Used GMTK [Bilmes & Zweig 2002] for inference and EM parameter training

– Note: ICSI transcription is somewhere between phones and features—not ideal, but as good as we have

Page 18: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSResults (development set)

61.263.6Baseforms only

47.950.3+ phonological rules

24.835.2synchronous feature-based

16.429.7asynchronous feature-based

Failure rate (%)

Error rate (%)

Model

19.432.7asynch. + segmental constraint

19.4asynch. + segmental constraint + EM

27.8

1.7 prons/word

4 prons/word

• When did asynchrony matter?– Vowel nasalization & rounding

– Nasal + stop nasal

– Some schwa deletions

– instruments [ih_n s ch em ih_n n s]

– everybody [eh r uw ay]

• What didn’t work? – Some deletions ([ax], [t])

– Vowel retroflexion

– Alveolar + [y] palatal

– (Cross-word effects)

– (Speech/transcription errors…)

Page 19: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSSample Viterbi path

everybody [ eh r uw ay ]

Page 20: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSOngoing/future work

• Trainable synchrony constraints ([ICSLP 2004?])

• Context-dependent distributions for underlying (Ui) and surface (Si) feature values

• Extension to more complex tasks (multi-word sequences, larger vocabularies)

• Implementation in a complete recognizer (cf. [Eurospeech 2003])

• Articulatory databases for parameter learning/testing

• Can we use such a model to learn something about speech?

Page 21: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSIntegration with feature classifier outputs

• Use (hard) classifier decisions as observations for Si

1tS 2

tS 3tS

1tU 2

tU 3tU

(rest of model)

1tO

2tO

3tO

)()|1( sSPsSOP itSVM

it

it

• Landmark-based classifier outputs to DBN Si’s:– Convert landmark-based features to one feature vector/frame

– (Possibly) convert from SVM feature set to DBN feature set

• Convert classifier scores to posterior probabilities and use as “soft evidence” for Si

Page 22: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSAcknowledgment

• Jeff Bilmes, U. Washington

Page 23: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT Computer Science and Artificial Intelligence Laboratory

SPOKEN LANGUAGE SYSTEMS

Thank you!

Page 24: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT Computer Science and Artificial Intelligence Laboratory

SPOKEN LANGUAGE SYSTEMS

GRAVEYARD

Page 25: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSBackground: Continuous Speech Recognition

• Given waveform with acoustic features A, find most likely word string :}, ... ,,{ 21

*MwwwW

)|,(maxarg ** AUWPWW

acoustic model pronunciation model language model

Bayes’ Rule

UW

W

AUWP

AWPW

)|,(maxarg

)|(maxarg*possible pronunciations (typically phone strings)

• Assuming U* much more likely than all other U:

)( )|( ),|( maxarg},{,

** WPWUPUWAPUWUW

Page 26: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSExample: “warmth” “warmpth”

• Phone-based view: Brain:Give me a []!

• (Articulatory) feature-based view:Brain:

Give me a []!

Tongue:Umm…yeah, OK.

Lips:Huh?

Velum, glottis:Right on it, sir !Velum, glottis:Right on it, sir !

Lips, tongue, velum, glottis:Right on it, sir!

Lips, tongue, velum, glottis:Right on it, sir!

Lips, tongue, velum, glottis:Right on it, sir!

Lips, tongue, velum, glottis:Right on it, sir!

Page 27: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSGraphical models for hidden feature modeling

• Most ASR approaches use hidden Markov models (HMMs) and/or finite-state transducers (FSTs)– Efficient and powerful, but limited

– Only one state variable per time frame

• Graphical models (GMs) allow for– Arbitrary numbers of variables and dependencies

– Standard algorithms over large classes of models Straightforward mapping between feature-based models and GMs Potentially large reduction in number of parameters

• GMs for ASR:– Zweig (e.g. PhD thesis, 1998), Bilmes (e.g. PhD thesis, 1999),

Stephenson (e.g. Eurospeech 2001)

– Feature-based ASR with GMs suggested by Zweig, but not previously investigated

Page 28: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSBackground

• Brief intro to ASR– Words written in terms of sub-word units, acoustic models

compute probability of acoustic (spectral) features given sub-word units or vice versa

• Pronunciation model: mapping between words and strings of sub-word units

Page 29: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSPossible solution?

• Allow every pronunciation in some large database

Unreliable probability estimation due to sparse data

Unseen words

Increased confusability

Page 30: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSPhone-based pronunciation modeling (2)

• Generalize across words

• But: Data still sparse Still increased confusability Some pronunciation changes not well described by phonetic

rules Limited gains in speech recognition experiments

Page 31: Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

MIT CSAIL

SLSApproach

• Begin with usual assumption that each word has one or more “target” pronunciations, given by the dictionary

• Model the evolution of multiple feature streams, allowing for:

– Feature changes on a frame-by-frame basis

– Feature desynchronization

– Control of asynchrony—more “synchronous” feature configurations are preferable

• Dynamic Bayesian networks (DBNs): Efficient parameterization and computation when state can be factored