feature-based pronunciation modeling using dynamic bayesian networks
DESCRIPTION
Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks. Karen Livescu JHU Workshop Planning Meeting April 16, 2004 Joint work with Jim Glass. Preview. The problem of pronunciation variation for automatic speech recognition (ASR) - PowerPoint PPT PresentationTRANSCRIPT
MIT Computer Science and Artificial Intelligence Laboratory
SPOKEN LANGUAGE SYSTEMS
Feature-based Pronunciation ModelingUsing Dynamic Bayesian Networks
Karen Livescu
JHU Workshop Planning Meeting
April 16, 2004
Joint work with Jim Glass
MIT CSAIL
SLSPreview
• The problem of pronunciation variation for automatic speech recognition (ASR)
• Traditional methods: phone-based pronunciation modeling
• Proposed approach: pronunciation modeling via multiple sequences of linguistic features
• A natural framework: dynamic Bayesian networks (DBNs)
• A feature-based pronunciation model using DBNs
• Proof-of-concept experiments
• Ongoing/future work
• Integration with SVM feature classifiers
MIT CSAIL
SLSThe problem of pronunciation variation
• Conversation from the Switchboard speech database:
• “neither one of them”:
• “decided”:
• “never really”:
• “probably”:
• Noted as an obstacle for ASR (e.g., [McAllester et al. 1998])
MIT CSAIL
SLSThe problem of pronunciation variation (2)
0
10
20
30
40
50
60
70
80
0 50 100 150 200 250
word frequency
# p
ron
un
cia
tio
ns
/wo
rd
read
casual
• More acute in casual/conversational than in read speech:
p r aa b iy 2
p r ay 1
p r aw l uh 1
p r ah b iy 1
p r aa lg iy 1
p r aa b uw 1
p ow ih 1
p aa iy 1
p aa b uh b l iy 1
p aa ah iy 1
probably
MIT CSAIL
SLSPreview
• The problem of pronunciation variation for automatic speech recognition (ASR)
• Traditional methods: phone-based pronunciation modeling
• Proposed approach: pronunciation modeling via multiple sequences of linguistic features
• A natural framework: dynamic Bayesian networks (DBNs)
• A feature-based pronunciation model using DBNs
• Proof-of-concept experiments
• Ongoing/future work
• Integration with SVM feature classifiers
MIT CSAIL
SLSTraditional solution: phone-based pronunciation modeling
• Transformation rules are typically of the form p1 p2 / p3 __ p4 (where pi may be null)
– E.g. Ø p / m __ {non-labial}
• Rules are derived from– Linguistic knowledge (e.g. [Hazen et al. 2002])
– Data (e.g. [Riley & Ljolje 1996])
• Powerful, but:– Sparse data issues
– Increased inter-word confusability
– Some pronunciation changes not well described
– Limited success in recognition experiments
[p] insertion rule[ w ao r m p th ]warmth
dictionary/ w ao r m th /
MIT CSAIL
SLSPreview
• The problem of pronunciation variation for automatic speech recognition (ASR)
• Traditional methods: phone-based pronunciation modeling
• Proposed approach: pronunciation modeling via multiple sequences of linguistic features
• A natural framework: dynamic Bayesian networks (DBNs)
• A feature-based pronunciation model using DBNs
• Proof-of-concept experiments
• Ongoing/future work
• Integration with SVM feature classifiers
MIT CSAIL
SLSA feature-based approach
• Speech can alternatively be described using sub-phonetic features
LIP-OPTT-OPEN
TT-LOC
TB-LOC
TB-OPENVELUM
VOICING
• (This feature set based on articulatory phonology [Browman & Goldstein 1990])
MIT CSAIL
SLSFeature-based pronunciation modeling
• instruments [ih_n s ch em ih_n n s]
…...............
MidCloMidMidNarlip opening
CloOpCloCloClovelum
!VVVVVvoicing lips & velum desynchronize
[ w ao r m p th ]dictionary
warmth
• wants [w aa_n t s] -- Phone deletion??
• several [s eh r v ax l] -- Exchange of two phones???
everybody [eh r uw ay]
MIT CSAIL
SLSRelated work
• Much work on classifying features:– [King et al. 1998]
– [Kirchhoff 2002]
– [Chang, Greenberg, & Wester 2001]
– [Juneja & Espy-Wilson 2003]
– [Omar & Hasegawa-Johnson 2002]
– [Niyogi & Burges 2002]
• Less work on “non-phonetic” relationship between words and features– [Deng et al. 1997], [Richardson & Bilmes 2000]: “fully-connected” state
space via hidden Markov model
– [Kirchhoff 1996]: features independent, except for synchronization at syllable boundaries
– [Carson-Berndsen 1998]: bottom-up, constraint-based approach
• Goal: Develop a general feature-based pronunciation model– Capable of using known independence assumptions
– Without overly strong assumptions
MIT CSAIL
SLSApproach: Main Ideas ([HLT/NAACL-2004])
• Begin with usual assumption: Each word has one or more underlying pronunciations, given by a dictionary
dictionarywarmth
• Surface (actual) feature values can stray from underlying values via:
1) Substitution – modeled by confusion matrices P(s|u)
2) Asynchrony– Assign index (counter) to each feature, and allow index values to differ
– Apply constraints on the difference between the mean indices of feature subsets
• Natural to implement using graphical models, in particular dynamic Bayesian networks (DBNs)
…...............
MidCloMidMidNarlip opening
OffOnOffOffOffvelum
!VVVVVvoicing
43210index
MIT CSAIL
SLS
O
S S
O
......frame i-1 frame i
Aside: Dynamic Bayesian networks
• Bayesian network (BN): Directed-graph representation of a distribution over a set of variables– Graph node variable + its distribution given parents
– Graph edge “dependency”
• Dynamic Bayesian network (DBN): BN with a repeating structure
• Example: HMM
• Uniform algorithms for (among other things)– Finding the most likely values of a subset of the variables, given the
rest (analogous to Viterbi algorithm for HMMs)
– Learning model parameters via EM
)s|p(s 1-ii
)s|p(o ii
)p(s0 )s|p(s 1-ii
L
1i)s|p(o ii)s|p(o 00
) , p( L:0o L:0s
speaking rate
# questions
lunchtime
MIT CSAIL
SLSPreview
• The problem of pronunciation variation for automatic speech recognition (ASR)
• Traditional methods: phone-based pronunciation modeling
• Proposed approach: pronunciation modeling via multiple sequences of linguistic features
• A natural framework: dynamic Bayesian networks (DBNs)
• A feature-based pronunciation model using DBNs
• Proof-of-concept experiments
• Ongoing/future work
• Integration with SVM feature classifiers
MIT CSAIL
SLSApproach: A DBN-based Model
• Example DBN using 3 features:
…
.1
0
0
MID
… … … … ……
… .2 .7 0 0NAR
… .1 .2 .7 0CRI
… 0 .1 .2 .7CLO
…N-MNARCRICLO
encodes baseform pronunciations
tword
12;1 tsync 13;2,1 tsync
1tind 2
tind 3tind
1tS 2
tS 3tS
1tU 2
tU 3tU
otherwise ,0 ,1 || ,5.0
, ,1 ),|1Pr( 21
21
212;1 indindindind
indindsync
• (Simplified to show important properties! Implemented model has additional variables.)
MIT CSAIL
SLSApproach: A DBN-based Model (2)
• “Unrolled” DBN:
Tword
12;1 Tsync 13;2,1 Tsync
1Tind 2
Tind 3Tind
1TS
2TS
3TS
1TU
2TU
3TU
• Parameter learning via Expectation Maximization (EM)
• Training data– Articulatory databases
– Detailed phonetic transcriptions
1word
12;11 sync 13;2,1
1 sync
11ind 2
1ind 31ind
11S
21S
31S
11U
21U
31U
0word
12;10 sync 13;2,1
0 sync
10ind 2
0ind 30ind
10S
20S
30S
10U
20U
30U
. . .
MIT CSAIL
SLSPreview
• The problem of pronunciation variation for automatic speech recognition (ASR)
• Traditional methods: phone-based pronunciation modeling
• Proposed approach: pronunciation modeling via multiple sequences of linguistic features
• A natural framework: dynamic Bayesian networks (DBNs)
• A feature-based pronunciation model using DBNs
• Proof-of-concept experiments
• Ongoing/future work
• Integration with SVM feature classifiers
MIT CSAIL
SLSA proof-of-concept experiment
• Task: classify an isolated word from the Switchboard corpus, given a detailed phonetic transcription (from ICSI Berkeley, [Greenberg et al. 1996])
– Convert transcription into feature vectors Si, one per 10ms
– For each word w in a 3k+ word vocabulary, compute P(w|Si)
– Output w* = arg maxw P(w|Si)
– Used GMTK [Bilmes & Zweig 2002] for inference and EM parameter training
– Note: ICSI transcription is somewhere between phones and features—not ideal, but as good as we have
MIT CSAIL
SLSResults (development set)
61.263.6Baseforms only
47.950.3+ phonological rules
24.835.2synchronous feature-based
16.429.7asynchronous feature-based
Failure rate (%)
Error rate (%)
Model
19.432.7asynch. + segmental constraint
19.4asynch. + segmental constraint + EM
27.8
1.7 prons/word
4 prons/word
• When did asynchrony matter?– Vowel nasalization & rounding
– Nasal + stop nasal
– Some schwa deletions
– instruments [ih_n s ch em ih_n n s]
– everybody [eh r uw ay]
• What didn’t work? – Some deletions ([ax], [t])
– Vowel retroflexion
– Alveolar + [y] palatal
– (Cross-word effects)
– (Speech/transcription errors…)
MIT CSAIL
SLSSample Viterbi path
everybody [ eh r uw ay ]
MIT CSAIL
SLSOngoing/future work
• Trainable synchrony constraints ([ICSLP 2004?])
• Context-dependent distributions for underlying (Ui) and surface (Si) feature values
• Extension to more complex tasks (multi-word sequences, larger vocabularies)
• Implementation in a complete recognizer (cf. [Eurospeech 2003])
• Articulatory databases for parameter learning/testing
• Can we use such a model to learn something about speech?
MIT CSAIL
SLSIntegration with feature classifier outputs
• Use (hard) classifier decisions as observations for Si
1tS 2
tS 3tS
1tU 2
tU 3tU
(rest of model)
1tO
2tO
3tO
)()|1( sSPsSOP itSVM
it
it
• Landmark-based classifier outputs to DBN Si’s:– Convert landmark-based features to one feature vector/frame
– (Possibly) convert from SVM feature set to DBN feature set
• Convert classifier scores to posterior probabilities and use as “soft evidence” for Si
MIT CSAIL
SLSAcknowledgment
• Jeff Bilmes, U. Washington
MIT Computer Science and Artificial Intelligence Laboratory
SPOKEN LANGUAGE SYSTEMS
Thank you!
MIT Computer Science and Artificial Intelligence Laboratory
SPOKEN LANGUAGE SYSTEMS
GRAVEYARD
MIT CSAIL
SLSBackground: Continuous Speech Recognition
• Given waveform with acoustic features A, find most likely word string :}, ... ,,{ 21
*MwwwW
)|,(maxarg ** AUWPWW
acoustic model pronunciation model language model
Bayes’ Rule
UW
W
AUWP
AWPW
)|,(maxarg
)|(maxarg*possible pronunciations (typically phone strings)
• Assuming U* much more likely than all other U:
)( )|( ),|( maxarg},{,
** WPWUPUWAPUWUW
MIT CSAIL
SLSExample: “warmth” “warmpth”
• Phone-based view: Brain:Give me a []!
• (Articulatory) feature-based view:Brain:
Give me a []!
Tongue:Umm…yeah, OK.
Lips:Huh?
Velum, glottis:Right on it, sir !Velum, glottis:Right on it, sir !
Lips, tongue, velum, glottis:Right on it, sir!
Lips, tongue, velum, glottis:Right on it, sir!
Lips, tongue, velum, glottis:Right on it, sir!
Lips, tongue, velum, glottis:Right on it, sir!
MIT CSAIL
SLSGraphical models for hidden feature modeling
• Most ASR approaches use hidden Markov models (HMMs) and/or finite-state transducers (FSTs)– Efficient and powerful, but limited
– Only one state variable per time frame
• Graphical models (GMs) allow for– Arbitrary numbers of variables and dependencies
– Standard algorithms over large classes of models Straightforward mapping between feature-based models and GMs Potentially large reduction in number of parameters
• GMs for ASR:– Zweig (e.g. PhD thesis, 1998), Bilmes (e.g. PhD thesis, 1999),
Stephenson (e.g. Eurospeech 2001)
– Feature-based ASR with GMs suggested by Zweig, but not previously investigated
MIT CSAIL
SLSBackground
• Brief intro to ASR– Words written in terms of sub-word units, acoustic models
compute probability of acoustic (spectral) features given sub-word units or vice versa
• Pronunciation model: mapping between words and strings of sub-word units
MIT CSAIL
SLSPossible solution?
• Allow every pronunciation in some large database
Unreliable probability estimation due to sparse data
Unseen words
Increased confusability
MIT CSAIL
SLSPhone-based pronunciation modeling (2)
• Generalize across words
• But: Data still sparse Still increased confusability Some pronunciation changes not well described by phonetic
rules Limited gains in speech recognition experiments
MIT CSAIL
SLSApproach
• Begin with usual assumption that each word has one or more “target” pronunciations, given by the dictionary
• Model the evolution of multiple feature streams, allowing for:
– Feature changes on a frame-by-frame basis
– Feature desynchronization
– Control of asynchrony—more “synchronous” feature configurations are preferable
• Dynamic Bayesian networks (DBNs): Efficient parameterization and computation when state can be factored