speech recognition (part 2) t. j. hazen mit computer science and artificial intelligence laboratory
TRANSCRIPT
Speech Recognition(Part 2)
T. J. Hazen
MIT Computer Science and Artificial Intelligence Laboratory
Lecture Overview
• Probabilistic framework
• Pronunciation modeling
• Language modeling
• Finite state transducers
• Search
• System demonstrations (time permitting)
Probabilistic Framework
• Speech recognition is typically performed a using probabilistic modeling approach
• Goal is to find the most likely string of words, W, given the acoustic observations, A:
Wmax ( | )P W A
• The expression is rewritten using Bayes’ Rule:
W
( | ) ( )max max ( | ) ( )
( )
W
P A W P WP A W P W
P A
LexicalNetwork
Probabilistic Framework
• Words are represented as sequence of phonetic units.
• Using phonetic units, U, expression expands to:
• Pronunciation and language models provide constraint
• Pronunciation and language models encoded in network
• Search must efficiently find most likely U and W
,max ( | ) ( | ) ( )U W
P A U P U W P W
Acoustic Model
Pronunciation Model
Language Model
Phonemes
• Phonemes are the basic linguistic units used to construct morphemes, words and sentences.– Phonemes represent unique canonical acoustic sounds
– When constructing words, changing a single phoneme changes the word.
• Example phonemic mappings:– pin /p ih n/
– thought /th ao t/
– saves /s ey v z/
• English spelling is not (exactly) phonemic – Pronunciation can not always be determined from spelling
– Homophones have same phonemes but different spellings
* Two vs. to vs. too, bear vs. bare, queue vs. cue, etc.
– Same spelling can have different pronunciations
* read, record, either, etc.
Phonemic Units and Classes
Vowelsaa : pot er : bertae : bat ey : baitah : but ih : bitao : bought iy : beataw : bout ow : boatax : about oy : boyay : buy uh : bookeh : bet uw : boot
Semivowelsl : light w : wetr : right y : yet
Fricativess : sue f : feez : zoo v : veesh : shoe th : thesiszh : azure dh : thathh : hat
Nasalsm : mightn : nightng : sing
Affricatesch : chew jh : Joe
Stopsp : pay b : bayt : tea d : dayk : key g : go
Phones
• Phones (or phonetic units) are used to represent the actual acoustic realization of phonemes.
• Examples:– Stops contain a closure and a release
* /t/ [tcl t]
* /k/ [kcl k]
– The /t/ and /d/ phonemes can be flapped
* utter /ah t er/ [ah dx er]
* udder /ah d er/ [ah dx er]
– Vowels can be fronted:
* Tuesday /t uw z d ey/ [tcl t ux z d ey]
Enhanced Phoneme Labels
Stopsp : pay b : bayt : tea d : dayk : key g : go
Special sequencesnt : interviewtq en : Clinton
Stops w/ optional releasepd : tap bd : tabtd : pat dd : badkd : pack gd : dog
Unaspirated stopsp- : speedt- : steepk- : ski
Stops w/ optional flaptf : batterdf : badder
Retroflexed stopstr : treedr : drop
Example Phonemic Baseform File
<hangup> : _h1 +
<noise> : _n1 +
<uh> : ah_fp
<um> : ah_fp m
adder : ae df er
atlanta : ( ae | ax ) td l ae nt ax
either : ( iy | ay ) th er
laptop : l ae pd t aa pd
northwest : n ao r th w eh s td
speech : s p- iy ch
temperature : t eh m p ( r ? ax | er ax ? ) ch er
trenton : tr r eh n tq en
special filledpause vowel
alternatepronunciations
repeat previous symbol
special noise model symbol
optionalphonemes
Applying Phonological Rules
• Multiple phonetic realization of phonemes can be generated by applying phonological rules.
• Example:
• Phonological rewrite rules can be used to generate this:
butter : b ah tf er
This can be realized phonetically as:
bcl b ah tcl t er
or as:
bcl b ah dx er
Standard /t/
Flapped /t/
butter : bcl b ah ( tcl t | dx ) er
Example Phonological Rules
• Example rule for /t/ deletion (“destination”):
{s} t {ax ix} => [tcl t];
Left Context
Phoneme RightContext
PhoneticRealization
• Example rule for palatalization of /s/ (“miss you”):
{} s {y} => s | sh;
Contractions and Reductions
• Examples of contractions:– what’s what is
– isn’t is not
– won’t will not
– i’d i would | i had
– today’s today is | today’s
• Example of multi-word reductions – gimme give me
– gonna going to
– ave avenue
– ‘bout about
– d’y’ave do you have
• Contracted and reduced forms entered in lexical dictionary
Language Modeling
• A language model constrains hypothesized word sequences
• A finite state grammar (FSG) example:
• Probabilities can be added to arcs for additional constraint
• FSGs work well when users stay within grammar…
• …but FSGs can’t cover everything that might be spoken.
tell me
what is
theforecast
weather in
for
baltimore
boston
N-gram Language Modeling
• An n-gram model is a statistical language model
• Predicts current word based on previous n-1 words
• Trigram model expression:
• Examples
• An n-gram model allows any sequence of words…
• …but prefers sequences common in training data.
P( wn | wn-2 , wn-1 )
P( | arriving in )
P( | tuesday march )
boston
seventeenth
N-gram Model Smoothing
• For a bigram model, what if…
• To avoid sparse training data problems, we can use an interpolated bigram:
• One method for determining interpolation weight:
)(p~)1()|(p)|(p~11 11 nwnnwnn wwwww
nn
Kwc
wc
n
nwn
)(
)(
1
11
0)|(p 1 nn ww
Class N-gram Language Modeling
• Class n-gram models can also help sparse data problems
• Class trigram expression:
• Example:
P(class(wn) | class(wn-2) , class(wn-1)) P(wn | class(wn))
P(seventeenth | tuesday march )
P( NTH | WEEKDAY MONTH ) P( seventeenth | NTH )
Multi-Word N-gram Units
• Common multi-word units can be treated as a single unit within an N-gram language model
• Common uses of compound units:– Common multi-word phrases:
* thank_you , good_bye , excuse_me
– Multi word sequences that act as a single semantic unit:
* new_york , labor_day , wind_speed
– Letter sequences or initials:
* j_f_k , t_w_a , washington_d_c
Finite-State Transducer (FST) Motivation
• Most speech recognition constraints and results can be represented as finite-state automata:– Language models (e.g., n-grams and word networks)
– Lexicons
– Phonological rules
– N-best lists
– Word graphs
– Recognition paths
• Common representation and algorithms desirable– Consistency
– Powerful algorithms can be employed throughout system
– Flexibility to combine or factor in unforeseen ways
What is an FST?
• One initial state
• One or more final states
• Transitions between states: input : output / weight– input requires an input symbol to match
– output indicates symbol to output when transition taken
– epsilon () consumes no input or produces no output
– weight is the cost (e.g., -log probability) of taking transition
• An FST defines a weighted relationship between regular languages
• A generalization of the classic finite-state acceptor (FSA)
FST Example: Lexicon
• Lexicon maps /phonemes/ to ‘words’
• Words can share parts of pronunciations
• Sharing at beginning beneficial to recognition speed because pruning can prune many words at once
FST Composition
• Composition (o) combines two FSTs to produce a single FST that performs both mappings in single step
words /phonemes/ /phonemes/ [phones]
o =
words [phones]
FST Optimization Example
letter to word lexicon
FST Optimization Example: Determinization
• Determinization turns lexicon into tree
• Words share common prefix
FST Optimization Example: Minimization
• Minimization enables sharing at the ends
A Cascaded FST Recognizer
Acoustic Model Labels
Phonetic Units
Phonemic Units
Spoken Words
Multi-Word Units
Canonical Words
C : CD Model Mapping
P: Phonological Model
L : Lexical Model
G : Language Model
R : Reductions Model
M : Multi-word MappingLanguage Model
Pronunciation Model
A Cascaded FST Recognizer
Acoustic Model Labels
Phonetic Units
Phonemic Units
Spoken Words
Multi-Word Units
Canonical Words
C : CD Model Mapping
P: Phonological Model
L : Lexical Model
G : Language Model
R : Reductions Model
M : Multi-word Mapping
give me new_york_city
give me new york city
gimme new york city
g ih m iy n uw y ao r kd s ih tf iy
gcl g ih m iy n uw y ao r kcl s ih dx iy
Search
• Once again, the probabilistic expression is:
• Pronunciation and language models encoded in FST
• Search must efficiently find most likely U and W
,max ( | ) ( | ) ( )U W
P A U P U W P W
Acoustic Model
LexicalFST
Viterbi Search
• Viterbi search: a time synchronous breadth-first searchL
exic
al N
od
es
h#
m
z
r
a
Time
t0 t1 t2 t3 t4 t5 t6 t7 t8
h# m a r z h#
Viterbi Search Pruning
• Search efficiency can be improved with pruning– Score-based: Don’t extend low scoring hypotheses
– Count-based: Extend only a fixed number of hypotheses
Lex
ical
No
des
h#
m
z
r
a
Time
t0 t1 t2 t3 t4 t5 t6 t7 t8
x
x
Search Pruning Example
• Count-based pruning can effectively reduce search
• Example: Fix beam size (count) and vary beam width (score)
36
Lex
ical
No
des
h#
m
z
r
a
Time
t0 t1 t2 t3 t4 t5 t6 t7 t8
N-best Computation with Backwards A* Search
• Backwards A* search can be used to find N-best paths
• Viterbi backtrace is used as future estimate for path scores
Street Address Recognition
• Street address recognition is difficult– 6.2M unique street, city, state pairs in US (283K unique words)
– High confusion rate among similar street names
– Very large search space for recognition
• Commercial solution Directed dialogue– Breaks problem into set of smaller recognition tasks
– Simple for first time users, but tedious with repeated use
C: Main menu. Please say one of the following:C: “directions”, “restaurants”, “gas stations”, or “more options”.H: Directions.C: Okay. Directions. What state are you going to?H: Massachusetts.C: Okay. Massachusetts. What city are you going to?H: Cambridge.C: Okay. Cambridge. What is the street address?H: 32 Vassar Street.C: Okay. 32 Vassar Street in Cambridge, Massachusetts.C: From you current location, continue straight on…
Street Address Recognition
• Research goal Mixed initiative dialogue– More difficult to predict what users will say
– Far more natural for repeat or expert users
C: How can I help you?H: I need directions to 32 Vassar Street in Cambridge, Mass.
• Recognition approach: dynamically adapt recognition vocabulary– 3 recognition passes over
one utterance.
– 1st pass: Detect state and activate relevant cities
– 2nd pass: Detect cities and activate relevant streets
– 3rd pass: Recognize full street address
Dynamic Vocabulary Recognition