![Page 1: Application of HMMs: Speech recognition “Noisy channel” model of speech](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649d2d5503460f94a0440e/html5/thumbnails/1.jpg)
Application of HMMs: Speech recognition
• “Noisy channel” model of speech
![Page 2: Application of HMMs: Speech recognition “Noisy channel” model of speech](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649d2d5503460f94a0440e/html5/thumbnails/2.jpg)
Speech feature extractionAcoustic wave form
Sampled at 8KHz, quantized to 8-12 bits
Spectrogram
Time
Fre
quen
cyA
mpl
itude
Frame(10 ms or 80 samples)
Feature vector
~39 dim.
![Page 3: Application of HMMs: Speech recognition “Noisy channel” model of speech](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649d2d5503460f94a0440e/html5/thumbnails/3.jpg)
Speech feature extractionAcoustic wave form
Sampled at 8KHz, quantized to 8-12 bits
Spectrogram
Time
Fre
quen
cyA
mpl
itude
Frame(10 ms or 80 samples)
Feature vector
~39 dim.
![Page 4: Application of HMMs: Speech recognition “Noisy channel” model of speech](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649d2d5503460f94a0440e/html5/thumbnails/4.jpg)
Phonetic model
![Page 5: Application of HMMs: Speech recognition “Noisy channel” model of speech](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649d2d5503460f94a0440e/html5/thumbnails/5.jpg)
Phonetic model• Phones: speech sounds• Phonemes: groups of speech sounds that
have a unique meaning/function in a language (e.g., there are several different ways to pronounce “t”)
![Page 6: Application of HMMs: Speech recognition “Noisy channel” model of speech](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649d2d5503460f94a0440e/html5/thumbnails/6.jpg)
HMM models for phones• HMM states in most speech recognition systems
correspond to subphones– There are around 60 phones and as many as 603
context-dependent triphones
![Page 7: Application of HMMs: Speech recognition “Noisy channel” model of speech](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649d2d5503460f94a0440e/html5/thumbnails/7.jpg)
HMM models for words
![Page 8: Application of HMMs: Speech recognition “Noisy channel” model of speech](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649d2d5503460f94a0440e/html5/thumbnails/8.jpg)
Putting words together
• Given a sequence of acoustic features, how do we find the corresponding word sequence?
![Page 9: Application of HMMs: Speech recognition “Noisy channel” model of speech](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649d2d5503460f94a0440e/html5/thumbnails/9.jpg)
Decoding with the Viterbi algorithm
![Page 10: Application of HMMs: Speech recognition “Noisy channel” model of speech](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649d2d5503460f94a0440e/html5/thumbnails/10.jpg)
Limitations of Viterbi decoding• Number of states may be too large
– Beam search: at each time step, maintain a short list of the most probable words and only extend transitions from those words into the next time step
• Words with multiple pronunciation variants may get a smaller probability than incorrect words with fewer pronunciation paths
Word model for “tomato”
![Page 11: Application of HMMs: Speech recognition “Noisy channel” model of speech](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649d2d5503460f94a0440e/html5/thumbnails/11.jpg)
Limitations of Viterbi decoding• Number of states may be too large
• Beam search: at each time step, maintain a short list of the most probable words and only extend transitions from those words into the next time step
• Words with multiple pronunciation variants may get a smaller probability than incorrect words with fewer pronunciation paths– Use the forward algorithm instead of Viterbi algorithm
• The Markov assumption is too weak to capture the constraints of real language
![Page 12: Application of HMMs: Speech recognition “Noisy channel” model of speech](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649d2d5503460f94a0440e/html5/thumbnails/12.jpg)
Advanced techniques• Multiple pass decoding
– Let the Viterbi decoder return multiple candidate utterances and then re-rank them using a more sophisticated language model, e.g., n-gram model
![Page 13: Application of HMMs: Speech recognition “Noisy channel” model of speech](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649d2d5503460f94a0440e/html5/thumbnails/13.jpg)
Advanced techniques• Multiple pass decoding
– Let the Viterbi decoder return multiple candidate utterances and then re-rank them using a more sophisticated language model, e.g., n-gram model
• A* decoding– Build a search tree whose nodes are words and whose
paths are possible utterances– Path cost is given by the likelihood of the acoustic
features given the words inferred so far– Heuristic function estimates the best-scoring extension
until the end of the utterance
![Page 14: Application of HMMs: Speech recognition “Noisy channel” model of speech](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649d2d5503460f94a0440e/html5/thumbnails/14.jpg)
Reference
• D. Jurafsky and J. Martin, “Speech and Language Processing,” 2nd ed., Prentice Hall, 2008