landmark-based speech recognition: spectrogram reading, support vector machines, dynamic bayesian...
DESCRIPTION
Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson [email protected] University of Illinois at Urbana-Champaign, USA. Lecture 3: Spectral Dynamics and the Production of Consonants. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/1.jpg)
Landmark-Based Speech Recognition:
Spectrogram Reading,Support Vector Machines,
Dynamic Bayesian Networks,and Phonology
Mark [email protected]
University of Illinois at Urbana-Champaign, USA
![Page 2: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/2.jpg)
Lecture 3: Spectral Dynamics and the Production of Consonants
• International Phonetic Alphabet• Events in the Closure of a Nasal Consonant
– Formant transitions: a perturbation model– Nasalized vowel– Nasal murmur
• Events in the Release of a Stop Consonant– Pre-voicing (voiced stops in carefully read English)– Transient (stops and affricates)– Frication (stops, affricates, and fricatives)– Aspiration (aspirated stops and /h/)– Formant Transitions (any consonant-vowel transition)
• Formant Tracking– Does it help Speech Recognition?– Methods for Vowels, and for Aspiration & Nasals
• Reminder – lab 1 due Monday!
![Page 3: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/3.jpg)
International Phonetic Alphabet: Purpose and Brief History
• Purpose of the alphabet: to provide a universal notation for the sounds of the world’s languages– “Universal” = If any language on Earth distinguishes two
phonemes, IPA must also distinguish them– “Distinguish” = Meaning of a word changes when the phoneme
changes, e.g. “cat” vs. “bat.”• Very Brief History:
– 1876: Alexander Bell publishes a distinctive-feature-based phonetic notation in “Visible Speech: The Science of the Universal Alphabetic.” His notation is rejected as being too expensive to print
– 1886: International Phonetic Association founded in Paris by phoneticians from across Europe
– 1991: Unicode provides a standard method for including IPA notation in computer documents
![Page 4: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/4.jpg)
International Phonetic Alphabet: Vowels
Pinyin ARPABET(Approx.)
i /u (xu) IY / UX
EY
EH
a (zhang) AE
a (ma)
Pinyin ARPABET(Approx.)
/ u (zhu) / UW
o UH
/ oa / OW
/ o AH / AO
a (ma) AA
Pinyin:e ARPA:AX
![Page 5: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/5.jpg)
IPA: Regular Consonants
NG
ARPABET: F/V (labiodental), TH/DH (dental), S/Z (alveolar), SH/ZH (postalveolar or palatal)Pinyin: s (alveolar), x (postalveolar), sh/r (retroflex)
DX
RHH/HV
Q
Tongue Blade Tongue Body
Y
![Page 6: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/6.jpg)
Affricates and Doubly-Articulated Consonants
Affricates in English and Chinese: Pinyin ARPABET IPA Alveolar: c/z ts/dz Post-alveolar: q/j CH/JH tʃ/dʒ Retroflex: ch/zh ţş/ɖʐ
ARPABET WH W
![Page 7: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/7.jpg)
Non-Pulmonic Consonants
![Page 8: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/8.jpg)
Events in the Closure of a Syllable-Final Nasal
Consonant
![Page 9: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/9.jpg)
Events in the Closure of a Nasal Consonant
Vowel Nasalization
Formant Transitions
Nasal Murmur
![Page 10: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/10.jpg)
Formant Transitions: A Perturbation Theory Model
![Page 11: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/11.jpg)
Formant Transitions:
Labial Consonants
“the mom”
“the bug”
![Page 12: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/12.jpg)
Formant Transitions:
Alveolar Consonants
“the tug”
“the supper”
![Page 13: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/13.jpg)
Formant Transitions: Post-alveolar Consonants
“the shoe”
“the zsazsa”
![Page 14: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/14.jpg)
Formant Transitions:
Velar Consonants
“the gut”
“sing a song”
![Page 15: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/15.jpg)
Formant Transitions: A Perceptual Study
The study: (1) Synthesize speech with different formant patterns, (2) recordsubject responses. Delattre, Liberman and Cooper, J. Acoust. Soc. Am. 1955.
![Page 16: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/16.jpg)
Perception of Formant Transitions: Conclusions
![Page 17: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/17.jpg)
Vowel Nasalization
![Page 18: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/18.jpg)
Vowel Nasalization
![Page 19: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/19.jpg)
Additive Terms in the Log Spectrum
![Page 20: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/20.jpg)
Transfer Function of a Nasalized Vowel
![Page 21: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/21.jpg)
Nasal Murmur“the mug” “the nut” “sing a song”
Observations:Low-frequency resonance (about 300Hz) always presentLow-frequency resonance has wide bandwidth (about 150Hz)Energy of low-frequency resonance is very constantMost high-frequency resonances cancelled by zerosDifferent places of articulation have different high frequency spectraHigh-frequency spectrum is talker-dependent and variable
![Page 22: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/22.jpg)
Resonances of a Nasal Consonant
Reference: Fujimura, JASA 1962
![Page 23: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/23.jpg)
Anti-Resonances of a Nasal Consonant
![Page 24: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/24.jpg)
Events in the Release of a Stop (Plosive) Consonant
![Page 25: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/25.jpg)
Events in the Release of a Stop
“Burst” = transient + frication (the part of the spectrogram whose transfer function has poles only at the front cavity resonance frequencies, not at the back cavity resonances).
![Page 26: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/26.jpg)
Events in the Release of a StopUnaspirated (/b/) Aspirated (/t/)
Transient Frication Aspiration Voicing
![Page 27: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/27.jpg)
Pre-voicing during ClosureTo make a voiced stop in most European languages:
Tongue root is relaxed, allowing it to expandm so that vocal folds can continue to vibrating for a little while after oral closure.
Result is a low-frequency “voice bar” that may continue well into closure.
In English, closure voicing is typical of read speech, but not casual speech.
“the bug”
![Page 28: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/28.jpg)
Transient: The Release of Pressure
![Page 29: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/29.jpg)
Transfer Function During Transient and Frication: Poles
Front cavity resonance frequency: FR = c/4Lf
Turbulence striking an obstacle makes noise
![Page 30: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/30.jpg)
Transfer Function During Frication: An Important Zero
![Page 31: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/31.jpg)
Transfer Function During Frication: An Important Zero
![Page 32: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/32.jpg)
Transfer Function During Aspiration
![Page 33: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/33.jpg)
Are Formant Frequencies Useful for Speech Recognition?
• Kopec and Bush (1992): WER(formants alone) > WER(cepstrum alone) > WER(formants and cepstrum together)
• How should we track formants?– In vowels: Autoregressive (AR) modeling (also known
as LPC)– In aspiration, nasals: Autoregressive Moving Average
(ARMA) modeling. Problem: no closed-form solution– In aspiration, nasals: Exponentially Weighted
Autoregressive (EWAR; Zheng and Hasegawa-Johnson, ICASSP 2004)
![Page 34: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/34.jpg)
Formant Tracking for Vowels: Autoregressive Model (LPC)
![Page 35: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/35.jpg)
Formant Tracking for Aspiration: “Auto-Regressive Moving Average”
Model (ARMA)
![Page 36: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/36.jpg)
Formant Tracking for Aspiration: “Exponentially Weighted Auto-
Regressive” Model (EWAR)(Zheng and Hasegawa-Johnson, ICSLP 2004)
![Page 37: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/37.jpg)
Solving the EWAR Model
![Page 38: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/38.jpg)
Results: Stop Classification, MFCC alone vs. MFCC+formants
![Page 39: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/39.jpg)
Results: Stop Classification, MFCC alone vs. MFCC+formants
![Page 40: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology](https://reader035.vdocuments.us/reader035/viewer/2022062814/568167e7550346895ddd5509/html5/thumbnails/40.jpg)
Summary• International Phonetic Alphabet:
– Useful on any computer with unicode– International encoding for all sounds of the world’s languages
• Events in a nasal closure:– Formant transitions (perturbation model)– Vowel nasalization (sum of TFs)– Nasal murmur (impedance match at juncture)
• Events in release of a stop:– Pre-voicing in English voiced stops (read speech)– Transient (dp/dt ~ dA/dt)– Frication ((zero at f=0)/(front cavity resonances))– Aspiration ((zero at f=0)/(same poles as the vowel))
• Formant tracking– In a vowel: use LPC– In aspiration, frication, or nasal murmur: ARMA is theoretically
optimum, but computationally expensive– Aspiration etcetera: EWAR can be a good approximation to ARMA