speech & language modeling cindy burklow & jay hatcher cs521 – march 30, 2006
TRANSCRIPT
Speech & Language Modeling
Cindy Burklow & Jay Hatcher
CS521 – March 30, 2006
Agenda
What is Speech Recognition? What is Speech Recognition?
Challenges of Speech Recognition Challenges of Speech Recognition
Expresso III Case StudyExpresso III Case Study
IBM Superhuman Speech TechIBM Superhuman Speech Tech
Speech Synthesis Speech Synthesis
What is Speech Recognition
How does it work?
Two approaches
Phonemes
One longRule book
DeductiveFramework
Search Algorithms& Math Models
Hunting Speech
Phoneme Sequence
Phonemes Energy
Challenges of Speech Recognition
Noise
Users own preferences
Limit Speech Range
People
Infinite Combinations
Software
Expresso III
Project
Who? Why?What? How?
Expresso III
How is it different?
Why try a new method?Co-ArticulationIndependenciesDuration
Linear Dynamic Model (LDM)
Expresso III
Why Linear Dynamic Model (LDM)?
Expresso III ‘s Hypothesis
Testing Methods
Includes error modelsOnly linear models allowed
Series of tests (5 total)Increase “phones” & training data
Switching & Iteration & Data classificationGenerated histograms of log likelihood
Divide & Conquer TechniqueResults
IBM Superhuman Speech Tech
ViaVoice 4.4ViaVoice 4.4 ProductsProducts GoalGoal
“Get performance comparable to humans in the next five years.”
-IBM Jan. 2006
Comprehend languages
Translate dynamically
Create “on-the-fly” subtitles on TV
Speak commands
Free-Form Command
MASTOR
TALES
PDAS, IPODS, & DVRs
“Free-Form Command”
• Commands associated with objects
• Simplified Language
• Partnering with Specialized Hardware Manufacturing
• Finding Cliché markets
• Well-chosen Algorithms
IBM’s MASTOR
Multilingual Automatic Speech-to-Speech Translator
IBM’s Tales
• Server-based system
• Dynamically Transcribe & translates any words spoken into English subtitles
• Requires long processing time
• Real-time translations are impossible
• 60%-70% accuracy rate
• High subscription fee for users
Expanding Speech Recognition Applications
PDAs to collect data
iPod: Email & RSS Read Aloud
Navigate Your DVR with Speech
Voice commands
Requires microphone*TV remote* Headset
Text to Speech Systems
Two major steps:1. Convert the text into a pronounceable format
– Look for domain specific sections like time, dates, numbers, addresses, and abbreviations
– Try to identify homographs and the contexts in which they occur
– Use some combination of dictionary and rule-based approaches as a guide to pronunciation
2. Synthesize speech from the phonetic representation using one of many possible approaches
Speech Synthesis
Formant Synthesis Recordings
Concatenative synthesis
Unit Selection
Waveform Synthesis
Diphone Synthesis
Hybrid ApproachesArticulatory Synthesis
HMM-based synthesis
Continuum of Speech Synthesis methods
Speech Synthesis at CMU
• Carnegie Mellon University has been doing extensive research in both speech recognition and speech synthesis
• Research primarily uses the Festival Speech Synthesis System, an open-source framework developed by Edinburgh University
Speech Synthesis at CMU
• Research has primarily focused on Diphone Synthesis, with some additional exploration into Unit Selection.
Speech Synthesis at CMU
• Diphone synthesis allows greater control of pitch and voice inflection, but often has a more robotic sound to it.
• Example: This is a short introduction to the Festival Speech Synthesis System. Festival was developed by Alan Black and Paul Taylor, at the Centre for Speech Technology Research, University of Edinburgh.
Speech Synthesis at CMU
• Improvements can be made by performing statistical analysis of the text as a preprocessing step before synthesis.
• This helps with pacing, homographs, and other situations where pronunciation differs depending on context.
• He wanted to go for a drive in.• He wanted to go for a drive in the country.
• My cat who lives dangerously has nine lives.• Henry V: Part I Act II Scene XI: Mr X is I believe, V I
Lenin, and not Charles I.
Speech Synthesis at CMU
• Unit selection can be used instead of diphones to improve how natural the voice sounds by using whole phones (e.g. syllables) and not just diphones (sound transitions)
• The following examples are based on the same speaker:
• Diphones
• Unit Selection
Speech Synthesis at CMU
• With care, unit selection can produce very convincing natural sound.– Original Sound– Synthesis from natural phones, pitch, and
duration data
• However, it is difficult to generalize Unit Selection for a variety of situations, and if it does poorly it sounds much worse than diphones.– Example
Speech Synthesis at CMU
• Most commercial TTS packages use Unit Selection with medium to large databases of samples.– Example: Neospeech VoiceText
• These produce higher quality sound at the expense of memory and processor power.
• CMU’s Festival implementation has focused more on Diphone Synthesis to reduce memory footprint and allow greater control of the synthesizer.
Speech Synthesis at CMU
• Diphone Synthesis can control inflection, pitch, and other factors dynamically.– A short example with no prosody.– A short example with declination.– A short example with accents on stressed
syllables and end tones.– A short example with statistically trained
intonation and duration models.
Conclusion
• CMU’s research using Festival has lead to useful technology for embedded systems and servers. The Diphone Synthesis model they have developed can produce generally intelligible speech with minimal memory and processing costs. The model is still being worked on and may one day reach a natural level of quality.
What is speech recognition & Challenges?• http://www.extremetech.com/article2/0,1697,1826664,00.asp• http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/
dl/mags/co/&toc=comp/mags/co/2002/04/r4toc.xml&DOI=10.1109/MC.2002.993770
• http://en.wikipedia.org/wiki/Speech_recognition• http://cslu.cse.ogi.edu/HLTsurvey/ch1node7.html
Expresso III Case Study• http://www.cstr.ed.ac.uk/publications/users/
s0129866_abstracts.html#Couper-02• http://www.cstr.ed.ac.uk/publications/users/s0129866.html
IBM Superhuman Speech Tech• http://www.ibm.com• http://www.pcmag.com/article2/0,1895,1915071,00.asp
References and Useful Links
References and Useful Links
• The Festival Speech Synthesis System
• NeoSpeech VoiceText Demo
• AT&T’s TTS FAQ
• Reviews of Popular Speech Synthesizers
• Speech Engine Listings with Samples
• BrightSpeech.com
• Festival at CMU
• FestVox