speech & language modeling cindy burklow & jay hatcher cs521 – march 30, 2006

29
Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Upload: jasper-ramsey

Post on 26-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Speech & Language Modeling

Cindy Burklow & Jay Hatcher

CS521 – March 30, 2006

Page 2: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Agenda

What is Speech Recognition? What is Speech Recognition?

Challenges of Speech Recognition Challenges of Speech Recognition

Expresso III Case StudyExpresso III Case Study

IBM Superhuman Speech TechIBM Superhuman Speech Tech

Speech Synthesis Speech Synthesis

Page 3: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

What is Speech Recognition

How does it work?

Two approaches

Phonemes

One longRule book

DeductiveFramework

Search Algorithms& Math Models

Page 4: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Hunting Speech

Page 5: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Phoneme Sequence

Page 6: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Phonemes Energy

Page 7: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Challenges of Speech Recognition

Noise

Users own preferences

Limit Speech Range

People

Infinite Combinations

Software

Page 8: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Expresso III

Project

Who? Why?What? How?

Page 9: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Expresso III

How is it different?

Why try a new method?Co-ArticulationIndependenciesDuration

Linear Dynamic Model (LDM)

Page 10: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Expresso III

Why Linear Dynamic Model (LDM)?

Expresso III ‘s Hypothesis

Testing Methods

Includes error modelsOnly linear models allowed

Series of tests (5 total)Increase “phones” & training data

Switching & Iteration & Data classificationGenerated histograms of log likelihood

Divide & Conquer TechniqueResults

Page 11: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

IBM Superhuman Speech Tech

ViaVoice 4.4ViaVoice 4.4 ProductsProducts GoalGoal

“Get performance comparable to humans in the next five years.”

-IBM Jan. 2006

Comprehend languages

Translate dynamically

Create “on-the-fly” subtitles on TV

Speak commands

Free-Form Command

MASTOR

TALES

PDAS, IPODS, & DVRs

Page 12: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

“Free-Form Command”

• Commands associated with objects

• Simplified Language

• Partnering with Specialized Hardware Manufacturing

• Finding Cliché markets

• Well-chosen Algorithms

Page 13: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

IBM’s MASTOR

Multilingual Automatic Speech-to-Speech Translator

Page 14: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

IBM’s Tales

• Server-based system

• Dynamically Transcribe & translates any words spoken into English subtitles

• Requires long processing time

• Real-time translations are impossible

• 60%-70% accuracy rate

• High subscription fee for users

Page 15: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Expanding Speech Recognition Applications

PDAs to collect data

iPod: Email & RSS Read Aloud

Page 16: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Navigate Your DVR with Speech

Voice commands

Requires microphone*TV remote* Headset

Page 17: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Text to Speech Systems

Two major steps:1. Convert the text into a pronounceable format

– Look for domain specific sections like time, dates, numbers, addresses, and abbreviations

– Try to identify homographs and the contexts in which they occur

– Use some combination of dictionary and rule-based approaches as a guide to pronunciation

2. Synthesize speech from the phonetic representation using one of many possible approaches

Page 18: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Speech Synthesis

Formant Synthesis Recordings

Concatenative synthesis

Unit Selection

Waveform Synthesis

Diphone Synthesis

Hybrid ApproachesArticulatory Synthesis

HMM-based synthesis

Continuum of Speech Synthesis methods

Page 19: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Speech Synthesis at CMU

• Carnegie Mellon University has been doing extensive research in both speech recognition and speech synthesis

• Research primarily uses the Festival Speech Synthesis System, an open-source framework developed by Edinburgh University

Page 20: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Speech Synthesis at CMU

• Research has primarily focused on Diphone Synthesis, with some additional exploration into Unit Selection.

Page 21: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Speech Synthesis at CMU

• Diphone synthesis allows greater control of pitch and voice inflection, but often has a more robotic sound to it.

• Example: This is a short introduction to the Festival Speech Synthesis System. Festival was developed by Alan Black and Paul Taylor, at the Centre for Speech Technology Research, University of Edinburgh.

Page 22: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Speech Synthesis at CMU

• Improvements can be made by performing statistical analysis of the text as a preprocessing step before synthesis.

• This helps with pacing, homographs, and other situations where pronunciation differs depending on context.

• He wanted to go for a drive in.• He wanted to go for a drive in the country.

• My cat who lives dangerously has nine lives.• Henry V: Part I Act II Scene XI: Mr X is I believe, V I

Lenin, and not Charles I.

Page 23: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Speech Synthesis at CMU

• Unit selection can be used instead of diphones to improve how natural the voice sounds by using whole phones (e.g. syllables) and not just diphones (sound transitions)

• The following examples are based on the same speaker:

• Diphones

• Unit Selection

Page 24: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Speech Synthesis at CMU

• With care, unit selection can produce very convincing natural sound.– Original Sound– Synthesis from natural phones, pitch, and

duration data

• However, it is difficult to generalize Unit Selection for a variety of situations, and if it does poorly it sounds much worse than diphones.– Example

Page 25: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Speech Synthesis at CMU

• Most commercial TTS packages use Unit Selection with medium to large databases of samples.– Example: Neospeech VoiceText

• These produce higher quality sound at the expense of memory and processor power.

• CMU’s Festival implementation has focused more on Diphone Synthesis to reduce memory footprint and allow greater control of the synthesizer.

Page 26: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Speech Synthesis at CMU

• Diphone Synthesis can control inflection, pitch, and other factors dynamically.– A short example with no prosody.– A short example with declination.– A short example with accents on stressed

syllables and end tones.– A short example with statistically trained

intonation and duration models.

Page 27: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Conclusion

• CMU’s research using Festival has lead to useful technology for embedded systems and servers. The Diphone Synthesis model they have developed can produce generally intelligible speech with minimal memory and processing costs. The model is still being worked on and may one day reach a natural level of quality.

Page 28: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

What is speech recognition & Challenges?• http://www.extremetech.com/article2/0,1697,1826664,00.asp• http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/

dl/mags/co/&toc=comp/mags/co/2002/04/r4toc.xml&DOI=10.1109/MC.2002.993770

• http://en.wikipedia.org/wiki/Speech_recognition• http://cslu.cse.ogi.edu/HLTsurvey/ch1node7.html

Expresso III Case Study• http://www.cstr.ed.ac.uk/publications/users/

s0129866_abstracts.html#Couper-02• http://www.cstr.ed.ac.uk/publications/users/s0129866.html

IBM Superhuman Speech Tech• http://www.ibm.com• http://www.pcmag.com/article2/0,1895,1915071,00.asp

References and Useful Links

Page 29: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

References and Useful Links

• The Festival Speech Synthesis System

• NeoSpeech VoiceText Demo

• AT&T’s TTS FAQ

• Reviews of Popular Speech Synthesizers

• Speech Engine Listings with Samples

• BrightSpeech.com

• Festival at CMU

• FestVox