introduction to conversational interfaces jim glass ([email protected]) spoken language systems group...

Introduction toConversational Interfaces

Jim Glass ([email protected])

Spoken Language Systems Group

MIT Laboratory for Computer Science

February 10, 2003

Speech interfaces are ideal for information access and management when:

• The information space is broad and complex,

• The users are technically naive, or

• Speech is the only available modality.

Speech interfaces are ideal for information access and management when:

• The information space is broad and complex,

• The users are technically naive, or

• Speech is the only available modality.

Virtues of Spoken Language

Natural: Requires no special training

Flexible: Leaves hands and eyes free

Efficient: Has high data rate

Economical: Can be transmitted and received inexpensively

Human

Computer

Input Output

Generation Understanding

Speech

Text

Recognition

Speech

Text

Synthesis

Meaning

Communication via Spoken Language

SpeechRecognition

SpeechRecognition

Language Understanding


ContextResolution

ContextResolution

DialogueManagement

DialogueManagement

LanguageGenerationLanguage

Generation

SpeechSynthesisSpeech

Synthesis

AudioAudio DatabaseDatabase

Components of Conversational Systems

Hub

GALAXY LanguageGenerationLanguage

Generation

SpeechRecognition

SpeechRecognition



ContextResolution

ContextResolution

DatabaseDatabase

DialogueManagement

DialogueManagement

SpeechSynthesisSpeech

Synthesis

AudioAudio

SUMMITSUMMIT

TINATINA

DiscourseDiscourse

DialogueManagerDialogueManager

GENESISGENESIS

ENVOICEENVOICE

Components of MIT Conversational Systems

Segment network created by interconnecting spectral landmarks

Waveform

Segment-Based Speech Recognition

Frame-based measurements (every 5ms)

ao

-m - aedh- k

p

uw

er

z t k-ax dx

Probabilistic search finds most likely phone & word strings

computers that talk

Segment-Based Speech Recognition

Natural Language Understanding

show me flights from boston to denver

flight destinationsource

topic

display subject

predicate

full_parse

command

sentence

predicate

citycity tofromflight_list

destinationsourceflight

display

Some syntactic nodes carry semantic tags forcreating semantic frame

Clause: DISPLAYTopic: FLIGHT

Predicate: FROMTopic: CITY Name: "Boston"

Predicate: TOTopic: CITY Name: "Denver"

Clause: DISPLAYTopic: FLIGHT

Predicate: FROMTopic: CITY Name: "Boston"

Predicate: TOTopic: CITY Name: "Denver"

HumanComputerInitiative

• Human takes complete control

• Computer is totally passive

• Human takes complete control

• Computer is totally passive

H: I want to visit my grandmother.

• Computer maintains tight control

• Human is highly restricted

• Computer maintains tight control

• Human is highly restricted

C: Please say the departure city.

Dialogue Modeling Strategies

• Effective conversational interface must incorporate extensive and complex dialogue modeling

• Conversational systems differ in the degree with which human or computer takes the initiative

• Our systems use a mixed initiative approach, where both the human and the computer play an active role

U: I need a flight from Boston to San Francisco

C: Did you say Boston or Austin?

U: Boston, Massachusetts

C: I need a date before I can access Travelocity

U: Tomorrow

C: Hold on while I retrieve the flights for you

C: I have found 10 flights meeting your specification. When would you like to leave?

U: In the morning.C: Do you have a preferred airline?U: UnitedC: I found two non-stop United flights leaving in the morning…

Help the user narrowdown the choices

Clarification(insufficient info)

Clarification(recognition errors)

• Post-Retrieval: Multiple DB Retrievals => Unique Response

Different Roles of Dialogue Management

• Pre-Retrieval: Ambiguous Input => Unique Query to DB

compassiondisputed

cedar citysincegiantsince

compassiondisputed

cedar citysincegiantsince

labyrinthabracadabra

obligatory

labyrinthabracadabra

obligatory

Continental flight 4695 from Greensboro is expected in Halifax at 10:08 pm local time.

The third ad is a 1996 black Acura Integra with 45380 miles.The price is 8970 dollars. Please call (404) 399-7682.

Concatenative Speech Synthesis

• Output waveform generated by concatenating segments of pre-recorded speech corpus.

• Concatenation at phrase, word or sub-word level.

computerscience

laboratory

Synthesis Examples

Multilingual Conversational Interfaces

• Adopts an interlingua approach for multilingual human-machine interactions

• Applications:– MuXing: Mandarin

system for weather information

– Mokusei: Japanese system for weather information

– Spanish systems are also under development

– New speech-to-speech translation work (Phrasebook)

LanguageGenerationLanguageGeneration

SpeechRecognition

SpeechRecognition

DiscourseResolutionDiscourseResolution

Text-to-SpeechConversion


DialogueManagement

DialogueManagement

LanguageUnderstanding

LanguageUnderstanding

ApplicationBack-end

ApplicationBack-end

AudioServerAudioServerAudio

ServerAudioServerI/OServers

I/OServers Application

Back-end

ApplicationBack-endHub

ModelsModelsModels

LanguageIndependent

LanguageDependent

LanguageTransparent


Text-to-SpeechConversionText-to-Speech

ConversionText-to-Speech

Conversion

ModelsModelsRules

ModelsModelsRules

ApplicationBack-end

Bilingual Jupiter Demonstration

Multi-modal Conversational Interfaces

• Typing, pointing, clicking can augment/complement speech

• A picture (or a map) is worth a thousand words

• Applications:– WebGalaxy

– Allows typing and clicking

– Includes map-based navigation

– With display

– Embedded in a web browser

– Current exhibit at MIT Museum

LANGUAGEUNDERSTANDING

LANGUAGEUNDERSTANDING

meaning

SPEECHRECOGNITION

SPEECHRECOGNITION

GESTURERECOGNITION

GESTURERECOGNITION

HANDWRITINGRECOGNITION

HANDWRITINGRECOGNITION

MOUTH & EYESTRACKING

MOUTH & EYESTRACKING

WebGalaxy Demonstration

Delegating Tasks to Computers

• Many information related activities can be done off line

• Off-line delegation frees the user to attend to other matters

• Application: Orion system– Task Specification: User interacts

with Orion to specify a task

“Call me every morning at 6 and tell me the weather in Boston.”

“Send me e-mail any time between 4 and 6 p.m. if the traffic on Route 93 is at a standstill.”

– Task Execution: Orion leverages existing infrastructure to support interaction with humans

– Event Notification: Orion calls back to deliver information

Audio Visual Integration

• Audio and visual signals both contain information about:– Identity of the person: Who is talking?

– Linguistic message: What’s (s)he saying?

– Emotion, mood, stress, etc.: How does (s)he feel?

• The two channels of information– Are often inter-related

– Are often complementary

– Must be consistent

• Integration of these cues can lead to enhanced capabilities for future human computer interfaces

Audio Visual Symbiosis

PersonalIdentity

ParalinguisticInformation

LinguisticMessage

RobustPerson ID

SpeakerID

AcousticSignal

Visual Signal

FaceID

RobustASR

SpeechRecognition

Lip/MouthReading

RobustParalinguistic

Detection

AcousticParaling.Detection

VisualParaling.Detection

• Timing information is a useful way to relate inputs

Multi-modal Interfaces: Beyond Clicking

Does this mean “yes,” “one,” or something else?

• Inputs need to be understood in the proper context

Where is she looking or pointing at while saying “this” and “there”?

Move this one over there

Are there any over here?

What does he mean by “any,” and what is he pointing at?

Multi-modal Fusion: Initial Progress

• All multi-modal inputs are synchronized– Speech recognizer generates absolute times for words

– Mouse and gesture movements generate {x,y,t} triples

– Network Time Protocol (NTP) is used for msec time resolution

• Speech understanding constrains gesture interpretation– Initial work identifies an object or a location from gesture inputs

– Speech constrains what, when, and how items are resolved

– Object resolution also depends on information from application

Speech: “Move this one over here”

Pointing: (object) (location)

time

Multi-modal Demonstration

• Manipulating planets in a solar-system application

• Created w. SpeechBuilder utility with small changes

• Gestures from vision (Darrell & Demirdjien)

Summary

• Speech and language are inevitable, i.e., – The need for mobility and connectivity

– The miniaturization of computers

– Humans’ innate desire to speak

• Progress has been made, e.g., – Understanding and responding in constrained domains

– Incorporating multiple languages and modalities

– Automation and delegation

– Rapid system configuration

• Much interesting research remains, e.g., – Audiovisual integration

– Perceptual user interfaces

ResearchScott CyphersJames GlassT.J. HazenLee HetheringtonJoseph PolifroniShinsuke SakaiStephanie SeneffMichelle SpinaChao WangVictor Zue

ResearchScott CyphersJames GlassT.J. HazenLee HetheringtonJoseph PolifroniShinsuke SakaiStephanie SeneffMichelle SpinaChao WangVictor Zue

AdministrativeMarcia Davidson

AdministrativeMarcia Davidson

VisitorsPaul BrittainThomas GardosRita Singh

VisitorsPaul BrittainThomas GardosRita Singh

S.M.Alicia BoozerBrooke CowanJohn LeeLaura MiyakawaEkaterina SaenkoSy Bor Wang

S.M.Alicia BoozerBrooke CowanJohn LeeLaura MiyakawaEkaterina SaenkoSy Bor Wang

Post-DoctoralTony Ezzat

Post-DoctoralTony Ezzat

Ph.D.Edward FiliskoKaren LivescuAlex ParkMitchell PeabodyErnest PusateriHan ShuMin TangJon Yi

Ph.D.Edward FiliskoKaren LivescuAlex ParkMitchell PeabodyErnest PusateriHan ShuMin TangJon Yi

The Spoken Language Systems Group

M.Eng.Chian ChuChia-Huo LaJonathon Lau

M.Eng.Chian ChuChia-Huo LaJonathon Lau

introduction to conversational interfaces jim glass ([email protected]) spoken language systems group...

Documents

developmentnew speech

initiativeour systems

information access

computer sciencefebruary

user narrowdown

information space

user inte

nonstop united flights