speechbuilder: facilitating spoken dialogue system creation
DESCRIPTION
SpeechBuilder: Facilitating Spoken Dialogue System Creation. Eugene Weinstein Project Oxygen Core Team MIT Laboratory for Computer Science [email protected]. Language Generation. Speech Synthesis. Dialogue Management. Hub. Audio. Database Server. Speech Recog. Context Resolution. - PowerPoint PPT PresentationTRANSCRIPT
L C S
SpeechBuilder: Facilitating Spoken Dialogue System Creation
Eugene Weinstein
Project Oxygen Core Team
MIT Laboratory for Computer Science
Eugene Weinstein – MIT Lab for Computer Science Oxygen Alliance 2003 Workshop – February 24-28, 2003
• Developing robust, mixed-initiative spoken dialogue systems is difficult
– Complex systems can be created by human-language technology experts
SpeechBuilder
Hub
SpeechSynthesis
SpeechSynthesis
LanguageGeneration
LanguageGeneration
DialogueManagement
DialogueManagement
ContextResolution
ContextResolution
Language ProcessingLanguage
Processing
SpeechRecog.
SpeechRecog.
DatabaseServer
DatabaseServerAudioAudio
Bridging the Experience Gap
• SpeechBuilder aims to help novices rapidly create speech-based systems
– Uses intuitive methods for specifying domain-specific constraints
– Automatically configures HLT components using MIT GALAXY architecture
* Leverages future technical advances
* Encourages research on portability
– Novice developers must overcome a considerable technical challenge
Eugene Weinstein – MIT Lab for Computer Science Oxygen Alliance 2003 Workshop – February 24-28, 2003
SpeechBuilderServer
SpeechBuilderServerHub
CGI ParameterGeneration
CGI ParameterGeneration
SpeechRecognition
SpeechRecognition
SpeechSynthesisSpeech
Synthesis
Language ProcessingLanguage
Processing
AudioServerAudioServer
HTTP
• Gives developer total control over application functionality
DeveloperApplicationDeveloper
Application
• Communication with Galaxy via simple HTTP protocol
“Turn on the lights in the kitchen”
action=set&frame=(object=lights, room=kitchen,value=on)
“Show me the banks on Main Street”
action=identify&frame=( object=(type=bank, on=(street=Main, ext=Street)))
Baseline Configuration
Eugene Weinstein – MIT Lab for Computer Science Oxygen Alliance 2003 Workshop – February 24-28, 2003
• Still gives developer total control over application functionality
• Frame Relay server exposes Galaxy meaning representation to app
DeveloperApplicationDeveloper
Application
“Turn on the lights in the kitchen”
{c turn_management
:parse_frame {c turn
:object “lights” :room “kitchen”
:value “on”}
“Show me the banks on Main Street”{c turn_management :parse_frame {c identify “type” bank :pred {p :on {:street “Main”
:ext “Street”}}}
Modified Baseline Configuration (this class)
Frame RelayServer
Frame RelayServerHub
CGI ParameterGeneration
CGI ParameterGeneration
SpeechRecognition
SpeechRecognition
SpeechSynthesisSpeech
Synthesis
Language ProcessingLanguage
Processing
AudioServerAudioServer
TCP SocketSemantic
Frame
Eugene Weinstein – MIT Lab for Computer Science Oxygen Alliance 2003 Workshop – February 24-28, 2003
• For a speech-based interface to structured data• No programming required; specify table(s) and constraints
DatabaseServer
DatabaseServerHub
LanguageGenerationLanguage
Generation
SpeechRecognition
SpeechRecognition
DiscourseResolutionDiscourseResolution
SpeechSynthesisSpeech
SynthesisDialogue
ManagementDialogue
Management
Language ProcessingLanguage
Processing
I/OServer
I/OServer
AudioServerAudioServer
AudioServerAudioServer INFO
Database Access Configuration **
Eugene Weinstein – MIT Lab for Computer Science Oxygen Alliance 2003 Workshop – February 24-28, 2003
Step 1: Off-line creation and compilation
Hub
NLGNLG
ASRASR DiscoursDiscours
TTSTTS DialogDialog
NLUNLU
Audio
Audio SBSB
Query
Response
Step 2: On-line deployment
INFO
INFO
Dialog
NLG
HUBNLU
DiscASR
Upload
Compile
Creating a Speech-Based Application
Eugene Weinstein – MIT Lab for Computer Science Oxygen Alliance 2003 Workshop – February 24-28, 2003
AudioServer
AudioServer
• Telephone or lightweight audio server
DatabaseServer
DatabaseServer
• Accesses back-end database
Language ProcessingLanguage
Processing
• N-best interface with ASR
• Grammar from attributes & actions
• Backs off to concept spotting
ContextResolution
ContextResolution
• New component performs concept inheritance & masking
• Processes ‘E-form’
DialogueManagement
DialogueManagement
• Generic server handles interactionSpeech
Synthesis
SpeechSynthesis
• Commercial product
LanguageGeneration
LanguageGeneration
• Generates ‘E-form’, SQL, & responses
• Default entries made
• Galaxy programmable hub controls interactions between all components
Hub
Human Language Technologies
SpeechRecognition
SpeechRecognition
• Generic acoustic models
• Unknown word model
• Class or hierarchical n-gram
Eugene Weinstein – MIT Lab for Computer Science Oxygen Alliance 2003 Workshop – February 24-28, 2003
• Some columns are used to access entries (e.g., Name)– Column entries must be incorporated into ASR & NLU
• Some columns are only used in responses (e.g., Phone)– Column names must be incorporated into ASR & NLU
Name Phone Email Office
Jim Glass x3-1640 [email protected] 603
Stephanie Seneff x3-0451 [email protected] 643
Victor Zue x3-8513 [email protected] 601a
“What is the phone number for Victor Zue?”
Extracting Database Information **
Eugene Weinstein – MIT Lab for Computer Science Oxygen Alliance 2003 Workshop – February 24-28, 2003
Knowledge Representation
• Concepts and actions form basis for understanding– Concepts become key/value entries in meaning representation
* city: Boston, New York… day: Monday, Tuesday
– Actions provide sentence-level patterns of specific queries
* “I want to fly from Boston to Taipei…” action=lookup_flight
– Action text can be bracketed to define hierarchical concepts **
* “I want to fly source=(from Boston) destination=(to Taipei)”
* source=Boston destination=Taipei
– Concepts and actions used to configure the following components
* Speech Recognition
* Natural Language Understanding
* Discourse
• Database columns define basic concepts– Column names can be grouped into concepts
* property: phone, email… weather: snow, rain…
Eugene Weinstein – MIT Lab for Computer Science Oxygen Alliance 2003 Workshop – February 24-28, 2003
• Concept usage can be fine-tuned to improve performance:**
• By default, concepts are used for language modeling, parsing grammar, and meaning representation
– For language modeling and parsing grammar only (i.e., no meaning)
– For keyword spotting only (i.e., no role in language modeling)
– For fine-grained language modeling with coarser meaning representation
rain
hailsnow weather: snow“Will it snow?”
sprinkles
flurriesshowers
breezy
rainysnowy
snowfallaccumulation
rainfall
snowstormthunderstorm
blizzard
weather: snow
Language Modeling and Understanding
Eugene Weinstein – MIT Lab for Computer Science Oxygen Alliance 2003 Workshop – February 24-28, 2003
Current Status
• SpeechBuilder has been operational for over two years
– Used by over 50 developers from MIT and elsewhere
– Used in undergraduate classes at MIT and Georgetown University
• ASR capabilities benchmarked against main systems
– Achieves same ASR performance as MIT Jupiter weather information system (6.8% word error rate on clean data) (phone #)
• Several prototype systems have been developed
– Information about faculty, staff and students at LCS and AI Labs (phone, email, room, voice messages, transfer, etc.)
– Application to control the various physical items in a typical office (lights, curtains, TV, VCR, projector, etc.)
– Others include TV schedules, real-time weather forecasts, hotel and restaurant information etc.
• SpeechBuilder used for initial design of many more complex domains
Eugene Weinstein – MIT Lab for Computer Science Oxygen Alliance 2003 Workshop – February 24-28, 2003
• Increase sophistication of discourse and dialogue manager to handle more complex dialogues
– Enable finer specification of discourse capabilities
– Add generic capabilities for times, dates, etc.
• Incorporate confidence scoring and implement unsupervised training of acoustic and language models
• Create functionality to allow developers to create domain-specific concatenative speech synthesis
• Create alternative methods of domain specifications to streamline development
– Advanced developers don’t necessarily use web interface
– Allow for more efficient automatic generation of SpeechBuilder domains
Ongoing and Future Work
Eugene Weinstein – MIT Lab for Computer Science Oxygen Alliance 2003 Workshop – February 24-28, 2003
Issam Bazzi
Scott Cyphers
Ed Filisko
Jim Glass
TJ Hazen
Lee Hetherington
Joe Polifroni
Stephanie Seneff
Michelle Spina
Eugene Weinstein
Jon Yi
Misha Zitser
Acknowledgements
L C S
SpeechBuilder Hands-on Activity
Eugene Weinstein
Project Oxygen Core Team
MIT Laboratory for Computer Science
Eugene Weinstein – MIT Lab for Computer Science Oxygen Alliance 2003 Workshop – February 24-28, 2003
Frame RelayServer
Frame RelayServerHub
CGI ParameterGeneration
CGI ParameterGeneration
SpeechRecognition
SpeechRecognition
SpeechSynthesisSpeech
Synthesis
Language ProcessingLanguage
Processing
AudioServerAudioServer
TCP Socket
• Still gives developer total control over application functionality
• Frame Relay server exposes Galaxy meaning representation to app
DeveloperApplicationDeveloper
Application
Modified Baseline Configuration (this class)
Semantic
Frame
Jaim
Eugene Weinstein – MIT Lab for Computer Science Oxygen Alliance 2003 Workshop – February 24-28, 2003
SpeechBuilder API
Galaxy Frame Relay
• Galaxy meaning representation provided through frame relay
• Applications connect via TCP sockets
• API provided in Perl, Python, and Java– This class: Python API
Python classgalaxy.server.Server
Application
Python classgalaxy.frame.Frame
galaxy.server.Server methods:Constructor(machine,port,ID)
connect()processMessage(blocking)
disconnect()
galaxy.frame.Frame methods:getAction()
getAttribute(attr_name)getText()toString()
Python
API
TCPSock
et