speech technologies and voicexml
DESCRIPTION
Speech Technologies and VoiceXML. Chun-Feng Liao NCCU Department of Computer Science Intelligent Media Lab [email protected]. Presentation Agenda. Voice technologies Backgrounds ASR/TTS Voice browsing with VoiceXML VoiceXML architecture VoiceXML Programming Future of VoiceXML - PowerPoint PPT PresentationTRANSCRIPT
Speech Technologies and VoiceXML
Chun-Feng LiaoNCCU Department of Computer Science
Intelligent Media [email protected]
Presentation Agenda
Voice technologies Backgrounds• ASR/TTS
Voice browsing with VoiceXML VoiceXML architecture VoiceXML Programming Future of VoiceXML Summary
Reference [1]Bob Edgar(2001),“The VoiceXML Handbook” ,NY:CM
P Books. [2]Dave Raggett(2001),”Getting started with VoiceXML
2.0”,W3C. [3]Sun Microsystems(1998),”Java Speech Grammar For
mat Specification v1.0”,Sun Microsystems. [4]Chetan Sharma and Jeff Kunins(2002),”VoiceXML:St
rategies and Techniques for Effective Voice Application Development with VoiceXML 2.0”,Wiley.
[5]Brian Eberman,Jerry Carter,Darren Meyer,David Goddeau(2002),”Building VoiceXML Browsers with OpenVXI”, NY:ACM Press.
Reference [6]Microsoft (2002),“Speech Technology Overview ” ,
http://www.microsoft.com/speech/evaluation/techover/
[7] VoiceGenie Technologies Inc.(2001),”White Paper:Speaking Freely About The VoiceGenie VoiceXML Gateway and the VoiceXML Interpreter”,VoiceGenie Technologies Inc.
[8]W3C(2002),”VoiceXML Specification v2.0”,W3C.
Voice Technologies
In the mid- to late 1990s, personal computers started to become powerful enough to support ASR
The two key underlying technologies behind these advances are speech recognition (SR) and text-to-speech synthesis (TTS).
Speech Recognition
Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )
Speech Synthesis
Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )
Pervasive Computing Model
E-business has changed from client-server model to web-centric model
Once connect to the Internet,one can get any information he want. But people wants more convenient way to connect to Internet.
Lou Gerstner,CEO of IBM:Pervasive Computing Model is billion people interacting with million e-business with trillion devices interconnected.
Voice Browsing
VoiceXML instead of HTML A voice browser instead of an ordina
ry web browser Phone instead of PC.
VoiceXML Key Design Issues
Speech Input: speech recognition and DTMF
Speech Output: pre-recorded audio and synthesized speech
Internet: XML, IP, HTTP, SSL, JavaScript
Telephony: call transfer, data passing
W3C Voice Browser Working Group
Founded May 1999 60 company members Mission — Standards group to prepa
re and review markup languages to enable internet-based speech applications
http://www.w3.org/Voice
VoiceXML Forum
Industry Group to promote VoiceXML
550+ member companies Submitted VoiceXML 1.0 to W3C in
May 2000 http://www.voicexml.org
• VoiceXML v1.0 (May 2000)• VoiceXML Forum • Specification submitted to the W3C
• VoiceXML v2.0 • W3C Voice Browser Working Group• 50+ members collaborating• Addressed 400+ change requests
VoiceXML Overview A language for specifying voice dialogs. Voice dialogs use audio prompts and text-to-spee
ch (TTS) for output; touch-tone keys (DTMF) and automatic speech recognition (ASR) for input.
Main input/output device (initially) is the phone. Leverages the Internet for application developm
ent and delivery. Standard language enables portability.(VoiceXM
L 統一了 Dialog 描述語言 )
VoiceXML Platform Architecture
VoiceXML Platform Architecture-1
Telephone and Telephone network-Connects caller’s telephone with Telephony Server
VoiceXML Gateway• Voice Browser• Audio input-Speech Recognition (ASR), Touch
tone (DTMF), Audio recording.• Audio output-Audio playback, Speech Synthes
is (TTS)• Interface, Call Controls
VoiceXML Platform Architecture-2
VoiceXML Documents• Dialog and flow control• Client-side scripting (ECMAScript)• Speech Recognition grammar• Speech Synthesis pronunciation control
Document servers(web server)• Feeding Static VoiceXML documents or audio file
s. Application servers
• Generate VoiceXML documents dynamically.• Server-side application logic• Connect to Database, or database interface
Example
VoiceXML-browser
<% user.storePreference(“try”) %><form> <block> 今天的氣溫是 <%= weather.getTemp() %> 度 </block></form>
Web server+ Servlet/JSP engine
weather.jsp - VoiceXML and JSP
<form> <block> 今天的氣溫是 25 度 </block></form>
DB
Voice Gateway
Implementations of VoiceXML Gateways
In Taiwan:• Yes Mobile• Chunghwa Telecom Laboratories ( 二代
語音平台 )• eWings Technologies, Inc
Free• IBM VoiceServerSDK
Open Source• CMU:OpenVXI
[DEMO]A Simple VoiceXML Applicati
on
DEMO A Simple VoiceXML application to i
ntroduce the department of Computer Science .
Exp. show that to build a corresponding HTML version first is helpful.
Document A VoiceXML
document defines one or more dialogs
The user is always in one dialog at any time
Each dialog specifies the next dialog to transition to using a URL
Dialog 1
doc1.vxml
Dialog 2
Transition: #dialog 2
Transition: http://xyz.com/doc2.vxml
Dialog
A Dialog describes an interaction between a user and the system
Two kinds of dialogs: form and menu
VoiceXML Document Structure.
Form
output
input
Form 會依照 Grammar 的定義,持續搜集 filed 中的資訊。
eval
<form> <field name="travellers“> <grammar mode=“voice” src=“./number.grxml”/> <prompt>How many are travelling?</prompt>
<filled> <submit next=”http://travel.com/order”/> </filled> </field></form>
Menu
<menu id=“commands”>
What service would you like?
<choice next=“/cars”> Car hire </choice>
<choice next=“/hotels”> Hotel reservations </choice>
<choice next=“/news”> Today’s news </choice>
</menu>
menu 其實就是沒有欄位的 form
menu 是一個流程控制的方式,依照 user 的選擇,分別傳送到不同 URL 。
Submit
Typically used to send results from client to server
Syntax:<submit next=”URI” namelist=”var1 var2 ...”/>
namelist: 指定要傳到下一頁的Fields 。
Submit, Example
<form> <field name=“dest-city"> <prompt> Where do you want to go to? </prompt> <grammar mode=“voice” src=“./cities.grxml”/> </field> <field name="travellers“> <prompt> How many are travelling to <value expr="city"/>?
</prompt> <grammar mode=“voice” src=“./number.grxml”/> </field> <filled> Thank you. Your order is now being processed. <submit next="http://travel.com/order" namelist=“dest-city
travellers"/> </filled></form>
Variables
Variables can be manipulated and referenced
•宣告 : <field name="user2">•設值 : <assign name="user1"
expr=”’peter’"/>•清除 : <clear namelist="user1
user2"/>•引用 : How many are travelling to
<value expr=“dest-city”/> ? - 引用時不用加 $
Variable Scope
session
application
document
dialog
Session variables are ”read-only”
variables provided by the interpreter
context
Session variables are ”read-only”
variables provided by the interpreter
context
Scope defined by element containing executable content (<block>, <filled> or
event handler)
Scope defined by element containing executable content (<block>, <filled> or
event handler)
Search for variable name
錯誤處理 :Events
Events are used to signal ”unexpected” situations
Events are caught by an catch event handler • <catch
event=”com.acme.mailreader”>...</catch>• <catch event=”nomatch
noinput”>...</catch>• Shortcut: <nomatch> is equivalent to <catch
event="nomatch"> • Other shortcuts: <noinput>, <error>
<field name=“dest-city">
<prompt> Where do you want to go to? </prompt> <grammar mode=“voice” src=“./cities.grxml”/> <nomatch> Please say the city you want to fly to. </nomatch>
</field>
Events, Example
Multimodal Web Browsing xHTML + VoiceXML SALT
[DEMO]Multimodal Browsing
Future of the “Voice” web and VoiceXML
VoiceXML1.0
VoiceXML2.0
VoiceXML forum (2000)
W3C (2003 -in CR)
Speech synthesis (SSML)
Speech reco. grammar
NLP
Speech semantics
Pronunciation lexicon [early]
Call control [early]
Voice Browser interoperation [early]
W3C
SALT
Microsoft-led (2002)
Speech ApplicationLanguage Tags
JSML
Sun/SpeechWorks (1999)
JSGF
VoiceXML 3?
Conclusion
Speech is the most natural way for human to communicate thus it will become an important way in HCI.
VoiceXML has revolutionized speech recognition & telephony application development & deployment.
Q & A
Backup
History of VoiceXMLSource:VoiceXML forum(http://www.voicexml.org)
Classification of Voice Application
Basic interactive voice response (IVR)• Computer: “For stock quotes, press
1. For trading, press 2. …”• Human: (presses DTMF “1”)
Basic speech ASR• C: “Say the stock name for a price
quote.”• H: “Lucent Technologies”
Classification of Voice Application
Advanced speech ASR• C: “Stock Services, how may I help you?”• H: “Uh, what’s Lucent trading at?”
“Near-natural language” ASR• C: “How may I help you?”• H: “Um, yeah, I’d like to get the current price
of Lucent Technologies”• C: “Lucent is up two at sixty eight and a half.”• H: “OK. I want to buy one hundred shares at
market price.”• C: “…”
Speech Recognition Capturing speech (analog) signals Digitizing the sound waves,
converting them to basic language units or phonemes,
Constructing words from phonemes, and contextually analyzing the words to ensure correct spelling for words that sound alike (such as write and right).
Speech Synthesis
Speech Synthesis, or text-to-speech, is the process of converting text into spoken language. • Breaking down the words into
phonemes; • Analyzing for special handling of text
such as numbers, currency amounts.• Generating the digital audio for
playback.
VoiceXML Gateway(detail)
Programming VoiceXML
Writing a VoiceXML application is programming.
Control constructs are procedural (if-else etc.)
VoiceXML platform iterates through a <form> until values for all field items have been collected
VoiceXML System Components
VoiceXMLserver
Telecom boardsPBX
CT Integration
Speech synthesis (TTS)
Speech recognition (SR)
Speech grammars
Voice Biometrics
Software utilities
VoiceXML servers serve as integratorsof various hardware and software
Callcentre
FIA - Form Interpretation
Algorithm The FIA has a main loop that repeatedly selects a form item and then visits it
The first (in document order) form item, whose field item variable is undefined, is selected
As a result, the user is prompted for each field item in turn
FIA – Form Example
Field item 1
Field item 2
<form> <prompt>Where do you want to go to and how many are travelling ?
</prompt>
<field name=“dest-city"> <prompt>Where do you want to go to?</prompt> <grammar mode=“voice” src=“./cities.grxml”/> </field>
<field name="travellers”> <prompt>How many are travelling to your destination?</prompt> <grammar mode=“voice” src=“./number.grxml”/> </field> <!-- other fields --></form>
if, else and elseif
<form> ... <filled> <if cond="travellers > 10">
Sorry, we cannot handle groups larger than 10 persons <clear namelist="travellers"/> <elseif cond="travellers > 5 && dest-city == 'London'"/> Sorry, we cannot handle groups larger than 5 persons travelling to
London
<clear namelist=”city travellers"/> <else/> <submit next="http://travel.com/order"/> </if> </filled></form>
JSML - JSpeech Markup Language
Developed by Sun and SpeechWorks, as a markup language for text-to-speech dialogs.
Based on the Java Speech API Markup Languagehttp://java.sun.com/products/java-media/speech/
Text annotation to provide hints to speech synthesizers• Aimed at making TTS speech more natural, more understandable
Feature set:• hints to word pronunciation• hints to phrasing, emphasis, pitch and speaking rate• “marker” elements -- notifications from the speech synthesizer
to applications when marker is reached.
JSML - JSpeech Grammar Format
Developed by Sun and SpeechWorks, as a syntax for expressing speech grammars
Based on the Java Speech Grammar API Grammar Formathttp://java.sun.com/products/java-media/speech/
Microsoft’s SALT Speech Application Language Tags
• Microsoft, Cisco, Intel, Comverse, SpeechWorks, Philips
A “lightweight” set of tags designed to be used with HTML and XHTML to enable lightweight telephony applications driven from regular Web documents.
Targeted at supporting multimodal access