speech technologies and voicexml

Speech Technologies and VoiceXML

Chun-Feng LiaoNCCU Department of Computer Science

Intelligent Media [email protected]

Presentation Agenda

Voice technologies Backgrounds• ASR/TTS

Voice browsing with VoiceXML VoiceXML architecture VoiceXML Programming Future of VoiceXML Summary

Reference [1]Bob Edgar(2001),“The VoiceXML Handbook” ,NY:CM

P Books. [2]Dave Raggett(2001),”Getting started with VoiceXML

2.0”,W3C. [3]Sun Microsystems(1998),”Java Speech Grammar For

mat Specification v1.0”,Sun Microsystems. [4]Chetan Sharma and Jeff Kunins(2002),”VoiceXML:St

rategies and Techniques for Effective Voice Application Development with VoiceXML 2.0”,Wiley.

[5]Brian Eberman,Jerry Carter,Darren Meyer,David Goddeau(2002),”Building VoiceXML Browsers with OpenVXI”, NY:ACM Press.

Reference [6]Microsoft (2002),“Speech Technology Overview ” ,

http://www.microsoft.com/speech/evaluation/techover/

[7] VoiceGenie Technologies Inc.(2001),”White Paper:Speaking Freely About The VoiceGenie VoiceXML Gateway and the VoiceXML Interpreter”,VoiceGenie Technologies Inc.

[8]W3C(2002),”VoiceXML Specification v2.0”,W3C.

Voice Technologies

In the mid- to late 1990s, personal computers started to become powerful enough to support ASR

The two key underlying technologies behind these advances are speech recognition (SR) and text-to-speech synthesis (TTS).

Speech Recognition

Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )

http://www.microsoft.com/speech/



Speech Synthesis

Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )


Pervasive Computing Model

E-business has changed from client-server model to web-centric model

Once connect to the Internet,one can get any information he want. But people wants more convenient way to connect to Internet.

Lou Gerstner,CEO of IBM:Pervasive Computing Model is billion people interacting with million e-business with trillion devices interconnected.

Voice Browsing

VoiceXML instead of HTML A voice browser instead of an ordina

ry web browser Phone instead of PC.

VoiceXML Key Design Issues

Speech Input: speech recognition and DTMF

Speech Output: pre-recorded audio and synthesized speech

Internet: XML, IP, HTTP, SSL, JavaScript

Telephony: call transfer, data passing

W3C Voice Browser Working Group

Founded May 1999 60 company members Mission — Standards group to prepa

re and review markup languages to enable internet-based speech applications

http://www.w3.org/Voice

VoiceXML Forum

Industry Group to promote VoiceXML

550+ member companies Submitted VoiceXML 1.0 to W3C in

May 2000 http://www.voicexml.org

• VoiceXML v1.0 (May 2000)• VoiceXML Forum • Specification submitted to the W3C

• VoiceXML v2.0 • W3C Voice Browser Working Group• 50+ members collaborating• Addressed 400+ change requests

VoiceXML Overview A language for specifying voice dialogs. Voice dialogs use audio prompts and text-to-spee

ch (TTS) for output; touch-tone keys (DTMF) and automatic speech recognition (ASR) for input.

Main input/output device (initially) is the phone. Leverages the Internet for application developm

ent and delivery. Standard language enables portability.(VoiceXM

L 統一了 Dialog 描述語言 )

VoiceXML Platform Architecture

VoiceXML Platform Architecture-1

Telephone and Telephone network-Connects caller’s telephone with Telephony Server

VoiceXML Gateway• Voice Browser• Audio input-Speech Recognition (ASR), Touch

tone (DTMF), Audio recording.• Audio output-Audio playback, Speech Synthes

is (TTS)• Interface, Call Controls

VoiceXML Platform Architecture-2

VoiceXML Documents• Dialog and flow control• Client-side scripting (ECMAScript)• Speech Recognition grammar• Speech Synthesis pronunciation control

Document servers(web server)• Feeding Static VoiceXML documents or audio file

s. Application servers

• Generate VoiceXML documents dynamically.• Server-side application logic• Connect to Database, or database interface

Example

VoiceXML-browser

<% user.storePreference(“try”) %><form> <block> 今天的氣溫是 <%= weather.getTemp() %> 度 </block></form>

Web server+ Servlet/JSP engine

weather.jsp - VoiceXML and JSP

<form> <block> 今天的氣溫是 25 度 </block></form>

DB

Voice Gateway

Implementations of VoiceXML Gateways

In Taiwan:• Yes Mobile• Chunghwa Telecom Laboratories ( 二代

語音平台 )• eWings Technologies, Inc

Free• IBM VoiceServerSDK

Open Source• CMU:OpenVXI

[DEMO]A Simple VoiceXML Applicati

on

DEMO A Simple VoiceXML application to i

ntroduce the department of Computer Science .

Exp. show that to build a corresponding HTML version first is helpful.

Document A VoiceXML

document defines one or more dialogs

The user is always in one dialog at any time

Each dialog specifies the next dialog to transition to using a URL

Dialog 1

doc1.vxml

Dialog 2

Transition: #dialog 2

Transition: http://xyz.com/doc2.vxml

Dialog

A Dialog describes an interaction between a user and the system

Two kinds of dialogs: form and menu

VoiceXML Document Structure.

Form

output

input

Form 會依照 Grammar 的定義，持續搜集 filed 中的資訊。

eval

<form> <field name="travellers“> <grammar mode=“voice” src=“./number.grxml”/> <prompt>How many are travelling?</prompt>

<filled> <submit next=”http://travel.com/order”/> </filled> </field></form>

Menu

<menu id=“commands”>

What service would you like?

<choice next=“/cars”> Car hire </choice>

<choice next=“/hotels”> Hotel reservations </choice>

<choice next=“/news”> Today’s news </choice>

</menu>

menu 其實就是沒有欄位的 form

menu 是一個流程控制的方式，依照 user 的選擇，分別傳送到不同 URL 。

Submit

Typically used to send results from client to server

Syntax:<submit next=”URI” namelist=”var1 var2 ...”/>

namelist: 指定要傳到下一頁的Fields 。

Submit, Example

<form> <field name=“dest-city"> <prompt> Where do you want to go to? </prompt> <grammar mode=“voice” src=“./cities.grxml”/> </field> <field name="travellers“> <prompt> How many are travelling to <value expr="city"/>?

</prompt> <grammar mode=“voice” src=“./number.grxml”/> </field> <filled> Thank you. Your order is now being processed. <submit next="http://travel.com/order" namelist=“dest-city

travellers"/> </filled></form>

Variables

Variables can be manipulated and referenced

•宣告 : <field name="user2">•設值 : <assign name="user1"

expr=”’peter’"/>•清除 : <clear namelist="user1

user2"/>•引用 : How many are travelling to

<value expr=“dest-city”/> ? - 引用時不用加 $

Variable Scope

session

application

document

dialog

Session variables are ”read-only”

variables provided by the interpreter

context

Session variables are ”read-only”

variables provided by the interpreter

context

Scope defined by element containing executable content (<block>, <filled> or

event handler)

Scope defined by element containing executable content (<block>, <filled> or

event handler)

Search for variable name

錯誤處理 :Events

Events are used to signal ”unexpected” situations

Events are caught by an catch event handler • <catch

event=”com.acme.mailreader”>...</catch>• <catch event=”nomatch

noinput”>...</catch>• Shortcut: <nomatch> is equivalent to <catch

event="nomatch"> • Other shortcuts: <noinput>, <error>

<field name=“dest-city">

<prompt> Where do you want to go to? </prompt> <grammar mode=“voice” src=“./cities.grxml”/> <nomatch> Please say the city you want to fly to. </nomatch>

</field>

Events, Example

Multimodal Web Browsing xHTML + VoiceXML SALT

[DEMO]Multimodal Browsing

Future of the “Voice” web and VoiceXML

VoiceXML1.0

VoiceXML2.0

VoiceXML forum (2000)

W3C (2003 -in CR)

Speech synthesis (SSML)

Speech reco. grammar

NLP

Speech semantics

Pronunciation lexicon [early]

Call control [early]

Voice Browser interoperation [early]

W3C

SALT

Microsoft-led (2002)

Speech ApplicationLanguage Tags

JSML

Sun/SpeechWorks (1999)

JSGF

VoiceXML 3?

Conclusion

Speech is the most natural way for human to communicate thus it will become an important way in HCI.

VoiceXML has revolutionized speech recognition & telephony application development & deployment.

Backup

History of VoiceXMLSource:VoiceXML forum(http://www.voicexml.org)

Show : VoiceXML in Daily Life

應用程式

D:\report\XML??\voicexml.exe

Classification of Voice Application

Basic interactive voice response (IVR)• Computer: “For stock quotes, press

1. For trading, press 2. …”• Human: (presses DTMF “1”)

Basic speech ASR• C: “Say the stock name for a price

quote.”• H: “Lucent Technologies”

Classification of Voice Application

Advanced speech ASR• C: “Stock Services, how may I help you?”• H: “Uh, what’s Lucent trading at?”

“Near-natural language” ASR• C: “How may I help you?”• H: “Um, yeah, I’d like to get the current price

of Lucent Technologies”• C: “Lucent is up two at sixty eight and a half.”• H: “OK. I want to buy one hundred shares at

market price.”• C: “…”

Speech Recognition Capturing speech (analog) signals Digitizing the sound waves,

converting them to basic language units or phonemes,

Constructing words from phonemes, and contextually analyzing the words to ensure correct spelling for words that sound alike (such as write and right).

Speech Synthesis

Speech Synthesis, or text-to-speech, is the process of converting text into spoken language. • Breaking down the words into

phonemes; • Analyzing for special handling of text

such as numbers, currency amounts.• Generating the digital audio for

playback.

VoiceXML Gateway(detail)

Programming VoiceXML

Writing a VoiceXML application is programming.

Control constructs are procedural (if-else etc.)

VoiceXML platform iterates through a <form> until values for all field items have been collected

VoiceXML System Components

VoiceXMLserver

Telecom boardsPBX

CT Integration

Speech synthesis (TTS)

Speech recognition (SR)

Speech grammars

Voice Biometrics

Software utilities

VoiceXML servers serve as integratorsof various hardware and software

Callcentre

FIA - Form Interpretation

Algorithm The FIA has a main loop that repeatedly selects a form item and then visits it

The first (in document order) form item, whose field item variable is undefined, is selected

As a result, the user is prompted for each field item in turn

FIA – Form Example

Field item 1

Field item 2

<form> <prompt>Where do you want to go to and how many are travelling ?

</prompt>

<field name=“dest-city"> <prompt>Where do you want to go to?</prompt> <grammar mode=“voice” src=“./cities.grxml”/> </field>

<field name="travellers”> <prompt>How many are travelling to your destination?</prompt> <grammar mode=“voice” src=“./number.grxml”/> </field> </form>

if, else and elseif

<form> ... <filled> <if cond="travellers > 10">

Sorry, we cannot handle groups larger than 10 persons <clear namelist="travellers"/> <elseif cond="travellers > 5 && dest-city == 'London'"/> Sorry, we cannot handle groups larger than 5 persons travelling to

London

<clear namelist=”city travellers"/> <else/> <submit next="http://travel.com/order"/> </if> </filled></form>

JSML - JSpeech Markup Language

Developed by Sun and SpeechWorks, as a markup language for text-to-speech dialogs.

Based on the Java Speech API Markup Languagehttp://java.sun.com/products/java-media/speech/

Text annotation to provide hints to speech synthesizers• Aimed at making TTS speech more natural, more understandable

Feature set:• hints to word pronunciation• hints to phrasing, emphasis, pitch and speaking rate• “marker” elements -- notifications from the speech synthesizer

to applications when marker is reached.

JSML - JSpeech Grammar Format

Developed by Sun and SpeechWorks, as a syntax for expressing speech grammars

Based on the Java Speech Grammar API Grammar Formathttp://java.sun.com/products/java-media/speech/

Microsoft’s SALT Speech Application Language Tags

• Microsoft, Cisco, Intel, Comverse, SpeechWorks, Philips

A “lightweight” set of tags designed to be used with HTML and XHTML to enable lightweight telephony applications driven from regular Web documents.

Targeted at supporting multimodal access