speach application development in asp.net

8/14/2019 Speach Application Development in ASP.net

1/19


2/19

One of the main design goals of both the MSS and SASDK was to leverage existingstandards and ensure industry compliance to make speech a natural extension of any

Web-based application. In addition to the basics of XML, HTML, and JavaScript, thereare several speech-related standards as shown in Table 1.

Table 1: The standards of speech applications

Standard Description

Speech Application Language Tags (SALT) SALT is an extension of HTML and other marku

telephony interface to Web applications. It supbrowsers. SALT defines a small number of X

that serve as the core API for user in

Speech Recognition Grammar Specification(SRGS)

SRGS provides a way to define the phrases includes words that may be spoken, patterns th

language of each word.

Speech Synthesis Markup Language (SSML) SSML provides an XML-based markup languawithin an application. It enables the contro

pronunciation, volume, pitch, and rate.

Defining the Application

Both the SASDK and Microsoft Speech Server are designed to develop and support

two distinct types of speech-based applications-voice only and multimodal. By default


3/19

the developer selects the application type when they create a new project as shownin Figure 2. The role of the SASDK is to provide a developer-based infrastructure to

support both the development and debugging of either type on a local machine. Onthe other hand, the MSS is designed to provide a production-level server-side

environment for deployed speech-based applications. Figure 3 shows a sampleschematic of a production environment that includes Microsoft Speech Server.

Fig 2


4/19

Figure 3. Sample Schematic: The figure shows an example of what yourproduction speech environment may contain

Voice-only applications are designed to never expose a visible Web interface to endusers. This type of speech application includes both traditional voice only applications

and touchtone or Dual Tone Multi-Frequency (DTMF) applications. In either case, all

interaction with the application is done by either voice only or keypad presses. Theresult is a menu option or selection based on the user's response. Once deployed,

the Microsoft Speech Server includes two major components that are designed to

support these types of applications in a production environment. The TelephonyBuilding Speech ApplicationsLike any Web-based application, speech applications have two major components-a Webbrowser component and server component. Realistically, the device that consumes theapplication will ultimately determine the physical location of the Speech Services engine. Forexample, a telephone or DTMF application will natively take advantage of the server-side featuresof Microsoft Speech Server. However, a desktop Web application will leverage the markupreturned by MSS in conjunction with desktop recognition software and the speech add-ins forMicrosoft Internet Explorer.

In addition to the default project template, the SASDK also installs a set of speech-enabledASP.NET controls. By default these controls are added to the Visual Studio toolbox as shown inFigure 4.


5/19

Figure 4. Installed Controls: Here's the set of speech controls installed by the SASDK into Visual Studio 2003.

Fundamentally, these controls operate identically to the standard set of ASP.NET Web controlsexcept that during the server-side rendering phase of a Web page containing a speech control,the output document contains SALT, SSDN, and SRGD in addition to the standard HTML andJavaScript. The document returned to the speech-enabled client is first parsed and then anyadditional grammar files specified in the returned markup are downloaded. Additionally, if theSpeech Services engine is local, prompts or pre-recorded text are also downloaded. Finally, boththe SALT client and Web browser invoke the series of and elements specifiedby the markup. Any additional client-side elements are invoked by calling the client-side start()


6/19

function.Application Service (TAS) is responsible for providing a voice only browser orSALT interpreter which is used to process the SALT markup generated by the

ASP.NET speech-enabled Web application. Also, the Speech Engine Services (SES)that provides the speech recognition engine also handles the retrieval of the output

generated by the application. Finally, the Telephony Interface Manager (TIM)component provides the bridge between the telephony board hardware which is

connected to both the network and the TAS.

Multimodal applications, on the other hand, are designed to combine speech input

and output with a Web-based graphical user interface. In a traditional Web-basedGUI, the user directs the system through a combination of selections and commands.

Each action is translated into a simple sentence that the system can execute.Fundamentally, each sentence contains verbs that act on a direct object. The

selection of the mouse defines the direct object of a command, while the menuselection describes the action to perform. For example, by selecting a document and

choosing print, the user is telling the computer to "Print this document." Inmultimodal systems, speech and mouse input are combined to form more complex

commands.

For example, by selecting a document and simultaneously saying "Print five of this"the user successfully collapses several simple sentences into a single Click command.

Obviously this type of application is best suited for devices that support both speech

and ASP.NET. However, for mobile devices like PDAs this is particularly well-suitedbecause conventional keyboard input is difficult. For developers, a multimodal

application combines ASP.NET and speech controls with server-side extensions likeSALT for application delivery. Once an application is deployed to Microsoft Speech

Server it is responsible for providing output markup that includes SALT, HTML,

JavaScript, and XML to the client and the speech services needed for voiceinteraction

Once started, the Speech Services engine listens for input from the user when a elementis invoked. Once it receives the response audio or utterances it compares it's analysis of theaudio stream to what is stored in the grammar file, looking for a matching pattern. If therecognizer finds a match a special type of XML document is returned. This document containsmarkup called Semantic Markup Language (SML) and is used by the client as the interpretationof what the user said. The client then uses this document to determine what to do next. Forexample, execute a or element. The cycle repeats itself until the application isdone or the session ends.

All ASP.NET speech controls are implemented in the framework namespaceMicrosoft.Speech.Web.UI. Within the namespace, these controls are categorized by theirfunctions. By default, these categories are basic, dialog, application controls, and Call

management controls. Call Management controls are an abstraction of the Computer SupportedTelecommunications Applications (CSTA) messages you'll use in your application.

Like any other ASP.NET Web control, the speech controls are designed to provide a high levelabstraction on top of the lower-level XML and script emitted during run time. Also, to make theimplementation of these controls easier, each control provides a set of property builders as showninFigure 5.


7/19

Figure 4. Installed Controls: Here's the set of speech controls installed by the SASDK into Visual Studio 2003

HTML (ASP.NET)

Listing 1: Welcome to the Speech Application

Default


8/19

function PlayWelcomePrompt() {

Prompt1.Start();}

Welcome to the world ofspeech!

The Basic Speech ControlsThe basic speech controls, which include Prompt and Listen, are designed to create andmanipulate the SALT hierarchy of elements. These controls provide server-side functionality thatis identical to the elements invoked during run time on the client. The Prompt control is designedto specify the content of the audio output. The Listen controls perform recognition, postprocessing, recording, and configuration of the speech recognizer. Ideally, the Basic controls are

primarily designed for tap- and talk-based client devices and applications designed to confirmresponses and manage application flow through a GUI.

The basic speech controls are designed exclusively to be called by client-side script. Examiningthe "Hello World" example in Listing 1, you will notice that once the user presses the Web pagebutton this then calls the OnClick client-side event. This event invokes the Startmethod of theunderlying prompt or exposed SALT element. The event processing for the basic speech controlsis identical to features of SALT. Fundamentally, these features are based on the system's ability torecognize user input. The concept of recognition or "reco" is used by SALT to describe thespeech input resources and provides event management in cases where valid recognition isn'treturned. For example, you create specific events such as "reco" and "noreco" and then assignthe name of these procedures to control properties such as OnClientReco and OnClientNoReco.When the browser detects one of these events, it calls the assigned procedure. The procedure is

then able to extract information about the event directly from the event object.

The Listen control is a server-side representation of the SALT List element. The Listen elementspecifies possible speech inputs and provides control of the speech recognition process. Bydefault, only one Listen element can be active at a time. However, a Web page can have morethan one Listen control and each control can be used more than once.

The following code represents the HTML markup when a Listen control is added to a Web page.


9/19

As you can see, the main elements of the Listen control are grammars. Grammars are used todirect speech input to a particular recognition engine. Once the audio is recognized, the resultingtext is converted and placed into an HTML output.

Dialog Speech ControlsThe dialog speech controls, which include the QA, Command, and Semantic items, are designedto build questions, answers, statements, and digressions for an application. Programmatically,these controls are called through the script element, RunSpeech, which manages both theexecution and state of these controls. RunSpeech is a client-side JavaScript object that providesthe flow control for voice-only applications. Sometimes referred to as the dialog manager, it isresponsible for activating the dialog speech controls on a page in the correct order. RunSpeechactivates a Dialog speech control using the following steps:

1. RunSpeech establishes the Speech Order of each control based on the control's sourceorder orSpeechIndexproperty.

2. Runspeech examines the Dialog Speech controls on the page in Speech Order. Basedon the order specified in the page, it locates the first dialog control within that list and theninitializes it.

3. RunSpeech submits the page.

The QA control within the Microsoft.Speech.Web.UI.QA namespace is used to ask questions andobtain responses from application users. It can be used as either a standalone prompt statementor can supply answers for multiple questions without having to ask them. Here's an example ofhow you can mark up this control.


10/19

The Command control contained in the Microsoft.Speech.Web.UI.Command namespace enablesdevelopers to add out-of-context phrases or dialogue digressions. These are the statements thatoccur during conversations that don't seem to make sense for the given dialog. For example,

allowing an application user to say "help" at any point. The following is an example of how youcan apply this globally to a speech application.

The SemanticMap and SemanticItem controls track the answers and overall state management ofthe dialogue. You use Semantic items to store elements of contextual information gathered from auser. While the semantic map simply provides a container for multiple semantic items, eachSemanticItem maintains its own state. For example, these include empty, confirmed, or awaiting

confirmation. You'll use the SemanticMap to group the SemanticItem controls together. Keep inmind that while the QA control manages the overall semantics of invoking recognition, the storageof the recognized value is decoupled from the control. This simplifies state management byenabling the concept of centralized application state storage. Additionally, this makes it very easyto implement mixed-initiative dialog in your application. In a mixed-initiative dialog, both the userand the system are directing the dialog. For example, the markup for these controls would looklike the following.

Prompt AuthoringPrompts are such an important part of any application because they serve as the voice andinteraction point with users. Basically, they act as the main interface for a dialog-drivenapplication. As you begin to build a speech application you will quickly notice that the synthesizedvoice prompts sound a bit mechanical; definitely not like the smooth slightly British tone of HAL in"2001." This is because, by default, unless otherwise specified the local text to speech engine will

synthesize all prompt interaction. Prompts should be as flexible as any portion of the entireapplication. Just as you should invest time in creating a well-designed Web page, so should youspend time on designing a clean sounding dynamic prompt system for application users tointeract with. As with any application, the goal is to quickly prototype the proof of concept andusability testing. The extensibility of the speech environment makes it easy to have a paralleldevelopment track occurring of the dialog and prompt recording.

The Prompt database is the repository of recorded prompts for an application. It is compiled anddownloaded to the telephony browser during run time. Before the speech engine plays any typeof prompt, it queries the database and if a match is found, it plays the recoding instead of using


11/19

the synthesized voice. Within Visual Studio the Prompt Project is used to store these recordingsand is available within the new project dialog as shown inFigure 6. The Prompt Project containsa single prompt database with a .promptdb extension. By default, Prompt databases can beshared across multiple applications and mixed together. In practice it's actually a good idea to useseparate prompt databases across a single application to both reduce size and make it moremanageable. The database can contain wave recordings either directly recorded or imported fromexternal files.

You can edit the prompts database through Visual Studio's Prompt Editor as shown in Figure 7.This window is divided into a Transcription and Extraction window. The Transcription window (top)is used to identify an individual recording and its properties. These include playback propertiesincluding volume, quality, and wave format. More importantly, you use the Transcription window todefine the text representation of the wave file content. The bottom portion of the Prompt Editorcontains the Extraction window. This identifies one or more consecutive speech alignments of atranscription. Essentially, extractions constitute the smallest individual element or words within atranscription that a system can use as part of an individual prompt.

Figure 6. Prompts Project Database: The prompts project contains a database that is used to store pre-recorded voiceprompts.

The Prompt database is the repository of recorded prompts for an application. It is compiled and

downloaded to the telephony browser during run time. Before the speech engine plays any typeof prompt, it queries the database and if a match is found, it plays the recoding instead of usingthe synthesized voice. Within Visual Studio the Prompt Project is used to store these recordingsand is available within the new project dialog as shown inFigure 6. The Prompt Project containsa single prompt database with a .promptdb extension. By default, Prompt databases can beshared across multiple applications and mixed together. In practice it's actually a good idea to useseparate prompt databases across a single application to both reduce size and make it moremanageable. The database can contain wave recordings either directly recorded or imported fromexternal files


12/19


13/19

a prompt. For example, the extractions "ham," "roast beef," "club," and "sandwich" can becombined with "you ordered a" to create the prompt, "You ordered a ham sandwich."

Figure 8. Editing Prompts: The figure shows the process of editing the prompts database within Visual Studio 2003.


14/19

.Figure 9. Aligning Prompts: Proper speech recognition requires defining an alignment between the prompts and transcriptionwithin Visual Studio 2003.

Once all the application prompts are recorded they are then referenced within a project to create

inline prompts as shown inFigure 10. Within an application this creates a prompts file thatcontains only the extractions identified within the prompts database. By default, anything notmarked as an extraction is not available within a referenced application. The result is that whenthe application runs, the prompt engine matches the prompt text in your application with theextractions in your database. If the required extraction is found it is played, otherwise the text-to-speech engine uses the system-synthesized voice to play the prompt.

Figure 10. Sharing Prompt Databases: Sharing a prompts database across projects is simply a process of creating areference.

Programmatically, this is provided by the Dialog Speech Control through thePromptSelectFunction property of every Dialog and Application speech control. ThePromptSelectFunction property is actually a callback function for each control that is executed onthe client side. It is responsible for returning the prompt and its associated markup to use whenthe control is activated. This built in function enables speech applications to check and react tothe current state of the dialog as shown in the following code.

function GetPromptSelectFunction() {var lastCommandOrException = "";var len = RunSpeech.ActiveQA.History.length;if(len > 0) {

lastCommandOrException =RunSpeech.ActiveQA.History[len - 1];

}if (lastCommandOrException == "Silence") {

return "Sorry I couldn't hear you. " +"What menu selection would you like?";

}}


15/19

Figure 11. Prompt Function File: Editing and managing the code and states for each prompts can be done through the promptfunction file.

Grammar AuthoringSpeech is an interactive process of prompts and commands. Semantic Markup LanguageGrammars are the set of structured command rules that identify words, phrases, and validselections that are collected in response to an application prompt. Grammars provide both theexact words and the order in which the commands can be said by application users. A grammarcan consist of a single word, a list of acceptable words or complex phrases. Structurally it's acombination of XML and plain text that is the result of attempting to match the user responsesWithin MSS, this set of data conforms to the W3C Speech Recognition Grammar Specification(SRGS). An example of a simple grammar file that allows for the selection of a sandwich is shownbelow:

ham$._value = "ham"

roast beef$._value = "roast beef"

italian$._value = "italian"


16/19

Grammars form the guidelines that applications must use to recognize the possible commandsthat a user might issue. Unless the words or phrases are defined in the grammar structure, theapplication cannot recognize the user's speech commands and returns an error. You can think ofgrammar as a vocabulary of what can be said by the user and what can be understood by the

application. This is like a lookup table in a database that provides a list of options to the user,rather than accepting free-form text input.

A very simple application can limit spoken commands to a single word like "open" or "print." Inthis case, the grammar is not much more than a list of words. However, most applications requirea richer set of commands and sentences. The user interacting with this type of speech applicationexpects to use a normal and natural language level. This increases the expectation for anyapplication and requires additional thought during design. For example, an application mustaccept, "I would like to buy a roast beef sandwich," as well as, "Gimme a ham sandwich."

Implementing GrammarWithin a Visual Studio speech application, grammar files have a .grxmlextension and are addedindependently as shown inFigure 12. Once added to a project, the Grammar Editing Tool, as

shown inFigure 13, is used to add and update the independent elements. This tool is designed toprovide a graphical layout using a left to right view of the phrases and rules stored in a particulargrammar file. Essentially, it provides a visualization of the underlying SRGS format, in a wordgraph rather than the hierarchical XML

Figure 12. Grammar Files: Within a Visual Studio speech application, grammar files have a .grxml extention and are addeddirectly to the project.


17/19

For developers, the goal of the Grammar Editor is to present a flowchart of the valid grammarpaths. A valid phrase is defined by a successful path through this flowchart. Building recognitionrules is done by dragging the set of toolbox elements listed in Table 2 onto the design canvas.

The design canvas displays the set of valid toolbox shapes and represents the underlying SRGSelements.

Table 2: The elements of the Grammar Editor toolbox.

Element Description

Phrase The phrase element represents a single grammatical entry.

List The list element specifies the relationship between a group of phrases.

Group The group element binds a series of phrases together in a sequence.

Rule Reference The rule reference element provides the ability to reference an external encapsulated rule.

Script Tag The script tag element defines the set of valid phrases for this grammar.

Wild Card The wild card element allows any part of a response to be ignored.

Skip The skip element creates an optional group that can be used to insert or format semantictags at key points in the grammar

Halt The halt element immediately stops recognition when it is encountered.

During development the Grammar Editor provides the ability to show both the path of anutterance and the returned SML document as shown in Figure 14. For example, the string, "Iwould like to buy a ham sandwich" is entered into the Recognition string text box at the top andthe path the recognizer took through the grammar is highlighted. At the bottom of the screen thebuild output window displays a copy of the SML document returned by the recognizer. Thisfeature provides an important way to validate and test that both the grammar and SML documentreturned are accurate.

Structurally the editor provides the list of rules that identify words or phrases that an applicationuser is able to provide. A rule defines a pattern of speech input that is recognized by theapplication. At run time the speech engine attempts to find a complete path through the rule usingthe supplied voice input. If a path is found the recognition is successful and results are returned tothe application in the form of an SML document. This is an XML-based document that combinesthe utterance, semantic items, and a confidence value defined by the grammar as shown below.

ham


18/19

The confidence value is a score returned by the recognition engine that indicates the degree ofconfidence it has in recognizing the audio. Confidence values are often used to drive theconfirmation logic within an application. For example, you may want to trigger a confirmationanswer if the confidence value falls below a specific threshold such as .8.

The SASDK also includes the ability to leverage other data types as grammar within anapplication. The clear benefit is that you don't have to manually author every specific grammarrule. Adding these external grammars can be done through an included Grammar Library or usinga process called data-driven grammar.

The Grammar Library is a reusable collection of rules provided in SRGS format that are designedto cover a variety of basic types. For example, this includes grammar for recognizing numbersand mapping holiday dates to their actual calendar dates. Data-driven grammar is a featureprovided by three Application Speech controls. The ListSelector and DataTableNavigator controlsenable you to take SQL Server data, bind it to the control, and automatically make all the dataaccessible by voice. Logically this means that you don't have to recreate all the data stored in adatabase into a grammar file. The third control, the AlphaDigit control, isn't a data-bound control.Rather, it automatically generates a grammar for recognizing a masked sequence. For example,the mask "DDA" would recognize any string following the format: digit, digit, character.

Figure 14. Grammar Editor: During development the grammar editor provides the ability to show both the path of an utteranceand the returned SML within the Visual Studio environment.

Application DeploymentUp to this point the discussion has focused exclusively on the development phase of speechapplications using the SASDK. Ultimately, the SASDK is a faithful simulated representation of aSpeech Server, coupled with additional development and debugging tools. The benefit is thatwhen it comes time to deploy speech applications they exist as a set of ASP.NET Web pages andartifacts as shown in Table 3. Deployment is simply the process of packaging and deploying thesefiles using the same methodology as any ASP.NET application.

Table 3: The typical components of a speech application.


19/19

File Type Description

.aspx A file containing the visual elements of a Web Forms page.

.ascx A file that persists a user control as a text file.

.asax A file that handles application-level events.

.ashx An ASP.NET Web Handler file used to manage raw HTTP requests.

.grxml A grammar file.

.cfg A binary file created by the SASDK command-line grammar compiler.

.prompts A compiled prompt database that is created when a .promptdb file is compiledwith the Speech Prompt Editor.

However, there are some inherent architecture differences to remember between the SASDK andthe Speech Server environment. For example, the Telephone Application Simulator (TASIM)provided by the SASDK is used in the development of voice-only applications. It simulates thefunctions of both the TAS and SES components of a Speech Server. Once the application isdeployed you wouldn't be able to access voice-only applications from Internet Explorer. In aproduction environment, the TAS component and the SES component are completely separate.The TAS is responsible for handling the processing of incoming calls and SES handles speech

input. In the development environment both of these functions are handled by the TASIM.Additionally, in the development of a multimodal application, debugging is provided by the Speechadd-in for Internet Explorer, which has been enhanced to include extensions and integration withthe Speech Debugger. However, these enhancements aren't available as part of the standardclient installation.

Finally, when developing applications using the SASDK it is possible to build applications wherethe paths to grammar files are specified by physical paths. However, within the productionSpeech Server environment, all paths to external resources such as a grammar file must bespecified using a URL not a physical path. Prior to creating your deployment package it is alwaysa good idea to switch into HTML view and verify that all grammar file paths use a relative orabsolute URL.

There are many different ways to deploy the components of Speech Server based on your usageand workload requirements. Each component is designed to work independently. This enablesthe deployment of single or multi-box configurations. The simplest deployment is to place theTAS, SES, IIS, hardware telephony board, and TIM software together on a single machine.Without a doubt the Web server is the essential component of any Speech Server deployment asit is the place where applications are deployed and the SALT-enabled Web content is generated.Even if you decide to deploy Speech Server components on separate Web Servers, each serverrunning SES must have IIS enabled.

In this article you've looked at how to use the SASDK and Microsoft Speech Server 2004 todevelop speech-enabled ASP.NET applications. The SASDK provides the developmentenvironment that includes a simulator and debugging tools integrated into Visual Studio 2003.This combination provides developers the ability to build and test voice and multimodal

applications on their local machine. Once the application is complete, Microsoft Speech Serverprovides the production-level support and scalability needed to run these types of applications.Personally, I don't expect to be talking to HAL anytime soon. However, the possibilities arestarting to get better.

speach application development in asp.net

Documents