1.introduction a voice browser is a “device which interprets a (voice) markup language and is...

1.INTRODUCTION

A voice browser is a “device which interprets a (voice) markup language and is

capable of generating voice output and/or interpreting voice input,and possibly other input/output

modalities." The definition of a voice browser, above, is a broad one.The fact that the system

deals with speech is obvious given the first word of the name,but what makes a software system

that interacts with the user via speech a "browser"?The information that the system uses (for

either domain data or dialog flow) is dynamic and comes somewhere from the Internet.

From an end-user's perspective, the impetus is to provide a service similar to

what graphical browsers of HTML and related technologies do today, but on devices that are not

equipped with full-browsers or even the screens to support them. This situation is only

exacerbated by the fact that much of today's content depends on the ability to run scripting

languages and 3rd-party plug-ins to work correctly.

Much of the efforts concentrate on using the telephone as the first voice

browsing device. This is not to say that it is the preferred embodiment for a voice browser, only

that the number of access devices is huge, and because it is at the opposite end of the graphical-

browser continuum, which high lights the requirements that make a speech interface viable. By

the first meeting it was clear that this scope-limiting was also needed in order to make progress,

given that there are significant challenges in designing a system that uses or integrates with

existing content, or that automatically scales to the features of various access devices.

Voice Browsing refers to using speech to navigate an application. These

applications are written using parts of the Speech Interface Framework. In much the same way

that Web applications are written in HTML and are rendered in a Web browser, speech

applications are written in VoiceXML and are rendered via a Voice Browser.

1

2.VOICE BROWSER DOCUMENTS

2.1 Dialog Requirements:

"A prioritized list of requirements for spoken dialog interaction which any

proposed markup language (or extension thereof) should address."

The Dialog Requirements document describes properties of a voice browser dialog,

including a discussion of modalities (input and output mechanisms combined with various dialog

interaction capabilities), functionality (system behavior) and the format of a dialog language. A

definition of the latter is not specified, but a list of criteria is given that any proposed language

should adhere to.An important requirement of any proposed dialog language is ease-of-creation.

Dialogs can be created with a tool as simple as a text-editor, with more specific tools, such as an

(XML) structure editor, to tools that are special-purposed to deal with the semantics of the

language at hand.

2.2 Grammar Representation Requirements:

It defines a speech recognition grammar specification language that will be

generally useful across a variety of speech platforms used in the context of a dialog and synthesis

markup environment.When the system or application needs to describe to the speech-recognizer

what to listen for, one way it can do so is via a format that is both human and machine-readable.

2.3 Model Architecture for Voice Browser Systems Representations:

"To assist in clarifying the scope of charters of each of the several subgroups

of the W3C Voice Browser Working Group, a representative or model architecture for a

typical voice browser application has been developed. This architecture illustrates one

possible arrangement of the main components of a typical system, and should not be

construed as a recommendation."

2.4 Natural Language Processing Requirements:

It establishes a prioritized list of requirements for natural language processing in a

voice browser environment. The data that a voice browser uses to create a dialog can vary from a

rigid set of instructions and state transitions, whether declaratively and/or procedurally stated, to

a dialog that is created dynamically from information and constraints about the dialog itself. The

2

NLP requirements document describes the requirements of a system that takes the latter

approach, using an example paradigm of a set of tasks operating on a frame-based model. Slots

in the frame that are optionally filled guide the dialog and provide contextual information used

for task-selection.

2.5 Speech Synthesis Markup Requirements:

It establishes a prioritized list of requirements for speech synthesis markup which

any proposed markup language should address. A text-to-speech system, which is usually a

stand-alone module that does not actually "understand the meaning" of what is spoken, must rely

on hints to produce an utterance that is natural and easy to understand, and moreover, evokes the

desired meaning in the listener. In addition to these prosodic elements, the document also

describes issues such as multi-lingual capability, pronunciation issues for words not in the

lexicon, time-synchronization, and textual items that require special preprocessing before they

can be spoken properly.

3

3. STANDARDIZATION

Standardization to voice browsing technique were given by:

The World Wide Web Consortium (W3C) develops interoperable

technologies (specifications, guidelines, software, and tools) to lead the Web to its full potential

as a forum for information, commerce, communication, and collectiveunder standing. W3C

which includes:

1 .Voice Browser Working Group

2. Speech Interface Framework

3.1 Voice Browser Working Group:

It was established on 26 March 1999 and re-chartered through 31 January 2009. W3C

voice browser working group made the speech interface framework possible . This framework

allows developers to create speech enabled applications that are based on Web technologies.

The framework also provides developer with an environment that will be familiar to those who

are familiar with Web development techniques. So, applications are written using parts of speech

interface framework.

Thus speech applications are written in VoiceXML and are rendered through a Voice Browser.

In much the same way as Web applications are written in html and run on a Web browser.

As per estimation, over 85% of Interactive Voice Response (IVR) applications for telephones

(including mobile) use W3C's VoiceXML standard.

Voice Browser Working Group are coordinating their efforts to make the Web available on more

devices and in more situations.

3.1.1.Aim:

The Aim of the W3C Working Group is to enable users to speak and listen to

Web applications by making standard languages for developing Web-based speech applications.

This Working Group concentrates on languages for capturing and producing speech and

managing the conversation between user and computer system, while a related Group, the

Multimodal Interaction Working Group, works on additional input modes including keyboard

and mouse, ink and pen, etc.

4

3.1.2. W3C Recommendations:

Its recommendations have been reviewed by w3c group Members, by software

developers, and other interested parties, and are also endorsed by the Director as Web Standards.

3.2 Speech Interface Framework:

These framework includes:

3.2.1.VoiceXML:

A language for creating audio dialogs that feature synthesized speech, digitized

audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and

mixed initiative conversations. Some of its versions are:

• VoiceXML 1.0: designed for creating audio dialogs.

• VoiceXML 2.0: uses form interpretation algorithm(FIA).

• VoiceXML 2.1: 8 additional elements.

• Voice XML 3.0: relationship between semantics and syntax.

3.2.2.Speech Recognition Grammar Specification (SRGS) 1.0:

A document language that can be used by developers to specify the words and

patterns of words to be listened for by a speech recognizer or other grammar processor.

3.2.3.Speech Synthesis Markup Language (SSML):

A markup language for rendering a combination of prerecorded speech,

synthetic speech, and music. Some of its versions are:

• Speech Synthesis Markup Language (SSML) 1.0

• Speech Synthesis Markup Language (SSML) 1.1

5

3.2.4.Semantic Interpretation (SISR):

Document format that represents annotations to grammar rules for extracting the

semantic results from recognition.

Eg: Semantic Interpretation(SISR)1.0 version

3.2.5.Pronunciation Lexicon Specification (PLS):

A representation of phonetic information for use in speech recognition and

synthesis .

Eg: Pronunciation Lexicon Specification (PLS) 1.0 version

6

4.DESIGN OF THE SYSTEM

This section aims to give an overview of the designof a phone browser,

TeleBrowse. The prototype is assumed to have a phone-based physical medium.The architecture

is first introduced, which is followed by the functions provided by the prototype.

4.1 System Architecture:

The system consists of three modules: voice driveninterface, HTML translator

and a commandprocessor.The voice driven interface of the prototype includes a voice recogniser,

a DTMF recogniser, aspeech synthesiser and an audio engine. The physical medium for

communication is intended to be a telephone, where the user can be remote to the system’s

location, but it can also be simulated with a microphone/speaker combination connected directly

to a PC running the system. A voice recogniser and a DTMF (Dual Tone Multi-Frequency)

recognizer are available for input to the system, while the speech/audio synthesis is suitable for

output to the user. Hence, a command issued to the system can be in one of two forms: either

spoken commands by the user, or DTMF tones punched in on a phone’s keypad. DTMF assigns

a specific frequency, or tone, to each key on a touch-tone telephone so that it can easily be

identified by a microprocessor. The purpose of DTMF in the architecture is justified by the cost

effectiveness to supplement the voice functions. Although voice recognition technology is

capable to handle most of the translation task, it is however an expensive one. This will be used

for issuing commands that are not complex enough to warrant the translation to a voice driven

format.

The Voice Driven Interface essentially accepts spoken words as input. The

input signals are thencompared against with a set of pre-defined commands. If there is an

appropriate match, the corresponding command is output to the Command Processor. The engine

also handles auxiliary functions (in conjunction with the speech synthesis engine) such as the

confirmation of commands where appropriate, speed control and the URL dictation system. The

Text-to-Speech/audio engine is responsible for producing the only output from the system back

to the user. The output is in the form of spoken (synthesised) text, or sounds for the auditory

icons.The input for these two engines comes from the Command Processor. The input can either

be a text stream consisting of the actual information to be read out as the content of the page

7

(processed by the TTS engine), or an instruction to play a sound as an auditory icon to mark a

document structure (processed by the audio engine).

The other major component is an HTML Translator. When a user requests an

HTML document, the contents of the document must first be parsed and translated to a form

which is suitable for use in the audio realm. This includes the removal of unwanted tags and

information, and the retagging of structures for compliance and subsequent use with the audio

output engine of the interface.The translator also summarises information about the document

such as the title and positions of various structures for use with the document summary feature.

A Command Processor sits between the HTML translator and the interface.

The command processor is responsible for acting on the voice / DTMF commands issued by a

user. The Processor retrieves HTML documents from the WWW and feeds them to the HTML

translation algorithm. It also controls the navigation between web pages, and the functionality

associated with this navigation (bookmarks, history list, etc.). This component also processes all

the other system and housekeeping commands associated with the program. A stream of marked

text to the speech synthesis / audio engine is output. The stream consists of a combination of

actual textual information, and tags to mark where audio cues should be played.

"The architecture diagram was created as an aid to how we structure our work

into subgroups. The diagram will help us to pinpoint areas currently outside the scope of existing

groups."

Although individual instances of voice browser systems are apt to vary

considerably, it is reasonable to try and point out architectural commonalties as an aid to

discussion, design and implementation. Not all segments of this diagram need be present in any

one system, and systems which implement various subsets of this functionality may be organized

differently. Systems built entirely third-party components, with architecture imposed, may result

in unused or redundant functional blocks.

Two types of clients are illustrated: telephony and data networking. The

fundamental telephony client is, of course, the telephone, either wirelined or wireless. The

handset telephone requires PSTN (Public Switched Telephone Network) interface, which can be

either tip/ring, T1, or higher level, and may include hybrid echo cancellation to remove line

8

echoes for ASR barge-in over audio output. A speakerphone will also require an acoustic echo

canceller to remove room echoes.

The data network interface will require only acoustic echo cancellation if used

with an open microphone since there is no line echo on data networks. The IP interface is shown

for illustration only. Other data transport mechanisms can be used as well.The model architecture

is shown below. Solid (green) boxes indicate system components, peripheral solid (yellow)

boxes indicate points of usage for markup language, and dotted peripheral boxes indicate

information flows.

Fig: 4.1 System architecture

5.LEVELS OF VOICE BROWSING

9

In order to understand it better in its current form, voice browsing can be

examined under three levels.

5.1.Voice Browsing:

This is the first level in examining the technique of voice browsing. The simplest

way of understanding the voice browser and the voice web would be to take the web itself into

consideration. We know that people all over the world visit websites and get visual feedback.

Now, the voice web contains voice sites where the feedback is delivered through dialogues.

The most basic example would be calling up our cellphone operator portal,

where speech recognition software provides the caller with a series of options like recharging

your account, talking to the customer care executive or listen to the new offers. The technology

of voice browsing has brought down the cost considerably, since earlier it would cost a rupee per

minute for human operators to talk to customers while 10paise per minute for an automated call.

Voice browsing also has its use in the corporate sector specially in banks and airlines.

5.2.Voice Browsing The World Wide Web:

This is the second level through which we can examine the voice browsing

technique. We can browse the websites through voice which offer voice portals. In order to

maintain customer loyalty these portals offer voice browsing of personal content. To have a

claim in this sector or we can say market, AOL bought quack.com.

Almost everywhere the voice portal market is being expanded, be it Europe, the US or Asia.

5.3.The Voice Web:

The level three is the voice web. Many companies have introduced forums for

programmers for setting up voice sites so as to enhance the interest in voice browsing and speech

recognition. The forums further become voice webs which sprout voice auction sites and voice

based chat rooms.

1010110100100100110010110010101100000111010011011011110110000001010010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101101011010010010011001011001010110000011101001101101111011000000100100101010011010000011110101011010000111010011110001111001011110110000100

6.WEB PAGE DESIGN STRATEGIES FOR VOICE BROWSERS

10

The major obstacle to wide-scale commercial deployment of voice browsers for

the web is not the technology,but the ease (or difficulty!) with which web page designers can add

speech support to their site.Authoring a web page for any specific type of user agent or system

configuration, should never be a completely separate subject with arcane new techniques

developed for each special need, but rather an application of the common set of Universally

Accessible Design principles that should be part of every web author's repertoire. With few

exceptions, pages should never be designed for certain types (or brands) of browsers, but should

instead be designed for all uses (and potential uses) of the information. All web documents

should be equally accessible to voice browsers as to visual user agents.

The HTML Writer's Guild studies, and discussions with web authors, have

shown that the primary obstacle to universal accessibility is ignorance. There are few cases

where a conscious decision has been made to produce a generally inaccessible web page; rather,

the author is simply unaware of the need to create accessible pages and the techniques by which

that is done. Once enlightened, most web authors eagerly embrace the concept of universal

accessibility, since the benefits are many and obvious.

In this section, some of the primary techniques of Universally Accessible

Design will be briefly listed as they relate to voice browsers, and offer ideas as to how authors

can implement these considerations when designing web pages.

6.1.Aural Style Sheets:

Aural Style Sheets are part of the Cascading Style Sheets, Level 2 specification,

and provide for a level of control in spoken text roughly analogous to that for displayed/printed

text. Aural rendering of a document is already commonly used by the blind and print-impaired

communities. It combines speech synthesis and "auditory icons." Often such aural presentation

occurs by converting the document to plain text and feeding this to a screen reader, software or

hardware that simply reads all the characters on the screen. This result in less effective

presentation than would be the case if the document structure were retained. Style sheet

properties for aural presentation may be used together with visual properties (mixed media) or as

an aural alternative to visual presentation.The use of an aural style sheet (or aural style sheet

11

properties included in a general style sheet document) allows the author to specify characteristics

of the spoken text such as volume, pitch, speed, and stress; indicate pauses and insert audio

"icons" (sound files); and show how certain phrases, acronyms, punctuation, and numbers should

be voiced.

Combined with the @media selector for media types, a well-crafted aural style

sheet can greatly increase the accessibility of a web document in a voice browser. Further

investigation in this area is encouraged, especially in the area of example aural style sheets and

suggestions for authoring techniques.

6.2.Rich Meta-Content:

HTML 4.0 gives the author the ability to embed a great deal of meta-content

into a document, specifying information which expands on the semantic meaning of the content

and allows for specialized rendering by the user agent. In other words, by using features found in

HTML 4.0 (and to a limited extent, in other versions of HTML), an author can give better

information to the browser, which can then make the document easier to use.

Judicious and ample use of meta-content within a document allows the author to

not simply specify the content, but also suggest the meaning and relationship of that content in

the context of the document. Voice browsers can then use that meta-information as appropriate

for their presentation and structural needs.

6.3.Planned Abstraction:

One use for meta-content information is the development of pages, which are

designed to be abstracted. The typical web document found on the web can often be quite

lengthy; finding information by listening to web page read out loud takes longer than visually

scanning a page, especially when most web pages are designed for visual use.Thus, most voice

browsers will provide a method for abstracting a page; presenting one or more outlines of the

page's content based on a semantic interpretation of the document.

Examples of potential or valid abstraction techniques include:

12

Listing all the links and link text on a page.

Forming a structure based on the H1, H2, ... H6 headers.

Summarizing table data.

Scanning for TITLE attributes in elements and presenting a list of options for expansion.

Vocalizing any "bold" or emphasized text.

Digesting the entire document into a summary based on keywords as some search

engines provide.

There is any number of other options available for voice browser programmers

to use to provide short, easily-digestible versions of web contents to the browser user. This

suggests that the web author should provide as much meta-content as possible as well as careful

use of HTML elements in their proper manner. Specific techniques include:

Useful choices for link text (e.g., "the report is available" instead of "click here").

Appropriate use of heading tags to define document structure, not simply for

size/formatting.

Use of the SUMMARY attribute for tables.

Use of STRONG and EM where appropriate, providing benefits for both vocal and visual

"scanability".

Use of META elements with KEYWORDS and SUMMARY content.

6.4.Alternative Content for Unsupported Media Types:

The "poster child" for web accessibility is the ALT attribute, which allows

alternative text to be specified for images; if a user agent cannot display the visual image, the

ALT text can be used instead. Widespread use of the ALT attribute by all sites on the Internet

would likely double the accessibility of World Wide Web with such a simple change. Web

authors who do not correctly use ALT text are seriously damaging the usability of the entire

medium!

For voice browsers, ALT text is vitally important since images cannot be

represented at all, aurally. Especially when used as part of a link, alternative content must be

provided so that the voice browser can accurately render the page in a manner useful to the user.

13

In addition to ALT for IMG attributes, HTML 4.0 provides a number of other

ways for specifying alternative content that can be used by a browser if an unsupported media

type is provided. Some of those include:

ALT attributes for image map AREAs, APPLETs, and image INPUT buttons.

Text captions and transcripts for multimedia (video and audio).

NOSCRIPT elements when including scripting languages, as voice browsers may be

unable to process Java script instructions.

NOFRAMES elements when using frame sets, as frames are a very visually oriented

method of document display.

Use of nested OBJECT elements to include a wide variety of alternative contents for

many media types.

7.FUNCTIONALITY

14

All communication from the user to the system is made by issuing voice

commands or using DTMF tones. Such commands are arranged into objects known as menus.

Depending upon the functionality requirement/availability, different menus are available at

different points in the program’s execution. A grammar set is defined to recognise the speech

commands. Some of these rules are for administrative control. To name a few administrative

controls:

<exit|quit[program|application|

TeleBrowse]>

speak <faster|slower>

Where am I?

What is my homepage?

The other rules are used to control navigation. It is anticipated that they are the

most frequently used commands. The navigation is supported in various ways: within the same

page (intra-page navigation), browsing a new web page (inter-page navigation), bookmarks,

history list, document structure or to follow a hyperlink in the web page. The following

grammars display the nature of these rules:

Start browsing by <location|

bookmark|homepage>

Maintain bookmarks

Start reading [all|again]

Load <location|bookmark|homepage>

Go to the history list

Jump <forwards|backwards x <structure>>

<Next|Previous <structure> >

A <structure> is one of paragraph, link, anchor, level 1/2/3 heading, list, page or

table, and x represents a positive integer value. All three versions of this command represent one

action – moving between structures within a document. Another way to navigate to a specific

target page is via dictation. Dictation is invoked whenever a ‘browse by location’ type command

is requested, and it is responsible for fetching a URL address from the user. Users dictate to the

system by saying words representing a single letter to improve recognition accuracy. A good

example is the military code - ‘alpha’ for ‘a’, ‘bravo’ for ‘b’, ‘charlie’ for ‘c’, etc. The grammar

15

recognises this military code, and also common animals, such as ‘frog’ for ‘f’. Macros and

shortcuts are also used to simplify the dictation process. The ‘http://’ at the beginning of every

URL is automatically added, and the system recognises phrases like ‘World Wide Web’,

‘company’, ‘aussie’ among many more torepresent ‘www.’, ‘.com’, and ‘.au’ respectively.

The dictation menu also allows for corrections to be made, a review of what has

been dictated so far and an ability to restart the dictation session. Output from the system is

either synthesised text or sounds (as auditory icons). The synthesised text can represent either

actual information being read from a web page, or feedback about the system’s operation to the

user. When a page is to be read out (post translation), the page is broken up one piece at a time

and analysed. Two situations can occur: if the piece is a tag with an associated auditory icon, this

icon is played out, or, if the piece is simply text, it is synthesised into voice.

Typical application of auditory icons include the creaking opening door

(creaking) to represent internal link (link to an anchor within the same web page), or a doorbell

to represent an email address, or the clicking sound of a camera shutter to relate to an

image.When certain tags are encountered (end of paragraph, end of list, end of table row, etc.),

speaking ceases and the user is returned to the menu they were last at. Alternatively, if a user

wishes to interrupt the speaking prior to the next break point, the interrupt key ‘*’ can be used.

8.EXPERIMENTS AND RESULTS

16

The TeleBrowse prototype was developed under the Visual Basic 6

environment.Three ActiveX controls were used in conjunction with the VB project.The

Microsoft Telephony Control (see http://www.microsoft.com/speech) has the voice recognition

engine, speech synthesis engine, and audio output engine. The Microsoft Internet Transfer

Control is able to retrieve HTML documents from the Web using the HTTP protocol, while the

HTML Zap Control (Newcomb, 1997) provides a simple interface for analysing HTML

documents. The primary goal of the evaluation of the TeleBrowse prototype is to determine the

usability and acceptance of the application as a voice-driven web-browsing tool.Measurement is

done via the transformed data.

The operational efficiency is represented by the physical interface, speech

recognition & synthesis capabilities, document and page navigation, prompts and feedback from

the prototype. The quality aspect is evaluated through the availability, integrity and usefulness of

the data after the translation, and also the support of various structures in the original HTML

page, e.g. headings, links, tables. The experiment was carried out on two different groups of

users who had similar characteristics, i.e. they were all above 18 years old, they had prior

experience on using current web browsing products and had limited knowledge of the WWW

and the related technology. The major difference between these two groups was subjects in the

first group (G1) were normally sighted, while the second group (G2) were visually impaired, or

to be precise, they suffered from complete blindness. In other words, subjects in G1 were

familiar with typical visually oriented web browsers such as Microsoft Internet Explorer. In this

experiment, there were five people in G1 and four subjects in G2.

Evaluation sessions were run on a one-to-one basis. There was also a general

discussion involving the group at the end of the individual experiments. No interaction occurred

between subjects from differing groups. Each subject was briefed about the operation of the

prototype, and the methods for interacting with it. Along with the briefing, each user was given a

sheet on which was printed a vocabulary listing of what phrases the program would respond to,

and at what points in the program’s execution such phrases could be used. In the case of the

subjects in G2, the sheet was printed using a Braille printing device. The sheet also contained a

set of tasks the subjects had to complete using the program. Some of the tasks that subjects were

required to complete included: starting the application, checking to see what the current

homepage was set to, commencing reading of the web document once loaded, jumping to

17

another location by dictating a URL address directly into the system etc. Accompanying each

task was a description of the behaviour the system would demonstrate during the task’s

execution, and what phrases to use in interaction with the system to complete those tasks. All

tasks were first completed using the software running in emulation mode on the laptop computer.

The speech recognition engine was not trained to adapt to any specific person. After completing

this and gaining a degree of experience in using the system, subjects were then given an

opportunity to use the system as they chose over the phone, completing the effect of a phone-

based webbrowsing tool.

Subjects were also asked to view the same web pages that had just been

‘viewed’ using the prototype with a browser they would typically use. In the case of G1, this was

either Microsoft Internet Explorer or Netscape Navigator. In the case of G2, this was again

Microsoft Internet Explorer, but thistime with the edition of the JAWS screen reading program.

After using the prototype for a sufficient amount of time (in most cases this was a period of

about twenty minutes to half an hour), each subject was asked to complete a questionnaire to

record their experiences with the prototype. The questionnaire was arranged into two parts: the

measurement of efficiency and integrity. Some of these questions asked the participants to give

numerical scores in a scale of 1 (very poor) to 7 (very good). Some of the other questions were

free format where the subjects could provide their own comments.

This gave us an initial and very broad insight into how subjects from each group

responded to using the prototype in the experiments. By graphing both groups’ results on the

same axes, we could also see a comparison of acceptance between the two groups.In addition to

the rated response questions, The evaluation questionnaire contained a further thirteen questions

of free-form response in nature. These free-form questions were designed to draw out any

comments, problems, criticism, or general feedback from the test subjects. There were, on

average, two such questions per section of the evaluation criteria.

8.1.Voice Recognition:

It was one of the more poorly rated criteria. While both groups considered it a

necessary technology for the idea of a phone browser, subjects suffered from its shortcomings,

and it did result in a loss of efficiency for most users. Subjects from G2 responded more

favourably than those from G1. Of particular concern was the dictation of URL addresses. This

was noted as a shortcoming in the interface by every user from both groups. The idea of having

18

to spell out URL addresses one letter at a time (and wait for confirmation of each letter) was not

well received. The idea of using shortcuts like ‘company’ to spellout ‘.com’ was considered a

strong improvement, thus this technique must be further explored.

8.2.Speech Synthesis:

There was little or no problem with this sub-system. Subjects found the voice

easy to understand and of suitable volume and pitch. The major contrast between the two groups

was the usage of the speed control feature. Subjects from G1 saw no reason to adjust the speed of

the synthesised voice. They were content with the default normally paced speaking voice.

However, subjects from G2 tended to change the speed to a much higher rate before doing

anything else.

8.3.Navigation:

The overall ability of intra-page and inter-page navigation using the system was

rated favourably by both groups. The use of auditory icons to mark HTML structures was viewed

by the G2 subjects as being superior to any similar screen reader marking scheme. Subjects from

G1 also appreciated the ability of the voice icon to quickly and to simply mark structures from a

document, in a way that was natural and easy to remember. The idea of metaphorically matching

the meaning of sounds with the structure they were representing was well liked and accepted.

There was little problem with remembering the mapping of sound to structure, especially after

using the system for an extended amount of time. A criticism with the auditory icons was that

they appeared too frequently, and could be seen as breaking up the flow of text unnecessarily.

A comment made by many subjects was that the prototype offered similar and

familiar functionality to that of browsers they have previously used. Thus, features such as

bookmarks, the history or ‘go’ list and the ability to save a ‘home’ page were all well received.

The ability to follow links contained in documents was well liked. Using different auditory icons

for the different types of links allowed subjects to know in advance whether the link would be to

a target within the same document, or an external link to another document. This too was well

liked. Again, the problem with dictating URL addresses was brought up.

8.4.Online Help:

19

This section of the criteria did not rate well, due to the lack of help associated

with commands and prompts used in the system. It was thought by all users that more detailed

help (context based) explaining the meaning and usage of commands, should be available at any

point duringthe system’s operation, as opposed to the simple listing of commands currently

available. Certainly the need to refer to other supporting documentation for more detailed

information shouldbe avoided, as access to this information would not be available

inenvironments where a phone browser might be used. The tutorial available from the system’s

main menu was well accepted in terms of its content, but perhaps a similarly detailed tutorial

should be available at every menu in the system, customized for the relevant set of commands.

8.5.Information & Structure:

There was a marked difference between the two evaluation groups in these two

aspects. G2 was more willing to accept the level of integrity of information presented to them by

the prototype than G1. Comments were made from users in G1 concerning the lack of ability to

quickly visualise an entire page at a “glance”. They reported frustration when they were forced to

listen to the content in sequential fashion.

8.6.Overall Impression:

The subjects from both evaluation groups accepted the prototype as a viable

method of browsing the web in the audio realm by phone. The efficiency of the product was

quite highly regarded by most subjects. The system interface faired very well.The only major

problem was the dictation of URL addresses to the system.

9.FUTURE OF VOICE TECHNOLOGY

20

The speech technology is supposed to grow rapidly. The voice portal market is

going to reach billions in just a few years. It is estimated by the kelsey group that voice browsing

market will reach 6.5 billion dollars, while OVUM estimates a world market of 26 billion

dollars. Anyone may guess the actual growth of the industry of voice technology due to

variations in these figures. It is very difficult to navigate on a WAP to scroll through many lists.

Hands-free interaction enables us to develop an easy communication between the user and the

system.

Voice browsing can be used to access three kinds of information:

9.1.Business:

Information like automated telephone ordering services, support desks, order

tracking, airline arrival and departure information services, cinema and theatre booking services,

home banking services,etc can be retrieved using voice browsing very easily.

9.2.Public:

Voice browser can be used to access services like local , national and

international news alongwith community information such as weather forecasting, traffic

conditions, school closure and events. it can also be used to gather information on national and

international stock market information and also business and e-commerce transactions.

9.3.Personal use:

It is used in accessing personal information like voice mails, personal horoscope

,personal newsletter, calendars, address and telephone lists etc. In future it is expected that voice

browsing will become visual i.e MULTI MODAL. But greatest achievement would be when

voice browsing is integrated with all types of operating system .This success would surly make

voice browsing available to each and every application.

10.BENEFITS

21

Voice is a very natural user interface because it enables the user to speak and

listen using skills learned during childhood. Currently users speak and listen to telephones and

cell phones with no display to interact with voice browsers. Some voice browsers may have

small screens, such as those found on cell phones and palm computers. In the future, voice

browsers may also support other modes and media such as pen, video, and sensor input and

graphics animation and actuator controls as output. For example, voice and pen input would be

appropriate for Asian users whose spoken language does not lend itself to entry with traditional

QWERTY keyboards.

Some voice browsers are portable. They can be used anywhere—at home, at

work, and on the road. Information will be available to a greater audience, especially to people

who have access to handsets, either telephones or cell phones, but not to networked computers.

Voice browsers present a pragmatic interface for functionally blind users or

users needing Web access while keeping their hands and eyes free for other things. Voice

browsers present an invisible user interface to the user, while freeing workspace previously

occupied by keyboards and mice.

22

11.CONCLUSION

The VoiceBrowse prototype developed for this project proved its success in the

evaluation as a telephone based web-browsing tool. Problems were highlighted in the user study.

Part of these problems are related to the features currently available in the prototype, for

example, the navigation of a table, the interruption of an output using voice rather than “*” key

on the phone pad, the missing function of filling in forms on a web page etc. At the same time,

there are some deeper issues which require more research include the dictation function (how

and where contextual information can be acquired to improve the accuracy of a dictation),

visualisation process (what kind of model is made available which provides a feeling of

information coming in parallel).

Testing of the prototype in real life environments rather than in a standard

laboratory condition is also an important direction of future work, e.g. walking in the street,

driving a car. This may uncover further issues that relate to situated interaction. Research into

domain-specific speech interaction model may also improve the accuracy and effectiveness of

the system. It is the intention of this project to create an environment where web surfing with

voice is possible and surfing experience is a pleasant one. The knowledge accumulated in the

production of the first generation prototype, VoiceBrowse, has given us a lot of insights and lay

the groundwork that helps us move on to develop the second generation prototype.

12.BIBLIOGRAPHY

23

http://www.w3.org/TR/REC-CSS2/aural.html

http://www.hwg.org/opcenter/w3c/voicebrowsers.html

http://www.w3.org/Voice/1998/Workshop/Michael-Brown.html

http://www.oasis-open.org/cover/wap-wml.html

Newcomb, M. (1997). HtmlZap ATL ActiveX Control.

http://www.miken.com/htmlzap/

http://www.brookes.ac.uk/schools/cms/research/speech/publications/43hft97.htm,

24

1.introduction a voice browser is a “device which interprets a (voice) markup language and is...

Documents

voice browser dialog

voice markup language

voice browser environment

speech system

w3c voice browser

voice browser documents2

voice browsing

dialog requirements