1.introduction a voice browser is a “device which interprets a (voice) markup language and is...
DESCRIPTION
voice browserTRANSCRIPT
1.INTRODUCTION
A voice browser is a “device which interprets a (voice) markup language and is
capable of generating voice output and/or interpreting voice input,and possibly other input/output
modalities." The definition of a voice browser, above, is a broad one.The fact that the system
deals with speech is obvious given the first word of the name,but what makes a software system
that interacts with the user via speech a "browser"?The information that the system uses (for
either domain data or dialog flow) is dynamic and comes somewhere from the Internet.
From an end-user's perspective, the impetus is to provide a service similar to
what graphical browsers of HTML and related technologies do today, but on devices that are not
equipped with full-browsers or even the screens to support them. This situation is only
exacerbated by the fact that much of today's content depends on the ability to run scripting
languages and 3rd-party plug-ins to work correctly.
Much of the efforts concentrate on using the telephone as the first voice
browsing device. This is not to say that it is the preferred embodiment for a voice browser, only
that the number of access devices is huge, and because it is at the opposite end of the graphical-
browser continuum, which high lights the requirements that make a speech interface viable. By
the first meeting it was clear that this scope-limiting was also needed in order to make progress,
given that there are significant challenges in designing a system that uses or integrates with
existing content, or that automatically scales to the features of various access devices.
Voice Browsing refers to using speech to navigate an application. These
applications are written using parts of the Speech Interface Framework. In much the same way
that Web applications are written in HTML and are rendered in a Web browser, speech
applications are written in VoiceXML and are rendered via a Voice Browser.
1
2.VOICE BROWSER DOCUMENTS
2.1 Dialog Requirements:
"A prioritized list of requirements for spoken dialog interaction which any
proposed markup language (or extension thereof) should address."
The Dialog Requirements document describes properties of a voice browser dialog,
including a discussion of modalities (input and output mechanisms combined with various dialog
interaction capabilities), functionality (system behavior) and the format of a dialog language. A
definition of the latter is not specified, but a list of criteria is given that any proposed language
should adhere to.An important requirement of any proposed dialog language is ease-of-creation.
Dialogs can be created with a tool as simple as a text-editor, with more specific tools, such as an
(XML) structure editor, to tools that are special-purposed to deal with the semantics of the
language at hand.
2.2 Grammar Representation Requirements:
It defines a speech recognition grammar specification language that will be
generally useful across a variety of speech platforms used in the context of a dialog and synthesis
markup environment.When the system or application needs to describe to the speech-recognizer
what to listen for, one way it can do so is via a format that is both human and machine-readable.
2.3 Model Architecture for Voice Browser Systems Representations:
"To assist in clarifying the scope of charters of each of the several subgroups
of the W3C Voice Browser Working Group, a representative or model architecture for a
typical voice browser application has been developed. This architecture illustrates one
possible arrangement of the main components of a typical system, and should not be
construed as a recommendation."
2.4 Natural Language Processing Requirements:
It establishes a prioritized list of requirements for natural language processing in a
voice browser environment. The data that a voice browser uses to create a dialog can vary from a
rigid set of instructions and state transitions, whether declaratively and/or procedurally stated, to
a dialog that is created dynamically from information and constraints about the dialog itself. The
2
NLP requirements document describes the requirements of a system that takes the latter
approach, using an example paradigm of a set of tasks operating on a frame-based model. Slots
in the frame that are optionally filled guide the dialog and provide contextual information used
for task-selection.
2.5 Speech Synthesis Markup Requirements:
It establishes a prioritized list of requirements for speech synthesis markup which
any proposed markup language should address. A text-to-speech system, which is usually a
stand-alone module that does not actually "understand the meaning" of what is spoken, must rely
on hints to produce an utterance that is natural and easy to understand, and moreover, evokes the
desired meaning in the listener. In addition to these prosodic elements, the document also
describes issues such as multi-lingual capability, pronunciation issues for words not in the
lexicon, time-synchronization, and textual items that require special preprocessing before they
can be spoken properly.
3
3. STANDARDIZATION
Standardization to voice browsing technique were given by:
The World Wide Web Consortium (W3C) develops interoperable
technologies (specifications, guidelines, software, and tools) to lead the Web to its full potential
as a forum for information, commerce, communication, and collectiveunder standing. W3C
which includes:
1 .Voice Browser Working Group
2. Speech Interface Framework
3.1 Voice Browser Working Group:
It was established on 26 March 1999 and re-chartered through 31 January 2009. W3C
voice browser working group made the speech interface framework possible . This framework
allows developers to create speech enabled applications that are based on Web technologies.
The framework also provides developer with an environment that will be familiar to those who
are familiar with Web development techniques. So, applications are written using parts of speech
interface framework.
Thus speech applications are written in VoiceXML and are rendered through a Voice Browser.
In much the same way as Web applications are written in html and run on a Web browser.
As per estimation, over 85% of Interactive Voice Response (IVR) applications for telephones
(including mobile) use W3C's VoiceXML standard.
Voice Browser Working Group are coordinating their efforts to make the Web available on more
devices and in more situations.
3.1.1.Aim:
The Aim of the W3C Working Group is to enable users to speak and listen to
Web applications by making standard languages for developing Web-based speech applications.
This Working Group concentrates on languages for capturing and producing speech and
managing the conversation between user and computer system, while a related Group, the
Multimodal Interaction Working Group, works on additional input modes including keyboard
and mouse, ink and pen, etc.
4
3.1.2. W3C Recommendations:
Its recommendations have been reviewed by w3c group Members, by software
developers, and other interested parties, and are also endorsed by the Director as Web Standards.
3.2 Speech Interface Framework:
These framework includes:
3.2.1.VoiceXML:
A language for creating audio dialogs that feature synthesized speech, digitized
audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and
mixed initiative conversations. Some of its versions are:
• VoiceXML 1.0: designed for creating audio dialogs.
• VoiceXML 2.0: uses form interpretation algorithm(FIA).
• VoiceXML 2.1: 8 additional elements.
• Voice XML 3.0: relationship between semantics and syntax.
3.2.2.Speech Recognition Grammar Specification (SRGS) 1.0:
A document language that can be used by developers to specify the words and
patterns of words to be listened for by a speech recognizer or other grammar processor.
3.2.3.Speech Synthesis Markup Language (SSML):
A markup language for rendering a combination of prerecorded speech,
synthetic speech, and music. Some of its versions are:
• Speech Synthesis Markup Language (SSML) 1.0
• Speech Synthesis Markup Language (SSML) 1.1
5
3.2.4.Semantic Interpretation (SISR):
Document format that represents annotations to grammar rules for extracting the
semantic results from recognition.
Eg: Semantic Interpretation(SISR)1.0 version
3.2.5.Pronunciation Lexicon Specification (PLS):
A representation of phonetic information for use in speech recognition and
synthesis .
Eg: Pronunciation Lexicon Specification (PLS) 1.0 version
6
4.DESIGN OF THE SYSTEM
This section aims to give an overview of the designof a phone browser,
TeleBrowse. The prototype is assumed to have a phone-based physical medium.The architecture
is first introduced, which is followed by the functions provided by the prototype.
4.1 System Architecture:
The system consists of three modules: voice driveninterface, HTML translator
and a commandprocessor.The voice driven interface of the prototype includes a voice recogniser,
a DTMF recogniser, aspeech synthesiser and an audio engine. The physical medium for
communication is intended to be a telephone, where the user can be remote to the system’s
location, but it can also be simulated with a microphone/speaker combination connected directly
to a PC running the system. A voice recogniser and a DTMF (Dual Tone Multi-Frequency)
recognizer are available for input to the system, while the speech/audio synthesis is suitable for
output to the user. Hence, a command issued to the system can be in one of two forms: either
spoken commands by the user, or DTMF tones punched in on a phone’s keypad. DTMF assigns
a specific frequency, or tone, to each key on a touch-tone telephone so that it can easily be
identified by a microprocessor. The purpose of DTMF in the architecture is justified by the cost
effectiveness to supplement the voice functions. Although voice recognition technology is
capable to handle most of the translation task, it is however an expensive one. This will be used
for issuing commands that are not complex enough to warrant the translation to a voice driven
format.
The Voice Driven Interface essentially accepts spoken words as input. The
input signals are thencompared against with a set of pre-defined commands. If there is an
appropriate match, the corresponding command is output to the Command Processor. The engine
also handles auxiliary functions (in conjunction with the speech synthesis engine) such as the
confirmation of commands where appropriate, speed control and the URL dictation system. The
Text-to-Speech/audio engine is responsible for producing the only output from the system back
to the user. The output is in the form of spoken (synthesised) text, or sounds for the auditory
icons.The input for these two engines comes from the Command Processor. The input can either
be a text stream consisting of the actual information to be read out as the content of the page
7
(processed by the TTS engine), or an instruction to play a sound as an auditory icon to mark a
document structure (processed by the audio engine).
The other major component is an HTML Translator. When a user requests an
HTML document, the contents of the document must first be parsed and translated to a form
which is suitable for use in the audio realm. This includes the removal of unwanted tags and
information, and the retagging of structures for compliance and subsequent use with the audio
output engine of the interface.The translator also summarises information about the document
such as the title and positions of various structures for use with the document summary feature.
A Command Processor sits between the HTML translator and the interface.
The command processor is responsible for acting on the voice / DTMF commands issued by a
user. The Processor retrieves HTML documents from the WWW and feeds them to the HTML
translation algorithm. It also controls the navigation between web pages, and the functionality
associated with this navigation (bookmarks, history list, etc.). This component also processes all
the other system and housekeeping commands associated with the program. A stream of marked
text to the speech synthesis / audio engine is output. The stream consists of a combination of
actual textual information, and tags to mark where audio cues should be played.
"The architecture diagram was created as an aid to how we structure our work
into subgroups. The diagram will help us to pinpoint areas currently outside the scope of existing
groups."
Although individual instances of voice browser systems are apt to vary
considerably, it is reasonable to try and point out architectural commonalties as an aid to
discussion, design and implementation. Not all segments of this diagram need be present in any
one system, and systems which implement various subsets of this functionality may be organized
differently. Systems built entirely third-party components, with architecture imposed, may result
in unused or redundant functional blocks.
Two types of clients are illustrated: telephony and data networking. The
fundamental telephony client is, of course, the telephone, either wirelined or wireless. The
handset telephone requires PSTN (Public Switched Telephone Network) interface, which can be
either tip/ring, T1, or higher level, and may include hybrid echo cancellation to remove line
8
echoes for ASR barge-in over audio output. A speakerphone will also require an acoustic echo
canceller to remove room echoes.
The data network interface will require only acoustic echo cancellation if used
with an open microphone since there is no line echo on data networks. The IP interface is shown
for illustration only. Other data transport mechanisms can be used as well.The model architecture
is shown below. Solid (green) boxes indicate system components, peripheral solid (yellow)
boxes indicate points of usage for markup language, and dotted peripheral boxes indicate
information flows.
Fig: 4.1 System architecture
5.LEVELS OF VOICE BROWSING
9
In order to understand it better in its current form, voice browsing can be
examined under three levels.
5.1.Voice Browsing:
This is the first level in examining the technique of voice browsing. The simplest
way of understanding the voice browser and the voice web would be to take the web itself into
consideration. We know that people all over the world visit websites and get visual feedback.
Now, the voice web contains voice sites where the feedback is delivered through dialogues.
The most basic example would be calling up our cellphone operator portal,
where speech recognition software provides the caller with a series of options like recharging
your account, talking to the customer care executive or listen to the new offers. The technology
of voice browsing has brought down the cost considerably, since earlier it would cost a rupee per
minute for human operators to talk to customers while 10paise per minute for an automated call.
Voice browsing also has its use in the corporate sector specially in banks and airlines.
5.2.Voice Browsing The World Wide Web:
This is the second level through which we can examine the voice browsing
technique. We can browse the websites through voice which offer voice portals. In order to
maintain customer loyalty these portals offer voice browsing of personal content. To have a
claim in this sector or we can say market, AOL bought quack.com.
Almost everywhere the voice portal market is being expanded, be it Europe, the US or Asia.
5.3.The Voice Web:
The level three is the voice web. Many companies have introduced forums for
programmers for setting up voice sites so as to enhance the interest in voice browsing and speech
recognition. The forums further become voice webs which sprout voice auction sites and voice
based chat rooms.
1010110100100100110010110010101100000111010011011011110110000001010010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101101011010010010011001011001010110000011101001101101111011000000100100101010011010000011110101011010000111010011110001111001011110110000100
6.WEB PAGE DESIGN STRATEGIES FOR VOICE BROWSERS
10
The major obstacle to wide-scale commercial deployment of voice browsers for
the web is not the technology,but the ease (or difficulty!) with which web page designers can add
speech support to their site.Authoring a web page for any specific type of user agent or system
configuration, should never be a completely separate subject with arcane new techniques
developed for each special need, but rather an application of the common set of Universally
Accessible Design principles that should be part of every web author's repertoire. With few
exceptions, pages should never be designed for certain types (or brands) of browsers, but should
instead be designed for all uses (and potential uses) of the information. All web documents
should be equally accessible to voice browsers as to visual user agents.
The HTML Writer's Guild studies, and discussions with web authors, have
shown that the primary obstacle to universal accessibility is ignorance. There are few cases
where a conscious decision has been made to produce a generally inaccessible web page; rather,
the author is simply unaware of the need to create accessible pages and the techniques by which
that is done. Once enlightened, most web authors eagerly embrace the concept of universal
accessibility, since the benefits are many and obvious.
In this section, some of the primary techniques of Universally Accessible
Design will be briefly listed as they relate to voice browsers, and offer ideas as to how authors
can implement these considerations when designing web pages.
6.1.Aural Style Sheets:
Aural Style Sheets are part of the Cascading Style Sheets, Level 2 specification,
and provide for a level of control in spoken text roughly analogous to that for displayed/printed
text. Aural rendering of a document is already commonly used by the blind and print-impaired
communities. It combines speech synthesis and "auditory icons." Often such aural presentation
occurs by converting the document to plain text and feeding this to a screen reader, software or
hardware that simply reads all the characters on the screen. This result in less effective
presentation than would be the case if the document structure were retained. Style sheet
properties for aural presentation may be used together with visual properties (mixed media) or as
an aural alternative to visual presentation.The use of an aural style sheet (or aural style sheet
11
properties included in a general style sheet document) allows the author to specify characteristics
of the spoken text such as volume, pitch, speed, and stress; indicate pauses and insert audio
"icons" (sound files); and show how certain phrases, acronyms, punctuation, and numbers should
be voiced.
Combined with the @media selector for media types, a well-crafted aural style
sheet can greatly increase the accessibility of a web document in a voice browser. Further
investigation in this area is encouraged, especially in the area of example aural style sheets and
suggestions for authoring techniques.
6.2.Rich Meta-Content:
HTML 4.0 gives the author the ability to embed a great deal of meta-content
into a document, specifying information which expands on the semantic meaning of the content
and allows for specialized rendering by the user agent. In other words, by using features found in
HTML 4.0 (and to a limited extent, in other versions of HTML), an author can give better
information to the browser, which can then make the document easier to use.
Judicious and ample use of meta-content within a document allows the author to
not simply specify the content, but also suggest the meaning and relationship of that content in
the context of the document. Voice browsers can then use that meta-information as appropriate
for their presentation and structural needs.
6.3.Planned Abstraction:
One use for meta-content information is the development of pages, which are
designed to be abstracted. The typical web document found on the web can often be quite
lengthy; finding information by listening to web page read out loud takes longer than visually
scanning a page, especially when most web pages are designed for visual use.Thus, most voice
browsers will provide a method for abstracting a page; presenting one or more outlines of the
page's content based on a semantic interpretation of the document.
Examples of potential or valid abstraction techniques include:
12
Listing all the links and link text on a page.
Forming a structure based on the H1, H2, ... H6 headers.
Summarizing table data.
Scanning for TITLE attributes in elements and presenting a list of options for expansion.
Vocalizing any "bold" or emphasized text.
Digesting the entire document into a summary based on keywords as some search
engines provide.
There is any number of other options available for voice browser programmers
to use to provide short, easily-digestible versions of web contents to the browser user. This
suggests that the web author should provide as much meta-content as possible as well as careful
use of HTML elements in their proper manner. Specific techniques include:
Useful choices for link text (e.g., "the report is available" instead of "click here").
Appropriate use of heading tags to define document structure, not simply for
size/formatting.
Use of the SUMMARY attribute for tables.
Use of STRONG and EM where appropriate, providing benefits for both vocal and visual
"scanability".
Use of META elements with KEYWORDS and SUMMARY content.
6.4.Alternative Content for Unsupported Media Types:
The "poster child" for web accessibility is the ALT attribute, which allows
alternative text to be specified for images; if a user agent cannot display the visual image, the
ALT text can be used instead. Widespread use of the ALT attribute by all sites on the Internet
would likely double the accessibility of World Wide Web with such a simple change. Web
authors who do not correctly use ALT text are seriously damaging the usability of the entire
medium!
For voice browsers, ALT text is vitally important since images cannot be
represented at all, aurally. Especially when used as part of a link, alternative content must be
provided so that the voice browser can accurately render the page in a manner useful to the user.
13
In addition to ALT for IMG attributes, HTML 4.0 provides a number of other
ways for specifying alternative content that can be used by a browser if an unsupported media
type is provided. Some of those include:
ALT attributes for image map AREAs, APPLETs, and image INPUT buttons.
Text captions and transcripts for multimedia (video and audio).
NOSCRIPT elements when including scripting languages, as voice browsers may be
unable to process Java script instructions.
NOFRAMES elements when using frame sets, as frames are a very visually oriented
method of document display.
Use of nested OBJECT elements to include a wide variety of alternative contents for
many media types.
7.FUNCTIONALITY
14
All communication from the user to the system is made by issuing voice
commands or using DTMF tones. Such commands are arranged into objects known as menus.
Depending upon the functionality requirement/availability, different menus are available at
different points in the program’s execution. A grammar set is defined to recognise the speech
commands. Some of these rules are for administrative control. To name a few administrative
controls:
<exit|quit[program|application|
TeleBrowse]>
speak <faster|slower>
Where am I?
What is my homepage?
The other rules are used to control navigation. It is anticipated that they are the
most frequently used commands. The navigation is supported in various ways: within the same
page (intra-page navigation), browsing a new web page (inter-page navigation), bookmarks,
history list, document structure or to follow a hyperlink in the web page. The following
grammars display the nature of these rules:
Start browsing by <location|
bookmark|homepage>
Maintain bookmarks
Start reading [all|again]
Load <location|bookmark|homepage>
Go to the history list
Jump <forwards|backwards x <structure>>
<Next|Previous <structure> >
A <structure> is one of paragraph, link, anchor, level 1/2/3 heading, list, page or
table, and x represents a positive integer value. All three versions of this command represent one
action – moving between structures within a document. Another way to navigate to a specific
target page is via dictation. Dictation is invoked whenever a ‘browse by location’ type command
is requested, and it is responsible for fetching a URL address from the user. Users dictate to the
system by saying words representing a single letter to improve recognition accuracy. A good
example is the military code - ‘alpha’ for ‘a’, ‘bravo’ for ‘b’, ‘charlie’ for ‘c’, etc. The grammar
15
recognises this military code, and also common animals, such as ‘frog’ for ‘f’. Macros and
shortcuts are also used to simplify the dictation process. The ‘http://’ at the beginning of every
URL is automatically added, and the system recognises phrases like ‘World Wide Web’,
‘company’, ‘aussie’ among many more torepresent ‘www.’, ‘.com’, and ‘.au’ respectively.
The dictation menu also allows for corrections to be made, a review of what has
been dictated so far and an ability to restart the dictation session. Output from the system is
either synthesised text or sounds (as auditory icons). The synthesised text can represent either
actual information being read from a web page, or feedback about the system’s operation to the
user. When a page is to be read out (post translation), the page is broken up one piece at a time
and analysed. Two situations can occur: if the piece is a tag with an associated auditory icon, this
icon is played out, or, if the piece is simply text, it is synthesised into voice.
Typical application of auditory icons include the creaking opening door
(creaking) to represent internal link (link to an anchor within the same web page), or a doorbell
to represent an email address, or the clicking sound of a camera shutter to relate to an
image.When certain tags are encountered (end of paragraph, end of list, end of table row, etc.),
speaking ceases and the user is returned to the menu they were last at. Alternatively, if a user
wishes to interrupt the speaking prior to the next break point, the interrupt key ‘*’ can be used.
8.EXPERIMENTS AND RESULTS
16
The TeleBrowse prototype was developed under the Visual Basic 6
environment.Three ActiveX controls were used in conjunction with the VB project.The
Microsoft Telephony Control (see http://www.microsoft.com/speech) has the voice recognition
engine, speech synthesis engine, and audio output engine. The Microsoft Internet Transfer
Control is able to retrieve HTML documents from the Web using the HTTP protocol, while the
HTML Zap Control (Newcomb, 1997) provides a simple interface for analysing HTML
documents. The primary goal of the evaluation of the TeleBrowse prototype is to determine the
usability and acceptance of the application as a voice-driven web-browsing tool.Measurement is
done via the transformed data.
The operational efficiency is represented by the physical interface, speech
recognition & synthesis capabilities, document and page navigation, prompts and feedback from
the prototype. The quality aspect is evaluated through the availability, integrity and usefulness of
the data after the translation, and also the support of various structures in the original HTML
page, e.g. headings, links, tables. The experiment was carried out on two different groups of
users who had similar characteristics, i.e. they were all above 18 years old, they had prior
experience on using current web browsing products and had limited knowledge of the WWW
and the related technology. The major difference between these two groups was subjects in the
first group (G1) were normally sighted, while the second group (G2) were visually impaired, or
to be precise, they suffered from complete blindness. In other words, subjects in G1 were
familiar with typical visually oriented web browsers such as Microsoft Internet Explorer. In this
experiment, there were five people in G1 and four subjects in G2.
Evaluation sessions were run on a one-to-one basis. There was also a general
discussion involving the group at the end of the individual experiments. No interaction occurred
between subjects from differing groups. Each subject was briefed about the operation of the
prototype, and the methods for interacting with it. Along with the briefing, each user was given a
sheet on which was printed a vocabulary listing of what phrases the program would respond to,
and at what points in the program’s execution such phrases could be used. In the case of the
subjects in G2, the sheet was printed using a Braille printing device. The sheet also contained a
set of tasks the subjects had to complete using the program. Some of the tasks that subjects were
required to complete included: starting the application, checking to see what the current
homepage was set to, commencing reading of the web document once loaded, jumping to
17
another location by dictating a URL address directly into the system etc. Accompanying each
task was a description of the behaviour the system would demonstrate during the task’s
execution, and what phrases to use in interaction with the system to complete those tasks. All
tasks were first completed using the software running in emulation mode on the laptop computer.
The speech recognition engine was not trained to adapt to any specific person. After completing
this and gaining a degree of experience in using the system, subjects were then given an
opportunity to use the system as they chose over the phone, completing the effect of a phone-
based webbrowsing tool.
Subjects were also asked to view the same web pages that had just been
‘viewed’ using the prototype with a browser they would typically use. In the case of G1, this was
either Microsoft Internet Explorer or Netscape Navigator. In the case of G2, this was again
Microsoft Internet Explorer, but thistime with the edition of the JAWS screen reading program.
After using the prototype for a sufficient amount of time (in most cases this was a period of
about twenty minutes to half an hour), each subject was asked to complete a questionnaire to
record their experiences with the prototype. The questionnaire was arranged into two parts: the
measurement of efficiency and integrity. Some of these questions asked the participants to give
numerical scores in a scale of 1 (very poor) to 7 (very good). Some of the other questions were
free format where the subjects could provide their own comments.
This gave us an initial and very broad insight into how subjects from each group
responded to using the prototype in the experiments. By graphing both groups’ results on the
same axes, we could also see a comparison of acceptance between the two groups.In addition to
the rated response questions, The evaluation questionnaire contained a further thirteen questions
of free-form response in nature. These free-form questions were designed to draw out any
comments, problems, criticism, or general feedback from the test subjects. There were, on
average, two such questions per section of the evaluation criteria.
8.1.Voice Recognition:
It was one of the more poorly rated criteria. While both groups considered it a
necessary technology for the idea of a phone browser, subjects suffered from its shortcomings,
and it did result in a loss of efficiency for most users. Subjects from G2 responded more
favourably than those from G1. Of particular concern was the dictation of URL addresses. This
was noted as a shortcoming in the interface by every user from both groups. The idea of having
18
to spell out URL addresses one letter at a time (and wait for confirmation of each letter) was not
well received. The idea of using shortcuts like ‘company’ to spellout ‘.com’ was considered a
strong improvement, thus this technique must be further explored.
8.2.Speech Synthesis:
There was little or no problem with this sub-system. Subjects found the voice
easy to understand and of suitable volume and pitch. The major contrast between the two groups
was the usage of the speed control feature. Subjects from G1 saw no reason to adjust the speed of
the synthesised voice. They were content with the default normally paced speaking voice.
However, subjects from G2 tended to change the speed to a much higher rate before doing
anything else.
8.3.Navigation:
The overall ability of intra-page and inter-page navigation using the system was
rated favourably by both groups. The use of auditory icons to mark HTML structures was viewed
by the G2 subjects as being superior to any similar screen reader marking scheme. Subjects from
G1 also appreciated the ability of the voice icon to quickly and to simply mark structures from a
document, in a way that was natural and easy to remember. The idea of metaphorically matching
the meaning of sounds with the structure they were representing was well liked and accepted.
There was little problem with remembering the mapping of sound to structure, especially after
using the system for an extended amount of time. A criticism with the auditory icons was that
they appeared too frequently, and could be seen as breaking up the flow of text unnecessarily.
A comment made by many subjects was that the prototype offered similar and
familiar functionality to that of browsers they have previously used. Thus, features such as
bookmarks, the history or ‘go’ list and the ability to save a ‘home’ page were all well received.
The ability to follow links contained in documents was well liked. Using different auditory icons
for the different types of links allowed subjects to know in advance whether the link would be to
a target within the same document, or an external link to another document. This too was well
liked. Again, the problem with dictating URL addresses was brought up.
8.4.Online Help:
19
This section of the criteria did not rate well, due to the lack of help associated
with commands and prompts used in the system. It was thought by all users that more detailed
help (context based) explaining the meaning and usage of commands, should be available at any
point duringthe system’s operation, as opposed to the simple listing of commands currently
available. Certainly the need to refer to other supporting documentation for more detailed
information shouldbe avoided, as access to this information would not be available
inenvironments where a phone browser might be used. The tutorial available from the system’s
main menu was well accepted in terms of its content, but perhaps a similarly detailed tutorial
should be available at every menu in the system, customized for the relevant set of commands.
8.5.Information & Structure:
There was a marked difference between the two evaluation groups in these two
aspects. G2 was more willing to accept the level of integrity of information presented to them by
the prototype than G1. Comments were made from users in G1 concerning the lack of ability to
quickly visualise an entire page at a “glance”. They reported frustration when they were forced to
listen to the content in sequential fashion.
8.6.Overall Impression:
The subjects from both evaluation groups accepted the prototype as a viable
method of browsing the web in the audio realm by phone. The efficiency of the product was
quite highly regarded by most subjects. The system interface faired very well.The only major
problem was the dictation of URL addresses to the system.
9.FUTURE OF VOICE TECHNOLOGY
20
The speech technology is supposed to grow rapidly. The voice portal market is
going to reach billions in just a few years. It is estimated by the kelsey group that voice browsing
market will reach 6.5 billion dollars, while OVUM estimates a world market of 26 billion
dollars. Anyone may guess the actual growth of the industry of voice technology due to
variations in these figures. It is very difficult to navigate on a WAP to scroll through many lists.
Hands-free interaction enables us to develop an easy communication between the user and the
system.
Voice browsing can be used to access three kinds of information:
9.1.Business:
Information like automated telephone ordering services, support desks, order
tracking, airline arrival and departure information services, cinema and theatre booking services,
home banking services,etc can be retrieved using voice browsing very easily.
9.2.Public:
Voice browser can be used to access services like local , national and
international news alongwith community information such as weather forecasting, traffic
conditions, school closure and events. it can also be used to gather information on national and
international stock market information and also business and e-commerce transactions.
9.3.Personal use:
It is used in accessing personal information like voice mails, personal horoscope
,personal newsletter, calendars, address and telephone lists etc. In future it is expected that voice
browsing will become visual i.e MULTI MODAL. But greatest achievement would be when
voice browsing is integrated with all types of operating system .This success would surly make
voice browsing available to each and every application.
10.BENEFITS
21
Voice is a very natural user interface because it enables the user to speak and
listen using skills learned during childhood. Currently users speak and listen to telephones and
cell phones with no display to interact with voice browsers. Some voice browsers may have
small screens, such as those found on cell phones and palm computers. In the future, voice
browsers may also support other modes and media such as pen, video, and sensor input and
graphics animation and actuator controls as output. For example, voice and pen input would be
appropriate for Asian users whose spoken language does not lend itself to entry with traditional
QWERTY keyboards.
Some voice browsers are portable. They can be used anywhere—at home, at
work, and on the road. Information will be available to a greater audience, especially to people
who have access to handsets, either telephones or cell phones, but not to networked computers.
Voice browsers present a pragmatic interface for functionally blind users or
users needing Web access while keeping their hands and eyes free for other things. Voice
browsers present an invisible user interface to the user, while freeing workspace previously
occupied by keyboards and mice.
22
11.CONCLUSION
The VoiceBrowse prototype developed for this project proved its success in the
evaluation as a telephone based web-browsing tool. Problems were highlighted in the user study.
Part of these problems are related to the features currently available in the prototype, for
example, the navigation of a table, the interruption of an output using voice rather than “*” key
on the phone pad, the missing function of filling in forms on a web page etc. At the same time,
there are some deeper issues which require more research include the dictation function (how
and where contextual information can be acquired to improve the accuracy of a dictation),
visualisation process (what kind of model is made available which provides a feeling of
information coming in parallel).
Testing of the prototype in real life environments rather than in a standard
laboratory condition is also an important direction of future work, e.g. walking in the street,
driving a car. This may uncover further issues that relate to situated interaction. Research into
domain-specific speech interaction model may also improve the accuracy and effectiveness of
the system. It is the intention of this project to create an environment where web surfing with
voice is possible and surfing experience is a pleasant one. The knowledge accumulated in the
production of the first generation prototype, VoiceBrowse, has given us a lot of insights and lay
the groundwork that helps us move on to develop the second generation prototype.
12.BIBLIOGRAPHY
23
http://www.w3.org/TR/REC-CSS2/aural.html
http://www.hwg.org/opcenter/w3c/voicebrowsers.html
http://www.w3.org/Voice/1998/Workshop/Michael-Brown.html
http://www.oasis-open.org/cover/wap-wml.html
Newcomb, M. (1997). HtmlZap ATL ActiveX Control.
http://www.miken.com/htmlzap/
http://www.brookes.ac.uk/schools/cms/research/speech/publications/43hft97.htm,
24