user interface design ofvoice controlled consumer electronics bound... · user interface design...

16
Philips J. Res. 49 (1995) 439-454 USER INTERFACE DESIGN OF VOICE CONTROLLED CONSUMER ELECTRONICS by S. GAMM and R. HAEB-UMBACH Philips GmbH Forschungslaboratorien, Aachen, Postfach 1980, D-52021 Aachen, Germany Abstract Today speech recognition of a small vocabulary can be realized so cost- effectively that the technology can penetrate into consumer electronics. But, as first applications that failed on the market show, it is by no means obvious how to incorporate voice control in a user interface. This paper addresses the issue of how to design a voice control so that the user per- ceives it as a benefit. User interface guidelines that are adapted or specific to voice control are presented. Then the process of designing a voice con- trol in the user-centred approach is described. By means of two examples, the car stereo and the telephone answering machine, it is shown how this is turned into practice. Keywords: car stereo; human factors; speech recognition; telephone answering machine; usability engineering; voice control. 1. Introduetion Voice control indicates the ability to control a machine by means of spoken commands. The first application to come into mind is perhaps in a factory environment where speech is used to elicit a certain action when both hands are occupied. Here, however, we are concerned with applications in the consu- mer or telecommunications domain. To name just one example, consider name dialling: in order to place a call the name of the person to be called is just spoken after a corresponding prompt, instead of keying in the telephone num- ber. This example shows already some ofthe issues involved. Compared to the factory environment where the hands simply are not available to do the task, here speech input is just an alternative input modality which has to compete with conventional input modalities. But speech indeed offers some unique fea- tures, in this case directness: speaking the number ofthe person avoids the mental Philips Journalof Research Vol. 49 No. 4 1995 439

Upload: duongminh

Post on 14-May-2018

226 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS Bound... · USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS ... car stereo; human factors; speech recognition;

Philips J. Res. 49 (1995) 439-454

USER INTERFACE DESIGN OF VOICE CONTROLLEDCONSUMER ELECTRONICS

by S. GAMM and R. HAEB-UMBACHPhilips GmbH Forschungslaboratorien, Aachen, Postfach 1980, D-52021 Aachen, Germany

AbstractToday speech recognition of a small vocabulary can be realized so cost-effectively that the technology can penetrate into consumer electronics.But, as first applications that failed on the market show, it is by no meansobvious how to incorporate voice control in a user interface. This paperaddresses the issue of how to design a voice control so that the user per-ceives it as a benefit. User interface guidelines that are adapted or specificto voice control are presented. Then the process of designing a voice con-trol in the user-centred approach is described. By means of two examples,the car stereo and the telephone answering machine, it is shown how this isturned into practice.Keywords: car stereo; human factors; speech recognition; telephone

answering machine; usability engineering; voice control.

1. Introduetion

Voice control indicates the ability to control a machine by means of spokencommands. The first application to come into mind is perhaps in a factoryenvironment where speech is used to elicit a certain action when both handsare occupied. Here, however, we are concerned with applications in the consu-mer or telecommunications domain. To name just one example, consider namedialling: in order to place a call the name of the person to be called is justspoken after a corresponding prompt, instead of keying in the telephone num-ber. This example shows already some ofthe issues involved. Compared to thefactory environment where the hands simply are not available to do the task,here speech input is just an alternative input modality which has to competewith conventional input modalities. But speech indeed offers some unique fea-tures, in this case directness: speaking the number ofthe person avoids the mental

Philips Journalof Research Vol. 49 No. 4 1995 439

Page 2: USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS Bound... · USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS ... car stereo; human factors; speech recognition;

S. Gamm and R. Haeb-Umbach

translation of the name into the telephone number. We will discuss otherexamples of voice control in moredetail laterrWith the recent"progress in recognition accuracy and the decline of hard-

ware costs speech recognition comes within reach of everyday consumer andtelecommunication products.It is by no means straightforward and obvious how to incorporate voice

control successfully in an everyday appliance. The validity of this statementmay be seen by the limited success of first consumer products that includedspeech recognition. It is often said that speech-i/o is the most natural interface.However, man-machine communication would only be really natural if themachine could meet the capabilities of a human. Since, however, automaticspeech recognition is still inferior to human speech recognition, despite greatprogress in recent years, building reliable and accepted speech interfaces is asophisticated and subtle process. Just replacing button presses by speech inputdoes not seem to deliver any discernible customer benefit. Speech has to beincluded from the outset of the development rather than simply adding it toan existing system. lfthe human factor issues are addressed from the beginningand if the technological capabilities are well taken into account, attractivedesigns can be developed which take full advantage of the unique propertiesof speech input. We will address these issues in two examples.

The unique properties of speech as input modality are:• Hands-free operation: Speech input allows hands- and eyes-free operation

which is very important in hands- or eyes-busy situations, e.g. while drivinga car.

• Remote control: Speech input can be used for remote control, e.g. via thetelephone, in order to control a system which is out of manual reach.

• Straightforwardness: With voice control no mental translations are neces-sary. With name dialling, for example, the names need no longer be trans-lated into numbers.

The challenge of the user interface design is to exploit these strengths while atthe same time trying to cope with its shortcomings. The machine recognizerhas a limited vocabulary size, often a more or less rigid dialogue structure,and misrecognitions are possible. A user interface design must be aware ofthese limitations.and take into consideration how the performance is affectedby factors such as vocabulary size, dialogue structure, adherence to prompts,etc.The consumer application domain poses very stringent restrictions o~ hard-

ware costs. With today's.technology, recognition vocabularies ofmore than100 words seem to be- unrealistic. A restricted recognition vocabulary has

440 Philip. Journalof Research Vol. 49. No. 4 1995

Page 3: USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS Bound... · USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS ... car stereo; human factors; speech recognition;

User intelface design of voice controlled consumer electronics

immediate consequences for the user interface and dialogue design. It is thusthe goal of the design process to find a good compromise between recognitionaccuracy implementation costs and flexibility of the dialogue.The outline ofthe paper is as follows. In the next section we will give a short

overview ofthe state-of-the-art recognition technology in as much as it affectsuser interface design. Section 3 contains a summary of guidelines on how toincorporate voice control in a user interface. An example, the voice-controlledcar stereo, illustrates how these guidelines can be turned into practice. In Sec-tion 4 we discuss the user interface design process. We present the usabilityengineering lifecycle for voice controls and illustrate it with the example of avoice-controlled telephone answering machine. The results are summarizedin Section 5.

2. Towards a user-centred interface

The technological capabilities of automatic speech recognition have a greatimpact on user interface design. In early systems the user had to speak fixedwords in a broken fashion to a computer that repeated. the last statement,asked for conformation, then requested another instruction. Such a dialogueis not at all natural. With the progress in speech recognition more and moreconstraints no longer apply, allowing the design ofa more user-oriented ratherthan machine-oriented dialogue. In the following we give examples of howalgorithmic advances lead to increased freedom for the interface designer.

2.1. Continuous-speecli recognition

Current state-of-the-art systems allow for continuous input, rendering themandatory pauses between the words of an isolated speech recognizer super-fluous. This results not only in a more natural way of speaking but also in ahigher throughput in spoken words per second.

2.2. Keyword spotting

Keyword spotting is a technique to artificially increase the vocabulary sizeof a recognition system. Commands can be embedded in carrier phrases andthe recognition system extracts the command whereas non-keywords aredetected as 'garbage' and thus discarded. Such a technique frees the userfrom adhering to only keyword input.

Philips Journalof Research Vol. 49 No.4 1995 441

Page 4: USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS Bound... · USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS ... car stereo; human factors; speech recognition;

S. Gamm and R. Haeb-Umbacb

2.3. Speaker-independent recognition

Obliging a user to train a system before using it might indeed often be askingtoo much, in particular for everyday consumer products. A speaker-indepen-dent recognizer frees the user from having. to train the system. Note, however,that the speaker-independent vocabulary is fixed and cannot be altered by theuser. Therefore, often in practice a mix of speaker-independent and speaker-dependent vocabulary is used. Command words and other common vocabu-lary (e.g. digits) are 'factory-trained', and the user can add speaker-dependentwords for personal settings, e.g. the name repertory for name dialling.Furthermore, it is often a good safety measure to allow a user to overwritea speaker-independent template by a speaker-dependent one, either to improvethe recognition accuracy of this particular word or to give the user the freedomto replace a speaker-independent word by a word he feels more comfortablewith. Table I summarizes properties ofthe two types ofrecognition vocabulary.

2.4. Robustness

Robustness means the ability of a system to maintain its performance evenunder changing environmental conditions. Here, new algorithms have led toconsiderable improvement, although this topic is still a much addressedresearch issue. A practical consequence of improved robustness is that nowdesktop or hand held microphones can be used in situations where head-mounted microphones had to be used before.

The above examples show that algorithmic advances have led to morefreedom for the user interface design. But one has to bear in mind, though,that a speaker-dependent isolated-word recognizer of a lO-word vocabularywill still have a better performance in terms of recognition accuracy than aspeaker-independent continuous-speech lOO-word recognizer. Whereas thespeech recognition researcher has a fairly simple measure of performance,the error rate, the picture is much more complicated for the user interface

TABLE IComparison of speaker-independent and speaker-dependent vocabulary

Speaker-independent Speaker-dependent

Requires training by userLanguage dependentDegree of algorithm complexityRecognition accuracy

noyeshigh

relatively low

~esnolow

relatively high

442 Philip. Journal of Research Vol.49 No.4 1995

Page 5: USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS Bound... · USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS ... car stereo; human factors; speech recognition;

User intelface design ofvoice controlled consumer electronics

designer. He has to optimize user satisfaction. He has to find the right balancebetween recognition accuracy and flexibility for the user, given an upper limiton the algorithmic complexity dictated by the implementation costs.

3. User interface guidelines for voice control

In this section we present guidelines for incorporating voice control into auser interface. The guidelines are explained in the first subsection, whereasthe second subsection illustrates them by means of an example.

3.1. The guidelines

A number ofusability principles have found widespread acceptance [1]. In thefollowing we mention those guidelines which assume a special or additionalinterpretation for voice control interfaces. Further, results of some specificguidelines for voice control interfaces [2], pertinent to our applications, willalso be reviewed.

Give the user the choice of input modality

Systems that use several input modalities, such as voice input and keyboardinput, should accept them alternatively whenever an input is demanded fromthe user [3]. The user should not have to opt once and for all for one inputmodality. Adherence to this guideline is essential if different input modalitiesshould complement each other in such a way that one modality compensatesfor the shortcomings of the other one [4].

Be consistent

Consistency asserts that mechanisms should be used in the same way when-ever they occur. A particular system action should always be achievable by oneparticular user action such that a user is not required to learn which commandfor the same intended action is required at what stage of the machine [5]. Ifspeech input is employed as another input modality, consistency also meansthat the result of a command should be the same irrespective of the way ithas been invoked, whether by a speech command or by a button.

Provide appropriate feedback

The system should always keep the user informed about what is going on by

Philip. Journal of Research Vol.49 No.4 1995 443

Page 6: USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS Bound... · USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS ... car stereo; human factors; speech recognition;

s. Gamm and R. Haeb-Umbach

providing him with correct feedback within reasonable time [5].Without feed-back the user cannot learn from mistakes [6].This issue is, however, particu-larly delicate for voice control interfaces. It is very awkward and tiring if therecognizer asks for confirmation each time a word has been recognized. Ifthe outcome of a misrecognition is not fatal, it is more appropriate to executethe recognized command rather than asking for confirmation. Feedback isthen given by the reaction of the machine as implicit feedback.

Take into account the user's expectations

Take into account the possibility that the user's expectations of the systemwill affect his interpretation of any dialogue with it. The dialogue should bedesigned to minimize confusion arising from these expectations [7]. For aspeech interface the user's expectations might easily exceed the machine's cap-abilities. Therefore it is important to detect usage problems. One way to do sois to react upon recognition of non-keyword speech: if a non-valid utterance isdetected, help menus may be offered automatically.

444 Philip. Journalof Research Vol.49 No.4 1995

Do not overload the voice input channel

Each input modality has its own strength. The input of a location can ideallybe done by a pointing device; a straightforward selection is ideally done byspeech. But speech is definitely not suited for fine tunings, i.e. adjustmentsof a value within a continuous range. Commands like 'up' and 'down' haveonly limited utility; they have to be used iteratively in order to accomplishan acceptable degree of precision in manipulating an object [2]. Since speechinput is not useful for all functions, the right balance between the differentinput modalities has to be found.

3.2. Example: The voice-controlled car stereo

Hands-free operation has been mentioned as one ofthe unique properties ofspeech input. This property is the motivation for bringing speech recognitioninto the car environment.

Figure 1 shows the control panel of a car stereo with an extra 3-buttoncontrol element positioned on or close to the steering wheel. It can be con-trolled while keeping the hands on the steering wheel, similar to windscreenwipers or indicators. The 3-button control element exists in parallel to theordinary control panel in the slot of the car stereo. The microphone is afree-talk microphone mounted preferably on the car ceiling. Using voice

Page 7: USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS Bound... · USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS ... car stereo; human factors; speech recognition;

Philips Journalof Research Vol. 49 No. 4 1995 445

User interface design of voice controlled consumer electronics

Fig. I Voice controlled car stereo.

commands and the 3-button control element, most of the functions of the carstereo, at least the common functions, can be called.

The speech recognizer employs speaker-independent command words andspeaker-dependent, i.e. user-defined, words. The speaker-dependent part isused for user-defined names of radio stations. The station names are trainedas follows: the radio station to be programmed is tuned in and, when thetraining mode is entered, the user is asked to speak the name of the station,which needs of course not be the 'official' name, a couple of times (typically2 to 4 times). Afterwards this station can be tuned in by speaking its givenname. This scenario is very much like the original programming of presets.

A typical usage scenario is as follows. The car stereo is turned on by saying"radio" or "turn the radio on" while pressing and holding down the recordingbutton in the middle. The keyword 'radio' is recognized and surroundingphrases, as in the second command, are discarded by the keyword spotting fea-ture. The resulting action is that the car stereo is turned on and starts to play.Stations can now be tuned by speaking their names, such as 'BBC' or 'changeto BBC now'. The command word 'volume' will program the up and downkeys with the volume function, i.e. pushing the up-key will increase the

Page 8: USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS Bound... · USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS ... car stereo; human factors; speech recognition;

S. Gamm and R. Haeb-Umbach

volume, pushing the down-key will decrease the volume. But many other func-tions can also be realized with these keys. To mentionjust one more, the com-mand 'scan station' will program the keys to assume the tuning function:pushing the up-key looks for radio stations by increasing the frequency, push-ing the down-key by decreasing the frequency. We have thus introduced theconcept of speech programmable softkeys.In the following it shall be shown how the guidelines presented have been

turned into practice in this application.

Give the user the choice of input modality

The voice control has been added to the ordinary key control. For the mostcommon functions the user has the choice between voice and key control. Fortuning a radio station, for example, he can either press a preset button or speakthe station's name. During use he can switch between voice and key controlwhenever he wants. There is no need to stick to the input medium once chosen.The key control serves as a fall-back mechanism in case of misrecognitions.

Be consistent

The operation is independent of the music source selected, such as radio orCD. Scanning for example is done in the same way for radio stations as for CDtracks. Consistency is not only ensured between the two functional parts ofthecar stereo but also between the two input modalities. When scanning radio sta-tions, for example, pressing the 'Next' key has the same effect as speaking'next'; i.e. the feedback and the possible further functions are identical.

Provide appropriate feedback

In this application the feedback is mostly implicit since rnisrecognitions donot have fatal effects. After tuning in another radio station, for example, theuser gets the implicit feedback by the changing sound and in addition an expli-cit feedback by the display of the new station name.

Take into account the user's expectations

The user may overestimate the capabilities ofthe machine. In order to avoidconfusion by wrong expectations, the machine guides the user when he speaksin such a manner that the machine is not able to understand him.

446 Philip. JournnI of Research Vol. 49 No. 4 1995

Page 9: USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS Bound... · USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS ... car stereo; human factors; speech recognition;

,, ....

User intelface design ofvoice controlled consumer electronics

Do not overload the voice input channel

Voice control is ideal for a straightforward selection, e.g. for the selection ofa preset radio. Voice control is not suited for all kinds of fine tuning; i.e., theadjustment of volume, bass, treble and fading is better done by keys. Thetrade-off between voice and key control also concerns the activation of thespeech recognizer. In principle the recognizer could be activated by pressinga button or speaking a codeword. Although speech activation may be desir-able in the car environment we chose activation by a button. For controllingthe car stereo the middle button in Fig. Ihas to be pressed and held down forthe time a speech command is uttered. Currently, activation by speech is stilltoo error prone and an unintentional activation is very costly in terms of userdissatisfaction. In addition, vocal activation tends to be tiring, since each com-mand word has to be preceded by an activation word.

4. Usability engineering lifecycle

In this section we present the usability engineering lifecycle, i.e. the processof developing a usable voice control. The lifecycle itself is explained in the firstsubsection, whereas the second subsection illustrates it by means of anexample [8].

ConceptDevelopment

....~",,, ", .,,

,,·····..Verification

Fig. 2. The usability engineering Iifecycle.

Philips Journalof Research Vol.49 No.4 1995 447

Page 10: USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS Bound... · USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS ... car stereo; human factors; speech recognition;

S. Gamm and R. Haeb-Umbach

4.1. The lifecycle .

The development process is illustrated in Fig. 2. It consists of five phaseswhich are described in the following.

448 Philip. Journalof Research Vol.49 No.4 1995

Functional specification

The result of the functional specification phase is the feature set of thedevice. First, all conceivable features are listed and assessed according to sev-eral criteria, such as frequency or complexity ofuse. These criteria span a spacein which all the features can be located. The definition of the feature set isbasically a trade-off between utility and complexity of use. It is of course tosome extent arbitrary, but the assessments are based on user enquiries ormarket studies.

Concept development

The result of the concept development phase is an outline of the man-machine dialogue. This outline is a dialogue specification on the abstract level.Different concepts may be developed in parallel as competitive solutions in thisphase.

In order to avoid false developments at an early stage of the developmentprocess, the concepts are tested by means of simple mock-ups or Wizard-of-Oz simulations [9]. If competitive concepts are available, the most promisingone can thereby be identified.

Dialogue specification

The result of the dialogue specification phase is the formal specification ofthe man-machine dialogue by means of a state transition diagram. Basedon the chosen concept, the structure of the dialogue is refined until theman-machine dialogue is fully specified. During this refinement commonlyaccepted guidelines for dialogue design should be observed [1]. There maybe certain requirements which are especially important for a particularapplication.

The dialogue specification is supported by a graphical state editor which ispart of our development environment. The state editor allows us not only todraw state transition diagrams in an interactive style, but also to specifyevents, e.g. recognized words, that trigger a state transition or functions thatare activated by a state transition (see Fig. 3).

Page 11: USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS Bound... · USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS ... car stereo; human factors; speech recognition;

User interface design of voice controlled consumer electronics

Fig. 3. The dialogue editor.

Rapid prototyping and iterative improvement

The result of the rapid prototyping phase is a software simulation of thesystem according to the dialogue specification. The state editor generates amodule which can be interpreted by a rapid prototyping system. The proto-type is used for simple, qualitative tests with users. Observations of test per-sons using the system are fed back into the design cycle and thus lead to animproved dialogue and prototype.

Verification

In the verification phase the prototype is finally assessed from the humanfactors point of view. The assessment can be done by means of interviews,focus groups or performance measures in usability tests. Hereby it is verifiedwhether or not usability goals, which have been defined in the concept phase,have been met or not. If the assessment reveals that the goals are not met, afurther iteration step has to be added to the design process.

Philips Journalof Research Vol. 49 No.4 1995 449

Page 12: USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS Bound... · USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS ... car stereo; human factors; speech recognition;

S. Gamm and R. Haeb-Umbacli

4.2. Example: The voice-controlled telephone answering machine

Remote control via the telephone has been mentioned as one of the uniqueadvantages of voice control. This advantage is the motivation. for. bringingspeech recognition into the telephone answering machine. Currently, answer-ing machines can be interrogated from remote by means of touch-tones, thedual tone multifrequency signals (DTMF). In many countries the DTMFpenetration is lower than 60% [10], which means that additional bleepersneed to be carried along all the time. Remote control means that there is nocontrol panel and no button to activate the speech recognizer. The activationis done context-dependent, automatically by the machine, and it is hiddenfrom the user.The speech recognizer employs speaker-independent command words and a

speaker-dependent, i.e. user-defined, password. The spoken password is usedas access control for remote interrogation and it can replace the former 4-digitPIN.A typical usage scenario is as follows. When the answering machine picks up

the line, it plays a greeting message. After the beep, when, usually, callers leavetheir message, the owner, who wants to interrogate the machine from remote,speaks his password. The machine starts playing the received messages, oneafter the other. Between each message the caller can say 'previous', 'next','delete' or 'replay'. 'Delete' for example causes the system to delete the messagethat has just been played. The command word can be embedded in fluentspeech; e.g., it can also be said 'please delete this message'.In the following it will be shown how the usability engineering lifecycle has

been turned into practice in this application.

Functional specification

For the answering machine the functional specification led to the followingfeature set:

, ".~,. '"f'" 7 _' ~ t tt, !' - , •

Functions affecting asingle message:

• Replay last message• Delete last message• Next message• Previous message• Stop playing

Functions affecting all messages orthe greeting:

• Replay all messages• Delete all messages at once• Change the greeting• Deactivate answering machine

450 Philip. Journalof Resea rch Vol.49 No.4 1995

Page 13: USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS Bound... · USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS ... car stereo; human factors; speech recognition;

User interface design of voice controlled consumer electronics

Concept development

For the answering machine the dialogue is divided into two parts. The firstpart is when a message is being played. Here the user can activate functions thataffect the single message, e.g. deleting or repeating it. After having heard allmessages the second part begins. Here the user can activate global functionsthat affect all messages or the greeting. This concept has been taken fromDTMF control.

In a Wizard-of-Oz simulation of the answering machine, the speech recog-nition and system control were performed by a human operator. In a controlpanel he could trigger the playback of messages and system announcements,depending on what the user said. A qualitative user test was conducted witheight test persons. It turned out that the concept was accepted, probablydue to its compatibility with known answering machines.

As the usability goal we defined that users should perceive a clear benefitwhen using voice control rather than DTMF control.

Dialogue specification

For the answering machine there are three requirements on the dialogue:

• Efficiency - The remote interrogation must not take much time; callinglong-distance is expensive.

• Ease of use - The remote interrogation must be intuitive to use; when call-ing from afar there is no manual at hand.

• Robustness - The remote interrogation must be possible even underadverse conditions, e.g. a noisy environment.

In order to make the dialogue both efficientand easy, there are two modes: thecommand mode and the menu mode. ,

In the command mode the systemjust prompts the user for a command. Thesystem executes the function demanded and prompts for the next command.This mode requires the user to know about the functionality of the deviceand to remember the command words. The command mode is a rather efficientmode and meant for the expert user.

In the menu mode, the system guides the user by offering the available func-tions in acoustic menus. The user can then choose one and activate it by speak-ing the proposed command word. The menu mode is easy to use and meant forthe novice user.

The mode is determined automatically by the system, depending on theuser's behaviour. The default mode is the efficient command mode. But ifthe system detects that the user has any problems, e.g. because he uses

Philips Journalof Research Vol.49 No.4 1995 451

Page 14: USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS Bound... · USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS ... car stereo; human factors; speech recognition;

S. Gamm and R. Haeb-Umbach

unknown command words or does not continue to react, it switches to themenu mode. After any successful completion, the system falls back into thecommand mode. .

• ", ": .,- .. ', .•.• : ... , .... ·-,·,·~ ...'t.7·.-:·.t ""rltH.,·"~-i:;':Ö';'('r1t~"rr>,-;-~ ..;"'I;.~~.Smce speech recognition IS never 100% rëliablè, ésjièóially ihä'uoisy'erivir-onment, a fallback mechanism onto a more reliable input form must be pro-vided [4]. For the answering machine, DTMF input represents that fallbackmechanism. Therefore the user always has the choice between speech andDTMF input.

Rapid prototyping and iterative improvement

For the answering machine three interation steps were conducted. Thedesign was regarded as being satisfactory when the observations of the testusers were unaccompanied by criticism.

First iteration: Command words

It turned out that some of the command words were not intuitively used. Inorder to investigate the optimal set of command words, a test was conductedwith eight users. In this test the menu mode was turned off, so that the usershad no clue of which were the accepted command words. Giving them a cer-tain task, it was observed which command words the users would intuitivelychoose.It turned out that some command words, such as 'delete', were obvious,

whereas for some functions, such as changing the greeting, no common intui-tive command word could be identified. For those functions several synonymswere included in the system's vocabulary. As a result of this synonym test thevocabulary grew from 14 to 30 words.

Second iteration: Subjective impression of efficiency

After determining the vocabulary, people complained: about-the generalinefficiency. The system was perceived as being too slow. Therefore theresponse times were shortened and the system announcements were spokenfaster, although this might have diminished the intelligibility. In order tofurther shorten the system announcements, the help menus were split intotwo levels: at the first level just the command words are mentioned and atthe second level the functions are also explained. The second level of themenu is only played if, after having heard the first level, the user still doesnot react appropriately.

452 Philip. Journalof Research Vol. 49 No. 4 1995

Page 15: USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS Bound... · USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS ... car stereo; human factors; speech recognition;

User interface design of voice controlled consumer electronics

Third iteration: Time window for speech input

A fixed time window for speech input turned out to be unacceptable. It wasobserved. that users either tried to barge into the system announcements orthey waited too long. Therefore the time window has been made flexible inthe sense that the system reacts as soon as a command word has been recog-nized. With this flexibility it is possible to extend the time limit and therebyto satisfy the slow user as well as the fast one.

After three iteration steps the usability of the remote interrogation wasconsidered to be satisfactory and the design process considered to becompleted.

Verification

For the answering machine we demanded that users should perceive a clearbenefit when using voice control rather than DTMF control. It had to beshown that the majority of the test persons, even if they used the systemonly once, would prefer voice control.

The achievement of this goal has been checked by means of focus groups.There were three focus groups each in two countries. People in the same focusgroup had similar backgrounds in using telecommunication terminals. In thefocus groups the answering machine was first demonstrated and a few taskswere performed by the test persons. Feedback was collected by means of ques-tionnaires and intensive group discussions. The focus groups revealed thatthere is a clear preference for voice control, but not quite as expressed in acountry with high DTMF penetration as in a country with low DTMFpenetration.

5. Summary and conclusion

We have described how to design a voice control for consumer electronics sothat it is perceived as a benefit by the user. More sophisticated recognitionalgorithms have led to more natural user interfaces and to more freedom forthe designer. We presented guidelines and showed how they have been realizedin the voice controlled car stereo. We further explained the design process andshowed how the voice controlled answering machine has been developed in auser-centred approach.

The design of a voice control is a subtle task, since replacing button pressesby speech commands does not improve the user interface at all. The right bal-ance between voice and key control has to be found. The given examples show

Philips Journalof Research Vol. 49 No. 4 1995 453

Page 16: USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS Bound... · USER INTERFACE DESIGN OFVOICE CONTROLLED CONSUMER ELECTRONICS ... car stereo; human factors; speech recognition;

s. Gamm and R. Haeb-Umbacb

that a well designed voice control can make consumer electronics more usableand more attractive and that this technology is on the verge of penetrating themass market.

REFERENCES[I] J.D. Gould and C. Lewis, Designing for usability: Key principles and what designers think,

Commun. ACM, 28(3) (March), 300-311 (1985).[2] D. Jones, K. Hapeshi and C. Frankish, Design guidelines for speech recognition interfaces,

Applied Ergonomics, 20(1), 47-52 (1989).[3] L.T. Stifelman, B. Arons, C. Shimandt and E.A. Hulteen, Voice Notes: A speech interface for

a hand-held voice notetaker, Proc. INTERCHI '93, Amsterdam, pp. 179-186 (1993).[4] T. Falck, S. Gamm and A. Kerner, Multimodal dialogues make feature phones easier to use,

Proc. Applications of Speech Technology, Lautrach, pp. 125-128 (1993).[5] R. Molich and J. Nielsen, Improving a human-computer dialogue, Commun. ACM, 33(3),

338-348 (1990).[6] H. Thimbleby, Can anyone work the video?, New Scientist, 23 Feb., 48-51 (1991).[7] B.R. Gaines and M.L.G. Shaw, The art of computer conversation, Prentice-Hall (1984).[8] S. Gamm, R. Haeb-Umbach and D. Langmann, The usability engineering of a voice

controlled answering machine, Proc. Int. Symp. Human Factors in Telecommunications,Melbourne, pp. 177-184 (1995).

[9] J.D. Gould, J. Conti and T. Hovanyecz, Composing letters with a simulated listening type-writer, Commun. ACM, 26(4) (April), 295-308 (1983).

[10] R.W. Bennett, A.K. Syrdal and E.S. Halpern, Issues in designing public telecommunicationsservices using ASR, Proc. Speech Technology '92: Voice Systems Worldwide, pp. 222-229(1992).

454 Philips Journalof Research Vol.49 No.4 1995