“verba volant scripta manent” a false axiom within virtual environments. a semi-automatic tool...

10
Computers & Graphics 30 (2006) 619–628 Technical Section ‘‘Verba Volant Scripta Manent’’ a false axiom within virtual environments. A semi-automatic tool for retrieval of semantics understanding for speech-enabled VR applications Giuseppe Conti , Giuliana Ucelli, Raffaele De Amicis Fondazione Graphitech, via F. Zeni 8, 38068 Rovereto, Italy Abstract Traditional interaction with virtual environments (VE) via widgets or menus forces users to rigidly sequential interactions. Previous research has proved that the adoption of speech recognition (SR) allows more flexible and natural forms of interaction resembling the human-to-human communication pattern. This feature though requires programmers to compile some human supplied knowledge in the form grammars. These are then used at runtime to process spoken utterances into complete commands. Further speech recognition (SR) must be hard-coded into the application. This paper presents a completely automatic process to build a body of knowledge from the information embedded within the application source code. The programmer in fact embeds, throughout the coding process, a vast amount of semantic information. This research work exploits this semantic richness and it provides a self-configurable system, which automatically adapts its understanding of human commands according to the content and to the semantic information defined within the application’s source code. r 2006 Published by Elsevier Ltd. Keywords: Virtual reality; Speech recognition; User interfaces; Semantics 1. Introduction Most virtual reality (VR) applications adopt menu-based interactions to provide access to the functionalities avail- able within the three-dimensional (3D) environment. The interaction metaphors can vary from traditional 2D menus, to more complex 3D widgets. Hybrid approaches make use of further abstraction levels by introducing elements, such as a tablet or a pen [1–3], which in turn provide access to traditional menu-based commands. Major advances in the field of human–computer interaction (HCI) have fostered the adoption of more natural forms of interaction. In fact new interfaces have gone beyond the mere decoding of users’ pointing actions by taking advantage of the information encoded through voice, gestures or gaze. This has led to multimodal VR interfaces where multiple communication channels [4] are used. Particularly, the integration of gestures and SR within VR applications can let users benefit from both the intrinsic spatial nature of VR environments and from the directness and efficacy of spoken commands. As asserted by McNeill [5] speech and gestures originate from an internal knowledge representa- tion that encodes both semantic and visual information. Their integration becomes a decisive advantage in applica- tions targeted to the engineering design domain. Indeed, as highlighted by Reeves et al. [6], the design experience, which is based on the generation of shapes [7], strongly benefits from the support of multi-sensorial, or multi- modal, interactions. In fact, as proved by cognitive scientists, virtual reality aided design (VRAD) applications can exploit speech as a complementary conceptual channel ARTICLE IN PRESS www.elsevier.com/locate/cag 0097-8493/$ - see front matter r 2006 Published by Elsevier Ltd. doi:10.1016/j.cag.2006.03.004 Abbreviations: AI, artificial intelligence; AR, augmented reality; CFG, context-free grammars; FS, feature structures; HCI, human computer interaction; NLP, natural language processing; SAPI, speech application programming interface; SR, speech recognition; TTS, text-to-speech; VE, virtual environments; WSD, word sense disambiguation; VERBOSE, voice enabled recognition based on semantic expansion; VR, virtual reality; VRAD, virtual reality aided design; XML, extensible mark-up language Corresponding author. Tel.: +39 0464 443450; fax: +39 0464 443470. E-mail addresses: [email protected] (G. Conti), giuliana. [email protected] (G. Ucelli), [email protected] (R. De Amicis).

Upload: giuseppe-conti

Post on 05-Sep-2016

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: “Verba Volant Scripta Manent” a false axiom within virtual environments. A semi-automatic tool for retrieval of semantics understanding for speech-enabled VR applications

ARTICLE IN PRESS

0097-8493/$ - se

doi:10.1016/j.ca

Abbreviations

context-free gra

interaction; NL

programming in

virtual environm

voice enabled r

reality; VRAD,

language�CorrespondE-mail addr

ucelli@graphite

(R. De Amicis)

Computers & Graphics 30 (2006) 619–628

www.elsevier.com/locate/cag

Technical Section

‘‘Verba Volant Scripta Manent’’ a false axiom within virtualenvironments. A semi-automatic tool for retrieval of semantics

understanding for speech-enabled VR applications

Giuseppe Conti�, Giuliana Ucelli, Raffaele De Amicis

Fondazione Graphitech, via F. Zeni 8, 38068 Rovereto, Italy

Abstract

Traditional interaction with virtual environments (VE) via widgets or menus forces users to rigidly sequential interactions. Previous

research has proved that the adoption of speech recognition (SR) allows more flexible and natural forms of interaction resembling the

human-to-human communication pattern. This feature though requires programmers to compile some human supplied knowledge in the

form grammars. These are then used at runtime to process spoken utterances into complete commands. Further speech recognition (SR)

must be hard-coded into the application.

This paper presents a completely automatic process to build a body of knowledge from the information embedded within the

application source code. The programmer in fact embeds, throughout the coding process, a vast amount of semantic information. This

research work exploits this semantic richness and it provides a self-configurable system, which automatically adapts its understanding of

human commands according to the content and to the semantic information defined within the application’s source code.

r 2006 Published by Elsevier Ltd.

Keywords: Virtual reality; Speech recognition; User interfaces; Semantics

1. Introduction

Most virtual reality (VR) applications adopt menu-basedinteractions to provide access to the functionalities avail-able within the three-dimensional (3D) environment. Theinteraction metaphors can vary from traditional 2D menus,to more complex 3D widgets. Hybrid approaches make useof further abstraction levels by introducing elements, suchas a tablet or a pen [1–3], which in turn provide access totraditional menu-based commands. Major advances in the

e front matter r 2006 Published by Elsevier Ltd.

g.2006.03.004

: AI, artificial intelligence; AR, augmented reality; CFG,

mmars; FS, feature structures; HCI, human computer

P, natural language processing; SAPI, speech application

terface; SR, speech recognition; TTS, text-to-speech; VE,

ents; WSD, word sense disambiguation; VERBOSE,

ecognition based on semantic expansion; VR, virtual

virtual reality aided design; XML, extensible mark-up

ing author. Tel.: +390464 443450; fax: +39 0464 443470.

esses: [email protected] (G. Conti), giuliana.

ch.it (G. Ucelli), [email protected]

.

field of human–computer interaction (HCI) have fosteredthe adoption of more natural forms of interaction. In factnew interfaces have gone beyond the mere decoding ofusers’ pointing actions by taking advantage of theinformation encoded through voice, gestures or gaze. Thishas led to multimodal VR interfaces where multiplecommunication channels [4] are used. Particularly, theintegration of gestures and SR within VR applications canlet users benefit from both the intrinsic spatial nature ofVR environments and from the directness and efficacy ofspoken commands. As asserted by McNeill [5] speech andgestures originate from an internal knowledge representa-tion that encodes both semantic and visual information.Their integration becomes a decisive advantage in applica-tions targeted to the engineering design domain. Indeed, ashighlighted by Reeves et al. [6], the design experience,which is based on the generation of shapes [7], stronglybenefits from the support of multi-sensorial, or multi-modal, interactions. In fact, as proved by cognitivescientists, virtual reality aided design (VRAD) applicationscan exploit speech as a complementary conceptual channel

Page 2: “Verba Volant Scripta Manent” a false axiom within virtual environments. A semi-automatic tool for retrieval of semantics understanding for speech-enabled VR applications

ARTICLE IN PRESSG. Conti et al. / Computers & Graphics 30 (2006) 619–628620

[8] capable of transmitting information not easily definedspatially through gestures. With respect to this Kendon [9]highlights how, during the drawing of shapes, gestures tendto co-occur with phonologically prominent words.

Concurrent exploitation of gesture and speech requires arepresentation of the knowledge [5] relative to the specificcontext domain which allows the appropriate decoding ofthe user’s action. When such knowledge is supplied the useof speech within VR applications allows direct access tofunctions available within the virtual world through a more‘‘natural’’ form of interaction [4] which resembles thehuman-to-human communication pattern [10].

However the implementation of SR capability requiresthe developer to manually define the necessary knowledgebase. This defines the way commands detected by the SRsubsystem are to be translated into actions within the VRenvironment. More precisely, developers typically need tomanually define how spoken utterances are bound to thesystem’s functions through an inefficient process. Thisprocess must be repeated every time a modification to thesystem code is introduced.

This research work tackles these issues through voiceenabled recognition based on semantic expansion (VER-BOSE) a system capable of automatically generating aspeech-enabled interface for VR applications which intelli-gently adapts to the user’s commands. VERBOSE plays akey factor during the development of a VR applicationsince it can be integrated with the specific parts of thesystem architecture dealing with the interaction processwhich are hard-coded to the rest of the application.

The process proposed allows, with few minimumvariations to the original application, to automaticallygenerate both the body of knowledge required by the SRengine and to connect the SR subsystem to the VRapplication’s functions. The system presented is capable ofself-configuring to adapt its comprehension to the users’spoken commands by automatically generating the rele-vant context-free grammars (CFG) from the applicationsource code. The system also generates automatically theinformation necessary to its text-to-speech (TTS) function-alities necessary to provide an adequate interaction level.The approach developed takes advantage of the semanticsdefined within the source code by exploiting the vastamount of information encoded by the developer. Thisinformation is compiled by the user when defining class,instance, field or function names or the class inheritance.The resulting combination of form-descriptive gesturesused to sketch and deform models three-dimensionally,together with the adoption of a flexible speech processingfunctionality, delivers an improved interface which can letusers explore the 3D space and access its functionalities inan intuitive and natural way.

2. Related works

Traditionally speech recognition facilities have beenembedded into VR systems to provide a ‘‘natural’’ means

of interaction [4] with the virtual world, and to enhance theefficiency in the workflow. In Ref. [11] the augmentedreality (AR) system described has been developed foreducational purposes, allowing control of interface com-ponents through the use of spoken commands. Mostsystems developed for engineering applications [12], forcomplex assembly and maintenance tasks [13] usually useoff-the-shelf speech engines [14] to recognize short com-mands or in replacement of simple input from thekeyboard [15]. Other works have led to the creation ofspeech-enabled VR/AR environments based on portabledevices [16].More advanced multimodal VR applications such as in

Ref. [17] have proposed an agent-based structure capableof processing inputs from different modalities through theuse of feature structures (FS) unification [18]. Theunification process checks for the compatibility betweendata structures and it merges their features into a singledata structure. In Ref. [19] the authors propose the use ofinteraction graphs, diagrams whose tokens contain infor-mation coming from different modality, to show the user’sprogress within the current task.The StudierStube [1] VR/AR platform introduces a new

level of abstraction to the multimodal interaction. Thesystem makes use of an open architecture for trackingdevices, called OpenTracker [20], which provides high-levelabstraction over different tracking devices and interactionmodes.As far as spoken commands are concerned, several

commercial systems [21] make use of finite state grammar,which allow filtering the number of messages to be decodedby the system. Most modern engines [14,22] make use ofpre-defined dictionaries and rule sets compiled into CFGs.These are used by the system to retrieve, from spokenutterances, the information required to activate therelevant commands. Semantic information can be alsoincluded at a grammar level in various commercialrecognizers [14,21]. The recognition subsystem is interfacedto the VR/AR environment and the relations betweenspoken commands and computer actions are hard-coded.The output of the speech recognition process is passed to alanguage parser that interprets it accordingly.This approach is preferred when a relatively limited set

of commands must be decoded, as in the case of VRapplications. In fact this approach enhances precision [4]over standard dictation systems since it allows greatnarrowing of the number of commands that have to beinterpreted by the system, from general natural language toa specific domain. Vocabularies are defined apriori [23], thenumber of recognizable commands is limited in size [24]and defined according to the specific context of theapplication. As noted in Ref. [23] the development of morecomprehensive vocabularies and grammars represents animportant achievement since this can enhance significantlythe expressive power of the application.Past works [4] stressed the need for expressing knowl-

edge contained in VEs and for extracting semantics out of

Page 3: “Verba Volant Scripta Manent” a false axiom within virtual environments. A semi-automatic tool for retrieval of semantics understanding for speech-enabled VR applications

ARTICLE IN PRESS

Fig. 1. Example of how descriptive pseudo-code can carry a vast amount

of semantic information.

Fig. 2. General architecture: the system uses its source code to

automatically generate grammars for its speech-recognition subsystem.

G. Conti et al. / Computers & Graphics 30 (2006) 619–628 621

virtual scenes. In Ref. [4] the importance of having someform of linguistic knowledge is also highlighted which canbe embedded in a VR system in an automatic way.Numerous works, in the field of natural language proces-sing (NLP), have tried to answer this demand. In particularthe problem of understanding the meaning of a term in awide semantic context has been a subject of great interest.The problem, well known as word sense disambiguation(WSD), is in fact a fundamental step for the overall processof knowledge understanding.

As far as WSD is concerned, the seminal work [25]highlighted the need for defined logical structuring oflanguages capable of mapping the semantics of naturallanguage. WSD has being extensively studied for the last 6decades by scientists with an artificial intelligence (AI)background across several domains, from machine transla-tion [26], to NLP [27] and knowledge management [28].The problem has been described in Ref. [29] asAI-complete, e.g. that needs tackling of major AIproblems manifested during the synthesis of a human-levelintelligence.

Interest emerged in the past years over computationallinguistics has refueled research in WSD. In particular thework done in Ref. [30] has lead to the development ofWordNet, the most important resource for computationallinguistics. Authors in Ref. [27] use WordNet for WSDwith good results. The algorithm tries to develop a domain-independent syntactic parser for text fragment analysis.Similarly, other authors [31] calculate the semantic distancebetween terms through the assignment of weights based ontypes of semantic and lexical relations between words.Shorter distances define closer senses between words.According to Ref. [27] semantic similarity between wordsis inversely proportional to the semantic distance betweenwords in a WordNet hypernymy/hyponym hierarchy.Finally authors in Ref. [32] use a mixed approach whereWordNet ranking is mixed with word-to-word concur-rences based on Internet-based statistics.

3. System overview

The main contribution of this research work is thedevelopment of a framework for speech recognition thatimplements the knowledge-reuse paradigm taking fulladvantage of knowledge that is produced during thecoding phase of the VR application. The knowledgeembedded in the source code is used for supportingrecognition of spoken commands and for intelligentlyadapting the system’s behavior to improve its under-standing. Our research work has been inspired by anextensive use of VR systems and, to the best of the authors’knowledge, no such approach has been previously followedfor the implementation of systems for speech recognitionor multimodal interactions.

In fact programmers during the development process,implicitly define a vast knowledge, whose formalism,strictly defined by the rules of the programming language

adopted, is only partially exploited, by the compiler andlater by the executer, to create the program. Such body ofknowledge carries a wide range of implicit semanticinformation which is not exploited at later stages. Thisinformation is contained in class names, objects orreference names, in method definitions and methodarguments. Fig. 1 illustrates this concept by showing,respectively at the top and at the bottom, two examples ofsemantically rich and poor pseudo-code both providingidentical functionalities.Our system is capable of turning such semantic

information, intrinsically contained by the source code,into a data repository, which can be then used totransparently support SR at the application level. Asillustrated in Fig. 2, the system comprises a C++ library,linked by the general application, and a normalizationapplication. Normalization is performed before compila-tion of the final application for automatically extractingsemantic information from the code. It also creates the

Page 4: “Verba Volant Scripta Manent” a false axiom within virtual environments. A semi-automatic tool for retrieval of semantics understanding for speech-enabled VR applications

ARTICLE IN PRESSG. Conti et al. / Computers & Graphics 30 (2006) 619–628622

knowledge repository which is later used to provide the SRfunctionality. Every time the programmer requires afunction to be speech-enabled he/she tags it with a specificpreprocessor macro. During the normalization process thetagged source code is parsed and the relevant semanticinformation extracted. This information is used to createthe knowledge repository for the automatic compilation ofthe CFGs. These are then used at runtime by the system toprovide support for the framework, which automaticallyand transparently links the data coming from the recogni-tion sub-system to the main application.

3.1. Creation of contexts

As detailed in the following sections the system takesadvantage of the terms used by the programmer to modelits own knowledge repository. This knowledge is modeledaccording to so-called contexts. We refer to the termcontext meaning the set of information, confined to acertain scope, which boundaries a specific body of knowl-edge. Enforcing the object-oriented paradigm, the systemassumes that terms within the same class have strongrelation one another, since they are usually related to ashared concept. The algorithm therefore considers eachclass as representation of a different context whose taggedmethod names and arguments are used as sources for theknowledge repository. The linguistic knowledge built uponsuch repository is used to automatically expand themeaning defined by each action. A semantic expansionlets the system respond to a much wider set of commands,if compared with the standard fixed-expression approach.An example illustrates the advantages of our approach. Wetake the method ‘‘bend_shape (float angle_X, float angle_Y,float angle_Z)’’ that belongs to the ‘‘basic_shape’’ class asexample. The programmer tags this declaration as follows:

SSPEECH_FUNCTION (void bend_shape (float an-gle_X, float angle_Y, float angle_Z)).

During the process of normalization a context for eachclass is automatically created. Other tagged methodsbelonging to the same class will also be used to create thecontext’s knowledge. A first level analysis separates nouns,verbs, adverbs and adjectives. Then through a WSDalgorithm each lemma is expanded and its relevantsynonyms are computed. In our example the algorithmrecognizes the term ‘‘bend’’ as a verb, then, selectssynonyms, such as ‘‘twist’’, according to its context andexcludes those whose sense is not related with the generalsemantic meaning of the context, such as ‘‘crouch’’, ‘‘stoop’’or ‘‘bow’’. The same process is repeated for the term‘‘shape’’. Finally the relevant CFG is built and loaded bythe system at runtime. Enforcing the object-orientedparadigm, the use of CFG takes advantage of theinheritance between classes. In the following example aclass ‘‘spline’’, which was extended from ‘‘basic_shape’’,would inherit the parental semantic properties. The usercan thus command ‘‘please twist that shape by five-point-

eight three-point-four four-point-eleven’’ and the system

interprets the command bending the selected shapeaccording to the values set by the user. The comprehensionof the complete command, which was made of the‘‘specific’’ expression plus a series of double precisionvalues, is enforced throughout the process. In fact, duringthe normalization process, argument types are detected andused to automatically configure the system to wait forparticular types of arguments after detecting certaincommands. It is worth underlying how the entire processis automatically implemented with nearly no extra code forthe programmer, who is only required to type a few macroswhen appropriate.

3.2. Semantic expansion

Semantic expansion provides the means for enrichingterms with their synonyms according to a specific context.The problem of understanding the meaning of eachselected term in relation to its wider semantic context hasbeen tackled through the definition of the WSD algorithmillustrated in the following sections. WSD is performedwith the help of knowledge extracted by the code of a singleclass, therefore not taking into account code of otherclasses. This is necessary since different classes may refer todifferent semantic domains (i.e. one might refer tovisualization another to creation of geometry) and theyrequire independent WSD processes. Our algorithm wasinitially inspired by the work done in Ref. [29] and it makesuse of WordNet lexical reference system.The information extracted by the normalizer is modeled

using synsets of WordNet [30] [33–36]. A synset is a set ofsynonyms defining a common lexical concept [33] andtherefore a meaning.

4. Normalization process

When the normalizer is called it parses the source codelooking for preprocessor tags. To avoid conflicts if theprocess of normalization is not called before the finalcompilation of the application the tags are expanded intoblank spaces. During the normalization, for each speechenabling tag (i.e. the preprocessor macros), the normalizerextracts all the relevant information (e.g. the functionname, return values and arguments). As far as the semanticexpansion is concerned only function names will be used.The remaining data is used at later stages. The normal-ization replaces each tagged method with a more appro-priate data structure, which will be used by the system toautomatically activate these functions at later stages. Thenew data structure operates as a signature and containsinformation on the method name, arguments and returntype, which are then used at runtime to activate the properfunctions. For our example the normalizer automaticallygenerates the following source code:SSPEECH_FUNCTION_ENCODED (shape, void_

bend_shape_float_angle_X_float_angle_Y_float_angle_,bend_shape).

Page 5: “Verba Volant Scripta Manent” a false axiom within virtual environments. A semi-automatic tool for retrieval of semantics understanding for speech-enabled VR applications

ARTICLE IN PRESSG. Conti et al. / Computers & Graphics 30 (2006) 619–628 623

Then the algorithm starts building the knowledge of thecontext. This is done in two phases: performing a morpho-syntactic disambiguation and then a sense disambiguation.During the first process a set of heuristics defined in thesystem help determine the syntactic category of eachlemma. Due to the nature of the text being parsed thistask does not present particular complexity. In fact, inmost cases method declarations are defined following anessential style (e.g. ‘‘move_cube (int x, int y, int z)’’). A setof heuristics then was developed to process and performefficient syntactic analysis in order to determine whether aterm is a verb, a noun, an adjective or an adverb. At theend of the morpho-syntactic analysis for each ith class, orcontext, the following four sets are defined:

Nifn0i . . . nsig; Vifv0i . . . vtig,

Aifa0i . . . nvig; Difd0i . . . nwig,

respectively, defining nouns, verbs, adverbs and adjectives.The next step, the WSD, requires a more complex

analysis. At the present stage the algorithm providesdisambiguation for only nouns and verbs, which howeverdefine the most significant part of the semantic informationpresent in the source code. The algorithm can be split intwo phases: it searches the different meanings of each wordbelonging to the context and then it establishes the relationof the correct senses with the given context. The way asense is defined has been a matter of intense debate withinthe scientific community and among lexicographers, but itis beyond the scope of this research work. Here sensedefinition relies on WordNet and the proposed algorithmlooks after how to define a method to represent a contextand how to provide the knowledge that the system requiresto associate senses to the given context.

Fig. 3. Example of two hypernym sequences.

4.1. Context definition and noun disambiguation

As underlined in Ref. [29] the selection of words used tocreate the context is a crucial aspect of WSD. Theapproach followed uses only the nouns to build up thecontext of the class. This choice allows efficient analysisand construction of the context, dramatically reducing theoverhead required for the semantic processing yet featuringgood accuracy. In fact, as noted in Ref. [34] verbs have atendency to change their meaning according to the nounsthey are related with, whilst this tendency is limited fornouns. For instance the set of terms {curve; cube; sphere;geometry; line} generally lets the majority of people thinkabout the same topic (i.e. generation and manipulation ofshapes) and can be easily used to identify a context.However the same is not true for the set {create; delete;move; modify}, which defines more generically differenttypes of actions that can be performed upon variouscontexts (not necessarily related to geometries). Thereforethe second set is less suited for our purpose. In short,building a context using nouns has proven an efficient wayto represent a fast, stable yet representative approach.

In order to create the context a first level of normal-ization is performed using the morphological processingtools of WordNet. This process eliminates inflectionalendings from each noun.Each morphologically normalized element is then used

to retrieve its corresponding synsets from the databaseaccording to its syntactic category. Specifically for eachnith noun a search is performed returning all the synonyms

and, in case of polysemous nouns, all synsets for eachsense. Let Si{s0iysti} be the set of synsets of the ith classcreated from the original set of nouns Ni. The full set ofpossible hypernym sequences Hi{h0iyhsi} whose firstelement is a synset belonging to Si is then retrieved. It ishere worth underlying that a concept represented by thesynset X is a hyponym of the concept represented by thesynset Y if in spoken English the sentence build as ‘‘An X is

a (kind of) Y’’ is considered correct [33]. If such eventoccurs Y is said to be a hypernym of X and conventionally[30,33], their relation is formally defined through thesymbol ‘‘@-’’. For instance the term ‘‘square’’, consid-ered as a noun, has eight different senses. Fig. 3 shows twoof the hypernym sequences of the same term referring totwo different senses.We call Hi the context of a class. This contains the full

set of possible hypernym sequences extracted from Ni andit is the body of knowledge used to describe the semanticknowledge embedded into the code of a class. Once thecontext has been built it can be used to disambiguate eachnoun belonging to Ni. The approach followed implements aclustering process of hypernym sequences inspired in Ref.[29]. Our approach departs from the original research workby delivering a better control of the different elementsbelonging to the context, introducing a new process whichoperates verbs disambiguation. As represented in Fig. 4,for each noun belonging to Ni let PN{p0ypt} be the set ofits hypernym sequences and Ri{r0iyrsi} a copy of thecontext Hi. For each sequence pn of length l every possiblesub-sequence pn

k of length k (with 1pkpl) is extracted.Each sub-sequence will contain the last kth elements of pnstarting from one of the beginners nodes of WordNet. Thealgorithm starts processing the full sequence reducing at

Page 6: “Verba Volant Scripta Manent” a false axiom within virtual environments. A semi-automatic tool for retrieval of semantics understanding for speech-enabled VR applications

ARTICLE IN PRESS

Fig. 4. Clustering process.

G. Conti et al. / Computers & Graphics 30 (2006) 619–628624

each step the value of k until k ¼ 1, i.e. until the one-synsetsequence containing only one node is reached. Each sub-sequence pn

k is used to filter the copy of the context Ri

through a clustering process. Each cluster groups thesequences, belonging to the context Ri, whose first kthelements coincide with the kth synsets of the sub-sequencepnk taken as reference. Every time such a match is found the

corresponding sequence is removed from Ri and added tothe cluster. As suggested in Ref. [29], the weight of eachcluster is defined as follows:

W j ¼1

Pclust:sizen¼0 ðln � kÞ

.

In other words Wj provides a measure of the extent towhich the sequences belonging to the cluster differ from thesub-sequence pn

k. The higher Wj the less the sequencescontained will differ from the one chosen as reference. Theclustering process is repeated until every subsequence ofeach element belonging to PN{p0ypt} has been extracted.The cluster(s) with the highest value of Wj is taken and p0

k,the first synset of its reference sub-sequence, is used toretrieve the sense of the noun.

This value in fact indicates the highest degree ofmatching of the chosen sub-sequence with the generalcontext represented by the set of sequences belonging to Ri.This way it is possible to measure the ‘‘closeness’’ of asequence to the general meaning of the context Hi.Consequently the sense that is referred to by the chosensub-sequence is considered to be the closest to the generalsemantic context of the class and it is used to extract thesynonyms of the noun.

A more complex process is required for verbs since, asnoted in Ref. [34], these tend to be more polysemous thannouns, which might indicate their higher flexibility. As

shown this observation has led the authors not to use verbsto build the contexts for the classes; rather we takeadvantage of this data to disambiguate verbs. To do sowe parse the list of verbs Vi extracting for each element thelist of derivationally related forms using WordNet. Deriva-tional forms are nouns which are morphologically relatedto a verb. In our example the derivationally related formsof verb ‘‘bend’’ include ‘‘curve’’, ‘‘flexure’’, ‘‘bender’’, etc.The algorithm selects the sense of the verb whose

derivationally related forms are characterized by highestdegree of matching with the semantics of the context, andthis sense is then used to retrieve its synonyms. Specificallyfor each nth verb belonging to Vi the system creates thedata structure in Fig. 5. Proceeding from bottom to top thesystem extracts the list of synsets DV{d0ydi} pointing tothe verb derivational forms. Then, for each synset dm weextract the list WD{w0ywj} of j synonyms within the sensespecified by dn. For each sth element ws of WD theapproach previously described for noun disambiguation isrepeated. Thus, for each ws the correspondent k sequencesof hypernyms are retrieved, the sub-sequence of hypernymsextracted and each sub-sequence is used to create a groupof clusters from a copy of the context Hi.As illustrated in Fig. 5 (now proceeding from top to

bottom) for each hypernym sequence the weight iscalculated. The highest weight among the clusters isassociated with the correspondent jth word belonging tothe list WD. In order to find a measure of the generalcorrespondence of the hypernym sequences to the semanticcontext Hi, the mean value wi

T of the j maximum weights iscalculated as follows:

wTm ¼

Xj

s¼1

ws

k.

Page 7: “Verba Volant Scripta Manent” a false axiom within virtual environments. A semi-automatic tool for retrieval of semantics understanding for speech-enabled VR applications

ARTICLE IN PRESS

Fig. 6. System used as a real test bench.

Fig. 5. Data structure used for the verbs’ WSD.

Fig. 7. Modification required to the header file.

G. Conti et al. / Computers & Graphics 30 (2006) 619–628 625

This process is repeated for each mth derivational form,extracted from the verb, which belongs to DV. Thederivational related form, whose weight wm

T is the highest,indicates the most suitable sense for the context of theclass, and the synonyms according to that sense areextracted from WordNet and used for the rest of thealgorithm.

The process of semantic expansion described so far fornouns and verbs is eventually repeated for each class, orcontext, where a speech enabling tag has been found by thepreprocessor.

4.2. Automatic creation of grammars

After the disambiguation of nouns and verbs the relevantsynonyms are extracted from WordNet. These are used tocreate different CFGs, one for each class or context thatwas tagged. Here sets of synonyms are used to compileflexible grammars that consider all the matching combina-tions of synonyms for terms used in the source code. Theapplication creates XML speech application programminginterface (SAPI) compliant grammars [14] which are thenused by the SR engine. This process yields as result a SRsystem that is still based on CFGs, thus assuring a highlevel of precision, but is significantly more flexible in thecomprehension of the user’s commands. Most importantlyour method does not require any intervention from theprogrammer since the system ensures that the every time achange in the code is made the set of grammars areconsistently updated with the changes introduced.

5. VERBOSE integration within a VRAD system

A VRAD application has been used to test VERBOSE’sfeatures. The application, described in previous works fromthe authors [7], allows the user to create curves and several

types of surfaces within a 3D virtual environment. It isbased on the StudierStube [1] VR/AR system and supportsa number of output and tracking devices. As illustrated inFig. 6 the set-up selected features a desktop VR config-uration and a virtual glove [37] as tracking device. This set-up allows the user to interact with the system by moving hisright hand. First users press a button on the virtual tabletwhich in turn activates the relevant application’s function.By bending their fingers users generate events equivalent tomouse clicks. Once a command is triggered, e.g. thecreation of a curve, users can interact with their hand toperform the relevant task.In order to integrate VERBOSE the authors have first

identified the classes containing the functions to be SR-enabled. As showed in Fig. 7, each of these classes mustextend SSBase, VERBOSE’s super class responsible for thelargest part of SR functionalities. Further, as illustrated inFig. 8, we have declared, within each class constructor, themacro SSPEECH_CONSTRUCTOR. Finally, throughspecific tagging, we have identified the functions to bespeech-enabled within the application’s ‘‘.cpp’’ source files.As showed in Fig. 8 this is done through the SSPEECH_FUNCTION preprocessor macro.The authors have decided to provide those methods

responsible for the creation of different shapes with speechrecognition functionalities. From what was already intro-duced it is clear that, declarations expressed in a nearhuman readable form, maximize the benefits of VERBOSE.

Page 8: “Verba Volant Scripta Manent” a false axiom within virtual environments. A semi-automatic tool for retrieval of semantics understanding for speech-enabled VR applications

ARTICLE IN PRESS

Fig. 8. Modification required to the .cpp file.

Fig. 9. Normalization process for each ith class.

G. Conti et al. / Computers & Graphics 30 (2006) 619–628626

However the naming convention of the original VRADapplication slightly departed from this approach byadopting mixed method declarations such as extrude_But-

tonCB. In order to minimize the change in the original codethe authors have introduced a set of more ‘‘readable’’functions which in turn call the original functions. As aresult a number of wrapper functions have been introducedsuch as ‘‘start_curve’’, ‘‘create_surface’’, ‘‘start_skin’’, ‘‘ex-

trude_curve’’ and ‘‘delete_’’. Their naming follows VER-BOSE’s convention which uses underscores to separatewords within the method definition.

After the relevant changes to the code have beenintroduced the normalizer has been finally run. Fig. 9shows the normalization process: the application analysesthe source code, extracts the relevant semantic informationand stores them into the grammar archive. The latter alsocontains a number of default grammars such as thosenecessary to handle numbers or to trigger VERBOSE’sfunctions (e.g. CFGs responsible for turning recognitionon/off or for choosing the type of system feedback).

Finally, as illustrated in Fig. 9, the normalizer replaceseach tagged element within the code with the new data

structure. The resulting code, represented in the example ofFig. 10, contains the new macro definitions used at runtimeby the system to connect each function call to the SRengine. The new code, finally compiled and run, isautomatically speech-enabled through the VERBOSEruntime library.When the user starts the application a welcome message

from the TTS engine, played during the loading process,confirms that the application is now speech enabled and,after the configuration of the engine is finished, itannounces that the application is ready to listen to theuser’s commands.As a result of VERBOSE the system is able to respond to

a wide spectrum of users’ commands. For instance the usercan activate the function ‘‘start_curve’’ by speaking ‘‘start

creating the curve’’ or ‘‘please begin a new curve now’’. Ifafter the SR the system detects a command incompatiblewith the information within the CFGs the system canrespond asking the user to rephrase what required. If thesystem is still not able to decode the command the voicewill ask the user whether he/she wants to be informed ofthe list of available commands. VERBOSE can alsoprovide various other forms of feedback to the user’scommands. The user can decide if the system has toconfirm the commands decoded by the system throughsimple message boxes or spoken messages.

5.1. Performance indicators

As far as the system’s performances are concerned, it isworth highlighting that the process of semantic expansiondoes not introduce any significant runtime overhead duringthe recognition process if compared to standard SAPIperformances.Obviously the time required for the recognition can

slightly vary according to the performance of the machinebeing used and the SAPI configuration. In our tests weused a PC with a Pentium 4 processor with 512MB RAM,Windows XP Pro, Visual Studio 6.0 and MS SAPI 5.1. Theresponse time was always between 0.5 and 1.5 s accordingto the precision level set in SAPI. This result, is in line withhuman-to-human response time and therefore it has beenconsidered highly acceptable by users.The duration of the process of normalization, that is run

only once modifications to the code are introduced, greatly

Page 9: “Verba Volant Scripta Manent” a false axiom within virtual environments. A semi-automatic tool for retrieval of semantics understanding for speech-enabled VR applications

ARTICLE IN PRESS

Fig. 10. The cpp file after the normalization process.

G. Conti et al. / Computers & Graphics 30 (2006) 619–628 627

varies according to the number of classes and to theamount of code lines to be parsed. This process thereforelasts a significant time in the order of magnitude of severalminutes.

6. Conclusions

The novelty of this research work consists in an originalapproach for creating the knowledge model necessary formultimodal VR/AR systems where specific sections of thesystem architecture are devoted to the multimodal interac-tion. It is a first attempt towards the exploitation and reuseof the vast amount of semantic information that program-mers embed throughout the development of applications.In particular this paper illustrates how the body ofknowledge contained into code elements, such as methoddeclarations, can be exploited by multimodal applicationsto make the system more adaptable to the user’s interac-tion. Sentences embedded in the application code areexpressed in pseudo-natural human terms and they clearlycontain a very high degree of semantic information such asthe nature and kind of actions being carried out by theclasses. A VR/AR application has been tested using ourapproach. It has also been illustrated in detail how terms inthe code are disambiguated in order to provide thesemantic expansion required, and how the algorithmextracts the sense most suited for the various contexts.

The system-developed process eliminates the problem ofcompiling flexible interfaces for VR system capable ofresponding to a various range of user’s commands. Thisissue was originally dependent upon the developer’scapability of writing grammars capable of handling asufficiently wide variety of commands. With the adoptionof VERBOSE this process becomes completely automatic.Moreover the CFGs are written in extensible mark-uplanguage (XML) files and therefore they are encoded in ahuman-readable form. This make it possible, if necessary,to easily additionally extend them, thus providing a furtherlevel of flexibility to the system.

Acknowledgments

This research work is part of the projects InSIDe andSIMI-Pro financed by the Provincia Autonoma di Trento.

References

[1] Szalavari Z, Schmalstieg D, Fuhrmann A, Gervautz M. Studier-

stube—an environment for collaboration in augmented reality.

Virtual Reality: Research, Development & Applications 1998;3:

37–48.

[2] Schmalstieg D, Encarnac- ao LM, Szalavari Z. Using transparent

props for interacting with the virtual table. In: Proceedings of the

ACM SIGGRAPH symposium on interactive 3D graphics. I3DG’99.

New York:ACM Press;1999 [DOI http://doi.acm.org/10.1145/

300523.300542].

[3] Szalavari Z, Gervautz M. The personal interaction panel—a two

handed interface for augmented reality. Computer Graphics Forum.

Proceedings of EUROGRAPHICS’97 1997;16(3):335–46.

[4] Latoschik ME. A gesture processing framework for multimodal

interaction in virtual reality. In: Proceedings of the first international

conference on computer graphics, virtual reality and visualization.

New York:ACM Press;2001 [DOI http://doi.acm.org/10.1145/

513867.513888].

[5] McNeill D. Hand and mind: what gestures reveal about thought.

Chicago: University of Chicago Press; 1992.

[6] Reeves B, Nass C. Perceptual user interfaces: perceptual bandwidth.

Communications of the ACM 2000;43(3):65–70.

[7] De Amicis R, Bruno F, Stork A, Luchi ML. The eraser pen: a new

interaction paradigm for curve sketching in 3D. In: Marjanovic D,

editor. Design 2002. Zagreb: Faculty of Mechanical Engineering and

Naval Architecture, The Design Society; 2002.

[8] Forbus KD, Ferguson RW, Usher JM. Towards a computational

model of sketching. In: Proceedings of the sixth international

conference on intelligent user interfaces. New York:ACM Press;2001

[DOI http://doi.acm.org/10.1145/359784.360278].

[9] Kendon A. Language and gesture: unity or duality? In: McNeill D,

editor. Language and gesture. Cambridge, UK: Cambridge Uni-

versity Press; 2000.

[10] Oviatt S, Cohen P. Multimodal interfaces that process what comes

naturally. Communications of the ACM 2000;43(3):45–54.

[11] Kaufmann H. Collaborative augmented reality in education. In:

Imagina03 proceedings. 2003 [CD-ROM].

[12] Reiners D, Stricker D, Klinker G, Mueller S. Augmented reality for

construction tasks: doorlock assembly. In: Proceedings of the

international workshop on augmented reality: placing artificial

objects in real scenes: placing artificial objects in real scenes. Natick,

MA, USA:A.K. Peters, Ltd.;1998.

[13] Schwald B, de Laval B. An augmented reality system for training and

assistance to maintenance in the industrial context. Journal of WSCG

2003;11(3):425–32.

[14] Microsoft speech technology, http://www.microsoft.com/speech/.

[15] Gordan G, Billinghurst M, Bell M, Woodfill J, Kowalik B, Erendi A,

Tilander J. The use of dense stereo range data in augmented reality.

In: Proceedings of the IEEE and ACM international symposium on

mixed and augmented reality. ISMAR 2002. Los Alamitos, CA:IEEE

Press;2002.

Page 10: “Verba Volant Scripta Manent” a false axiom within virtual environments. A semi-automatic tool for retrieval of semantics understanding for speech-enabled VR applications

ARTICLE IN PRESSG. Conti et al. / Computers & Graphics 30 (2006) 619–628628

[16] Goose S, Schneider G. Augmented reality in the palm of your hand:

A PDA-based framework offering a location-based, 3D and speech-

driven user interface. In: TCMC 2003: workshop on wearable

computing. Graz, Austria;2003.

[17] Cohen P, McGee D, Oviatt S, Wu L, Clow J, King R, et al.

Multimodal interaction for 2D and 3D environments. IEEE

Computer Graphics and Applications 1999;19(4):10–3.

[18] Oviatt S, Cohen P, Wu L, Vergo J, Duncan L, Suhm B, et al.

Designing the user interface for multimodal speech and

pen-based gesture applications: state-of-the-art systems and

future research directions. Human–Computer Interaction 2000;15:

263–322.

[19] Kulas C, Sandor C, Klinker G. Towards a development methodology

for augmented reality user interfaces. In: Vanderdonckt J, Jardim

Nunes N, Rich C, editors. Proceedings of the international workshop

exploring the design and engineering of mixed reality systems—

MIXER 2004, Funchal, Madeira, CEUR Workshop Proceedings.

New York: ACM Press; 2004.

[20] Reitmayr G, Schmalstieg D. OpenTracker—an open software

architecture for reconfigurable tracking based on XML. In: IEEE

computer society, editors. Proceedings of the IEEE virtual reality

2001. Vr 2001. Los Alamitos, CA:IEEE Press;2001.

[21] Nuance speech recognition software, http://www.nuance.com/

prodserv/prodnuance.html.

[22] Dragon Naturally Speaking, http://www.scansoft.com/naturallyspeaking/.

[23] Kaiser E, Olwal A, McGee D, Benko H, Corradini A, Li X, Cohen P,

Feiner S. Mutual disambiguation of 3D multimodal interaction in

augmented and virtual reality. In: Proceedings of the fifth interna-

tional conference on multimodal interfaces. New York:ACM

Press;2003 [DOI http://doi.acm.org/10.1145/958432.958438].

[24] LaViola J. MSVT: a virtual reality-based multimodal scientific

visualization tool. In: Proceedings of the third IASTED international

conference on computer graphics and imaging. CGIM 2000.

Anaheim, CA, USA:IASTED/ACTA Press;2000

[25] Weaver W. Translation. In: Locke WN, Booth AD, editors. Machine

translation of languages. New York: Wiley; 1955.

[26] Cong L, Hang L. Word translation disambiguation using bilingual

bootstrapping. Computational Linguistics 2004;30(1):1–22.

[27] Li X, Szpakowics S, Matwin S. A WordNet-based algorithm for

word sense disambiguation. In: International joint conference on

artificial intelligence. IJCAI-95. Los Altos, CA:Morgan-Kaufmann

Publishers Inc.;1995.

[28] Ciaramita M, Hofmann T, Johnson M. Hierarchical semantic

classification: word sense disambiguation with world knowledge. In:

18th International joint conference on artificial intelligence. IJCAI-

03. Los Altos, CA:Morgan-Kaufmann Publishers, Inc.;2003.

[29] Canas AJ, Valerio A, Lalinde-Pulido J, Carvalho M, Arguedas M.

Using WordNet for word sense disambiguation to support concept

map construction. In: Nascimento MA, de Moura ES, Oliveira AL,

editors. SPIRE 2003. Berlin, New York: Springer; 2003.

[30] Miller GA. Nouns in WordNet: a lexical inheritance system.

International Journal of Lexicography 1990;3(4):245–64.

[31] Sussna M. Word sense disambiguation for free-text indexing using a

massive semantic network. In: Second international conference on

information and knowledge management. New York:ACM

Press;1993 [DOI http://doi.acm.org/10.1145/170088.170106].

[32] Mihalcea R, Moldovan D. A method for word sense disambiguation

of unrestricted text. In: ACL ‘99. Morristown, NJ, USA: Association

for Computational Linguistics; 1999.

[33] Miller GA, Beckwith R, Fellbaum C, Gross D, Miller KJ.

Introduction to WordNet: an on-line lexical database. International

Journal of Lexicography 1990;3(4):235–44.

[34] Fellbaum C. English verbs as a semantic net. International Journal of

Lexicography 1990;3(4):278–301.

[35] Global WordNet website, http://www.globalWordNet.org.

[36] WordNet website, http://www.cogsci.princeton.edu/�wn/links.shtml.

[37] EssentialReality P5 glove, http://www.essentialreality.com/specifications.

asp.