meet edgar, a tutoring agent at monserrate · proceedings of the 51st annual meeting of the...

Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 61–66,Sofia, Bulgaria, August 4-9 2013. c©2013 Association for Computational Linguistics

Meet EDGAR, a tutoring agent at MONSERRATE

Pedro Fialho, Luısa Coheur, Sergio Curto, Pedro ClaudioAngela Costa, Alberto Abad, Hugo Meinedo and Isabel Trancoso

Spoken Language Systems Lab (L2F), INESC-IDRua Alves Redol 9

1000-029 Lisbon, [email protected]

AbstractIn this paper we describe a platform forembodied conversational agents with tu-toring goals, which takes as input writtenand spoken questions and outputs answersin both forms. The platform is devel-oped within a game environment, and cur-rently allows speech recognition and syn-thesis in Portuguese, English and Spanish.In this paper we focus on its understand-ing component that supports in-domain in-teractions, and also small talk. Most in-domain interactions are answered usingdifferent similarity metrics, which com-pare the perceived utterances with ques-tions/sentences in the agent’s knowledgebase; small-talk capabilities are mainlydue to AIML, a language largely used bythe chatbots’ community. In this paperwe also introduce EDGAR, the butler ofMONSERRATE, which was developed inthe aforementioned platform, and that an-swers tourists’ questions about MONSER-RATE.

1 Introduction

Several initiatives have been taking place in thelast years, targeting the concept of Edutainment,that is, education through entertainment. Fol-lowing this strategy, virtual characters have ani-mated several museums all over the world: the3D animated Hans Christian Andersen is ca-pable of establishing multimodal conversationsabout the writer’s life and tales (Bernsen andDybkjr, 2005), Max is a virtual character em-ployed as guide in the Heinz Nixdorf MuseumsForum (Pfeiffer et al., 2011), and Sergeant Black-well, installed in the Cooper-Hewitt National De-sign Museum in New York, is used by the U.S.Army Recruiting Command as a hi-tech attrac-tion and information source (Robinson et al.,

Figure 1: EDGAR at MONSERRATE.

2008). DuARTE Digital (Mendes et al., 2009)and EDGAR are also examples of virtual charac-ters for the Portuguese language with the sameedutainment goal: DuARTE Digital answers ques-tions about Custodia de Belem, a famous work ofthe Portuguese jewelry; EDGAR is a virtual butlerthat answers questions about MONSERRATE (Fig-ure 1).

Considering the previous mentioned agents,they all cover a specific domain of knowledge (al-though a general Question/Answering system wasintegrated in Max (Waltinger et al., 2011)). How-ever, as expected, people tend also to make smalltalk when interacting with these agents. There-fore, it is important that these systems properlydeal with it. Several strategies are envisaged tothis end and EDGAR is of no exception. In thispaper, we describe the platform behind EDGAR,which we developed aiming at the fast insertion ofin-domain knowledge, and to deal with small talk.This platform is currently in the process of beingindustrially applied by a company known for itsexpertise in building and deploying kiosks. Wewill provide the hardware and software requiredto demonstrate EDGAR, both on a computer andon a tablet.

This paper is organized as follows: in Sec-tion 2 we present EDGAR’s development platform

61

Figure 2: EDGAR architecture

and describe typical interactions, in Section 3 weshow how we move from in-domain interactionsto small talk, and in Section 4 we present an anal-ysis on collected logs and their initial evaluationresults. Finally, in Section 5 we present some con-clusions and point to future work.

2 The Embodied Conversational Agentplatform

2.1 Architecture overview

The architecture of the platform, generally de-signed for the development of Embodied Con-versational Agents (ECAs) (such as EDGAR), isshown in Figure 2. In this platform, several mod-ules intercommunicate by means of well definedprotocols, thus leveraging the capabilities of inde-pendent modules focused on specific tasks, suchas speech recognition or 3D rendering/animation.This independence allows us to use subsets of thisplatform modules in scenarios with different re-quirements (for instance, we can record charactersuttering a text).

Design and deployment of the front end ofEDGAR is performed in a game engine, which hasenabled the use of computer graphics technologiesand high quality assets, as seen in the video gameindustry.

2.2 Multimodal components

The game environment, where all the interac-tion with EDGAR takes place, is developed in theUnity1 platform, being composed of one highly

1http://unity3d.com/

detailed character, made and animated by Rocket-box studios2, a virtual keyboard and a push-while-talking button.

In this platform, Automatic Speech Recogni-tion (ASR) is performed by AUDIMUS (Meinedoet al., 2003) for all languages, using generic acous-tic and language models, recently compiled frombroadcast news data (Meinedo et al., 2010). Lan-guage models were interpolated with all the do-main questions defined in the Natural LanguageUnderstanding (NLU) framework (see below),while ASR includes features such as speech/non-speech (SNS) detection and automatic gain control(AGC). Speech captured in a public space raisesseveral ASR robustness issues, such as loudnessvariability of spoken utterances, which is partic-ularly bound to happen in a museological envi-ronment (such as MONSERRATE) where silence isusually incited. Thus, we have added a boundedamplication to the captured signal, despite theAGC mechanism, ensuring that too silent soundsare not discarded by the SNS mechanism.

Upon a spoken input, AUDIMUS translates itinto a sentence, with a confidence value. Anempty recognition result, or one with low con-fidence, triggers a control tag (“ REPEAT ”) tothe NLU module, which results in a request forthe user to repeat what was said. The answer re-turned by the NLU module is synthesized in a lan-guage dependent Text To Speech (TTS) system,with DIXI (Paulo et al., 2008) being used for Por-tuguese, while a recent version of FESTIVAL (Zenet al., 2009) covers both English and Spanish. The

2http://www.rocketbox-libraries.com/

62

synthesized audio is played while the correspond-ing phonemes are mapped into visemes, repre-sented as skeletal animations, being synchronizedaccording to phoneme durations, available in allthe employed TTS engines.

Emotions are declared in the knowledge sourcesof the agent. As shown in Figure 3, they are coor-dinated with viseme animations.

Figure 3: The EDGAR character in a joyful state.

2.3 Interacting with EDGAR

In a typical interaction, the user enters a ques-tion with a virtual keyboard or says it to the mi-crophone while pressing a button (Figure 4), inthe language chosen in the interface (as previouslysaid, Portuguese, English or Spanish).

Figure 4: A question written in the EDGAR inter-face.

Then, the ASR will transcribe it and the NLUmodule will process it. Afterwards, the answer,chosen by the NLU module, is heard throughthe speakers, due to the TTS, and sequentiallywritten in a talk bubble, according to the pro-duced speech. The answer is accompanied withvisemes, represented by movements of the char-acter’s mouth/lips, and by facial emotions asmarked in the answers of the NLU knowledgebase. A demo of EDGAR, only for English interac-tions, can be tested in https://edgar.l2f.inesc-id.pt/m3/edgar.php.

3 The natural language understandingcomponent

3.1 In-domain knowledge sourcesThe in-domain knowledge sources of the agentare XML files, hand-crafted by domain experts.

This XML files have multilingual pairs consti-tuted by different paraphrases of the same ques-tion and possible answers. The main reason tofollow this approach (and contrary to other workswhere grammars are used), is to ease the processof creating/enriching the knowledge sources of theagent being developed, which is typically doneby non experts in linguistics or computer science.Thus, we opted for following a similar approachof the work described, for instance, in (Leuski etal., 2006), where the agents knowledge sources areeasy to create and maintain. An example of a ques-tions/answers pair is:

<questions><q en="How is everything?"

es="Todo bien?">Tudo bem?</q>

</questions><answers>

<a en="I am ok, thank you."es="Estoy bien, gracias."emotion="smile_02">Estou bem, obrigado.</a>

</answers>

As it can been see from this example, emotionsare defined in these files, associated to each ques-tion/answer pair (emotion=“smile” in the exam-ple, one of the possible smile emotions).

These knowledge sources can be (automati-cally) extended with “synonyms”. We call them“synonyms”, because they do not necessarily fitin the usual definition of synonyms. Here we fol-low a broader approach to this concept and if twowords, within the context of a sentence from theknowledge source, will lead to the same answer,then we consider them to be “synonyms”. Forinstance “palace” or “castle” are not synonyms.However, people tend to refer to MONSERRATE inboth forms. Thus, we consider them to be “syn-onyms” and if one of these is used in the orig-inal knowledge sources, the other is used to ex-pand them. It should be clear that we will gener-ate many incorrect questions with this procedure,but empirical tests (out of the scope of this paper)show that these questions do not hurt the systemperformance. Moreover, they are useful for ASRlanguage model interpolation, which is based onN-grams.

63

3.2 Out-of-domain knowledge sources

The same format of the previously describedknowledge sources can be used to represent out-of-domain knowledge. Here, we extensively usedthe “synonyms” approach. For instance, wordswife and girlfriend are considered to be “syn-onyms” as all the personal questions with thesewords should be answered with the same sentence:I do not want to talk about my private life.

Nevertheless, and taking into consideration thework around small talk developed by the chat-bots community (Klwer, 2011), we decided touse the most popular language to build chat-bots: the “Artificial Intelligence Markup Lan-guage”, widely known as AIML, a derivative ofXML. With AIML, knowledge is coded as a setof rules that will match the user input, associ-ated with templates, the generators of the out-put. A detailed description of AIML syntax canbe found in http://www.alicebot.org/aiml.html. In what respects AIML inter-preters, we opted to use Program D (java), whichwe integrated in our platform. Currently, we useAIML to deal with slang and to answer questionsthat have to do with cinema and compliments.

As a curiosity, we should explain that we dealwith slang when input came from the keyboard,and not when it is speech, as the language modelsare not trained with this specific lexicon. The rea-son we do that is because if the language modelswere trained with slang, it would be possible to er-roneously detect it in utterances and then answerthem accordingly, which could be extremely un-pleasant. Therefore, EDGAR only deals with slangwhen the input is the keyboard.

The current knowledge sources have 152 ques-tion/answer pairs, corresponding to 763 questionsand 206 answers. For Portuguese, English andSpanish the use of 226, 219 and 53 synonym re-lations, led to the generation of 22 194, 16 378and 1 716 new questions, respectively.

3.3 Finding the appropriate answer

The NLU module is responsible for the answer se-lection process. It has three main components.

The first one, STRATEGIES, is responsible tochoose an appropriate answer to the received inter-action. Several strategies are implemented, includ-ing the ones based on string matching, string dis-tances (as for instance, Levenshtein, Jaccard andDice), N-gram Overlap and support vector ma-

chines (seeing the answer selection as a classifica-tion problem). Currently, best results are attainedusing a combination of Jaccard and bigram Over-lap measures and word weight through the use oftf-idf statistic. In this case, Jaccard takes into ac-count how many words are shared between theuser’s interaction and the knowledge source en-try, bigram Overlap gives preference to the sharedsequences of words and tf-idf contributes to theresults attained by previous measures, by givenweight to unfrequent words, which should havemore weight on the decision process (for example,the word MONSERRATE occurs in the majority ofthe questions in the corpus, so it is not very infor-mative and should not have the same weight as, forinstance, the word architect or owner).

The second component, PLUGINS, deals withtwo different situations. First, it accesses Pro-gram D when interactions are not answered by theSTRATEGIES component. That is, when the tech-nique used by STRATEGIES returns a value thatis lower than a threshold (dependent of the usedtechnique), the PLUGIN component runs ProgramD in order to try to find an answer to the posedquestion. Secondly, when the ASR has no confi-dence of the attained transcription (and returns the“ REPEAT ” tag) or Program D is not able to findan answer, the PLUGINS component does the fol-lowing (with the goal of taking the user again tothe agent topic of expertise):

• In the first time that this occurs, a sentencesuch as Sorry, I did not understand you. ischosen as the answer to be returned.

• The second time this occurs, EDGAR asks theuser I did not understand you again. Whydon’t you ask me X?, being X generated inrun time and being a question from a subsetof the questions from the knowledge sources.Obviously, only in-domain (not expanded)questions are considered for replacing X.

• The third time there is a misunderstanding,EDGAR says We are not understanding eachother, let me talk about MONSERRATE. Andit randomly choses some answer to present tothe user.

The third component is the HISTORY-TRACKER, which handles the agent knowledgeabout previous interactions (kept until a defaulttime without interactions is reached).

64

4 Preliminary evaluation

Edgar is more a domain-specific Question An-swering (QA) than a task-oriented dialogue sys-tem. Therefore, we evaluated it with the metricstypically used in QA. The mapping of the dif-ferent situations in true/false positives/negatives isexplained in the following.

We have manually transcribed 1086 spoken ut-terances (in Portuguese), which were then labeledwith the following tags, some depending on theanswer given by EDGAR:

• 0: in-domain question incorrectly answered,although there was information in the knowl-edge sources (excluding Program D) to an-swer it;

• 1: out-of-domain question, incorrectly an-swered;

• 2: question correctly answered by ProgramD;

• 3: question correctly answered by usingknowledge sources (excluding Program D);

• 4: in-domain question, incorrectly answered.There is no information in the knowledgesource to answer it, but it should be;

• 5: multiple questions, partially answered;

• 6: multiple questions, unanswered;

• 7: question with implicit information (there,him, etc.), unanswered;

• 8: question which is not “ipsis verbis” in theknowledge source, but has a paraphrase thereand was not correctly answered;

• 9: question with a single word (garden,palace), unanswered;

• 10: question that we do not want the systemto answer (some were answered, some werenot).

The previous tags were mapped into:

• true positives: questions marked with 2, 3and 5;

• true negatives: questions marked with 0 and10 (the ones that were not answered by thesystem);

• false positives: questions marked with 0 and10 (the ones that were answered by the sys-tem);

• false negatives: questions marked with 4, 6,7, 8 and 9.

Then, two experiments were conducted: in thefirst, the NLU module was applied to the manualtranscriptions; in the second, directly to the outputof the ASR. Table 1 shows the results.

NLU input = manual transcriptionsPrecision Recall F-measure

0.92 0.60 0.72acNLU input = ASR

Precision Recall F-measure0.71 0.32 0.45

Table 1: NLU results

The ASR Word Error Rate (WER) is of 70%.However, we detect some problems in the way wewere collecting the audio, and in more recent eval-uations (by using 363 recent logs where previousproblems were corrected), that error decreased to aWER of 52%, including speech from 111 children,21 non native Portuguese speakers (thus, with adifferent pronunciation), 23 individuals not talkingin Portuguese and 27 interactions where multiplespeakers overlap. Here, we should refer the workpresented in (Traum et al., 2012), where an eval-uation of two virtual guides in a museum is pre-sented. They also had to deal with speakers fromdifferent ages and with question off-topic, and re-port a ASR with 57% WER (however they major-ity of their user are children: 76%).

We are currently preparing a new corpus forevaluating the NLU module, however, the follow-ing results remain: in the best scenario, if tran-scription is perfect, the NLU module behaves asindicated in Table 1 (manual transcriptions).

5 Conclusions and Future Work

We have described a platform for developingECAs with tutoring goals, that takes both speechand text as input and output, and introducedEDGAR, the butler of MONSERRATE, which wasdeveloped in that platform. Special attention wasgiven to EDGAR’s NLU module, which couplestechniques that try to find distances between theuser input and sentences in the existing knowledge

65

sources, with a framework imported from the chat-bots community (AIML plus Program D). EDGAR

has been tested with real users for the last year andwe are currently performing a detailed evaluationof it. There is much work to be done, including tobe able to deal with language varieties, which is animportant source of recognition errors. Moreover,the capacity of dealing with out-of-domain ques-tions is still a hot research topic and one of ourpriorities in the near future. We have testified thatpeople are delighted when EDGAR answers out-of-domain questions (Do you like soccer?/I ratherhave a tea and read a good criminal book) and wecannot forget that entertainment is also one of thisEmbodied Conversational Agent (ECA)’s goal.

Acknowledgments

This work was supported by nationalfunds through FCT – Fundacao para aCiencia e a Tecnologia, under project PEst-OE/EEI/LA0021/2013. Pedro Fialho, SergioCurto and Pedro Claudio scholarships were sup-ported under project FALACOMIGO (ProjectoVIIem co-promocao, QREN n 13449).

ReferencesN. O. Bernsen and L. Dybkjr. 2005. Meet hans chris-

tian andersen. In In Proceedings of Sixth SIGdialWorkshop on Discourse and Dialogue, pages 237–241.

Tina Klwer. 2011. “i like your shirt” – dialogue actsfor enabling social talk in conversational agents. InProceedings of the 11th International Conference onIntelligent Virtual Agents. International Conferenceon Intelligent Virtual Agents (IVA), 11th, September17-19, Reykjavik, Iceland. Springer.

Anton Leuski, Ronakkumar Patel, David Traum, andBrandon Kennedy. 2006. Building effective ques-tion answering characters. In 7th SIGdial Workshopon Discourse and Dialogue, Sydney, Australia.

Hugo Meinedo, Diamantino Caseiro, Joao Neto, andIsabel Trancoso. 2003. Audimus.media: a broad-cast news speech recognition system for the euro-pean portuguese language. In Proceedings of the 6thinternational conference on Computational process-ing of the Portuguese language, PROPOR’03, pages9–17, Berlin, Heidelberg. Springer-Verlag.

H. Meinedo, A. Abad, T. Pellegrini, I. Trancoso, andJ. P. Neto. 2010. The l2f broadcast news speechrecognition system. In Proceedings of Fala2010,Vigo, Spain.

Ana Cristina Mendes, Rui Prada, and Luısa Coheur.2009. Adapting a virtual agent to users’ vocabu-lary and needs. In Proceedings of the 9th Interna-tional Conference on Intelligent Virtual Agents, IVA’09, pages 529–530, Berlin, Heidelberg. Springer-Verlag.

Sergio Paulo, Luıs C. Oliveira, Carlos Mendes, LuısFigueira, Renato Cassaca, Ceu Viana, and HelenaMoniz. 2008. Dixi — a generic text-to-speech sys-tem for european portuguese. In Proceedings of the8th international conference on Computational Pro-cessing of the Portuguese Language, PROPOR ’08,pages 91–100, Berlin, Heidelberg. Springer-Verlag.

Thies Pfeiffer, Christian Liguda, Ipke Wachsmuth, andStefan Stein. 2011. Living with a virtual agent:Seven years with an embodied conversational agentat the heinz nixdorf museumsforum. In Proceedingsof the International Conference Re-Thinking Tech-nology in Museums 2011 - Emerging Experiences,pages 121 – 131. thinkk creative & the University ofLimerick.

Susan Robinson, David Traum, Midhun Ittycheriah,and Joe Henderer. 2008. What would you ask aconversational agent? observations of human-agentdialogues in a museum setting. In InternationalConference on Language Resources and Evaluation(LREC), Marrakech, Morocco.

David Traum, Priti Aggarwal, Ron Artstein, SusanFoutz, Jillian Gerten, Athanasios Katsamanis, AntonLeuski, Dan Noren, and William Swartout. 2012.Ada and grace: Direct interaction with museumvisitors. In The 12th International Conference onIntelligent Virtual Agents (IVA), Santa Cruz, CA,September.

Ulli Waltinger, Alexa Breuing, and Ipke Wachsmuth.2011. Interfacing virtual agents with collaborativeknowledge: Open domain question answering us-ing wikipedia-based topic models. In IJCAI, pages1896–1902.

Heiga Zen, Keiichiro Oura, Takashi Nose, Junichi Ya-magishi, Shinji Sako, Tomoki Toda, Takashi Ma-suko, Alan W. Black, and Keiichi Tokuda. 2009.Recent development of the HMM-based speech syn-thesis system (HTS). In Proc. 2009 Asia-PacificSignal and Information Processing Association (AP-SIPA), Sapporo, Japan, October.

66

meet edgar, a tutoring agent at monserrate · proceedings of the 51st annual meeting of the...

Documents