proceedings of the fifth annual young researchers...

104
Proceedings of the Fifth Annual Young Researchers’ Roundtable on Spoken Dialogue Systems’09 Queen Mary University of London, UK September 13 th –14 th , 2009 http://www.yrrsds.org/

Upload: others

Post on 07-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Proceedings of the Fifth Annual

Young Researchers’ Roundtableon Spoken Dialogue Systems’09Queen Mary University of London, UKSeptember 13th–14th, 2009

http://www.yrrsds.org/

Page 2: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal
Page 3: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

Sponsors

Endorsements

London, UK September 13th–14th, 2009 3

Page 4: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

4 September 13th–14th, 2009 London, UK

Page 5: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

Organising Committee

David Dıaz Pardo de Vera,Signal Processing Applications Group (GAPS),Politechnic University of Madrid, Spain

Milica Gasic,Department of Engineering,University of Cambridge, UK

Francois Mairesse,Department of Engineering,University of Cambridge, UK

Matthew Marge,Language Technologies Institute,Carnegie Mellon University, USA

Joana Paulo Pardal,L2F INESC-ID andIST, Technical University of Lisbon, Portugal

Ricardo Ribeiro,L2F INESC-ID andISCTE-IUL, Lisbon, Portugal

Local Organisers

Arash Eshghi,Interaction, Media and Communication Group,Queen Mary University of London, UK

Christine Howes,Interaction, Media and Communication Group,Queen Mary University of London, UK

Gregory Mills,Interaction, Media and Communication Group,Queen Mary University of London, UK

London, UK September 13th–14th, 2009 5

Page 6: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

6 September 13th–14th, 2009 London, UK

Page 7: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

Foreword

Welcome to the fifth annual Young Researchers Roundtable on Spoken Dialogue Systems, whichbuilds on the good work of the previous workshops, held in Columbus (2008), Antwerp (2007),Pittsburgh (2006) and Lisbon (2005).The Young Researchers Roundtable aims to promote networking and discussion in the rich andvaried field of dialogue systems between students and young researchers in both academia andindustry. It is our hope that our program will facilitate and encourage this, and that our timetogether will be worthwhile, productive, and fun!We would like to thank our sponsors, Orange, Microsoft Research and AT&T Labs-Research,for giving us the opportunity to make the workshop fee as low as possible for our participantsand for providing coffee breaks, lunch and dinner. Thanks also to ISCA, SIGDial, and IEEESLTC for endorsing and promoting the event, and thus broadening our reach. Special thanksto Queen Mary University of London for enabling us to use the university’s rooms. Further, wewould like to particularly thank the members of our advisory committee for providing assistanceand helping to promote the event. And last, but not least, we would like to thank all of you, thisyear’s participants, for providing interesting and varied position papers and thought-provokingquestions for discussion, and without whom this event would not be possible.We hope you all enjoy the workshop, and have fun in London!

The organising committee of YRRSDS 2009

London, UK September 13th–14th, 2009 7

Page 8: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

8 September 13th–14th, 2009 London, UK

Page 9: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

Contents

Sponsors 3

Endorsements 3

Organising Committee 5

Local Organisers 5

Foreword 7

Program 11

Advisory Committee 13

Industry Speakers 15Philippe Bretier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Sebastian Moller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Tim Paek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Invited Panelists 17Jason Williams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Michael McTear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Frameworks and Grand Challenges for Dialogue System Evaluation – Speakers 18Alan W. Black . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Dan Bohus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

List of Participants 19

Position Papers 21Timo Baumann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Luciana Benotti . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Jose Luis Blanco Murillo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Dan Bohus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Okko Buss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Dmitry Butenkov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Caroline Clemens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Klaus-Peter Engelbrecht . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Arash Eshghi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Milica Gasic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Florian Godde . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Christine Howes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Srinivasan C Janarthanam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Denise Cristina Kluge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Theodora Koulouri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Christine Kuhnel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51Catherine Lai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Marianne Laurent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

London, UK September 13th–14th, 2009 9

Page 10: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

Beatriz Lopez Mencıa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Syaheerah L. Lutfi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Francois Mairesse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61Matthew Marge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63Lluıs Mas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Toyomi Meguro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Gregory Mills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69Teruhisa Misu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71Aasish Pappu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Joana Paulo Pardal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75David Dıas Pardo de Vera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77Florian Pinault . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Mara Reis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Sylvie Saget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Nur-Hana Samsudin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85Stefan Schaffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87Alexander Schmitt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89Niels Schutte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Alvaro Siguenza Izquierdo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93Vasile Topac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95Bogdan Vlasenko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97Ina Wechsung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99Charlotte Wollermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

10 September 13th–14th, 2009 London, UK

Page 11: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

Program

Day before - September 12th, Saturday

19:30 YRRSDS get-together

First Day - September 13th, Sunday

08:30 Registration & Coffee

09:00 Reception session(with photo and 30 sec introduction / participant)

09:30 Discussion session ITopics: to be determined based on participants interests

11:30 Summaries & discussion

12:00 Industry/Company talks: sponsors presentation

• Philippe Bretier, Orange

• Tim Paek, Microsoft Research

• Sebastian Moller, DE Telecom

13:00 Lunch break

14:00 Discussion session IITopics: to be determined based on participants interests

15:30 Summaries & discussion

16:00 Tea break

16:30 Industry & Academic Panel: session on career development

• Jason Williams, AT&T

• Michael McTear, University of Ulster

18:30 Dinner

London, UK September 13th–14th, 2009 11

Page 12: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

Second Day - September 14th, Monday

09:00 Coffee

09:30 Demos and Posters

10:30 Frameworks and Grand Challenges for Dialogue System Evaluation

• Alan Black, Carnegie Mellon University

• Dan Bohus, Microsoft Research

11:00 Discussion Sessions on Evaluation

• Automative

• Human-Robot Dialogue

• Let’s Go

• Multimedia

• Tutorial

12:00 Summaries & discussion with Panelists

12:30 Close up

13:00 Lunch break and farewell

12 September 13th–14th, 2009 London, UK

Page 13: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

Advisory Committee

Hua Ai,University of Pittsburgh, USA

James Allen,University of Rochester, USA

Alan Black,Carnegie Mellon University, USA

Dan Bohus,Microsoft Research, USA

Philippe Bretier,Orange, France

Robert Dale,Macquarie University, Australia

Maxine Eskenazi,Carnegie Mellon University, USA

Sadaoki Furui,Tokyo Institute of Technology, Japan

Luis Hernandez Gomez,Polytechnic University of Madrid, Spain

Carlos Gomez Gallo,University of Rochester, USA

Kristiina Jokinen,University of Helsinki, Finland

Nuno J. Mamede,L2F INESC-ID, Technical University of Lisbon, Portugal

David Martins de Matos,L2F INESC-ID, Technical University of Lisbon, Portugal

Joao Paulo Neto,L2F INESC-ID, Technical University of Lisbon, Voice Interaction, Portugal

Tim Paek,Microsoft Research, USA

London, UK September 13th–14th, 2009 13

Page 14: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

Antoine Raux,Honda Research Institute, USA

Robert Ross,University of Bremen, Germany

Alexander Rudnicky,Carnegie Mellon University, USA

Mary Swift,University of Rochester, USA

Isabel Trancoso,L2F INESC-ID, Technical University of Lisbon, Portugal

Tim Weale,The Ohio State University, USA

Jason Williams,AT&T Labs, USA

Sabrina Wilske,Saarland University, Germany

Andi Winterboer,University van Amsterdam, Netherlands

Craig Wooton,University of Ulster, UK

Steve Young,University of Cambridge, UK

14 September 13th–14th, 2009 London, UK

Page 15: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

Industry Speakers

Philippe Bretier

Affiliation: Orange

Presentation Abstract:Recent research in spoken dialogue systems (SDS) has called for a synergistic convergencebetween research and industry. The talk presents one of the works Orange Labs is carryingunder this motivation.Manual dialogue design and adaptive dialogue management reflect more or less the respectivepaths that industry and research have followed when building their SDS. Dialogue evaluation isa common concern for both communities, but remains hard to put into operational perspectives.The talk shows how design tools, monitoring tools and dedicated reinforcement learning algo-rithms can be complementary and offer a new convergent experience for developers in designingand evaluating SDS. First, the integration of logs feedbacks into an industrial SDS design toolmakes it a monitoring tool as well, by revealing the effective call flows and their associatedKey Performance Indicators. Second, the integration of design alternatives into a single dia-logue design makes the SDS developers capable to visually compare different design choices.Third, the integration of a reinforcement learning technique allows automatic optimisation ofthe SDS choices. Finally, the design/monitoring tool helps the SDS developers to understandthe learnt behaviour. The SDS developers can then contrast the different indicators and controlthe further SDS choices by removing or adding dialogue alternatives.

Biographical Sketch:Philippe Bretier heads the Natural Dialogue research group at Orange Labs,Lannion, France. He obtained Computer Science Engineering and ArtificialIntelligence Master Degrees from the University of Rennes and received a PhDfor his work on automatic reasoning on interaction theory for dialogue man-agement from University of Paris Nord (1995). He joined France Telecom -Orange as a research engineer and developed the Artimis BDI-agent based dialogue system.Amongst work in research to industry transfers within the France Telecom - Orange opera-tional and commercial divisions, he focused his research on industrial automata-based dialoguetechnology enhancement. His recent research interests include dialogue evaluation metrics andmethodologies, dialogue design tools including usage analytics and feedbacks, machine learning,and modular architectures in dialogue systems.

Sebastian Moller

Affiliation: DE Telecom

Presentation Abstract:Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin. Evaluating the qual-ity and usability of spoken dialogue services is usually a cumbersome and expensive process.Whereas academic researchers are interested in quantifying the advances their research hasmade, industrial evaluations are commonly restricted to an absolute minimum, due to time andbudget restrictions. This may however lead to low usability of the resulting commercial services.In my talk, I will address both academic and industrial approaches to evaluate and improve spo-ken dialogue services. I will highlight the current research topics of Deutsche Telekom Labs, apublic-private partnership between Deutsche Telekom AG and TU Berlin, in the field of spokendialogue systems.

London, UK September 13th–14th, 2009 15

Page 16: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

Biographical Sketch:Sebastian Moller was born in 1968 and studied electrical engineering at theuniversities of Bochum (Germany), Orleans (France) and Bologna (Italy).From 1994 to 2005, he held the position of a scientific researcher at the In-stitute of Communication Acoustics (IKA), Ruhr-University Bochum, andworked on speech signal processing, speech technology, communication acous-tics, as well as on speech communication quality aspects. Since June 2005,he works at Deutsche Telekom Laboratories, TU Berlin. He was appointed Professor at TUBerlin for the subject “Usability” in April 2007, and heads the “Quality and Usability Lab” atDeutsche Telekom Laboratories.He received a Doctor-of-Engineering degree at Ruhr-University Bochum in 1999 for his work onthe assessment and prediction of speech quality in telecommunications. In 2000, he was a guestscientist at the Institut dalle Molle d’Intelligence Artificielle Perceptive (IDIAP) in Martigny(Switzerland) where he worked on the quality of speech recognition systems. He gained thequalification needed to be a professor (venia legendi) at the Faculty of Electrical Engineeringand Information Technology at Ruhr-University Bochum in 2004, with a book on the quality oftelephone-based spoken dialogue systems. In September 2008, we worked as a Visiting Fellowat MARCS Auditory Laboratories, University of Western Sydney (Australia) on the evaluationof avatars.Sebastian Moller was awarded the GEERS prize in 1998 for his interdisciplinary work on theanalysis of infant cries for early hearing-impairment detection, the ITG prize of the GermanAssociation for Electrical, Electronic & Information Technologies (VDE) in 2001, the Lothar-Cremer prize of the German Acoustical Association (DEGA) in 2003, and a Heisenberg fellow-ship of the German Research Foundation (DFG) in 2005. Since 1997, he has taken part in thestandardisation activities of the International Telecommunication Union (ITU-T) on transmis-sion performance of telephone networks and terminals. He is currently acting as a Rapporteurfor question Q.8/12.

Tim Paek

Affiliation: Microsoft Research

Presentation Abstract:Current Directions in Dialogue Research at Microsoft. In this talk, I will provide a broadoverview of current dialogue research at Microsoft as well as products and services offered byMicrosoft that involve speech and dialogue technologies. I will also share my opinions aboutthe opportunities and challenges of conducting dialogue research at an industry lab.

Biographical Sketch:Dr. Tim Paek is a researcher in the Machine Learning and Applied Statisticsgroup at Microsoft Research in Redmond, Washington. His primary researchfocus is on spoken dialogue systems, including both technologies that driveinteraction (e.g., language modeling, dialogue management) as well as user in-terface design and experience. Since joining Microsoft Research in 2001, he hashelped shipped several products, including Voice Command 1.6, Live SearchMobile, and the Windows Vista speech interface. He currently serves as thePresident of the Special Interest Group on Discourse and Dialogue (SIGDIAL)for Association for Computational Linguistics (ACL) and International Speech CommunicationAssociation (ISCA).

16 September 13th–14th, 2009 London, UK

Page 17: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

Invited Panelists

Jason Williams

Affiliation: AT&T

Biographical Sketch:Jason Williams is Principal Member of Technical Staff at AT&T Labs Re-search, where he has been since 2006. He received a BSE in Electrical Engi-neering from Princeton University in 1998, and at Cambridge University hereceived an M Phil in Computer Speech and Language Processing in 1999 anda Ph D in Information Engineering in 2006. His main research interests aredialogue management, the design of spoken language systems, and planningunder uncertainty. He is currently Editor-in-chief of the IEEE Speech andLanguage Processing Technical Committee’s Newsletter. He is also on the Science AdvisoryCommittee of SIGDIAL. Prior to entering research, he built commercial spoken dialogue sys-tems for Tellme Networks (now Microsoft), and others. He also served as a consultant withMcKinsey & Companys Business Technology Office.

Michael McTear

Affiliation: University of Ulster

Biographical Sketch:Michael McTear is Professor of Knowledge Engineering at the University ofUlster. He graduated in German Language and Literature from Queens Uni-versity Belfast in 1965, was awarded MA in Linguistics at University of Essexin 1975, and a PhD at the University of Ulster in 1981. He has been VisitingProfessor at the University of Hawaii (1986-87), the University of Koblenz,Germany (1994-95), and University of Granada, Spain (2006, 2007). He hasbeen researching in the field of spoken dialogue systems for more than fifteenyears and is the author of the widely used text book Spoken dialogue technology: toward theconversational user interface (Springer Verlag, 2004).Michael McTear has delivered keynote addresses at many conferences and workshops, includ-ing the EU funded DUMAS Workshop, Geneva, 2004, the SIGDial workshop, Lisbon, 2005,the Spanish Conference on Natural Language Processing (SEPLN), Granada, 2005, and hasdelivered invited tutorials at IEEE/ACL Conference on Spoken Language Technologies, Aruba,2006, and ACL 2007, Prague. He was the recipient of the Distinguished Senior Research Fel-lowship of the University of Ulster in 2006. He is a member of the Scientific Advisory Group ofSIGDial.

London, UK September 13th–14th, 2009 17

Page 18: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

Frameworks and Grand Challengesfor Dialogue System Evaluation:Speakers

Alan W. Black

Affiliation: Carnegie Mellon University

Biographical Sketch:Alan W. Black is an Associate Professor in the Language Technologies Insti-tute at Carnegie Mellon University. He previously worked in the Centre forSpeech Technology Research at the University of Edinburgh, and before thatat ATR in Japan. He is one of the principal authors of the free software Festi-val Speech Synthesis System, the FestVox voice building tools and CMU Flite,a small footprint speech synthesis engine. He received his PhD in Computa-tional Linguistics from Edinburgh University in 1993, his MSc in Knowledge Based Systems alsofrom Edinburgh in 1986, and a BSc (Hons) in Computer Science from Coventry University in1984. Although much of his core research focuses on speech synthesis, he also works in real-timehands-free speech-to-speech translation systems (Croatian, Arabic and Thai), spoken dialoguesystems, and rapid language adaptation for support of new languages. Alan W Black was anelected member of the IEEE Speech Technical Committee (2003-2007). He is currently on theboard of ISCA and on the editorial board of Speech Communications. He was program chairof the ISCA Speech Synthesis Workshop 2004, and was general co-chair of Interspeech 2006 –ICSLP. In 2004, with Prof Keiichi Tokuda, he initiated the now annual Blizzard Challenge, thelargest multi-site evaluation of corpus-based speech synthesis techniques.

Dan Bohus

Affiliation: Microsoft Research

Biographical Sketch:Before joining Microsoft Research, Dan Bohus received his Ph.D. degree (2007)in Computer Science from Carnegie Mellon University. His dissertation work,supervised by Alex Rudnicky and Roni Rosenfeld was focused on developingtechniques for increasing the robustness and reliability of spoken languageinterfaces. As part of this work, he developed a task-indepedent dialoguemanagement framework called RavenClaw, that has been used to build andsuccessfully deploy various dialogue systems spanning different domains and interaction types.Prior to CMU, Dan Bohus obtained a B.S. degree in Computer Science from Politehnica Uni-versity of Timisoara, Romania.

18 September 13th–14th, 2009 London, UK

Page 19: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

List of Participants (ordered by last name)

Timo BaumannUniversity of PotsdamGermany

Luciana BenottiLORIA/INRIAFrance

Jose Luis Blanco MurilloUniv Politecnica de MadridSpain

Dan BohusMicrosoft ResearchUSA

Okko BussPotsdam UniversityGermany

!

Dmitry ButenkovDeutsche Telekom LabsGermany

Caroline ClemensDeutsche Telekom LabsGermany

Klaus-Peter EngelbrechtDeutsche Telekom LabsGermany

Arash EshghiQueen Mary Univ of LondonUnited Kingdom

Milica GasicUniversity of CambridgeUnited Kingdom

Florian GoddeBerlin Institute of TechnologyGermany

Christine HowesQueen Mary Univ of LondonUnited Kingdom

Srinivasan JanarthanamUniversity of EdinburghUnited Kingdom

Theodora KoulouriBrunel UniversityUnited Kingdom

Catherine LaiUniversity of PennsylvaniaUSA

Marianne LaurentOrange LabsTechnopole Brest-IroiseFrance

Beatriz Lopez MencıaUniv Politecnica de MadridSpain

Syaheerah LutfiUniv Politecnica de MadridSpain

London, UK September 13th–14th, 2009 19

Page 20: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

5thYoung Researchers’ Roundtable on Spoken Dialogue Systems

Francois MairesseUniversity of CambridgeUnited Kingdom

Lluıs Mas ManchonUniv Autonoma de BarcelonaSpain

Toyomi MeguroNTT Communication ScienceLabs, NTT CorporationJapan

Gregory MillsQueen Mary Univ of LondonUnited Kingdom

Teruhisa MisuNat Institute of Informa-tion & Communications Tech(NICT)Japan

Joana Paulo PardalL2F INESC-IDIST Tech Univ LisbonPortugal

David Dıaz Pardo deVera,Univ Politecnica de MadridSpain

Florian PinaultLIA - University of AvignonFrance

!

Mara ReisFederal Univ of Sta CatarinaBrazil

Sylvie SagetLLI-IRISAFrance

Nur-Hana SamsudinUniversity of BirminghamUnited Kingdom

Alexander SchmittUniversity of UlmGermany

Niels SchutteDublin Institute of TechnologyIreland

tion finished, the dialogue should be resumed just in the point where it was before its interruption. How-ever, the control of the dialogue might make use of additional information (such us driving context, user preferences…), providing with intelligence to the dia-logue in order to focus it and to decrease the dialogue turns and consequently the workload. For example, if an application knows that the driver like vegetarian restaurants, it might suggest that he stop to have lunch without the necessity of asking him about his prefer-ences. In the same way, depending on the driving situation might be better presenting the information of the system in other modality of interaction (visual dis-play, haptic actuator).

Since, there are few studies focusing on the work-load management of a spoken dialogue system when appearing concurrently with the driving task, GAPS have developed a bench simulator to research on all of these presented issues (Blanco et al. 2009).

2 Future of Spoken Dialog Research

I believe in the near future we will see a prolifera-tion of spoken dialog systems in a wide range of con-text such as the vehicle. This will cause a great concern for analyzing the effects that the concurrent performance with other tasks produces on the curse of the dialogue. Due to in some contexts the spoken in-teraction with a system can not be the priority task (i.e. driving), a study about the workload generated by the dialogues is needed to ensure the right performance of the primary task. We need to incorporate technologies for representation of context in order to adapt the dia-logue flow to the workload situation. In the same way, will be necessary a research on methods that allow to detect overloaded users a-priori.

In the driving context, to carry out these aims will be important a simulation bench where we can evalu-ate different strategies of workload adaptation without putting drivers at risk. Anyway, a modeling of the structure of human attention resources is needed to can consider the spare resources when we analyze what is the best way of interaction in a specific instant.

3 Suggestions for discussion

My suggestions for discussion are the following: • Incorporation of contextual information and

inference technologies to the spoken dialogue considering the time restrictions in this kind of systems.

• How it should be resumed the spoken dia-logue system when it is interrupted due to the user is overloaded or caused by adverse driv-ing situations?

• Strategies to reduce the workload in dual task

environments.

• Methods to detect overloaded users using only speech information of the dialogue.

References Kun A., Paek T., and Medenica Z. 2007. The Effect of

Speech Interface Accuracy on Driving Perform-ance. Interspeech 2007.

Vollrath M. 2007. Speech and driving-solution or problem? Intelligent Transport Systems, IET 1, 89– 94.

Green P. 2004. Driver Distraction, Telematics Design, and Workload Managers: Safety Issues and Solu-tions. SAE paper Number 2004-21-0022.

Nishimoto T., Shioya M., Takahashi J. and Daigo H. 2005. A study of dialogue management principles corresponding to the driver’s workload. Biennal on Digital Signal Processing for In-Vehicle and Mobile Systems.

Blanco J.L., Sigüenza, A., Díaz D., Sendra M. and Hernández L.A. 2009. Reworking spoken dialogue systems with context awareness and information prioritisation to reduce driver workload. In: Pro-ceedings of the NAG-DAGA 2009, International Conference in Acoustics.

Biographical Sketch

Álvaro Sigüenza Izquierdo has a MEng in telecommunications from Universidad Politécnica de Madrid. He is currently a PhD student and holds a research posi-tion in this university under the supervision of Luis Hérnandez. .

Alvaro Siguenza IzquierdoUniv Politecnica de MadridSpain

!

Vasile TopacPolitehnica Univ of TimisoaraRomania

!

Bogdan VlasenkoOtto-von-GuerickeUniversity MagdeburgGermany

!

Ina Wechsung,Deutsche Telekom LabsGermany

Charlotte WollermannUniversity of BonnGermany

20 September 13th–14th, 2009 London, UK

Page 21: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Timo Baumann University of PotsdamDepartment for LinguisticsKarl-Liebknecht-Str. 24-25D-14476 Golm

[email protected]/~timo

1 Research Interests

My research is geared towards interaction managementin spoken dialogue systems. Specifically, I am interestedin the timing of dialogue and dialogue-related phenom-ena, and in taming dialogue systems to respond as quicklyand in similar ways as humans. For a dialogue system toreact to a user, while she is still speaking, it is necessaryfor the system to run incrementally, that is, to process theuser’s utterance while it is ongoing, and to come up withpartial conclusions about what the user is saying, includ-ing how certain the system is about these conclusions oreven predictions. I believe, that prosody analysis plays avital role in improving incremental spoken dialogue sys-tems’ performance, as it conveys information that other-wise may only become clear later in the utterance.

1.1 Incremental Processing

In a modular system, an incremental module is one thatgenerates (partial) output while input is still ongoingand makes these available to other modules. I havethoroughly investigated incremental ASR which outputsword hypotheses while recognition is still ongoing (Bau-mann et al., 2009a). We developed measures to be ableto assess the incremental behaviour of ASR. These dealwith how often hypotheses change (every change meansthat consuming modules have to re-process their input)and others describe timing properties of when words arefirst considered and first finally recognised by the ASR. Ishowed influences between optimising timing and changemeasures and derived a measure of certainty from thedifferent timing measures. Having assessed incremen-tal properties of ASR, I analysed methods to improveincremental behaviour by simple (and generic) filteringapproaches (Baumann et al., 2009a).

Together with my colleagues, we applied the work onevaluation of incremental components to semantic inter-pretations (Atterer et al., 2009), the evaluation of incre-mental reference resolution (Schlangen et al., 2009), andto n-best processing in both ASR and semantic interpreta-tion (Baumann et al., 2009b). As part of our venture intoincremental analysis, we built a toolkit to process and vi-

sualise incremental data (Malsburg et al., 2009).

1.2 Prosody AnalysisProsodic analysis of spoken input into a dialogue systemis vital, if understanding and behaviour should depend onmore than just words. We have investigated end-of-turn(EoT) prediction (Baumann, 2008) and end-of-utterance(EoU) detection (Atterer et al., 2008) using rather crudeacoustic-prosodic features. While the need for EoT de-tection is obvious, EoU detection can be important forsyntactic processing and semantic interpretation in morecomplex user turns.

Recently I have come back to the topic of prosodyanalysis, implementing a more phonologically sound in-cremental model of prosodic analysis (Baumann, 2009)that integrates with ASR output and generates word- andsyllable-relative information which should be helpful inthe NLU component of an SDS.

1.3 Incremental SDSTo show the end-to-end benefits of incrementality in spo-ken dialogue systems, it’s best to show example sys-tems. In (Baumann, 2008) I presented a system for turn-taking simulation, which stress-tests end-of-turn predic-tion. Two artificial dialogue participants converse witheach other; the resulting turn-taking behaviour is similarto human-human dialogue.

Currently, I am working on a multi-modal command-and-control game, that exploits the timing advantages ofincremental ASR with associated hypothesis certaintiesover regular incremental ASR (Baumann, 2009).

1.4 Future WorkCurrently in a spoken dialogue system, good timingmeans “as quickly as possible”. But with quicker reason-ing and sufficiently accurate predictive processing com-ing up, good timing may become more specific. I wouldlike to further analyse gaps and pauses in dialogue (atturn-boundaries and within turns) and find out whetheronset-timings are related (and in which way) to rhythmpatterns across speakers, so that these can be taken intoaccount for system utterances.

YRRoSDS’2009 Sep 13-14, London, UK 21

Page 22: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

2 Future of Spoken Dialogue Research

Currently deployed SDSs are mostly tailored towards in-formation access and simple tasks. To some extent theycan be seen rather as VUIs than as full dialogue.

I believe that deployed dialogue systems will appearas conversational assistants in hospitals and for elderlypeople, in tutoring (not only for foreign language learn-ing, but in all areas), as characters in computer gamesand as more natural user interfaces for general-purposepersonal digital assistants in smartphones.

Computer games make for an especially interestingsandbox for advanced SDSs, as domains are controlledand consequences of errors are small, while demand for“new stuff” and enthusiastic users are plentiful.

While human-like behaviour is not needed or couldeven distract in simple task-oriented systems, human-like, intuitive behaviour becomes more important for fu-ture applications, as they will be less recognised as toolsbut as real interlocutors. For better intuitivity, interactionbehaviour (turn-taking, and -yielding, understanding andhinting below the content level) must be improved.

A further ingredient to better intuitivity is adaptivity.On the linguistic side, there is entrainment that shouldbe followed (or deliberately broken) on all levels but can-not simply be ignored. On the non-linguistic side, the in-tegration of different knowledge sources (the user’s or herpeer-group’s calendar, previous e-mails, overheard con-versations, . . . ) will lead to better understanding.

In terms of SDS design, better modelling of controlflow in the system, (no central control, but patterns fordistributed control), and theories of what is possible un-der which architectural constraints, are needed.

3 Suggestions for Discussion• How to learn interaction behaviour from corpora?

Most often, a range of legal interaction decisions ex-ist. Albeit, in a corpus of recorded dialogue, all butone decision have been blocked by the one that hasbeen taken. How do we learn from such noisy data?How do we now, which decision (among an infiniteset) is best or at least good enough?

• Availability of data (and systems) for independentresult verification!Reproducibility of results and independent verifica-tion is standard in scientific work. Not so in com-putational linguistics where people are happy to de-velop different methods on different data and allclaim to have found the one correct solution.

• People are afraid of too-human machines.When asked, people want to know whether they areinteracting with machines. At the same time, theydon’t care that traffic lights have been computer-controlled for decades.

ReferencesMichaela Atterer, Timo Baumann, and David Schlangen.

2008. Towards Incremental End-of-Utterance Detec-tion in Dialogue Systems. In Proceedings of COLING2008, Manchester, UK.

Michaela Atterer, Timo Baumann, and David Schlangen.2009. No Sooner Said Than Done? Testing the Incre-mentality of Semantic Interpretations of SpontaneousSpeech. In Proceedings of Interspeech 2009, Brighton,UK.

Timo Baumann, Michaela Atterer, and David Schlangen.2009a. Assessing and Improving the Performance ofSpeech Recognition for Incremental Systems. In Pro-ceedings of NAACL-HLT 2009, Boulder, USA.

Timo Baumann, Okko Buß, Michaela Atterer, and DavidSchlangen. 2009b. Evaluating the Potential Util-ity of ASR N-Best Lists for Incremental Spoken Di-alogue Systems. In Proceedings of Interspeech 2009,Brighton, UK.

Timo Baumann. 2008. Simulating Spoken DialogueWith a Focus on Realistic Turn-Taking. In Proceed-ings of the 13th ESSLLI Student Session, Hamburg,Germany.

Timo Baumann. 2009. Integrating Prosodic Modellingwith Incremental Speech Recognition. In Proceedingsof DiaHolmia (SemDial 2009), Stockholm, Sweden.

Titus von der Malsburg, Timo Baumann, and DavidSchlangen. 2009. TELIDA: A Package for Manipula-tion and Visualisation of Timed Linguistic Data. Sub-mitted.

David Schlangen, Timo Baumann, and Michaela Atterer.2009. Incremental Reference Resolution: The Task,Metrics for Evaluation, and a Bayesian Filtering Modelthat is Sensitive to Disfluencies. Submitted.

Biographical SketchTimo Baumann is a researchassistant and PhD candidate atthe University of Potsdam, Ger-many, working under the super-vision of David Schlangen.

He previously studied com-puter science, phonetics and lin-guistics at Hamburg University

and received his master’s degree in 2007 for work onprosody analysis carried out at IBM Research Germany.

In his free time, Timo likes to go hiking or cycling andsings in a choir. Also, he enjoys to stay abroad and liveon student grants (USA 1997, Switzerland 2003, Spain2005). Continuing in this spirit, Timo is currently a guestresearcher at KTH Stockholm’s speech group, workingwith Jens Edlund on incremental prosody analysis.

22 YRRoSDS’2009

Page 23: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Luciana Benotti LORIA/INRIA Nancy Grand Est615, rue du Jardin Botanique54600 Villers les Nancy, [email protected]/˜benottil/

1 Research Interests

More is involved in what one communicates than whatone literally says; more is involved in what one meansthan the standard, conventional meaning of the words oneuses. For instance, if I ask you to go to the cinema andyou reply, ”I have a presentation tomorrow I’m not pre-pared for.” You have conveyed to me that you will notbe coming to the cinema, although you haven’t literallysaid so. You intend for me to figure out that by indicat-ing a reason for not coming (the need to prepare yourpresentation) you intend to convey that you are not com-ing for that reason. The study of such conversationalimplicatures, and how they can be modelled computa-tionally in a dialogue system is the subject of my re-search; which crucially involves the characterisation ofthe context-representation and means-ends reasoning-tasks involved. In other words, my research interests liein the area of Computational Pragmatics.

1.1 Conversational implicatures as tacit actsModelling how listeners draw inferences from what theyhear is a basic problem for theories of understanding nat-ural language. An important part of the information con-veyed is inferred in context, given the nature of conver-sation as a goal-oriented enterprise; as illustrated by thefollowing classical example by Grice:

(1) A: I am out of petrol.B: There is a garage around the corner.; B thinks that the garage is open and selling petrol.(Grice, 1975, p.311)

B’s answer conversationally implicates (;) informa-tion that is relevant to A. In Grice’s terms, B made a rel-evance implicature: he would be flouting the conver-sational maxim of relevance unless he believes that it’spossible that the garage is open and has petrol to sell.

I view such relevance implicatures as tacit dialogueacts that fill the gap between the context and the utter-ance (Benotti, 2008a). The tacit dialogue act is illustratedbelow in italics.

(2) A: I am out of petrol.B: I know a garage that is open and selling petrol.B: There is (such) a garage around the corner.

Human-human dialogue makes evident that such tacitdialogue acts are inferred and processed by the dialogueparticipants; as we argue in (Benotti, 2009a) presentingevidence from a dialogue corpus called SCARE (Stoia etal., 2008). A can well reject or ask clarifications on thetacit act content, as if the tacit act has been explicitly said.For instance, A can continue the exchange with follow-ups such as (a) or (b) below:

(3) a. A: No, I checked, it’s closed.b. A: Why do you think it’s open at 4am?

I’ve integrated the inference and processing of (somekinds of) such conversational implicatures in two dia-logue systems. I briefly describe each of them in the fol-lowing section.

1.2 Conversational implicatures in dialogue systemsThe first system to which I added the inference of tacitacts is called Frolog (Benotti, 2009b). Frolog implementsa text-adventure game and the dialogue acts that it canhandle are instructions that can be directly mapped intophysical actions in the simulated game world. The in-ference of tacit acts requires means-ends reasoning onthe possible dialogue acts and the representation of thedialogue context. The necessary reasoning tasks havebeen implemented integrating two planners into Frolog.The first planner that I used was Blackbox (Kautz andSelman, 1999) which is fast and deterministic; how aclassical planner can be used to infer the conversationalimplicatures of an utterance is discussed in (Benotti,2007). The other planner is called PKS (Petrick andBacchus, 2004); PKS can reason over non-deterministicacts. For detailed discussion and examples includingnon-deterministic acts see (Benotti, 2008b).

Finally, I added support for some kinds of conversa-tional implicatures to the ICT Virtual Human1 dialoguemanager (Traum et al., 2008). This dialogue manager al-lows virtual humans to participate in multi-party conver-sations in a variety of domains with dialogue acts at dif-ferent levels. The dialogue manager is embedded withinthe Soar cognitive architecture (Laird et al., 1987), anddecisions about interpreting and producing speech com-pete with other cognitive and physical operations. Soar

1http://ict.usc.edu/projects/virtual_humans

YRRoSDS’2009 Sep 13-14, London, UK 23

Page 24: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

provides support for some kinds of means-ends reasoninguseful for inferring tacit acts. However, the architectureof the dialogue manager is complex and the work pre-sented in (Benotti and Traum, 2009) constitute just a firststep that still needs to be properly evaluated.

2 Future of Spoken Dialogue ResearchI think that a great challenge for this generation of youngresearchers is to start building dialogue systems thatremember particular users and use that information toadapt in the following interactions with the same user (orgroup of users). This challenge could not only be usefulapplication-wise but it could also have an impact in thetheories of dialogue. I adhere to the approach that viewconversational implicatures as imposing pressure on lin-guistic structures; pressure that many times leaves its im-print in the literal meaning of words. Studying semanticchange in situated conversation, would make the conceptsof semantic coordination and implicatures more concrete,and also clarify the relations between them.

This idea of learning from previous interactions is notnew and it’s the approach used by researchers such asthose that apply reinforcement learning to learning dia-logue policies. However, such approaches learn manydifferent kinds of information at the same time includ-ing task dependent and task independent information.Moreover, since the search space they are dealing withgrows with the amount of features that they include ineach state, its information states are necessarily coarse-grained. What seems to be needed is to restrict the statespace using some notion of coherence of sequences ofacts in the state space. In other words, I believe that sym-bolic and statistical methods would need to be combined.If such a combination is possible, then systems that useexpectation-based interpretation on a large scale seem notso far away. That is, systems that try to infer the next pos-sible moves that the user can make and interpret the nextutterance taking into account these possible moves. Thatwe are approaching the point when we can begin to thinkabout ways of combining such methods makes the nextfew decades in dialogue systems research seem very ex-citing indeed.

3 Suggestions for DiscussionThree possible topics for discussion:

• Challenges in combining reinforcement learningand symbolic methods for building adaptable dia-logue systems.

• Expectation based interpretation and performance ofNLU in dialogue systems.

• Using expectations for asking task-related questionsto the user instead of signalling misunderstanding.

ReferencesL. Benotti and D. Traum. 2009. A computational account of

comparative implicatures for a spoken dialogue agent. InEighth Int. Conf. on Computational Semantics (IWCS-8).

L. Benotti. 2007. Incomplete knowledge and tacit action: En-lightened update in a dialogue game. In The 2007 Workshopon the Semantics and Pragmatics of Dialogue (DECALOG).

L. Benotti. 2008a. Accommodation through tacit dialogue acts.In Conf. on Semantics and Modelisation (JSM08).

L. Benotti. 2008b. Accommodation through tacit sensing. InThe 2008 Workshop on the Semantics and Pragmatics of Di-alogue (LONDIAL).

L. Benotti. 2009a. Clarification potential of instructions. InThe 10th Annual SIGDIAL Meeting on Discourse and Dia-logue, London, United Kingdom.

L. Benotti. 2009b. Frolog: An accommodating text-adventuregame. In The 12th Conf. of the European Chapter of theAssociation for Computational Linguistics (EACL-09).

P. Grice. 1975. Logic and conversation. Syntax and Semantics:Vol. 3: Speech Acts, pages 41–58. Academic Press.

H. Kautz and B. Selman. 1999. Unifying SAT-based andgraph-based planning. In Proc. of the IJCAI, pages 318–325.

J. Laird, A. Newell, and P. Rosenbloom. 1987. SOAR: anarchitecture for general intelligence. Artificial Intelligence,33(1):1–64, September.

R. Petrick and F. Bacchus. 2004. Extending the knowledge-based approach to planning with incomplete information andsensing. In Proc. of the Int. Conf. on Principles of Knowl-edge Representation and Reasoning, pages 613–622.

L. Stoia, D. Shockley, D. Byron, and E. Fosler-Lussier. 2008.SCARE: A situated corpus with annotated referring expres-sions. In Proc. of LREC.

D. Traum, W. Swartout, J. Gratch, and S. Marsella. 2008. Avirtual human dialogue model for non-team interaction. Re-cent Trends in Discourse and Dialogue. Springer.

Biographical SketchLuciana Benotti is currently a thirdyear PhD Student at INRIA Nancygrand Est, in France. She is a mem-ber of the TALARIS Team2 whereshe works under the supervision ofPatrick Blackburn. She spent lastsummer at the Institute for Creative

Technologies (ICT) of the University of Southern Cali-fornia (USC) working with David Traum. In 2006, sheobtained an European Masters in Computational Logic(from the Free University of Bolzano in Italy and thePolytechnic University of Madrid in Spain). She also hasa Masters in Computer Science from the National Uni-versity of Comahue in Argentina (where she was born).She likes travelling, meeting people from different cul-tures and cooking. She enjoys hot weather, good foodand life in the open air.

2http://talaris.loria.fr

24 YRRoSDS’2009

Page 25: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Jose Luis Blanco Murillo Universidad Politecnica de MadridETSI de TelecomunicacionAvda. Complutense 3028040 Madrid, Spain

[email protected]

1 Research InterestsMy main research interests lie in spoken dialoguesystems (SDSs), incorporating additional informationfrom heterogonous sources, focusing on the improve-ment in representation and management of this addi-tional knowledge.

1.1 Past and Ongoing researchDuring the last decade spoken dialogue systems havehad an enormous progression. From the first automaticspeech recognition systems to the present conversationalagents and interaction systems, our understanding of thenature of interaction has grown widely, allowing us to de-velop new standards and systems. Although there are stillmany questions to be answered, it is clear that the qual-ity of our systems has improved noticeably, in terms ofrobustness and flexibility. However, new scenarios andapplications have arisen for which there isn’t a commonconcern about the way vocal interfaces have to be devel-oped, neither the way information is to be handled.

At GAPS (the Signal Processing Applications Groupat Universidad Politecnica de Madrid) we have been re-peatedly leading with this problem during the last yearswe have overcome this same problem in various projects.As a matter of fact, each of those projects was meant tofocus on a different aspect of the interaction. However,they all had something in common: vocal interaction wasincluded to enhance human-machine interaction.

1.2 The in-vehicle interaction scenarioThe main scenario we have been working out is the spo-ken interaction within the automobile. Our group hasbeen collaborating in various publicly founded projects,both national and European, concerning this particularenvironment. Our main goal is, and has been, to analysethe opportunities that arise when we introduce vocal in-terfaces into the car, as well as the additional constraintsthis particular scenario imposes to the dialogue. The bal-ance between those new possibilities and their limitationsmakes it an extremely open and challenging environment.

Several studies have been addressed related with thisparticular problem of the in-car interaction. However,

there is a deep gap between academic improvements andthe actual commercial applications. The most remark-able references we have had news of are those deliv-ered from the European project TALK. In this consor-tium both industry and universities have considered sev-eral aspects about the in-vehicle interaction from a com-mon perspective which is certainly the best way to bridgethe gap (Becker et al., 2007). Their description of theproblem human-machine interaction faces in the drivingcontext, as well as their analysis of the alternatives thepresent state-of-the-art can provide, have taught us thatit is worth to consider each situation specifically, ratherthan to ignore it’s peculiarities. Since there isn’t a com-mon concern about this need, software developers are stillreusing previous systems in quite different contexts fromthose for which they were designed without reworkingthose spoken dialogue systems as they require (Blanco etal., 2009).

Moreover, in this scenario most efforts have been di-rected towards the improvement of robustness in speechrecognition systems, due to its particular noisy condi-tions. The adaptation of the available base technologiesis certainly a key, highly correlated with users’ percep-tion of systems’ performance, but there is much more todo. Adapting the interaction strategies and policies to theparticular circumstances of the interaction is required ifwe want to achieve a successful improvement. Finally,the combination of both will help us to reduce the inter-ference the spoken interaction might cause on the driving.

1.3 Reworking spoken dialogue systems

Reworking spoken dialogue systems has to be faced notjust from this particular point for view of the in-vehiclescenario, but from a broader perspective.

Sources of information appear almost anywhere andare extremely heterogeneous: from social relations (i.e.contacts list or social agenda), to localisation, travellingspeed, or users actual physical and emotional state. Itseems reasonable to believe there might be some kindof correlation between interaction and its circumstanceswhen thinking about those examples. For instance, ifwe know that a certain vocal interface is been accessed

YRRoSDS’2009 Sep 13-14, London, UK 25

Page 26: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

from a mobile phone, or from a land line, it might beuseful to change the noise models in the speech recog-nition system. Or if the user is known to suffer fromsome kind of disease affecting the vocal cords, it mightbe useful to use other models or to ad-apt those beingused. But also the dialogue might have to be adapted.If a user calls from a hands-free phone while driving, ashis/her interaction abilities will certainly be diminished,shorter locutions from the speech synthesizer are recom-mended (Nishimoto et al., 2005).

There are just a few studies focusing on these particu-lar aspects in a deep and sensible way. Actually the bestalternative is to continue with the design of specific so-lutions, but obviously this isn’t the best way to face theproblem. A homogeneous framework for the availableinformation will certainly help to adapt our systems in anefficient and reliable way.

2 Future of Spoken Dialogue Research

I believe we will soon have at our disposal efficient tech-nologies for the representation of heterogeneous informa-tion, which will bring us the opportunity to model thevariables affecting the curse of interaction. This will helpus adapt our designs to make them much more flexibleunder adverse circumstances. Adapting to the particu-lar circumstances of the interaction will certainly takelonger. Though the present technologies are capable todo this after an extensive redefinition, new technologiesand tools are to come which will help us design those newsystems.

On the other hand, commercial solutions are improv-ing their performance ratios and approaching the devel-opments from the academic realm. Standardising thoseresults and their description of the interaction beyond thepresent VoiceXML v.2 standard, as well as the represen-tation of the context variables will bring the opportunityto finally bridge that gap. This will be take long, but willget our actual designs out of their knowledge bubble andinto a see of shared knowledge.

The development of new methodologies to aim de-signers in their task will simplify the integration of sys-tems at any level of the interaction. Open architecturessuch as the Olympus/Ravenclaw (Bohus et al., 2007)have provided an excellent precedent of this and shall bepreserved. This same idea, but from a quite different per-spective will bring to us new systems able to check thetopics of the communication to decide how the interac-tion should be carried out. Those systems which multi-plex various singular spoken dialogue systems designedfor a certain task (Kerminen and Jokinen, 2003) will shedsome light about the way in which systems should be in-tegrated.

3 Suggestions for DiscussionMy suggestions for discussion are the following:

• How could we combine the representation of the cir-cumstances together with the interaction in a waythat can be easily used to adapt our systems?

• How should we adapt interaction strategies to takeadvantage of this new knowledge?

• Standardisation of experiments for analysing corre-lation between dialogue progress and context.

• SDSs implemented as multiplex of singular conver-sational agents.

ReferencesT. Becker, N. Blaylock, C. Gerstenberger, A. Korthauer,

N. Perera, M. Pitz, P. Poller, J. Shehl, F. Steffens, R.Stegmann and J. Steigner. 2007. D5.3: In-Car Show-case Based on TALK Libraries. Talk and Look: Toolsfor Ambient Linguistic Knowledge. IST-507802 De-liverable 5.3.

J. L. Blanco, A. Siguenza, D. Dıaz, M. Sendra and L. A.Hernandez 2009. Reworking spoken dialogue systemswith context awareness and information prioritisationto reduce driver workload. In Proceedings of the NAG-DAGA 2009, International Conference in Acoustics.

T. Nishimoto, M. Shioya, J. Takahashi and H. Daigo.2005. A study of dialogue management principles cor-responding to the driver’s workload. In Biennial Work-shop on Digital Signal Processing for In-Vehicle andmobile systems, Sesimbra, Portugal.

D. Bohus, A. Raux, T. K. Harris, M. Eskenazi and A.I. Rudnicky. 2007. Olympus: an open-source frame-work for conversational spoken language interface re-search. In Bridging the Gap: Academic and In-dustrial Research in Dialog Technology workshop atHLT/NAACL 2007

A. Kerminen and K. Jokinen. 2003. Distributed Dia-logue Management in a Blackboard Architecture. InProceedings of the EACL Workshop Dialogue Systems:interaction, adaptation and styles of management, Bu-dapest, Hungary. pp. 55–66.

Biographical SketchJose Luis Blanco Murillo has a MEng intelecommunications engineering fromUniversidad Politecnica de Madrid. Heis currently a PhD student and holds aresearch position in this same univer-sity, under the supervision of Dr. LuisHernandez.

26 YRRoSDS’2009

Page 27: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Dan Bohus Microsoft ResearchOne Microsoft WayRedmond, WA, 98052USA

[email protected]/˜dbohus

1 Research InterestsMy research interests lie in the area of situated inter-action and open-world spoken dialogue systems. Thecentral question driving my long-term research agendais: how can we develop systems that naturally embedinteraction and computation deeply into the flow of ev-eryday tasks, activities and collaborations? More specif-ically, some of the areas and problems that I am cur-rently investigating are: multimodal sensor fusion, con-versational scene analysis, situated and open-worlddialogue, turn-taking and engagement models, self-supervised (lifelong) learning and adaptation.

1.1 Previous workMy dissertation work (Bohus, 2007) has focused on is-sues of dialogue management and error handling inspoken dialogue systems. As part of this work, I have de-veloped RavenClaw (Bohus and Rudnicky, 2003; 2009),an open-source, reusable dialogue management frame-work, that has since been used to develop several spokendialogue systems spanning different domains and inter-action types. With respect to error handling, my disser-tation addressed the following questions: (1) how can asystem reliably detect potential errors? (Bohus and Rud-nicky, 2005a; 2006; Bohus 2007), (2) what strategies canbe used to recover from different types of errors? (Bo-hus and Rudnicky, 2005b), and (3) how should a systemchoose between multiple such strategies at runtime? (Bo-hus et al., 2006).

1.2 Current researchMost research to date on spoken dialogue systems has fo-cused on the study and support of interactions betweena single human and a computing system within a con-strained, predefined communication context. Efforts inthis space have led to the development and wide-scalesuccessful deployment of telephony based, and more re-cently multimodal mobile applications. At the same time,numerous and important challenges in the realm of situ-ated spoken language interaction remain to be addressed.

In particular, my current research is focused on thechallenges of developing spoken language interfaces that

operate continuously in open, dynamic, relatively uncon-strained environments. The question that drives this long-term research agenda is: how can we develop systemsthat embed interaction and computation deeply into thenatural flow of everyday tasks, activities and collabora-tions? Examples include interactive billboards in a mall,robots in a home environment, interactive home controlsystems, interactive systems providing assistance duringprocedural tasks, intelligent tutoring systems, etc.

Interaction in the open environments is characterisedby two aspects that capture key departures from assump-tions traditionally made in spoken dialogue systems (Bo-hus and Horvitz, 2009c). The first one is the multipartyand dynamic nature of the interaction, i.e., the world typ-ically contains not just one, but multiple agents that arerelevant to the interactive system; agents may come andgo, and their goals, needs, and plans change in time; rel-evant events might happen asynchronously with respectto the natural flow of dialogue. A second important as-pect is the physically situated nature of these systems,i.e. the fact that the physical surroundings provide a con-tinuously streaming, rich context relevant for organisingand conducting the interactions. I believe these aspectsof open-world interaction raise interesting research chal-lenges and bring new dimensions to existing dialogueproblems, such as engagement, turn-taking, language un-derstanding, dialogue management and output planning.

As an example, consider the problem of establish-ing and maintaining engagement with an interactive sys-tem. In single-user, closed-world dialogue systems thisproblem often finds a relatively trivial solution. For in-stance, in telephony-based applications, engagement canbe safely assumed once a call has been picked up bythe system. Similarly, a push-to-talk button provides aclear engagement signal in multimodal mobile applica-tions. While these solutions are appropriate, perhaps evennatural in those contexts, they are insufficient for sys-tems that must operate continuously in open, dynamicenvironments, where multiple participants may enter andleave conversations, and interact with the system and witheach other. Such systems should ideally be able to flu-idly engage, disengage, or re-engage with one or multi-

YRRoSDS’2009 Sep 13-14, London, UK 27

Page 28: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

ple participants, whether the participants are close by orat a distance, whether they have a standing plan to in-teract with a system, or opportunistically decide to doso, in-stream with other ongoing activities. To success-fully manage engagement in this context, a system mustcontinuously monitor its surroundings and fuse incomingsensory streams to infer engagement states, actions, andintentions of various participants, must be able to makeengagement control decisions while obeying rules of eti-quette and social interaction, and ultimately render thesedecisions in an appropriate set of behaviors.

This is just one example problem that lives at the low-est (Channel) level of grounding in communication. Sim-ilar new challenges arise however in turn-taking, spo-ken language and discourse understanding, interactionplanning, etc. In (Bohus and Horvitz, 2009c) we haveidentified a number of such challenges and outlined aset of core competencies required for open-world spokenlanguage interaction. In the Situated Interaction (2009)project at Microsoft Research, we have developed sev-eral real-world situated conversational agents, and are us-ing them as an experimental platform for pursuing thesechallenges. Initial work in this space including a compu-tational model for managing situated multiparty engage-ment (Bohus and Horvitz, 2009c) and an implicit learn-ing approach for detecting engagement decisions (Bohusand Horvitz, 2009c) have been reported in SIGDial’2009.Current efforts are focused on tracking conversational dy-namics in multiparty interactions and developing multi-participant turn-taking models.

2 Future of Spoken Dialogue Research

I believe that in the next 5–10 years more attentionwill shift towards issues in multi-modal, embodied andsituated interactive systems, of the type described inthe previous section. The challenges that lie ahead ofus are exciting and many: situation awareness, multi-modal sensor fusion, scene analysis, behavior and in-tention recognition, situated dialogue management, situ-ated grounding, engagement models, mixed-initiative andmulti-participant interaction, life-long, open-domain andopen-world learning and adaptation.

3 Suggestions for Discussion

• dialogue systems and robots. Discuss applicabil-ity and limitations of current dialogue technologiesin the context of human-robot interaction (e.g., mul-tiple sensors, asynchronous events, multiple partici-pants, etc.)

• challenge problem (s) and evaluation. Proposeand discuss one or more challenge problem(s) forthe field, and a corresponding evaluation process.

ReferencesD. Bohus and E. Horvitz 2009c. Learning to Predict

Engagement with a Spoken Dialog System in Open-World Settings. In Proc. of SIGDial’09, London, UK.

D. Bohus and E. Horvitz. 2009b. Models for MultipartyEngagement in Open-World Dialog. In Proc. Of SIG-Dial’09, London, UK.

D. Bohus and E. Horvitz. 2009a. Open-World Dia-log: Challenges, Directions and Prototype. In Proc.of KRPD’09, Pasadena, CA.

D. Bohus and A. Rudnicky. 2009. The RavenClaw di-alog management framework: Architecture and sys-tems. In Computer Speech & Language 23, Issue 3,pp.332–361.

D. Bohus 2007. Error Awareness and Recovery in Con-versational Spoken Language Interfaces. Ph.D. Thesis,CS-07-124, Carnegie Mellon University, Pittsburgh,PA.

D. Bohus, B. Langner, A. Raux, A. Black, M. Eskenaziand A. Rudnicky. 2006. Online Supervised Learningof Non-understanding Recovery Policies. In Proc. ofSLT’2006, Palm Beach, Aruba.

D. Bohus and A. Rudnicky. 2006. A K-hypotheses +Other Belief Updating Model. In Proc. of AAAI Work-shop on Statistical Methods in Spoken Dialogue Sys-tems.

D. Bohus and A. Rudnicky. 2005b. Sorry I didn’t CatchThat: An Investigation of Non-understanding Errorsand Recovery Strategies. In Proc. of SIGDial’2005,Lisbon, Portugal.

D. Bohus and A. Rudnicky. 2005a. A Principled Ap-proach for Rejection Threshold Optimization in Spo-ken Dialog Systems. In Proc. of Interspeech’2005,Lisbon, Portugal.

D. Bohus and A. Rudnicky. 2003. RavenClaw: Di-alog Management Using Hierarchical Task Decom-position and an Expectation Agenda. In Proc. ofEurospeech’2003, Geneva, Switzerland. http://www.ravenclaw-olympus.org

Situated Interaction Project web page, 2009 —research.microsoft.com/˜dbohus/research_situated_interaction.html.

Biographical SketchDan is currently a researcher inthe Adaptive Systems and Interac-tion group at Microsoft Research.His current research agenda is fo-cused on situated interactive sys-tems and open-world dialogue.Prior to joining Microsoft, Danobtained his Ph.D. degree from

Carnegie Mellon University, where he investigated prob-lems of dialogue management and error handling in task-oriented spoken dialogue systems.

28 YRRoSDS’2009

Page 29: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Okko Buss Potsdam UniversityKarl-Liebknecht Str. 24-2514415 PotsdamGermany

[email protected]/˜okko

1 Research Interests

My research is concerned with incremental human-machine dialogue processing. An incremental SDS canprocess and react to user input while the user is stillspeaking. In particular I’m focussed on modelling inter-action and dialogue management from an incrementalpoint of view. I hold these to be two separate tasks (orgroups of tasks) that require individual lines of inquiry.

I’m further interested in what types of control a speechdialogue processing system must embody to exhibit hu-man or human-like behavior. I believe this to be a pre-requisite for accurate cognitive modelling of spoken lan-guage behavior.

Lastly I am interested in standardisation of speech di-alogue processing.

1.1 Incremental Interaction and DialogueManagement

Our mission in the InPro project at the University ofPotsdam is to build computational models of incremen-tal dialogue processing. Thus far, a great deal of ef-fort has gone into achieving incrementality at the lev-els of speech-recognition (ASR) (Baumann et al., 2009)and semantic interpretation/natural language understand-ing (NLU) (Atterer et al., 2009) and related issues, suchas n-best lists (Baumann et al., 2009b). These modulescan best be understood as point-of-contact technologiesbetween the system and the user. Achieving incremen-tality at these levels is a prerequisite for exploring “later”issues such as interaction (IM) and dialogue management(DM).

IM and DM are often treated as part of an overall dia-logue planning module. This is particularly true in com-mercial dialogues systems, such as telephony applica-tions built on standards like VoiceXML.1

While it makes sense for standardisation to keep IMand DM closely linked, there are conceptual and com-putational benefits to separating them. While the former

1Here, IM and DM are reducible to XML tags for low-levelinteraction and ECMA script variables for high-level world-knowledge.

is conceptually concerned with low-level events, such asproducing and playing back system prompts, aggregat-ing user input and keeping timing measures, the latter isresponsible for high-level tasks such as verifying user in-tentions against a dialogue strategy, issuing clarificationrequests and updating its own world knowledge. A two-tiered approach that separates IM and DM along thesedistinctions is described in (Lemon et al., 2003).

Making IM and DM incremental shows the need forthis separation even more clearly. Incrementality placesfar more stringent requirements on interaction aspects ofcommunication, especially where timing is concerned.While a non-incremental SDS may only need two time-out variables to handle turn-taking - for in-speech andafter-speech states - an incremental system can exhibitany number of in-between states, each of which requiresits own kind of timeout behavior. Moreover, these time-outs may change dynamically in-utterance. We describethis aspect in (Baumann et al., 2009a) for our researchprototype.

Other aspects of IM and DM to investigate from anincremental perspective will include confidence scoring,output generation and alternative hypothesis processing.

1.2 Control in SDS

An incremental SDS by nature has a much more loosedefinition of control than a classical pipelined architec-ture where control is passed from one module to the next(e.g. the ASR listens, passes its result to the NLU andso forth.) Incrementality however requires that parallelactions and asynchronous events can occur at any giventime, something that such an architecture prohibits. Yetcompletely removing any notion of control is computa-tionally and cognitively unsatisfying. The question ofhow control can be embodied in a parallel, distributedarchitecture thus arises.

I believe that understanding control for cognitive mod-elling is crucial to understanding intelligent behavior, aview adopted with modification from (Sloman, 2003).Rather than focussing solely on behavior, a proper taxon-omy of types of control may be necessary for implement-ing intelligent behavior in parallel, asynchronous agents,

YRRoSDS’2009 Sep 13-14, London, UK 29

Page 30: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

including some components in SDSs.

1.3 Standards for SDS

Having worked extensively with commercial voice appli-cations, I’m keenly interested in extending the ground-work performed in standardising aspects of SDS withthe incremental, parallel processing approach describedabove. Some such aspects include prompt typology, se-mantic representation, confidence scoring and n-best listprocessing.

2 Future of Spoken Dialogue Research

Until recently, most deployed SDSs were largely highlytask-specific applications geared at automation of other-wise labor-intensive tasks performed by human opera-tors. While these kinds of applications will remain promi-nent (if unpopular), mobile computing has opened op-portunities to move speech and language technology tonovel areas. Three things about mobile computing needto be explored for development of SDSs in mobile envi-ronments.

Mobile speech applications that exist today make useof a mixed bag of tools for language processing. Somerely on local others on distributed resources for languageprocessing. A first line of inquiry is thus harnessing theseresources for SDSs in a meaningful way.

Second is the ubiquity of internet access. This makesspeech-enabled mobile applications available to interfacewith a host of data sources. Making use of these will cer-tainly need to be explored and requires moving towards adata-centric view of dialogue.

A third feature is multi-modality. Mixing traditionalinterfaces with voice in an intuitive fashion is a goal farfrom being reached and poses interesting research ques-tions from computational, a cognitive load, and integra-tion point of views.

In addition to mobile computing, paralinguistic inputsources will have an effect on SDS research. Languageand communication consist of more than spoken strings.SDSs stand much to gain from research on affect recog-nition, prosodic cueing and gesture recognition.

3 Suggestions for Discussion

• Learnability of behavior from data.Statistical approaches in ASR and NLU are takenfor granted. But what means are available to train anSDS’s interaction and planning modules from data.To what extend can an SDS function without beinginformed by symbols and rules.

• Cross-disciplinary evaluation metrics.Evaluation of SDSs involves comparing simulatedoutcomes with hand-annotated ones. Usually this

concerns ASR and NLU output only. Is it possibleto evaluate performance of SDSs in terms of metricsfrom other scenarios, such as psycholinguistic ones(e.g. eye-tracking for reference resolution)? Howwould an SDS fare compared to a human agent facedwith an identical task.

• Standardisation.How relevant are standards to SDS research? Whouses them and to what effect?

ReferencesMichaela Atterer, Timo Baumann, and David Schlangen.

2009. No sooner Said Than Done? Testing the Incre-mentality of Semantic Interpretations of SpontaneousSpeech. To appear Proceedings of Interspeech 2009,Brighton, UK.

Timo Baumann, Michaela Atterer, and David Schlangen.2009. Assessing and Improving the Performance ofSpeech Recognition for Incremental Systems. In Pro-ceedings of NAACL-HLT 2009, Boulder, USA.

Timo Baumann, Michaela Atterer, and Okko Buss.2009a. Incremental ASR, NLU and Dialogue Manage-ment in the Potsdam INPRO P2 System. Invited talk atthe Workshop on Incrementality in Verbal Interaction,Bielefeld, Germany.

Timo Baumann, Okko Buss, Michaela Atterer, and DavidSchlangen. 2009b. Evaluating the Potential Util-ity of ASR N-Best Lists for Incremental Spoken Di-alogue Systems. In Proceedings of Interspeech 2009,Brighton, UK.

Oliver Lemon, Lawrence Cavedon, and Barbara Kelly.2003. Managing Dialogue Interaction: A Multi-Layered Approach. In Proceedings of the 4th SIGdialWorkshop on Discourse and Dialogue, Sapporo, Japan.

Aaron Sloman. 1996. Beyond Turing Equivalence. InMachines and Thought: The Legacy of Alan Turing,1:179–219.

Biographical SketchOkko Buss is a research as-sistant and PhD candidate atthe University of Potsdam, Ger-many, working under the super-vision of David Schlangen in theInPro project. He holds a B.A.in Linguistics from McGill Uni-versity, Montreal, Canada, as

well as an M.A. in Cognitive Science from UNSW, Syd-ney Australia. In the years prior to joining the team inPotsdam he held various positions at speech and languagetechnology companies, such as Speechworks, Scansoftand Mundwerk building voice applications.

He also enjoys skiing, hiking and cooking.

30 YRRoSDS’2009

Page 31: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Dmitry Butenkov Deutsche Telekom LaboratoriesQuality & Usability LabBerlin Institute of TechnologyErnst-Reuter-Platz 710587 Berlin Germany

[email protected]

1 Research Interests

My current research interests include statistical user sim-ulation and modelling, decision making under uncer-tainty, uncertainty models and fuzzy applications in spo-ken dialogue systems. I share the research view-point ofBerkley Initiative in Soft Computing. Besides that, I con-sider the Robotics and Machine Learning fields as tightlyrelated to user simulation and I will try to make use ofthe results from both. I mostly work on an abstract, userintentions level of Human-Computer Interaction, tryingto understand and formalise how the real users think andbehave.

1.1 Past Research

I started my research career in early 2002 from the field ofintelligent overtaking and navigation systems (Butenkovand Finaev 2003). The project addressed the optimalmodels of traffic streams in local city area.

Slightly after that, I worked in cooperation with Lab-oratory of Mathematical Problems in AI (based onTSURE) toward modern fuzzy methods and approachesto various natural problems formalisation and interpreta-tion (Butenkov et al. 2004, 2005, 2008). Most of topicsaddressed image and scene analysis and decision makingproblems.

1.2 Current Research

The focus of my current research is data-driven user sim-ulation for spoken dialogue systems evaluation. Theplenty of attempts were made in this direction so far(Schatzmann et al. 2005; Pietquin, 2008), but re-searchresults indicate that considerable research work is stillnecessary in order to develop a flexible user simulationwhich would be valid for a range of systems and users(Chung 2004). However, evaluation is vital for systemswhich should be usable for real users. These facts ex-plain growing interest in automatic, model-driven spokendialogue system evaluation approaches in recent years(Moller et al., 2006; Ai and Weng, 2008; Eskenazi et al.,2008; Jung et al., 2009).

The specific attention is put to the usability assess-ment in such an evaluation. My current work is done

within the frame of the SpeechEval project (Butenkovand Moller, 2009). This is a joint project between T-Labs1 and the German Research Center for Artificial In-telligence (DFKI)2, sponsored by the European Found forRegional Development3.

Working on intentions level, we develop an inter-actionparadigm that should allow us to simulate user behaviormore naturally than classic approaches. The data-drivenscope of the project is based on a large Voice Awards cor-pus. This data was collected during several years of opencontest with real, deployed commercial German spokendialogue systems (Steimel et. al, 2007).

The entire solution is based on an Ontology-based Di-alogue Platform by SemVox GmbH4 (Pfalzgraf et al.,2008). The Nuance Recognizer 9.0 by Nuance Commu-nications5 is used as high-quality ASR back-end.

1.3 Future Research

I plan to explore and extend the possibilities of modernuser simulation for spoken dialogue systems. During thelong time its role was underestimated. Thus, today it ismostly used as an assessment tool for spoken dialoguemanagement development.

However, I would like to demonstrate the significantusefulness of user simulation as an evaluation tool andaddress the fundamental generalisation problem, i.e. thetransferability of a simulation from one system (version)to another, and from one group of users to another.

2 Future of Spoken Dialogue Research

In my opinion, the information access and retrieval viaspoken language will play a more and more importantrole in the near future. Databases become aggressivelyhuge causing the rapid grow of the semantic web searchfield. I assume it is very attractive to design speech appli-cations for this. Therefore, we can expect to have public

1www.t-labs.tu-berlin.de2www.dfki.de3europa.eu/scadplus/leg/de/lvb/l60015.

htm4www.semvox.de5www.nuance.com

YRRoSDS’2009 Sep 13-14, London, UK 31

Page 32: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

spoken voice search services based on e.g. Google Searchor MSN engines in next 5-10 years

3 Suggestions for DiscussionData-driven user simulation

Can we reliably test and evaluate modern commer-cial spoken dialogue systems with the help of a usersimulation? Does the generalisation problem eval-uating multiple spoken dialogue systems still hold(even within same domain)?

Complex natural problems formalisationIs the reduction of communication and discoursephenomena to simple slot-filling an oversimplifica-tion (in terms of modern commercial SDS)? Howto efficiently extract and formalise real user motiva-tions?

The trends in the field from business point of viewAre state-of-the-art commercial spoken dialoguesystems, in principle, much more complex then 10years ago? Do the semantic technologies help to sig-nificantly improve those systems? What other tech-nologies are useful or have potential impact?

ReferencesH. Ai and F. Weng. 2008. User Simulation as Test-

ing for Spoken Dialog Systems. In Proceedings ofthe 9th SIGDial Workshop on Discourse and Dialogue,Columbus, Ohio, USA.

D. Butenkov, S. Moller. 2009. Towards a Flexible UserSimulation for Evaluating Spoken Dialogue Systems.In Proceedings of INTERACT 2009, Uppsala, Sweden.

D. Butenkov. 2008. Fuzzy geometric models of multi-dimensional data in intelligent data processing sys-tems. In Proceedings of 2nd National conference“Fuzzy Systems & Soft Computing”, Uljanovsk, Rus-sian Federation

D. Butenkov and V. Finaev. 2003. The developmentof intelligent systems of overtaking and traffic streamsmodeling. In Proceedings of winter scientific session“MEPHI-2003”, Moscow, Russian Federation

S. Butenkov, V. Krivsha, and D. Butenkov. 2005. Themeasures for multilevel information granulated prob-lems formalization in the systems computing withwords. In Proceedings of International conference“Intellectual Information Analysis – 2005”, Kiev,Ukraine

S. Butenkov, V. Krivsha, D. Butenkov, and A. Kurbesov.2004. The adaptation of fuzzy relations for case-basedreasoning in medical diagnostics and problem solving.In Proceedings of 9th International Conference AIS-2004, Moscow, Russian Federation

G. Chung. 2004. Developing a Flexible Spoken DialogSystem Using Simulation. In Proceedings of ACL-04,Barcelona, Spain

M. Eskenazi, A. Black, A. Raux, and B. Langner. 2008.Let’s Go Lab: a platform for evaluation of spoken di-alog systems with real world users. In Proceedings ofInterSpeech’2008.

S. Jung, C. Lee, K. Kim, M. Jeong, and G. Lee. 2009.Data-driven user simulation for automated evaluationof spoken dialog systems. In Computer Speech & Lan-guage 23.

S. Moller, R. Englert, K.-P. Engelbrecht, V. Hafner, A.Jameson, A. Oulasvirta, A. Raake, and N. Reithinger.2006. MeMo: Towards Automatic Usability Evalua-tion of Spoken Dialogue Services by User Error Sim-ulations. In Proceedings of InterSpeech’2006, Pitts-burgh, USA

A. Pfalzgraf, N. Pfleger, J. Schehl, and J. Steigner. 2008.ODP: ontology-based dialogue platform. Technical re-port, SemVox GmbH

O. Pietquin. 2008. User Simulation/User Modeling:State of the Art and Open Questions. CLASSiCProject Consortium Meeting, Issy Les Moulineaux.

J. Schatzmann, K. Georgila, and S. Young. 2005. Quan-titative Evaluation of User Simulation Techniques forSpoken Dialogue Systems. In Proceedings of 6th SIG-dial Workshop on Discourse and Dialogue, Lisbon,Portugal

B. Steimel, A. Jameson, O. Jacobs, and S. Pulke. 2007.Voice Awards 2007: Die Besten deutschsprachigenSprachapplikationen. Project Deliverables

Biographical SketchDmitry Butenkov currently works

!

toward a PhD at Berlin Instituteof Technology under supervisionof Sebastian Moller. He joinedT-Labs in October 2008 holdingBSc in Computer Engineering and aDiploma in Computer Science fromTaganrog State University of Ra-dioengineering (now Taganrog In-

stitute of Technology, Southern Federal University), Rus-sian Federation. He was born in 1984 in Taganrog,USSR.

32 YRRoSDS’2009

Page 33: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Caroline Clemens Deutsche Telekom LaboratoriesErnst-Reuter-Platz 710587 BerlinGermany

[email protected]/menue/team/forscher/caroline_clemens/www.laboratories.telekom.com

1 Research Interests

My focus of work is the user interface design of spokendialogue systems. One of my main research fields in thelast years has been automatic user classification. I de-velop voice applications that support the process of callhandling in call centers. As a speaker and voice coachI am working on the production of speech recordings.

1.1 User Classification

The aim of my dissertation is to show, how informationabout the users and their behavior can be gathered au-tomatically. The results can be used for automatic userclassification. As fields of benefit, this enables adaptiv-ity of dialogue systems, gives clues for the designers, andhelps marketing to know the users. The results support aprospective and therefore more efficient dialogue devel-opment process in early phases of dialogue development.

I focused on log files as a source of information. Logfiles are created during running dialogues and contain de-tailed entries about events of the dialogue with precisetimestamps. I collected log files of test person calls to-gether with data from questionnaires and interviews. Thecombined analysis including demographic data, person-ality profiles, and features of interaction behavior offera deeper understanding and interpretation of the resultedcorrelations.

I worked on my PhD thesis “User classification forautomatic speech dialogue systems” (see Clemens andHempel, 2008) in the DFG (Deutsche Forschungsge-meinschaft) research training group Prometei1 (Prospek-tive Gestaltung von Mensch-Technik-Interaktion) at theCenter of Human-Machine-Systems2 at the TechnischeUniversitat Berlin.

1.2 Call Center Supporting Applications

In several projects at T-Labs3 (Deutsche Telekom Labora-tories) we develop applications that support the processof call handling in call centers (for example Clemens et

1www.zmms.tu-berlin.de/prometei2www.zmms.tu-berlin.de3www.laboratories.telekom.com

al., 2009). Speech based classification is already used forrouting purposes in automatic voice portals. We developconcepts for the optimisation of costumer concern han-dling. I am working on design guidelines for a consistentvoice user interface design of such applications.

1.3 Questionnaire for Technical AffinityWith colleagues of the Center of Human-Machine-Systems we set up a workgroup to develop a question-naire to obtain the user’s attitude towards and contactwith menu based electronical devices in every day live.The questionnaire is ready and will be published soon(Karrer et al., 2009). It can be used for free for researchand product development and evaluation.

2 Future of Spoken Dialogue ResearchI think that the design of spoken dialogue systems hasto be more flexible and adaptive in the future due to thediversity of the users and system contexts. There will be:

• heterogeneous user groups,

• less stand-alone applications and more integrationsin complex systems and processes,

• combinations with others modalities and cross-device interaction.

3 Suggestions for Discussion• Adaptivity of spoken dialogue systems: To which

kind of information can a system adapt? Which kindof information retrieval will be used in the future?How can the system adapt in style, speed, contentetc.?

• Combination of spoken dialogue systems with othermodalities like audiovisual interaction: Whichchances and which limitations do we see?

• Intuitivity of spoken dialogue systems: Is talking toa machine intuitive in general? Which aspects areimportant to increase intuitive use?

YRRoSDS’2009 Sep 13-14, London, UK 33

Page 34: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

ReferencesC. Clemens and T. Hempel. 2008. Automatic User Clas-

sification for Speech Dialog Systems. In Usability ofSpeech Dialog Systems – Listening to the target audi-ence. Berlin, Springer.

C. Clemens, S. Feldes, K. Schuhmacher and J. Stegmann2009. Automatic Topic Detection of Recorded VoiceMessages. In Proc. Interspeech 2009.

C. Hipp et al. 2008. Qualitatsleitfaden. Kochbuch furgute Sprachapplikationen. Initiative VOICE BUSI-NESS, Fraunhofer-institut fur Arbeitswirt-schaft undOrganisation (IAO), Stuttgart. Qualitats-leitfaden furSprachanwendungen.

K. Karrer, C. Glaser, C. Clemens and C. Bruder. 2009.Technikaffinitat erfassen – der Fragebogen TA-EG.

Biographical SketchCaroline Clemens studied Scienceof Communication and Linguisticsfocussing on phonetics and speech.As a spoken dialogue designer shegathered experience in dialogue de-sign and dialogue production atMundwerk AG4 where she workedon the whole process of system de-velopment from requirement anal-ysis, architecture design, dialogue

flow, storyboard, grammar writing, to testing und eval-uation. She worked at Siemens AG, User Interface De-sign5, where she also was a PhD student with a doctoralthesis about spoken dialogue systems that was supportedby the research training group Prometei at TechnischeUniversitat Berlin. For several years she was memberof the scientific board of the Centre of Human-MachineSystems, State Berlin. She participated in the develop-ment of a handbook for speech applications (Hipp etal. 2008) Currently she works as a research scientist atDeutsche Telekom Laboratories, Quality and UsabilityLab6 at Technische Universitat Berlin (Institut fur Soft-waretechnik und Informatik). For the second time sheis in the program committee of the Berliner WerkstattMensch-Maschine-Systeme7 and also organising a work-shop8 on Exploring Design Criteria for Intuitive Use at

4The Mundwerk AG developed and ran speech recognitionbased telephone applications.

5Siemens AG, Corporate Technology, User Interface De-sign.

6www.qu.tu-berlin.de/menue/home/parameter/en/

7www.tu-berlin.de/zentrum\_mensch-maschine-systeme/menue/veranstaltungen/berliner\_werkstaetten\_mms/8\_berliner\_werkstatt\_mms

8www.iuui.de/workshops/mc2009/Home.html

the M&C9 conference.

9www2.hu-berlin.de/mc2009/

34 YRRoSDS’2009

Page 35: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Klaus-Peter Engelbrecht Quality and Usability LabDeutsche Telekom LaboratoriesBerlin University of TechnologyErnst-Reuter-Platz 7Berlin, Germany

[email protected]

1 Research InterestsMy research interests lie in measurement of the usabil-ity and quality of spoken dialogue systems (SDSs). Inthis, I mainly focus on the prediction of subjective judg-ments.

1.1 Context of my WorkI am working at Deutsche Telekom Laboratories in thefield of automated usability evaluation methods. We aimat modelling human machine interactions with the helpof user simulation, focusing on spoken dialogue systemsand multimodal systems. I participated in the MeMoproject, in which a workbench for early usability testingwas developed. In order to simulate the interaction, thedevice under test is modeled as a state chart, represent-ing the system behavior, and annotations of features ofthe dialogue elements, e.g. prompts or widgets. The usermodel searches a path through the state-chart, based onan agenda, and rules reflecting usability guidelines maycause deviations from that path. Interactions are on theconcept level, and errors can be generated by randomlydeleting, substituting or inserting concepts with a definedlikelihood.

In another project currently running at T-labs, a usersimulation interacting with real SDSs over the telephoneis developed. Accordingly, the interaction is fully ver-bally, i.e. spoken prompts and utterances.

Apart from quantitative and qualitative parameterssuch as duration of the interaction, number of turns to ac-complish the goal or task success, we would like to obtainan estimation of subjective measures like user satisfactionor cognitive demand during the interaction.

1.2 Previous workI started my work on the research area with my Magisterthesis, in which I developed a classification scheme foruser behavior in interactions with spoken dialogue sys-tems (Engelbrecht, 2006). This served as a basis or thebehavior that should be modeled in the MeMo workbench(or generally in a user simulator for SDS evaluation).

My work in the MeMo project laid mainly in the eval-uation of the simulation approach. I modeled an existing

system with the workbench, and compared the simula-tions to real user data gathered with this system. In addi-tion, I compared two real corpora according to the samecriteria. Criteria were interaction parameters, user rat-ings (predicted with PARADISE (Walker et al., 1997) incase of the simulation), and coverage of utterances in thecorpora. In addition, I analysed in depth which of the be-havior classified in my Magister thesis needs to be addedto the simulation.

1.3 Prediction of user judgments

User judgments provide a valid and easy-to-comparemeans of determining the overall quality of a system.Therefore, as our simulations focus on the evaluation ofsystems, a prediction model for user judgments can bevery helpful.

There is one previous approach to such predictions,which is the PARADISE framework by Walker et al.(1997). The basic assumptions are that judgments dependon task success and dialogue costs, and that task successand dialogue costs can be measured instrumentally in theform of interaction parameters. Linear Regression is usedto derive the prediction function. The accuracy of thepredictions, however, is relatively low. Models can ex-plain around 50% of the variance in the judgments. Anextensive evaluation of such models showed that predic-tion results can hardly be improved, even if more inter-action parameters are considered (Moller et al., 2008) orthe classifier is changed.

I have analysed the value of PARADISE-style modelsfor the task desired in our user simulations, i.e. provid-ing an overall measure of the system’s quality from theperspective of the user. I found that the predictions resultin relatively accurate mean values for system configura-tions, despite the low accuracy in predicting the ratingsfor individual dialogues. My assumption is that users dif-fer in their judgment behavior, depending on their cur-rent mood, attitude towards SDSs and many other factorswhich are hard to measure or even to identify. As thesedifferences are random, they are equalised when aver-age judgments are considered (Engelbrecht and Moller,2007).

YRRoSDS’2009 Sep 13-14, London, UK 35

Page 36: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

In a follow-up study, I analysed an experiment where14 tasks were judged by each user (Engelbrecht et al.,2008). Correlations between interaction parameters andjudgments could be calculated. The finding was that theparameters correlated with the judgment were similar forall users. However, the strength of the correlations differsconsiderably among the users, namely depending on theusers’ age, affinity towards technology, and their cogni-tive abilities. However, even for the groups which couldbe predicted best, the accuracy of predictions was rela-tively low.

Therefore, a new approach was created to more accu-rately model the interrelations between events in the di-alogue, and to take into account the inter-individual dif-ferences in judging things. In this approach, a HMM isused to model the evolution of the judgment over time.In the model, the judgment depends on the current com-bination of dialogue events, as well as the judgment ofthe dialogue so far (i.e. the previous judgment). To attaindata for the model, I conducted a WOZ experiment inwhich I asked the users for a quality rating after each di-alogue turn. The model predicts the probability for eachjudgment (“bad”..“excellent”) at each turn, reflecting thedifferences between the users.

In the future, I plan to test the model on data from otherdatabases and systems. Also, I will investigate the possi-bilities of training the HMM on dialogues for which onlythe final judgment is available. Finally, I plan to integratethe model into the user simulator and compare estima-tions of quality from the simulation to measurements inan experiment with real users.

2 Future of Spoken Dialogue Research

From my perspective, commercially applied SDSs havemade major progress in the past two years. With a welldesigned system, speech recognition errors are seldom,and efficient and reliable recovery/prevention strategiesare at hand. However, for many tasks the systems areapplied for, the internet is far more efficient.

I believe there are two possible lines research shouldfollow. Firstly, improved user simulation might enableextensive testing, and thus allow to create systems withreliable, but more complex (and efficient) dialogue strate-gies. These might include adoption to the users, e.g. interms of default tunings or expectation of typical behav-ior patterns or tasks.

On the other hand, mobile internet, especially on smalldevices, might be more attractive in many cases. How-ever, Speech technology might add comfort to such inter-faces and services.

3 Suggestions for Discussion• Perspectives for SDSs. Which applications will

overcome mobile internet?

• System performance, user and expert judgments.Which of them determines system quality?

• Modelling of SDS users. How accurate models areneeded?

ReferencesK.-P. Engelbrecht, F. Godde, F. Hartard, H. Ketabdar and

S. Moller. Modelling User Satisfaction with HiddenMarkov Models. Submitted to SIGDial 2009.

K.-P. Engelbrecht, S. Moller, R. Schleicher and I. Wech-sung. 2008. Analysis of PARADISE Models for Indi-vidual Users of a Spoken Dialog System. In Proc. ofESSV’2008.

K.-P. Engelbrecht and S. Moller. 2007. Pragmatic Usageof Linear Regression Models for the Prediction of UserJudgments. In Proc. of SIGDial’07.

K.-P Engelbrecht. 2006. Fehlerklassifikation undBenutzbarkeits-Vorhersage fur Sprachdialogsystemeauf der Basis von mentalen Modellen. Magister the-sis. Deutsche Telekom Labs, TU Berlin.

S. Moller, K.-P. Engelbrecht and R. Schleicher. 2008.Predicting the Quality and Usability of Spoken Dia-logue Services. In Speech Communication 50, pp. 730-744.

M. Walker, D. Litman, C. Kamm and A. Abella. 1997.PARADISE: A Framework for Evaluating Spoken Di-alogue Agents. In Proc. of ACL/EACL, Madrid, pp.271–280.

Biographical SketchKlaus-Peter Engelbrecht isworking as WissenschaftlicherMitarbeiter at the Qual-ity and Usability Lab atDeutsche Telekom Laborato-ries, TU-Berlin. He studiedCom-munication Research andMusicology and received his

Magister degree in 2006 from the TU-Berlin. At T-Labs,he is working towards his PhD thesis in the domainof automated usability evaluation for spoken dialoguesystems. His extracurricular interests include playing inrock bands and sports.

36 YRRoSDS’2009

Page 37: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Arash Eshghi Queen Mary University of LondonInteraction, Media and CommunicationGroupMile End RoadLondon, E1 4NS

[email protected]

1 Research Interests

My main research interests are in the area of human-human interaction. I am interested in how the con-versational context evolves and coordination of agentsis achieved in multi-party dialogues (henceforth multi-logue). My research has so far been experimental, butI’m also interested in how the phenomena observed viaexperimentation may be captured within a formal con-textual framework.

2 Past, Current and Future Work

Models of dialogue, both psychological (e.g. (Clark andSchaefer, 1989)) and computational (e.g. Ginzburg’s KoS(Ginzburg 2009)), have been developed primarily to ac-count for how context is built up through direct interac-tion between pairs of participants. Multilogue, however,is conditioned by an overall lack balance in the partici-pants’ levels of direct interaction or involvement with oneanother. It is not only possible but likely that a participantwill be left out temporarily or during a whole conversa-tion/topic.

This immediately raises the question of what, if any-thing, is the difference between those participants whoare in direct interaction for a stretch of talk and thosewho provide little or no feedback during it. Are the lat-ter at a disadvantage in terms of the levels of groundingand semantic coordination that they reach as regards thematerial they provided little or no feed back on? Studieshave shown that they could be (e.g. (Schober and Clark,1989; Healey and Mills, 2006)). But would these resultspersist even if this lack of involvement was only ‘tem-porary’? If not, then what are the differences betweena participant who is only momentarily inactive in orderto avoid interruptions or overlapping talk, with one whoplays a completely passive role? More broadly, how doesshared context evolve and when and for whom is it sharedin multi-party dialogue? These are questions which myresearch has attempted to address.

2.1 Collective states of understanding

A corpus analysis of the different surface forms ofcontext-dependency (elliptical expressions) shows that

there are dialogues in which a third silent participant, de-spite the lack of overt feedback, reaches the same level ofgrounding or mutual understanding as the dyad activelyengaged in talk. These dialogues lead to collective con-texts that are not reducible to the component dyadic inter-actions that gave rise to them (Eshghi and Healey, 2007).

2.2 Grounding by proxyIt is then proposed that such collective contexts are theeffect, at least, of those dialogues in which subgroupsof the participants are organised into parties (Schegloff,1995), such that there are fewer parties than there areindividuals. The claim is developed that grounding op-erates in such dialogues, in the first instance, betweenparties rather than individuals, such that the party actsas a unified aggregate to carry the conversation forward.That is, the party is responsible as a whole, for ground-ing and thus satisfying the constraints imposed by thecurrent context of the conversation. In this manner, onemember can stand proxy for others in doing this, suchthat it doesn’t matter which member is doing the talkingas the rest effect the same contextual increments as thatindividual. This provides a pragmatic parallel to Sche-gloff’s (1995) proposal that turn-taking operates in thefirst instance between parties. An experimental test ofthis idea is carried out which provides causal evidencefor the practical reality of parties, in the form of ‘ex-tra’ intra-party interactions, which, all things being equal,are caused by the failure of one of the party members toground.

2.3 Emergence of unevenly distributed sharedcontexts

The claim is then developed and tested experimen-tally that in contrast to the dialogues leading to theevenly shared irreducible collective contexts just men-tioned, multi-party dialogues can also under some cir-cumstances, give rise to multiple shared contexts thatare unevenly distributed across the (ratified) participants(Eshghi and Healey, 2009). This is manifested in sys-tematically divergent interpretations of one and the sameutterance by different participants owing to the distinctcontexts against which the utterance is assessed, a phe-nomenon we dub ‘Pragmatic Pluralism’ in dialogue. This

YRRoSDS’2009 Sep 13-14, London, UK 37

Page 38: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

occurs we argue, when a participant falls outside the in-teracting parties for a stretch of talk, and as a result ismore weakly attached to the local context than the mem-bers of the interacting parties.

Arguably, the results summarised above present sig-nificant challenges for how formal models of dialoguecontext - underpinning dialogue systems - characterisethe contextual increments arising from each individualcontribution. Within Ginzburg’s (1996, 2009) questionbased approach to modelling contextual change, these re-sults indicate the possibility of a systematic divergencebetween the participants’ Dialogue Game Boards (DGB).More specifically, they point to the need for the utteranceby utterance DGB Update Rules (specifically FACTS-incrementation and QUD-downdate rules) to be rela-tivised to participant role, understood here, NOT in termsof Goffman’s participation framework, but in terms ofparty membership.

3 Future of Spoken Dialogue ResearchSpoken Dialogue Systems are becoming increasinglycommonplace, and yet, they seem to be hampered greatlyby the fact that the semantic ontologies underpinningthem are often, if not always, ad-hoc (domain-specific)and static. That is, the theoretical models that they’rebased on, fail to account for the ways in which shared se-mantic ontologies or conventions emerge from the inter-action itself (the view of language as an adaptive system)via local coordination mechanisms such as repair, clar-ification and feedback. This requires a formal conceptof context that is sufficiently rich to encode these essen-tially emergent and dynamic conventions and their tran-sitions at a local level. A more generic dialogue systemwould certainly need to address these problems. In thecase of multi-party dialogues, the situation is even morecomplex, as different sub-groups of the participants mayreach divergent levels of semantic coordination (Healeyand Mills, 2006) contingent upon their levels of directinteraction. Moreover, sub-groups may effect divergentcommitments (contextual updates) as a result of one andthe same utterance (Eshghi and Healey, 2009) dependingon their prior knowledge of and publicly presumed stancetowards the topic at hand. This latter problem is partic-ularly significant in the development and implementationof automatic meeting assistants or summarisers.

4 Suggestions for Discussion• How adaptive should a dialogue system be? How

can we make dialogue systems more generic? Howcan the semantic coordination of agents be for-malised?

• Apart from the current difficulties in speech recog-nition, what are the most pressing problems in

the development of automatic meeting summaris-ers/assistants? How can they be addressed given theextreme complexity and contextual variability inher-ent in multi-party dialogues?

• How can different forms of non-verbal behaviour berecognised, and in turn how do they help the sys-tem in choosing the right course of action, whichotherwise could involve needless clarification sub-dialogues?

ReferencesH. H. Clark, and E. A. Schaefer. 1989. Contributing to

discourse. In Cognitive Science 13, pp. 259-294.A. Eshghi, and P. G. T. Healey. 2007. Collective states

of understanding. In B. H. Keizer S. and T. Paek(Eds.), In Procs. of the 8th SIGDial workshop ondiscourse and dialogue, pp. 2–9. Association forComputational Linguistics.

A. Eshghi, and P. G. T. Healey. 2009. What is conversa-tion? distinguishing dialogue contexts. In Proceed-ings of the annual cognitive science conference.

J. Ginzburg. 1996. Dynamics and the semantics of dia-logue. In J. Seligman (Ed.), Language, Logic andComputation.

J. Ginzburg. 2009. The interactive stance. CSLI: CenterFor The Study of Language and Information.

P. G. T. Healey, and G. J. Mills. 2006. Participation,precedence and co-ordination in dialogue. In Pro-ceedings of the annual cognitive science confer-ence.

E. A. Schegloff. 1995. Parties talking together: twoways in which numbers are significant for talk-in-interaction. In P. Ten Have and G. Psathas (Eds.),Situated order: Studies in social organization andembodied activities, pp.31–42. University Press ofAmerica.

M. F. Schober, and H. H. Clark. 1989. Understanding byaddressees and overhearers. In Cognitive Psychol-ogy 21, pp 211–232.

Biographical SketchArash Eshghi has just defendedhis PhD thesis in the Interac-tion, Media and CommunicationGroup at Queen Mary Univer-sity of London. His first degreewas in Computer Science, andhe has a Masters in Artificial In-telligence. Given the highly ex-perimental nature of his thesis,

his interests are, at least in principle, cross-disciplinary.He greatly enjoys philosophical debates and loves skiingand camping.

38 YRRoSDS’2009

Page 39: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Milica Gasic Department of EngineeringUniversity of CambridgeTrumpington streetCambridge

[email protected]/˜mg436/

1 Research Interests

My research interest lies in the area of statistical ap-proaches to Dialogue Management. I am primarily fo-cused on Partially Observable Markov Decision Pro-cess framework to Spoken Dialogue Systems. Thisframework has the potential to enable learning from data,learning from interaction with both simulated and realusers and, also, to be robust to recognition errors.

1.1 Modelling Dialogue as a Partially ObservableMarkov Decision Process

One of the main problems that real world applications ofSpoken Dialogue Systems are encountering is the diffi-culty to deal with the errors from the speech recogniserin a noisy environment. As almost any real application isnot noise free, this is a large obstacle preventing dialoguesystems from wide use.

It has been suggested that the Partially ObservableMarkov Decision Process (POMDP) can provide a prin-cipled mathematical framework to deal with the uncer-tainty that originates from errors in the speech recogni-tion by retaining and updating the distribution over allstates in each turn (Young, 2002; Williams and Young,2007). The idea is that the dialogue state is hidden andtherefore the policy should not be based on a single statebut on the distribution over all possible states.

The main weakness of POMDP approach is its com-putational intractability for even very simple domains.However, it has been suggested that factoring the statesinto summary states can enable using POMDPs for pol-icy optimisation (Williams and Young, 2007). It has alsobeen shown that the policy learning can be performed on-line in interaction with a simulated user.

1.2 Current Work

My work is focused on the Hidden Information State sys-tem (HIS) (Young et al., 2009). What makes HIS dif-ferent to other POMDP-based dialogue managers is itsrepresentation of user goal in the dialogue state. It re-tains the full representation of user goal though partitionsof the user goal space. The partitions are created basedon the information that the user has provided. After each

dialogue turn the partitions and their probabilities are up-dated. The partitions have the ability to represent anyuser goal based on the domain ontology. As the noisein the input increases the number of partitions increasesexponentially. Therefore, a pruning technique has beendevised that can deal effectively remove low probabilitypartitions and still retain the ones that are represent plau-sible user goal.

In order to reduce the dimensionality, the state space(the master space) is mapped into a much smaller sum-mary space. Reinforcement learning is performed on theprobability distribution over the summary space – the be-lief space. Since the belief space is continuous, it is dis-cretised into grid points. Then, Monte Carlo Control al-gorithm (Sutton and Barto, 1998) is performed on thesepoints. Finally, the outcome of the policy is mapped backinto the master space.

The results of my current work suggest that the trainingin noise leads to better performance at higher semantic er-ror rates then in training in noise free environment. Also,the ability of a dialogue manager to make use of N-bestinputs from a recogniser improves in the performance athigher error rates (Gasic et al., 2008).

In addition to that, the performance can be improvedby ordering the actions in an N-best list using the Q-values from the Monte Carlo Control algorithm. The ideais if the top action is not mappable to master space, to usethe next-best one rather than the default action.

1.3 Future WorkWhat I encountered during the training of the dialoguemanager with a simulated user is training a very smallnumber of grid points can reach the same performance asusing a larger number of grid points for learning. Thisdoes not only suggest that there the current choice ofsummary space can be enriched in order to be more infor-mative, but that the Q-value function can be more effec-tively estimated in various parts of the summary space.Therefore, rather then computing the exact Q-value ongrid point and generalising over a certain part of space asis done in grid based Monte Carlo Control algorithm, itwould be better to have a functional estimate of Q-valuefunction over summary space with a measure of certainty.

YRRoSDS’2009 Sep 13-14, London, UK 39

Page 40: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

This can then be used for the exploration/exploitationtrade-off during learning and also as a basis for policyadaptation with real users.

2 Future of Spoken Dialogue ResearchIn the next decade, I expect noise robustness issues thatprevent dialogue systems from a wider application to beresolved. Also, it is very important to define the frame-work which is domain independent and easily transfer-able to other domains. POMDPs seem to have the poten-tial for this, but additional research is needed in order tofurther investigate the possibilities to obtain desirable re-sults with a computationally tractable approximations. Itwould be desirable that the process of building a Statisti-cal Spoken Dialogue System in the next decade becomesfully automatic. This would mean that the simulated usercan be trainable from real data, that the dialogue managercan learn from the interaction with the simulated user andfinally that is also able to learn with real users. Thereforethe future of the Dialogue Manager is highly dependenton the simulated user.

3 Suggestions for Discussion• Machine Learning Techniques in Dialogue Mod-

elling

• Data acquisition and reliability

• Generalisation across different domains

ReferencesR. S. Sutton and A. G. Barto 1998. Reinforcement

Learning: An Introduction, MIT Press, Cambridge,MA.

S. J. Young. 2002. Talking to machines (statisticallyspeaking). In Proc. of Int. Conf. Spoken LanguageProcessing, Denver, Colorado.

J. D. Williams and S. J. Young. 2007. Partially observ-able Markov decision processes for spoken dialog sys-tems. In Computer Speech and Language 21 (2), pp.393–422.

M. Gasic, S. Keizer, B. Thomson, F. Mairesse, J. Schatz-mann, K. Yu, and S. Young. 2008. Training and evalu-ation of the HIS-POMDP dialogue system in noise. InProc. 9th SIGDial, Columbus, OH.

S. Young, M. Gasic, S. Keizer, B. Thomson, F. Mairesse,J. Schatzmann, and K. Yu. 2009. The Hidden Informa-tion Statemodel: A practical framework for POMDP-based spoken dialogue management. In ComputerSpeech and Language. doi:10.1016/j.csl.2009.04.001.

Biographical SketchAfter graduating in Computer Science and Mathematics

from the University of Belgradein 2006, Milica Gasic did MPhilin Computer Speech, Text andInternet Technology at the Uni-versity of Cambridge. She com-pleted the course in 2007 withthe thesis Limited Domain Syn-thesis of Expressive Speech. InOctober 2007 she became a PhD

candidate in Dialogue Modelling under the supervisionof Prof Steve Young. She is working in the Dia-logue Systems Group: http://mi.eng.cam.ac.uk/research/dialogue/.

40 YRRoSDS’2009

Page 41: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Florian Godde Berlin Institute of TechnologyQuality & Usability LabErnst-Reuter-Platz 710587 Berlin - Germany

[email protected]

1 Research Interests

Building user-friendly spoken dialogue systems (SDS) isnot trivial, as one can see that many deployed systemsmiss a clear, understandable dialogue structure, informa-tive help prompts that focus on the necessary informationand individual error recovery strategies for different er-rors. Since user tests within the development cycle ofSDSs are very expensive and time consuming, it is of in-terest to do usability testing in the early stages of develop-ment in an automated way, to provide guidance for devel-opers throughout the whole process and not only at a verylate stage where some flaws may not be fixed that eas-ily. I am working on semi-automatic evaluation methodsfor spoken dialogue systems that might provide a fast andcheap way to test early prototypes of SDSs against designand usability flaws. Furthermore, I am interested in SDSsfor different user groups. One of the most promising ap-plications for SDSs is the home-care sector, where easyand simple access to information technology is manda-tory to help older users to interact with computer systems.Since speech is something that most of the potential usersdo not have to learn (in contrast to keyboard and mouseusage), SDS are a promising alternative. But frustrationwith systems that do not work as expected by the usersmight be very frustrating, so a clear understanding of theneeds and abilities of older users is mandatory.

1.1 Past Research

To predict user satisfaction with SDSs it is necessary tounderstand how users perceive these systems. Differentusers have different needs regarding their interaction withcomputers: For instance, novice users should be providedwith extensive help during a dialogue, whereas expertusers should be able to perform their task as fast as pos-sible. In the frame of my diploma thesis I analysed theinteractions of two user groups using the smart-home sys-tem INSPIRE, namely younger and older users (Godde etal., 2008). Older users are of particular interest for the in-dustry, since SDSs might help older users to successfullyintegrate modern technology into their daily life, whichcan help them to stay independent as long as possible intheir own homes. As an outcome of this work I showed,that the behavior and acceptance of SDSs does not mainly

depend on the age of the users, but on a more generalaffinity towards technology, and that older users tend tospeak in a more natural way to this system, comparable tohuman-human communication. Furthermore, older usersbenefit more from extensive help prompts, since it sup-ports fast alignment to the system vocabulary (Wolters etal., 2009).

1.2 Current and Future Research

Currently I am working towards my PhD in a projectnamed “SpeechEval”. We are building a framework forsemi-automatic quality testing of spoken dialogue sys-tems, which uses user simulation and usability predictionbased on data-driven machine learning algorithms. Theidea is to build a realistic user simulation that learns inter-action behavior out of a dialogue corpus containing a va-riety of dialogues with commercial systems, and to gen-eralise from that corpus to be able to interact with newsystems. I am developing a method that uses dialoguecorpora together with user ratings to automatically assignreal user ratings to the simulated dialogues. Furthermore,I will try to measure the quality of the simulated dia-logues independently of user characteristics. That means,the quality rating of the system is valid for all simulateduser groups. As an outcome of the project, we build aworkbench that utilises user simulation to interact withthe SDS under test, by learning the dialogue structurein the first dialogues through trial and error. The usersimulation recognises keywords in help prompts and usesontologies to build utterances that the system should un-derstand. After some dialogues, the user simulation getsmore confident in the interaction, and the generated dia-logue logs can then be used to measure the quality, us-ability and user satisfaction of the system. The developerof the SDS can then use this report to improve his/hersystem, and test it again, and so on. A user test with realusers might then be only necessary at the end of the de-velopment cycle to get real ratings from the target group.

2 Future of Spoken Dialogue Research

For the next years, not only improving techniques for allparts of a SDS will be of interest, but also research onoverall acceptance of these systems. Since speech will

YRRoSDS’2009 Sep 13-14, London, UK 41

Page 42: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

be more common, for instance as input modality in smarthomes, it is necessary to understand how users perceivethese systems, how users want to interact and what makesa dialogue system acceptable and comfortable to use.

We have to find out

• what is important for the users, “natural” –human-human like- dialogues or fast and error-free commu-nication

• which factors are really influencing the satisfactionwith SDSs. This question is not yet answered, sincein every model of user satisfaction prediction thereis a large unexplained variance

• how can SDSs be seamlessly integrated into moderntechnology, for instance in the home care sector

3 Suggestions for Discussion• What do people really want and what not? How to

increase the acceptance of SDSs by looking moreclosely on the users needs.

• How can real user satisfaction be measured? Dofield tests provide more realistic judgments then lab-oratory experiments?

• Human-Human communication is very flexible, di-alogues concerning the same topic will never be ac-tually identical, in contrast to human-machine dia-logues. Is it feasible for spoken dialogue research towork on dialogue systems with flexible prompts tobring this variety into HMI?

ReferencesF. Godde, S. Moller, K.-P. Engelbrecht, C. Kuhnel, R.

Schleicher, A. Naumann, and M. Wolters. 2008. Studyof a Speech-based Smart Home System with OlderUsers. In Proc. Int. Workshop on Intelligent User In-terfaces for Ambient Assisted Living (IUI4AAL 2008),Canary Islands, Jan. 2008, Frauenhofer IRB Verlag,Stuttgart, pp. 17–22.

M. Wolters, K.-P. Engelbrecht, F. Godde, S. Moller, A.Naumann, and R. Schleicher. 2009. Making it Easierfor Older People to Talk to Smart Homes: The Effectof Early Help Prompts. Accepted for Universal Accessin the Information Society.

Biographical SketchFlorian Godde graduated inComputer Science in 2007 withhis diploma thesis “Evaluationof a Smart-Home System forElderly Users”. After gradua-tion, he worked at the CarmeqGmbH on spoken dialoguesystems for Volkswagen. SinceSeptember 2008 he is workingtowards his PhD at the BerlinInstitute of Technology onusability of spoken dialogue

systems under the supervision of Prof. Sebastian Moller.

42 YRRoSDS’2009

Page 43: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Christine Howes Queen Mary University of LondonInteraction, Media and CommunicationGroupMile End RoadLondon, E1 4NS

[email protected]/˜chrizba

1 Research Interests

My main research interests are in the area of human-human interaction. Specifically, I am concerned withhow coordination of agents is achieved via the build-ing up of semantic knowledge and/or situation models us-ing strictly incremental mechanisms of syntax (using theformalism of Dynamic Syntax). I am currently investi-gating these questions via the study of split utterances.

1.1 Background and related work

The pioneering work of Clark (1996) initiated a broadlyGricean program for dialogue modelling, in which co-ordination in dialogue is said to be achieved by estab-lishing recognition of speaker-intentions relative to whateach person takes to be their mutually held beliefs (com-mon ground). However, computational models in thisvein have largely been developed without explicit high-order meta-representations of others’ beliefs or intentionsand it is arguable that the Gricean assumptions underpin-ning communication should be re-considered. In Kemp-son et al. (2009), the groundwork is laid for an inter-active model of communication using Dynamic Syntax(DS: Cann et al., 2005), which examines its applicationto the tightly interactive dialogue phenomena that arise incases of utterances split between speakers. In this model,each interlocutor interprets the signals they receive andplans the signals they send, egocentrically, without ex-plicit representation of the other party’s beliefs or inten-tions. Nevertheless, the effect of coordinated communi-cation is achieved by relying on ongoing feedback andthe goal directed action-based architecture of the gram-mar. The claim is that communication involves takingrisks: in all cases where an interlocutor’s system fails tofully determine choices to be made (either in parsing orproduction), the eventual choice may happen to be right,and might or might not get acknowledgement; it may bewrong and (possibly) get corrected; or, in recognition ofthe nondeterminism, they may set out a sub-routine ofclarification, before proceeding.

1.2 Split Utterances

Split utterances (SUs) – single utterances split betweentwo or more dialogue turns/speakers – have been claimedto occur regularly in dialogue. They are of interest to di-alogue theorists as a clear sign of how turns cohere at alllevels – syntactic, semantic and pragmatic. They also in-dicate the radical context-dependency of conversationalcontributions. Turns can be highly elliptical and never-theless not disrupt the flow of the dialogue. SUs are themost dramatic illustration of this: contributions spreadacross turns/speakers rely crucially on the dynamics ofthe unfolding context, linguistic and extra-linguistic, inorder to guarantee successful processing and production.

Utterances that are split across speakers also presenta canonical example of participant coordination in dia-logue. The ability of one participant to continue anotherinterlocutor’s utterance coherently, both at the syntacticand the semantic level, suggests that both speaker andhearer are highly coordinated in terms of processing andproduction. The initial speaker must be able to switch tothe role of hearer, processing and integrating the continu-ation of their utterance, whereas the initial hearer must beclosely monitoring the grammar and content of what theyare being offered so that they can take over and continuein a way that respects the constraints set up by the firstpart of the utterance.

Currently, I am investigating the impact that SUs haveon ongoing dialogues using the DiET chat tool (Healeyet al., 20003). The experiment reported in Howes et al.(2009) tests the effects of artificially introduced SUs ongroups of people engaged in a task-oriented dialogue. Re-sults show reliable effects on response time and on thenumber of edits in formulating subsequent turns. In par-ticular, if the second part of an utterance is ‘misattributed’people take longer to respond, and responses to utter-ances that appear to be split across speakers involve fewerdeletes. This provides evidence that: a) speaker switchesaffect processing where they interfere with expectationsabout who will speak next and b) the pragmatic effect ofa split is to suggest to other participants the formation ofa coalition or ‘party’ (Schegloff, 1995; Lerner, 1991).

Complimentarily, a preliminary corpus study on a sub-

YRRoSDS’2009 Sep 13-14, London, UK 43

Page 44: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

section of the British National Corpus (Purver et al. 2009)indicates that splits can occur anywhere in a string. Thisis consistent with models that advocate highly coordi-nated resources between interlocutors and, moreover, theneed for highly incremental means of processing (Purveret al., 2006; Skantze and Schlangen, 2009).

From a computational modelling point of view, the re-sults of the corpus study indicate that continuations usu-ally start in an incomplete way, which means that a dia-logue system has a chance of spotting them from surfacecharacteristics of the input. However, this is hamperedby the fact that the split can occur within any type of syn-tactic constituent, hence no reliable grammatical featurescan be employed securely. On the other hand, 8% of allutterances do not end in a complete way, and only 63%of these ever get continued. Contrarily, the vast majorityof continuations continue an already complete antecedentand long distances between antecedent and continuationare possible. In this respect, locating the antecedent is nota straightforward task for automated systems, especiallyagain as this can be any type of constituent, and manySUs are split over more than two contributions.

2 Future of Spoken Dialogue ResearchI see Spoken Dialogue Systems proliferating in the fu-ture, and feel that if they are to be accepted by thewider community then we need to get a handle on howhuman-human communication progresses so effortlessly,and which of these techniques can be effectively utilisedin dialogue systems. This should include (but not be re-stricted to) a more detailed understanding of context (andhow dialogue contexts can vary), feedback (in the formof backchannels etc - are there specific points when theseare somehow ‘more’ appropriate), prosody, and the needfor parsing and processing (and possibly the grammar it-self) to be incremental and predictive. A good way to geta handle on some of these questions is through the devel-opment of spoken dialogue systems in severely restricteddomains.

3 Suggestions for Discussion• How general (and/or human-like) should a dialogue

system be.

• How can we model a system that is flexible enoughto process all the fragmentary, multiply ambiguousdata that humans produce on a regular basis, and, ifnecessary, seek clarification?

• What feedback should a system give the user, andwhen is it appropriate to do so? And can this bedetermined by e.g. syntactic constraints, or intona-tional contours, or (potentially partial) semantic in-terpretations?

ReferencesR. Cann, R. Kempson, and L. Marten. 2005. The Dy-

namics of Language. Elsevier, Oxford.

A. Gargett, E. Gregoromichelaki, C. Howes, and Y. Sato.2008. Dialogue-grammar correspondence in dynamicsyntax. In Proceedings of the 12th SEMDIAL (LON-DIAL).

E. Gregoromichelaki, Y. Sato, R. Kempson, A. Gargett,and C. Howes. 2009. Dialogue modelling and the re-mit of core grammar. In Proceedings of IWCS.

P. Healey, M. Purver, J. King, J. Ginzburg, and G. Mills.2003. Experimenting with clarication in dialogue. InProceedings of the 25th Annual Meeting of the Cogni-tive Science Society.

C. Howes, P. G. Healey, and G. Mills. 2009. A: Anexperimental investigation into ... B: ... split utterances.In Proceedings of SIGDial’2009.

R. Kempson, E. Gregoromichelaki, M. Purver, G. J.Mills, A. Gargett, and C. Howes. 2009. How mecha-nistic can accounts of interaction be? In Proceedingsof the 13th SEMDIAL Workshop on the Semantics andPragmatics of Dialogue (DiaHolmia).

G. Lerner. 1991. On the syntax of sentences-in-progress.Language in Society, pp.441-458.

M. Pickering and S. Garrod. 2004. Toward a mechanis-tic psychology of dialogue. In Behavioral and BrainSciences 27, pp. 169-226.

M. Poesio and H. Rieser. to appear. Completions, coor-dination, and alignment in dialogue. Ms.

M. Purver, C. Howes, P. G. Healey, and E. Gre-goromichelaki. 2009. Split utterances in dialogue: acorpus study. In Proceedings of SIGDial’2009.

G. Skantze and D. Schlangen. 2009. Incremental dia-logue processing in a micro-domain. In Proceedingsof the 12th Conference of the European Chapter of theACL (EACL 2009).

Biographical SketchChristine Howes is a 2nd yearPh.D student in the Interac-tion, Media and CommunicationGroup at Queen Mary Univer-sity of London. Her first degreewas in Artificial Intelligence andPsychology, and she has a Mas-ters in Computational Linguis-

tics and Formal Grammar. She thus likes to think of her-self as inter-disciplinary.

Outside of research, she is a keen follower of themighty Brentford F.C., and also loves the escapism of thecinema.

44 YRRoSDS’2009

Page 45: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Srinivasan C Janarthanam School of InformaticsUniversity of EdinburghEdinburgh EH89ABScotland, UK

[email protected]

1 Research Interests

Spoken dialogue systems must ideally adapt to the usersbased on their domain expertise. Although such adapta-tion can happen at different levels, I am interested in howreferring expressions for domain objects can be appro-priately chosen in a user-specific manner. In a technicalsupport dialogue task, expressions can be chosen betweentechnical terms and descriptions based on the user’s expe-rience in the domain, i.e. descriptive expressions for be-ginners and technical expressions for experts. Similarly,in a city navigation task, the system could choose to useproper names for locals while foreign tourists get moredescriptive expressions. Natural language generationpolicies to choose the right set of referring expressions touse in a dialogue can be learned or hand-coded easily fordifferent users separately. However, such policies cannothandle a user whose expertise is unknown. Ideally, thesystem needs a single policy that can adapt to all differ-ent kinds of users during the course of the dialogue basedon the evidence presented by the users. Reinforcementlearning methods can be used to learn adaptive referringexpression generation policies (Janarthanam and Lemon,2008, 2009a). Using a user simulation that is sensitiveto the choice of referring expressions that the system isusing, it is possible to learn a referring expression gen-eration policy. The policy learned will be adaptive if thesimulation produces users with different levels of exper-tise in different cycles. For this purpose, we propose atwo-tiered simulation (Janarthanam and Lemon 2009c).We use a wizard-of-Oz environment to collect dialoguecorpora to model the user’s behavior in the simulation.(Janarthanam and Lemon, 2009b).

In addition to the above, I am also generally inter-ested in other areas of dialogue systems research like tu-torial dialogue systems, learning dialogue managementand NLG policies, incremental processing in NLU andDM, etc.

2 Future of Spoken Dialogue Research

Dialogue systems community is slowly moving from in-formation seeking tasks, like town-info, flights, etc to

troubleshooting and technical support tasks. This willopen up more research questions in terms of task man-agement moves.

In terms dialogue processing, incremental processing(Skantze and Schlangen, 2009) is the direction to take.It avoids unnecessary delays and makes the conversationmore natural. However, incremental processing must beimplemented at all steps like parsing, dialogue manage-ment and generation.

More features like back-channeling, turn-taking, timemanagement strategies must be explored in dialoguemanagement. The use of multi-dimensional dialogue acts(Bunt and Girard, 2005) must be explored to make the di-alogue systems more natural.

Dialogue systems have to be adaptive to the user tomake the conversation less frustrating. Adaptation canhappen over different dimensions like user’s expertise inthe domain, their age group, etc.

3 Suggestions for Discussion

Following are the topics that I think that needs to be dis-cussed in the workshop:

• Adaptive systems: What are the features to look forto adapt in a user?

• Multi-functional dialogue acts: What dimensionsmust be considered in dialogue management?

• Enriched speech recognition: Can speech recognis-ers tell more than just the sequence of words andthe confidence scores? Can prosodic features be de-tected and used to make more natural dialogue man-agement decisions?

• Corpus collection strategies: How to collect largedialogue corpora for different needs quickly and in-expensively?

YRRoSDS’2009 Sep 13-14, London, UK 45

Page 46: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

ReferencesHarry Bunt and Yann Girard. 2005. Designing an

open,multidimensional dialogue act taxonomy. In Gar-dent, C., and Gaiffe, B. (eds). Proc. 9th Workshop onthe Semantics and Pragmatics of Dialogue, pp. 37–44.

Skantze, G. and Schlangen, D. 2009. Incremental dia-logue processing in a micro-domain. In Proc. EACL2009, Athens, Greece.

Srinivasan Janarthanam and Oliver Lemon. 2008. UserSimulation for knowledge-alignment and on-line adap-tation in Troubleshooting Systems. In Proc. SEMDial2008, London, UK.

Srinivasan Janarthanam and Oliver Lemon. 2009a.Learning Lexical Alignment Policies for generatingReferring Expressions for Spoken Dialogue Systems.In Proc. ENLG 2009, Athens, Greece.

Srinivasan Janarthanam and Oliver Lemon. 2009b. AWizard-of-Oz Environment to study Referring Ex-pression Generation in a Situated Spoken DialogueTask. In Proc. ENLG 2009, Athens, Greece.

Srinivasan Janarthanam and Oliver Lemon. 2009c.Learning Adaptive Referring Expression Generationpolicies for Spoken Dialogue Systems. In Proc. SEM-Dial 2009, Stockholm, Sweden.

Biographical SketchSrinivasan C Janarthanam is athird year Ph.D student at theUniversity of Edinburgh. He is aUKIERI (2007-10) scholar andis funded by the British Council.Previously, he worked as a re-search associate in Amrita Uni-versity, India and as an applica-

tions developer in iNautix Technologies, India. He hasan M.Sc in Intelligent Systems from the University ofSussex, UK and a B.E in Computer Science and Engi-neering from Bharathiyar University, India. Currently,his work relates to Natural Language Generation in Di-alogue Systems. Previously, he has worked on DialogueManagement, Tamil Named Entity Recognition, English-Tamil Machine Translation, Tamil Dependency Parsingand Morphological Analysis.

46 YRRoSDS’2009

Page 47: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Denise Cristina Kluge Federal University of Parana-LitoralApplied LinguisticsMatinhos, PR, Brazil

[email protected]

1 Research Interests

My research interests lies in practical uses of spokendialogue systems, towards studies that investigate for-eign language speech production and perception, per-ceptual training, audiovisual speech perception andspeech intelligibility.

1.1 Previous work

As the interrelationship between perception and produc-tion has been discussed in the second/foreign (L2) lan-guage phonetics and phonology literature and some stud-ies have shown that perception plays a very importantrole in the production of second language sounds (Flege,1995; Kuhl and Iverson, 1995; Best and Tyler, 2007), Ihave investigated the relationship between perception andproduction of Brazilian learners of English as a foreignlanguage.

The study aimed at investigating the perception andproduction of the English nasals /m/ and /n/ in syllable-final position by Brazilian learners of English (Kluge etal., 2007) as those coda nasals have different patternsof phonetic realisations across languages, whereas theyare distinctively pronounced in English, in Brazilian Por-tuguese they are not fully realised.

It might be expected that for accurate production, thelearner would need accurate perception, which was thecase of the study conducted, considering the two per-ception tests: an identification and a discrimination test.The results indicated that there was some relationship be-tween the identification/ discrimination of the target codanasals and their accurate production. However, the ten-dency of this study was for production to be more accu-rate than perception.

1.2 Current and future work

Recent studies have included visual cues as a variableto investigate the perception of L2 contrasts (Hardison,1999; Ohrstrom and Traunmuller, 2004; Hazan et al.,2006) and they have shown that L2 listeners seemed tobenefit from an Audio/Video presentation in the identifi-cation of visually distinctive L2 contrasts (Hazan et al.,

2006). Bearing this in mind, my colleagues and I arecurrently working with the use of visual cues in the per-ception of nonnative contrasts.

We have recently investigated whether Brazilian learn-ers of English would benefit from AV presentation whenthey had to identify a visually distinctive L2 contrast suchas the labial/alveolar nasal consonants contrast in word-final position in English (Kluge et al., 2007). Brazilianlearners of English have difficulties with English nasalsin syllable-final position due to phonological differencesbetween the two languages. In this study, the Englishmonosyllabic words with either /m/ or /n/ in word-finalposition were presented in three different conditions (Au-dio only, Audio/Video and Video Only). Results not onlyshowed that the Audio/Video condition seemed to favorthe accurate identification of both word-final nasal con-sonants when compared to the Audio only condition, butalso showed a slight tendency for the Audio only con-dition to disfavor the accurate identification of both bi-labial and alveolar nasal consonants compared to the Au-dio/Video condition.

Taking into consideration the findings that L2 speak-ers may benefit from audiovisual input when they per-ceive visually distinctive L2 contrasts and the fact thata more accurate perception may lead to a more accu-rate production, our current work aims at investigatingwhether audiovisual perceptual training may be able toalter production. In order to do so, we are investigatingwhether Brazilian learners of English may be able to im-prove their perception and production of English word-final nasal consonants /m/ and /n/ by means of audiovi-sual perceptual training with this visually distinctive con-trast. We are also aiming at investigating whether au-diovisual perceptual training would alter perception andproduction of Brazilian learners of English with contrastsuch as the voiceless TH, S and F in word-final positionwhich are not too visually distinctive contrasts as com-pared to the English nasal consonants in word-final posi-tion.

As regards the application and impact of the results ofthis study on research on spoken language systems, if thefindings show that L2 speakers are able to better under-

YRRoSDS’2009 Sep 13-14, London, UK 47

Page 48: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

stand L2 speech in a more multimodal approach I wouldlike to discuss whether this is applicable to any dialoguesystem.

2 Future of Spoken Dialogue ResearchTaking into consideration the increasing use of speechtechnology and the increasing number of studies in thearea of L2 speech perception and production, I would liketo think that further research could somehow discuss thefindings of language studies in order deepen the discus-sion of human-machine interaction.

As for the improvement of human-machine interaction,I believe that machines should be more able to offer anoutput that is more easily recognised copping with differ-ences between a native speaker and a nonnative speakerof a given language. As regards dialogue for languagelearning, further studies could investigate how teachersand students, for instance, could benefit and make use ofdialogue systems according to their learning needs.

3 Suggestions for Discussion• Is the multimodal approach applicable and suitable

to any dialogue system?

• If a multimodal approach is applicable, how coulddialogue systems benefit from that?

ReferencesC.T. Best and M.D. Tyler. 2007. Nonnative and second-

language speech perception: Commonalities and com-plementarities. In: Bohn, Ocke-Schwen and Murray J.Munro (Eds.), Second language speech learning: Therole of language experience in speech perception andproduction, pp. 13–34. Amsterdam: John Benjamins.

J.E. Flege. 1995. Second Speech Learning: Theory,Findings, and Problems. In W. Shange (Ed.), SpeechPerception and Linguistic Experience – Issues Cross-Language Research, pp. 233–277. Timonium: YorkPress.

D. Hardison. 1999. Bimodal speech perception by nativeand nonnative speakers of English: Factors influenc-ing the McGurk effect. In Language Learning 49, pp.213–283.

V. Hazan, A. Sennema, A. Faulkner, M. Ortega-Llebariad, M. Iba, and H. Chung. 2006. The use ofvisual cues in the perception of non-native consonantcontrasts. In Journal Acoustical Society of America,119 (3), pp. 1740–1751.

D. C. Kluge, A. S. Rauber, M. S. Reis, and R. A. H. Bion2007. The relationship between perception and pro-duction of English nasal codas by Brazilian learnersof English. In Procs. of Interspeech 2007, pp. 2297–2300.

D. C. Kluge, M. S. Reis, D. Nobre-Oliveira, and M.Bettoni-Techio. 2007. The use of visual cues in theperception of English syllable-final nasals by BrazilianEFL learners. In New Sounds 2007: Procs. of the 5thInternational Symposium on the Acquisition of SecondLanguage Speech, pp. 274–281.

P. K. Kuhl and P. Iverson. 1995. Linguistic experienceand the “perceptual magnet effect”. In W. Strange(Ed.), Speech perception and linguistic experience: Is-sues in cross-language research, pp.121–154. CityTi-monium, StateMD: York Press.

Biographical SketchDenise Cristina Kluge is a Profes-sor of Applied Linguistics at theFederal University of Parana, Brazilparticularly interested in EnglishPhonetics and Phonology learning.She received her PhD and her MAin the same field in 2009 and 2004,respectively. Her BA is in Por-tuguese and English Teaching in2000. In 2007 she completed partof her PhD studies at University ofAmsterdam under the supervisionof Prof. Paul Boersma.

48 YRRoSDS’2009

Page 49: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Theodora Koulouri Department of Information Systems andComputing, Brunel UniversityUxbridge, MiddlesexUB8 3PHUK

[email protected]

1 Research Interests

My research interests lie in the area of natural languageinterfaces in Human-Robot Interaction. In particular,my research efforts are currently focusing on enrichingthe dialogue manager of a mobile personal robot withcapabilities of natural (mis)communication manage-ment. My thesis follows an empirical approach and ismotivated by collaborative models of human communi-cation, also viewing Human-Robot Interaction (HRI) asa primarily bilateral process. It explores the linguistic re-sources employed by users in dialogue-based navigationof a robot. On the other hand, building upon human dia-logue strategies, my aim is to identify the feedback andrepair mechanisms that a robot should possess withinthe domain of goal-oriented HRI. The platform and test-bed of my research is a speech-enabled robot that is capa-ble of executing and learning navigation tasks by meansof unconstrained natural language (Lauria et al., 2001).

1.1 Past and Current Work

We conducted a series of Wizard of Oz simulations in or-der to obtain information on the range of utterances thatusers would produce when interacting with the robot aswell as specify task and system requirements (Koulouriand Lauria, 2009b). As mentioned above, I am lookingat how a robot should initiate repairs and provide feed-back, and therefore, the wizards were also subjects of thestudy. The study also explored how visual shared spaceand monitoring shaped the interaction patterns of bothparticipants. Thus, the experimental design involved twoconditions in which the user could or could not monitorthe robot. The study yielded a corpus of route instruc-tions and also defined a closed set of robot error handlingstrategies (Koulouri and Lauria, 2009a). Moreover, dif-ferences in coordination patterns and linguistic choiceswere observed depending on condition. In brief, when therobot was executing without the direct supervision of theuser, the responsibility for establishing task status and un-derstanding shifted towards the robot that provided richdescriptions and feedback. Yet, when users monitoredthe robot, verbal feedback by the wizards was often omit-

ted. Similarly, when visual information was available,users’ spatial descriptions tended to be less detailed andprecise. The results indicate that the demands for spatialreasoning and inferential capacities could be higher forcollocated, supervised robots. On the other hand, (semi-)autonomous execution requires more sophisticated inter-active abilities.

1.2 Future Work

As suggested in the previous section, providing effectiveand timely feedback is crucial for task-oriented interac-tions, and especially in the dynamic setting of HRI, inwhich the user’s instructions can be incomplete or out-dated. The amount and placement of feedback should bedecided upon several knowledge sources combined in asingle criterion that is adaptive both within and betweeninteractions. These sources could be the history of theon-going dialogue (e.g., how many times so far the robotand user have initiated repair), model of the environment(e.g., is the robot at home, outdoors or at a crowded work-place) and the task (e.g., is the route well-known, whatare the consequences of errors). My future work will dealwith the implementation of such functionality.

Moreover, according to a study on spatial descriptionsby Mills and Healey (2006), clarification requests havea direct effect on the processes of coordination. Theyobserved that as the dialogue progresses, the interlocu-tors converge in the use of more complex and efficientspatial descriptions. However, after clarification subdia-logues, the instructor shifts to more conservative descrip-tions. These insights coming from human communica-tion present interesting questions and rich opportunitiesfor investigation in HRI. The next step in my research isto examine whether and how users revised and adaptedtheir strategies, within the course of the dialogue, in re-sponse to particular robot utterances and in the presenceof miscommunication. Finally, it would be interesting todetermine whether certain user responses (as primed byprevious repair initiations by the robot) are more efficientin terms of recovery from error.

YRRoSDS’2009 Sep 13-14, London, UK 49

Page 50: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

2 Future of Spoken Dialogue Research

Spoken interfaces have been successfully embedded innumerous practical systems, which are mostly telephone-based, information-seeking applications. Recent progressin robotic technologies brings closer the vision of ubiq-uitous and commercial use of robots. Natural languageapart from being the most intuitive and expressive meansof communication is also a powerful tool for instruction,most appropriate for robots that essentially have to “learnon-the-job”. Therefore, I believe that interaction withembodied agents, able to assist and collaborate with peo-ple in their everyday activities, is what the near futuredemands and holds for the dialogue systems community.

However, dialogue-based HRI, in which robots andusers coordinate verbal and physical actions sharing timeand space, poses several hard but exciting challenges.Unlike most dialogue systems, the deployment environ-ment of a robot cannot be modelled a priori. More-over, robots are typically built on agent-based architec-tures (that is, the systems consist of several componentsmanaging the dialogue and the robot actions and possiblysensor devices). Situated dialogue entails instantaneoussynchronisation and updating of these components to in-clude a continuous flow of information. Further, physi-cal co-presence enhances the user’s perception of com-mon ground, increasing the use of spatiotemporal refer-ence and, thus, the occurrence of mismatches. In partic-ular, spatial language can be underspecified and arbitraryrequiring the application of multiple layers of discourseand situational context. Thus, in navigation tasks, therobot needs to have understanding of language, spatialactions and relations as well as perception of the world.Finally, misunderstandings in HRI might have safety im-plications. In conclusion, the scope, occurrence and costsof miscommunication increase, impelling researchers tointegrate miscommunication management into the designof spoken interfaces for robots.

Such dialogue models should certainly be computa-tionally practical but should also be based on models ofhuman communication as well as empirical studies in thedomain of HRI. Endowing artificial agents with “human-inspired” strategies holds the promise of natural and ro-bust management of miscommunication. Moreover, inHRI, like human collaborative behaviour and communi-cation, omniscience is not required, but coordination ofinformation and knowledge states. Thus, I believe thatredefining HRI as collaboration and embracing errors asinherent in communication could enable us to transferpersistent problems within speech recognition technol-ogy and HRI (arising from uncontrolled environments,unknown tasks, system synchronisation, interpretation ofraw sensor data etc.) into the dialogue domain and dealwith them as triggers for further interaction in the form of

verbal instruction, feedback, clarification and grounding.Therefore, research efforts should focus on developingdialogue models that equip embodied agents with suchinteractive capabilities.

3 Suggestions for Discussion• Evaluation metrics for assistive robots with speech

and multimodal interfaces.

• To what extent dialogue models and miscommunica-tion frameworks from computer-based dialogue sys-tems and human interaction can be applied to em-bodied agents.

• What additional challenges does long-term inter-action entail for the dialogue manager of per-sonal/service robots. How can the dialogue man-ager accommodate emergent language and social re-lationships.

ReferencesGregory J. Mills and Patrick G. T. Healey. 2006. Clarify-

ing Spatial Descriptions: Local and Global Effects onSemantic Co-ordination. In Procs. of the 10th Work-shop on the Semantics and Pragmatics of Dialogue.

Stanislao Lauria, Guido Bugmann, Theocharis Kyriacou,Johan Bos, and Ewan Klein. 2001. Training PersonalRobots Using Natural Language Instruction. In IEEEIntelligent Systems, pp 38–45.

Theodora Koulouri and Stasha Lauria. 2009a. Explor-ing Miscommunication and Collaborative Behaviourin Human-Robot Interaction. In Procs. of SIGDial’09.

Theodora Koulouri and Stasha Lauria. 2009b. A Corpus-based Analysis of Route Instructions in Human-RobotInteraction. In Procs. of TAROS’09.

Biographical SketchTheodora Koulouri is a PhD stu-dent at Brunel University un-der the supervision of StanislaoLauria. For the requirements ofher MSc degree in Speech Sci-ences (UCL), she implementeda system for far-field recogni-tion of spontaneous speech. She

also holds a BA in Linguistics from the Universityof Athens. Among other happy distractions from re-search and teaching, she enjoys doing sports and drawing(mostly robots and llamas).

50 YRRoSDS’2009

Page 51: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Christine Kuhnel Technische Universtitat BerlinDeutsche Telekom LaboratoriesQuality and UsabilityBerlin, Germany

[email protected]

1 Research InterestsMy main research interest is on multimodal human-machine interaction in the context of a smart-home di-alogue system. My dissertation work focuses on threemajor topics: (1) I am currently developing a multimodalsmart-home dialogue system, based on the EU-fundedIST-project INSPIRE (INfotainment management withSPeech Interaction via REmote microphones and tele-phone interfaces; IST 2001-32746). The system extendshome appliances with a speech interface. During the pastmonths and the next year both, new output and input com-ponents are added with the aim to develop a multimodalsmart-home system for research purposes.(2) The evaluation of the system and its components isconducted either with standardised or – where necessary– new methods. The first part of it, the evaluation of sys-tem output will be briefly described in the next section.(3) Based on earlier work on the quality of spoken dia-logue systems (Moller, 2005) Sebastian Moller and col-leagues proposed a taxonomy of performance and qualityaspects, including as well influencing factors of the sys-tem, the user, and the context of use (Moller et al., 2009a).Part of my dissertation work is to further analyse the in-terrelations between the different aspects as given in thetaxonomy.

1.1 Previous WorkThe last year I focused on the output side of human-machine interaction. Four experiments on the quality oftalking heads as output components to the smart-homesystem have been conducted by me and my colleagueBenjamin Weiss. The results of a passive listening-onlyexperiment (Kuhnel et al., 2008), a web-based experi-ment (Weiss et al., 2009a; Weiss et al., 2009c), an inter-active experiment with a simulated smart-home system(set in a laboratory) (Weiss et al., 2009b) and an interac-tive experiment with a fully functional smart-home sys-tem (set in a living room) (Kuhnel et al., 2009) were anal-ysed and compared to address the following main ques-tions:

• Concerning the quality of talking heads, which as-pects of talking heads contribute to the perceived

quality and how?

• Concerning evaluation of talking heads, is a web-based experiment as valid as a laboratory-based ex-periment and how does interaction influence thequality ratings?

• Concerning the smart-home domain, is a talkinghead a suitable user interface and how does it in-fluence system quality?

1.2 Current and future workThe next step is to focus on the input side of the IN-SPIRE system. The original system offers speech inputonly, depending on automatic speech recognition. Giventhe low recognition rates achieved, especially under theconditions found in the home (multiple speakers, back-ground noise, distant speech) analysing different inter-faces seems worthwhile. Furthermore, the majority of po-tential users is still not used to have a conversation witha machine. On the other hand, speech input offers ad-vantages – as for example in hands-busy situations – thatcan not easily be provided by other means. Combiningmultiple input modalities might allow to compensate forthe weakness of one modality by the strength of anotherand vice versa. It is often argued that multimodal interac-tion is intuitive because it resembles natural interaction.Apart from a talking head that can show facial expression– used for example for back channeling during a conver-sation – this would imply to render the system sensible toinformation conveyed via facial expressions and gestures.This seems a promising approach in dialogue-heavy ap-plications and a lot of work is being done in the areaof conversational agents. But in the smart-home domainthis would on the one hand require omnipresent cameraswhile on the other hand quite a few interactions are task-driven and might not depend on a sophisticated dialoguewere natural interaction is especially helpful.But there are other means of interaction being or becom-ing more prevalent in recent years: GUI-based interactionand mobile interaction with a mobile phone as user inter-face. With the recent smart phones a broad range of userinterfaces are feasible. With accelerometers and touchscreens, 2D and 3D gesture recognition can be used for

YRRoSDS’2009 Sep 13-14, London, UK 51

Page 52: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

user input while the screen offers another mode for sys-tem output.While writing this I am working on coupling an iPhonewith the INSPIRE system both as a sole interface as wellas combined with the speech input and output modalityalready existent. A major part of this work is the de-velopment of a fusion module on a semantic level. Asthe system is mainly developed for experiments a wiz-ard interface to bypass parts of the system (ie. the ASR,the gesture recognition or the fusion) is developed at thesame time.The final system will be used for experiments with theaim to find relations between the quality of system mod-ules and the overall system quality, as well as correla-tions between user ratings accumulated with question-naires and system performance. Based on the taxon-omy introduced above quality aspects important for mul-timodal systems will be identified and characterised andtheir interrelations defined.

2 Future of Spoken Dialogue Research

With ever evolving new user interfaces, dialogue systemsrelying purely on speech will most likely be confinedto niche applications were its advantages outweigh theusability constraints of dissatisfying recognition results(compared for example to GUI-based interaction). Butthe need for advanced dialogue systems will increase asmore and more of our every day world is equipped withmachines replacing for example the woman behind thecounter or combining different household equipment tocomplex robots. My grandfather is not longer able toby a ticket for a train ride and even people of youngerage might be unable to cope with new technology. Atthe same time these technologies might offer a new free-dom to disabled people granted an intuitive interface isprovided. This does not necessarily require human-likeinteraction in the way that human interact amongst them-selves. As long as the user knows a machine is involved(s)he will act accordingly. But dialogue systems that of-fer a choice of input and output modalities to cater fordifferent user needs, different tasks and different situa-tions could be the answer to many problems. This, ofcourse requires a thorough understanding of the require-ments and well established methods and metrics for eval-uating systems and system components.

3 Suggestions for Discussion

• Does an interaction need to be human-like to be ’nat-ural’ and intuitive?

• How to achieve realistic setups for evaluation stud-ies and how to motivate participant to give ’realworld’ ratings.

• Privacy issues: should the dialogue always be com-menced by the user? Is it desirable to have an everalert system that is always eavesdropping on you?

ReferencesChristine Kuhnel and B. Weiss and I. Wechsung and S.

Moller and S. Fagel. 2008. Evaluating Talking Headsfor Smart Home Systems. 10th Int. Conf. on Multi-modal Interfaces (ICMI 2008), GR-Chania, 20-22 Oct.2008, 81-85.

Christine Kuhnel and B. Weiss and S. Moller. 2009.Talking Heads for Interacting with Spoken DialogSmart-Home Systems Interspeech, Brighton, GB (ac-cepted).

Sebastian Moller. 2005. Quality of Telephone-BasedSpoken Dialogue Systems. Springer, US-New YorkNY.

Sebastian Moller and K.-P. Engelbrecht and C. Kuhneland I. Wechsung and B. Weiss. 2009. Evaluation ofMultimodal Interfaces for Ambient Intelligence. ac-cepted for: Human-Centric Interfaces for Ambient In-telligence (H. Aghajan, R. Lopez-Cozar Delgado andJ. C. Augusto, eds.), Elsevier.

Benjamin Weiss and C. Kuhnel and I. Wechsung and S.Moller and S. Fagel. 2009. Comparison of DifferentTalking Heads in Non-Interactive Settings 13th Int.Conf. on Human-Computer Interaction (HCI Interna-tional 2009), 19-24 July, US-San Diego CA.

Benjamin Weiss and C. Kuhnel and I. Wechsung and S.Moller and S. Fagel. 2009. Quality of Talking Headsin Different Interaction and Media Contexts. SpecialIssue of Speech Communication (accepted).

Benjamin Weiss and C. Kuhnel and I. Wechsung and S.Moller and S. Fagel. 2009. Web-based evaluation oftalking heads: How valid is it? 9th Int. Conf. on Intel-ligent Virtual Agents, 14-16 September, Amsterdam,Netherlands (accepted).

Biographical SketchI am working as a researchassistant at the Quality andUsability Lab of DeutscheTelekom Laboratories, TU-Berlin. I studied ElectricalEngineering and Business Ad-ministration and received mydiploma degree in 2007 from

the Christian-Albrechts University of Kiel. I am workingtowards my PhD thesis in the domain of evaluation ofmultimodal systems. My extracurricular interests includetango argentino, violin playing, sports and outdooractivities like cycling, hiking and canooing.

52 YRRoSDS’2009

Page 53: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Catherine Lai Department of LinguisticsUniversity of PennsylvaniaPhiladelphia PA 19104, USA

[email protected]/˜laic

1 Research Interests

My research interests lie in modelling the contribution ofprosody to meaning in dialogue. In particular, I am inter-ested in how prosody can be integrated into formal mod-els of dialogue. This involves studying how prosodic fea-tures fit in with theoretical concepts like an intonationallexicon, how well these sorts of theoretical models fareon speech data from real dialogue, and how measurablephonetic data can best be analysed and represented to ad-dress these questions. I am also interested in how ma-chine learning and speech recognition tools can be usedto investigate these problems.

In general, these studies are about understanding howprosody can be used determine different types of speechacts and to structure dialogue. I also hope they will shedlight on some theoretical questions about questions them-selves: what makes an utterance a question, what makesan answer, and when can an utterance with question char-acteristics safely be ignored.

1.1 Previous Work

The overarching goal of my research project is to pro-vide a robust and testable framework of prosody in di-alogue. To develop robust models, we need to developmethods of analysis on data sets which may not be an-notated for prosody or discourse metadata. However, be-fore making hypotheses about the full range of dialoguemoves, we need to consider smaller sets of data whichare tractable in terms of investigating fine phonetic de-tail, semantic/pragmatic contributions, and their interac-tion. More importantly we want to study data that areconsistently and frequently used to shape the structure ofthe dialogue.

As such, my previous work has focused on studyingthe semantics, pragmatics and prosody on cue words.These discourse markers are important for maintainingdialogue coherence and have a wide range of uses. Forexample, the set of English cue words includes backchan-nels like uh-huh and okay, agreements like right, andquestioning particles like really. Understanding thesesorts of markers is important because they indicate bothwhen things are going well and when things are goingbadly.

Individual cue words can express more than one in-terpretation and prosody contributes to this interpretation(Gravano, 2009). However, to understand this contribu-tion we need appropriate dimensions of meaning withwhich to view the data. In Lai (2008), I investigatedreally which, as a one word turn, is classified as botha backchannel and question in a metadata annotation ofthe Switchboard (MDE 2003, LDC2004T12). That studywas an attempt to find out if prosodic features (e.g. pitchcontour type, intensity, duration) could distinguish thesetwo categories of reallys. Statistical analysis and machinelearning attempts indicated that this was not the case. Infact, really appears to span a spectrum of meanings/usesranging from acknowledgement to questioning to excla-mative.

So, to model how these turns are interpreted we needto know not only what dimensions of meaning prosodycan work on, but also how this combines with the syn-tax/semantics of the turn. Really and right were usedto probe this interaction in Lai (2009). Really acts as acheck on the common ground and seems to be interrog-ative as a one word turn. On the other hand, right is anagreement word which can appear as questioning in veryspecific contexts: as a tag question with rising intonation.A perception experiment of really and right found thatratings of surprise were positively correlated with ratingsof how questioning the cue word sounded. However, theratings for really did not translate to ratings for right withsimilar prosodic values. So, the semantics/pragmatics ofcue words suggest different thresholds for surprisingnesswhen they are heard in isolation. The next step is to lookat how these perceptions are integrated in the dialogue.

1.2 Current and Future WorkMy previous studies were small steps in understandingthe prosody of dialogue alignment. What we want fromthis project is a way to detect when there is agreement indialogue and when there is a need for repair. My currentwork is focuses on cue words in formal models of dia-logue (e.g. Fernandez (2006), Schlangen (2004)). I amcurrently performing a corpus study looking at the rhetor-ical relations that structure the discourse around reallyand right, how this structure differs with varying prosody,and how this directs the Question Under Discussion or

YRRoSDS’2009 Sep 13-14, London, UK 53

Page 54: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

discourse topic. This feeds into further perception exper-iments of cue words in context. The goal of this is tounderstand how the range of speaker affect and attitudeavailable with words like really translates into the cate-gorisations and relations presented in semantic models.

A further goal of my project is to see whether the ex-perimental and corpus results can be used to guide largerscale corpus analysis for better automatic classificationof speech acts. However, I hope the theoretical back-ground together with the corpus coverage will also helpus choose between models of how we represent timevarying prosodic features for dialogue like pitch contours(Prom-on et al. (2006), Kochanski and Shih (2003), in-teralia).

2 Future of Spoken Dialogue ResearchFor spoken dialogue systems to be usable they must beable to detect when things are going wrong and providestrategies for correction. This means detecting speechproduction errors as well as pragmatic misalignment.Clearly, given my line of research, I think prosody iskey for handling this. However, it is still unclear what(abstract) representation is appropriate for prosodic fea-tures. Moreover, current research is significantly slowedby the fact that creating manually annotated corpora is anextremely painstaking process. So, the development ofautomated methods for extracting prosodic informationis extremely important. In particular, this means dimen-sionality reduction for features like F0 contours. Devel-opment of such techniques seems viable in the next 5-10year, especially as statistical/speech technology methodsstart to creep into theoretical linguistic research on dis-course and phonology.

3 Suggestions for Discussion• How can we model the distribution of the load be-

tween prosody and non-speech features in dialogue?When is prosody crucial for understanding dialogueand when can it be ignored?

• How can test theories of prosodic meaning and di-alogue structure using real conversational corpusdata? How far can we get with unannotated speechcorpora?

• How can we model/detect speaker affect and atti-tude? How do we know when a dialogue is go-ing well? How does this affect turn taking and thechoice of conversational moves?

ReferencesR. Fernandez. 2006. Non-Sentential Utterances in Dia-

logue: Classification, Resolution and Use. PhD thesis,

Department of Computer Science, Kings College Lon-don, University of London.

A. Gravano. 2009. Turn-Taking and AffirmativeCue Words in Task-Oriented Dialogue. PhD thesis,Columbia University.

G. Kochanski and C. Shih. 2003. Prosody modelingwith soft templates. Speech Communication, 39 (3-4),pp.311–352.

C. Lai. 2008. Prosodic Cues for Backchannels and ShortQuestions: Really? In Proceedings of Speech Prosody,Campinas, Brazils.

C. Lai. 2009. Perceiving Surprise on Cue Words:Prosody and Semantics Interact on Right and Really.In Proceedings of Interspeech, Brighton, UK.

S. Prom-on, Y. Xu, and B. Thipakorn. 2006. QuantitativeTarget Approximation Model: Simulating UnderlyingMechanisms of Tones and Intonations. In Procs. ofIEEE International Conference on Acoustics, Speechand Signal Processing, ICASSP, volume 1.

D. Schlangen. 2004. A Coherence-Based Approach tothe Interpretation of Non-Sentential Utterances in Di-alogue. PhD thesis, School of Informatics, Universityof Edinburgh.

Biographical SketchCatherine is pursing a PhD inlinguistics at the University ofPennsylvania. Her main interestis the interface between seman-tics, pragmatics and prosody,but she has also worked on thephonetics of speech disfluen-cies. Her other research hobbyis formally modelling languagechange. Before this, she re-

ceived a BSc (Hons) in Mathematics and Computer Sci-ence and an MSc in Computer Science at the Universityof Melbourne, Australia, where she worked on queryinglinguistic trees, amongst other things.

54 YRRoSDS’2009

Page 55: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Marianne Laurent Technopole Brest-IroiseCS 8381829238 Brest Cedex 3France

[email protected]/mariannelaurent/

1 Research Interests

The deployment of efficient Spoken Dialogue Systems(SDS) requires the use of a dialogue design methodol-ogy. Actual methods are not guided enough by an antic-ipation of a design adapted to the user’s practices. Con-sequently they require numerous iteration cycles based onreal interactions data to detect and correct design issues.As a result, these ad hoc processes mainly lead to longand costly development cycles.

In this framework, the goal of my thesis is to workon the definition of dialogue design methodologies ableto put together some knowledge on the technical com-ponents, on the typology of services, and on the user’spractices. My work relies on an explicit formalisationof service evaluation, with a view to identify problemsmore easily, to compare iterative versions of a service,and, if possible, validate best practices applicable to thedesign of future services.

My work should include three stages. A first stage ded-icated to gather an understanding of several aspects of thedialogue between two people. A second one focused onthe design of a SDS’s logic of dialogue and its enhance-ment supported by evaluation tools. And the third onededicated to the validation of best practices to feed thedesign process for future services.

1.1 A data driven approach

My study has started with the collection of a corpusof telephone communications held by a school secretary.The transcription reveals very rich interactions, address-ing many kinds of subjects, with both known and un-known interlocutors. Our objective is to analyse these di-alogues according to several point-of-views (theme, strat-egy, rhythm, etc.) to identify a few trends to be taken intoaccount in the design of SDS dialogues. In this purpose, adeep reflexion has been con-ducted on the transcriptionformalism so that a statistical study of the data is madepossible.

These findings will then be used to feed the design ofa service related to some of the observed secretary task.

1.2 Evaluation for enhancementThe second stage, led by a learning-by-doing approach,consists in the design of a design, and its enhancement.This will be the occasion to both integrate the knowledgegathered thanks to the analysis of the human-human in-teractions, and think about the evaluation tools helpingthe iteration cycles.

Keeping the data driven approach, the sets of indicatorswill be deduced from success or failure situations identi-fied in corpora. This opens two opposite perspectives forthe evaluation process.

On the one hand, I will work on the identification ofred-light indicators. Their goal will be to draw attentionto the presence of problematical situations experiencedin dialogues. They should be mainly used to spot issuesto be fixed within the design of service which leads toiteration in conception to propose an alternative design.

On the other hand, work will be pursued on perfor-mance indicators to be used for either a/b testing of al-ternatives or automatic reinforcement learning integratedin the SDS. In (Laurent et al., 2009) we suggested achange in evaluation paradigm that consists in measur-ing the user’s conformance to the system in the systemreferential instead of evaluating the system as regard tothe users referential. This change suggests the selectionof the indicators according to the service specifications,and not to a presupposed user satisfaction. Performanceindicators will be obtained from the targeted functionali-ties of the service.

This work on evaluation will face a lot of issues.Among others, each intuitively selected metric will haveto be challenged carefully as regard the performance ob-jectives and user practice. We can also think about theproblem of dispassion. Even if we deal with quanti-tative metrics, to what extend the designed evaluationparadigms can be objective? Last, we might have to han-dle the problem of commensurability and portability ofthe evaluation paradigms, question remarkably raised by(Paek, 2007).

1.3 Evaluation for learningBeyond the use of evaluation for the comparison of alter-natives, an interesting stake would be to broaden its role

YRRoSDS’2009 Sep 13-14, London, UK 55

Page 56: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

to the identification of clues for the design of services. Ianticipate two perspectives to address this idea. The firstone could be to validate intuited best practices on a cer-tain amount of distinct contexts to consider them on trustfor future developments. The second one would be a pro-cedure that, on the basis of evaluation output, spots theelement(s) to be modified within the design of service.Can evaluation help to recognise the causes of failure orsuccess within a dialogue?

As a result, if we come to some results with these per-spectives, and add them up, we could come to valuabletools to industrialise the dialogue design process.

2 Future of Spoken Dialogue Research• A great challenge to cope with is undoubtedly in the

vocal recognition performance. This might comefrom the new STT technologies, such as the oneused by Spinvox, who still has to adapt to the realtime needs of the SDS services. Anyway, we cananticipate a little revolution in the way of apprehend-ing dialogue when recognition possibilities will sky-rocket!

• Above the recognition of words pronounced by theuser, we can hope for the future ability of systemin the recognition of nonverbal elements within theuser’s utterances. Prosody or rhythm could stronglyhelp the service in both understanding the intentions,and synchronising with the user’s tense.

• Last I believe that another change to pay attentionto is the evolution in the way the mainstream per-ceives SDS solutions. Will they always be comparedto the preexisting human operator’s services, or willthey on day be seen as another human-machine in-terface, just as the web is. This could change thedefinition given to naturality in interaction, shiftingfrom a search for humanlike behaviour to an atten-tion given to interface usability.

3 Suggestions for Discussion• Data-driven analysis: To what extend human-human

dialogues corpora can be helpful for the study ofhuman-machine dialogue?

• Evaluation: What are the issues triggered by the ab-sence of dispassion in evaluating SDS? Is dispassionof the evaluation a goal in itself?

• Limitations of Spoken Dialogue Systems: To whatextent a SDS performance could be enhanced bytaking into account additional aspects of the users’utterances that are ignored today (prosody, paralan-guage, etc.)?

ReferencesMarianne Laurent, Ghislain Putois, Philippe Bretier,

Thierry Moudenc. 2009. Nouveau paradigmed’evaluation des systemes de dialogue humain-machine. In TALN.

Sebastian Moller. 2005. Quality of Telephone-basedSpoken Dialogue Systems. Springer.

Tim Paek. 2007. Toward evaluation that leads to bestpractices: Reconciling dialogue evaluation in researchand industry. In Proc. of the NAACL-HLT Workshop onBridging the Gap: Academic and Industrial Researchin Dialog Technologies.

Biographical SketchMarianne joined the OrangeLabs in Lannion, France, as aPh.D. student in 2008, her the-sis being supervised by PhilippeBretier (Orange Labs) and Ioan-nis Kanellos (Telecom Bre-tagne). After a Master in Man-agement from Grenoble Ecolede Management, she graduated

in 2007 with a Master of Engineering from Telecom Bre-tagne. Her studies, both technical and managerial, mainlyfocused on the Management of Enterprise InformationSystems. She started her career in IS project managementwithin Thales in the UK before she joined the dialogueteam of Orange.

56 YRRoSDS’2009

Page 57: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Beatriz Lopez Mencıa Universidad Politecnica de MadridCiudad Universitaria s/n28040 MadridSpain

[email protected]

1 Research InterestsI am pursuing a research career in the field of Human-Computer Interaction, and my special interest is in thedesign and usability evaluation of multimodal inter-faces and embodied conversational agents. My the-sis topic explores the use of embodied conversationalagents (ECAs) and their visual communicative ability toimprove typical interaction problems and the users’overall experience.

1.1 Past and ongoing researchThe main line of research I have been following atGrupo de Procesado de Senales(GAPS), UniversidadPolitecnica de Madrid(UPM), is the comparative study ofa variety of contexts of multimodal human-machine inter-action, with a special focus on embodied conversationalagents (ECAs). For this purpose we have built severalplatforms to test interaction through various modalitiesin the context of use imposed by different applications,including biometric identification and remote access tohome devices. My primary goal is to improve the robust-ness of human-computer communication and to foster abetter user experience, and to explore how ECAs mightbe useful in these respects.

Firstly we focused on evaluating multimodal biometricsystems, specifically, identity verification system usingvoice recognition. We sought for effects on system us-ability efficiency and user acceptance related to the pres-ence of an embodied conversational agent (ECA). We de-signed a spoken dialogue interface which incorporates ananimated figure making gestures tuned to specific verifi-cation dialogue stages. The results obtained suggest that,while interaction with the animated agent was more effi-cient and enjoyable, users had stronger privacy concerns(Hernandez et al., 2007).

A following study focussed on robustness problemswith spoken language dialogue systems. Adding a visualchannel with an animated character that personifies theartificial system could make it possible to convey com-munication cues through body and facial gestures whichmight help the user follow the interaction more easily(e.g., by displaying metacognitive gestures suggestive ofthe mood and perhaps also of the content of what the ECA

intends to communicate, reinforcing the information con-veyed verbally). For this purpose we identified typical in-teraction problems with SLDSs and associated with eachof them a particular ECA gesture or behaviour. The goalwas to keep users in a positive frame of mind when recog-nition errors occurred. We also tested a proxemic ‘code’to mark dialogue turns in order to make the interactionmore fluent and the users more confident (Lopez Menciaet al., 2008).

Our evaluation work was focused on special featuresthat should be taken into account with a face-to-face com-munication system. This time we tested an applicationto remotely control home devices (i.e., a domotic sys-tem). Our evaluation scheme was chiefly based on ITUstandards for the evaluation of (strictly) spoken dialoguesystems (ITU, 2005). Our test results revealed that ECAusers scored better on certain important objective perfor-mance parameters. We emphasised the need for in-depthfactor analyses of the users’ subjective experience withface-to-face dialogue systems. The results suggested thatrejection is a dimension in its own right that may stronglyinfluence subjective evaluation parameters closely relatedto user acceptance. Some of the extracted ‘subjective’ di-mensions show that our ECA seems to help users to betterunderstand the flow of the dialogue and reduce confusion.

My present research is focussed on annotation pro-cedures, data processing and statistical analysis of data(user opinion and system performance). Another on-going research line is a joint work between UPM andseveral schools of disabled children in Madrid. Thiswork is based on the collaborative design, developmentand evaluation of applications that might reinforce learn-ing in children with several disabilities (autism, cerebralpalsy...). We hope that these children may benefit fromlearning improvements in specific areas and increase theirlearning motivation by using new advanced technologies(such as ECAs). At present we are working on the eval-uation of a specific application using ECA technologyand webcams in order to boost the learning of emotionsin physically and cognitive disabled children with severespeech and motor limitations.

YRRoSDS’2009 Sep 13-14, London, UK 57

Page 58: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

2 Future of Spoken Dialogue ResearchA new generation of multimodal interactive systems is inthe making that promises to boost user context awarenessand multidevice interaction. I believe, according to thisline, that we should take advantage of new communica-tions channels such us the visual one. The integration of avisual channel to spoken dialogue systems could enrichthe interaction. An interesting possibility is the incor-poration of Embodied Conversational Agents to SpokenDialogue Systems using the visual channel. These ECAsmay contribute to boosting the social abilities of the sys-tem and also humanising the system in specific applica-tions (learning with children, applications for old peo-ple...) in which these aspects are very important. A visualcommunication channel could also be exploited to carrysupralinguistic information regarding affective interac-tion components (expectations, intentions, mental pro-cesses and emotions). The goals of interaction could thenbecome less task- and domain-specific and more openand social in nature, perhaps even allowing meaningfullong-term relationships to develop between the user andan animated agent that shows some sort of ‘personality’(Lopez Mencia et al., 2008). Another research challengeis to design an interaction adapted to users’ needs, andoffering accessibility for all users. Minimising interac-tion errors is also a key aspect of user experience thatshould be dealt with. A visual channel would introducenew problems, however, such as the possibility of misin-terpreting the other party’s gestures. In any case, it seemsworthwhile to explore the possibilities that interaction soenriched may offer. In order to tackle these challenges,standardisation work is very important. Defining stan-dard evaluation parameters would allow comparing sys-tems in terms of their usability and user-centred qualityparameters. Also, research in standards and the imple-mentation of nonverbal information (e.g. generation ofnonverbal output information [EML, 2008]) would be de-sirable for the future of spoken dialogue research.

3 Suggestions for DiscussionSome suggestions for discussion are the following:

• How can we identify interaction errors in a spokendialogue system enriched with a visual communica-tion channel?

• Definition of metrics for evaluating spoken dialoguesystems with a face-to-face communication channel.

• How does the way nonverbal behaviour is imple-mented influence the interaction? Is there a compro-mise that has to be reached between attaining natu-ralness in nonverbal behaviour and obtaining goodvalues in dialogue metrics for performance?

• Evaluation: How to define standard procedures tocollect corpora? What methodologies are most ap-propriate for analysing objective and subjective data(e.g., questionnaire responses)?

ReferencesA. Hernandez, B. Lopez, D. Dıaz, R. Fernandez, L.

Hernandez, and J. Caminero. 2007. A person in theinterface: effects on user perceptions of multibiomet-rics. Workshop on Embodied Language Processing, inthe 45th Annual Meeting of the Association for Com-putational Linguistics: 33–40, ACL, Prague.

B. Lopez Mencia, D. Dıaz de Vera, A. HernandezTrapote, M. C. Rodriguez Gancedo, J. Relano, and L.Hernandez Gomez. 2008. Evaluation of ECA Ges-ture Strategies for Robust Human-Computer Interac-tion. International Journal of Semantic Computing,Vol. 2, No. 1: 1–24.

I.-T. S. t. P.-S. Rec. 2005. ITU-T Suppl. 24 to P-SeriesRec. Parameters Describing the Interaction with Spo-ken Dialogue Systems. International Telecommunica-tion Union, Geneva (2005).

EML. 2008. EML: Elements of an EmotionML1.0. W3C Incubator Group Report. Novem-ber. http://www.w3.org/2005/Incubator/emotion/XGR-emotionml-20081120/

Biographical SketchBeatriz Lopez Mencıa has anMEng in telecommunicationsengineering from UniversidadPolitecnica de Madrid. She iscurrently a 3rd year PhD studentand holds a research position inthe same university, under the

supervision of Prof. Luis Hernandez.

58 YRRoSDS’2009

Page 59: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Syaheerah L. Lutfi Speech Technology Group (GTH)Department of Electronic EngineeringUniversidad Politecnica de MadridSpain

[email protected]

1 Research Interests

Human-Machine Interaction in general and specificallyAffective Computing is my field of interest. Currently Iam involved in a project for modelling affect in a roboticagent for domestic purposes. In a smaller scope, I am in-terested in researching and developing culture-sensitiveemotive interfaces. This is motivated by the fact that per-ception and expression of emotions vary among peoplefrom different cultures, and therefore designs of emotivecomputer interfaces should be tailored to conform to thedifferent needs according to cultural background.

1.1 Previous Experiences

Emotional prosody modelling for a Malay TTSEmotional prosody modelling for a Malay TTS I be-

came involved with speech studies when I worked on myMasters dissertation. I worked on a project in collabo-ration with the R&D Company, MIMOS Malaysia to in-corporate emotions into a Malay TTS. It involved the in-corporation of an affective component to the Malays TTSsystem, in order to produce a system that is more expres-sive in nature. I introduced a new template-driven methodfor generating expressive speech by embedding an ’emo-tion layer’ called eXpressive Text Reader Automation,abbreviated as eXTRA. The module is an independentcomponent that can serve as an extension to any MalayTTS system that uses Multiband Resynthesis OverlapAdd (MBROLA) engine for diphone concatenation. De-tails can be found in (Lutfi et al., 2006).

Culture-sensitive TTSIt is vital to ensure that intelligent interfaces are also

equipped to meet the challenge of cultural diversity. Stud-ies show that the expression and perception of emotionsmay vary from one culture to another (Silzer, 2001) andthat the localised synthetic speech of software agents, forexample, from the same ethnic background as the interac-tor are perceived to be more socially attractive and trust-worthy than those from different backgrounds (Nass andLee, 2002). Based on these studies and personal expe-riences, we realised that it is crucial to infuse a morefamiliarised set of emotions to a TTS system whereby

the users are natives. We further worked on establishinga localised TTS by concentrating on the culture-specificmanner of speaking and choices of words when in a cer-tain emotional state. Evaluations from a localised TTSthat we established show that the risk of evoking confu-sions or negative emotions such as annoyance, offense oraggravation from the user is minimised (Syaheerah et al.,2008).

1.2 Recent workEmotion Identifications in Speech

In my first year of PhD I worked on identifying emo-tions from speech for a couple of class projects. Oneof them was concerned with obtaining the best paramet-ric model for emotion recognition, based on a HiddenMarkov Models (HMMs) classifier. The optimised pa-rameters for the emotion identification task were deter-mined empirically, across two representations of obser-vations, the mel-scale cepstral co-efficient (MFCC) andalso Perceptual Linear Prediction (PLP), and were im-proved using well-known normalisation techniques. Amore important finding showed that certain representa-tions were better or more precise at identifying a certaintype of emotion over the other. This and other findingsfrom the experiments are discussed in (Syaheerah et al.,2009). In the article, it is also proposed that the findingscould be applied to speech-based affect identification sys-tems as the next generation biometric identification sys-tems that are aimed at determining a person’s ’state ofmind’, or psycho-physiological state.

Emotional TTSIn GTH we also work on Emotional Spanish TTS and

participated in a number of Expressive TTS competi-tions such the Spanish Albazyn competition and INTER-SPEECH Emotional Challenge 2009 (Barra-Chicote etal., 2009).

1.3 Current workCurrently I am involved in modelling emotion for a task-independent robotic agent by integrating a module ofneeds. Based on theories which adhere to the idea thatan affect system is influenced by a motivation system,

YRRoSDS’2009 Sep 13-14, London, UK 59

Page 60: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

we propose an emotion framework for a task-independentautonomous agent by integrating a module of needs. Theneed framework is based on Abraham Maslow’s motiva-tion theory (hierarchy of needs). What makes the ap-proach different from other appraisal-based approachesis the incorporation of the need-layers as a module thatfunctions as a decomposer of task-specific events accord-ing to their importance and urgency. The incorporationof this module has given two advantages: scalability interms of tasks and needs, whereby the agent’s tasks andneeds can be added or appropriately changed to suit var-ious application domains, and a priority mechanism ontasks in a multi-tasking environment.

One of my future plans and hopes is to integrate a cul-ture module in this emotional framework that would per-sonalise the agent according to the target user’s socio-cultural background.

2 Future of Spoken Dialogue ResearchDespite the strong evidence for cross-cultural consistencyin emotion appraisal processes and perceptions, thereare cultural differences as well. Affective applicationsthat are developed with a general framework of emo-tion according to Western standards (individualistic cul-ture) may not be suitable to be adopted in an environ-ment belonging to collectivist culture (East). A subtlelevel of awareness of difference in spoken dialogue (andalso other modalities) is vital to go beyond stereotypes ofhow a particular affective conversational agent might dif-fer from the mainstream. I believe analysis focusing onsocio-cultural grounding would help with modelling dia-logue framework or other aspects of affective agents thatincrease cross-culture acceptance.

3 Suggestions for Discussion• Culturally-sensitive dialogue design: what are the

factors to be identified in designing it? Is there struc-tured information available?

• Evaluation design: There is no gold standardfor evaluating the believability aspects of emotiveagents. How to design it?

• Virtual/robotic agents: Do people really preferhuman-like agents to characters with simpler ap-pearance such as line-drawn cartoon characters?

ReferencesC. Nass and K. Lee. 2002. Does Computer-Synthesized

Speech Manifest Personality? Experimental Tests ofRecognition, Similarity-Attraction, and Consistency-Attraction. In Journal of Experimental Psychology:Applied, 7 (3), pp. 171–181.

Roberto Barra-Chicote, Fernando Fernandez, SyaheerahLutfi, Juan Manuel Lucas-Cuesta, Javier Macias-Guarasa, Juan Manuel Montero, Ruben San-Segundo,and Jose Manuel Pardo. 2009. Acoustic EmotionRecognition Using Dynamic Bayesian Networks andMulti-Space Distributions. In Procs. of Interspeech2009, Brighton UK.

P. J. Silzer. 2001. Miffed, Upset, Angry or Furious?Translating Emotion Words. In Procs. of ATA 42ndAnnual Conference, pp. 1–6, Lost Angeles, CA, Oct31-Nov 3.

Syaheerah L. Lutfi, Raja N. Ainon, and Zuraidah M. Don.2006. Expressive Text Reader Automation Layer (Ex-tra): Template-driven ’Emotion Layer’ in Malay Con-catenated Synthesized Speech. In Proc. Oriental CO-COSDA’2006, Penang, Malaysia.

Syaheerah L. Lutfi, J. M. Montero, J. M. Lucas, and R.Barra-Chicote. 2009. Expressive Speech Identifica-tions Based on Hidden Markov Model. In Procs. ofHEALTHINF’2009, pp 488-493, Porto, Portugal, Jan14-17.

Syaheerah L. Lutfi, J. M. Montero, Raja N. Ainon, andZuraidah M. Don. 2008. Extra: A Culturally EnrichedMalay Text to Speech System. In Procs. of AISB’2008,pp. 77–83, Aberdeen, Scotland, Dec 9-11.

Biographical SketchSyaheerah Lutfi is a PhD candi-date at Universidad Politecnicade Madrid (UPM), Spain underMOHE/USM scholarship. Pre-viously she obtained a BSc.Computing from Bolton Univer-sity, UK, and a Masters in Soft-ware Engineering from Univer-

sity Malaya (UM), Malaysia under the (USM) scholar-ship. She enjoys travelling and meeting people as cul-tures of the world continue to amaze her. She is blessedwith a loving husband and two beautiful children, Wa-heedah and Mash’al.

60 YRRoSDS’2009

Page 61: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Francois Mairesse Department of EngineeringUniversity of CambridgeTrumpington streetCambridge CB2 1PZ

[email protected]/˜farm2

1 Research Interests

My research interests lie generally in models of individ-ual differences in language, with a special focus on stylis-tic natural language generation, expressive speechsynthesis and natural language understanding.

1.1 Past work

I’ve recently completed my Ph.D. thesis on statisticalmodels for detecting and conveying linguistic variationin dialogue systems. Although there are many ways toexpress any given content, most dialogue systems do nottake linguistic variation into account in both the under-standing and generation phases, i.e. the user’s linguisticstyle is typically ignored, and the style conveyed by thesystem is chosen once for all interactions at developmenttime.

Over the past few years, psychologists have identi-fied the main dimensions of individual differences in hu-man behaviour: the Big Five personality traits (Norman,1963). The Big Five traits are hypothesised to provide auseful computational framework for modelling importantaspects of linguistic variation. My thesis first exploresthe possibility of recognising the user’s personality usingdata-driven models trained on essays and conversationaldata (Mairesse et al., 2007). I then tested whether it ispossible to generate language varying consistently alongeach personality dimension in the information presenta-tion domain. I implemented PERSONAGE: a languagegenerator modelling findings from psychological studiesto project various personality traits (Mairesse and Walker,2007). PERSONAGE was used to compare various gener-ation paradigms: (1) rule-based generation, (2) overgen-erate and select and (3) generation using parameter es-timation models that learn to produce recognisable vari-ation without the computational cost incurred by over-generation techniques. These generation methods wereevaluated based on human judgements, showing that hu-man judges can detect the personality conveyed by thesystem’s utterances, even if multiple traits are projectedsimultaneously (Mairesse and Walker, 2008).

As far as natural language understanding is concerned,I recently presented a semantic decoding method that

can learn to predict semantic trees from unaligned data,i.e. without any word-level semantic annotation, by re-cursively combining support vector machine classifierspredicting semantic concepts from n-gram features com-puted over the whole utterance. As unaligned data ischeaper to annotate, this methods reduces dialogue sys-tem development time compared with techniques re-quiring Treebank-style annotations. This technique wasshown to perform as well as state-of-the-art semanticparsers (Mairesse et al., 2009).

1.2 Current and future work

I recently joined Cambridge’s Dialogue Systems group inorder to develop statistical models for semantic parsingand language generation, and to test these componentswithin a fully trainable dialogue system. This work is partof the CLASSiC project funded by the European Union.In the future, I am planning to combine stylistic genera-tion together with expressive text-to-speech synthesis, inorder to produce coherent and convincing linguistic vari-ation in dialogue systems.

2 Future of Spoken Dialogue Research

Spoken dialogue systems are still a long way from be-ing widely adopted by the population, mostly because thecurrent level of performance of the understanding com-ponents is not matching the user’s expectations, and be-cause the generation side is producing repetitive, limitedoutputs. Whereas a lot of research currently focuses onthe understanding side, modelling linguistic variation cangreatly improve the interaction in dialogue systems, suchas in intelligent tutoring systems, video games, or infor-mation retrieval systems, which all require specific lin-guistic styles.

I thus believe that the future of dialogue system re-search is to produce systems that can control that vari-ation in a scalable way, both for understanding all thepragmatic nuances of the user input, as well as convey-ing realistic outputs. This scalability will require movingtowards data-driven approaches at all levels of the outputgeneration, i.e. controlling the level of initiative, the lan-guage generation process and the speech synthesis mod-

YRRoSDS’2009 Sep 13-14, London, UK 61

Page 62: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

ule in a consistent way. While previous research has pro-duced highly trainable components (e.g. speech recog-niser, text-to-speech engine), young researchers will needto focus on how to make the language understanding, dia-logue management and language generation componentsre-usable across domains.

3 Suggestions for Discussion• Given an infinite amount of data and computational

power, could dialogue systems be undistinguishablefrom human beings? And should they?

• Dialogue system personalisation: is it important tomodel the system’s output style, and if so what kindof style or personality should the system convey fordifferent dialogue applications (e.g. intelligent tu-toring systems, tourist information presentation, fi-nancial information retrieval)?

• In the near-future, will data-driven methods removethe need for rule-based knowledge in all dialoguesystem components?

ReferencesFrancois Mairesse and Marilyn A. Walker. 2007. PER-

SONAGE: Personality generation for dialogue. InProceedings of the 45th Annual Meeting of the Associ-ation for Computational Linguistics (ACL), pages 496–503.

Francois Mairesse and Marilyn A. Walker. 2008. Train-able generation of Big-Five personality styles throughdata-driven parameter estimation. In Proceedings ofthe 46th Annual Meeting of the Association for Com-putational Linguistics (ACL).

Francois Mairesse, Marilyn A. Walker, Matthias R. Mehl,and Roger K. Moore. 2007. Using linguistic cuesfor the automatic recognition of personality in conver-sation and text. Journal of Artificial Intelligence Re-search (JAIR), 30:457–500.

F. Mairesse, M. Gasic, F. Jurcıcek, S. Keizer, B. Thom-son, K. Yu, and S. Young. 2009. Spoken language un-derstanding from unaligned data using discriminativeclassification models. In Proceedings of ICASSP.

W. T. Norman. 1963. Toward an adequate taxonomyof personality attributes: Replicated factor structure inpeer nomination personality rating. Journal of Abnor-mal and Social Psychology, 66:574–583.

Biographical SketchFrancois Mairesse is a re-search associate at the Univer-sity of Cambridge, working inthe Machine Intelligence Labas part of the EU CLASSiCproject. He recently completedhis Ph.D. thesis at the Univer-sity of Sheffield supervised byProf. Marilyn Walker, focusingon models of individual differ-

ences in dialogue systems. The thesis investigates tech-niques for the recognition of the personality of the useras well as the control of the personality conveyed by thesystem. In 2006, he did an internship at AT&T Labsin Florham Park working on paraphrase acquisition fromweb reviews. In 2004, he obtained a Master’s degree inComputer Science and Engineering from the UniversiteCatholique de Louvain in Belgium.

62 YRRoSDS’2009

Page 63: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Matthew Marge Carnegie Mellon UniversityLanguage Technologies InstituteSchool of Computer SciencePittsburgh, PA 15213

[email protected]/˜mrmarge

1 Research Interests

My research interests lie in the domain of spoken dia-logue systems for human-robot interaction. As part ofthe TeamTalk project, I work on improving how multi-modal robots can better interpret spatial language andtheir environment (Marge et al., 2009). Our project’s do-main requires that humans and robots work together on a“treasure hunting” task. I rely on the principle of ground-ing for my research efforts. Currently, I am exploring thecapabilities of spatial perspective-taking in human-robotdialogue.

1.1 Spoken Dialogue Systems for Surveys

My undergraduate research included building a spokendialogue system for course-based surveys with the assis-tance of colleagues in the Computer Science Departmentat Stony Brook University. My chief responsibilities wereto design, implement, and evaluate an automated courseevaluation system called the “Rate-A-Course System”(Stent et al., 2006). This system allowed people to settheir own goals for the conversation, such as how manytopics to discuss about a course and in what order to dis-cuss them. For my honors thesis, I designed and con-ducted experiments with human participants to study howspeech-driven dialogues can adapt to users. Our projectgoal was to understand what dialogue designs were mosteffective when interacting with users that have specificgoals in conversation.

1.2 Adaptive Human-Robot Dialogue

In past research, I investigated the cognitive and socialaspects of robotics at the Carnegie Mellon UniversityHuman-Computer Interaction Institute. I studied howrobots might adapt to the existing knowledge of novicesand experts via dialogue interaction. Pearl, an interac-tive robot from the CMU Robotics Institute, was used inour experiments. We investigated whether, given a topicof conversation such as cooking, Pearl should use tech-nical terms with experts but longer explanations of thoseterms with novices. One of my responsibilities was to en-hance Pearl’s existing adaptive dialogue system. I did thisby increasing the number of appropriate responses Pearl

could give to participants’ questions about cooking tools.When we compared responses of novices and experts, wefound that novice cooks appreciated Pearl and performedbest when it gave detailed explanations of the tools ratherthan the tool names alone (Torrey et al., 2006). By con-trast, expert cooks found Pearl patronising when it gavethem detailed explanations of the tools.

As an intern at the Naval Research Laboratory, I devel-oped a scenario for disambiguating dialogue in a human-robot team (Fransen et al., 2007). In this scenario, twopeople are directing a mobile robot to retrieve an itemfrom a dangerous area. In order to improve the ability ofthe robot to disambiguate dialogue spoken by the teammembers, I worked with graduate students to integrateseveral functionalities into the robot, including gesturerecognition, sound localisation, and natural language un-derstanding.

1.3 Spatial Human-Robot DialogueCurrently, I am working with other members of theTeamTalk project and the larger Boeing TreasureHuntproject under the supervision of Dr. Alex Rudnicky. Thisproject investigates how robots can better collaboratewith humans using speech and dialogue with the goal offinding “treasures” in a real-world location. I helped pre-pare the current version of the virtual TeamTalk system,along with a research associate and a summer intern. Thissimulation, built using the USARSim system, provides usa vehicle to test our spoken dialogue system without theneed to manage actual robots (Balakirsky et al., 2006).The system is built using the RavenClaw/Olympus Dia-logue Architecture (Bohus et al., 2007).

I am currently exploring the capabilities of spatialperspective-taking in human-robot dialogue. We wantto find out how spatial perspective-taking, both in refer-ence to members of a human-robot team and to objectsin the environment, can be incorporated into the currentTeamTalk platform. This included developing a smallscenario that we could begin to study. I am also learn-ing more about the dialogue concept of grounding for thiswork.

Our exploration of spatial perspective-taking inhuman-robot dialogue has led us to design an experiment

YRRoSDS’2009 Sep 13-14, London, UK 63

Page 64: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

to assess how humans give simple dialogue commands inreference to members of a human-robot team. We havedeveloped this experiment in order to conduct it formallywith human participants. I have also conducted a litera-ture review of spatial reasoning, spatial language, and itsapplications in human-robot interaction.

We are currently expanding the experimental stimulifrom snapshots of scenarios to real-time interaction withthe TeamTalk virtual system. Once we are establishedon a spatial reasoning component that can refer to teammembers in a scenario, I plan to extend the component toreference objects in the environment of the human-robotdialogue scenario. This will require developing a basicontology of the environment that will be shared by hu-mans and robots in the scenario. We will focus on usingTeamTalk’s virtual system for developing and testing thiscomponent.

2 Future of Spoken Dialogue Research

Interactive spoken dialogue applications will steadily in-crease in popularity over the next decade. This is depen-dent upon public acceptance of these systems, which haveoften frustrated the general public due to poor design andimplementation. We should see a decrease in the numberof call centers answering customer service phone lines,for example. I also expect that as the field of robotics ex-pands, dialogue will become an increasingly greater needgiven the wide range of tasks that robots can perform.

Our generation of spoken dialogue researchers shouldbe able to develop formalised, accepted methods for eval-uating spoken dialogue systems. Also, our generationshould be able to analyse and improve the fine-grained as-pects of human-computer dialogue to improve everydayinteractions with public users. I think that grounding willbecome an even more ever-present concept that dialogueresearchers must rely on when designing effective spo-ken dialogue systems. Attaining these goals requires thatformal user studies be performed with publicly accessiblespoken dialogue systems. In addition, these goals will re-quire our generation of dialogue researchers to educateincoming students in our departments about the impor-tance of spoken dialogue applications and the need formore researchers in the field.

3 Suggestions for Discussion

• Cognitive plausibility: What is the cognitive plausi-bility of existing spoken dialogue systems? Do theymanage dialogue in a way that the cognitive sciencecommunity would find acceptable?

• Environment interaction: How can we have spokendialogue systems better process changing informa-tion about the environments of their applications?

• Appealing to the next generation: What kind of ap-plications should we discuss with new members ofour departments to excite them about dialogue re-search? Should we be discussing new “killer apps”?

ReferencesS. Balakirsky, C. Scrapper, S. Carpin, and M. Lewis.

2006. Usarsim: providing a framework for multirobotperformance evaluation. In Performance Metrics forIntelligent Systems Workshop (PerMIS), pages 98–102.

Dan Bohus, Antoine Raux, Thomas K. Harris, MaxineEskenazi, and Alex Rudnicky. 2007. Olympus: anopen-source framework for conversational spoken lan-guage interface research. In HLT-NAACL 2007 Work-shop on Bridging the Gap: Academic and IndustrialResearch in Dialog Technology.

B. Fransen, V. Morariu, E. Martinson, S. Blis-ard, M. Marge, S. Thomas, A. Schultz, andD. Perzanowski. 2007. Using vision, acoustics,and natural language for disambiguation. In 2ndACM/IEEE Conference on Human-Robot Interaction,pages 73–80.

M. Marge, A. Pappu, B. Frisch, T. K. Harris, and A. Rud-nicky. 2009. Exploring spoken dialog interaction inhuman-robot teams. In Robots, Games, and Research:Success stories in USARSim Workshop.

Amanda Stent, Svetlana Stenchikova, and MatthewMarge. 2006. Dialog systems for surveys: The rate-a-course system. Spoken Language Technology Work-shop, 2006. IEEE, pages 210–213.

Cristen Torrey, Aaron Powers, Matthew Marge, SusanFussell, and Sara Kiesler. 2006. Effects of adaptiverobot dialogue on information exchange and social re-lations. In 1st ACM SIGCHI/SIGART Human-RobotInteraction Conference, pages 126–133.

Biographical SketchMatthew Marge is a PhD stu-dent in the Language Technolo-gies Institute at Carnegie Mel-lon University. He receivedthe MSc degree in Artificial In-telligence specialising in Natu-ral Language Engineering fromthe University of Edinburgh in

2007. Prior to graduate school, he received the BScdegree in Computer Science and Applied Mathematicsfrom Stony Brook University in 2006. He is currentlyfunded by the Boeing Treasure Hunt Project and a Na-tional Science Foundation Graduate Research Fellow-ship. Matthew is a former recipient of a St. Andrew’sSociety of the State of New York Scholarship and anEdinburgh-Stanford Link studentship. Some of his hob-bies include squash, bowling, and travelling.

64 YRRoSDS’2009

Page 65: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Lluıs Mas Manchon Universitat Autonoma de BarcelonaEdifici I Facultat de CC de la ComunicacioBellaterra 08193Barcelona, Spain

[email protected]

1 Research InterestsMy research interests are not in the field of engineeringitself, but how to optimise the use of engineering froma communication-centric perspective. My goal is thusto become part of an interdisciplinary team in which Ican give important insights about how agents in the com-munication process use the variables of the message tosuccessful communication. Specifically, my interest isfocused in the prosody of TV News Broadcast (Mas,2008), which could, through the use of keyword spot-ting, segment them in pieces of news. However, my in-terest is also more general and theoretical: “taking thecommunication process as whole, the objectives of theemitter are always represented in some variables of themessage, in which the attention of the receiver also fo-cuses; this can be very valuable for engineers to exclu-sively deal with those variables” (Rodriguez Bravo, 1989,2004).

This could make the processing of every messagemuch more accurate and less resource-wasting. Even theresearch in human-computer communication would ben-efit of an approach that prioritises the functions of vari-ables in the process of communication: phoneme, wordor phrase are to be considered as what functions theyserve in the discourse. My interest would be to becomepart of interdisciplinary teams working in a wide range oftopics: emotion synthesis, word spotting, segmentationand artificial intelligence.

1.1 Current ResearchMy thesis (to be presented in April) focuses on search-ing parameters of prosody belonging to the end and be-ginning of pieces of news, so they can be used in an algo-rithm for automatic news segmentation. I did an intern-ship in the Department of Applied Mathematics (in theUniversidad Politecnica de Cartagena, Spain) to wheremy task was to design a processing algorithm to representand analyse those parameters. For this pur-pose, Labviewwas used.

The time domain signal was broken into 512 datapoints with no overlapping between them, as the peri-odic spotting of F0 was the result of averaging approx-imate fragments. Then, the amplitude and phase spec-

trum (VRMS in Labview) was applied twice to get thecepstrum. The process was taking volts and seconds, andgetting the Root Mean Square amplitude in the frequencydomain, so we could visualise the power spectra in eachmoment. The value of F0 was finally isolated by multi-plying by 0 all the range of power between 25 and 200points in the spectra, as the F0 we were targeting (cor-rectly articulated vowels) was expected at a maximumlevel in the little range left. This was proved iterativelywith different ranges of signals for the VRMS and for themaximum power spectrum search.

Meanwhile, intensity was easily obtained by integrat-ing the square power spectra, and both the intensity andF0 values were placed along a time axis. From here on,the search of variables was a matter of using differentLabview tools: levels of F0 and energy in the differentphases of the “locution”.

We had three models of inspiration to design the acous-tical system:

1. Intonational Model of Speech (Cantero, 2002): con-siders every data of pitch in Hz as a variation in per-centages of the prior one, always starting in the level100.

2. MOMEL (Espesser, 1993): depicts parabolas out ofthe pitch data, searching for the turning points: max-ima and minima.

3. Autosegmental Model (Pierrehumbert, 1980): tran-scribes the tone movements in a symbolic languagein which phonological features are related to pho-netic ones.

Then, an algorithm for spotting pauses was designedto make use of the intensity: if the intensity fell to somequantity during a half second, the system would use theprosodic algorithm that examines if the intonation be-longs to a typical end and beginning of pieces of news.Despite some damaging factors like reporters, noise, sig-nature tune and other news formats, results shows a 60%of success in spotting the cuts between pieces of news.This has been achieved independently of any other fac-tor, such as channels of the program, season, gender ofpresenters and even language (the corpus is in Catalan,

YRRoSDS’2009 Sep 13-14, London, UK 65

Page 66: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Spanish and Portuguese). Furthermore, when a set of1000 keywords (taken from a project on automatic newstopic classification) was utilised, the success increased to80% in spotting news cut. This implies that we have suc-cessfully found the typical intonation of ending and be-ginning of news locution, because the most part of errorsare about using this intonation for emphasising purposein long and tedious pieces of news.

2 Future of Spoken Dialogue ResearchAccording to the Penta Model (Xu, 2004), every spokensignal should be analysed as the result of the functionsto which it serves. As stated before, articulation possi-bilities, objectives of emitter and effects over the receivershould all be related to the acoustical variations of thesignal, so engineers can design accurate algorithms to ex-clusively process the part of the signal that is relevant tosome specific communication purposes.

So, in 5 to 10 years time, interdisciplinary teams couldaccomplish a big improvement in applied spoken dia-logue algorithms if they work together with the researchknowledge already at hand in disciplines as well as inothers, in particular linguistics.

Taking the example of prosody, the knowledge of thedifferent fields on how every language modulates, couldbe used for automatic language learning systems. An-other level would imply the use of emotional intonation(already quite developed), plus the influence of prosodyover the context and goal of the discourse. So in the end,different levels of prosody variables could be put togetherin an algorithm for intelligent recognition and synthesissystems; as, for example, they could process irony andsarcasm in a language and under certain conditions.

3 Suggestions for DiscussionWe propose the following topics for discussion:

• Paralinguistic phenomena in the phases of dis-course: prosody on beginnings, developments andendings.

• Paralinguistic phenomena on discourse finality.

• Discourse analysis for human-computer systems(Swerts and Ostendorf, 1995; Taboada and Mann,2006).

• Systemic and functional Communication Model.

ReferencesFrancisco Jose Cantero. 2002. Teorıa y analisis de la en-

tonacion. Barcelona: Ed. Publicacions i edicions UB.

Robert Espesser and Daniel Hirst. 1993. AutomaticModelling of Fundamental Frequency using a Quad-ratic Spline Function. Aix en Provence: Ed. IPA,Travaux de l’Institut de Phonetique d’Aix, 15.

Lluıs Mas Manchon. 2008. Estructura de la Prosodia dela noticia. In Actas y Memoria Final. Congreso Inter-nacional Fundacional AEIC. Santiago de Compostela.Asociacion Espanola de Investigacion de la Comuni-cacion.

Pierrehumbert, Janet B.. 1980. The phonology and pho-netics of English intonation. PhD Thesis, MIT, Cam-bridge, MA, USA.

Angel Rodrıguez Bravo. 2004. La investigacion apli-cada: una nueva perspectiva para los estudios en co-municacion. In Analisi Quaderns de Comunicacio iCultura 30, pp.17–36. Barcelona.

Marc Swerts and M. Ostendorf. 1995. Discourseprosody in human machine interactions. In Procs. ofESCA Workshop, Vigso, Denmark.

Maite Taboada and William C. Mann. 2006. Ap-plications of Rethorical Structure Theory. In Dis-course Studies, 8:4, pp.567–588. DOI: 10.1177/1461445606064836.

Yi Xu. 2004. The Penta Model of Speech Melody: trans-mitting multiple communicative functions in parallel.Sound to Sence, June 11-13 at MIT.

Biographical SketchLluıs Mas is graduated in Adver-tising and Public Relation in theUniversity of Alicante (Spain) andis doing his PHD in the Universi-dad Autonoma de Barcelona. Heis part of Laboratorio de AnalisisInstrumental de la Comunicacion,which deals with tools of ob-jective analysis of image, textand sound, for experimental tests

and/or content analysis, to create new knowledge withimmediate applications in different fields.

Some of the projects in which he has worked are aboutsynthesis of emotions, keywords spotting for news clas-sification, quality of TV programs, and structure analysisof news discourses and prosody of discourse.

66 YRRoSDS’2009

Page 67: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Toyomi Meguro NTT Communication Science LaboratoriesNTT Corporation2-4 Hikaridai, Seika-choSoraku-gun, Kyoto 619-0237, Japan

[email protected]/icl/lirg/members/meguro/index.html

1 Research Interests

My interests lie in building listening agents. Here, lis-tening agents mean systems that can attentively listen tothe user and satisfy his/her desire to speak and have him-self/herself heard and understood. Such agents wouldlead the user’s state of mind for the better as in a ther-apy session or in senior peer counselling, although I wantthe listening agents to help users mentally in everydayconversation. I think everyone wants to talk to someoneand this desire can be partly satisfied by computationalmeans.

There has been little research on listening agents. Oneexception is (Maatman et al., 2005), which showed thatsystems can make the user have the sense of being heardby using gestures, such as nodding and shaking of thehead. Although my work is similar to theirs, the differ-ence is that I focus more on verbal communication in-stead of non-verbal one.

As a first step for the study, I analysed the char-acteristics of listening-oriented dialogues by comparingthem with casual conversation. Here, casual conversationmeans a dialogue where conversational participants haveno predefined roles (i.e., listeners and speakers). Theanalysis using Hidden Markov Models (HMMs) foundthat it is important for listeners to self-disclose beforeasking questions and that it is necessary to utter morequestions and acknowledgment than in casual conversa-tion to be good listeners (Meguro et al., 2009).

I would like to further analyse the characteristics oflistening-oriented dialogues using a larger amount ofdata. Then, I would like to incorporate the results of theanalysis to build a workable listening agent. I am cur-rently concerned with labelling schemes of dialogue actsfor listening-oriented dialogues, how agents should intro-duce appropriate topics, what statistical models to adoptother than HMMs, and the evaluation of listening agents.

I am also interested in recommendation systems usingdialogue. I think dialogue has the potential to act on one’smental attitude towards things (Fogg, 2002). I would liketo create a system that finds and recommends users’ bestsuited articles through dialogue.

2 Future of Spoken Dialogue ResearchDialogue offers an intuitive and natural way of communi-cation between users and systems. I would like this fea-ture to be exploited more in the next 5 to 10 years. Cur-rently, there are many sensors and mobile devices aroundus to make our communication easier. This trend wouldcontinue for many years to come. In the future, dialoguetechnology would come in between such devices and hu-mans to activate human communication so that the qual-ity of life (QoL) of the human being can be improved.

For example, under time and physical constraints, itis sometimes difficult for one to communicate with theperson that he/she wants to talk to. If such a situationcontinues for a long time, there would be no shared top-ics between them, and they might lose their will/desireto communicate. A dialogue system that resides in one’sside, listens to one’s episodes and feelings, and deliversthem to other people for sharing would contribute to en-hancing human communication. Twitter may be usefulfor sharing one’s feelings with others but the messagesare generally directed at anonymous people on the web. Iwould like dialogue technology be used to strengthen thepersonal bond between certain people.

3 Suggestions for Discussion• Integration of verbal and non-verbal communica-

tion: Currently, there is little work that focuses onthe integration of modalities. What kind of archi-tecture is appropriate for modality fusion? Can welearn anything from human science?

• Social aspects of dialogue: Humans perform dia-logue for social reasons. Can we make dialogue sys-tems recognise their social situation? What kind ofinformation is necessary to make them produce ap-propriate social behavior?

• Personality of dialogue systems: Although there issome emerging work on assigning a personality todialogue systems (Mairesse and Walker, 2008), cur-rent dialogue systems generally show little person-ality. Is a personality really necessary for a dialoguesystem to be our conversational partner? How can

YRRoSDS’2009 Sep 13-14, London, UK 67

Page 68: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

we build a model of personality for dialogue sys-tems?

ReferencesB. J. Fogg. 2002. Persuasive Technology: Using Com-

puters to Change What We Think and Do. MorganKaufmann.

R. M. Maatman, Jonathan Gratch, and Stacy Marsella.2005. Natural behavior of a listening agent. In Pro-ceedings of the 5th International Working Conferenceon Intelligent Virtual Agents (IVA), pages 25–36.

Francois Mairesse and Marilyn Walker. 2008. Train-able generation of big-five personality styles throughdata-driven parameter estimation. In Proceedings ofthe 46th Annual Meeting of the Association for Com-putational Linguistics (ACL), pages 165–173.

Toyomi Meguro, Ryuichiro Higashinaka, Kohji Dohsaka,Yasuhiro Minami, and Hideki Isozaki. 2009. Analy-sis of listening-oriented dialogue for building listeningagents. In Proceedings of the 10th Annual SIGDIALMeeting on Discourse and Dialogue (SIGDIAL). (toappear).

Biographical SketchToyomi Meguro is a researcherat NTT Communication ScienceLaboratories, NTT Corporation.She received the B.E. and M.E.degrees in electric engineeringfrom Tohoku University, Japan,in 2006 and 2008, respectively.She joined NTT in 2008. Her re-

search interests are in building listening agents and rec-ommendation systems. She enjoys travelling, playingtennis, cooking, and finding good restaurants.

68 YRRoSDS’2009

Page 69: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Gregory Mills Interaction, Media and CommunicationGroupDepartment of Computer ScienceQueen Mary, University of London.

[email protected]

1 Research Interests

My research is primarily concerned with the empiricalinvestigation of human-human communication, focus-ing on the interactive mechanisms deployed by interlocu-tors in semantic co-ordination. I am particularly inter-ested in the mechanisms used for dealing with problem-atic understanding, e.g. repair, clarification requests,corrections, and their role in language change that oc-curs during the development of co-ordination. I am cur-rently focusing on how alignment is exploited by inter-locutors and also on the boundary between communica-tion/miscommunication.

1.1 Past and Current Research

Perhaps one of the least contentious statements con-cerning language is that it is intrinsically underdeter-mined, dynamic, and adaptable to novel dialogue con-texts. Despite this insight, research on dialogue sys-tems has primarily focused on the information-exchangeaspects of language use, pre-supposing that both inter-locutor and dialogue manager already “know how totalk” about the particular domain and have already co-ordinated their linguistic resources to suit the commu-nicative situation. Consequently, dialogue system imple-mentations have traditionally relegated the importance ofthis co-ordination of linguistic resources by selective in-corporation of highly domain-specific vocabularies andontologies (Larsson, 2006).

The point of departure of my research is that this statictreatment of language presupposes semantic transparencybetween interlocutors and is inadequate for describinghow co-ordination is achieved. The inherently dynamicand adaptable nature of language, and the fact that in-terlocutors necessarily have different interaction histo-ries suggests the importance of an account that explainshow interlocutors manage to resolve these differences andconverge on a semantic model during dialogue.

Although existing models of dialogue agree that thisis achieved within interaction, they disagree on whichparticular interactive mechanisms are implicated in thisprocess. The collaborative model of Clark et al (Clark,1996) characterises this as occurring through iterative cy-

cles of “positive evidence of understanding”, while theinteractive alignment model (Garrod et al, 1994, 2004)prioritises the role of priming in the development of co-ordination.

To address these differences and investigate in detailthe negotiation of semantic co-ordination, my researchinvolved a series of experiments using a novel text-basedchat tool. In contrast to existing experimental approacheswhich involve relatively coarse modification of the com-municative context, e.g. confederates or wizard-of-ozscenarios, the chat tool allows fine-grained manipulationof the unfolding dialogue by selectively interfering withindividual co-ordination mechanisms (Healey and Mills,2007).

Key findings from the experiments carried out with thischat tool are: interfering with sequential coherence of thedialogue leads participants to align more (Mills, 2007);interlocutors index the level of semantic co-ordination ofpartners with different levels of participation (Healey andMills, 2006); and on encountering problematic utterancesinterlocutors resort to less specific and “vaguer” descrip-tions (Mills and Healey, 2006).

The findings from these experiments present co-ordination phenomena that are difficult to reconcile withboth models, due to their “semantic neutrality” (Healeyand Mills, 2006). In particular, the findings demonstratehow interlocutors are sensitive to semantic differencesbetween different kinds of referring expression, and ex-ploit these differences in order to develop and sustainmutual-intelligibility. Importantly, this is achieved tacitlyby interlocutors, in particular when dealing with prob-lematic understanding (Mills, 2007).

These findings also work against semantic trans-parency underwriting co-ordination. Interlocutors fre-quently introduce and use terms without having full un-derstanding of their applicability to the dialogue situa-tion. Instead, terms are introduced opportunistically, theirmeaning fleshed out through iterative cycles of repair(Healey and Mills, 2006).

My research has yielded rich patterns of co-ordinationin clarification subdialogues that exhibit semantic changewhich are not strictly reducible to the exchange of propo-sitionally encoded information, yet still have the effect of

YRRoSDS’2009 Sep 13-14, London, UK 69

Page 70: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

resolving problematic understanding. Importantly, theyprovide compelling evidence supporting the thesis thatalignment is not simply an outcome of successful interac-tion, but is also exploited by interlocutors when dealingwith problematic understanding (Mills, 2007; Mills andHealey 2008).

1.2 Current and future workI am currently working on refining the experimental chattool methodology, developing it as a general experimentalplatform for dialogue researchers, in order to facilitate theexperimental investigation of dialogue phenomena1.

I am also conducting experiments using this chat toolto investigate how interlocutors exploit alignment in dia-logue, in particular on the role of figure and ground. I amalso using the chat tool to investigate how these findingsscale up to multilogue.

2 Future of Spoken Dialogue ResearchIt is essential that the development of more naturalisticdialogue systems be guided by “what interlocutors actu-ally do”. All too frequently, the natural spontaneity, andflexibility of language and the interactive mechanismsused to deal with problems of mutual-intelligibility thatarise when language is adapted to novel contexts of useis treated as inferior to the idealisation of well-formedspeech acts. This of prime importance, as experimentalevidence demonstrates that repair facilitates comprehen-sion (Brennan and Schober, 2001).

Further, this flexibility introduces the notion of seman-tic change occurring during the course of individual con-versations, bringing the problem of semantic opacity tothe foreground. Dialogue research could benefit from acritical assessment of the insights from the philosophy oflanguage concerning intentional and also semantic trans-parency, in particular Wittgenstein’s consideration of lan-guage as practice and his arguments concerning the limi-tations of a strictly informational view of language. Thiswould address the issue that using a term necessarily in-troduces change into its meaning for an agent and thelanguage community as a whole. Creating a typologyof these changes, and how they are achieved through co-ordination mechanisms would decrease the gulf that ex-ists between dialogues with dialogue managers and actualhuman-human conversation.

3 Suggestions for Discussion• Methodologies used to determine which particular

mechanisms to include in dialogue systems: empir-ical approaches, e.g. wizard-of-oz, confederates, in-troduce different biases from user-simulations. Dis-

1DiET: Dialogue Experimentation Toolkit. http://www.dcs.qmul.ac.uk/research/imc/diet/

cussion of the relative merits and limitations of bothwould be of great benefit to dialogue system design-ers.

• Semantic and intentional transparency: what stepscan be taken to make models of agency and languageproduction/comprehension more similar to actualhuman-human conversation.

• Language change: how can dialogue managers bedesigned to allow for novel uses of terms by bothinterlocutors and dialogue system?

ReferencesBrennan, S.E., and Schober M. F. 2001. How listen-

ers compensate for disfluencies in spontaneous speech.Journal of Memory and Language, 44: 274–296.

Clark, H. H. Using Language. Cambridge UniversityPress, Cambridge.

Garrod, S. and Doherty, G. 1994. Conversation, co-ordination and convention: an empirical investigationof how groups establish linguistic conventions. Cogni-tion, 53: 181–215.

Healey, P.G.T. and Mills, G.J. 2006. Participation, Prece-dence and Co-ordination in Dialogue. In Proceedingsof the 28th Conference of the Cognitive Science Soci-ety: 1470–1475, Vancouver, Canada.

Larsson, S. 2006. Semantic plasticity. Paper presented atLCM 2006. (Language, Culture and Mind), July 2006,Paris, France.

Mills, G. J. 2007. The development of semantic co-ordination in dialogue: the role of direct interaction.Unpublished PhD Thesis.

Mills, G.J. and Healey, P.G.T. 2006. Clarifying SpatialDescriptions: Local and Global Effects on SemanticCo-ordination. In Proceedings of the 10th Workshopon the Semantics and Pragmatics of Dialogue: 122–129 University of Potsdam, Germany.

Biographical SketchThe author completed his PhDin 2007 and is currently work-ing as a Postdoc at Queen MaryUniversity, focusing on the in-teractive mechanisms involvedin semantic co-ordination. Heis interested in the philosophyof language and mind, phe-

nomenology of (mis)communication, the evolution oflanguage, and also language use in therapeutic settings.Outside of academia, he flies gliders and plays electronicmusic and the harmonica.

70 YRRoSDS’2009

Page 71: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Teruhisa Misu National Institute of Information andCommunications Technology (NICT)2-2-2 Hikaridai, Keihanna Science City,Kyoto, Japan

[email protected]/˜xtmisu/

1 Research Interests

My current research interest is in making proactive dia-logue systems. Although most current spoken dialoguesystems (SDSs) handle simple database retrieval or trans-actions driven by users’ explicit requests, it is desirablethat SDSs should make proactive interactions such asclarifications, recommendations and advice like a hotelconcierge or an experienced operator. More specifically, Iam interested in dialogue management in non-databaseretrieval tasks and adaptive response generation.

1.1 Previous work

My previous work has focused on SDSs based on infor-mation retrieval and question-answering (QA) using aset of documents as a knowledge base (Misu, 2008).

1.1.1 Confirmation and Clarification in DocumentRetrieval Tasks

It is indispensable for SDSs to interpret user’s inten-tion robustly in the presence of speech recognition errorsand extraneous expressions characteristic of spontaneousspeech. In speech input, moreover, users’ queries tendto be vague, and they may need to be clarified throughdialogue in order to extract sufficient information to getmeaningful retrieval results. In conventional databasequery tasks, it is easy to cope with these problems byextracting and confirming keywords based on semanticslots. However, it is not straightforward to apply such amethodology to general document retrieval tasks.

To solve these problems, we proposed a confirmationmethod based on two statistical measures that are notbased on the confidence measure of ASR, but on theimpact on retrieval as well as the degree of matchingwith the backend knowledge base (Misu and Kawahara,2006b; Misu and Kawahara, 2008). We also proposeda method to make clarification questions by dynamicallyselecting from a pool of possible candidate questions. Asthe criterion for the selection the information gain is de-fined based on the reduction in the number of matcheditems (Misu and Kawahara, 2005; Misu and Kawahara,2006b).

1.1.2 Interactive Navigation based on QA andInformation Recommendation

We proposed an interactive dialogue framework. Inconventional audio guidance systems, such as those de-ployed in museums, the information flow is one-wayand the content is fixed. We prepare two modes,a user-initiative retrieval/QA mode (pull-mode) and asystem-initiative recommendation mode (push-mode),and switch between them according to the user’s state.In the user-initiative retrieval/QA mode, the user can askquestions about specific facts in the documents in addi-tion to general queries. In the system-initiative recom-mendation mode, the system actively provides the infor-mation the user would be interested in. The system ut-terances are generated by retrieving from and summaris-ing Wikipedia documents. We implemented a navigationsystem containing Kyoto city information. The effective-ness of the proposed techniques was confirmed througha field trial by a number of real novice users (Misu andKawahara, 2007b; Misu and Kawahara, 2007a).

1.1.3 Efficient Language Model Construction forSDSs

We proposed a bootstrapping method of constructingstatistical language models for new SDSs by collectingand selecting sentences from the World Wide Web. Outof the collected texts, we select “matched” sentences bothin terms of the domain and in utterance style, thus appro-priate for training data of the language model (Misu andKawahara, 2006a).

1.2 Current and Future WorkCurrently, we are developing consulting dialogue systemsthat help make decisions through spontaneous interac-tions.

Most previous studies assumed a definite and consis-tent user goal. The dialogue strategies were usually de-signed to minimise the cost of information access. How-ever, this assumption fails in various real world situa-tions. Specifically, we are developing a dialogue systemthat handles tourist guidance. Thus far, we have collecteditinerary planning dialogues in Japanese, in which usersplan a one-day visit to Kyoto City (Misu et al., 2009).

YRRoSDS’2009 Sep 13-14, London, UK 71

Page 72: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

It contains various exchanges, such as clarifications andreasonings. The user may explain his/her vague prefer-ences by listing examples. The server would sense theusers preference from his/her utterances and then requesta decision.

In order to construct a consulting dialogue system fromthe corpus, we are annotating dialogue acts (Misu et al.,2009). Moreover, we proposed an efficient dialogue man-agement scheme based on WFSTs using N-gram DA tagsequences (Hori et al., 2009). Of course, some sort ofdialogue state modelling (e.g. (Thomson et al., 2008))would be needed, and re-scoring of system actions basedon the state is also our future work.

2 Future of Spoken Dialogue ResearchI am afraid that the more multifunctional cell phones be-come (e.g. iPhone, Android), the less users use speechinterfaces for simple query tasks such as train search, ho-tel reservation, etc. Thus we will have to propose appli-cations that can make use of something that such smart-phones do not have. It may be a large (human-sized)touchscreen or a motion detection sensor. Another direc-tion would be expansion of tasks that a speech interfacecan handle (but a small touch screen cannot).

3 Suggestion for discussion• How to evaluate dialogue systems

What is a good evaluation measure of dialogue sys-tems? Can we define a evaluation measure like“BLEU/NIST score” for machine translation. (es-pecially in non-goal-oriented dialogue systems)

• How to make users behave naturally as a humanoperator can.User’s talk to the systems in different utterance stylefrom the style they talk to human operators. Whatare the causes of this phenomenon?

• Cost reduction in developing new SDSs using theWWWHow can we make use of web resources for con-struction of SDSs.

ReferencesC. Hori, K. Ohtake, T. Misu, H. Kashioka, and S. Naka-

mura. 2009. Recent Advances in WFST-based DialogSystem. In Proc. Interspeech, page accepted for pre-sentation.

T. Misu and T. Kawahara. 2005. Speech-based in-formation retrieval system with clarification dialoguestrategy. In Proc. Human Language Technology Conf.(HLT/EMNLP).

T. Misu and T. Kawahara. 2006a. A Bootstrapping Ap-proach for Developing Language Model of New Spo-ken Dialogue Systems by Selecting Web Texts. InProc. Interspeech, pages 9–12.

T. Misu and T. Kawahara. 2006b. Dialogue Strategyto Clarify User’s Queries for Document Retrieval Sys-tem with Speech Interface. Speech Communication,48(9):1137–1150.

T. Misu and T. Kawahara. 2007a. An interactive frame-work for document retrieval and presentation withquestion-answering function in restricted domain. InProc. Int’l Conf. Industrial, Engineering & Other Ap-plications of Artificial Intelligent Systems (IEA/AIE),pages 126–134.

T. Misu and T. Kawahara. 2007b. Speech-based Inter-active Information Guidance System using Question-Answering Technique. In Proc. ICASSP.

T. Misu and T. Kawahara. 2008. Bayes risk-based dia-logue management for document retrieval system withspeech interface. In Proc. COLING, Vol.Posters &Demo, pages 59–62.

T. Misu, K. Ohtake, C. Hori, H. Kashioka, and S. Naka-mura. 2009. Annotating Communicative Function andSemantic Content in Dialogue Act for Construction ofConsulting Dialogue Systems. In Proc. Interspeech,page accepted for presentation.

T. Misu. 2008. Speech-based Navigation Systems basedon Information Retrieval and Question-Answeringwith Optimal Dialogue Strategies. Ph.D. thesis,School of Informatics, Kyoto University.

B. Thomson, J. Schatzmann, and S. Young. 2008.Bayesian Update of Dialogue State for Robust Dia-logue Systems. In Proc. ICASSP, pages 4937–4940.

Biographical SketchTeruhisa Misu is a researcherin the Spoken Language Com-munication Group, MASTARProject at National Instituteof Information and Commu-nications Technology (NICT),Japan. He received the B.E. de-gree in 2003, the M.E. degree in

2005, and the Ph.D. degree in 2008, all in information sci-ence, from Kyoto University, Kyoto, Japan. From 2005to 2008, he was a Research Fellow (DC1) of the JapanSociety for the Promotion of Science (JSPS). In 2008, hejoined NICT Spoken Language Communication Group.

72 YRRoSDS’2009

Page 73: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Aasish Pappu Carnegie Mellon UniversityLanguage Technologies InstituteSchool of Computer SciencePittsburgh, PA 15213

[email protected]/˜apappu

1 Research Interests

My research interests span task oriented dialogue systemsto spoken dialogue interfaces for knowledge acquisition.Currently, I am interested in investigating ontology drivendialogue task management. I am study-ing the usefulnessof ontology for preparing a common ground of knowl-edge for humans and robots while they perform collabo-rative work.

1.1 Past Research

During my undergrad research I worked on a natural lan-guage interface for programming in Java. We developeda system that could respond to simple English sentenceswith primitive Java constructs. Sentences such as “Selectthird positive element from array”, “Add a to b”, “Selectevery element which is greater than two from array”, etc.

Primarily, the motivation behind this project is to cre-ate an environment for people who are planning to learnnew programming languages with little programmingbackground. Since, the syntax is taken care of one shouldknow little about the new language. Al-so, system ac-tively learns about the typecasting of the variables fromthe initialised value. Like, “initialise g with 9.8”, wouldinform the system about typecasting scope for the vari-able g. Through conversation, user was able to ask thesystem about existing variables and system could alertthe user while rewriting existing variables. Altogether,we could setup a very basic programming environmentthrough dialogue.

This work is inspired from M’PAL (Rosenbloom,1985) a programming language project based on defini-tions written in English and another interesting work thattook case-based reasoning approach for natural languageinterface to Unix shell scripting (Won Il Lee and GeunbaeLee, 1995)

1.2 Current Research

Currently, I am associated with TeamTalk project groupunder the supervision of Dr. Alex Rudnicky. This projectinvestigates different aspects of human-robot teams par-ticipating in collaborative task. My present work focuseson designing ontology to drive robot navigation in an en-

vironment. Ontology driven planning and learning hasbeen popular with systems related to personal assistants(Niekrasz et al., 2004), that keep track of user’s actionsand learn about surroundings to keep the user informedabout alerts in the environment. Also, goal oriented sys-tems urban search and rescue systems (Schlenoff andMessina, 2005) employed ontology based approach forrobot navigation.

Primary purpose of having ontology is to generate spa-tial representations that allow a human robot team to referto objects using common sense in an environment. It isnecessary to have a symbolic representation of the ob-jects and relationships with other objects. Conceptualisa-tion of these objects avails the opportunity for the robotto infer complex queries such as “Face the door and walkuntil you find a window”.

Furthermore, we could assign attributes to each objectkeep track of their status and update the sensory map i.e.,a robot’s view of the world perceived through its sen-sory inputs. Additionally, sub-teams working on a com-mon plan could share their ’view’ of the world with theirteammates by updating the common knowledge. Fur-thermore, queries like “which robot walked through thesecond door”. This kind of queries de-mand good un-derstanding of answer-types conditioned with a context.These answer types are nothing concepts in the ontology.Additionally, events that take place in the environmentand actions performed by robots are tracked and stored inthe ontology, thereby helping a robot to answer querieslike, “when was the last time we talked about the blueball”.

1.3 Queuing models for dialogue managementDialogue systems that are severely exposed to diverseusers exhibit interesting challenges pertaining to behav-ioral changes due to changing environment. Not only, di-alogue systems but also traditional service provider sys-tems like network traffic management and disaster evac-uation systems (Gross, 2008) have adopted wide range ofsimulation models that approximate real environment.

We could observe analogies between queuing mod-els and dialogue systems. Every user utterance couldbe treated as a Poisson arrival, whereas decoding and

YRRoSDS’2009 Sep 13-14, London, UK 73

Page 74: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

prompting to the user as a service. In M/M/1 queuingmodel where the Kendall’s notation says that both the ar-rival rate and service rate are Poisson processes with oneservice provider. This model is very much similar to sin-gle server dialogue system. On the other hand, we couldhave M/M/k multiple servers with single queue systemthat are highly efficient and perform better than each hav-ing their own queue. It is intuitive to under-stand a sin-gle large set of servers perform better than two or moresmaller sets.

Similarly, we could have several dialogue systemsworking together, each maintaining a dialogue state cho-sen from n-best possible states for a given user utterance.This kind of setup may help in multi-participant con-versation, where the conversation towards an addresseecould be shared with another addressee or re-main inde-pendent from other addressees. Therefore, more than oneparticipant may apparently share the state, each one witha different dialogue manager.

2 Future of Spoken Dialogue Research

In my opinion spoken dialogue research will have to takeassistance from other modalities to infer from tractableuser actions and anticipate what user might say. Ad-ditionally, speech recognition mistakes should be com-plemented by inputs from other modalities, thereby howcould we learning from word errors.

2.1 Learn from humans

It will be interesting to investigate effectiveness of dia-logue systems in active learning and knowledge acquisi-tion tasks through which humans can share their knowl-edge with machines.

2.2 Spoken dialogue forum

An interesting application of spoken dialogue re-searchcould be in the area of product review systems. A usermay voice out his opinion about a product and the systemcould voice out similar opinions put forth by other peo-ple. Additionally, the purpose of this system is to accel-erate the process of data collection for speech processingresearch.

3 Suggestions for Discussion

• Investigating effective (and inexpensive) ways ofcollecting lots of “real” data and simulating real con-versations.

• How to identify domain independent sub-conversations from a collection of data andadapt models trained on them to other domains?

ReferencesMichael H. Rosenbloom. 1985. M’PAL: a programming

language based on natural English definitions. ACMSIGPLAN Notices, Volume 20, New York, NY.

Won Il Lee and Geunbae Lee. 1995. From natural lan-guage to shell script: a case-based reasoning systemfor automatic UNIX programming. In Expert Systemswith Applications, 9:1, 1995, pp.71–79.

J. Niekrasz, M. Purver, J. Dowding, and S. Peters. 2005.Ontology-based discourse understanding for a persis-tent meeting assistant. In Proceedings of the AAAISpring Symposium.

Craig Schlenoff and Elena Messina. 2005. A robot ontol-ogy for urban search and rescue. In Proc. of the 2005ACM workshop on KRAS. ACM, New York, NY.

D. Gross. 2008. Fundamentals of Queuing Theory.Wiley-India.

Biographical SketchAasish Pappu is currently a 2nd

year Master’s student and Re-search Assistant at the LanguageTechnologies Institute, CMU atPittsburgh under the supervi-sion of Dr. Alex Rudnicky.He obtained his BTech degreein Information Technology fromIndian Institute of InformationTechnology, Allahabad, India.

Besides research, his interests include photography, phi-losophy, languages and poetry.

74 YRRoSDS’2009

Page 75: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Joana Paulo Pardal Dept. Computer Science and EngineeringIST, Technical University of LisbonSpoken Language Systems Laboratory (L2F)Lisboa, Portugal

[email protected]/˜joana

1 Research Interests

My research interests lie generally in the area of spokendialogue systems with particular interest in software en-gineering techniques to dynamically integrate struc-tured knowledge sources, like databases and ontolo-gies (Paulo Pardal, 2007), and in evaluation frameworksto measure the improvements. The challenges of creatingcoaching, tutorial, and educational systems that can beused by the general public at their homes, schools, ormuseums are also part of my research.

1.1 Past research

My MSc thesis was on “automatic terms acquisi-tion” (Paulo et al., 2004). I’ve taught object-orientedprogramming and design patterns, knowledge represen-tation, artificial intelligence (AI), autonomous agents andmulti-agent systems (Melo et al., 2006), and distributedsystems (Pardal et al., 2008). When starting my PhD I’vemoved to spoken dialogue systems. Being a CS engineerfrom the field of AI I was interested on how to use on-tologies to ease the extension of a system to new tasks,similarly to what is done with databases. Databases’structure was used to extract domain knowledge and itallows the generic use of that kind of information re-ducing the coding time and adaptations needed to buildnew dialogue systems. With this task in mind, I workedon OntoChef, a cooking ontology (Ribeiro et al., 2006).The ontology was later enriched with the use of a nat-ural language specific tool (Machado, 2007) through in-formation extraction techniques. A collection of nearly9000 recipes where extracted from Portuguese websiteswith a specially designed tool. They are to be convertedinto ontology-based format soon. To better understandhow humans coach each others while cooking, a human-human cooking corpus was collected, where a personhelped another one while s/he cooks a recipe (currentlythere are approximately 3 hours with 6 different partic-ipants, in 3 teams doing a ’chocolate mousse’). Thiscorpus is to be annotated with the requests made by theperson executing the task and the relative answers, tips,and comments from the coach. This follows my previouswork at Rochester (Gomez-Gallo et al., 2007).

1.2 Current research

Most practical dialogue systems are designed for a spe-cific task, and even if the authors were concerned withpossible future extensions, integrating new tasks is al-ways a challenge. Dynamic integration of new tasks ac-cording to some kind of structured knowledge is an inter-esting research topic. The main goal of my thesis (PauloPardal, 2007) is to study how different levels of knowl-edge stored in ontologies can be used to facilitate thecreation of new coaching dialogue systems capable ofdomain reasoning. I’m taking McGuinness’ ontologiesspectrum (McGuinness, 2003) to split the ontology intoincreasingly complex knowledge levels.

The hypothesis being studied is whether ontologies canbe used to enrich a coaching spoken dialogue systemand be used in it in such way that the system can ab-stract the source of domain-specific knowledge – relatedto the tasks being coached – focusing only on the dia-logue phenomena. The integration of ontological knowl-edge should be done with few architecture adaptions tothe dialogue system so that when adding a new domain –a new class of tasks – minor changes in special modulesare sufficient.

1.2.1 Case Study: Cooking Coach

Cooking is something that everybody ends up doing.Some have been cooking for a lifetime, some are just be-ginning. Most of the time the user’s hands are busy anddirty. In such a scenario, manually handling a recipe bookis to be avoided. A system that helps the user by dictat-ing the steps while hands and eyes are occupied with thecooking tasks is much desired. My goal is to develop aspoken dialogue system that provides assistance in read-ing the procedure and detailing all steps that may be un-clear to a user lacking expertise. Based in our cooking ex-perience, we built a prototype system that helps the userwhile cooking (Martins et al., 2008c). The system adaptsitself to the users’ needs and expertise. It was built overDIGA (Martins et al., 2008b; Martins et al., 2008a), aspoken dialogue systems framework. Some experimentswith CMU’s Olympus (Bohus et al., 2007) are currentlybeing done.

YRRoSDS’2009 Sep 13-14, London, UK 75

Page 76: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

1.3 Future researchThe use of an ontology delivers another interesting result:the system could reason about the tasks and plans at hand.After saying ’Separate egg whites and egg yolks.’ to theuser, the system needs to know that the existing ’eggs’will disappear and give place to ’egg whites’ and ’eggyolks’.

It should also make it easier to include different lan-guages by translating (Graca et al., 2008) the cookingontology and adapting the necessary linguistic resources(understanding and generation).

The integration of additional sensors in the kitchencould bring interesting enhancements like in DFKI’s’SmartKitchen’(Schneider, 2007) or the MIT’s MediaLab ’intelligent counter’ (http://www.media.mit.edu/ci/)

2 Future of Spoken Dialogue ResearchCurrently spoken dialogue systems are proposed only when noother input modalities are available (like when there is no accessto a keyboard or when the user has some kind of special need– blindness, reduced accessibility, etc.) However, we shouldconsider the use of speech whenever it is natural. That would beeasier if interaction with these systems was more natural (moresimilar to human-human interaction).

When human-computer interaction approaches human-human interaction, people will feel comfortable on delegatingsome tasks to a digital helper while they will concern them-selves with some other tasks. Managing priorities and knowingthe right times to interrupt are important. The “uncanny valley”,however, must be avoided.

Exploring the new emerging technologies and mobile de-vices (like iPhone or Android) is another path that is worthconsidering: integrating speech with new information from sen-sors can further leverage the current state-of-the-art, making theavailable interfaces easier to use and much more natural.

3 Suggestions for Discussion• Teaching SDS (methods, frameworks, evaluation)

• SDS for Coaching (assist with task) and Teaching (tutorialand educational applications).

• Evaluation: universal metrics for comparing disparate sys-tems, tasks, languages and modalities; expert systemsagainst rapid development frameworks.

ReferencesD. Bohus, A. Raux, T. K. Harris, M. Eskenazi, and A. I. Rud-

nicky. 2007. Olympus: an open-source framework for con-versational spoken language interface research. In Proc. ofBridging the Gap: Academic and Industrial Research in Di-alog Technology, workshop at HLT/NAACL.

C. Gomez-Gallo, G. Aist, J. Allen, W. de Beaumont, S. Coria,W. Gegg-Harrison, J. Paulo Pardal, and M. Swift. 2007. An-notating continuous understanding in a multimodal dialoguecorpus. In SemDial – DECALOG.

J. Graca, J. Paulo Pardal, L. Coheur, and D. Caseiro. 2008.Building a golden collection of parallel multi-language wordalignment. In LREC.

T. Machado. 2007. Extraccao de informacao – introducao au-tomatica de receitas de acordo com ontologia. Master’s the-sis, IST, UTL.

F. Martins, A. Mendes, J. Paulo Pardal, N. Mamede, and J. P.Neto. 2008a. Using system expectations to manage userinteractions. In PROPOR, LNCS. Springer.

F. Martins, A. Mendes, M. Viveiros, J. Paulo Pardal, P. Arez,N. Mamede, and J. P. Neto. 2008b. Reengineering a domain-independent framework for spoken dialogue systems. InSET&QA4NLP. ACL.

F. Martins, J. Paulo Pardal, L. Franqueira, P. Arez, andN. Mamede. 2008c. Starting to cook a tutoring dialoguesystem. In SLT. IEEE/ACL.

D. McGuinness. 2003. Ontologies come of age. In The Seman-tic Web: Why, What, and How. MIT Press.

C. Melo, R. Prada, G. Raimundo, J. Paulo Pardal, H. S. Pinto,and A. Paiva. 2006. Mainstream games in the multi-agentclassroom. In IAT. IEEE Computer Society.

M. Pardal, S. Fernandes, J. Martins, and J. Paulo Pardal.2008. Customizing web services with extensions in theSTEP Framework. International Journal of Web ServicesPractices - IJWSP, 3(1).

J. L. Paulo, D. Martins de Matos, and N. Mamede, 2004. Ter-minology Mining with ATA and Galinha, chapter 2. EdicoesColibri, Lisbon, Portugal.

J. Paulo Pardal. 2007. Dynamic use of ontologies in dialoguesystems. In NAACL-HLT Doctoral Consortium.

R. Ribeiro, F. Batista, J. Paulo Pardal, N. Mamede, and H. S.Pinto. 2006. Cooking an ontology. In AIMSA, LNCS.Springer.

M. Schneider. 2007. The semantic cookbook: sharing cook-ing experiences in the smartkitchen. 3rd Interntl. Conf. onIntelligent Environments.

Biographical SketchJoana Paulo Pardal is a 4th yearPh.D. student in Computer ScienceEngineering at IST, Technical Uni-versity of Lisbon, under the super-vision of Nuno J. Mamede (IST),H. Sofia Pinto (IST) and James F.Allen (U. Rochester). She holds afellowship from FCT (PortugueseNat. Science Foundation). Joanareceived a licenciatura in 2001,

and an M.Sc. in 2004 both in CS Eng. from IST. Joana is aresearcher at L2F INESC-ID since 2001. She is a Lecturer atIST since 2002. She is a student member of ISCA, ACL andAAAI. and participates on CMU’s DoD reading group. She hasworked abroad thrice: in Clermont-Ferrand, France (Summer2004); Rochester, NY, USA (Fall 2006), Cambridge, MA, USA(Fall 2008, Winter 2009). She enjoys traveling, cooking, gar-dening, playing Nintendo Wii, and eating at good restaurants.

76 YRRoSDS’2009

Page 77: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

David Dıaz Pardo de Vera Universidad Politecnica de MadridCuidad Universitaria s/n28040 MadridSpain

[email protected]

1 Research InterestsMy main research interests lie in spoken dialoguesystems (SDSs) incorporating nonverbal channels ofcommunication, focussing particularly on user-centredquality evaluation.

1.1 Main research goal/focusAt GAPS (the Signal Processing Applications Group atUPM) our main goal is to improve the robustness ofhuman-machine communication while making the expe-rience for the users as efficient and enjoyable as possible.This can only be achieved by combining work in the de-sign of multimodal communication acts, developing di-alogue strategies that provide flexibility and robustness,and then learning how to properly evaluate the users’ ex-perience.

We have paid special attention to studying the ef-fects that incorporating an embodied conversational agent(ECA) might have on the usability, efficiency and user ac-ceptance of spoken dialogue systems.

1.2 User-centred evaluationWe may identify two main goals of user-centred evalua-tion: the first is to be able to establish the users’ overallopinion of the “quality” of a dialogue system; the secondis to develop models to predict the users’ perception ofquality from a set of parameters.

No standard procedure exists to evaluate dialogue sys-tems in specific contexts, such as enrolment and verifica-tion dialogues for speaker authentication systems, whichis an area we have explored at GAPS. Nor have we foundwell defined and tested guidelines to evaluate multimodaldialogue systems that include a humanlike figure in thevisual communication channel (an ECA), or standardisedapproaches to take user emotions into account. We havefollowed the ITU P.851 recommendation (ITU-T P.851,1999) on questionnaire design for the evaluation of spo-ken dialogue systems for general telephone services, andexpanded it to include dimensions to evaluate user per-ceptions related with secure access and with the ECA.Inspired by Moller’s taxonomy of quality factors (Molleret al., 2007), we combine user questionnaire responseswith interaction data registered automatically.

We have defined a quality-parameter class structure, orframe, that provides conceptual clarity both to developquestionnaires and future experiments and to analyse thedata obtained in them. The class structure arises fromconsidering two orthogonal sets of dimensions. First,from the literature on usability and ergonomics (amongwhich Angela Sasse’s work is especially relevant; see,e.g., Sasse (2004)) we have extrapolated the notion thatthree major classes of parameters are generally (and tac-itly) considered to be related to user acceptance: likeabil-ity factors (those that have to do with the experience ofusing the system), rejection factors (those that can onlyhave a negative valence) and perception of usefulness.Secondly, the class structure is broken down into variouslevels that may independently affect or be subject to theusers’ perception. We may distinguish an overall system-assessment level, task and goal-related levels, and an in-teraction/interface level.

1.3 Case studies

We have performed experiments to study the effects ofan ECA in a SLDS. Responses to our questionnaires donot show a clear difference in scores between users withand without the ECA in their interface, regarding theirsubjective experience, despite differences observed in im-portant objective parameters, such as number of turns,time-outs and barge-in attempts. ECA users were slightlymore positive about the experience, but they were nomore inclined to use a similar system in the future (thisis commonly taken as an important indicator of user ac-ceptance). Factor analyses on our questionnaires have al-lowed us to gain extra insight into the structure of useracceptance-related factors and the dynamics of the users’experience, revealing similar factor structures to thoseidentified by other authors (especially Hone and Graham(2001)). We identified a new factor that appeared per-sistently in all questionnaires: rejection of the systemcaused by privacy and security concerns. Furthermore,privacy concerns were strongly coupled with perceptionof usefulness and inclination to use the system. We alsoobserved a slight advantage in the perception of perfor-mance quality for the ECA system in the beginning, per-haps due to false expectations generated by the stronger

YRRoSDS’2009 Sep 13-14, London, UK 77

Page 78: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

social presence of the system with the ECA. Differencesdisappear in subsequent stages. Finally, the ECA seemsto help users understand what is going on and what theyare supposed to do (say) throughout the dialogue.

1.4 Multimodal communication act generationWe have conceived a multimodal interaction platform forthe multimodal expression of communication goals. Thecentral module of the system we are working on (at themoment it is just a concept) we call the interaction man-ager, which produces a communication intention base(CIB). The CIB defines the ECA’s response on a pre-verbal level. It has three elements: the interaction controlelement defines turn management and theme structure;the open discourse element deals with the literal mean-ing the system wishes to communicate. The non-declaredcommunication element describes an intention behind theliteral meaning of the message. Note that hidden inten-tions may differ greatly from what is openly said!

2 Future of Spoken Dialogue ResearchI believe in the near future we will see an intensificationof efforts to standardise the specification of communica-tion acts for system dialogue output. The generation pro-cess has to be studied, first of all to determine what stagesit should be broken down into. We need to know how thesystem should react to user input in particular interactioncontexts, how to specify multimodal, multi-layered se-mantic output, from concept to output. Efforts from theAI camp centred on cognition are already taking off (e.g.,the SAIBA framework). An HMI counterpart focusing onthe interaction and the user’s experience is also needed.Multi-layered semantic appraisal of the user’s messagesand behaviour may take longer, but is nonetheless alsonecessary achieve sophisticated human-machine conver-sation.

3 Suggestions for DiscussionMy suggestions for discussion are the following:

• What elements of quality should we focus on whenevaluating multimodal dialogue systems?

• Methods of qualitative and quantitative analysis ofthe users’ experience.

• Formalisation of multimodal communication actgeneration.

ReferencesA. Hernandez, B. Lopez, D. Dıaz, R. Fernandez, L.

Hernandez, and J. Caminero. 2007. A person inthe interface: effects on user perceptions of multibio-metrics, In Procs. of Workshop on Embodied LanguageProcessing, in the 45th Annual Meeting of the Associ-ation for Computational Linguistics, ACL, pp.33–40,Prague.

ITU-T P.851. 1999. Subjective Quality Evaluation ofTelephone Services Based on Spoken Dialogue Sys-tems. International Telecommunication Union (ITU),Geneva.

B. Lopez Mencia, D. Dıaz Pardo de Vera, A. HernandezTrapote, L. Hernandez Gomez, M. C. Rodrıguez and J.Relano Gil. 2008. Evaluation of ECA gesture strate-gies for robust human-machine interaction. In Inter-national Journal of Semantic Computing Special Issueon Gesture in Multimodal Systems.

S. Moller, P. Smeele, H. Boland, and J. Krebber. 2007.Evaluating spoken dialogue systems according to de-facto standards: a case study. In Computer Speech &Language 21, pp.26–53.

SAIBA: wiki.mindmakers.org/projects:saiba:main/

M. A. Sasse. 2004. Usability and trust in informa-tion systems, Cyber Trust & Crime Prevention Project.University College London.

Biographical SketchDavid Dıaz Pardo de Vera hasan MEng in telecommunicationsengineering from UniversidadPolitecnica de Madrid. He iscurrently a 2nd year PhD studentand holds a research position in thesame university, under the super-

vision of Luis Hernandez. David is also (still!) writing(or, rather, not...) a masters’ thesis on the sociologyand ethics of human-animal relations in the context oftourism in Lapland and in Spain. David (long ago) usedto run competitively (though not too competently), andnow wishes he hadn’t wasted so much time in his youthinstead of focusing on music so he could now (maybe)play jazz (or anything at all!), which he loves.

78 YRRoSDS’2009

Page 79: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Florian Pinault LIA - University of Avignon,BP 1228, 84911 Avignon,France

[email protected]

1 Research Interests

My research interests about automatic language process-ing lie generally in the area of dialogue management,to which I want to apply principled mathematical modelsand linguistic theories.

The PhD that I am currently completing exploresthe field of human-machine dialogue systems from astochastic approach using reinforcement learning. Sta-tistical modelling relies on using Partially ObservableMarkov Decision Process (POMDP).

1.1 Stochastic dialogue management modelling(POMDP)

Classical dialogue systems model the dialogue state us-ing logical rules, grammar or planning. A major issueis that they are usually very sensitive to speech recogni-tion error. Therefore complex mechanisms to repair, toconfirm or to correct these mistakes have been designedon a meta-dialogue level. Nevertheless, triggering thesemechanisms at the appropriate time has proved to be dif-ficult. A potential solution lies in modelling the uncer-tainty of the system about the dialogue. The POMDPframework statistically models the belief of the systemabout the dialogue state.

In a POMDP, a Hidden Markov Process models theunobserved dialogue state st and observation ot. Beliefabout s is bt(.) = P (st = .| all past history) and can becomputed using Bayes rules. At each time step an actionat = π(bt) is taken (decision) by the system from thebelief bt according a policy π. The state st+1 dependsalso on at. An introduction can be found in (Young et al.,2007).

1.2 Parameters estimation:

Algorithms finding the optimal policy π can be model-based or model-free. In the model-based approach thePOMDP parameters are fully estimated. I used a corpusannotated with hierarchical frame semantics (Meurs etal., 2009), based on the FrameNet paradigm (Baker et al.,1998). The model-free approach usually learn their pa-rameters through user-simulation, I am currently adapt-ing an agenda based user simulator (Schatzmann et al.,

2007) in this purpose.

1.3 Computational complexity issue:POMDP policy optimisation is known to be intractablefor real-world tasks, though some algorithms finding sub-optimal policies have been proposed recently (Pineau,2004). To reduce the complexity of the search space,the dialogue state s and action a are summarised intoa small number of descriptors where it is assumed thatsummary machine actions can be decided from the infor-mation still available in the summary state. The resultingPOMDP becomes more tractable due to the lower searchspace size.

1.4 Evaluation:Evaluation has been performed at a summary level, re-using the same probability model. This auto-evaluationparadigm where the same model is used to fulfill the taskand evaluate the performance is a major issue in dialoguemanagement research.

These work are described more precisely in (Pinault etal., 2009).

2 Future of Spoken Dialogue ResearchPresent research focuses rightly to task-specific systems.I believe that dialogue involving complex reasoning,analogy, poetry or humour will stay out of reach for long.

In my view, the state of dialogue research in the nextdecade will show different characteristics:

• Multi-modal based : Future human-machine dia-logues will occur through a mobile device, includ-ing diverse input/output device. A strong demand todevelop efficient dialogue systems will come fromthis part of the web industry. (e.g. VoiceXML)

• Rich annotated corpus availability : As multi-media classification and semantic web annota-tion will progress, data collected about users (e.g.google) will generate huge corpora.

• Transcriptions provided by ASR (Automatic SpeechRecognition) will keep being noisy and corrupted by

YRRoSDS’2009 Sep 13-14, London, UK 79

Page 80: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

errors. The next big improvement in speech recog-nition will come from a better use of the dialoguecontext including semantic modelling.

3 Suggestions for Discussion• Dialogue state modelling: Should dialogue state in-

clude more than an estimation of the user goal?

• Academic and industry applications: How to cast aPOMDP in VoiceXML?

• Cooperative user adaptation: How to use naturalability of the user to adapt to the system?

• Rigid turn by turn sequence: Dialogue are alwaysmodeled as a succession of speaker turns. Howto naturally include synchronous events as mixedturns, barge in, etc.?

ReferencesC.F. Baker, C.J. Fillmore, and J.B. Lowe. 1998.

The Berkeley FrameNet project. Proceedings of theCOLING-ACL, 98.

M.J. Meurs, F. Lefevre, and R. De Mori. 2009. Spo-ken language interpretation: On the use of dynamicbayesian networks for semantic composition. InICASSP.

F. Pinault, F. Lefevre, and R. De Mori. 2009. Feature-based summary spaces for stochastic dialogue mod-eling with hierarchical semantic frames. INTER-SPEECH.

J. Pineau. 2004. Tractable Planning Under Uncertainty:Exploiting Structure. Ph.D. thesis.

J. Schatzmann, B. Thomson, K. Weilhammer, H. Ye, andS. Young. 2007. Agenda-based user simulation forbootstrapping a POMDP dialogue system. ACL.

S. Young, J. Schatzmann, K. Weilhammer, and H. Ye.2007. The Hidden Information State Approach to Di-alog Management. Proc. of ICASSP.

Biographical SketchFlorian Pinault is currently com-pleting a Ph. D. in StochasticDialogue Systems at the Univer-sity of Avignon (France). Hereceived a Master in Probabilityand Statistics from University ofOrsay (Paris XI) and a magis-tere of Mathematics and Com-

puter Science from the Ecole Normale Superieure of Ulm(Paris).

His main research interests lie in applications ofstochastic models to language processing. His thesiswork focuses on stochastic modelling of dialogue stateand finding optimal dialogue policies using reinforce-ment learning.

80 YRRoSDS’2009

Page 81: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Mara Reis Federal University of Santa CatarinaDept. of Applied LinguisticsSecond Language Phonetics and PhonologyFlorianopolis, SC, Brazil

[email protected]

1 Research Interests

My research interests lies in practical uses of spoken di-alogue systems, towards studies that investigate foreignlanguage speech perception and production, pronun-ciation training, and speech intelligibility improve-ment.

1.1 Past Research

Given that accurate foreign language production is be-lieved to be closely related to accurate perception (Flege,1995; Kuhl and Iverson, 1995; Best and Tyler, 2007), Ihave investigated the relationship between perception andproduction among Brazilian, Dutch, and French speak-ers of English as a foreign language. If such a relation-ship exists, perceptual phonetic training may be able toalter production. In order to examine this hypothesis, Ihave conducted perceptual training with Brazilian learn-ers of English using the consonants /p, t, k/ that, althoughpresent in Portuguese and in English, are phoneticallydifferent in their realisation (Reis, 2007a; Reis et al.,2007). The results showed significant improvement to-wards a more English-like pronunciation, corroboratingthe claim that perception and production are related.

1.2 Current and Future Research

Bearing in mind that foreign language perception andproduction may be related for some speech sounds, mycolleagues and I are currently working with pronuncia-tion training and, thus, the improvement of foreign lan-guage intelligibility.

In a globalised world in which English is undeni-ably the contemporary lingua franca, successful cross-language communication requires a minimal level ofspeech intelligibility, particularly for business reasons(Celce-Murcia, 1987; Morley, 1991; Reis, 2007b).

As far as foreign-accented speech and intelligibility areconcerned, some studies indicate, some studies indicatethat accented-speech does not necessarily favor intelligi-bility by non-native speakers (e.g., Mackay et al., 2006;Munro et al., 2006), whereas other studies show that for-eign language users benefit from speech produced withtheir own accent (e.g., Smith and Bisazza, 1982; Gass

and Varonis, 1984).We have recently investigated whether accented-

speech would be differently evaluated by Brazilian andDutch speakers of English (Reis and Kluge, 2008; Klugeet al., 2008). Due to different phonological representa-tions of the consonants /m/ and /n/ in word-final positionin their own language, Brazilian Portuguese and Dutchspeakers tend to have different pronunciation of En-glish words in this context — Dutch present an English-like production, whereas Brazilians’ output is remark-ably accented. When asked to identify and rate sam-ples of Brazilian-accented English, both groups showedthat accurate pronunciation led to more word recogni-tion. However, Dutch were consistently more aware ofthe accented-speech than Brazilians. These somehowcontradictory results were interpreted as indication thatfurther studies are needed in order to understand to whatextent accented-speech is more or less intelligible for for-eign language speakers.

Bearing in mind such need, our current work aims atanalysing whether speakers of Brazilian Portuguese andFrench recognise and judge their own pronunciation ofEnglish words that contain TH spelling, as in ’theater’,more intelligible than native and the other non-native pro-duction.

With regard to the application and impact of the re-sults of this study on research on spoken language sys-tems, if the findings show that foreign speakers tend toevaluate their own accent more intelligible than native orother non-native pronunciations, adaptations of the sys-tems to certain cultural or political contexts is a possibil-ity to be considered. Multicultural environments in whichforeign accent is maintained as an identity and culturalbond could require an adjustment of the systems accord-ing to these environments’ needs.

2 Future of Spoken Dialogue Research

With the increasing use of speech technology, I liketo think that further research will deepen knowledge ofdialogue in human-machine interaction, as well as theknowledge of language learning.

As for the improvement in service oriented by human-

YRRoSDS’2009 Sep 13-14, London, UK 81

Page 82: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

machine interaction, I believe that not only should ma-chines be more able to recognise indirect speech acts thatmay lead to misinterpretation, but also offer an outputthat is equally more easily recognised, identified, and in-terpreted by humans, as it would be the case of machinesproviding accented-speech for members of certain com-munities in multicultural environments.

With regard to dialogue for language learning, studieswill shed more light into the possibilities of non-expertsdeveloping their own dialogue systems according to theirneeds. Therefore, teachers and students would benefitmore if they could customise their own dialogue models,fitted according to their own reality and necessities.

3 Suggestions for Discussion• To what extent can dialogue systems be user-

friendly? How can they be endowed with ‘socialcompetence’?

• How can dialogue systems benefit from a even widermultidisciplinary interaction?

• Is the multimodal approach applicable and suitableto any dialogue system?

ReferencesC. T. Best and M. D. Tyler. 2007. Nonnative and second-

language speech perception: Commonalities and com-plementarities. In: Bohn, Ocke-Schwen and Murray J.Munro (Eds.), Second language speech learning: Therole of language experience in speech perception andproduction, pp.13—34. Amsterdam: John Benjamins.

M. Celce-Murcia. 1987. Teaching pronunciation ascommunication. In J. Morley (Ed.), Current perspec-tives on pronunciation, pp.5–12. Washington, D.C.:TESOL.

J. E. Flege. 1995. Second Speech Learning: Theory,Findings, and Problems. In W. Shange (Ed.), SpeechPerception and Linguistic Experience – Issues Cross-Language Research, pp.233–277. Timonium: YorkPress.

S. Gass and E. Varonis. 1984. The effect of familiar-ity on the comprehensibility of nonnative speech. InLanguage Learning 34, pp.65-–89.

D. C. Kluge, M. S. Reis, D. Nobre-Oliveira, and A. S.Rauber. 2008. Intelligibility of accented speech: theperception of word-final nasals by Dutch and Brazil-ians. In Procs. of V Jornadas en Tecnologıa del Habla,pp. 199–202, Bilbao.

P. K. Kuhl and P. Iverson. 1995. Linguistic experienceand the “perceptual magnet effect”. In W. Strange(Ed.), Speech perception and linguistic experience: Is-sues in cross-language research, pp.121–154. Timo-nium, MD: York Press.

I. R. A. Mackay, J. E. Flege, and S. Imai. 2006. Evaluat-ing the effects of chronological age and sentence dura-tion on degree of perceived foreign accent. In AppliedPsycholinguistics 27, pp.157–183

J. Morley. 1991. The Pronunciation Component inTeaching English to Speakers of Other Languages. InTESOL Quarterly 25/1, pp.51–74.

M. J. Munro, T. M. Derwing, and S. L. Morton. 2006.The mutual intelligibility of foreign accents. In Studiesin Second Language Acquisition 28, pp.111–131.

M. S. Reis, D. Nobre-Oliveira, and A. Rauber. 2007.Effects of perceptual training on the identification andproduction of the English voiceless plosives by Brazil-ian EFL learners. In Procs. of Fifth International Sym-posium on the Acquisition of Second Language Speech,Florianopolis.

M. S. Reis. 2007a. O efeito de treinamento perceptualna percepcao e producao das plosivas nao-vozeadas doingles. In III CELLI Coloquio de Estudos Linguisticose Literarios, Maringa.

M. S. Reis. 2007b. Guia de Pronuncia do Ingles paraBrasileiros: Book review and experimentation. In JoseMarcelo Freitas de Luna. (Org.). Educacao e Linguis-tica: ensino de lınguas, pp.137–149. Itajaı: UnivaliEditora.

M. S. Reis and D. C. Kluge. 2008. Intelligibility ofBrazilian Portuguese-Accented English Realization ofNasals in Word-Final Position by Brazilian and DutchEFL Learners. In Revista Crop 13, pp.215–229.

L. Smith and J. Bisazza. 1982. The comprehensibility ofthree varieties of English for college students in sevencountries. In Language Learning 32, pp.259-–269.

Biographical SketchMara Reis is a 4th year

!

PhD student in SecondLanguage Acquisition,particularly interested inEnglish Phonetics andPhonology learning. Shereceived her MA in thesame field in 2006, her BA

in Visual Arts in 2003, and her BA in Dentistry in 1995.Last summer she completed part of her PhD studiesat University College London under the supervision ofProf. Valerie Hazan.

82 YRRoSDS’2009

Page 83: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Sylvie Saget LLI-IRISA6 rue de KerampontBP 80518 - 22305 Lannion CedexFrance

[email protected]

1 Research Interests

My research interests lie generally in the area of dia-logue management, with a special focus on the ground-ing process and common ground modelling.

1.1 Grounding and Common Ground

Considering dialogue has a collaborative activity whiledesigning spoken dialogue systems aims at enhancing asystem’s robustness. Indeed, such a modelling is a wayof handling non-understandings by giving a dialogue sys-tem and its users the capacity to interactively refine theirunderstanding until a point of intelligibility is reached.

The interaction between the different fields concernedby collaborative activities may be a source of richnessand of significant overhangs. In particular, in philosoph-ical studies, the necessity of distinguishing the context-dependant (pragmatic) mental attitude which is accep-tance, from the context-free mental attitude which is be-lief has been brought (again) to the forefront by a lot ofphilosophers.

We integrate this distinction within a formal model ofdialogue management which leads to architectural pro-posal on information modelling within dialogue systems.The fundament of this approach has been presented in[Saget 2006; Saget and Guyomard, 2007], the collabo-rative model of dialogue and of reference in [Saget andGuyomard, 2006; Saget and Guyomard, 2007], and afirst step for linking this model to reference treatment in[Saget, 2007].

1.2 Robust and adaptive/adaptable interfacededicated to UVs Systems

The next generation of Unmanned Vehicle (UV) Systemwill include several vehicles with high autonomous capa-bilities. As a consequence, such systems will require newforms of Human-System interaction.

LUSSI department of Telecom Bretagne has developeda prototype multi-Uvs ground station control (SMAART)that allow an operator to supervise the surveillance of asimulated strategic airbase by a swarm of a rotary-wingUVs.

In this project, we adapt our approach of the ground-ing process to the specific design of operator’s interfaceof UV Systems [Coppin; Legras and Saget, 2009; Saget,Legras and Coppin, 2008]. Dialogue management is alsoan adaptable and adaptive process in order to avoid nega-tive side effects of automation.

2 Future of Spoken Dialogue ResearchOne of the main limitations one may encounter while de-signing a dialogue system is to find the proper toolkit, di-alogue System designer, as well in a commercial as wellas in a research perspective, may need to use a simpletoolkit allowing him to develop a specific, adaptive andpossibly multi-tasking interface. Then, effort has to bedone in order to develop a wide range of such toolkits cor-responding to the major kind of support, task complexity,etc.

Another important aspect is to enhance interface inter-activity by improving user trust in spoken dialogue capac-ity. Integrating alignment processes in spoken dialoguesystems is a way to highly enhance trust as well as de-creasing system complexity. Moreover, users of dialoguesystems need to have a sufficient knowledge of system’scapacity (vocabulary, level of interaction flexibility, etc.)to have a suitable trust level. Then, how do we train fu-ture users of a dialogue system such that they will have anaccurate trust level? How do we verify an online user’strust and how correcting it if necessary?

3 Suggestions for DiscussionI would suggest the following subjects of discussion:

• Interactive alignment in spoken dialogue interac-tion: usefulness in HCI, what level has to be aligned,are alignment process useful for user modelling andlearning?

• User’s training process: What is the main differencebetween professional and public large applications?When and how accurately does online help assist di-alogue system users?

YRRoSDS’2009 Sep 13-14, London, UK 83

Page 84: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

ReferencesGilles Coppin, Francois Legras, and Sylvie Saget. 2009.

Supervision of Autonomous Vehicles: Mutual Model-ing and Interaction Management. HCI International2009, San Diego, CA, USA.

Sylvie Saget, Francois Legras, and Gilles Coppin. 2008.Collaborative model of interaction and Unmanned Ve-hicle Systems’ interface. HCP workshop on ”Supervi-sory Control in Critical Systems Management”, 3rd In-ternational Conference on Human Centered Processes(HCP-2008), Delft, The Netherlands.

Sylvie Saget and Marc Guyomard. 2007. Doit-ondire la verite pour se comprendre ? Principes d’unmodele collaboratif du dialogue base sur la notiond’acceptation. Quatrieme Journees Francophones desModeles Formels de l’Interaction (MFI’07), Annalesdu Lamsade: 239–248, Paris, France.

Sylvie Saget. 2007. Using Collective Acceptance formodelling the Conversational Common Ground: Con-sequences on referent representation and on referencetreatment. Proceedings of the 5th IJCAI’s Workshopon Knowledge and Reasoning in Practical Dialog Sys-tems: 55–58, Hyderabad, India.

Sylvie Saget and Marc Guyomard. 2006. Goal-orientedDialog as a Collaborative Subordinated Activity in-volving Collaborative Acceptance. Proceedings of the10th Workshop on the Semantics and Pragmatics of Di-alogue (Brandial 2006): 131–138, University of Pots-dam, Germany.

Sylvie Saget. 2006. In favor of collective acceptance:Studies on goal-oriented dialogues. The Fifth Interna-tional Conference on Collective Intentionality (CollIntV), Helsinki, Finland.

Biographical SketchSylvie Saget is preparing a PhDin Computer Science at the Uni-versity of Rennes 1 since 2003.She is also a board memberof ISCA SAC as the GeneralCoordinator. In 2008, sheworked at Telecom Bretagne inthe LUSSI Department for a

one-year project.She obtained the M. Sc. and B. Sc. in Computer Sci-

ence, respectively in 2003 and 2002, with a major in Ar-tificial Intelligence from the IFSIC Institute of the Uni-versity of Rennes 1.

84 YRRoSDS’2009

Page 85: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Nur-Hana Samsudin School of Computer ScienceUniversity of BirminghamB15 2TT EdgbastonUnited Kingdom

[email protected]/˜nhs/

1 Research Interests

My research interest revolves around Speech Processingand specifically Text-to-Speech (TTS) Synthesis. Thefocus of my current work is on multilingual/polyglotTTS and linguistic modelling for cross languages. I amalso interested in implementing related Speech Recog-nition models as well as issues of Speaker Adaptationinto my research.

1.1 Polyglot Speech Synthesis

Automatic Speech Synthesis is an emerging technologywith applications in many different domains such asHuman-Computer Interaction, Assistive Technology andSpeech-to-Speech Systems. However Speech Synthesisor Text-to-Speech (TTS) Systems require extensive natu-ral language resources. Unfortunately such resources arenot easily available for all languages. Such requirementsmake it even more difficult to be achieved by multilin-gual TTS application. The issues are not only because ofthe resources; it is also about how different system needsto be developed separately for different languages (Sam-sudin, 2009). Information reuse for polyglot TTS maybe a solution. Information reuse will make it possible toadopt existing data and rules within a polyglot TTS sys-tem which makes it possible to develop just one speechengine for multiple languages (Latorre, 2006).

1.2 Linguistic Modelling for Polyglot SpeechSynthesis

There are two parts in TTS architecture; the front end;which involves text processing, natural language process-ing and prosody assignment; and the back end; which in-volve all signal processing aspects in speech generation.Currently, I am working on the front end of a TTS sys-tem and doing research to develop a model to representthe linguistic aspect of the speech. There are two compo-nents in this model; the phonetic processing componentand phonological processing component.

In phonetic data preparation, the goal is to come outwith a finite list of phonemes which include the phonemesused in languages available in the resources. The pro-cessing should be able to provide phoneme mapping

where a pronunciation dictionary is not available. Cur-rently, phonemes from 31 languages have been extracted,analysed and indexed. These phonemes come fromMBROLA database. I have selected MBROLA becausethat is the most comprehensively available phonemes setavailable. There are 6 language families involve: Indo-European (20), Afro-Asiatic (2), Uralic (2), Austronesian(3), Altaic (3) and Dravidian (1) language.

In phonological processing, the process will be fo-cussed on creating the effect of speech production asclose to the native speaker’s speech as possible. Toachieve that target, a few rules need to be identified. Therules which are being considered are: general phonologi-cal rules, language specific phonological rules, languagepronunciation clusters and language intonation templates.These rules need to be constructed to complete phonolog-ical processing component.

The implementation is targeted to fit into the Festivalframework. Festival is selected due to do its frameworkflexibility. Festival already comprises the core architec-ture of a TTS system (CSTR, 2004). Theoretically, if wecan create a Malay TTS in Festival (which is not avail-able in the available text and speech corpus in the Festi-val framework), then we can create an X TTS using Fes-tival where the X is actually a combination of all lan-guages in question. A lot of other issues will then arise.How is it possible to represent all phonemes in the lan-guages in question? Since it is possible to represent eachsound unit based on the IPA, how is it possible to rep-resent the phonological rules of different languages us-ing the identified list - there might be rules conflict whenwe are discussing for one language as compared to an-other. One way to solve this is by using set of trainingdata of language in question. One possible advantage isof course to get the phonological rules right for each lan-guage and another is to get the spectral value correct inorder to generate the speech for the language in question.This however will make the speech’s voice restricted tothe trainer’s timbre only.

It is hoped that by developing this model we willbe able to make the process of creating a multilin-gual/polyglot TTS system less time consuming and

YRRoSDS’2009 Sep 13-14, London, UK 85

Page 86: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

achievable to build with less resource languages, or mi-nority languages or even with less linguistic informationor technical expert.

2 Future of Spoken Dialogue ResearchIn the next 5 to 10 years, I predict that research will fo-cus more on the standardisation of resources in spokenlanguage processing so that resources are applicable forother languages and other domain. I also hope that toolsto aid system development will be more accessible andresources will have a standard format to increase usabil-ity.

The future of spoken dialogue research is progress-ing toward assistive usage not only for disable but alsofor everyday need. The current systems like speech-to-speech translation, speech-to-speech system, question-answering system and edutainment system are still in theprocess of perfection. In the near future, these compo-nents will be able to be integrated into a fully workingrobotic system and other domain specific spoken dialogueapplications to create a generation of ubiquitous comput-ing.

3 Suggestions for DiscussionSome topics which maybe of interest are:

• Creating a standard format for corporaLexicon, speech and text corpora may be easilyavailable. But applying different corpora requiresmodifications in the particular application that wantto make use of those resources.

• Resource Sharing - What are the limitationsMBROLA, Festival, FestVox, Sphinx and Dragonare among the applications that have successfullycreated an engine which allows a single platform tobe applied for multilingual applications. Discussionon resource sharing for spoken dialogue applicationcomponents, module and tools may be able to helpnew scientist not to make the expensive mistakes indata collection.

• From Generic to Domain Specific ApplicationsSome tools like HTK, HTS and FestVox have beendeveloped to aid speech applications development.Addressing the development issues faced by the de-veloper of the tools and conversion issues faced bythe users of the tools will anticipate future developedtools with the ’compulsory’ requirements.

ReferencesNur Hana Samsudin. 2009. Reusing Multilingual Re-

sources for Polyglot Speech Synthesis, Poster Presen-tation, University of Birmingham, UK.

Javier Latorre and Koji Iwano and Sadaoki Furui. 2006.New Approach to the Polyglot Speech Generation bymeans of an HMM-based Speaker Adaptable Synthe-sizer. Speech Communication, 48(10):1227–1242.

The Centre for Speech Technology Research. 2004. Fes-tival. http://www.cstr.ed.ac.uk/. University of Edin-burgh, UK.

Biographical SketchNur-Hana Samsudin is a PhDstudent at the School of Com-puter Science, University ofBirmingham, United Kingdomunder Dr Mark Lee’s supervi-sion. Previously she obtainedBachelor of Computer Scienceand Master of Science from Uni-versiti Sains Malaysia. Her MScwork was on Malay Speech Syn-

thesis. Her previous work revolved around unit selectionspeech synthesis, adjacency analysis and prosody analy-sis of Malay. After submitting her MSc dissertation, sheworked as a researcher at MIMOS Bhd (Malaysia) underthe Artificial Intelligence Centre in Speech ProcessingGroup. She then resigned from the company to furtherher studies and currently she is sponsored by the Ministryof Higher Education, Malaysia.

86 YRRoSDS’2009

Page 87: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Stefan Schaffer Technische Universitat BerlinGraduiertenkolleg PrometeiFranklinstraße 28-2910587 Berlin

[email protected]/?id=sschaffer

1 Research Interests

My research interest is the development of model-basedmethods for simulation and automated usability pre-diction of multimodal interfaces. In particular I want toinvestigate how users’ modality choices can be predictedand simulated by a computational model estimating thequality of a multimodal interface. Therefore I am also in-terested in exploring rules and cognitive processes whichimpact multi-modal interaction.

1.1 Spoken Language System related Work

At Deutsche Telekom Laboratories (T-Labs) I workedon the integration of a janus speech recognition module(Finke et al., 1997) in so called Attentive Displays. TheAttentive Displays are an interactive wall mounted infor-mation system for employee and room search in an officeenvironment. Originally the system was controlled witha touch screen only. To enhance the input facility speechrecognition with acoustic models for distant speech wasembedded. Thereby the system input was altered fromunimodal to multi-modal.

The fusion module of the dialogue manager was basedon a simple first-come, first-served strategy, becauseusers can interact with the system via touch screen orspeech in each dialogue state. Furthermore the systemwas represented as a finite state machine whereas thegrammar of the speech recogniser was updated at eachstate change.

1.2 Research Work

Within my research group I conducted usability stud-ies with unimodal and multimodal system versions ofthe Attentive Displays to investigate how ratings of sin-gle modalities relate to the ratings of their combination(Wechsung et al., 2009a). To collect subjective userjudgements the AttrakDiff questionnaire was used (Has-senzahl et al., 2003). We figured out that for overall andglobal judgments ratings of the single modalities are verygood predictors for the ratings of the multimodal system.Additionally the results indicate that the modality usedmore frequent in multimodal interaction has a higher in-fluence on the judgment of the multimodal version than

the less frequent used modality. However, for separateusability aspects (e.g. hedonic qualities) the predictionwas less accurate. The findings of this study were limitedto the tested system and test design. The multimodal ver-sion was always the system tested last. We conducted afollow-up study with an enhanced version of the system(Wechsung et al., 2009b). Thereby we changed the orderwith the multimodal system version tested first. The re-sults of the first experiment could partially be supported.The prediction of the ratings of the multimodal systemvia overall and global judgments was still possible. Thehypothesis that lower prediction performance is partiallya consequence of the participants’ effort to rate consis-tently was supported.

Moreover we examined influences of user character-istics (training) on direct an indirect measures for theevaluation of multimodal systems (Seebode et al., 2009).We found some differences in the interaction patterns oftrained and untrained users, as experts needed less timeand fewer trials to solve the tasks and used speech morefrequently as input modality than novice users did. Wecould observe different groups of users with different in-teraction patterns that have an influence on their ratings.

1.3 Future Work

My previous research concerning usability evaluation viadirect and indirect measures indicated that the use of mul-timodal human-machine interfaces is influenced by cer-tain user interaction patterns. I want to investigate if thesepatterns can be characterised by rules that explain multi-modal interaction. My further experiments will focus onhow users of a multimodal system choose the interactionmodality. Rules should be derived from the observed in-teraction patterns.

In a next step the rules should be used for model-basedevaluation. Therefore the MeMo workbench for auto-mated usability testing will be extended (Engel-brecht etal., 2008). Test scenarios and prototypes will be mod-elled and simulated. The data gathered through simu-lation will be compared to empirical data of previouslyconducted experiments. An additional verification of thefindings can be provided through an ACT-R model thatimplements the rules and test scenarios in a simulation

YRRoSDS’2009 Sep 13-14, London, UK 87

Page 88: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

(Anderson, 1996).

2 Future of Spoken Dialogue ResearchMy impression is that recently only small improvementson the part of speech recognition technology are made.The integration of specialised acoustic models and lan-guage models seems to have only little impact on a furtherenhancement of the reliability of dialogue systems. How-ever an improvement of the recognition capabilities isnecessary to build systems of high acceptance. Hence re-searchers should also investigate how a system can profitfrom additional information like e.g. knowledge aboutthe user. By means of this a system could adapt compo-nents like dialogue strategies, grammars or acoustic mod-els to the actual user. This may beneficially affect thereliability of the recognition.

3 Suggestions for Discussion• Model driven evaluation and modelling of user be-

haviour.

• Additional knowledge bases for future dialogue sys-tems.

• How to define rules for the simulation of interaction.

ReferencesJ. R. Anderson. 1996. ACT: A simple theory of complex

cognition. American Psychologist, 51: 355–365.

K.-P. Engelbrecht, M. Kruppa, S. Moller and M. Quade.2008. MeMo Workbench for Semi-automated Usabil-ity Testing. In: Proc. Interspeech 2008

M. Finke, P. Geutner, H. Hild, T. Kemp, K. Ries, andM. Westphal. 1997. The Karlsruhe-Verbmobil SpeechRecognition Engine. In: Proc. ICASSP ’97, Munich,Germany.

M. Hassenzahl, M. Burmester, , and F. Koller.2003. AttrakDiff: Ein Fragebogen zur Mes-sung wahrgenommener hedonischer und pragmatis-cher Qualitat. In Ziegler J., Szwillus G. (eds.) Men-sch & Computer 2003. Interaktion in Bewegung: 187–196, B.G. Teubner, Stuttgart.

I. Wechsung, K.-P. Engelbrecht, S. Schaffer, J. Seebode,F. Metze, and S. Moller. 2009a. Usability Evaluationof Multimodal Interfaces: Is the whole the sum of itsparts?. accepted for: 13th International Conference onHuman-Computer Interaction.

I. Wechsung, K.-P. Engelbrecht, A. B. Naumann, S.Schaffer, J. Seebode, F. Metze, and S. Moller. 2009b.Predicting the quality of multimodal systems based onjudgments of single modalities. accepted for: Inter-speech 2009.

J. Seebode, S. Schaffer, I. Wechsung, and F. Metze. 2009.Influence of Training on Direct and Indirect Measuresfor the Evaluation of Multimodal Systems. acceptedfor: Interspeech 2009.

Biographical SketchStefan Schaffer studied Com-

viously conducted experiments. An additional verifica-

tion of the findings can be provided through an ACT-R

model that implements the rules and test scenarios in a

simulation (Anderson, 1996).

2 Future of Spoken Dialog Research

My impression is that recently only small improve-

ments on the part of speech recognition technology are

made. The integration of specialized acoustic models

and language models seems to have only little impact

on a further enhancement of the reliability of dialog

systems. However an improvement of the recognition

capabilities is necessary to build systems of high ac-

ceptance. Hence researchers should also investigate

how a system can profit from additional information

like e.g. knowledge about the user. By means of this a

system could adapt components like dialog strategies,

grammars or acoustic models to the actual user. This

may beneficially affect the reliability of the recogni-

tion.

3 Suggestions for discussion

• Model driven evaluation and modelling of

user behaviour

• Additional knowledge bases for future dialog

systems

• How to define rules for the simulation of in-

teraction

References

J. R. Anderson. 1996. ACT: A simple theory of com-

plex cognition. American Psychologist, 51, 355-365

K.-P. Engelbrecht, M. Kruppa, S. Möller and M. Qua-

de. 2008. MeMo Workbench for Semi-automated

Usability Testing. In: Proc. Interspeech 2008

M. Finke, P. Geutner, H. Hild, T. Kemp, K. Ries, and

M. Westphal. 1997. The Karlsruhe-Verbmobil

Speech Recognition Engine. In: Proc. ICASSP '97,

Munich, Germany

M. Hassenzahl, M. Burmester, , and F. Koller, 2003.

AttrakDiff: Ein Fragebogen zur Messung wahrge-

nommener hedonischer und pragmatischer Quali-

tät. In Ziegler J., Szwillus G. (eds.) Mensch &

Computer 2003. Interaktion in Bewegung, pp. 187-

196, B.G. Teubner, Stuttgart

I. Wechsung, K.-P. Engelbrecht, S. Schaffer, J. Seebo-

de, F. Metze, and S. Möller. 2009a. Usability

Evaluation of Multimodal Interfaces: Is the whole

the sum of its parts?. accepted for: 13th Interna-

tional Conference on Human-Computer Interaction

I. Wechsung, K.-P. Engelbrecht, A. B. Naumann, S.

Schaffer, J. Seebode, F. Metze, and S. Möller.

2009b. Predicting the quality of multimodal systems

based on judgments of single modalities. accepted

for: Interspeech 2009

J. Seebode, S. Schaffer, I. Wechsung, and F. Metze.

2009. Influence of Training on Direct and Indirect

Measures for the Evaluation of Multimodal Sys-

tems. accepted for: Interspeech 2009

Biographical Sketch

Stefan Schaffer studied

Communication Science and

Computer Science at the

Technical University Berlin.

After receiving his Magister

Degree in 2009, he has joined

the research training group

prometei. Currently he is

working towards his PhD

thesis in the domain of automated usability evaluation

for multimodal systems.

munication Science and Com-puter Science at the TechnicalUniversity Berlin. After receiv-ing his Magister Degree in 2009,he has joined the research train-ing group prometei. Currentlyhe is working towards his PhDthesis in the domain of auto-

mated usability evaluation for multimodal systems.

88 YRRoSDS’2009

Page 89: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Alexander Schmitt Dialogue Systems Research GroupUniversity of UlmAlbert-Einstein-Allee 4389081 Ulm

[email protected]/in/it/staff/ds/alexander-schmitt.html

1 Research Interests

My research topic is centered around stochastically-based detection of problematic dialogue situations intelephone-based speech applications. I apply statisticalclassifiers trained on log-data for detecting such situa-tions. Apart from this dialogue feature-based detection,my work includes the robust detection of the user’semotional state (especially anger) and the incorporationof speaker knowledge such as age and gender into therecognition process. Also, I am interested in the field ofspeech recognition on mobile devices, where I have doneresearch during my Master thesis project (Zaykovskiyand Schmitt, 2007; Zaykovskiy et al., 2007; Zaykovskiyand Schmitt, 2008).

1.1 Problematic Dialogue Situations

In the telephone application context, by “problematic” wemean situations where the caller is about to hang up with-out completing the task. Reasons for that might be that heis annoyed by automated agents in general or stuck andhelpless by faulty system behavior. Especially in largerconversations such as with automated technical supportagents as described in (Acomb et al., 2007), where dia-logues not rarely consist of 50 system- and user turns, theability to repair a problematic situation would lower thedrop-out rate.

Given my past work, my goal is to render SDS more“intelligent” than they are today. The central questionis: Is it possible to detect problems based on previouslyobserved calls and if so, how robust is this approach?

1.2 Dialogue Features for Problematic DialoguePrediction

My first Problematic Dialogue Predictor was based onturn-wise dialogue features that have been logged duringprevious calls such as ASR accuracy, ASR rejection, cur-rent system state, system prompt, user answer etc. basedon a confidence-rated rule learner, similar to (Walker etal., 2002). It turned out, that it worked quite well in earlypoints in the dialogue, but had flaws when the dialoguesbecame longer (Schmitt and Liscombe, 2008).

Additional features had to be developed. Into our clas-

sification, I included acoustic information (by early fu-sion), assuming that angry callers have a lower task com-pletion rate. It did not improve the results since the acous-tic features performed too poor in contrast to the dialoguefeatures (Herm et al., 2008).

1.3 Anger Recognition

Late fusion of acoustic information with dialogue fea-tures seems to be more promising than early fusion. Inrecent work, I have thus developed an anger detectionsystem in order to provide additional information to theProblematic Dialogue Prediction task which will be latelyfused with the log-data classifier. It is based on a combi-nation of a Support Vector Machine, a Neural Networkand a Decision Tree (Schmitt et al., 2009).

My current goal is to make this anger detection sys-tem as robust as possible by including all available acous-tic, lexical and dialogue features into an anger/non-angerclassifier. I further consider the dialogue context and pre-vious emotions of the caller.

In parallel I am developing additional lexical-,dialogue- and context features for the Problematic Dia-logue Predictor, such as bag-of-words, ratios of appropri-ate user answers and inappropriate ones, probabilities oftask success in the current system state.

Another central question that I want to answer is theimpact of age and gender in the ability to cope with anautomated agent. Might there be a difference between se-nior users and younger users, as well between male andfemale users? Beside the labelling of the corpus, I planto develop an age and gender recognition system whoseoutput is incorporated into the Problematic Dialogue Pre-dictor.

2 Future of Spoken Dialogue Research

In my estimation, telephone-based applications are theplatform for proving sense and nonsense of SLD tech-nology. Here, the acceptance of the users has to be won.While most people have already had “conversations” withthose Interactive Voice Response Systems (IVRs), theyoften have a distaste for them. The telecommunicationprovider O2 even just started an ad campaign in Germany

YRRoSDS’2009 Sep 13-14, London, UK 89

Page 90: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

proudly announcing to send the “robots” (i.e. IVRs) intoretirement.

One of our most important tasks in my view is to in-crease the acceptance of SDS by endowing them with “in-telligence” and empathy. SDS have to appear more onequal footing with the user and have to look more com-petent and sensitive than they are today.

Future interfaces should be able to adapt to the user’speculiarities, intentions and emotions, i.e., the interfaceshould be capable of identifying and understanding thefeelings, ideas, and (personal) circumstances of its users.

One of the decisive points here is the further progressof speech-based emotion recognition. Here, additionalreal-life corpora have to be collected that are differentfrom most acted corpora that are currently available tothe community. Further, user-modelling in general willbe a central research topic in the next 10 years: who isthe user, what are his or her preferences, is the currentsituation suited for pro-active behaviour from the part ofthe system, etc.

Since SDS will gain further importance in the near fu-ture (intelligent homes, automotive applications, multi-modal information systems), the field of SDS will furthergrow. However, we have to ensure to deliver pragmaticresearch results that can be contemporarily realised in thefield.

3 Suggestions for Discussion• Emotion Recognition: Are we at the limit? No more

classification accuracy to gain?

• Telephone-based SDS: are Interactive Voice Re-sponse systems doomed to death. . . and what comesnext?

• Distributed Speech Recognition: a technology withfuture or superfluous invention?

• Problems in SDS: how to detect and how to preventthem?

ReferencesKate Acomb, Jonathan Bloom, Krishna Dayanidhi,

Phillip Hunter, Peter Krogh, Esther Levin, and RobertoPieraccini. 2007. Technical support dialog sys-tems:issues, problems, and solutions. In Proceedingsof the Workshop on Bridging the Gap: Academic andIndustrial Research in Dialog Technologies, pages 25–31, Rochester, NY, April. Association for Computa-tional Linguistics.

Ota Herm, Alexander Schmitt, and Jackson Liscombe.2008. When calls go wrong: How to detect problem-atic calls based on log-files and emotions? In Proc. ofthe International Conference on Speech and LanguageProcessing (ICSLP), September.

Alexander Schmitt and Jackson Liscombe. 2008. De-tecting Problematic Calls With Automated Agents. In4th IEEE Tutorial and Research Workshop Perceptionand Interactive Technologies for Speech-Based Sys-tems, Irsee (Germany), June.

Alexander Schmitt, Tobias Heinroth, and Jackson Lis-combe. 2009. On nomatchs, noinputs and bargeins:Do non-acoustic features support anger detection? InProceedings of the 10th Annual SIGDIAL Meeting onDiscourse and Dialogue, SigDial Conference 2009,London, UK, September. Association for Computa-tional Linguistics.

M A Walker, I Langkilde-Geary, H W Hastie, J Wright,and A Gorin. 2002. Automatically training a problem-atic dialogue predictor for a spoken dialogue system.Journal of Artificial Intelligence Research, (16):293–319.

Dmitry Zaykovskiy and Alexander Schmitt. 2007. Javato Micro Edition Front-End for Distributed SpeechRecognition Systems. In The 2007 IEEE InternationalSymposium on Ubiquitous Computing and Intelligence(UCI’07), Niagara Falls (Canada), May.

Dmitry Zaykovskiy and Alexander Schmitt. 2008. Javavs. Symbian: A Comparison of Software-based DSRImplementations on Mobile Phones. In 4th IET Inter-national Conference on Intelligent Environments, Seat-tle (USA), July.

Dmitry Zaykovskiy, Alexander Schmitt, and M. Lutz.2007. New Use of Mobile Phones: Towards Multi-modal Information Access Systems. In 3rd IET Inter-national Conference on Intelligent Environments, Ulm(Germany), September.

Biographical SketchAlexander Schmitt studiedComputer Science in Ulm(Germany) and Marseille(France) with focus on mediapsychology and spoken lan-guage dialogue systems. Hereceived his Masters degreein 2006 when graduating on

Distributed Speech Recognition on Mobile Phones atUlm University and is currently carrying out his PhDresearch project (3rd year) at Ulm University. His topic iscentered around the detection of problematic phone callsin Interactive Voice Response Systems and is carriedout in cooperation with SpeechCycle, NYC, USA.Schmitt worked for Daimler Research Ulm on StatisticalLanguage Modelling and regularly gives lectures on thedevelopment of Voice User Interfaces in several places,e.g. the German University in Cairo, Egypt.

90 YRRoSDS’2009

Page 91: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Niels Schutte Dublin Institute of TechnologySchool of Computing, DIT DublinKevin StreetDublin 6, Ireland

[email protected]

1 Research Interests

My research interest is in the area of Multimodal Dia-logue Systems, especially in situated dialogues.

1.1 Past and planned work

As part of my Diplom degree I worked in the SmartChairproject. In this project we developed a multimodal di-alogue interface for the Bremen Rolland autonomouswheelchair platform.

My current research is part of the Lok8 project at DITDublin, which has started in early 2009. Goal of thisproject is to develop a multimodal dialogue system thatallows users to access location based services using mo-bile devices. In particular, we are investigating the useof mobile devices such as iPhones or Google AndroidPhones to track the position of the user in an environ-ment as well as the use of the sensors of those devicesto recognise certain classes of gestures such as directedpointing. This is the subject of the “Tracker” strand ofthe project.

An animated avatar, that can either be displayed di-rectly on the device, or on displays that are available inthe environment, will be utilised for interaction with theuser. This will be developed in the “Avatar” strand of theproject.

Apart from conventional ideas of dialogic interactionwe are also looking to integrate methods of sonification,such as audio based navigation, in the “Vocate” strand.One goal will be to research how those different ideascan complementarise each other.

My work in this project (in the “Contact” strand) isto develop and implement the dialogue framework of thesystem. This will among other things entail developingmodels to integrate the linguistic interaction with the userwith information supplied by the mobile device. Informa-tion about the current location of the user and direction ofthe device in conjunction with spatial models of the en-vironment will enable us to interpret pointing gestures asexophoric references to object in the environment. Infor-mation about general movements of the device may beused to interpret movement gestures like dismissive handwaving.

Figure 1: Lok8 system overview

Apart from that we are going to investigate the advan-tages of different types of interaction, when which type ofinteraction may be preferable and how we can smoothlytransition between them.

The use of an avatar as a personification of the systemposes several interesting research questions. For examplewe are considering the use of personalised avatars that,to a certain extend, may have their own personality traits,such as playfulness or seriousness. It will have to be re-searched how to integrate such individual behaviors withthe general behaviors required by the system task.

Each strand is handled by a PhD Student from a differ-ent background, ranging from Computer Science to De-sign.

Since the project, and my part of it is quite broad inscope, and my work has only recently begun, I am stilldeveloping my research questions.

2 Future of Spoken Dialogue ResearchWe believe that an important topic in the future devel-opment of dialogue systems will be the use of dialogicinteraction with mobile devices. This kind of dialoguewill be situated and will have to take the modalities otherthan speech into account. The mobility of such a systemmay also open up questions of scalability of such sys-

YRRoSDS’2009 Sep 13-14, London, UK 91

Page 92: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

tems. Some situations may make it necessary to reducethe activity of the system while maintaining a minimalfunctionality. For example it may be awkward to com-municate with a system in a crowd of people. In such asituation it might be appropriate to reduce the activity ofthe system to audio cues that highlight points of interestor guide navigation. If both hands are occupied, it may beimpossible to operate buttons or a touch screen. In sucha situation the interaction should shift towards speech.

We think that this kind of scalability is important andshould be researched. This question could also be gener-alised into researching general flexible models for vary-ing availability of modalities, that would allow deploy-ment on different hardware platforms.

3 Suggestions for DiscussionI think the following areas might be worth looking at

• How can multimodality be used to enhance the useof mobile devices for dialogue? How can weak-nesses of mobile devices such as small screens bebalanced out using other modalities?

• How can environments be efficiently modeled or an-notated to interact with mobile dialogue systems?

• How can we model linguistic context and attentionin a mobile situated dialogue system i.e. when con-text changes depending on the movement of the sys-tem?

Biographical SketchThe author is currently a firstyear PhD student at Dublin In-stitute of Technology. He is su-pervised by Dr John Kelleherand Dr Brian Mac Namee. Pre-viously the author studied Com-puter Science at the Universityof Bremen in Germany. The ti-tle of his Diplom thesis is “Au-

tomatic Generation of Concept Descriptions” and dealswith generating natural language descriptions of ontol-ogy contents.

92 YRRoSDS’2009

Page 93: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Alvaro Siguenza Izquierdo Universidad Politecnica de MadridAvda. Complutense 3028040 MadridSpain

[email protected]

1 Research Interests

My main research interests lie in the area human-machine interaction, with a special focus on spoken di-alogue systems while the user is engaged in a drivingmultitask environment.

1.1 Past and ongoing research

Driving is a complex task that requires the interactionand coordination of physical, sensorial and mental driverskills. Consequently, it is necessary that a great partof the user attention will be oriented to the driving taskwith the purpose of accomplishing a high level of per-formance. Traditionally, the driver attention could be di-verted to other tasks that might arise inside the vehicle,which are frequently interrelated with the driving task.Nevertheless, the proliferation of the in-vehicle informa-tion and communication systems has imposed new taskswhich accumulation has led drivers to assimilate a lot ofinformation in a brief period of time. When a demandingdriving situation appears (i.e. an intersection), the drivermight be too engaged in the in-vehicle system messagesprocessing, putting drivers at risk of an accident.

In the driving context, the interaction between driversand systems can not be considered as primary task, so thatdriver must pay their attention to driving safely. Tradi-tional in-vehicle interfaces require the driver taking theirhands from the wheel to press a button or looking awayfrom the road to look at a display. However, speech in-terfaces allow drivers to keep their hands on the wheeland eyes on the road, considering itself as a non-distractinteraction mode.

At GAPS (the Signal Processing Applications Groupat UPM) we have been researching on speech human-vehicle interaction, collaborating on several projects thatthey deal with this thematic. Our main aim in this field ofresearch is to adapt the in-vehicle spoken dialogue sys-tems to the circumstances of the driving, focusing on thestate of the driver.

1.2 Adapting in-vehicle spoken dialogue systems

Several studies have showed that, in spite of the benefitsprovided by vocal interfaces, there are situations where

the interaction with the in-vehicle can entail distraction.This distraction might be caused by the workload relatedto the concurrence of driving and a task involved in a spo-ken dialogue (secondary task). To cope with this addi-tional information, the driver needs to percept the auralmessage of the system, to process the information andto react to it, with the possibility of interfering with thedriving task. According to the multiple resources model,the resources of the driver are limited and when the driv-ing situation is too demanding and the same resources areneeded for driving and secondary task, interference be-tween tasks can impair driving if the driver is overloaded(Vollrath 2007). In addition, driving is a special contextwhere recognition errors are likely to appear caused byenvironmental factors (such as noise), intrinsic user fac-tors (mood state, fatigue. . . ), task variables or even thedesign of the interaction flow, increasing the workloadexperienced by the driver (Kun et al. 2007).

With the aim of avoiding situations where the safetyof driver can be affected by the interference betweenthe driving and spoken dialogue systems, is necessaryto carry out an evaluation of the vehicle context to takethe opportune measures. From the driving context canbe estimated an assessment of the workload experiencedby the user, allowing to adapt the spoken dialogue to theworkload. As the priority in this environment is givento the processing of the driving task, the driver shouldn’tengage in a spoken dialogue so long as the junction ofdriving workload and dialogue workload don’t exceed thedriver capacity.

Accordingly, to carry out the control of the dialogueflow is necessary the presence of a workload managerthat attempts to determine if a driver is overloaded or dis-tracted, and if they are, alters the operation of in-vehiclesystems (Green 2004). This workload manager has toconsider what the right mode to interact with the driveris. If the driver is overloaded, might be necessary stop-ping or interrupting the dialogue flow to solve this situa-tion. When this situation finished, the dialogue should beresumed just in the point where it was before its interrup-tion. However, the control of the dialogue might makeuse of additional information (such us driving context,user preferences. . . ), providing with intelligence to the

YRRoSDS’2009 Sep 13-14, London, UK 93

Page 94: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

dialogue in order to focus it and to decrease the dialogueturns and consequently the workload. For example, if anapplication knows that the driver like vegetarian restau-rants, it might suggest that he stop to have lunch withoutthe necessity of asking him about his preferences. In thesame way, depending on the driving situation might bebetter presenting the information of the system in othermodality of interaction (visual display, haptic actuator).

Since, there are few studies focusing on the workloadmanagement of a spoken dialogue system when appear-ing concurrently with the driving task, GAPS have devel-oped a bench simulator to research on all of these pre-sented issues (Blanco et al. 2009).

2 Future of Spoken Dialogue ResearchI believe in the near future we will see a proliferationof spoken dialogue systems in a wide range of contextsuch as the vehicle. This will cause a great concernfor analysing the effects that the concurrent performancewith other tasks produces on the curse of the dialogue.Due to in some contexts the spoken interaction with asystem can not be the priority task (i.e. driving), a studyabout the workload generated by the dialogues is neededto ensure the right performance of the primary task. Weneed to incorporate technologies for representation ofcontext in order to adapt the dialogue flow to the work-load situation. In the same way, will be necessary a re-search on methods that allow to detect overloaded usersa-priori.

In the driving context, to carry out these aims will beimportant a simulation bench where we can evaluate dif-ferent strategies of workload adaptation without puttingdrivers at risk. Anyway, a modelling of the structure ofhuman attention resources is needed to can consider thespare resources when we analyse what is the best way ofinteraction in a specific instant.

3 Suggestions for DiscussionMy suggestions for discussion are the following:

• Incorporation of contextual information and infer-ence technologies to the spoken dialogue consider-ing the time restrictions in this kind of systems.

• How it should be resumed the spoken dialogue sys-tem when it is interrupted due to the user is over-loaded or caused by adverse driving situations?

• Strategies to reduce the workload in dual task envi-ronments.

• Methods to detect overloaded users using onlyspeech information of the dialogue.

ReferencesKun A., Paek T., and Medenica Z. 2007. The Effect

of Speech Interface Accuracy on Driving Performance.Interspeech 2007.

Vollrath M. 2007. Speech and driving-solution or prob-lem? Intelligent Transport Systems, IET 1: 89–94.

Green P. 2004. Driver Distraction, Telematics Design,and Workload Managers: Safety Issues and Solutions.SAE paper Number 2004-21-0022.

Nishimoto T., Shioya M., Takahashi J. and Daigo H.2005. A study of dialogue management principles cor-responding to the driver’s workload. Biennal on Dig-ital Signal Processing for In-Vehicle and Mobile Sys-tems.

Blanco J.L., Siguenza, A., Dıaz D., Sendra M. andHernandez L.A. 2009. Reworking spoken dialoguesystems with context awareness and information pri-oritisation to reduce driver workload. In: Proceedingsof the NAG-DAGA 2009, International Conference inAcoustics.

Biographical SketchAlvaro Siguenza Izquierdo has

tion finished, the dialogue should be resumed just in the point where it was before its interruption. How-ever, the control of the dialogue might make use of additional information (such us driving context, user preferences…), providing with intelligence to the dia-logue in order to focus it and to decrease the dialogue turns and consequently the workload. For example, if an application knows that the driver like vegetarian restaurants, it might suggest that he stop to have lunch without the necessity of asking him about his prefer-ences. In the same way, depending on the driving situation might be better presenting the information of the system in other modality of interaction (visual dis-play, haptic actuator).

Since, there are few studies focusing on the work-load management of a spoken dialogue system when appearing concurrently with the driving task, GAPS have developed a bench simulator to research on all of these presented issues (Blanco et al. 2009).

2 Future of Spoken Dialog Research

I believe in the near future we will see a prolifera-tion of spoken dialog systems in a wide range of con-text such as the vehicle. This will cause a great concern for analyzing the effects that the concurrent performance with other tasks produces on the curse of the dialogue. Due to in some contexts the spoken in-teraction with a system can not be the priority task (i.e. driving), a study about the workload generated by the dialogues is needed to ensure the right performance of the primary task. We need to incorporate technologies for representation of context in order to adapt the dia-logue flow to the workload situation. In the same way, will be necessary a research on methods that allow to detect overloaded users a-priori.

In the driving context, to carry out these aims will be important a simulation bench where we can evalu-ate different strategies of workload adaptation without putting drivers at risk. Anyway, a modeling of the structure of human attention resources is needed to can consider the spare resources when we analyze what is the best way of interaction in a specific instant.

3 Suggestions for discussion

My suggestions for discussion are the following: • Incorporation of contextual information and

inference technologies to the spoken dialogue considering the time restrictions in this kind of systems.

• How it should be resumed the spoken dia-logue system when it is interrupted due to the user is overloaded or caused by adverse driv-ing situations?

• Strategies to reduce the workload in dual task

environments.

• Methods to detect overloaded users using only speech information of the dialogue.

References Kun A., Paek T., and Medenica Z. 2007. The Effect of

Speech Interface Accuracy on Driving Perform-ance. Interspeech 2007.

Vollrath M. 2007. Speech and driving-solution or problem? Intelligent Transport Systems, IET 1, 89– 94.

Green P. 2004. Driver Distraction, Telematics Design, and Workload Managers: Safety Issues and Solu-tions. SAE paper Number 2004-21-0022.

Nishimoto T., Shioya M., Takahashi J. and Daigo H. 2005. A study of dialogue management principles corresponding to the driver’s workload. Biennal on Digital Signal Processing for In-Vehicle and Mobile Systems.

Blanco J.L., Sigüenza, A., Díaz D., Sendra M. and Hernández L.A. 2009. Reworking spoken dialogue systems with context awareness and information prioritisation to reduce driver workload. In: Pro-ceedings of the NAG-DAGA 2009, International Conference in Acoustics.

Biographical Sketch

Álvaro Sigüenza Izquierdo has a MEng in telecommunications from Universidad Politécnica de Madrid. He is currently a PhD student and holds a research posi-tion in this university under the supervision of Luis Hérnandez. .

a MEng in telecommunicationsfrom Universidad Politecnica deMadrid. He is currently a PhDstudent and holds a research po-sition in this university under thesupervision of Luis Hernandez.

94 YRRoSDS’2009

Page 95: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Vasile Topac Department of Computer Science andInformation TechnologyPolitehnica University of TimisoaraBd. Vasile Parvan nr. 2RO-300223 Timisoara - Romania

[email protected]

1 Research InterestsMy research lies on improving the accessibility to text-based information (TBI). By text-based information Imean text having different media representations (e.g.speech, digital text, printed text). For this I am focusedon designing software applications which are integrat-ing technologies like speech recognition, spoken di-alogue systems, machine translation, text processing,TTS, OCR and others for improving the access to TBI.Also my research extends to the topic of text encoding.

1.1 Past WorkWhile working for my diploma project I have devel-oped an application for improving access to printed doc-uments. So the application allowed the user to place adocument under a high resolution webcam or into a scan-ner and listen the text spoken in any language (“any” lan-guage stands for a limited set of languages, of course...).Even if the application was designed for users with visualdisabilities, it has proven to be useful in many other sce-narios. Since then I have identified more limitations toTBI accessibility and decided to address some of them.

1.2 Accessibility LimitationsWhen trying to access TBI, depending on how it is rep-resented a user may encounter several limitations. I haveidentified and addressed four main limitations:

• Physical (user has hearing or vision problems);

• Language (the information is presented in a lan-guage which is unknown by the user);

• Specialised Text (the information is specialised on aspecific domain, like medicine, being hard to under-stand for normal users)

• Time (the user has to access a big quantity of infor-mation in a limited time).

1.3 Current WorkMy current work is focused on creating software architec-tures and developing solutions which are trying to over-step the limitations described above.

Starting from the applications I have previously de-scribed I continued with an application that improves theaccess to TBI, and addresses the four limitations. Cur-rently I work on text adaptation methods and on addingtext summarisation features, so that the user can listento a resume of a printed page or of an audio file. I’mcurrently researching how this kind of applications canbe controlled by using spoken dialogue systems. Usingthis type of applications leads to the problem of storingTBI so that it can be presented in any media representa-tion and any language without changing meaning of theinformation. To do this the text has to be encoded withenough metadata. The existing xml-based formats likeVoiceXML or TEI have to be extended in order to storeenough information about the text.

1.4 Spoken Dialogue Systems and TBI

Communication between spoken dialogue systems andapplications like the one described above can completelychange the way users interact with TBI. While most ofthe applications using spoken dialogue systems the fo-cuses on the communication, here the focus will be onthe text and on increasing TBI accessibility. So imagineapplications that allow the users to query audio files orprinted pages containing TBI in any language, allowingthe usage of interpretation or summarisation features.

2 Future of Spoken Dialogue Research

I believe that the spoken dialogue systems will get closerto text-based information accessibility. I’m thinking atdesktop applications for text visualisation or editing hav-ing integrated spoken dialogue systems and allowing “in-telligent” document interrogation in a multilingual fash-ion.

I can imagine scanners or smart phones with high res-olution cameras having embedded spoken dialogue sys-tems for querying a paper in a multilingual fashion; alsomp3 players or smart phones having embedded spokendialogue systems not only for navigation, but also for in-teraction with the content of an audio file.

I believe that the technology is already here and is ma-

YRRoSDS’2009 Sep 13-14, London, UK 95

Page 96: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

ture enough to do these steps.

3 Suggestions for Discussion• How can spoken dialogue modules become easier to

integrate into applications developed in different en-vironments and running on different operating sys-tem?

• How to define metrics for measuring the quality ofthe dialogue?

• What about adding spoken dialogue system in de-vices not only for menu navigation but also for con-tent interrogation? For example a smart phone hav-ing the ability to interrogate a mp3 file and to re-spond in a multilingual way?

• What about consuming spoken dialogue web ser-vices in smart devices?

ReferencesBohus, D. and Rudnicky, A. 2008. The Raven-

Claw dialog management framework: architectureand systems. In Computer Speech and Language,DOI:10.1016/j.csl.2008.10.001

Gruenstein A., J. Orszulak, S. Liu, S. Roberts, J. Zabel,B. Reimer, B. Mehler, S. Seneff, J. Glass, J. Cough-lin. 2009. City Browser: Developing a ConversationalAutomotive HMI. In Proc. CHI: 4291–4296, Boston,April.

Ergun Bicici and Marc Dymetman. 2008. DynamicTranslation Memory: Using Statistical Machine Trans-lation to Improve Translation Memory Fuzzy Matches,SpringerLink.

P. Koehn, F.J. Och, and D. Marcu. 2003. Statisti-cal phrase based translation. In Proceedings of theJoint Conference on Human Language Technologiesand the Annual Meeting of the North American Chap-ter of the Association of Computational Linguistics(HLT/NAACL).

Kevin Knight and Philipp Koehn. 2003. Statistical Ma-chine Translation Tutorial.

Topac, V. and Stoicu-Tivadar, V. 2009. Software Archi-tecture for Better Text-Based Infromation Accessibil-ity. AICT ’09. Fifth Advanced International Confer-ence on Telecomunication: 198–202. Digital ObjectIdentifier 10.1109/AICT.2009.41

Biographical Sketch

!

Vasile Topac is a PhD stu-dent in the 1’st year of studyat “Politehnica” University ofTimisoara, Romania. He grad-uated at “Aurel Vlaicu” Univer-sity of Arad, and has study onesemester in Portugal at “Fer-nando Pessoa” University. Hehas a passion for programming,

and he has 2 years of working experience as softwareengineer. He’s a common friend of Java and C# whilehe worked with the first for two years, and played withthe second in the Imagine Cup competitions, and also inresearch projects. When he is around science, he likesEinstein’s words: “Imagination is more important thanknowledge”.

96 YRRoSDS’2009

Page 97: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Bogdan Vlasenko Cognitive Systems IESKOtto-von-Guericke University MagdeburgD-39106 MagdeburgGermany

Bogdan.Vlasenko@e-technik.uni-magdeburg.dewww.iesk.uni-magdeburg.de/Bogdan_Vlasenko.html

1 Research Interests

My research interests lie generally in the area multi-modal human-machine dialogue systems, with a spe-cial focus on human behavior modelling by multi-modal emotion and intention recognition in such sys-tems.

At the first stage of my research I investigate oppor-tunity of emotion recognition within speech. All eval-uations are carried out on the public and commercialdatabases containing acted and spontaneous emotionalcolored speech. From public databases I single out EMO-DB (Berlin Database of Emotional Speech) and SUSAS(Speech Under Simulated and Actual Stress). From alist of commercial databases I deal with SmartKom andABC (Airplane Behavior Corpus). In (Vlasenko, 2007a,b), I describe a robust method for emotion classificationwithin speech. Emotion recognition for simplest behav-ior models and deep acoustic analysis showed good re-sults. Now I am modelling emotion recognition methodsdepending on prepotent user behavior model.

The next stage of my research assumes multimodal hu-man behavior modelling. I want to combine Acoustic,Video (mimic, gestures) and Linguistic analysis for ro-bust emotion and intention classification. I will realiseearly and late fusion of Spech and Video based Emo-tional Prosody classifiers. Interesting ideas about multi-modal emotion classification based on acted audiovisualdatabase can be found in (Schuller, 2007b).

In the framework of Neurobiologically Inspired,Multimodal Intention Recognition for Technical Com-munication Systems (NIMITEK) project wdok.cs.uni-magdeburg.de/nimitek. I am realisingspeech recognition with intention and emotion classifica-tion in human-machine dialogue system. Foremost emo-tion classification is supposed to indicate problematic sit-uations in real human-machine dialogue evaluation. Aworkgroup of NIMITEK project realised a Wizard-of-Oz experiment and collected an auxiliary audiovisualdatabase of simulated human-machine dialogues with

substantial amount of spontaneous emotions.

2 Future of Spoken Dialogue ResearchTaking into consideration experience of Verbmobil,SmartKom (Streit, 2006) and SmartWeb (Hacker, 2006)projects, prosody analysis plays a very important role fordialogue system modelling. Automatic dialogue systemsget easily confused if the recognised speech cannot bematched with the dialogue turn. Besides noise or otherpeoples conversations, even the users utterance can causedifficulties when he is talking to someone else or to him-self (“Off-Talk”). For this reason prosody analysis com-bined with video user gaze detection can be used for au-tomatic classification of the users focus of attention. Inte-gration of On-Talk and Focus of attention recognition canincrease the level of reliability and flexibility of human-machine dialogue system for Intelligent House solution.Such a system can identify objects of user attention and aset of possible user interests.

At the present moment the community of spoken dia-logue system researchers became aware of the fact thatemotions, mood, intentions and other attitudes play animportant role in natural communication. Emotions maywork as a self-contained dialogue move. They show com-plex relation to explicit communication. This insight inhuman behavior resulted in research on affective inter-faces, development of user behavior models in context ofhuman-machine interaction. The main aim of this is sup-porting the recognition of problematic dialogue situationsby analysing the affective state of the user. On the otherhand, for human-machine dialogue system for IntelligentHouse solution it will be good to realise classification ofthe users mood and condition. It will be possible to re-alise this with complex multimodal human behavior mod-elling. Such a system will have opportunity to cope withuser bad mood or tiredness.

Speech recognition confidence can be used in conjunc-tion with the semantic model of natural language under-standing to promote correct machine response. Seman-tic models can help the system to detect possible the-

YRRoSDS’2009 Sep 13-14, London, UK 97

Page 98: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

matic domains by spontaneous user request. Self-traineduser dependent or universal language models for detec-tion of thematic domains can advance us to a natural lan-guage human-machine communication era. In practiceit will make possible to the dialogue system to automat-ically scoop and generate language models for detectedthematic domain using World-Wide Web information re-sources and the current user dialogue history. In caseof Intelligent House solutions, it will make possible forthe system to collect user pleas-ing (according to user de-tected thematic domains) films, TV show, current culturalevents, sport sights and others to make the users home lifemore comfortable.

In the near feature human-machine dialogue systemswill consist of: user individual/universal multimodal be-havior learning machine, robust Prosody analysis (Off-Talk, speaking style, accents and phrase boundaries de-tection), intention and focus of attention determinationmodule, semantic thematic domain detection methodsand self-trained language model for predefined thematicdomains. Dialogue dependent semantic systems will beable to operate with the full spectra of human communi-cation interfaces and multitude of user interests.

3 Suggestions for Discussion• Collecting of realistic human-machine dialogue

databases including sufficient amount of ProsodyEvents. Analysis of current data-bases. How we cancollect real dialogues data-base in a better way?

• Role of human behavior modelling module in ahuman-machine dialogue system. How can humanbehavior modelling suppress problematic situationsin user-system communication? Which kinds of be-havior have to be providential?

• Multimodality in human-machine communication.How we can generate human natural communicationinterfaces? How we can use them in natural humanmachine interaction?

• How can we use Prosody in dialogue systems? Howit is used in known projects (Zeissler, 2006)? Whichnew features can we add?

ReferencesC. Hacker, A. Batliner, and E. Nth. 2006. Are You

Looking at Me, are You Talking with Me – MultimodalClassification of the Focus of Attention. In Proc. Text,Speech and Dialogue. 9th International Conference,Brno, Czech Republic.

B. Schuller, D. Seppi, A. Batliner, A. Maier, and S. Steidl.2007. Towards more Reality in the Recognition of

Emotional Speech. In Proc. ICASSP 2007, Honolulu,Hawaii, USA.

B. Schuller, D. Arcis, G. Rigoll, M. Wimmer, and B.Radig. 2007. Audiovisual behavior modeling by com-bined feature spaces. In Proc. ICASSP 2007, Hon-olulu, Hawaii, USA.

M. Streit, A. Batliner, and T. Portele. 2006. Emo-tions Analysis and Emotion-Handling Subdialogues.In Wahlster and Wolfgang (Eds.), SmartKom: Foun-dations of Multimodal Dialogue Systems: 317–332,Berlin : Springer.

B. Vlasenko, B. Schuller, A. Wendemuth, and G. Rigoll.2007. Combining Frame and Turn-Level Informationfor Robust Recognition of Emotions within Speech. InProc. of InterSpeech 2007, Antwerpen, Belgium.

B. Vlasenko, B. Schuller, A. Wendemuth, and G. Rigoll.2007. Frame vs. Turn-Level: Emotion Recognitionfrom Speech Considering Static and Dynamic Process-ing In Proc. of ACII 2007, Lisbon, Portugal.

V. Zeissler, J. Adelhardt, A. Batliner, C. Frank, E. Nth,R. P. Shi, and H. Niemann. 2006. The Prosody Mod-ule. In Wahlstera and Wolfgang (Eds.), SmartKom:Foundations of Multimodal Dialogue Systems: 139–152, Berlin : Springer.

Biographical SketchBogdan Vlasenko is currently

!

a 2nd year Ph.D. student atthe Chair of Cognitive Systems,Institute for Electronics, Sig-nal Processing and Communi-cations, Otto-von-Guericke Uni-versity, Magdeburg, Germany.He works under supervision of

Prof. Andreas Wendemuth. He was born in Ukraine andreceived his B.Sc. and M.Sc. in Computer Science inNational Technical University of Ukraine (KPI). His cur-rent research areas include multimodal human-machinedialogue modelling (human behavior model processing,emotion and intention recognition). His other interestsinclude travelling, cinematography, swimming and com-munication.

98 YRRoSDS’2009

Page 99: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Ina Wechsung Deutsche Telekom LaboratoriesErnst-Reuter-Platz 710589 Berlin

[email protected]

1 Research InterestsMy research interest is the development of subject-basedusability and user experience evaluation methods formultimodal interfaces. Furthermore I am interested inthe cognitive foundations of multimodal interaction to ex-plain users modality preferences, choices and ratings.

1.1 Evaluating multimodal systemsAlthough a lot of subject-based usability evaluationmethods are available only few are designed for multi-modal or spoken dialogue systems. The probably mostcommon technique applied in subject-based evaluationsare questionnaires, but a standardised and validated ques-tionnaire addressing the evaluation of multimodal sys-tems is still not available. Even for speech-based systemsthe probably most common questionnaire, the SubjectiveAssessment of Speech Interfaces questionnaire (SASSI)(Hone & Graham, 2000), still lacks final psychometricvalidation.

Thus I used several established questionnaires to as-sess the users opinion about multimodal systems. It wasanalysed in which aspects different established question-naires lead to the same result and where there are incon-sistencies across different questionnaires. The generalobjective was to investigate to which extent well knownand widely used scales for the evaluation of graphicaluser interfaces and speech-based systems are appropriatefor the evaluation of multimodal systems. It was shownthat questionnaires designed for unimodal systems arenot very applicable for usability evaluation of multimodalsystems, since they seem to measure different constructs(Wechsung et al. 2008). The questionnaires with themost concordance were the AttrakDiff (Hassenzahl et al.,2003) and the SASSI (Hone & Graham, 2000). A pos-sible explanation for that could be that the kind of ratingscale, the semantic differential, used in the AttrakDiff isapplicable to all systems. The semantic differential usesno direct ques-tions but pairs of bipolar adjectives, whichare not linked to special functions of a system. In a nextstep these subjective ratings were validated with more ob-jective and continuous parameters like log data. It wasshown that the subjective data (questionnaire ratings) and

the interaction data (task duration) showed concordancesonly to a limited extend (Naumann et al., 2008). In linewith the earlier findings the questionnaire ratings mostconsistent to interaction data was the AttrakDiff question-naire. Thus this questionnaire measures the construct itwas developed for.

1.2 Relationship between ratings of singlemodalities and multimodal systems

Since the AttrakDiff showed the highest validity and re-liability I continued to use the AttrakDiff questionnaireto examine how judgments of unimodal system versionsrelate to the judgments of the multimodal version (Wech-sung et al., 2009). It was shown that accurate predic-tions of overall and global ratings of multimodal systemsare possible on the basis of the respective ratings of theunimodal systems. Ratings of the multimodal systemsmatched the weighted sum of the ratings of the unimodalsystems with the modality used more often having thestronger influence. However regarding specific measures(in this case three of the AttrakDiff subscales) predictionperformance was lower. Only for one subscale (hedonic-quality-stimulation) equally good prediction performanceas for the overall and global scales was shown. But thesefindings were limited to the test design: The multimodalversion was always the system tested last. Therefore itis possible that the participants tried to rate consistently,adding up their single-modality judgments in their minds.Consequently, the judgments of the multimodal versionwould not represent the actual quality of that system.So, in a follow-up study the order was changed with themultimodal system version tested first. As in the previ-ous study (Wechsung et al., 2009) prediction was bestfor overall and general judgments and judgments on thescale hedonic quality-stimulation. The poor predictionperformance for the scale hedonic quality-identity mightbe explained by the underlying construct measured viathis scale. The theoretical base of the AttrakDiff ques-tionnaire is Hassenzahls model of user experience (Has-senzahl et al., 2003). This model postulates that a prod-ucts attributes can be divided in hedonic and pragmaticattributes. Hedonic refer to the products ability to evoke

YRRoSDS’2009 Sep 13-14, London, UK 99

Page 100: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

pleasure and emphasise the psychological well-being ofthe user. Hedonic attributes can be differentiated into at-tributes providing stimulation, attributes communicatingidentity and attributes provoking memory. The AttrakD-iff aims to measure two of these hedonic attributes: stim-ulation and identity. A product can provide stimulationwhen it pre-sents new interaction styles, like for exam-ple new modalities. Thus multimodality should be mea-surable with the scale hedonic quality-stimulation. Fur-thermore it should, and this was the case in all exper-iments, enhance a products ability to provide stimula-tion. The scale hedonic-quality identification measuresa products ability to express the owners self. The system(a room management and information system) we testedin the experiments was not developed (and is not avail-able) for personal use. Moreover it was custom tailoredfor Deutsche Telekom Laboratories (T-Labs). None ofthe participants was employed at T-Labs. Thus this sys-tem was in none of its versions de-signed to promote selfexpression or identification by communicating personalvalues. Furthermore the test was very task-oriented andgoal-oriented. All tasks referred to the systems function-ality for daily business of T-Labs employees and had noactual relevance for the participants. Data confirmed thatusers rated neu-tral on this scale. Thus this scale mighthave been an inappropriate measure. However regardingthe scale pragmatic quality measuring to a large extentthe concept of usability no such explanation can be foundand further research how single modalities influence rat-ings of multimodal systems should be conducted at thispoint.

2 Future of Spoken Dialogue ResearchIn my opinion todays spoken dialogue systems areonly under very specific circumstances preferable over agraphical user interface. Eyes and hands busy situationlike the in-car scenario are few examples were I thinkspeech control is actually useful. With the possibility tooffer different modalities in multimodal systems speechwill be one option and the actual usage will depend onthe tasks, the situation and individual preferences. To of-fer situation and task adequate modalities more researchneeds to be done in investigating the underlying cognitiveprocesses of multimodal interaction.

3 Suggestions for Discussion• How to get away from this self-made questionnaire

lacking psychometric validation?

• Influence of test situation on subjective ratings?

• Importance of user experience?

ReferencesHassenzahl, M., Burmester, M., and Koller, F. 2003.

AttrakDiff: Ein Fragebogen zur Messung wahrge-nommener hedonischer und pragmatischer Qualitt. InZiegler J., Szwillus G. (eds.) Mensch & Computer2003. Interaktion in Bewegung: 187–196. B.G. Teub-ner, Stuttgart.

Hone, K.S., and Graham, R. 2000. Towards a tool forthe subjective assessment of speech system interfaces(SASSI). Natural Language Engineering, 6(3/4), 287–305.

Naumann, A. B., and Wechsung, I. 2008. Develop-ing Usability Methods for Multimodal Systems: TheUse of Subjective and Objective Measures. In E. L.-C.Law, N. Bevan, G. Christou, M. Springett & M. Larus-dottir (Eds.), Proceedings of the International Work-shop on Meaningful Measures: Valid Useful User Ex-perience Measurement: 8–12.

Wechsung, I., and Naumann, A. B. 2008. EvaluationMethods for Multimodal Systems: A Comparison ofStandardized Usability Questionnaires. In E. Andr,L. Dybkjr, W. Minker, H. Neumann, R. Pieraccini &M. Weber (Eds.), Perception in Multimodal Dia-logueSystems, 4th IEEE Tutorial and Research Workshop onPerception and Interactive Technologies for Speech-Based Systems: 276–284. Heidelberg: Springer.

Wechsung, I., Engelbrecht, K.-P., Schaffer, S., Seebode,J., Metze, F., and Mller, S. 2009. Usability Evaluationof Multimodal Interfaces: Is the whole the sum of itsparts?, accepted for: 13th International Conference onHuman-Computer Interaction

Biographical SketchIna Wechsung is working

!

as a research assistant (Wis-senschaftlicher Mitarbeiter) atthe Quality and Usability Labof Deutsche Telekom Labora-tories, TU-Berlin. She studiedPsychology and received herdiploma degree in 2006 from

the Chemnitz University of Technology. At T-Labs, sheis working towards her PhD thesis.

100 YRRoSDS’2009

Page 101: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

Charlotte Wollermann Institute of Communication SciencesUniversity of BonnPoppelsdorfer Allee 4753115 BonnGermany

[email protected]/Members/cwo/homepage

1 Research Interests

My research interests lie in the experimental investigationof the role of audiovisual prosody for focus productionand also interpretation. For these purposes natural andsynthetic speech are used. In my project questions, meth-ods and techniques from multiple disciplines are com-bined: Speech production, perception and interpre-tation, audiovisual prosody, experimental pragmaticsand phonetics. However, basic research in human-humandialogue is important in order to develop believable spo-ken dialogue systems with deep linguistic performance,i.e. syntactic, semantic and pragmatic analysis.

1.1 Past, Current and Future Work

The purpose of my PhD research is to find out experimen-tally which role acoustic and/or visual cues play in theproduction, perception and interpretation of audio and/orvisual speech.

In one experiment the role of emotion and attitude inaudio speech perception was investigated. In Woller-mann and Lasarcyk (2007) we measured the influenceof (un)certainty on the perception of articulatory speechsynthesis. For these purposes we varied acoustic cueswhich were identified in human-human dialogue as con-veying uncertainty (e.g. Smith and Clark 1993; Swertsand Krahmer 2005). Results show that subjects are gen-erally able to identify intended different levels of uncer-tainty.

Further, we tested to what extent audiovisual prosodyeffects the interpretation of focus produced by a Talk-ing Head: We showed that the intensity of accent andeyebrow movement can influence the interpretation of apragmatic focus (Fisseni et al. 2006). “Focus” refersto the concept as it is defined in Formal Semantics, e.g.Rooth (1992).

Moreover, the goal of a series of interpretation stud-ies was to find out the contribution of uncertainty as par-alinguistic expression to pragmatic focus interpretation.Here, natural stimuli were used. The results of Woller-mann and Schroder (2008a, 2008b) suggest that prosodiccues effect the exhaustive interpretation of answers which

follows pragmatic focus detection, but contextual factors(micro and macro context) do have a stronger effect.

Present and future work tests which audiovisual cuesare relevant for pragmatic focus production. Our recentstudy (Wollermann, Schroder to appear) suggests thatspeakers use audiovisual cues of (un)certainty when ut-tering (non-)exhaustive answers. We also find empiricalevidence for the co-expressivity of the audio and visualsignal.

2 Future of Spoken Dialogue Research

I assume that current and future dialogue research dealsincreasingly with the modelling of emotion, attitude andpersonality. The modelling of emotion and attitude insynthetic speech has gained considerable importance inrecent years as one aims to generate synthetic speech,which is natural and human-like as possible. Such sys-tems were for instance developed by Burkhardt (2001)and Schroder (2004). But not only modelling of emo-tion in the acoustical channel has become more and morepopular, also for the visual modality for creating virtualagents which are “able” to express their emotional statetrough different channels. There has been much progressin developing Embodied Conversational Agents whichare able to express personality and emotional states (e.g.Becker and Wachsmuth 2006). Nowadays, applicationsfor these programs are usually restricted to special do-mains: REA (Cassell et al. 1999), e.g., functions as areal estate agent and MAX (Kopp et al. 2005) as a mu-seum guide. A challenge for researchers in this field is theadoption of these programs to a larger variety of domains.For instance it would be necessary to investigate and eval-uate in a very fine-grained manner which domains do ac-tually benefit from a virtual agent.

However, in order to implement emotional cues it isfirstly necessary to know how these cues are producedand perceived in natural speech. Most studies in the fieldof emotion research use acted emotional speech. In orderto develop believable dialogue systems, it would be use-ful to use methods for exploring real emotional speech.

YRRoSDS’2009 Sep 13-14, London, UK 101

Page 102: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal

3 Suggestions for Discussion

I propose the following subjects for discussion:

• Speakers and hearers use different cues in communi-cation for signalling and detecting uncertainty. Sev-eral studies show that uncertainty can be expressedby audio and/or visual synthetic speech. In this con-text the following question arises: Does the user ofsuch a dialogue system expect that the machine hasit’s own meta-cognitive state?

• It has been shown that the McGurk effect (McGurkand MacDonald, 1976) is a universal phenomenon,but its strength depends on different factors, e.g. cul-tural origin. Which cultural aspects need to be con-sidered for the development of talking heads giventhe fact that some facial expressions and gesturesdiffer cross-culturally in their meaning?

• In natural language different phenomena of focusoccur, e.g. semantic focus, pragmatic focus, con-trastive focus, broad vs. narrow focus. How can weaccount on the complexity of the phenomenon andprosodic implications for the development of spokendialogue systems?

ReferencesC. Becker and I. Wachsmuth. 2006. Modeling Primary

and Secondary Emotions for a Believable Communica-tion Agent. In Proc. of the First Workshop on Emotionand Computing, pp. 31–34. University of Bremen.

F. Burkhardt. 2001. Simulation emotionaler Sprech-weise mit Sprachsyntheseverfahren. PhD Thesis, Uni-versity of Berlin, Shaker Verlag.

J. Cassell, T. Bickmore, M. Billinghurst, L. Campbell, K.Chang, H. Vilhjlmsson, and H. Yan. 1999. Embod-iment in conversational interfaces: REA. In Proc. ofACM CHI 99, pp. 520–527.

B. Fisseni, F. Hulsken, B. Schroder, and C. Wollermann.2006. Eyebrows, Accent and Focus detection. Pa-per presented at the 7th Szklarska Poreba Workshop.March 03, 2006. Poland.

S. Kopp, L. Gesellensetter, N. Kramer, and I. Wachsmuth.2005. A conversational agent as museum guide –design and evaluation of a real-world application.In Panayiotopoulos et al. (eds.): Intelligent VirtualAgents, LNAI 3661, 329-343, Berlin: Springer-Verlag.

H. McGurk and J.W. MacDonald. 1976. Hearing lipsand seeing voices. In Nature, vol. 264, pp. 746–748.

M. Rooth. 1992. A theory of focus interpretation.InNatural Language Semantics, 1, pp. 76–116.

M. Schroder. 2004. Speech and Emotion Research: Anoverview of research frameworks and a dimensionalapproach to emotional speech synthesis. PhD thesis,PHONUS 7, Research Report of the Institute of Pho-netics, Saarland University.

V. Smith and H. Clark. 1993. On the course of answeringquestions. In Journal of Memory and Language, vol.32, pp. 25–38.

M. Swerts and E. Krahmer. 2005. Audiovisual prosodyand feeling of knowing. In Journal of Memory andLanguage, vol. 53 (1), pp. 81–94.

C. Wollermann and E. Lasarcyk. 2007. Modeling andperceiving of (un)certainty in articulatory speech syn-thesis. In Proc. of the 6th ISCA Tutorial and ResearchWorkshop on Speech Synthesis, pp. 40–45. Bonn, Ger-many.

C. Wollermann and B. Schroder. 2008a. Does Uncer-tainty Effect the Case of Exhaustive Interpretation? InProc. of ISCA Tutorial and Research Workshop on Ex-perimental Linguistics, pp. 233–236. Athens, Greece.

C. Wollermann and B. Schroder. 2008b. Certainty, Con-text and Exhaustivity of Answers. Paper presented atSpeech and Face to Face communication, October 27-29, 2008. Grenoble, France.

C. Wollermann. 2009. Effects of Exhaustivity and Un-certainty on Audiovisual Focus Production. To appearin Proc. of AVSP’2009. Norwich, UK.

Biographical SketchCharlotte Wollermann is a research assistant at the In-stitute of Communication Sciences at the University ofBonn. Her PhD thesis is supervised by Prof. Dr. Bern-hard Schroder and apl. Prof. Dr. Ulrich Schade. Sheholds a M.A. in communication research and phoneticsfrom the same university; she also time abroad at theDublin City University. As a student assistant she devel-oped and tested commercial question-answering-systemsat Q-Go online Marketing & Self Service; at Philips Lab-oratories she tested multimodal home dialogue systems.The author also took a course in advanced voice user in-terface design at VoiceObjects.

102 YRRoSDS’2009

Page 103: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal
Page 104: Proceedings of the Fifth Annual Young Researchers ...yrrsds/archived/YRR09_proceedings-web_small.pdf · David Martins de Matos, L2F INESC-ID, Technical University of Lisbon, Portugal