evaluating human-machine conversation for appropriateness david benyon, preben hansen, oli mival and...
TRANSCRIPT
Evaluating Human-Machine Conversation for Appropriateness
David Benyon, Preben Hansen, Oli Mival and Nick Webb
Overview
• www.companions-project.org
• Companions are targeted as persistent, collaborative, conversational partners
• Rather than singular tasks, Companions have a range of tasks
• Completion of tasks is important
• So is conversational performance
Metrics
• Objective measures– WER, CER, Turn Duration, Vocabulary…
• Subjective user measures– User satisfaction surveys
• Appropriateness
Appropriateness
• D. Traum, S. Robinson and J. Stephan. Evaluation of multi-party virtual reality dialogue interaction, in LREC, 2004.
• Alongside traditional measures, introduces concept of “response appropriateness”
• Created for ICT/ISI mission rehearsal exercise system
Initial Companion Evaluation
• 2 Companion prototypes– Health & Fitness
– Senior Companion
• 8 users completed entire protocol
• All participants were native English speakers without strong accents
• Ages from 27 to 61
• 2 were female, 6 were male
Initial Companion Evaluation
• New version (2.0) of Senior Companion– 12 new participants– 9 male, 3 female (ages 21-38)
• Key changes– Facebook photographs (pre-tagged)– Loquendo TTS elements (cough, laugh)– Additional “chat” ability from a chatbot
• Improved metric results– Avg. words / utterance– 4.27 (v1) to 6.1 (v2)
The Companion demonstrated emotion at timesThe Companion demonstrated emotion at times
v1.0v1.0
v2.0v2.0
v1.0 SC vs v2.0
Appropriateness
• Traum et al. devised an “appropriateness” coding scheme.
• Split system and user utterances.
• Users:– Response To System [RTS]– Gets RESponse [RES]– No Response: Appropriate [NRA]– No Response: Not appropriate [NRN]
• For agents:
–Filled Pause [FP]
–Request for Repair [RR]
–Appropriate Response [AR]
–Appropriate Question [AQ]
–Appropriate new INItiative [INI]
–Appropriate CONtinuation [CON]
– iNAPpropriate response, initiative or continuation [NAP]
3rd Phase - Appropriateness
Scoring Intuitions• Filled pauses generally human-like and good for virtual
agents to perform but don’t add a lot (0)
• Appropriate responses and questions very good (+2), but initiatives that push the interaction back on track are better (+3)
• Extended contributions on topic somewhat good (+.5)
• Repairs and clarifications bad (-.5), but their use can still gain points by allowing subsequent appropriate response
• Inappropriate response bad (-1), no response worse (-2)
Appropriateness Evaluation
• 7 HFC, 13 SC dialogues
• 4 Pre-Chatbot SC, 9 Post-Chatbot
• All annotated by a single user
• To check consistency, one dialogue of each annotated by at least 3 annotators
• Inter-coder reliability scored at over 0.8 (Krippendorff’s Alpha) for all
Sample HFC Dialogue <System CS="SUGGEST-MEAL-SOURCE ORDER-TAKEAWAY LUNCH"
time="02:51:06 PM" turn_no="16">Alright. How about ordering lunch from a takeaway?</System> [Appropriate Question]
<User time="02:51:13 PM" turn_no="17" /> [Response To System]
<System CS="SUGGEST-MEAL-SOURCE ORDER-TAKEAWAY LUNCH" time="02:51:25 PM" turn_no="18">Sorry, I didn't understand. Can you re-phrase: How about ordering lunch from a takeaway?</System> [Request Repair]
Sample SC Dialogue
<turn end_time="03:49:01 PM" mode="photo” speaker="S" start_time="03:48:59 PM" turn_no="65">What is your relationship to Libby.</turn>
[Appropriate Question]
<turn end_time="03:49:04 PM" mode="photo” speaker="S" start_time="03:49:01 PM" turn_no="66">I'm sorry I didn't understand your relationship to Libby.</turn> [Inappropriate Response]
…
<turn end_time="03:49:19 PM" mode="photo” speaker="U" turn_no="70"> could be as my friend</turn> [Response To System]
Initial Conclusions
• Seems to correlate with improvement in user responses (needs further investigation)
• Reliably encoded by annotators
• Indicates problem areas in dialogue
Tools and Resources
• XML encoded dialogue corpus
• Corpus collection tool
• Appropriateness annotation guidelines
• Appropriateness annotation tool
Next Steps
• Refine appropriateness measures– Add NEW tags
• confirmation, politeness, emotion,
– Modify existing tags• specific inappropriate tags
• Don’t have upper bounds of performance – require WoZ models
• Need to monitor users behaviour over time• Use scoring system to inform reinforcement
learning